Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve link checker (possible rewrite) #1838

Open
1 of 6 tasks
shcheklein opened this issue Oct 4, 2020 · 22 comments
Open
1 of 6 tasks

improve link checker (possible rewrite) #1838

shcheklein opened this issue Oct 4, 2020 · 22 comments
Assignees
Labels
dependencies Pull requests that update a dependency file (automatic) p2-nice-to-have Less of a priority at the moment. We don't usually deal with this immediately. 🐛 type: bug Something isn't working.

Comments

@shcheklein
Copy link
Member

shcheklein commented Oct 4, 2020

Link checker seems to be broken in certain scenarios, has false positives, noisy reporting:

  • Introduce state- report a failed link only if fails certain number of times
  • Introduce retries on certain error codes (5xx) - try 2-3 times
  • Email come with my nam every time it fails even on someone's else branch
  • Report via email is not informative, comes from my name for some reason Ivan Shcheklein
  • I've seen failures to run- on forks, on regular branches like this - https://github.com/iterative/dvc.org/runs/1206599665?check_suite_focus=true, etc
  • Twitter links are broken

Test on fork, test of force push.

@shcheklein shcheklein added 🐛 type: bug Something isn't working. p0-critical Affects users in a bad way at the moment labels Oct 4, 2020
@shcheklein
Copy link
Member Author

Another failure today: https://github.com/iterative/dvc.org/actions/runs/289778559

And I got an email from my name.

@iesahin

This comment has been minimized.

@shcheklein
Copy link
Member Author

https://dev.to/bnb/markdown-link-checking-in-github-with-actions-5dp0 - can be useful potentially

@rogermparent
Copy link
Contributor

https://dev.to/bnb/markdown-link-checking-in-github-with-actions-5dp0 - can be useful potentially

Great resource! Making our own link check is shockingly tough, so leaning on something like Linkinator would be better for everyone if we can do it.

@rogermparent
Copy link
Contributor

rogermparent commented Mar 10, 2021

I think it's worth noting that Check links with linkcheck is the most starred link check action on the marketplace by far, and it claims feats such as "It is 40 times faster than the only tool that goes to the same depth (linkchecker).".

Once we start work on this, I think it's worth following the stars and giving linkcheck the first attempt, but the retry option in linkinator looks appealing and doesn't seem to have a counterpart in linkcheck, so we should give that a trial to see if it offers advantages we want that are worth possible comparative drawbacks.

@jorgeorpinel jorgeorpinel added the dependencies Pull requests that update a dependency file (automatic) label May 6, 2021
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented May 6, 2021

This is the highest priority issue (somewhat) related to the website. I put it at the top in the project for now. Are there plans to work on it? Should we change the priority otherwise?

Or are there any other website/engine-related issues you have under your radar, e.g. maybe #1063, iterative/gatsby-theme-iterative#156 (would be my personal pick as I keep bumping with that), #2394, #1115, or anything else from here ? Cc @rogermparent @julieg18 thanks

@rogermparent
Copy link
Contributor

I don't think this has even been considered for a while, so I suppose it's not quite a p0.

Unless we decide to pick this up, it may stand to be demoted to p1- the link checker is considered important, so hopefully it doesn't fall by the wayside.

If someone were to pick the link checker back up and fix its core issues described in the OP, it may warrant a rewrite that leans more into the CLI interface and lets programs use that- using GHA Checks directly caused 90% of pain on the original run of that project.

@jorgeorpinel jorgeorpinel added p1-important Active priorities to deal within next sprints and removed p0-critical Affects users in a bad way at the moment labels May 7, 2021
@jorgeorpinel jorgeorpinel changed the title further improvements to the link checker improve link checker (possible rewrite) May 7, 2021
@iesahin
Copy link
Contributor

iesahin commented May 7, 2021 via email

@iesahin
Copy link
Contributor

iesahin commented May 7, 2021

It looks like the Rust library behind

https://github.com/lycheeverse/lychee-action

is modifiable even if the features don't cover our use cases.

@iesahin
Copy link
Contributor

iesahin commented May 7, 2021

I can also use the tool @rogermparent mentioned. It's a Dart library and though may not cover everything we need, I can use it in a custom library.

If you would like a link-checker from scratch, I can write it too.

@rogermparent
Copy link
Contributor

rogermparent commented May 7, 2021

@iesahin My plan of attack would probably be:

  1. Attempt a few already-made link checker actions (lychee-action and filiph/linkcheck). If one seems to perform well and reasonably cover our use case, just use it.
  2. If all viable solutions from step 1 still leave us wanting some extra feature, attempt to wrap and/or add the features we want to an existing library.
  3. As a last resort, write our custom logic around an existing link check library (like Lychee)
  4. No matter what, drop the responsibility of maintaining the actual link checking logic because the internet is a nightmare world of inconsistency.

While filiph/linkcheck has more stars than Lychee, I can see Lychee possibly having unique features from the old checker that we want because, as far as I understand, linkcheck parses built HTML and Lychee parses source files like Markdown.

@iesahin
Copy link
Contributor

iesahin commented May 7, 2021

These steps are similar to what I planned. @rogermparent

I'll test these 2 first and see if they are good enough. I think a thin wrapper around them may be needed at most.

@julieg18 julieg18 self-assigned this Jan 17, 2022
@julieg18
Copy link
Contributor

I started looking into other link checkers and it looks like lychee, by far, works the best. It has the most customization options and can scan several different types of files along with actual websites. As for how we should use it, I'll test out lychee-action on scanning the entire site and scanning pr differences. We may be able to use it or we'll need to write custom code that manipulates the inputs/outputs of the action.

🤔 Though @rogermparent, are we wanting to replace our custom link-check entirely or have it use something like lychee?

@rogermparent
Copy link
Contributor

are we wanting to replace our custom link-check entirely or have it use something like lychee?

My preference would be to use lychee as much as possible, but some more advanced features like running on git diffs and the whole "allow a link to fail once before reporting it" feature will probably require us to write some sort of wrapper extending lychee's features.

That said, we can stand to change/omit some aspects of the current link check that are harder to replicate- for example, instead of diff mode grabbing links directly from the diff we can just check all links in modified files, since IIRC lychee's language parsing only works on whole files and it's fast enough that the difference between solely added links and the rest of the page shouldn't be a problem.

@julieg18
Copy link
Contributor

julieg18 commented Mar 18, 2022

My preference would be to use lychee as much as possible, but some more advanced features like running on git diffs and the whole "allow a link to fail once before reporting it" feature will probably require us to write some sort of wrapper extending lychee's features.

That said, we can stand to change/omit some aspects of the current link check that are harder to replicate- for example, instead of diff mode grabbing links directly from the diff we can just check all links in modified files, since IIRC lychee's language parsing only works on whole files and it's fast enough that the difference between solely added links and the rest of the page shouldn't be a problem.

Makes sense that a wrapper would be needed! Though I'm inexperienced in this and I'm not sure what steps to take to create this "wrapper"...

First thing that comes to mind is running lychee inside of our link-check code. But is there a way to have our nodejs-based link-check run a command-line utility like lychee? I know we can run terminal commands like with exec but how would we install lychee in the first place 🤔

The other way I can think of is to use lychee-action and write steps before and after the action runs, creating the desired inputs and outputs within one github action file. But that sounds more complex than having link-check just use lychee...

Does anyone know of the best way to create this "wrapper"? cc @iterative/websites

@rogermparent
Copy link
Contributor

Does anyone know of the best way to create this "wrapper"?

The best wrapper would be no wrapper, followed by as minimal a wrapper as possible. Hopefully we'll be able to use lychee-action since it should be able to handle downloading lychee into the CI with any required setup. What we do around that depends on the feature in question.

A few examples:

If we want to run only on files added in the diff, I would think the ideal implementation would be a simple GitHub Actions workflow that gets a list of those added files via git diff --name-status and passes those to lychee-action.

For something like keeping track of consecutive fails, we could run lychee-action and just operate on its output file, using our code to simply keep a persistent count of consecutive fails and handles reporting when the amount gets unacceptable.

If we wanted to limit checked links to only those in the diff and ignore other links in modified files, we may be able to use lychee --dump to gather links without checking and pre-process them before passing the list back to lychee via lychee-action. I don't think this particular one would be necessary, but it's an option.

There's also the option of adding lychee to our build process instead of having it in a parallel Actions CI, the primary advantages being having the public folder accessible should we decide to check against build output rather than source files as well as not needing to deal with the deployment_status GHA event nor whatever is required for other build systems.

If we must drop down into a node script that executes lychee, you're correct that installing it could be an issue. From what I can tell, lychee-action actually installs lychee with a Docker image, so that's one option and probably the best way to go since lychee doesn't appear to be in the Ubuntu repos.

@julieg18
Copy link
Contributor

The best wrapper would be no wrapper, followed by as minimal a wrapper as possible. Hopefully we'll be able to use lychee-action since it should be able to handle downloading lychee into the CI with any required setup. What we do around that depends on the feature in question.

Thanks for the info and examples! I'll start working on a github action that works with lychee-action.

@julieg18
Copy link
Contributor

julieg18 commented Apr 5, 2022

Experimented with lychee-action! Looks like we'll be able to use it for checking prs (with the help of a github action like get-changed-files) and the entire repo!

But it has a couple bugs:

  • some github links fail due to 429 Too many requests errors. It also appears that we get some meetup and notion link failures due to this as well.
  • fake links (ones found in code examples for example) are checked, though I think our link-checker has the same issue

The main issue is the 429, and I've got a couple of ideas on how we could fix this:

  • By default lychee retries a link 3 times and waits 1 second before retrying. We could increase the retries and wait time to stop the 429s but this would increase the wait time of the action since it would take longer to check actually broken links. But lychee is pretty fast, the overall wait time still might not be too long 🤔
  • Take the output of lychee-action and retry the links ourselves with link-checker or some other package that checks links. This could save us some time, but may overcomplicate things since we'd have to be running two actions for checking links 🤔

@rogermparent
Copy link
Contributor

rogermparent commented Apr 5, 2022

  • some github links fail due to 429 Too many requests errors. It also appears that we get some meetup and notion link failures due to this as well.

It's unfortunate that Lychee seemingly doesn't have per-URL configuration; we got around this with the current link-check by limiting concurrency to 1 and rate-limiting to one GH check per second, but doing so with Lychee would require we either wrap it or limit the rate/concurrency of all links.

Looks like there's another workaround for GitHub specificially where you can use an access token to authenticate and ease the rate limiting greatly. This works for GitHub specifically, but thinking about this happening with other sites is a little worrying.

  • fake links (ones found in code examples for example) are checked, though I think our link-checker has the same issue

We only handle this with the current link checker by using a lot of exclusions, which should be able to translate into Lychee well enough

@julieg18
Copy link
Contributor

julieg18 commented Apr 5, 2022

Looks like there's another workaround for GitHub specificially where you can use an access token to authenticate and ease the rate limiting greatly. This works for GitHub specifically, but thinking about this happening with other sites is a little worrying.

We are using the GITHUB_TOKEN that is provided when it comes to gh actions, but would a different one change the outcome?

@rogermparent
Copy link
Contributor

We are using the GITHUB_TOKEN that is provided when it comes to gh actions, but would a different one change the outcome?

I didn't notice that, I see the lychee docs suggest using the GITHUB_TOKEN from Actions so I'd imagine it's supposed to work. Maybe a more proper token would give us more leeway but I doubt it. Might be worth a try if we get a bot token for the automatic publish Issue.

@rogermparent
Copy link
Contributor

I think a big improvement we can get is focusing on CLI, as we can go from there to any other reporting mechanism like GitHub Actions and simplify away the issue of having multiple build types for CLI and GHA as well as possible usage on other platforms.

@dberenbaum dberenbaum added p2-nice-to-have Less of a priority at the moment. We don't usually deal with this immediately. and removed p1-important Active priorities to deal within next sprints labels Feb 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file (automatic) p2-nice-to-have Less of a priority at the moment. We don't usually deal with this immediately. 🐛 type: bug Something isn't working.
Projects
None yet
Development

No branches or pull requests

6 participants