Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Resume Stopped Crawl #1753

Open
Shrinks99 opened this issue Apr 27, 2024 · 1 comment
Open

[Feature]: Resume Stopped Crawl #1753

Shrinks99 opened this issue Apr 27, 2024 · 1 comment
Assignees
Labels
back end Requires back end dev work front end Requires front end dev work

Comments

@Shrinks99
Copy link
Member

Shrinks99 commented Apr 27, 2024

Context

Currently once a crawl is stopped, that's it! Users cannot pick up where they left off which results in a few points of friction:

Crawling Large Websites

When crawling a large website (>20,000 pages) users are limited to the first n pages depending on their plan. If the crawler finds more pages than that, there is pretty much no way to capture them as the workflow settings dictate that the workflow must start from the seed URL. In theory one might be able to add every URL as an exclusion? In practice this would be ridiculous.

Picking up Next Month

Our customers have a set amount of execution minutes for the month, and while running into an execution minute limit might lead one to believe that they should purchase additional time, simply waiting until the next month is just as valid.

Requirements

  • For Stopped crawls, give users the option to "Resume"
    • This will resume crawling and inherit the crawl queue of the stopped archived item.
    • It will create a new archived item in the workflow with the new content in it
    • It will not capture any of the pages from previous related stopped items (related to [Feature]: Only Archive New URLs #1372)
      • In a situation where the first item was stopped and resumed, and the second resumed crawl was also stopped, if the second crawl is also resumed, the third crawl should not capture any pages from the first or second ones.
@Shrinks99 Shrinks99 added front end Requires front end dev work back end Requires back end dev work labels Apr 27, 2024
@Shrinks99 Shrinks99 self-assigned this Apr 27, 2024
@dla-kramski
Copy link

I would like to expressly support this feature request.

Our current use case is a literary forum with more than 500,000 articles, i.e. significantly more than 50,000 web pages. At the moment, I can only imagine switching to the highest-value tariff for months in which we need such crawls.

Doing a full crawl every time is a huge waste of resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
back end Requires back end dev work front end Requires front end dev work
Projects
Status: Todo
Development

No branches or pull requests

2 participants