[Feature]: Resume Stopped Crawl #1753

Shrinks99 · 2024-04-27T15:13:46Z

Context

Currently once a crawl is stopped, that's it! Users cannot pick up where they left off which results in a few points of friction:

Crawling Large Websites

When crawling a large website (>20,000 pages) users are limited to the first n pages depending on their plan. If the crawler finds more pages than that, there is pretty much no way to capture them as the workflow settings dictate that the workflow must start from the seed URL. In theory one might be able to add every URL as an exclusion? In practice this would be ridiculous.

Picking up Next Month

Our customers have a set amount of execution minutes for the month, and while running into an execution minute limit might lead one to believe that they should purchase additional time, simply waiting until the next month is just as valid.

Requirements

For Stopped crawls, give users the option to "Resume"
- This will resume crawling and inherit the crawl queue of the stopped archived item.
- It will create a new archived item in the workflow with the new content in it
- It will not capture any of the pages from previous related stopped items (related to [Feature]: Only Archive New URLs #1372)
  - In a situation where the first item was stopped and resumed, and the second resumed crawl was also stopped, if the second crawl is also resumed, the third crawl should not capture any pages from the first or second ones.

The text was updated successfully, but these errors were encountered:

dla-kramski · 2024-04-29T14:49:19Z

I would like to expressly support this feature request.

Our current use case is a literary forum with more than 500,000 articles, i.e. significantly more than 50,000 web pages. At the moment, I can only imagine switching to the highest-value tariff for months in which we need such crawls.

Doing a full crawl every time is a huge waste of resources.

Shrinks99 added front end Requires front end dev work back end Requires back end dev work labels Apr 27, 2024

Shrinks99 self-assigned this Apr 27, 2024

dla-kramski mentioned this issue Apr 29, 2024

[Feature]: Only Archive New URLs #1372

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Resume Stopped Crawl #1753

[Feature]: Resume Stopped Crawl #1753

Shrinks99 commented Apr 27, 2024 •

edited

dla-kramski commented Apr 29, 2024

[Feature]: Resume Stopped Crawl #1753

[Feature]: Resume Stopped Crawl #1753

Comments

Shrinks99 commented Apr 27, 2024 • edited

Context

Crawling Large Websites

Picking up Next Month

Requirements

dla-kramski commented Apr 29, 2024

Shrinks99 commented Apr 27, 2024 •

edited