Support incremental crawling #749

xqbumu · 2024-01-31T02:26:58Z

Please describe your feature request:

From the current logic, in Katana, setting 'srd' allows you to save the crawled content. However, when executing it for the second time, the content in that directory will be cleared. I hope to support incremental crawling, which means:

The directory should not be cleared during the second execution;
When encountering a request that has already been saved, skip crawling that link.

Describe the use case of this feature:

Replacing the -nc option in wget: use cases

wget -P ./output -nc -i urls.txt

Refer: https://github.com/projectdiscovery/katana/blob/main/pkg/output/output.go#L120

dogancanbakir · 2024-01-31T08:32:00Z

Thanks for opening this issue. I don't remember the specifics, but if -resume is specified, previously crawled content should not be removed. I'll look into this.

xqbumu · 2024-02-01T02:12:48Z

Thanks for opening this issue. I don't remember the specifics, but if -resume is specified, previously crawled content should not be removed. I'll look into this.

Thank you for your reply. I have used this switch (-resume), but it only works for resuming interrupted crawling. When I modify my urls.txt file, Katana should not be able to perform incremental crawling.

dogancanbakir · 2024-02-01T08:29:37Z

@xqbumu,
Makes sense!

@Mzack9999,
Thoughts? - "Incremental Crawling" sounds good to me 💭

Mzack9999 · 2024-02-26T12:44:25Z

This is for sure an interesting feature, but I'm not sure it can be fully applied to the crawling process. While it's easy to mimic it avoiding overwriting existing files, abandoning the crawling process requires some more thoughts, as it can't be simply based on the existence of the file, as for example, it would end the crawling at the very beginning since the root branch already exist. Maybe some better strategy can be adopted, for example:

Crawl normally till a minimum depth (2?)
Above that depth, if the crawled page is the same of existing one (or all children of parent node are the same above a certain threshold) => break out

What do you think?

xqbumu · 2024-03-01T10:11:24Z

@Mzack9999

Thank you for your response. My initial expectation was to be able to continue crawling the remaining links after an interruption. However, the re-crawling strategy you mentioned here seems to enhance the ability to resume crawling.

As for the re-crawling strategy, I feel that in addition to defining the depth of the links, it could also consider judging based on the modification time of the crawled files, as it is easier to determine data updates based on time.

The above is just my personal opinion, and I welcome your guidance.

dogancanbakir · 2024-03-06T11:33:08Z

@Mzack9999,

My initial expectation was to be able to continue crawling the remaining links after an interruption.

Let's begin with this idea and then gradually develop it further. What do you say?

xqbumu added the Type: Enhancement Most issues will probably ask for additions or changes. label Jan 31, 2024

dogancanbakir self-assigned this Jan 31, 2024

dogancanbakir added the Investigation label Jan 31, 2024

dogancanbakir mentioned this issue Mar 28, 2024

add no-clobber flag #827

Merged

dogancanbakir linked a pull request Mar 28, 2024 that will close this issue

add no-clobber flag #827

Merged

Mzack9999 added the Status: Completed Nothing further to be done with this issue. Awaiting to be closed. label Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support incremental crawling #749

Support incremental crawling #749

xqbumu commented Jan 31, 2024 •

edited

dogancanbakir commented Jan 31, 2024

xqbumu commented Feb 1, 2024

dogancanbakir commented Feb 1, 2024

Mzack9999 commented Feb 26, 2024

xqbumu commented Mar 1, 2024

dogancanbakir commented Mar 6, 2024

Support incremental crawling #749

Support incremental crawling #749

Comments

xqbumu commented Jan 31, 2024 • edited

Please describe your feature request:

Describe the use case of this feature:

dogancanbakir commented Jan 31, 2024

xqbumu commented Feb 1, 2024

dogancanbakir commented Feb 1, 2024

Mzack9999 commented Feb 26, 2024

xqbumu commented Mar 1, 2024

dogancanbakir commented Mar 6, 2024

xqbumu commented Jan 31, 2024 •

edited