-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support incremental crawling #749
Comments
Thanks for opening this issue. I don't remember the specifics, but if |
Thank you for your reply. I have used this switch (-resume), but it only works for resuming interrupted crawling. When I modify my urls.txt file, Katana should not be able to perform incremental crawling. |
@xqbumu, @Mzack9999, |
This is for sure an interesting feature, but I'm not sure it can be fully applied to the crawling process. While it's easy to mimic it avoiding overwriting existing files, abandoning the crawling process requires some more thoughts, as it can't be simply based on the existence of the file, as for example, it would end the crawling at the very beginning since the root branch already exist. Maybe some better strategy can be adopted, for example:
What do you think? |
Thank you for your response. My initial expectation was to be able to continue crawling the remaining links after an interruption. However, the re-crawling strategy you mentioned here seems to enhance the ability to resume crawling. As for the re-crawling strategy, I feel that in addition to defining the depth of the links, it could also consider judging based on the modification time of the crawled files, as it is easier to determine data updates based on time. The above is just my personal opinion, and I welcome your guidance. |
Let's begin with this idea and then gradually develop it further. What do you say? |
Please describe your feature request:
From the current logic, in Katana, setting 'srd' allows you to save the crawled content. However, when executing it for the second time, the content in that directory will be cleared. I hope to support incremental crawling, which means:
Describe the use case of this feature:
Replacing the -nc option in wget: use cases
Refer: https://github.com/projectdiscovery/katana/blob/main/pkg/output/output.go#L120
The text was updated successfully, but these errors were encountered: