Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]Cookiejars exposed #6218

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

GeorgeA92
Copy link
Contributor

Aimed to fix #1878
based on suggestion from #1878 (comment)

Copy link

codecov bot commented Feb 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.67%. Comparing base (6f73dc0) to head (b8f8960).
Report is 258 commits behind head on master.

Current head b8f8960 differs from pull request most recent head d8fb9d2

Please upload reports for the commit d8fb9d2 to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6218      +/-   ##
==========================================
+ Coverage   88.48%   88.67%   +0.18%     
==========================================
  Files         160      161       +1     
  Lines       11607    11792     +185     
  Branches     1883     1912      +29     
==========================================
+ Hits        10271    10457     +186     
+ Misses       1009     1007       -2     
- Partials      327      328       +1     
Files Coverage Δ
scrapy/downloadermiddlewares/cookies.py 96.33% <100.00%> (+0.13%) ⬆️

... and 14 files with indirect coverage changes

@wRAR
Copy link
Member

wRAR commented Feb 9, 2024

I haven't read the original ticket recently, but why is this feature optional?

@GeorgeA92
Copy link
Contributor Author

GeorgeA92 commented Feb 23, 2024

This is how it works on current version of PR

script_sample.py
import scrapy
from scrapy.crawler import CrawlerProcess

class Quotes(scrapy.Spider):
    name = "quotes"; custom_settings = {"DOWNLOAD_DELAY": 1}

    def start_requests(self):
        yield scrapy.Request(url='https://quotes.toscrape.com/login', callback=self.login)

    def login(self, response):
        self.logger.info(self.cookie_jars[None]) # scrapy.http.cookies.CookieJar object
        self.logger.info(self.cookie_jars[None].jar) # http.cookiejar object

        locale_cookie = self.cookie_jars[None]._cookies["quotes.toscrape.com"]["/"].get("session")
        locale_cookie.value = locale_cookie.value.upper()
        self.logger.info(self.cookie_jars[None].jar)

if __name__ == "__main__":
    p = CrawlerProcess(); p.crawl(Quotes); p.start()
log_output (fragment)
2024-02-23 10:51:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/login> (referer: None)
2024-02-23 10:51:27 [quotes] INFO: <scrapy.http.cookies.CookieJar object at 0x00000217DB719B40>
2024-02-23 10:51:27 [quotes] INFO: <CookieJar[<Cookie session=eyJjc3JmX3Rva2VuIjoiSnFQQU9GTGt1amZzZ3J3UVdHeGV6WFR2UnBpY0Job1NWS3liWmxhblVISXROREVtQ2RZTSJ9.Zdhqng.8uQzjuvDfOcNJHV7luY5Na6C1N0 for quotes.toscrape.com/>]>
2024-02-23 10:51:27 [quotes] INFO: <CookieJar[<Cookie session=EYJJC3JMX3RVA2VUIJOISNFQQU9GTGT1AMZZZ3J3UVDHEGV6WFR2UNBPY0JOB1NWS3LIWMXHBLVISXROREVTQ2RZTSJ9.ZDHQNG.8UQZJUVDFOCNJHV7LUY5NA6C1N0 for quotes.toscrape.com/>]>
2024-02-23 10:51:27 [scrapy.core.engine] INFO: Closing spider (finished)

@Gallaecio Gallaecio requested a review from kmike February 23, 2024 17:14
@Gallaecio
Copy link
Member

Gallaecio commented Feb 23, 2024

I‘m slightly hesitant about setting a spider attribute from a middleware, and I wonder if maybe it should be set from a different place or in a different place (e.g. the crawler), but in general I’m fine with the approach.

@kmike Any thoughts on the general approach? Should @GeorgeA92 go on with tests and docs?

@kmike
Copy link
Member

kmike commented Mar 17, 2024

Hey! My main worry is the obscure API, which we'd need to document & support in the future. It'd require good documentation to explain a line like

self.cookie_jars[None]._cookies["quotes.toscrape.com"]["/"].get("session")

It also need an access to a private property (._cookies).

@GeorgeA92
Copy link
Contributor Author

Hey! My main worry is the obscure API, which we'd need to document & support in the future. It'd require good documentation to explain a line like

self.cookie_jars[None]._cookies["quotes.toscrape.com"]["/"].get("session")

It also need an access to a private property (._cookies).

Another option is to update scrapy.http.Cookies.CookieJar class to add.. more convenient way to interact with Cookiejar

class CookieJar:
def __init__(self, policy=None, check_expired_frequency=10000):
self.policy = policy or DefaultCookiePolicy()
self.jar = _CookieJar(self.policy)
self.jar._cookies_lock = _DummyLock()
self.check_expired_frequency = check_expired_frequency
self.processed = 0
def extract_cookies(self, response, request):
wreq = WrappedRequest(request)
wrsp = WrappedResponse(response)
return self.jar.extract_cookies(wrsp, wreq)

@GeorgeA92 GeorgeA92 changed the title Cookiejars exposed [WIP]Cookiejars exposed Jun 7, 2024
@@ -307,7 +307,7 @@ Does Scrapy manage cookies automatically?
Yes, Scrapy receives and keeps track of cookies sent by servers, and sends them
back on subsequent requests, like any regular web browser does.

For more info see :ref:`topics-request-response` and :ref:`cookies-mw`.
For more info see :ref:`topics-request-response` , :ref:`cookies-mw` and :ref:`cookiejars`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For more info see :ref:`topics-request-response` , :ref:`cookies-mw` and :ref:`cookiejars`.
For more info see :ref:`topics-request-response`, :ref:`cookies-mw` and :ref:`cookiejars`.

@@ -293,6 +293,52 @@ Here's an example of a log with :setting:`COOKIES_DEBUG` enabled::
[...]


.. _cookiejars:

Direct access to cookiejars from spider
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this section could go before COOKIES_ENABLED?


Direct access to cookiejars from spider
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
On some cases it is required to directly set specific values to existing cookiejar
Copy link
Member

@Gallaecio Gallaecio Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
On some cases it is required to directly set specific values to existing cookiejar
In some cases it is required to directly set specific values in an existing
cookiejar.

Or maybe something like:

Suggested change
On some cases it is required to directly set specific values to existing cookiejar
To manually set cookies in a cookie jar:

@Gallaecio
Copy link
Member

@GeorgeA92 @kmike @wRAR What about an API like this?

# cookie middleware code

from scrapy import Request

from http.cookies imort SimpleCookie


_UNSET = object()


# We define a Cookie class that we can extend in the future, based on the
# Python API for server-side cookie handling.
class Cookie(SimpleCookie):
    pass
    

# We define functions, in line with the get_retry_request approach, that can be 
# used to easily interact with the cookies of a request. We extract the cookie
# jar ID and domain from the request, the user indicates the key and optionally
# the path.
    
def get_cookie(request: Request, key: str, path="/") -> Cookie:
    ...
    

def pop_cookie(request: Request, key: str, path="/", default=_UNSET) -> Cookie:
    ...
    

def set_cookie(request: Request, cookie: Cookie) -> None:
    ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose cookiejars
4 participants