[WIP]Cookiejars exposed #6218

GeorgeA92 · 2024-02-09T12:16:39Z

Aimed to fix #1878
based on suggestion from #1878 (comment)

codecov · 2024-02-09T12:18:31Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.67%. Comparing base (6f73dc0) to head (b8f8960).
Report is 258 commits behind head on master.

❗ Current head b8f8960 differs from pull request most recent head d8fb9d2

Please upload reports for the commit d8fb9d2 to get more accurate results.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6218      +/-   ##
==========================================
+ Coverage   88.48%   88.67%   +0.18%     
==========================================
  Files         160      161       +1     
  Lines       11607    11792     +185     
  Branches     1883     1912      +29     
==========================================
+ Hits        10271    10457     +186     
+ Misses       1009     1007       -2     
- Partials      327      328       +1

Files	Coverage Δ
scrapy/downloadermiddlewares/cookies.py	`96.33% <100.00%> (+0.13%)`	⬆️

... and 14 files with indirect coverage changes

wRAR · 2024-02-09T12:36:27Z

I haven't read the original ticket recently, but why is this feature optional?

scrapy/downloadermiddlewares/cookies.py

GeorgeA92 · 2024-02-23T10:28:20Z

This is how it works on current version of PR

script_sample.py

import scrapy
from scrapy.crawler import CrawlerProcess

class Quotes(scrapy.Spider):
    name = "quotes"; custom_settings = {"DOWNLOAD_DELAY": 1}

    def start_requests(self):
        yield scrapy.Request(url='https://quotes.toscrape.com/login', callback=self.login)

    def login(self, response):
        self.logger.info(self.cookie_jars[None]) # scrapy.http.cookies.CookieJar object
        self.logger.info(self.cookie_jars[None].jar) # http.cookiejar object

        locale_cookie = self.cookie_jars[None]._cookies["quotes.toscrape.com"]["/"].get("session")
        locale_cookie.value = locale_cookie.value.upper()
        self.logger.info(self.cookie_jars[None].jar)

if __name__ == "__main__":
    p = CrawlerProcess(); p.crawl(Quotes); p.start()

log_output (fragment)

2024-02-23 10:51:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/login> (referer: None)
2024-02-23 10:51:27 [quotes] INFO: <scrapy.http.cookies.CookieJar object at 0x00000217DB719B40>
2024-02-23 10:51:27 [quotes] INFO: <CookieJar[<Cookie session=eyJjc3JmX3Rva2VuIjoiSnFQQU9GTGt1amZzZ3J3UVdHeGV6WFR2UnBpY0Job1NWS3liWmxhblVISXROREVtQ2RZTSJ9.Zdhqng.8uQzjuvDfOcNJHV7luY5Na6C1N0 for quotes.toscrape.com/>]>
2024-02-23 10:51:27 [quotes] INFO: <CookieJar[<Cookie session=EYJJC3JMX3RVA2VUIJOISNFQQU9GTGT1AMZZZ3J3UVDHEGV6WFR2UNBPY0JOB1NWS3LIWMXHBLVISXROREVTQ2RZTSJ9.ZDHQNG.8UQZJUVDFOCNJHV7LUY5NA6C1N0 for quotes.toscrape.com/>]>
2024-02-23 10:51:27 [scrapy.core.engine] INFO: Closing spider (finished)

Gallaecio · 2024-02-23T17:17:01Z

I‘m slightly hesitant about setting a spider attribute from a middleware, and I wonder if maybe it should be set from a different place or in a different place (e.g. the crawler), but in general I’m fine with the approach.

@kmike Any thoughts on the general approach? Should @GeorgeA92 go on with tests and docs?

kmike · 2024-03-17T17:22:24Z

Hey! My main worry is the obscure API, which we'd need to document & support in the future. It'd require good documentation to explain a line like

self.cookie_jars[None]._cookies["quotes.toscrape.com"]["/"].get("session")

It also need an access to a private property (._cookies).

GeorgeA92 · 2024-04-12T16:36:21Z

Hey! My main worry is the obscure API, which we'd need to document & support in the future. It'd require good documentation to explain a line like

self.cookie_jars[None]._cookies["quotes.toscrape.com"]["/"].get("session")

It also need an access to a private property (._cookies).

Another option is to update scrapy.http.Cookies.CookieJar class to add.. more convenient way to interact with Cookiejar

scrapy/scrapy/http/cookies.py

Lines 18 to 30 in 1d11ea3

    
           class CookieJar: 
        
               def __init__(self, policy=None, check_expired_frequency=10000): 
        
                   self.policy = policy or DefaultCookiePolicy() 
        
                   self.jar = _CookieJar(self.policy) 
        
                   self.jar._cookies_lock = _DummyLock() 
        
                   self.check_expired_frequency = check_expired_frequency 
        
                   self.processed = 0 
        
               def extract_cookies(self, response, request): 
        
                   wreq = WrappedRequest(request) 
        
                   wrsp = WrappedResponse(response) 
        
                   return self.jar.extract_cookies(wrsp, wreq)

Gallaecio · 2024-06-11T06:56:06Z

docs/faq.rst

@@ -307,7 +307,7 @@ Does Scrapy manage cookies automatically?
 Yes, Scrapy receives and keeps track of cookies sent by servers, and sends them
 back on subsequent requests, like any regular web browser does.

-For more info see :ref:`topics-request-response` and :ref:`cookies-mw`.
+For more info see :ref:`topics-request-response` , :ref:`cookies-mw` and :ref:`cookiejars`.


Suggested change

For more info see :ref:`topics-request-response` , :ref:`cookies-mw` and :ref:`cookiejars`.

For more info see :ref:`topics-request-response`, :ref:`cookies-mw` and :ref:`cookiejars`.

Gallaecio · 2024-06-11T06:57:14Z

docs/topics/downloader-middleware.rst

@@ -293,6 +293,52 @@ Here's an example of a log with :setting:`COOKIES_DEBUG` enabled::
    [...]


+.. _cookiejars:
+
+Direct access to cookiejars from spider


Maybe this section could go before COOKIES_ENABLED?

Gallaecio · 2024-06-11T06:58:45Z

docs/topics/downloader-middleware.rst

+
+Direct access to cookiejars from spider
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+On some cases it is required to directly set specific values to existing cookiejar


Suggested change

On some cases it is required to directly set specific values to existing cookiejar

In some cases it is required to directly set specific values in an existing

cookiejar.

Or maybe something like:

Suggested change

On some cases it is required to directly set specific values to existing cookiejar

To manually set cookies in a cookie jar:

Gallaecio · 2024-06-11T08:00:41Z

@GeorgeA92 @kmike @wRAR What about an API like this?

# cookie middleware code

from scrapy import Request

from http.cookies imort SimpleCookie


_UNSET = object()


# We define a Cookie class that we can extend in the future, based on the
# Python API for server-side cookie handling.
class Cookie(SimpleCookie):
    pass
    

# We define functions, in line with the get_retry_request approach, that can be 
# used to easily interact with the cookies of a request. We extract the cookie
# jar ID and domain from the request, the user indicates the key and optionally
# the path.
    
def get_cookie(request: Request, key: str, path="/") -> Cookie:
    ...
    

def pop_cookie(request: Request, key: str, path="/", default=_UNSET) -> Cookie:
    ...
    

def set_cookie(request: Request, cookie: Cookie) -> None:
    ...

cookiejars exposed

fe2979e

Gallaecio reviewed Feb 9, 2024

View reviewed changes

scrapy/downloadermiddlewares/cookies.py Outdated Show resolved Hide resolved

cookiejars exposed

b8f8960

Gallaecio requested a review from kmike February 23, 2024 17:14

Cj-Malone mentioned this pull request Mar 21, 2024

Add Anti-Bot Detection middleware alltheplaces/alltheplaces#7349

Open

GeorgeA92 changed the title ~~Cookiejars exposed~~ [WIP]Cookiejars exposed Jun 7, 2024

Georgiy Zatserklianyi added 2 commits June 7, 2024 14:48

cookiejars exposed, docs added

11022ac

cookiejars exposed, docs added

d8fb9d2

Gallaecio reviewed Jun 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP]Cookiejars exposed #6218

[WIP]Cookiejars exposed #6218

GeorgeA92 commented Feb 9, 2024

codecov bot commented Feb 9, 2024 •

edited

wRAR commented Feb 9, 2024

GeorgeA92 commented Feb 23, 2024 •

edited

Gallaecio commented Feb 23, 2024 •

edited

kmike commented Mar 17, 2024

GeorgeA92 commented Apr 12, 2024

Gallaecio Jun 11, 2024

Gallaecio Jun 11, 2024

Gallaecio Jun 11, 2024 •

edited

Gallaecio commented Jun 11, 2024

	For more info see :ref:`topics-request-response` , :ref:`cookies-mw` and :ref:`cookiejars`.
	For more info see :ref:`topics-request-response`, :ref:`cookies-mw` and :ref:`cookiejars`.

-On some cases it is required to directly set specific values to existing cookiejar
+In some cases it is required to directly set specific values in an existing
+cookiejar.

	On some cases it is required to directly set specific values to existing cookiejar

	To manually set cookies in a cookie jar:

[WIP]Cookiejars exposed #6218

Are you sure you want to change the base?

[WIP]Cookiejars exposed #6218

Conversation

GeorgeA92 commented Feb 9, 2024

codecov bot commented Feb 9, 2024 • edited

Codecov Report

wRAR commented Feb 9, 2024

GeorgeA92 commented Feb 23, 2024 • edited

Gallaecio commented Feb 23, 2024 • edited

kmike commented Mar 17, 2024

GeorgeA92 commented Apr 12, 2024

Gallaecio Jun 11, 2024

Choose a reason for hiding this comment

Gallaecio Jun 11, 2024

Choose a reason for hiding this comment

Gallaecio Jun 11, 2024 • edited

Choose a reason for hiding this comment

Gallaecio commented Jun 11, 2024

codecov bot commented Feb 9, 2024 •

edited

GeorgeA92 commented Feb 23, 2024 •

edited

Gallaecio commented Feb 23, 2024 •

edited

Gallaecio Jun 11, 2024 •

edited