Skip to content

Feature: Scrapy plugin for Pydoll (scrapy-pydoll) #248

@thalissonvs

Description

@thalissonvs

Make it trivial to use Pydoll inside Scrapy without custom glue code. The plugin should let a spider opt-in per request to drive a headless tab, run small actions (clicks, waits), and return a rendered HtmlResponse that plays nicely with Scrapy selectors. It should feel like standard Scrapy, just powered by Pydoll when needed.

Proposed API

  • Installable optional plugin: pip install scrapy-pydoll
  • Enable via settings:
PYDOLL_ENABLED = True
PYDOLL_CONCURRENCY = 2
PYDOLL_BROWSER_OPTIONS = { "geolocation": "GB", "headless": True }
  • Per-request opt-in (meta) or helper Request:
yield scrapy.Request(
    url,
    meta={
        "pydoll": {
            "actions": [
                {"type": "wait", "for": "networkidle"},
                {"type": "click", "selector": "#show-more"},
            ],
            "timeout": 15000,
        },
        "cookiejar": "sessionA",
    },
    callback=self.parse_page,
)

# or
yield PydollRequest(url, actions=[...], timeout=15000)

Requirements (MVP)

  • Deterministic rendered HtmlResponse compatible with .css() / .xpath()
  • Wait strategies: networkidle, selector, sleep(ms)
  • Small action set: click, type, scroll
  • Per-request headers/cookies merged with Pydoll context
  • Session reuse by cookiejar; graceful shutdown on spider_closed
  • Timeouts, retries surfaced as IgnoreRequest or similar

Follow-ups

  • Optionally attach Markdown (return_markdown=True) once exporter exists
  • Network record on error (integration with recorder feature)
  • Page bundle snapshot on exception for offline debugging
  • WebPoet/Scrapy-Poet provider to inject a Tab or rendered HTML

Example Spider

class ExampleSpider(scrapy.Spider):
    name = "example"

    def start_requests(self):
        yield scrapy.Request(
            "https://example.com/products",
            meta={"pydoll": {
                "actions": [{"type": "wait", "for": "networkidle"}],
                "timeout": 15000
            }},
            callback=self.parse_list
        )

    def parse_list(self, response):
        for href in response.css(".item a::attr(href)").getall():
            yield scrapy.Request(
                response.urljoin(href),
                meta={"pydoll": {"actions": [{"type": "click", "selector": "#accept"}]}},
                callback=self.parse_item
            )

    def parse_item(self, response):
        yield {
            "title": response.css("h1::text").get(),
            "price": response.css(".price::text").get(),
        }

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestfuture planningIdeas or features proposed for future development.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions