-
-
Notifications
You must be signed in to change notification settings - Fork 324
Open
Labels
enhancementNew feature or requestNew feature or requestfuture planningIdeas or features proposed for future development.Ideas or features proposed for future development.
Description
Make it trivial to use Pydoll inside Scrapy without custom glue code. The plugin should let a spider opt-in per request to drive a headless tab, run small actions (clicks, waits), and return a rendered HtmlResponse that plays nicely with Scrapy selectors. It should feel like standard Scrapy, just powered by Pydoll when needed.
Proposed API
- Installable optional plugin:
pip install scrapy-pydoll - Enable via settings:
PYDOLL_ENABLED = True
PYDOLL_CONCURRENCY = 2
PYDOLL_BROWSER_OPTIONS = { "geolocation": "GB", "headless": True }- Per-request opt-in (meta) or helper Request:
yield scrapy.Request(
url,
meta={
"pydoll": {
"actions": [
{"type": "wait", "for": "networkidle"},
{"type": "click", "selector": "#show-more"},
],
"timeout": 15000,
},
"cookiejar": "sessionA",
},
callback=self.parse_page,
)
# or
yield PydollRequest(url, actions=[...], timeout=15000)Requirements (MVP)
- Deterministic rendered
HtmlResponsecompatible with.css()/.xpath() - Wait strategies:
networkidle,selector,sleep(ms) - Small action set:
click,type,scroll - Per-request headers/cookies merged with Pydoll context
- Session reuse by
cookiejar; graceful shutdown onspider_closed - Timeouts, retries surfaced as
IgnoreRequestor similar
Follow-ups
- Optionally attach Markdown (
return_markdown=True) once exporter exists - Network record on error (integration with recorder feature)
- Page bundle snapshot on exception for offline debugging
- WebPoet/Scrapy-Poet provider to inject a
Tabor rendered HTML
Example Spider
class ExampleSpider(scrapy.Spider):
name = "example"
def start_requests(self):
yield scrapy.Request(
"https://example.com/products",
meta={"pydoll": {
"actions": [{"type": "wait", "for": "networkidle"}],
"timeout": 15000
}},
callback=self.parse_list
)
def parse_list(self, response):
for href in response.css(".item a::attr(href)").getall():
yield scrapy.Request(
response.urljoin(href),
meta={"pydoll": {"actions": [{"type": "click", "selector": "#accept"}]}},
callback=self.parse_item
)
def parse_item(self, response):
yield {
"title": response.css("h1::text").get(),
"price": response.css(".price::text").get(),
}LucasAlvws, Telsho and gab-i-alves
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestfuture planningIdeas or features proposed for future development.Ideas or features proposed for future development.