Skip to content

Commit b13fc13

Browse files
authored
docs: Update SitemapRequestLoader documentation (#1520)
- based on #1516
1 parent 138cf82 commit b13fc13

File tree

3 files changed

+14
-3
lines changed

3 files changed

+14
-3
lines changed

docs/guides/architecture_overview.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -291,7 +291,7 @@ Request loaders provide a subset of <ApiLink to="class/RequestQueue">`RequestQue
291291

292292
- <ApiLink to="class/RequestLoader">`RequestLoader`</ApiLink> - Base interface for read-only access to a stream of requests, with capabilities like fetching the next request, marking as handled, and status checking.
293293
- <ApiLink to="class/RequestList">`RequestList`</ApiLink> - Lightweight in-memory implementation of `RequestLoader` for managing static lists of URLs.
294-
- <ApiLink to="class/SitemapRequestLoader">`SitemapRequestLoader`</ApiLink> - Specialized loader for reading URLs from XML sitemaps with filtering capabilities.
294+
- <ApiLink to="class/SitemapRequestLoader">`SitemapRequestLoader`</ApiLink> - A specialized loader that reads URLs from XML and plain-text sitemaps following the [Sitemaps protocol](https://www.sitemaps.org/protocol.html) with filtering capabilities.
295295

296296
### Request managers
297297

docs/guides/request_loaders.mdx

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ The [`request_loaders`](https://github.com/apify/crawlee-python/tree/master/src/
3131
And specific request loader implementations:
3232

3333
- <ApiLink to="class/RequestList">`RequestList`</ApiLink>: A lightweight implementation for managing a static list of URLs.
34-
- <ApiLink to="class/SitemapRequestLoader">`SitemapRequestLoader`</ApiLink>: A specialized loader that reads URLs from XML sitemaps with filtering capabilities.
34+
- <ApiLink to="class/SitemapRequestLoader">`SitemapRequestLoader`</ApiLink>: A specialized loader that reads URLs from XML and plain-text sitemaps following the [Sitemaps protocol](https://www.sitemaps.org/protocol.html) with filtering capabilities.
3535

3636
Below is a class diagram that illustrates the relationships between these components and the <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>:
3737

@@ -130,7 +130,13 @@ To enable persistence, provide `persist_state_key` and optionally `persist_reque
130130

131131
### Sitemap request loader
132132

133-
The <ApiLink to="class/SitemapRequestLoader">`SitemapRequestLoader`</ApiLink> is a specialized request loader that reads URLs from XML sitemaps. It's particularly useful when you want to crawl a website systematically by following its sitemap structure. The loader supports filtering URLs using glob patterns and regular expressions, allowing you to include or exclude specific types of URLs. The <ApiLink to="class/SitemapRequestLoader">`SitemapRequestLoader`</ApiLink> provides streaming processing of sitemaps, ensuring efficient memory usage without loading the entire sitemap into memory.
133+
The <ApiLink to="class/SitemapRequestLoader">`SitemapRequestLoader`</ApiLink> is a specialized request loader that reads URLs from sitemaps following the [Sitemaps protocol](https://www.sitemaps.org/protocol.html). It supports both XML and plain text sitemap formats. It's particularly useful when you want to crawl a website systematically by following its sitemap structure.
134+
135+
:::note
136+
The `SitemapRequestLoader` is designed specifically for sitemaps that follow the standard Sitemaps protocol. HTML pages containing links are not supported by this loader - those should be handled by regular crawlers using the `enqueue_links` functionality.
137+
:::
138+
139+
The loader supports filtering URLs using glob patterns and regular expressions, allowing you to include or exclude specific types of URLs. The <ApiLink to="class/SitemapRequestLoader">`SitemapRequestLoader`</ApiLink> provides streaming processing of sitemaps, ensuring efficient memory usage without loading the entire sitemap into memory.
134140

135141
<RunnableCodeBlock className="language-python" language="python">
136142
{SitemapExample}

src/crawlee/request_loaders/_sitemap_request_loader.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,11 @@ class SitemapRequestLoaderState(BaseModel):
9090
class SitemapRequestLoader(RequestLoader):
9191
"""A request loader that reads URLs from sitemap(s).
9292
93+
The loader is designed to handle sitemaps that follow the format described in the Sitemaps protocol
94+
(https://www.sitemaps.org/protocol.html). It supports both XML and plain text sitemap formats.
95+
Note that HTML pages containing links are not supported - those should be handled by regular crawlers
96+
and the `enqueue_links` functionality.
97+
9398
The loader fetches and parses sitemaps in the background, allowing crawling to start
9499
before all URLs are loaded. It supports filtering URLs using glob and regex patterns.
95100

0 commit comments

Comments
 (0)