Skip to content

Commit 2cc4690

Browse files
committed
feat(scraping-paginated-sites)
1 parent fbe45f9 commit 2cc4690

File tree

5 files changed

+286
-0
lines changed

5 files changed

+286
-0
lines changed
17.4 KB
Loading
9.26 KB
Loading
1.78 KB
Loading
1.6 KB
Loading
Lines changed: 286 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,286 @@
1+
---
2+
title: Scraping paginated sites
3+
description: Description
4+
menuWeight: 8.1
5+
paths:
6+
- advanced-web-scraping/scraping-paginated-sites
7+
---
8+
9+
# Scraping websites with limited pagination
10+
11+
Limited pagination is a common practice on e-commerce sites and is becoming more popular over time. It makes sense: a real user will never want to look through more than 200 pages of results – only bots love unlimited pagination. Fortunately, there are ways to overcome this limit while keeping our code clean and generic.
12+
13+
![Pagination in on Google search results page]({{@asset advanced_web_scraping/images/pagination.webp}})
14+
15+
> In a rush? Skip the tutorial and get the [full code example](https://github.com/metalwarrior665/apify-utils/tree/master/examples/crawler-with-filters).
16+
17+
## [](#how-to-overcome-the-limit) How to overcome the limit
18+
19+
Websites usually limit the pagination of a single (sub)category to somewhere between 1,000 to 20,000 listings. The site might have over a million listings in total. Without a proven algorithm, it will be very manual and almost impossible to scrape all listings.
20+
21+
We will first look at a couple ideas that don't work so well and then present the [final robust solution](#using-filter-ranges).
22+
23+
### [](#going-deeper-into-subcategories) Going deeper into subcategories
24+
25+
This is usually the first solution that comes to mind. You traverse the smallest subcategories and hope that those are below the pagination limits. Unfortunately, there are two big problems with this approach:
26+
27+
1. Any subcategory might be bigger than the pagination limit.
28+
2. Some listings from the parent category might not be present in any subcategory.
29+
30+
While you can often manually test if the second problem is true on the site, the first problem is a hard blocker. You might be just lucky, and it may work on this site but usually, traversing subcategories is just not enough. It can be used as a first step of the solution but not as the solution itself.
31+
32+
### [](#using-filters) Using filters
33+
34+
Most websites also provide a way for the user to select search filters. These allow a more granular level of search than categories and can be combined with them. Common filters allow you to select a **color**, **size**, **location** and similar attributes.
35+
36+
At first, it might seem as an easy solution. Enqueue all possible filter combinations and that should be so granular that it will never hit a pagination limit. Unfortunately, this solution is still far from good.
37+
38+
1. There is no guarantee that some products don't slip through the chosen filter combinations.
39+
2. The resulting split might be too granular and end up having too many tiny paginations with many duplicate products. This leads to scraping a lot more pages than necessary and makes analytics much harder.
40+
41+
### [](#using-filter-ranges) Using filter ranges
42+
43+
The best option is to use only a specific type of filter that can be used as a range. The most common one is **price range** but there may be others like the apartment size, etc. You can split the pagination pages to only contain listings within that range, e.g. products costing between $10 and $20.
44+
45+
This has several benefits:
46+
47+
1. All listings can eventually be found in a range.
48+
2. The ranges do not overlap, so we scrape the smallest possible number of pages and avoid duplicate listings.
49+
3. Ranges can be controlled by a generic algorithm that is simple to re-use for different sites.
50+
51+
## [](#splitting-pages-with-range-filters) Splitting pages with range filters
52+
53+
In the previous section, we analyzed different options to split the pages to overcome the pagination limit. We have chosen range filters as the most reliable way to do that. In this section, we will discuss a generic algorithm to work with ranges, look at a few special cases and then write an example crawler.
54+
55+
![An example of range filters on a website]({{@asset advanced_web_scraping/images/pagination-filters.webp}})
56+
57+
### [](#the-algorithm) The algorithm
58+
59+
The core algorithm is simple and can be used on any (even overlapping) range. This is a simplified presentation, we will discuss the details later.
60+
61+
1. We choose a few pivot ranges with a similar number of products and enqueue them. For example, **$0-$10**, **$100-$1000**, **$1000-$10000**, **$10000-**.
62+
2. For each range, we open the page and check if the listings are below the limit. If yes, we continue to step 3. If not, we split the filter in half, e.g. **$0-$10** to **$0-$5** and **$5-$10** and enqueue those again. We recursively repeat step **2** for each range as long as needed.
63+
3. We now have a pagination URL that is below the limit, we enqueue it under a pagination label and start enqueuing products.
64+
65+
Because the algorithm is recursive, we don't need to think about how big the final ranges should be, the algorithm will find them over time.
66+
67+
### [](#special-cases-to-look-for) Special cases to look for
68+
69+
We have the base algorithm, but before we start coding, let's answer a few questions to get more insight.
70+
71+
#### [](#can-the-ranges-overlap) Can the ranges overlap?
72+
73+
Some sites will allow you to construct non-overlapping ranges. For example, you can set the ranges with cents, e.g. **$0-$4.99**, **$5-$9.99**, etc. If that is possible, create the pivot ranges this way, too.
74+
75+
Non-overlapping ranges should remove the possibility of duplicate products (unless a [listing has multiple values](#can-a-listing-have-more-values)) and the lowest number of pages.
76+
77+
If the website supports only overlapping ranges (e.g. **$0-$5**, **$5-10**), it is not a big problem. Only a small portion of the listings will be duplicates, and they can be removed using a [request queue](https://docs.apify.com/storage/request-queue).
78+
79+
#### [](#can-a-listing-have-more-values) Can a listing have more values?
80+
81+
In rare cases, a listing can have more than one value that you are filtering in a range. A typical example is [amazon.com](https://amazon.com), where each product has several offers and those offers have different prices. If any of those offers is within the range, the product is shown.
82+
83+
There is no easy way to get around this but the price range split works even with duplicate listings, just use a [JS set](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Set) or request queue to deduplicate them.
84+
85+
#### [](#how-is-the-range-passed-to-the-url) How is the range passed to the URL?
86+
87+
In the easiest case, you can pass the range directly in the page's URL. For example, `<https://mysite.com/products?price=0-10>`. Sometimes, you will need to do some query composition because the price range might be encoded together with more information into a single parameter.
88+
89+
Some sites don't have page URLs with filters and instead load the filtered products via [XHRs](https://docs.apify.com/web-scraping-101/web-scraping-techniques#xhrs). Those can be GET or POST requests with various **URL** and **payload** syntax.
90+
91+
The nice thing here is that if you get to understand how their internal API works, you can have it return more products per page or extract full product details just from this single request.
92+
93+
In addition, XHRs are smaller and faster than loading an HTML page. On the other hand, you should not overly abuse them (with setting overly large limits), as this can expose you.
94+
95+
#### [](#does-the-website-show-the-number-of-products-for-each-filtered-page) Does the website show the number of products for each filtered page?
96+
97+
If it does, it is a nice bonus. It gives us an easy way to check if we are over or below the pagination limit and helps with analytics.
98+
99+
If it doesn't, we have to find a different way to check if the number of listings is within a limit. One option is to go to the last allowed page of the pagination. If that page is still full products, we can assume the filter is over the limit.
100+
101+
#### [](#how-to-handle-open-ends-of-the-range) How to handle (open) ends of the range
102+
103+
Logically, every full (price) range starts at 0 and ends at infinity. But the way this is encoded will differ on each site. The end of the price range can be either closed (0) or open (infinity). Open ranges require special handling when you split them (we will get to that).
104+
105+
Most sites will let you start with 0 (there might be exceptions, where you will have make the start open), so we can use just that. The high end is more complicated. Because you don't know the biggest price, it is best to leave it open and handle it specially. Internally you can just assign `null` to the value.
106+
107+
Here are few examples of a query parameter with an open and closed high-end range:
108+
109+
- Open: `p:100-` (higher than 100), Closed: `p:100-200` (between 100 and 200)
110+
- Open: `min_price=100`, Closed: `min_price=100&max_price=200`
111+
112+
#### [](#can-the-range-exceed-the-limit-on-a-single-value) Can the range exceed the limit on a single value?
113+
114+
In very rare cases, a site will have so many listings that a single value (e.g. **$100** or **$4.99**) will include a number of listings over the limit. [The basic algorithm](#the-algorithm) will recurse until the **min** value equals the **max** value and then stop because it cannot split that single value anymore.
115+
116+
In this rare case, you will need to add another range or other filters to combine it to get an even deeper split.
117+
118+
### [](#implementing-a-range-filter) Implementing a range filter
119+
120+
This section shows a simple code example implementing our solution for an imaginary website. Writing a real solution will bring up more complex problems but the previous section should prepare you for some of them.
121+
122+
First, let's define our imaginary site:
123+
124+
- It has a single `/products` path that contains all the products that we want to scrape.
125+
- **Max** pagination limit is **1000**.
126+
- The site contains over a million products.
127+
- It allows for filtering over a price range with query parameters `min_price` and `max_price`.
128+
- If `min_price` or `max_price` are not defined, it opens that end of the range (all products up to or all products over that).
129+
- The site allows to specify the price in cents.
130+
- Pagination is done via `page` query parameter.
131+
132+
#### [](#define-and-enqueue-pivot-ranges) Define and enqueue pivot ranges
133+
134+
This step is not necessary but it is useful. The algorithm doesn't start with splitting over too large or too small values.
135+
136+
```javascript
137+
import { Actor } from 'apify';
138+
import { CheerioCrawler } from 'crawlee';
139+
140+
await Actor.init();
141+
142+
const MAX_PRODUCTS_PAGINATION = 1000;
143+
144+
// These is just an example, choose what makes sense for your site
145+
const PIVOT_PRICE_RANGES = [
146+
{ min: 0, max: 9.99 },
147+
{ min: 10, max: 99.99 },
148+
{ min: 100, max: 999.99 },
149+
{ min: 1000, max: 9999.99 },
150+
{ min: 10000, max: null }, // open-ended
151+
];
152+
153+
// Let's create a helper function for creating the filter URLs, you can move those to a utils.js file
154+
const createFilterUrl = ({ min, max }) => {
155+
const minString = `min_price=${min}`;
156+
// We don't want to pass the parameter at all if it is null (open-ended)
157+
const maxString = max ? `&max_price=${max}` : '';
158+
return `https://www.mysite.com/products?${minString}${maxString}`;
159+
};
160+
161+
// And another helper for getting filters back from the URL, we could also pass them in userData
162+
const getFiltersFromUrl = (url) => {
163+
const min = Number(url.match(/min_price=([0-9.]+)/)[1]);
164+
// Max price might be empty
165+
const maxMatch = url.match(/max_price=([0-9.]+)/);
166+
const max = maxMatch ? Number(maxMatch[1]) : null;
167+
return { min, max };
168+
}
169+
170+
// Actor setup things here
171+
const crawler = new CheerioCrawler({
172+
async requestHandler(context) {
173+
// ...
174+
},
175+
});
176+
177+
// Let's create the pivot requests
178+
const initialRequests = [];
179+
for (const { min, max } of PIVOT_PRICE_RANGES) {
180+
initialRequests.push({
181+
url: createFilterUrl({ min, max }),
182+
label: 'FILTER',
183+
});
184+
}
185+
// Let's start the crawl
186+
await crawler.run(initialRequests);
187+
188+
await Actor.exit();
189+
```
190+
191+
#### [](#define-the-logic-for-the-filter-page) Define the logic for the `FILTER` page
192+
193+
```javascript
194+
import { CheerioCrawler } from 'crawlee';
195+
196+
// Doesn't matter what Crawler class we choose
197+
const crawler = new CheerioCrawler({
198+
// Crawler options here
199+
// ...
200+
async requestHandler({ request, $ }) {
201+
const { label } = request;
202+
if (label === 'FILTER') {
203+
// Of course, change the selectors and make it more robust
204+
const numberOfProducts = Number($('.product-count').text());
205+
206+
// The filter is either good enough of we have to split it
207+
if (numberOfProducts <= MAX_PRODUCTS_PAGINATION) {
208+
// We just pass the URL for scraping, we could optimize it so the page is not opened again
209+
await crawler.addRequests([{
210+
url: `${request.url}&page=1`,
211+
userData: { label: 'PAGINATION' },
212+
}]);
213+
} else {
214+
// Here we have to split the filter
215+
// To be continued...
216+
}
217+
}
218+
if (label === 'PAGINATION') {
219+
// We know we are under the limit here
220+
// Enqueue next page as long as possible
221+
// Enqueue or scrape products normally
222+
}
223+
}
224+
});
225+
```
226+
227+
#### [](#split-price-filters) Split price filters
228+
229+
We have the base of the crawler set up. The last part we are missing is the price filter splitting. Let's use a generic function for this. We can place it into the `utils.js` file.
230+
231+
```javascript
232+
// utils.js
233+
export function splitFilter(filter) {
234+
const { min, max } = filter;
235+
// Don't forget that max can be null and we have to handle that situation
236+
if (max && min > max) {
237+
throw new Error(`WRONG FILTER - min(${min}) is greater than max(${max})`);
238+
}
239+
240+
// We crate a middle value for the split. If max in null, we will use double min as the middle value
241+
const middle = max
242+
? min + Math.floor((max - min) / 2)
243+
: min * 2;
244+
245+
// We have to do the Math.max and Math.min to prevent having min > max
246+
const filterMin = {
247+
min,
248+
max: Math.max(middle, min),
249+
};
250+
const filterMax = {
251+
min: max ? Math.min(middle + 1, max) : middle + 1,
252+
max,
253+
};
254+
// We return 2 new filters
255+
return [filterMin, filterMax];
256+
}
257+
```
258+
259+
#### [](#enqueue-the-filters) Enqueue the filters
260+
261+
Let's finish the crawler now. This code example will go inside the `else` block of the previous crawler example.
262+
263+
```javascript
264+
const { min, max } = getFiltersFromUrl(request.url);
265+
// Our generic splitFilter function doesn't account for decimal values so we will have to convert to cents and back to dollars
266+
const newFilters = splitFilter({ min: min * 100, max: max * 100 });
267+
268+
// And we just enqueue those 2 new filters so the process will recursively repeat until all pages get to the PAGINATION phase
269+
const requestsToEnqueue = [];
270+
for (const filter of newFilters) {
271+
requestsToEnqueue.push({
272+
// Remember that we have to convert back from cents to dollars
273+
url: createFilterUrl({ min: filter.min / 100, max: filter.max / 100 }),
274+
label: 'FILTER',
275+
});
276+
}
277+
278+
await crawler.addRequests(requestsToEnqueue);
279+
```
280+
281+
## [](#summary) Summary
282+
283+
And that's it. We have an elegant and simple solution for a complicated problem. In a real project, you would want to make this a bit more robust and [save analytics data]({{@link expert_scraping_with_apify/saving_useful_stats.md}}). This will let you know what filters you went through and how many products each of them had.
284+
285+
Check out the [full code example](https://github.com/metalwarrior665/apify-utils/tree/master/examples/crawler-with-filters).
286+

0 commit comments

Comments
 (0)