Skip to content

Commit 1d44837

Browse files
committed
Shuffle categories + create new tutorial sections
1 parent 1dbedfd commit 1d44837

39 files changed

+68
-52
lines changed

content/academy/anti_scraping.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ Solely based on the way how the bots operate. It comperes data-rich pages visits
7373

7474
By definition, this is not an anti-scraping method, but it can heavily affect the reliability of a scraper. If your target website drastically changes its CSS selectors, and your scraper is heavily reliant on selectors, it could break. In principle, websites using this method change their HTML structure or CSS selectors randomly and frequently, making the parsing of the data harder, and requiring more maintenance of the bot.
7575

76-
One of the best ways of avoiding the possible breaking of your scraper due to website structure changes is to limit your reliance on data from HTML elements as much as possible (see [API Scraping]({{@link api_scraping.md}}) and [JavaScript objects within HTML]({{@link js_in_html.md}}))
76+
One of the best ways of avoiding the possible breaking of your scraper due to website structure changes is to limit your reliance on data from HTML elements as much as possible (see [API Scraping]({{@link api_scraping.md}}) and [JavaScript objects within HTML]({{@link node_js/js_in_html.md}}))
7777

7878
### IP session consistency
7979

content/academy/concepts.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
title: Concepts
33
description: Learn about some common yet tricky concepts and terms that are used frequently within the academy, as well as in the world of scraper development.
4-
menuWeight: 15
4+
menuWeight: 18
55
category: glossary
66
paths:
77
- concepts

content/academy/glossary.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
title: Why a glossary?
33
description: Browse important web scraping concepts, tools and topics in succinct articles explaining common web development terms in a web scraping and automation context.
4-
menuWeight: 13
4+
menuWeight: 16
55
category: glossary
66
paths:
77
- glossary

content/academy/node_js.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
---
2+
title: Node.js tutorials
3+
description: description
4+
menuWeight: 14
5+
category: tutorials
6+
paths:
7+
- node-js
8+
---
9+
10+
# Node.js Tutorials 💻📚
11+
12+
<!-- something -->

content/academy/analyzing_pages_and_fixing_errors.md renamed to content/academy/node_js/analyzing_pages_and_fixing_errors.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,12 @@
11
---
22
title: How to analyze and fix errors when scraping a website
33
description: Learn how to deal with random crashes in your web-scraping and automation jobs. Find out the essentials of debugging and fixing problems in your crawlers.
4-
menuWeight: 20
5-
category: tutorials
4+
menuWeight: 14.1
65
paths:
7-
- analyzing-pages-and-fixing-errors
6+
- node-js/analyzing-pages-and-fixing-errors
87
---
98

10-
# [](#scraping-with-sitemaps) Analyzing a page and fixing errors
9+
# [](#scraping-with-sitemaps) How to analyze and fix errors when scraping a website
1110

1211
Debugging is absolutely essential in programming. Even if you don't call yourself a programmer, having basic debugging skills will make building crawlers easier. It will also help you safe money by allowing you to avoid hiring an expensive developer to solve your issue for you.
1312

@@ -24,7 +23,7 @@ Here are the most common reasons your working solution may break.
2423
- The website changes its layout or [data feed](https://www.datafeedwatch.com/academy/data-feed).
2524
- A site's layout changes depending on location or uses [A/B testing](https://www.youtube.com/watch?v=XDoKXaGrUxE&feature=youtu.be).
2625
- A page starts to block you (recognizes you as a bot).
27-
- The website [loads its data later dynamically]({{@link dealing_with_dynamic_pages.md}}), so the code works only sometimes, if you are slow or lucky enough.
26+
- The website [loads its data later dynamically]({{@link node_js/dealing_with_dynamic_pages.md}}), so the code works only sometimes, if you are slow or lucky enough.
2827
- You made a mistake when updating your code.
2928
- Your [proxies]({{@link anti_scraping/mitigation/proxies.md}}) aren't working.
3029
- You have upgraded your [dependencies](https://www.quora.com/What-is-a-dependency-in-coding) (other software that your software relies upon), and the new versions no longer work (this is harder to debug).

content/academy/caching_responses_in_puppeteer.md renamed to content/academy/node_js/caching_responses_in_puppeteer.md

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,12 @@
11
---
22
title: How to optimize Puppeteer by caching responses
33
description: Learn why it's important to cache responses in memory when intercepting requests in Puppeteer, and how to do it.
4-
menuWeight: 22
5-
category: tutorials
4+
menuWeight: 14.2
65
paths:
7-
- caching-responses-in-puppeteer
6+
- node-js/caching-responses-in-puppeteer
87
---
98

10-
# [](#caching-responses-in-puppeteer) Caching responses in Puppeteer
9+
# [](#caching-responses-in-puppeteer) How to optimize Puppeteer by caching responses
1110

1211
> In the latest version of Puppeteer, the request-interception function inconveniently disables the native cache and significantly slows down the crawler. Therefore, it's not recommended to follow the examples shown in this article unless you have a very specific use-case where the default browser cache is not enough (e.g. cashing over multiple scraper runs)
1312
@@ -17,7 +16,7 @@ For this reason, in this article, we will take a look at how to use memory to ca
1716

1817
In this example, we will use a scraper which goes through top stories on the CNN website and takes a screenshot of each opened page. The scraper is very slow right now because it waits till all network requests are finished and because the posts contain videos. If the scraper runs with disabled caching, these statistics will show at the end of the run:
1918

20-
![Bad run stats]({{@asset images/bad-scraper-stats.webp}})
19+
![Bad run stats]({{@asset node_js/images/bad-scraper-stats.webp}})
2120

2221
As you can see, we used 177MB of traffic for 10 posts (that is how many posts are in the top-stories column) and 1 main page. We also stored all the screenshots, which you can find [here](https://my.apify.com/storage/key-value/q2ipoeLLy265NtSiL).
2322

@@ -27,15 +26,15 @@ From the screenshot above, it's clear that most of the traffic is coming from sc
2726

2827
If we go to the CNN website, open up the tools and go to the **Network** tab, we will find an option to disable caching.
2928

30-
![Disabling cache in the Network tab]({{@asset images/cnn-network-tab.webp}})
29+
![Disabling cache in the Network tab]({{@asset node_js/images/cnn-network-tab.webp}})
3130

3231
Once caching is disabled, we can take a look at how much data is transferred when we open the page. This is visible at the bottom of the developer tools.
3332

34-
![5.3MB of data transferred]({{@asset images/slow-no-cache.webp}})
33+
![5.3MB of data transferred]({{@asset node_js/images/slow-no-cache.webp}})
3534

3635
If we uncheck the disable-cache checkbox and refresh the page, we will see how much data we can save by caching responses.
3736

38-
![642KB of data transferred]({{@asset images/fast-with-cache.webp}})
37+
![642KB of data transferred]({{@asset node_js/images/fast-with-cache.webp}})
3938

4039
By comparison, the data transfer appears to be reduced by 88%!
4140

@@ -93,7 +92,7 @@ page.on('response', async(response) => {
9392
9493
After implementing this code, we can run the scraper again.
9594

96-
![Good run results]({{@asset images/good-run-results.webp}})
95+
![Good run results]({{@asset node_js/images/good-run-results.webp}})
9796

9897
Looking at the statistics, caching responses in Puppeteer brought the traffic down from 177MB to 13.4MB, which is a reduction of data transfer by 92%. The related screenshots can be found [here](https://my.apify.com/storage/key-value/iWQ3mQE2XsLA2eErL).
9998

content/academy/choosing_the_right_scraper.md renamed to content/academy/node_js/choosing_the_right_scraper.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,12 @@
11
---
22
title: How to choose the right scraper for the job
33
description: Understand how to choose the best scraper for your use-case by understanding some basic concepts.
4-
menuWeight: 23
5-
category: tutorials
4+
menuWeight: 14.3
65
paths:
7-
- choosing-the-right-scraper
6+
- node-js/choosing-the-right-scraper
87
---
98

10-
# [](#choosing-the-right-scraper) Choosing the right scraper for the job
9+
# [](#choosing-the-right-scraper) How to choose the right scraper for the job
1110

1211
There are two main ways you can proceed with building your crawler:
1312

@@ -24,7 +23,7 @@ If it were only a question of performance, you'd of course use request-based scr
2423

2524
## [](#dynamic-pages) Dynamic pages & blocking
2625

27-
Some websites do not load any data without a browser, as they need to execute some scripts to show it (these are known as [dynamic pages]({{@link dealing_with_dynamic_pages.md}})). Another problem is blocking. If the website is collecting a [browser fingerprint]({{@link anti_scraping/techniques/fingerprinting.md}}), it is very easy for it to distinguish between a real user and a bot (crawler) and block access.
26+
Some websites do not load any data without a browser, as they need to execute some scripts to show it (these are known as [dynamic pages]({{@link node_js/dealing_with_dynamic_pages.md}})). Another problem is blocking. If the website is collecting a [browser fingerprint]({{@link anti_scraping/techniques/fingerprinting.md}}), it is very easy for it to distinguish between a real user and a bot (crawler) and block access.
2827

2928
## [](#making-the-choice) Making the choice
3029

content/academy/dealing_with_dynamic_pages.md renamed to content/academy/node_js/dealing_with_dynamic_pages.md

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,12 @@
11
---
22
title: How to scrape from dynamic pages
33
description: Learn about dynamic pages and dynamic content. How can we find out if a page is dynamic? How do we programmatically scrape dynamic content?
4-
menuWeight: 17
5-
category: tutorials
4+
menuWeight: 14.4
65
paths:
7-
- dealing-with-dynamic-pages
6+
- node-js/dealing-with-dynamic-pages
87
---
98

10-
# [](#dealing-with-dynamic-pages) Dealing with dynamic pages
9+
# [](#dealing-with-dynamic-pages) How to scrape from dynamic pages
1110

1211
<!-- In the last few lessons, we learned about Crawlee, which is a powerful library for writing reliable and efficient scrapers. We recommend reading up on those last two lessons in order to install the `crawlee` package and familiarize yourself with it before moving forward with this lesson. -->
1312

@@ -17,7 +16,7 @@ In this lesson, we'll be discussing dynamic content and how to scrape it while u
1716

1817
From our adored and beloved [Fakestore](https://demo-webstore.apify.org/), we have been tasked to scrape each product's title, price, and image from the [new arrivals](https://demo-webstore.apify.org/search/new-arrivals) page. Easy enough! We did something very similar in the previous modules.
1918

20-
![New arrival products in Fakestore]({{@asset images/new-arrivals.webp}})
19+
![New arrival products in Fakestore]({{@asset node_js/images/new-arrivals.webp}})
2120

2221
First, create a file called **dynamic.js** and copy-paste the following boiler plate code into it:
2322

@@ -79,7 +78,7 @@ await crawler.run([{ url: 'https://demo-webstore.apify.org/search/new-arrivals'
7978
8079
After running it, you might say, "Great! It works!" **But wait...** What are those results being logged to console?
8180

82-
![Bad results in console]({{@asset images/bad-results.webp}})
81+
![Bad results in console]({{@asset node_js/images/bad-results.webp}})
8382

8483
Every single image seems to have the same exact "URL," but they are most definitely not the image URLs we are looking for. This is strange, because in the browser, we were getting URLs that looked like this:
8584

@@ -134,7 +133,7 @@ await crawler.run([{ url: 'https://demo-webstore.apify.org/search/new-arrivals'
134133

135134
After running this one, we can see that our results look different from before. We're getting the image links!
136135

137-
![Not perfect results]({{@asset images/almost-there.webp}})
136+
![Not perfect results]({{@asset node_js/images/almost-there.webp}})
138137

139138
Well... Not quite. It seems that the only images which we got the full links to were the ones that were being displayed within the view of the browser. This means that the images are lazy-loaded. **Lazy-loading** is a common technique used across the web to improve performance. Lazy-loaded items allow the user to load content incrementally, as they perform some action. In most cases, including our current one, this action is scrolling.
140139

0 commit comments

Comments
 (0)