Skip to content

Commit 442a569

Browse files
committed
Merge branch 'master' of https://github.com/apify/apify-docs into advanced-scraping-course
2 parents 3d83134 + 9a1818b commit 442a569

File tree

5 files changed

+19
-14
lines changed

5 files changed

+19
-14
lines changed

.markdownlint.json

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,5 +11,6 @@
1111
"no-multiple-blanks": {
1212
"maximum": 2
1313
},
14-
"no-space-in-emphasis": false
15-
}
14+
"no-space-in-emphasis": false,
15+
"link-fragments": false
16+
}

content/academy/web_scraping_for_beginners/crawling/processing_data.md

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,13 @@ To access the default dataset, we can use the [`Dataset`](https://crawlee.dev/a
2121

2222
```JavaScript
2323
// dataset.js
24-
import { Dataset } from 'crawlee';
24+
import { Dataset, } from 'crawlee';
25+
26+
// Crawlee automatically deletes data from its previous runs.
27+
// We can turn this off by setting 'purgeOnStart' to false.
28+
// If we did not do this, we would have no data to process.
29+
// This is a temporary workaround, and we'll add a better interface soon.
30+
Configuration.getGlobalConfig().set('purgeOnStart', false);
2531

2632
const dataset = await Dataset.open();
2733

@@ -39,6 +45,8 @@ Let's say we wanted to print the title for each product that is more expensive t
3945
// dataset.js
4046
import { Dataset } from 'crawlee';
4147

48+
Configuration.getGlobalConfig().set('purgeOnStart', false);
49+
4250
const { items } = await Dataset.getData();
4351

4452
let mostExpensive;
@@ -47,7 +55,7 @@ console.log('All items over $50 USD:');
4755
for (const { title, price } of items) {
4856
// Use a regular expression to filter out the
4957
// non-number and non-decimal characters
50-
const numPrice = +price.replace(/[^0-9.]/g, '');
58+
const numPrice = Number(price.replace(/[^0-9.]/g, ''));
5159
if (numPrice > 50) console.table({ title, price });
5260
if (numPrice > mostExpensive.price) mostExpensive = { title, price };
5361
}
@@ -60,7 +68,7 @@ In our case, the most expensive product was the Macbook Pro. Surprising? Heh, no
6068

6169
## [](#converting-to-excel) Converting the dataset to Excel
6270

63-
We promised that you won't need an Apify account for anything in this course, and it's true. You can use the skills learned in the [Save to CSV lesson]({{@link web_scraping_for_beginners/data_collection/save_to_csv.md}}) to save the dataset to a CSV. Just use the loading code from above, plug it in there and then open the CSV in Excel. However, we really want to show you this neat trick. It won't cost you anything, we promise, and it's a cool and fast way to convert datasets to any format.
71+
We promised that you won't need an Apify account for anything in this course, and it's true. You can use the skills learned in the [Save to CSV lesson]({{@link web_scraping_for_beginners/data_collection/save_to_csv.md}}) to save the dataset to a CSV. Just use the loading code from above, plug it in there and then open the CSV in Excel. However, we really want to show you this neat trick. It won't cost you anything, and it's a cool and fast way to convert datasets to any format.
6472

6573
### [](#get-apify-token) Getting an Apify token
6674

@@ -77,6 +85,8 @@ Now that you have a token, you can upload your local dataset to the Apify platfo
7785
import { Dataset } from 'crawlee';
7886
import { ApifyClient } from 'apify-client';
7987

88+
Configuration.getGlobalConfig().set('purgeOnStart', false);
89+
8090
const { items } = await Dataset.getData();
8191

8292
// We will use the Apify API client to access the Apify API.
@@ -110,6 +120,8 @@ import { Dataset } from 'crawlee';
110120
import { ApifyClient } from 'apify-client';
111121
import { writeFileSync } from 'fs';
112122

123+
Configuration.getGlobalConfig().set('purgeOnStart', false);
124+
113125
const { items } = await Dataset.getData();
114126

115127
const apifyClient = new ApifyClient({

content/docs/tutorials/crawl_urls_from_a_google_sheet.md

Lines changed: 1 addition & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -21,15 +21,7 @@ https://docs.google.com/spreadsheets/d/1GA5sSQhQjB_REes8I5IKg31S-TuRcznWOPjcpNqt
2121

2222
![Start URLs in a spreadsheet]({{@asset tutorials/images/start-urls-in-spreadsheet.webp}})
2323

24-
You don't have to add them to the actor manually or export them as a file, only to upload to the scraper.
25-
26-
Simply add the `/gviz/tq?tqx=out:csv` query parameter to the base part of the Google Sheet URL, right after the long document identifier.
27-
28-
```URL
29-
https://docs.google.com/spreadsheets/d/1GA5sSQhQjB_REes8I5IKg31S-TuRcznWOPjcpNqtxmU/gviz/tq?tqx=out:csv
30-
```
31-
32-
This gives you a URL that automatically exports the spreadsheet to CSV. Then, just click the **Link remote text file** button in the actor's input and paste the URL.
24+
You don't have to add them to the actor manually or export them as a file, only to upload to the scraper. Just click the **Text file** -> **Link remote text file** button in the actor's input and paste the URL.
3325

3426
![Link a remote text file]({{@asset tutorials/images/link-remote-file.webp}})
3527

38.4 KB
Loading
6.21 KB
Loading

0 commit comments

Comments
 (0)