You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Downloading archived web pages from the [Wayback Machine](https://archive.org/web/).
9
9
10
-
Internet-archive is a nice source for several OSINT-information. This script is a work in progress to query and fetch archived web pages.
10
+
Internet-archive is a nice source for several OSINT-information. This tool is a work in progress to query and fetch archived web pages.
11
+
12
+
This tool allows you to download content from the Wayback Machine (archive.org). You can use it to download either the latest version or all versions of web page snapshots within a specified range.
11
13
12
14
## Installation
13
15
14
16
### Pip
15
17
16
18
1. Install the package <br>
17
19
```pip install pywaybackup```
18
-
2. Run the script <br>
20
+
2. Run the tool <br>
19
21
```waybackup -h```
20
22
21
23
### Manual
@@ -26,30 +28,25 @@ Internet-archive is a nice source for several OSINT-information. This script is
26
28
```pip install .```
27
29
- in a virtual env or use `--break-system-package`
28
30
29
-
## Usage
30
-
31
-
This script allows you to download content from the Wayback Machine (archive.org). You can use it to download either the latest version or all versions of web page snapshots within a specified range.
32
-
33
-
### Arguments
31
+
## Arguments
34
32
35
33
-`-h`, `--help`: Show the help message and exit.
36
-
-`-a`, `--about`: Show information about the script and exit.
34
+
-`-a`, `--about`: Show information about the tool and exit.
37
35
38
-
####Required Arguments
36
+
### Required
39
37
40
38
-`-u`, `--url`: The URL of the web page to download. This argument is required.
41
39
42
40
#### Mode Selection (Choose One)
43
-
44
41
-`-c`, `--current`: Download the latest version of each file snapshot. You will get a rebuild of the current website with all available files (but not any original state because new and old versions are mixed).
45
42
-`-f`, `--full`: Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
46
43
-`-s`, `--save`: Save a page to the Wayback Machine. (beta)
47
44
48
-
####Optional Arguments
45
+
### Optional query parameters
49
46
50
47
-`-l`, `--list`: Only print the snapshots available within the specified range. Does not download the snapshots.
51
48
-`-e`, `--explicit`: Only download the explicit given url. No wildcard subdomains or paths. Use e.g. to get root-only snapshots.
52
-
-`-o`, `--output`: The folder where downloaded files will be saved.
49
+
-`-o`, `--output`: Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
53
50
54
51
-**Range Selection:**<br>
55
52
Specify the range in years or a specific timestamp either start, end or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
@@ -58,13 +55,36 @@ Specify the range in years or a specific timestamp either start, end or both. If
58
55
-`--start`: Timestamp to start searching.
59
56
-`--end`: Timestamp to end searching.
60
57
61
-
#### Additional
62
-
63
-
-`--csv`: Save a csv file with the list of snapshots inside the output folder or a specified folder. If you set `--list` the csv will contain the cdx list of snapshots. If you set either `--current` or `--full` the csv will contain the downloaded files.
64
-
-`--no-redirect`: Do not follow redirects of snapshots. Archive.org sometimes redirects to a different snapshot for several reasons. Downloading redirects may lead to timestamp-folders which contain some files with a different timestamp. This does not matter if you only want to download the latest version (`-c`).
65
-
-`--verbosity`: Set the verbosity: json (print json response), progress (show progress bar).
66
-
-`--retry`: Retry failed downloads. You can specify the number of retry attempts as an integer.
67
-
-`--workers`: The number of workers to use for downloading (simultaneous downloads). Default is 1. A safe spot is about 10 workers. Beware: Using too many workers will lead into refused connections from the Wayback Machine. Duration about 1.5 minutes.
58
+
### Additional behavior manipulation
59
+
60
+
-**`--csv`**`<path>`:<br>
61
+
Path defaults to output-dir. Saves a CSV file with the json-response for successfull downloads. If `--list` is set, the CSV contains the CDX list of snapshots. If `--current` or `--full` is set, CSV contains downloaded files. Named as `waybackup_<sanitized_url>.csv`.
62
+
63
+
-**`--skip`**`<path>`:<br>
64
+
Path defaults to output-dir. Checks for an existing `waybackup_<domain>.csv` for URLs to skip downloading. Useful for interrupted downloads. Files are checked by their root-domain, ensuring consistency across queries. This means that if you download `http://example.com/subdir1/` and later `http://example.com`, the second query will skip the first path.
65
+
66
+
-**`--no-redirect`**:<br>
67
+
Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
68
+
69
+
-**`--verbosity`**`<level>`:<br>
70
+
Sets verbosity level. Options are `json` (prints JSON response) or `progress` (shows progress bar).
71
+
72
+
-**`--retry`**`<attempts>`:<br>
73
+
Specifies number of retry attempts for failed downloads.
74
+
75
+
-**`--workers`**`<count>`:<br>
76
+
Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
77
+
78
+
**CDX Query Handling:**
79
+
-**`--cdxbackup`**`<path>`:<br>
80
+
Path defaults to output-dir. Saves the result of CDX query as a file. Useful for later downloading snapshots and overcoming refused connections by CDX server due to too many queries. Named as `waybackup_<sanitized_url>.cdx`.
81
+
82
+
-**`--cdxinject`**`<filepath>`:<br>
83
+
Injects a CDX query file to download snapshots. Ensure the query matches the previous `--url` for correct folder structure.
84
+
85
+
### Debug
86
+
87
+
-`--debug`: If set, full traceback will be printed in case of an error. The full exception will be written into `waybackup_error.log`.
68
88
69
89
### Examples
70
90
@@ -169,5 +189,5 @@ The csv contains the json response in a table format.
169
189
170
190
## Contributing
171
191
172
-
I'm always happy for some feature requests to improve the usability of this script.
173
-
Feel free to give suggestions and report issues. Project is still far from being perfect.
192
+
I'm always happy for some feature requests to improve the usability of this tool.
193
+
Feel free to give suggestions and report issues. Project is still far from being perfect.
0 commit comments