You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: applications/pytorch/bert/README.md
+83-38Lines changed: 83 additions & 38 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
This directory contains an implementation of BERT models in PyTorch for the IPU, leveraging the HuggingFace Transformers library. There are two examples:
4
4
5
-
1. BERT for pretraining - `run_pretraining.py`
5
+
1. BERT for pre-training - `run_pretraining.py`
6
6
2. BERT for SQuAD - `run_squad.py`
7
7
8
8
## Environment setup
@@ -14,31 +14,31 @@ Then, create a virtual environment, install the required packages and build the
14
14
```console
15
15
virtualenv venv -p python3.6
16
16
source venv/bin/activate
17
-
pip install -r requirements.txt
17
+
pip3 install -r requirements.txt
18
18
make
19
19
```
20
20
21
-
## Run the pretraining application
21
+
## Run the pre-training application
22
22
23
23
Setup your environment as explained above and run the example with the configuration of your choice.
24
24
25
25
```console
26
-
python run_pretraining.py --config demo_tiny_128
26
+
python3 run_pretraining.py --config demo_tiny_128
27
27
```
28
28
29
29
## Configurations
30
30
31
-
To see the available configurations for both SQuAD and pretraining see the `configs.yml` file.
31
+
To see the available configurations for both SQuAD and pre-training see the `configs.yml` file.
32
32
33
33
To see the available options available to use in the command line interface use the `--help` argument:
34
34
35
35
```console
36
-
python run_pretraining.py --help
36
+
python3 run_pretraining.py --help
37
37
# or
38
-
python run_squad.py --help
38
+
python3 run_squad.py --help
39
39
```
40
40
41
-
## Running pretraining with checkpointing
41
+
## Running pre-training with checkpointing
42
42
43
43
To enable the saving of model checkpoints on a run you need to add `--checkpoint-output-dir <path/to/checkpoint/dir>` to the command line. By default this will save a model checkpoint at the start and end of training.
44
44
@@ -48,67 +48,89 @@ To load model weights from a checkpoint directory use the flag `--pretrained-che
48
48
49
49
## Run the SQuAD application
50
50
51
-
The question answering with SQuAD example is found in the `run_squad.py` script. Like with pretraining there are SQuAD configs defined in `configs.yml`.
51
+
The question answering with SQuAD example is found in the `run_squad.py` script. Like with pre-training there are SQuAD configs defined in `configs.yml`.
52
52
53
53
To run BERT-Base:
54
+
54
55
```console
55
-
python run_squad.py --config squad_base_384
56
+
python3 run_squad.py --config squad_base_384
56
57
```
57
58
58
59
For BERT-Large there is `squad_large_384`, which is a high performance large configuration that uses an 8 IPU pipeline, unlike the other configs that use 4.
59
60
60
-
You will also need to specify a pretrained checkpoint to fine-tune, which is specified with the `--pretrained-checkpoint <FILE-PATH/HF-model-hub-name>` flag.
61
+
You will also need to specify a pre-trained checkpoint to fine-tune, which is specified with the `--pretrained-checkpoint <FILE-PATH/HF-model-hub-name>` flag.
61
62
62
63
## Caching executables
63
64
64
65
When running the application, it is possible to save/load executables to/from a cache store. This allows for reusing a saved executable instead of re-compiling the model when re-running identical model configurations. To enable saving/loading from the cache store, use `--executable-cache-dir <relative/path/to/cache/store>` when running the application.
65
66
66
-
## Running the entire pretraining and SQuAD pipeline
67
+
## Running the entire pre-training and SQuAD pipeline
# To do the same on POD64, simply append `_POD64` to the pretraining config names. To do phase 2 pretraining with a sequence length of 512, simply replace `384` with `512`.
To do the same on POD64, simply append `_POD64` to the pretraining config names.
93
100
94
-
## Run the tests (optional)
101
+
## POD128 configurations
102
+
103
+
PopDist and PopRun allow to seamlessly launch applications on large IPU-POD systems such as POD128. Further details about them can be found in the [docs](https://docs.graphcore.ai/projects/poprun-user-guide/en/latest/index.html).
104
+
105
+
We provide utility scripts to run the phase 1 and phase 2 pretraining in POD128. They can be executed as:
The result should be a folder containing directories named `AA`, `AB`...
182
+
The result should be a folder containing directories named `AA`, `AB`, ...
183
+
Note that the number of directories depends on the parameters of the `wikipedia_extract.sh` script, and is not to be confused with alphabetical ordering of the wikipedia articles.
184
+
In other words you should probably not expect all of `AC`, `AD`, ... `ZX`, `ZY`, `ZZ` to be created by the script.
153
185
154
186
### 3. Pre-processing
155
187
156
-
Install nltk package with `pip install nltk`.
188
+
Install nltk package with `pip3 install nltk`.
157
189
Use the `wikipedia_preprocess.py` script to preprocess the extracted files.
The script `create_pretraining_data.py` can accept a glob of input files to tokenise. However, attempting to process them all at once may result in the process being killed by the OS for consuming too much memory. It is therefore preferable to convert the files in groups. This is handled by the `./data/wikipedia_tokenize.py` script. At the same time, it is worth bearing in mind that `create_pretraining_data.py` shuffles the training instances across the loaded group of files, so a larger group would result in better shuffling of the samples seen by BERT during pre-training.
197
+
The script `create_pretraining_data.py` can accept a glob of input files to tokenize.
198
+
However, attempting to process them all at once may result in the process being killed by the OS for consuming too much memory.
199
+
It is therefore preferable to convert the files in groups. This is handled by the `./data/wikipedia_tokenize.py` script.
200
+
At the same time, it is worth bearing in mind that `create_pretraining_data.py` shuffles the training instances across the loaded group of files, so a larger group would result in better shuffling of the samples seen by BERT during pre-training.
201
+
202
+
The tokenization depends on `tensorflow` which can be installed by `pip3 install tensorflow`.
0 commit comments