graphcore
diff --git a/‎applications/pytorch/bert/README.md‎
Lines changed: 83 additions & 38 deletions b/‎applications/pytorch/bert/README.md‎
Lines changed: 83 additions & 38 deletions
diff --git a/‎applications/pytorch/bert/README_Benchmarks.md‎
Lines changed: 31 additions & 8 deletions b/‎applications/pytorch/bert/README_Benchmarks.md‎
Lines changed: 31 additions & 8 deletions
diff --git a/‎applications/pytorch/bert/args.py‎
Lines changed: 5 additions & 4 deletions b/‎applications/pytorch/bert/args.py‎
Lines changed: 5 additions & 4 deletions
diff --git a/‎applications/pytorch/bert/checkpointing.py‎
Lines changed: 14 additions & 6 deletions b/‎applications/pytorch/bert/checkpointing.py‎
Lines changed: 14 additions & 6 deletions
@@ -2,7 +2,7 @@
 
 This directory contains an implementation of BERT models in PyTorch for the IPU, leveraging the HuggingFace Transformers library. There are two examples:
 
-1. BERT for pretraining - `run_pretraining.py`
+1. BERT for pre-training - `run_pretraining.py`
 2. BERT for SQuAD - `run_squad.py`
 
 ## Environment setup
@@ -14,31 +14,31 @@ Then, create a virtual environment, install the required packages and build the
 ```console
 virtualenv venv -p python3.6
 source venv/bin/activate
-pip install -r requirements.txt
+pip3 install -r requirements.txt
 make
 ```
 
-## Run the pretraining application
+## Run the pre-training application
 
 Setup your environment as explained above and run the example with the configuration of your choice.
 
 ```console
-python run_pretraining.py --config demo_tiny_128
+python3 run_pretraining.py --config demo_tiny_128
 ```
 
 ## Configurations
 
-To see the available configurations for both SQuAD and pretraining see the `configs.yml` file.
+To see the available configurations for both SQuAD and pre-training see the `configs.yml` file.
 
 To see the available options available to use in the command line interface use the `--help` argument:
 
 ```console
-python run_pretraining.py --help
+python3 run_pretraining.py --help
 # or
-python run_squad.py --help
+python3 run_squad.py --help
 ```
 
-## Running pretraining with checkpointing
+## Running pre-training with checkpointing
 
 To enable the saving of model checkpoints on a run you need to add `--checkpoint-output-dir <path/to/checkpoint/dir>` to the command line. By default this will save a model checkpoint at the start and end of training.
 
@@ -48,67 +48,89 @@ To load model weights from a checkpoint directory use the flag `--pretrained-che
 
 ## Run the SQuAD application
 
-The question answering with SQuAD example is found in the `run_squad.py` script. Like with pretraining there are SQuAD configs defined in `configs.yml`. 
+The question answering with SQuAD example is found in the `run_squad.py` script. Like with pre-training there are SQuAD configs defined in `configs.yml`.
 
 To run BERT-Base:
+
 ```console
-python run_squad.py --config squad_base_384
+python3 run_squad.py --config squad_base_384
 ```
 
 For BERT-Large there is `squad_large_384`, which is a high performance large configuration that uses an 8 IPU pipeline, unlike the other configs that use 4.
 
-You will also need to specify a pretrained checkpoint to fine-tune, which is specified with the `--pretrained-checkpoint <FILE-PATH/HF-model-hub-name>` flag.
+You will also need to specify a pre-trained checkpoint to fine-tune, which is specified with the `--pretrained-checkpoint <FILE-PATH/HF-model-hub-name>` flag.
 
 ## Caching executables
 
 When running the application, it is possible to save/load executables to/from a cache store. This allows for reusing a saved executable instead of re-compiling the model when re-running identical model configurations. To enable saving/loading from the cache store, use `--executable-cache-dir <relative/path/to/cache/store>` when running the application.
 
-## Running the entire pretraining and SQuAD pipeline
+## Running the entire pre-training and SQuAD pipeline
 
 For Base on POD16:
+
 ```console
-# Phase 1 pretraining
-python run_pretraining.py --config pretrain_base_128 --checkpoint-output-dir checkpoints/pretrain_base_128
+# Phase 1 pre-training
+python3 run_pretraining.py --config pretrain_base_128 --checkpoint-output-dir checkpoints/pretrain_base_128
 
-# Phase 2 pretraining
-python run_pretraining.py --config pretrain_base_384 --checkpoint-output-dir checkpoints/pretrain_base_384 --pretrained-checkpoint checkpoints/pretrain_base_128/step_N/
+# Phase 2 pre-training
+python3 run_pretraining.py --config pretrain_base_384 --checkpoint-output-dir checkpoints/pretrain_base_384 --pretrained-checkpoint checkpoints/pretrain_base_128/step_N/
+
+# To do phase 2 pretraining with a sequence length of 512, simply replace `384` with `512`.
 
 # SQuAD fine-tuning
-python run_squad.py --config squad_base_384 --pretrained-checkpoint checkpoints/pretrain_base_384/step_N/
+python3 run_squad.py --config squad_base_384 --pretrained-checkpoint checkpoints/pretrain_base_384/step_N/
 ```
 
 For Large on POD16:
+
 ```console
 # Phase 1 pretraining
-python run_pretraining.py --config pretrain_large_128 --checkpoint-output-dir checkpoints/pretrain_large_128
+python3 run_pretraining.py --config pretrain_large_128 --checkpoint-output-dir checkpoints/pretrain_large_128
 
 # Phase 2 pretraining
-python run_pretraining.py --config pretrain_large_384 --checkpoint-output-dir checkpoints/pretrain_large_384 --pretrained-checkpoint checkpoints/pretrain_large_128/step_N/
+python3 run_pretraining.py --config pretrain_large_384 --checkpoint-output-dir checkpoints/pretrain_large_384 --pretrained-checkpoint checkpoints/pretrain_large_128/step_N/
+
+# To do the same on POD64, simply append `_POD64` to the pretraining config names. To do phase 2 pretraining with a sequence length of 512, simply replace `384` with `512`.
 
 # SQuAD fine-tuning
-python run_squad.py --config squad_large_384 --pretrained-checkpoint checkpoints/pretrain_large_384/step_N/
+python3 run_squad.py --config squad_large_384 --pretrained-checkpoint checkpoints/pretrain_large_384/step_N/
 ```
 
 To do the same on POD64, simply append `_POD64` to the pretraining config names.
 
-## Run the tests (optional)
+## POD128 configurations
+
+PopDist and PopRun allow to seamlessly launch applications on large IPU-POD systems such as POD128.  Further details about them can be found in the [docs]( https://docs.graphcore.ai/projects/poprun-user-guide/en/latest/index.html).
+
+We provide utility scripts to run the phase 1 and phase 2 pretraining in POD128. They can be executed as:
+
+```console
+# Phase 1 pretraining in POD128
+bash training_scripts/pretrain_large_128_POD128.sh
+
+# Phase 2 pretraining in POD128
+bash training_scripts/pretrain_large_384_POD128.sh
+```
 
-Setup your environment and generate the sample dataset as explained above and run `python -m pytest` from the root folder.
+The resulting pretraining checkpoint can be fine-tuned for SQuAD in a POD16 as described before.
 
+## Run the tests (optional)
+
+Setup your environment and generate the sample dataset as explained above and run `python3 -m pytest` from the root folder.
 
 ## Generate sample_text dataset (optional)
 
 The sample text provided enables training on a very small dataset for small scale testing.
-For convenience it is already provided in the `/data` folder in txt and tfrecord format.
+For convenience it is already provided in the `/data` folder in `txt` and `tfrecord` format.
 In order to re-generate the sample dataset, run the following script:
 
 ```console
-python third_party/create_pretraining_data.py --input-file data/sample_text.txt --output-file data/sample_text.tfrecord --sequence-length 128 --mask-tokens 20 --duplication-factor 4 --do-lower-case --model bert-base-uncased
+python3 third_party/create_pretraining_data.py --input-file data/sample_text.txt --output-file data/sample_text.tfrecord --sequence-length 128 --mask-tokens 20 --duplication-factor 4 --do-lower-case --model bert-base-uncased
 ```
 
 ## Generate pretraining dataset (optional)
 
-The dataset used for pretraining is WIKI-103. It can be generated from a RAW dump of Wikipedia following a four step process.
+The dataset used for pretraining is WIKI-103. It can be generated from a RAW dump of Wikipedia following a five step process.
 
 ### 1. Download
 
@@ -118,13 +140,13 @@ Use the `wikipedia_download.sh` script to download the latest Wikipedia dump, ab
 ./data/wikipedia_download.sh <chosen-path-for-dump-file>
 ```
 
-Dumps are available from https://dumps.wikimedia.org/ (and mirrors) and are licensed under CC BY-SA 3.0 and GNU Free Documentation Licenses.
+Dumps are available from <https://dumps.wikimedia.org/> (and mirrors) and are licensed under CC BY-SA 3.0 and GNU Free Documentation Licenses.
 
 ### 2. Extraction
 
 In order to create the pre-training data we need to extract the Wikipedia dump and put it in this form:
 
-```
+```text
 <doc id = article1>
 Title of article 1
 
@@ -141,46 +163,69 @@ Body of article 2
 
 and so on.
 
-One of the tools that can be used to do so is WikiExtractor, https://github.com/attardi/wikiextractor.
+One of the tools that can be used to do so is WikiExtractor, <https://github.com/attardi/wikiextractor>.
+Install the WikiExtractor package with `pip3 install wikiextractor`.
+
+In order not to encounter a `UnicodeEncodeError` at this step, you may want to run these two commands first:
+
+```console
+export PYTHONIOENCODING=utf-8
+export LC_ALL=C.UTF-8
+```
 
-You can use the the `wikipedia_extract.sh` script to use WikiExtractor to extract the data dump.
+You can then use the the `wikipedia_extract.sh` script to use WikiExtractor to extract the data dump.
 
 ```console
 ./data/wikipedia_extract.sh <chosen-path-for-dump-file>/wikidump.xml <chosen-folder-for-extracted-files>
 ```
 
-The result should be a folder containing directories named `AA`, `AB`...
+The result should be a folder containing directories named `AA`, `AB`, ...
+Note that the number of directories depends on the parameters of the `wikipedia_extract.sh` script, and is not to be confused with alphabetical ordering of the wikipedia articles.
+In other words you should probably not expect all of `AC`, `AD`, ... `ZX`, `ZY`, `ZZ` to be created by the script.
 
 ### 3. Pre-processing
 
-Install nltk package with `pip install nltk`.
+Install nltk package with `pip3 install nltk`.
 Use the `wikipedia_preprocess.py` script to preprocess the extracted files.
 
 ```console
-./data/wikipedia_preprocess.py --input-file-path <chosen-folder-for-extracted-files> --output-file-path <chosen-folder-for-preprocessed-files>
+python3 ./data/wikipedia_preprocess.py --input-file-path <chosen-folder-for-extracted-files> --output-file-path <chosen-folder-for-preprocessed-files>
 ```
 
 ### 4. Tokenization
 
-The script `create_pretraining_data.py` can accept a glob of input files to tokenise. However, attempting to process them all at once may result in the process being killed by the OS for consuming too much memory. It is therefore preferable to convert the files in groups. This is handled by the `./data/wikipedia_tokenize.py` script. At the same time, it is worth bearing in mind that `create_pretraining_data.py` shuffles the training instances across the loaded group of files, so a larger group would result in better shuffling of the samples seen by BERT during pre-training.
+The script `create_pretraining_data.py` can accept a glob of input files to tokenize.
+However, attempting to process them all at once may result in the process being killed by the OS for consuming too much memory.
+It is therefore preferable to convert the files in groups. This is handled by the `./data/wikipedia_tokenize.py` script.
+At the same time, it is worth bearing in mind that `create_pretraining_data.py` shuffles the training instances across the loaded group of files, so a larger group would result in better shuffling of the samples seen by BERT during pre-training.
+
+The tokenization depends on `tensorflow` which can be installed by `pip3 install tensorflow`.
 
 sequence length 128
+
 ```console
-./data/wikipedia_tokenize.py <chosen-folder-for-preprocessed-files> <chosen-folder-for-dataset-files> --sequence-length 128 --mask-tokens 20
+python3 ./data/wikipedia_tokenize.py <chosen-folder-for-preprocessed-files> <chosen-folder-for-dataset-files> --sequence-length 128 --mask-tokens 20
 ```
 
 sequence length 384
+
+```console
+python3 ./data/wikipedia_tokenize.py <chosen-folder-for-preprocessed-files> <chosen-folder-for-dataset-files> --sequence-length 384 --mask-tokens 56
+```
+
+sequence length 512
+
 ```console
-./data/wikipedia_tokenize.py <chosen-folder-for-preprocessed-files> <chosen-folder-for-dataset-files> --sequence-length 384 --mask-tokens 56
+python3 ./data/wikipedia_tokenize.py <chosen-folder-for-preprocessed-files> <chosen-folder-for-dataset-files> --sequence-length 512 --mask-tokens 76
 ```
 
-### Indexing
+### 5. Indexing
 
-In order to use the multi-threaded dataloader, tfrecord index files need to be generated.
+In order to use the multi-threaded `dataloader`, `tfrecord` index files need to be generated.
 First install the `tfrecord` Python package into your Python environment:
 
 ```console
-pip install tfrecord
+pip3 install tfrecord
 ```
 
 Then go to the directory containing the pre-processed Wikipedia files and run:
 
@@ -18,7 +18,11 @@ Run the following commands from inside the applications/pytorch/bert/ directory.
 
 Command:
 ```console
-python run_pretraining.py --config pretrain_base_128 --training-steps 10 --input-file $DATASETS_DIR/wikipedia/128/wiki_1[0-1]*.tfrecord --disable-progress-bar
+python3 run_pretraining.py \
+   --config pretrain_base_128 \
+   --training-steps 10 \
+   --input-file $DATASETS_DIR/wikipedia/128/wiki_1[0-1]*.tfrecord \
+   --disable-progress-bar
 ```
 
 ### Pretrain BERT-Base Sequence Length 384
@@ -27,7 +31,11 @@ python run_pretraining.py --config pretrain_base_128 --training-steps 10 --input
 
 Command:
 ```console
-python run_pretraining.py --config pretrain_base_384 --training-steps 10 --input-file $DATASETS_DIR/wikipedia/384/wiki_1[0-1]*.tfrecord --disable-progress-bar
+python3 run_pretraining.py \
+   --config pretrain_base_384 \
+   --training-steps 10 \
+   --input-file $DATASETS_DIR/wikipedia/384/wiki_1[0-1]*.tfrecord \
+   --disable-progress-bar
 ```
 
 ### Pretrain BERT-Large Sequence Length 128
@@ -36,17 +44,24 @@ python run_pretraining.py --config pretrain_base_384 --training-steps 10 --input
 
 Command:
 ```console
-python run_pretraining.py --config pretrain_large_128 --training-steps 10 --input-file $DATASETS_DIR/wikipedia/128/wiki_1[0-1]*.tfrecord --disable-progress-bar
+python3 run_pretraining.py \
+   --config pretrain_large_128 \
+   --training-steps 10 \
+   --input-file $DATASETS_DIR/wikipedia/128/wiki_1[0-1]*.tfrecord \
+   --disable-progress-bar
 ```
 
 #### 1 x IPU-POD64
 
 Command:
 ```console
-python run_pretraining.py --config pretrain_large_128_POD64 --training-steps 10 --input-file $DATASETS_DIR/wikipedia/128/wiki_1[0-1]*.tfrecord --disable-progress-bar
+python3 run_pretraining.py \
+   --config pretrain_large_128_POD64 \
+   --training-steps 10 \
+   --input-file $DATASETS_DIR/wikipedia/128/wiki_1[0-1]*.tfrecord \
+   --disable-progress-bar
 ```
 
-
 #### 1 x IPU-POD128
 
 #### 1 x IPU-POD128
@@ -91,7 +106,7 @@ python run_pretraining.py --config pretrain_large_128_POD64 --replication-factor
                           --replicated-tensor-sharding True \
                           --random-seed 1984 \
                           --input-files  $DATASETS_DIR/wikipedia/torch_bert/128/*.tfrecord 
-```CAL_HOME}/exec_cache" python run_pretraining.py --config configs/pretrain_large_128_phase1_POD128.json --train-file "$DATASETS_DIR/tf_wikipedia/tokenised_128_dup5_mask20/*.tfrecord" 
+
 ```
 
 ### Pretrain BERT-Large Sequence Length 384
@@ -100,14 +115,22 @@ python run_pretraining.py --config pretrain_large_128_POD64 --replication-factor
 
 Command:
 ```console
-python run_pretraining.py --config pretrain_large_384 --training-steps 10 --input-file $DATASETS_DIR/wikipedia/384/wiki_1[0-1]*.tfrecord --disable-progress-bar
+python3 run_pretraining.py \
+   --config pretrain_large_384 \
+   --training-steps 10 \
+   --input-file $DATASETS_DIR/wikipedia/384/wiki_1[0-1]*.tfrecord \
+   --disable-progress-bar
 ```
 
 #### 1 x IPU-POD64
 
 Command:
 ```console
-python run_pretraining.py --config pretrain_large_384_POD64 --training-steps 10 --input-file $DATASETS_DIR/wikipedia/384/wiki_1[0-1]*.tfrecord --disable-progress-bar
+python3 run_pretraining.py \
+   --config pretrain_large_384_POD64 \
+   --training-steps 10 \
+   --input-file $DATASETS_DIR/wikipedia/384/wiki_1[0-1]*.tfrecord \
+   --disable-progress-bar
 ```
 
 #### 1 x IPU-POD128
 
@@ -62,7 +62,8 @@ def parse_bert_args(args=None):
         formatter_class=argparse.ArgumentDefaultsHelpFormatter)
 
     # Execution
-    parser.add_argument("--batch-size", type=int, help="Set the micro batch-size")
+    parser.add_argument("--micro-batch-size", type=int,
+                        help="Set the micro-batch-size. This is the single forward-backward path batch-size on one replica")
     parser.add_argument("--training-steps", type=int, help="Number of training steps")
     parser.add_argument("--batches-per-step", type=int, help="Number of batches per training step")
     parser.add_argument("--replication-factor", type=int, help="Number of replicas")
@@ -208,11 +209,11 @@ def parse_bert_args(args=None):
         parser.error("checkpoint-steps must be >=1")
 
     if args.use_popdist:
-        args.global_batch_size = args.replication_factor * args.gradient_accumulation * args.batch_size * args.popdist_size
+        args.global_batch_size = args.replication_factor * args.gradient_accumulation * args.micro_batch_size * args.popdist_size
     else:
-        args.global_batch_size = args.replication_factor * args.gradient_accumulation * args.batch_size
+        args.global_batch_size = args.replication_factor * args.gradient_accumulation * args.micro_batch_size
 
-    args.samples_per_step = args.replication_factor * args.gradient_accumulation * args.batch_size * args.batches_per_step
+    args.samples_per_step = args.replication_factor * args.gradient_accumulation * args.micro_batch_size * args.batches_per_step
     args.intermediate_size = args.hidden_size * 4
 
     return args
@@ -27,15 +27,23 @@ def checkpoints_exist(path):
     return False
 
 
-def save_checkpoint(config, model, step, metrics=None):
+def save_checkpoint(config, model, step, optimizer=None, metrics=None):
     if config.checkpoint_output_dir:
         path = os.path.join(os.path.abspath(config.checkpoint_output_dir), f"step_{step}")
         os.makedirs(path, exist_ok=True)
 
         logger(f"Saving checkpoint for step {step} to: {path}\n")
         model.save_pretrained(path)
-        torch.save({
-            "step": step,
-            "metrics": metrics,
-            "config": config
-        }, os.path.join(path, "training_state.pt"))
+        if optimizer is None:
+            torch.save({
+                "step": step,
+                "metrics": metrics,
+                "config": config
+            }, os.path.join(path, "training_state.pt"))
+        else:
+            torch.save({
+                "step": step,
+                "optimizer_state_dict": optimizer.state_dict(),
+                "metrics": metrics,
+                "config": config
+            }, os.path.join(path, "training_state.pt"))