graphcore
diff --git a/‎.gitignore‎
Lines changed: 6 additions & 0 deletions b/‎.gitignore‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 1 addition & 1 deletion b/‎README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎applications/popart/bert/Makefile‎
Lines changed: 1 addition & 6 deletions b/‎applications/popart/bert/Makefile‎
Lines changed: 1 addition & 6 deletions
diff --git a/‎applications/popart/bert/README.md‎
Lines changed: 8 additions & 16 deletions b/‎applications/popart/bert/README.md‎
Lines changed: 8 additions & 16 deletions
diff --git a/‎applications/popart/bert/README_Benchmarks.md‎
Lines changed: 91 additions & 20 deletions b/‎applications/popart/bert/README_Benchmarks.md‎
Lines changed: 91 additions & 20 deletions
@@ -43,3 +43,9 @@ vars.capnp
 # Virtual environments
 *virtualenv/
 .venv/
+
+# Jupyter notebook checkpoint
+**/.ipynb_checkpoints
+nohup.*
+**/wandb
+
@@ -11,7 +11,7 @@ repository. If you are actively using this repository and want to report any iss
 
 The latest version of the documentation for the Poplar software stack, and other developer resources, is available at https://www.graphcore.ai/developer.
 
->  The code presented here requires using Poplar SDK 2.2.x
+>  The code presented here requires using Poplar SDK 2.3.x
 
 Please install and enable the Poplar SDK following the instructions in the Getting Started guide for your IPU system.
 
 
@@ -7,16 +7,11 @@ custom_ops.so: custom_ops/plugin_version custom_ops/*.cpp custom_ops/workarounds
 	g++ -std=c++14 -fPIC \
 		-DSTATIC_VERSION=\"${shell ./custom_ops/plugin_version}\" \
 		-DONNX_NAMESPACE=onnx \
-		custom_ops/detach.cpp \
+		custom_ops/attention_mask.cpp \
 		custom_ops/disable_attn_dropout_bwd_pattern.cpp \
-		custom_ops/sparse_accumulate.cpp \
-		custom_ops/sparse_accumulate_pattern.cpp \
-		custom_ops/embedding_gather.cpp \
 		custom_ops/tied_gather.cpp \
 		custom_ops/tied_gather_pattern.cpp \
-		custom_ops/lamb_serialised_weight_pattern.cpp \
 		custom_ops/workarounds/prevent_const_expr_folding_op.cpp \
-		custom_ops/workarounds/accumulate_priority_pattern.cpp \
 		-shared -lpopart -lpoplar -lpoplin -lpopnn -lpopops -lpoputil -lpoprand \
 		-o custom_ops.so
 
 
@@ -44,10 +44,10 @@ The following files are provided for running the BERT benchmarks.
 | --------------- | ------------------------------------------------------------ |
 | `bert.py`       | Main training loop                                           |
 | `bert_model.py` | BERT model definition                                        |
-| `utils.py`      | Utility functions                                            |
-| `bert_data/`    | Directory containing the data pipeline and training data generation <br /><br />- `dataset.py` - Dataloader and preprocessing. Loads binary files into Numpy arrays to be passed `popart.PyStepIO`, with shapes based on training options,  `--batches-per-step` & `--pipeline` <br /><br /> -`create_pretraining_data.py` - Script to generate binary files to be loaded from text data |
+| `utils/`        | Utility functions                                            |
+| `bert_data/`    | Directory containing the data pipeline and training data generation <br /><br />- `dataset.py` - Dataloader and preprocessing. Loads binary files into Numpy arrays to be passed `popart.PyStepIO`, with shapes to match the configuration <br /><br /> -`create_pretraining_data.py` - Script to generate binary files to be loaded from text data |
 | `configs/`      | Directory containing JSON configuration files to be used by the `--config` argument. |
-| `custom_ops/`   | Directory containing custom PopART operators. These are optimised parts of the graph that target Poplar/PopLibs operations directly.<br />  - `attention.cpp` - This operation is the fwd and grad implementation for multi-headed self-attention.<br/>  - `detach.cpp` - This operation is an identity with no grad implementation. This allows for the embedding dictionary to only be updated by its use in the projection.<br/>  -`embeddingGather.cpp` - This operation is a modification of the PopART Gather to ensure correct layout of the weights. |
+| `custom_ops/`   | Directory containing custom PopART operators. These are optimised parts of the graph that target Poplar/PopLibs operations directly. |
 
 
 ## Quick start guide
@@ -161,18 +161,14 @@ For the sample text a configuration has been created -  `configs/demo.json`. It
 {
   # Two layers as our dataset does not need the capacity of the usual 12 Layer BERT Base
   "num_layers": 2,
-  "no_dropout": true,
   "popart_dtype": "FLOAT16",
-  "loss_scaling": 1.0,
-  "stochastic_rounding": true,
   # The data generation should have created 64 samples. Therefore, we will do an epoch per session.run
   "batches_per_step": 64,
-  "epochs": 150,
+  "training_steps": 500,
   # Here we specify the file we created in the previous step.
   "input_files": [
     "data/sample_text.bin"
-  ]
-  "shuffle": true,
+  ],
   "no_validation": true
 }
 ```
@@ -183,7 +179,7 @@ Run this config:
 python3 bert.py --config configs/demo.json
 ```
 
-This will compile the graph and run for 150 epochs. At end our model should have overfit to 100% test accuracy.
+This will compile the graph and run for 500 training steps. At end our model should have overfit to 100% test accuracy.
 
 ##### View the pre-training results in Tensorboard
 
@@ -235,14 +231,10 @@ How to get the SQuAD 1.1 files required for inference is described in `bert_data
 
 To run SQuAD BERT Base inference with a sequence length of 128:
 
-`python3 bert.py --config configs/{mk1,mk2}/squad_base_128_inference.json`
+`python3 bert.py --config configs/{mk1,mk2}/squad_base_128_inf.json`
 
 and for BERT Large with a sequence length of 384:
 
-`python3 bert.py --config configs/{mk1,mk2}/squad_large_384_inference.json`
+`python3 bert.py --config configs/{mk1,mk2}/squad_large_384_inf.json`
 
 View the JSON files in configs for detailed parameters.
-
-It is also possible to run inference on the pretraining graph to validate the MLM/NSP results. It requires input files to be provided, either by adding them to the config or by using the following command-line for sequence length of 128:
-
-`python3 bert.py --config configs/{mk1,mk2}/mlm_nsp_base_128_inference.json --input-files <path_to_input_file>`
@@ -23,7 +23,48 @@ python bert.py --config configs/mk2/pretrain_large_128.json --input-files=$DATAS
 
 Command:
 ```console
-python bert.py --config configs/mk2/pretrain_large_128_POD64.json --input-files=$DATASETS_DIR/wikipedia/AA/sequence_128/wiki_*_tokenised --replication 16 --wandb
+python bert.py --config configs/mk2/pretrain_large_128_POD64.json --input-files=$DATASETS_DIR/wikipedia/AA/sequence_128/wiki_*_tokenised --checkpoint-dir "checkpoint/phase1"
+```
+
+#### 1 x IPU-POD128
+export POPLAR_ENGINE_OPTIONS='{"target.hostSyncTimeout": "1200", "target.syncReplicasIndependently": "true"}'
+export HOROVOD_STALL_CHECK_TIME_SECONDS=120
+export HOROVOD_POPART_BROADCAST_TIMEOUT=120
+
+$PARTITION, $IPUOF_VIPU_API_PARTITION_ID: ID of the Pod64 reconfigurable partition
+$TCP_IF_INCLUDE: sets the default route for traffic between Poplar hosts. It should be configured for a network to which all Poplar hosts have access, and for which the interfaces only have a single IP address.
+$VIPU_SERVER_HOST: IP address as appropriate for the target hardware 
+$HOSTS: IP address of the main host server
+
+Command:
+```console
+poprun -vv --num-instances=2 --num-replicas=32 \
+       --num-ilds=2 \
+       --ipus-per-replica=4 \
+       --vipu-server-host="$VIPU_SERVER_HOST" \
+       --host=$HOSTS \
+       --vipu-partition=gcl128 \
+       --vipu-cluster=c128 \
+       --update-partition=yes \
+       --remove-partition=no \
+       --reset-partition=no \
+       --print-topology=yes \
+       --vipu-server-timeout=1200 \
+       --mpi-global-args="--tag-output \
+                          --allow-run-as-root \
+                          --mca btl_tcp_if_include $TCP_IF_INCLUDE \
+                          --mca oob_tcp_if_include $TCP_IF_INCLUDE" \
+       --mpi-local-args="-x OPAL_PREFIX \
+                         -x CPATH \
+                         -x IPUOF_VIPU_API_TIMEOUT=1200 \
+                         -x POPLAR_ENGINE_OPTIONS \
+                         -x HOROVOD_STALL_CHECK_TIME_SECONDS \
+                         -x HOROVOD_POPART_BROADCAST_TIMEOUT" \
+python3 bert.py --config configs/mk2/pretrain_large_128.json  \
+                --replication-factor 16 \
+                --loss-scaling 4096 \
+                --replicated-tensor-sharding True \
+                --input-files $DATASETS_DIR/wikipedia/AA/sequence_128/* 
 ```
 
 ### BERT-Large Phase 2 Pre-training Sequence length 384
@@ -39,7 +80,50 @@ python bert.py --config configs/mk2/pretrain_large_384.json --input-files=$DATAS
 
 Command:
 ```console
-python3 bert.py --config configs/mk2/pretrain_large_384.json --input-files=$DATASETS_DIR/wikipedia/AA/sequence_384/wiki_00_tokenised --replication 16 --loss-scaling 4096 --epochs 2 --wandb
+python bert.py --config configs/mk2/pretrain_large_384_POD64.json --input-files=$DATASETS_DIR/wikipedia/AA/sequence_384/wiki_00_tokenised --epochs 2 
+```
+
+#### 1 x IPU-POD128
+
+export POPLAR_ENGINE_OPTIONS='{"target.hostSyncTimeout": "1200", "target.syncReplicasIndependently": "true"}'
+export HOROVOD_STALL_CHECK_TIME_SECONDS=120
+export HOROVOD_POPART_BROADCAST_TIMEOUT=120
+
+$PARTITION, $IPUOF_VIPU_API_PARTITION_ID: ID of the Pod64 reconfigurable partition
+$TCP_IF_INCLUDE: sets the default route for traffic between Poplar hosts. It should be configured for a network to which all Poplar hosts have access, and for which the interfaces only have a single IP address.
+$VIPU_SERVER_HOST: IP address as appropriate for the target hardware 
+$HOSTS: IP address of the main host server
+
+Command:
+```console
+poprun -vv --num-instances=2 --num-replicas=32 \
+       --num-ilds=2 \
+       --ipus-per-replica=4 \
+       --vipu-server-host="$VIPU_SERVER_HOST" \
+       --host=$HOSTS \
+       --vipu-partition=gcl128 \
+       --vipu-cluster=c128 \
+       --update-partition=yes \
+       --remove-partition=no \
+       --reset-partition=no \
+       --print-topology=yes \
+       --vipu-server-timeout=1200 \
+       --mpi-global-args="--tag-output \
+                          --allow-run-as-root \
+                          --mca btl_tcp_if_include $TCP_IF_INCLUDE \
+                          --mca oob_tcp_if_include $TCP_IF_INCLUDE" \
+       --mpi-local-args="-x OPAL_PREFIX \
+                         -x CPATH \
+                         -x IPUOF_VIPU_API_TIMEOUT=1200 \
+                         -x POPLAR_ENGINE_OPTIONS \
+                         -x HOROVOD_STALL_CHECK_TIME_SECONDS \
+                         -x HOROVOD_POPART_BROADCAST_TIMEOUT" \
+python3 bert.py --config configs/mk2/pretrain_large_384.json  \
+                --replication-factor 16 \
+                --replicated-tensor-sharding True \
+                --input-files $DATASETS_DIR/wikipedia/AA/sequence_384/* \
+                --wandb \
+                --onnx-checkpoint checkpoints/mk2/pretrain_large_rank_0/21-10-06-01-20-46/model.onnx
 ```
 
 ### BERT-Base Phase 1 Pre-training Sequence length 128
@@ -51,7 +135,6 @@ Command:
 python bert.py --config configs/mk2/pretrain_base_128.json --input-files=$DATASETS_DIR/wikipedia/AA/sequence_128/wiki_00_tokenised --epochs 1 --no-model-save --no-validation --steps-per-log 1
 ```
 
-
 ### BERT-Base Phase 2 Pre-training Sequence length 384
 
 #### 1 x IPU-POD16
@@ -67,7 +150,7 @@ python bert.py --config configs/mk2/pretrain_base_384.json --input-files=$DATASE
 
 Command:
 ```console
-python bert.py --config configs/mk2/squad_large_384.json --input-files=$DATASETS_DIR/squad/train-v1.1.json --vocab-file=$DATASETS_DIR/ckpts/uncased_L-24_H-1024_A-16/vocab.txt --no-model-save --no-validation --steps-per-log 1
+python run_squad.py --squad-do-validation False --config squad_large_384_POD16 --num-epochs 1
 ```
 
 ## Inference
@@ -87,7 +170,7 @@ This benchmark spawns multiple replicas using mpirun. To obtain the total throug
 
 Command:
 ```console
-mpirun --tag-output --allow-run-as-root --np 4 python bert.py --task=SQUAD --layers-per-ipu 24 --num-layers=24 --hidden-size=1024 --attention-heads=16 --sequence-length=128 --dtype=FLOAT16 --batches-per-step=16 --generated-data=true --no-model-save --host-embedding=NONE --minimum-latency-inference=true --input-files=$DATASETS_DIR/squad/dev-v1.1.json --inference --encoder-start-ipu=0 --use-default-available-memory-proportion=true --max-copy-merge-size=-1 --shuffle=false --micro-batch-size 1 --enable-half-partials --epochs-inference 20 --group-host-sync --no-outlining=false --steps-per-log=1
+mpirun --tag-output --np 4 --allow-run-as-root python bert.py --config configs/mk2/squad_large_128_inf.json           --micro-batch-size {batchsize} --generated-data=true --epochs-inference 20 --input-files=$DATASETS_DIR/squad/dev-v1.1.json
 ```
 
 Set --micro-batch-size to 1, 2 or 3.
@@ -100,20 +183,8 @@ This benchmark spawns multiple replicas using mpirun. To obtain the total throug
 
 Command:
 ```console
-mpirun --tag-output --allow-run-as-root --np 4 python bert.py --task=SQUAD --layers-per-ipu 12 --num-layers=12 --hidden-size=768 --attention-heads=12 --sequence-length=128 --dtype=FLOAT16 --batches-per-step=16 --generated-data=true --no-model-save --host-embedding=NONE --minimum-latency-inference=true --input-files=$DATASETS_DIR/squad/dev-v1.1.json --inference --encoder-start-ipu=0 --use-default-available-memory-proportion=true --max-copy-merge-size=-1 --shuffle=false --micro-batch-size 1 --enable-half-partials --epochs-inference 10 --group-host-sync --no-outlining=false --steps-per-log=1
-```
-
-Set --micro-batch-size to 1, 2, 4, 8, 16, 32, 64 or 80.
-
-### BERT 3-layer Base Inference Sequence length 128
-
-#### 1 x IPU-M2000
-
-This benchmark spawns multiple replicas using mpirun. To obtain the total throughput, sum the reported throughputs for each iteration.
-
-Command:
-```console
-mpirun --tag-output --allow-run-as-root --np 4 python3 bert.py --task SQUAD --layers-per-ipu=3 --num-layers=3 --hidden-size=768 --attention-heads=12 --sequence-length=128 --dtype=FLOAT16 --batches-per-step=2048 --generated-data=true --no-model-save --host-embedding=NONE --low-latency-inference=false --minimum-latency-inference=true --input-files=$DATASETS_DIR/squad/dev-v1.1.json --inference --encoder-start-ipu=0 --use-default-available-memory-proportion=true --max-copy-merge-size=-1 --shuffle=false --micro-batch-size 1 --enable-half-partials --epochs-inference 10 --group-host-sync --no-outlining=false --steps-per-log=1
+mpirun --tag-output --np 4 --allow-run-as-root python bert.py --config configs/mk2/squad_base_128_inf.json --micro-batch-size {batchsize} --generated-data=true --epochs-inference 10 --input-files=$DATASETS_DIR/squad/dev-v1.1.json
 ```
 
-Set --micro-batch-size to 1, 2, 4, 8, 16, 32 or 64. Set --low-latency-inference to false or true. Set --minimum-latency-inference to true or false.
+Set --micro-batch-size to 1, 2, 4, 8, 16, 32, 64, or 80 
+for micro-batch-size = 80, also set --available-memory-proportion 0.55