Skip to content

Commit 9b61ab6

Browse files
Updates with Poplar SDK 2.3 release
1 parent 788ead5 commit 9b61ab6

File tree

410 files changed

+20414
-9952
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

410 files changed

+20414
-9952
lines changed

.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,3 +43,9 @@ vars.capnp
4343
# Virtual environments
4444
*virtualenv/
4545
.venv/
46+
47+
# Jupyter notebook checkpoint
48+
**/.ipynb_checkpoints
49+
nohup.*
50+
**/wandb
51+

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ repository. If you are actively using this repository and want to report any iss
1111

1212
The latest version of the documentation for the Poplar software stack, and other developer resources, is available at https://www.graphcore.ai/developer.
1313

14-
> The code presented here requires using Poplar SDK 2.2.x
14+
> The code presented here requires using Poplar SDK 2.3.x
1515
1616
Please install and enable the Poplar SDK following the instructions in the Getting Started guide for your IPU system.
1717

applications/popart/bert/Makefile

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,16 +7,11 @@ custom_ops.so: custom_ops/plugin_version custom_ops/*.cpp custom_ops/workarounds
77
g++ -std=c++14 -fPIC \
88
-DSTATIC_VERSION=\"${shell ./custom_ops/plugin_version}\" \
99
-DONNX_NAMESPACE=onnx \
10-
custom_ops/detach.cpp \
10+
custom_ops/attention_mask.cpp \
1111
custom_ops/disable_attn_dropout_bwd_pattern.cpp \
12-
custom_ops/sparse_accumulate.cpp \
13-
custom_ops/sparse_accumulate_pattern.cpp \
14-
custom_ops/embedding_gather.cpp \
1512
custom_ops/tied_gather.cpp \
1613
custom_ops/tied_gather_pattern.cpp \
17-
custom_ops/lamb_serialised_weight_pattern.cpp \
1814
custom_ops/workarounds/prevent_const_expr_folding_op.cpp \
19-
custom_ops/workarounds/accumulate_priority_pattern.cpp \
2015
-shared -lpopart -lpoplar -lpoplin -lpopnn -lpopops -lpoputil -lpoprand \
2116
-o custom_ops.so
2217

applications/popart/bert/README.md

Lines changed: 8 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -44,10 +44,10 @@ The following files are provided for running the BERT benchmarks.
4444
| --------------- | ------------------------------------------------------------ |
4545
| `bert.py` | Main training loop |
4646
| `bert_model.py` | BERT model definition |
47-
| `utils.py` | Utility functions |
48-
| `bert_data/` | Directory containing the data pipeline and training data generation <br /><br />- `dataset.py` - Dataloader and preprocessing. Loads binary files into Numpy arrays to be passed `popart.PyStepIO`, with shapes based on training options, `--batches-per-step` & `--pipeline` <br /><br /> -`create_pretraining_data.py` - Script to generate binary files to be loaded from text data |
47+
| `utils/` | Utility functions |
48+
| `bert_data/` | Directory containing the data pipeline and training data generation <br /><br />- `dataset.py` - Dataloader and preprocessing. Loads binary files into Numpy arrays to be passed `popart.PyStepIO`, with shapes to match the configuration <br /><br /> -`create_pretraining_data.py` - Script to generate binary files to be loaded from text data |
4949
| `configs/` | Directory containing JSON configuration files to be used by the `--config` argument. |
50-
| `custom_ops/` | Directory containing custom PopART operators. These are optimised parts of the graph that target Poplar/PopLibs operations directly.<br /> - `attention.cpp` - This operation is the fwd and grad implementation for multi-headed self-attention.<br/> - `detach.cpp` - This operation is an identity with no grad implementation. This allows for the embedding dictionary to only be updated by its use in the projection.<br/> -`embeddingGather.cpp` - This operation is a modification of the PopART Gather to ensure correct layout of the weights. |
50+
| `custom_ops/` | Directory containing custom PopART operators. These are optimised parts of the graph that target Poplar/PopLibs operations directly. |
5151

5252

5353
## Quick start guide
@@ -161,18 +161,14 @@ For the sample text a configuration has been created - `configs/demo.json`. It
161161
{
162162
# Two layers as our dataset does not need the capacity of the usual 12 Layer BERT Base
163163
"num_layers": 2,
164-
"no_dropout": true,
165164
"popart_dtype": "FLOAT16",
166-
"loss_scaling": 1.0,
167-
"stochastic_rounding": true,
168165
# The data generation should have created 64 samples. Therefore, we will do an epoch per session.run
169166
"batches_per_step": 64,
170-
"epochs": 150,
167+
"training_steps": 500,
171168
# Here we specify the file we created in the previous step.
172169
"input_files": [
173170
"data/sample_text.bin"
174-
]
175-
"shuffle": true,
171+
],
176172
"no_validation": true
177173
}
178174
```
@@ -183,7 +179,7 @@ Run this config:
183179
python3 bert.py --config configs/demo.json
184180
```
185181

186-
This will compile the graph and run for 150 epochs. At end our model should have overfit to 100% test accuracy.
182+
This will compile the graph and run for 500 training steps. At end our model should have overfit to 100% test accuracy.
187183

188184
##### View the pre-training results in Tensorboard
189185

@@ -235,14 +231,10 @@ How to get the SQuAD 1.1 files required for inference is described in `bert_data
235231

236232
To run SQuAD BERT Base inference with a sequence length of 128:
237233

238-
`python3 bert.py --config configs/{mk1,mk2}/squad_base_128_inference.json`
234+
`python3 bert.py --config configs/{mk1,mk2}/squad_base_128_inf.json`
239235

240236
and for BERT Large with a sequence length of 384:
241237

242-
`python3 bert.py --config configs/{mk1,mk2}/squad_large_384_inference.json`
238+
`python3 bert.py --config configs/{mk1,mk2}/squad_large_384_inf.json`
243239

244240
View the JSON files in configs for detailed parameters.
245-
246-
It is also possible to run inference on the pretraining graph to validate the MLM/NSP results. It requires input files to be provided, either by adding them to the config or by using the following command-line for sequence length of 128:
247-
248-
`python3 bert.py --config configs/{mk1,mk2}/mlm_nsp_base_128_inference.json --input-files <path_to_input_file>`

applications/popart/bert/README_Benchmarks.md

Lines changed: 91 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,48 @@ python bert.py --config configs/mk2/pretrain_large_128.json --input-files=$DATAS
2323

2424
Command:
2525
```console
26-
python bert.py --config configs/mk2/pretrain_large_128_POD64.json --input-files=$DATASETS_DIR/wikipedia/AA/sequence_128/wiki_*_tokenised --replication 16 --wandb
26+
python bert.py --config configs/mk2/pretrain_large_128_POD64.json --input-files=$DATASETS_DIR/wikipedia/AA/sequence_128/wiki_*_tokenised --checkpoint-dir "checkpoint/phase1"
27+
```
28+
29+
#### 1 x IPU-POD128
30+
export POPLAR_ENGINE_OPTIONS='{"target.hostSyncTimeout": "1200", "target.syncReplicasIndependently": "true"}'
31+
export HOROVOD_STALL_CHECK_TIME_SECONDS=120
32+
export HOROVOD_POPART_BROADCAST_TIMEOUT=120
33+
34+
$PARTITION, $IPUOF_VIPU_API_PARTITION_ID: ID of the Pod64 reconfigurable partition
35+
$TCP_IF_INCLUDE: sets the default route for traffic between Poplar hosts. It should be configured for a network to which all Poplar hosts have access, and for which the interfaces only have a single IP address.
36+
$VIPU_SERVER_HOST: IP address as appropriate for the target hardware
37+
$HOSTS: IP address of the main host server
38+
39+
Command:
40+
```console
41+
poprun -vv --num-instances=2 --num-replicas=32 \
42+
--num-ilds=2 \
43+
--ipus-per-replica=4 \
44+
--vipu-server-host="$VIPU_SERVER_HOST" \
45+
--host=$HOSTS \
46+
--vipu-partition=gcl128 \
47+
--vipu-cluster=c128 \
48+
--update-partition=yes \
49+
--remove-partition=no \
50+
--reset-partition=no \
51+
--print-topology=yes \
52+
--vipu-server-timeout=1200 \
53+
--mpi-global-args="--tag-output \
54+
--allow-run-as-root \
55+
--mca btl_tcp_if_include $TCP_IF_INCLUDE \
56+
--mca oob_tcp_if_include $TCP_IF_INCLUDE" \
57+
--mpi-local-args="-x OPAL_PREFIX \
58+
-x CPATH \
59+
-x IPUOF_VIPU_API_TIMEOUT=1200 \
60+
-x POPLAR_ENGINE_OPTIONS \
61+
-x HOROVOD_STALL_CHECK_TIME_SECONDS \
62+
-x HOROVOD_POPART_BROADCAST_TIMEOUT" \
63+
python3 bert.py --config configs/mk2/pretrain_large_128.json \
64+
--replication-factor 16 \
65+
--loss-scaling 4096 \
66+
--replicated-tensor-sharding True \
67+
--input-files $DATASETS_DIR/wikipedia/AA/sequence_128/*
2768
```
2869

2970
### BERT-Large Phase 2 Pre-training Sequence length 384
@@ -39,7 +80,50 @@ python bert.py --config configs/mk2/pretrain_large_384.json --input-files=$DATAS
3980

4081
Command:
4182
```console
42-
python3 bert.py --config configs/mk2/pretrain_large_384.json --input-files=$DATASETS_DIR/wikipedia/AA/sequence_384/wiki_00_tokenised --replication 16 --loss-scaling 4096 --epochs 2 --wandb
83+
python bert.py --config configs/mk2/pretrain_large_384_POD64.json --input-files=$DATASETS_DIR/wikipedia/AA/sequence_384/wiki_00_tokenised --epochs 2
84+
```
85+
86+
#### 1 x IPU-POD128
87+
88+
export POPLAR_ENGINE_OPTIONS='{"target.hostSyncTimeout": "1200", "target.syncReplicasIndependently": "true"}'
89+
export HOROVOD_STALL_CHECK_TIME_SECONDS=120
90+
export HOROVOD_POPART_BROADCAST_TIMEOUT=120
91+
92+
$PARTITION, $IPUOF_VIPU_API_PARTITION_ID: ID of the Pod64 reconfigurable partition
93+
$TCP_IF_INCLUDE: sets the default route for traffic between Poplar hosts. It should be configured for a network to which all Poplar hosts have access, and for which the interfaces only have a single IP address.
94+
$VIPU_SERVER_HOST: IP address as appropriate for the target hardware
95+
$HOSTS: IP address of the main host server
96+
97+
Command:
98+
```console
99+
poprun -vv --num-instances=2 --num-replicas=32 \
100+
--num-ilds=2 \
101+
--ipus-per-replica=4 \
102+
--vipu-server-host="$VIPU_SERVER_HOST" \
103+
--host=$HOSTS \
104+
--vipu-partition=gcl128 \
105+
--vipu-cluster=c128 \
106+
--update-partition=yes \
107+
--remove-partition=no \
108+
--reset-partition=no \
109+
--print-topology=yes \
110+
--vipu-server-timeout=1200 \
111+
--mpi-global-args="--tag-output \
112+
--allow-run-as-root \
113+
--mca btl_tcp_if_include $TCP_IF_INCLUDE \
114+
--mca oob_tcp_if_include $TCP_IF_INCLUDE" \
115+
--mpi-local-args="-x OPAL_PREFIX \
116+
-x CPATH \
117+
-x IPUOF_VIPU_API_TIMEOUT=1200 \
118+
-x POPLAR_ENGINE_OPTIONS \
119+
-x HOROVOD_STALL_CHECK_TIME_SECONDS \
120+
-x HOROVOD_POPART_BROADCAST_TIMEOUT" \
121+
python3 bert.py --config configs/mk2/pretrain_large_384.json \
122+
--replication-factor 16 \
123+
--replicated-tensor-sharding True \
124+
--input-files $DATASETS_DIR/wikipedia/AA/sequence_384/* \
125+
--wandb \
126+
--onnx-checkpoint checkpoints/mk2/pretrain_large_rank_0/21-10-06-01-20-46/model.onnx
43127
```
44128

45129
### BERT-Base Phase 1 Pre-training Sequence length 128
@@ -51,7 +135,6 @@ Command:
51135
python bert.py --config configs/mk2/pretrain_base_128.json --input-files=$DATASETS_DIR/wikipedia/AA/sequence_128/wiki_00_tokenised --epochs 1 --no-model-save --no-validation --steps-per-log 1
52136
```
53137

54-
55138
### BERT-Base Phase 2 Pre-training Sequence length 384
56139

57140
#### 1 x IPU-POD16
@@ -67,7 +150,7 @@ python bert.py --config configs/mk2/pretrain_base_384.json --input-files=$DATASE
67150

68151
Command:
69152
```console
70-
python bert.py --config configs/mk2/squad_large_384.json --input-files=$DATASETS_DIR/squad/train-v1.1.json --vocab-file=$DATASETS_DIR/ckpts/uncased_L-24_H-1024_A-16/vocab.txt --no-model-save --no-validation --steps-per-log 1
153+
python run_squad.py --squad-do-validation False --config squad_large_384_POD16 --num-epochs 1
71154
```
72155

73156
## Inference
@@ -87,7 +170,7 @@ This benchmark spawns multiple replicas using mpirun. To obtain the total throug
87170

88171
Command:
89172
```console
90-
mpirun --tag-output --allow-run-as-root --np 4 python bert.py --task=SQUAD --layers-per-ipu 24 --num-layers=24 --hidden-size=1024 --attention-heads=16 --sequence-length=128 --dtype=FLOAT16 --batches-per-step=16 --generated-data=true --no-model-save --host-embedding=NONE --minimum-latency-inference=true --input-files=$DATASETS_DIR/squad/dev-v1.1.json --inference --encoder-start-ipu=0 --use-default-available-memory-proportion=true --max-copy-merge-size=-1 --shuffle=false --micro-batch-size 1 --enable-half-partials --epochs-inference 20 --group-host-sync --no-outlining=false --steps-per-log=1
173+
mpirun --tag-output --np 4 --allow-run-as-root python bert.py --config configs/mk2/squad_large_128_inf.json --micro-batch-size {batchsize} --generated-data=true --epochs-inference 20 --input-files=$DATASETS_DIR/squad/dev-v1.1.json
91174
```
92175

93176
Set --micro-batch-size to 1, 2 or 3.
@@ -100,20 +183,8 @@ This benchmark spawns multiple replicas using mpirun. To obtain the total throug
100183

101184
Command:
102185
```console
103-
mpirun --tag-output --allow-run-as-root --np 4 python bert.py --task=SQUAD --layers-per-ipu 12 --num-layers=12 --hidden-size=768 --attention-heads=12 --sequence-length=128 --dtype=FLOAT16 --batches-per-step=16 --generated-data=true --no-model-save --host-embedding=NONE --minimum-latency-inference=true --input-files=$DATASETS_DIR/squad/dev-v1.1.json --inference --encoder-start-ipu=0 --use-default-available-memory-proportion=true --max-copy-merge-size=-1 --shuffle=false --micro-batch-size 1 --enable-half-partials --epochs-inference 10 --group-host-sync --no-outlining=false --steps-per-log=1
104-
```
105-
106-
Set --micro-batch-size to 1, 2, 4, 8, 16, 32, 64 or 80.
107-
108-
### BERT 3-layer Base Inference Sequence length 128
109-
110-
#### 1 x IPU-M2000
111-
112-
This benchmark spawns multiple replicas using mpirun. To obtain the total throughput, sum the reported throughputs for each iteration.
113-
114-
Command:
115-
```console
116-
mpirun --tag-output --allow-run-as-root --np 4 python3 bert.py --task SQUAD --layers-per-ipu=3 --num-layers=3 --hidden-size=768 --attention-heads=12 --sequence-length=128 --dtype=FLOAT16 --batches-per-step=2048 --generated-data=true --no-model-save --host-embedding=NONE --low-latency-inference=false --minimum-latency-inference=true --input-files=$DATASETS_DIR/squad/dev-v1.1.json --inference --encoder-start-ipu=0 --use-default-available-memory-proportion=true --max-copy-merge-size=-1 --shuffle=false --micro-batch-size 1 --enable-half-partials --epochs-inference 10 --group-host-sync --no-outlining=false --steps-per-log=1
186+
mpirun --tag-output --np 4 --allow-run-as-root python bert.py --config configs/mk2/squad_base_128_inf.json --micro-batch-size {batchsize} --generated-data=true --epochs-inference 10 --input-files=$DATASETS_DIR/squad/dev-v1.1.json
117187
```
118188

119-
Set --micro-batch-size to 1, 2, 4, 8, 16, 32 or 64. Set --low-latency-inference to false or true. Set --minimum-latency-inference to true or false.
189+
Set --micro-batch-size to 1, 2, 4, 8, 16, 32, 64, or 80
190+
for micro-batch-size = 80, also set --available-memory-proportion 0.55

0 commit comments

Comments
 (0)