@@ -10,52 +10,54 @@ along with a set of flags.
1010
1111## ` TF_CONFIG `
1212
13- Both workers and parameter servers must have the ` TF_CONFIG ` environment
13+ Both masters and parameter servers must have the ` TF_CONFIG ` environment
1414variable set.
1515
1616The ` TF_CONFIG ` environment variable is a json-encoded string with the addresses
17- of the workers and parameter servers (in the ` 'cluster' ` key) and the
17+ of the masters and parameter servers (in the ` 'cluster' ` key) and the
1818identification of the current task (in the ` 'task' ` key).
1919
2020For example:
2121
2222```
2323cluster = {
2424 'ps': ['host1:2222', 'host2:2222'],
25- 'worker ': ['host3:2222', 'host4:2222', 'host5:2222']
25+ 'master ': ['host3:2222', 'host4:2222', 'host5:2222']
2626}
2727os.environ['TF_CONFIG'] = json.dumps({
2828 'cluster': cluster,
29- 'task': {'type': 'worker', 'index': 1}
29+ 'task': {'type': 'master', 'index': 1},
30+ 'environment': 'cloud',
3031})
3132```
3233
3334## Command-line flags
3435
35- The following T2T command-line flags must also be set on the workers for
36+ The following T2T command-line flags must also be set on the masters for
3637distributed training:
3738
3839- ` --master=grpc://$ADDRESS `
39- - ` --worker_replicas=$NUM_WORKERS `
40- - ` --worker_gpu=$NUM_GPUS_PER_WORKER `
41- - ` --worker_id=$WORKER_ID `
40+ - ` --worker_replicas=$NUM_MASTERS `
41+ - ` --worker_gpu=$NUM_GPUS_PER_MASTER `
42+ - ` --worker_id=$MASTER_ID `
43+ - ` --worker_job='/job:master' `
4244- ` --ps_replicas=$NUM_PS `
4345- ` --ps_gpu=$NUM_GPUS_PER_PS `
4446- ` --schedule=train `
4547- ` --sync ` , if you want synchronous training, i.e. for there to be a single
46- master worker coordinating the work across "ps" jobs (yes, the naming is
47- unfortunate). If not set, then each worker operates independently while
48- variables are shared on the parameter servers.
48+ master coordinating the work across "ps" jobs. If not set, then each master
49+ operates independently while variables are shared on the parameter servers.
4950
50- Parameter servers only need ` --schedule=run_std_server ` .
51+ Parameter servers only need ` --master=grpc://$ADDRESS ` and
52+ ` --schedule=run_std_server ` .
5153
5254## Utility to produce ` TF_CONFIG ` and flags
5355
5456[ ` t2t-make-tf-configs ` ] ( https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/bin/t2t-make-tf-configs ) )
5557generates the ` TF_CONFIG ` json strings and the above-mentioned command-line
56- flags for the workers and parameter servers.
58+ flags for the masters and parameter servers.
5759
58- Given a set of worker and parameter server addresses, the script outputs, for
60+ Given a set of master and parameter server addresses, the script outputs, for
5961each job, a line with the ` TF_CONFIG ` environment variable and the command-line
6062flags necessary for distributed training. For each job, you should invoke the
6163` t2t-trainer ` with the ` TF_CONFIG ` value and flags that are output.
@@ -66,6 +68,9 @@ For example:
6668TF_CONFIG=$JOB_TF_CONFIG t2t-trainer $JOB_FLAGS --model=transformer ...
6769```
6870
71+ Modify the ` --worker_gpu ` and ` --ps_gpu ` flags, which specify how many gpus are
72+ on each master and ps, respectively, as needed for your machine/cluster setup.
73+
6974## Command-line flags for eval jobs
7075
7176Eval jobs should set the following flags and do not need the ` TF_CONFIG `
0 commit comments