Skip to content

Commit 66e8b99

Browse files
hmellorgeorgepaw
authored andcommitted
Warn if local replication is dropping a batch
Summary: As of the refactor of distributed TF Keras (@christiana), it's possible that entire batch(es) is/are discarded if all the batches in an instance's dataset cannot be distributed evenly between its local replicas. This diff adds a warning to let user know if any batches are being discarded for this reason. i.e. if you run ``` poprun --nunm-instances 1 --num-replicas 2 python main.py ``` and your (non-repeating) dataset contains 781 batches, each local replica will only use 390 batches and the remaining 1 is discarded. TF2.5 Only Reviewers: #tensorflow, #framework_ip_review_-_any_oss_or_third-party_code_use_has_been_approved, markf, christiana Reviewed By: #tensorflow, #framework_ip_review_-_any_oss_or_third-party_code_use_has_been_approved, christiana Subscribers: christiana Maniphest Tasks: T52076 Differential Revision: https://phabricator.sourcevertex.net/D60868
1 parent 7393d5a commit 66e8b99

File tree

1 file changed

+7
-0
lines changed

1 file changed

+7
-0
lines changed

tensorflow/python/ipu/keras/extensions/data_adapter.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,13 @@ def _infer_steps(self, steps, dataset):
159159
raise ValueError(
160160
"Could not infer the size of the data. You must specify the number "
161161
"of steps to run.")
162+
if steps % self._replication_factor:
163+
logging.warn(
164+
"Dataset of length {} is being evenly distributed between {} "
165+
"replicas. The remaining {} batch{} will be dropped.".format(
166+
len(dataset), self._replication_factor,
167+
steps % self._replication_factor,
168+
"es" if steps % self._replication_factor > 1 else ""))
162169

163170
return int(steps // self._replication_factor)
164171

0 commit comments

Comments
 (0)