Skip to content

Commit 3ebb04a

Browse files
jakeh-gcgeorgepaw
authored andcommitted
Expand the embedded runtime docs and improve example.
Summary: Resolves T45423 Reviewers: #tensorflow, #framework_ip_review_-_any_oss_or_third-party_code_use_has_been_approved, georgep, jamiep Reviewed By: #tensorflow, #framework_ip_review_-_any_oss_or_third-party_code_use_has_been_approved, georgep, jamiep Subscribers: jamiep, georgep Maniphest Tasks: T45423 Differential Revision: https://phabricator.sourcevertex.net/D52253
1 parent 5a2fc7d commit 3ebb04a

File tree

5 files changed

+71
-5
lines changed

5 files changed

+71
-5
lines changed

tensorflow/compiler/plugin/poplar/docs/embedded_application_runtime.rst

Lines changed: 71 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -37,9 +37,7 @@ engine. This object is created with a call to
3737
.. code-block:: python
3838
3939
from tensorflow.python.ipu import embedded_runtime
40-
4140
...
42-
4341
context = embedded_runtime.embedded_runtime_start(
4442
poplar_exec_filepath, startup_inputs, engine_name)
4543
@@ -52,17 +50,85 @@ function can be called. The context object ensures all appropriate metadata is
5250
passed, and control dependencies are created.
5351

5452
.. code-block:: python
55-
5653
...
57-
5854
results = embedded_runtime.embedded_runtime_call(
5955
call_inputs, context)
60-
session.run(results)
56+
session.run(results, feed_dict={...})
6157
6258
Once the IPU embedded application runtime has been created and used within the
6359
session, the Poplar engine will be running in a background thread. This thread
6460
can outlive the TensorFlow session.
6561

62+
Pipelining and I/O tiles
63+
~~~~~~~~~~~~~~~~~~~~~~~~
64+
When running a pipelined application, or an application with I/O tiles, we must
65+
handle the additional layer of pipelining. This is a result of there being
66+
multiple batches of data resident in the device at the same time.
67+
68+
There are two ways to manage this. The first is by submitting multiple requests
69+
in parallel. The second is to provide a maximum timeout that the application
70+
should wait for additional data.
71+
72+
Parallel requests
73+
_________________
74+
75+
To ensure the application isn't starved of data you can submit multiple
76+
batches of data in parallel in multiple threads. These will be enqueued and
77+
processed as early as possible by the device.
78+
79+
When an application is pipelined, these parallel batches of data will overlap
80+
in time as they are processed by the devices. This improves the overall
81+
utilisation of the devices and minimises the batch latency.
82+
83+
.. figure:: figures/threads_2.png
84+
:width: 75%
85+
:alt: Embedded runtime with two threads
86+
:align: center
87+
88+
Embedded runtime with two threads and some waiting
89+
90+
.. figure:: figures/threads_4.png
91+
:width: 75%
92+
:alt: Embedded runtime with four threads
93+
:align: center
94+
95+
The same application with four threads and no waiting
96+
97+
Timeout
98+
_______
99+
When the application is pipelined or using I/O tiles, and data starvation might
100+
occur, the timeout option allows you to set an upperbound on the time the IPU
101+
will wait for data.
102+
103+
When TensorFlow receives a Poplar callback a timer is started. When the
104+
timer reaches the defined timeout, a "dummy" batch of data is passed to the
105+
device. This unblocks any pending batches that are in the device.
106+
107+
.. figure:: figures/timeout.png
108+
:width: 75%
109+
:alt: Embedded runtime timeout
110+
:align: center
111+
112+
An embedded runtime application triggering a 500us timeout
113+
114+
Engine restarts
115+
_______________
116+
The number of batches to process in an application is a compile-time decision.
117+
However, you might later deliver more batches at runtime than compiled for. If
118+
this happens, the Poplar engine will be restarted. A restart blocks enqueued
119+
items from being processed, temporarily increasing latency.
120+
121+
To mitigate this, we recommend compiling the application to process as many
122+
batches as required before it terminates. If the number of batches is unknown,
123+
choose a value large enough to minimise this.
124+
125+
.. figure:: figures/restart.png
126+
:width: 75%
127+
:alt: Embedded runtime engine restart
128+
:align: center
129+
130+
An embedded runtime application triggering an engine restart causing increased latency
131+
66132
Example
67133
~~~~~~~~
68134
This example creates a very simple IPU program that doubles the input tensor.
159 KB
Loading
116 KB
Loading
133 KB
Loading
117 KB
Loading

0 commit comments

Comments
 (0)