@@ -37,9 +37,7 @@ engine. This object is created with a call to
3737.. code-block :: python
3838
3939 from tensorflow.python.ipu import embedded_runtime
40-
4140 ...
42-
4341 context = embedded_runtime.embedded_runtime_start(
4442 poplar_exec_filepath, startup_inputs, engine_name)
4543
@@ -52,17 +50,85 @@ function can be called. The context object ensures all appropriate metadata is
5250passed, and control dependencies are created.
5351
5452.. code-block :: python
55-
5653 ...
57-
5854 results = embedded_runtime.embedded_runtime_call(
5955 call_inputs, context)
60- session.run(results)
56+ session.run(results, feed_dict = { ... } )
6157
6258 Once the IPU embedded application runtime has been created and used within the
6359session, the Poplar engine will be running in a background thread. This thread
6460can outlive the TensorFlow session.
6561
62+ Pipelining and I/O tiles
63+ ~~~~~~~~~~~~~~~~~~~~~~~~
64+ When running a pipelined application, or an application with I/O tiles, we must
65+ handle the additional layer of pipelining. This is a result of there being
66+ multiple batches of data resident in the device at the same time.
67+
68+ There are two ways to manage this. The first is by submitting multiple requests
69+ in parallel. The second is to provide a maximum timeout that the application
70+ should wait for additional data.
71+
72+ Parallel requests
73+ _________________
74+
75+ To ensure the application isn't starved of data you can submit multiple
76+ batches of data in parallel in multiple threads. These will be enqueued and
77+ processed as early as possible by the device.
78+
79+ When an application is pipelined, these parallel batches of data will overlap
80+ in time as they are processed by the devices. This improves the overall
81+ utilisation of the devices and minimises the batch latency.
82+
83+ .. figure :: figures/threads_2.png
84+ :width: 75%
85+ :alt: Embedded runtime with two threads
86+ :align: center
87+
88+ Embedded runtime with two threads and some waiting
89+
90+ .. figure :: figures/threads_4.png
91+ :width: 75%
92+ :alt: Embedded runtime with four threads
93+ :align: center
94+
95+ The same application with four threads and no waiting
96+
97+ Timeout
98+ _______
99+ When the application is pipelined or using I/O tiles, and data starvation might
100+ occur, the timeout option allows you to set an upperbound on the time the IPU
101+ will wait for data.
102+
103+ When TensorFlow receives a Poplar callback a timer is started. When the
104+ timer reaches the defined timeout, a "dummy" batch of data is passed to the
105+ device. This unblocks any pending batches that are in the device.
106+
107+ .. figure :: figures/timeout.png
108+ :width: 75%
109+ :alt: Embedded runtime timeout
110+ :align: center
111+
112+ An embedded runtime application triggering a 500us timeout
113+
114+ Engine restarts
115+ _______________
116+ The number of batches to process in an application is a compile-time decision.
117+ However, you might later deliver more batches at runtime than compiled for. If
118+ this happens, the Poplar engine will be restarted. A restart blocks enqueued
119+ items from being processed, temporarily increasing latency.
120+
121+ To mitigate this, we recommend compiling the application to process as many
122+ batches as required before it terminates. If the number of batches is unknown,
123+ choose a value large enough to minimise this.
124+
125+ .. figure :: figures/restart.png
126+ :width: 75%
127+ :alt: Embedded runtime engine restart
128+ :align: center
129+
130+ An embedded runtime application triggering an engine restart causing increased latency
131+
66132Example
67133~~~~~~~~
68134This example creates a very simple IPU program that doubles the input tensor.
0 commit comments