@@ -106,6 +106,16 @@ will occupy those chip-select rows.
106106This term is avoided because it is unclear when needing to distinguish
107107between chip-select rows and socket sets.
108108
109+ * High Bandwidth Memory (HBM)
110+
111+ HBM is a new memory type with low power consumption and ultra-wide
112+ communication lanes. It uses vertically stacked memory chips (DRAM dies)
113+ interconnected by microscopic wires called "through-silicon vias," or
114+ TSVs.
115+
116+ Several stacks of HBM chips connect to the CPU or GPU through an ultra-fast
117+ interconnect called the "interposer". Therefore, HBM's characteristics
118+ are nearly indistinguishable from on-chip integrated RAM.
109119
110120Memory Controllers
111121------------------
@@ -176,3 +186,113 @@ nodes::
176186 the L1 and L2 directories would be "edac_device_block's"
177187
178188.. kernel-doc :: drivers/edac/edac_device.h
189+
190+
191+ Heterogeneous system support
192+ ----------------------------
193+
194+ An AMD heterogeneous system is built by connecting the data fabrics of
195+ both CPUs and GPUs via custom xGMI links. Thus, the data fabric on the
196+ GPU nodes can be accessed the same way as the data fabric on CPU nodes.
197+
198+ The MI200 accelerators are data center GPUs. They have 2 data fabrics,
199+ and each GPU data fabric contains four Unified Memory Controllers (UMC).
200+ Each UMC contains eight channels. Each UMC channel controls one 128-bit
201+ HBM2e (2GB) channel (equivalent to 8 X 2GB ranks). This creates a total
202+ of 4096-bits of DRAM data bus.
203+
204+ While the UMC is interfacing a 16GB (8high X 2GB DRAM) HBM stack, each UMC
205+ channel is interfacing 2GB of DRAM (represented as rank).
206+
207+ Memory controllers on AMD GPU nodes can be represented in EDAC thusly:
208+
209+ GPU DF / GPU Node -> EDAC MC
210+ GPU UMC -> EDAC CSROW
211+ GPU UMC channel -> EDAC CHANNEL
212+
213+ For example: a heterogeneous system with 1 AMD CPU is connected to
214+ 4 MI200 (Aldebaran) GPUs using xGMI.
215+
216+ Some more heterogeneous hardware details:
217+
218+ - The CPU UMC (Unified Memory Controller) is mostly the same as the GPU UMC.
219+ They have chip selects (csrows) and channels. However, the layouts are different
220+ for performance, physical layout, or other reasons.
221+ - CPU UMCs use 1 channel, In this case UMC = EDAC channel. This follows the
222+ marketing speak. CPU has X memory channels, etc.
223+ - CPU UMCs use up to 4 chip selects, So UMC chip select = EDAC CSROW.
224+ - GPU UMCs use 1 chip select, So UMC = EDAC CSROW.
225+ - GPU UMCs use 8 channels, So UMC channel = EDAC channel.
226+
227+ The EDAC subsystem provides a mechanism to handle AMD heterogeneous
228+ systems by calling system specific ops for both CPUs and GPUs.
229+
230+ AMD GPU nodes are enumerated in sequential order based on the PCI
231+ hierarchy, and the first GPU node is assumed to have a Node ID value
232+ following those of the CPU nodes after latter are fully populated::
233+
234+ $ ls /sys/devices/system/edac/mc/
235+ mc0 - CPU MC node 0
236+ mc1 |
237+ mc2 |- GPU card[0] => node 0(mc1), node 1(mc2)
238+ mc3 |
239+ mc4 |- GPU card[1] => node 0(mc3), node 1(mc4)
240+ mc5 |
241+ mc6 |- GPU card[2] => node 0(mc5), node 1(mc6)
242+ mc7 |
243+ mc8 |- GPU card[3] => node 0(mc7), node 1(mc8)
244+
245+ For example, a heterogeneous system with one AMD CPU is connected to
246+ four MI200 (Aldebaran) GPUs using xGMI. This topology can be represented
247+ via the following sysfs entries::
248+
249+ /sys/devices/system/edac/mc/..
250+
251+ CPU # CPU node
252+ ├── mc 0
253+
254+ GPU Nodes are enumerated sequentially after CPU nodes have been populated
255+ GPU card 1 # Each MI200 GPU has 2 nodes/mcs
256+ ├── mc 1 # GPU node 0 == mc1, Each MC node has 4 UMCs/CSROWs
257+ │ ├── csrow 0 # UMC 0
258+ │ │ ├── channel 0 # Each UMC has 8 channels
259+ │ │ ├── channel 1 # size of each channel is 2 GB, so each UMC has 16 GB
260+ │ │ ├── channel 2
261+ │ │ ├── channel 3
262+ │ │ ├── channel 4
263+ │ │ ├── channel 5
264+ │ │ ├── channel 6
265+ │ │ ├── channel 7
266+ │ ├── csrow 1 # UMC 1
267+ │ │ ├── channel 0
268+ │ │ ├── ..
269+ │ │ ├── channel 7
270+ │ ├── .. ..
271+ │ ├── csrow 3 # UMC 3
272+ │ │ ├── channel 0
273+ │ │ ├── ..
274+ │ │ ├── channel 7
275+ │ ├── rank 0
276+ │ ├── .. ..
277+ │ ├── rank 31 # total 32 ranks/dimms from 4 UMCs
278+ ├
279+ ├── mc 2 # GPU node 1 == mc2
280+ │ ├── .. # each GPU has total 64 GB
281+
282+ GPU card 2
283+ ├── mc 3
284+ │ ├── ..
285+ ├── mc 4
286+ │ ├── ..
287+
288+ GPU card 3
289+ ├── mc 5
290+ │ ├── ..
291+ ├── mc 6
292+ │ ├── ..
293+
294+ GPU card 4
295+ ├── mc 7
296+ │ ├── ..
297+ ├── mc 8
298+ │ ├── ..
0 commit comments