Skip to content

Commit 66a69be

Browse files
authored
matrix_mul_mkl: update README (#2012)
1 parent eefbe65 commit 66a69be

File tree

1 file changed

+36
-21
lines changed

1 file changed

+36
-21
lines changed

Libraries/oneMKL/matrix_mul_mkl/README.md

Lines changed: 36 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# `Matrix Multiplication with oneMKL` Sample
22

3-
Matrix Multiplication with Intel® oneAPI Math Kernel Library (oneMKL) shows how to use the oneMKL optimized matrix multiplication routines.
3+
Matrix Multiplication with Intel® oneAPI Math Kernel Library (oneMKL) shows how to use the oneMKL optimized matrix multiplication routines, and provides a simple benchmark.
44

55
| Optimized for | Description
66
|:--- |:---
@@ -14,14 +14,17 @@ For more information on oneMKL and complete documentation of all oneMKL routines
1414

1515
## Purpose
1616

17-
Matrix Multiplication uses oneMKL to multiply two large matrices.
18-
19-
This sample performs its computations on the default SYCL* device. You can set the `SYCL_DEVICE_TYPE` environment variable to `cpu` or `gpu` to select the device to use.
17+
Matrix Multiplication uses oneMKL to multiply two large matrices and measure device performance.
2018

19+
This sample performs its computations on the default SYCL* device. You can set the `SYCL_DEVICE_FILTER` environment variable to `cpu` or `gpu` to select the device to use.
2120

2221
## Key Implementation Details
2322

24-
The oneMKL `blas::gemm` routine performs a generalized matrix multiplication operation. OneMKL BLAS routines support both row-major and column-major matrix layouts; this sample uses row-major layouts, the traditional choice for C++.
23+
The oneMKL `blas::gemm` routine performs a matrix multiplication operation with optional scaling and updating behavior. oneMKL BLAS routines support both row-major and column-major matrix layouts; this sample uses the default column-major layout, the traditional choice for BLAS.
24+
25+
This sample provides a simple benchmark to test `gemm` performance on a SYCL device, and illustrates several best practices:
26+
- Perform a warmup run before timing, to allow oneMKL to initialize and prepare GEMM kernels for execution.
27+
- Pad matrix dimensions if needed to ensure data is well-aligned.
2528

2629
## Using Visual Studio Code* (Optional)
2730

@@ -63,27 +66,39 @@ You can remove all generated files with `make clean`.
6366
### On a Windows* System
6467
Run `nmake` to build and run the sample. `nmake clean` removes temporary files.
6568

66-
> **Warning**: On Windows, static linking with oneMKL currently takes a very long time due to a known compiler issue. This will be addressed in an upcoming release.
67-
6869
## Running the Matrix Multiplication with oneMKL Sample
6970

7071
### Example of Output
71-
If everything is working correctly, the program will generate two input matrices and call oneMKL to multiply them. It will also compute the product matrix itself to verify the results from oneMKL.
72+
Example output from this sample:
7273

7374
```
74-
./sgemm.mkl
75-
Problem size: A (8192x8192) * B (8192x8192) --> C (8192x8192)
76-
Benchmark interations: 100
77-
Device: Intel(R) Iris(R) Xe Graphics
78-
Launching oneMKL GEMM calculation...
79-
SGEMM performance : GFLOPS
80-
81-
./dgemm.mkl
82-
Problem size: A (8192x8192) * B (8192x8192) --> C (8192x8192)
83-
Benchmark interations: 100
84-
Device: Intel(R) Data Center GPU Max 1100
85-
Launching oneMKL GEMM calculation...
86-
DGEMM performance : GFLOPS
75+
./matrix_mul_mkl single
76+
oneMKL DPC++ GEMM benchmark
77+
---------------------------
78+
Device: Intel(R) Iris(R) Pro Graphics 580
79+
Core/EU count: 72
80+
Maximum clock frequency: 950 MHz
81+
82+
Benchmarking (4096 x 4096) x (4096 x 4096) matrix multiplication, single precision
83+
-> Initializing data...
84+
-> Warmup...
85+
-> Timing...
86+
87+
Average performance: ...
88+
89+
./matrix_mul_mkl double
90+
oneMKL DPC++ GEMM benchmark
91+
---------------------------
92+
Device: Intel(R) Iris(R) Pro Graphics 580
93+
Core/EU count: 72
94+
Maximum clock frequency: 950 MHz
95+
96+
Benchmarking (4096 x 4096) x (4096 x 4096) matrix multiplication, double precision
97+
-> Initializing data...
98+
-> Warmup...
99+
-> Timing...
100+
101+
Average performance: ...
87102
```
88103

89104
### Troubleshooting

0 commit comments

Comments
 (0)