|
1 | 1 | # Benchmarking Script for Large Language Models |
2 | 2 |
|
3 | | -This script provides a unified approach to estimate performance for Large Language Models (LLMs). It leverages pipelines provided by Optimum-Intel and allows performance estimation for PyTorch and OpenVINO models using nearly identical code and pre-collected models. |
4 | | - |
5 | | - |
6 | | -### 1. Prepare Python Virtual Environment for LLM Benchmarking |
7 | | - |
8 | | -``` bash |
9 | | -python3 -m venv ov-llm-bench-env |
10 | | -source ov-llm-bench-env/bin/activate |
11 | | -pip install --upgrade pip |
12 | | - |
13 | | -git clone https://github.com/openvinotoolkit/openvino.genai.git |
14 | | -cd openvino.genai/llm_bench/python/ |
15 | | -pip install -r requirements.txt |
16 | | -``` |
17 | | - |
18 | | -> Note: |
19 | | -> For existing Python environments, run the following command to ensure that all dependencies are installed with the latest versions: |
20 | | -> `pip install -U --upgrade-strategy eager -r requirements.txt` |
21 | | -
|
22 | | -#### (Optional) Hugging Face Login : |
23 | | - |
24 | | -Login to Hugging Face if you want to use non-public models: |
25 | | - |
26 | | -```bash |
27 | | -huggingface-cli login |
28 | | -``` |
29 | | - |
30 | | -### 2. Convert Model to OpenVINO IR Format |
31 | | - |
32 | | -The `optimum-cli` tool simplifies converting Hugging Face models to OpenVINO IR format. |
33 | | -- Detailed documentation can be found in the [Optimum-Intel documentation](https://huggingface.co/docs/optimum/main/en/intel/openvino/export). |
34 | | -- To learn more about weight compression, see the [NNCF Weight Compression Guide](https://docs.openvino.ai/2024/openvino-workflow/model-optimization-guide/weight-compression.html). |
35 | | -- For additional guidance on running inference with OpenVINO for LLMs, see the [OpenVINO LLM Inference Guide](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide.html). |
36 | | - |
37 | | -**Usage:** |
38 | | - |
39 | | -```bash |
40 | | -optimum-cli export openvino --model <MODEL_ID> --weight-format <PRECISION> <OUTPUT_DIR> |
41 | | - |
42 | | -optimum-cli export openvino -h # For detailed information |
43 | | -``` |
44 | | - |
45 | | -* `--model <MODEL_ID>` : model_id for downloading from [huggngface_hub](https://huggingface.co/models) or path with directory where pytorch model located. |
46 | | -* `--weight-format <PRECISION>` : precision for model conversion. Available options: `fp32, fp16, int8, int4, mxfp4` |
47 | | -* `<OUTPUT_DIR>`: output directory for saving generated OpenVINO model. |
48 | | - |
49 | | -**NOTE:** |
50 | | -- Models larger than 1 billion parameters are exported to the OpenVINO format with 8-bit weights by default. You can disable it with `--weight-format fp32`. |
51 | | - |
52 | | -**Example:** |
53 | | -```bash |
54 | | -optimum-cli export openvino --model meta-llama/Llama-2-7b-chat-hf --weight-format fp16 models/llama-2-7b-chat |
55 | | -``` |
56 | | -**Resulting file structure:** |
57 | | - |
58 | | -```console |
59 | | - models |
60 | | - └── llama-2-7b-chat |
61 | | - ├── config.json |
62 | | - ├── generation_config.json |
63 | | - ├── openvino_detokenizer.bin |
64 | | - ├── openvino_detokenizer.xml |
65 | | - ├── openvino_model.bin |
66 | | - ├── openvino_model.xml |
67 | | - ├── openvino_tokenizer.bin |
68 | | - ├── openvino_tokenizer.xml |
69 | | - ├── special_tokens_map.json |
70 | | - ├── tokenizer_config.json |
71 | | - ├── tokenizer.json |
72 | | - └── tokenizer.model |
73 | | -``` |
74 | | - |
75 | | -### 3. Benchmark LLM Model |
76 | | - |
77 | | -To benchmark the performance of the LLM, use the following command: |
78 | | - |
79 | | -``` bash |
80 | | -python benchmark.py -m <model> -d <device> -r <report_csv> -f <framework> -p <prompt text> -n <num_iters> |
81 | | -# e.g. |
82 | | -python benchmark.py -m models/llama-2-7b-chat/ -n 2 |
83 | | -python benchmark.py -m models/llama-2-7b-chat/ -p "What is openvino?" -n 2 |
84 | | -python benchmark.py -m models/llama-2-7b-chat/ -pf prompts/llama-2-7b-chat_l.jsonl -n 2 |
85 | | -``` |
86 | | - |
87 | | -**Parameters:** |
88 | | -- `-m`: Path to the model. |
89 | | -- `-d`: Inference device (default: CPU). |
90 | | -- `-r`: Path to the CSV report. |
91 | | -- `-f`: Framework (default: ov). |
92 | | -- `-p`: Interactive prompt text. |
93 | | -- `-pf`: Path to a JSONL file containing prompts. |
94 | | -- `-n`: Number of iterations (default: 0, the first iteration is excluded). |
95 | | -- `-ic`: Limit the output token size (default: 512) for text generation and code generation models. |
96 | | - |
97 | | -**Additional options:** |
98 | | -``` bash |
99 | | -python ./benchmark.py -h # for more information |
100 | | -``` |
101 | | - |
102 | | -#### Benchmarking the Original PyTorch Model: |
103 | | -To benchmark the original PyTorch model, first download the model locally and then run benchmark by specifying PyTorch as the framework with parameter `-f pt` |
104 | | - |
105 | | -```bash |
106 | | -# Download PyTorch Model |
107 | | -huggingface-cli download meta-llama/Llama-2-7b-chat-hf --local-dir models/llama-2-7b-chat/pytorch |
108 | | -# Benchmark with PyTorch Framework |
109 | | -python benchmark.py -m models/llama-2-7b-chat/pytorch -n 2 -f pt |
110 | | -``` |
111 | | - |
112 | | -> **Note:** If needed, You can install a specific OpenVINO version using pip: |
113 | | -> ``` bash |
114 | | -> # e.g. |
115 | | -> pip install openvino==2024.4.0 |
116 | | -> # Optional, install the openvino nightly package if needed. |
117 | | -> # OpenVINO nightly is pre-release software and has not undergone full release validation or qualification. |
118 | | -> pip uninstall openvino |
119 | | -> pip install --upgrade --pre openvino openvino-tokenizers --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly |
120 | | -> ``` |
121 | | -
|
122 | | -## 4. Benchmark LLM with `torch.compile()` |
123 | | -
|
124 | | -The `--torch_compile_backend` option enables you to use `torch.compile()` to accelerate PyTorch models by compiling them into optimized kernels using a specified backend. |
125 | | -
|
126 | | -Before benchmarking, you need to download the original PyTorch model. Use the following command to download the model locally: |
127 | | -
|
128 | | -```bash |
129 | | -huggingface-cli download meta-llama/Llama-2-7b-chat-hf --local-dir models/llama-2-7b-chat/pytorch |
130 | | -``` |
131 | | -
|
132 | | -To run the benchmarking script with `torch.compile()`, use the `--torch_compile_backend` option to specify the backend. You can choose between `pytorch` or `openvino` (default). Example: |
133 | | - |
134 | | -```bash |
135 | | -python ./benchmark.py -m models/llama-2-7b-chat/pytorch -d CPU --torch_compile_backend openvino |
136 | | -``` |
137 | | - |
138 | | -> **Note:** To use `torch.compile()` with CUDA GPUs, you need to install the nightly version of PyTorch: |
139 | | -> |
140 | | -> ```bash |
141 | | -> pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118 |
142 | | -> ``` |
143 | | -
|
144 | | -
|
145 | | -## 5. Running on 2-Socket Platforms |
146 | | -
|
147 | | -The benchmarking script sets `openvino.properties.streams.num(1)` by default. For multi-socket platforms, use `numactl` on Linux or the `--load_config` option to modify behavior. |
148 | | -
|
149 | | -| OpenVINO Version | Behaviors | |
150 | | -|:--------------------|:------------------------------------------------| |
151 | | -| Before 2024.0.0 | streams.num(1) <br>execute on 2 sockets. | |
152 | | -| 2024.0.0 | streams.num(1) <br>execute on the same socket as the APP is running on. | |
153 | | -
|
154 | | -For example, `--load_config config.json` as following will result in streams.num(1) and execute on 2 sockets. |
155 | | -```json |
156 | | -{ |
157 | | - "INFERENCE_NUM_THREADS": <NUMBER> |
158 | | -} |
159 | | -``` |
160 | | -`<NUMBER>` is the number of total physical cores in 2 sockets. |
161 | | -
|
162 | | -## 6. Execution on CPU device |
163 | | - |
164 | | -OpenVINO is by default bult with [oneTBB](https://github.com/oneapi-src/oneTBB/) threading library, while Torch uses [OpenMP](https://www.openmp.org/). Both threading libraries have ['busy-wait spin'](https://gcc.gnu.org/onlinedocs/libgomp/GOMP_005fSPINCOUNT.html) by default. When running LLM pipeline on CPU device, there is threading overhead in the switching between inference on CPU with OpenVINO (oneTBB) and postprocessing (For example: greedy search or beam search) with Torch (OpenMP). |
165 | | - |
166 | | -**Alternative solutions** |
167 | | -1. Use --genai option which uses OpenVINO genai API instead of optimum-intel API. In this case postprocessing is executed with OpenVINO genai API. |
168 | | -2. Without --genai option which uses optimum-intel API, set environment variable [OMP_WAIT_POLICY](https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fWAIT_005fPOLICY.html) to PASSIVE which will disable OpenMP 'busy-wait', and benchmark.py will also limit the Torch thread number to avoid using CPU cores which is in 'busy-wait' by OpenVINO inference. |
169 | | - |
170 | | -## 7. Additional Resources |
171 | | - |
172 | | -- **Error Troubleshooting:** Check the [NOTES.md](./doc/NOTES.md) for solutions to known issues. |
173 | | -- **Image Generation Configuration:** Refer to [IMAGE_GEN.md](./doc/IMAGE_GEN.md) for setting parameters for image generation models. |
| 3 | +> [!IMPORTANT] |
| 4 | +> LLM bench code was moved to [tools](../../tools/llm_bench/) directory. Please navigate to the new directory for continue of tool usage. |
0 commit comments