Skip to content

Commit b05d8e7

Browse files
authored
[Doc] update document link (#270)
* [Doc] update document link * [Doc] add the README.md of TraceReplay
1 parent 7a82557 commit b05d8e7

File tree

3 files changed

+130
-2
lines changed

3 files changed

+130
-2
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
</p>
77

88
<p align="center">
9-
| <a href="docs/source/index.md"><b>Documentation</b></a> | <a href="https://modelengine-ai.net/#/ucm"><b>Website</b></a> | <a href="https://github.com/ModelEngine-Group/unified-cache-management/issues/78"><b>RoadMap</b></a> | <a href="https://github.com/ModelEngine-Group/unified-cache-management/blob/main/README_zh.md"><b>中文</b></a> |
9+
| <a href="https://ucm.readthedocs.io/en/latest"><b>Documentation</b></a> | <a href="https://modelengine-ai.net/#/ucm"><b>Website</b></a> | <a href="https://github.com/ModelEngine-Group/unified-cache-management/issues/78"><b>RoadMap</b></a> | <a href="https://github.com/ModelEngine-Group/unified-cache-management/blob/main/README_zh.md"><b>中文</b></a> |
1010
</p>
1111

1212
---

README_zh.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
</p>
77

88
<p align="center">
9-
| <a href="docs/source/index.md"><b>文档</b></a> | <a href="https://modelengine-ai.net/#/ucm"><b>网站</b></a> | <a href="https://github.com/ModelEngine-Group/unified-cache-management/issues/78"><b>发展路线图</b></a> | <a href="https://github.com/ModelEngine-Group/unified-cache-management"><b>EN</b></a> |
9+
| <a href="https://ucm.readthedocs.io/en/latest"><b>文档</b></a> | <a href="https://modelengine-ai.net/#/ucm"><b>网站</b></a> | <a href="https://github.com/ModelEngine-Group/unified-cache-management/issues/78"><b>发展路线图</b></a> | <a href="https://github.com/ModelEngine-Group/unified-cache-management"><b>EN</b></a> |
1010
</p>
1111

1212
---

benchmarks/README.md

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
# TraceReplay Benchmark Tool
2+
3+
It accurately replays real-world request traces with original timing or dynamically generates requests using popular datasets. The tool delivers comprehensive performance metrics—including Time to First Token (TTFT), Time Per Output Token (TPOT), Inter-Token Latency (ITL), End-to-End Latency, Goodput, etc.
4+
5+
---
6+
7+
## 1. Overview
8+
9+
The Trace Replay feature mainly includes request generation, request sending and response receiving, as well as result calculation and saving. It can reproduce historical requests based on MoonCake's trace file and strictly send the requests according to the timestamps recorded in the trace. After execution, Trace Replay calculates key performance metrics such as Time to First Token (TTFT) and Time Per Output Token (TPOT), then outputs the results to the terminal and saves them to an Excel file.
10+
11+
[Mooncake traces](https://github.com/kvcache-ai/Mooncake/tree/main/FAST25-release/traces) consist of two types of trace data:
12+
* Conversation and Tool&Agent trace: Sampled from one hour of online request data from different clusters.
13+
* Synthetic trace: Generated synthetically from other publicly available datasets.
14+
15+
For more information, please refer to the Mooncake paper: [Mooncake-FAST25.pdf](https://github.com/kvcache-ai/Mooncake/blob/main/FAST25-release/Mooncake-FAST25.pdf) .
16+
17+
Trace Replay supports two request-generation methods:
18+
* Hash ID-based: Input tokens are generated based on the input_length and hash_ids recorded in the trace file. Each hash_id corresponds to a block, with each block containing 512 tokens. The same hash_id always maps to the identical token sequence.
19+
* Dataset-based: Prompts are generated by invoking vLLM's benchmark module using the input_length from the trace file and the user-specified dataset name. This approach does not rely on the hash_ids present in the trace file.
20+
21+
Depending on the request generation method, Trace Replay offers two modes: Trace Mode and Benchmark Mode, which can be configured by the user via the --trace-mode parameter.
22+
23+
---
24+
25+
## 2. Parameter
26+
27+
| Argument | Default | Help |
28+
|-----------|---------|---------|
29+
| --backend | None | backend framework type |
30+
| --model | None | model path |
31+
| --host| localhost | IP address of the inference server |
32+
| --port | None | Port number of the inference server |
33+
| --trace-path | None | trace jsonl file path |
34+
| --trace-mode | trace | 'trace' to replay requests from cached trace files, 'benchmark' to generate requests dynamically using the benchmark module |
35+
| --dataset-name | sharegpt | if enable benchmark mode, you must specify a dataset, refer to the [vLLM benchmark documentation](https://github.com/vllm-project/vllm/blob/releases/v0.9.1/benchmarks/README.md )|
36+
| --save-prompts | False | save generated prompts with timestamp for reuse |
37+
| --save-result | False | save the benchmark metrics to excel file |
38+
| --result-dir | None | path to save results |
39+
40+
---
41+
42+
## 3. Example
43+
44+
### 1. Download example trace
45+
46+
You need to download the trace jsonl file from [Mooncake traces](https://github.com/kvcache-ai/Mooncake/tree/main/FAST25-release/traces ). In the trace, each line is a JSON object representing a single request:
47+
48+
```
49+
{
50+
"timestamp": 1696000000123, // ms since epoch
51+
"input_length": 512, // number of input tokens
52+
"output_length": 128, // expected output tokens
53+
"hash_ids": [123, 456, 789] // seed list for deterministic prompt generation
54+
}
55+
```
56+
57+
### 2. Set environment variable
58+
59+
Trace Replay depends on [vLLM's benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks) module, which you need to download separately. Before running Trace Replay, you must set the path to the benchmark module via an environment variable.:
60+
61+
```bash
62+
export BENCHMARK_PATH="/vllm-workspace/benchmarks"
63+
```
64+
65+
### 3.Basic Usage
66+
67+
Execute the Python script to replay a trace against a local vLLM instance:
68+
69+
```bash
70+
python3 /trace_replay.py \
71+
--model /home/models/dsv2-lite \
72+
--backend vllm \
73+
--trace-path /conversation_trace.jsonl \
74+
--trace-mode trace \
75+
--host 127.0.0.1 \
76+
--port 8000 \
77+
--save-result \
78+
--save-prompts
79+
```
80+
81+
### 4.Results
82+
83+
Successful execution results in output similar to the following:
84+
85+
```
86+
============ Serving Benchmark Result ============
87+
Successful requests: 510
88+
Benchmark duration (s): 301.46
89+
Total input tokens: 7201515
90+
Total generated tokens: 185502
91+
Request throughput (req/s): 1.69
92+
Output token throughput (tok/s): 615.34
93+
Total Token throughput (tok/s): 24504.02
94+
---------------Time to First Token----------------
95+
Mean TTFT (ms): 20931.33
96+
Median TTFT (ms): 19119.63
97+
Std TTFT (ms): 17324.45
98+
P25 TTFT (ms): 4057.98
99+
P50 TTFT (ms): 19119.63
100+
P75 TTFT (ms): 33284.55
101+
P99 TTFT (ms): 64592.68
102+
-----Time per Output Token (excl. 1st token)------
103+
Mean TPOT (ms): 187.71
104+
Median TPOT (ms): 200.69
105+
Std TPOT (ms): 63.08
106+
P25 TPOT (ms): 144.17
107+
P50 TPOT (ms): 200.69
108+
P75 TPOT (ms): 234.55
109+
P99 TPOT (ms): 312.87
110+
---------------Inter-token Latency----------------
111+
Mean ITL (ms): 181.20
112+
Median ITL (ms): 169.18
113+
Std ITL (ms): 133.70
114+
P25 ITL (ms): 86.63
115+
P50 ITL (ms): 169.18
116+
P75 ITL (ms): 230.91
117+
P99 ITL (ms): 647.04
118+
----------------End-to-end Latency----------------
119+
Mean E2EL (ms): 86656.79
120+
Median E2EL (ms): 89218.82
121+
Std E2EL (ms): 43454.94
122+
P25 E2EL (ms): 53935.13
123+
P50 E2EL (ms): 89218.82
124+
P75 E2EL (ms): 120761.34
125+
P99 E2EL (ms): 171262.27
126+
==================================================
127+
```
128+

0 commit comments

Comments
 (0)