|
| 1 | +# TraceReplay Benchmark Tool |
| 2 | + |
| 3 | +It accurately replays real-world request traces with original timing or dynamically generates requests using popular datasets. The tool delivers comprehensive performance metrics—including Time to First Token (TTFT), Time Per Output Token (TPOT), Inter-Token Latency (ITL), End-to-End Latency, Goodput, etc. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## 1. Overview |
| 8 | + |
| 9 | +The Trace Replay feature mainly includes request generation, request sending and response receiving, as well as result calculation and saving. It can reproduce historical requests based on MoonCake's trace file and strictly send the requests according to the timestamps recorded in the trace. After execution, Trace Replay calculates key performance metrics such as Time to First Token (TTFT) and Time Per Output Token (TPOT), then outputs the results to the terminal and saves them to an Excel file. |
| 10 | + |
| 11 | +[Mooncake traces](https://github.com/kvcache-ai/Mooncake/tree/main/FAST25-release/traces) consist of two types of trace data: |
| 12 | +* Conversation and Tool&Agent trace: Sampled from one hour of online request data from different clusters. |
| 13 | +* Synthetic trace: Generated synthetically from other publicly available datasets. |
| 14 | + |
| 15 | +For more information, please refer to the Mooncake paper: [Mooncake-FAST25.pdf](https://github.com/kvcache-ai/Mooncake/blob/main/FAST25-release/Mooncake-FAST25.pdf) . |
| 16 | + |
| 17 | +Trace Replay supports two request-generation methods: |
| 18 | +* Hash ID-based: Input tokens are generated based on the input_length and hash_ids recorded in the trace file. Each hash_id corresponds to a block, with each block containing 512 tokens. The same hash_id always maps to the identical token sequence. |
| 19 | +* Dataset-based: Prompts are generated by invoking vLLM's benchmark module using the input_length from the trace file and the user-specified dataset name. This approach does not rely on the hash_ids present in the trace file. |
| 20 | + |
| 21 | +Depending on the request generation method, Trace Replay offers two modes: Trace Mode and Benchmark Mode, which can be configured by the user via the --trace-mode parameter. |
| 22 | + |
| 23 | +--- |
| 24 | + |
| 25 | +## 2. Parameter |
| 26 | + |
| 27 | +| Argument | Default | Help | |
| 28 | +|-----------|---------|---------| |
| 29 | +| --backend | None | backend framework type | |
| 30 | +| --model | None | model path | |
| 31 | +| --host| localhost | IP address of the inference server | |
| 32 | +| --port | None | Port number of the inference server | |
| 33 | +| --trace-path | None | trace jsonl file path | |
| 34 | +| --trace-mode | trace | 'trace' to replay requests from cached trace files, 'benchmark' to generate requests dynamically using the benchmark module | |
| 35 | +| --dataset-name | sharegpt | if enable benchmark mode, you must specify a dataset, refer to the [vLLM benchmark documentation](https://github.com/vllm-project/vllm/blob/releases/v0.9.1/benchmarks/README.md )| |
| 36 | +| --save-prompts | False | save generated prompts with timestamp for reuse | |
| 37 | +| --save-result | False | save the benchmark metrics to excel file | |
| 38 | +| --result-dir | None | path to save results | |
| 39 | + |
| 40 | +--- |
| 41 | + |
| 42 | +## 3. Example |
| 43 | + |
| 44 | +### 1. Download example trace |
| 45 | + |
| 46 | +You need to download the trace jsonl file from [Mooncake traces](https://github.com/kvcache-ai/Mooncake/tree/main/FAST25-release/traces ). In the trace, each line is a JSON object representing a single request: |
| 47 | + |
| 48 | +``` |
| 49 | +{ |
| 50 | + "timestamp": 1696000000123, // ms since epoch |
| 51 | + "input_length": 512, // number of input tokens |
| 52 | + "output_length": 128, // expected output tokens |
| 53 | + "hash_ids": [123, 456, 789] // seed list for deterministic prompt generation |
| 54 | +} |
| 55 | +``` |
| 56 | + |
| 57 | +### 2. Set environment variable |
| 58 | + |
| 59 | +Trace Replay depends on [vLLM's benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks) module, which you need to download separately. Before running Trace Replay, you must set the path to the benchmark module via an environment variable.: |
| 60 | + |
| 61 | +```bash |
| 62 | +export BENCHMARK_PATH="/vllm-workspace/benchmarks" |
| 63 | +``` |
| 64 | + |
| 65 | +### 3.Basic Usage |
| 66 | + |
| 67 | +Execute the Python script to replay a trace against a local vLLM instance: |
| 68 | + |
| 69 | +```bash |
| 70 | +python3 /trace_replay.py \ |
| 71 | + --model /home/models/dsv2-lite \ |
| 72 | + --backend vllm \ |
| 73 | + --trace-path /conversation_trace.jsonl \ |
| 74 | + --trace-mode trace \ |
| 75 | + --host 127.0.0.1 \ |
| 76 | + --port 8000 \ |
| 77 | + --save-result \ |
| 78 | + --save-prompts |
| 79 | +``` |
| 80 | + |
| 81 | +### 4.Results |
| 82 | + |
| 83 | +Successful execution results in output similar to the following: |
| 84 | + |
| 85 | +``` |
| 86 | +============ Serving Benchmark Result ============ |
| 87 | +Successful requests: 510 |
| 88 | +Benchmark duration (s): 301.46 |
| 89 | +Total input tokens: 7201515 |
| 90 | +Total generated tokens: 185502 |
| 91 | +Request throughput (req/s): 1.69 |
| 92 | +Output token throughput (tok/s): 615.34 |
| 93 | +Total Token throughput (tok/s): 24504.02 |
| 94 | +---------------Time to First Token---------------- |
| 95 | +Mean TTFT (ms): 20931.33 |
| 96 | +Median TTFT (ms): 19119.63 |
| 97 | +Std TTFT (ms): 17324.45 |
| 98 | +P25 TTFT (ms): 4057.98 |
| 99 | +P50 TTFT (ms): 19119.63 |
| 100 | +P75 TTFT (ms): 33284.55 |
| 101 | +P99 TTFT (ms): 64592.68 |
| 102 | +-----Time per Output Token (excl. 1st token)------ |
| 103 | +Mean TPOT (ms): 187.71 |
| 104 | +Median TPOT (ms): 200.69 |
| 105 | +Std TPOT (ms): 63.08 |
| 106 | +P25 TPOT (ms): 144.17 |
| 107 | +P50 TPOT (ms): 200.69 |
| 108 | +P75 TPOT (ms): 234.55 |
| 109 | +P99 TPOT (ms): 312.87 |
| 110 | +---------------Inter-token Latency---------------- |
| 111 | +Mean ITL (ms): 181.20 |
| 112 | +Median ITL (ms): 169.18 |
| 113 | +Std ITL (ms): 133.70 |
| 114 | +P25 ITL (ms): 86.63 |
| 115 | +P50 ITL (ms): 169.18 |
| 116 | +P75 ITL (ms): 230.91 |
| 117 | +P99 ITL (ms): 647.04 |
| 118 | +----------------End-to-end Latency---------------- |
| 119 | +Mean E2EL (ms): 86656.79 |
| 120 | +Median E2EL (ms): 89218.82 |
| 121 | +Std E2EL (ms): 43454.94 |
| 122 | +P25 E2EL (ms): 53935.13 |
| 123 | +P50 E2EL (ms): 89218.82 |
| 124 | +P75 E2EL (ms): 120761.34 |
| 125 | +P99 E2EL (ms): 171262.27 |
| 126 | +================================================== |
| 127 | +``` |
| 128 | + |
0 commit comments