Skip to content

Commit 84ac213

Browse files
authored
Add benchmark results (#76)
1 parent 48a8a22 commit 84ac213

File tree

2 files changed

+65
-1
lines changed

2 files changed

+65
-1
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,7 @@ go to the deps/JetStream folder (downloaded during `install_everything.sh`)
112112
cd deps/JetStream
113113
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
114114
export dataset_path=ShareGPT_V3_unfiltered_cleaned_split.json
115-
python benchmarks/benchmark_serving.py --tokenizer $tokenizer_path --num-prompts 2000 --dataset-path $dataset_path --dataset sharegpt --save-request-outputs
115+
python benchmarks/benchmark_serving.py --tokenizer $tokenizer_path --num-prompts 2000 --dataset-path $dataset_path --dataset sharegpt --save-request-outputs --warm-up=True
116116
```
117117
Please look at `deps/JetStream/benchmarks/README.md` for more information.
118118

benchmarks/summary.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# Benchmark results of various models
2+
3+
4+
## Llama 3 - 8B
5+
6+
Date | Device | dtype | batch size | cache length |max input length |max output length| throughput (token/s)
7+
----| ------- | ------ |---------- | -------------|-----------------|------------------|----------------------
8+
2024-04-24 | TPU v5e-8 | bfloat16 | 128 | 2048 | 1024 | 1024 | 8249
9+
2024-04-24 | TPU v5e-8 | int8 | 256 | 2048 | 1024 | 1024 | 10873
10+
11+
12+
## Gemma - 7B
13+
14+
Date | Device | dtype | batch size | cache length |max input length |max output length| throughput (token/s)
15+
----| ------- | ------ |---------- | -------------|-----------------|------------------|----------------------
16+
2024-05-10 | TPU v5e-8 | bfloat16 | 96 | 2048 | 1024 | 1024 | 3236
17+
2024-05-10 | TPU v5e-8 | int8 | 128 | 2048 | 1024 | 1024 | 4695
18+
19+
## Llama 2 - 7B
20+
21+
Date | Device | dtype | batch size | cache length |max input length |max output length| throughput (token/s)
22+
----| ------- | ------ |---------- | -------------|-----------------|------------------|----------------------
23+
2024-03-28 | TPU v5e-8 | bfloat16 | 96 | 2048 | 1024 | 1024 | 3663
24+
2024-03-28 | TPU v5e-8 | int8 | 96 | 2048 | 1024 | 1024 | 4783
25+
26+
## Llama 2 - 13B
27+
28+
Date | Device | dtype | batch size | cache length |max input length |max output length| throughput (token/s)
29+
----| ------- | ------ |---------- | -------------|-----------------|------------------|----------------------
30+
2024-03-28 | TPU v5e-8 | bfloat16 | 48 | 2048 | 1024 | 1024 | 2056
31+
2024-03-28 | TPU v5e-8 | int8 | 96 | 2048 | 1024 | 1024 | 3458
32+
2024-03-28 | TPU v5e-8 | bfloat16 | 80 | 1280 | 1024 | 1024 | 2911
33+
2024-03-28 | TPU v5e-8 | int8 | 96 | 1280 | 1024 | 1024 | 3938
34+
35+
**NOTE:** When cache length is less than the sum of max input length + max output length
36+
we employ *Rolling window attention*.
37+
38+
39+
# Instructions to reproduce:
40+
41+
Please refer [README.md](README.md) for instructions in how to get the model weights.
42+
43+
**NOTE** Different weights can produce different benchmark results (due to generating)
44+
different sentence length. For llama, we used the `-chat` versions of the weight.
45+
For Gemma we used the `-it` (instruction finetuned) version of the weights.
46+
47+
## Run the server
48+
NOTE: the `--platform=tpu=8` need to specify number of tpu devices (which is 4 for v4-8 and 8 for v5light-8`)
49+
50+
```bash
51+
python run_server.py --param_size=7b --batch_size= 128 --max_cache_length=2048 --quantize_weights=$quantize --quantize_kv_cache=$quantize --checkpoint_path=$output_ckpt_dir --tokenizer_path=$tokenizer_path --platform=tpu=8 --model=$model_name
52+
```
53+
Now you can fire gRPC to it
54+
55+
# Run benchmark
56+
go to the deps/JetStream folder (downloaded during `install_everything.sh`)
57+
58+
```bash
59+
cd deps/JetStream
60+
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
61+
export dataset_path=ShareGPT_V3_unfiltered_cleaned_split.json
62+
python benchmarks/benchmark_serving.py --tokenizer $tokenizer_path --num-prompts 2000 --dataset-path $dataset_path --dataset sharegpt --save-request-outputs --warm-up=True
63+
```
64+
Please look at `deps/JetStream/benchmarks/README.md` for more information.

0 commit comments

Comments
 (0)