|
| 1 | +# Benchmark results of various models |
| 2 | + |
| 3 | + |
| 4 | +## Llama 3 - 8B |
| 5 | + |
| 6 | +Date | Device | dtype | batch size | cache length |max input length |max output length| throughput (token/s) |
| 7 | +----| ------- | ------ |---------- | -------------|-----------------|------------------|---------------------- |
| 8 | +2024-04-24 | TPU v5e-8 | bfloat16 | 128 | 2048 | 1024 | 1024 | 8249 |
| 9 | +2024-04-24 | TPU v5e-8 | int8 | 256 | 2048 | 1024 | 1024 | 10873 |
| 10 | + |
| 11 | + |
| 12 | +## Gemma - 7B |
| 13 | + |
| 14 | +Date | Device | dtype | batch size | cache length |max input length |max output length| throughput (token/s) |
| 15 | +----| ------- | ------ |---------- | -------------|-----------------|------------------|---------------------- |
| 16 | +2024-05-10 | TPU v5e-8 | bfloat16 | 96 | 2048 | 1024 | 1024 | 3236 |
| 17 | +2024-05-10 | TPU v5e-8 | int8 | 128 | 2048 | 1024 | 1024 | 4695 |
| 18 | + |
| 19 | +## Llama 2 - 7B |
| 20 | + |
| 21 | +Date | Device | dtype | batch size | cache length |max input length |max output length| throughput (token/s) |
| 22 | +----| ------- | ------ |---------- | -------------|-----------------|------------------|---------------------- |
| 23 | +2024-03-28 | TPU v5e-8 | bfloat16 | 96 | 2048 | 1024 | 1024 | 3663 |
| 24 | +2024-03-28 | TPU v5e-8 | int8 | 96 | 2048 | 1024 | 1024 | 4783 |
| 25 | + |
| 26 | +## Llama 2 - 13B |
| 27 | + |
| 28 | +Date | Device | dtype | batch size | cache length |max input length |max output length| throughput (token/s) |
| 29 | +----| ------- | ------ |---------- | -------------|-----------------|------------------|---------------------- |
| 30 | +2024-03-28 | TPU v5e-8 | bfloat16 | 48 | 2048 | 1024 | 1024 | 2056 |
| 31 | +2024-03-28 | TPU v5e-8 | int8 | 96 | 2048 | 1024 | 1024 | 3458 |
| 32 | +2024-03-28 | TPU v5e-8 | bfloat16 | 80 | 1280 | 1024 | 1024 | 2911 |
| 33 | +2024-03-28 | TPU v5e-8 | int8 | 96 | 1280 | 1024 | 1024 | 3938 |
| 34 | + |
| 35 | +**NOTE:** When cache length is less than the sum of max input length + max output length |
| 36 | + we employ *Rolling window attention*. |
| 37 | + |
| 38 | + |
| 39 | +# Instructions to reproduce: |
| 40 | + |
| 41 | +Please refer [README.md](README.md) for instructions in how to get the model weights. |
| 42 | + |
| 43 | +**NOTE** Different weights can produce different benchmark results (due to generating) |
| 44 | +different sentence length. For llama, we used the `-chat` versions of the weight. |
| 45 | +For Gemma we used the `-it` (instruction finetuned) version of the weights. |
| 46 | + |
| 47 | +## Run the server |
| 48 | +NOTE: the `--platform=tpu=8` need to specify number of tpu devices (which is 4 for v4-8 and 8 for v5light-8`) |
| 49 | + |
| 50 | +```bash |
| 51 | +python run_server.py --param_size=7b --batch_size= 128 --max_cache_length=2048 --quantize_weights=$quantize --quantize_kv_cache=$quantize --checkpoint_path=$output_ckpt_dir --tokenizer_path=$tokenizer_path --platform=tpu=8 --model=$model_name |
| 52 | +``` |
| 53 | +Now you can fire gRPC to it |
| 54 | + |
| 55 | +# Run benchmark |
| 56 | +go to the deps/JetStream folder (downloaded during `install_everything.sh`) |
| 57 | + |
| 58 | +```bash |
| 59 | +cd deps/JetStream |
| 60 | +wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json |
| 61 | +export dataset_path=ShareGPT_V3_unfiltered_cleaned_split.json |
| 62 | +python benchmarks/benchmark_serving.py --tokenizer $tokenizer_path --num-prompts 2000 --dataset-path $dataset_path --dataset sharegpt --save-request-outputs --warm-up=True |
| 63 | +``` |
| 64 | +Please look at `deps/JetStream/benchmarks/README.md` for more information. |
0 commit comments