|
| 1 | +# [목적] |
| 2 | +- 이 페이지는 fmbench 의 디폴트 실행을 한 후의 리포트 결과를 보여 주어서, 벤치 마킹의 기본적인 개념을 잡기 위함 입니다. |
| 3 | +- 아래는 fmbench git 의 기본 Quickstart 를 실행한 후에 생성된 reprot.md 파일의 내용입니다. |
| 4 | + - [Foundation Model benchmarking tool (FMBench) built using Amazon SageMaker](https://github.com/aws-samples/foundation-model-benchmarking-tool) |
| 5 | + |
| 6 | +--- |
| 7 | + |
| 8 | + |
| 9 | +# Results for performance benchmarking |
| 10 | + |
| 11 | +**Last modified (UTC): 2024-04-09 10:28:00.510046** |
| 12 | + |
| 13 | +## Summary |
| 14 | + |
| 15 | +We did performance benchmarking for the `Llama2-7b` model on "`ml.g5.2xlarge`, `ml.g5.xlarge`" instances on multiple datasets and based on the test results the best price performance for dataset `en_1000-2000` is provided by the `ml.g5.xlarge` instance type. |
| 16 | +| Information | Value | |
| 17 | +|-----|-----| |
| 18 | +| experiment_name | llama2-7b-g5.xlarge-huggingface-pytorch-tgi-inference-2.0.1-tgi1.1.0 | |
| 19 | +| payload_file | payload_en_1000-2000.jsonl | |
| 20 | +| instance_type | ml.g5.xlarge | |
| 21 | +| concurrency | 1 | |
| 22 | +| error_rate | 0.0 | |
| 23 | +| prompt_token_count_mean | 1626 | |
| 24 | +| prompt_token_throughput | 1307 | |
| 25 | +| completion_token_count_mean | 35 | |
| 26 | +| completion_token_throughput | 13 | --> Tokens / Sec (1초당 Token 처리 개수) |
| 27 | +| latency_mean | 2.09 | |
| 28 | +| latency_p50 | 2.09 | |
| 29 | +| latency_p95 | 2.09 | |
| 30 | +| latency_p99 | 2.09 | |
| 31 | +| transactions_per_minute | 50 | |
| 32 | +| price_per_hour | 1.006 | |
| 33 | +| price_per_txn | 0.000335 | |
| 34 | +| price_per_token | 0.00000020 | |
| 35 | + |
| 36 | + |
| 37 | +The price performance comparison for different instance types is presented below: |
| 38 | + |
| 39 | + |
| 40 | + |
| 41 | +The configuration used for these tests is available in the [`config`](config-llama2-7b-g5-quick.yml) file. |
| 42 | + |
| 43 | +The cost to run each experiment is provided in the table below. The total cost for running all experiments is $0.39. |
| 44 | + |
| 45 | +| experiment_name | instance_type | duration_in_seconds | cost | |
| 46 | +|-----|-----|-----|-----| |
| 47 | +| llama2-7b-g5.xlarge-huggingface-pytorch-tgi-inference-2.0.1-tgi1.1.0 | ml.g5.xlarge | 626.45 | 0.18 | |
| 48 | +| llama2-7b-g5.2xlarge-huggingface-pytorch-tgi-inference-2.0.1-tgi1.1.0 | ml.g5.2xlarge | 624.54 | 0.21 | |
| 49 | + |
| 50 | + |
| 51 | + |
| 52 | + |
| 53 | +## Per instance results |
| 54 | + |
| 55 | +The following table provides the best combinations for running inference for different sizes prompts on different instance types. The following dataset(s) were used for this test: `2wikimqa_e.jsonl`, `2wikimqa.jsonl`, `hotpotqa_e.jsonl`, `hotpotqa.jsonl`, `narrativeqa.jsonl`, `triviaqa_e.jsonl`, `triviaqa.jsonl`. |
| 56 | + |
| 57 | +|Dataset | Instance type | Recommendation | |
| 58 | +|---|---|---| |
| 59 | +|`payload_en_1-500.jsonl`|`ml.g5.2xlarge`|The best option for staying within a latency budget of `20 seconds` on a `ml.g5.2xlarge` for the `payload_en_1-500.jsonl` dataset is a `concurrency level of 4`. A concurrency level of 4 achieves an `average latency of 1.21 seconds`, for an `average prompt size of 207 tokens` and `completion size of 14 tokens` with `144 transactions/minute`.| |
| 60 | +|`payload_en_1000-2000.jsonl`|`ml.g5.2xlarge`|The best option for staying within a latency budget of `20 seconds` on a `ml.g5.2xlarge` for the `payload_en_1000-2000.jsonl` dataset is a `concurrency level of 4`. A concurrency level of 4 achieves an `average latency of 3.34 seconds`, for an `average prompt size of 1626 tokens` and `completion size of 34 tokens` with `51 transactions/minute`.| |
| 61 | +|`payload_en_2000-3000.jsonl`|`ml.g5.2xlarge`|The best option for staying within a latency budget of `20 seconds` on a `ml.g5.2xlarge` for the `payload_en_2000-3000.jsonl` dataset is a `concurrency level of 4`. A concurrency level of 4 achieves an `average latency of 4.32 seconds`, for an `average prompt size of 2538 tokens` and `completion size of 32 tokens` with `42 transactions/minute`.| |
| 62 | +|`payload_en_500-1000.jsonl`|`ml.g5.2xlarge`|The best option for staying within a latency budget of `20 seconds` on a `ml.g5.2xlarge` for the `payload_en_500-1000.jsonl` dataset is a `concurrency level of 4`. A concurrency level of 4 achieves an `average latency of 2.77 seconds`, for an `average prompt size of 763 tokens` and `completion size of 41 tokens` with `122 transactions/minute`.| |
| 63 | +|`payload_en_1-500.jsonl`|`ml.g5.xlarge`|The best option for staying within a latency budget of `20 seconds` on a `ml.g5.xlarge` for the `payload_en_1-500.jsonl` dataset is a `concurrency level of 4`. A concurrency level of 4 achieves an `average latency of 1.11 seconds`, for an `average prompt size of 207 tokens` and `completion size of 16 tokens` with `174 transactions/minute`.| |
| 64 | +|`payload_en_1000-2000.jsonl`|`ml.g5.xlarge`|The best option for staying within a latency budget of `20 seconds` on a `ml.g5.xlarge` for the `payload_en_1000-2000.jsonl` dataset is a `concurrency level of 4`. A concurrency level of 4 achieves an `average latency of 3.37 seconds`, for an `average prompt size of 1626 tokens` and `completion size of 35 tokens` with `50 transactions/minute`.| |
| 65 | +|`payload_en_2000-3000.jsonl`|`ml.g5.xlarge`|The best option for staying within a latency budget of `20 seconds` on a `ml.g5.xlarge` for the `payload_en_2000-3000.jsonl` dataset is a `concurrency level of 4`. A concurrency level of 4 achieves an `average latency of 4.4 seconds`, for an `average prompt size of 2538 tokens` and `completion size of 33 tokens` with `38 transactions/minute`.| |
| 66 | +|`payload_en_500-1000.jsonl`|`ml.g5.xlarge`|The best option for staying within a latency budget of `20 seconds` on a `ml.g5.xlarge` for the `payload_en_500-1000.jsonl` dataset is a `concurrency level of 4`. A concurrency level of 4 achieves an `average latency of 2.76 seconds`, for an `average prompt size of 763 tokens` and `completion size of 41 tokens` with `122 transactions/minute`.| |
| 67 | + |
| 68 | +## Plots |
| 69 | + |
| 70 | +The following plots provide insights into the results from the different experiments run. |
| 71 | + |
| 72 | + |
| 73 | + |
| 74 | + |
| 75 | + |
| 76 | + |
0 commit comments