@@ -35,12 +35,12 @@ Example: generate_kwargs =
3535
36361 . using HF accelerate
3737``` shell
38- python -m inference_server.cli --model_name bigscience/bloom --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs ' {"min_length": 100, "max_new_tokens": 100, "do_sample": false}'
38+ python -m inference_server.cli --model_name bigscience/bloom --model_class AutoModelForCausalLM -- dtype bf16 --deployment_framework hf_accelerate --generate_kwargs ' {"min_length": 100, "max_new_tokens": 100, "do_sample": false}'
3939```
4040
41412 . using DS inference
4242``` shell
43- python -m inference_server.cli --model_name microsoft/bloom-deepspeed-inference-fp16 --dtype fp16 --deployment_framework ds_inference --generate_kwargs ' {"min_length": 100, "max_new_tokens": 100, "do_sample": false}'
43+ python -m inference_server.cli --model_name microsoft/bloom-deepspeed-inference-fp16 --model_class AutoModelForCausalLM -- dtype fp16 --deployment_framework ds_inference --generate_kwargs ' {"min_length": 100, "max_new_tokens": 100, "do_sample": false}'
4444```
4545
4646#### BLOOM server deployment
@@ -51,21 +51,21 @@ python -m inference_server.cli --model_name microsoft/bloom-deepspeed-inference-
5151
52521 . using HF accelerate
5353``` shell
54- python -m inference_server.benchmark --model_name bigscience/bloom --dtype bf16 --deployment_framework hf_accelerate --benchmark_cycles 5
54+ python -m inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM -- dtype bf16 --deployment_framework hf_accelerate --benchmark_cycles 5
5555```
5656
57572 . using DS inference
5858``` shell
59- deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5
59+ deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM -- dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5
6060```
6161alternatively, to load model faster:
6262``` shell
63- deepspeed --num_gpus 8 --module inference_server.benchmark --model_name microsoft/bloom-deepspeed-inference-fp16 --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5
63+ deepspeed --num_gpus 8 --module inference_server.benchmark --model_name microsoft/bloom-deepspeed-inference-fp16 --model_class AutoModelForCausalLM -- dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5
6464```
6565
66663 . using DS ZeRO
6767``` shell
68- deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom --dtype bf16 --deployment_framework ds_zero --benchmark_cycles 5
68+ deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM -- dtype bf16 --deployment_framework ds_zero --benchmark_cycles 5
6969```
7070
7171## Support
0 commit comments