Skip to content
This repository was archived by the owner on Oct 9, 2024. It is now read-only.

Commit 556ccac

Browse files
authored
Reflect model_class option to server deployment (Fixes #33) (#34)
* Reflect model_class to server deployment (Fixes #31) * Update inference_server/README.md to reflect model_class option
1 parent dffb799 commit 556ccac

File tree

2 files changed

+7
-7
lines changed

2 files changed

+7
-7
lines changed

inference_server/README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -35,12 +35,12 @@ Example: generate_kwargs =
3535

3636
1. using HF accelerate
3737
```shell
38-
python -m inference_server.cli --model_name bigscience/bloom --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": false}'
38+
python -m inference_server.cli --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": false}'
3939
```
4040

4141
2. using DS inference
4242
```shell
43-
python -m inference_server.cli --model_name microsoft/bloom-deepspeed-inference-fp16 --dtype fp16 --deployment_framework ds_inference --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": false}'
43+
python -m inference_server.cli --model_name microsoft/bloom-deepspeed-inference-fp16 --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": false}'
4444
```
4545

4646
#### BLOOM server deployment
@@ -51,21 +51,21 @@ python -m inference_server.cli --model_name microsoft/bloom-deepspeed-inference-
5151

5252
1. using HF accelerate
5353
```shell
54-
python -m inference_server.benchmark --model_name bigscience/bloom --dtype bf16 --deployment_framework hf_accelerate --benchmark_cycles 5
54+
python -m inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework hf_accelerate --benchmark_cycles 5
5555
```
5656

5757
2. using DS inference
5858
```shell
59-
deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5
59+
deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5
6060
```
6161
alternatively, to load model faster:
6262
```shell
63-
deepspeed --num_gpus 8 --module inference_server.benchmark --model_name microsoft/bloom-deepspeed-inference-fp16 --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5
63+
deepspeed --num_gpus 8 --module inference_server.benchmark --model_name microsoft/bloom-deepspeed-inference-fp16 --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5
6464
```
6565

6666
3. using DS ZeRO
6767
```shell
68-
deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom --dtype bf16 --deployment_framework ds_zero --benchmark_cycles 5
68+
deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework ds_zero --benchmark_cycles 5
6969
```
7070

7171
## Support

inference_server/model_handler/deployment.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ def _initialize_service(self, args: argparse.Namespace):
8787
if args.deployment_framework in [DS_INFERENCE, DS_ZERO]:
8888
ports = " ".join(map(str, self.ports))
8989

90-
cmd = f"inference_server.model_handler.launch --model_name {args.model_name} --deployment_framework {args.deployment_framework} --dtype {get_str_dtype(args.dtype)} --port {ports}"
90+
cmd = f"inference_server.model_handler.launch --model_name {args.model_name} --deployment_framework {args.deployment_framework} --dtype {get_str_dtype(args.dtype)} --port {ports} --model_class {args.model_class}"
9191

9292
if args.max_batch_size is not None:
9393
cmd += f" --max_batch_size {args.max_batch_size}"

0 commit comments

Comments
 (0)