-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Description
Proposal to improve performance
By default vLLM collects model support info in a single sub process per model
(added in in #9233). Specifically, this
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/registry.py#L336
_run_in_subprocess call.
This adds ~4s when running against local ssd and can easily be double or more
against a network filesystem in some environments. Collecting the info
in-process does not seem to have adverse effects, at least based on my limited
manual testing, but I lack context on why this was done in the first place.
Can we make this behaviour configurable via a boolean flag or env var? That way
users could opt out.
collect_model_info_via_subprocess = True
Something like
if self.model_config.collect_model_info_via_subprocess:
return _run_in_subprocess(
lambda: _ModelInfo.from_model_cls(self.load_model_cls()))
return _ModelInfo.from_model_cls(self.load_model_cls())
Show the latency in "inspect-model" span based on my local wip otel tracing of start up
Report of performance regression
No response
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
