Skip to content

Conversation

@a4lg
Copy link
Contributor

@a4lg a4lg commented Nov 29, 2025

Purpose

If tokenizer is a Hugging Face model, vLLM attempts to access Hugging Face even if the tokenizer is already available offline.
It prevents specifying a Hugging Face model as tokenizer on the offline mode (i.e. HF_HUB_OFFLINE is true).

It also tweaks when the offline mode path replacement log is emitted in a separate commit: only when model/tokenizer value changes. This is because it's not helpful to log the replacement when this value is unchanged (e.g. the model is a local GGUF file).

Test Plan

tests/entrypoints/offline_mode/test_offline_mode.py is modified to test this case. Use this for pytest based tests.

For easy comparison, I'll attach vllm serve-based reproduction here.

vllm serve reproduction: Setup

hf download Qwen/Qwen3-0.6B
hf download Qwen/Qwen3-4B

vllm serve reproduction: Main

HF_HUB_OFFLINE=1 vllm serve Qwen/Qwen3-0.6B --tokenizer Qwen/Qwen3-4B

Test Result

Before

> HF_HUB_OFFLINE=1 vllm serve Qwen/Qwen3-0.6B --tokenizer Qwen/Qwen3-4B
...
(APIServer pid=128760) INFO 11-28 23:31:36 [arg_utils.py:589] HF_HUB_OFFLINE is True, replace model_id [Qwen/Qwen3-0.6B] to model_path [/home/user/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca]
(APIServer pid=128760) INFO 11-28 23:31:36 [model.py:638] Resolved architecture: Qwen3ForCausalLM
(APIServer pid=128760) INFO 11-28 23:31:36 [model.py:1756] Using max model len 40960
(APIServer pid=128760) INFO 11-28 23:31:37 [scheduler.py:207] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=128760) Traceback (most recent call last):
...
(APIServer pid=128760)   File "/home/user/.py/vllm-rocm-nightly/lib/python3.13/site-packages/huggingface_hub/utils/_http.py", line 106, in send
(APIServer pid=128760)     raise OfflineModeIsEnabled(
(APIServer pid=128760)         f"Cannot reach {request.url}: offline mode is enabled. To disable it, please unset the `HF_HUB_OFFLINE` environment variable."
(APIServer pid=128760)     )
(APIServer pid=128760) huggingface_hub.errors.OfflineModeIsEnabled: Cannot reach https://huggingface.co/api/models/Qwen/Qwen3-4B: offline mode is enabled. To disable it, please unset the `HF_HUB_OFFLINE` environment variable.

After

A warning in the middle seems irrelevant:

  1. Remove upstream fa checks #29471 mandated flash_attn to be installed on ROCm.
  2. Nightly ROCm toolchain + PyTorch already has the flash attention operator installed.
> HF_HUB_OFFLINE=1 vllm serve Qwen/Qwen3-0.6B --tokenizer Qwen/Qwen3-4B
...
(APIServer pid=131665) INFO 11-28 23:41:18 [arg_utils.py:589] HF_HUB_OFFLINE is True, replace model_id [Qwen/Qwen3-0.6B] to model_path [/home/user/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca]
(APIServer pid=131665) INFO 11-28 23:41:18 [arg_utils.py:598] HF_HUB_OFFLINE is True, replace tokenizer_id [Qwen/Qwen3-4B] to tokenizer_path [/home/user/.cache/huggingface/hub/models--Qwen--Qwen3-4B/snapshots/1cfa9a7208912126459214e8b04321603b3df60c]
(APIServer pid=131665) INFO 11-28 23:41:18 [model.py:638] Resolved architecture: Qwen3ForCausalLM
(APIServer pid=131665) INFO 11-28 23:41:18 [model.py:1756] Using max model len 40960
(APIServer pid=131665) INFO 11-28 23:41:18 [scheduler.py:207] Chunked prefill is enabled with max_num_batched_tokens=8192.
/home/user/.py/vllm-rocm-nightly/lib/python3.13/site-packages/torch/library.py:357: UserWarning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: flash_attn::_flash_attn_backward(Tensor dout, Tensor q, Tensor k, Tensor v, Tensor out, Tensor softmax_lse, Tensor(a6!)? dq, Tensor(a7!)? dk, Tensor(a8!)? dv, float dropout_p, float softmax_scale, bool causal, SymInt window_size_left, SymInt window_size_right, float softcap, Tensor? alibi_slopes, bool deterministic, Tensor? rng_state=None) -> Tensor
    registered at /home/user/.py/vllm-rocm-nightly/lib/python3.13/site-packages/torch/_library/custom_ops.py:926
  dispatch key: ADInplaceOrView
  previous kernel: no debug info
       new kernel: registered at /home/user/.py/vllm-rocm-nightly/lib/python3.13/site-packages/torch/_library/custom_ops.py:926 (Triggered internally at /__w/TheRock/TheRock/external-builds/pytorch/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:208.)
  self.m.impl(
(EngineCore_DP0 pid=131774) INFO 11-28 23:41:20 [core.py:93] Initializing a V1 LLM engine (v0.11.2.dev379+g405f849cf.d20251128) with config: model='/home/user/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca', speculative_config=None, tokenizer='/home/user/.cache/huggingface/hub/models--Qwen--Qwen3-4B/snapshots/1cfa9a7208912126459214e8b04321603b3df60c', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/user/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'enable_fusion': False, 'enable_attn_fusion': False, 'enable_noop': True, 'enable_sequence_parallelism': False, 'enable_async_tp': False, 'enable_fi_allreduce_fusion': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>}, 'local_cache_dir': None}
...
(APIServer pid=131665) INFO:     Started server process [131665]
(APIServer pid=131665) INFO:     Waiting for application startup.
(APIServer pid=131665) INFO:     Application startup complete.

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

a4lg added 2 commits November 28, 2025 23:19
When HF_HUB_OFFLINE is true, vLLM always rewrites the model ID/path to the
corresponding local path.  However, this does not mean the ID/path is
necessarily altered.  If `model` already points to a local directory or a
GGUF file, the value remains unchanged and there is no need to inform the
user via the log.

This change updates the offline-conversion logging to emit a message only
when the value of `model` actually changes.

Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com>
If `tokenizer` is a Hugging Face model, vLLM attempts to access Hugging
Face even if the tokenizer is already available offline.  It prevents
specifying a Hugging Face model as `tokenizer` on the offline mode
(i.e. `HF_HUB_OFFLINE` is true).

With this commit, vLLM performs offline path replacement also on
`tokenizer`, not only on `model`.

A test case is added because error related to the offline mode only
occurs when the model and the tokenizer are different.

Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com>
@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses an issue with using a Hugging Face model as a tokenizer in offline mode. The changes ensure that when HF_HUB_OFFLINE is set, the tokenizer path is also resolved to a local cache path, preventing unnecessary and failing network requests. The logic is sound, and the addition of conditional logging to report path replacements only when they occur is a nice improvement for clarity. The new test case in test_offline_mode.py effectively covers the scenario and validates the fix. Overall, this is a well-executed change that improves the offline capabilities of vLLM.

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) November 29, 2025 00:02
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 29, 2025
@a4lg
Copy link
Contributor Author

a4lg commented Nov 29, 2025

TODO (maintainers): redo the CI after #29704 is merged (because CI failure seems caused by #29682, which is fixed by #29704).

@DarkLight1337
Copy link
Member

Known failing test on main

@vllm-bot vllm-bot merged commit 762a4a6 into vllm-project:main Nov 29, 2025
50 of 52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants