[Frontend] Perform offline path replacement to `tokenizer` #29706

a4lg · 2025-11-29T00:00:03Z

Purpose

If tokenizer is a Hugging Face model, vLLM attempts to access Hugging Face even if the tokenizer is already available offline.
It prevents specifying a Hugging Face model as tokenizer on the offline mode (i.e. HF_HUB_OFFLINE is true).

It also tweaks when the offline mode path replacement log is emitted in a separate commit: only when model/tokenizer value changes. This is because it's not helpful to log the replacement when this value is unchanged (e.g. the model is a local GGUF file).

Test Plan

tests/entrypoints/offline_mode/test_offline_mode.py is modified to test this case. Use this for pytest based tests.

For easy comparison, I'll attach vllm serve-based reproduction here.

`vllm serve` reproduction: Setup

hf download Qwen/Qwen3-0.6B
hf download Qwen/Qwen3-4B

`vllm serve` reproduction: Main

HF_HUB_OFFLINE=1 vllm serve Qwen/Qwen3-0.6B --tokenizer Qwen/Qwen3-4B

Test Result

Before

> HF_HUB_OFFLINE=1 vllm serve Qwen/Qwen3-0.6B --tokenizer Qwen/Qwen3-4B
...
(APIServer pid=128760) INFO 11-28 23:31:36 [arg_utils.py:589] HF_HUB_OFFLINE is True, replace model_id [Qwen/Qwen3-0.6B] to model_path [/home/user/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca]
(APIServer pid=128760) INFO 11-28 23:31:36 [model.py:638] Resolved architecture: Qwen3ForCausalLM
(APIServer pid=128760) INFO 11-28 23:31:36 [model.py:1756] Using max model len 40960
(APIServer pid=128760) INFO 11-28 23:31:37 [scheduler.py:207] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=128760) Traceback (most recent call last):
...
(APIServer pid=128760)   File "/home/user/.py/vllm-rocm-nightly/lib/python3.13/site-packages/huggingface_hub/utils/_http.py", line 106, in send
(APIServer pid=128760)     raise OfflineModeIsEnabled(
(APIServer pid=128760)         f"Cannot reach {request.url}: offline mode is enabled. To disable it, please unset the `HF_HUB_OFFLINE` environment variable."
(APIServer pid=128760)     )
(APIServer pid=128760) huggingface_hub.errors.OfflineModeIsEnabled: Cannot reach https://huggingface.co/api/models/Qwen/Qwen3-4B: offline mode is enabled. To disable it, please unset the `HF_HUB_OFFLINE` environment variable.

After

A warning in the middle seems irrelevant:

Remove upstream fa checks #29471 mandated flash_attn to be installed on ROCm.
Nightly ROCm toolchain + PyTorch already has the flash attention operator installed.

> HF_HUB_OFFLINE=1 vllm serve Qwen/Qwen3-0.6B --tokenizer Qwen/Qwen3-4B
...
(APIServer pid=131665) INFO 11-28 23:41:18 [arg_utils.py:589] HF_HUB_OFFLINE is True, replace model_id [Qwen/Qwen3-0.6B] to model_path [/home/user/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca]
(APIServer pid=131665) INFO 11-28 23:41:18 [arg_utils.py:598] HF_HUB_OFFLINE is True, replace tokenizer_id [Qwen/Qwen3-4B] to tokenizer_path [/home/user/.cache/huggingface/hub/models--Qwen--Qwen3-4B/snapshots/1cfa9a7208912126459214e8b04321603b3df60c]
(APIServer pid=131665) INFO 11-28 23:41:18 [model.py:638] Resolved architecture: Qwen3ForCausalLM
(APIServer pid=131665) INFO 11-28 23:41:18 [model.py:1756] Using max model len 40960
(APIServer pid=131665) INFO 11-28 23:41:18 [scheduler.py:207] Chunked prefill is enabled with max_num_batched_tokens=8192.
/home/user/.py/vllm-rocm-nightly/lib/python3.13/site-packages/torch/library.py:357: UserWarning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: flash_attn::_flash_attn_backward(Tensor dout, Tensor q, Tensor k, Tensor v, Tensor out, Tensor softmax_lse, Tensor(a6!)? dq, Tensor(a7!)? dk, Tensor(a8!)? dv, float dropout_p, float softmax_scale, bool causal, SymInt window_size_left, SymInt window_size_right, float softcap, Tensor? alibi_slopes, bool deterministic, Tensor? rng_state=None) -> Tensor
    registered at /home/user/.py/vllm-rocm-nightly/lib/python3.13/site-packages/torch/_library/custom_ops.py:926
  dispatch key: ADInplaceOrView
  previous kernel: no debug info
       new kernel: registered at /home/user/.py/vllm-rocm-nightly/lib/python3.13/site-packages/torch/_library/custom_ops.py:926 (Triggered internally at /__w/TheRock/TheRock/external-builds/pytorch/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:208.)
  self.m.impl(
(EngineCore_DP0 pid=131774) INFO 11-28 23:41:20 [core.py:93] Initializing a V1 LLM engine (v0.11.2.dev379+g405f849cf.d20251128) with config: model='/home/user/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca', speculative_config=None, tokenizer='/home/user/.cache/huggingface/hub/models--Qwen--Qwen3-4B/snapshots/1cfa9a7208912126459214e8b04321603b3df60c', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/user/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'enable_fusion': False, 'enable_attn_fusion': False, 'enable_noop': True, 'enable_sequence_parallelism': False, 'enable_async_tp': False, 'enable_fi_allreduce_fusion': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>}, 'local_cache_dir': None}
...
(APIServer pid=131665) INFO:     Started server process [131665]
(APIServer pid=131665) INFO:     Waiting for application startup.
(APIServer pid=131665) INFO:     Application startup complete.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

When HF_HUB_OFFLINE is true, vLLM always rewrites the model ID/path to the corresponding local path. However, this does not mean the ID/path is necessarily altered. If `model` already points to a local directory or a GGUF file, the value remains unchanged and there is no need to inform the user via the log. This change updates the offline-conversion logging to emit a message only when the value of `model` actually changes. Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com>

If `tokenizer` is a Hugging Face model, vLLM attempts to access Hugging Face even if the tokenizer is already available offline. It prevents specifying a Hugging Face model as `tokenizer` on the offline mode (i.e. `HF_HUB_OFFLINE` is true). With this commit, vLLM performs offline path replacement also on `tokenizer`, not only on `model`. A test case is added because error related to the offline mode only occurs when the model and the tokenizer are different. Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com>

chatgpt-codex-connector · 2025-11-29T00:00:10Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

github-actions · 2025-11-29T00:00:13Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request correctly addresses an issue with using a Hugging Face model as a tokenizer in offline mode. The changes ensure that when HF_HUB_OFFLINE is set, the tokenizer path is also resolved to a local cache path, preventing unnecessary and failing network requests. The logic is sound, and the addition of conditional logging to report path replacements only when they occur is a nice improvement for clarity. The new test case in test_offline_mode.py effectively covers the scenario and validates the fix. Overall, this is a well-executed change that improves the offline capabilities of vLLM.

DarkLight1337

Thanks, LGTM

a4lg · 2025-11-29T01:58:11Z

TODO (maintainers): redo the CI after #29704 is merged (because CI failure seems caused by #29682, which is fixed by #29704).

DarkLight1337 · 2025-11-29T02:32:02Z

Known failing test on main

a4lg added 2 commits November 28, 2025 23:19

a4lg requested review from DarkLight1337, NickLucche, aarnphm and robertgshaw2-redhat as code owners November 29, 2025 00:00

gemini-code-assist bot reviewed Nov 29, 2025

View reviewed changes

DarkLight1337 approved these changes Nov 29, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) November 29, 2025 00:02

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 29, 2025

vllm-bot merged commit 762a4a6 into vllm-project:main Nov 29, 2025
50 of 52 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Frontend] Perform offline path replacement to `tokenizer` #29706

[Frontend] Perform offline path replacement to `tokenizer` #29706

a4lg commented Nov 29, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot commented Nov 29, 2025

Uh oh!

github-actions bot commented Nov 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

DarkLight1337 left a comment

Uh oh!

a4lg commented Nov 29, 2025

Uh oh!

DarkLight1337 commented Nov 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[Frontend] Perform offline path replacement to tokenizer #29706

[Frontend] Perform offline path replacement to tokenizer #29706

Conversation

a4lg commented Nov 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

vllm serve reproduction: Setup

vllm serve reproduction: Main

Test Result

Before

After

Uh oh!

chatgpt-codex-connector bot commented Nov 29, 2025

Uh oh!

github-actions bot commented Nov 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

a4lg commented Nov 29, 2025

Uh oh!

DarkLight1337 commented Nov 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Frontend] Perform offline path replacement to `tokenizer` #29706

[Frontend] Perform offline path replacement to `tokenizer` #29706

a4lg commented Nov 29, 2025 •

edited by github-actions bot

Loading

`vllm serve` reproduction: Setup

`vllm serve` reproduction: Main