CPU Serving Benchmarks

Compare memory footprint and latency of running Llama 3.1 8B on CPU across three inference stacks:

huggingface / PyTorch transformers
vllm
llama.cpp (GGUF weights)

The repository provides a ready-to-use devcontainer, standalone backend scripts, and an orchestrated runner that captures structured results for analysis.

Quickstart

Install VS Code with the Dev Containers extension (or the devcontainer CLI) and accept access to the meta-llama/Llama-3.1-8B model on Hugging Face.
Clone the repo and open it in the devcontainer:
```
devcontainer open .
```
The container now uses Ubuntu 24.04 with Python 3.12, the uv package manager, and tooling required to build CPU-first inference stacks.
Authenticate with Hugging Face inside the container so downloads succeed:
```
huggingface-cli login
```
Provision the per-backend virtual environments (re-run whenever dependencies need refreshing):
```
python scripts/setup_virtualenvs.py
```
The command prints the interpreter path for each backend. By default environments live in /opt/venvs inside the devcontainer (or .venvs/ when running locally).

Model Assets

Transformers & vLLM load weights directly from Hugging Face. Set HF_HOME or --download-dir if you need a custom cache location. CPU runs require ~16–24 GB of RAM for the full-precision model; consider parameter-efficient or quantized variants if memory is constrained.
llama.cpp requires a GGUF file. Either download an official GGUF release (e.g. meta-llama/Llama-3.1-8B-Instruct-GGUF) or convert meta-llama/Llama-3.1-8B locally using the conversion tools in the upstream llama.cpp repository. Place the resulting file under models/ (or supply an absolute path when running benchmarks).

Single Backend Benchmarks

Each backend owns an isolated virtual environment to avoid conflicts between PyTorch and vLLM builds. Use scripts/setup_virtualenvs.py to discover interpreter paths, then call the desired script with that interpreter (all scripts accept --help for full options):

PyTorch / Transformers:

/opt/venvs/venv-hf/bin/python scripts/benchmark_hf.py \
  --model-id meta-llama/Llama-3.1-8B \
  --max-new-tokens 128 \
  --num-threads 16

vLLM (CPU eager mode recommended):

/opt/venvs/venv-vllm/bin/python scripts/benchmark_vllm.py \
  --model-id meta-llama/Llama-3.1-8B \
  --enforce-eager \
  --num-threads 16

llama.cpp (GGUF input required):

/opt/venvs/venv-llamacpp/bin/python scripts/benchmark_llamacpp.py \
  --model-path ./models/llama-3.1-8b-q4_k_m.gguf \
  --num-threads 16

Adjust the interpreter prefix if you are running outside the devcontainer (check the setup_virtualenvs.py output). All scripts accept --prompt or --prompt-file for custom inputs and can emit raw JSON via --print-json.

Full Comparative Run

python scripts/run_all_benchmarks.py prepares the required virtual environments on demand, orchestrates every backend, and saves timestamped reports under artifacts/.

Example:

python scripts/run_all_benchmarks.py \
  --llamacpp-model-path ./models/llama-3.1-8b-q4_k_m.gguf \
  --max-new-tokens 128 \
  --hf-num-threads 16 \
  --vllm-num-threads 16 \
  --llamacpp-num-threads 16 \
  --label local-test

Use --backends to run a subset (e.g. --backends hf vllm). The runner reuses existing environments by default; add --venv-reinstall to recreate them or --skip-venv-sync to rely on previously installed dependencies. Include --print-json to mirror the final summary to stdout.

Results

The aggregated JSON includes:

System metadata (CPU topology, RAM, Python version).
Per-backend metrics (load_time_s, generate_time_s, completion_tokens, tokens_per_second, peak_memory_mebibytes).
The generated completion text for quick sanity checks.

Saved files follow artifacts/benchmark_<timestamp>[_label].json. They can be post-processed with pandas or imported into dashboards.

Repository Layout

.devcontainer/ – devcontainer configuration, Dockerfile, and post-create installer.
envs/ – dependency declarations for each virtual environment.
requirements.txt – pointer to the per-backend dependency files.
scripts/ – entry points for individual backends, virtualenv bootstrapper, and the orchestration runner.
src/cpu_serving/ – reusable benchmarking utilities (memory sampling, result formatting, virtualenv helpers).
artifacts/ – output directory for JSON reports (kept empty via .gitkeep).

Notes & Tips

Large CPU runs benefit from setting OMP_NUM_THREADS, MKL_NUM_THREADS, and VLLM_WORKER_CPU_THREADS; the scripts set these automatically when --num-threads is provided.
vLLM CPU support is evolving; enabling --enforce-eager often yields more predictable behavior at the expense of throughput.
Peak memory estimates rely on periodic RSS sampling through psutil; adjust the sampling rate in src/cpu_serving/benchmarks.py if more precision is required.
The default interpreter candidates favour Python 3.12 for Transformers and python3.13 for vLLM when available, matching the latest supported PyTorch wheels. Override the interpreter by exporting VIRTUALENV_HOME and ensuring the desired binary is on PATH.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.devcontainer		.devcontainer
artifacts		artifacts
envs		envs
scripts		scripts
src/cpu_serving		src/cpu_serving
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENT.md		AGENT.md
CPU_NOTES.md		CPU_NOTES.md
README.md		README.md
REFACTOR.md		REFACTOR.md
TODOs.md		TODOs.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
simple_test.sh		simple_test.sh
sitecustomize.py		sitecustomize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CPU Serving Benchmarks

Quickstart

Model Assets

Single Backend Benchmarks

Full Comparative Run

Results

Repository Layout

Notes & Tips

About

Uh oh!

Releases

Packages

Languages

aidamian/cpu_serving

Folders and files

Latest commit

History

Repository files navigation

CPU Serving Benchmarks

Quickstart

Model Assets

Single Backend Benchmarks

Full Comparative Run

Results

Repository Layout

Notes & Tips

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages