Run unit tests with real LLM calls #8486

TomeHirata · 2025-07-03T05:28:38Z

This PR enables running unit tests that require real LLM calls, which have been skipped in CI.
The model for testing is configurable through LLM_MODEL env variable, and we use ollama/llama3.2:3b in the branch build to balance quality and latency. The Ollama model pulling only introduces ~13s latency, so this PR just enables real LLM tests in the branch build instead of nightly tests.

chenmoneygithub · 2025-07-03T19:49:27Z

@TomeHirata This is awesome! One thing I want to discuss.

I compared the time cost of running test before and after this PR, seems we are seeing a significant increase from 2m => 4m:

Before:

After:

4m seems to be all right, but that's the outcome of only converting 6 test cases into using Ollama, which means it could become much slower over time. I am thinking about the following:

Use pytest markers to split tests requiring actual LM into its own group, like @pytest.mark.real_lm_call. These tests will be skipped in normal github tests, and only run in the dedicated action Run Tests with Real LM. We may not need to test across all python versions, because essentially it's an supplementary to the normal test to test against actual LM behavior.
Prefer mocking over real LM testing in unit tests. Only when mocking goes too complex like optimizer, or unreliable over time like streaming, we switch to using the real LM.

Let me know what you think!

TomeHirata · 2025-07-04T01:27:39Z

Hi, @chenmoneygithub. I agree with both points. I've split LLM call tests into a separate job. I completely agree with #2. We should limit the usage of LLM call in unit tests because of latency and potential flakiness.

chenmoneygithub

The setup looks good!

tests/streaming/test_streaming.py

.github/workflows/run_tests.yml

tests/conftest.py

chenmoneygithub

LGTM!

TomeHirata changed the title ~~use real LLM for unit tests~~ Use real LLM for unit tests Jul 3, 2025

use real LLM for unit tests

868f4ae

TomeHirata force-pushed the test/llm branch from ed4ef14 to 868f4ae Compare July 3, 2025 05:31

TomeHirata added 2 commits July 3, 2025 15:20

use ollama

0cd6132

use Llama 3.2 3b

d21eba6

TomeHirata changed the title ~~Use real LLM for unit tests~~ Run unit tests with real LLM calls Jul 3, 2025

TomeHirata added 4 commits July 4, 2025 09:52

add verbose option

5e9a6a6

split test into a separate job

cecf88a

remove LLM pulling

061c58e

fix option name

f17ef53

chenmoneygithub reviewed Jul 7, 2025

View reviewed changes

tests/streaming/test_streaming.py Outdated Show resolved Hide resolved

.github/workflows/run_tests.yml Show resolved Hide resolved

tests/conftest.py Outdated Show resolved Hide resolved

rename env var

db6552e

chenmoneygithub approved these changes Jul 7, 2025

View reviewed changes

TomeHirata merged commit b4d1a7e into stanfordnlp:main Jul 8, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Run unit tests with real LLM calls #8486

Run unit tests with real LLM calls #8486

Uh oh!

TomeHirata commented Jul 3, 2025 •

edited

Loading

Uh oh!

chenmoneygithub commented Jul 3, 2025

Uh oh!

TomeHirata commented Jul 4, 2025

Uh oh!

chenmoneygithub left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chenmoneygithub left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Run unit tests with real LLM calls #8486

Run unit tests with real LLM calls #8486

Uh oh!

Conversation

TomeHirata commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chenmoneygithub commented Jul 3, 2025

Uh oh!

TomeHirata commented Jul 4, 2025

Uh oh!

chenmoneygithub left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chenmoneygithub left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TomeHirata commented Jul 3, 2025 •

edited

Loading