Exposing Behavior test run as a tool call to LLMs #949

aseembits93 · 2025-12-03T04:12:26Z

PR Type

Enhancement, Tests

Description

Expose behavioral tests as LLM tool
Add tool schema and execution registry
Provide lazy imports to avoid cycles
Add comprehensive tests for tool API

Diagram Walkthrough

flowchart LR
  LLM["LLM"] -- "calls tool" --> Exec["execute_tool()"]
  Exec -- "dispatch" --> Tool["run_behavioral_tests_tool()"]
  Tool -- "invoke" --> Runner["run_behavioral_tests()"]
  Runner -- "produce JUnit XML" --> Parser["parse_test_xml()"]
  Parser -- "results" --> Tool
  Tool -- "structured dict" --> LLM

File Walkthrough

Relevant files

Enhancement

__init__.py `Lazy export of verification LLM tools` codeflash/verification/init.py Add lazy attribute loader for LLM tools Re-export tool APIs via __all__	+31/-0
llm_tools.py `LLM tool schema and behavior test wrapper` codeflash/verification/llm_tools.py Define JSON schema for `run_behavioral_tests` Implement `run_behavioral_tests_tool` wrapper Add tool registry and execution helpers Map string test types to enum	+321/-0

Tests

test_llm_tools.py `Tests for verification LLM tools interface` tests/test_llm_tools.py Add tests for tool schema and registry Validate execute_tool dispatch and errors Run real pytest samples through tool Test handling of failing and invalid paths	+193/-0

CLAassistant · 2025-12-03T04:12:32Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

Codeflash Bot seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

github-actions · 2025-12-03T04:13:26Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Possible Issue The mapping in `_test_type_from_string` uses the key `concolic_test` instead of `concolic_coverage_test` (already present). If callers pass `concolic_test` per earlier conventions, it's fine, but if schemas or other parts expect only `concolic_coverage_test`, the extra alias may mask typos. Conversely, if `concolic_test` was not intended as an alias, this could be a bug. Validate intended enum values and align with `TestType`. def _test_type_from_string(test_type_str: str) -> TestType: """Convert a string test type to TestType enum.""" mapping = { "existing_unit_test": TestType.EXISTING_UNIT_TEST, "generated_regression": TestType.GENERATED_REGRESSION, "replay_test": TestType.REPLAY_TEST, "concolic_test": TestType.CONCOLIC_COVERAGE_TEST, "concolic_coverage_test": TestType.CONCOLIC_COVERAGE_TEST, } return mapping.get(test_type_str.lower(), TestType.EXISTING_UNIT_TEST) Robustness When constructing `PYTHONPATH`, the code appends `os.pathsep + project_root_path` without checking for duplication or normalizing. Also, if `PYTHONPATH` exists but is empty, the leading separator can occur. Consider normalizing and avoiding duplicates. # Ensure PYTHONPATH includes project root if "PYTHONPATH" not in test_env: test_env["PYTHONPATH"] = str(project_root_path) else: test_env["PYTHONPATH"] += os.pathsep + str(project_root_path) Error Handling The broad `except Exception` swallows all errors and returns `success: False` with minimal context. Consider logging or including more structured error info (e.g., traceback) to aid debugging, and ensure sensitive paths are handled appropriately. except Exception as e: return { "success": False, "total_tests": 0, "passed_tests": 0, "failed_tests": 0, "results": [], "stdout": "", "stderr": "", "error": str(e), }

github-actions · 2025-12-03T04:14:02Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Align schema enum with mapper The accepted enum in the tool schema does not include 'concolic_test', yet the mapping supports it. This mismatch can cause user inputs to be rejected before reaching your mapper. Align the schema enum with supported inputs to prevent validation errors. codeflash/verification/llm_tools.py [123-131] mapping = { "existing_unit_test": TestType.EXISTING_UNIT_TEST, "generated_regression": TestType.GENERATED_REGRESSION, "replay_test": TestType.REPLAY_TEST, "concolic_test": TestType.CONCOLIC_COVERAGE_TEST, "concolic_coverage_test": TestType.CONCOLIC_COVERAGE_TEST, } +... +RUN_BEHAVIORAL_TESTS_TOOL_SCHEMA = { + "type": "function", + "function": { + "name": "run_behavioral_tests", + "description": ( + "Run behavioral tests to verify code correctness. " + "This executes test files using pytest or unittest and returns detailed results " + "including pass/fail status, runtime information, and any errors encountered." + ), + "parameters": { + "type": "object", + "properties": { + "test_files": { + "type": "array", + "description": "List of test files to run", + "items": { + "type": "object", + "properties": { + "test_file_path": { + "type": "string", + "description": "Absolute path to the test file to run", + }, + "test_type": { + "type": "string", + "enum": [ + "existing_unit_test", + "generated_regression", + "replay_test", + "concolic_test", + "concolic_coverage_test", + ], + "default": "existing_unit_test", + "description": "Type of test being run", + }, + }, + "required": ["test_file_path"], + }, + }, + ... + }, + "required": ["test_files", "project_root"], + }, + }, +} Suggestion importance[1-10]: 7 __ Why: Correctly identifies a mismatch: `_test_type_from_string` accepts `concolic_test` but the tool schema enum omits it, causing premature validation failures. The proposed change is accurate and improves usability, though not critical to core functionality.	Medium
General	Prevent PYTHONPATH duplication, ensure precedence Appending to an existing PYTHONPATH without checking for duplicates can grow unbounded across calls and may break module resolution order. Prepend the project root if not already present to ensure it takes precedence and avoid duplication. codeflash/verification/llm_tools.py [191-196] -# Ensure PYTHONPATH includes project root -if "PYTHONPATH" not in test_env: - test_env["PYTHONPATH"] = str(project_root_path) -else: - test_env["PYTHONPATH"] += os.pathsep + str(project_root_path) +# Ensure PYTHONPATH includes project root once, and with precedence +current_pp = test_env.get("PYTHONPATH", "") +pp_parts = current_pp.split(os.pathsep) if current_pp else [] +project_root_str = str(project_root_path) +if project_root_str not in pp_parts: + test_env["PYTHONPATH"] = os.pathsep.join([project_root_str] + pp_parts) if pp_parts else project_root_str Suggestion importance[1-10]: 6 __ Why: Sensible enhancement to avoid unbounded PYTHONPATH growth and ensure project root precedence. It’s a maintainability/robustness improvement; impact is moderate and the code change aligns with the existing snippet.	Low
General	Normalize subprocess output to text Accessing 'process.stdout' and 'process.stderr' assumes they are strings; they may be bytes or None depending on how the subprocess was run. Normalize to string to avoid type issues for JSON serialization and downstream consumers. codeflash/verification/llm_tools.py [248-258] +def _to_text(s: Any) -> str: + if s is None: + return "" + return s.decode("utf-8", errors="replace") if isinstance(s, (bytes, bytearray)) else str(s) + return { "success": True, "total_tests": len(test_results), "passed_tests": passed_count, "failed_tests": failed_count, "results": results_list, - "stdout": process.stdout if process.stdout else "", - "stderr": process.stderr if process.stderr else "", + "stdout": _to_text(process.stdout), + "stderr": _to_text(process.stderr), "error": None, } Suggestion importance[1-10]: 6 __ Why: Normalizing `process.stdout`/`stderr` to strings improves robustness for serialization and consumer consistency. While likely already strings, handling bytes/None is a reasonable defensive improvement without altering behavior.	Low

codeflash-ai · 2025-12-03T04:19:12Z

⚡️ Codeflash found optimizations for this PR

📄 128% (1.28x) speedup for `_test_type_from_string` in `codeflash/verification/llm_tools.py`

⏱️ Runtime : 3.07 milliseconds → 1.34 milliseconds (best of 128 runs)

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up function _test_type_from_string by 128% in PR #949 (feat/behavior-test-as-tool) #950

If you approve, it will be merged into this PR (branch feat/behavior-test-as-tool).

first commit

6be5697

aseembits93 marked this pull request as draft December 3, 2025 04:12

github-actions bot added the Review effort 3/5 label Dec 3, 2025

codeflash-ai bot mentioned this pull request Dec 3, 2025

⚡️ Speed up function _test_type_from_string by 128% in PR #949 (feat/behavior-test-as-tool) #950

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Exposing Behavior test run as a tool call to LLMs #949

Exposing Behavior test run as a tool call to LLMs #949

Uh oh!

aseembits93 commented Dec 3, 2025 •

edited by github-actions bot

Loading

Uh oh!

CLAassistant commented Dec 3, 2025

Uh oh!

github-actions bot commented Dec 3, 2025

Uh oh!

github-actions bot commented Dec 3, 2025

Uh oh!

codeflash-ai bot commented Dec 3, 2025

⚡️ Speed up function `_test_type_from_string` by 128% in PR #949 (`feat/behavior-test-as-tool`) #950

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Exposing Behavior test run as a tool call to LLMs #949

Are you sure you want to change the base?

Exposing Behavior test run as a tool call to LLMs #949

Uh oh!

Conversation

aseembits93 commented Dec 3, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

CLAassistant commented Dec 3, 2025

Uh oh!

github-actions bot commented Dec 3, 2025

PR Reviewer Guide 🔍

Uh oh!

github-actions bot commented Dec 3, 2025

PR Code Suggestions ✨

Uh oh!

codeflash-ai bot commented Dec 3, 2025

⚡️ Codeflash found optimizations for this PR

📄 128% (1.28x) speedup for _test_type_from_string in codeflash/verification/llm_tools.py

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up function _test_type_from_string by 128% in PR #949 (feat/behavior-test-as-tool) #950

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aseembits93 commented Dec 3, 2025 •

edited by github-actions bot

Loading

📄 128% (1.28x) speedup for `_test_type_from_string` in `codeflash/verification/llm_tools.py`

⚡️ Speed up function `_test_type_from_string` by 128% in PR #949 (`feat/behavior-test-as-tool`) #950