Skip to content

Conversation

@aseembits93
Copy link
Contributor

@aseembits93 aseembits93 commented Dec 5, 2025

PR Type

Enhancement, Tests


Description

  • Add AI-powered code repair workflow

  • Track candidate source across pipeline

  • Return detailed test comparison diffs

  • Parse pytest failures from stdout


Diagram Walkthrough

flowchart LR
  cand["OptimizedCandidate (source tagged)"]
  proc["CandidateProcessor"]
  run["Run candidate tests"]
  cmp["Compare results (match, diffs)"]
  repairReq["AIServiceCodeRepairRequest"]
  repairAPI["/code_repair (AiServiceClient)"]
  queue["Re-queue repaired candidate"]
  refine["/refine pipeline"]

  cand -- "enqueue" --> proc
  proc -- "dequeue" --> run
  run -- "compare" --> cmp
  cmp -- "mismatch <=50%" --> repairReq
  repairReq -- "submit" --> repairAPI
  repairAPI -- "candidate (REPAIR)" --> queue
  queue -- "enqueue" --> proc
  cmp -- "match" --> refine
Loading

File Walkthrough

Relevant files
Enhancement
5 files
aiservice.py
Add code-repair API and candidate source tagging                 
+63/-15 
models.py
Add TestDiffs, repair request, and source enum                     
+76/-0   
function_optimizer.py
Integrate repair queue, dedupe helpers, and source-aware flow
+154/-39
equivalence.py
Return (match, diffs) and capture pytest errors                   
+58/-29 
parse_test_output.py
Parse pytest failures per test from stdout                             
+60/-0   
Tests
5 files
test_codeflash_capture.py
Adapt tests to new compare_test_results API                           
+305/-6 
test_comparator.py
Update comparator tests for tuple return                                 
+14/-7   
test_instrument_all_and_run.py
Adjust assertions to match new API                                             
+16/-8   
test_instrumentation_run_results_aiservice.py
Update assertions and add match unpacking                               
+6/-4     
test_pickle_patcher.py
Align pickle patcher tests with new comparison                     
+4/-4     

mohammedahmed18 and others added 30 commits November 27, 2025 16:18
The optimized code achieves a **15% speedup** through several targeted micro-optimizations that reduce computational overhead in the parsing loop:

**Key Optimizations:**

1. **Single-pass boundary search**: Instead of checking both conditions (`start_line != -1 and end_line != -1`) on every iteration, the optimized version uses `None` values and breaks immediately when both markers are found, eliminating redundant condition checks.

2. **Fast-path string matching**: Before calling the expensive `.startswith("_______")` method, it first checks if `line[0] == "_"`, avoiding the method call for most lines that don't start with underscores.

3. **Method lookup optimization**: Pulls `current_failure_lines.append` into a local variable to avoid repeated attribute lookups in the hot loop where failure lines are processed.

4. **Memory-efficient list management**: Uses `current_failure_lines.clear()` instead of creating new list objects (`current_failure_lines = []`), reducing object allocation pressure.

**Performance Impact:**
The optimizations show the most significant gains in large-scale scenarios:
- **Large failure sets**: 14.2% faster with 500 failures, 14.0% faster with 999 failures  
- **Large output**: 29.2% faster for single failures with 1000 lines of output
- **Complex scenarios**: 22.3% faster with 50 cases having 10 lines each

**Hot Path Context:**
Based on the function reference, `parse_test_failures_from_stdout` is called from `parse_test_results`, which appears to be part of a test optimization pipeline. The function processes pytest stdout to extract failure information, making it performance-critical when dealing with large test suites or verbose test outputs. The 15% improvement becomes meaningful when processing hundreds of test failures in CI/CD environments or during iterative code optimization workflows.
…25-11-27T14.49.01

⚡️ Speed up function `parse_test_failures_from_stdout` by 16% in PR #945 (`feat/feedback-loop-for-unmatched-test-results`)
…b.com:codeflash-ai/codeflash into feat/feedback-loop-for-unmatched-test-results
…b.com:codeflash-ai/codeflash into feat/feedback-loop-for-unmatched-test-results
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
…b.com:codeflash-ai/codeflash into feat/feedback-loop-for-unmatched-test-results
…b.com:codeflash-ai/codeflash into feat/feedback-loop-for-unmatched-test-results
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Codeflash Bot seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Robustness

The pytest failure parsing relies on regex and section heuristics; edge cases (different pytest verbosity/themes, locale, plugins altering output) may break parsing or misassociate failures. Consider guarding against None indices, normalizing headers, and adding unit tests for varied outputs.

def parse_test_failures_from_stdout(test_results: TestResults, stdout: str) -> TestResults:
    """Extract individual pytest test failures from stdout grouped by test case qualified name, and add them to the test results."""
    lines = stdout.splitlines()
    start = end = None

    for i, line in enumerate(lines):
        if FAILURES_HEADER_RE.search(line.strip()):
            start = i
            break

    if start is None:
        return test_results

    for j in range(start + 1, len(lines)):
        stripped = lines[j].strip()
        if "short test summary info" in stripped:
            end = j
            break
        # any new === section === block
        if stripped.startswith("=") and stripped.count("=") > 3:
            end = j
            break

    # If no clear "end", just grap the rest of the string
    if end is None:
        end = len(lines)

    failure_block = lines[start:end]

    failures: dict[str, str] = {}
    current_name = None
    current_lines: list[str] = []

    for line in failure_block:
        m = TEST_HEADER_RE.match(line.strip())
        if m:
            if current_name is not None:
                failures[current_name] = "".join(current_lines)

            current_name = m.group(1)
            current_lines = []
        elif current_name:
            current_lines.append(line + "\n")

    if current_name:
        failures[current_name] = "".join(current_lines)

    test_results.test_failures = failures
    return test_results
Nonesafety

In compare_test_results, accessing original_test_result fields before confirming both results are present can raise exceptions when either side is None. Validate both objects exist before reading attributes or add try/except around access.

    set(candidate_results.get_all_unique_invocation_loop_ids())
)
test_diffs: list[TestDiff] = []
did_all_timeout: bool = True
for test_id in test_ids_superset:
    original_test_result = original_results.get_by_unique_invocation_loop_id(test_id)
    cdd_test_result = candidate_results.get_by_unique_invocation_loop_id(test_id)
    candidate_test_failures = candidate_results.test_failures
    original_test_failures = original_results.test_failures
    cdd_pytest_error = (
        candidate_test_failures.get(original_test_result.id.test_fn_qualified_name(), "")
        if candidate_test_failures
        else ""
    )
    original_pytest_error = (
        original_test_failures.get(original_test_result.id.test_fn_qualified_name(), "")
        if original_test_failures
        else ""
    )

    if cdd_test_result is not None and original_test_result is None:
        continue
    # If helper function instance_state verification is not present, that's ok. continue
    if (
        original_test_result.verification_type
        and original_test_result.verification_type == VerificationType.INIT_STATE_HELPER
        and cdd_test_result is None
    ):
        continue
    if original_test_result is None or cdd_test_result is None:
        return False, []
    did_all_timeout = did_all_timeout and original_test_result.timed_out
Race/State

CandidateProcessor mutates shared lists of futures (refinement/repair) and internal state across threads. Ensure thread-safety when appending to self.future_all_code_repair from other threads and when clearing these lists after wait; consider locks or confining mutations to processor.

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Guard against None before access

Accessing original_test_result.id before ensuring original_test_result is not None
can raise an AttributeError. Reorder logic to guard against None before building
cdd_pytest_error/original_pytest_error. Also compute failure messages only after
both results are validated to exist.

codeflash/verification/equivalence.py [16-43]

 def compare_test_results(original_results: TestResults, candidate_results: TestResults) -> tuple[bool, list[TestDiff]]:
-    # This is meant to be only called with test results for the first loop index
     if len(original_results) == 0 or len(candidate_results) == 0:
-        return False, []  # empty test results are not equal
+        return False, []
     original_recursion_limit = sys.getrecursionlimit()
     if original_recursion_limit < INCREASED_RECURSION_LIMIT:
-        sys.setrecursionlimit(INCREASED_RECURSION_LIMIT)  # Increase recursion limit to avoid RecursionError
+        sys.setrecursionlimit(INCREASED_RECURSION_LIMIT)
     test_ids_superset = original_results.get_all_unique_invocation_loop_ids().union(
         set(candidate_results.get_all_unique_invocation_loop_ids())
     )
     test_diffs: list[TestDiff] = []
     did_all_timeout: bool = True
     for test_id in test_ids_superset:
         original_test_result = original_results.get_by_unique_invocation_loop_id(test_id)
         cdd_test_result = candidate_results.get_by_unique_invocation_loop_id(test_id)
-        candidate_test_failures = candidate_results.test_failures
-        original_test_failures = original_results.test_failures
-        cdd_pytest_error = (
-            candidate_test_failures.get(original_test_result.id.test_fn_qualified_name(), "")
-            if candidate_test_failures
-            else ""
-        )
-        original_pytest_error = (
-            original_test_failures.get(original_test_result.id.test_fn_qualified_name(), "")
-            if original_test_failures
-            else ""
-        )
 
         if cdd_test_result is not None and original_test_result is None:
             continue
-        # If helper function instance_state verification is not present, that's ok. continue
         if (
-            original_test_result.verification_type
+            original_test_result is not None
             and original_test_result.verification_type == VerificationType.INIT_STATE_HELPER
             and cdd_test_result is None
         ):
             continue
         if original_test_result is None or cdd_test_result is None:
             return False, []
+
+        candidate_test_failures = candidate_results.test_failures
+        original_test_failures = original_results.test_failures
+        test_fn_name = original_test_result.id.test_fn_qualified_name()
+        cdd_pytest_error = candidate_test_failures.get(test_fn_name, "") if candidate_test_failures else ""
+        original_pytest_error = original_test_failures.get(test_fn_name, "") if original_test_failures else ""
+
         did_all_timeout = did_all_timeout and original_test_result.timed_out
         if original_test_result.timed_out:
             continue
         superset_obj = False
Suggestion importance[1-10]: 8

__

Why: The code accesses original_test_result.id before verifying original_test_result is not None, which can raise an AttributeError; reordering the guards is accurate and prevents a real bug without changing behavior.

Medium
Fix Enum base class order

The Enum declaration order is invalid; in Python, StrEnum-like patterns must inherit
from str first, then Enum. Reverse the base class order to prevent runtime/type
errors when creating or comparing enum values.

codeflash/models/models.py [387-391]

-class OptimizedCandidateSource(enum.Enum, str):
+class OptimizedCandidateSource(str, enum.Enum):
     OPTIMIZE = "OPTIMIZE"
     OPTIMIZE_LP = "OPTIMIZE_LP"
     REFINE = "REFINE"
     REPAIR = "REPAIR"
Suggestion importance[1-10]: 7

__

Why: Inheriting enum.Enum, str is unconventional and may break string behavior for enum members; switching to str, enum.Enum is a small but correct improvement to ensure expected comparisons for OptimizedCandidateSource.

Medium
General
Harden pytest failure parsing

The parser assumes pytest "FAILURES" section and underscores headers always exist
and match once; this can misparse or raise on alternative outputs. Add basic guards
and normalize header matching to avoid incorrect slicing and ensure test_failures
remains a dict even on unexpected formats.

codeflash/verification/parse_test_output.py [519-567]

-FAILURES_HEADER_RE = re.compile(r"=+ FAILURES =+")
+FAILURES_HEADER_RE = re.compile(r"=+\s*FAILURES\s*=+")
 TEST_HEADER_RE = re.compile(r"_{3,}\s*(.*?)\s*_{3,}$")
 
 
 def parse_test_failures_from_stdout(test_results: TestResults, stdout: str) -> TestResults:
-    """Extract individual pytest test failures from stdout grouped by test case qualified name, and add them to the test results."""
+    failures: dict[str, str] = {}
+    if not stdout:
+        test_results.test_failures = failures
+        return test_results
+
     lines = stdout.splitlines()
-    start = end = None
-
+    start = None
     for i, line in enumerate(lines):
         if FAILURES_HEADER_RE.search(line.strip()):
             start = i
             break
-
     if start is None:
+        test_results.test_failures = failures
         return test_results
 
+    end = None
     for j in range(start + 1, len(lines)):
         stripped = lines[j].strip()
         if "short test summary info" in stripped:
             end = j
             break
-        # any new === section === block
         if stripped.startswith("=") and stripped.count("=") > 3:
             end = j
             break
-
-    # If no clear "end", just grap the rest of the string
     if end is None:
         end = len(lines)
 
     failure_block = lines[start:end]
-
-    failures: dict[str, str] = {}
     current_name = None
     current_lines: list[str] = []
-
     for line in failure_block:
         m = TEST_HEADER_RE.match(line.strip())
         if m:
-            if current_name is not None:
+            if current_name is not None and current_lines:
                 failures[current_name] = "".join(current_lines)
-
             current_name = m.group(1)
             current_lines = []
-        elif current_name:
+        elif current_name is not None:
             current_lines.append(line + "\n")
-
-    if current_name:
+    if current_name is not None and current_lines:
         failures[current_name] = "".join(current_lines)
 
     test_results.test_failures = failures
     return test_results
Suggestion importance[1-10]: 6

__

Why: The proposed normalization and guards make the failure parsing more robust to format variations; while not critical, it reduces parsing errors and keeps test_failures consistent.

Low

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants