Skip to content

Conversation

@camd
Copy link
Collaborator

@camd camd commented Nov 18, 2025

An experiment to see how this looks.

Here were the options Claude proposed. I went with Option 1 in this PR:

Option 1: Make Push Health use TextLogError as a fallback ✅ RECOMMENDED

  • Modify treeherder/push_health/tests.py:322-341 to query TextLogError when no FailureLine objects exist
  • Minimal changes, uses existing data
  • Pros: Quick fix, uses data that's already there
  • Cons: Mixing two data sources; TextLogError doesn't have structured test data like FailureLine

Option 2: Investigate and fix FailureLine generation

  • Determine if timeout failures create FailureLine objects
  • If yes, what action do they have? Add that action to Push Health's query
  • If no, fix the log parser to create FailureLine objects for timeouts
  • Pros: Fixes the root cause
  • Cons: Requires understanding the Mozilla test harness structured log format; may require changes upstream

Option 3: Create FailureLine objects from TextLogError

  • When TextLogError objects exist but no matching FailureLine, create synthetic FailureLine objects
  • Pros: Unifies the data sources
  • Cons: More complex; adds processing overhead

My Recommendation

Start with Option 1 (fallback to TextLogError) because:

  1. It's the fastest solution
  2. The data is already available and working in the Summary Panel
  3. It's a minimal code change to treeherder/push_health/tests.py
  4. You can later implement Option 2 as a proper fix

@camd camd self-assigned this Nov 18, 2025
@camd camd requested a review from jmaher November 18, 2025 18:25
@codecov-commenter
Copy link

codecov-commenter commented Nov 18, 2025

Codecov Report

❌ Patch coverage is 18.07229% with 68 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.10%. Comparing base (75d47a7) to head (8cf3aa9).

Files with missing lines Patch % Lines
treeherder/push_health/tests.py 18.07% 68 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #9079      +/-   ##
==========================================
- Coverage   80.26%   80.10%   -0.16%     
==========================================
  Files         596      596              
  Lines       32182    32265      +83     
  Branches     3276     3269       -7     
==========================================
+ Hits        25830    25845      +15     
- Misses       6184     6252      +68     
  Partials      168      168              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Collaborator

@jmaher jmaher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that this is solving for the case where we are pending log parsing- but that is a normal thing, yet probably not so misleading. If you think this will make push health more responsive and accurate (even the perception), then let me know

total_jobs_for_type,
is_investigated,
investigated_test_id,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this end up with a duplicate entry, assuming we have:
TEST-UNEXPECTED-FAIL | test_1.js | timed out
TEST-CRASH | | test_1.js

in the end we don't need both, I will be curious where this ends up.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a look now...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we'd just get whichever one comes first. So likely the timeout. Would it be better to show the crash instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh, whatever has the most information. I think a crash is a better signal than timeout the majority of the time. I guess we could replace (add/delete) if a crash comes in after the timeout?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I made the change to use the crash, rather than timeout if there is both.

@camd
Copy link
Collaborator Author

camd commented Nov 19, 2025

it seems that this is solving for the case where we are pending log parsing- but that is a normal thing, yet probably not so misleading. If you think this will make push health more responsive and accurate (even the perception), then let me know

This is a really good point. I added logic to check if the job's log parsing is still pending before it tries to fall back to the TextLogErrors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants