-
Notifications
You must be signed in to change notification settings - Fork 907
Enhanced worker crash handling with integrated crash telemetry #5412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- Added enhanced crash handling logic with feature flag control - Implemented dual-mode operation for Plan v7 vs Plan v8+ scenarios - Added forced completion for Plan v8+ worker crashes - Enhanced logging for crash detection and completion analysis - Added notifyServerOfWorkerCrash variable for clear intent - Maintained backward compatibility with original logic - Added comprehensive trace logging for debugging
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| { | ||
| // Direct plan event reporting for Plan v8+ worker crashes | ||
| Trace.Warning($"Plan event reporting for Plan v8+ worker crash [JobId:{message.JobId}, PlanVersion:{message.Plan.Version}, ExitCode:{returnCode}, Result:{result}]"); | ||
| await ReportJobCompletionEventAsync(message, result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we not calling CompleteJobRequestAsync inside?
Are we completely changing the agent’s behavior to not call CompleteJobRequestAsync for V8?
|
|
||
| Trace.Info($"Enhanced crash handling enabled - Normal completion crash analysis [JobId:{message.JobId}, PlanVersion:{message.Plan.Version}, IsPlanV8Plus:{isPlanV8Plus}, IsWorkerCrash:{isWorkerCrash}, ExitCode:{returnCode}, NeedsForcedCompletion:{isPlanV8Plus && isWorkerCrash}]"); | ||
|
|
||
| if (isPlanV8Plus && isWorkerCrash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if isPlanV8Plus = true and isWorkerCrash = false? did we test this behaviour?
| nameof(EnhancedWorkerCrashHandling), | ||
| "If true, enables enhanced worker crash handling with forced completion for Plan v8+ scenarios where worker crashes cannot send completion events", | ||
| new EnvironmentKnobSource("ENHANCED_WORKER_CRASH_HANDLING"), | ||
| new BuiltInDefaultKnobSource("false")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intended, not to have RuntimeKnobSource?
| if (enhancedworkercrashhandlingenabled) | ||
| { | ||
| bool isPlanV8Plus = PlanUtil.GetFeatures(message.Plan).HasFlag(PlanFeatures.JobCompletedPlanEvent); | ||
| bool isWorkerCrash = !TaskResultUtil.IsValidReturnCode(returnCode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we rename this to worker failed to send status to server?
as in that case this can be extended to any events in future
| Trace.Info("Standard completion executed successfully"); | ||
| } | ||
| } | ||
| else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please check if we could simplify this if/else
something similar to this and then use this function in main function?
private bool ShouldUseEnhancedCrashHandling(AgentJobRequestMessage message, int returnCode)
{
if (!AgentKnobs.EnhancedWorkerCrashHandling.GetValue(...).AsBoolean())
return false;
bool isPlanV8Plus = PlanUtil.GetFeatures(message.Plan).HasFlag(PlanFeatures.JobCompletedPlanEvent);
bool isWorkerCrash = !TaskResultUtil.IsValidReturnCode(returnCode);
return isPlanV8Plus && isWorkerCrash;
}
| Uri jobServerUrl = systemConnection.Url; | ||
|
|
||
| // Make sure SystemConnection Url match Config Url base for OnPremises server | ||
| if (!message.Variables.ContainsKey(Constants.Variables.System.ServerType) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this logic is similar to the logic in method LogWorkerProcessUnhandledException, could we please check if we can refactor this
| public static readonly Knob EnhancedWorkerCrashHandling = new Knob( | ||
| nameof(EnhancedWorkerCrashHandling), | ||
| "If true, enables enhanced worker crash handling with forced completion for Plan v8+ scenarios where worker crashes cannot send completion events", | ||
| new EnvironmentKnobSource("ENHANCED_WORKER_CRASH_HANDLING"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please convert this to runtime knob
| detailInfo = string.Join(Environment.NewLine, workerOutput); | ||
| Trace.Info($"Return code {returnCode} indicate worker encounter an unhandled exception or app crash, attach worker stdout/stderr to JobRequest result."); | ||
| await LogWorkerProcessUnhandledException(message, detailInfo, agentCertManager.SkipServerCertificateValidation); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we have sendEvent to server method here as well, as here only we are logging if worker is terminated due to unhandled exception
Context
Design Doc: https://microsoftapc.sharepoint.com/:w:/r/teams/ADOTasksandAgents/_layouts/15/Doc.aspx?sourcedoc=%7BDFED2317-D226-4EC9-A6AB-1E3E53D91934%7D&file=worker%20crash%20handling.docx&action=default&mobileredirect=true
#AB2333253
Description
Provide a concise summary of the changes introduced in this PR.
Risk Assessment (Low / Medium / High)
Assess the risk level and justify your assessment. For example: code path sensitivity, usage scope, or backward compatibility concerns.
Unit Tests Added or Updated (Yes / No)
Indicate whether unit tests were added or modified to reflect the changes.
Additional Testing Performed
List manual or automated tests performed beyond unit tests (e.g., integration, scenario, regression).
Change Behind Feature Flag (Yes / No)
Can this change be behine feature flag, if not why?
Tech Design / Approach
Documentation Changes Required (Yes/No)
Indicate whether related documentation needs to be updated.
Logging Added/Updated (Yes/No)
Telemetry Added/Updated (Yes/No)
Rollback Scenario and Process (Yes/No)
Dependency Impact Assessed and Regression Tested (Yes/No)