Skip to content

Conversation

@raujaiswal
Copy link
Contributor

@raujaiswal raujaiswal commented Nov 28, 2025

Context

Design Doc: https://microsoftapc.sharepoint.com/:w:/r/teams/ADOTasksandAgents/_layouts/15/Doc.aspx?sourcedoc=%7BDFED2317-D226-4EC9-A6AB-1E3E53D91934%7D&file=worker%20crash%20handling.docx&action=default&mobileredirect=true

#AB2333253


Description

Provide a concise summary of the changes introduced in this PR.


Risk Assessment (Low / Medium / High)

Assess the risk level and justify your assessment. For example: code path sensitivity, usage scope, or backward compatibility concerns.


Unit Tests Added or Updated (Yes / No)

Indicate whether unit tests were added or modified to reflect the changes.


Additional Testing Performed

List manual or automated tests performed beyond unit tests (e.g., integration, scenario, regression).


Change Behind Feature Flag (Yes / No)

Can this change be behine feature flag, if not why?


Tech Design / Approach

  • Design has been written and reviewed.
  • Any architectural decisions, trade-offs, and alternatives are captured.

Documentation Changes Required (Yes/No)

Indicate whether related documentation needs to be updated.

  • User guides, API specs, system diagrams, or runbooks are updated.

Logging Added/Updated (Yes/No)

  • Appropriate log statements are added with meaningful messages.
  • Logging does not expose sensitive data.
  • Log levels are used correctly (e.g., info, warn, error).

Telemetry Added/Updated (Yes/No)

  • Custom telemetry (e.g., counters, timers, error tracking) is added as needed.
  • Events are tagged with proper metadata for filtering and analysis.
  • Telemetry is validated in staging or test environments.

Rollback Scenario and Process (Yes/No)

  • Rollback plan is documented.

Dependency Impact Assessed and Regression Tested (Yes/No)

  • All impacted internal modules, APIs, services, and third-party libraries are analyzed.
  • Results are reviewed and confirmed to not break existing functionality.

azure-pipelines-bot and others added 3 commits November 28, 2025 15:33
- Added enhanced crash handling logic with feature flag control
- Implemented dual-mode operation for Plan v7 vs Plan v8+ scenarios
- Added forced completion for Plan v8+ worker crashes
- Enhanced logging for crash detection and completion analysis
- Added notifyServerOfWorkerCrash variable for clear intent
- Maintained backward compatibility with original logic
- Added comprehensive trace logging for debugging
@raujaiswal raujaiswal marked this pull request as ready for review December 2, 2025 06:47
@raujaiswal raujaiswal requested review from a team as code owners December 2, 2025 06:47
@raujaiswal raujaiswal marked this pull request as draft December 3, 2025 09:52
@raujaiswal raujaiswal marked this pull request as ready for review December 5, 2025 09:39
@raujaiswal
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@raujaiswal raujaiswal changed the title Implement enhanced worker crash handling in JobDispatcher Enhanced worker crash handling in JobDispatcher with added crash telemetry Dec 5, 2025
@raujaiswal raujaiswal changed the title Enhanced worker crash handling in JobDispatcher with added crash telemetry Enhanced worker crash handling with added crash telemetry Dec 5, 2025
@raujaiswal raujaiswal changed the title Enhanced worker crash handling with added crash telemetry Enhanced worker crash handling with integrated crash telemetry Dec 5, 2025
@raujaiswal
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@raujaiswal raujaiswal added the misc Miscellaneous Changes label Dec 8, 2025
{
// Direct plan event reporting for Plan v8+ worker crashes
Trace.Warning($"Plan event reporting for Plan v8+ worker crash [JobId:{message.JobId}, PlanVersion:{message.Plan.Version}, ExitCode:{returnCode}, Result:{result}]");
await ReportJobCompletionEventAsync(message, result);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we not calling CompleteJobRequestAsync inside?
Are we completely changing the agent’s behavior to not call CompleteJobRequestAsync for V8?


Trace.Info($"Enhanced crash handling enabled - Normal completion crash analysis [JobId:{message.JobId}, PlanVersion:{message.Plan.Version}, IsPlanV8Plus:{isPlanV8Plus}, IsWorkerCrash:{isWorkerCrash}, ExitCode:{returnCode}, NeedsForcedCompletion:{isPlanV8Plus && isWorkerCrash}]");

if (isPlanV8Plus && isWorkerCrash)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if isPlanV8Plus = true and isWorkerCrash = false? did we test this behaviour?

nameof(EnhancedWorkerCrashHandling),
"If true, enables enhanced worker crash handling with forced completion for Plan v8+ scenarios where worker crashes cannot send completion events",
new EnvironmentKnobSource("ENHANCED_WORKER_CRASH_HANDLING"),
new BuiltInDefaultKnobSource("false"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended, not to have RuntimeKnobSource?

if (enhancedworkercrashhandlingenabled)
{
bool isPlanV8Plus = PlanUtil.GetFeatures(message.Plan).HasFlag(PlanFeatures.JobCompletedPlanEvent);
bool isWorkerCrash = !TaskResultUtil.IsValidReturnCode(returnCode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we rename this to worker failed to send status to server?
as in that case this can be extended to any events in future

Trace.Info("Standard completion executed successfully");
}
}
else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check if we could simplify this if/else

something similar to this and then use this function in main function?

private bool ShouldUseEnhancedCrashHandling(AgentJobRequestMessage message, int returnCode)
{
if (!AgentKnobs.EnhancedWorkerCrashHandling.GetValue(...).AsBoolean())
return false;

bool isPlanV8Plus = PlanUtil.GetFeatures(message.Plan).HasFlag(PlanFeatures.JobCompletedPlanEvent);
bool isWorkerCrash = !TaskResultUtil.IsValidReturnCode(returnCode);

return isPlanV8Plus && isWorkerCrash;

}

Uri jobServerUrl = systemConnection.Url;

// Make sure SystemConnection Url match Config Url base for OnPremises server
if (!message.Variables.ContainsKey(Constants.Variables.System.ServerType) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic is similar to the logic in method LogWorkerProcessUnhandledException, could we please check if we can refactor this

public static readonly Knob EnhancedWorkerCrashHandling = new Knob(
nameof(EnhancedWorkerCrashHandling),
"If true, enables enhanced worker crash handling with forced completion for Plan v8+ scenarios where worker crashes cannot send completion events",
new EnvironmentKnobSource("ENHANCED_WORKER_CRASH_HANDLING"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please convert this to runtime knob

detailInfo = string.Join(Environment.NewLine, workerOutput);
Trace.Info($"Return code {returnCode} indicate worker encounter an unhandled exception or app crash, attach worker stdout/stderr to JobRequest result.");
await LogWorkerProcessUnhandledException(message, detailInfo, agentCertManager.SkipServerCertificateValidation);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we have sendEvent to server method here as well, as here only we are logging if worker is terminated due to unhandled exception

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

internal misc Miscellaneous Changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants