Skip to content

Conversation

@Json-Andriopoulos
Copy link
Contributor

Describe changes

I implemented/fixed _ to achieve _.

Pre-requisites

Please ensure you have done the following:

  • I have read the CONTRIBUTING.md document.
  • I have added tests to cover my changes.
  • I have based my new branch on develop and the open PR is targeting develop. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.
  • IMPORTANT: I made sure that my changes are reflected properly in the following resources:
    • ZenML Docs
    • Dashboard: Needs to be communicated to the frontend team.
    • Templates: Might need adjustments (that are not reflected in the template tests) in case of non-breaking changes and deprecations.
    • Projects: Depending on the version dependencies, different projects might get affected.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Other (add details above)

@github-actions github-actions bot added the enhancement New feature or request label Nov 26, 2025
@Json-Andriopoulos Json-Andriopoulos force-pushed the feature/4135-generalised-early-stopping branch from 6dee130 to 9f6475d Compare November 26, 2025 07:19
@Json-Andriopoulos Json-Andriopoulos added the run-slow-ci Tag that is used to trigger the slow-ci label Nov 26, 2025
@Json-Andriopoulos Json-Andriopoulos force-pushed the feature/4135-generalised-early-stopping branch from 9f6475d to e89d745 Compare November 26, 2025 07:57
RuntimeError: If steps fail to be set to STOPPING or steps don't have heartbeat enabled.
"""

from zenml.client import Client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, I think to avoid any race conditions and make this much simpler, we should set the pipeline run status instead and adjust the server-side logic to look at this one when deciding on the heartbeat response.

@Json-Andriopoulos Json-Andriopoulos force-pushed the feature/4135-generalised-early-stopping branch from 30d9131 to 5aba3ed Compare November 26, 2025 16:16
@github-actions
Copy link
Contributor

github-actions bot commented Nov 26, 2025

ZenML CLI Performance Comparison (Threshold: 1.0s, Timeout: 60s, Slow: 5s)

❌ Failed Commands on Current Branch (feature/4135-generalised-early-stopping)

  • zenml stack list: Command failed on run 1 (exit code: 1)
  • zenml pipeline list: Command failed on run 1 (exit code: 1)
  • zenml model list: Command failed on run 1 (exit code: 1)

🚨 New Failures Introduced

The following commands fail on your branch but worked on the target branch:

  • zenml stack list
  • zenml pipeline list
  • zenml model list

Performance Comparison

Command develop Time (s) feature/4135-generalised-early-stopping Time (s) Difference Status
zenml --help 1.420556 ± 0.011662 1.423323 ± 0.009695 +0.003s ✓ No significant change
zenml model list Not tested Failed N/A ❌ Broken in current branch
zenml pipeline list Not tested Failed N/A ❌ Broken in current branch
zenml stack --help 1.425642 ± 0.007874 1.453499 ± 0.077201 +0.028s ✓ No significant change
zenml stack list Not tested Failed N/A ❌ Broken in current branch

Summary

  • Total commands analyzed: 5
  • Commands compared for timing: 2
  • Commands improved: 0 (0.0% of compared)
  • Commands degraded: 0 (0.0% of compared)
  • Commands unchanged: 2 (100.0% of compared)
  • Failed commands: 3 (NEW FAILURES INTRODUCED)
  • Timed out commands: 0
  • Slow commands: 0

Environment Info

  • Target branch: Linux 6.11.0-1018-azure
  • Current branch: Linux 6.11.0-1018-azure
  • Test timestamp: 2025-11-27T07:30:14Z
  • Timeout: 60 seconds
  • Slow threshold: 5 seconds

@Json-Andriopoulos Json-Andriopoulos force-pushed the feature/4135-generalised-early-stopping branch from 53c3806 to d817e63 Compare November 27, 2025 06:22
@Json-Andriopoulos Json-Andriopoulos force-pushed the feature/4135-generalised-early-stopping branch 8 times, most recently from 43b78fd to e4a8533 Compare December 1, 2025 08:00
@Json-Andriopoulos Json-Andriopoulos force-pushed the feature/4135-generalised-early-stopping branch from e4a8533 to 221ad75 Compare December 1, 2025 11:30
@Json-Andriopoulos Json-Andriopoulos force-pushed the feature/4135-generalised-early-stopping branch from 221ad75 to c2b3d01 Compare December 1, 2025 15:19
title="The substitutions of the step run.",
default={},
)
cached_heartbeat_threshold: Optional[int] = Field(title="", default=None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing title here I think?

Also, I'm not sure what the cached in the name here indicates? We can also just call this heartbeat_threshold right? I'm wondering whether we even need this attribute on the response, or if it's enough just on the DB schema. When we have the model already, we might as well just do .config.heartbeat_threshold I assume?

return False

if latest_heartbeat:
heartbeat_diff = datetime.now(tz=timezone.utc) - to_local_tz(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to convert to the local timezone here, and then subtract that from a utc timezone datetime?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have both objects timezone-aware. Basically this part of the utiltiy:

if dt.tzinfo is None:
    dt = dt.replace(tzinfo=timezone.utc)

)
)

cached_heartbeat_threshold: Optional[int] = Field(nullable=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call this just heartbeat_threshold, the cached_ IMO just confuses things. This is really just the configured threshold right, whether this is a duplicate from the config doesn't really play a role when using it?

return items


def depaginate_stream(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not used anymore? Feel free to leave it in though, I've always thought we should have an option to get them as an iterator, potentially even on our client methods

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yy decided to leave it a more memory-safe option.

"substitutions": substitutions,
"cache_policy": cache_policy,
"runtime": runtime,
"heartbeat_healthy_threshold": heartbeat_healthy_threshold,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also BaseStep.with_options(...) that needs to have this new argument

self._terminated = True


def cached_is_heartbeat_unhealthy(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the cached in the name here either?

"Heartbeat for step `%s` indicates unhealthy status.",
step_name,
)
return NodeStatus.FAILED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get to this point, it means the Kubernetes job is still actively running but for some reason the heartbeat stopped, correct? In that case, it would probably be a good idea to also stop the job from continuing?

step_run_id=step_run_info.step_run_id,
step_run_update=StepRunUpdate(
exception_info=exception_info,
status=ExecutionStatus.STOPPING,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason you update to STOPPING here instead of STOPPED?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic behind this is that step runner sets to stopping and the step launcher which could be in a different process does any post execution steps and sets to stopped. It could be set to stopped directly I guess but to me it makes more sense as status transitions go (it is truly stopped after the step-launcher handles the signal/interruption).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request run-slow-ci Tag that is used to trigger the slow-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants