Skip to content

Conversation

@cstockton
Copy link
Contributor

The systemd default is 10s / 5 for these values with a DefaultRestartUSec of 100ms. Most services set a RestartSec limit of 3, under most circumstances it takes 15s to restart 5 times so the limit of 10s is not exceeded. However if other system processes (salt, cloud init) restart it explicitly, or recovering system services within the --before chain trigger a restart the limit can be exceeded causing it to be marked as failed. Since no services mark gotrue.service as required it will remain offline until the next explicit restart is issued.

Setting these values to 0 with Restart=always and RestartSec=3 will prevent gotrue from being marked as failed.

The systemd default is 10s / 5 for these values with a DefaultRestartUSec of
100ms. Most services set a RestartSec limit of 3, under most circumstances it
takes 15s to restart 5 times so the limit of 10s is not exceeded. However if
other system processes (salt, cloud init) restart it explicitly, or recovering
system services within the --before chain trigger a restart the limit can be
exceeded causing it to be marked as failed. Since no services mark
gotrue.service as required it will remain offline until the next explicit
restart is issued.

Setting these values to 0 with Restart=always and RestartSec=3 will prevent
gotrue from being marked as failed.
@cstockton cstockton requested review from a team as code owners November 28, 2025 18:14
Chris Stockton and others added 2 commits December 1, 2025 15:59
I've noticed all !oneshot services set a `RestartSec` of `3s` and we use the
systemd defaults of `StartLimitBurst=5` and `StartLimitInterval=10s`. Together
this forms a property that under typical conditions a service will be restarted
indefinitely until it comes back up due to `(3s * 5) > 10s`, but it is still
possible for a service to enter a failed state under some scenarios. This change
defensively sets them to 0/0 to keep them in restart loops.
@cstockton cstockton enabled auto-merge December 2, 2025 13:18
@samrose samrose requested review from darora and pcnc December 2, 2025 13:20
Copy link
Collaborator

@samrose samrose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to create a testing AMI to thoroughly test these changes out. Will request @LGUG2Z to perform these tests as he's also going to be helping us find ways to automate these testing approaches.

@samrose samrose requested a review from LGUG2Z December 2, 2025 13:52
Copy link
Collaborator

@samrose samrose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we ultimately merge this, we should bump the versions in ansible/vars.yml to create a release for these changes. This way, it will be a distinct change instead of bundled with other changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants