Skip to content

Conversation

@AlexejPenner
Copy link
Contributor

@AlexejPenner AlexejPenner commented Nov 19, 2025

Describe changes

see this slack thread for reference

Pre-requisites

Please ensure you have done the following:

  • I have read the CONTRIBUTING.md document.
  • I have added tests to cover my changes.
  • I have based my new branch on develop and the open PR is targeting develop. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.
  • IMPORTANT: I made sure that my changes are reflected properly in the following resources:
    • ZenML Docs
    • Dashboard: Needs to be communicated to the frontend team.
    • Templates: Might need adjustments (that are not reflected in the template tests) in case of non-breaking changes and deprecations.
    • Projects: Depending on the version dependencies, different projects might get affected.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Other (add details above)

@github-actions github-actions bot added internal To filter out internal PRs and issues bug Something isn't working labels Nov 19, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Nov 19, 2025

Documentation Link Check Results

Absolute links check failed
There are broken absolute links in the documentation. See workflow logs for details
Relative links check passed
Last checked: 2025-12-01 08:47:46 UTC

@AlexejPenner AlexejPenner force-pushed the bugfix/helmchart-fsgroup-1000 branch from e4fab64 to 5ae4057 Compare November 19, 2025 21:11
@AlexejPenner AlexejPenner changed the title adjusted helm Adjusted helm chart to have the right permissions for PVC Nov 20, 2025
@htahir1
Copy link
Contributor

htahir1 commented Nov 21, 2025

@claude full-review

@claude
Copy link

claude bot commented Nov 21, 2025

Claude finished @htahir1's task —— View job


Full PR Review

Summary

This PR addresses PVC (Persistent Volume Claim) permission issues in the ZenML Helm chart by properly configuring the fsGroup security context. The changes are well-implemented and address a real operational issue that would cause deployment failures when using persistent volumes.

Changes Overview

Files Modified:

  • helm/values.yaml: Sets podSecurityContext.fsGroup to 1000 by default + adds comprehensive inline documentation
  • helm/README.md: Adds new "Database Persistence" section with usage guidance

Detailed Review

✅ Strengths

  1. Correct Technical Solution

    • Setting fsGroup: 1000 aligns with the container's runAsUser: 1000 (helm/values.yaml:1191)
    • The fsGroupChangePolicy: "OnRootMismatch" is an excellent optimization that prevents unnecessary permission changes on every pod restart
    • This follows Kubernetes best practices for volume permissions
  2. Excellent Documentation

    • Inline comments in values.yaml explain the "why" (permission errors) not just the "what"
    • The README section provides clear guidance with concrete examples
    • Both locations warn users about the importance of maintaining this setting when overriding
  3. Backward Compatibility

    • This change sets a default that was previously empty/commented
    • Existing deployments won't break, but new deployments will work out-of-the-box
    • Users who override podSecurityContext are properly warned in documentation

🔍 Issues & Suggestions

1. Missing Test Coverage (High Priority)

The PR checklist shows "I have added tests to cover my changes" is unchecked. While this is infrastructure/configuration code, consider:

  • Does ZenML have Helm chart tests (e.g., using helm test or chart testing)?
  • Can you add a simple validation that verifies fsGroup is set when persistence.enabled is true?
  • At minimum, manual testing should be documented in the PR description

Recommendation: Add a comment in the PR description documenting how this was tested (e.g., "Tested by deploying with persistence enabled and verifying pod can write to PVC").

2. Inconsistent Scope in Comments (Medium Priority)

The inline comment at helm/values.yaml:318-320 mentions:

"When persistence is enabled, ensure that podSecurityContext.fsGroup is set to 1000..."

But the comment at helm/values.yaml:1181 broadens the scope:

"fsGroup must be set to 1000 when using persistent volumes (database persistence or backup PVCs)"

Question: Does this fix also apply to backup PVCs (zenml.database.backupPVStorageSize)? If yes, the first comment should mention backup PVCs too. If no, clarify the scope.

Related code reference: helm/values.yaml:427 mentions:

"NOTE: you should set the podSecurityContext.fsGroup option if you are using a persistent volume to store the dump file."

This suggests backup PVCs also need this setting, so the documentation should be consistent.

3. Default Value Change Impact (Low Priority)

This changes the default from an empty podSecurityContext: {} to having explicit values. Consider:

  • Are there any environments where setting fsGroup: 1000 could cause issues?
  • What if users are running with a different UID/GID strategy?

Note: This is likely fine since the container already runs as UID 1000 (securityContext.runAsUser: 1000), but it's worth confirming there are no edge cases.

4. PR Checklist Items (Medium Priority)

Several checklist items are unchecked:

  • Tests to cover changes
  • ZenML Docs might need updates (depending on scope)
  • Dashboard communication (probably N/A)
  • Templates (probably N/A)
  • Projects (probably N/A)

Action Items:

  • Mark "Bug fix" in the types of changes section
  • Verify if ZenML Docs at docs.zenml.io need corresponding updates
  • Document testing approach
  • Mark N/A items as N/A in the PR description

📝 Code Quality

Comment Quality: ⭐⭐⭐⭐⭐

  • Comments explain why (permission errors) not just what (set fsGroup)
  • Includes context about UID 1000 and root:root ownership issues
  • Provides actionable warnings for users overriding defaults

Documentation Quality: ⭐⭐⭐⭐⭐

  • Clear example configuration in README
  • Warning about overriding podSecurityContext is prominently placed
  • Explains the consequences of not following the guidance

🔒 Security Considerations

No security concerns identified

  • Setting fsGroup is a standard security practice
  • Restricting to UID 1000 (non-root) is appropriate
  • fsGroupChangePolicy: "OnRootMismatch" is a good optimization

🎯 Testing Recommendations

  1. Test Case 1: Fresh installation with persistence enabled

    helm install test-zenml . --set zenml.database.persistence.enabled=true
    # Verify pod starts successfully and can write to PVC
  2. Test Case 2: Upgrade existing deployment

    # Test that existing deployments don't break when upgraded
    helm upgrade existing-zenml .
  3. Test Case 3: Custom podSecurityContext override

    # Ensure override without fsGroup causes expected failure (to validate docs)
    helm install test-zenml . \
      --set zenml.database.persistence.enabled=true \
      --set podSecurityContext.runAsGroup=1000
    # (intentionally omitting fsGroup to verify it fails as documented)

📋 Final Recommendations

Before Merging:

  1. ✅ Complete the PR checklist items
  2. ✅ Document testing approach in PR description
  3. ✅ Verify scope: Does this also fix backup PVC permissions? Update docs accordingly
  4. ⚠️ Consider if docs.zenml.io needs updates (check deployment guides)
  5. ⚠️ Add PR label: bug (this fixes a deployment failure issue)

Optional Improvements:

  • Consider adding a validation or warning in the Helm templates if persistence.enabled=true but fsGroup is not set
  • Add a troubleshooting section to the README for common permission errors

Verdict

Status:APPROVE with minor recommendations

The core changes are technically correct and well-documented. The primary gaps are in the PR process (checklist, testing documentation) rather than the code itself. This is a valuable fix that will prevent deployment issues for users enabling persistence.

Great work on the comprehensive inline documentation and README additions! 🎉


References:


@htahir1
Copy link
Contributor

htahir1 commented Nov 24, 2025

@AlexejPenner this needs docs changes

@htahir1
Copy link
Contributor

htahir1 commented Nov 26, 2025

@claude Add docs to this

@claude
Copy link

claude bot commented Nov 26, 2025

Claude finished @htahir1's task —— View job


Documentation Added ✅

I've added comprehensive documentation for the PVC permissions changes to the Helm deployment guide.

Changes Made

  • Read the Helm chart changes to understand what needs documenting
  • Add documentation about PVC permission requirements to deploy-with-helm.md
  • Commit and push changes

What Was Added

Added a new section "Deployment with persistent local database" to docs/book/getting-started/deploying-zenml/deploy-with-helm.md that:

  1. Explains the fsGroup requirement: Documents that podSecurityContext.fsGroup: 1000 is set by default to ensure the ZenML container (running as UID 1000) can write to persistent volumes

  2. Provides example configuration: Shows users how to enable database persistence with the proper security context

  3. Includes a warning: Uses a GitBook warning hint to alert users that if they override podSecurityContext, they must include fsGroup: 1000 or the server will fail to start due to permission errors

  4. Cross-references backup configuration: Links to the database backup and recovery section since the same requirement applies to backup PVCs

The documentation follows ZenML's style guidelines with clear, conversational language and practical examples that users can directly apply to their deployments.

--- | Branch: bugfix/helmchart-fsgroup-1000

github-actions bot and others added 2 commits November 26, 2025 09:20
Add a new section to the Helm deployment documentation explaining the
fsGroup security context requirement for persistent volumes. This
documents the default podSecurityContext.fsGroup: 1000 setting and
warns users about maintaining it when overriding podSecurityContext.

Relates to the Helm chart changes that set fsGroup by default to
prevent permission errors when using persistent volumes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Hamza Tahir <htahir1@users.noreply.github.com>
Copy link
Contributor

@stefannica stefannica left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good but didn't test it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working internal To filter out internal PRs and issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants