Skip to content

Conversation

@AlexejPenner
Copy link
Contributor

@AlexejPenner AlexejPenner commented Nov 28, 2025

Describe changes

I added a section per deployment scenario - https://zenml-io.gitbook.io/alexej/zenml-pro

Pre-requisites

Please ensure you have done the following:

  • I have read the CONTRIBUTING.md document.
  • I have added tests to cover my changes.
  • I have based my new branch on develop and the open PR is targeting develop. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.
  • IMPORTANT: I made sure that my changes are reflected properly in the following resources:
    • ZenML Docs
    • Dashboard: Needs to be communicated to the frontend team.
    • Templates: Might need adjustments (that are not reflected in the template tests) in case of non-breaking changes and deprecations.
    • Projects: Depending on the version dependencies, different projects might get affected.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Other (add details above)

@AlexejPenner AlexejPenner changed the base branch from develop to docs/pro-vs-oss November 28, 2025 10:33
@AlexejPenner AlexejPenner requested a review from htahir1 November 28, 2025 10:33
@github-actions github-actions bot added internal To filter out internal PRs and issues documentation Improvements or additions to documentation labels Nov 28, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Nov 28, 2025

Documentation Link Check Results

Absolute links check failed
There are broken absolute links in the documentation. See workflow logs for details
Relative links check failed
There are broken relative links in the documentation. See workflow logs for details
Last checked: 2025-12-05 10:01:47 UTC

@github-actions
Copy link
Contributor

github-actions bot commented Nov 28, 2025

🔍 Broken Links Report

Summary

  • 📁 Files with broken links: 3
  • 🔗 Total broken links: 3
  • 📄 Broken markdown links: 2
  • 🖼️ Broken image links: 1
  • ⚠️ Broken reference placeholders: 0

Details

File Link Type Link Text Broken Path
zenml-pro/hybrid-deployment-helm.md 📄 "Set up users and teams" ../organization.md
zenml-pro/self-hosted-deployment.md 🖼️ "Self-hosted deployment architecture" ../../.gitbook/assets/air-gapped-architecture.png
zenml-pro/hybrid-deployment-ecs.md 📄 "Set up users and teams" ../organization.md
📂 Full file paths
  • /home/runner/work/zenml/zenml/scripts/../docs/book/getting-started/zenml-pro/hybrid-deployment-helm.md
  • /home/runner/work/zenml/zenml/scripts/../docs/book/getting-started/zenml-pro/self-hosted-deployment.md
  • /home/runner/work/zenml/zenml/scripts/../docs/book/getting-started/zenml-pro/hybrid-deployment-ecs.md

Comment on lines 72 to 93
1. **Code Execution**: You write code and run pipelines with your client SDK using Python
2. **Authentication & Token Acquisition**:
- Users authenticate via your internal identity provider (LDAP/AD/OIDC)
- The ZenML Pro control plane (running in your infrastructure) handles authentication and RBAC
- The ZenML client fetches short-lived tokens from your ZenML workspace for:
- Pushing Docker images to your container registry
- Communicating with your artifact store
- Submitting workloads to your orchestrator
- *Note: Your local Python environment needs the client libraries for your stack components*
3. **Authorization**: RBAC policies enforced by your control plane before token issuance
4. **Image & Workload Submission**: The client pushes Docker images (and optionally code if no code repository is configured) to your container registry, then submits the workload to your orchestrator
5. **Orchestrator Execution**: In the orchestrator environment within your infrastructure:
- The Docker image is pulled from your container registry
- Within the pipeline/step entrypoint, the necessary code is pulled in
- A connection to your ZenML workspace is established
- The relevant pipeline/step code is executed
6. **Runtime Data Flow**: During execution (all within your infrastructure):
- Pipeline and step run metadata is logged to your ZenML workspace
- Logs are streamed to your log backend
- Artifacts are written to your artifact store
- Metadata pointing to these artifacts is persisted in your workspace
7. **Observability**: The ZenML Pro dashboard (running in your infrastructure) connects to your workspace and uses all persisted metadata to provide you with a complete observability plane
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefannica fact check pls


The diagram above illustrates a complete air-gapped ZenML Pro deployment with all components running within your organization's VPC. This architecture ensures zero external communication while providing full enterprise MLOps capabilities.

### Architecture Components
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefannica fact check pls

- **Backup sites** for disaster recovery
- **Monitoring and alerting** for all components

## Pre-requisites
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefannica fact check pls

@AlexejPenner
Copy link
Contributor Author

https://zenml-io.gitbook.io/alexej/zenml-pro - view here to see it in action

Copy link
Contributor

@htahir1 htahir1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its good for a first round. many comments apply to many pages

Comment on lines 59 to 60
-**Vulnerability Assessment Reports** available on request
-**Software Bill of Materials (SBOM)** available on request
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefannica should verify this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we can provide this on request


All three deployment scenarios follow a similar pipeline execution pattern, with differences in where authentication happens and where data resides:

### Standard Data Flow Steps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely needs a diagram

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - we might even have one laying around somewhere


**SaaS**: Metadata is stored in ZenML infrastructure. Your ML data and compute remain in your infrastructure.

**Hybrid**: Metadata and control plane are split — authentication/RBAC happens at ZenML control plane, but all run metadata, artifacts, and compute stay in your infrastructure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thnk the authentication bit is the most important here and isnt really elaborated but maybe it is later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What more would you like to know about this at this stage?


You control this access by configuring appropriate cloud IAM permissions.

## Getting Started
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO super strnage to have this whole section here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the whole section ? maybe we dont need the example pipeline - butt i like how it shows how quickly youi're ready

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm really? its in the dashboard already when you sign up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well somebody in the docs here wants to know what complexity awaits them - "Is it worth my time?"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im not sure tbh

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my experience these are the questions we get very early on

@github-actions
Copy link
Contributor

github-actions bot commented Dec 3, 2025

Images automagically compressed by Calibre's image-actions

Compression reduced images by 34%, saving 132.21 KB.

Filename Before After Improvement Visual comparison
docs/book/getting-started/zenml-pro/.gitbook/assets/pro-workload-managers.png 388.76 KB 256.55 KB -34.0% View diff

383 images did not require optimisation.

Update required: Update image-actions configuration to the latest version before 1/1/21. See README for instructions.

Copy link
Contributor

@stefannica stefannica left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first round of reviews, more to follow...

#### ZenML Pro Client Artifacts

If you're planning on running containerized ZenML pipelines, or using other containerization related ZenML features, you'll also need to access the public ZenML client container image located [in Docker Hub at `zenmldocker/zenml`](https://hub.docker.com/r/zenmldocker/zenml). This isn't a problem unless you're deploying ZenML Pro in an air-gapped environment, in which case you'll also have to copy the client container image into your own container registry. You'll also have to configure your code to use the correct base container registry via DockerSettings (see the [DockerSettings documentation](https://docs.zenml.io/how-to/customize-docker-builds) for more information).
If you're planning on running containerized ZenML pipelines, or using other containerization related ZenML features, you'll also need to access the public ZenML client container image located [in Docker Hub at `zenmldocker/zenml`](https://hub.docker.com/r/zenmldocker/zenml). This isn't a problem unless you're deploying ZenML Pro in a Self-hosted environment, in which case you'll also have to copy the client container image into your own container registry. You'll also have to configure your code to use the correct base container registry via DockerSettings (see the [DockerSettings documentation](https://docs.zenml.io/how-to/customize-docker-builds) for more information).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If you're planning on running containerized ZenML pipelines, or using other containerization related ZenML features, you'll also need to access the public ZenML client container image located [in Docker Hub at `zenmldocker/zenml`](https://hub.docker.com/r/zenmldocker/zenml). This isn't a problem unless you're deploying ZenML Pro in a Self-hosted environment, in which case you'll also have to copy the client container image into your own container registry. You'll also have to configure your code to use the correct base container registry via DockerSettings (see the [DockerSettings documentation](https://docs.zenml.io/how-to/customize-docker-builds) for more information).
If you're planning on running containerized ZenML pipelines, or using other containerization related ZenML features, you'll also need to access the public ZenML client container image located [in Docker Hub at `zenmldocker/zenml`](https://hub.docker.com/r/zenmldocker/zenml). This isn't a problem unless you're deploying ZenML Pro in an air-gapped environment, in which case you'll also have to copy the client container image into your own container registry. You'll also have to configure your code to use the correct base container registry via DockerSettings (see the [DockerSettings documentation](https://docs.zenml.io/how-to/customize-docker-builds) for more information).

The original text was actually correct here.

Choose **Self-hosted** if you need complete control with no external dependencies.

**What runs where:**
- All components: [Your infrastructure](https://docs.zenml.io/stacks) (completely isolated)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't make sense to point to stacks here.


1. **Code Execution**: You write code and run pipelines with your client SDK using Python

2. **Token Acquisition**: The ZenML client fetches short-lived tokens from your ZenML workspace for:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: this only happens if you use service connectors


| Component | Location | Purpose |
|-----------|----------|---------|
| **ZenML Pro Server** | ZenML Infrastructure | Manages pipeline orchestration and metadata |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are using the term "zenml server" in several places, but it appears in none of the diagrams. You should probably be using "zenml workspace".


| Deployment Aspect | SaaS | Hybrid SaaS | Self-hosted |
|-------------------|------|-------------|------------|
| **ZenML Server** | ZenML infrastructure | Your infrastructure | Your infrastructure |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be workspace instead of server ?


### 🚀 Production Ready

- **High availability**: Built-in redundancy for critical components
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect. I would say that the workspaces are the critical components, not the control plane or UI, and they are not under our control, so we cannot offer any such guarantees and shouldn't make such claims.

- **High availability**: Built-in redundancy for critical components
- **Automatic updates**: Control plane maintained by ZenML
- **Professional support**: Direct access to ZenML experts
- **Monitoring included**: Health checks and alerting configured
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only partially correct. Health checks and alerting is only partial configured for the control plane. Workspaces are not covered.


### Artifact Store Access

The ZenML dashboard requires read access to your artifact store to display:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should say UI (not dashboard) and it requires access to more than the artifact store (log store, orchestrators; see my other comment)

- Artifact lineage graphs
- Step logs and outputs

You control this access by configuring appropriate cloud IAM permissions.
Copy link
Contributor

@stefannica stefannica Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is misleading. The truth is this: if you give your users permission to access these things, you also implicitly give the UI permission to do so. You could say "you control who can access this information in the UI by configuring appropriate ZenML Pro RBAC permissions. Cloud IAM permissions do not apply here.

Comment on lines +105 to +113
```mermaid
graph LR
A[User] -->|1. Login| B[Control Plane<br/>ZenML Infrastructure]
B -->|2. Auth Token| A
A -->|3. Access Workspace| C[Workspace<br/>Your Infrastructure]
C -->|4. Validate Token| B
B -->|5. Authorization| C
C -->|6. Execute| D[Your Resources]
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these do not render correctly

Copy link
Contributor

@stefannica stefannica left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to review as much of this as I could. Half of it is pretty good, while the other half is clearly vibe-written and riddled with hallucinations and over-simplifications.

I would kindly ask you to give this another careful read yourself, check that it's factually correct based on the original docs and resources, then correct the mistakes.


| Data Type | Storage Location | Purpose |
|-----------|-----------------|---------|
| User credentials | Control Plane | Authentication only |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The control plane doesn't store user credentials (unless you count Personal Access Tokens or API keys). It's the customer's SSO/identity provider that stores the credentials.

Comment on lines +115 to +120
1. User authenticates with ZenML control plane (SSO)
2. Control plane issues authentication token
3. User accesses workspace with token
4. Workspace validates token with control plane
5. Control plane confirms authorization (RBAC)
6. Workspace executes operations on your infrastructure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit more complicated than this and the authentication flow varies depending on the type of authentication (web client, python client via web login flow, service account, PAT). If the point of this section is to provide a gross oversimplification of the authentication process, I think you nailed it. But you should probably mention that it's not 100% accurate (e.g. in most cases, the workspace issues its own temporary credentials, to avoid overloading the control plane by checking credentials for every API call).

I would recommend keeping details like the type of credentials being used (token). out of this oversimplified description (the diagram too):

Suggested change
1. User authenticates with ZenML control plane (SSO)
2. Control plane issues authentication token
3. User accesses workspace with token
4. Workspace validates token with control plane
5. Control plane confirms authorization (RBAC)
6. Workspace executes operations on your infrastructure
1. User authenticates with ZenML control plane (SSO)
2. Control plane issues authentication credentials
3. User accesses workspace with credentials
4. Workspace validates credentials with control plane
5. Control plane confirms authenticaiton and authorization (RBAC)
6. Workspace executes operations on your infrastructure

Comment on lines +193 to +197
1. **Clients authenticate** with ZenML Control Plane (SSO) - hosted by ZenML
2. **Control Plane issues** RBAC-validated tokens to clients
3. **Clients connect** to their assigned workspace(s) in your infrastructure
4. **Workspaces validate** tokens with Control Plane (outbound-only connection)
5. **Pipelines execute** on your infrastructure resources
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a re-iteration of the previous section. You should merge them.

- `kubectl` configured to access your cluster
- `helm` CLI (3.0+) installed
- A domain name and TLS certificate for your ZenML server
- MySQL or PostgreSQL database (managed or self-hosted)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- MySQL or PostgreSQL database (managed or self-hosted)
- MySQL database (managed or self-hosted)

Comment on lines +50 to +66
## Step 3: Create Secrets for Credentials

Create a secret for your Pro OAuth2 credentials. Ask you ZenML Solutions Architect to send you this secret.:

```bash
kubectl -n zenml-hybrid create secret generic zenml-pro-credentials \
--from-literal=ZENML_SERVER_PRO_OAUTH2_CLIENT_SECRET=<your-client-secret>
```


If using a custom TLS certificate (self-signed or from a CA), create a secret:

```bash
kubectl -n zenml-hybrid create secret tls zenml-tls \
--cert=/path/to/tls.crt \
--key=/path/to/tls.key
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect. This is handled by the helm chart. The user doesn't have to manually configure any secrets.

Comment on lines +235 to +238
1. Navigate to `https://zenml.mycompany.com` in your browser
2. You should be redirected to ZenML Cloud login
3. Sign in with your organization credentials
4. You should see your workspace listed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the backwards way of doing it. You should instruct them to log in to cloud.zenml.io, access their org and then their workspace. This redirect is more of a backwards compatibility failsafe than it is an official way of accessing the workspace UI.

kubectl -n zenml-workload-manager create serviceaccount zenml-runner
```

### 2. Configure Workload Manager in Helm Values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is repeated in at least 3 different places:

  • here
  • in the self-hosted docs
  • in the workload managers section

Can you please just point to the workload managers section instead of duplicating this information ?

Comment on lines +332 to +338
external:
type: mysql
host: zenml-db.123456789.us-east-1.rds.amazonaws.com
port: 3306
username: admin
password: <your-rds-password>
database: zenml_hybrid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hallucinated or ill-informed. Please consult the official helm chart values.

# Add other environment variables as needed
```

## Database Configuration Examples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire section is unnecessary. We already have a complete section on how to configure the helm chart for OSS server deployments. Duplicating oversimplified parts of that here - and incorrectly at that - isn't going to help anyone. Better to link to the correct and fully detailed OSS helm documentation here instead.

Comment on lines +346 to +352
external:
type: mysql
host: 34.123.45.67
port: 3306
username: root
password: <your-cloud-sql-password>
database: zenml_hybrid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hallucinated. I won't repeat this comment....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation internal To filter out internal PRs and issues vibecoded

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants