Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/cloud/guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ keywords: ['cloud guides', 'documentation', 'how-to', 'cloud features', 'tutoria
| [BYOC on AWS Observability](/cloud/reference/byoc/observability) | Deploy ClickHouse on your own cloud infrastructure |
| [BYOC Onboarding for AWS](/cloud/reference/byoc/onboarding/aws) | Deploy ClickHouse on your own cloud infrastructure |
| [BYOC security playbook](/cloud/security/audit-logging/byoc-security-playbook) | This page illustrates methods customers can use to identify potential security events |
| [ClickHouse Cloud production readiness guide](/cloud/guides/production-readiness) | Guide for organizations transitioning from quick start to enterprise-ready ClickHouse Cloud deployments |
| [ClickHouse Government](/cloud/infrastructure/clickhouse-government) | Overview of ClickHouse Government offering |
| [ClickHouse Private](/cloud/infrastructure/clickhouse-private) | Overview of ClickHouse Private offering |
| [Cloud Compatibility](/whats-new/cloud-compatibility) | This guide provides an overview of what to expect functionally and operationally in ClickHouse Cloud. |
Expand Down
180 changes: 180 additions & 0 deletions docs/cloud/guides/production-readiness.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
---
slug: /cloud/guides/production-readiness
sidebar_label: 'Production readiness'
title: 'ClickHouse Cloud production readiness guide'
description: 'Guide for organizations transitioning from quick start to enterprise-ready ClickHouse Cloud deployments'
keywords: ['production readiness', 'enterprise', 'saml', 'sso', 'terraform', 'monitoring', 'backup', 'disaster recovery']
doc_type: 'guide'
---

# ClickHouse Cloud Production Readiness Guide {#production-readiness}

For organizations who have completed the quick start guide and have an active service with data flowing

:::note[TL;DR]
This guide helps you transition from quick start to enterprise-ready ClickHouse Cloud deployments. You'll learn how to:

- Establish separate dev/staging/production environments for safe testing
- Integrate SAML/SSO authentication with your identity provider
- Automate deployments with Terraform or the Cloud API
- Connect monitoring to your alerting infrastructure (Prometheus, PagerDuty)
- Validate backup procedures and document disaster recovery processes
:::

## Introduction {#introduction}

You have ClickHouse Cloud running successfully for your business workloads. Now you need to mature your deployment to meet enterprise production standards—whether triggered by a compliance audit, a production incident from an untested query, or IT requirements to integrate with corporate systems.

ClickHouse Cloud's managed platform handles infrastructure operations, automatic scaling, and system maintenance. Enterprise production readiness requires connecting ClickHouse Cloud to your broader IT environment through authentication systems, monitoring infrastructure, automation tools, and business continuity processes.

Your responsibilities for enterprise production readiness:
- Establish separate environments for safe testing before production deployment
- Integrate with existing identity providers and access management systems
- Connect monitoring and alerting to your operational infrastructure
- Implement infrastructure-as-code practices for consistent management
- Establish backup validation and disaster recovery procedures
- Configure cost management and billing integration

This guide walks you through each area, helping you transition from a working ClickHouse Cloud deployment to an enterprise-ready system.

## Environment strategy {#environment-strategy}

Establish separate environments to safely test changes before impacting production workloads. Most production incidents trace back to untested queries or configuration changes deployed directly to production systems.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's obvious already, but I think we should specify that a separate "environment" will be a separate Cloud service.


**Environment structure**: Maintain production (live workloads), staging (production-equivalent validation), and development (individual/team experimentation) environments.

**Testing**: Test queries in staging before production deployment. Queries that work on small datasets often cause memory exhaustion, excessive CPU usage, or slow execution at production scale. Validate configuration changes including user permissions, quotas, and service settings in staging—configuration errors discovered in production create immediate operational incidents.

**Sizing**: Size your staging service to approximate production load characteristics. Testing on significantly smaller infrastructure may not reveal resource contention or scaling issues. Use production-representative datasets through periodic data refreshes or synthetic data generation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we cross link to the scaling docs here? The reader might not know how to size their staging service


## Enterprise authentication and user management {#enterprise-authentication}

Moving from console-based user management to enterprise authentication integration is essential for production readiness.

### SSO/SAML setup {#sso-saml-setup}

Enterprise tier ClickHouse Cloud supports SAML integration with identity providers including Okta, Azure Active Directory, and Google Workspace. SAML configuration requires coordination with ClickHouse support and involves providing your IdP metadata and configuring attribute mappings.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cross link to the docs on how to set this up


:::note Important limitation
Users authenticated through SAML are assigned the "Member" role by default and must be manually granted additional roles by an admin after their first login. Group-to-role mapping and automatic role assignment are not currently supported.
:::

### Access control design {#access-control-design}

ClickHouse Cloud uses organization-level roles (Admin, Developer, Billing, Member) and service/database-level roles (Service Admin, Read Only, SQL console roles). Design roles around job functions applying the principle of least privilege:

- **Application users**: Service accounts with specific database and table access
- **Analyst users**: Read-only access to curated datasets and reporting views
- **Admin users**: Full administrative capabilities

Configure quotas, limits, and settings profiles to manage resource usage for different users and roles. Set memory and execution time limits to prevent individual queries from impacting system performance. Monitor resource usage through audit, session, and query logs to identify users or applications that frequently hit limits. Conduct regular access reviews using ClickHouse Cloud's audit capabilities.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tells me what to do but not how to do it. Can we cross link to docs on how to:

  • configure quotas
  • configure limits
  • configure settings profiles
  • set memory and execution time limits
  • monitor resource usage


### User lifecycle management limitations {#user-lifecycle-management}

ClickHouse Cloud does not currently support SCIM or automated provisioning/deprovisioning via identity providers. Users must be manually removed from the ClickHouse Cloud console after being removed from your IdP. Plan for manual user management processes until these features become available.

Learn more about [Cloud Access Management](/cloud/security/cloud_access_management) and [SAML SSO setup](/cloud/security/saml-setup).

## Infrastructure as code and automation {#infrastructure-as-code}

Managing ClickHouse Cloud through infrastructure-as-code practices and API automation provides consistency, version control, and repeatability for your deployment configuration.

### Terraform Provider {#terraform-provider}

Configure the ClickHouse Terraform provider with API keys created in the ClickHouse Cloud console:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where in the ClickHouse Cloud console?


```terraform
terraform {
required_providers {
clickhouse = {
source = "ClickHouse/clickhouse"
version = "~> 2.0"
}
}
}

provider "clickhouse" {
environment = "production"
organization_id = var.organization_id
token_key = var.token_key
token_secret = var.token_secret
}
```

The Terraform provider supports service provisioning, IP access lists, and user management. Note that the provider does not currently support importing existing services or explicit backup configuration. For features not covered by the provider, manage them through the console or contact ClickHouse support.

For comprehensive examples including service configuration and network access controls, see [Terraform example on how to use Cloud API](/knowledgebase/terraform_example).

### Cloud API integration {#cloud-api-integration}

Organizations with existing automation frameworks can integrate ClickHouse Cloud management directly through the Cloud API. The API provides programmatic access to service lifecycle management, user administration, backup operations, and monitoring data retrieval.

Common API integration patterns:
- Custom provisioning workflows integrated with internal ticketing systems
- Automated scaling adjustments based on application deployment schedules
- Programmatic backup validation and reporting for compliance workflows
- Integration with existing infrastructure management platforms

API authentication uses the same token-based approach as Terraform. For complete API reference and integration examples, see [ClickHouse Cloud API](/cloud/manage/api/api-overview) documentation.

## Monitoring and operational integration {#monitoring-integration}

Connecting ClickHouse Cloud to your existing monitoring infrastructure ensures visibility and proactive issue detection.

### Built-in monitoring {#built-in-monitoring}

ClickHouse Cloud provides an advanced dashboard with real-time metrics including queries per second, memory usage, CPU usage, and storage rates. Access via Cloud console under Monitoring → Advanced dashboard. Create custom dashboards tailored to specific workload patterns or team resource consumption.

:::note Common production gaps
Lack of proactive alerting integration with enterprise incident management systems and automated cost monitoring. Built-in dashboards provide visibility but automated alerting requires external integration.
:::

### Production alerting setup {#production-alerting}

**Built-in Capabilities**: ClickHouse Cloud provides notifications for billing events, scaling events, and service health via email, UI, and Slack. Configure delivery channels and notification severities through the console notification settings.

**Enterprise Integration**: For advanced alerting (PagerDuty, custom webhooks), use the Prometheus endpoint to export metrics to your existing monitoring infrastructure:

```yaml
scrape_configs:
- job_name: "clickhouse"
static_configs:
- targets: ["https://api.clickhouse.cloud/v1/organizations/<org_id>/prometheus"]
basic_auth:
username: <KEY_ID>
password: <KEY_SECRET>
```

For comprehensive setup including detailed Prometheus/Grafana configuration and advanced alerting, see the [ClickHouse Cloud Observability Guide](/use-cases/observability/cloud-monitoring#prometheus).

## Business continuity and support integration {#business-continuity}

Establishing backup validation procedures and support integration ensures your ClickHouse Cloud deployment can recover from incidents and access help when needed.

### Backup strategy assessment {#backup-strategy}

ClickHouse Cloud provides automatic backups with configurable retention periods. Assess your current backup configuration against compliance and recovery requirements. Enterprise customers with specific compliance requirements around backup location or encryption can configure ClickHouse Cloud to store backups in their own cloud storage buckets (BYOB). Contact ClickHouse support for BYOB configuration.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where can I contact ClickHouse support?


### Validate and test recovery procedures {#validate-test-recovery}

Most organizations discover backup gaps during actual recovery scenarios. Establish regular validation cycles to verify backup integrity and test recovery procedures before incidents occur. Schedule periodic test restorations to non-production environments, document step-by-step recovery procedures including time estimates, verify restored data completeness and application functionality, and test recovery procedures with different failure scenarios (service deletion, data corruption, regional outages). Maintain updated recovery runbooks accessible to on-call teams.

Test backup restoration at least quarterly for critical production services. Organizations with strict compliance requirements may need monthly or even weekly validation cycles.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


### Disaster recovery planning {#disaster-recovery-planning}

Document your recovery time objectives (RTO) and recovery point objectives (RPO) to validate that your current backup configuration meets business requirements. Establish regular testing schedules for backup restoration and maintain updated recovery documentation.

### Production support integration {#production-support}

Understand your current support tier's SLA expectations and escalation procedures. Create internal runbooks defining when to engage ClickHouse support and integrate these procedures with your existing incident management processes.

Learn more about [ClickHouse Cloud backup and recovery](/cloud/manage/backups/overview) and [support services](/about-us/support).

## Next steps {#next-steps}

After implementing the integrations and procedures in this guide, focus on optimization and operational maturity. Review query patterns and optimize table structures for performance. Implement network isolation, audit logging, and compliance controls required for your industry. Establish runbooks for common scenarios, implement automated testing for schema changes, and build internal knowledge bases for your team.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where can I learn more about how to implement network isolation, audit logging, and compliance controls required for my industry?


- Visit the cloud [resource tour](/cloud/get-started/cloud/resource-tour) for guides on [monitoring](/cloud/get-started/cloud/resource-tour#monitoring), [security](/cloud/get-started/cloud/resource-tour#security), and [cost optimization](/cloud/get-started/cloud/resource-tour#cost-optimization)

When current service tier limitations impact your production operations, consider upgrade paths for enhanced capabilities such as private networking, customer-managed encryption keys, or multi-region disaster recovery options.
6 changes: 5 additions & 1 deletion scripts/aspell-ignore/en/aspell-dict.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
personal_ws-1.1 en 3818
personal_ws-1.1 en 3822
AArch
ACLs
AICPA
Expand Down Expand Up @@ -3831,3 +3831,7 @@ SpanName
lucene
TrackedLink
eventName
deprovisioning
runbooks
severities
webhooks