Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 94 additions & 4 deletions docs/book/how-to/deployment/deployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,84 @@ curl -X POST http://localhost:8000/invoke \
-d '{"parameters": {"city": "London", "temperature": 20}}'
```

### Session-aware Invocations

Deployments support session-aware execution so that multiple `/invoke` calls can share context, which is critical for:
- LLM chat and tool-using agents
- Multi-step decision flows
- User-specific short-term state

#### Session Basics

- Each request may include a `session_id`.
- If no `session_id` is provided, the deployment can generate one and echo it back in `metadata.session_id`.
- Reusing the same `session_id` lets your steps read/write session-scoped state across invocations.
- Session state is stored separately from the LLM context window: it's your authoritative server-side memory.

**Key separation**: LLMs have a model context window (tokens in a single prompt). Session state is your durable, structured memory you choose to feed into prompts. Keeping this state compact and curated is standard best practice across frameworks like LangChain/LangGraph.

#### Configuring Sessions

Sessions are controlled via the `deployment_settings.sessions` block:

```yaml
deployment_settings:
sessions:
enabled: true # enable/disable session support
ttl_seconds: 3600 # inactivity timeout before a session expires
max_state_bytes: 65536 # soft limit for serialized session state
```

**Fields:**

- `enabled`
- `true` (default recommended for LLM/agent-style deployments)
- `false` for fully stateless APIs
- `ttl_seconds`
- Per-session inactivity timeout. After this, the session is expired and its state can be garbage-collected.
- Choose based on your use case (e.g. 15–30 min for chats, hours for workflows)
- `max_state_bytes`
- A guardrail on how large the serialized `session_state` can get
- Prevents unbounded growth, DoS-style misuse, and huge payloads when reloading/saving state

#### Using Session State in Steps

Within a deployed pipeline, steps access session state via `get_step_context()`:

```python
from zenml import step, get_step_context

@step
def chat_step(message: str) -> str:
ctx = get_step_context()
session_state = ctx.session_state # session dict; persisted if sessions enabled

history = session_state.setdefault("history", [])
history.append({"role": "user", "content": message})

reply = f"Echo {len(history)}: {message}"
history.append({"role": "assistant", "content": reply})

session_state["last_reply"] = reply
return reply
```

If sessions are disabled, `session_state` behaves as an empty, non-persisted dict so the same code is safe to run.

**Example flow:**

```bash
# 1st call – server may generate a session_id
zenml deployment invoke my_deployment --city="London"

# 2nd call – reusing the session
zenml deployment invoke my_deployment \
--city="Berlin" \
--session-id="session-12345"
```

The [weather agent example](https://github.com/zenml-io/zenml/tree/develop/examples/weather_agent) shows how a real pipeline uses `session_state` to keep a running history of weather analyses across turns.

## Deployment Lifecycle

Once a Deployment is created, it is tied to the specific **Deployer** stack component that was used to provision it and can be managed independently of the active stack as a standalone entity with its own lifecycle.
Expand Down Expand Up @@ -415,7 +493,7 @@ curl -X POST http://localhost:8000/invoke \

## Deployment Initialization, Cleanup and State

It often happens that the HTTP requests made to the same deployment share some type of initialization or cleanup or need to share the same global state or. For example:
It often happens that the HTTP requests made to the same deployment share some type of initialization or cleanup or need to share the same global state. For example:

* a machine learning model needs to be loaded in memory, initialized and then shared between all the HTTP requests made to the deployment in order to be used by the deployed pipeline to make predictions

Expand Down Expand Up @@ -461,6 +539,17 @@ The following happens when the pipeline is deployed and then later invoked:

This mechanism can be used to initialize and share global state between all the HTTP requests made to the deployment or to execute long-running initialization or cleanup operations when the deployment is started or stopped rather than on each HTTP request.

{% hint style="info" %}
**Deployment State vs Session State**

ZenML deployments support two types of state:

- **`pipeline_state`**: Deployment-global state shared across all invocations (e.g., loaded models, DB clients, caches). Set via `on_init` hook, accessed via `get_step_context().pipeline_state`.
- **`session_state`**: Per-session state that persists across multiple invocations with the same `session_id` (e.g., conversation history, user context). Accessed via `get_step_context().session_state`.

Use `pipeline_state` for expensive resources you want to load once and reuse, and `session_state` for conversational or multi-turn workflows where each session needs its own memory.
{% endhint %}

## Deployment Configuration

The deployer settings cover aspects of the pipeline deployment process and specific back-end infrastructure used to provision and manage the resources required to run the deployment servers. Independently of that, `DeploymentSettings` can be used to fully customize all aspects pertaining to the deployment ASGI application itself, including:
Expand Down Expand Up @@ -510,13 +599,14 @@ For more detailed information on deployment options, see the [deployment setting
3. **Return Useful Data**: Design pipeline outputs to provide meaningful responses
4. **Use Type Annotations**: Leverage Pydantic models for complex parameter types
5. **Use Global Initialization and State**: Use the `on_init` and `on_cleanup` hooks along with the `pipeline_state` step context property to initialize and share global state between all the HTTP requests made to the deployment. Also use these hooks to execute long-running initialization or cleanup operations when the deployment is started or stopped rather than on each HTTP request.
5. **Handle Errors Gracefully**: Implement proper error handling in your steps
6. **Test Locally First**: Validate your deployable pipeline locally before deploying to production
6. **Keep Session State Small**: For session-aware deployments, store only compact summaries, IDs, and essential context in `session_state`. Move large artifacts (documents, embeddings, full histories) to external storage (vector stores, databases, object storage) and keep only references in session state. This matches best practices from frameworks like LangChain/LangGraph.
7. **Handle Errors Gracefully**: Implement proper error handling in your steps
8. **Test Locally First**: Validate your deployable pipeline locally before deploying to production

## Conclusion

Pipeline deployment transforms ZenML pipelines from batch processing workflows into real-time services. By following the guidelines for deployable pipelines and understanding the deployment lifecycle, you can create robust, scalable ML services that integrate seamlessly with web applications and real-time systems.

See also:
- [Steps & Pipelines](../steps-pipelines/steps_and_pipelines.md) - Core building blocks
- [Deployer Stack Component](../../component-guide/deployers/README.md) - The stack component that manages the deployment of pipelines as long-running HTTP servers
- [Deployer Stack Component](../../component-guide/deployers/README.md) - The stack component that manages the deployment of pipelines as long-running HTTP servers
43 changes: 43 additions & 0 deletions docs/book/how-to/deployment/deployment_settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,7 @@ Check out [this page](https://docs.zenml.io/concepts/steps_and_pipelines/configu
`DeploymentSettings` expose the following basic customization options. The sections below provide
short examples and guidance.

- sessions (session-aware invocations)
- application metadata and paths
- built-in endpoints and middleware toggles
- static files (SPAs) and dashboards
Expand All @@ -135,6 +136,48 @@ short examples and guidance.
- startup and shutdown hooks
- uvicorn server options, logging level, and thread pool size

### Sessions

Configure session support for deployments that need to maintain state across multiple invocations (e.g., LLM agents, chatbots, multi-turn workflows):

```python
from zenml.config import DeploymentSettings

settings = DeploymentSettings(
sessions={
"enabled": True, # Enable session support
"ttl_seconds": 1800, # 30 minute session timeout
"max_state_bytes": 32768, # 32KB state size limit
}
)
```

Or in YAML:

```yaml
settings:
deployment:
sessions:
enabled: true
ttl_seconds: 1800
max_state_bytes: 32768
```

**Session Configuration Fields:**

- `enabled` (default: `True`): Enable or disable session support. When enabled, each invocation can include a `session_id` to maintain state across calls.
- `ttl_seconds` (default: `86400` / 24 hours): Inactivity timeout before a session expires. Choose based on your use case (e.g., 15-30 min for chats, hours for workflows).
- `max_state_bytes` (default: `65536` / 64 KB): Maximum size for serialized session state. This prevents unbounded growth and potential abuse.

**When to Use Sessions:**

- ✅ LLM chat and conversational agents
- ✅ Multi-step workflows requiring context
- ✅ User-specific short-term state
- ❌ Fully stateless REST APIs

For more details on using session state in your pipeline steps, see [Managing conversational session state](../steps-pipelines/advanced_features.md#managing-conversational-session-state).

### Application metadata

You can set `app_title`, `app_description`, and `app_version` to be reflected in the ASGI application's metadata:
Expand Down
46 changes: 46 additions & 0 deletions docs/book/how-to/steps-pipelines/advanced_features.md
Original file line number Diff line number Diff line change
Expand Up @@ -674,6 +674,52 @@ def my_step(some_parameter: int = 1):
raise ValueError("My exception")
```

### Managing conversational session state

When a deployment is invoked with sessions enabled, each step can access a per-session dictionary through the step context. This is useful for LLM workflows, agents, or any pipeline that needs to remember information across `/invoke` calls.

#### Understanding Deployment State

ZenML deployments support two types of state:

- **`pipeline_state`**: Deployment-global state shared across all invocations (e.g., loaded models, DB clients, caches). Set via `on_init` hook, accessed via `get_step_context().pipeline_state`.
- **`session_state`**: Per-session state that persists across multiple invocations with the same `session_id` (e.g., conversation history, user context). Accessed via `get_step_context().session_state`.

This mirrors common LLM/agent designs: small short-term memory (session state) + external long-term memory (vector stores, databases).

#### Using Session State

```python
from zenml import step, get_step_context

@step
def agent_turn(message: str) -> str:
ctx = get_step_context()
session_state = ctx.session_state # Live dict persisted after the run

history = session_state.setdefault("history", [])
history.append({"role": "user", "content": message})

# Use external tools/vector DB for heavy context; keep session state light
reply = plan_and_call_llm(history=history[-10:], message=message)

history.append({"role": "assistant", "content": reply})
session_state["last_reply"] = reply
return reply
```

#### Best Practices for Session State

- **Keep it compact**: Store summaries, pointers, IDs, and essential context only
- **Push large artifacts elsewhere**: Documents, embeddings, and full histories belong in databases, vector stores, or object storage
- **Use size guardrails**: The `deployment_settings.sessions.max_state_bytes` setting (default 64 KB) prevents unbounded growth
- **Configure TTL appropriately**: Set `ttl_seconds` based on your use case (e.g., 15-30 min for chats, hours for workflows)
- **Store references, not content**: Keep file paths, document IDs, and embedding keys in session state rather than the actual data

This approach matches best practices from frameworks like LangChain and LangGraph, where short-term working memory is kept small and structured.

If sessions are disabled for a deployment, `ctx.session_state` simply returns an empty dict, so the same code works without extra guards.

### Using Alerter in Hooks

You can use the [Alerter stack component](https://docs.zenml.io/component-guide/alerters) to send notifications when steps fail or succeed:
Expand Down
7 changes: 4 additions & 3 deletions examples/weather_agent/pipelines/weather_agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
init_hook,
)
from starlette.middleware.gzip import GZipMiddleware
from steps import analyze_weather_with_llm, get_weather
from steps import analyze_weather_with_llm, compare_city_trends, get_weather

from zenml import pipeline
from zenml.config import (
Expand Down Expand Up @@ -266,7 +266,7 @@ def on_shutdown(
)
def weather_agent(
city: str = "London",
) -> tuple[Dict[str, float], str]:
) -> tuple[Dict[str, float], str, str]:
"""Weather agent pipeline.

Args:
Expand All @@ -277,4 +277,5 @@ def weather_agent(
"""
weather_data = get_weather(city=city)
result = analyze_weather_with_llm(weather_data=weather_data, city=city)
return weather_data, result
comparison = compare_city_trends(analysis=result)
return weather_data, result, comparison
2 changes: 2 additions & 0 deletions examples/weather_agent/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,6 @@
# run = client.get_pipeline_run(run.id)
if run:
result = run.steps["analyze_weather_with_llm"].output.load()
comparison = run.steps["compare_city_trends"].output.load()
print(result)
print("\n" + comparison)
4 changes: 3 additions & 1 deletion examples/weather_agent/steps/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,11 @@
pipeline.
"""

from .comparison import compare_city_trends
from .weather_agent import analyze_weather_with_llm, get_weather

__all__ = [
"analyze_weather_with_llm",
"compare_city_trends",
"get_weather",
]
]
35 changes: 35 additions & 0 deletions examples/weather_agent/steps/comparison.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
"""Comparison steps for the weather agent pipeline."""

from typing import Annotated

from zenml import step
from zenml.steps import get_step_context


@step
def compare_city_trends(analysis: str) -> Annotated[str, "city_comparison"]:
"""Return how the current city compares to the previous turn.

Args:
analysis: The analysis of the current city.

Returns:
A string comparing the current city to the previous city.
"""
session_state = get_step_context().session_state
history = session_state.get("history", [])
if len(history) < 2:
return "Not enough history to compare cities yet."

current = history[-1]
previous = history[-2]
delta_temp = current["temperature"] - previous["temperature"]
delta_humidity = current["humidity"] - previous["humidity"]
delta_wind = current["wind_speed"] - previous["wind_speed"]

return (
f"Comparing {current['city']} to {previous['city']}:\n"
f"• Temperature change: {delta_temp:+.1f}°C\n"
f"• Humidity change: {delta_humidity:+.0f}%\n"
f"• Wind speed change: {delta_wind:+.1f} km/h"
)
15 changes: 13 additions & 2 deletions examples/weather_agent/steps/weather_agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,17 @@ def analyze_weather_with_llm(
wind = weather_data["wind_speed"]

step_context = get_step_context()
session_state = step_context.session_state
history = session_state.setdefault("history", [])
history.append(
{
"city": city,
"temperature": round(temp, 2),
"humidity": humidity,
"wind_speed": round(wind, 2),
}
)
session_state["turn_count"] = len(history)
pipeline_state = step_context.pipeline_state

client = None
Expand Down Expand Up @@ -83,7 +94,7 @@ def analyze_weather_with_llm(

llm_analysis = response.choices[0].message.content

return f"""🤖 LLM Weather Analysis for {city}:
return f"""🤖 LLM Weather Analysis for {city} (turn {len(history)}):

{llm_analysis}

Expand Down Expand Up @@ -138,7 +149,7 @@ def analyze_weather_with_llm(
if wind > 20:
warning += " Strong winds - secure loose items."

return f"""🤖 Weather Analysis for {city}:
return f"""🤖 Weather Analysis for {city} (turn {len(history)}):

Assessment: {temp_desc.title()} weather with {humidity}% humidity
Comfort Level: {comfort}/10
Expand Down
Loading