Skip to content

Commit 3168a2e

Browse files
istio-testingLiorLiebermancraigbox
authored
[release-1.26] add istio inference support blog (#16706)
* add istio inference support blog * chnage description * Update description Co-authored-by: Craig Box <craig.box@gmail.com> * Update content/en/blog/2025/inference-extension-support/index.md Co-authored-by: Craig Box <craig.box@gmail.com> --------- Co-authored-by: Lior Lieberman <liorlib7+riskified@gmail.com> Co-authored-by: Craig Box <craig.box@gmail.com>
1 parent 24078b9 commit 3168a2e

File tree

3 files changed

+169
-0
lines changed

3 files changed

+169
-0
lines changed

.spelling

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -237,6 +237,7 @@ CFP
237237
Chandrasekhara
238238
channel
239239
Chaomeng
240+
chatbots
240241
Chavali
241242
CheckRequest
242243
CheckResponse
@@ -553,6 +554,7 @@ fuzzers
553554
gamified
554555
gapped
555556
gateway-api
557+
gateway-api-inference-extension
556558
GatewayClass
557559
Gather.town
558560
Gaudet
@@ -579,6 +581,7 @@ Gmail
579581
googleapis.com
580582
googlegroups.com
581583
GoTo
584+
GPUs
582585
Grafana
583586
grafana-istio-dashboard
584587
Graphviz
@@ -631,6 +634,8 @@ impactful
631634
incentivized
632635
Incrementality
633636
Indo-Pacific
637+
InferenceModel
638+
InferencePool
634639
initContainer
635640
initializer
636641
initializers
@@ -820,6 +825,8 @@ Loki
820825
Longmuir
821826
lookups
822827
loopback
828+
LoRA
829+
LoRA-aware
823830
Lua
824831
lucavall.in
825832
Lukonde
@@ -973,6 +980,7 @@ outsized
973980
overridden
974981
Ovidiu
975982
p50
983+
p90
976984
p99
977985
PaaS
978986
Padmanabhan
@@ -1152,6 +1160,7 @@ sharded
11521160
Sharding
11531161
sharding
11541162
Shaughnessy
1163+
Sheddable
11551164
Shi
11561165
Shilin
11571166
Shivanshu
@@ -1264,6 +1273,7 @@ toolbelt
12641273
toolchain
12651274
topologySpreadConstraints
12661275
touchpoints
1276+
TPUs
12671277
tradeoff
12681278
tradeoffs
12691279
Traefik
@@ -1369,6 +1379,7 @@ Virtualization
13691379
virtualization
13701380
VirtualService
13711381
virtualservices-destrules
1382+
vLLM
13721383
VM
13731384
vm-1
13741385
VMs
Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
---
2+
title: "Bringing AI-Aware Traffic Management to Istio: Gateway API Inference Extension Support"
3+
description: A smarter, dynamic way to optimize AI traffic routing based on real-time metrics and the unique characteristics of inference workloads.
4+
publishdate: 2025-07-28
5+
attribution: "Lior Lieberman (Google), Keith Mattix (Microsoft), Aslak Knutsen (Red Hat)"
6+
keywords: [istio,AI,inference,gateway-api-inference-extension]
7+
---
8+
9+
The world of AI inference on Kubernetes presents unique challenges that traditional traffic-routing architectures weren't designed to handle. While Istio has long excelled at managing microservice traffic with sophisticated load balancing, security, and observability features, the demands of Large Language Model (LLM) workloads require specialized functionality.
10+
11+
That's why we're excited to announce Istio's support for the Gateway API Inference Extension, bringing intelligent, model-aware and LoRA-aware routing to Istio.
12+
13+
## Why AI Workloads Need Special Treatment
14+
15+
Traditional web services typically handle quick, stateless requests measured in milliseconds. AI inference workloads operate in a completely different paradigm that challenges conventional load balancing approaches in several fundamental ways.
16+
17+
### The Scale and Duration Challenge
18+
19+
Unlike typical API responses that complete in milliseconds, AI inference requests often take significantly longer to process - sometimes several seconds or even minutes. This dramatic difference in processing time means that routing decisions have far more impact than in traditional web services. A single poorly-routed request can tie up expensive GPU resources for extended periods, creating cascading effects across the entire system.
20+
21+
The payload characteristics are equally challenging. AI inference requests frequently involve substantially larger payloads, especially when dealing with Retrieval-Augmented Generation (RAG) systems, multi-turn conversations with extensive context, or multi-modal inputs including images, audio, or video. These large payloads require different buffering, streaming, and timeout strategies than traditional HTTP APIs.
22+
23+
### Resource Consumption Patterns
24+
25+
Perhaps most critically, a single inference request can consume an entire GPU's resources during processing. This is fundamentally different from traditional request serving where multiple requests can be processed concurrently on the same compute resources. When a GPU is fully engaged with one request, additional requests must queue, making the scheduling and routing decision far more impactful than those for standard API workloads.
26+
27+
This resource exclusivity means that simple round-robin or least-connection algorithms can create severe imbalances. Sending requests to a server that's already processing a complex inference task doesn't just add latency, it can cause resource contention that impacts performance for all queued requests.
28+
29+
### Stateful Considerations and Memory Management
30+
31+
AI models often maintain in-memory caches that significantly impact performance. KV caches store intermediate attention calculations for previously processed tokens, serving as the primary consumer of GPU memory during generation and often becoming the most common bottleneck. When KV cache utilization approaches limits, performance degrades dramatically, making cache-aware routing essential.
32+
33+
Additionally, many modern AI deployments use fine-tuned adapters like [LoRA](https://arxiv.org/abs/2106.09685) (Low-Rank Adaptation) to customize model behavior for specific users, organizations, or use cases. These adapters consume GPU memory and loading time when switched. A model server that already has the required LoRA adapter loaded can process requests immediately, while servers without the adapter face expensive loading overhead that can take seconds to complete.
34+
35+
### Queue Dynamics and Criticality
36+
37+
AI inference workloads also introduce the concept of request criticality that's less common in traditional services. Real-time interactive applications (like chatbots or live content generation) require low latency and should be prioritized, while batch processing jobs or experimental workloads can tolerate higher latency or even be dropped during system overload.
38+
39+
Traditional load balancers lack the context to make these criticality-based decisions. They can't distinguish between a time-sensitive customer support query and a background batch job, leading to suboptimal resource allocation during peak demand periods.
40+
41+
This is where inference-aware routing becomes critical. Instead of treating all backends as equivalent black boxes, we need routing decisions that understand the current state and capabilities of each model server, including their queue depth, memory utilization, loaded adapters, and ability to handle requests of different criticality levels.
42+
43+
## Gateway API Inference Extension: A Kubernetes-Native Solution
44+
45+
The [Kubernetes Gateway API Inference Extension](https://gateway-api-inference-extension.sigs.k8s.io) has introduced solutions to these challenges, building on the proven foundation of Kubernetes Gateway API while adding AI-specific intelligence. Rather than requiring organizations to patch together custom solutions or abandon their existing Kubernetes infrastructure, the extension provides a standardized, vendor-neutral approach to intelligent AI traffic management.
46+
47+
The extension introduces two key Custom Resource Definitions that work together to address the routing challenges we've outlined. The **InferenceModel** resource provides an abstraction for AI-Inference workload owners to define logical model endpoints, while the **InferencePool** resource gives platform operators the tools to manage backend infrastructure with AI workload awareness.
48+
49+
By extending the familiar Gateway API model rather than creating an entirely new paradigm, the inference extension enables organizations to leverage their existing Kubernetes expertise while gaining the specialized capabilities that AI workloads demand. This approach ensures that teams can adopt intelligent inference routing aligned with familiar networking knowledge and tooling.
50+
51+
Note: InferenceModel is likely to change in future Gateway API Inference Extension releases.
52+
53+
### InferenceModel
54+
55+
The InferenceModel resource allows inference workload owners to define logical model endpoints that abstract the complexities of backend deployment.
56+
57+
{{< text yaml >}}
58+
apiVersion: inference.networking.x-k8s.io/v1alpha2
59+
kind: InferenceModel
60+
metadata:
61+
name: customer-support-bot
62+
namespace: ai-workloads
63+
spec:
64+
modelName: customer-support
65+
criticality: Critical
66+
poolRef:
67+
name: llama-pool
68+
targetModels:
69+
- name: llama-3-8b-customer-v1
70+
weight: 80
71+
- name: llama-3-8b-customer-v2
72+
weight: 20
73+
{{< /text >}}
74+
75+
This configuration exposes a customer-support model that intelligently routes between two backend variants, enabling safe rollouts of new model versions while maintaining service availability.
76+
77+
### InferencePool
78+
79+
The InferencePool acts as a specialized backend service that understands AI workload characteristics:
80+
81+
{{< text yaml >}}
82+
apiVersion: inference.networking.x-k8s.io/v1alpha2
83+
kind: InferencePool
84+
metadata:
85+
name: llama-pool
86+
namespace: ai-workloads
87+
spec:
88+
targetPortNumber: 8000
89+
selector:
90+
app: llama-server
91+
version: v1
92+
extensionRef:
93+
name: llama-endpoint-picker
94+
{{< /text >}}
95+
96+
When integrated with Istio, this pool automatically discovers model servers through Istio’s service discovery.
97+
98+
## How Inference Routing Works in Istio
99+
100+
Istio's implementation builds on the service mesh's proven traffic management foundation. When a request enters the mesh through a Kubernetes Gateway, it follows the standard Gateway API HTTPRoute matching rules. However, instead of using traditional load balancing algorithms, the backend is picked by an Endpoint Picker (EPP) service.
101+
102+
The EPP evaluates multiple factors to select the optimal backend:
103+
104+
* **Request Criticality Assessment**: Critical requests receive priority routing to available servers, while lower criticality requests (Standard or Sheddable) may be load-shed during high utilization periods.
105+
106+
* **Resource Utilization Analysis**: The extension monitors GPU memory usage, particularly KV cache utilization, to avoid overwhelming servers that are approaching capacity limits.
107+
108+
* **Adapter Affinity**: For models using LoRA adapters, requests are preferentially routed to servers that already have the required adapter loaded, eliminating expensive loading overhead.
109+
110+
* **Prefix-Cache Aware Load Balancing**: Routing decisions consider distributed KV cache states across model servers, and prioritize model servers that already have the prefix in their cache.
111+
112+
* **Queue Depth Optimization**: By tracking request queue lengths across backends, the system avoids creating hotspots that would increase overall latency.
113+
114+
This intelligent routing operates transparently within Istio's existing architecture, maintaining compatibility with features like mutual TLS, access policies, and distributed tracing.
115+
116+
### Inference Routing Request Flow
117+
118+
{{< image width="100%"
119+
link="./inference-request-flow.svg"
120+
alt="Flow of an inference request with gateway-api-inference-extension routing."
121+
>}}
122+
123+
## The Road Ahead
124+
125+
The future roadmap includes istio-related features such as:
126+
127+
* **Support for Waypoints** - As Istio continues to evolve toward ambient mesh architecture, inference-aware routing will be integrated into waypoint proxies to provide centralized, scalable policy enforcement for AI workloads.
128+
129+
Beyond Istio-specific innovations, the Gateway API Inference Extension community is also actively developing several advanced capabilities that will further enhance routing for for AI inference workloads on Kubernetes:
130+
131+
* **HPA Integration for AI Metrics**: Horizontal Pod Autoscaling based on model-specific metrics rather than just CPU and memory.
132+
133+
* **Multi-Modal Input Support**: Optimized routing for large multi-modal inputs and outputs (images, audio, video) with intelligent buffering and streaming capabilities.
134+
135+
* **Heterogeneous Accelerator Support**: Intelligent routing across different accelerator types (GPUs, TPUs, specialized AI chips) with latency and cost-aware load balancing.
136+
137+
## Getting Started with Istio Inference Extension
138+
139+
Ready to try inference-aware routing? The implementation is officially available starting with Istio 1.27!
140+
141+
For installation and guides, please follow the Istio-specific guidance on the [Gateway API Inference Extension website](https://gateway-api-inference-extension.sigs.k8s.io/guides/#__tabbed_3_2).
142+
143+
## Performance Impact and Benefits
144+
145+
Early evaluations show significant performance improvements with inference-aware routing, including substantially lower p90 latency at higher query rates and reduced end-to-end tail latencies compared to traditional load balancing.
146+
147+
For detailed benchmark results and methodology, see the [Gateway API Inference Extension performance evaluation](https://kubernetes.io/blog/2025/06/05/introducing-gateway-api-inference-extension/#benchmarks) with testing data using H100 GPUs and vLLM deployments.
148+
149+
The integration with Istio's existing infrastructure means these benefits come with minimal operational overhead, and your existing monitoring, security, and traffic management configurations continue to work unchanged.
150+
151+
## Conclusion
152+
153+
The Gateway API Inference Extension represents a significant step forward in making Kubernetes truly AI-ready, and Istio's implementation brings this intelligence to the service mesh layer where it can have maximum impact. By combining inference-aware routing with Istio's proven security, observability, and traffic management capabilities, we're enabling organizations to run AI workloads with the same operational excellence they expect from their traditional services.
154+
155+
---
156+
157+
*Have a question or want to get involved? [Join the Kubernetes Slack](https://slack.kubernetes.io/) and then find us on the [#gateway-api-inference-extension](https://kubernetes.slack.com/archives/C08E3RZMT2P) channel or [discuss on the Istio Slack](https://slack.istio.io).*

content/en/blog/2025/inference-extension-support/inference-request-flow.svg

Lines changed: 1 addition & 0 deletions
Loading

0 commit comments

Comments
 (0)