Skip to content

Commit ff87161

Browse files
committed
more
Signed-off-by: bitliu <bitliu@tencent.com>
1 parent 9046dab commit ff87161

File tree

1 file changed

+93
-39
lines changed

1 file changed

+93
-39
lines changed

website/blog/2025-10-16-mom-family.md

Lines changed: 93 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,28 @@ tags: [mom, models, routing, announcement]
1515

1616
vLLM-SR solves a critical problem: **how to route LLM requests to the right model at the right time**. Not every query needs the same resources—"What's the weather?" shouldn't cost as much as "Analyze this legal contract."
1717

18+
## MoM System Card
19+
20+
A quick overview of all MoM models:
21+
22+
| Category | Model | Size | Base Model | Latency | Purpose |
23+
|----------|-------|------|------------|---------|---------|
24+
| **🧠 Intelligent Routing** | mom-brain-flash | Flash | ModernBERT | <10ms | Ultra-fast intent classification |
25+
| | mom-brain-pro | Pro | Qwen 0.6B | ~30-50ms | Balanced routing with reasoning |
26+
| | mom-brain-max | Max | Qwen 1.7B | ~50-100ms | Maximum accuracy for complex decisions |
27+
| **🔍 Similarity Search** | mom-similarity-flash | Flash | ModernBERT | <10ms | Semantic similarity matching |
28+
| **🔒 Prompt Guardian** | mom-jailbreak-flash | Flash | ModernBERT | <10ms | Jailbreak/attack detection |
29+
| | mom-pii-flash | Flash | ModernBERT | <10ms | PII detection & privacy protection |
30+
| **🎯 SLM Experts** | mom-expert-math-flash | Flash | Qwen 0.6B | ~30-50ms | Mathematics routing |
31+
| | mom-expert-math-pro | Pro | Qwen 1.7B | ~50-100ms | Advanced math with reasoning |
32+
33+
**Key Insights:**
34+
35+
- **4 Categories** × **3 Size Variants** = Flexible routing architecture
36+
- **ModernBERT** (encoder-only) → Sub-10ms latency for high-throughput scenarios
37+
- **Qwen** (decoder-only) → Explainable decisions with reasoning capabilities
38+
- **Flash** models achieve 10,000+ QPS on commodity hardware
39+
1840
## The Evolution: From Encoder-Only to Mixture-of-Models
1941

2042
### Where We Started: ModernBERT Foundation
@@ -59,64 +81,80 @@ This hybrid architecture lets you choose the right tool for each job: speed when
5981

6082
## The MoM Model Family
6183

62-
### 🔒 Encoders — Speed & Safety
84+
We organize MoM models into **four categories** with **three size variants** (Flash, Pro, Max):
85+
86+
### 🧠 Intelligent Routing
87+
88+
Smart routing models with three size variants:
89+
90+
| Model | Size | Base Model | Purpose |
91+
|-------|------|------------|---------|
92+
| **mom-brain-flash** | Flash | ModernBERT | Ultra-fast intent classification (sub-10ms latency) |
93+
| **mom-brain-pro** | Pro | Qwen 0.6B | Balanced performance with reasoning capabilities |
94+
| **mom-brain-max** | Max | Qwen 1.7B | Maximum accuracy for complex routing decisions |
95+
96+
**Architecture**: Flash is based on ModernBERT (encoder-only), while Pro and Max are based on Qwen 0.6B and 1.7B (decoder-only) models.
6397

64-
Fast, high-throughput models for classification and security checks:
98+
### 🔍 Similarity Search
6599

66-
| Model | Purpose |
67-
|-------|---------|
68-
| **mom-enc-class-intent-v1** | Intent/topic classification (sub-10ms latency) |
69-
| **mom-enc-guard-pii-v1** | PII detection (privacy protection) |
70-
| **mom-enc-guard-jailbreak-v1** | Jailbreak/attack detection (security) |
100+
Semantic similarity and vector search:
71101

72-
### 🧠 Decoders — Explainability
102+
| Model | Size | Base Model | Purpose |
103+
|-------|------|------------|---------|
104+
| **mom-similarity-flash** | Flash | ModernBERT | Fast semantic similarity matching for route selection |
73105

74-
When you need to understand *why* a routing decision was made:
106+
**Architecture**: Based on ModernBERT (encoder-only) for high-speed embedding generation.
75107

76-
| Model | Purpose |
77-
|-------|---------|
78-
| **mom-dec-class-intent-v1** | Intent classification with reasoning |
79-
| **mom-dec-class-intent-r1** | Higher-capacity variant for complex cases |
108+
### 🔒 Prompt Guardian
80109

81-
### 🎯 Domain Agents — Specialized Expertise
110+
Security and safety checks before routing:
82111

83-
Expert models for domain-specific routing:
112+
| Model | Size | Base Model | Purpose |
113+
|-------|------|------------|---------|
114+
| **mom-jailbreak-flash** | Flash | ModernBERT | Jailbreak/attack detection (security) |
115+
| **mom-pii-flash** | Flash | ModernBERT | PII detection (privacy protection) |
84116

85-
| Model | Domain |
86-
|-------|--------|
87-
| **mom-dec-agent-sci-v1** | Science (physics, chemistry, biology) |
88-
| **mom-dec-agent-math-v1** | Mathematics (algebra, calculus, statistics) |
89-
| **mom-dec-agent-hum-v1** | Humanities (literature, philosophy, history) |
90-
| **mom-dec-agent-soc-v1** | Social sciences (psychology, economics) |
91-
| **mom-dec-agent-law-v1** | Legal (contracts, compliance) |
92-
| **mom-dec-agent-gen-v1** | Generalist fallback |
117+
**Architecture**: Both based on ModernBERT (encoder-only) for ultra-fast security checks.
118+
119+
### 🎯 SLM Experts
120+
121+
Specialized small language models for domain-specific routing:
122+
123+
| Model | Size | Base Model | Domain |
124+
|-------|------|------------|--------|
125+
| **mom-expert-math-flash** | Flash | Qwen 0.6B | Mathematics (algebra, calculus, statistics) |
126+
| **mom-expert-math-pro** | Pro | Qwen 1.7B | Advanced mathematics with reasoning |
127+
128+
**Architecture**: Based on Qwen models (decoder-only) for domain-specific reasoning capabilities.
93129

94130
## Design Principles
95131

96-
**Safety-First**: Guardrail models (PII, jailbreak detection) run before routing—security at the edge.
132+
**Safety-First**: Prompt Guardian models (PII, jailbreak detection) run before routing—security at the edge.
97133

98-
**Speed ↔ Explainability**: Choose encoders for sub-10ms latency or decoders for transparent reasoning. Different endpoints, different SLAs.
134+
**Speed ↔ Capability**: Choose Flash for sub-10ms latency, Pro for balanced performance, or Max for maximum accuracy. Different sizes, different SLAs.
99135

100-
**Domain Expertise**: Specialized agents achieve 15-25% better accuracy on domain-specific tasks vs. generalist routing. Math queries go to math experts, legal queries to legal experts.
136+
**Domain Expertise**: SLM Expert models achieve 15-25% better accuracy on domain-specific tasks vs. generalist routing. Math queries go to math experts.
101137

102138
## How vLLM-SR Uses MoM
103139

104140
vLLM-SR's routing pipeline leverages MoM models at multiple stages:
105141

106-
1. **Security Check**`mom-enc-guard-*` models filter malicious/sensitive requests
107-
2. **Intent Classification**`mom-enc-class-intent-v1` or `mom-dec-class-intent-v1` determines query type
108-
3. **Domain Routing**`mom-dec-agent-*` models route specialized queries to optimal downstream models
109-
4. **Cost Optimization** → Simple queries → lightweight models; complex queries → premium models
142+
1. **Security Check**`mom-jailbreak-flash` and `mom-pii-flash` filter malicious/sensitive requests
143+
2. **Intent Classification**`mom-brain-*` models (flash/pro/max) determine query type and routing decisions
144+
3. **Similarity Search**`mom-similarity-flash` finds semantically similar routes
145+
4. **Domain Routing**`mom-expert-*` models route specialized queries to optimal downstream models
146+
5. **Cost Optimization** → Simple queries → lightweight models; complex queries → premium models
110147

111148
This achieves **2x+ cost reduction** while maintaining quality, similar to [RouteLLM](https://arxiv.org/abs/2406.18665).
112149

113150
## Performance
114151

115152
Early benchmarks:
116153

117-
- **Encoders**: sub-10ms p99 latency, 10,000+ QPS
118-
- **Decoders**: ~50-100ms latency with explainable outputs
119-
- **Domain Agents**: 15-25% accuracy improvement over generalist routing
154+
- **Flash Models** (BERT-based): sub-10ms p99 latency, 10,000+ QPS
155+
- **Pro Models** (Qwen 0.6B): ~30-50ms latency with reasoning capabilities
156+
- **Max Models** (Qwen 1.7B): ~50-100ms latency with maximum accuracy
157+
- **SLM Experts**: 15-25% accuracy improvement over generalist routing
120158

121159
## What's Next: Exploring Frontier Techniques
122160

@@ -156,16 +194,32 @@ Enable routers to:
156194

157195
**The vision**: vLLM-SR routers that not only classify but *reason*, *learn*, and *adapt*.
158196

159-
## Model Naming
197+
## Model Naming Convention
160198

161199
```text
162-
mom-{type}-{function}-{domain}-{version}
200+
mom-{category}-{size}
201+
mom-expert-{domain}-{size}
163202
```
164203

165-
- **type**: `enc` (encoder) / `dec` (decoder)
166-
- **function**: `class` (classification) / `guard` (safety) / `agent` (domain expert)
167-
- **domain**: `intent`, `pii`, `jailbreak`, `sci`, `math`, etc.
168-
- **version**: `v1` (baseline) / `r1` (higher-capacity)
204+
### Four Categories
205+
206+
1. **Intelligent Routing**: `mom-brain-{flash|pro|max}`
207+
2. **Similarity Search**: `mom-similarity-{flash}`
208+
3. **Prompt Guardian**: `mom-{jailbreak|pii}-{flash}`
209+
4. **SLM Experts**: `mom-expert-{domain}-{flash|pro}`
210+
211+
### Three Size Variants
212+
213+
- **flash**: ModernBERT-based (for brain/similarity/guardian) or Qwen 0.6B (for experts) — fastest, sub-10ms latency
214+
- **pro**: Qwen 0.6B (for brain) or Qwen 1.7B (for experts) — balanced performance with reasoning
215+
- **max**: Qwen 1.7B (for brain) — maximum accuracy and capabilities
216+
217+
### Architecture Summary
218+
219+
- **Intelligent Routing**: Flash (ModernBERT) + Pro/Max (Qwen 0.6B/1.7B)
220+
- **Similarity Search**: Flash (ModernBERT)
221+
- **Prompt Guardian**: Flash (ModernBERT)
222+
- **SLM Experts**: Flash/Pro (Qwen 0.6B/1.7B)
169223

170224
## Get Started
171225

0 commit comments

Comments
 (0)