@@ -15,6 +15,28 @@ tags: [mom, models, routing, announcement]
1515
1616vLLM-SR solves a critical problem: ** how to route LLM requests to the right model at the right time** . Not every query needs the same resources—"What's the weather?" shouldn't cost as much as "Analyze this legal contract."
1717
18+ ## MoM System Card
19+
20+ A quick overview of all MoM models:
21+
22+ | Category | Model | Size | Base Model | Latency | Purpose |
23+ | ----------| -------| ------| ------------| ---------| ---------|
24+ | ** 🧠 Intelligent Routing** | mom-brain-flash | Flash | ModernBERT | <10ms | Ultra-fast intent classification |
25+ | | mom-brain-pro | Pro | Qwen 0.6B | ~ 30-50ms | Balanced routing with reasoning |
26+ | | mom-brain-max | Max | Qwen 1.7B | ~ 50-100ms | Maximum accuracy for complex decisions |
27+ | ** 🔍 Similarity Search** | mom-similarity-flash | Flash | ModernBERT | <10ms | Semantic similarity matching |
28+ | ** 🔒 Prompt Guardian** | mom-jailbreak-flash | Flash | ModernBERT | <10ms | Jailbreak/attack detection |
29+ | | mom-pii-flash | Flash | ModernBERT | <10ms | PII detection & privacy protection |
30+ | ** 🎯 SLM Experts** | mom-expert-math-flash | Flash | Qwen 0.6B | ~ 30-50ms | Mathematics routing |
31+ | | mom-expert-math-pro | Pro | Qwen 1.7B | ~ 50-100ms | Advanced math with reasoning |
32+
33+ ** Key Insights:**
34+
35+ - ** 4 Categories** × ** 3 Size Variants** = Flexible routing architecture
36+ - ** ModernBERT** (encoder-only) → Sub-10ms latency for high-throughput scenarios
37+ - ** Qwen** (decoder-only) → Explainable decisions with reasoning capabilities
38+ - ** Flash** models achieve 10,000+ QPS on commodity hardware
39+
1840## The Evolution: From Encoder-Only to Mixture-of-Models
1941
2042### Where We Started: ModernBERT Foundation
@@ -59,64 +81,80 @@ This hybrid architecture lets you choose the right tool for each job: speed when
5981
6082## The MoM Model Family
6183
62- ### 🔒 Encoders — Speed & Safety
84+ We organize MoM models into ** four categories** with ** three size variants** (Flash, Pro, Max):
85+
86+ ### 🧠 Intelligent Routing
87+
88+ Smart routing models with three size variants:
89+
90+ | Model | Size | Base Model | Purpose |
91+ | -------| ------| ------------| ---------|
92+ | ** mom-brain-flash** | Flash | ModernBERT | Ultra-fast intent classification (sub-10ms latency) |
93+ | ** mom-brain-pro** | Pro | Qwen 0.6B | Balanced performance with reasoning capabilities |
94+ | ** mom-brain-max** | Max | Qwen 1.7B | Maximum accuracy for complex routing decisions |
95+
96+ ** Architecture** : Flash is based on ModernBERT (encoder-only), while Pro and Max are based on Qwen 0.6B and 1.7B (decoder-only) models.
6397
64- Fast, high-throughput models for classification and security checks:
98+ ### 🔍 Similarity Search
6599
66- | Model | Purpose |
67- | -------| ---------|
68- | ** mom-enc-class-intent-v1** | Intent/topic classification (sub-10ms latency) |
69- | ** mom-enc-guard-pii-v1** | PII detection (privacy protection) |
70- | ** mom-enc-guard-jailbreak-v1** | Jailbreak/attack detection (security) |
100+ Semantic similarity and vector search:
71101
72- ### 🧠 Decoders — Explainability
102+ | Model | Size | Base Model | Purpose |
103+ | -------| ------| ------------| ---------|
104+ | ** mom-similarity-flash** | Flash | ModernBERT | Fast semantic similarity matching for route selection |
73105
74- When you need to understand * why * a routing decision was made:
106+ ** Architecture ** : Based on ModernBERT (encoder-only) for high-speed embedding generation.
75107
76- | Model | Purpose |
77- | -------| ---------|
78- | ** mom-dec-class-intent-v1** | Intent classification with reasoning |
79- | ** mom-dec-class-intent-r1** | Higher-capacity variant for complex cases |
108+ ### 🔒 Prompt Guardian
80109
81- ### 🎯 Domain Agents — Specialized Expertise
110+ Security and safety checks before routing:
82111
83- Expert models for domain-specific routing:
112+ | Model | Size | Base Model | Purpose |
113+ | -------| ------| ------------| ---------|
114+ | ** mom-jailbreak-flash** | Flash | ModernBERT | Jailbreak/attack detection (security) |
115+ | ** mom-pii-flash** | Flash | ModernBERT | PII detection (privacy protection) |
84116
85- | Model | Domain |
86- | -------| --------|
87- | ** mom-dec-agent-sci-v1** | Science (physics, chemistry, biology) |
88- | ** mom-dec-agent-math-v1** | Mathematics (algebra, calculus, statistics) |
89- | ** mom-dec-agent-hum-v1** | Humanities (literature, philosophy, history) |
90- | ** mom-dec-agent-soc-v1** | Social sciences (psychology, economics) |
91- | ** mom-dec-agent-law-v1** | Legal (contracts, compliance) |
92- | ** mom-dec-agent-gen-v1** | Generalist fallback |
117+ ** Architecture** : Both based on ModernBERT (encoder-only) for ultra-fast security checks.
118+
119+ ### 🎯 SLM Experts
120+
121+ Specialized small language models for domain-specific routing:
122+
123+ | Model | Size | Base Model | Domain |
124+ | -------| ------| ------------| --------|
125+ | ** mom-expert-math-flash** | Flash | Qwen 0.6B | Mathematics (algebra, calculus, statistics) |
126+ | ** mom-expert-math-pro** | Pro | Qwen 1.7B | Advanced mathematics with reasoning |
127+
128+ ** Architecture** : Based on Qwen models (decoder-only) for domain-specific reasoning capabilities.
93129
94130## Design Principles
95131
96- ** Safety-First** : Guardrail models (PII, jailbreak detection) run before routing—security at the edge.
132+ ** Safety-First** : Prompt Guardian models (PII, jailbreak detection) run before routing—security at the edge.
97133
98- ** Speed ↔ Explainability ** : Choose encoders for sub-10ms latency or decoders for transparent reasoning . Different endpoints , different SLAs.
134+ ** Speed ↔ Capability ** : Choose Flash for sub-10ms latency, Pro for balanced performance, or Max for maximum accuracy . Different sizes , different SLAs.
99135
100- ** Domain Expertise** : Specialized agents achieve 15-25% better accuracy on domain-specific tasks vs. generalist routing. Math queries go to math experts, legal queries to legal experts.
136+ ** Domain Expertise** : SLM Expert models achieve 15-25% better accuracy on domain-specific tasks vs. generalist routing. Math queries go to math experts.
101137
102138## How vLLM-SR Uses MoM
103139
104140vLLM-SR's routing pipeline leverages MoM models at multiple stages:
105141
106- 1 . ** Security Check** → ` mom-enc-guard-* ` models filter malicious/sensitive requests
107- 2 . ** Intent Classification** → ` mom-enc-class-intent-v1 ` or ` mom-dec-class-intent-v1 ` determines query type
108- 3 . ** Domain Routing** → ` mom-dec-agent-* ` models route specialized queries to optimal downstream models
109- 4 . ** Cost Optimization** → Simple queries → lightweight models; complex queries → premium models
142+ 1 . ** Security Check** → ` mom-jailbreak-flash ` and ` mom-pii-flash ` filter malicious/sensitive requests
143+ 2 . ** Intent Classification** → ` mom-brain-* ` models (flash/pro/max) determine query type and routing decisions
144+ 3 . ** Similarity Search** → ` mom-similarity-flash ` finds semantically similar routes
145+ 4 . ** Domain Routing** → ` mom-expert-* ` models route specialized queries to optimal downstream models
146+ 5 . ** Cost Optimization** → Simple queries → lightweight models; complex queries → premium models
110147
111148This achieves ** 2x+ cost reduction** while maintaining quality, similar to [ RouteLLM] ( https://arxiv.org/abs/2406.18665 ) .
112149
113150## Performance
114151
115152Early benchmarks:
116153
117- - ** Encoders** : sub-10ms p99 latency, 10,000+ QPS
118- - ** Decoders** : ~ 50-100ms latency with explainable outputs
119- - ** Domain Agents** : 15-25% accuracy improvement over generalist routing
154+ - ** Flash Models** (BERT-based): sub-10ms p99 latency, 10,000+ QPS
155+ - ** Pro Models** (Qwen 0.6B): ~ 30-50ms latency with reasoning capabilities
156+ - ** Max Models** (Qwen 1.7B): ~ 50-100ms latency with maximum accuracy
157+ - ** SLM Experts** : 15-25% accuracy improvement over generalist routing
120158
121159## What's Next: Exploring Frontier Techniques
122160
@@ -156,16 +194,32 @@ Enable routers to:
156194
157195** The vision** : vLLM-SR routers that not only classify but * reason* , * learn* , and * adapt* .
158196
159- ## Model Naming
197+ ## Model Naming Convention
160198
161199``` text
162- mom-{type}-{function}-{domain}-{version}
200+ mom-{category}-{size}
201+ mom-expert-{domain}-{size}
163202```
164203
165- - ** type** : ` enc ` (encoder) / ` dec ` (decoder)
166- - ** function** : ` class ` (classification) / ` guard ` (safety) / ` agent ` (domain expert)
167- - ** domain** : ` intent ` , ` pii ` , ` jailbreak ` , ` sci ` , ` math ` , etc.
168- - ** version** : ` v1 ` (baseline) / ` r1 ` (higher-capacity)
204+ ### Four Categories
205+
206+ 1 . ** Intelligent Routing** : ` mom-brain-{flash|pro|max} `
207+ 2 . ** Similarity Search** : ` mom-similarity-{flash} `
208+ 3 . ** Prompt Guardian** : ` mom-{jailbreak|pii}-{flash} `
209+ 4 . ** SLM Experts** : ` mom-expert-{domain}-{flash|pro} `
210+
211+ ### Three Size Variants
212+
213+ - ** flash** : ModernBERT-based (for brain/similarity/guardian) or Qwen 0.6B (for experts) — fastest, sub-10ms latency
214+ - ** pro** : Qwen 0.6B (for brain) or Qwen 1.7B (for experts) — balanced performance with reasoning
215+ - ** max** : Qwen 1.7B (for brain) — maximum accuracy and capabilities
216+
217+ ### Architecture Summary
218+
219+ - ** Intelligent Routing** : Flash (ModernBERT) + Pro/Max (Qwen 0.6B/1.7B)
220+ - ** Similarity Search** : Flash (ModernBERT)
221+ - ** Prompt Guardian** : Flash (ModernBERT)
222+ - ** SLM Experts** : Flash/Pro (Qwen 0.6B/1.7B)
169223
170224## Get Started
171225
0 commit comments