more

Xunzhuo · Xunzhuo · commit ff87161c2891 · 2025-10-17T15:44:39.000+08:00
Signed-off-by: bitliu &lt;bitliu@tencent.com&gt;
diff --git a/website/blog/2025-10-16-mom-family.md b/website/blog/2025-10-16-mom-family.md
@@ -15,6 +15,28 @@ tags: [mom, models, routing, announcement]
 
 vLLM-SR solves a critical problem: **how to route LLM requests to the right model at the right time**. Not every query needs the same resources—"What's the weather?" shouldn't cost as much as "Analyze this legal contract."
 
+## MoM System Card
+
+A quick overview of all MoM models:
+
+| Category | Model | Size | Base Model | Latency | Purpose |
+|----------|-------|------|------------|---------|---------|
+| **🧠 Intelligent Routing** | mom-brain-flash | Flash | ModernBERT | <10ms | Ultra-fast intent classification |
+| | mom-brain-pro | Pro | Qwen 0.6B | ~30-50ms | Balanced routing with reasoning |
+| | mom-brain-max | Max | Qwen 1.7B | ~50-100ms | Maximum accuracy for complex decisions |
+| **🔍 Similarity Search** | mom-similarity-flash | Flash | ModernBERT | <10ms | Semantic similarity matching |
+| **🔒 Prompt Guardian** | mom-jailbreak-flash | Flash | ModernBERT | <10ms | Jailbreak/attack detection |
+| | mom-pii-flash | Flash | ModernBERT | <10ms | PII detection & privacy protection |
+| **🎯 SLM Experts** | mom-expert-math-flash | Flash | Qwen 0.6B | ~30-50ms | Mathematics routing |
+| | mom-expert-math-pro | Pro | Qwen 1.7B | ~50-100ms | Advanced math with reasoning |
+
+**Key Insights:**
+
+- **4 Categories** × **3 Size Variants** = Flexible routing architecture
+- **ModernBERT** (encoder-only) → Sub-10ms latency for high-throughput scenarios
+- **Qwen** (decoder-only) → Explainable decisions with reasoning capabilities
+- **Flash** models achieve 10,000+ QPS on commodity hardware
+
 ## The Evolution: From Encoder-Only to Mixture-of-Models
 
 ### Where We Started: ModernBERT Foundation
@@ -59,64 +81,80 @@ This hybrid architecture lets you choose the right tool for each job: speed when
 
 ## The MoM Model Family
 
-### 🔒 Encoders — Speed & Safety
+We organize MoM models into **four categories** with **three size variants** (Flash, Pro, Max):
+
+### 🧠 Intelligent Routing
+
+Smart routing models with three size variants:
+
+| Model | Size | Base Model | Purpose |
+|-------|------|------------|---------|
+| **mom-brain-flash** | Flash | ModernBERT | Ultra-fast intent classification (sub-10ms latency) |
+| **mom-brain-pro** | Pro | Qwen 0.6B | Balanced performance with reasoning capabilities |
+| **mom-brain-max** | Max | Qwen 1.7B | Maximum accuracy for complex routing decisions |
+
+**Architecture**: Flash is based on ModernBERT (encoder-only), while Pro and Max are based on Qwen 0.6B and 1.7B (decoder-only) models.
 
-Fast, high-throughput models for classification and security checks:
+### 🔍 Similarity Search
 
-| Model | Purpose |
-|-------|---------|
-| **mom-enc-class-intent-v1** | Intent/topic classification (sub-10ms latency) |
-| **mom-enc-guard-pii-v1** | PII detection (privacy protection) |
-| **mom-enc-guard-jailbreak-v1** | Jailbreak/attack detection (security) |
+Semantic similarity and vector search:
 
-### 🧠 Decoders — Explainability
+| Model | Size | Base Model | Purpose |
+|-------|------|------------|---------|
+| **mom-similarity-flash** | Flash | ModernBERT | Fast semantic similarity matching for route selection |
 
-When you need to understand *why* a routing decision was made:
+**Architecture**: Based on ModernBERT (encoder-only) for high-speed embedding generation.
 
-| Model | Purpose |
-|-------|---------|
-| **mom-dec-class-intent-v1** | Intent classification with reasoning |
-| **mom-dec-class-intent-r1** | Higher-capacity variant for complex cases |
+### 🔒 Prompt Guardian
 
-### 🎯 Domain Agents — Specialized Expertise
+Security and safety checks before routing:
 
-Expert models for domain-specific routing:
+| Model | Size | Base Model | Purpose |
+|-------|------|------------|---------|
+| **mom-jailbreak-flash** | Flash | ModernBERT | Jailbreak/attack detection (security) |
+| **mom-pii-flash** | Flash | ModernBERT | PII detection (privacy protection) |
 
-| Model | Domain |
-|-------|--------|
-| **mom-dec-agent-sci-v1** | Science (physics, chemistry, biology) |
-| **mom-dec-agent-math-v1** | Mathematics (algebra, calculus, statistics) |
-| **mom-dec-agent-hum-v1** | Humanities (literature, philosophy, history) |
-| **mom-dec-agent-soc-v1** | Social sciences (psychology, economics) |
-| **mom-dec-agent-law-v1** | Legal (contracts, compliance) |
-| **mom-dec-agent-gen-v1** | Generalist fallback |
+**Architecture**: Both based on ModernBERT (encoder-only) for ultra-fast security checks.
+
+### 🎯 SLM Experts
+
+Specialized small language models for domain-specific routing:
+
+| Model | Size | Base Model | Domain |
+|-------|------|------------|--------|
+| **mom-expert-math-flash** | Flash | Qwen 0.6B | Mathematics (algebra, calculus, statistics) |
+| **mom-expert-math-pro** | Pro | Qwen 1.7B | Advanced mathematics with reasoning |
+
+**Architecture**: Based on Qwen models (decoder-only) for domain-specific reasoning capabilities.
 
 ## Design Principles
 
-**Safety-First**: Guardrail models (PII, jailbreak detection) run before routing—security at the edge.
+**Safety-First**: Prompt Guardian models (PII, jailbreak detection) run before routing—security at the edge.
 
-**Speed ↔ Explainability**: Choose encoders for sub-10ms latency or decoders for transparent reasoning. Different endpoints, different SLAs.
+**Speed ↔ Capability**: Choose Flash for sub-10ms latency, Pro for balanced performance, or Max for maximum accuracy. Different sizes, different SLAs.
 
-**Domain Expertise**: Specialized agents achieve 15-25% better accuracy on domain-specific tasks vs. generalist routing. Math queries go to math experts, legal queries to legal experts.
+**Domain Expertise**: SLM Expert models achieve 15-25% better accuracy on domain-specific tasks vs. generalist routing. Math queries go to math experts.
 
 ## How vLLM-SR Uses MoM
 
 vLLM-SR's routing pipeline leverages MoM models at multiple stages:
 
-1. **Security Check** → `mom-enc-guard-*` models filter malicious/sensitive requests
-2. **Intent Classification** → `mom-enc-class-intent-v1` or `mom-dec-class-intent-v1` determines query type
-3. **Domain Routing** → `mom-dec-agent-*` models route specialized queries to optimal downstream models
-4. **Cost Optimization** → Simple queries → lightweight models; complex queries → premium models
+1. **Security Check** → `mom-jailbreak-flash` and `mom-pii-flash` filter malicious/sensitive requests
+2. **Intent Classification** → `mom-brain-*` models (flash/pro/max) determine query type and routing decisions
+3. **Similarity Search** → `mom-similarity-flash` finds semantically similar routes
+4. **Domain Routing** → `mom-expert-*` models route specialized queries to optimal downstream models
+5. **Cost Optimization** → Simple queries → lightweight models; complex queries → premium models
 
 This achieves **2x+ cost reduction** while maintaining quality, similar to [RouteLLM](https://arxiv.org/abs/2406.18665).
 
 ## Performance
 
 Early benchmarks:
 
-- **Encoders**: sub-10ms p99 latency, 10,000+ QPS
-- **Decoders**: ~50-100ms latency with explainable outputs
-- **Domain Agents**: 15-25% accuracy improvement over generalist routing
+- **Flash Models** (BERT-based): sub-10ms p99 latency, 10,000+ QPS
+- **Pro Models** (Qwen 0.6B): ~30-50ms latency with reasoning capabilities
+- **Max Models** (Qwen 1.7B): ~50-100ms latency with maximum accuracy
+- **SLM Experts**: 15-25% accuracy improvement over generalist routing
 
 ## What's Next: Exploring Frontier Techniques
 
@@ -156,16 +194,32 @@ Enable routers to:
 
 **The vision**: vLLM-SR routers that not only classify but *reason*, *learn*, and *adapt*.
 
-## Model Naming
+## Model Naming Convention
 
 ```text
-mom-{type}-{function}-{domain}-{version}
+mom-{category}-{size}
+mom-expert-{domain}-{size}
 ```
 
-- **type**: `enc` (encoder) / `dec` (decoder)
-- **function**: `class` (classification) / `guard` (safety) / `agent` (domain expert)
-- **domain**: `intent`, `pii`, `jailbreak`, `sci`, `math`, etc.
-- **version**: `v1` (baseline) / `r1` (higher-capacity)
+### Four Categories
+
+1. **Intelligent Routing**: `mom-brain-{flash|pro|max}`
+2. **Similarity Search**: `mom-similarity-{flash}`
+3. **Prompt Guardian**: `mom-{jailbreak|pii}-{flash}`
+4. **SLM Experts**: `mom-expert-{domain}-{flash|pro}`
+
+### Three Size Variants
+
+- **flash**: ModernBERT-based (for brain/similarity/guardian) or Qwen 0.6B (for experts) — fastest, sub-10ms latency
+- **pro**: Qwen 0.6B (for brain) or Qwen 1.7B (for experts) — balanced performance with reasoning
+- **max**: Qwen 1.7B (for brain) — maximum accuracy and capabilities
+
+### Architecture Summary
+
+- **Intelligent Routing**: Flash (ModernBERT) + Pro/Max (Qwen 0.6B/1.7B)
+- **Similarity Search**: Flash (ModernBERT)
+- **Prompt Guardian**: Flash (ModernBERT)
+- **SLM Experts**: Flash/Pro (Qwen 0.6B/1.7B)
 
 ## Get Started