Remove unsupported model from docs, benchmark results, examples (#43)

steven10a · web-flow · commit 0a651994e267 · 2025-11-04T10:47:07.000-08:00
diff --git a/docs/benchmarking/NSFW_roc_curve.png b/docs/benchmarking/NSFW_roc_curve.png
diff --git a/docs/benchmarking/alignment_roc_curves.png b/docs/benchmarking/alignment_roc_curves.png
diff --git a/docs/benchmarking/hallucination_detection_roc_curves.png b/docs/benchmarking/hallucination_detection_roc_curves.png
diff --git a/docs/benchmarking/jailbreak_roc_curve.png b/docs/benchmarking/jailbreak_roc_curve.png
diff --git a/docs/evals.md b/docs/evals.md
@@ -27,7 +27,7 @@ guardrails-evals \
   --config-path guardrails_config.json \
   --dataset-path data.jsonl \
   --mode benchmark \
-  --models gpt-5 gpt-5-mini gpt-5-nano
+  --models gpt-5 gpt-5-mini
 ```
 
 Test with included demo files in our [github repository](https://github.com/openai/openai-guardrails-python/tree/main/src/guardrails/evals/eval_demo)
diff --git a/docs/ref/checks/hallucination_detection.md b/docs/ref/checks/hallucination_detection.md
@@ -175,10 +175,8 @@ The statements cover various types of factual claims including:
 |--------------|---------|-------------|-------------|-------------|
 | gpt-5         | 0.854   | 0.732       | 0.686       | 0.670       |
 | gpt-5-mini    | 0.934   | 0.813       | 0.813       | 0.770       |
-| gpt-5-nano    | 0.566   | 0.540       | 0.540       | 0.533       |
 | gpt-4.1       | 0.870   | 0.785       | 0.785       | 0.785       |
 | gpt-4.1-mini (default) | 0.876   | 0.806       | 0.789       | 0.789       |
-| gpt-4.1-nano  | 0.537   | 0.526       | 0.526       | 0.526       |
 
 **Notes:**
 - ROC AUC: Area under the ROC curve (higher is better)
@@ -192,10 +190,8 @@ The following table shows latency measurements for each model using the hallucin
 |--------------|--------------|--------------|
 | gpt-5         | 34,135       | 525,854      |
 | gpt-5-mini    | 23,013       | 59,316       |
-| gpt-5-nano    | 17,079       | 26,317       |
 | gpt-4.1       | 7,126        | 33,464       |
 | gpt-4.1-mini (default) | 7,069        | 43,174       |
-| gpt-4.1-nano  | 4,809        | 6,869        |
 
 - **TTC P50**: Median time to completion (50% of requests complete within this time)
 - **TTC P95**: 95th percentile time to completion (95% of requests complete within this time)
@@ -217,10 +213,8 @@ In addition to the above evaluations which use a 3 MB sized vector store, the ha
 |--------------|---------------------|----------------------|---------------------|---------------------------|
 | gpt-5         | 28,762 / 396,472    | 34,135 / 525,854     | 37,104 / 75,684     | 40,909 / 645,025          |
 | gpt-5-mini    | 19,240 / 39,526     | 23,013 / 59,316      | 24,217 / 65,904     | 37,314 / 118,564          |
-| gpt-5-nano    | 13,436 / 22,032     | 17,079 / 26,317      | 17,843 / 35,639     | 21,724 / 37,062           |
 | gpt-4.1       | 7,437 / 15,721      | 7,126 / 33,464       | 6,993 / 30,315      | 6,688 / 127,481           |
 | gpt-4.1-mini (default) | 6,661 / 14,827      | 7,069 / 43,174       | 7,032 / 46,354      | 7,374 / 37,769            |
-| gpt-4.1-nano  | 4,296 / 6,378       | 4,809 / 6,869        | 4,171 / 6,609       | 4,650 / 6,201             |
 
 - **Vector store size impact varies by model**: GPT-4.1 series shows minimal latency impact across vector store sizes, while GPT-5 series shows significant increases.
 
@@ -240,10 +234,6 @@ In addition to the above evaluations which use a 3 MB sized vector store, the ha
 | | Medium (3 MB) | 0.934 | 0.813 | 0.813 | 0.770 |
 | | Large (11 MB) | 0.919 | 0.817 | 0.817 | 0.817 |
 | | Extra Large (105 MB) | 0.909 | 0.793 | 0.793 | 0.711 |
-| **gpt-5-nano** | Small (1 MB) | 0.590 | 0.547 | 0.545 | 0.536 |
-| | Medium (3 MB) | 0.566 | 0.540 | 0.540 | 0.533 |
-| | Large (11 MB) | 0.564 | 0.534 | 0.532 | 0.507 |
-| | Extra Large (105 MB) | 0.603 | 0.570 | 0.558 | 0.550 |
 | **gpt-4.1** | Small (1 MB) | 0.907 | 0.839 | 0.839 | 0.839 |
 | | Medium (3 MB) | 0.870 | 0.785 | 0.785 | 0.785 |
 | | Large (11 MB) | 0.846 | 0.753 | 0.753 | 0.753 |
@@ -252,15 +242,11 @@ In addition to the above evaluations which use a 3 MB sized vector store, the ha
 | | Medium (3 MB) | 0.876 | 0.806 | 0.789 | 0.789 |
 | | Large (11 MB) | 0.862 | 0.791 | 0.757 | 0.757 |
 | | Extra Large (105 MB) | 0.802 | 0.722 | 0.722 | 0.722 |
-| **gpt-4.1-nano** | Small (1 MB) | 0.605 | 0.528 | 0.528 | 0.528 |
-| | Medium (3 MB) | 0.537 | 0.526 | 0.526 | 0.526 |
-| | Large (11 MB) | 0.618 | 0.531 | 0.531 | 0.531 |
-| | Extra Large (105 MB) | 0.636 | 0.528 | 0.528 | 0.528 |
 
 **Key Insights:**
 
 - **Best Performance**: gpt-5-mini consistently achieves the highest ROC AUC scores across all vector store sizes (0.909-0.939)
-- **Best Latency**: gpt-4.1-nano shows the most consistent and lowest latency across all scales (4,171-4,809ms P50) but shows poor performance
+- **Best Latency**: gpt-4.1-mini (default) provides the lowest median latencies while maintaining strong accuracy
 - **Most Stable**: gpt-4.1-mini (default) maintains relatively stable performance across vector store sizes with good accuracy-latency balance
 - **Scale Sensitivity**: gpt-5 shows the most variability in performance across vector store sizes, with performance dropping significantly at larger scales
 - **Performance vs Scale**: Most models show decreasing performance as vector store size increases, with gpt-5-mini being the most resilient
@@ -270,4 +256,4 @@ In addition to the above evaluations which use a 3 MB sized vector store, the ha
 - **Signal-to-noise ratio degradation**: Larger vector stores contain more irrelevant documents that may not be relevant to the specific factual claims being validated
 - **Semantic search limitations**: File search retrieves semantically similar documents, but with a large diverse knowledge source, these may not always be factually relevant
 - **Document quality matters more than quantity**: The relevance and accuracy of documents is more important than the total number of documents
-- **Performance plateaus**: Beyond a certain size (11 MB), the performance impact becomes less severe
+- **Performance plateaus**: Beyond a certain size (11 MB), the performance impact becomes less severe
diff --git a/docs/ref/checks/jailbreak.md b/docs/ref/checks/jailbreak.md
@@ -93,23 +93,19 @@ This benchmark evaluates model performance on a diverse set of prompts:
 
 | Model         | ROC AUC | Prec@R=0.80 | Prec@R=0.90 | Prec@R=0.95 | Recall@FPR=0.01 |
 |--------------|---------|-------------|-------------|-------------|-----------------|
-| gpt-5         | 0.979   | 0.973       | 0.970       | 0.970       | 0.733           |
-| gpt-5-mini    | 0.954   | 0.990       | 0.900       | 0.900       | 0.768           |
-| gpt-5-nano    | 0.962   | 0.973       | 0.967       | 0.965       | 0.048           |
-| gpt-4.1       | 0.990   | 1.000       | 1.000       | 0.984       | 0.946           |
-| gpt-4.1-mini (default) | 0.982   | 0.992       | 0.992       | 0.954       | 0.444           |
-| gpt-4.1-nano  | 0.934   | 0.924       | 0.924       | 0.848       | 0.000           |
+| gpt-5         | 0.982   | 0.984       | 0.977       | 0.977       | 0.743           |
+| gpt-5-mini    | 0.980   | 0.980       | 0.976       | 0.975       | 0.734           |
+| gpt-4.1       | 0.979   | 0.975       | 0.975       | 0.975       | 0.661           |
+| gpt-4.1-mini (default) | 0.979   | 0.974       | 0.972       | 0.972       | 0.654           |
 
 #### Latency Performance
 
 | Model         | TTC P50 (ms) | TTC P95 (ms) |
 |--------------|--------------|--------------|
 | gpt-5         | 4,569        | 7,256        |
 | gpt-5-mini    | 5,019        | 9,212        |
-| gpt-5-nano    | 4,702        | 6,739        |
 | gpt-4.1       | 841          | 1,861        |
 | gpt-4.1-mini  | 749          | 1,291        |
-| gpt-4.1-nano  | 683          | 890          |
 
 **Notes:**
 
diff --git a/docs/ref/checks/nsfw.md b/docs/ref/checks/nsfw.md
@@ -82,12 +82,10 @@ This benchmark evaluates model performance on a balanced set of social media pos
 
 | Model         | ROC AUC | Prec@R=0.80 | Prec@R=0.90 | Prec@R=0.95 | Recall@FPR=0.01 |
 |--------------|---------|-------------|-------------|-------------|-----------------|
-| gpt-5        | 0.9532  | 0.9195      | 0.9096      | 0.9068      | 0.0339          |
-| gpt-5-mini   | 0.9629  | 0.9321      | 0.9168      | 0.9149      | 0.0998          |
-| gpt-5-nano   | 0.9600  | 0.9297      | 0.9216      | 0.9175      | 0.1078          |
-| gpt-4.1      | 0.9603  | 0.9312      | 0.9249      | 0.9192      | 0.0439          |
-| gpt-4.1-mini (default) | 0.9520  | 0.9180      | 0.9130      | 0.9049      | 0.0459          |
-| gpt-4.1-nano | 0.9502  | 0.9262      | 0.9094      | 0.9043      | 0.0379          |
+| gpt-5        | 0.953   | 0.919       | 0.910       | 0.907       | 0.034           |
+| gpt-5-mini   | 0.963   | 0.932       | 0.917       | 0.915       | 0.100           |
+| gpt-4.1      | 0.960   | 0.931       | 0.925       | 0.919       | 0.044           |
+| gpt-4.1-mini (default) | 0.952   | 0.918       | 0.913       | 0.905       | 0.046           |
 
 **Notes:**
 
diff --git a/docs/ref/checks/prompt_injection_detection.md b/docs/ref/checks/prompt_injection_detection.md
@@ -111,12 +111,10 @@ This benchmark evaluates model performance on agent conversation traces:
 
 | Model         | ROC AUC | Prec@R=0.80 | Prec@R=0.90 | Prec@R=0.95 | Recall@FPR=0.01 |
 |---------------|---------|-------------|-------------|-------------|-----------------|
-| gpt-5         | 0.9931  | 0.9992      | 0.9992      | 0.9992      | 0.5845          |
-| gpt-5-mini    | 0.9536  | 0.9951      | 0.9951      | 0.9951      | 0.0000          |
-| gpt-5-nano    | 0.9283  | 0.9913      | 0.9913      | 0.9717      | 0.0350          |
-| gpt-4.1       | 0.9794  | 0.9973      | 0.9973      | 0.9973      | 0.0000          |
-| gpt-4.1-mini (default) | 0.9865  | 0.9986      | 0.9986      | 0.9986      | 0.0000          |
-| gpt-4.1-nano  | 0.9142  | 0.9948      | 0.9948      | 0.9387      | 0.0000          |
+| gpt-5         | 0.993   | 0.999       | 0.999       | 0.999       | 0.584           |
+| gpt-5-mini    | 0.954   | 0.995       | 0.995       | 0.995       | 0.000           |
+| gpt-4.1       | 0.979   | 0.997       | 0.997       | 0.997       | 0.000           |
+| gpt-4.1-mini (default) | 0.987   | 0.999       | 0.999       | 0.999       | 0.000           |
 
 **Notes:**
 
@@ -128,12 +126,10 @@ This benchmark evaluates model performance on agent conversation traces:
 
 | Model         | TTC P50 (ms) | TTC P95 (ms) |
 |---------------|--------------|--------------|
-| gpt-4.1-nano  | 1,159        | 2,534        |
 | gpt-4.1-mini (default)  | 1,481        | 2,563        |
 | gpt-4.1       | 1,742        | 2,296        |
 | gpt-5         | 3,994        | 6,654        |
 | gpt-5-mini    | 5,895        | 9,031        |
-| gpt-5-nano    | 5,911        | 10,134       |
 
 - **TTC P50**: Median time to completion (50% of requests complete within this time)
 - **TTC P95**: 95th percentile time to completion (95% of requests complete within this time)
diff --git a/examples/basic/agents_sdk.py b/examples/basic/agents_sdk.py
@@ -33,7 +33,7 @@
             {
                 "name": "Custom Prompt Check",
                 "config": {
-                    "model": "gpt-4.1-nano-2025-04-14",
+                    "model": "gpt-4.1-mini-2025-04-14",
                     "confidence_threshold": 0.7,
                     "system_prompt_details": "Check if the text contains any math problems.",
                 },
diff --git a/examples/basic/hello_world.py b/examples/basic/hello_world.py
@@ -25,7 +25,7 @@
             {
                 "name": "Custom Prompt Check",
                 "config": {
-                    "model": "gpt-4.1-nano",
+                    "model": "gpt-4.1-mini",
                     "confidence_threshold": 0.7,
                     "system_prompt_details": "Check if the text contains any math problems.",
                 },
@@ -45,7 +45,7 @@ async def process_input(
         # Use the new GuardrailsAsyncOpenAI - it handles all guardrail validation automatically
         response = await guardrails_client.responses.create(
             input=user_input,
-            model="gpt-4.1-nano",
+            model="gpt-4.1-mini",
             previous_response_id=response_id,
         )
 
diff --git a/examples/basic/multi_bundle.py b/examples/basic/multi_bundle.py
@@ -30,7 +30,7 @@
             {
                 "name": "Custom Prompt Check",
                 "config": {
-                    "model": "gpt-4.1-nano",
+                    "model": "gpt-4.1-mini",
                     "confidence_threshold": 0.7,
                     "system_prompt_details": "Check if the text contains any math problems.",
                 },
@@ -56,7 +56,7 @@ async def process_input(
     # including pre-flight, input, and output stages, plus the LLM call
     stream = await guardrails_client.responses.create(
         input=user_input,
-        model="gpt-4.1-nano",
+        model="gpt-4.1-mini",
         previous_response_id=response_id,
         stream=True,
     )
diff --git a/examples/basic/multiturn_chat_with_alignment.py b/examples/basic/multiturn_chat_with_alignment.py
@@ -230,7 +230,7 @@ async def main(malicious: bool = False) -> None:
             # Only add to messages AFTER guardrails pass and LLM call succeeds
             try:
                 resp = await client.chat.completions.create(
-                    model="gpt-4.1-nano",
+                    model="gpt-4.1-mini",
                     messages=messages + [{"role": "user", "content": user_input}],
                     tools=tools,
                 )
@@ -321,7 +321,7 @@ async def main(malicious: bool = False) -> None:
                 # Final call with tool results (pass inline without mutating messages)
                 try:
                     resp = await client.chat.completions.create(
-                        model="gpt-4.1-nano",
+                        model="gpt-4.1-mini",
                         messages=messages + [assistant_message] + tool_messages,
                         tools=tools,
                     )
diff --git a/examples/basic/structured_outputs_example.py b/examples/basic/structured_outputs_example.py
@@ -26,7 +26,7 @@ class UserInfo(BaseModel):
             {
                 "name": "Custom Prompt Check",
                 "config": {
-                    "model": "gpt-4.1-nano",
+                    "model": "gpt-4.1-mini",
                     "confidence_threshold": 0.7,
                     "system_prompt_details": "Check if the text contains any math problems.",
                 },
@@ -50,7 +50,7 @@ async def extract_user_info(
                 {"role": "system", "content": "Extract user information from the provided text."},
                 {"role": "user", "content": text},
             ],
-            model="gpt-4.1-nano",
+            model="gpt-4.1-mini",
             text_format=UserInfo,
             previous_response_id=previous_response_id,
         )
diff --git a/examples/basic/suppress_tripwire.py b/examples/basic/suppress_tripwire.py
@@ -25,7 +25,7 @@
             {
                 "name": "Custom Prompt Check",
                 "config": {
-                    "model": "gpt-4.1-nano-2025-04-14",
+                    "model": "gpt-4.1-mini-2025-04-14",
                     "confidence_threshold": 0.7,
                     "system_prompt_details": "Check if the text contains any math problems.",
                 },
@@ -45,7 +45,7 @@ async def process_input(
         # Use GuardrailsClient with suppress_tripwire=True
         response = await guardrails_client.responses.create(
             input=user_input,
-            model="gpt-4.1-nano-2025-04-14",
+            model="gpt-4.1-mini-2025-04-14",
             previous_response_id=response_id,
             suppress_tripwire=True,
         )
diff --git a/examples/implementation_code/blocking/blocking_completions.py b/examples/implementation_code/blocking/blocking_completions.py
@@ -22,7 +22,7 @@ async def process_input(
         # Only add to messages AFTER guardrails pass and LLM call succeeds
         response = await guardrails_client.chat.completions.create(
             messages=messages + [{"role": "user", "content": user_input}],
-            model="gpt-4.1-nano",
+            model="gpt-4.1-mini",
         )
 
         response_content = response.llm_response.choices[0].message.content
diff --git a/examples/implementation_code/blocking/blocking_responses.py b/examples/implementation_code/blocking/blocking_responses.py
@@ -16,7 +16,7 @@ async def process_input(guardrails_client: GuardrailsAsyncOpenAI, user_input: st
     try:
         # Use the GuardrailsClient - it handles all guardrail validation automatically
         # including pre-flight, input, and output stages, plus the LLM call
-        response = await guardrails_client.responses.create(input=user_input, model="gpt-4.1-nano", previous_response_id=response_id)
+        response = await guardrails_client.responses.create(input=user_input, model="gpt-4.1-mini", previous_response_id=response_id)
 
         print(f"\nAssistant: {response.llm_response.output_text}")
 
diff --git a/examples/implementation_code/streaming/streaming_completions.py b/examples/implementation_code/streaming/streaming_completions.py
@@ -23,7 +23,7 @@ async def process_input(
         # Only add to messages AFTER guardrails pass and streaming completes
         stream = await guardrails_client.chat.completions.create(
             messages=messages + [{"role": "user", "content": user_input}],
-            model="gpt-4.1-nano",
+            model="gpt-4.1-mini",
             stream=True,
         )
 
diff --git a/examples/implementation_code/streaming/streaming_responses.py b/examples/implementation_code/streaming/streaming_responses.py
@@ -19,7 +19,7 @@ async def process_input(guardrails_client: GuardrailsAsyncOpenAI, user_input: st
         # including pre-flight, input, and output stages, plus the LLM call
         stream = await guardrails_client.responses.create(
             input=user_input,
-            model="gpt-4.1-nano",
+            model="gpt-4.1-mini",
             previous_response_id=response_id,
             stream=True,
         )
diff --git a/src/guardrails/checks/text/jailbreak.py b/src/guardrails/checks/text/jailbreak.py
@@ -18,7 +18,7 @@
 Configuration Parameters:
     This guardrail uses the base LLM configuration (see LLMConfig) with these parameters:
 
-    - `model` (str): The name of the LLM model to use (e.g., "gpt-4.1-nano", "gpt-4o")
+    - `model` (str): The name of the LLM model to use (e.g., "gpt-4.1-mini", "gpt-5")
     - `confidence_threshold` (float): Minimum confidence score (0.0 to 1.0) required to
         trigger the guardrail. Defaults to 0.7.
 
diff --git a/src/guardrails/evals/README.md b/src/guardrails/evals/README.md
@@ -27,7 +27,7 @@ guardrails-evals \
   --config-path eval_demo/demo_config.json \
   --dataset-path eval_demo/demo_data.jsonl \
   --mode benchmark \
-  --models gpt-5 gpt-5-mini gpt-5-nano
+  --models gpt-5 gpt-5-mini
 ```
 
 ### Basic Evaluation
@@ -43,7 +43,7 @@ guardrails-evals \
   --config-path guardrails_config.json \
   --dataset-path data.jsonl \
   --mode benchmark \
-  --models gpt-5 gpt-5-mini gpt-5-nano
+  --models gpt-5 gpt-5-mini
 ```
 
 ## Core Components
diff --git a/src/guardrails/evals/guardrail_evals.py b/src/guardrails/evals/guardrail_evals.py
@@ -41,10 +41,8 @@
 DEFAULT_BENCHMARK_MODELS = [
     "gpt-5",
     "gpt-5-mini",
-    "gpt-5-nano",
     "gpt-4.1",
     "gpt-4.1-mini",
-    "gpt-4.1-nano",
 ]
 DEFAULT_BATCH_SIZE = 32
 DEFAULT_LATENCY_ITERATIONS = 50
@@ -616,7 +614,7 @@ def main() -> None:
     parser.add_argument(
         "--models",
         nargs="+",
-        help="Models to test in benchmark mode (default: gpt-5, gpt-5-mini, gpt-5-nano, gpt-4.1, gpt-4.1-mini, gpt-4.1-nano)",
+        help="Models to test in benchmark mode (default: gpt-5, gpt-5-mini, gpt-4.1, gpt-4.1-mini)",
     )
     parser.add_argument(
         "--latency-iterations",
diff --git a/tests/integration/test_suite.py b/tests/integration/test_suite.py
@@ -176,7 +176,7 @@ def __init__(
     # 7. Jailbreak Check Test Cases
     GuardrailTest(
         name="Jailbreak",
-        config={"confidence_threshold": 0.7, "model": "gpt-4.1-nano"},
+        config={"confidence_threshold": 0.7, "model": "gpt-4.1-mini"},
         passing_cases=[
             "What's the weather forecast for tomorrow?",
             "Can you explain how photosynthesis works?",
@@ -197,7 +197,7 @@ def __init__(
         name="Custom Prompt Check",
         config={
             "confidence_threshold": 0.7,
-            "model": "gpt-4.1-nano",
+            "model": "gpt-4.1-mini",
             "system_prompt_details": textwrap.dedent(
                 """
                 You are a healthcare communication compliance assistant. Analyze text for HIPAA compliance and medical communication standards.
@@ -241,7 +241,7 @@ def __init__(
         name="Off Topic Prompts",
         config={
             "confidence_threshold": 0.7,
-            "model": "gpt-4.1-nano",
+            "model": "gpt-4.1-mini",
             "system_prompt_details": textwrap.dedent(
                 """
                 You are a T-Mobile customer service assistant. Your purpose is to provide helpful information and support specifically for T-Mobile customers.
@@ -286,7 +286,7 @@ def __init__(
     # 10. NSFW permissiveness test - balanced
     GuardrailTest(
         name="NSFW Text",
-        config={"confidence_threshold": 0.7, "model": "gpt-4.1-nano"},
+        config={"confidence_threshold": 0.7, "model": "gpt-4.1-mini"},
         passing_cases=[
             "I stayed up too late watching a war documentary.",
             "Ugh, Mondays suck.",
@@ -356,7 +356,7 @@ async def run_test(
         try:
             # Use GuardrailsClient to run the test
             response = await guardrails_client.chat.completions.create(
-                model="gpt-4.1-nano",
+                model="gpt-4.1-mini",
                 messages=[{"role": "user", "content": case}],
                 suppress_tripwire=True,
             )
@@ -410,7 +410,7 @@ async def run_test(
         try:
             # Use GuardrailsClient to run the test
             response = await guardrails_client.chat.completions.create(
-                model="gpt-4.1-nano",
+                model="gpt-4.1-mini",
                 messages=[{"role": "user", "content": case}],
                 suppress_tripwire=True,
             )

Original file line number	Diff line number	Diff line change
`@@ -33,7 +33,7 @@`
`33`	`33`	`{`
`34`	`34`	`"name": "Custom Prompt Check",`
`35`	`35`	`"config": {`
`36`		`- "model": "gpt-4.1-nano-2025-04-14",`
	`36`	`+ "model": "gpt-4.1-mini-2025-04-14",`
`37`	`37`	`"confidence_threshold": 0.7,`
`38`	`38`	`"system_prompt_details": "Check if the text contains any math problems.",`
`39`	`39`	`},`