Skip to content

Commit 0a65199

Browse files
authored
Remove unsupported model from docs, benchmark results, examples (#43)
1 parent 38a656d commit 0a65199

23 files changed

+40
-66
lines changed
-9.25 KB
Loading
-46.4 KB
Loading
-89.2 KB
Loading
-80.3 KB
Loading

docs/evals.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ guardrails-evals \
2727
--config-path guardrails_config.json \
2828
--dataset-path data.jsonl \
2929
--mode benchmark \
30-
--models gpt-5 gpt-5-mini gpt-5-nano
30+
--models gpt-5 gpt-5-mini
3131
```
3232

3333
Test with included demo files in our [github repository](https://github.com/openai/openai-guardrails-python/tree/main/src/guardrails/evals/eval_demo)

docs/ref/checks/hallucination_detection.md

Lines changed: 2 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -175,10 +175,8 @@ The statements cover various types of factual claims including:
175175
|--------------|---------|-------------|-------------|-------------|
176176
| gpt-5 | 0.854 | 0.732 | 0.686 | 0.670 |
177177
| gpt-5-mini | 0.934 | 0.813 | 0.813 | 0.770 |
178-
| gpt-5-nano | 0.566 | 0.540 | 0.540 | 0.533 |
179178
| gpt-4.1 | 0.870 | 0.785 | 0.785 | 0.785 |
180179
| gpt-4.1-mini (default) | 0.876 | 0.806 | 0.789 | 0.789 |
181-
| gpt-4.1-nano | 0.537 | 0.526 | 0.526 | 0.526 |
182180

183181
**Notes:**
184182
- ROC AUC: Area under the ROC curve (higher is better)
@@ -192,10 +190,8 @@ The following table shows latency measurements for each model using the hallucin
192190
|--------------|--------------|--------------|
193191
| gpt-5 | 34,135 | 525,854 |
194192
| gpt-5-mini | 23,013 | 59,316 |
195-
| gpt-5-nano | 17,079 | 26,317 |
196193
| gpt-4.1 | 7,126 | 33,464 |
197194
| gpt-4.1-mini (default) | 7,069 | 43,174 |
198-
| gpt-4.1-nano | 4,809 | 6,869 |
199195

200196
- **TTC P50**: Median time to completion (50% of requests complete within this time)
201197
- **TTC P95**: 95th percentile time to completion (95% of requests complete within this time)
@@ -217,10 +213,8 @@ In addition to the above evaluations which use a 3 MB sized vector store, the ha
217213
|--------------|---------------------|----------------------|---------------------|---------------------------|
218214
| gpt-5 | 28,762 / 396,472 | 34,135 / 525,854 | 37,104 / 75,684 | 40,909 / 645,025 |
219215
| gpt-5-mini | 19,240 / 39,526 | 23,013 / 59,316 | 24,217 / 65,904 | 37,314 / 118,564 |
220-
| gpt-5-nano | 13,436 / 22,032 | 17,079 / 26,317 | 17,843 / 35,639 | 21,724 / 37,062 |
221216
| gpt-4.1 | 7,437 / 15,721 | 7,126 / 33,464 | 6,993 / 30,315 | 6,688 / 127,481 |
222217
| gpt-4.1-mini (default) | 6,661 / 14,827 | 7,069 / 43,174 | 7,032 / 46,354 | 7,374 / 37,769 |
223-
| gpt-4.1-nano | 4,296 / 6,378 | 4,809 / 6,869 | 4,171 / 6,609 | 4,650 / 6,201 |
224218

225219
- **Vector store size impact varies by model**: GPT-4.1 series shows minimal latency impact across vector store sizes, while GPT-5 series shows significant increases.
226220

@@ -240,10 +234,6 @@ In addition to the above evaluations which use a 3 MB sized vector store, the ha
240234
| | Medium (3 MB) | 0.934 | 0.813 | 0.813 | 0.770 |
241235
| | Large (11 MB) | 0.919 | 0.817 | 0.817 | 0.817 |
242236
| | Extra Large (105 MB) | 0.909 | 0.793 | 0.793 | 0.711 |
243-
| **gpt-5-nano** | Small (1 MB) | 0.590 | 0.547 | 0.545 | 0.536 |
244-
| | Medium (3 MB) | 0.566 | 0.540 | 0.540 | 0.533 |
245-
| | Large (11 MB) | 0.564 | 0.534 | 0.532 | 0.507 |
246-
| | Extra Large (105 MB) | 0.603 | 0.570 | 0.558 | 0.550 |
247237
| **gpt-4.1** | Small (1 MB) | 0.907 | 0.839 | 0.839 | 0.839 |
248238
| | Medium (3 MB) | 0.870 | 0.785 | 0.785 | 0.785 |
249239
| | Large (11 MB) | 0.846 | 0.753 | 0.753 | 0.753 |
@@ -252,15 +242,11 @@ In addition to the above evaluations which use a 3 MB sized vector store, the ha
252242
| | Medium (3 MB) | 0.876 | 0.806 | 0.789 | 0.789 |
253243
| | Large (11 MB) | 0.862 | 0.791 | 0.757 | 0.757 |
254244
| | Extra Large (105 MB) | 0.802 | 0.722 | 0.722 | 0.722 |
255-
| **gpt-4.1-nano** | Small (1 MB) | 0.605 | 0.528 | 0.528 | 0.528 |
256-
| | Medium (3 MB) | 0.537 | 0.526 | 0.526 | 0.526 |
257-
| | Large (11 MB) | 0.618 | 0.531 | 0.531 | 0.531 |
258-
| | Extra Large (105 MB) | 0.636 | 0.528 | 0.528 | 0.528 |
259245

260246
**Key Insights:**
261247

262248
- **Best Performance**: gpt-5-mini consistently achieves the highest ROC AUC scores across all vector store sizes (0.909-0.939)
263-
- **Best Latency**: gpt-4.1-nano shows the most consistent and lowest latency across all scales (4,171-4,809ms P50) but shows poor performance
249+
- **Best Latency**: gpt-4.1-mini (default) provides the lowest median latencies while maintaining strong accuracy
264250
- **Most Stable**: gpt-4.1-mini (default) maintains relatively stable performance across vector store sizes with good accuracy-latency balance
265251
- **Scale Sensitivity**: gpt-5 shows the most variability in performance across vector store sizes, with performance dropping significantly at larger scales
266252
- **Performance vs Scale**: Most models show decreasing performance as vector store size increases, with gpt-5-mini being the most resilient
@@ -270,4 +256,4 @@ In addition to the above evaluations which use a 3 MB sized vector store, the ha
270256
- **Signal-to-noise ratio degradation**: Larger vector stores contain more irrelevant documents that may not be relevant to the specific factual claims being validated
271257
- **Semantic search limitations**: File search retrieves semantically similar documents, but with a large diverse knowledge source, these may not always be factually relevant
272258
- **Document quality matters more than quantity**: The relevance and accuracy of documents is more important than the total number of documents
273-
- **Performance plateaus**: Beyond a certain size (11 MB), the performance impact becomes less severe
259+
- **Performance plateaus**: Beyond a certain size (11 MB), the performance impact becomes less severe

docs/ref/checks/jailbreak.md

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -93,23 +93,19 @@ This benchmark evaluates model performance on a diverse set of prompts:
9393

9494
| Model | ROC AUC | Prec@R=0.80 | Prec@R=0.90 | Prec@R=0.95 | Recall@FPR=0.01 |
9595
|--------------|---------|-------------|-------------|-------------|-----------------|
96-
| gpt-5 | 0.979 | 0.973 | 0.970 | 0.970 | 0.733 |
97-
| gpt-5-mini | 0.954 | 0.990 | 0.900 | 0.900 | 0.768 |
98-
| gpt-5-nano | 0.962 | 0.973 | 0.967 | 0.965 | 0.048 |
99-
| gpt-4.1 | 0.990 | 1.000 | 1.000 | 0.984 | 0.946 |
100-
| gpt-4.1-mini (default) | 0.982 | 0.992 | 0.992 | 0.954 | 0.444 |
101-
| gpt-4.1-nano | 0.934 | 0.924 | 0.924 | 0.848 | 0.000 |
96+
| gpt-5 | 0.982 | 0.984 | 0.977 | 0.977 | 0.743 |
97+
| gpt-5-mini | 0.980 | 0.980 | 0.976 | 0.975 | 0.734 |
98+
| gpt-4.1 | 0.979 | 0.975 | 0.975 | 0.975 | 0.661 |
99+
| gpt-4.1-mini (default) | 0.979 | 0.974 | 0.972 | 0.972 | 0.654 |
102100

103101
#### Latency Performance
104102

105103
| Model | TTC P50 (ms) | TTC P95 (ms) |
106104
|--------------|--------------|--------------|
107105
| gpt-5 | 4,569 | 7,256 |
108106
| gpt-5-mini | 5,019 | 9,212 |
109-
| gpt-5-nano | 4,702 | 6,739 |
110107
| gpt-4.1 | 841 | 1,861 |
111108
| gpt-4.1-mini | 749 | 1,291 |
112-
| gpt-4.1-nano | 683 | 890 |
113109

114110
**Notes:**
115111

docs/ref/checks/nsfw.md

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -82,12 +82,10 @@ This benchmark evaluates model performance on a balanced set of social media pos
8282

8383
| Model | ROC AUC | Prec@R=0.80 | Prec@R=0.90 | Prec@R=0.95 | Recall@FPR=0.01 |
8484
|--------------|---------|-------------|-------------|-------------|-----------------|
85-
| gpt-5 | 0.9532 | 0.9195 | 0.9096 | 0.9068 | 0.0339 |
86-
| gpt-5-mini | 0.9629 | 0.9321 | 0.9168 | 0.9149 | 0.0998 |
87-
| gpt-5-nano | 0.9600 | 0.9297 | 0.9216 | 0.9175 | 0.1078 |
88-
| gpt-4.1 | 0.9603 | 0.9312 | 0.9249 | 0.9192 | 0.0439 |
89-
| gpt-4.1-mini (default) | 0.9520 | 0.9180 | 0.9130 | 0.9049 | 0.0459 |
90-
| gpt-4.1-nano | 0.9502 | 0.9262 | 0.9094 | 0.9043 | 0.0379 |
85+
| gpt-5 | 0.953 | 0.919 | 0.910 | 0.907 | 0.034 |
86+
| gpt-5-mini | 0.963 | 0.932 | 0.917 | 0.915 | 0.100 |
87+
| gpt-4.1 | 0.960 | 0.931 | 0.925 | 0.919 | 0.044 |
88+
| gpt-4.1-mini (default) | 0.952 | 0.918 | 0.913 | 0.905 | 0.046 |
9189

9290
**Notes:**
9391

docs/ref/checks/prompt_injection_detection.md

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -111,12 +111,10 @@ This benchmark evaluates model performance on agent conversation traces:
111111

112112
| Model | ROC AUC | Prec@R=0.80 | Prec@R=0.90 | Prec@R=0.95 | Recall@FPR=0.01 |
113113
|---------------|---------|-------------|-------------|-------------|-----------------|
114-
| gpt-5 | 0.9931 | 0.9992 | 0.9992 | 0.9992 | 0.5845 |
115-
| gpt-5-mini | 0.9536 | 0.9951 | 0.9951 | 0.9951 | 0.0000 |
116-
| gpt-5-nano | 0.9283 | 0.9913 | 0.9913 | 0.9717 | 0.0350 |
117-
| gpt-4.1 | 0.9794 | 0.9973 | 0.9973 | 0.9973 | 0.0000 |
118-
| gpt-4.1-mini (default) | 0.9865 | 0.9986 | 0.9986 | 0.9986 | 0.0000 |
119-
| gpt-4.1-nano | 0.9142 | 0.9948 | 0.9948 | 0.9387 | 0.0000 |
114+
| gpt-5 | 0.993 | 0.999 | 0.999 | 0.999 | 0.584 |
115+
| gpt-5-mini | 0.954 | 0.995 | 0.995 | 0.995 | 0.000 |
116+
| gpt-4.1 | 0.979 | 0.997 | 0.997 | 0.997 | 0.000 |
117+
| gpt-4.1-mini (default) | 0.987 | 0.999 | 0.999 | 0.999 | 0.000 |
120118

121119
**Notes:**
122120

@@ -128,12 +126,10 @@ This benchmark evaluates model performance on agent conversation traces:
128126

129127
| Model | TTC P50 (ms) | TTC P95 (ms) |
130128
|---------------|--------------|--------------|
131-
| gpt-4.1-nano | 1,159 | 2,534 |
132129
| gpt-4.1-mini (default) | 1,481 | 2,563 |
133130
| gpt-4.1 | 1,742 | 2,296 |
134131
| gpt-5 | 3,994 | 6,654 |
135132
| gpt-5-mini | 5,895 | 9,031 |
136-
| gpt-5-nano | 5,911 | 10,134 |
137133

138134
- **TTC P50**: Median time to completion (50% of requests complete within this time)
139135
- **TTC P95**: 95th percentile time to completion (95% of requests complete within this time)

examples/basic/agents_sdk.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@
3333
{
3434
"name": "Custom Prompt Check",
3535
"config": {
36-
"model": "gpt-4.1-nano-2025-04-14",
36+
"model": "gpt-4.1-mini-2025-04-14",
3737
"confidence_threshold": 0.7,
3838
"system_prompt_details": "Check if the text contains any math problems.",
3939
},

0 commit comments

Comments
 (0)