File tree Expand file tree Collapse file tree 3 files changed +8
-5
lines changed Expand file tree Collapse file tree 3 files changed +8
-5
lines changed Original file line number Diff line number Diff line change @@ -82,10 +82,12 @@ This benchmark evaluates model performance on a balanced set of social media pos
8282
8383| Model | ROC AUC | Prec@R=0.80 | Prec@R=0.90 | Prec@R=0.95 | Recall@FPR=0.01 |
8484| --------------| ---------| -------------| -------------| -------------| -----------------|
85- | gpt-4.1 | 0.989 | 0.976 | 0.962 | 0.962 | 0.717 |
86- | gpt-4.1-mini (default) | 0.984 | 0.977 | 0.977 | 0.943 | 0.653 |
87- | gpt-4.1-nano | 0.952 | 0.972 | 0.823 | 0.823 | 0.429 |
88- | gpt-4o-mini | 0.965 | 0.977 | 0.955 | 0.945 | 0.842 |
85+ | gpt-5 | 0.9532 | 0.9195 | 0.9096 | 0.9068 | 0.0339 |
86+ | gpt-5-mini | 0.9629 | 0.9321 | 0.9168 | 0.9149 | 0.0998 |
87+ | gpt-5-nano | 0.9600 | 0.9297 | 0.9216 | 0.9175 | 0.1078 |
88+ | gpt-4.1 | 0.9603 | 0.9312 | 0.9249 | 0.9192 | 0.0439 |
89+ | gpt-4.1-mini (default) | 0.9520 | 0.9180 | 0.9130 | 0.9049 | 0.0459 |
90+ | gpt-4.1-nano | 0.9502 | 0.9262 | 0.9094 | 0.9043 | 0.0379 |
8991
9092** Notes:**
9193
Original file line number Diff line number Diff line change 5858 - " Streaming vs Blocking " : streaming_output.md
5959 - Tripwires : tripwires.md
6060 - Checks :
61- - Prompt Injection Detection : ref/checks/prompt_injection_detection.md
6261 - Contains PII : ref/checks/pii.md
6362 - Custom Prompt Check : ref/checks/custom_prompt_check.md
6463 - Hallucination Detection : ref/checks/hallucination_detection.md
6564 - Jailbreak Detection : ref/checks/jailbreak.md
6665 - Moderation : ref/checks/moderation.md
66+ - NSFW Text : ref/checks/nsfw.md
6767 - Off Topic Prompts : ref/checks/off_topic_prompts.md
68+ - Prompt Injection Detection : ref/checks/prompt_injection_detection.md
6869 - URL Filter : ref/checks/urls.md
6970 - Evaluation Tool : evals.md
7071 - API Reference :
You can’t perform that action at this time.
0 commit comments