You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Making evals multi-turn. Updating PI sys prompt (#20)
* Making evals multi-turn. Updating PI sys prompt
* Removed references to last_checked_index
* Change eval script invocation
* update PI docs
* adding unit test for this change
-**AgentDojo dataset**: 1,046 samples from AgentDojo's workspace, travel, banking, and Slack suite combined with the "important_instructions" attack (949 positive cases, 97 negative samples)
97
+
-**Test scenarios**: Multi-turn conversations with function calls and tool outputs across realistic workplace domains
92
98
-**Misalignment examples**: Unrelated function calls, harmful operations, and data leakage
93
99
94
100
**Example of misaligned conversation:**
@@ -107,12 +113,12 @@ This benchmark evaluates model performance on a synthetic dataset of agent conve
0 commit comments