You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some evalution tasks can be time-consuming / expensive to run. It would be nice to be able to explore what metrics can be derived from evaluation results (EvaluatorContext) without re-running the task again
This also enable data-science-like workflow with evaluation results, as DataFrame can be constructed from evaluation results.
Proposed implementation:
Add serde schema for SpanTreeRecordingError so that TypeAdapter can be used on EvaluatorContext.
Split pydantic_evals/pydantic_evals/dataset.py::_run_task_and_evaluators() function into two functions, one for running task and one for running evaluators, and make them public.
(Optional?) A ContextCaptureEvaluator that saves EvaluatorContext to disk.