Skip to content

Serialization / Deserialization for EvaluatorContext and separate task execution and evaluator execution #3314

@hon-gyu

Description

@hon-gyu

Description

Motivation:

  • Some evalution tasks can be time-consuming / expensive to run. It would be nice to be able to explore what metrics can be derived from evaluation results (EvaluatorContext) without re-running the task again
  • This also enable data-science-like workflow with evaluation results, as DataFrame can be constructed from evaluation results.

Proposed implementation:

  • Add serde schema for SpanTreeRecordingError so that TypeAdapter can be used on EvaluatorContext.
  • Split pydantic_evals/pydantic_evals/dataset.py::_run_task_and_evaluators() function into two functions, one for running task and one for running evaluators, and make them public.
  • (Optional?) A ContextCaptureEvaluator that saves EvaluatorContext to disk.

References

No response

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions