Skip to content

Commit 1d4edec

Browse files
committed
1
1 parent 4232345 commit 1d4edec

File tree

1 file changed

+7
-6
lines changed

1 file changed

+7
-6
lines changed

docs/building-blocks/5-metrics.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,7 @@ DSPy is a machine learning framework, so you must think about your **automatic m
99

1010
## What is a metric and how do I define a metric for my task?
1111

12-
What makes outputs from your system good or bad? Invest in defining metrics and improving them over time incrementally. It's really hard to consistently improve what you aren't able to define.
13-
14-
A metric is just a function that will take examples from your data and take the output of your system, and return a score that quantifies how good the output is.
12+
A metric is just a function that will take examples from your data and take the output of your system, and return a score that quantifies how good the output is. What makes outputs from your system good or bad?
1513

1614
For simple tasks, this could be just "accuracy" or "exact match" or "F1 score". This may be the case for simple classification or short-form QA tasks.
1715

@@ -26,7 +24,7 @@ A DSPy metric is just a function in Python that takes `example` (e.g., from your
2624

2725
Your metric should also accept an optional third argument called `trace`. You can ignore this for a moment, but it will enable some powerful tricks if you want to use your metric for optimization.
2826

29-
Here's a simple example of a metric that's comparing `example.answer` and `pred.answer`.
27+
Here's a simple example of a metric that's comparing `example.answer` and `pred.answer`. This particular metric will return a `bool`.
3028

3129
```python
3230
def validate_answer(example, pred, trace=None):
@@ -38,7 +36,7 @@ Some people find these utilities (built-in) convenient:
3836
- `dspy.evaluate.metrics.answer_exact_match`
3937
- `dspy.evaluate.metrics.answer_passage_match`
4038

41-
Your metrics could be more complex, e.g. check for multiple properties.
39+
Your metrics could be more complex, e.g. check for multiple properties. The metric below will return a `float` if `trace is None` (i.e., if it's used for evaluation or optimization), and will return a `bool` otherwise (i.e., if it's used to bootstrap demonstrations).
4240

4341
```python
4442
def validate_context_and_answer(example, pred, trace=None):
@@ -48,7 +46,10 @@ def validate_context_and_answer(example, pred, trace=None):
4846
# check the predicted answer comes from one of the retrieved contexts
4947
context_match = any((pred.answer.lower() in c) for c in pred.context)
5048

51-
return answer_match and context_match
49+
if trace is None: # if we're doing evaluation or optimization
50+
return (answer_match + context_match) / 2.0
51+
else: # if we're doing bootstrapping, i.e. self-generating good demonstrations of each step
52+
return answer_match and context_match
5253
```
5354

5455
Defining a good metric is an iterative process, so doing some initial evaluations and looking at your data and your outputs are key.

0 commit comments

Comments
 (0)