Skip to content

Commit 65df37b

Browse files
committed
Warn about error handling in modelutility, show warnings by default, some doc fixes
1 parent db7ddd3 commit 65df37b

File tree

4 files changed

+109
-46
lines changed

4 files changed

+109
-46
lines changed

docs/value/data-banzhaf.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,8 @@ from pydvl.valuation.stopping import MinUpdates
4040

4141
train, test = Dataset.from_arrays(...)
4242
model = ...
43-
utility = ModelUtility(model, SupervisedScorer(model, test, default=0.0))
43+
scorer = SupervisedScorer(model, test, default=0.0)
44+
utility = ModelUtility(model, scorer)
4445
sampler = PermutationSampler()
4546
valuation = BanzhafValuation(utility, sampler, MinUpdates(1000))
4647
with parallel_config(n_jobs=16):
@@ -84,7 +85,8 @@ more on this subject see [[semi-values-sampling]].
8485

8586
train, test = Dataset.from_arrays(...)
8687
model = ...
87-
utility = ModelUtility(model, SupervisedScorer(model, test, default=0.0))
88+
scorer = SupervisedScorer(model, test, default=0.0)
89+
utility = ModelUtility(model, scorer)
8890
valuation = MSRBanzhafValuation(utility, MaxSamples(1000), batch_size=64)
8991
with parallel_config(n_jobs=16):
9092
valuation.fit(train)

docs/value/index.md

Lines changed: 26 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -229,6 +229,16 @@ objects for different datasets. You can read more about [setting up the
229229
cache][getting-started-cache] in the installation guide, and in the
230230
documentation of the [caching][pydvl.utils.caching] module.
231231

232+
!!! danger "Errors are hidden by default"
233+
During semi-value computations, the utility can be evaluated on subsets that
234+
break the fitting process. For instance, a classifier might require at least two
235+
classes to fit, but the utility is sometimes evaluated on subsets with only one
236+
class. This will raise an error with most classifiers. To avoid this, we set by
237+
default `catch_errors=True` upon instantiation, which will catch the error and
238+
return the scorer's default value instead. While we show a warning to signal that
239+
something went wrong, this suppression can lead to unexpected results, so it is
240+
important to be aware of this setting and to set it to `False` when testing, or if
241+
you are sure that the utility will not be evaluated on problematic subsets.
232242

233243
### Computing some values
234244

@@ -267,25 +277,33 @@ over, sliced, sorted, as well as converted to a [pandas.DataFrame][] using
267277

268278
### Learning the utility
269279

270-
Since each evaluation of the utility entails a full retrain of the model on a new subset of the training data, it is natural to try to learn this mapping from subsets to scores. This is the idea behind **Data Utility Learning (DUL)**
280+
Since each evaluation of the utility entails a full retraining of the model on a
281+
new subset of the training data, it is natural to try to learn this mapping from
282+
subsets to scores. This is the idea behind **Data Utility Learning (DUL)**
271283
[@wang_improving_2022] and in pyDVL it's as simple as wrapping the
272-
`ModelUtility` inside [DataUtilityLearning][pydvl.valuation.utility.DataUtilityLearning]:
284+
[ModelUtility][pydvl.valuation.utility.ModelUtility] inside a
285+
[DataUtilityLearning][pydvl.valuation.utility.DataUtilityLearning] object:
273286

274287
```python
275-
from pydvl.valuation import ModelUtility, DataUtilityLearning, Dataset
288+
from pydvl.valuation import *
289+
from pydvl.valuation.types import Sample
276290
from sklearn.linear_model import LinearRegression, LogisticRegression
277291
from sklearn.datasets import load_iris
278292

279-
dataset = Dataset.from_sklearn(load_iris())
280-
u = ModelUtility(LogisticRegression(), dataset)
293+
train, test = Dataset.from_sklearn(load_iris())
294+
scorer = SupervisedScorer("accuracy", test, default=0.0, range=(0, 1))
295+
u = ModelUtility(LogisticRegression(), scorer)
281296
training_budget = 3
282-
wrapped_u = DataUtilityLearning(u, training_budget, LinearRegression())
297+
utility_model = IndicatorUtilityModel(
298+
predictor=LinearRegression(), n_data=len(train)
299+
)
300+
wrapped_u = DataUtilityLearning(u, training_budget, utility_model)
283301

284302
# First 3 calls will be computed normally
285303
for i in range(training_budget):
286-
_ = wrapped_u((i,))
304+
_ = wrapped_u(Sample(None, train.indices[:i]))
287305
# Subsequent calls will be computed using the learned model for DUL
288-
wrapped_u((1, 2, 3))
306+
wrapped_u(Sample(None, train.indices))
289307
```
290308

291309
## Problems of data values { #problems-of-data-values }

src/pydvl/valuation/utility/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ class for all utilities is [UtilityBase][pydvl.valuation.utility.base.UtilityBas
99
## Utility for model-based methods
1010
1111
[ModelUtility][pydvl.valuation.utility.modelutility.ModelUtility] holds information
12-
about model, and scoring function (the latter being what one usually understands under
12+
about model and scoring function (the latter being what one usually understands under
1313
*utility* in the general definition of Shapley value). Model-based evaluation methods
1414
define the utility as a retraining of the model on a subset of the data, which is then
1515
[scored][pydvl.valuation.scorers]. Please see the documentation on [Computing Data

src/pydvl/valuation/utility/modelutility.py

Lines changed: 78 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,85 @@
11
"""
22
This module implements a utility function for supervised models.
33
4-
It is mostly geared towards sci-kit-learn models, but can be used with any object
5-
that implements the [BaseModel][pydvl.utils.types.BaseModel] protocol, i.e. that has a
4+
[ModelUtility][pydvl.valuation.utility.modelutility.ModelUtility] holds a model and a
5+
scorer. Each call to the utility will fit the model on a subset of the training data and
6+
evaluate the scorer on the test data. It is used by all the valuation methods in
7+
[pydvl.valuation][pydvl.valuation].
8+
9+
This class is geared towards sci-kit-learn models, but can be used with any object that
10+
implements the [BaseModel][pydvl.utils.types.BaseModel] protocol, i.e. that has a
611
`fit()` method.
712
13+
!!! danger "Errors are hidden by default"
14+
During semi-value computations, the utility can be evaluated on subsets that
15+
break the fitting process. For instance, a classifier might require at least two
16+
classes to fit, but the utility is sometimes evaluated on subsets with only one
17+
class. This will raise an error with most classifiers. To avoid this, we set by
18+
default `catch_errors=True` upon instantiation, which will catch the error and
19+
return the scorer's default value instead. While we show a warning to signal that
20+
something went wrong, this suppression can lead to unexpected results, so it is
21+
important to be aware of this setting and to set it to `False` when testing, or if
22+
you are sure that the utility will not be evaluated on problematic subsets.
23+
24+
25+
## Examples
26+
27+
??? Example "Standard usage"
28+
The utility takes a model and a scorer and is passed to the valuation method. Here's
29+
the basic usage:
30+
31+
```python
32+
from joblib import parallel_config
33+
from pydvl.valuation import (
34+
Dataset, MinUpdates, ModelUtility, SupervisedScorer, TMCShapleyValuation
35+
)
36+
37+
train, test = Dataset.from_arrays(X, y, ...)
38+
model = SomeModel() # Implementing the basic scikit-learn interface
39+
scorer = SupervisedScorer("r2", test, default=0.0, range=(-np.inf, 1.0))
40+
utility = ModelUtility(model, scorer, catch_errors=True, show_warnings=True)
41+
valuation = TMCShapleyValuation(utility, is_done=MinUpdates(1000))
42+
with parallel_config(n_jobs=-1):
43+
valuation.fit(train)
44+
```
45+
46+
??? Example "Directly calling the utility"
47+
The following code instantiates a utility object and calls it directly. The
48+
underlying logistic regression model will be trained on the indices passed as
49+
argument, and evaluated on the test data.
50+
51+
```python
52+
from pydvl.valuation.utility import ModelUtility
53+
from pydvl.valuation.dataset import Dataset
54+
from pydvl.valuation.scorers import SupervisedScorer
55+
from sklearn.linear_model import LinearRegression, LogisticRegression
56+
from sklearn.datasets import load_iris
57+
58+
train, test = Dataset.from_sklearn(load_iris(), random_state=16)
59+
scorer = SupervisedScorer("accuracy", test, default=0.0, range=(0.0, 1.0))
60+
u = ModelUtility(LogisticRegression(random_state=16), scorer, catch_errors=True)
61+
u(Sample(None, subset=train.indices))
62+
```
63+
64+
??? Example "Enabling the cache"
65+
In this example an in-memory cache is used. Note that caching is only useful
66+
under certain conditions, and does not really speed typical Monte Carlo
67+
approximations. See [the introduction][#getting-started-cache] and the [module
68+
documentation][pydvl.utils.caching] for more.
69+
70+
```python
71+
(...) # Imports as above
72+
cache_backend = InMemoryCacheBackend() # See other backends in the caching module
73+
u = ModelUtility(
74+
model=LogisticRegression(random_state=16),
75+
scorer=SupervisedScorer("accuracy", test, default=0.0, range=(0.0, 1.0)),
76+
cache_backend=cache_backend,
77+
catch_errors=True
78+
)
79+
u(Sample(None, subset=train.indices))
80+
u(Sample(None, subset=train.indices)) # The second call does not retrain the model
81+
```
82+
883
## Data type of the underlying data arrays
984
1085
In principle, very few to no assumptions are made about the data type. As long as it is
@@ -109,38 +184,6 @@ class ModelUtility(UtilityBase[SampleT], Generic[SampleT, ModelT]):
109184
cached_func_options: Optional configuration object for cached utility evaluation.
110185
clone_before_fit: If `True`, the model will be cloned before calling
111186
`fit()`.
112-
113-
??? Example
114-
``` pycon
115-
>>> from pydvl.valuation.utility import ModelUtility, DataUtilityLearning
116-
>>> from pydvl.valuation.dataset import Dataset
117-
>>> from pydvl.valuation.scorers import SupervisedScorer
118-
>>> from sklearn.linear_model import LinearRegression, LogisticRegression
119-
>>> from sklearn.datasets import load_iris
120-
>>> train, test = Dataset.from_sklearn(load_iris(), random_state=16)
121-
>>> u = ModelUtility(LogisticRegression(random_state=16), SupervisedScorer("accuracy"))
122-
>>> u(Sample(None, subset=train.indices))
123-
0.9
124-
```
125-
126-
With caching enabled:
127-
128-
```pycon
129-
>>> from pydvl.valuation.utility import ModelUtility, DataUtilityLearning
130-
>>> from pydvl.valuation.dataset import Dataset
131-
>>> from pydvl.utils.caching.memory import InMemoryCacheBackend
132-
>>> from sklearn.linear_model import LinearRegression, LogisticRegression
133-
>>> from sklearn.datasets import load_iris
134-
>>> train, test = Dataset.from_sklearn(load_iris(), random_state=16)
135-
>>> cache_backend = InMemoryCacheBackend()
136-
>>> u = ModelUtility(
137-
... model=LogisticRegression(random_state=16),
138-
... scorer=SupervisedScorer("accuracy"),
139-
... cache_backend=cache_backend)
140-
>>> u(Sample(None, subset=train.indices))
141-
0.9
142-
```
143-
144187
"""
145188

146189
model: ModelT
@@ -152,7 +195,7 @@ def __init__(
152195
scorer: Scorer,
153196
*,
154197
catch_errors: bool = False,
155-
show_warnings: bool = False,
198+
show_warnings: bool = True,
156199
cache_backend: CacheBackend | None = None,
157200
cached_func_options: CachedFuncConfig | None = None,
158201
clone_before_fit: bool = True,

0 commit comments

Comments
 (0)