Skip to content

Commit 0f406ec

Browse files
committed
[skip ci] Checklist: docs/value/shapley.md
1 parent d199af8 commit 0f406ec

File tree

1 file changed

+36
-24
lines changed

1 file changed

+36
-24
lines changed

docs/value/shapley.md

Lines changed: 36 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -85,24 +85,25 @@ and others are preferred, but if desired, usage follows the same pattern:
8585
)
8686

8787
model = SomeSKLearnModel()
88-
scorer = SupervisedScorer("accuracy", test_data, default=0)
89-
utility = ModelUtility(model, scorer, ...)
88+
scorer = SupervisedScorer("accuracy", test_data, default=0.0)
89+
utility = ModelUtility(model, scorer)
9090
sampler = UniformSampler(seed=42)
91-
stopping = MaxSamples(5000)
91+
stopping = MaxSamples(sampler, 5000)
9292
valuation = ShapleyValuation(utility, sampler, stopping)
9393
with parallel_config(n_jobs=16):
9494
valuation.fit(training_data)
9595
result = valuation.values()
9696
```
9797

98-
The DataFrames returned by most Monte Carlo methods will contain approximate
99-
standard errors as an additional column, in this case named `cmc_stderr`.
100-
101-
Note the usage of the object [MaxUpdates][pydvl.value.stopping.MaxUpdates] as the
102-
stop condition. This is an instance of a
103-
[StoppingCriterion][pydvl.value.stopping.StoppingCriterion]. Other examples are
104-
[MaxTime][pydvl.value.stopping.MaxTime] and
105-
[AbsoluteStandardError][pydvl.value.stopping.AbsoluteStandardError].
98+
Note the usage of the object [MaxSamples][pydvl.value.stopping.MaxSamples] as
99+
the stopping condition, which takes the sampler as argument. This is a special
100+
instance of a [StoppingCriterion][pydvl.value.stopping.StoppingCriterion]. More
101+
examples which are not tied to the sampler are
102+
[MaxTime][pydvl.value.stopping.MaxTime] (stops after a certain time),
103+
[MinUpdates][pydvl.value.stopping.MinUpdates] (looks at the number of updates
104+
to the individual values), and
105+
[AbsoluteStandardError][pydvl.value.stopping.AbsoluteStandardError] (not very
106+
reliable as a stopping criterion), among others.
106107

107108

108109
## A stratified approach { #stratified-shapley-value }
@@ -151,13 +152,17 @@ a way to reduce the variance of the estimator.
151152
[ShapleyValuation][pydvl.valuation.methods.shapley.ShapleyValuation] with a
152153
custom sampler, for instance
153154
[VRDSSampler][pydvl.valuation.samplers.stratified.VRDSSampler].
155+
Note the use of the [History][pydvl.value.stopping.History] object, a stopping
156+
which does not stop, but records the trace of value updates in a rolling
157+
memory. The data can then be used to check for convergence, debugging,
158+
plotting, etc.
154159

155160
```python
156161
from pydvl.valuation import StratifiedShapleyValuation, MinUpdates, History
157162
training_data, test_data = Dataset.from_arrays(...)
158163
model = ...
159164
scorer = SupervisedScorer(model, test_data, default=..., range=...)
160-
utility = ModelUtility(model, scorer,
165+
utility = ModelUtility(model, scorer)
161166
valuation = StratifiedShapleyValuation(
162167
utility=utility,
163168
is_done=MinUpdates(min_updates) | History(n_steps=min_updates),
@@ -232,21 +237,29 @@ You can see this method in action in
232237

233238

234239
??? example "Truncated Monte Carlo Shapley values"
240+
Use of this object follows the same pattern as the previous examples, except
241+
that separate instantiation of the sampler is not necessary anymore. This
242+
has the drawback that we cannot use
243+
[MaxSamples][pydvl.value.stopping.MaxSamples] as stopping criterion anymore
244+
since it requires the sampler. To work around this, use
245+
[ShapleyValuation][pydvl.valuation.methods.shapley.ShapleyValuation]
246+
directly.
247+
235248
```python
236249
from pydvl.valuation import (
237-
TMCShapleyValuation,
250+
MinUpdates
238251
ModelUtility,
239-
SupervisedScorer,
240252
PermutationSampler,
253+
SupervisedScorer,
241254
RelativeTruncation,
242-
MaxSamples
255+
TMCShapleyValuation,
243256
)
244257

245258
model = SomeSKLearnModel()
246259
scorer = SupervisedScorer("accuracy", test_data, default=0)
247260
utility = ModelUtility(model, scorer, ...)
248261
truncation = RelativeTruncation(rtol=0.05)
249-
stopping = MaxSamples(5000)
262+
stopping = MinUpdates(5000)
250263
valuation = TMCShapleyValuation(utility, truncation, stopping)
251264
with parallel_config(n_jobs=16):
252265
valuation.fit(training_data)
@@ -259,25 +272,24 @@ As already mentioned, with the architecture of
259272
[ShapleyValuation][pydvl.valuation.methods.shapley.ShapleyValuation] it is
260273
possible to try different importance-sampling schemes by swapping the sampler.
261274
Besides TMCS we also have [Owen sampling][owen-shapley-intro]
262-
[@okhrati_multilinear_2021], and [Beta
263-
Shapley][beta-shapley-intro] [@kwon_beta_2022] when $\alpha = \beta = 1.$
275+
[@okhrati_multilinear_2021], and [Beta Shapley][beta-shapley-intro]
276+
[@kwon_beta_2022] when $\alpha = \beta = 1.$
264277

265278
A different approach is via a SAT problem, as done in [Group Testing
266279
Shapley][group-testing-shapley-intro] [@jia_efficient_2019].
267280

268-
Yet another, which is applicable to any utility-based valuation method, is
269-
[Data Utility Learning][data-utility-learning-intro]
270-
[@wang_improving_2022]. This method learns a model of the utility function
271-
during a warmup phase, and then uses it to speed up marginal utility
272-
computations.
281+
Yet another, which is applicable to any utility-based valuation method, is [Data
282+
Utility Learning][data-utility-learning-intro] [@wang_improving_2022]. This
283+
method learns a model of the utility function during a warmup phase, and then
284+
uses it to speed up marginal utility computations.
273285

274286

275287
## Model-specific methods
276288

277289
Shapley values can have a closed form expression or a simpler approximation
278290
scheme when the model class is restricted. The prime example is
279291
[kNN-Shapley][knn-shapley-intro] [@jia_efficient_2019a], which is exact for the
280-
kNN model, and is $O(n_test n \log n).$
292+
kNN model, and is $O(n_\text{test}\ n \log n).$
281293

282294
[^not1]: The quantity $u(S_{+i}) − u(S)$ is called the
283295
[marginal utility][glossary-marginal-utility] of the sample $x_i$ (with

0 commit comments

Comments
 (0)