You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Nested cross-validation has become a recommended technique for situations in which the size of our dataset is insufficient to simultaneously handle hyperparameter tuning and algorithm comparison. Examples of such situations include: proof of concept, start-ups, medical studies, time series, etc. Using standard methods such as k-fold cross-validation in these cases may result in substantial increases in optimization bias. Nested cross-validation has been shown to produce less biased, out-of-sample error estimates even using datasets with only hundreds of rows and therefore gives a better judgement of generalization performance.
11
11
12
-
The primary issue with this technique is that it is computationally very expensive with potentially tens of 1000s of models being trained during the process. While researching this technique, I found two slightly different methods of performing nested cross-validation — one authored by [Sabastian Raschka](https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/11_eval4-algo/code/11-eval4-algo__nested-cv_verbose1.ipynb) and the other by [Max Kuhn and Kjell Johnson](https://tidymodels.github.io/rsample/articles/Applications/Nested_Resampling.html).
12
+
The primary issue with this technique is that it can be computationally expensive with potentially tens of 1000s of models being trained during the process. While researching this technique, I found two slightly different variations of performing nested cross-validation — one authored by [Sabastian Raschka](https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/11_eval4-algo/code/11-eval4-algo__nested-cv_verbose1.ipynb) and the other by [Max Kuhn and Kjell Johnson](https://tidymodels.github.io/rsample/articles/Applications/Nested_Resampling.html).
13
+
14
+
Various elements of the technique affect the run times and can be altered to improve performance. These include:
15
+
16
+
1. Hyperparameter value grids
17
+
2. Grid search strategy
18
+
3. Inner-Loop CV strategy
19
+
4. Outer-Loop CV strategy
20
+
13
21
I'll be examining two aspects of nested cross-validation:
14
22
15
23
1. Duration: Find out which packages and combinations of model functions give us the fastest implementation of each method.
16
-
2. Performance: First, develop a testing framework. Then, using a generated dataset, calculate how many repeats, given the sample size, should we expect to need in order to obtain a reasonably accurate out-of-sample error estimate.
24
+
2. Performance: First, develop a testing framework. Then, for a given data generating process, how large of sample size is needed to obtain reasonably accurate out-of-sample error estimate? And how many repeats in the outer-loop cv strategy should be used to calculate this error estimate?
17
25
18
26
19
-
## Duration Experiment
27
+
## Duration
20
28
#### Experiment details:
21
29
22
30
* Random Forest and Elastic Net Regression algorithms
23
-
* Both with 100x2 hyperparameter grids
31
+
* Both algorithms are tuned with 100x2 hyperparameter grids using a latin hypercube design.
32
+
* From {mlbench}, I'm using the generated data set, friedman1, from Friedman's Multivariate Adaptive Regression Splines (MARS) paper.
The sizes of the data sets are the same as those in the original scripts by the authors. [MLFlow](https://mlflow.org/docs/latest/index.html) is used to keep track of the duration (seconds) of each run along with the implementation and method used.
34
-
42
+
The sizes of the data sets are the same as those in the original scripts by the authors. Using Kuhn-Johnson, 50,000 models (grid size * number of repeats * number of folds in the outer-loop * number of folds/resamples in the inner-loop) are trained for each algorithm — using Raschka's, 1,001 models for each algorithm. The one extra model in the Raschka variation is due to his method of choosing the hyperparameter values for the final model. He performs an extra k-fold cross-validation using the inner-loop cv strategy on the entire training set. Kuhn-Johnson uses majority vote. Whichever set of hyperparameter values has been chosen during the inner-loop tuning procedure the most often is the set used to fit the final model.
35
43
36
-
Various elements of the technique can be altered to improve performance. These include:
37
-
38
-
1. Hyperparameter value grids
39
-
2. Outer-Loop CV strategy
40
-
3. Inner-Loop CV strategy
41
-
4. Grid search strategy
42
-
43
-
These elements also affect the run times. Both methods are using the same size grids, but Kuhn-Johnson uses repeats and more folds in the outer and inner loops while Raschka's trains an extra model over the entire training set at the end at the end. Using Kuhn-Johnson, 50,000 models (grid size * number of repeats * number of folds in the outer-loop * number of folds/resamples in the inner-loop) are trained for each algorithm — using Raschka's, 1,001 models.
44
+
[MLFlow](https://mlflow.org/docs/latest/index.html) is used to keep track of the duration (seconds) of each run along with the implementation and method used.
44
45
45
46

46
47
@@ -104,18 +105,20 @@ durations
104
105
```
105
106
106
107
107
-
## Performance Experiment
108
+
## Performance
108
109
109
110
#### Experiment details:
110
111
112
+
* The same data, algorithms, and hyperparameter grids are used.
111
113
* The fastest implementation of each method is used in running a nested cross-validation with different sizes of data ranging from 100 to 5000 observations and different numbers of repeats of the outer-loop cv strategy.
112
114
* The {mlr3} implementation is the fastest for Raschka's method, but the Ranger-Kuhn-Johnson implementation is close. To simplify, I am using [Ranger-Kuhn-Johnson](https://github.com/ercbk/nested-cross-validation-comparison/blob/master/duration-experiment/kuhn-johnson/nested-cv-ranger-kj.R) for both methods.
113
-
* The chosen algorithm and hyperparameters predicts on a 100K row simulated dataset.
115
+
* The chosen algorithm with hyperparameters is fit on the entire training set, and the resulting final model predicts on a 100K row Friedman dataset.
114
116
* The percent error between the the average mean absolute error (MAE) across the outer-loop folds and the MAE of the predictions on this 100K dataset is calculated for each combination of repeat, data size, and method.
115
117
* To make this experiment manageable in terms of runtimes, I am using AWS instances: a r5.2xlarge for the Elastic Net and a r5.24xlarge for Random Forest.
118
+
+ Also see the Other Notes section
116
119
* Iterating through different numbers of repeats, sample sizes, and methods makes a functional approach more appropriate than running imperative scripts. Also, given the long runtimes and impermanent nature of my internet connection, it would also be nice to cache each iteration as it finishes. The [{drake}](https://github.com/ropensci/drake) package is superb on both counts, so I'm using it to orchestrate.
@@ -212,14 +215,122 @@ b + e + plot_layout(guides = "auto") +
212
215
213
216
#### Results:
214
217
215
-
Kuhn-Johnson:
216
-
217
218
* Runtimes for n = 100 and n = 800 are close, and there's a large jump in runtime going from n = 2000 to n = 5000.
218
219
* The number of repeats has little effect on the amount of percent error.
219
220
* For n = 100, there is substantially more variation in percent error than in the other sample sizes.
220
221
* While there is a large runtime cost that comes with increasing the sample size from 2000 to 5000 obsservations, it doesn't seem to provide any benefit in gaining a more accurate estimate of the out-of-sample error.
* The longest runtime is under 30 minutes, so runtime isn't a large consideration if we are making a choice about sample size.
325
+
* There isn't much difference in runtime between n = 100 and n = 2000.
326
+
* For n = 100, there's a relatively large change in percent error when going from 1 repeat to 2 repeats. The error estimate then stabilizes for repeats 3 through 5.
327
+
* n = 5000 gives poorer out-of-sample error estimates than n = 800 and n = 2000 for all values of repeats.
328
+
* n = 800 remains under 2.5% percent error for all repeat values, but also shows considerable volatility.
0 commit comments