You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.Rmd
+31-21Lines changed: 31 additions & 21 deletions
Original file line number
Diff line number
Diff line change
@@ -17,7 +17,7 @@ I'll be examining two aspects of nested cross-validation:
17
17
18
18
19
19
## Duration Experiment
20
-
#####Experiment details:
20
+
#### Experiment details:
21
21
22
22
* Random Forest and Elastic Net Regression algorithms
23
23
* Both with 100x2 hyperparameter grids
@@ -30,7 +30,7 @@ I'll be examining two aspects of nested cross-validation:
30
30
+ outer loop: 5 folds
31
31
+ inner loop: 2 folds
32
32
33
-
(Size of the data sets are the same as those in the original scripts by the authors)
33
+
The sizes of the data sets are the same as those in the original scripts by the authors. [MLFlow](https://mlflow.org/docs/latest/index.html) is used to keep track of the duration (seconds) of each run along with the implementation and method used.
34
34
35
35
36
36
Various elements of the technique can be altered to improve performance. These include:
@@ -40,9 +40,7 @@ Various elements of the technique can be altered to improve performance. These i
40
40
3. Inner-Loop CV strategy
41
41
4. Grid search strategy
42
42
43
-
These elements also affect the run times. Both methods will be using the same size grids, but Kuhn-Johnson uses repeats and more folds in the outer and inner loops while Raschka's trains an extra model over the entire training set at the end at the end. Using Kuhn-Johnson, 50,000 models (grid size * number of repeats * number of folds in the outer-loop * number of folds/resamples in the inner-loop) will be trained for each algorithm — using Raschka's, 1,001 models.
44
-
45
-
[MLFlow](https://mlflow.org/docs/latest/index.html) was used to keep track of the duration (seconds) of each run along with the implementation and method used. I've used "implementation" to encapsulate not only the combinations of various model functions, but also, to describe the various changes in coding structures that accompanies using each package's functions, i.e. I can't just plug-and-play different packages' model functions into the same script.
43
+
These elements also affect the run times. Both methods are using the same size grids, but Kuhn-Johnson uses repeats and more folds in the outer and inner loops while Raschka's trains an extra model over the entire training set at the end at the end. Using Kuhn-Johnson, 50,000 models (grid size * number of repeats * number of folds in the outer-loop * number of folds/resamples in the inner-loop) are trained for each algorithm — using Raschka's, 1,001 models.
46
44
47
45

48
46
@@ -108,18 +106,18 @@ durations
108
106
109
107
## Performance Experiment
110
108
111
-
#####Experiment details:
109
+
#### Experiment details:
112
110
113
-
* The fastest implementation of each method was used in running a nested cross-validation with different sizes of data ranging from 100 to 5000 observations and different numbers of repeats of the outer-loop cv strategy.
114
-
* The {mlr3} implementation was the fastest for Raschka's method, but the Ranger-Kuhn-Johnson implementation was close. To simplify, I'll be using [Ranger-Kuhn-Johnson](https://github.com/ercbk/nested-cross-validation-comparison/blob/master/duration-experiment/kuhn-johnson/nested-cv-ranger-kj.R) for both methods.
115
-
* The chosen algorithm and hyperparameters was used to predict on a 100K row simulated dataset.
116
-
* The percent error between the the average mean absolute error (MAE) across the outer-loop folds and the MAE of the predictions on this 100K dataset was calculated for each combination of repeat, data size, and method.
117
-
* To make this experiment manageable in terms of runtimes, I used AWS instances: a r5.2xlarge for the Elastic Net and a r5.24xlarge for Random Forest.
118
-
* Iterating through different numbers of repeats, sample sizes, and methods makes a functional approach more appropriate than running imperative scripts. Also, given the long runtimes and impermanent nature of my internet connection, it would also be nice to cache each iteration as it finishes. The [{drake}](https://github.com/ropensci/drake) package is superb on both counts, so I'm used it to orchestrate.
111
+
* The fastest implementation of each method is used in running a nested cross-validation with different sizes of data ranging from 100 to 5000 observations and different numbers of repeats of the outer-loop cv strategy.
112
+
* The {mlr3} implementation is the fastest for Raschka's method, but the Ranger-Kuhn-Johnson implementation is close. To simplify, I am using [Ranger-Kuhn-Johnson](https://github.com/ercbk/nested-cross-validation-comparison/blob/master/duration-experiment/kuhn-johnson/nested-cv-ranger-kj.R) for both methods.
113
+
* The chosen algorithm and hyperparameters predicts on a 100K row simulated dataset.
114
+
* The percent error between the the average mean absolute error (MAE) across the outer-loop folds and the MAE of the predictions on this 100K dataset is calculated for each combination of repeat, data size, and method.
115
+
* To make this experiment manageable in terms of runtimes, I am using AWS instances: a r5.2xlarge for the Elastic Net and a r5.24xlarge for Random Forest.
116
+
* Iterating through different numbers of repeats, sample sizes, and methods makes a functional approach more appropriate than running imperative scripts. Also, given the long runtimes and impermanent nature of my internet connection, it would also be nice to cache each iteration as it finishes. The [{drake}](https://github.com/ropensci/drake) package is superb on both counts, so I'm using it to orchestrate.
* Runtimes for n = 100 and n = 800 are close, and there's a large jump in runtime going from n = 2000 to n = 5000.
209
-
* The number of repeats had little effect on the amount of percent error.
219
+
* The number of repeats has little effect on the amount of percent error.
210
220
* For n = 100, there is substantially more variation in percent error than in the other sample sizes.
211
221
* While there is a large runtime cost that comes with increasing the sample size from 2000 to 5000 obsservations, it doesn't seem to provide any benefit in gaining a more accurate estimate of the out-of-sample error.
0 commit comments