You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.Rmd
+43-4Lines changed: 43 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -2,11 +2,12 @@
2
2
output: github_document
3
3
---
4
4
5
-
# Nested Cross-Validation: Comparing Methods and Implementations
5
+
# Nested Cross-Validation: Comparing Methods and Implementations
6
6
7
7
Nested cross-validation has become a recommended technique for situations in which the size of our dataset is insufficient to handle both hyperparameter tuning and algorithm comparison. Using standard methods such as k-fold cross-validation in such situations results in significant increases in optimization bias. Nested cross-validation has been shown to produce low bias in out-of-sample error estimates even using datasets with only a few hundred rows.
8
8
9
-
The primary issue with this technique is that it is computationally very expensive with potentially tens of 1000s of models being trained during the process. This experiment seeks to answer two questions:
9
+
The primary issue with this technique is that it is computationally very expensive with potentially tens of 1000s of models being trained during the process. This experiment seeks to answer two questions:
10
+
10
11
1. Which implementation is fastest?
11
12
2. How many *repeats*, given the size of the training set, should we expect to need to obtain a reasonably accurate out-of-sample error estimate?
12
13
@@ -29,20 +30,58 @@ Duration experiment details:
29
30
30
31
(Size of the data sets are the same as those in the original scripts by the authors)
31
32
32
-
Various elements of the technique can be altered to improve performance. These include:
33
+
Various elements of the technique can be altered to improve performance. These include:
34
+
33
35
1. Hyperparameter value grids
34
36
2. Outer-Loop CV strategy
35
37
3. Inner-Loop CV strategy
36
38
4. Grid search strategy
37
39
38
40
For the performance experiment (question 2), I'll be varying the repeats of the outer-loop cv strategy for each method. The fastest implementation of each method will be tuned with different sizes of data ranging from 100 to 5000 observations. The mean absolute error will be calculated for each combination of repeat, data size, and method.
39
41
40
-
I'm using a 4 core, 16 GB RAM machine.
42
+
Notes:
43
+
44
+
1. I'm using a 4 core, 16 GB RAM machine.
45
+
2. "parsnip" refers to scripts where both the Elastic Net and Ranger Random Forest model functions come from {parsnip}
46
+
3. "ranger" means the Random Forest model function that's used is directly from the {ranger} package.
47
+
4. In "sklearn", the Random Forest model function comes for scikit-learn.
48
+
5. "ranger-kj" uses all the Kuhn-Johnson loop functions and the {ranger} Random Forest model function to execute Raschka's method.
Boulesteix, AL, and C Strobl. 2009. “Optimal Classifier Selection and Negative Bias in Error Rate Estimation: An Empirical Study on High-Dimensional Prediction.” BMC Medical Research Methodology 9 (1): 85. [link](https://www.researchgate.net/publication/40756303_Optimal_classifier_selection_and_negative_bias_in_error_rate_estimation_An_empirical_study_on_high-dimensional_prediction)
0 commit comments