Skip to content

Commit 68fa371

Browse files
author
ercbk
committed
readme update
1 parent ea09360 commit 68fa371

File tree

7 files changed

+102
-37
lines changed

7 files changed

+102
-37
lines changed

.gitignore

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,4 @@
22
.Rhistory
33
.RData
44
.Ruserdata
5-
.env
6-
mlruns
5+
.env

README.Rmd

Lines changed: 43 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,12 @@
22
output: github_document
33
---
44

5-
# Nested Cross-Validation: Comparing Methods and Implementations
5+
# Nested Cross-Validation: Comparing Methods and Implementations
66

77
Nested cross-validation has become a recommended technique for situations in which the size of our dataset is insufficient to handle both hyperparameter tuning and algorithm comparison. Using standard methods such as k-fold cross-validation in such situations results in significant increases in optimization bias. Nested cross-validation has been shown to produce low bias in out-of-sample error estimates even using datasets with only a few hundred rows.
88

9-
The primary issue with this technique is that it is computationally very expensive with potentially tens of 1000s of models being trained during the process. This experiment seeks to answer two questions:
9+
The primary issue with this technique is that it is computationally very expensive with potentially tens of 1000s of models being trained during the process. This experiment seeks to answer two questions:
10+
1011
1. Which implementation is fastest?
1112
2. How many *repeats*, given the size of the training set, should we expect to need to obtain a reasonably accurate out-of-sample error estimate?
1213

@@ -29,20 +30,58 @@ Duration experiment details:
2930

3031
(Size of the data sets are the same as those in the original scripts by the authors)
3132

32-
Various elements of the technique can be altered to improve performance. These include:
33+
Various elements of the technique can be altered to improve performance. These include:
34+
3335
1. Hyperparameter value grids
3436
2. Outer-Loop CV strategy
3537
3. Inner-Loop CV strategy
3638
4. Grid search strategy
3739

3840
For the performance experiment (question 2), I'll be varying the repeats of the outer-loop cv strategy for each method. The fastest implementation of each method will be tuned with different sizes of data ranging from 100 to 5000 observations. The mean absolute error will be calculated for each combination of repeat, data size, and method.
3941

40-
I'm using a 4 core, 16 GB RAM machine.
42+
Notes:
43+
44+
1. I'm using a 4 core, 16 GB RAM machine.
45+
2. "parsnip" refers to scripts where both the Elastic Net and Ranger Random Forest model functions come from {parsnip}
46+
3. "ranger" means the Random Forest model function that's used is directly from the {ranger} package.
47+
4. In "sklearn", the Random Forest model function comes for scikit-learn.
48+
5. "ranger-kj" uses all the Kuhn-Johnson loop functions and the {ranger} Random Forest model function to execute Raschka's method.
49+
50+
4151

4252
Progress (duration in seconds)
4353

4454
![](duration-experiment/outputs/0225-results.png)
4555

56+
57+
```{r, echo=FALSE, eval=FALSE, message=FALSE}
58+
library(dplyr, quietly = TRUE)
59+
library(echarts4r, quietly = TRUE)
60+
61+
runs <- readr::read_rds("data/duration-runs.rds")
62+
63+
e_common(
64+
font_family = "Roboto Medium",
65+
theme = NULL
66+
)
67+
68+
runs %>%
69+
group_by(method) %>%
70+
arrange(duration) %>%
71+
mutate(duration = round(duration/60, 2)) %>%
72+
e_charts(implementation) %>%
73+
e_bar(duration) %>%
74+
e_flip_coords() %>%
75+
e_tooltip() %>%
76+
e_legend() %>%
77+
e_title("Duration", "minutes") %>%
78+
e_theme_custom('{"color":["#195198","#BD9865"], "backgroundColor": "ivory"}')
79+
80+
81+
```
82+
83+
84+
4685
References
4786

4887
Boulesteix, AL, and C Strobl. 2009. “Optimal Classifier Selection and Negative Bias in Error Rate Estimation: An Empirical Study on High-Dimensional Prediction.” BMC Medical Research Methodology 9 (1): 85. [link](https://www.researchgate.net/publication/40756303_Optimal_classifier_selection_and_negative_bias_in_error_rate_estimation_An_empirical_study_on_high-dimensional_prediction)

README.md

Lines changed: 23 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -11,11 +11,12 @@ using datasets with only a few hundred rows.
1111

1212
The primary issue with this technique is that it is computationally very
1313
expensive with potentially tens of 1000s of models being trained during
14-
the process. This experiment seeks to answer two questions:
15-
1\. Which implementation is fastest?
16-
2\. How many *repeats*, given the size of the training set, should we
17-
expect to need to obtain a reasonably accurate out-of-sample error
18-
estimate?
14+
the process. This experiment seeks to answer two questions:
15+
16+
1. Which implementation is fastest?
17+
2. How many *repeats*, given the size of the training set, should we
18+
expect to need to obtain a reasonably accurate out-of-sample error
19+
estimate?
1920

2021
While researching this technique, I found two *methods* of performing
2122
nested cross-validation — one authored by [Sabastian
@@ -44,19 +45,30 @@ Duration experiment details:
4445
the authors)
4546

4647
Various elements of the technique can be altered to improve performance.
47-
These include:
48-
1\. Hyperparameter value grids
49-
2\. Outer-Loop CV strategy
50-
3\. Inner-Loop CV strategy
51-
4\. Grid search strategy
48+
These include:
49+
50+
1. Hyperparameter value grids
51+
2. Outer-Loop CV strategy
52+
3. Inner-Loop CV strategy
53+
4. Grid search strategy
5254

5355
For the performance experiment (question 2), I’ll be varying the repeats
5456
of the outer-loop cv strategy for each method. The fastest
5557
implementation of each method will be tuned with different sizes of data
5658
ranging from 100 to 5000 observations. The mean absolute error will be
5759
calculated for each combination of repeat, data size, and method.
5860

59-
I’m using a 4 core, 16 GB RAM machine.
61+
Notes:
62+
63+
1. I’m using a 4 core, 16 GB RAM machine.
64+
2. “parsnip” refers to scripts where both the Elastic Net and Ranger
65+
Random Forest model functions come from {parsnip}
66+
3. “ranger” means the Random Forest model function that’s used is
67+
directly from the {ranger} package.
68+
4. In “sklearn”, the Random Forest model function comes for
69+
scikit-learn.
70+
5. “ranger-kj” uses all the Kuhn-Johnson loop functions and the
71+
{ranger} Random Forest model function to execute Raschka’s method.
6072

6173
Progress (duration in seconds)
6274

data/duration-runs.rds

1.39 KB
Binary file not shown.

duration-experiment/outputs/0224-runs.csv

Lines changed: 0 additions & 10 deletions
This file was deleted.

duration-experiment/outputs/0225-runs.csv

Lines changed: 0 additions & 10 deletions
This file was deleted.

renv.lock

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -245,6 +245,20 @@
245245
"Repository": "CRAN",
246246
"Hash": "98ca919385a634e5d558e6938755e0bf"
247247
},
248+
"corrplot": {
249+
"Package": "corrplot",
250+
"Version": "0.84",
251+
"Source": "Repository",
252+
"Repository": "CRAN",
253+
"Hash": "b55c32ae818a84109a51f172290c95f2"
254+
},
255+
"countrycode": {
256+
"Package": "countrycode",
257+
"Version": "1.1.1",
258+
"Source": "Repository",
259+
"Repository": "CRAN",
260+
"Hash": "947b61a2a21b5a50af567b591b845f72"
261+
},
248262
"crayon": {
249263
"Package": "crayon",
250264
"Version": "1.3.4",
@@ -266,13 +280,27 @@
266280
"Repository": "CRAN",
267281
"Hash": "2b7d10581cc730804e9ed178c8374bd6"
268282
},
283+
"d3r": {
284+
"Package": "d3r",
285+
"Version": "0.8.7",
286+
"Source": "Repository",
287+
"Repository": "CRAN",
288+
"Hash": "4c1677c45eb1dff74f3863e773a8b26a"
289+
},
269290
"data.table": {
270291
"Package": "data.table",
271292
"Version": "1.12.8",
272293
"Source": "Repository",
273294
"Repository": "CRAN",
274295
"Hash": "cd711af60c47207a776213a368626369"
275296
},
297+
"data.tree": {
298+
"Package": "data.tree",
299+
"Version": "0.7.11",
300+
"Source": "Repository",
301+
"Repository": "CRAN",
302+
"Hash": "9087f2826e50c659ba54ade20d4c8676"
303+
},
276304
"desc": {
277305
"Package": "desc",
278306
"Version": "1.2.0",
@@ -329,6 +357,13 @@
329357
"Repository": "CRAN",
330358
"Hash": "716869fffc16e282c118f8894e082a7d"
331359
},
360+
"echarts4r": {
361+
"Package": "echarts4r",
362+
"Version": "0.2.3",
363+
"Source": "Repository",
364+
"Repository": "CRAN",
365+
"Hash": "2604014e6b28deb9dc2be4062c96a58a"
366+
},
332367
"ellipsis": {
333368
"Package": "ellipsis",
334369
"Version": "0.3.0",

0 commit comments

Comments
 (0)