Skip to content

Commit 863b592

Browse files
author
ercbk
committed
fixed runtime chart bar labels; added performance results csv; included results section in readme
1 parent bb2ccad commit 863b592

File tree

6 files changed

+92
-39
lines changed

6 files changed

+92
-39
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,5 @@
66
.drake
77
ec2-ssh-raw.log
88
README_cache
9-
check-results.R
9+
check-results.R
10+
perf-exp-output-backup.rds

README.Rmd

Lines changed: 30 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ output: github_document
77

88
![](images/ncv.png)
99

10-
Nested cross-validation has become a recommended technique for situations in which the size of our dataset is insufficient to simultaneously handle hyperparameter tuning and algorithm comparison. Examples of such situations include: proof of concept, start-ups, medical studies, time series, etc. Using standard methods such as k-fold cross-validation in these cases may result in significant increases in optimization bias. Nested cross-validation has been shown to produce low bias, out-of-sample error estimates even using datasets with only hundreds of rows and therefore gives a better judgement of generalization performance.
10+
Nested cross-validation has become a recommended technique for situations in which the size of our dataset is insufficient to simultaneously handle hyperparameter tuning and algorithm comparison. Examples of such situations include: proof of concept, start-ups, medical studies, time series, etc. Using standard methods such as k-fold cross-validation in these cases may result in substantial increases in optimization bias. Nested cross-validation has been shown to produce less biased, out-of-sample error estimates even using datasets with only hundreds of rows and therefore gives a better judgement of generalization performance.
1111

1212
The primary issue with this technique is that it is computationally very expensive with potentially tens of 1000s of models being trained during the process. While researching this technique, I found two slightly different methods of performing nested cross-validation — one authored by [Sabastian Raschka](https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/11_eval4-algo/code/11-eval4-algo__nested-cv_verbose1.ipynb) and the other by [Max Kuhn and Kjell Johnson](https://tidymodels.github.io/rsample/articles/Applications/Nested_Resampling.html).
1313
I'll be examining two aspects of nested cross-validation:
@@ -17,7 +17,7 @@ I'll be examining two aspects of nested cross-validation:
1717

1818

1919
## Duration Experiment
20-
Experiment details:
20+
##### Experiment details:
2121

2222
* Random Forest and Elastic Net Regression algorithms
2323
* Both with 100x2 hyperparameter grids
@@ -40,9 +40,9 @@ Various elements of the technique can be altered to improve performance. These i
4040
3. Inner-Loop CV strategy
4141
4. Grid search strategy
4242

43-
These elements also affect the run times. Both methods will be using the same size grids, but Kuhn-Johnson uses repeats and more folds in the outer and inner loops while Raschka's trains an extra model over the entire training set at the end at the end. Using Kuhn-Johnson, 50,000 models will be trained for each algorithm — using Raschka's, 1,001 models.
43+
These elements also affect the run times. Both methods will be using the same size grids, but Kuhn-Johnson uses repeats and more folds in the outer and inner loops while Raschka's trains an extra model over the entire training set at the end at the end. Using Kuhn-Johnson, 50,000 models (grid size * number of repeats * number of folds in the outer-loop * number of folds/resamples in the inner-loop) will be trained for each algorithm — using Raschka's, 1,001 models.
4444

45-
MLFlow was used to keep track of the duration (seconds) of each run along with the implementation and method used. I've used implementation to describe the various changes in coding structures that accompanies using each package's functions. A couple examples are the python for-loop being replaced with a while-loop and `iter_next` function when using {reticulate} and {mlr3} entirely using R's R6 Object Oriented Programming system.
45+
[MLFlow](https://mlflow.org/docs/latest/index.html) was used to keep track of the duration (seconds) of each run along with the implementation and method used. I've used "implementation" to encapsulate not only the combinations of various model functions, but also, to describe the various changes in coding structures that accompanies using each package's functions, i.e. I can't just plug-and-play different packages' model functions into the same script.
4646

4747
![](duration-experiment/outputs/0225-results.png)
4848

@@ -53,7 +53,7 @@ pacman::p_load(extrafont, dplyr, ggplot2, patchwork, stringr, tidytext)
5353
5454
5555
56-
runs_raw <- readr::read_rds("data/duration-runs.rds")
56+
runs_raw <- readr::read_rds("duration-experiment/outputs/duration-runs.rds")
5757
5858
5959
@@ -108,18 +108,18 @@ durations
108108

109109
## Performance Experiment
110110

111-
Experiment details:
111+
##### Experiment details:
112112

113-
* The fastest implementation of each method will be used in running a nested cross-validation with different sizes of data ranging from 100 to 5000 observations and different numbers of repeats of the outer-loop cv strategy.
113+
* The fastest implementation of each method was used in running a nested cross-validation with different sizes of data ranging from 100 to 5000 observations and different numbers of repeats of the outer-loop cv strategy.
114114
* The {mlr3} implementation was the fastest for Raschka's method, but the Ranger-Kuhn-Johnson implementation was close. To simplify, I'll be using [Ranger-Kuhn-Johnson](https://github.com/ercbk/nested-cross-validation-comparison/blob/master/duration-experiment/kuhn-johnson/nested-cv-ranger-kj.R) for both methods.
115-
* The chosen algorithm and hyperparameters will be used to predict on a 100K row simulated dataset.
116-
* The percent error between the the average mean absolute error (MAE) across the outer-loop folds and the MAE of the predictions on this 100K dataset will be calculated for each combination of repeat, data size, and method.
117-
* To make this experiment manageable in terms of runtimes, I'm using AWS instances: a r5.2xlarge for the Elastic Net and a r5.24xlarge for Random Forest.
118-
* Iterating through different numbers of repeats, sample sizes, and methods makes a functional approach more appropriate than running imperative scripts. Also, given the long runtimes and impermanent nature of my internet connection, it would also be nice to cache each iteration as it finishes. The [{drake}](https://github.com/ropensci/drake) package is superb on both counts, so I'm using it to orchestrate.
115+
* The chosen algorithm and hyperparameters was used to predict on a 100K row simulated dataset.
116+
* The percent error between the the average mean absolute error (MAE) across the outer-loop folds and the MAE of the predictions on this 100K dataset was calculated for each combination of repeat, data size, and method.
117+
* To make this experiment manageable in terms of runtimes, I used AWS instances: a r5.2xlarge for the Elastic Net and a r5.24xlarge for Random Forest.
118+
* Iterating through different numbers of repeats, sample sizes, and methods makes a functional approach more appropriate than running imperative scripts. Also, given the long runtimes and impermanent nature of my internet connection, it would also be nice to cache each iteration as it finishes. The [{drake}](https://github.com/ropensci/drake) package is superb on both counts, so I'm used it to orchestrate.
119119

120-
```{r perf_build_times, echo=FALSE, message=FALSE, cache=FALSE}
120+
```{r perf_build_times, echo=FALSE, message=FALSE}
121121
122-
pacman::p_load(extrafont,dplyr, purrr, lubridate, ggplot2, drake)
122+
pacman::p_load(extrafont,dplyr, purrr, lubridate, ggplot2, ggfittext, drake)
123123
bt <- build_times(starts_with("ncv_results"), digits = 4)
124124
125125
subtarget_bts <- bt %>%
@@ -140,6 +140,9 @@ subtargets <- subtargets_raw %>%
140140
elapsed = round(as.numeric(elapsed)/3600, 2),
141141
percent_error = round(delta_error/oos_error, 3))
142142
143+
readr::write_csv(subtargets, "performance-experiment/output/perf-exp-output.csv")
144+
# readr::write_rds(subtargets, "performance-experiment/output/perf-exp-output-backup.rds")
145+
143146
```
144147

145148
```{r perf_bt_charts, echo=FALSE, message=FALSE}
@@ -150,8 +153,10 @@ ggplot(subtargets, aes(y = elapsed, x = repeats,
150153
fill = n, label = elapsed)) +
151154
geom_col(position = position_dodge(width = 0.85)) +
152155
scale_fill_manual(values = fill_colors[4:7]) +
153-
geom_text(hjust = 1.3, size = 3.5,
154-
color = "white", position = position_dodge(width = 0.85)) +
156+
# geom_text(hjust = 1.3, size = 3.5,
157+
# color = "white", position = position_dodge(width = 0.85)) +
158+
geom_bar_text(position = "dodge", min.size = 3.5,
159+
place = "right", contrast = TRUE) +
155160
coord_flip() +
156161
labs(y = "Runtime (hrs)", x = "Repeats",
157162
title = "Kuhn-Johnson", fill = "Sample Size") +
@@ -196,6 +201,16 @@ ggplot(subtargets, aes(x = repeats, y = percent_error, group = n)) +
196201
)
197202
```
198203

204+
##### Results:
205+
206+
Kuhn-Johnson:
207+
208+
* Runtimes for n = 100 and n = 800 are close, and there's a large jump in runtime going from n = 2000 to n = 5000.
209+
* The number of repeats had little effect on the amount of percent error.
210+
* For n = 100, there is substantially more variation in percent error than in the other sample sizes.
211+
* While there is a large runtime cost that comes with increasing the sample size from 2000 to 5000 obsservations, it doesn't seem to provide any benefit in gaining a more accurate estimate of the out-of-sample error.
212+
213+
199214

200215

201216
References

README.md

Lines changed: 39 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,9 @@ situations in which the size of our dataset is insufficient to
1010
simultaneously handle hyperparameter tuning and algorithm comparison.
1111
Examples of such situations include: proof of concept, start-ups,
1212
medical studies, time series, etc. Using standard methods such as k-fold
13-
cross-validation in these cases may result in significant increases in
14-
optimization bias. Nested cross-validation has been shown to produce low
15-
bias, out-of-sample error estimates even using datasets with only
13+
cross-validation in these cases may result in substantial increases in
14+
optimization bias. Nested cross-validation has been shown to produce
15+
less biased, out-of-sample error estimates even using datasets with only
1616
hundreds of rows and therefore gives a better judgement of
1717
generalization performance.
1818

@@ -35,7 +35,7 @@ I’ll be examining two aspects of nested cross-validation:
3535

3636
## Duration Experiment
3737

38-
Experiment details:
38+
##### Experiment details:
3939

4040
- Random Forest and Elastic Net Regression algorithms
4141
- Both with 100x2 hyperparameter grids
@@ -63,16 +63,17 @@ These elements also affect the run times. Both methods will be using the
6363
same size grids, but Kuhn-Johnson uses repeats and more folds in the
6464
outer and inner loops while Raschka’s trains an extra model over the
6565
entire training set at the end at the end. Using Kuhn-Johnson, 50,000
66-
models will be trained for each algorithm — using Raschka’s, 1,001
67-
models.
68-
69-
MLFlow was used to keep track of the duration (seconds) of each run
70-
along with the implementation and method used. I’ve used implementation
71-
to describe the various changes in coding structures that accompanies
72-
using each package’s functions. A couple examples are the python
73-
for-loop being replaced with a while-loop and `iter_next` function when
74-
using {reticulate} and {mlr3} entirely using R’s R6 Object Oriented
75-
Programming system.
66+
models (grid size \* number of repeats \* number of folds in the
67+
outer-loop \* number of folds/resamples in the inner-loop) will be
68+
trained for each algorithm — using Raschka’s, 1,001 models.
69+
70+
[MLFlow](https://mlflow.org/docs/latest/index.html) was used to keep
71+
track of the duration (seconds) of each run along with the
72+
implementation and method used. I’ve used “implementation” to
73+
encapsulate not only the combinations of various model functions, but
74+
also, to describe the various changes in coding structures that
75+
accompanies using each package’s functions, i.e. I can’t just
76+
plug-and-play different packages’ model functions into the same script.
7677

7778
![](duration-experiment/outputs/0225-results.png)
7879

@@ -82,9 +83,9 @@ Programming system.
8283

8384
## Performance Experiment
8485

85-
Experiment details:
86+
##### Experiment details:
8687

87-
- The fastest implementation of each method will be used in running a
88+
- The fastest implementation of each method was used in running a
8889
nested cross-validation with different sizes of data ranging from
8990
100 to 5000 observations and different numbers of repeats of the
9091
outer-loop cv strategy.
@@ -93,27 +94,42 @@ Experiment details:
9394
simplify, I’ll be using
9495
[Ranger-Kuhn-Johnson](https://github.com/ercbk/nested-cross-validation-comparison/blob/master/duration-experiment/kuhn-johnson/nested-cv-ranger-kj.R)
9596
for both methods.
96-
- The chosen algorithm and hyperparameters will be used to predict on
97-
a 100K row simulated dataset.
97+
- The chosen algorithm and hyperparameters was used to predict on a
98+
100K row simulated dataset.
9899
- The percent error between the the average mean absolute error (MAE)
99100
across the outer-loop folds and the MAE of the predictions on this
100-
100K dataset will be calculated for each combination of repeat, data
101+
100K dataset was calculated for each combination of repeat, data
101102
size, and method.
102-
- To make this experiment manageable in terms of runtimes, I’m using
103-
AWS instances: a r5.2xlarge for the Elastic Net and a r5.24xlarge
104-
for Random Forest.
103+
- To make this experiment manageable in terms of runtimes, I used AWS
104+
instances: a r5.2xlarge for the Elastic Net and a r5.24xlarge for
105+
Random Forest.
105106
- Iterating through different numbers of repeats, sample sizes, and
106107
methods makes a functional approach more appropriate than running
107108
imperative scripts. Also, given the long runtimes and impermanent
108109
nature of my internet connection, it would also be nice to cache
109110
each iteration as it finishes. The
110111
[{drake}](https://github.com/ropensci/drake) package is superb on
111-
both counts, so I’m using it to orchestrate.
112+
both counts, so I’m used it to orchestrate.
112113

113114
![](README_files/figure-gfm/perf_bt_charts-1.png)<!-- -->
114115

115116
![](README_files/figure-gfm/perf-error-line-1.png)<!-- -->
116117

118+
##### Results:
119+
120+
Kuhn-Johnson:
121+
122+
- Runtimes for n = 100 and n = 800 are close, and there’s a large jump
123+
in runtime going from n = 2000 to n = 5000.
124+
- The number of repeats had little effect on the amount of percent
125+
error.
126+
- For n = 100, there is substantially more variation in percent error
127+
than in the other sample sizes.
128+
- While there is a large runtime cost that comes with increasing the
129+
sample size from 2000 to 5000 obsservations, it doesn’t seem to
130+
provide any benefit in gaining a more accurate estimate of the
131+
out-of-sample error.
132+
117133
References
118134

119135
Boulesteix, AL, and C Strobl. 2009. “Optimal Classifier Selection and
5.41 KB
Loading
File renamed without changes.
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
subtarget,n,repeats,method,oos_error,ncv_error,delta_error,chosen_algorithm,mixture,penalty,mtry,trees,elapsed,percent_error
2+
ncv_results_100_0108d912,100,5,kj,2.19359,2.01424,0.1793499999999999,glmnet,0.50424303883221,0.2211151988375703,NA,NA,1.36,0.082
3+
ncv_results_100_7aaa57d2,100,1,kj,2.19359,2.04781,0.1457799999999998,glmnet,0.50424303883221,0.2211151988375703,NA,NA,0.15,0.066
4+
ncv_results_100_97e7fe04,100,2,kj,2.19359,1.99077,0.20282,glmnet,0.50424303883221,0.2211151988375703,NA,NA,0.4,0.092
5+
ncv_results_100_9d044993,100,4,kj,2.19359,1.99643,0.19716,glmnet,0.50424303883221,0.2211151988375703,NA,NA,0.97,0.09
6+
ncv_results_100_ea11bf8d,100,3,kj,2.19262,2.01702,0.17559999999999976,glmnet,0.5809470646083355,0.16010254880830843,NA,NA,0.65,0.08
7+
ncv_results_2000_47742c31,2000,4,kj,1.38697,1.37171,0.015260000000000051,rf,NA,NA,5,1779,2.96,0.011
8+
ncv_results_2000_746435d6,2000,5,kj,1.39092,1.37625,0.01466999999999996,rf,NA,NA,5,1779,3.71,0.011
9+
ncv_results_2000_7d80d14d,2000,1,kj,1.38466,1.36553,0.01913000000000009,rf,NA,NA,5,1948,0.74,0.014
10+
ncv_results_2000_80d2e33a,2000,3,kj,1.38955,1.3711,0.018450000000000077,rf,NA,NA,5,1948,2.22,0.013
11+
ncv_results_2000_c16e9aff,2000,2,kj,1.38739,1.37015,0.017239999999999922,rf,NA,NA,5,1948,1.48,0.012
12+
ncv_results_5000_20d7ace1,5000,4,kj,1.24192,1.25837,0.016450000000000076,rf,NA,NA,5,1573,8.92,0.013
13+
ncv_results_5000_2a916af4,5000,5,kj,1.24272,1.25644,0.013719999999999954,rf,NA,NA,5,1664,11.13,0.011
14+
ncv_results_5000_7b1fdb55,5000,2,kj,1.24336,1.2612,0.017840000000000078,rf,NA,NA,5,1351,4.46,0.014
15+
ncv_results_5000_7b6f8e72,5000,1,kj,1.24304,1.25709,0.014050000000000118,rf,NA,NA,5,1664,2.23,0.011
16+
ncv_results_5000_d380966a,5000,3,kj,1.24267,1.25724,0.014569999999999972,rf,NA,NA,5,1365,6.69,0.012
17+
ncv_results_800_3b54c7f8,800,1,kj,1.63668,1.58422,0.05245999999999995,rf,NA,NA,6,1507,0.26,0.032
18+
ncv_results_800_3f87e120,800,2,kj,1.6333,1.58689,0.04641000000000006,rf,NA,NA,6,1168,0.51,0.028
19+
ncv_results_800_50b46544,800,4,kj,1.63707,1.58522,0.05184999999999995,rf,NA,NA,6,1693,1.09,0.032
20+
ncv_results_800_589454bb,800,3,kj,1.63456,1.5905,0.04405999999999999,rf,NA,NA,6,1168,0.76,0.027
21+
ncv_results_800_a2c27fe0,800,5,kj,1.63489,1.58745,0.04743999999999993,rf,NA,NA,6,1507,1.52,0.029

0 commit comments

Comments
 (0)