Skip to content

Commit 9fa0b2f

Browse files
author
ercbk
committed
readmne edits, finished perf exp n=100,800,2000
1 parent 3e9fabc commit 9fa0b2f

File tree

4 files changed

+141
-53
lines changed

4 files changed

+141
-53
lines changed

README.Rmd

Lines changed: 68 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -3,18 +3,21 @@ output: github_document
33
---
44

55
# Nested Cross-Validation: Comparing Methods and Implementations
6+
### (In-progress)
67

7-
Nested cross-validation has become a recommended technique for situations in which the size of our dataset is insufficient to simultaneously handle hyperparameter tuning and algorithm comparison. Using standard methods such as k-fold cross-validation in such situations results in significant increases in optimization bias. Nested cross-validation has been shown to produce low bias, out-of-sample error estimates even using datasets with only a few hundred rows and therefore gives a better judgemnet of generalization performance.
8+
Nested cross-validation has become a recommended technique for situations in which the size of our dataset is insufficient to simultaneously handle hyperparameter tuning and algorithm comparison. Examples of such situations include: proof of concept, start-ups, medical studies, time series, etc. Using standard methods such as k-fold cross-validation in these cases may result in significant increases in optimization bias. Nested cross-validation has been shown to produce low bias, out-of-sample error estimates even using datasets with only hundreds of rows and therefore gives a better judgement of generalization performance.
89

9-
The primary issue with this technique is that it is computationally very expensive with potentially tens of 1000s of models being trained during the process. While researching this technique, I found two methods of performing nested cross-validation — one authored by [Sabastian Raschka](https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/11_eval4-algo/code/11-eval4-algo__nested-cv_verbose1.ipynb) and the other by [Max Kuhn and Kjell Johnson](https://tidymodels.github.io/rsample/articles/Applications/Nested_Resampling.html).
10-
This experiment seeks to answer two questions:
10+
The primary issue with this technique is that it is computationally very expensive with potentially tens of 1000s of models being trained during the process. While researching this technique, I found two slightly different methods of performing nested cross-validation — one authored by [Sabastian Raschka](https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/11_eval4-algo/code/11-eval4-algo__nested-cv_verbose1.ipynb) and the other by [Max Kuhn and Kjell Johnson](https://tidymodels.github.io/rsample/articles/Applications/Nested_Resampling.html).
11+
I'll be examining two aspects of nested cross-validation:
1112

12-
1. What's the fastest implementation of each method?
13-
2. How many repeats, given the size of this dataset, should we expect to need to obtain a reasonably accurate out-of-sample error estimate?
13+
1. Duration: Which packages and functions give us the fastest implementation of each method?
14+
2. Performance: First, develop a testing framework. Then, using a generated dataset, find how many repeats, given the number of samples, should we expect to need in order to obtain a reasonably accurate out-of-sample error estimate.
1415

1516
With regards to the question of speed, I'll will be testing implementations of both methods from various packages which include {tune}, {mlr3}, {h2o}, and {sklearn}.
1617

17-
Duration experiment details:
18+
19+
## Duration Experiment
20+
Experiment details:
1821

1922
* Random Forest and Elastic Net Regression algorithms
2023
* Both with 100x2 hyperparameter grids
@@ -37,11 +40,9 @@ Various elements of the technique can be altered to improve performance. These i
3740
3. Inner-Loop CV strategy
3841
4. Grid search strategy
3942

40-
For the performance experiment (question 2), the fastest implementation of each method will be used in running a nested cross-validation with different sizes of data ranging from 100 to 5000 observations and different numbers of repeats of the outer-loop cv strategy. The chosen algorithm and hyperparameters will predict on a 100K row simulated dataset and the mean absolute error will be calculated for each combination of repeat, data size, and method.
41-
42-
43+
These elements also affect the run times. Both methods will be using the same size grids, but Kuhn-Johnson uses repeats and more folds in the outer and inner loops while Raschka's trains an extra model over the entire training set at the end at the end. Using Kuhn-Johnson, 50,000 models will be trained for each algorithm — using Raschka's, 1,001 models.
4344

44-
Progress (duration in seconds)
45+
MLFlow was used to keep track of the duration (seconds) of each run along with the implementation and method used. I've used implementation to describe the various changes in coding structures that accompanies using each package's functions. A couple examples are the python for-loop being replaced with a while-loop and `iter_next` function when using {reticulate} and {mlr3} entirely using R's R6 Object Oriented Programming system.
4546

4647
![](duration-experiment/outputs/0225-results.png)
4748

@@ -105,6 +106,63 @@ durations
105106
```
106107

107108

109+
## Performance Experiment
110+
111+
Experiment details:
112+
113+
* The fastest implementation of each method will be used in running a nested cross-validation with different sizes of data ranging from 100 to 5000 observations and different numbers of repeats of the outer-loop cv strategy.
114+
* The chosen algorithm and hyperparameters will used to predict on a 100K row simulated dataset and the mean absolute error will be calculated for each combination of repeat, data size, and method.
115+
* AWS
116+
* Drake
117+
118+
```{r perf_build_times, echo=FALSE, message=FALSE}
119+
pacman::p_load(extrafont,dplyr, purrr, lubridate, ggplot2, drake)
120+
121+
bt <- build_times(starts_with("ncv_results"), digits = 4)
122+
123+
subtarget_bts <- bt %>%
124+
filter(stringr::str_detect(target, pattern = "[0-9]_([0-9]|[a-z])")) %>%
125+
select(target, elapsed)
126+
127+
subtargets_raw <- map_dfr(subtarget_bts$target, function(x) {
128+
results <- readd(x, character_only = TRUE) %>%
129+
mutate(subtarget = x) %>%
130+
select(subtarget, everything())
131+
132+
}) %>%
133+
inner_join(subtarget_bts, by = c("subtarget" = "target"))
134+
135+
subtargets <- subtargets_raw %>%
136+
mutate(repeats = factor(repeats),
137+
n = factor(n),
138+
elapsed = round(as.numeric(elapsed)/3600, 2))
139+
140+
141+
ggplot(subtargets, aes(y = elapsed, x = repeats,
142+
fill = n, label = elapsed)) +
143+
geom_bar(stat = "identity", position = position_dodge(width = 0.8)) +
144+
geom_text(hjust = 1.3, size = 3.5,
145+
color = "white", position = position_dodge(width = 0.8)) +
146+
coord_flip() +
147+
labs(y = "Runtime (hrs)", x = "Repeats",
148+
title = "Kuhn-Johnson", fill = "Sample Size") +
149+
theme(title = element_text(family = "Roboto"),
150+
text = element_text(family = "Roboto"),
151+
legend.position = "top",
152+
axis.ticks = element_blank(),
153+
axis.text.x = element_blank(),
154+
panel.background = element_rect(fill = "ivory",
155+
colour = "ivory"),
156+
plot.background = element_rect(fill = "ivory"),
157+
panel.border = element_blank(),
158+
panel.grid.major = element_blank(),
159+
panel.grid.minor = element_blank()
160+
)
161+
162+
```
163+
164+
165+
108166

109167
References
110168

README.md

Lines changed: 52 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,43 @@
11

22
# Nested Cross-Validation: Comparing Methods and Implementations
33

4+
### (In-progress)
5+
46
Nested cross-validation has become a recommended technique for
57
situations in which the size of our dataset is insufficient to
68
simultaneously handle hyperparameter tuning and algorithm comparison.
7-
Using standard methods such as k-fold cross-validation in such
8-
situations results in significant increases in optimization bias. Nested
9-
cross-validation has been shown to produce low bias, out-of-sample error
10-
estimates even using datasets with only a few hundred rows and therefore
11-
gives a better judgemnet of generalization performance.
9+
Examples of such situations include: proof of concept, start-ups,
10+
medical studies, time series, etc. Using standard methods such as k-fold
11+
cross-validation in these cases may result in significant increases in
12+
optimization bias. Nested cross-validation has been shown to produce low
13+
bias, out-of-sample error estimates even using datasets with only
14+
hundreds of rows and therefore gives a better judgement of
15+
generalization performance.
1216

1317
The primary issue with this technique is that it is computationally very
1418
expensive with potentially tens of 1000s of models being trained during
15-
the process. While researching this technique, I found two methods of
16-
performing nested cross-validation — one authored by [Sabastian
19+
the process. While researching this technique, I found two slightly
20+
different methods of performing nested cross-validation — one authored
21+
by [Sabastian
1722
Raschka](https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/11_eval4-algo/code/11-eval4-algo__nested-cv_verbose1.ipynb)
1823
and the other by [Max Kuhn and Kjell
1924
Johnson](https://tidymodels.github.io/rsample/articles/Applications/Nested_Resampling.html).
20-
This experiment seeks to answer two questions:
25+
I’ll be examining two aspects of nested cross-validation:
2126

22-
1. What’s the fastest implementation of each method?
23-
2. How many repeats, given the size of this dataset, should we expect
24-
to need to obtain a reasonably accurate out-of-sample error
25-
estimate?
27+
1. Duration: Which packages and functions give us the fastest
28+
implementation of each method?
29+
2. Performance: First, develop a testing framework. Then, using a
30+
generated dataset, find how many repeats, given the number of
31+
samples, should we expect to need in order to obtain a reasonably
32+
accurate out-of-sample error estimate.
2633

2734
With regards to the question of speed, I’ll will be testing
2835
implementations of both methods from various packages which include
2936
{tune}, {mlr3}, {h2o}, and {sklearn}.
3037

31-
Duration experiment details:
38+
## Duration Experiment
39+
40+
Experiment details:
3241

3342
- Random Forest and Elastic Net Regression algorithms
3443
- Both with 100x2 hyperparameter grids
@@ -52,22 +61,43 @@ These include:
5261
3. Inner-Loop CV strategy
5362
4. Grid search strategy
5463

55-
For the performance experiment (question 2), the fastest implementation
56-
of each method will be used in running a nested cross-validation with
57-
different sizes of data ranging from 100 to 5000 observations and
58-
different numbers of repeats of the outer-loop cv strategy. The chosen
59-
algorithm and hyperparameters will predict on a 100K row simulated
60-
dataset and the mean absolute error will be calculated for each
61-
combination of repeat, data size, and method.
62-
63-
Progress (duration in seconds)
64+
These elements also affect the run times. Both methods will be using the
65+
same size grids, but Kuhn-Johnson uses repeats and more folds in the
66+
outer and inner loops while Raschka’s trains an extra model over the
67+
entire training set at the end at the end. Using Kuhn-Johnson, 50,000
68+
models will be trained for each algorithm — using Raschka’s, 1,001
69+
models.
70+
71+
MLFlow was used to keep track of the duration (seconds) of each run
72+
along with the implementation and method used. I’ve used implementation
73+
to describe the various changes in coding structures that accompanies
74+
using each package’s functions. A couple examples are the python
75+
for-loop being replaced with a while-loop and `iter_next` function when
76+
using {reticulate} and {mlr3} entirely using R’s R6 Object Oriented
77+
Programming system.
6478

6579
![](duration-experiment/outputs/0225-results.png)
6680

6781
![](duration-experiment/outputs/duration-pkg-tbl.png)
6882

6983
![](README_files/figure-gfm/unnamed-chunk-1-1.png)<!-- -->
7084

85+
## Performance Experiment
86+
87+
Experiment details:
88+
89+
- The fastest implementation of each method will be used in running a
90+
nested cross-validation with different sizes of data ranging from
91+
100 to 5000 observations and different numbers of repeats of the
92+
outer-loop cv strategy.
93+
- The chosen algorithm and hyperparameters will used to predict on a
94+
100K row simulated dataset and the mean absolute error will be
95+
calculated for each combination of repeat, data size, and method.
96+
- AWS
97+
- Drake
98+
99+
![](README_files/figure-gfm/perf_build_times-1.png)<!-- -->
100+
71101
References
72102

73103
Boulesteix, AL, and C Strobl. 2009. “Optimal Classifier Selection and
12.4 KB
Loading

performance-experiment/Kuhn-Johnson/plan-kj.R

Lines changed: 21 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -71,28 +71,28 @@ plan <- drake_plan(
7171
error_FUN,
7272
method),
7373
dynamic = map(ncv_dat_800)
74+
),
75+
76+
# sample size = 2000
77+
sim_dat_2000 = mlbench_data(2000),
78+
params_list_2000 = create_grids(sim_dat_2000,
79+
algorithms,
80+
size = grid_size),
81+
ncv_dat_2000 = create_ncv_objects(sim_dat_2000,
82+
repeats,
83+
method),
84+
ncv_results_2000 = target(
85+
run_ncv(ncv_dat_2000,
86+
sim_dat_2000,
87+
large_dat,
88+
mod_FUN_list,
89+
params_list_2000,
90+
error_FUN,
91+
method),
92+
dynamic = map(ncv_dat_2000)
7493
)#,
75-
#
76-
# # sample size = 2000
77-
# sim_dat_2000 = mlbench_data(2000),
78-
# params_list_2000 = create_grids(sim_dat_2000,
79-
# algorithms,
80-
# size = grid_size),
81-
# ncv_dat_2000 = create_ncv_objects(sim_dat_2000,
82-
# repeats,
83-
# method),
84-
# ncv_results_2000 = target(
85-
# run_ncv(ncv_dat_2000,
86-
# sim_dat_2000,
87-
# large_dat,
88-
# mod_FUN_list,
89-
# params_list_2000,
90-
# error_FUN,
91-
# method),
92-
# dynamic = map(ncv_dat_2000)
93-
# ),
94-
#
95-
# # sample size = 5000
94+
95+
# sample size = 5000
9696
# sim_dat_5000 = mlbench_data(5000),
9797
# params_list_5000 = create_grids(sim_dat_5000,
9898
# algorithms,

0 commit comments

Comments
 (0)