Skip to content

Commit d391fe6

Browse files
author
ercbk
committed
created patchwork for runtimes, percent error; minor edits
1 parent 863b592 commit d391fe6

File tree

3 files changed

+54
-52
lines changed

3 files changed

+54
-52
lines changed

README.Rmd

Lines changed: 31 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ I'll be examining two aspects of nested cross-validation:
1717

1818

1919
## Duration Experiment
20-
##### Experiment details:
20+
#### Experiment details:
2121

2222
* Random Forest and Elastic Net Regression algorithms
2323
* Both with 100x2 hyperparameter grids
@@ -30,7 +30,7 @@ I'll be examining two aspects of nested cross-validation:
3030
+ outer loop: 5 folds
3131
+ inner loop: 2 folds
3232

33-
(Size of the data sets are the same as those in the original scripts by the authors)
33+
The sizes of the data sets are the same as those in the original scripts by the authors. [MLFlow](https://mlflow.org/docs/latest/index.html) is used to keep track of the duration (seconds) of each run along with the implementation and method used.
3434

3535

3636
Various elements of the technique can be altered to improve performance. These include:
@@ -40,9 +40,7 @@ Various elements of the technique can be altered to improve performance. These i
4040
3. Inner-Loop CV strategy
4141
4. Grid search strategy
4242

43-
These elements also affect the run times. Both methods will be using the same size grids, but Kuhn-Johnson uses repeats and more folds in the outer and inner loops while Raschka's trains an extra model over the entire training set at the end at the end. Using Kuhn-Johnson, 50,000 models (grid size * number of repeats * number of folds in the outer-loop * number of folds/resamples in the inner-loop) will be trained for each algorithm — using Raschka's, 1,001 models.
44-
45-
[MLFlow](https://mlflow.org/docs/latest/index.html) was used to keep track of the duration (seconds) of each run along with the implementation and method used. I've used "implementation" to encapsulate not only the combinations of various model functions, but also, to describe the various changes in coding structures that accompanies using each package's functions, i.e. I can't just plug-and-play different packages' model functions into the same script.
43+
These elements also affect the run times. Both methods are using the same size grids, but Kuhn-Johnson uses repeats and more folds in the outer and inner loops while Raschka's trains an extra model over the entire training set at the end at the end. Using Kuhn-Johnson, 50,000 models (grid size * number of repeats * number of folds in the outer-loop * number of folds/resamples in the inner-loop) are trained for each algorithm — using Raschka's, 1,001 models.
4644

4745
![](duration-experiment/outputs/0225-results.png)
4846

@@ -108,18 +106,18 @@ durations
108106

109107
## Performance Experiment
110108

111-
##### Experiment details:
109+
#### Experiment details:
112110

113-
* The fastest implementation of each method was used in running a nested cross-validation with different sizes of data ranging from 100 to 5000 observations and different numbers of repeats of the outer-loop cv strategy.
114-
* The {mlr3} implementation was the fastest for Raschka's method, but the Ranger-Kuhn-Johnson implementation was close. To simplify, I'll be using [Ranger-Kuhn-Johnson](https://github.com/ercbk/nested-cross-validation-comparison/blob/master/duration-experiment/kuhn-johnson/nested-cv-ranger-kj.R) for both methods.
115-
* The chosen algorithm and hyperparameters was used to predict on a 100K row simulated dataset.
116-
* The percent error between the the average mean absolute error (MAE) across the outer-loop folds and the MAE of the predictions on this 100K dataset was calculated for each combination of repeat, data size, and method.
117-
* To make this experiment manageable in terms of runtimes, I used AWS instances: a r5.2xlarge for the Elastic Net and a r5.24xlarge for Random Forest.
118-
* Iterating through different numbers of repeats, sample sizes, and methods makes a functional approach more appropriate than running imperative scripts. Also, given the long runtimes and impermanent nature of my internet connection, it would also be nice to cache each iteration as it finishes. The [{drake}](https://github.com/ropensci/drake) package is superb on both counts, so I'm used it to orchestrate.
111+
* The fastest implementation of each method is used in running a nested cross-validation with different sizes of data ranging from 100 to 5000 observations and different numbers of repeats of the outer-loop cv strategy.
112+
* The {mlr3} implementation is the fastest for Raschka's method, but the Ranger-Kuhn-Johnson implementation is close. To simplify, I am using [Ranger-Kuhn-Johnson](https://github.com/ercbk/nested-cross-validation-comparison/blob/master/duration-experiment/kuhn-johnson/nested-cv-ranger-kj.R) for both methods.
113+
* The chosen algorithm and hyperparameters predicts on a 100K row simulated dataset.
114+
* The percent error between the the average mean absolute error (MAE) across the outer-loop folds and the MAE of the predictions on this 100K dataset is calculated for each combination of repeat, data size, and method.
115+
* To make this experiment manageable in terms of runtimes, I am using AWS instances: a r5.2xlarge for the Elastic Net and a r5.24xlarge for Random Forest.
116+
* Iterating through different numbers of repeats, sample sizes, and methods makes a functional approach more appropriate than running imperative scripts. Also, given the long runtimes and impermanent nature of my internet connection, it would also be nice to cache each iteration as it finishes. The [{drake}](https://github.com/ropensci/drake) package is superb on both counts, so I'm using it to orchestrate.
119117

120118
```{r perf_build_times, echo=FALSE, message=FALSE}
121119
122-
pacman::p_load(extrafont,dplyr, purrr, lubridate, ggplot2, ggfittext, drake)
120+
pacman::p_load(extrafont,dplyr, purrr, lubridate, ggplot2, ggfittext, drake, patchwork)
123121
bt <- build_times(starts_with("ncv_results"), digits = 4)
124122
125123
subtarget_bts <- bt %>%
@@ -149,7 +147,7 @@ readr::write_csv(subtargets, "performance-experiment/output/perf-exp-output.csv"
149147
150148
fill_colors <- unname(swatches::read_ase("palettes/Forest Floor.ase"))
151149
152-
ggplot(subtargets, aes(y = elapsed, x = repeats,
150+
b <- ggplot(subtargets, aes(y = elapsed, x = repeats,
153151
fill = n, label = elapsed)) +
154152
geom_col(position = position_dodge(width = 0.85)) +
155153
scale_fill_manual(values = fill_colors[4:7]) +
@@ -159,14 +157,15 @@ ggplot(subtargets, aes(y = elapsed, x = repeats,
159157
place = "right", contrast = TRUE) +
160158
coord_flip() +
161159
labs(y = "Runtime (hrs)", x = "Repeats",
162-
title = "Kuhn-Johnson", fill = "Sample Size") +
160+
fill = "Sample Size") +
163161
theme(title = element_text(family = "Roboto"),
164162
text = element_text(family = "Roboto"),
165163
legend.position = "top",
166164
legend.background = element_rect(fill = "ivory"),
167165
legend.key = element_rect(fill = "ivory"),
168166
axis.ticks = element_blank(),
169-
axis.text.x = element_blank(),
167+
axis.text.x = element_text(size = 11),
168+
axis.text.y = element_text(size = 11),
170169
panel.background = element_rect(fill = "ivory",
171170
colour = "ivory"),
172171
plot.background = element_rect(fill = "ivory"),
@@ -178,35 +177,46 @@ ggplot(subtargets, aes(y = elapsed, x = repeats,
178177
```
179178

180179
```{r perf-error-line, echo=FALSE, message=FALSE}
181-
ggplot(subtargets, aes(x = repeats, y = percent_error, group = n)) +
180+
e <- ggplot(subtargets, aes(x = repeats, y = percent_error, group = n)) +
182181
geom_point(aes(color = n), size = 3) +
183182
geom_line(aes(color = n), size = 2) +
184183
expand_limits(y = c(0, 0.10)) +
185184
scale_y_continuous(labels = scales::percent_format(accuracy = 0.1)) +
186185
scale_color_manual(values = fill_colors[4:7]) +
187186
labs(y = "Percent Error", x = "Repeats",
188-
title = "Kuhn-Johnson", color = "Sample Size") +
187+
color = "Sample Size") +
189188
theme(title = element_text(family = "Roboto"),
190189
text = element_text(family = "Roboto"),
191190
legend.position = "top",
192191
legend.background = element_rect(fill = "ivory"),
193192
legend.key = element_rect(fill = "ivory"),
194193
axis.ticks = element_blank(),
194+
axis.text.x = element_text(size = 11),
195+
axis.text.y = element_text(size = 11),
195196
panel.background = element_rect(fill = "ivory",
196-
colour = "ivory"),
197+
color = "ivory"),
197198
plot.background = element_rect(fill = "ivory"),
198199
panel.border = element_blank(),
199200
panel.grid.major = element_blank(),
200201
panel.grid.minor = element_blank()
201202
)
202203
```
203204

204-
##### Results:
205+
```{r kj-patch, echo=FALSE, fig.width=10, fig.height=6}
206+
b + e + plot_layout(guides = "auto") +
207+
plot_annotation(title = "Kuhn-Johnson") &
208+
theme(legend.position = "top",
209+
panel.background = element_rect(fill = "ivory",
210+
color = "ivory"),
211+
plot.background = element_rect(fill = "ivory"),)
212+
```
213+
214+
#### Results:
205215

206216
Kuhn-Johnson:
207217

208218
* Runtimes for n = 100 and n = 800 are close, and there's a large jump in runtime going from n = 2000 to n = 5000.
209-
* The number of repeats had little effect on the amount of percent error.
219+
* The number of repeats has little effect on the amount of percent error.
210220
* For n = 100, there is substantially more variation in percent error than in the other sample sizes.
211221
* While there is a large runtime cost that comes with increasing the sample size from 2000 to 5000 obsservations, it doesn't seem to provide any benefit in gaining a more accurate estimate of the out-of-sample error.
212222

README.md

Lines changed: 23 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ I’ll be examining two aspects of nested cross-validation:
3535

3636
## Duration Experiment
3737

38-
##### Experiment details:
38+
#### Experiment details:
3939

4040
- Random Forest and Elastic Net Regression algorithms
4141
- Both with 100x2 hyperparameter grids
@@ -48,8 +48,10 @@ I’ll be examining two aspects of nested cross-validation:
4848
- outer loop: 5 folds
4949
- inner loop: 2 folds
5050

51-
(Size of the data sets are the same as those in the original scripts by
52-
the authors)
51+
The sizes of the data sets are the same as those in the original scripts
52+
by the authors. [MLFlow](https://mlflow.org/docs/latest/index.html) is
53+
used to keep track of the duration (seconds) of each run along with the
54+
implementation and method used.
5355

5456
Various elements of the technique can be altered to improve performance.
5557
These include:
@@ -59,21 +61,13 @@ These include:
5961
3. Inner-Loop CV strategy
6062
4. Grid search strategy
6163

62-
These elements also affect the run times. Both methods will be using the
64+
These elements also affect the run times. Both methods are using the
6365
same size grids, but Kuhn-Johnson uses repeats and more folds in the
6466
outer and inner loops while Raschka’s trains an extra model over the
6567
entire training set at the end at the end. Using Kuhn-Johnson, 50,000
6668
models (grid size \* number of repeats \* number of folds in the
67-
outer-loop \* number of folds/resamples in the inner-loop) will be
68-
trained for each algorithm — using Raschka’s, 1,001 models.
69-
70-
[MLFlow](https://mlflow.org/docs/latest/index.html) was used to keep
71-
track of the duration (seconds) of each run along with the
72-
implementation and method used. I’ve used “implementation” to
73-
encapsulate not only the combinations of various model functions, but
74-
also, to describe the various changes in coding structures that
75-
accompanies using each package’s functions, i.e. I can’t just
76-
plug-and-play different packages’ model functions into the same script.
69+
outer-loop \* number of folds/resamples in the inner-loop) are trained
70+
for each algorithm — using Raschka’s, 1,001 models.
7771

7872
![](duration-experiment/outputs/0225-results.png)
7973

@@ -83,45 +77,43 @@ plug-and-play different packages’ model functions into the same script.
8377

8478
## Performance Experiment
8579

86-
##### Experiment details:
80+
#### Experiment details:
8781

88-
- The fastest implementation of each method was used in running a
82+
- The fastest implementation of each method is used in running a
8983
nested cross-validation with different sizes of data ranging from
9084
100 to 5000 observations and different numbers of repeats of the
9185
outer-loop cv strategy.
92-
- The {mlr3} implementation was the fastest for Raschka’s method,
93-
but the Ranger-Kuhn-Johnson implementation was close. To
94-
simplify, I’ll be using
86+
- The {mlr3} implementation is the fastest for Raschka’s method,
87+
but the Ranger-Kuhn-Johnson implementation is close. To
88+
simplify, I am using
9589
[Ranger-Kuhn-Johnson](https://github.com/ercbk/nested-cross-validation-comparison/blob/master/duration-experiment/kuhn-johnson/nested-cv-ranger-kj.R)
9690
for both methods.
97-
- The chosen algorithm and hyperparameters was used to predict on a
98-
100K row simulated dataset.
91+
- The chosen algorithm and hyperparameters predicts on a 100K row
92+
simulated dataset.
9993
- The percent error between the the average mean absolute error (MAE)
10094
across the outer-loop folds and the MAE of the predictions on this
101-
100K dataset was calculated for each combination of repeat, data
95+
100K dataset is calculated for each combination of repeat, data
10296
size, and method.
103-
- To make this experiment manageable in terms of runtimes, I used AWS
104-
instances: a r5.2xlarge for the Elastic Net and a r5.24xlarge for
105-
Random Forest.
97+
- To make this experiment manageable in terms of runtimes, I am using
98+
AWS instances: a r5.2xlarge for the Elastic Net and a r5.24xlarge
99+
for Random Forest.
106100
- Iterating through different numbers of repeats, sample sizes, and
107101
methods makes a functional approach more appropriate than running
108102
imperative scripts. Also, given the long runtimes and impermanent
109103
nature of my internet connection, it would also be nice to cache
110104
each iteration as it finishes. The
111105
[{drake}](https://github.com/ropensci/drake) package is superb on
112-
both counts, so I’m used it to orchestrate.
113-
114-
![](README_files/figure-gfm/perf_bt_charts-1.png)<!-- -->
106+
both counts, so I’m using it to orchestrate.
115107

116-
![](README_files/figure-gfm/perf-error-line-1.png)<!-- -->
108+
![](README_files/figure-gfm/kj-patch-1.png)<!-- -->
117109

118-
##### Results:
110+
#### Results:
119111

120112
Kuhn-Johnson:
121113

122114
- Runtimes for n = 100 and n = 800 are close, and there’s a large jump
123115
in runtime going from n = 2000 to n = 5000.
124-
- The number of repeats had little effect on the amount of percent
116+
- The number of repeats has little effect on the amount of percent
125117
error.
126118
- For n = 100, there is substantially more variation in percent error
127119
than in the other sample sizes.
23.6 KB
Loading

0 commit comments

Comments
 (0)