Skip to content

Commit 3868ecd

Browse files
author
ercbk
committed
added n = 5000, repeats 1,2,3 runtime output to readme
1 parent 9fa0b2f commit 3868ecd

File tree

17 files changed

+67
-46
lines changed

17 files changed

+67
-46
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,5 @@
44
.Ruserdata
55
.env
66
.drake
7-
ec2-ssh-raw.log
7+
ec2-ssh-raw.log
8+
README_cache

README.Rmd

Lines changed: 32 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -111,13 +111,14 @@ durations
111111
Experiment details:
112112

113113
* The fastest implementation of each method will be used in running a nested cross-validation with different sizes of data ranging from 100 to 5000 observations and different numbers of repeats of the outer-loop cv strategy.
114+
* The {mlr3} implementation was the fastest for Raschka's method, but the Ranger-Kuhn-Johnson implementation is close. So I'll be using Ranger-Kuhn-Johnson for both methods.
114115
* The chosen algorithm and hyperparameters will used to predict on a 100K row simulated dataset and the mean absolute error will be calculated for each combination of repeat, data size, and method.
115-
* AWS
116-
* Drake
116+
* Runtimes began to explode after n = 800 for my 8 vcpu, 16 GB RAM desktop, so I ran this experiment using AWS instances: a r5.2xlarge for the Elastic Net and a r5.24xlarge for Random Forest.
117+
* I'll be iterating through different numbers of repeats and sample sizes, so I'll be transitioning from imperative scripts to a functional approach. Given the long runtimes and impermanent nature of my internet connection, it would be nice to cache each iteration as it finishes. The [{drake}](https://github.com/ropensci/drake) package is superb on both counts, so I'm using it to orchestrate.
117118

118-
```{r perf_build_times, echo=FALSE, message=FALSE}
119-
pacman::p_load(extrafont,dplyr, purrr, lubridate, ggplot2, drake)
119+
```{r perf_build_times, echo=FALSE, message=FALSE, cache=TRUE}
120120
121+
pacman::p_load(extrafont,dplyr, purrr, lubridate, ggplot2, drake)
121122
bt <- build_times(starts_with("ncv_results"), digits = 4)
122123
123124
subtarget_bts <- bt %>%
@@ -137,33 +138,42 @@ subtargets <- subtargets_raw %>%
137138
n = factor(n),
138139
elapsed = round(as.numeric(elapsed)/3600, 2))
139140
141+
```
142+
143+
```{r perf_bt_charts, echo=FALSE, message=FALSE}
144+
145+
fill_colors <- unname(swatches::read_ase("palettes/Forest Floor.ase"))
140146
141147
ggplot(subtargets, aes(y = elapsed, x = repeats,
142148
fill = n, label = elapsed)) +
143-
geom_bar(stat = "identity", position = position_dodge(width = 0.8)) +
144-
geom_text(hjust = 1.3, size = 3.5,
145-
color = "white", position = position_dodge(width = 0.8)) +
146-
coord_flip() +
147-
labs(y = "Runtime (hrs)", x = "Repeats",
148-
title = "Kuhn-Johnson", fill = "Sample Size") +
149-
theme(title = element_text(family = "Roboto"),
150-
text = element_text(family = "Roboto"),
151-
legend.position = "top",
152-
axis.ticks = element_blank(),
153-
axis.text.x = element_blank(),
154-
panel.background = element_rect(fill = "ivory",
155-
colour = "ivory"),
156-
plot.background = element_rect(fill = "ivory"),
157-
panel.border = element_blank(),
158-
panel.grid.major = element_blank(),
159-
panel.grid.minor = element_blank()
160-
)
149+
geom_col(position = position_dodge(width = 0.8)) +
150+
scale_fill_manual(values = fill_colors[4:7]) +
151+
geom_text(hjust = 1.3, size = 3.5,
152+
color = "white", position = position_dodge(width = 0.8)) +
153+
coord_flip() +
154+
labs(y = "Runtime (hrs)", x = "Repeats",
155+
title = "Kuhn-Johnson", fill = "Sample Size") +
156+
theme(title = element_text(family = "Roboto"),
157+
text = element_text(family = "Roboto"),
158+
legend.position = "top",
159+
legend.background = element_rect(fill = "ivory"),
160+
legend.key = element_rect(fill = "ivory"),
161+
axis.ticks = element_blank(),
162+
axis.text.x = element_blank(),
163+
panel.background = element_rect(fill = "ivory",
164+
colour = "ivory"),
165+
plot.background = element_rect(fill = "ivory"),
166+
panel.border = element_blank(),
167+
panel.grid.major = element_blank(),
168+
panel.grid.minor = element_blank()
169+
)
161170
162171
```
163172

164173

165174

166175

176+
167177
References
168178

169179
Boulesteix, AL, and C Strobl. 2009. “Optimal Classifier Selection and Negative Bias in Error Rate Estimation: An Empirical Study on High-Dimensional Prediction.” BMC Medical Research Methodology 9 (1): 85. [link](https://www.researchgate.net/publication/40756303_Optimal_classifier_selection_and_negative_bias_in_error_rate_estimation_An_empirical_study_on_high-dimensional_prediction)

README.md

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -89,14 +89,24 @@ Experiment details:
8989
- The fastest implementation of each method will be used in running a
9090
nested cross-validation with different sizes of data ranging from
9191
100 to 5000 observations and different numbers of repeats of the
92-
outer-loop cv strategy.
92+
outer-loop cv strategy.
93+
- The {mlr3} implementation was the fastest for Raschka’s method,
94+
but the Ranger-Kuhn-Johnson implementation is close. So I’ll be
95+
using Ranger-Kuhn-Johnson for both methods.
9396
- The chosen algorithm and hyperparameters will used to predict on a
9497
100K row simulated dataset and the mean absolute error will be
9598
calculated for each combination of repeat, data size, and method.
96-
- AWS
97-
- Drake
98-
99-
![](README_files/figure-gfm/perf_build_times-1.png)<!-- -->
99+
- Runtimes began to explode after n = 800 for my 8 vcpu, 16 GB RAM
100+
desktop, so I ran this experiment using AWS instances: a r5.2xlarge
101+
for the Elastic Net and a r5.24xlarge for Random Forest.
102+
- I’ll be iterating through different numbers of repeats and sample
103+
sizes, so I’ll be transitioning from imperative scripts to a
104+
functional approach. Given the long runtimes and impermanent nature
105+
of my internet connection, it would be nice to cache each iteration
106+
as it finishes. The [{drake}](https://github.com/ropensci/drake)
107+
package is superb on both counts, so I’m using it to orchestrate.
108+
109+
![](README_files/figure-gfm/perf_bt_charts-1.png)<!-- -->
100110

101111
References
102112

7.4 KB
Loading
813 Bytes
Loading

palettes/Analagous.ase

540 Bytes
Binary file not shown.

palettes/Deep Rooted.ase

540 Bytes
Binary file not shown.

palettes/Drama Queen.ase

540 Bytes
Binary file not shown.

palettes/Ethereal Material.ase

540 Bytes
Binary file not shown.

palettes/Focal Points.ase

476 Bytes
Binary file not shown.

0 commit comments

Comments
 (0)