ercbk
diff --git a/‎.gitignore‎
Lines changed: 2 additions & 1 deletion b/‎.gitignore‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎README.Rmd‎
Lines changed: 32 additions & 22 deletions b/‎README.Rmd‎
Lines changed: 32 additions & 22 deletions
diff --git a/‎README.md‎
Lines changed: 15 additions & 5 deletions b/‎README.md‎
Lines changed: 15 additions & 5 deletions
diff --git a/‎README_files/figure-gfm/perf_bt_charts-1.png‎
7.4 KB b/‎README_files/figure-gfm/perf_bt_charts-1.png‎
7.4 KB
diff --git a/‎README_files/figure-gfm/perf_build_times-1.png‎
813 Bytes b/‎README_files/figure-gfm/perf_build_times-1.png‎
813 Bytes
diff --git a/‎palettes/Analagous.ase‎
540 Bytes b/‎palettes/Analagous.ase‎
540 Bytes
diff --git a/‎palettes/Deep Rooted.ase‎
540 Bytes b/‎palettes/Deep Rooted.ase‎
540 Bytes
diff --git a/‎palettes/Drama Queen.ase‎
540 Bytes b/‎palettes/Drama Queen.ase‎
540 Bytes
diff --git a/‎palettes/Ethereal Material.ase‎
540 Bytes b/‎palettes/Ethereal Material.ase‎
540 Bytes
diff --git a/‎palettes/Focal Points.ase‎
476 Bytes b/‎palettes/Focal Points.ase‎
476 Bytes
@@ -4,4 +4,5 @@
 .Ruserdata
 .env
 .drake
-ec2-ssh-raw.log
+ec2-ssh-raw.log
+README_cache
@@ -111,13 +111,14 @@ durations
 Experiment details:  
 
   * The fastest implementation of each method will be used in running a nested cross-validation with different sizes of data ranging from 100 to 5000 observations and different numbers of repeats of the outer-loop cv strategy.  
+      * The {mlr3} implementation was the fastest for Raschka's method, but the Ranger-Kuhn-Johnson implementation is close. So I'll be using Ranger-Kuhn-Johnson for both methods.  
   * The chosen algorithm and hyperparameters will used to predict on a 100K row simulated dataset and the mean absolute error will be calculated for each combination of repeat, data size, and method.  
-  * AWS  
-  * Drake  
+  * Runtimes began to explode after n = 800 for my 8 vcpu, 16 GB RAM desktop, so I ran this experiment using AWS instances: a r5.2xlarge for the Elastic Net and a r5.24xlarge for Random Forest.  
+  * I'll be iterating through different numbers of repeats and sample sizes, so I'll be transitioning from imperative scripts to a functional approach. Given the long runtimes and impermanent nature of my internet connection, it would be nice to cache each iteration as it finishes. The [{drake}](https://github.com/ropensci/drake) package is superb on both counts, so I'm using it to orchestrate.  
 
-```{r perf_build_times, echo=FALSE, message=FALSE}
-pacman::p_load(extrafont,dplyr, purrr, lubridate, ggplot2, drake)
+```{r perf_build_times, echo=FALSE, message=FALSE, cache=TRUE}
 
+pacman::p_load(extrafont,dplyr, purrr, lubridate, ggplot2, drake)
 bt <- build_times(starts_with("ncv_results"), digits = 4)
 
 subtarget_bts <- bt %>% 
@@ -137,33 +138,42 @@ subtargets <- subtargets_raw %>%
              n = factor(n),
              elapsed = round(as.numeric(elapsed)/3600, 2))
 
+```
+
+```{r perf_bt_charts, echo=FALSE, message=FALSE}
+
+fill_colors <- unname(swatches::read_ase("palettes/Forest Floor.ase"))
 
 ggplot(subtargets, aes(y = elapsed, x = repeats,
                        fill = n, label = elapsed)) +
-      geom_bar(stat = "identity", position = position_dodge(width = 0.8)) +
-      geom_text(hjust = 1.3,  size = 3.5,
-                color = "white", position = position_dodge(width = 0.8)) +
-      coord_flip() +
-      labs(y = "Runtime (hrs)", x = "Repeats",
-           title = "Kuhn-Johnson", fill = "Sample Size") +
-      theme(title = element_text(family = "Roboto"),
-            text = element_text(family = "Roboto"),
-            legend.position = "top",
-            axis.ticks = element_blank(),
-            axis.text.x = element_blank(),
-            panel.background = element_rect(fill = "ivory",
-                                            colour = "ivory"),
-            plot.background = element_rect(fill = "ivory"),
-            panel.border = element_blank(),
-            panel.grid.major = element_blank(),
-            panel.grid.minor = element_blank()
-      )
+   geom_col(position = position_dodge(width = 0.8)) +
+   scale_fill_manual(values = fill_colors[4:7]) +
+   geom_text(hjust = 1.3,  size = 3.5,
+             color = "white", position = position_dodge(width = 0.8)) +
+   coord_flip() +
+   labs(y = "Runtime (hrs)", x = "Repeats",
+        title = "Kuhn-Johnson", fill = "Sample Size") +
+   theme(title = element_text(family = "Roboto"),
+         text = element_text(family = "Roboto"),
+         legend.position = "top",
+         legend.background = element_rect(fill = "ivory"),
+         legend.key = element_rect(fill = "ivory"),
+         axis.ticks = element_blank(),
+         axis.text.x = element_blank(),
+         panel.background = element_rect(fill = "ivory",
+                                         colour = "ivory"),
+         plot.background = element_rect(fill = "ivory"),
+         panel.border = element_blank(),
+         panel.grid.major = element_blank(),
+         panel.grid.minor = element_blank()
+   )
 
 ```
 
 
 
 
+
 References  
 
 Boulesteix, AL, and C Strobl. 2009. “Optimal Classifier Selection and Negative Bias in Error Rate Estimation: An Empirical Study on High-Dimensional Prediction.” BMC Medical Research Methodology 9 (1): 85. [link](https://www.researchgate.net/publication/40756303_Optimal_classifier_selection_and_negative_bias_in_error_rate_estimation_An_empirical_study_on_high-dimensional_prediction)  
 
@@ -89,14 +89,24 @@ Experiment details:
   - The fastest implementation of each method will be used in running a
     nested cross-validation with different sizes of data ranging from
     100 to 5000 observations and different numbers of repeats of the
-    outer-loop cv strategy.  
+    outer-loop cv strategy.
+      - The {mlr3} implementation was the fastest for Raschka’s method,
+        but the Ranger-Kuhn-Johnson implementation is close. So I’ll be
+        using Ranger-Kuhn-Johnson for both methods.  
   - The chosen algorithm and hyperparameters will used to predict on a
     100K row simulated dataset and the mean absolute error will be
     calculated for each combination of repeat, data size, and method.  
-  - AWS  
-  - Drake
-
-![](README_files/figure-gfm/perf_build_times-1.png)<!-- -->
+  - Runtimes began to explode after n = 800 for my 8 vcpu, 16 GB RAM
+    desktop, so I ran this experiment using AWS instances: a r5.2xlarge
+    for the Elastic Net and a r5.24xlarge for Random Forest.  
+  - I’ll be iterating through different numbers of repeats and sample
+    sizes, so I’ll be transitioning from imperative scripts to a
+    functional approach. Given the long runtimes and impermanent nature
+    of my internet connection, it would be nice to cache each iteration
+    as it finishes. The [{drake}](https://github.com/ropensci/drake)
+    package is superb on both counts, so I’m using it to orchestrate.
+
+![](README_files/figure-gfm/perf_bt_charts-1.png)<!-- -->
 
 References