ercbk
diff --git a/‎.gitignore‎
Lines changed: 2 additions & 1 deletion b/‎.gitignore‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎README.Rmd‎
Lines changed: 29 additions & 5 deletions b/‎README.Rmd‎
Lines changed: 29 additions & 5 deletions
diff --git a/‎README.md‎
Lines changed: 16 additions & 12 deletions b/‎README.md‎
Lines changed: 16 additions & 12 deletions
diff --git a/‎README_files/figure-gfm/perf-error-line-1.png‎
5.82 KB b/‎README_files/figure-gfm/perf-error-line-1.png‎
5.82 KB
diff --git a/‎performance-experiment/Kuhn-Johnson/check-results.R‎
Lines changed: 0 additions & 21 deletions b/‎performance-experiment/Kuhn-Johnson/check-results.R‎
Lines changed: 0 additions & 21 deletions
@@ -5,4 +5,5 @@
 .env
 .drake
 ec2-ssh-raw.log
-README_cache
+README_cache
+check-results.R
@@ -110,9 +110,10 @@ Experiment details:
 
   * The fastest implementation of each method will be used in running a nested cross-validation with different sizes of data ranging from 100 to 5000 observations and different numbers of repeats of the outer-loop cv strategy.  
       * The {mlr3} implementation was the fastest for Raschka's method, but the Ranger-Kuhn-Johnson implementation was close. To simplify, I'll be using Ranger-Kuhn-Johnson for both methods.  
-  * The chosen algorithm and hyperparameters will used to predict on a 100K row simulated dataset and the mean absolute error will be calculated for each combination of repeat, data size, and method.  
-  * Runtimes began to explode after n = 800 for my 8 vcpu, 16 GB RAM desktop, therefore I ran this experiment using AWS instances: a r5.2xlarge for the Elastic Net and a r5.24xlarge for Random Forest.  
-  * I'll be transitioning from imperative scripts to a functional approach, because I'm iterating through different numbers of repeats and sample sizes. Given the long runtimes and impermanent nature of my internet connection, it would also be nice to cache each iteration as it finishes. The [{drake}](https://github.com/ropensci/drake) package is superb on both counts, so I'm using it to orchestrate.  
+  * The chosen algorithm and hyperparameters will be used to predict on a 100K row simulated dataset.  
+  * The percent error between the the average mean absolute error (MAE) across the outer-loop folds and the MAE of the predictions on this 100K dataset will be calculated for each combination of repeat, data size, and method.  
+  * To make this experiment manageable in terms of runtimes, I'm using AWS instances: a r5.2xlarge for the Elastic Net and a r5.24xlarge for Random Forest.  
+  * Iterating through different numbers of repeats, sample sizes, and methods makes a functional approach more appropriate than running imperative scripts. Also, given the long runtimes and impermanent nature of my internet connection, it would also be nice to cache each iteration as it finishes. The [{drake}](https://github.com/ropensci/drake) package is superb on both counts, so I'm using it to orchestrate.  
 
 ```{r perf_build_times, echo=FALSE, message=FALSE, cache=TRUE}
 
@@ -134,7 +135,8 @@ subtargets_raw <- map_dfr(subtarget_bts$target, function(x) {
 subtargets <- subtargets_raw %>% 
       mutate(repeats = factor(repeats),
              n = factor(n),
-             elapsed = round(as.numeric(elapsed)/3600, 2))
+             elapsed = round(as.numeric(elapsed)/3600, 2),
+             percent_error = round(delta_error/oos_error, 3))
 
 ```
 
@@ -168,7 +170,29 @@ ggplot(subtargets, aes(y = elapsed, x = repeats,
 
 ```
 
-
+```{r perf-error-line, echo=FALSE, message=FALSE}
+ggplot(subtargets, aes(x = repeats, y = percent_error, group = n)) +
+   geom_point(aes(color = n), size = 3) +
+   geom_line(aes(color = n), size = 2) +
+   expand_limits(y = c(0, 0.10)) +
+   scale_y_continuous(labels = scales::percent_format(accuracy = 0.1)) +
+   scale_color_manual(values = fill_colors[4:7]) +
+   labs(y = "Percent Error", x = "Repeats",
+        title = "Kuhn-Johnson", color = "Sample Size") +
+   theme(title = element_text(family = "Roboto"),
+         text = element_text(family = "Roboto"),
+         legend.position = "top",
+         legend.background = element_rect(fill = "ivory"),
+         legend.key = element_rect(fill = "ivory"),
+         axis.ticks = element_blank(),
+         panel.background = element_rect(fill = "ivory",
+                                         colour = "ivory"),
+         plot.background = element_rect(fill = "ivory"),
+         panel.border = element_blank(),
+         panel.grid.major = element_blank(),
+         panel.grid.minor = element_blank()
+   )
+```
 
 
 
 
@@ -89,23 +89,27 @@ Experiment details:
       - The {mlr3} implementation was the fastest for Raschka’s method,
         but the Ranger-Kuhn-Johnson implementation was close. To
         simplify, I’ll be using Ranger-Kuhn-Johnson for both methods.  
-  - The chosen algorithm and hyperparameters will used to predict on a
-    100K row simulated dataset and the mean absolute error will be
-    calculated for each combination of repeat, data size, and method.  
-  - Runtimes began to explode after n = 800 for my 8 vcpu, 16 GB RAM
-    desktop, therefore I ran this experiment using AWS instances: a
-    r5.2xlarge for the Elastic Net and a r5.24xlarge for Random
-    Forest.  
-  - I’ll be transitioning from imperative scripts to a functional
-    approach, because I’m iterating through different numbers of repeats
-    and sample sizes. Given the long runtimes and impermanent nature of
-    my internet connection, it would also be nice to cache each
-    iteration as it finishes. The
+  - The chosen algorithm and hyperparameters will be used to predict on
+    a 100K row simulated dataset.  
+  - The percent error between the the average mean absolute error (MAE)
+    across the outer-loop folds and the MAE of the predictions on this
+    100K dataset will be calculated for each combination of repeat, data
+    size, and method.  
+  - To make this experiment manageable in terms of runtimes, I’m using
+    AWS instances: a r5.2xlarge for the Elastic Net and a r5.24xlarge
+    for Random Forest.  
+  - Iterating through different numbers of repeats, sample sizes, and
+    methods makes a functional approach more appropriate than running
+    imperative scripts. Also, given the long runtimes and impermanent
+    nature of my internet connection, it would also be nice to cache
+    each iteration as it finishes. The
     [{drake}](https://github.com/ropensci/drake) package is superb on
     both counts, so I’m using it to orchestrate.
 
 ![](README_files/figure-gfm/perf_bt_charts-1.png)<!-- -->
 
+![](README_files/figure-gfm/perf-error-line-1.png)<!-- -->
+
 References
 
 Boulesteix, AL, and C Strobl. 2009. “Optimal Classifier Selection and