readmne edits, finished perf exp n=100,800,2000

ercbk · ercbk · commit 9fa0b2f52649 · 2020-05-25T22:34:38.000-04:00
diff --git a/README.Rmd b/README.Rmd
@@ -3,18 +3,21 @@ output: github_document
 ---
 
 # Nested Cross-Validation: Comparing Methods and Implementations  
+### (In-progress)
 
-Nested cross-validation has become a recommended technique for situations in which the size of our dataset is insufficient to simultaneously handle hyperparameter tuning and algorithm comparison. Using standard methods such as k-fold cross-validation in such situations results in  significant increases in optimization bias. Nested cross-validation has been shown to produce low bias, out-of-sample error estimates even using datasets with only a few hundred rows and therefore gives a better judgemnet of generalization performance.  
+Nested cross-validation has become a recommended technique for situations in which the size of our dataset is insufficient to simultaneously handle hyperparameter tuning and algorithm comparison. Examples of such situations include: proof of concept, start-ups, medical studies, time series, etc. Using standard methods such as k-fold cross-validation in these cases may result in  significant increases in optimization bias. Nested cross-validation has been shown to produce low bias, out-of-sample error estimates even using datasets with only hundreds of rows and therefore gives a better judgement of generalization performance.  
 
-The primary issue with this technique is that it is computationally very expensive with potentially tens of 1000s of models being trained during the process. While researching this technique, I found two methods of performing nested cross-validation — one authored by [Sabastian Raschka](https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/11_eval4-algo/code/11-eval4-algo__nested-cv_verbose1.ipynb) and the other by [Max Kuhn and Kjell Johnson](https://tidymodels.github.io/rsample/articles/Applications/Nested_Resampling.html).  
-This experiment seeks to answer two questions:  
+The primary issue with this technique is that it is computationally very expensive with potentially tens of 1000s of models being trained during the process. While researching this technique, I found two slightly different methods of performing nested cross-validation — one authored by [Sabastian Raschka](https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/11_eval4-algo/code/11-eval4-algo__nested-cv_verbose1.ipynb) and the other by [Max Kuhn and Kjell Johnson](https://tidymodels.github.io/rsample/articles/Applications/Nested_Resampling.html).  
+I'll be examining two aspects of nested cross-validation:  
 
-1. What's the fastest implementation of each method?  
-2. How many repeats, given the size of this dataset, should we expect to need to obtain a reasonably accurate out-of-sample error estimate?  
+1. Duration: Which packages and functions give us the fastest implementation of each method?  
+2. Performance: First, develop a testing framework. Then, using a generated dataset, find how many repeats, given the number of samples, should we expect to need in order to obtain a reasonably accurate out-of-sample error estimate.  
 
 With regards to the question of speed, I'll will be testing implementations of both methods from various packages which include {tune}, {mlr3}, {h2o}, and {sklearn}.  
 
-Duration experiment details:  
+
+## Duration Experiment
+Experiment details:  
    
    * Random Forest and Elastic Net Regression algorithms  
    * Both with 100x2 hyperparameter grids  
@@ -37,11 +40,9 @@ Various elements of the technique can be altered to improve performance. These i
 3. Inner-Loop CV strategy  
 4. Grid search strategy  
 
-For the performance experiment (question 2), the fastest implementation of each method will be used in running a nested cross-validation with different sizes of data ranging from 100 to 5000 observations and different numbers of repeats of the outer-loop cv strategy. The chosen algorithm and hyperparameters will predict on a 100K row simulated dataset and the mean absolute error will be calculated for each combination of repeat, data size, and method. 
-  
-
+These elements also affect the run times. Both methods will be using the same size grids, but Kuhn-Johnson uses repeats and more folds in the outer and inner loops while Raschka's trains an extra model over the entire training set at the end at the end. Using Kuhn-Johnson, 50,000 models will be trained for each algorithm — using Raschka's, 1,001 models.  
 
-Progress (duration in seconds)  
+MLFlow was used to keep track of the duration (seconds) of each run along with the implementation and method used. I've used implementation to describe the various changes in coding structures that accompanies using each package's functions. A couple examples are the python for-loop being replaced with a while-loop and `iter_next` function when using {reticulate} and {mlr3} entirely using R's R6 Object Oriented Programming system.   
 
 ![](duration-experiment/outputs/0225-results.png)  
 
@@ -105,6 +106,63 @@ durations
 ```
 
 
+## Performance Experiment  
+
+Experiment details:  
+
+  * The fastest implementation of each method will be used in running a nested cross-validation with different sizes of data ranging from 100 to 5000 observations and different numbers of repeats of the outer-loop cv strategy.  
+  * The chosen algorithm and hyperparameters will used to predict on a 100K row simulated dataset and the mean absolute error will be calculated for each combination of repeat, data size, and method.  
+  * AWS  
+  * Drake  
+
+```{r perf_build_times, echo=FALSE, message=FALSE}
+pacman::p_load(extrafont,dplyr, purrr, lubridate, ggplot2, drake)
+
+bt <- build_times(starts_with("ncv_results"), digits = 4)
+
+subtarget_bts <- bt %>% 
+      filter(stringr::str_detect(target, pattern = "[0-9]_([0-9]|[a-z])")) %>% 
+      select(target, elapsed)
+
+subtargets_raw <- map_dfr(subtarget_bts$target, function(x) {
+      results <- readd(x, character_only = TRUE) %>% 
+            mutate(subtarget = x) %>% 
+            select(subtarget, everything())
+      
+}) %>% 
+      inner_join(subtarget_bts, by = c("subtarget" = "target"))
+
+subtargets <- subtargets_raw %>% 
+      mutate(repeats = factor(repeats),
+             n = factor(n),
+             elapsed = round(as.numeric(elapsed)/3600, 2))
+
+
+ggplot(subtargets, aes(y = elapsed, x = repeats,
+                       fill = n, label = elapsed)) +
+      geom_bar(stat = "identity", position = position_dodge(width = 0.8)) +
+      geom_text(hjust = 1.3,  size = 3.5,
+                color = "white", position = position_dodge(width = 0.8)) +
+      coord_flip() +
+      labs(y = "Runtime (hrs)", x = "Repeats",
+           title = "Kuhn-Johnson", fill = "Sample Size") +
+      theme(title = element_text(family = "Roboto"),
+            text = element_text(family = "Roboto"),
+            legend.position = "top",
+            axis.ticks = element_blank(),
+            axis.text.x = element_blank(),
+            panel.background = element_rect(fill = "ivory",
+                                            colour = "ivory"),
+            plot.background = element_rect(fill = "ivory"),
+            panel.border = element_blank(),
+            panel.grid.major = element_blank(),
+            panel.grid.minor = element_blank()
+      )
+
+```
+
+
+
 
 References  
 
diff --git a/README.md b/README.md
@@ -1,34 +1,43 @@
 
 # Nested Cross-Validation: Comparing Methods and Implementations
 
+### (In-progress)
+
 Nested cross-validation has become a recommended technique for
 situations in which the size of our dataset is insufficient to
 simultaneously handle hyperparameter tuning and algorithm comparison.
-Using standard methods such as k-fold cross-validation in such
-situations results in significant increases in optimization bias. Nested
-cross-validation has been shown to produce low bias, out-of-sample error
-estimates even using datasets with only a few hundred rows and therefore
-gives a better judgemnet of generalization performance.
+Examples of such situations include: proof of concept, start-ups,
+medical studies, time series, etc. Using standard methods such as k-fold
+cross-validation in these cases may result in significant increases in
+optimization bias. Nested cross-validation has been shown to produce low
+bias, out-of-sample error estimates even using datasets with only
+hundreds of rows and therefore gives a better judgement of
+generalization performance.
 
 The primary issue with this technique is that it is computationally very
 expensive with potentially tens of 1000s of models being trained during
-the process. While researching this technique, I found two methods of
-performing nested cross-validation — one authored by [Sabastian
+the process. While researching this technique, I found two slightly
+different methods of performing nested cross-validation — one authored
+by [Sabastian
 Raschka](https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/11_eval4-algo/code/11-eval4-algo__nested-cv_verbose1.ipynb)
 and the other by [Max Kuhn and Kjell
 Johnson](https://tidymodels.github.io/rsample/articles/Applications/Nested_Resampling.html).  
-This experiment seeks to answer two questions:
+I’ll be examining two aspects of nested cross-validation:
 
-1.  What’s the fastest implementation of each method?  
-2.  How many repeats, given the size of this dataset, should we expect
-    to need to obtain a reasonably accurate out-of-sample error
-    estimate?
+1.  Duration: Which packages and functions give us the fastest
+    implementation of each method?  
+2.  Performance: First, develop a testing framework. Then, using a
+    generated dataset, find how many repeats, given the number of
+    samples, should we expect to need in order to obtain a reasonably
+    accurate out-of-sample error estimate.
 
 With regards to the question of speed, I’ll will be testing
 implementations of both methods from various packages which include
 {tune}, {mlr3}, {h2o}, and {sklearn}.
 
-Duration experiment details:
+## Duration Experiment
+
+Experiment details:
 
   - Random Forest and Elastic Net Regression algorithms  
   - Both with 100x2 hyperparameter grids  
@@ -52,22 +61,43 @@ These include:
 3.  Inner-Loop CV strategy  
 4.  Grid search strategy
 
-For the performance experiment (question 2), the fastest implementation
-of each method will be used in running a nested cross-validation with
-different sizes of data ranging from 100 to 5000 observations and
-different numbers of repeats of the outer-loop cv strategy. The chosen
-algorithm and hyperparameters will predict on a 100K row simulated
-dataset and the mean absolute error will be calculated for each
-combination of repeat, data size, and method.
-
-Progress (duration in seconds)
+These elements also affect the run times. Both methods will be using the
+same size grids, but Kuhn-Johnson uses repeats and more folds in the
+outer and inner loops while Raschka’s trains an extra model over the
+entire training set at the end at the end. Using Kuhn-Johnson, 50,000
+models will be trained for each algorithm — using Raschka’s, 1,001
+models.
+
+MLFlow was used to keep track of the duration (seconds) of each run
+along with the implementation and method used. I’ve used implementation
+to describe the various changes in coding structures that accompanies
+using each package’s functions. A couple examples are the python
+for-loop being replaced with a while-loop and `iter_next` function when
+using {reticulate} and {mlr3} entirely using R’s R6 Object Oriented
+Programming system.
 
 ![](duration-experiment/outputs/0225-results.png)
 
 ![](duration-experiment/outputs/duration-pkg-tbl.png)
 
 ![](README_files/figure-gfm/unnamed-chunk-1-1.png)<!-- -->
 
+## Performance Experiment
+
+Experiment details:
+
+  - The fastest implementation of each method will be used in running a
+    nested cross-validation with different sizes of data ranging from
+    100 to 5000 observations and different numbers of repeats of the
+    outer-loop cv strategy.  
+  - The chosen algorithm and hyperparameters will used to predict on a
+    100K row simulated dataset and the mean absolute error will be
+    calculated for each combination of repeat, data size, and method.  
+  - AWS  
+  - Drake
+
+![](README_files/figure-gfm/perf_build_times-1.png)<!-- -->
+
 References
 
 Boulesteix, AL, and C Strobl. 2009. “Optimal Classifier Selection and
diff --git a/README_files/figure-gfm/perf_build_times-1.png b/README_files/figure-gfm/perf_build_times-1.png
diff --git a/performance-experiment/Kuhn-Johnson/plan-kj.R b/performance-experiment/Kuhn-Johnson/plan-kj.R
@@ -71,28 +71,28 @@ plan <- drake_plan(
               error_FUN,
               method),
       dynamic = map(ncv_dat_800)
+   ),
+
+   # sample size = 2000
+   sim_dat_2000 = mlbench_data(2000),
+   params_list_2000 = create_grids(sim_dat_2000,
+                                  algorithms,
+                                  size = grid_size),
+   ncv_dat_2000 = create_ncv_objects(sim_dat_2000,
+                                    repeats,
+                                    method),
+   ncv_results_2000 = target(
+      run_ncv(ncv_dat_2000,
+              sim_dat_2000,
+              large_dat,
+              mod_FUN_list,
+              params_list_2000,
+              error_FUN,
+              method),
+      dynamic = map(ncv_dat_2000)
    )#,
-   # 
-   # # sample size = 2000
-   # sim_dat_2000 = mlbench_data(2000),
-   # params_list_2000 = create_grids(sim_dat_2000,
-   #                                algorithms,
-   #                                size = grid_size),
-   # ncv_dat_2000 = create_ncv_objects(sim_dat_2000,
-   #                                  repeats,
-   #                                  method),
-   # ncv_results_2000 = target(
-   #    run_ncv(ncv_dat_2000,
-   #            sim_dat_2000,
-   #            large_dat,
-   #            mod_FUN_list,
-   #            params_list_2000,
-   #            error_FUN,
-   #            method),
-   #    dynamic = map(ncv_dat_2000)
-   # ),
-   # 
-   # # sample size = 5000
+
+   # sample size = 5000
    # sim_dat_5000 = mlbench_data(5000),
    # params_list_5000 = create_grids(sim_dat_5000,
    #                                algorithms,