fixed runtime chart bar labels; added performance results csv; included results section in readme

ercbk · ercbk · commit 863b59262b60 · 2020-05-27T20:53:22.000-04:00
diff --git a/.gitignore b/.gitignore
@@ -6,4 +6,5 @@
 .drake
 ec2-ssh-raw.log
 README_cache
-check-results.R
+check-results.R
+perf-exp-output-backup.rds
diff --git a/README.Rmd b/README.Rmd
@@ -7,7 +7,7 @@ output: github_document
 
 ![](images/ncv.png)
 
-Nested cross-validation has become a recommended technique for situations in which the size of our dataset is insufficient to simultaneously handle hyperparameter tuning and algorithm comparison. Examples of such situations include: proof of concept, start-ups, medical studies, time series, etc. Using standard methods such as k-fold cross-validation in these cases may result in  significant increases in optimization bias. Nested cross-validation has been shown to produce low bias, out-of-sample error estimates even using datasets with only hundreds of rows and therefore gives a better judgement of generalization performance.  
+Nested cross-validation has become a recommended technique for situations in which the size of our dataset is insufficient to simultaneously handle hyperparameter tuning and algorithm comparison. Examples of such situations include: proof of concept, start-ups, medical studies, time series, etc. Using standard methods such as k-fold cross-validation in these cases may result in substantial increases in optimization bias. Nested cross-validation has been shown to produce less biased, out-of-sample error estimates even using datasets with only hundreds of rows and therefore gives a better judgement of generalization performance.  
 
 The primary issue with this technique is that it is computationally very expensive with potentially tens of 1000s of models being trained during the process. While researching this technique, I found two slightly different methods of performing nested cross-validation — one authored by [Sabastian Raschka](https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/11_eval4-algo/code/11-eval4-algo__nested-cv_verbose1.ipynb) and the other by [Max Kuhn and Kjell Johnson](https://tidymodels.github.io/rsample/articles/Applications/Nested_Resampling.html).  
 I'll be examining two aspects of nested cross-validation:  
@@ -17,7 +17,7 @@ I'll be examining two aspects of nested cross-validation:
 
 
 ## Duration Experiment
-Experiment details:  
+##### Experiment details:  
    
    * Random Forest and Elastic Net Regression algorithms  
    * Both with 100x2 hyperparameter grids  
@@ -40,9 +40,9 @@ Various elements of the technique can be altered to improve performance. These i
 3. Inner-Loop CV strategy  
 4. Grid search strategy  
 
-These elements also affect the run times. Both methods will be using the same size grids, but Kuhn-Johnson uses repeats and more folds in the outer and inner loops while Raschka's trains an extra model over the entire training set at the end at the end. Using Kuhn-Johnson, 50,000 models will be trained for each algorithm — using Raschka's, 1,001 models.  
+These elements also affect the run times. Both methods will be using the same size grids, but Kuhn-Johnson uses repeats and more folds in the outer and inner loops while Raschka's trains an extra model over the entire training set at the end at the end. Using Kuhn-Johnson, 50,000 models (grid size * number of repeats * number of folds in the outer-loop * number of folds/resamples in the inner-loop) will be trained for each algorithm — using Raschka's, 1,001 models.  
 
-MLFlow was used to keep track of the duration (seconds) of each run along with the implementation and method used. I've used implementation to describe the various changes in coding structures that accompanies using each package's functions. A couple examples are the python for-loop being replaced with a while-loop and `iter_next` function when using {reticulate} and {mlr3} entirely using R's R6 Object Oriented Programming system.   
+[MLFlow](https://mlflow.org/docs/latest/index.html) was used to keep track of the duration (seconds) of each run along with the implementation and method used. I've used "implementation" to encapsulate not only the combinations of various model functions, but also, to describe the various changes in coding structures that accompanies using each package's functions, i.e. I can't just plug-and-play different packages' model functions into the same script.  
 
 ![](duration-experiment/outputs/0225-results.png)  
 
@@ -53,7 +53,7 @@ pacman::p_load(extrafont, dplyr, ggplot2, patchwork, stringr, tidytext)
 
 
 
-runs_raw <- readr::read_rds("data/duration-runs.rds")
+runs_raw <- readr::read_rds("duration-experiment/outputs/duration-runs.rds")
 
 
 
@@ -108,18 +108,18 @@ durations
 
 ## Performance Experiment  
 
-Experiment details:  
+##### Experiment details:  
 
-  * The fastest implementation of each method will be used in running a nested cross-validation with different sizes of data ranging from 100 to 5000 observations and different numbers of repeats of the outer-loop cv strategy.  
+  * The fastest implementation of each method was used in running a nested cross-validation with different sizes of data ranging from 100 to 5000 observations and different numbers of repeats of the outer-loop cv strategy.  
       * The {mlr3} implementation was the fastest for Raschka's method, but the Ranger-Kuhn-Johnson implementation was close. To simplify, I'll be using [Ranger-Kuhn-Johnson](https://github.com/ercbk/nested-cross-validation-comparison/blob/master/duration-experiment/kuhn-johnson/nested-cv-ranger-kj.R) for both methods.  
-  * The chosen algorithm and hyperparameters will be used to predict on a 100K row simulated dataset.  
-  * The percent error between the the average mean absolute error (MAE) across the outer-loop folds and the MAE of the predictions on this 100K dataset will be calculated for each combination of repeat, data size, and method.  
-  * To make this experiment manageable in terms of runtimes, I'm using AWS instances: a r5.2xlarge for the Elastic Net and a r5.24xlarge for Random Forest.  
-  * Iterating through different numbers of repeats, sample sizes, and methods makes a functional approach more appropriate than running imperative scripts. Also, given the long runtimes and impermanent nature of my internet connection, it would also be nice to cache each iteration as it finishes. The [{drake}](https://github.com/ropensci/drake) package is superb on both counts, so I'm using it to orchestrate.  
+  * The chosen algorithm and hyperparameters was used to predict on a 100K row simulated dataset.  
+  * The percent error between the the average mean absolute error (MAE) across the outer-loop folds and the MAE of the predictions on this 100K dataset was calculated for each combination of repeat, data size, and method.  
+  * To make this experiment manageable in terms of runtimes, I used AWS instances: a r5.2xlarge for the Elastic Net and a r5.24xlarge for Random Forest.  
+  * Iterating through different numbers of repeats, sample sizes, and methods makes a functional approach more appropriate than running imperative scripts. Also, given the long runtimes and impermanent nature of my internet connection, it would also be nice to cache each iteration as it finishes. The [{drake}](https://github.com/ropensci/drake) package is superb on both counts, so I'm used it to orchestrate.  
 
-```{r perf_build_times, echo=FALSE, message=FALSE, cache=FALSE}
+```{r perf_build_times, echo=FALSE, message=FALSE}
 
-pacman::p_load(extrafont,dplyr, purrr, lubridate, ggplot2, drake)
+pacman::p_load(extrafont,dplyr, purrr, lubridate, ggplot2, ggfittext, drake)
 bt <- build_times(starts_with("ncv_results"), digits = 4)
 
 subtarget_bts <- bt %>% 
@@ -140,6 +140,9 @@ subtargets <- subtargets_raw %>%
              elapsed = round(as.numeric(elapsed)/3600, 2),
              percent_error = round(delta_error/oos_error, 3))
 
+readr::write_csv(subtargets, "performance-experiment/output/perf-exp-output.csv")
+# readr::write_rds(subtargets, "performance-experiment/output/perf-exp-output-backup.rds")
+
 ```
 
 ```{r perf_bt_charts, echo=FALSE, message=FALSE}
@@ -150,8 +153,10 @@ ggplot(subtargets, aes(y = elapsed, x = repeats,
                        fill = n, label = elapsed)) +
    geom_col(position = position_dodge(width = 0.85)) +
    scale_fill_manual(values = fill_colors[4:7]) +
-   geom_text(hjust = 1.3,  size = 3.5,
-             color = "white", position = position_dodge(width = 0.85)) +
+   # geom_text(hjust = 1.3,  size = 3.5,
+   #           color = "white", position = position_dodge(width = 0.85)) +
+   geom_bar_text(position = "dodge", min.size = 3.5,
+                 place = "right", contrast = TRUE) +
    coord_flip() +
    labs(y = "Runtime (hrs)", x = "Repeats",
         title = "Kuhn-Johnson", fill = "Sample Size") +
@@ -196,6 +201,16 @@ ggplot(subtargets, aes(x = repeats, y = percent_error, group = n)) +
    )
 ```
 
+##### Results:  
+
+Kuhn-Johnson:  
+
+  * Runtimes for n = 100 and n = 800 are close, and there's a large jump in runtime going from n = 2000 to n = 5000.  
+  * The number of repeats had little effect on the amount of percent error.
+  * For n = 100, there is substantially more variation in percent error than in the other sample sizes.  
+  * While there is a large runtime cost that comes with increasing the sample size from 2000 to 5000 obsservations, it doesn't seem to provide any benefit in gaining a more accurate estimate of the out-of-sample error.  
+  
+
 
 
 References  
diff --git a/README.md b/README.md
@@ -10,9 +10,9 @@ situations in which the size of our dataset is insufficient to
 simultaneously handle hyperparameter tuning and algorithm comparison.
 Examples of such situations include: proof of concept, start-ups,
 medical studies, time series, etc. Using standard methods such as k-fold
-cross-validation in these cases may result in significant increases in
-optimization bias. Nested cross-validation has been shown to produce low
-bias, out-of-sample error estimates even using datasets with only
+cross-validation in these cases may result in substantial increases in
+optimization bias. Nested cross-validation has been shown to produce
+less biased, out-of-sample error estimates even using datasets with only
 hundreds of rows and therefore gives a better judgement of
 generalization performance.
 
@@ -35,7 +35,7 @@ I’ll be examining two aspects of nested cross-validation:
 
 ## Duration Experiment
 
-Experiment details:
+##### Experiment details:
 
   - Random Forest and Elastic Net Regression algorithms  
   - Both with 100x2 hyperparameter grids  
@@ -63,16 +63,17 @@ These elements also affect the run times. Both methods will be using the
 same size grids, but Kuhn-Johnson uses repeats and more folds in the
 outer and inner loops while Raschka’s trains an extra model over the
 entire training set at the end at the end. Using Kuhn-Johnson, 50,000
-models will be trained for each algorithm — using Raschka’s, 1,001
-models.
-
-MLFlow was used to keep track of the duration (seconds) of each run
-along with the implementation and method used. I’ve used implementation
-to describe the various changes in coding structures that accompanies
-using each package’s functions. A couple examples are the python
-for-loop being replaced with a while-loop and `iter_next` function when
-using {reticulate} and {mlr3} entirely using R’s R6 Object Oriented
-Programming system.
+models (grid size \* number of repeats \* number of folds in the
+outer-loop \* number of folds/resamples in the inner-loop) will be
+trained for each algorithm — using Raschka’s, 1,001 models.
+
+[MLFlow](https://mlflow.org/docs/latest/index.html) was used to keep
+track of the duration (seconds) of each run along with the
+implementation and method used. I’ve used “implementation” to
+encapsulate not only the combinations of various model functions, but
+also, to describe the various changes in coding structures that
+accompanies using each package’s functions, i.e. I can’t just
+plug-and-play different packages’ model functions into the same script.
 
 ![](duration-experiment/outputs/0225-results.png)
 
@@ -82,9 +83,9 @@ Programming system.
 
 ## Performance Experiment
 
-Experiment details:
+##### Experiment details:
 
-  - The fastest implementation of each method will be used in running a
+  - The fastest implementation of each method was used in running a
     nested cross-validation with different sizes of data ranging from
     100 to 5000 observations and different numbers of repeats of the
     outer-loop cv strategy.
@@ -93,27 +94,42 @@ Experiment details:
         simplify, I’ll be using
         [Ranger-Kuhn-Johnson](https://github.com/ercbk/nested-cross-validation-comparison/blob/master/duration-experiment/kuhn-johnson/nested-cv-ranger-kj.R)
         for both methods.  
-  - The chosen algorithm and hyperparameters will be used to predict on
-    a 100K row simulated dataset.  
+  - The chosen algorithm and hyperparameters was used to predict on a
+    100K row simulated dataset.  
   - The percent error between the the average mean absolute error (MAE)
     across the outer-loop folds and the MAE of the predictions on this
-    100K dataset will be calculated for each combination of repeat, data
+    100K dataset was calculated for each combination of repeat, data
     size, and method.  
-  - To make this experiment manageable in terms of runtimes, I’m using
-    AWS instances: a r5.2xlarge for the Elastic Net and a r5.24xlarge
-    for Random Forest.  
+  - To make this experiment manageable in terms of runtimes, I used AWS
+    instances: a r5.2xlarge for the Elastic Net and a r5.24xlarge for
+    Random Forest.  
   - Iterating through different numbers of repeats, sample sizes, and
     methods makes a functional approach more appropriate than running
     imperative scripts. Also, given the long runtimes and impermanent
     nature of my internet connection, it would also be nice to cache
     each iteration as it finishes. The
     [{drake}](https://github.com/ropensci/drake) package is superb on
-    both counts, so I’m using it to orchestrate.
+    both counts, so I’m used it to orchestrate.
 
 ![](README_files/figure-gfm/perf_bt_charts-1.png)<!-- -->
 
 ![](README_files/figure-gfm/perf-error-line-1.png)<!-- -->
 
+##### Results:
+
+Kuhn-Johnson:
+
+  - Runtimes for n = 100 and n = 800 are close, and there’s a large jump
+    in runtime going from n = 2000 to n = 5000.  
+  - The number of repeats had little effect on the amount of percent
+    error.
+  - For n = 100, there is substantially more variation in percent error
+    than in the other sample sizes.  
+  - While there is a large runtime cost that comes with increasing the
+    sample size from 2000 to 5000 obsservations, it doesn’t seem to
+    provide any benefit in gaining a more accurate estimate of the
+    out-of-sample error.
+
 References
 
 Boulesteix, AL, and C Strobl. 2009. “Optimal Classifier Selection and
diff --git a/README_files/figure-gfm/perf_bt_charts-1.png b/README_files/figure-gfm/perf_bt_charts-1.png
diff --git a/duration-experiment/outputs/duration-runs.rds b/duration-experiment/outputs/duration-runs.rds
diff --git a/performance-experiment/output/perf-exp-output.csv b/performance-experiment/output/perf-exp-output.csv
@@ -0,0 +1,21 @@
+subtarget,n,repeats,method,oos_error,ncv_error,delta_error,chosen_algorithm,mixture,penalty,mtry,trees,elapsed,percent_error
+ncv_results_100_0108d912,100,5,kj,2.19359,2.01424,0.1793499999999999,glmnet,0.50424303883221,0.2211151988375703,NA,NA,1.36,0.082
+ncv_results_100_7aaa57d2,100,1,kj,2.19359,2.04781,0.1457799999999998,glmnet,0.50424303883221,0.2211151988375703,NA,NA,0.15,0.066
+ncv_results_100_97e7fe04,100,2,kj,2.19359,1.99077,0.20282,glmnet,0.50424303883221,0.2211151988375703,NA,NA,0.4,0.092
+ncv_results_100_9d044993,100,4,kj,2.19359,1.99643,0.19716,glmnet,0.50424303883221,0.2211151988375703,NA,NA,0.97,0.09
+ncv_results_100_ea11bf8d,100,3,kj,2.19262,2.01702,0.17559999999999976,glmnet,0.5809470646083355,0.16010254880830843,NA,NA,0.65,0.08
+ncv_results_2000_47742c31,2000,4,kj,1.38697,1.37171,0.015260000000000051,rf,NA,NA,5,1779,2.96,0.011
+ncv_results_2000_746435d6,2000,5,kj,1.39092,1.37625,0.01466999999999996,rf,NA,NA,5,1779,3.71,0.011
+ncv_results_2000_7d80d14d,2000,1,kj,1.38466,1.36553,0.01913000000000009,rf,NA,NA,5,1948,0.74,0.014
+ncv_results_2000_80d2e33a,2000,3,kj,1.38955,1.3711,0.018450000000000077,rf,NA,NA,5,1948,2.22,0.013
+ncv_results_2000_c16e9aff,2000,2,kj,1.38739,1.37015,0.017239999999999922,rf,NA,NA,5,1948,1.48,0.012
+ncv_results_5000_20d7ace1,5000,4,kj,1.24192,1.25837,0.016450000000000076,rf,NA,NA,5,1573,8.92,0.013
+ncv_results_5000_2a916af4,5000,5,kj,1.24272,1.25644,0.013719999999999954,rf,NA,NA,5,1664,11.13,0.011
+ncv_results_5000_7b1fdb55,5000,2,kj,1.24336,1.2612,0.017840000000000078,rf,NA,NA,5,1351,4.46,0.014
+ncv_results_5000_7b6f8e72,5000,1,kj,1.24304,1.25709,0.014050000000000118,rf,NA,NA,5,1664,2.23,0.011
+ncv_results_5000_d380966a,5000,3,kj,1.24267,1.25724,0.014569999999999972,rf,NA,NA,5,1365,6.69,0.012
+ncv_results_800_3b54c7f8,800,1,kj,1.63668,1.58422,0.05245999999999995,rf,NA,NA,6,1507,0.26,0.032
+ncv_results_800_3f87e120,800,2,kj,1.6333,1.58689,0.04641000000000006,rf,NA,NA,6,1168,0.51,0.028
+ncv_results_800_50b46544,800,4,kj,1.63707,1.58522,0.05184999999999995,rf,NA,NA,6,1693,1.09,0.032
+ncv_results_800_589454bb,800,3,kj,1.63456,1.5905,0.04405999999999999,rf,NA,NA,6,1168,0.76,0.027
+ncv_results_800_a2c27fe0,800,5,kj,1.63489,1.58745,0.04743999999999993,rf,NA,NA,6,1507,1.52,0.029