ercbk
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 2 deletions b/‎.gitignore‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎README.Rmd‎
Lines changed: 43 additions & 4 deletions b/‎README.Rmd‎
Lines changed: 43 additions & 4 deletions
diff --git a/‎README.md‎
Lines changed: 23 additions & 11 deletions b/‎README.md‎
Lines changed: 23 additions & 11 deletions
diff --git a/‎data/duration-runs.rds‎
1.39 KB b/‎data/duration-runs.rds‎
1.39 KB
diff --git a/‎duration-experiment/outputs/0224-runs.csv‎
Lines changed: 0 additions & 10 deletions b/‎duration-experiment/outputs/0224-runs.csv‎
Lines changed: 0 additions & 10 deletions
diff --git a/‎duration-experiment/outputs/0225-runs.csv‎
Lines changed: 0 additions & 10 deletions b/‎duration-experiment/outputs/0225-runs.csv‎
Lines changed: 0 additions & 10 deletions
diff --git a/‎renv.lock‎
Lines changed: 35 additions & 0 deletions b/‎renv.lock‎
Lines changed: 35 additions & 0 deletions
@@ -2,5 +2,4 @@
 .Rhistory
 .RData
 .Ruserdata
-.env
-mlruns
+.env
@@ -2,11 +2,12 @@
 output: github_document
 ---
 
-# Nested Cross-Validation: Comparing Methods and Implementations
+# Nested Cross-Validation: Comparing Methods and Implementations  
 
 Nested cross-validation has become a recommended technique for situations in which the size of our dataset is insufficient to handle both hyperparameter tuning and algorithm comparison. Using standard methods such as k-fold cross-validation in such situations results in  significant increases in optimization bias. Nested cross-validation has been shown to produce low bias in out-of-sample error estimates even using datasets with only a few hundred rows.  
 
-The primary issue with this technique is that it is computationally very expensive with potentially tens of 1000s of models being trained during the process. This experiment seeks to answer two questions:    
+The primary issue with this technique is that it is computationally very expensive with potentially tens of 1000s of models being trained during the process. This experiment seeks to answer two questions:  
+
 1. Which implementation is fastest?  
 2. How many *repeats*, given the size of the training set, should we expect to need to obtain a reasonably accurate out-of-sample error estimate?  
 
@@ -29,20 +30,58 @@ Duration experiment details:
 
 (Size of the data sets are the same as those in the original scripts by the authors)
 
-Various elements of the technique can be altered to improve performance. These include:    
+Various elements of the technique can be altered to improve performance. These include:  
+
 1. Hyperparameter value grids  
 2. Outer-Loop CV strategy  
 3. Inner-Loop CV strategy  
 4. Grid search strategy  
 
 For the performance experiment (question 2), I'll be varying the repeats of the outer-loop cv strategy for each method. The fastest implementation of each method will be tuned with different sizes of data ranging from 100 to 5000 observations. The mean absolute error will be calculated for each combination of repeat, data size, and method. 
 
-I'm using a 4 core, 16 GB RAM machine.
+Notes: 
+
+1. I'm using a 4 core, 16 GB RAM machine.  
+2. "parsnip" refers to scripts where both the Elastic Net and Ranger Random Forest model functions come from {parsnip}  
+3. "ranger" means the Random Forest model function that's used is directly from the {ranger} package.  
+4. In "sklearn", the Random Forest model function comes for scikit-learn.  
+5. "ranger-kj" uses all the Kuhn-Johnson loop functions and the {ranger} Random Forest model function to execute Raschka's method.  
+  
+
 
 Progress (duration in seconds)  
 
 ![](duration-experiment/outputs/0225-results.png)  
 
+
+```{r, echo=FALSE, eval=FALSE, message=FALSE}
+library(dplyr, quietly = TRUE)
+library(echarts4r, quietly = TRUE)
+
+runs <- readr::read_rds("data/duration-runs.rds")
+
+e_common(
+      font_family = "Roboto Medium",
+      theme = NULL
+)
+
+runs %>% 
+      group_by(method) %>% 
+      arrange(duration)  %>% 
+      mutate(duration = round(duration/60, 2)) %>% 
+      e_charts(implementation) %>% 
+      e_bar(duration) %>% 
+      e_flip_coords() %>% 
+      e_tooltip() %>%
+      e_legend() %>% 
+      e_title("Duration", "minutes") %>% 
+      e_theme_custom('{"color":["#195198","#BD9865"], "backgroundColor": "ivory"}')
+
+
+```
+
+
+
 References  
 
 Boulesteix, AL, and C Strobl. 2009. “Optimal Classifier Selection and Negative Bias in Error Rate Estimation: An Empirical Study on High-Dimensional Prediction.” BMC Medical Research Methodology 9 (1): 85. [link](https://www.researchgate.net/publication/40756303_Optimal_classifier_selection_and_negative_bias_in_error_rate_estimation_An_empirical_study_on_high-dimensional_prediction)  
 
@@ -11,11 +11,12 @@ using datasets with only a few hundred rows.
 
 The primary issue with this technique is that it is computationally very
 expensive with potentially tens of 1000s of models being trained during
-the process. This experiment seeks to answer two questions:  
-1\. Which implementation is fastest?  
-2\. How many *repeats*, given the size of the training set, should we
-expect to need to obtain a reasonably accurate out-of-sample error
-estimate?
+the process. This experiment seeks to answer two questions:
+
+1.  Which implementation is fastest?  
+2.  How many *repeats*, given the size of the training set, should we
+    expect to need to obtain a reasonably accurate out-of-sample error
+    estimate?
 
 While researching this technique, I found two *methods* of performing
 nested cross-validation — one authored by [Sabastian
@@ -44,19 +45,30 @@ Duration experiment details:
 the authors)
 
 Various elements of the technique can be altered to improve performance.
-These include:  
-1\. Hyperparameter value grids  
-2\. Outer-Loop CV strategy  
-3\. Inner-Loop CV strategy  
-4\. Grid search strategy
+These include:
+
+1.  Hyperparameter value grids  
+2.  Outer-Loop CV strategy  
+3.  Inner-Loop CV strategy  
+4.  Grid search strategy
 
 For the performance experiment (question 2), I’ll be varying the repeats
 of the outer-loop cv strategy for each method. The fastest
 implementation of each method will be tuned with different sizes of data
 ranging from 100 to 5000 observations. The mean absolute error will be
 calculated for each combination of repeat, data size, and method.
 
-I’m using a 4 core, 16 GB RAM machine.
+Notes:
+
+1.  I’m using a 4 core, 16 GB RAM machine.  
+2.  “parsnip” refers to scripts where both the Elastic Net and Ranger
+    Random Forest model functions come from {parsnip}  
+3.  “ranger” means the Random Forest model function that’s used is
+    directly from the {ranger} package.  
+4.  In “sklearn”, the Random Forest model function comes for
+    scikit-learn.  
+5.  “ranger-kj” uses all the Kuhn-Johnson loop functions and the
+    {ranger} Random Forest model function to execute Raschka’s method.
 
 Progress (duration in seconds)
 
 
@@ -245,6 +245,20 @@
       "Repository": "CRAN",
       "Hash": "98ca919385a634e5d558e6938755e0bf"
     },
+    "corrplot": {
+      "Package": "corrplot",
+      "Version": "0.84",
+      "Source": "Repository",
+      "Repository": "CRAN",
+      "Hash": "b55c32ae818a84109a51f172290c95f2"
+    },
+    "countrycode": {
+      "Package": "countrycode",
+      "Version": "1.1.1",
+      "Source": "Repository",
+      "Repository": "CRAN",
+      "Hash": "947b61a2a21b5a50af567b591b845f72"
+    },
     "crayon": {
       "Package": "crayon",
       "Version": "1.3.4",
@@ -266,13 +280,27 @@
       "Repository": "CRAN",
       "Hash": "2b7d10581cc730804e9ed178c8374bd6"
     },
+    "d3r": {
+      "Package": "d3r",
+      "Version": "0.8.7",
+      "Source": "Repository",
+      "Repository": "CRAN",
+      "Hash": "4c1677c45eb1dff74f3863e773a8b26a"
+    },
     "data.table": {
       "Package": "data.table",
       "Version": "1.12.8",
       "Source": "Repository",
       "Repository": "CRAN",
       "Hash": "cd711af60c47207a776213a368626369"
     },
+    "data.tree": {
+      "Package": "data.tree",
+      "Version": "0.7.11",
+      "Source": "Repository",
+      "Repository": "CRAN",
+      "Hash": "9087f2826e50c659ba54ade20d4c8676"
+    },
     "desc": {
       "Package": "desc",
       "Version": "1.2.0",
@@ -329,6 +357,13 @@
       "Repository": "CRAN",
       "Hash": "716869fffc16e282c118f8894e082a7d"
     },
+    "echarts4r": {
+      "Package": "echarts4r",
+      "Version": "0.2.3",
+      "Source": "Repository",
+      "Repository": "CRAN",
+      "Hash": "2604014e6b28deb9dc2be4062c96a58a"
+    },
     "ellipsis": {
       "Package": "ellipsis",
       "Version": "0.3.0",