A Predictive Analytics Project
Author: A K M Intisar Islam
Course: MATH 5383 – Predictive Analytics
Language: R
This project explores linear classification using logistic regression, implemented through the Iteratively Reweighted Least Squares (IRLS) algorithm. It systematically examines how logistic regression behaves under different conditions—such as linear separability, regularization, outliers, dataset size, and class imbalance.
The project demonstrates how L2 regularization (ridge penalty) enhances the stability, convergence, and generalization of logistic regression, especially when data contain noise or outliers.
- Implement the IRLS algorithm for logistic regression using Newton–Raphson optimization.
- Compare unregularized and L2-regularized logistic regression.
- Study the effects of dataset size, balance, and outliers.
- Visualize decision boundaries and log-likelihood surfaces.
- Evaluate performance using an 80–20 train–test split.
Synthetic datasets were generated with two Gaussian clusters:
| Parameter | Value |
|---|---|
| Observations | n = 50 (later 500) |
| Predictors | m = 2 (later 4) |
| Class ratio | 50/50 (later 40/60) |
| Standard deviation | 1 |
| Cluster means | (3, 3) and (7, 7) |
Later experiments introduced:
- Larger datasets (n = 500)
- Imbalanced class distributions (40%/60%)
- Artificial outliers to test robustness
The IRLS algorithm iteratively updates coefficients using:
[ \beta^{(k+1)} = \beta^{(k)} - (H_k)^{-1} g_k ]
where
[ H_k = -X^T W X - \lambda I, \quad g_k = X^T (y - \hat{p}) ]
Convergence criterion: [ | \beta^{(k+1)} - \beta^{(k)} |_2 < \epsilon ]
Default parameters:
- β₀ = 0
- ε = 1e−6
- Max iterations = 100
- λ = 0 (unregularized) or 0.5 (regularized)
| Scenario | Regularization | Iterations | Accuracy | Observation |
|---|---|---|---|---|
| Small, balanced | None | 35 | 100% | Perfect separation |
| Small, balanced | L2 | 9 | 100% | Faster, stable |
| Large, balanced | None | 100 | 99% | Stable with more data |
| Imbalanced (40/60) | L2 | 87 | 100% | Robust to imbalance |
| With outliers | L2 | 10 | 82% | Regularization improves robustness |
Regularization reduced coefficient magnitudes and prevented divergence under near-separable or noisy data.
- Cluster plots showing decision boundaries (black = unregularized, green = regularized).
- Coefficient trajectories visualized on log-likelihood contours.
- Outlier impact plots showing shifts in decision boundaries.
Performance was evaluated using an 80/20 train–test split:
- Clean datasets → 99–100% accuracy
- Datasets with outliers → ≈82% accuracy
- Regularization stabilized solutions without sacrificing predictive power
- Unregularized logistic regression can achieve perfect accuracy but becomes numerically unstable when data are separable or noisy.
- L2 regularization provides finite, stable, and interpretable coefficients.
- Regularization is essential in the presence of outliers, small datasets, or high-dimensional features.
- R (v4.x)
- Packages:
ggplot2,MASS,glm - Environment: RStudio / Google Colab
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer.
- McCullagh, P., & Nelder, J. A. (1989). Generalized Linear Models. Chapman & Hall.