Skip to content

akmintisar/analysis-of-irls-for-logistic-regression

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Iteratively Reweighted Least Squares for Logistic Regression

A Predictive Analytics Project

Author: A K M Intisar Islam
Course: MATH 5383 – Predictive Analytics
Language: R


Project Overview

This project explores linear classification using logistic regression, implemented through the Iteratively Reweighted Least Squares (IRLS) algorithm. It systematically examines how logistic regression behaves under different conditions—such as linear separability, regularization, outliers, dataset size, and class imbalance.

The project demonstrates how L2 regularization (ridge penalty) enhances the stability, convergence, and generalization of logistic regression, especially when data contain noise or outliers.


Objectives

  • Implement the IRLS algorithm for logistic regression using Newton–Raphson optimization.
  • Compare unregularized and L2-regularized logistic regression.
  • Study the effects of dataset size, balance, and outliers.
  • Visualize decision boundaries and log-likelihood surfaces.
  • Evaluate performance using an 80–20 train–test split.

Data Generation

Synthetic datasets were generated with two Gaussian clusters:

Parameter Value
Observations n = 50 (later 500)
Predictors m = 2 (later 4)
Class ratio 50/50 (later 40/60)
Standard deviation 1
Cluster means (3, 3) and (7, 7)

Later experiments introduced:

  • Larger datasets (n = 500)
  • Imbalanced class distributions (40%/60%)
  • Artificial outliers to test robustness

IRLS Implementation

The IRLS algorithm iteratively updates coefficients using:

[ \beta^{(k+1)} = \beta^{(k)} - (H_k)^{-1} g_k ]

where

[ H_k = -X^T W X - \lambda I, \quad g_k = X^T (y - \hat{p}) ]

Convergence criterion: [ | \beta^{(k+1)} - \beta^{(k)} |_2 < \epsilon ]

Default parameters:

  • β₀ = 0
  • ε = 1e−6
  • Max iterations = 100
  • λ = 0 (unregularized) or 0.5 (regularized)

Key Findings

Scenario Regularization Iterations Accuracy Observation
Small, balanced None 35 100% Perfect separation
Small, balanced L2 9 100% Faster, stable
Large, balanced None 100 99% Stable with more data
Imbalanced (40/60) L2 87 100% Robust to imbalance
With outliers L2 10 82% Regularization improves robustness

Regularization reduced coefficient magnitudes and prevented divergence under near-separable or noisy data.


Visualization Highlights

  • Cluster plots showing decision boundaries (black = unregularized, green = regularized).
  • Coefficient trajectories visualized on log-likelihood contours.
  • Outlier impact plots showing shifts in decision boundaries.

Performance Evaluation

Performance was evaluated using an 80/20 train–test split:

  • Clean datasets → 99–100% accuracy
  • Datasets with outliers → ≈82% accuracy
  • Regularization stabilized solutions without sacrificing predictive power

Conclusion

  • Unregularized logistic regression can achieve perfect accuracy but becomes numerically unstable when data are separable or noisy.
  • L2 regularization provides finite, stable, and interpretable coefficients.
  • Regularization is essential in the presence of outliers, small datasets, or high-dimensional features.

Tools & Libraries

  • R (v4.x)
  • Packages: ggplot2, MASS, glm
  • Environment: RStudio / Google Colab

References

  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer.
  • McCullagh, P., & Nelder, J. A. (1989). Generalized Linear Models. Chapman & Hall.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages