Skip to content

A fully Bayesian linear regression pipeline for predicting salaries with conjugate priors, Gibbs sampling, and model diagnostics.

License

Notifications You must be signed in to change notification settings

pablo-reyes8/bayesian-linear-regression

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bayesian Linear Regression

Repo size Last commit Open issues Forks Stars

A comprehensive end-to-end implementation of a Bayesian linear regression model to predict individual salaries using conjugate Gaussian–Inverse-Gamma priors, optimized Gibbs sampling, and full posterior diagnostics.

📖 Overview

This project provides a self-contained Jupyter notebook that:

  • Loads and preprocesses salary data with predictors: cost, LSAT, GPA, age, library volume, log(cost), and institutional rank
  • Specifies weakly-informative priors for regression coefficients and error variance
  • Implements an optimized Gibbs sampler (Cholesky sampling, pre-inversion of constant matrices)
  • Runs multiple chains and assesses convergence via trace plots, ACF, Effective Sample Size (ESS), and Gelman–Rubin $\hat R$
  • Summarizes posterior distributions (means, medians, SDs, 95% HDIs, sign probabilities)
  • Explores joint parameter dependencies (pairplots, correlation heatmaps)
  • Performs Posterior Predictive Checks (histogram, KDE, HDI shading)
  • Demonstrates model refinements (Student-$t$ likelihood, log-transform, mixture models, heteroskedasticity)

✨ Key Features

  • Conjugate Bayesian setup: closed-form updates for $\beta$ and $\sigma^2$
  • Optimized Gibbs sampler: Cholesky draws and precomputed inverses for speed
  • Robust diagnostics: trace, ACF, ESS, $\hat R$ and ArviZ integration
  • Rich posterior summaries: HDIs, posterior $P(\beta>0)$, forest plots
  • Flexible PPCs: histograms, KDE, rug plots, HDI shading
  • Extension recipes: code snippets for Student-$t$ errors, mixtures, transformations

🚀 Quick Start

  1. Dependencies

    • Python 3.8+
    • NumPy, SciPy, Pandas
    • Matplotlib, Seaborn
    • Statsmodels (for ACF)
    • ArviZ (for ESS, $\hat R$)
  2. Environment Setup

    pip install numpy scipy pandas matplotlib seaborn statsmodels arviz
  3. Open & Execute

    • Navigate to Linear_Regression.ipynb
    • Run all cells in sequential order to reproduce data loading, model specification, Gibbs sampling, diagnostics, posterior summaries, PPCs, and extensions.

📈 Results & Interpretation

  • Predictor Effects

    • Strong positive: Cost, GPA, Library volume (95% HDI excludes zero, P(β>0)>0.99)
    • Strong negative: Log(cost), Institutional rank (95% HDI excludes zero, P(β>0)<0.01)
    • Ambiguous: LSAT, Age (HDIs straddle zero, moderate P(β>0))
  • Sampling Diagnostics

    • High Effective Sample Size (ESS > 70 000)
    • Gelman–Rubin $\hat R = 1.00$
    • ACF near zero beyond lag 0 → almost independent draws
  • Predictive Fit

    • Posterior Predictive Checks reveal that the Normal-error model underestimates multimodality and heavy tails in the observed salary distribution.
    • Model refinements (Student-t errors, mixture components, transformations) are recommended to capture sharp peaks and extreme values.

🤝 Contributing

Contributions and suggestions are welcome! Please:

  • Open an issue to propose enhancements or report bugs
  • Submit pull requests with clear descriptions of changes
  • Include unit tests for any new sampler or diagnostic functions

📜 License

Released under the MIT License

About

A fully Bayesian linear regression pipeline for predicting salaries with conjugate priors, Gibbs sampling, and model diagnostics.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published