From 910bcf9cbdf3bf6590fdadc467799c1ffedd97e9 Mon Sep 17 00:00:00 2001 From: Dangote Date: Mon, 20 Apr 2020 11:07:12 +0100 Subject: [PATCH 1/4] Added challenge folder by Jato Joseph --- .../notebook-checkpoint.ipynb | 501 +++++++++++++ JatoJoseph_Logistic/Admittance.csv | 169 +++++ JatoJoseph_Logistic/notebook.ipynb | 702 ++++++++++++++++++ 3 files changed, 1372 insertions(+) create mode 100644 JatoJoseph_Logistic/.ipynb_checkpoints/notebook-checkpoint.ipynb create mode 100644 JatoJoseph_Logistic/Admittance.csv create mode 100644 JatoJoseph_Logistic/notebook.ipynb diff --git a/JatoJoseph_Logistic/.ipynb_checkpoints/notebook-checkpoint.ipynb b/JatoJoseph_Logistic/.ipynb_checkpoints/notebook-checkpoint.ipynb new file mode 100644 index 0000000..c480400 --- /dev/null +++ b/JatoJoseph_Logistic/.ipynb_checkpoints/notebook-checkpoint.ipynb @@ -0,0 +1,501 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "logistic_challenge.ipynb", + "provenance": [], + "collapsed_sections": [], + "toc_visible": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "code", + "metadata": { + "id": "x0ovbaMbHuju", + "colab_type": "code", + "colab": {} + }, + "source": [ + "" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MpvDLTrhHyiC", + "colab_type": "text" + }, + "source": [ + "## Outline:\n", + "* Define a logistic regression\n", + "* State and explain the formula and it's variables\n", + "* Get a dataset\n", + "* Build a logistic function and explain its effect \n", + "* Train the model\n", + "* Test the model\n", + "* Interprete the Logistic Regression \n", + "Table\n", + "* Summary\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KMnIsHxQW5wU", + "colab_type": "text" + }, + "source": [ + "## Logistic Regression\n", + "Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable.The possible outcomes of a logistic regression are not numerical but\n", + "rather categorical ( 1 or 0, Yes or No ) etc.\n", + "\n", + "Even though a Logistic regression is seen as a generalized linear model, It is a linear model with a link function that maps the output of linear multiple regression to a posterior probability of each class (0,1) using the **logistic\n", + "sigmoid function.**\n", + "\n", + "## The Logistic Regression Formla\n", + "\n", + "\n", + "## $p(X)/1 −p (X) = e ( β 0 +β 1 X 1 +...+β k X k )$\n", + "\n", + "Where,\n", + "\n", + "$p(X)$ = Probability of the dsitribution\n", + "\n", + "$e$ = Exponent\n", + "\n", + "$β0β1$ = Coefficients\n", + "\n", + "$X1$ = Independenet variable\n", + "\n", + "## **ODDS** = $p (X)/1 −p (X)$\n", + "\n", + "The logistic regression model is not very useful in itself. The right-hand side of the model is an exponent which is very computationally inefficient and generally hard to grasp.\n", + "\n", + "When we talk about a *logistic regression* what we usually mean is **logit regression** – which is a variation of the model where we have taken the log of both sides. See formula below:\n", + "\n", + "## $log(p(X)/1 −p(X)) = log(e(β 0 + β 1 x + ⋯ β k x k))$\n", + "\n", + "On the right hand side, log cancels 'e(exponential)' function leavig us with our new formula:\n", + "\n", + "## $log(p(X)/1 −p(X)) = β 0 + β 1 x + ⋯ β k x k$\n", + "\n", + "With odds:\n", + "\n", + "## $log (odds) = β 0 + β 1 x + ⋯ β k x k$\n", + "\n", + "We'll implemt all these in the code section of the project.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "81EgDyl4xjiK", + "colab_type": "text" + }, + "source": [ + "## The Dataset\n", + "Our dataset will be a collection of jamb scores of students that sat for the 2020 Utme Jamb in March.\n", + "## Task\n", + "We build a logistic model that tells us the probaliity of a student getting admission based on their jamb scores." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "t2-v-b3k6vCP", + "colab_type": "text" + }, + "source": [ + "## Code Implementation" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "pAQNMn8cIf25", + "colab_type": "code", + "outputId": "7ba9593c-afc0-4f62-caf2-2a9fb9da4117", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 72 + } + }, + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import statsmodels.api as sm\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "sns.set()" + ], + "execution_count": 0, + "outputs": [ + { + "output_type": "stream", + "text": [ + "/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.\n", + " import pandas.util.testing as tm\n" + ], + "name": "stderr" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Cd5voKSS7Iqt", + "colab_type": "text" + }, + "source": [ + "## DATA" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hSgCe0RA67oM", + "colab_type": "code", + "colab": {} + }, + "source": [ + "data = pd.read_csv('/content/jamb_scores.csv')" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "a19Zmvij7Xsq", + "colab_type": "code", + "outputId": "48457280-76a6-4120-b4f8-960735b35e4d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + } + }, + "source": [ + "data.head()" + ], + "execution_count": 0, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ScoresAdmitted
02551
11050
22211
33591
43001
\n", + "
" + ], + "text/plain": [ + " Scores Admitted\n", + "0 255 1\n", + "1 105 0\n", + "2 221 1\n", + "3 359 1\n", + "4 300 1" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 3 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JUTJfJIdGoQj", + "colab_type": "text" + }, + "source": [ + "Observe that those that scored from 180 and above are assigned an admitted value of 1 indicated their possiblity of getting admission.\n", + "\n", + "Let's seee what our regression model will tell us" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QotmOvecHI36", + "colab_type": "text" + }, + "source": [ + "## Variables" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "H-g8QnRSGA3O", + "colab_type": "code", + "colab": {} + }, + "source": [ + "X = data['Scores'] # Independent variable\n", + "y = data['Admitted'] # Dependent variable" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dI1ejw-7HiDB", + "colab_type": "text" + }, + "source": [ + "## PLOT\n", + "\n", + "Let's plot the data to see its distribution before building our logistic function" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ut5XnF8wHT4q", + "colab_type": "code", + "outputId": "715529c5-69cc-45d5-e27a-76e20154d7b9", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 291 + } + }, + "source": [ + "# Create a scatter plot of x (Scores) and y (Admitted)\n", + "plt.scatter(X,y, color='C0')\n", + "plt.xlabel('Scores', fontsize = 20)\n", + "plt.ylabel('Admitted', fontsize = 20)\n", + "plt.show()" + ], + "execution_count": 0, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "imqAD27oP0Lc", + "colab_type": "text" + }, + "source": [ + "## Logistic formula\n", + "\n", + "Let's covert the logistic regressio formula to a python function" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hBt-pNzZPzju", + "colab_type": "code", + "colab": {} + }, + "source": [ + "def log_form(x,b0,b1):\n", + " return np.array(np.exp(b0+x*b1) / (1 + np.exp(b0+x*b1)))" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i2Jjeo3aRhW1", + "colab_type": "text" + }, + "source": [ + "## The Logit Function\n", + "\n", + "Here, we are going to call the Logit function from statsmodel and fit it on our data" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ibAxdnqQRlZ6", + "colab_type": "code", + "colab": {} + }, + "source": [ + "x_const = sm.add_constant(X)\n", + "reg_log = sm.Logit(y,X)\n", + "results = reg_log.fit()" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RxTjA2XAA1b6", + "colab_type": "text" + }, + "source": [ + "## Logistic Summary" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2UlPlAfKA5mc", + "colab_type": "code", + "colab": {} + }, + "source": [ + "results.summary()" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XmToMDMpSY_5", + "colab_type": "text" + }, + "source": [ + "## Sorting" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VFfBEWX_SdVD", + "colab_type": "code", + "outputId": "a67d5992-3259-470c-aeae-807c383473bc", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 748 + } + }, + "source": [ + "f_sorted = np.sort(log_form(X,results_log.params[0],results_log.params[1]))\n", + "x_sorted = np.sort(np.array(X))" + ], + "execution_count": 0, + "outputs": [ + { + "output_type": "error", + "ename": "IndexError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_value\u001b[0;34m(self, series, key)\u001b[0m\n\u001b[1;32m 4403\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 4404\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_value\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mk\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtz\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mgetattr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mseries\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"tz\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 4405\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_value\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_value\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n", + "\u001b[0;31mKeyError\u001b[0m: 1", + "\nDuring handling of the above exception, another exception occurred:\n", + "\u001b[0;31mIndexError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mf_sorted\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msort\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlog_form\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mresults_log\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mparams\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mresults_log\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mparams\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mx_sorted\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msort\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/series.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 869\u001b[0m \u001b[0mkey\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcom\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mapply_if_callable\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 870\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 871\u001b[0;31m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mindex\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_value\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 872\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 873\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mis_scalar\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_value\u001b[0;34m(self, series, key)\u001b[0m\n\u001b[1;32m 4408\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4409\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 4410\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mlibindex\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_value_at\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 4411\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mIndexError\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4412\u001b[0m \u001b[0;32mraise\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.get_value_at\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.get_value_at\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32mpandas/_libs/util.pxd\u001b[0m in \u001b[0;36mpandas._libs.util.get_value_at\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32mpandas/_libs/util.pxd\u001b[0m in \u001b[0;36mpandas._libs.util.validate_indexer\u001b[0;34m()\u001b[0m\n", + "\u001b[0;31mIndexError\u001b[0m: index out of bounds" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CKuWoq7YN_a7", + "colab_type": "text" + }, + "source": [ + "## Plot a Logistic Curve\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SUjjHHh_H7QO", + "colab_type": "code", + "colab": {} + }, + "source": [ + "" + ], + "execution_count": 0, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/JatoJoseph_Logistic/Admittance.csv b/JatoJoseph_Logistic/Admittance.csv new file mode 100644 index 0000000..5910743 --- /dev/null +++ b/JatoJoseph_Logistic/Admittance.csv @@ -0,0 +1,169 @@ +Score,Admitted +1363,No +1792,Yes +1954,Yes +1653,No +1593,No +1755,Yes +1775,Yes +1887,Yes +1893,Yes +1580,No +1857,Yes +1880,Yes +1664,Yes +1364,No +1693,No +1850,Yes +1633,No +1634,No +1636,Yes +1855,Yes +1987,Yes +1997,Yes +1422,No +1508,No +1720,Yes +1879,Yes +1634,Yes +1802,Yes +1849,Yes +1764,Yes +1460,No +1675,Yes +1656,No +2020,Yes +1850,Yes +1865,Yes +1664,No +1872,Yes +1654,No +1393,No +1587,No +1631,Yes +1931,Yes +1370,No +1810,Yes +1414,No +1761,Yes +1477,No +1486,No +1561,No +1549,No +2050,Yes +1697,No +1543,No +1934,Yes +1385,No +1670,No +1735,Yes +1634,No +1777,Yes +1550,No +1715,Yes +1925,Yes +1842,Yes +1786,Yes +1435,No +1387,No +1521,No +1975,Yes +1435,No +1714,Yes +1634,Yes +1464,No +1794,Yes +1855,Yes +1953,Yes +1469,No +1663,Yes +1907,Yes +1990,Yes +1542,No +1808,Yes +1966,Yes +1679,No +2021,Yes +2015,Yes +1473,No +1979,Yes +1787,Yes +1687,Yes +1674,No +1478,No +1735,Yes +1720,Yes +1494,No +1964,Yes +1843,Yes +1550,No +1764,Yes +1712,Yes +1775,Yes +1531,No +1781,Yes +1579,No +1526,No +1778,Yes +1769,Yes +1824,Yes +1481,No +1464,No +1591,No +1666,No +1455,No +1934,Yes +1625,No +1334,No +1721,Yes +1475,No +1662,Yes +1861,Yes +1936,Yes +1572,No +1508,No +1430,No +1891,Yes +1550,No +1741,Yes +1690,No +1687,Yes +1730,Yes +1674,Yes +1475,No +1962,Yes +1532,No +1492,No +1502,No +1974,Yes +1607,No +1412,No +1557,No +1821,Yes +1760,Yes +1685,Yes +1773,Yes +1826,Yes +1565,No +1510,No +1374,No +1402,No +1702,Yes +1956,Yes +1933,Yes +1832,Yes +1893,Yes +1831,Yes +1487,No +2041,Yes +1850,Yes +1555,No +2020,Yes +1593,No +1934,Yes +1808,Yes +1722,Yes +1750,Yes +1555,No +1524,No +1461,No diff --git a/JatoJoseph_Logistic/notebook.ipynb b/JatoJoseph_Logistic/notebook.ipynb new file mode 100644 index 0000000..9c2dec9 --- /dev/null +++ b/JatoJoseph_Logistic/notebook.ipynb @@ -0,0 +1,702 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "MpvDLTrhHyiC" + }, + "source": [ + "## Outline:\n", + "* Define a logistic regression\n", + "* State and explain the formula and it's variables\n", + "* Get a dataset\n", + "* Build a logistic function and explain its effect \n", + "* Train the model\n", + "* Test the model\n", + "* Interprete the Logistic Regression \n", + "Table\n", + "* Summary\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "KMnIsHxQW5wU" + }, + "source": [ + "## Logistic Regression\n", + "Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable.The possible outcomes of a logistic regression are not numerical but\n", + "rather categorical ( 1 or 0, Yes or No ) etc.\n", + "\n", + "Even though a Logistic regression is seen as a generalized linear model, It is a linear model with a link function that maps the output of linear multiple regression to a posterior probability of each class (0,1) using the **logistic\n", + "sigmoid function.**\n", + "\n", + "## The Logistic Regression Formla\n", + "\n", + "\n", + "## $p(X)/1 −p (X) = e ( β 0 +β 1 X 1 +...+β k X k )$\n", + "\n", + "Where,\n", + "\n", + "$p(X)$ = Probability of the dsitribution\n", + "\n", + "$e$ = Base of the Natural Log\n", + "\n", + "$β0$ = Biase or Intercept\n", + "\n", + "$β1$ = Coefficient\n", + "\n", + "$X1$ = Independenet variable\n", + "\n", + "## **ODDS** = $p (X)/1 −p (X)$\n", + "\n", + "The logistic regression model is not very useful in itself. The right-hand side of the model is an exponent which is very computationally inefficient and generally hard to grasp.\n", + "\n", + "When we talk about a *logistic regression* what we usually mean is **logit regression** – which is a variation of the model where we have taken the log of both sides. See formula below:\n", + "\n", + "## $log(p(X)/1 −p(X)) = log(e(β 0 + β 1 x + ⋯ β k x k))$\n", + "\n", + "On the right hand side, log cancels 'e(exponential)' function leavig us with our new formula:\n", + "\n", + "## $log(p(X)/1 −p(X)) = β 0 + β 1 x + ⋯ β k x k$\n", + "\n", + "With odds:\n", + "\n", + "## $log (odds) = β 0 + β 1 x + ⋯ β k x k$\n", + "\n", + "We'll implemt all these in the code section of the project.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "t2-v-b3k6vCP" + }, + "source": [ + "## Code Implementation" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 72 + }, + "colab_type": "code", + "id": "pAQNMn8cIf25", + "outputId": "7ba9593c-afc0-4f62-caf2-2a9fb9da4117" + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import statsmodels.api as sm\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "sns.set()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "Cd5voKSS7Iqt" + }, + "source": [ + "## DATA\n", + "\n", + "Our dataset is not a real life data, it was made for the purpose of this challenge. \n", + "We ain to predict whether a stuent will get admitted based on the cummulative points of students in their secondary school courses.\n", + "\n", + "The bench mark is 1700 points for a student to get admitted and anything short of that will not be admitted." + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "hSgCe0RA67oM" + }, + "outputs": [], + "source": [ + "data = pd.read_csv('Admittance.csv')" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "colab_type": "code", + "id": "a19Zmvij7Xsq", + "outputId": "48457280-76a6-4120-b4f8-960735b35e4d" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ScoreAdmitted
01363No
11792Yes
21954Yes
31653No
41593No
\n", + "
" + ], + "text/plain": [ + " Score Admitted\n", + "0 1363 No\n", + "1 1792 Yes\n", + "2 1954 Yes\n", + "3 1653 No\n", + "4 1593 No" + ] + }, + "execution_count": 65, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 66, + "metadata": {}, + "outputs": [], + "source": [ + "data['Admitted'] = data['Admitted'].map({'Yes':1, 'No':0})" + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ScoreAdmitted
013630
117921
219541
316530
415930
\n", + "
" + ], + "text/plain": [ + " Score Admitted\n", + "0 1363 0\n", + "1 1792 1\n", + "2 1954 1\n", + "3 1653 0\n", + "4 1593 0" + ] + }, + "execution_count": 63, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "QotmOvecHI36" + }, + "source": [ + "## Variables" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "H-g8QnRSGA3O" + }, + "outputs": [], + "source": [ + "X = data['Score'] # Independent variable\n", + "y = data['Admitted'] # Dependent variable" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "dI1ejw-7HiDB" + }, + "source": [ + "## PLOT\n", + "\n", + "Let's plot the data to see its distribution before building our logistic function" + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 291 + }, + "colab_type": "code", + "id": "ut5XnF8wHT4q", + "outputId": "715529c5-69cc-45d5-e27a-76e20154d7b9" + }, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Create a scatter plot of x (Scores) and y (Admitted)\n", + "plt.scatter(X,y, color='C0')\n", + "plt.xlabel('Score', fontsize = 20)\n", + "plt.ylabel('Admitted', fontsize = 20)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Observe from above plot that, values above 1700 fall under the value of 1(admitted). Therea are a few cases of outliers in our data." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "i2Jjeo3aRhW1" + }, + "source": [ + "## The Logit Function\n", + "Here, we are going to call the Logit function from statsmodel and fit it on our data" + ] + }, + { + "cell_type": "code", + "execution_count": 79, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "ibAxdnqQRlZ6" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Optimization terminated successfully.\n", + " Current function value: 0.137766\n", + " Iterations 10\n" + ] + } + ], + "source": [ + "x_const = sm.add_constant(X)\n", + "reg_log = sm.Logit(y,x_const)\n", + "result = reg_log.fit()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "RxTjA2XAA1b6" + }, + "source": [ + "## Logistic Summary" + ] + }, + { + "cell_type": "code", + "execution_count": 80, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "2UlPlAfKA5mc" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "
Logit Regression Results
Dep. Variable: Admitted No. Observations: 168
Model: Logit Df Residuals: 166
Method: MLE Df Model: 1
Date: Mon, 20 Apr 2020 Pseudo R-squ.: 0.7992
Time: 10:45:34 Log-Likelihood: -23.145
converged: True LL-Null: -115.26
Covariance Type: nonrobust LLR p-value: 5.805e-42
\n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "
coef std err z P>|z| [0.025 0.975]
const -69.9128 15.737 -4.443 0.000 -100.756 -39.070
SAT 0.0420 0.009 4.454 0.000 0.024 0.060


Possibly complete quasi-separation: A fraction 0.27 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified." + ], + "text/plain": [ + "\n", + "\"\"\"\n", + " Logit Regression Results \n", + "==============================================================================\n", + "Dep. Variable: Admitted No. Observations: 168\n", + "Model: Logit Df Residuals: 166\n", + "Method: MLE Df Model: 1\n", + "Date: Mon, 20 Apr 2020 Pseudo R-squ.: 0.7992\n", + "Time: 10:45:34 Log-Likelihood: -23.145\n", + "converged: True LL-Null: -115.26\n", + "Covariance Type: nonrobust LLR p-value: 5.805e-42\n", + "==============================================================================\n", + " coef std err z P>|z| [0.025 0.975]\n", + "------------------------------------------------------------------------------\n", + "const -69.9128 15.737 -4.443 0.000 -100.756 -39.070\n", + "SAT 0.0420 0.009 4.454 0.000 0.024 0.060\n", + "==============================================================================\n", + "\n", + "Possibly complete quasi-separation: A fraction 0.27 of observations can be\n", + "perfectly predicted. This might indicate that there is complete\n", + "quasi-separation. In this case some parameters will not be identified.\n", + "\"\"\"" + ] + }, + "execution_count": 80, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "result.summary()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Interpreting the Log Table\n", + "\n", + "From the table abve, you can see that our function was able to identify 'Admitted'as the dependent variable. The model used is a Logit regression, while the method is Maximum Likelihood Estimation (Maximum likelihood estimation is a method that determines values for the parameters of a model.). It has clearly converged after classifyin 168 observations.\n", + "\n", + "The Pseudo R-squared is 0.80 which is within the 'acceptable region'. Pseudo R2 is a measure of how well variables of the model explain some phenomenon. If we have a pseudo-R value more than 0.5 then can form our expectation for the model.\n", + "\n", + "The Admitted variable is highly significant and its coefficient is 1.\n", + "\n", + "The constant is also significant and equals: -69.9\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "imqAD27oP0Lc" + }, + "source": [ + "## Logistic formula\n", + "\n", + "Let's covert the logistic regression formula to a python function" + ] + }, + { + "cell_type": "code", + "execution_count": 81, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "hBt-pNzZPzju" + }, + "outputs": [], + "source": [ + "def log_form(x,b0,b1):\n", + " return np.array(np.exp(b0+x*b1) / (1 + np.exp(b0+x*b1)))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "XmToMDMpSY_5" + }, + "source": [ + "## Sorting" + ] + }, + { + "cell_type": "code", + "execution_count": 84, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 748 + }, + "colab_type": "code", + "id": "VFfBEWX_SdVD", + "outputId": "a67d5992-3259-470c-aeae-807c383473bc" + }, + "outputs": [], + "source": [ + "# The params valeus are comming from when we fitted our logit function on the data above\n", + "f_sorted = np.sort(log_form(X,result.params[0],result.params[1]))\n", + "x_sorted = np.sort(np.array(X))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "CKuWoq7YN_a7" + }, + "source": [ + "## Plot a Logistic Curve\n" + ] + }, + { + "cell_type": "code", + "execution_count": 78, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "SUjjHHh_H7QO" + }, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "plt.scatter(X,y,color='C0')\n", + "plt.xlabel('Score', fontsize = 20)\n", + "plt.ylabel('Admitted', fontsize = 20)\n", + "# Plotting the curve\n", + "plt.plot(x_sorted,f_sorted,color='C8')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can observe that our function fitted well on our data and even ignored the outliered values." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Testing the Model" + ] + }, + { + "cell_type": "code", + "execution_count": 89, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([0.00, 1.00, 1.00, 0.38, 0.05, 0.98, 0.99, 1.00, 1.00, 0.03, 1.00,\n", + " 1.00, 0.50, 0.00, 0.77, 1.00, 0.21, 0.22, 0.23, 1.00, 1.00, 1.00,\n", + " 0.00, 0.00, 0.91, 1.00, 0.22, 1.00, 1.00, 0.98, 0.00, 0.61, 0.41,\n", + " 1.00, 1.00, 1.00, 0.50, 1.00, 0.39, 0.00, 0.04, 0.20, 1.00, 0.00,\n", + " 1.00, 0.00, 0.98, 0.00, 0.00, 0.01, 0.01, 1.00, 0.80, 0.01, 1.00,\n", + " 0.00, 0.56, 0.95, 0.22, 0.99, 0.01, 0.89, 1.00, 1.00, 0.99, 0.00,\n", + " 0.00, 0.00, 1.00, 0.00, 0.89, 0.22, 0.00, 1.00, 1.00, 1.00, 0.00,\n", + " 0.49, 1.00, 1.00, 0.01, 1.00, 1.00, 0.65, 1.00, 1.00, 0.00, 1.00,\n", + " 0.99, 0.72, 0.60, 0.00, 0.95, 0.91, 0.00, 1.00, 1.00, 0.01, 0.98,\n", + " 0.88, 0.99, 0.00, 0.99, 0.03, 0.00, 0.99, 0.99, 1.00, 0.00, 0.00,\n", + " 0.04, 0.52, 0.00, 1.00, 0.16, 0.00, 0.92, 0.00, 0.47, 1.00, 1.00,\n", + " 0.02, 0.00, 0.00, 1.00, 0.01, 0.96, 0.75, 0.72, 0.94, 0.60, 0.00,\n", + " 1.00, 0.00, 0.00, 0.00, 1.00, 0.08, 0.00, 0.01, 1.00, 0.98, 0.70,\n", + " 0.99, 1.00, 0.02, 0.00, 0.00, 0.00, 0.83, 1.00, 1.00, 1.00, 1.00,\n", + " 1.00, 0.00, 1.00, 1.00, 0.01, 1.00, 0.05, 1.00, 1.00, 0.92, 0.97,\n", + " 0.01, 0.00, 0.00])" + ] + }, + "execution_count": 89, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "result.predict()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The above predicted values shows that probability of a value being a 1 or 0. The first observation (0.0) shows an a hundread percent accuracy of a number being a 0 which implies no admission, while the secsond observation shows a hundread percent accuracy of a student being admitted with a value of 1. Other values which are neither 1 or 0 are outliers which are not fully sure of which of the class they belong to. This was visually seen in our plot above." + ] + } + ], + "metadata": { + "colab": { + "collapsed_sections": [], + "name": "logistic_challenge.ipynb", + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From 28ddf0523814646e6634b4cb48a9b93258a4f4d3 Mon Sep 17 00:00:00 2001 From: jatojoseph Date: Mon, 20 Apr 2020 11:29:45 +0100 Subject: [PATCH 2/4] Update README.md --- README.md | 40 ---------------------------------------- 1 file changed, 40 deletions(-) diff --git a/README.md b/README.md index d3696d5..8b13789 100644 --- a/README.md +++ b/README.md @@ -1,41 +1 @@ -# ML-Logistic-regression-algorithm-challenge - -![DSN logo](DSN_logo.png)|DSN Algorithm Challenge| -|---|---| - -A lot of data scientists or machine learning enthusiasts do use various machine learning algorithms as a black box without knowing how they work or the mathematics behind it. The purpose of this challenge is to encourage the mathematical understanding of machine learning algorithms, their break and yield point. - -In summary, participants are encouraged to understand the fundamental concepts behind machine learning algorithms/models. - - -The rules and guidelines for this challenge are as follows: - -1. Ensure to register at https://bit.ly/dsnmlhack - -2. The algorithm challenge is open to all. - -3. Participants are expected to design and develop the Logistic Regression algorithm from scratch using Python or R programming. - -4. For python developers (numpy is advisable). - -5. To push your solution to us, make a [pull request](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/about-pull-requests) to DSN's GitHub page at https://www.github.com/datasciencenigeria/ML-Logistic-regression-algorithm-challenge. Ensure to add your readme file to understand your code. - -6. The top 3 optimized code will be compensated as follows: - -- **1st position**: 20GB data plan. -- **2nd position**: 15GB data plan. -- **3rd position**: 10GB data plan. - -7. Add your scripts and readme.MD file as a folder saved as your full name (surname_first_middle name) by making a pull request to the repository. - ---- -For issues on this challenge kindly reach out to the AI+campus/city managers - -**Twitter**: [@DataScienceNIG](https://twitter.com/DataScienceNIG), [@elishatofunmi](https://twitter.com/Elishatofunmi), [@o_funminiyi](https://twitter.com/o_funminiyi), [@gbganalyst](https://twitter.com/gbganalyst) - -or - -**Call**: +2349062000119,+2349080564419. - -Good luck! From ff4f0ce90b056fb02d237c423d9ae8a1db576a38 Mon Sep 17 00:00:00 2001 From: jatojoseph Date: Mon, 20 Apr 2020 11:30:14 +0100 Subject: [PATCH 3/4] Delete README.md --- README.md | 1 - 1 file changed, 1 deletion(-) delete mode 100644 README.md diff --git a/README.md b/README.md deleted file mode 100644 index 8b13789..0000000 --- a/README.md +++ /dev/null @@ -1 +0,0 @@ - From b2f34dc835401e67c4e8828343824e74f4549e2d Mon Sep 17 00:00:00 2001 From: jatojoseph Date: Mon, 20 Apr 2020 11:30:59 +0100 Subject: [PATCH 4/4] Added a readme file --- README.md | 44 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 44 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..511b69a --- /dev/null +++ b/README.md @@ -0,0 +1,44 @@ +# ML-Logistic-regression-algorithm-challenge +ML-Logistic-regression-algorithm-challenge +## Logistic Regression +Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable.The possible outcomes of a logistic regression are not numerical but +rather categorical ( 1 or 0, Yes or No ) etc. + +Even though a Logistic regression is seen as a generalized linear model, It is a linear model with a link function that maps the output of linear multiple regression to a posterior probability of each class (0,1) using the **logistic +sigmoid function.** + +## The Logistic Regression Formla + + +## $p(X)/1 −p (X) = e ( β 0 +β 1 X 1 +...+β k X k )$ + +Where, + +$p(X)$ = Probability of the dsitribution + +$e$ = Base of the Natural Log + +$β0$ = Biase or Intercept + +$β1$ = Coefficient + +$X1$ = Independenet variable + +## **ODDS** = $p (X)/1 −p (X)$ + +The logistic regression model is not very useful in itself. The right-hand side of the model is an exponent which is very computationally inefficient and generally hard to grasp. + +When we talk about a *logistic regression* what we usually mean is **logit regression** – which is a variation of the model where we have taken the log of both sides. See formula below: + +## $log(p(X)/1 −p(X)) = log(e(β 0 + β 1 x + ⋯ β k x k))$ + +On the right hand side, log cancels 'e(exponential)' function leavig us with our new formula: + +## $log(p(X)/1 −p(X)) = β 0 + β 1 x + ⋯ β k x k$ + +With odds: + +## $log (odds) = β 0 + β 1 x + ⋯ β k x k$ + +We'll implemt all these in the code section of the project. +