BiAPoL
diff --git a/‎10_machine_learning/13_06_Machine_Learning_wo_Brat.pdf‎
1.18 MB b/‎10_machine_learning/13_06_Machine_Learning_wo_Brat.pdf‎
1.18 MB
diff --git a/‎10_machine_learning/machine_learning.ipynb‎
Lines changed: 276 additions & 0 deletions b/‎10_machine_learning/machine_learning.ipynb‎
Lines changed: 276 additions & 0 deletions
@@ -0,0 +1,276 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "id": "12032af8-547f-486b-baab-d9f3f46cf957",
+      "metadata": {
+        "id": "12032af8-547f-486b-baab-d9f3f46cf957"
+      },
+      "source": [
+        "# Unsupervised machine learning"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "825d6705-cf79-4b52-acc6-da93d6e2d96c",
+      "metadata": {
+        "id": "825d6705-cf79-4b52-acc6-da93d6e2d96c"
+      },
+      "outputs": [],
+      "source": [
+        "import matplotlib.pyplot as plt\n",
+        "import pandas as pd\n",
+        "from sklearn.cluster import KMeans\n",
+        "from sklearn.metrics import jaccard_score, accuracy_score, precision_score, recall_score"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "def generate_biomodal_2d_data():\n",
+        "    import numpy as np\n",
+        "        \n",
+        "    rs = np.random.RandomState(seed=0)\n",
+        "\n",
+        "    x1 = rs.normal(3, 1, (150,2))\n",
+        "    x2 = rs.normal(8, 1.5, (150,2))\n",
+        "\n",
+        "    x_all = np.concatenate((x1, x2), axis=0)\n",
+        "    rs.shuffle(x_all)\n",
+        "    return x_all"
+      ],
+      "metadata": {
+        "id": "y5kc5gCOfEVu"
+      },
+      "id": "y5kc5gCOfEVu",
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "id": "5292a204-9e34-47dd-8fac-f7db68cd4bc6",
+      "metadata": {
+        "id": "5292a204-9e34-47dd-8fac-f7db68cd4bc6"
+      },
+      "source": [
+        "In the following data set, we are going to simulate patients with Myeloid Leukemia. We are going to analyze two features, Progression and Mutational Signature. Patients with a faster progression and higher mutational signature are considered with Acute Myeloid Leukemia (AML). "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "8e3707f7-77be-4a26-811d-3e6bdd47018e",
+      "metadata": {
+        "id": "8e3707f7-77be-4a26-811d-3e6bdd47018e"
+      },
+      "outputs": [],
+      "source": [
+        "data = generate_biomodal_2d_data()\n",
+        "\n",
+        "plt.scatter(data[:, 0], data[:, 1], c='#DDDDDD')\n",
+        "plt.xlabel('progression')\n",
+        "plt.ylabel('mutational signature')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "ad993994-7a52-4d58-8873-6c095debe66e",
+      "metadata": {
+        "id": "ad993994-7a52-4d58-8873-6c095debe66e"
+      },
+      "source": [
+        "To get a more detailed insight into the data, we print out the first entries."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "7d78f58d-7c5b-452a-b545-8b3e791f9c5a",
+      "metadata": {
+        "id": "7d78f58d-7c5b-452a-b545-8b3e791f9c5a"
+      },
+      "outputs": [],
+      "source": [
+        "pd.DataFrame(data[:20], columns=[\"progression\", \"mutational signature\"])"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "e7f82812-4bac-4984-b90e-a646718a6adb",
+      "metadata": {
+        "id": "e7f82812-4bac-4984-b90e-a646718a6adb"
+      },
+      "source": [
+        "## Separating test and validation data\n",
+        "Before we train our k-means method, we need to split the annotated data into two subsets. Goal is to enable unbiased validation. We train on the first half of the annotated data points and measure the quality on the second half."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "2fcb456d-90c9-4cc2-90c6-9c69a43e72bb",
+      "metadata": {
+        "id": "2fcb456d-90c9-4cc2-90c6-9c69a43e72bb"
+      },
+      "outputs": [],
+      "source": [
+        "train_data = data[:200]\n",
+        "validation_data = data[200:250]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "22c17985-c857-4170-84ce-66e96f8e4971",
+      "metadata": {
+        "id": "22c17985-c857-4170-84ce-66e96f8e4971"
+      },
+      "source": [
+        "## Training\n",
+        "With the selected data we can train our k-means model"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "ec60e60e-bc9e-4e46-84a3-90366cf45c99",
+      "metadata": {
+        "id": "ec60e60e-bc9e-4e46-84a3-90366cf45c99"
+      },
+      "outputs": [],
+      "source": [
+        "kmeans = KMeans(n_clusters=2, random_state=0,  n_init=\"auto\").fit(train_data)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "result = kmeans.predict(train_data)\n",
+        "\n",
+        "colors = ['orange', 'blue']\n",
+        "predicted_colors = []\n",
+        "for i in result:\n",
+        "  predicted_colors.append(colors[i-1])\n",
+        "\n",
+        "plt.scatter(train_data[:, 0], train_data[:, 1], c=predicted_colors)\n",
+        "plt.xlabel('progression')\n",
+        "plt.ylabel('mutational signature')\n",
+        "\n",
+        "centroids = kmeans.cluster_centers_\n",
+        "plt.scatter(\n",
+        "    centroids[:, 0],\n",
+        "    centroids[:, 1],\n",
+        "    marker=\"x\",\n",
+        "    s=169,\n",
+        "    linewidths=3,\n",
+        "    color=\"black\",\n",
+        "    zorder=10,\n",
+        ")"
+      ],
+      "metadata": {
+        "id": "eE0SMd_5hVAu"
+      },
+      "id": "eE0SMd_5hVAu",
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "id": "ec31464b-c5c8-49fa-80a9-d5721b50a136",
+      "metadata": {
+        "id": "ec31464b-c5c8-49fa-80a9-d5721b50a136"
+      },
+      "source": [
+        "## Validation\n",
+        "We can now apply the classifier to the validation data."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "01002155-5344-446d-a6bb-79fdf6aba05b",
+      "metadata": {
+        "id": "01002155-5344-446d-a6bb-79fdf6aba05b"
+      },
+      "outputs": [],
+      "source": [
+        "result = kmeans.predict(validation_data)\n",
+        "\n",
+        "colors = ['orange', 'blue']\n",
+        "predicted_colors = []\n",
+        "for i in result:\n",
+        "  predicted_colors.append(colors[i-1])\n",
+        "\n",
+        "plt.scatter(validation_data[:, 0], validation_data[:, 1], c=predicted_colors)\n",
+        "plt.xlabel('progression')\n",
+        "plt.ylabel('mutational signature')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "a32230e9-6f91-4248-b47a-688fbfe29bf9",
+      "metadata": {
+        "id": "a32230e9-6f91-4248-b47a-688fbfe29bf9"
+      },
+      "source": [
+        "## Prediction\n",
+        "After training and validation of the classifier, we can reuse it to process other data sets. \n",
+        "It is uncommon to classify test- and validation data, as those should be used for making the classifier only. We here apply the classifier to the remaining data points."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "fa01dc06-1a8a-4f79-a6c9-de7c54087f54",
+      "metadata": {
+        "id": "fa01dc06-1a8a-4f79-a6c9-de7c54087f54"
+      },
+      "outputs": [],
+      "source": [
+        "remaining_data = data[250:]\n",
+        "\n",
+        "prediction = kmeans.predict(remaining_data)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "f35dce87-89e0-4fec-ba16-020188f6bdb3",
+      "metadata": {
+        "id": "f35dce87-89e0-4fec-ba16-020188f6bdb3"
+      },
+      "outputs": [],
+      "source": [
+        "predicted_colors = [colors[i-1] for i in prediction]\n",
+        "\n",
+        "plt.scatter(remaining_data[:, 0], remaining_data[:, 1], c=predicted_colors)\n",
+        "plt.xlabel('progression')\n",
+        "plt.ylabel('mutational signature')"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3 (ipykernel)",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.9.16"
+    },
+    "colab": {
+      "provenance": []
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 5
+}