Skip to content

Commit 39b2fd7

Browse files
authored
Machine Learning by Melissa Sanabria
I had to drop one slide, because it was too big.
1 parent 055cd1e commit 39b2fd7

File tree

3 files changed

+552
-0
lines changed

3 files changed

+552
-0
lines changed
1.18 MB
Binary file not shown.
Lines changed: 276 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,276 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "12032af8-547f-486b-baab-d9f3f46cf957",
6+
"metadata": {
7+
"id": "12032af8-547f-486b-baab-d9f3f46cf957"
8+
},
9+
"source": [
10+
"# Unsupervised machine learning"
11+
]
12+
},
13+
{
14+
"cell_type": "code",
15+
"execution_count": null,
16+
"id": "825d6705-cf79-4b52-acc6-da93d6e2d96c",
17+
"metadata": {
18+
"id": "825d6705-cf79-4b52-acc6-da93d6e2d96c"
19+
},
20+
"outputs": [],
21+
"source": [
22+
"import matplotlib.pyplot as plt\n",
23+
"import pandas as pd\n",
24+
"from sklearn.cluster import KMeans\n",
25+
"from sklearn.metrics import jaccard_score, accuracy_score, precision_score, recall_score"
26+
]
27+
},
28+
{
29+
"cell_type": "code",
30+
"source": [
31+
"def generate_biomodal_2d_data():\n",
32+
" import numpy as np\n",
33+
" \n",
34+
" rs = np.random.RandomState(seed=0)\n",
35+
"\n",
36+
" x1 = rs.normal(3, 1, (150,2))\n",
37+
" x2 = rs.normal(8, 1.5, (150,2))\n",
38+
"\n",
39+
" x_all = np.concatenate((x1, x2), axis=0)\n",
40+
" rs.shuffle(x_all)\n",
41+
" return x_all"
42+
],
43+
"metadata": {
44+
"id": "y5kc5gCOfEVu"
45+
},
46+
"id": "y5kc5gCOfEVu",
47+
"execution_count": null,
48+
"outputs": []
49+
},
50+
{
51+
"cell_type": "markdown",
52+
"id": "5292a204-9e34-47dd-8fac-f7db68cd4bc6",
53+
"metadata": {
54+
"id": "5292a204-9e34-47dd-8fac-f7db68cd4bc6"
55+
},
56+
"source": [
57+
"In the following data set, we are going to simulate patients with Myeloid Leukemia. We are going to analyze two features, Progression and Mutational Signature. Patients with a faster progression and higher mutational signature are considered with Acute Myeloid Leukemia (AML). "
58+
]
59+
},
60+
{
61+
"cell_type": "code",
62+
"execution_count": null,
63+
"id": "8e3707f7-77be-4a26-811d-3e6bdd47018e",
64+
"metadata": {
65+
"id": "8e3707f7-77be-4a26-811d-3e6bdd47018e"
66+
},
67+
"outputs": [],
68+
"source": [
69+
"data = generate_biomodal_2d_data()\n",
70+
"\n",
71+
"plt.scatter(data[:, 0], data[:, 1], c='#DDDDDD')\n",
72+
"plt.xlabel('progression')\n",
73+
"plt.ylabel('mutational signature')"
74+
]
75+
},
76+
{
77+
"cell_type": "markdown",
78+
"id": "ad993994-7a52-4d58-8873-6c095debe66e",
79+
"metadata": {
80+
"id": "ad993994-7a52-4d58-8873-6c095debe66e"
81+
},
82+
"source": [
83+
"To get a more detailed insight into the data, we print out the first entries."
84+
]
85+
},
86+
{
87+
"cell_type": "code",
88+
"execution_count": null,
89+
"id": "7d78f58d-7c5b-452a-b545-8b3e791f9c5a",
90+
"metadata": {
91+
"id": "7d78f58d-7c5b-452a-b545-8b3e791f9c5a"
92+
},
93+
"outputs": [],
94+
"source": [
95+
"pd.DataFrame(data[:20], columns=[\"progression\", \"mutational signature\"])"
96+
]
97+
},
98+
{
99+
"cell_type": "markdown",
100+
"id": "e7f82812-4bac-4984-b90e-a646718a6adb",
101+
"metadata": {
102+
"id": "e7f82812-4bac-4984-b90e-a646718a6adb"
103+
},
104+
"source": [
105+
"## Separating test and validation data\n",
106+
"Before we train our k-means method, we need to split the annotated data into two subsets. Goal is to enable unbiased validation. We train on the first half of the annotated data points and measure the quality on the second half."
107+
]
108+
},
109+
{
110+
"cell_type": "code",
111+
"execution_count": null,
112+
"id": "2fcb456d-90c9-4cc2-90c6-9c69a43e72bb",
113+
"metadata": {
114+
"id": "2fcb456d-90c9-4cc2-90c6-9c69a43e72bb"
115+
},
116+
"outputs": [],
117+
"source": [
118+
"train_data = data[:200]\n",
119+
"validation_data = data[200:250]"
120+
]
121+
},
122+
{
123+
"cell_type": "markdown",
124+
"id": "22c17985-c857-4170-84ce-66e96f8e4971",
125+
"metadata": {
126+
"id": "22c17985-c857-4170-84ce-66e96f8e4971"
127+
},
128+
"source": [
129+
"## Training\n",
130+
"With the selected data we can train our k-means model"
131+
]
132+
},
133+
{
134+
"cell_type": "code",
135+
"execution_count": null,
136+
"id": "ec60e60e-bc9e-4e46-84a3-90366cf45c99",
137+
"metadata": {
138+
"id": "ec60e60e-bc9e-4e46-84a3-90366cf45c99"
139+
},
140+
"outputs": [],
141+
"source": [
142+
"kmeans = KMeans(n_clusters=2, random_state=0, n_init=\"auto\").fit(train_data)"
143+
]
144+
},
145+
{
146+
"cell_type": "code",
147+
"source": [
148+
"result = kmeans.predict(train_data)\n",
149+
"\n",
150+
"colors = ['orange', 'blue']\n",
151+
"predicted_colors = []\n",
152+
"for i in result:\n",
153+
" predicted_colors.append(colors[i-1])\n",
154+
"\n",
155+
"plt.scatter(train_data[:, 0], train_data[:, 1], c=predicted_colors)\n",
156+
"plt.xlabel('progression')\n",
157+
"plt.ylabel('mutational signature')\n",
158+
"\n",
159+
"centroids = kmeans.cluster_centers_\n",
160+
"plt.scatter(\n",
161+
" centroids[:, 0],\n",
162+
" centroids[:, 1],\n",
163+
" marker=\"x\",\n",
164+
" s=169,\n",
165+
" linewidths=3,\n",
166+
" color=\"black\",\n",
167+
" zorder=10,\n",
168+
")"
169+
],
170+
"metadata": {
171+
"id": "eE0SMd_5hVAu"
172+
},
173+
"id": "eE0SMd_5hVAu",
174+
"execution_count": null,
175+
"outputs": []
176+
},
177+
{
178+
"cell_type": "markdown",
179+
"id": "ec31464b-c5c8-49fa-80a9-d5721b50a136",
180+
"metadata": {
181+
"id": "ec31464b-c5c8-49fa-80a9-d5721b50a136"
182+
},
183+
"source": [
184+
"## Validation\n",
185+
"We can now apply the classifier to the validation data."
186+
]
187+
},
188+
{
189+
"cell_type": "code",
190+
"execution_count": null,
191+
"id": "01002155-5344-446d-a6bb-79fdf6aba05b",
192+
"metadata": {
193+
"id": "01002155-5344-446d-a6bb-79fdf6aba05b"
194+
},
195+
"outputs": [],
196+
"source": [
197+
"result = kmeans.predict(validation_data)\n",
198+
"\n",
199+
"colors = ['orange', 'blue']\n",
200+
"predicted_colors = []\n",
201+
"for i in result:\n",
202+
" predicted_colors.append(colors[i-1])\n",
203+
"\n",
204+
"plt.scatter(validation_data[:, 0], validation_data[:, 1], c=predicted_colors)\n",
205+
"plt.xlabel('progression')\n",
206+
"plt.ylabel('mutational signature')"
207+
]
208+
},
209+
{
210+
"cell_type": "markdown",
211+
"id": "a32230e9-6f91-4248-b47a-688fbfe29bf9",
212+
"metadata": {
213+
"id": "a32230e9-6f91-4248-b47a-688fbfe29bf9"
214+
},
215+
"source": [
216+
"## Prediction\n",
217+
"After training and validation of the classifier, we can reuse it to process other data sets. \n",
218+
"It is uncommon to classify test- and validation data, as those should be used for making the classifier only. We here apply the classifier to the remaining data points."
219+
]
220+
},
221+
{
222+
"cell_type": "code",
223+
"execution_count": null,
224+
"id": "fa01dc06-1a8a-4f79-a6c9-de7c54087f54",
225+
"metadata": {
226+
"id": "fa01dc06-1a8a-4f79-a6c9-de7c54087f54"
227+
},
228+
"outputs": [],
229+
"source": [
230+
"remaining_data = data[250:]\n",
231+
"\n",
232+
"prediction = kmeans.predict(remaining_data)"
233+
]
234+
},
235+
{
236+
"cell_type": "code",
237+
"execution_count": null,
238+
"id": "f35dce87-89e0-4fec-ba16-020188f6bdb3",
239+
"metadata": {
240+
"id": "f35dce87-89e0-4fec-ba16-020188f6bdb3"
241+
},
242+
"outputs": [],
243+
"source": [
244+
"predicted_colors = [colors[i-1] for i in prediction]\n",
245+
"\n",
246+
"plt.scatter(remaining_data[:, 0], remaining_data[:, 1], c=predicted_colors)\n",
247+
"plt.xlabel('progression')\n",
248+
"plt.ylabel('mutational signature')"
249+
]
250+
}
251+
],
252+
"metadata": {
253+
"kernelspec": {
254+
"display_name": "Python 3 (ipykernel)",
255+
"language": "python",
256+
"name": "python3"
257+
},
258+
"language_info": {
259+
"codemirror_mode": {
260+
"name": "ipython",
261+
"version": 3
262+
},
263+
"file_extension": ".py",
264+
"mimetype": "text/x-python",
265+
"name": "python",
266+
"nbconvert_exporter": "python",
267+
"pygments_lexer": "ipython3",
268+
"version": "3.9.16"
269+
},
270+
"colab": {
271+
"provenance": []
272+
}
273+
},
274+
"nbformat": 4,
275+
"nbformat_minor": 5
276+
}

0 commit comments

Comments
 (0)