Skip to content

Commit 95db2ac

Browse files
committed
shortened scripts to PCA
1 parent 8a94266 commit 95db2ac

File tree

4 files changed

+3
-5570
lines changed

4 files changed

+3
-5570
lines changed

.DS_Store

0 Bytes
Binary file not shown.
6 KB
Binary file not shown.

10_correlation_dim_reduction/Correlations-Copy1.ipynb

Lines changed: 0 additions & 5288 deletions
This file was deleted.

10_correlation_dim_reduction/PCA_UMAP.ipynb renamed to 10_correlation_dim_reduction/PCA.ipynb

Lines changed: 3 additions & 282 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,9 @@
55
"id": "8f78b35b",
66
"metadata": {},
77
"source": [
8-
"# PCA and UMAP\n",
8+
"# PCA \n",
99
"## Principle Component analysis\n",
10-
"## Uniform Manifold Approximation and Projection for Dimension Reduction\n",
10+
"\n",
1111
"\n",
1212
"Anna Poetsch \n",
1313
"27.June 2022\n",
@@ -430,286 +430,7 @@
430430
"id": "02af1798",
431431
"metadata": {},
432432
"source": [
433-
"The separation of species with PCA did work, but not very well."
434-
]
435-
},
436-
{
437-
"cell_type": "markdown",
438-
"id": "c966449f",
439-
"metadata": {},
440-
"source": [
441-
"## UMAP"
442-
]
443-
},
444-
{
445-
"cell_type": "code",
446-
"execution_count": null,
447-
"id": "fe4dedb9",
448-
"metadata": {},
449-
"outputs": [],
450-
"source": [
451-
"embedding = reducer.fit_transform(scaled_penguin_data)\n",
452-
"embedding.shape"
453-
]
454-
},
455-
{
456-
"cell_type": "code",
457-
"execution_count": null,
458-
"id": "b4a083c3",
459-
"metadata": {},
460-
"outputs": [],
461-
"source": [
462-
"plt.scatter(\n",
463-
" embedding[:, 0],\n",
464-
" embedding[:, 1],\n",
465-
" c=[sns.color_palette()[x] for x in penguins.species_short.map({\"Adelie\":0, \"Chinstrap\":1, \"Gentoo\":2})])"
466-
]
467-
},
468-
{
469-
"cell_type": "markdown",
470-
"id": "067cc8f0",
471-
"metadata": {},
472-
"source": [
473-
"The separation with UMAP worked generally well, yet there are some Chinstrap penguins that cluster with Gentoo. \n",
474-
"Now we can use the two-dimensional embedding to visualise different data:"
475-
]
476-
},
477-
{
478-
"cell_type": "code",
479-
"execution_count": null,
480-
"id": "3435f0cd",
481-
"metadata": {
482-
"scrolled": true
483-
},
484-
"outputs": [],
485-
"source": [
486-
"plt.scatter(\n",
487-
" embedding[:, 0],\n",
488-
" embedding[:, 1],\n",
489-
" c=penguins[\"culmen_length_mm\"])\n",
490-
"plt.title(\"culmen_length_mm\")\n",
491-
"plt.colorbar()\n",
492-
"plt.show()"
493-
]
494-
},
495-
{
496-
"cell_type": "code",
497-
"execution_count": null,
498-
"id": "dfbd6694",
499-
"metadata": {},
500-
"outputs": [],
501-
"source": [
502-
"plt.scatter(\n",
503-
" embedding[:, 0],\n",
504-
" embedding[:, 1],\n",
505-
" c=penguins[\"culmen_depth_mm\"])\n",
506-
"plt.title(\"culmen_depth_mm\")\n",
507-
"plt.colorbar()\n",
508-
"plt.show()"
509-
]
510-
},
511-
{
512-
"cell_type": "code",
513-
"execution_count": null,
514-
"id": "e24261d2",
515-
"metadata": {},
516-
"outputs": [],
517-
"source": [
518-
"plt.scatter(\n",
519-
" embedding[:, 0],\n",
520-
" embedding[:, 1],\n",
521-
" c=penguins[\"flipper_length_mm\"])\n",
522-
"plt.title(\"flipper_length_mm\")\n",
523-
"plt.colorbar()\n",
524-
"plt.show()"
525-
]
526-
},
527-
{
528-
"cell_type": "code",
529-
"execution_count": null,
530-
"id": "871b5d67",
531-
"metadata": {},
532-
"outputs": [],
533-
"source": [
534-
"plt.scatter(\n",
535-
" embedding[:, 0],\n",
536-
" embedding[:, 1],\n",
537-
" c=penguins[\"body_mass_g\"])\n",
538-
"plt.title(\"body_mass_g\")\n",
539-
"plt.colorbar()\n",
540-
"plt.show()"
541-
]
542-
},
543-
{
544-
"cell_type": "markdown",
545-
"id": "4b03c368",
546-
"metadata": {},
547-
"source": [
548-
"### n_neighbors"
549-
]
550-
},
551-
{
552-
"cell_type": "code",
553-
"execution_count": null,
554-
"id": "4bc804db",
555-
"metadata": {
556-
"scrolled": true
557-
},
558-
"outputs": [],
559-
"source": [
560-
"for n in (2, 5, 15, 100, 1000):\n",
561-
" reducer = umap.UMAP(n_neighbors=n)\n",
562-
" embedding = reducer.fit_transform(scaled_penguin_data)\n",
563-
" plt.scatter(\n",
564-
" embedding[:, 0],\n",
565-
" embedding[:, 1],\n",
566-
" c=[sns.color_palette()[x] for x in penguins.species_short.map({\"Adelie\":0, \"Chinstrap\":1, \"Gentoo\":2})]\n",
567-
" )\n",
568-
" plt.title('n_neighbors = {}'.format(n))\n",
569-
" plt.show()"
570-
]
571-
},
572-
{
573-
"cell_type": "markdown",
574-
"id": "55b84bc4",
575-
"metadata": {},
576-
"source": [
577-
"n_neighbors does focus on fine grained structure, when kept small. Then we might miss the bigger picture. When small,it might also generate \"sausages\", which can be a reason to want to modify the parameters. The higher one goes, the more cramped the clusters become. One should not define more neighbors than there are data points ;-) "
578-
]
579-
},
580-
{
581-
"cell_type": "markdown",
582-
"id": "34fac601",
583-
"metadata": {},
584-
"source": [
585-
"### min_dist"
586-
]
587-
},
588-
{
589-
"cell_type": "code",
590-
"execution_count": null,
591-
"id": "929ed306",
592-
"metadata": {},
593-
"outputs": [],
594-
"source": [
595-
"for d in (0.0, 0.1, 0.5, 0.9):\n",
596-
" reducer = umap.UMAP(min_dist=d)\n",
597-
" embedding = reducer.fit_transform(scaled_penguin_data)\n",
598-
" plt.scatter(\n",
599-
" embedding[:, 0],\n",
600-
" embedding[:, 1],\n",
601-
" c=[sns.color_palette()[x] for x in penguins.species_short.map({\"Adelie\":0, \"Chinstrap\":1, \"Gentoo\":2})]\n",
602-
" )\n",
603-
" plt.title('min_dist = {}'.format(d))\n",
604-
" plt.show()"
605-
]
606-
},
607-
{
608-
"cell_type": "markdown",
609-
"id": "e6aaadc4",
610-
"metadata": {},
611-
"source": [
612-
"min_dist defines how close we allow points to lie on top of each other. The higher the value, the more loose our clusters will be. "
613-
]
614-
},
615-
{
616-
"cell_type": "markdown",
617-
"id": "b377fb9c",
618-
"metadata": {},
619-
"source": [
620-
"### metric"
621-
]
622-
},
623-
{
624-
"cell_type": "code",
625-
"execution_count": null,
626-
"id": "c5b64fc2",
627-
"metadata": {},
628-
"outputs": [],
629-
"source": [
630-
"for m in (\"euclidean\",\"cosine\",\"correlation\"):\n",
631-
" reducer = umap.UMAP(metric=m)\n",
632-
" embedding = reducer.fit_transform(scaled_penguin_data)\n",
633-
" plt.scatter(\n",
634-
" embedding[:, 0],\n",
635-
" embedding[:, 1],\n",
636-
" c=[sns.color_palette()[x] for x in penguins.species_short.map({\"Adelie\":0, \"Chinstrap\":1, \"Gentoo\":2})]\n",
637-
" )\n",
638-
" plt.title('metric = {}'.format(m))\n",
639-
" plt.show()"
640-
]
641-
},
642-
{
643-
"cell_type": "markdown",
644-
"id": "a2fafd4b",
645-
"metadata": {},
646-
"source": [
647-
"There are many more possibilities to define distances. Eucledian is the default, but one might look for an alternative dependent on the data. For example sparse data, i.e. data with many 0s are frequently addressed with cosine distance. Such data are for example RNA-Seq data. "
648-
]
649-
},
650-
{
651-
"cell_type": "markdown",
652-
"id": "133755dd",
653-
"metadata": {},
654-
"source": [
655-
"### Randomness \n",
656-
"Dont forget, there is a random component as well! \n",
657-
"Now we are introducing different seeds, which will change the random component of UMAP "
658-
]
659-
},
660-
{
661-
"cell_type": "code",
662-
"execution_count": null,
663-
"id": "03e7bc87",
664-
"metadata": {
665-
"scrolled": true
666-
},
667-
"outputs": [],
668-
"source": [
669-
"for r in (41,41,42,43):\n",
670-
" np.random.seed(r)\n",
671-
" reducer = umap.UMAP()\n",
672-
" embedding = reducer.fit_transform(scaled_penguin_data)\n",
673-
" plt.scatter(\n",
674-
" embedding[:, 0],\n",
675-
" embedding[:, 1],\n",
676-
" c=[sns.color_palette()[x] for x in penguins.species_short.map({\"Adelie\":0, \"Chinstrap\":1, \"Gentoo\":2})])\n",
677-
" plt.show()"
678-
]
679-
},
680-
{
681-
"cell_type": "markdown",
682-
"id": "10c47853",
683-
"metadata": {},
684-
"source": [
685-
"### spread"
686-
]
687-
},
688-
{
689-
"cell_type": "code",
690-
"execution_count": null,
691-
"id": "fc2e342c",
692-
"metadata": {},
693-
"outputs": [],
694-
"source": [
695-
"for s in (0.5,1,2):\n",
696-
" reducer = umap.UMAP(spread=s)\n",
697-
" embedding = reducer.fit_transform(scaled_penguin_data)\n",
698-
" plt.scatter(\n",
699-
" embedding[:, 0],\n",
700-
" embedding[:, 1],\n",
701-
" c=[sns.color_palette()[x] for x in penguins.species_short.map({\"Adelie\":0, \"Chinstrap\":1, \"Gentoo\":2})]\n",
702-
" )\n",
703-
" plt.title('spread = {}'.format(s))\n",
704-
" plt.show()"
705-
]
706-
},
707-
{
708-
"cell_type": "markdown",
709-
"id": "be3fcef6",
710-
"metadata": {},
711-
"source": [
712-
"Spread is a parameter that allows increased spread in the two dimensional space. Here it does not make a big difference."
433+
"The separation of species with PCA did work, but not very well. Maybe it is worth thinking about different ways of dimensionality reduction? Maybe a non-linear strategy, such as t-SNE or UMAP may help. "
713434
]
714435
}
715436
],

0 commit comments

Comments
 (0)