|
5 | 5 | "id": "8f78b35b", |
6 | 6 | "metadata": {}, |
7 | 7 | "source": [ |
8 | | - "# PCA and UMAP\n", |
| 8 | + "# PCA \n", |
9 | 9 | "## Principle Component analysis\n", |
10 | | - "## Uniform Manifold Approximation and Projection for Dimension Reduction\n", |
| 10 | + "\n", |
11 | 11 | "\n", |
12 | 12 | "Anna Poetsch \n", |
13 | 13 | "27.June 2022\n", |
|
430 | 430 | "id": "02af1798", |
431 | 431 | "metadata": {}, |
432 | 432 | "source": [ |
433 | | - "The separation of species with PCA did work, but not very well." |
434 | | - ] |
435 | | - }, |
436 | | - { |
437 | | - "cell_type": "markdown", |
438 | | - "id": "c966449f", |
439 | | - "metadata": {}, |
440 | | - "source": [ |
441 | | - "## UMAP" |
442 | | - ] |
443 | | - }, |
444 | | - { |
445 | | - "cell_type": "code", |
446 | | - "execution_count": null, |
447 | | - "id": "fe4dedb9", |
448 | | - "metadata": {}, |
449 | | - "outputs": [], |
450 | | - "source": [ |
451 | | - "embedding = reducer.fit_transform(scaled_penguin_data)\n", |
452 | | - "embedding.shape" |
453 | | - ] |
454 | | - }, |
455 | | - { |
456 | | - "cell_type": "code", |
457 | | - "execution_count": null, |
458 | | - "id": "b4a083c3", |
459 | | - "metadata": {}, |
460 | | - "outputs": [], |
461 | | - "source": [ |
462 | | - "plt.scatter(\n", |
463 | | - " embedding[:, 0],\n", |
464 | | - " embedding[:, 1],\n", |
465 | | - " c=[sns.color_palette()[x] for x in penguins.species_short.map({\"Adelie\":0, \"Chinstrap\":1, \"Gentoo\":2})])" |
466 | | - ] |
467 | | - }, |
468 | | - { |
469 | | - "cell_type": "markdown", |
470 | | - "id": "067cc8f0", |
471 | | - "metadata": {}, |
472 | | - "source": [ |
473 | | - "The separation with UMAP worked generally well, yet there are some Chinstrap penguins that cluster with Gentoo. \n", |
474 | | - "Now we can use the two-dimensional embedding to visualise different data:" |
475 | | - ] |
476 | | - }, |
477 | | - { |
478 | | - "cell_type": "code", |
479 | | - "execution_count": null, |
480 | | - "id": "3435f0cd", |
481 | | - "metadata": { |
482 | | - "scrolled": true |
483 | | - }, |
484 | | - "outputs": [], |
485 | | - "source": [ |
486 | | - "plt.scatter(\n", |
487 | | - " embedding[:, 0],\n", |
488 | | - " embedding[:, 1],\n", |
489 | | - " c=penguins[\"culmen_length_mm\"])\n", |
490 | | - "plt.title(\"culmen_length_mm\")\n", |
491 | | - "plt.colorbar()\n", |
492 | | - "plt.show()" |
493 | | - ] |
494 | | - }, |
495 | | - { |
496 | | - "cell_type": "code", |
497 | | - "execution_count": null, |
498 | | - "id": "dfbd6694", |
499 | | - "metadata": {}, |
500 | | - "outputs": [], |
501 | | - "source": [ |
502 | | - "plt.scatter(\n", |
503 | | - " embedding[:, 0],\n", |
504 | | - " embedding[:, 1],\n", |
505 | | - " c=penguins[\"culmen_depth_mm\"])\n", |
506 | | - "plt.title(\"culmen_depth_mm\")\n", |
507 | | - "plt.colorbar()\n", |
508 | | - "plt.show()" |
509 | | - ] |
510 | | - }, |
511 | | - { |
512 | | - "cell_type": "code", |
513 | | - "execution_count": null, |
514 | | - "id": "e24261d2", |
515 | | - "metadata": {}, |
516 | | - "outputs": [], |
517 | | - "source": [ |
518 | | - "plt.scatter(\n", |
519 | | - " embedding[:, 0],\n", |
520 | | - " embedding[:, 1],\n", |
521 | | - " c=penguins[\"flipper_length_mm\"])\n", |
522 | | - "plt.title(\"flipper_length_mm\")\n", |
523 | | - "plt.colorbar()\n", |
524 | | - "plt.show()" |
525 | | - ] |
526 | | - }, |
527 | | - { |
528 | | - "cell_type": "code", |
529 | | - "execution_count": null, |
530 | | - "id": "871b5d67", |
531 | | - "metadata": {}, |
532 | | - "outputs": [], |
533 | | - "source": [ |
534 | | - "plt.scatter(\n", |
535 | | - " embedding[:, 0],\n", |
536 | | - " embedding[:, 1],\n", |
537 | | - " c=penguins[\"body_mass_g\"])\n", |
538 | | - "plt.title(\"body_mass_g\")\n", |
539 | | - "plt.colorbar()\n", |
540 | | - "plt.show()" |
541 | | - ] |
542 | | - }, |
543 | | - { |
544 | | - "cell_type": "markdown", |
545 | | - "id": "4b03c368", |
546 | | - "metadata": {}, |
547 | | - "source": [ |
548 | | - "### n_neighbors" |
549 | | - ] |
550 | | - }, |
551 | | - { |
552 | | - "cell_type": "code", |
553 | | - "execution_count": null, |
554 | | - "id": "4bc804db", |
555 | | - "metadata": { |
556 | | - "scrolled": true |
557 | | - }, |
558 | | - "outputs": [], |
559 | | - "source": [ |
560 | | - "for n in (2, 5, 15, 100, 1000):\n", |
561 | | - " reducer = umap.UMAP(n_neighbors=n)\n", |
562 | | - " embedding = reducer.fit_transform(scaled_penguin_data)\n", |
563 | | - " plt.scatter(\n", |
564 | | - " embedding[:, 0],\n", |
565 | | - " embedding[:, 1],\n", |
566 | | - " c=[sns.color_palette()[x] for x in penguins.species_short.map({\"Adelie\":0, \"Chinstrap\":1, \"Gentoo\":2})]\n", |
567 | | - " )\n", |
568 | | - " plt.title('n_neighbors = {}'.format(n))\n", |
569 | | - " plt.show()" |
570 | | - ] |
571 | | - }, |
572 | | - { |
573 | | - "cell_type": "markdown", |
574 | | - "id": "55b84bc4", |
575 | | - "metadata": {}, |
576 | | - "source": [ |
577 | | - "n_neighbors does focus on fine grained structure, when kept small. Then we might miss the bigger picture. When small,it might also generate \"sausages\", which can be a reason to want to modify the parameters. The higher one goes, the more cramped the clusters become. One should not define more neighbors than there are data points ;-) " |
578 | | - ] |
579 | | - }, |
580 | | - { |
581 | | - "cell_type": "markdown", |
582 | | - "id": "34fac601", |
583 | | - "metadata": {}, |
584 | | - "source": [ |
585 | | - "### min_dist" |
586 | | - ] |
587 | | - }, |
588 | | - { |
589 | | - "cell_type": "code", |
590 | | - "execution_count": null, |
591 | | - "id": "929ed306", |
592 | | - "metadata": {}, |
593 | | - "outputs": [], |
594 | | - "source": [ |
595 | | - "for d in (0.0, 0.1, 0.5, 0.9):\n", |
596 | | - " reducer = umap.UMAP(min_dist=d)\n", |
597 | | - " embedding = reducer.fit_transform(scaled_penguin_data)\n", |
598 | | - " plt.scatter(\n", |
599 | | - " embedding[:, 0],\n", |
600 | | - " embedding[:, 1],\n", |
601 | | - " c=[sns.color_palette()[x] for x in penguins.species_short.map({\"Adelie\":0, \"Chinstrap\":1, \"Gentoo\":2})]\n", |
602 | | - " )\n", |
603 | | - " plt.title('min_dist = {}'.format(d))\n", |
604 | | - " plt.show()" |
605 | | - ] |
606 | | - }, |
607 | | - { |
608 | | - "cell_type": "markdown", |
609 | | - "id": "e6aaadc4", |
610 | | - "metadata": {}, |
611 | | - "source": [ |
612 | | - "min_dist defines how close we allow points to lie on top of each other. The higher the value, the more loose our clusters will be. " |
613 | | - ] |
614 | | - }, |
615 | | - { |
616 | | - "cell_type": "markdown", |
617 | | - "id": "b377fb9c", |
618 | | - "metadata": {}, |
619 | | - "source": [ |
620 | | - "### metric" |
621 | | - ] |
622 | | - }, |
623 | | - { |
624 | | - "cell_type": "code", |
625 | | - "execution_count": null, |
626 | | - "id": "c5b64fc2", |
627 | | - "metadata": {}, |
628 | | - "outputs": [], |
629 | | - "source": [ |
630 | | - "for m in (\"euclidean\",\"cosine\",\"correlation\"):\n", |
631 | | - " reducer = umap.UMAP(metric=m)\n", |
632 | | - " embedding = reducer.fit_transform(scaled_penguin_data)\n", |
633 | | - " plt.scatter(\n", |
634 | | - " embedding[:, 0],\n", |
635 | | - " embedding[:, 1],\n", |
636 | | - " c=[sns.color_palette()[x] for x in penguins.species_short.map({\"Adelie\":0, \"Chinstrap\":1, \"Gentoo\":2})]\n", |
637 | | - " )\n", |
638 | | - " plt.title('metric = {}'.format(m))\n", |
639 | | - " plt.show()" |
640 | | - ] |
641 | | - }, |
642 | | - { |
643 | | - "cell_type": "markdown", |
644 | | - "id": "a2fafd4b", |
645 | | - "metadata": {}, |
646 | | - "source": [ |
647 | | - "There are many more possibilities to define distances. Eucledian is the default, but one might look for an alternative dependent on the data. For example sparse data, i.e. data with many 0s are frequently addressed with cosine distance. Such data are for example RNA-Seq data. " |
648 | | - ] |
649 | | - }, |
650 | | - { |
651 | | - "cell_type": "markdown", |
652 | | - "id": "133755dd", |
653 | | - "metadata": {}, |
654 | | - "source": [ |
655 | | - "### Randomness \n", |
656 | | - "Dont forget, there is a random component as well! \n", |
657 | | - "Now we are introducing different seeds, which will change the random component of UMAP " |
658 | | - ] |
659 | | - }, |
660 | | - { |
661 | | - "cell_type": "code", |
662 | | - "execution_count": null, |
663 | | - "id": "03e7bc87", |
664 | | - "metadata": { |
665 | | - "scrolled": true |
666 | | - }, |
667 | | - "outputs": [], |
668 | | - "source": [ |
669 | | - "for r in (41,41,42,43):\n", |
670 | | - " np.random.seed(r)\n", |
671 | | - " reducer = umap.UMAP()\n", |
672 | | - " embedding = reducer.fit_transform(scaled_penguin_data)\n", |
673 | | - " plt.scatter(\n", |
674 | | - " embedding[:, 0],\n", |
675 | | - " embedding[:, 1],\n", |
676 | | - " c=[sns.color_palette()[x] for x in penguins.species_short.map({\"Adelie\":0, \"Chinstrap\":1, \"Gentoo\":2})])\n", |
677 | | - " plt.show()" |
678 | | - ] |
679 | | - }, |
680 | | - { |
681 | | - "cell_type": "markdown", |
682 | | - "id": "10c47853", |
683 | | - "metadata": {}, |
684 | | - "source": [ |
685 | | - "### spread" |
686 | | - ] |
687 | | - }, |
688 | | - { |
689 | | - "cell_type": "code", |
690 | | - "execution_count": null, |
691 | | - "id": "fc2e342c", |
692 | | - "metadata": {}, |
693 | | - "outputs": [], |
694 | | - "source": [ |
695 | | - "for s in (0.5,1,2):\n", |
696 | | - " reducer = umap.UMAP(spread=s)\n", |
697 | | - " embedding = reducer.fit_transform(scaled_penguin_data)\n", |
698 | | - " plt.scatter(\n", |
699 | | - " embedding[:, 0],\n", |
700 | | - " embedding[:, 1],\n", |
701 | | - " c=[sns.color_palette()[x] for x in penguins.species_short.map({\"Adelie\":0, \"Chinstrap\":1, \"Gentoo\":2})]\n", |
702 | | - " )\n", |
703 | | - " plt.title('spread = {}'.format(s))\n", |
704 | | - " plt.show()" |
705 | | - ] |
706 | | - }, |
707 | | - { |
708 | | - "cell_type": "markdown", |
709 | | - "id": "be3fcef6", |
710 | | - "metadata": {}, |
711 | | - "source": [ |
712 | | - "Spread is a parameter that allows increased spread in the two dimensional space. Here it does not make a big difference." |
| 433 | + "The separation of species with PCA did work, but not very well. Maybe it is worth thinking about different ways of dimensionality reduction? Maybe a non-linear strategy, such as t-SNE or UMAP may help. " |
713 | 434 | ] |
714 | 435 | } |
715 | 436 | ], |
|
0 commit comments