|
13 | 13 | ### Table of contents |
14 | 14 |
|
15 | 15 | - [Introduction](#introduction) |
16 | | - |
| 16 | + * [GraphX](#graphx) |
| 17 | + * [Pregel](#pregel) |
| 18 | +- [Setup](#setup) |
| 19 | + * [Mac](#mac) |
| 20 | + * [Linux](#linux) |
| 21 | +- [Run](#run) |
| 22 | + * [VS Code](#vscode) |
| 23 | + * [Terminal](#terminal) |
17 | 24 | --- |
18 | 25 |
|
19 | 26 | <a id="introduction" /> |
20 | 27 |
|
21 | 28 | #### 1. Introduction |
22 | 29 |
|
23 | | -This repo is all about what we have done in SDM lab 2 during our semester at UPC. As a BDMA student, we really need to know how to deal with semantics (a.k.a metadata about the data) in case of big data ecosystems. Semantics play a key role when analysing a large corpora of data in today's era. |
| 30 | +__`Data drives the world.`__ In this big data era, the need to analyse large volumes of data has become ever more challenging and quite complex. Several different eco-systems have been developed which try to solve some particular problem. One of the main tool in Big Data eco system is the [_Apache Spark_](https://spark.apache.org/) |
| 31 | + |
| 32 | + |
| 33 | +__Apache Spark__ analysis of big data became essential easier. Spark brings a lot implementation of useful algorithms for data mining, data analysis, machine learning, algorithms on graphs. Spark takes on the challenge of implementing sophisticated algorithms with tricky optimization and ability to run your code on distributed cluster. Spark effectively solve problems like fault tolerance and provide simple API to make the parallel computation. |
| 34 | + |
| 35 | +<a id="graphx" /> |
| 36 | + |
| 37 | +##### 1.1. GraphX |
| 38 | + |
| 39 | +[GraphX](https://spark.apache.org/docs/latest/graphx-programming-guide.html) is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge. |
| 40 | + |
| 41 | + |
| 42 | +This repository serves as a starting point for working with _Spark GraphX API_. As part of our SDM lab, we'd be focusing on getting a basic idea about how to work with `pregel` and get a hands-on experience with distributed processing of large graph. |
| 43 | + |
| 44 | +<a id="pregel" /> |
| 45 | + |
| 46 | +##### 1.2. Pregel |
| 47 | + |
| 48 | +Pregel, originally developed by Google, is essentially a message-passing interface which facilitates the processing of large-scale graphs. _Apache Spark's GraphX_ module provides the [Pregel API](https://spark.apache.org/docs/latest/graphx-programming-guide.html#pregel-api) which allow us to write distributed graph programs / algorithms. For more details, kindly check out the [original paper](https://github.com/mohammadzainabbas/SDM-Lab-2/blob/main/docs/pregel.pdf) |
| 49 | + |
| 50 | +--- |
| 51 | + |
| 52 | +<a id="setup" /> |
| 53 | + |
| 54 | +#### 2. Setup |
| 55 | + |
| 56 | + |
| 57 | +Before starting, you may need to setup your machine first. Please follow the below mentioned guides to setup Spark and Maven on your machine. |
| 58 | + |
| 59 | +<a id="mac" /> |
| 60 | + |
| 61 | +##### 2.1. Mac |
| 62 | + |
| 63 | +We have created a setup script which will setup brew, apache-spark, maven and conda enviornment. If you are on Mac machine, you can run the following commands: |
| 64 | + |
| 65 | +```bash |
| 66 | +git clone https://github.com/mohammadzainabbas/SDM-Lab-2.git |
| 67 | +cd SDM-Lab-2 && sh scripts/setup.sh |
| 68 | +``` |
| 69 | + |
| 70 | +<a id="linux" /> |
| 71 | + |
| 72 | +##### 2.2. Linux |
| 73 | + |
| 74 | +If you are on Linux, you need to install [Apache Spark](https://spark.apache.org) by yourself. You can follow this [helpful guide](https://computingforgeeks.com/how-to-install-apache-spark-on-ubuntu-debian/) to install `apache spark`. You can install maven via this [guide](https://maven.apache.org/install.html). |
| 75 | + |
| 76 | +We also recommend you to install _conda_ on your machine. You can setup conda from [here](https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html) |
| 77 | + |
| 78 | +After you have conda, create new enviornment via: |
24 | 79 |
|
25 | | -In this course, we will explore about the concepts of SDM. |
| 80 | +```bash |
| 81 | +conda create -n spark_env python=3.8 |
| 82 | +``` |
| 83 | + |
| 84 | +> Note: We are using Python3.8 because spark doesn't support Python3.9 and above (at the time of writing this) |
| 85 | +
|
| 86 | +Activate your enviornment: |
| 87 | + |
| 88 | +```bash |
| 89 | +conda activate spark_env |
| 90 | +``` |
| 91 | + |
| 92 | +Now, you need to install _pyspark_: |
| 93 | + |
| 94 | +```bash |
| 95 | +pip install pyspark |
| 96 | +``` |
| 97 | + |
| 98 | +If you are using bash: |
| 99 | + |
| 100 | +```bash |
| 101 | + |
| 102 | +echo "export PYSPARK_DRIVER_PYTHON=$(which python)" >> ~/.bashrc |
| 103 | +echo "export PYSPARK_DRIVER_PYTHON_OPTS=''" >> ~/.bashrc |
| 104 | +. ~/.bashrc |
| 105 | + |
| 106 | +``` |
| 107 | + |
| 108 | +And if you are using zsh: |
| 109 | + |
| 110 | +```zsh |
| 111 | + |
| 112 | +echo "export PYSPARK_DRIVER_PYTHON=$(which python)" >> ~/.zshrc |
| 113 | +echo "export PYSPARK_DRIVER_PYTHON_OPTS=''" >> ~/.zshrc |
| 114 | +. ~/.zshrc |
| 115 | + |
| 116 | +``` |
26 | 117 |
|
27 | 118 | --- |
28 | 119 |
|
| 120 | +<a id="run" /> |
| 121 | + |
| 122 | +#### 3. Run |
| 123 | + |
| 124 | +Since, this is a typical maven project, you can run it however you'd like to run a maven project. To facilitate you, we provide you two ways to run this project. |
| 125 | + |
| 126 | +<a id="vscode" /> |
| 127 | + |
| 128 | +##### 3.1. VS Code |
| 129 | + |
| 130 | +In you are using VS Code, change the `args` in the `Launch Main` configuration in `launch.json` file located at `.vscode` directory. |
| 131 | + |
| 132 | +See the [main class](https://github.com/mohammadzainabbas/SDM-Lab-2/blob/main/src/main/java/Main.java) for the supported arguments. |
| 133 | + |
| 134 | + |
| 135 | +<a id="terminal" /> |
| 136 | + |
| 137 | +##### 3.2. Terminal |
| 138 | + |
| 139 | +Just run the following with the supported arguments: |
| 140 | + |
| 141 | +```bash |
| 142 | +sh scripts/build_n_run.sh exercise1 |
| 143 | +``` |
| 144 | + |
| 145 | +> Note: `exercise1` here is the argument which you'd need to run the first exercise |
| 146 | +
|
| 147 | +Again, you can check the [main class](https://github.com/mohammadzainabbas/SDM-Lab-2/blob/main/src/main/java/Main.java) for the supported arguments. |
0 commit comments