Skip to content

Commit 911cdc7

Browse files
Merge pull request #20 from mohammadzainabbas/zain
Updated readme.md
2 parents bad60cc + 8df52d5 commit 911cdc7

File tree

2 files changed

+234
-3
lines changed

2 files changed

+234
-3
lines changed

README.md

Lines changed: 122 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,16 +13,135 @@
1313
### Table of contents
1414

1515
- [Introduction](#introduction)
16-
16+
* [GraphX](#graphx)
17+
* [Pregel](#pregel)
18+
- [Setup](#setup)
19+
* [Mac](#mac)
20+
* [Linux](#linux)
21+
- [Run](#run)
22+
* [VS Code](#vscode)
23+
* [Terminal](#terminal)
1724
---
1825

1926
<a id="introduction" />
2027

2128
#### 1. Introduction
2229

23-
This repo is all about what we have done in SDM lab 2 during our semester at UPC. As a BDMA student, we really need to know how to deal with semantics (a.k.a metadata about the data) in case of big data ecosystems. Semantics play a key role when analysing a large corpora of data in today's era.
30+
__`Data drives the world.`__ In this big data era, the need to analyse large volumes of data has become ever more challenging and quite complex. Several different eco-systems have been developed which try to solve some particular problem. One of the main tool in Big Data eco system is the [_Apache Spark_](https://spark.apache.org/)
31+
32+
33+
__Apache Spark__ analysis of big data became essential easier. Spark brings a lot implementation of useful algorithms for data mining, data analysis, machine learning, algorithms on graphs. Spark takes on the challenge of implementing sophisticated algorithms with tricky optimization and ability to run your code on distributed cluster. Spark effectively solve problems like fault tolerance and provide simple API to make the parallel computation.
34+
35+
<a id="graphx" />
36+
37+
##### 1.1. GraphX
38+
39+
[GraphX](https://spark.apache.org/docs/latest/graphx-programming-guide.html) is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge.
40+
41+
42+
This repository serves as a starting point for working with _Spark GraphX API_. As part of our SDM lab, we'd be focusing on getting a basic idea about how to work with `pregel` and get a hands-on experience with distributed processing of large graph.
43+
44+
<a id="pregel" />
45+
46+
##### 1.2. Pregel
47+
48+
Pregel, originally developed by Google, is essentially a message-passing interface which facilitates the processing of large-scale graphs. _Apache Spark's GraphX_ module provides the [Pregel API](https://spark.apache.org/docs/latest/graphx-programming-guide.html#pregel-api) which allow us to write distributed graph programs / algorithms. For more details, kindly check out the [original paper](https://github.com/mohammadzainabbas/SDM-Lab-2/blob/main/docs/pregel.pdf)
49+
50+
---
51+
52+
<a id="setup" />
53+
54+
#### 2. Setup
55+
56+
57+
Before starting, you may need to setup your machine first. Please follow the below mentioned guides to setup Spark and Maven on your machine.
58+
59+
<a id="mac" />
60+
61+
##### 2.1. Mac
62+
63+
We have created a setup script which will setup brew, apache-spark, maven and conda enviornment. If you are on Mac machine, you can run the following commands:
64+
65+
```bash
66+
git clone https://github.com/mohammadzainabbas/SDM-Lab-2.git
67+
cd SDM-Lab-2 && sh scripts/setup.sh
68+
```
69+
70+
<a id="linux" />
71+
72+
##### 2.2. Linux
73+
74+
If you are on Linux, you need to install [Apache Spark](https://spark.apache.org) by yourself. You can follow this [helpful guide](https://computingforgeeks.com/how-to-install-apache-spark-on-ubuntu-debian/) to install `apache spark`. You can install maven via this [guide](https://maven.apache.org/install.html).
75+
76+
We also recommend you to install _conda_ on your machine. You can setup conda from [here](https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html)
77+
78+
After you have conda, create new enviornment via:
2479

25-
In this course, we will explore about the concepts of SDM.
80+
```bash
81+
conda create -n spark_env python=3.8
82+
```
83+
84+
> Note: We are using Python3.8 because spark doesn't support Python3.9 and above (at the time of writing this)
85+
86+
Activate your enviornment:
87+
88+
```bash
89+
conda activate spark_env
90+
```
91+
92+
Now, you need to install _pyspark_:
93+
94+
```bash
95+
pip install pyspark
96+
```
97+
98+
If you are using bash:
99+
100+
```bash
101+
102+
echo "export PYSPARK_DRIVER_PYTHON=$(which python)" >> ~/.bashrc
103+
echo "export PYSPARK_DRIVER_PYTHON_OPTS=''" >> ~/.bashrc
104+
. ~/.bashrc
105+
106+
```
107+
108+
And if you are using zsh:
109+
110+
```zsh
111+
112+
echo "export PYSPARK_DRIVER_PYTHON=$(which python)" >> ~/.zshrc
113+
echo "export PYSPARK_DRIVER_PYTHON_OPTS=''" >> ~/.zshrc
114+
. ~/.zshrc
115+
116+
```
26117

27118
---
28119

120+
<a id="run" />
121+
122+
#### 3. Run
123+
124+
Since, this is a typical maven project, you can run it however you'd like to run a maven project. To facilitate you, we provide you two ways to run this project.
125+
126+
<a id="vscode" />
127+
128+
##### 3.1. VS Code
129+
130+
In you are using VS Code, change the `args` in the `Launch Main` configuration in `launch.json` file located at `.vscode` directory.
131+
132+
See the [main class](https://github.com/mohammadzainabbas/SDM-Lab-2/blob/main/src/main/java/Main.java) for the supported arguments.
133+
134+
135+
<a id="terminal" />
136+
137+
##### 3.2. Terminal
138+
139+
Just run the following with the supported arguments:
140+
141+
```bash
142+
sh scripts/build_n_run.sh exercise1
143+
```
144+
145+
> Note: `exercise1` here is the argument which you'd need to run the first exercise
146+
147+
Again, you can check the [main class](https://github.com/mohammadzainabbas/SDM-Lab-2/blob/main/src/main/java/Main.java) for the supported arguments.

scripts/setup.sh

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
#!/bin/bash
2+
#====================================================================================
3+
# Author: Mohammad Zain Abbas
4+
# Date: 29th April, 2022
5+
#====================================================================================
6+
# This script is used to set up the enviorment & installations
7+
#====================================================================================
8+
9+
# Enable exit on error
10+
set -e -u -o pipefail
11+
12+
log () {
13+
echo "[[ log ]] $1"
14+
}
15+
16+
error () {
17+
echo "[[ error ]] $1"
18+
}
19+
20+
#Function that shows usage for this script
21+
function usage()
22+
{
23+
cat << HEREDOC
24+
Setup for SDM Lab 2 @ UPC
25+
Usage:
26+
27+
$progname [OPTION] [Value]
28+
Options:
29+
-h, --help Show usage
30+
Examples:
31+
$ $progname
32+
⚐ → Installs all dependencies for your SDM Lab 2
33+
HEREDOC
34+
}
35+
36+
progname=$(basename $0)
37+
env_name='pyspark_env'
38+
39+
#Get all the arguments and update accordingly
40+
while [[ "$#" -gt 0 ]]; do
41+
case $1 in
42+
-h|--help)
43+
usage
44+
exit 1
45+
;;
46+
*) printf "\n$progname: invalid option → '$1'\n\n⚐ Try '$progname -h' for more information\n\n"; exit 1 ;;
47+
esac
48+
shift
49+
done
50+
51+
install_brew() {
52+
if [ ! $(type -p brew) ]; then
53+
error "'brew' not found. Installing it now ..."
54+
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
55+
else
56+
log "'brew' found ..."
57+
fi
58+
}
59+
60+
install_apache_spark() {
61+
if [ ! $(type -p spark-submit) ]; then
62+
error "'apache-spark' not found. Installing it now ..."
63+
brew install apache-spark
64+
else
65+
log "'apache-spark' found ..."
66+
fi
67+
}
68+
69+
install_mvn() {
70+
if [ ! $(type -p mvn) ]; then
71+
error "'mvn' not found. Installing it now ..."
72+
brew install mvn
73+
else
74+
log "'mvn' found ..."
75+
fi
76+
}
77+
78+
conda_init() {
79+
conda init --all || error "Unable to conda init ..."
80+
if [[ $SHELL == *"zsh"* ]]; then
81+
. ~/.zshrc
82+
elif [[ $SHELL == *"bash"* ]]; then
83+
. ~/.bashrc
84+
else
85+
error "Please restart your shell to see effects"
86+
fi
87+
}
88+
89+
install_conda() {
90+
if [ ! $(type -p conda) ]; then
91+
error "'anaconda' not found. Installing it now ..."
92+
brew install --cask anaconda && conda_init
93+
else
94+
log "'anaconda' found ..."
95+
fi
96+
}
97+
98+
create_conda_env() {
99+
conda create -n $env_name python=3.8 pandas -y || error "Unable to create new env '$env_name' ..."
100+
conda activate $env_name &> /dev/null || echo "" > /dev/null
101+
pip install pyspark > /dev/null
102+
}
103+
104+
log "Starting Setup Service"
105+
106+
install_brew
107+
install_apache_spark
108+
install_mvn
109+
install_conda
110+
create_conda_env
111+
112+
log "All done !!"

0 commit comments

Comments
 (0)