Spark - MLlib, graphframes

ML vs MLLib

Parts B and D (MLLib exercises) can be solved using either the Dataframe-based API (pyspark.ml) or the RDD-based API (pyspark.mllib). The corresponding templates for each have the suffix _ml and _mllib. Make sure you rename the python files corresopnding to parts B and D to part_b.py and part_d.py respectively before submitting them.

Each file can be executed by running spark-submit --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11 part_xxx.py
You can alternatively run the following to get rid of spark logs spark-submit --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11 part_xxx.py 2> /dev/null
Make sure that you have the given dataset in the directory you are running the given code from. The structure this repository is arranged in is recommended.
While the extra argument for graphframes is not required for part b and part d, it is not necessary to remove it these parts

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
dataset		dataset
Dockerfile		Dockerfile
README.md		README.md
part_a.py		part_a.py
part_b.py		part_b.py
part_b_ml.py		part_b_ml.py
part_b_mllib.py		part_b_mllib.py
part_c.py		part_c.py
part_d.py		part_d.py
part_d_ml.py		part_d_ml.py
part_d_mllib.py		part_d_mllib.py
parta.sh		parta.sh
partb.sh		partb.sh
partc.sh		partc.sh
partd.sh		partd.sh
pyspark_graph_shell.sh		pyspark_graph_shell.sh