- Parts B and D (MLLib exercises) can be solved using either the
Dataframe-based API (pyspark.ml) or the RDD-based API (pyspark.mllib).
The corresponding templates for each have the suffix
_mland_mllib. Make sure you rename the python files corresopnding to parts B and D topart_b.pyandpart_d.pyrespectively before submitting them.
- Each file can be executed by running
spark-submit --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11 part_xxx.py - You can alternatively run the following to get rid of spark logs
spark-submit --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11 part_xxx.py 2> /dev/null - Make sure that you have the given dataset in the directory you are running the given code from. The structure this repository is arranged in is recommended.
- While the extra argument for graphframes is not required for part b and part d, it is not necessary to remove it these parts