Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions Lab 2 - RDD, DataFrame, ML pipeline, and parallelization.md
Original file line number Diff line number Diff line change
Expand Up @@ -530,12 +530,13 @@ Starting from this lab, you need to use *as many DataFrame functions as possible

Load the Aug95 NASA access log data in Lab 1 and create a DataFrame with FIVE columns by **specifying** the schema according to the description in the downloaded html file. Use this DataFrame for the following questions.

2. Find out the number of **unique** hosts in total (i.e. in August 1995)?
3. Find out the most frequent visitor, i.e. the host with the largest number of visits.
2. Find out the number of **unique** hosts in total (i.e. in August 1995)? [Answer: 75060 Unique Hosts]
3. Find out the most frequent visitor, i.e. the host with the largest number of visits. [Answer: "edams.ksc.nasa.gov]

### Linear regression for advertising

4. Add regularization to the [linear regression for advertising example](#example-linear-regression-for-advertising) and evaluate the prediction performance against the performance without any regularization. Study at least three different regularization settings.
[Answer: Adding increasing regularisation parameters (0.1, 0.2, 0.5) increases each of the predictions each time. Not really sure what else to put here without just copy pasting it all in?]

### Logistic regression for document classification

Expand Down