diff --git a/LICENSE.md b/LICENSE.md
new file mode 100644
index 0000000..2c81482
--- /dev/null
+++ b/LICENSE.md
@@ -0,0 +1,9 @@
+The tutorials in this folder are licensed under the MIT "Expat" License:
+
+Copyright (c) 2018: Huda Nassar
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
diff --git a/intro-to-julia-for-data-science/1. Julia for Data Science - Data.ipynb b/intro-to-julia-for-data-science/1. Julia for Data Science - Data.ipynb
index 4530ec7..6eb366a 100644
--- a/intro-to-julia-for-data-science/1. Julia for Data Science - Data.ipynb
+++ b/intro-to-julia-for-data-science/1. Julia for Data Science - Data.ipynb
@@ -7,6 +7,8 @@
"# Julia for Data Science\n",
"Prepared by [@nassarhuda](https://github.com/nassarhuda)! 😃\n",
"\n",
+ "`Last updated on 06/Sep/2018` \n",
+ "\n",
"In this tutorial, we will discuss why *Julia* is the tool you want to use for your data science applications.\n",
"\n",
"We will cover the following:\n",
@@ -34,9 +36,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"P = download(\"https://raw.githubusercontent.com/nassarhuda/easy_data/master/programming_languages.csv\",\"programminglanguages.csv\")"
@@ -53,7 +53,7 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
- "collapsed": true
+ "scrolled": true
},
"outputs": [],
"source": [
@@ -64,31 +64,54 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "And there's the *.csv file we downloaded!\n",
- "\n",
- "By default, `readcsv` will fill an array with the data stored in the input .csv file. If we set the keyword argument `header` to `true`, we'll get a second output array."
+ "Add the CSV package to Julia using `add()`. `CSV.read()` will automatically define headers from the .csv file if we set the `header` argument as `true`.\n",
+ "We could also use the `DelimitedFiles` package and its `readdlm()` function as shown below."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
- "collapsed": true
+ "scrolled": true
},
"outputs": [],
"source": [
- "P,H = readcsv(\"programminglanguages.csv\",header=true)"
+ "# using Pkg\n",
+ "# Pkg.add(\"CSV\") # for CSV.read()\n",
+ "# Pkg.add(\"DelimitedFiles\") # for readdlm"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
- "collapsed": true
+ "scrolled": true
},
"outputs": [],
"source": [
- "P"
+ "# using CSV\n",
+ "# P = CSV.read(\"programminglanguages.csv\",header=true)\n",
+ "# or\n",
+ "using DelimitedFiles\n",
+ "P,H= readdlm(\"programminglanguages.csv\",',',header=true)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "P # stores the dataset"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "H # stores the header names"
]
},
{
@@ -102,9 +125,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"function language_created_year(P,language::String)\n",
@@ -116,9 +137,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"language_created_year(P,\"Julia\")"
@@ -127,9 +146,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"language_created_year(P,\"julia\")"
@@ -145,9 +162,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"function language_created_year_v2(P,language::String)\n",
@@ -163,7 +178,7 @@
"source": [
"**Reading and writing to files is really easy in Julia.**
\n",
"\n",
- "You can use different delimiters with the function `readdlm` (`readcsv` is just an instance of `readdlm`).
\n",
+ "You can use different delimiters with the function `readdlm` (`readcsv` is just an instance of `readdlm`) available with the `DelimitedFiles` package.
\n",
"\n",
"To write to files, we can use `writecsv` or `writedlm`.
\n",
"\n",
@@ -173,9 +188,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"writedlm(\"programming_languages_data.txt\", P, '-')"
@@ -191,9 +204,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
";head -10 programming_languages_data.txt"
@@ -209,9 +220,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"P_new_delim = readdlm(\"programming_languages_data.txt\", '-');\n",
@@ -231,9 +240,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"dict = Dict{Integer,Vector{String}}()"
@@ -251,9 +258,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"dict2 = Dict()"
@@ -271,9 +276,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"for i = 1:size(P,1)\n",
@@ -297,9 +300,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"dict[2003]"
@@ -319,11 +320,10 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
+ "# Pkg.add(\"DataFrames\")\n",
"using DataFrames\n",
"df = DataFrame(year = P[:,1], language = P[:,2])"
]
@@ -342,9 +342,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"df[:year]"
@@ -354,93 +352,112 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "### RDatasets\n",
+ "**`DataFrames` provides some handy features when dealing with data**\n",
"\n",
- "We can use RDatasets to play around with pre-existing datasets"
+ "First, it uses the \"missing\" type."
]
},
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
- "using RDatasets\n",
- "iris = dataset(\"datasets\", \"iris\")"
+ "a = missing\n",
+ "typeof(a)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "Note that data loaded with `dataset` is stored as a DataFrame. 😃"
+ "Let's see what happens when we try to add a \"missing\" type to a number."
]
},
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
- "typeof(iris) "
+ "a + 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "**`DataFrames` provides some handy features when dealing with data**\n",
+ "`DataFrames` provides the `describe` can give you quick statistics about each column in your dataframe "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "describe(df)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### RDatasets\n",
"\n",
- "First, it introduces the \"missing\" type"
+ "We can use RDatasets to play around with pre-existing datasets"
]
},
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
- "a = NA\n",
- "typeof(a)"
+ "# using Pkg\n",
+ "# Pkg.add(\"RData\")\n",
+ "# Pkg.add(\"RDatasets\")\n",
+ "# Pkg.add(\"RCall\") # should have R installed to build RCall"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "using RDatasets\n",
+ "# Pkg.instantiate()\n",
+ "iris = dataset(\"datasets\", \"iris\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "Let's see what happens when we try to add a \"missing\" type to a number"
+ "Note that data loaded with `dataset` is stored as a DataFrame. 😃"
]
},
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
- "a + 1"
+ "typeof(iris) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "`DataFrames` provides the `describe` can give you quick statistics about each column in your dataframe "
+ "The summary we get from `describe` on `iris` gives us a lot more information than the summary on `df`!"
]
},
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"describe(iris)"
@@ -450,28 +467,38 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "You can create your own dataframe quickly as follows"
+ "### `DataArrays`\n",
+ "\n",
+ "##### Note: `DataArrays` is [no longer available](https://github.com/JuliaStats/JuliaStats.github.io/pull/6) from `v1.0.0`. Alternate ways to use arrays with missing values are described [here](https://docs.julialang.org/en/v1/manual/missing/#Arrays-With-Missing-Values-1).\n",
+ "\n",
+ "You can create `DataArray`s as follows"
]
},
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
+ "Pkg.add(\"DataArrays\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "using DataArrays\n",
"foods = @data([\"apple\", \"cucumber\", \"tomato\", \"banana\"])\n",
- "calories = @data([NA,47,22,105])\n",
+ "calories = @data([missing,47,22,105])\n",
"typeof(calories)"
]
},
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"mean(calories)"
@@ -481,18 +508,18 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "NA ruins everything! 😑"
+ "Missing values ruin everything! 😑\n",
+ "\n",
+ "Luckily we can ignore them with `skipmissing`!"
]
},
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
- "mean(dropna(calories))"
+ "mean(skipmissing(calories))"
]
},
{
@@ -509,9 +536,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"😑 = 0 # expressionless\n",
@@ -523,7 +548,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "*Back to NA's*\n",
+ "*Back to missing values*\n",
"\n",
"In fact, `describe' will drop these values too"
]
@@ -531,9 +556,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"describe(calories)"
@@ -551,9 +574,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"newcalories = convert(Vector,calories)"
@@ -569,20 +590,16 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
- "newcalories = convert(Vector,calories,0) # i.e. replace every NA with the value 0"
+ "newcalories = convert(Vector,calories,0) # i.e. replace every missing with the value 0"
]
},
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"prices = @data([0.85,1.6,0.8,0.6,])"
@@ -591,9 +608,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"dataframe_calories = DataFrame(item=foods,calories=calories)"
@@ -602,9 +617,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"dataframe_prices = DataFrame(item=foods,price=prices)"
@@ -620,9 +633,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"DF = join(dataframe_calories,dataframe_prices,on=:item)"
@@ -638,11 +649,11 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
+ "# Pkg.add(\"ImageMagick\")\n",
+ "# Pkg.add(\"FileIO\")\n",
"using FileIO\n",
"julialogo = download(\"https://avatars0.githubusercontent.com/u/743164?s=200&v=4\",\"julialogo.png\")"
]
@@ -657,9 +668,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
";ls"
@@ -675,9 +684,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"X1 = load(\"julialogo.png\")"
@@ -693,9 +700,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"@show typeof(X1);\n",
@@ -719,20 +724,17 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
+ "# Pkg.add(\"MAT\")\n",
"using MAT"
]
},
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"A = rand(5,5)\n",
@@ -751,9 +753,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"newfile = matopen(\"densematrix.mat\")\n",
@@ -763,9 +763,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"names(newfile)"
@@ -774,35 +772,24 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"close(newfile)"
]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": true
- },
- "outputs": [],
- "source": []
}
],
"metadata": {
"kernelspec": {
- "display_name": "Julia 0.6.0",
+ "display_name": "Julia 1.1.0-DEV",
"language": "julia",
- "name": "julia-0.6"
+ "name": "julia-1.1"
},
"language_info": {
"file_extension": ".jl",
"mimetype": "application/julia",
"name": "julia",
- "version": "0.6.0"
+ "version": "1.0.0"
}
},
"nbformat": 4,
diff --git a/intro-to-julia-for-data-science/2. Julia for Data Science - Algorithms.ipynb b/intro-to-julia-for-data-science/2. Julia for Data Science - Algorithms.ipynb
index 8d0764d..d9e34ca 100644
--- a/intro-to-julia-for-data-science/2. Julia for Data Science - Algorithms.ipynb
+++ b/intro-to-julia-for-data-science/2. Julia for Data Science - Algorithms.ipynb
@@ -17,12 +17,12 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
- "using DataFrames"
+ "using DataFrames\n",
+ "using CSV\n",
+ "using Pkg # to add other packages if needed"
]
},
{
@@ -39,13 +39,11 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"download(\"http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv\",\"houses.csv\")\n",
- "houses = readtable(\"houses.csv\")"
+ "houses = CSV.read(\"houses.csv\")"
]
},
{
@@ -58,13 +56,12 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
+ "# Pkg.add(\"Plots\")\n",
"using Plots\n",
- "pyplot(size=(500,500),leg=false)"
+ "plot(size=(500,500),leg=false)"
]
},
{
@@ -77,13 +74,13 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"x = houses[:sq__ft]\n",
+ "# x = houses[7] # equivalent, useful if file has no header\n",
"y = houses[:price]\n",
+ "# y = houses[10] # equivalent\n",
"scatter(x,y,markersize=3)"
]
},
@@ -101,12 +98,10 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
- "filter_houses = houses[houses[:sq__ft].>0,:]\n",
+ "filter_houses = houses[houses[:sq__ft].>0,:] # dot broadcasting\n",
"x = filter_houses[:sq__ft]\n",
"y = filter_houses[:price]\n",
"scatter(x,y)"
@@ -123,29 +118,26 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "We can filter a `DataFrame` by feature value too, using the `by` function."
+ "We can filter a `DataFrame` by feature value too, using the `by` function. `mean()` has been moved into the `Statistics` module in the standard library; you may need to first enter `using Statistics` to start using them."
]
},
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
- "by(filter_houses,:_type,size)"
+ "by(filter_houses,:type,size)"
]
},
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
- "by(filter_houses,:_type,filter_houses->mean(filter_houses[:price]))"
+ "using Statistics\n",
+ "by(filter_houses,:type,filter_houses->mean(filter_houses[:price])) "
]
},
{
@@ -160,12 +152,11 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
- "#Pkg.add(\"Clustering\")\n",
+ "# using Pkg\n",
+ "# Pkg.add(\"Clustering\")\n",
"using Clustering"
]
},
@@ -173,37 +164,75 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Let's store the features `:latitude` and `:longitude` in an array `X` that we will pass to `kmeans`."
+ "Let us see how `Clustering` works with a generic example first."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# make a random dataset with 1000 points\n",
+ "# each point is a 5-dimensional vector\n",
+ "J = rand(5, 1000)\n",
+ "R = kmeans(J, 20; maxiter=200, display=:iter) \n",
+ "# performs K-means over X, trying to group them into 20 clusters\n",
+ "# set maximum number of iterations to 200\n",
+ "# set display to :iter, so it shows progressive info at each iteration"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Now, let's get back to the problem in hand and see how this can be applied over there.\n",
+ "Let's store the features `:latitude` and `:longitude` in an array `X` that we will pass to `kmeans`. First we add data for `:latitude` and `:longitude` to a new `DataFrame` called `X`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "X = filter_houses[[:latitude,:longitude]]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "and then we convert `X` to an `Array` via `X = convert(Array, X)`. This will turn `X` into an `Array`."
]
},
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
- "X = filter_houses[[:latitude,:longitude]]\n",
- "X = Array(X)"
+ "#X = Array{Float64}(X)\n",
+ "X = convert(Array,X)\n",
+ "#X = convert(Array{Float64},X)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "Each feature is stored as a row of `X`, but we can transpose to make these features columns of `X`."
+ "We now take the transpose of `X` using the `transpose()` function. A transpose is required since `kmeans()` function takes each row as a `feature`, and each column a `data point`."
]
},
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
- "X = X'"
+ "X = transpose(X)\n",
+ "#X = X' # also does the same thing\n",
+ "X"
]
},
{
@@ -218,30 +247,29 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
- "k = length(unique(filter_houses[:zip])) "
+ "k = length(unique(filter_houses[:zip])) \n",
+ "# there should be atleast 2 distinct features (k>=2) to group the data points\n",
+ "println(\"unique zip codes are \",k)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "We can use the `kmeans` function to do kmeans clustering!"
+ "Now, we can use the `kmeans()` function to do kmeans clustering!"
]
},
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
- "C = kmeans(X,k) # try changing k"
+ "using Clustering\n",
+ "C = kmeans(X, k)"
]
},
{
@@ -254,13 +282,10 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
- "df = DataFrame(cluster = C.assignments,city = filter_houses[:city],\n",
- " latitude = filter_houses[:latitude],longitude = filter_houses[:longitude],zip = filter_houses[:zip])"
+ "df = DataFrame(cluster=C.assignments,city=filter_houses[:city],latitude=filter_houses[:latitude],longitude=filter_houses[:longitude],zip=filter_houses[:zip])"
]
},
{
@@ -273,9 +298,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"clusters_figure = plot()\n",
@@ -301,9 +324,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"unique_zips = unique(filter_houses[:zip])\n",
@@ -330,9 +351,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"plot(clusters_figure,zips_figure,layout=(2, 1))"
@@ -357,11 +376,10 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
+ "# Pkg.add(\"NearestNeighbors\")\n",
"using NearestNeighbors"
]
},
@@ -375,9 +393,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"knearest = 10\n",
@@ -395,9 +411,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"kdtree = KDTree(X)\n",
@@ -414,9 +428,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"x = filter_houses[:latitude];\n",
@@ -434,9 +446,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"x = filter_houses[idxs,:latitude];\n",
@@ -456,9 +466,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"cities = filter_houses[idxs,:city]"
@@ -478,9 +486,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"F = filter_houses[[:sq__ft,:price]]\n",
@@ -497,9 +503,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"scatter(F[1,:],F[2,:])\n",
@@ -517,9 +521,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"# Pkg.add(\"MultivariateStats\")\n",
@@ -536,9 +538,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"M = fit(PCA, F)"
@@ -565,9 +565,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"y = transform(M, F)"
@@ -583,9 +581,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"Xr = reconstruct(M, y)"
@@ -603,9 +599,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"scatter(F[1,:],F[2,:])\n",
@@ -626,9 +620,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"using Flux, Flux.Data.MNIST\n",
@@ -646,9 +638,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"imgs = MNIST.images()\n",
@@ -665,9 +655,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"typeof(imgs[3])"
@@ -685,9 +673,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"fpt_imgs = float.(imgs)"
@@ -703,9 +689,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"fpt_imgs[3]"
@@ -723,9 +707,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"unraveled_fpt_imgs = reshape.(fpt_imgs, :);\n",
@@ -742,9 +724,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"Vector"
@@ -760,9 +740,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"unraveled_fpt_imgs[3]"
@@ -780,9 +758,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"X = hcat(unraveled_fpt_imgs...)"
@@ -802,9 +778,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"onefigure = X[:,2]"
@@ -820,9 +794,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"t1 = reshape(onefigure,28,28)"
@@ -838,9 +810,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"using Images"
@@ -849,9 +819,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"colorview(Gray, t1)"
@@ -869,9 +837,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"labels = MNIST.labels() # the true labels"
@@ -887,9 +853,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"Y = onehotbatch(labels, 0:9)"
@@ -912,9 +876,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"m = Chain(\n",
@@ -933,9 +895,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"loss(x, y) = Flux.crossentropy(m(x), y)\n",
@@ -952,12 +912,10 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
- "dataset = repeated((X, Y), 200)\n",
+ "datasetx = repeated((X, Y), 200)\n",
"evalcb = () -> @show(loss(X, Y))\n",
"opt = ADAM(Flux.params(m))"
]
@@ -974,9 +932,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"?Flux.train!"
@@ -992,12 +948,10 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
- "Flux.train!(loss, dataset, opt, cb = throttle(evalcb, 10))\n",
+ "Flux.train!(loss, datasetx, opt, cb = throttle(evalcb, 10))\n",
"\n",
"accuracy(X, Y)"
]
@@ -1012,9 +966,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"tX = hcat(float.(reshape.(MNIST.images(:test), :))...)"
@@ -1030,9 +982,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"test_image = m(tX[:,1])"
@@ -1041,12 +991,10 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
- "indmax(test_image) - 1"
+ "maximum(test_image) - 1 "
]
},
{
@@ -1061,9 +1009,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"using Images\n",
@@ -1090,13 +1036,11 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"xvals = repeat(1:0.5:10,inner=2)\n",
- "yvals = 3+xvals+2*rand(length(xvals))-1\n",
+ "yvals = xvals+(2*rand(length(xvals)).-1)\n",
"scatter(xvals,yvals,color=:black,leg=false)"
]
},
@@ -1112,9 +1056,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"function find_best_fit(xvals,yvals)\n",
@@ -1141,21 +1083,17 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"a,b = find_best_fit(xvals,yvals)\n",
- "ynew = a*xvals + b"
+ "ynew = a.*xvals.+b"
]
},
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"plot!(xvals,ynew)"
@@ -1171,22 +1109,18 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"xvals = 1:100000;\n",
"xvals = repeat(xvals,inner=3);\n",
- "yvals = 3+xvals+2*rand(length(xvals))-1;"
+ "yvals = xvals+(2*rand(length(xvals)).-1);"
]
},
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"@show size(xvals)\n",
@@ -1203,9 +1137,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"@time a,b = find_best_fit(xvals,yvals)"
@@ -1221,9 +1153,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"using PyCall\n",
@@ -1233,9 +1163,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"py\"\"\"\n",
@@ -1255,9 +1183,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"find_best_fit_python = py\"find_best_fit_python\""
@@ -1266,9 +1192,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"xpy = PyObject(xvals)\n",
@@ -1286,9 +1210,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"using BenchmarkTools"
@@ -1297,9 +1219,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"@btime a,b = find_best_fit_python(xvals,yvals)"
@@ -1308,35 +1228,24 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"@btime a,b = find_best_fit(xvals,yvals)"
]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": true
- },
- "outputs": [],
- "source": []
}
],
"metadata": {
"kernelspec": {
- "display_name": "Julia 0.6.0",
+ "display_name": "Julia 1.1.0-DEV",
"language": "julia",
- "name": "julia-0.6"
+ "name": "julia-1.1"
},
"language_info": {
"file_extension": ".jl",
"mimetype": "application/julia",
"name": "julia",
- "version": "0.6.0"
+ "version": "1.0.0"
}
},
"nbformat": 4,
diff --git a/intro-to-julia-for-data-science/3. Julia for Data Science - Plotting.ipynb b/intro-to-julia-for-data-science/3. Julia for Data Science - Plotting.ipynb
index 5242f4d..b37b2b7 100644
--- a/intro-to-julia-for-data-science/3. Julia for Data Science - Plotting.ipynb
+++ b/intro-to-julia-for-data-science/3. Julia for Data Science - Plotting.ipynb
@@ -24,17 +24,37 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
+ "# using Pkg\n",
+ "# Pkg.add(\"LaTeXStrings\")\n",
"using LaTeXStrings\n",
- "using Plots\n",
- "pyplot(leg=false)\n",
+ "# using Plots\n",
+ "# plotly()\n",
+ "using PyPlot # both loads plot()\n",
+ "pyplot()\n",
"x = 1:0.2:4"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let's declare some variables that store the functions we want to plot written in LaTex"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "x2 = L\"x^2\"\n",
+ "logx = L\"log(x)\"\n",
+ "sqrtx = L\"\\sqrt{x}\""
+ ]
+ },
{
"cell_type": "markdown",
"metadata": {},
@@ -45,38 +65,35 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"y1 = sqrt.(x)\n",
"y2 = log.(x)\n",
"y3 = x.^2\n",
"\n",
- "f1 = plot(x,y1)\n",
- "plot!(f1,x,y2) # \"plot!\" means \"plot on the same canvas we just plot on\"\n",
- "plot!(f1,x,y3)"
+ "f1 = plot(x,y1, legend = false)\n",
+ "plot!(f1, x,y2) # \"plot!\" means \"plot on the same canvas we just plotted on\"\n",
+ "plot!(f1, x,y3)\n",
+ "title!(\"Plot $x2 vs. $logx vs. $sqrtx\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "Now we can annotate each of these plots! using either text, or latex strings"
+ "Now we can annotate each of these curves using either text, or latex strings"
]
},
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
- "annotate!(f1,[(x[6],y1[6],text(L\"\\sqrt{x}\",16,:center)),\n",
- " (x[11],y2[11],text(L\"log(x)\",:right,16)),\n",
- " (x[6],y3[6],text(L\"x^2\",16))])"
+ "annotate!(f1,[(x[6],y1[6],text(sqrtx,16,:center)),\n",
+ " (x[11],y2[11],text(logx,:right,16)),\n",
+ " (x[6],y3[6],text(x2,16))])"
]
},
{
@@ -98,9 +115,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"n = 1000\n",
@@ -119,9 +134,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"using DataFrames\n",
@@ -134,9 +147,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"gh = histogram2d(x,y,nbins=20,colorbar=true)\n",
@@ -167,9 +178,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"using StatPlots\n",
@@ -180,9 +189,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"boxplot!([\"Series 1\" \"Series 2\" \"Series 3\" \"Series 4\" \"Series 5\"],y,leg=false,color=:green)"
@@ -200,9 +207,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"some_cities = [\"SACRAMENTO\",\"RANCHO CORDOVA\",\"RIO LINDA\",\"CITRUS HEIGHTS\",\"NORTH HIGHLANDS\",\"ANTELOPE\",\"ELK GROVE\",\"ELVERTA\" ] # try picking pther cities!\n",
@@ -228,9 +233,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"mylayout = @layout([a{0.5h};[b{0.7w} c]])\n",
@@ -253,9 +256,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"using PyPlot\n",
@@ -266,9 +267,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"xkcd()\n",
@@ -288,9 +287,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"xticks([])\n",
@@ -317,9 +314,7 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {
- "collapsed": true
- },
+ "metadata": {},
"outputs": [],
"source": [
"display(fig)"
@@ -333,28 +328,19 @@
"\n",
"https://tinyurl.com/JuliaDataScience"
]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": true
- },
- "outputs": [],
- "source": []
}
],
"metadata": {
"kernelspec": {
- "display_name": "Julia 0.6.0",
+ "display_name": "Julia 1.1.0-DEV",
"language": "julia",
- "name": "julia-0.6"
+ "name": "julia-1.1"
},
"language_info": {
"file_extension": ".jl",
"mimetype": "application/julia",
"name": "julia",
- "version": "0.6.0"
+ "version": "1.0.0"
}
},
"nbformat": 4,
diff --git a/intro-to-julia-for-data-science/LICENSE.md b/intro-to-julia-for-data-science/LICENSE.md
new file mode 100644
index 0000000..2c81482
--- /dev/null
+++ b/intro-to-julia-for-data-science/LICENSE.md
@@ -0,0 +1,9 @@
+The tutorials in this folder are licensed under the MIT "Expat" License:
+
+Copyright (c) 2018: Huda Nassar
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
diff --git a/intro-to-julia-for-data-science/README.md b/intro-to-julia-for-data-science/README.md
new file mode 100644
index 0000000..5a241c3
--- /dev/null
+++ b/intro-to-julia-for-data-science/README.md
@@ -0,0 +1 @@
+These tutorials were created by Huda Nassar.