Welcome to the first project of the Machine Learning Engineer Nanodegree! In this notebook, some template code has already been provided for you, and you will need to implement additional functionality to successfully complete this project. You will not need to modify the included code beyond what is requested. Sections that begin with 'Implementation' in the header indicate that the following block of code will require additional functionality which you must provide. Instructions will be provided for each section and the specifics of the implementation are marked in the code block with a 'TODO' statement. Please be sure to read the instructions carefully!
-In addition to implementing code, there will be questions that you must answer which relate to the project and your implementation. Each section where you will answer a question is preceded by a 'Question X' header. Carefully read each question and provide thorough answers in the following text boxes that begin with 'Answer:'. Your project submission will be evaluated based on your answers to each of the questions and the implementation you provide.
-- -Note: Code and Markdown cells can be executed using the Shift + Enter keyboard shortcut. In addition, Markdown cells can be edited by typically double-clicking the cell to enter edit mode.
-
In this project, you will evaluate the performance and predictive power of a model that has been trained and tested on data collected from homes in suburbs of Boston, Massachusetts. A model trained on this data that is seen as a good fit could then be used to make certain predictions about a home — in particular, its monetary value. This model would prove to be invaluable for someone like a real estate agent who could make use of such information on a daily basis.
-The dataset for this project originates from the UCI Machine Learning Repository. The Boston housing data was collected in 1978 and each of the 506 entries represent aggregated data about 14 features for homes from various suburbs in Boston, Massachusetts. For the purposes of this project, the following preprocessing steps have been made to the dataset:
-'MEDV' value of 50.0. These data points likely contain missing or censored values and have been removed.'RM' value of 8.78. This data point can be considered an outlier and has been removed.'RM', 'LSTAT', 'PTRATIO', and 'MEDV' are essential. The remaining non-relevant features have been excluded.'MEDV' has been multiplicatively scaled to account for 35 years of market inflation.Run the code cell below to load the Boston housing dataset, along with a few of the necessary Python libraries required for this project. You will know the dataset loaded successfully if the size of the dataset is reported.
- -# Import libraries necessary for this project
-import numpy as np
-import pandas as pd
-from sklearn.cross_validation import ShuffleSplit
-
-# Import supplementary visualizations code visuals.py
-import visuals as vs
-
-# Pretty display for notebooks
-%matplotlib inline
-
-# Load the Boston housing dataset
-data = pd.read_csv('housing.csv')
-prices = data['MEDV']
-features = data.drop('MEDV', axis = 1)
-
-# Success
-print "Boston housing dataset has {} data points with {} variables each.".format(*data.shape)
-In this first section of this project, you will make a cursory investigation about the Boston housing data and provide your observations. Familiarizing yourself with the data through an explorative process is a fundamental practice to help you better understand and justify your results.
-Since the main goal of this project is to construct a working model which has the capability of predicting the value of houses, we will need to separate the dataset into features and the target variable. The features, 'RM', 'LSTAT', and 'PTRATIO', give us quantitative information about each data point. The target variable, 'MEDV', will be the variable we seek to predict. These are stored in features and prices, respectively.
For your very first coding implementation, you will calculate descriptive statistics about the Boston housing prices. Since numpy has already been imported for you, use this library to perform the necessary calculations. These statistics will be extremely important later on to analyze various prediction results from the constructed model.
In the code cell below, you will need to implement the following:
-'MEDV', which is stored in prices.prices = data['MEDV']
-
-# TODO: Minimum price of the data
-minimum_price = np.amin(prices)
-print minimum_price
-
-# TODO: Maximum price of the data
-maximum_price = np.amax(prices)
-print maximum_price
-
-# TODO: Mean price of the data
-mean_price = np.mean(prices)
-
-# TODO: Median price of the data
-median_price = np.median(prices)
-
-# TODO: Standard deviation of prices of the data
-std_price = np.std(prices)
-
-# Show the calculated statistics
-print "Statistics for Boston housing dataset:\n"
-print "Minimum price: ${:,.2f}".format(minimum_price)
-print "Maximum price: ${:,.2f}".format(maximum_price)
-print "Mean price: ${:,.2f}".format(mean_price)
-print "Median price ${:,.2f}".format(median_price)
-print "Standard deviation of prices: ${:,.2f}".format(std_price)
-As a reminder, we are using three features from the Boston housing dataset: 'RM', 'LSTAT', and 'PTRATIO'. For each data point (neighborhood):
'RM' is the average number of rooms among homes in the neighborhood.'LSTAT' is the percentage of homeowners in the neighborhood considered "lower class" (working poor).'PTRATIO' is the ratio of students to teachers in primary and secondary schools in the neighborhood.Using your intuition, for each of the three features above, do you think that an increase in the value of that feature would lead to an increase in the value of 'MEDV' or a decrease in the value of 'MEDV'? Justify your answer for each.
-Hint: Would you expect a home that has an 'RM' value of 6 be worth more or less than a home that has an 'RM' value of 7?
Answer:
-RM: More number of rooms would mean that the house is bigger in area, and can accomodate bigger families. So an increase in RM will lead to an increase in MEDV.
-LSTAT: If the percentage of lower class homeowners is less, that would mean that the place has a higher standard of living and the cost of living is higher. Hence if the inverse were to be true, then it would deter rich folks from living in the area, because like attracts like - rich folks tend to live in rich neighborhoods amidst other rich folks and form communities around them. So an increase in LSTAT would lead to a decrease in MEDV
-In this second section of the project, you will develop the tools and techniques necessary for a model to make a prediction. Being able to make accurate evaluations of each model's performance through the use of these tools and techniques helps to greatly reinforce the confidence in your predictions.
- -It is difficult to measure the quality of a given model without quantifying its performance over training and testing. This is typically done using some type of performance metric, whether it is through calculating some type of error, the goodness of fit, or some other useful measurement. For this project, you will be calculating the coefficient of determination, R2, to quantify your model's performance. The coefficient of determination for a model is a useful statistic in regression analysis, as it often describes how "good" that model is at making predictions.
-The values for R2 range from 0 to 1, which captures the percentage of squared correlation between the predicted and actual values of the target variable. A model with an R2 of 0 is no better than a model that always predicts the mean of the target variable, whereas a model with an R2 of 1 perfectly predicts the target variable. Any value between 0 and 1 indicates what percentage of the target variable, using this model, can be explained by the features. A model can be given a negative R2 as well, which indicates that the model is arbitrarily worse than one that always predicts the mean of the target variable.
-For the performance_metric function in the code cell below, you will need to implement the following:
r2_score from sklearn.metrics to perform a performance calculation between y_true and y_predict.score variable.# TODO: Import 'r2_score'
-from sklearn.metrics import r2_score
-
-def performance_metric(y_true, y_predict):
- """ Calculates and returns the performance score between
- true and predicted values based on the metric chosen. """
-
- # TODO: Calculate the performance score between 'y_true' and 'y_predict'
- score = r2_score(y_true, y_predict)
-
- # Return the score
- return score
-Assume that a dataset contains five data points and a model made the following predictions for the target variable:
-| True Value | -Prediction | -
|---|---|
| 3.0 | -2.5 | -
| -0.5 | -0.0 | -
| 2.0 | -2.1 | -
| 7.0 | -7.8 | -
| 4.2 | -5.3 | -
Would you consider this model to have successfully captured the variation of the target variable? Why or why not?
-Run the code cell below to use the performance_metric function and calculate this model's coefficient of determination.
# Calculate the performance of this model
-score = performance_metric([3, -0.5, 2, 7, 4.2], [2.5, 0.0, 2.1, 7.8, 5.3])
-print "Model has a coefficient of determination, R^2, of {:.3f}.".format(score)
-Answer: An R2 score closer to 1 indicates an accurate prediction, and since this model is close quite close to 1, it's performance metric seems to be quite good.
- -Your next implementation requires that you take the Boston housing dataset and split the data into training and testing subsets. Typically, the data is also shuffled into a random order when creating the training and testing subsets to remove any bias in the ordering of the dataset.
-For the code cell below, you will need to implement the following:
-train_test_split from sklearn.cross_validation to shuffle and split the features and prices data into training and testing sets.random_state for train_test_split to a value of your choice. This ensures results are consistent.X_train, X_test, y_train, and y_test.# TODO: Import 'train_test_split'
-from sklearn.cross_validation import train_test_split
-
-# TODO: Shuffle and split the data into training and testing subsets
-X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=0)
-
-# Success
-print "Training and testing split was successful."
-What is the benefit to splitting a dataset into some ratio of training and testing subsets for a learning algorithm?
-Hint: What could go wrong with not having a way to test your model?
Answer: A learning curve needs to be able to do two things - to learn as much as it can from the given data, and to generalize well for unseen data. Hence by splitting a dataset into different ratios of training and testing subsets, we can analyze the performance of our model - more training points will mean more data for our learning algorithm to lean from, but we'll have fewer unseen data - we won't have a good idea of how well the algorithm can generalize. The inverse is also true.
-If we don't have a way to test our model then there's no way to analyze the model for bias or variance. The only way to know is by testing our algorithm with unseen data-points and analyzing its performance against them.
- -In this third section of the project, you'll take a look at several models' learning and testing performances on various subsets of training data. Additionally, you'll investigate one particular algorithm with an increasing 'max_depth' parameter on the full training set to observe how model complexity affects performance. Graphing your model's performance based on varying criteria can be beneficial in the analysis process, such as visualizing behavior that may not have been apparent from the results alone.
The following code cell produces four graphs for a decision tree model with different maximum depths. Each graph visualizes the learning curves of the model for both training and testing as the size of the training set is increased. Note that the shaded region of a learning curve denotes the uncertainty of that curve (measured as the standard deviation). The model is scored on both the training and testing sets using R2, the coefficient of determination.
-Run the code cell below and use these graphs to answer the following question.
- -# Produce learning curves for varying training set sizes and maximum depths
-vs.ModelLearning(features, prices)
-Choose one of the graphs above and state the maximum depth for the model. What happens to the score of the training curve as more training points are added? What about the testing curve? Would having more training points benefit the model?
-Hint: Are the learning curves converging to particular scores?
Answer:
-Since both the curves plateau even when the number of training points cross 350, the model doesn't seem to be getting any better. So adding more points won't make the model significantly better at predicting unseen data.
- -The following code cell produces a graph for a decision tree model that has been trained and validated on the training data using different maximum depths. The graph produces two complexity curves — one for training and one for validation. Similar to the learning curves, the shaded regions of both the complexity curves denote the uncertainty in those curves, and the model is scored on both the training and validation sets using the performance_metric function.
Run the code cell below and use this graph to answer the following two questions.
- -vs.ModelComplexity(X_train, y_train)
-When the model is trained with a maximum depth of 1, does the model suffer from high bias or from high variance? How about when the model is trained with a maximum depth of 10? What visual cues in the graph justify your conclusions?
-Hint: How do you know when a model is suffering from high bias or high variance?
Answer: When the model is trained with maximum depth of 1, then it suffers from high bias. If it's trained with maximum depth of 10 then it suffers from high variance.
-High bias occurs when our model depends on more features than the ones that we have taken into consideration. This will result in a low R2 score.
-As we increase the maximum depth, we find that the training and validation curves are further and further apart, which indicates high variance. This might be because we're trying to overfit the model with too many features, some of which might not be necessary. We might also need more training data to help generalize the model better, so that the two curves converge at a relatively high score.
- -Which maximum depth do you think results in a model that best generalizes to unseen data? What intuition lead you to this answer?
- -Answer: A maximum depth of 3 seems to optimally generalize best to unseen data. As we can see from the graph, the increase in score for the training curve is proportional to the increase in score for the validation curve until the maximum depth value is 3. After which, the score of the validation curve does not increase in the same proportion as that of the training curve, which means that the variance is increasing.
- -In this final section of the project, you will construct a model and make a prediction on the client's feature set using an optimized model from fit_model.
What is the grid search technique and how it can be applied to optimize a learning algorithm?
- -Answer: For an algorithm that accepts a set of parameters, Grid Search builds different models by tuning the parameters to different values, and then cross validates the models to decide which combination of parameters (or tune) gives the best performance. These parameters are specified in a grid.
-It can be used to optimize a learning algorithm since it returns the best of a family of models, one that gives the most accurate predictions on the testing data.
- -What is the k-fold cross-validation training technique? What benefit does this technique provide for grid search when optimizing a model?
-Hint: Much like the reasoning behind having a testing set, what could go wrong with using grid search without a cross-validated set?
Answer: The k-fold technique splits our data-set into k sub-sets. One of the k subsets is used as the testing data, and the rest (k-1) subsets are used as the training data. The machine learning algorithm is trained with the training data-set and then tested for performance using the testing data-set. -This process is repeated k times, when each time a different subset is used as the testing data. Once this is done, the average performance is calculated from the results of the k experiments.
-In this way, our algorithm can use all the data for training, and all the data for testing too. This process takes a longer time, but it improves the accuracy of the grid search.
-This techique could help in eliminating bias, because we take into account the entire data-set. It aids grid search in making a more accurate prediction.
- -Your final implementation requires that you bring everything together and train a model using the decision tree algorithm. To ensure that you are producing an optimized model, you will train the model using the grid search technique to optimize the 'max_depth' parameter for the decision tree. The 'max_depth' parameter can be thought of as how many questions the decision tree algorithm is allowed to ask about the data before making a prediction. Decision trees are part of a class of algorithms called supervised learning algorithms.
In addition, you will find your implementation is using ShuffleSplit() for an alternative form of cross-validation (see the 'cv_sets' variable). While it is not the K-Fold cross-validation technique you describe in Question 8, this type of cross-validation technique is just as useful!. The ShuffleSplit() implementation below will create 10 ('n_iter') shuffled sets, and for each shuffle, 20% ('test_size') of the data will be used as the validation set. While you're working on your implementation, think about the contrasts and similarities it has to the K-fold cross-validation technique.
For the fit_model function in the code cell below, you will need to implement the following:
DecisionTreeRegressor from sklearn.tree to create a decision tree regressor object.'regressor' variable.'max_depth' with the values from 1 to 10, and assign this to the 'params' variable.make_scorer from sklearn.metrics to create a scoring function object.performance_metric function as a parameter to the object.'scoring_fnc' variable.GridSearchCV from sklearn.grid_search to create a grid search object.'regressor', 'params', 'scoring_fnc', and 'cv_sets' as parameters to the object. GridSearchCV object to the 'grid' variable.# TODO: Import 'make_scorer', 'DecisionTreeRegressor', and 'GridSearchCV'
-from sklearn.metrics import make_scorer
-from sklearn.tree import DecisionTreeRegressor
-from sklearn.grid_search import GridSearchCV
-
-
-def fit_model(X, y):
- """ Performs grid search over the 'max_depth' parameter for a
- decision tree regressor trained on the input data [X, y]. """
-
- # Create cross-validation sets from the training data
- cv_sets = ShuffleSplit(X.shape[0], n_iter = 10, test_size = 0.20, random_state = 0)
-
- # TODO: Create a decision tree regressor object
- regressor = DecisionTreeRegressor()
-
- # TODO: Create a dictionary for the parameter 'max_depth' with a range from 1 to 10
- params = {
- 'max_depth': [1,2,3,4,5,6,7,8,9,10]
- }
-
- # TODO: Transform 'performance_metric' into a scoring function using 'make_scorer'
- scoring_fnc = make_scorer(performance_metric)
-
- # TODO: Create the grid search object
- grid = GridSearchCV(regressor, params, scoring_fnc, cv=cv_sets)
-
- # Fit the grid search object to the data to compute the optimal model
- grid = grid.fit(X, y)
-
- # Return the optimal model after fitting the data
- return grid.best_estimator_
-Once a model has been trained on a given set of data, it can now be used to make predictions on new sets of input data. In the case of a decision tree regressor, the model has learned what the best questions to ask about the input data are, and can respond with a prediction for the target variable. You can use these predictions to gain information about data where the value of the target variable is unknown — such as data the model was not trained on.
- -What maximum depth does the optimal model have? How does this result compare to your guess in Question 6?
-Run the code block below to fit the decision tree regressor to the training data and produce an optimal model.
- -# Fit the training data to the model using grid search
-reg = fit_model(X_train, y_train)
-
-# Produce the value for 'max_depth'
-print "Parameter 'max_depth' is {} for the optimal model.".format(reg.get_params()['max_depth'])
-Answer: The model predicts an optimum max_depth value of 4, as opposed to the guess made earlier in question 6, where I guessed that the optimum max_depth value could be 3.
- -Imagine that you were a real estate agent in the Boston area looking to use this model to help price homes owned by your clients that they wish to sell. You have collected the following information from three of your clients:
-| Feature | -Client 1 | -Client 2 | -Client 3 | -
|---|---|---|---|
| Total number of rooms in home | -5 rooms | -4 rooms | -8 rooms | -
| Neighborhood poverty level (as %) | -17% | -32% | -3% | -
| Student-teacher ratio of nearby schools | -15-to-1 | -22-to-1 | -12-to-1 | -
What price would you recommend each client sell his/her home at? Do these prices seem reasonable given the values for the respective features?
-Hint: Use the statistics you calculated in the Data Exploration section to help justify your response.
Run the code block below to have your optimized model make predictions for each client's home.
- -# Produce a matrix for client data
-client_data = [[5, 17, 15], # Client 1
- [4, 32, 22], # Client 2
- [8, 3, 12]] # Client 3
-
-# Show predictions
-for i, price in enumerate(reg.predict(client_data)):
- print "Predicted selling price for Client {}'s home: ${:,.2f}".format(i+1, price)
-Answer: The predicted prices for each of the houses seems perfectly reasonable.
-On comparing these prices with the minimum price and maximum price from the Boston housing data-set, we can understand that the predicted prices certainly aren't outliers.
-The average of these three prices is 507657.84, which seems close to the calculated mean of the entire Boston housing data-set $454342.94.
-Client 3 has a house that has more rooms than the other clients, has lesser poverty percentage level and has the least student-teacher ratio than the other houses, so naturally the predicted value of the house is higher than the rest.
-Client 2 on the other hand, has the least number of rooms, has the highest poverty percentage level and has the largest student-teacher ratio among the three clients, hence the predicted value of this house is the least of all.
-Client 1 has a house that has more rooms than Client 2 but lesser than Client 3. The house also seems to be in a better neighborhood and has better student-teacher ratio than that of Client 2, but not as good as that of Client 3. So the predicted value of this house is greater than Client 2's, but lesser than Client 3's.
- -An optimal model is not necessarily a robust model. Sometimes, a model is either too complex or too simple to sufficiently generalize to new data. Sometimes, a model could use a learning algorithm that is not appropriate for the structure of the data given. Other times, the data itself could be too noisy or contain too few samples to allow a model to adequately capture the target variable — i.e., the model is underfitted. Run the code cell below to run the fit_model function ten times with different training and testing sets to see how the prediction for a specific client changes with the data it's trained on.
vs.PredictTrials(features, prices, fit_model, client_data)
-In a few sentences, discuss whether the constructed model should or should not be used in a real-world setting.
-Hint: Some questions to answering:
Answer: The range in prices is high, so it needs to be more consistent with it's predictions.
-We might be ignoring some not so obvious but important features - for example employment opportunities in the neighborhood, age of the house, crime rates, health-care facilities etc. The features present in the data-set are good ,but they might not be enough to describe a home.
-The model is not making consistent predictions, looks like it flutuates a bit - judging from th range in priec predictions
-Some of the featues might not be applicable in a rural city, for example - poverty level might not be aa important a factor as for people in urban areas (most people have a very simple way of life in rural areas).
- -- -Note: Once you have completed all of the code implementations and successfully answered each question above, you may finalize your work by exporting the iPython Notebook as an HTML document. You can do this by using the menu above and navigating to
-
-File -> Download as -> HTML (.html). Include the finished document along with this notebook as your submission.
| \n", - " | age | \n", - "workclass | \n", - "education_level | \n", - "education-num | \n", - "marital-status | \n", - "occupation | \n", - "relationship | \n", - "race | \n", - "sex | \n", - "capital-gain | \n", - "capital-loss | \n", - "hours-per-week | \n", - "native-country | \n", - "income | \n", - "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", - "39 | \n", - "State-gov | \n", - "Bachelors | \n", - "13.0 | \n", - "Never-married | \n", - "Adm-clerical | \n", - "Not-in-family | \n", - "White | \n", - "Male | \n", - "2174.0 | \n", - "0.0 | \n", - "40.0 | \n", - "United-States | \n", - "<=50K | \n", - "
| \n", - " | age | \n", - "workclass | \n", - "education_level | \n", - "education-num | \n", - "marital-status | \n", - "occupation | \n", - "relationship | \n", - "race | \n", - "sex | \n", - "capital-gain | \n", - "capital-loss | \n", - "hours-per-week | \n", - "native-country | \n", - "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", - "0.301370 | \n", - "State-gov | \n", - "Bachelors | \n", - "0.800000 | \n", - "Never-married | \n", - "Adm-clerical | \n", - "Not-in-family | \n", - "White | \n", - "Male | \n", - "0.667492 | \n", - "0.0 | \n", - "0.397959 | \n", - "United-States | \n", - "
| 1 | \n", - "0.452055 | \n", - "Self-emp-not-inc | \n", - "Bachelors | \n", - "0.800000 | \n", - "Married-civ-spouse | \n", - "Exec-managerial | \n", - "Husband | \n", - "White | \n", - "Male | \n", - "0.000000 | \n", - "0.0 | \n", - "0.122449 | \n", - "United-States | \n", - "
| 2 | \n", - "0.287671 | \n", - "Private | \n", - "HS-grad | \n", - "0.533333 | \n", - "Divorced | \n", - "Handlers-cleaners | \n", - "Not-in-family | \n", - "White | \n", - "Male | \n", - "0.000000 | \n", - "0.0 | \n", - "0.397959 | \n", - "United-States | \n", - "
| 3 | \n", - "0.493151 | \n", - "Private | \n", - "11th | \n", - "0.400000 | \n", - "Married-civ-spouse | \n", - "Handlers-cleaners | \n", - "Husband | \n", - "Black | \n", - "Male | \n", - "0.000000 | \n", - "0.0 | \n", - "0.397959 | \n", - "United-States | \n", - "
| 4 | \n", - "0.150685 | \n", - "Private | \n", - "Bachelors | \n", - "0.800000 | \n", - "Married-civ-spouse | \n", - "Prof-specialty | \n", - "Wife | \n", - "Black | \n", - "Female | \n", - "0.000000 | \n", - "0.0 | \n", - "0.397959 | \n", - "Cuba | \n", - "
| \n", - " | age | \n", - "workclass | \n", - "education_level | \n", - "education-num | \n", - "marital-status | \n", - "occupation | \n", - "relationship | \n", - "race | \n", - "sex | \n", - "capital-gain | \n", - "capital-loss | \n", - "hours-per-week | \n", - "native-country | \n", - "income | \n", - "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", - "39 | \n", - "State-gov | \n", - "Bachelors | \n", - "13.0 | \n", - "Never-married | \n", - "Adm-clerical | \n", - "Not-in-family | \n", - "White | \n", - "Male | \n", - "2174.0 | \n", - "0.0 | \n", - "40.0 | \n", - "United-States | \n", - "<=50K | \n", - "
| \n", - " | age | \n", - "workclass | \n", - "education_level | \n", - "education-num | \n", - "marital-status | \n", - "occupation | \n", - "relationship | \n", - "race | \n", - "sex | \n", - "capital-gain | \n", - "capital-loss | \n", - "hours-per-week | \n", - "native-country | \n", - "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", - "0.301370 | \n", - "State-gov | \n", - "Bachelors | \n", - "0.800000 | \n", - "Never-married | \n", - "Adm-clerical | \n", - "Not-in-family | \n", - "White | \n", - "Male | \n", - "0.667492 | \n", - "0.0 | \n", - "0.397959 | \n", - "United-States | \n", - "
| 1 | \n", - "0.452055 | \n", - "Self-emp-not-inc | \n", - "Bachelors | \n", - "0.800000 | \n", - "Married-civ-spouse | \n", - "Exec-managerial | \n", - "Husband | \n", - "White | \n", - "Male | \n", - "0.000000 | \n", - "0.0 | \n", - "0.122449 | \n", - "United-States | \n", - "
| 2 | \n", - "0.287671 | \n", - "Private | \n", - "HS-grad | \n", - "0.533333 | \n", - "Divorced | \n", - "Handlers-cleaners | \n", - "Not-in-family | \n", - "White | \n", - "Male | \n", - "0.000000 | \n", - "0.0 | \n", - "0.397959 | \n", - "United-States | \n", - "
| 3 | \n", - "0.493151 | \n", - "Private | \n", - "11th | \n", - "0.400000 | \n", - "Married-civ-spouse | \n", - "Handlers-cleaners | \n", - "Husband | \n", - "Black | \n", - "Male | \n", - "0.000000 | \n", - "0.0 | \n", - "0.397959 | \n", - "United-States | \n", - "
| 4 | \n", - "0.150685 | \n", - "Private | \n", - "Bachelors | \n", - "0.800000 | \n", - "Married-civ-spouse | \n", - "Prof-specialty | \n", - "Wife | \n", - "Black | \n", - "Female | \n", - "0.000000 | \n", - "0.0 | \n", - "0.397959 | \n", - "Cuba | \n", - "
Welcome to the second project of the Machine Learning Engineer Nanodegree! In this notebook, some template code has already been provided for you, and it will be your job to implement the additional functionality necessary to successfully complete this project. Sections that begin with 'Implementation' in the header indicate that the following block of code will require additional functionality which you must provide. Instructions will be provided for each section and the specifics of the implementation are marked in the code block with a 'TODO' statement. Please be sure to read the instructions carefully!
In addition to implementing code, there will be questions that you must answer which relate to the project and your implementation. Each section where you will answer a question is preceded by a 'Question X' header. Carefully read each question and provide thorough answers in the following text boxes that begin with 'Answer:'. Your project submission will be evaluated based on your answers to each of the questions and the implementation you provide.
-- -Note: Please specify WHICH VERSION OF PYTHON you are using when submitting this notebook. Code and Markdown cells can be executed using the Shift + Enter keyboard shortcut. In addition, Markdown cells can be edited by typically double-clicking the cell to enter edit mode.
-
In this project, you will employ several supervised algorithms of your choice to accurately model individuals' income using data collected from the 1994 U.S. Census. You will then choose the best candidate algorithm from preliminary results and further optimize this algorithm to best model the data. Your goal with this implementation is to construct a model that accurately predicts whether an individual makes more than $50,000. This sort of task can arise in a non-profit setting, where organizations survive on donations. Understanding an individual's income can help a non-profit better understand how large of a donation to request, or whether or not they should reach out to begin with. While it can be difficult to determine an individual's general income bracket directly from public sources, we can (as we will see) infer this value from other publically available features.
-The dataset for this project originates from the UCI Machine Learning Repository. The datset was donated by Ron Kohavi and Barry Becker, after being published in the article "Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid". You can find the article by Ron Kohavi online. The data we investigate here consists of small changes to the original dataset, such as removing the 'fnlwgt' feature and records with missing or ill-formatted entries.
Run the code cell below to load necessary Python libraries and load the census data. Note that the last column from this dataset, 'income', will be our target label (whether an individual makes more than, or at most, $50,000 annually). All other columns are features about each individual in the census database.
# Import libraries necessary for this project
-import numpy as np
-import pandas as pd
-from time import time
-from IPython.display import display # Allows the use of display() for DataFrames
-
-# Import supplementary visualization code visuals.py
-import visuals as vs
-
-# Pretty display for notebooks
-%matplotlib inline
-
-# Load the Census dataset
-data = pd.read_csv("census.csv")
-
-# Success - Display the first record
-display(data.head(n=1))
-A cursory investigation of the dataset will determine how many individuals fit into either group, and will tell us about the percentage of these individuals making more than \$50,000. In the code cell below, you will need to compute the following:
-'n_records''n_greater_50k'.'n_at_most_50k'.'greater_percent'. HINT: You may need to look at the table above to understand how the 'income' entries are formatted.
# TODO: Total number of records
-#print(data)
-n_records = data.shape[0]
-
-# TODO: Number of records where individual's income is more than $50,000
-more_than_50k = data.loc[(data['income'] == '>50K')]
-n_greater_50k = more_than_50k.shape[0]
-
-# TODO: Number of records where individual's income is at most $50,000
-atmost_50k = data.loc[(data['income'] == '<=50K')]
-n_at_most_50k = atmost_50k.shape[0]
-
-# TODO: Percentage of individuals whose income is more than $50,000
-greater_percent = n_greater_50k*100/n_records
-
-# Print the results
-print("Total number of records: {}".format(n_records))
-print("Individuals making more than $50,000: {}".format(n_greater_50k))
-print("Individuals making at most $50,000: {}".format(n_at_most_50k))
-print("Percentage of individuals making more than $50,000: {:.2f}%".format(greater_percent))
-Featureset Exploration
-Before data can be used as input for machine learning algorithms, it often must be cleaned, formatted, and restructured — this is typically known as preprocessing. Fortunately, for this dataset, there are no invalid or missing entries we must deal with, however, there are some qualities about certain features that must be adjusted. This preprocessing can help tremendously with the outcome and predictive power of nearly all learning algorithms.
- -A dataset may sometimes contain at least one feature whose values tend to lie near a single number, but will also have a non-trivial number of vastly larger or smaller values than that single number. Algorithms can be sensitive to such distributions of values and can underperform if the range is not properly normalized. With the census dataset two features fit this description: 'capital-gain' and 'capital-loss'.
Run the code cell below to plot a histogram of these two features. Note the range of the values present and how they are distributed.
- -# Split the data into features and target label
-income_raw = data['income']
-features_raw = data.drop('income', axis = 1)
-
-# Visualize skewed continuous features of original data
-vs.distribution(data)
-For highly-skewed feature distributions such as 'capital-gain' and 'capital-loss', it is common practice to apply a logarithmic transformation on the data so that the very large and very small values do not negatively affect the performance of a learning algorithm. Using a logarithmic transformation significantly reduces the range of values caused by outliers. Care must be taken when applying this transformation however: The logarithm of 0 is undefined, so we must translate the values by a small amount above 0 to apply the the logarithm successfully.
Run the code cell below to perform a transformation on the data and visualize the results. Again, note the range of values and how they are distributed.
- -# Log-transform the skewed features
-skewed = ['capital-gain', 'capital-loss']
-features_log_transformed = pd.DataFrame(data = features_raw)
-features_log_transformed[skewed] = features_raw[skewed].apply(lambda x: np.log(x + 1))
-
-# Visualize the new log distributions
-vs.distribution(features_log_transformed, transformed = True)
-In addition to performing transformations on features that are highly skewed, it is often good practice to perform some type of scaling on numerical features. Applying a scaling to the data does not change the shape of each feature's distribution (such as 'capital-gain' or 'capital-loss' above); however, normalization ensures that each feature is treated equally when applying supervised learners. Note that once scaling is applied, observing the data in its raw form will no longer have the same original meaning, as exampled below.
Run the code cell below to normalize each numerical feature. We will use sklearn.preprocessing.MinMaxScaler for this.
# Import sklearn.preprocessing.StandardScaler
-from sklearn.preprocessing import MinMaxScaler
-
-# Initialize a scaler, then apply it to the features
-scaler = MinMaxScaler() # default=(0, 1)
-numerical = ['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
-
-features_log_minmax_transform = pd.DataFrame(data = features_log_transformed)
-features_log_minmax_transform[numerical] = scaler.fit_transform(features_log_transformed[numerical])
-
-# Show an example of a record with scaling applied
-display(features_log_minmax_transform.head(n = 5))
-From the table in Exploring the Data above, we can see there are several features for each record that are non-numeric. Typically, learning algorithms expect input to be numeric, which requires that non-numeric features (called categorical variables) be converted. One popular way to convert categorical variables is by using the one-hot encoding scheme. One-hot encoding creates a "dummy" variable for each possible category of each non-numeric feature. For example, assume someFeature has three possible entries: A, B, or C. We then encode this feature into someFeature_A, someFeature_B and someFeature_C.
| - | someFeature | -- | someFeature_A | -someFeature_B | -someFeature_C | -
|---|---|---|---|---|---|
| 0 | -B | -- | 0 | -1 | -0 | -
| 1 | -C | -----> one-hot encode ----> | -0 | -0 | -1 | -
| 2 | -A | -- | 1 | -0 | -0 | -
Additionally, as with the non-numeric features, we need to convert the non-numeric target label, 'income' to numerical values for the learning algorithm to work. Since there are only two possible categories for this label ("<=50K" and ">50K"), we can avoid using one-hot encoding and simply encode these two categories as 0 and 1, respectively. In code cell below, you will need to implement the following:
pandas.get_dummies() to perform one-hot encoding on the 'features_raw' data.'income_raw' to numerical entries.0 and records with ">50K" to 1.# TODO: One-hot encode the 'features_log_minmax_transform' data using pandas.get_dummies()
-features_final = pd.get_dummies(features_log_minmax_transform)
-
-# TODO: Encode the 'income_raw' data to numerical values
-income = pd.get_dummies(income_raw)['>50K']
-print(income.count())
-# Print the number of features after one-hot encoding
-encoded = list(features_final.columns)
-print("{} total features after one-hot encoding.".format(len(encoded)))
-
-# Uncomment the following line to see the encoded feature names
-#print(encoded)
-Now all categorical variables have been converted into numerical features, and all numerical features have been normalized. As always, we will now split the data (both features and their labels) into training and test sets. 80% of the data will be used for training and 20% for testing.
-Run the code cell below to perform this split.
- -# Import train_test_split
-from sklearn.model_selection import train_test_split
-
-# Split the 'features' and 'income' data into training and testing sets
-X_train, X_test, y_train, y_test = train_test_split(features_final,
- income,
- test_size = 0.2,
- random_state = 0)
-
-# Show the results of the split
-print("Training set has {} samples.".format(X_train.shape[0]))
-print("Testing set has {} samples.".format(X_test.shape[0]))
-In this section, we will investigate four different algorithms, and determine which is best at modeling the data. Three of these algorithms will be supervised learners of your choice, and the fourth algorithm is known as a naive predictor.
- -CharityML, equipped with their research, knows individuals that make more than \$50,000 are most likely to donate to their charity. Because of this, *CharityML* is particularly interested in predicting who makes more than \$50,000 accurately. It would seem that using accuracy as a metric for evaluating a particular model's performace would be appropriate. Additionally, identifying someone that does not make more than \$50,000 as someone who does would be detrimental to *CharityML*, since they are looking to find individuals willing to donate. Therefore, a model's ability to precisely predict those that make more than \$50,000 is more important than the model's ability to recall those individuals. We can use F-beta score as a metric that considers both precision and recall:
-$$ F_{\beta} = (1 + \beta^2) \cdot \frac{precision \cdot recall}{\left( \beta^2 \cdot precision \right) + recall} $$In particular, when $\beta = 0.5$, more emphasis is placed on precision. This is called the F$_{0.5}$ score (or F-score for simplicity).
-Looking at the distribution of classes (those who make at most \$50,000, and those who make more), it's clear most individuals do not make more than \$50,000. This can greatly affect accuracy, since we could simply say "this person does not make more than \$50,000" and generally be right, without ever looking at the data! Making such a statement would be called naive, since we have not considered any information to substantiate the claim. It is always important to consider the naive prediction for your data, to help establish a benchmark for whether a model is performing well. That been said, using that prediction would be pointless: If we predicted all people made less than \$50,000, CharityML would identify no one as donors.
-Accuracy measures how often the classifier makes the correct prediction. It’s the ratio of the number of correct predictions to the total number of predictions (the number of test data points).
-Precision tells us what proportion of messages we classified as spam, actually were spam. -It is a ratio of true positives(words classified as spam, and which are actually spam) to all positives(all words classified as spam, irrespective of whether that was the correct classificatio), in other words it is the ratio of
-[True Positives/(True Positives + False Positives)]
Recall(sensitivity) tells us what proportion of messages that actually were spam were classified by us as spam. -It is a ratio of true positives(words classified as spam, and which are actually spam) to all the words that were actually spam, in other words it is the ratio of
-[True Positives/(True Positives + False Negatives)]
For classification problems that are skewed in their classification distributions like in our case, for example if we had a 100 text messages and only 2 were spam and the rest 98 weren't, accuracy by itself is not a very good metric. We could classify 90 messages as not spam(including the 2 that were spam but we classify them as not spam, hence they would be false negatives) and 10 as spam(all 10 false positives) and still get a reasonably good accuracy score. For such cases, precision and recall come in very handy. These two metrics can be combined to get the F1 score, which is weighted average(harmonic mean) of the precision and recall scores. This score can range from 0 to 1, with 1 being the best possible F1 score(we take the harmonic mean as we are dealing with ratios).
- -'accuracy' and 'fscore' to be used later.Please note that the the purpose of generating a naive predictor is simply to show what a base model without any intelligence would look like. In the real world, ideally your base model would be either the results of a previous model or could be based on a research paper upon which you are looking to improve. When there is no benchmark model set, getting a result better than random choice is a place you could start from.
-HINT:
-'''
-TP = np.sum(income) # Counting the ones as this is the naive case. Note that 'income' is the 'income_raw' data
-encoded to numerical values done in the data preprocessing step.
-FP = income.count() - TP # Specific to the naive case
-
-TN = 0 # No predicted negatives in the naive case
-FN = 0 # No predicted negatives in the naive case
-'''
-# TODO: Calculate accuracy, precision and recall
-from sklearn.metrics import accuracy_score
-true_positives = np.sum(income) #This adds up the 1s, (column with >50k have value of 1)
-total_data_points = income_raw.count()
-false_positives = income_raw.count() - true_positives
-
-true_negatives = 0
-false_negatives = 0
-print(true_positives)
-#print(type(true_positives[0]))
-#print(type(total_data_points))
-
-accuracy = true_positives/ total_data_points
-precision = true_positives/(true_positives + false_positives)
-recall = true_positives/(true_positives + false_negatives)
-
-
-# TODO: Calculate F-score using the formula above for beta = 0.5 and correct values for precision and recall.
-# HINT: The formula above can be written as (1 + beta**2) * (precision * recall) / ((beta**2 * precision) + recall)
-beta = 0.5
-fscore = (1 + beta**2) * (precision * recall)/((beta**2 * precision) + recall)
-print(fscore)
-# Print the results
-print("Naive Predictor: [Accuracy score: {:.4f}, F-score: {:.4f}]".format(accuracy, fscore))
-The following are some of the supervised learning models that are currently available in scikit-learn that you may choose from:
List three of the supervised learning models above that are appropriate for this problem that you will test on the census data. For each model chosen
-HINT:
-Structure your answer in the same format as above^, with 4 parts for each of the three models you pick. Please include references with your answer.
- -Answer:
-Gaussian Naive Bayes
-Decision Trees
-Random forest
-Support Vector Machines
-Sources:
- - -To properly evaluate the performance of each model you've chosen, it's important that you create a training and predicting pipeline that allows you to quickly and effectively train models using various sizes of training data and perform predictions on the testing data. Your implementation here will be used in the following section. -In the code block below, you will need to implement the following:
-fbeta_score and accuracy_score from sklearn.metrics.X_test, and also on the first 300 training points X_train[:300].beta parameter!# TODO: Import two metrics from sklearn - fbeta_score and accuracy_score
-from sklearn.metrics import fbeta_score
-from sklearn.metrics import accuracy_score
-
-def train_predict(learner, sample_size, X_train, y_train, X_test, y_test):
- '''
- inputs:
- - learner: the learning algorithm to be trained and predicted on
- - sample_size: the size of samples (number) to be drawn from training set
- - X_train: features training set
- - y_train: income training set
- - X_test: features testing set
- - y_test: income testing set
- '''
-
- results = {}
-
- # TODO: Fit the learner to the training data using slicing with 'sample_size' using .fit(training_features[:], training_labels[:])
- start = time() # Get start time
- learner = learner.fit(X_train[:sample_size], y_train[:sample_size])
- end = time() # Get end time
-
- # TODO: Calculate the training time
- results['train_time'] = end - start
-
- # TODO: Get the predictions on the test set(X_test),
- # then get predictions on the first 300 training samples(X_train) using .predict()
- start = time() # Get start time
- predictions_test = learner.predict(X_test)
- predictions_train = learner.predict(X_train[:300])
- end = time() # Get end time
-
- # TODO: Calculate the total prediction time
- results['pred_time'] = end - start
-
- # TODO: Compute accuracy on the first 300 training samples which is y_train[:300]
- results['acc_train'] = accuracy_score(y_train[:300], predictions_train)
-
- # TODO: Compute accuracy on test set using accuracy_score()
- results['acc_test'] = accuracy_score(y_test, predictions_test)
-
- # TODO: Compute F-score on the the first 300 training samples using fbeta_score()
- results['f_train'] = fbeta_score(y_train[:300], predictions_train, beta=0.5)
-
- # TODO: Compute F-score on the test set which is y_test
- results['f_test'] = fbeta_score(y_test, predictions_test, beta=0.5)
-
- # Success
- print("{} trained on {} samples.".format(learner.__class__.__name__, sample_size))
-
- # Return the results
- return results
-In the code cell, you will need to implement the following:
-'clf_A', 'clf_B', and 'clf_C'.'random_state' for each model you use, if provided.'samples_1', 'samples_10', and 'samples_100' respectively.Note: Depending on which algorithms you chose, the following implementation may take some time to run!
- -# TODO: Import the three supervised learning models from sklearn
-from sklearn.naive_bayes import GaussianNB
-from sklearn.tree import DecisionTreeClassifier
-from sklearn.ensemble import RandomForestClassifier
-
-# TODO: Initialize the three models
-clf_A = GaussianNB()
-clf_B = DecisionTreeClassifier(random_state=None)
-clf_C = RandomForestClassifier(max_depth=None, random_state=None)
-
-# TODO: Calculate the number of samples for 1%, 10%, and 100% of the training data
-# HINT: samples_100 is the entire training set i.e. len(y_train)
-# HINT: samples_10 is 10% of samples_100
-# HINT: samples_1 is 1% of samples_100
-
-samples_100 = len(y_train)
-samples_10 = int(len(y_train)*10/100)
-samples_1 = int(len(y_train)*1/100)
-
-# Collect results on the learners
-results = {}
-for clf in [clf_A, clf_B, clf_C]:
- clf_name = clf.__class__.__name__
- results[clf_name] = {}
- for i, samples in enumerate([samples_1, samples_10, samples_100]):
- results[clf_name][i] = \
- train_predict(clf, samples, X_train, y_train, X_test, y_test)
-
-# Run metrics visualization for the three supervised learning models chosen
-#print(results)
-vs.evaluate(results, accuracy, fscore)
-In this final section, you will choose from the three supervised learning models the best model to use on the student data. You will then perform a grid search optimization for the model over the entire training set (X_train and y_train) by tuning at least one parameter to improve upon the untuned model's F-score.
HINT:
-Look at the graph at the bottom left from the cell above(the visualization created by vs.evaluate(results, accuracy, fscore)) and check the F score for the testing set when 100% of the training set is used. Which model has the highest score? Your answer should include discussion of the:
Answer:
-Based on the results, I most definitely believe that a random forest model will be most appropriate for this task.
-Training time Gaussian Naive Bayes:
-Random forest: Training time: 0.5828454494476318, Prediction Time: 0.034857749938964844
-The training and prediction time is higher than the other models
-Based on these factors, random forest is better suited to make predictions. It performs fairly well and the training time and prediction times are on acceptable levels.
-HINT:
-When explaining your model, if using external resources please include all citations.
- -Answer:
-To understand how Random Forest works, we need to first understand how the Decision Tree algorithm works.
-A Decision tree is a classification algorithm that uses tree-like data structures to model decisions and their possible outcomes. The way the algorithm works is -
-When used alone, decision trees are prone to overfitting. However, random forests help by correcting the possible overfitting that could occur. Random forests work by using multiple decision trees - using a multitude of different decision trees with different predictions, a random forest combines the results of those individual trees to give the final outcomes.
-Random forest applies an ensemble algorithm called bagging to the decision trees, which help reduce variance and overfitting -
-Out of all the tested models, random forest also seems like the best candidate to try tuning the hyper-parameters, and using other ensemble methods, like gradient boosting. This would result in improved accuracy scores.
-References:
- - -Fine tune the chosen model. Use grid search (GridSearchCV) with at least one important parameter tuned with at least 3 different values. You will need to use the entire training set for this. In the code cell below, you will need to implement the following:
sklearn.grid_search.GridSearchCV and sklearn.metrics.make_scorer.clf.random_state if one is available to the same state you set before.parameters = {'parameter' : [list of values]}.max_features parameter of your learner if that parameter is available!make_scorer to create an fbeta_score scoring object (with $\beta = 0.5$).clf using the 'scorer', and store it in grid_obj.X_train, y_train), and store it in grid_fit.Note: Depending on the algorithm chosen and the parameter list, the following implementation may take some time to run!
- -# TODO: Import 'GridSearchCV', 'make_scorer', and any other necessary libraries
-from sklearn.model_selection import GridSearchCV
-from sklearn.metrics import make_scorer
-# TODO: Initialize the classifier
-clf = RandomForestClassifier(max_depth=None, random_state=None)
-
-# TODO: Create the parameters list you wish to tune, using a dictionary if needed.
-# HINT: parameters = {'parameter_1': [value1, value2], 'parameter_2': [value1, value2]}
-parameters = {'n_estimators': [20, 40, 60], 'warm_start': [False, True], 'criterion': ['gini', 'entropy'], 'bootstrap': [True, False]}
-
-# TODO: Make an fbeta_score scoring object using make_scorer()
-scorer = make_scorer(fbeta_score, beta=0.5)
-
-# TODO: Perform grid search on the classifier using 'scorer' as the scoring method using GridSearchCV()
-grid_obj = GridSearchCV(clf, parameters, scoring=scorer)
-
-# TODO: Fit the grid search object to the training data and find the optimal parameters using fit()
-grid_fit = grid_obj.fit(X_train, y_train)
-
-# Get the estimator
-best_clf = grid_fit.best_estimator_
-
-# Make predictions using the unoptimized and model
-predictions = (clf.fit(X_train, y_train)).predict(X_test)
-best_predictions = best_clf.predict(X_test)
-
-# Report the before-and-afterscores
-print("Unoptimized model\n------")
-print("Accuracy score on testing data: {:.4f}".format(accuracy_score(y_test, predictions)))
-print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, predictions, beta = 0.5)))
-print("\nOptimized Model\n------")
-print("Final accuracy score on the testing data: {:.4f}".format(accuracy_score(y_test, best_predictions)))
-print("Final F-score on the testing data: {:.4f}".format(fbeta_score(y_test, best_predictions, beta = 0.5)))
-Note: Fill in the table below with your results, and then provide discussion in the Answer box.
- -| Metric | -Unoptimized Model | -Optimized Model | -
|---|---|---|
| Accuracy Score | -0.8375 | -0.8417 | -
| F-score | -0.6710 | -0.6799 | -
Answer: -Both the scores seem to be improving slightly, but not conclusive enough. One possible reason could be that all the features in the data-set are given equal importance. Some other not-so-relevant features might also be interfering with the training
- -An important task when performing supervised learning on a dataset like the census data we study here is determining which features provide the most predictive power. By focusing on the relationship between only a few crucial features and the target label we simplify our understanding of the phenomenon, which is most always a useful thing to do. In the case of this project, that means we wish to identify a small number of features that most strongly predict whether an individual makes at most or more than \$50,000.
-Choose a scikit-learn classifier (e.g., adaboost, random forests) that has a feature_importance_ attribute, which is a function that ranks the importance of features according to the chosen classifier. In the next python cell fit this classifier to training set and use this attribute to determine the top 5 most important features for the census dataset.
When Exploring the Data, it was shown there are thirteen available features for each individual on record in the census data. Of these thirteen records, which five features do you believe to be most important for prediction, and in what order would you rank them and why?
- -Answer:
-Choose a scikit-learn supervised learning algorithm that has a feature_importance_ attribute availble for it. This attribute is a function that ranks the importance of each feature when making predictions based on the chosen algorithm.
In the code cell below, you will need to implement the following:
-'.feature_importances_'.# TODO: Import a supervised learning model that has 'feature_importances_'
-from sklearn.tree import DecisionTreeRegressor
-from sklearn.ensemble import AdaBoostRegressor
-
-# TODO: Train the supervised model on the training set using .fit(X_train, y_train)
-model = AdaBoostRegressor(DecisionTreeRegressor(max_depth=None, random_state=None))
-model = model.fit(X_train, y_train)
-# TODO: Extract the feature importances using .feature_importances_
-importances = model.feature_importances_
-# Plot
-vs.feature_plot(importances, X_train, y_train)
-Observe the visualization created above which displays the five most relevant features for predicting if an individual makes at most or above \$50,000.
-Answer: -Looks like I was way off with my guesses. Looks like some other factors are more relevant than what I previously assumed.
-How does a model perform if we only use a subset of all the available features in the data? With less features required to train, the expectation is that training and prediction time is much lower — at the cost of performance metrics. From the visualization above, we see that the top five most important features contribute more than half of the importance of all features present in the data. This hints that we can attempt to reduce the feature space and simplify the information required for the model to learn. The code cell below will use the same optimized model you found earlier, and train it on the same training set with only the top five important features.
- -# Import functionality for cloning a model
-from sklearn.base import clone
-
-# Reduce the feature space
-X_train_reduced = X_train[X_train.columns.values[(np.argsort(importances)[::-1])[:5]]]
-X_test_reduced = X_test[X_test.columns.values[(np.argsort(importances)[::-1])[:5]]]
-
-# Train on the "best" model found from grid search earlier
-clf = (clone(best_clf)).fit(X_train_reduced, y_train)
-
-# Make new predictions
-reduced_predictions = clf.predict(X_test_reduced)
-
-# Report scores from the final model using both versions of data
-print("Final Model trained on full data\n------")
-print("Accuracy on testing data: {:.4f}".format(accuracy_score(y_test, best_predictions)))
-print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, best_predictions, beta = 0.5)))
-print("\nFinal Model trained on reduced data\n------")
-print("Accuracy on testing data: {:.4f}".format(accuracy_score(y_test, reduced_predictions)))
-print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, reduced_predictions, beta = 0.5)))
-Answer: -Actually the final model's F-score and accuracy seems to be decreasing slightly. Maybe some more features need to be included in.
-The training time isn't much, so I'd use the complete data as my training set. But in a scenario where there are lots of samples, I'd consider using the reduced data-set. Although, there might be certain trade-offs between training time and performance.
- -- -Note: Once you have completed all of the code implementations and successfully answered each question above, you may finalize your work by exporting the iPython Notebook as an HTML document. You can do this by using the menu above and navigating to
-
-File -> Download as -> HTML (.html). Include the finished document along with this notebook as your submission.
| \n", + " | fixed acidity | \n", + "volatile acidity | \n", + "citric acid | \n", + "residual sugar | \n", + "chlorides | \n", + "free sulfur dioxide | \n", + "total sulfur dioxide | \n", + "density | \n", + "pH | \n", + "sulphates | \n", + "alcohol | \n", + "quality | \n", + "
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "
| mean | \n", + "8.0 | \n", + "1.0 | \n", + "0.0 | \n", + "3.0 | \n", + "0.0 | \n", + "16.0 | \n", + "46.0 | \n", + "1.0 | \n", + "3.0 | \n", + "1.0 | \n", + "10.0 | \n", + "6.0 | \n", + "
| std | \n", + "2.0 | \n", + "0.0 | \n", + "0.0 | \n", + "1.0 | \n", + "0.0 | \n", + "10.0 | \n", + "33.0 | \n", + "0.0 | \n", + "0.0 | \n", + "0.0 | \n", + "1.0 | \n", + "1.0 | \n", + "
| min | \n", + "5.0 | \n", + "0.0 | \n", + "0.0 | \n", + "1.0 | \n", + "0.0 | \n", + "1.0 | \n", + "6.0 | \n", + "1.0 | \n", + "3.0 | \n", + "0.0 | \n", + "8.0 | \n", + "3.0 | \n", + "
| 25% | \n", + "7.0 | \n", + "0.0 | \n", + "0.0 | \n", + "2.0 | \n", + "0.0 | \n", + "7.0 | \n", + "22.0 | \n", + "1.0 | \n", + "3.0 | \n", + "1.0 | \n", + "10.0 | \n", + "5.0 | \n", + "
| 50% | \n", + "8.0 | \n", + "1.0 | \n", + "0.0 | \n", + "2.0 | \n", + "0.0 | \n", + "14.0 | \n", + "38.0 | \n", + "1.0 | \n", + "3.0 | \n", + "1.0 | \n", + "10.0 | \n", + "6.0 | \n", + "
| 75% | \n", + "9.0 | \n", + "1.0 | \n", + "0.0 | \n", + "3.0 | \n", + "0.0 | \n", + "21.0 | \n", + "62.0 | \n", + "1.0 | \n", + "3.0 | \n", + "1.0 | \n", + "11.0 | \n", + "6.0 | \n", + "
| max | \n", + "16.0 | \n", + "2.0 | \n", + "1.0 | \n", + "16.0 | \n", + "1.0 | \n", + "72.0 | \n", + "289.0 | \n", + "1.0 | \n", + "4.0 | \n", + "2.0 | \n", + "15.0 | \n", + "8.0 | \n", + "
| \n", + " | fixed acidity | \n", + "volatile acidity | \n", + "citric acid | \n", + "residual sugar | \n", + "chlorides | \n", + "free sulfur dioxide | \n", + "total sulfur dioxide | \n", + "density | \n", + "pH | \n", + "sulphates | \n", + "alcohol | \n", + "quality | \n", + "quality_cat | \n", + "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", + "7.4 | \n", + "0.70 | \n", + "0.00 | \n", + "1.9 | \n", + "0.076 | \n", + "11.0 | \n", + "34.0 | \n", + "0.9978 | \n", + "3.51 | \n", + "0.56 | \n", + "9.4 | \n", + "5 | \n", + "0 | \n", + "
| 1 | \n", + "7.8 | \n", + "0.88 | \n", + "0.00 | \n", + "2.6 | \n", + "0.098 | \n", + "25.0 | \n", + "67.0 | \n", + "0.9968 | \n", + "3.20 | \n", + "0.68 | \n", + "9.8 | \n", + "5 | \n", + "0 | \n", + "
| 2 | \n", + "7.8 | \n", + "0.76 | \n", + "0.04 | \n", + "2.3 | \n", + "0.092 | \n", + "15.0 | \n", + "54.0 | \n", + "0.9970 | \n", + "3.26 | \n", + "0.65 | \n", + "9.8 | \n", + "5 | \n", + "0 | \n", + "
| 3 | \n", + "11.2 | \n", + "0.28 | \n", + "0.56 | \n", + "1.9 | \n", + "0.075 | \n", + "17.0 | \n", + "60.0 | \n", + "0.9980 | \n", + "3.16 | \n", + "0.58 | \n", + "9.8 | \n", + "6 | \n", + "1 | \n", + "
| 4 | \n", + "7.4 | \n", + "0.70 | \n", + "0.00 | \n", + "1.9 | \n", + "0.076 | \n", + "11.0 | \n", + "34.0 | \n", + "0.9978 | \n", + "3.51 | \n", + "0.56 | \n", + "9.4 | \n", + "5 | \n", + "0 | \n", + "
| \n", + " | fixed acidity | \n", + "volatile acidity | \n", + "citric acid | \n", + "residual sugar | \n", + "chlorides | \n", + "free sulfur dioxide | \n", + "total sulfur dioxide | \n", + "density | \n", + "pH | \n", + "sulphates | \n", + "alcohol | \n", + "quality | \n", + "
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", + "7.4 | \n", + "0.70 | \n", + "0.00 | \n", + "1.9 | \n", + "0.076 | \n", + "11.0 | \n", + "34.0 | \n", + "0.9978 | \n", + "3.51 | \n", + "0.56 | \n", + "9.4 | \n", + "5 | \n", + "
| 1 | \n", + "7.8 | \n", + "0.88 | \n", + "0.00 | \n", + "2.6 | \n", + "0.098 | \n", + "25.0 | \n", + "67.0 | \n", + "0.9968 | \n", + "3.20 | \n", + "0.68 | \n", + "9.8 | \n", + "5 | \n", + "
| 2 | \n", + "7.8 | \n", + "0.76 | \n", + "0.04 | \n", + "2.3 | \n", + "0.092 | \n", + "15.0 | \n", + "54.0 | \n", + "0.9970 | \n", + "3.26 | \n", + "0.65 | \n", + "9.8 | \n", + "5 | \n", + "
| 3 | \n", + "11.2 | \n", + "0.28 | \n", + "0.56 | \n", + "1.9 | \n", + "0.075 | \n", + "17.0 | \n", + "60.0 | \n", + "0.9980 | \n", + "3.16 | \n", + "0.58 | \n", + "9.8 | \n", + "6 | \n", + "
| 4 | \n", + "7.4 | \n", + "0.70 | \n", + "0.00 | \n", + "1.9 | \n", + "0.076 | \n", + "11.0 | \n", + "34.0 | \n", + "0.9978 | \n", + "3.51 | \n", + "0.56 | \n", + "9.4 | \n", + "5 | \n", + "
| \n", + " | fixed acidity | \n", + "volatile acidity | \n", + "citric acid | \n", + "residual sugar | \n", + "chlorides | \n", + "free sulfur dioxide | \n", + "total sulfur dioxide | \n", + "density | \n", + "pH | \n", + "sulphates | \n", + "alcohol | \n", + "quality | \n", + "
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "
| mean | \n", + "8.0 | \n", + "1.0 | \n", + "0.0 | \n", + "3.0 | \n", + "0.0 | \n", + "16.0 | \n", + "46.0 | \n", + "1.0 | \n", + "3.0 | \n", + "1.0 | \n", + "10.0 | \n", + "6.0 | \n", + "
| std | \n", + "2.0 | \n", + "0.0 | \n", + "0.0 | \n", + "1.0 | \n", + "0.0 | \n", + "10.0 | \n", + "33.0 | \n", + "0.0 | \n", + "0.0 | \n", + "0.0 | \n", + "1.0 | \n", + "1.0 | \n", + "
| min | \n", + "5.0 | \n", + "0.0 | \n", + "0.0 | \n", + "1.0 | \n", + "0.0 | \n", + "1.0 | \n", + "6.0 | \n", + "1.0 | \n", + "3.0 | \n", + "0.0 | \n", + "8.0 | \n", + "3.0 | \n", + "
| 25% | \n", + "7.0 | \n", + "0.0 | \n", + "0.0 | \n", + "2.0 | \n", + "0.0 | \n", + "7.0 | \n", + "22.0 | \n", + "1.0 | \n", + "3.0 | \n", + "1.0 | \n", + "10.0 | \n", + "5.0 | \n", + "
| 50% | \n", + "8.0 | \n", + "1.0 | \n", + "0.0 | \n", + "2.0 | \n", + "0.0 | \n", + "14.0 | \n", + "38.0 | \n", + "1.0 | \n", + "3.0 | \n", + "1.0 | \n", + "10.0 | \n", + "6.0 | \n", + "
| 75% | \n", + "9.0 | \n", + "1.0 | \n", + "0.0 | \n", + "3.0 | \n", + "0.0 | \n", + "21.0 | \n", + "62.0 | \n", + "1.0 | \n", + "3.0 | \n", + "1.0 | \n", + "11.0 | \n", + "6.0 | \n", + "
| max | \n", + "16.0 | \n", + "2.0 | \n", + "1.0 | \n", + "16.0 | \n", + "1.0 | \n", + "72.0 | \n", + "289.0 | \n", + "1.0 | \n", + "4.0 | \n", + "2.0 | \n", + "15.0 | \n", + "8.0 | \n", + "
| \n", + " | fixed acidity | \n", + "volatile acidity | \n", + "citric acid | \n", + "residual sugar | \n", + "chlorides | \n", + "free sulfur dioxide | \n", + "total sulfur dioxide | \n", + "density | \n", + "pH | \n", + "sulphates | \n", + "alcohol | \n", + "quality | \n", + "quality_cat | \n", + "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", + "7.4 | \n", + "0.70 | \n", + "0.00 | \n", + "1.9 | \n", + "0.076 | \n", + "11.0 | \n", + "34.0 | \n", + "0.9978 | \n", + "3.51 | \n", + "0.56 | \n", + "9.4 | \n", + "5 | \n", + "0 | \n", + "
| 1 | \n", + "7.8 | \n", + "0.88 | \n", + "0.00 | \n", + "2.6 | \n", + "0.098 | \n", + "25.0 | \n", + "67.0 | \n", + "0.9968 | \n", + "3.20 | \n", + "0.68 | \n", + "9.8 | \n", + "5 | \n", + "0 | \n", + "
| 2 | \n", + "7.8 | \n", + "0.76 | \n", + "0.04 | \n", + "2.3 | \n", + "0.092 | \n", + "15.0 | \n", + "54.0 | \n", + "0.9970 | \n", + "3.26 | \n", + "0.65 | \n", + "9.8 | \n", + "5 | \n", + "0 | \n", + "
| 3 | \n", + "11.2 | \n", + "0.28 | \n", + "0.56 | \n", + "1.9 | \n", + "0.075 | \n", + "17.0 | \n", + "60.0 | \n", + "0.9980 | \n", + "3.16 | \n", + "0.58 | \n", + "9.8 | \n", + "6 | \n", + "1 | \n", + "
| 4 | \n", + "7.4 | \n", + "0.70 | \n", + "0.00 | \n", + "1.9 | \n", + "0.076 | \n", + "11.0 | \n", + "34.0 | \n", + "0.9978 | \n", + "3.51 | \n", + "0.56 | \n", + "9.4 | \n", + "5 | \n", + "0 | \n", + "
| \n", + " | fixed acidity | \n", + "volatile acidity | \n", + "citric acid | \n", + "residual sugar | \n", + "chlorides | \n", + "free sulfur dioxide | \n", + "total sulfur dioxide | \n", + "density | \n", + "pH | \n", + "sulphates | \n", + "alcohol | \n", + "quality | \n", + "
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", + "7.4 | \n", + "0.70 | \n", + "0.00 | \n", + "1.9 | \n", + "0.076 | \n", + "11.0 | \n", + "34.0 | \n", + "0.9978 | \n", + "3.51 | \n", + "0.56 | \n", + "9.4 | \n", + "5 | \n", + "
| 1 | \n", + "7.8 | \n", + "0.88 | \n", + "0.00 | \n", + "2.6 | \n", + "0.098 | \n", + "25.0 | \n", + "67.0 | \n", + "0.9968 | \n", + "3.20 | \n", + "0.68 | \n", + "9.8 | \n", + "5 | \n", + "
| 2 | \n", + "7.8 | \n", + "0.76 | \n", + "0.04 | \n", + "2.3 | \n", + "0.092 | \n", + "15.0 | \n", + "54.0 | \n", + "0.9970 | \n", + "3.26 | \n", + "0.65 | \n", + "9.8 | \n", + "5 | \n", + "
| 3 | \n", + "11.2 | \n", + "0.28 | \n", + "0.56 | \n", + "1.9 | \n", + "0.075 | \n", + "17.0 | \n", + "60.0 | \n", + "0.9980 | \n", + "3.16 | \n", + "0.58 | \n", + "9.8 | \n", + "6 | \n", + "
| 4 | \n", + "7.4 | \n", + "0.70 | \n", + "0.00 | \n", + "1.9 | \n", + "0.076 | \n", + "11.0 | \n", + "34.0 | \n", + "0.9978 | \n", + "3.51 | \n", + "0.56 | \n", + "9.4 | \n", + "5 | \n", + "
| \n", + " | fixed acidity | \n", + "volatile acidity | \n", + "citric acid | \n", + "residual sugar | \n", + "chlorides | \n", + "free sulfur dioxide | \n", + "total sulfur dioxide | \n", + "density | \n", + "pH | \n", + "sulphates | \n", + "alcohol | \n", + "quality | \n", + "
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "
| mean | \n", + "8.0 | \n", + "1.0 | \n", + "0.0 | \n", + "3.0 | \n", + "0.0 | \n", + "16.0 | \n", + "46.0 | \n", + "1.0 | \n", + "3.0 | \n", + "1.0 | \n", + "10.0 | \n", + "6.0 | \n", + "
| std | \n", + "2.0 | \n", + "0.0 | \n", + "0.0 | \n", + "1.0 | \n", + "0.0 | \n", + "10.0 | \n", + "33.0 | \n", + "0.0 | \n", + "0.0 | \n", + "0.0 | \n", + "1.0 | \n", + "1.0 | \n", + "
| min | \n", + "5.0 | \n", + "0.0 | \n", + "0.0 | \n", + "1.0 | \n", + "0.0 | \n", + "1.0 | \n", + "6.0 | \n", + "1.0 | \n", + "3.0 | \n", + "0.0 | \n", + "8.0 | \n", + "3.0 | \n", + "
| 25% | \n", + "7.0 | \n", + "0.0 | \n", + "0.0 | \n", + "2.0 | \n", + "0.0 | \n", + "7.0 | \n", + "22.0 | \n", + "1.0 | \n", + "3.0 | \n", + "1.0 | \n", + "10.0 | \n", + "5.0 | \n", + "
| 50% | \n", + "8.0 | \n", + "1.0 | \n", + "0.0 | \n", + "2.0 | \n", + "0.0 | \n", + "14.0 | \n", + "38.0 | \n", + "1.0 | \n", + "3.0 | \n", + "1.0 | \n", + "10.0 | \n", + "6.0 | \n", + "
| 75% | \n", + "9.0 | \n", + "1.0 | \n", + "0.0 | \n", + "3.0 | \n", + "0.0 | \n", + "21.0 | \n", + "62.0 | \n", + "1.0 | \n", + "3.0 | \n", + "1.0 | \n", + "11.0 | \n", + "6.0 | \n", + "
| max | \n", + "16.0 | \n", + "2.0 | \n", + "1.0 | \n", + "16.0 | \n", + "1.0 | \n", + "72.0 | \n", + "289.0 | \n", + "1.0 | \n", + "4.0 | \n", + "2.0 | \n", + "15.0 | \n", + "8.0 | \n", + "
| \n", + " | fixed acidity | \n", + "volatile acidity | \n", + "citric acid | \n", + "residual sugar | \n", + "chlorides | \n", + "free sulfur dioxide | \n", + "total sulfur dioxide | \n", + "density | \n", + "pH | \n", + "sulphates | \n", + "alcohol | \n", + "quality | \n", + "
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 205 | \n", + "12.8 | \n", + "0.300 | \n", + "0.74 | \n", + "2.6 | \n", + "0.095 | \n", + "9.0 | \n", + "28.0 | \n", + "0.99940 | \n", + "3.20 | \n", + "0.77 | \n", + "10.8 | \n", + "7 | \n", + "
| 206 | \n", + "12.8 | \n", + "0.300 | \n", + "0.74 | \n", + "2.6 | \n", + "0.095 | \n", + "9.0 | \n", + "28.0 | \n", + "0.99940 | \n", + "3.20 | \n", + "0.77 | \n", + "10.8 | \n", + "7 | \n", + "
| 243 | \n", + "15.0 | \n", + "0.210 | \n", + "0.44 | \n", + "2.2 | \n", + "0.075 | \n", + "10.0 | \n", + "24.0 | \n", + "1.00005 | \n", + "3.07 | \n", + "0.84 | \n", + "9.2 | \n", + "7 | \n", + "
| 244 | \n", + "15.0 | \n", + "0.210 | \n", + "0.44 | \n", + "2.2 | \n", + "0.075 | \n", + "10.0 | \n", + "24.0 | \n", + "1.00005 | \n", + "3.07 | \n", + "0.84 | \n", + "9.2 | \n", + "7 | \n", + "
| 264 | \n", + "12.5 | \n", + "0.560 | \n", + "0.49 | \n", + "2.4 | \n", + "0.064 | \n", + "5.0 | \n", + "27.0 | \n", + "0.99990 | \n", + "3.08 | \n", + "0.87 | \n", + "10.9 | \n", + "5 | \n", + "
| 294 | \n", + "13.3 | \n", + "0.340 | \n", + "0.52 | \n", + "3.2 | \n", + "0.094 | \n", + "17.0 | \n", + "53.0 | \n", + "1.00140 | \n", + "3.05 | \n", + "0.81 | \n", + "9.5 | \n", + "6 | \n", + "
| 328 | \n", + "13.4 | \n", + "0.270 | \n", + "0.62 | \n", + "2.6 | \n", + "0.082 | \n", + "6.0 | \n", + "21.0 | \n", + "1.00020 | \n", + "3.16 | \n", + "0.67 | \n", + "9.7 | \n", + "6 | \n", + "
| 338 | \n", + "12.4 | \n", + "0.490 | \n", + "0.58 | \n", + "3.0 | \n", + "0.103 | \n", + "28.0 | \n", + "99.0 | \n", + "1.00080 | \n", + "3.16 | \n", + "1.00 | \n", + "11.5 | \n", + "6 | \n", + "
| 339 | \n", + "12.5 | \n", + "0.280 | \n", + "0.54 | \n", + "2.3 | \n", + "0.082 | \n", + "12.0 | \n", + "29.0 | \n", + "0.99970 | \n", + "3.11 | \n", + "1.36 | \n", + "9.8 | \n", + "7 | \n", + "
| 347 | \n", + "13.8 | \n", + "0.490 | \n", + "0.67 | \n", + "3.0 | \n", + "0.093 | \n", + "6.0 | \n", + "15.0 | \n", + "0.99860 | \n", + "3.02 | \n", + "0.93 | \n", + "12.0 | \n", + "6 | \n", + "
| 353 | \n", + "13.5 | \n", + "0.530 | \n", + "0.79 | \n", + "4.8 | \n", + "0.120 | \n", + "23.0 | \n", + "77.0 | \n", + "1.00180 | \n", + "3.18 | \n", + "0.77 | \n", + "13.0 | \n", + "5 | \n", + "
| 359 | \n", + "12.6 | \n", + "0.380 | \n", + "0.66 | \n", + "2.6 | \n", + "0.088 | \n", + "10.0 | \n", + "41.0 | \n", + "1.00100 | \n", + "3.17 | \n", + "0.68 | \n", + "9.8 | \n", + "6 | \n", + "
| 363 | \n", + "12.5 | \n", + "0.460 | \n", + "0.63 | \n", + "2.0 | \n", + "0.071 | \n", + "6.0 | \n", + "15.0 | \n", + "0.99880 | \n", + "2.99 | \n", + "0.87 | \n", + "10.2 | \n", + "5 | \n", + "
| 364 | \n", + "12.8 | \n", + "0.615 | \n", + "0.66 | \n", + "5.8 | \n", + "0.083 | \n", + "7.0 | \n", + "42.0 | \n", + "1.00220 | \n", + "3.07 | \n", + "0.73 | \n", + "10.0 | \n", + "7 | \n", + "
| 366 | \n", + "12.8 | \n", + "0.615 | \n", + "0.66 | \n", + "5.8 | \n", + "0.083 | \n", + "7.0 | \n", + "42.0 | \n", + "1.00220 | \n", + "3.07 | \n", + "0.73 | \n", + "10.0 | \n", + "7 | \n", + "
| 374 | \n", + "14.0 | \n", + "0.410 | \n", + "0.63 | \n", + "3.8 | \n", + "0.089 | \n", + "6.0 | \n", + "47.0 | \n", + "1.00140 | \n", + "3.01 | \n", + "0.81 | \n", + "10.8 | \n", + "6 | \n", + "
| 381 | \n", + "13.7 | \n", + "0.415 | \n", + "0.68 | \n", + "2.9 | \n", + "0.085 | \n", + "17.0 | \n", + "43.0 | \n", + "1.00140 | \n", + "3.06 | \n", + "0.80 | \n", + "10.0 | \n", + "6 | \n", + "
| 391 | \n", + "13.7 | \n", + "0.415 | \n", + "0.68 | \n", + "2.9 | \n", + "0.085 | \n", + "17.0 | \n", + "43.0 | \n", + "1.00140 | \n", + "3.06 | \n", + "0.80 | \n", + "10.0 | \n", + "6 | \n", + "
| 394 | \n", + "12.7 | \n", + "0.600 | \n", + "0.65 | \n", + "2.3 | \n", + "0.063 | \n", + "6.0 | \n", + "25.0 | \n", + "0.99970 | \n", + "3.03 | \n", + "0.57 | \n", + "9.9 | \n", + "5 | \n", + "
| 409 | \n", + "12.5 | \n", + "0.460 | \n", + "0.49 | \n", + "4.5 | \n", + "0.070 | \n", + "26.0 | \n", + "49.0 | \n", + "0.99810 | \n", + "3.05 | \n", + "0.57 | \n", + "9.6 | \n", + "4 | \n", + "
| 429 | \n", + "12.8 | \n", + "0.840 | \n", + "0.63 | \n", + "2.4 | \n", + "0.088 | \n", + "13.0 | \n", + "35.0 | \n", + "0.99970 | \n", + "3.10 | \n", + "0.60 | \n", + "10.4 | \n", + "6 | \n", + "
| 440 | \n", + "12.6 | \n", + "0.310 | \n", + "0.72 | \n", + "2.2 | \n", + "0.072 | \n", + "6.0 | \n", + "29.0 | \n", + "0.99870 | \n", + "2.88 | \n", + "0.82 | \n", + "9.8 | \n", + "8 | \n", + "
| 442 | \n", + "15.6 | \n", + "0.685 | \n", + "0.76 | \n", + "3.7 | \n", + "0.100 | \n", + "6.0 | \n", + "43.0 | \n", + "1.00320 | \n", + "2.95 | \n", + "0.68 | \n", + "11.2 | \n", + "7 | \n", + "
| 446 | \n", + "12.5 | \n", + "0.380 | \n", + "0.60 | \n", + "2.6 | \n", + "0.081 | \n", + "31.0 | \n", + "72.0 | \n", + "0.99960 | \n", + "3.10 | \n", + "0.73 | \n", + "10.5 | \n", + "5 | \n", + "
| 470 | \n", + "13.0 | \n", + "0.320 | \n", + "0.65 | \n", + "2.6 | \n", + "0.093 | \n", + "15.0 | \n", + "47.0 | \n", + "0.99960 | \n", + "3.05 | \n", + "0.61 | \n", + "10.6 | \n", + "5 | \n", + "
| 472 | \n", + "12.5 | \n", + "0.370 | \n", + "0.55 | \n", + "2.6 | \n", + "0.083 | \n", + "25.0 | \n", + "68.0 | \n", + "0.99950 | \n", + "3.15 | \n", + "0.82 | \n", + "10.4 | \n", + "6 | \n", + "
| 509 | \n", + "13.3 | \n", + "0.290 | \n", + "0.75 | \n", + "2.8 | \n", + "0.084 | \n", + "23.0 | \n", + "43.0 | \n", + "0.99860 | \n", + "3.04 | \n", + "0.68 | \n", + "11.4 | \n", + "7 | \n", + "
| 510 | \n", + "12.4 | \n", + "0.420 | \n", + "0.49 | \n", + "4.6 | \n", + "0.073 | \n", + "19.0 | \n", + "43.0 | \n", + "0.99780 | \n", + "3.02 | \n", + "0.61 | \n", + "9.5 | \n", + "5 | \n", + "
| 516 | \n", + "12.5 | \n", + "0.600 | \n", + "0.49 | \n", + "4.3 | \n", + "0.100 | \n", + "5.0 | \n", + "14.0 | \n", + "1.00100 | \n", + "3.25 | \n", + "0.74 | \n", + "11.9 | \n", + "6 | \n", + "
| 538 | \n", + "12.9 | \n", + "0.350 | \n", + "0.49 | \n", + "5.8 | \n", + "0.066 | \n", + "5.0 | \n", + "35.0 | \n", + "1.00140 | \n", + "3.20 | \n", + "0.66 | \n", + "12.0 | \n", + "7 | \n", + "
| 544 | \n", + "14.3 | \n", + "0.310 | \n", + "0.74 | \n", + "1.8 | \n", + "0.075 | \n", + "6.0 | \n", + "15.0 | \n", + "1.00080 | \n", + "2.86 | \n", + "0.79 | \n", + "8.4 | \n", + "6 | \n", + "
| 548 | \n", + "12.4 | \n", + "0.350 | \n", + "0.49 | \n", + "2.6 | \n", + "0.079 | \n", + "27.0 | \n", + "69.0 | \n", + "0.99940 | \n", + "3.12 | \n", + "0.75 | \n", + "10.4 | \n", + "6 | \n", + "
| 554 | \n", + "15.5 | \n", + "0.645 | \n", + "0.49 | \n", + "4.2 | \n", + "0.095 | \n", + "10.0 | \n", + "23.0 | \n", + "1.00315 | \n", + "2.92 | \n", + "0.74 | \n", + "11.1 | \n", + "5 | \n", + "
| 555 | \n", + "15.5 | \n", + "0.645 | \n", + "0.49 | \n", + "4.2 | \n", + "0.095 | \n", + "10.0 | \n", + "23.0 | \n", + "1.00315 | \n", + "2.92 | \n", + "0.74 | \n", + "11.1 | \n", + "5 | \n", + "
| 557 | \n", + "15.6 | \n", + "0.645 | \n", + "0.49 | \n", + "4.2 | \n", + "0.095 | \n", + "10.0 | \n", + "23.0 | \n", + "1.00315 | \n", + "2.92 | \n", + "0.74 | \n", + "11.1 | \n", + "5 | \n", + "
| 559 | \n", + "13.0 | \n", + "0.470 | \n", + "0.49 | \n", + "4.3 | \n", + "0.085 | \n", + "6.0 | \n", + "47.0 | \n", + "1.00210 | \n", + "3.30 | \n", + "0.68 | \n", + "12.7 | \n", + "6 | \n", + "
| 560 | \n", + "12.7 | \n", + "0.600 | \n", + "0.49 | \n", + "2.8 | \n", + "0.075 | \n", + "5.0 | \n", + "19.0 | \n", + "0.99940 | \n", + "3.14 | \n", + "0.57 | \n", + "11.4 | \n", + "5 | \n", + "
| 564 | \n", + "13.0 | \n", + "0.470 | \n", + "0.49 | \n", + "4.3 | \n", + "0.085 | \n", + "6.0 | \n", + "47.0 | \n", + "1.00210 | \n", + "3.30 | \n", + "0.68 | \n", + "12.7 | \n", + "6 | \n", + "
| 565 | \n", + "12.7 | \n", + "0.600 | \n", + "0.49 | \n", + "2.8 | \n", + "0.075 | \n", + "5.0 | \n", + "19.0 | \n", + "0.99940 | \n", + "3.14 | \n", + "0.57 | \n", + "11.4 | \n", + "5 | \n", + "
| 596 | \n", + "12.4 | \n", + "0.400 | \n", + "0.51 | \n", + "2.0 | \n", + "0.059 | \n", + "6.0 | \n", + "24.0 | \n", + "0.99940 | \n", + "3.04 | \n", + "0.60 | \n", + "9.3 | \n", + "6 | \n", + "
| 599 | \n", + "12.7 | \n", + "0.590 | \n", + "0.45 | \n", + "2.3 | \n", + "0.082 | \n", + "11.0 | \n", + "22.0 | \n", + "1.00000 | \n", + "3.00 | \n", + "0.70 | \n", + "9.3 | \n", + "6 | \n", + "
| 601 | \n", + "13.2 | \n", + "0.460 | \n", + "0.52 | \n", + "2.2 | \n", + "0.071 | \n", + "12.0 | \n", + "35.0 | \n", + "1.00060 | \n", + "3.10 | \n", + "0.56 | \n", + "9.0 | \n", + "6 | \n", + "
| 603 | \n", + "13.2 | \n", + "0.460 | \n", + "0.52 | \n", + "2.2 | \n", + "0.071 | \n", + "12.0 | \n", + "35.0 | \n", + "1.00060 | \n", + "3.10 | \n", + "0.56 | \n", + "9.0 | \n", + "6 | \n", + "
| 611 | \n", + "13.2 | \n", + "0.380 | \n", + "0.55 | \n", + "2.7 | \n", + "0.081 | \n", + "5.0 | \n", + "16.0 | \n", + "1.00060 | \n", + "2.98 | \n", + "0.54 | \n", + "9.4 | \n", + "5 | \n", + "
| 652 | \n", + "15.9 | \n", + "0.360 | \n", + "0.65 | \n", + "7.5 | \n", + "0.096 | \n", + "22.0 | \n", + "71.0 | \n", + "0.99760 | \n", + "2.98 | \n", + "0.84 | \n", + "14.9 | \n", + "5 | \n", + "
| 680 | \n", + "13.3 | \n", + "0.430 | \n", + "0.58 | \n", + "1.9 | \n", + "0.070 | \n", + "15.0 | \n", + "40.0 | \n", + "1.00040 | \n", + "3.06 | \n", + "0.49 | \n", + "9.0 | \n", + "5 | \n", + "
| 811 | \n", + "12.9 | \n", + "0.500 | \n", + "0.55 | \n", + "2.8 | \n", + "0.072 | \n", + "7.0 | \n", + "24.0 | \n", + "1.00012 | \n", + "3.09 | \n", + "0.68 | \n", + "10.9 | \n", + "6 | \n", + "
| 814 | \n", + "12.6 | \n", + "0.410 | \n", + "0.54 | \n", + "2.8 | \n", + "0.103 | \n", + "19.0 | \n", + "41.0 | \n", + "0.99939 | \n", + "3.21 | \n", + "0.76 | \n", + "11.3 | \n", + "6 | \n", + "
| 1224 | \n", + "12.6 | \n", + "0.390 | \n", + "0.49 | \n", + "2.5 | \n", + "0.080 | \n", + "8.0 | \n", + "20.0 | \n", + "0.99920 | \n", + "3.07 | \n", + "0.82 | \n", + "10.3 | \n", + "6 | \n", + "
| \n", + " | fixed acidity | \n", + "volatile acidity | \n", + "citric acid | \n", + "residual sugar | \n", + "chlorides | \n", + "free sulfur dioxide | \n", + "total sulfur dioxide | \n", + "density | \n", + "pH | \n", + "sulphates | \n", + "alcohol | \n", + "quality | \n", + "
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 38 | \n", + "5.7 | \n", + "1.130 | \n", + "0.09 | \n", + "1.50 | \n", + "0.172 | \n", + "7.0 | \n", + "19.0 | \n", + "0.99400 | \n", + "3.50 | \n", + "0.48 | \n", + "9.8 | \n", + "4 | \n", + "
| 94 | \n", + "5.0 | \n", + "1.020 | \n", + "0.04 | \n", + "1.40 | \n", + "0.045 | \n", + "41.0 | \n", + "85.0 | \n", + "0.99380 | \n", + "3.75 | \n", + "0.48 | \n", + "10.5 | \n", + "4 | \n", + "
| 120 | \n", + "7.3 | \n", + "1.070 | \n", + "0.09 | \n", + "1.70 | \n", + "0.178 | \n", + "10.0 | \n", + "89.0 | \n", + "0.99620 | \n", + "3.30 | \n", + "0.57 | \n", + "9.0 | \n", + "5 | \n", + "
| 126 | \n", + "8.2 | \n", + "1.330 | \n", + "0.00 | \n", + "1.70 | \n", + "0.081 | \n", + "3.0 | \n", + "12.0 | \n", + "0.99640 | \n", + "3.53 | \n", + "0.49 | \n", + "10.9 | \n", + "5 | \n", + "
| 127 | \n", + "8.1 | \n", + "1.330 | \n", + "0.00 | \n", + "1.80 | \n", + "0.082 | \n", + "3.0 | \n", + "12.0 | \n", + "0.99640 | \n", + "3.54 | \n", + "0.48 | \n", + "10.9 | \n", + "5 | \n", + "
| 134 | \n", + "7.9 | \n", + "1.040 | \n", + "0.05 | \n", + "2.20 | \n", + "0.084 | \n", + "13.0 | \n", + "29.0 | \n", + "0.99590 | \n", + "3.22 | \n", + "0.55 | \n", + "9.9 | \n", + "6 | \n", + "
| 199 | \n", + "6.9 | \n", + "1.090 | \n", + "0.06 | \n", + "2.10 | \n", + "0.061 | \n", + "12.0 | \n", + "31.0 | \n", + "0.99480 | \n", + "3.51 | \n", + "0.43 | \n", + "11.4 | \n", + "4 | \n", + "
| 553 | \n", + "5.0 | \n", + "1.040 | \n", + "0.24 | \n", + "1.60 | \n", + "0.050 | \n", + "32.0 | \n", + "96.0 | \n", + "0.99340 | \n", + "3.74 | \n", + "0.62 | \n", + "11.5 | \n", + "5 | \n", + "
| 672 | \n", + "9.8 | \n", + "1.240 | \n", + "0.34 | \n", + "2.00 | \n", + "0.079 | \n", + "32.0 | \n", + "151.0 | \n", + "0.99800 | \n", + "3.15 | \n", + "0.53 | \n", + "9.5 | \n", + "5 | \n", + "
| 690 | \n", + "7.4 | \n", + "1.185 | \n", + "0.00 | \n", + "4.25 | \n", + "0.097 | \n", + "5.0 | \n", + "14.0 | \n", + "0.99660 | \n", + "3.63 | \n", + "0.54 | \n", + "10.7 | \n", + "3 | \n", + "
| 700 | \n", + "10.6 | \n", + "1.020 | \n", + "0.43 | \n", + "2.90 | \n", + "0.076 | \n", + "26.0 | \n", + "88.0 | \n", + "0.99840 | \n", + "3.08 | \n", + "0.57 | \n", + "10.1 | \n", + "6 | \n", + "
| 705 | \n", + "8.4 | \n", + "1.035 | \n", + "0.15 | \n", + "6.00 | \n", + "0.073 | \n", + "11.0 | \n", + "54.0 | \n", + "0.99900 | \n", + "3.37 | \n", + "0.49 | \n", + "9.9 | \n", + "5 | \n", + "
| 710 | \n", + "10.6 | \n", + "1.025 | \n", + "0.43 | \n", + "2.80 | \n", + "0.080 | \n", + "21.0 | \n", + "84.0 | \n", + "0.99850 | \n", + "3.06 | \n", + "0.57 | \n", + "10.1 | \n", + "5 | \n", + "
| 724 | \n", + "7.5 | \n", + "1.115 | \n", + "0.10 | \n", + "3.10 | \n", + "0.086 | \n", + "5.0 | \n", + "12.0 | \n", + "0.99580 | \n", + "3.54 | \n", + "0.60 | \n", + "11.2 | \n", + "4 | \n", + "
| 899 | \n", + "8.3 | \n", + "1.020 | \n", + "0.02 | \n", + "3.40 | \n", + "0.084 | \n", + "6.0 | \n", + "11.0 | \n", + "0.99892 | \n", + "3.48 | \n", + "0.49 | \n", + "11.0 | \n", + "3 | \n", + "
| 1261 | \n", + "6.3 | \n", + "1.020 | \n", + "0.00 | \n", + "2.00 | \n", + "0.083 | \n", + "17.0 | \n", + "24.0 | \n", + "0.99437 | \n", + "3.59 | \n", + "0.55 | \n", + "11.2 | \n", + "4 | \n", + "
| 1299 | \n", + "7.6 | \n", + "1.580 | \n", + "0.00 | \n", + "2.10 | \n", + "0.137 | \n", + "5.0 | \n", + "9.0 | \n", + "0.99476 | \n", + "3.50 | \n", + "0.40 | \n", + "10.9 | \n", + "3 | \n", + "
| 1312 | \n", + "8.0 | \n", + "1.180 | \n", + "0.21 | \n", + "1.90 | \n", + "0.083 | \n", + "14.0 | \n", + "41.0 | \n", + "0.99532 | \n", + "3.34 | \n", + "0.47 | \n", + "10.5 | \n", + "5 | \n", + "
| 1467 | \n", + "6.7 | \n", + "1.040 | \n", + "0.08 | \n", + "2.30 | \n", + "0.067 | \n", + "19.0 | \n", + "32.0 | \n", + "0.99648 | \n", + "3.52 | \n", + "0.57 | \n", + "11.0 | \n", + "4 | \n", + "
| \n", + " | fixed acidity | \n", + "volatile acidity | \n", + "citric acid | \n", + "residual sugar | \n", + "chlorides | \n", + "free sulfur dioxide | \n", + "total sulfur dioxide | \n", + "density | \n", + "pH | \n", + "sulphates | \n", + "alcohol | \n", + "quality | \n", + "
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 151 | \n", + "9.2 | \n", + "0.52 | \n", + "1.0 | \n", + "3.4 | \n", + "0.61 | \n", + "32.0 | \n", + "69.0 | \n", + "0.9996 | \n", + "2.74 | \n", + "2.0 | \n", + "9.4 | \n", + "4 | \n", + "
| \n", + " | fixed acidity | \n", + "volatile acidity | \n", + "citric acid | \n", + "residual sugar | \n", + "chlorides | \n", + "free sulfur dioxide | \n", + "total sulfur dioxide | \n", + "density | \n", + "pH | \n", + "sulphates | \n", + "alcohol | \n", + "quality | \n", + "
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9 | \n", + "7.5 | \n", + "0.500 | \n", + "0.36 | \n", + "6.10 | \n", + "0.071 | \n", + "17.0 | \n", + "102.0 | \n", + "0.99780 | \n", + "3.35 | \n", + "0.80 | \n", + "10.500000 | \n", + "5 | \n", + "
| 11 | \n", + "7.5 | \n", + "0.500 | \n", + "0.36 | \n", + "6.10 | \n", + "0.071 | \n", + "17.0 | \n", + "102.0 | \n", + "0.99780 | \n", + "3.35 | \n", + "0.80 | \n", + "10.500000 | \n", + "5 | \n", + "
| 14 | \n", + "8.9 | \n", + "0.620 | \n", + "0.18 | \n", + "3.80 | \n", + "0.176 | \n", + "52.0 | \n", + "145.0 | \n", + "0.99860 | \n", + "3.16 | \n", + "0.88 | \n", + "9.200000 | \n", + "5 | \n", + "
| 15 | \n", + "8.9 | \n", + "0.620 | \n", + "0.19 | \n", + "3.90 | \n", + "0.170 | \n", + "51.0 | \n", + "148.0 | \n", + "0.99860 | \n", + "3.17 | \n", + "0.93 | \n", + "9.200000 | \n", + "5 | \n", + "
| 18 | \n", + "7.4 | \n", + "0.590 | \n", + "0.08 | \n", + "4.40 | \n", + "0.086 | \n", + "6.0 | \n", + "29.0 | \n", + "0.99740 | \n", + "3.38 | \n", + "0.50 | \n", + "9.000000 | \n", + "4 | \n", + "
| 33 | \n", + "6.9 | \n", + "0.605 | \n", + "0.12 | \n", + "10.70 | \n", + "0.073 | \n", + "40.0 | \n", + "83.0 | \n", + "0.99930 | \n", + "3.45 | \n", + "0.52 | \n", + "9.400000 | \n", + "6 | \n", + "
| 35 | \n", + "7.8 | \n", + "0.645 | \n", + "0.00 | \n", + "5.50 | \n", + "0.086 | \n", + "5.0 | \n", + "18.0 | \n", + "0.99860 | \n", + "3.40 | \n", + "0.55 | \n", + "9.600000 | \n", + "6 | \n", + "
| 39 | \n", + "7.3 | \n", + "0.450 | \n", + "0.36 | \n", + "5.90 | \n", + "0.074 | \n", + "12.0 | \n", + "87.0 | \n", + "0.99780 | \n", + "3.33 | \n", + "0.83 | \n", + "10.500000 | \n", + "5 | \n", + "
| 40 | \n", + "7.3 | \n", + "0.450 | \n", + "0.36 | \n", + "5.90 | \n", + "0.074 | \n", + "12.0 | \n", + "87.0 | \n", + "0.99780 | \n", + "3.33 | \n", + "0.83 | \n", + "10.500000 | \n", + "5 | \n", + "
| 55 | \n", + "7.7 | \n", + "0.620 | \n", + "0.04 | \n", + "3.80 | \n", + "0.084 | \n", + "25.0 | \n", + "45.0 | \n", + "0.99780 | \n", + "3.34 | \n", + "0.53 | \n", + "9.500000 | \n", + "5 | \n", + "
| 57 | \n", + "7.5 | \n", + "0.630 | \n", + "0.12 | \n", + "5.10 | \n", + "0.111 | \n", + "50.0 | \n", + "110.0 | \n", + "0.99830 | \n", + "3.26 | \n", + "0.77 | \n", + "9.400000 | \n", + "5 | \n", + "
| 64 | \n", + "7.2 | \n", + "0.725 | \n", + "0.05 | \n", + "4.65 | \n", + "0.086 | \n", + "4.0 | \n", + "11.0 | \n", + "0.99620 | \n", + "3.41 | \n", + "0.39 | \n", + "10.900000 | \n", + "5 | \n", + "
| 65 | \n", + "7.2 | \n", + "0.725 | \n", + "0.05 | \n", + "4.65 | \n", + "0.086 | \n", + "4.0 | \n", + "11.0 | \n", + "0.99620 | \n", + "3.41 | \n", + "0.39 | \n", + "10.900000 | \n", + "5 | \n", + "
| 154 | \n", + "7.1 | \n", + "0.430 | \n", + "0.42 | \n", + "5.50 | \n", + "0.070 | \n", + "29.0 | \n", + "129.0 | \n", + "0.99730 | \n", + "3.42 | \n", + "0.72 | \n", + "10.500000 | \n", + "5 | \n", + "
| 155 | \n", + "7.1 | \n", + "0.430 | \n", + "0.42 | \n", + "5.50 | \n", + "0.071 | \n", + "28.0 | \n", + "128.0 | \n", + "0.99730 | \n", + "3.42 | \n", + "0.71 | \n", + "10.500000 | \n", + "5 | \n", + "
| 156 | \n", + "7.1 | \n", + "0.430 | \n", + "0.42 | \n", + "5.50 | \n", + "0.070 | \n", + "29.0 | \n", + "129.0 | \n", + "0.99730 | \n", + "3.42 | \n", + "0.72 | \n", + "10.500000 | \n", + "5 | \n", + "
| 157 | \n", + "7.1 | \n", + "0.430 | \n", + "0.42 | \n", + "5.50 | \n", + "0.071 | \n", + "28.0 | \n", + "128.0 | \n", + "0.99730 | \n", + "3.42 | \n", + "0.71 | \n", + "10.500000 | \n", + "5 | \n", + "
| 163 | \n", + "7.4 | \n", + "0.600 | \n", + "0.26 | \n", + "7.30 | \n", + "0.070 | \n", + "36.0 | \n", + "121.0 | \n", + "0.99820 | \n", + "3.37 | \n", + "0.49 | \n", + "9.400000 | \n", + "5 | \n", + "
| 164 | \n", + "7.3 | \n", + "0.590 | \n", + "0.26 | \n", + "7.20 | \n", + "0.070 | \n", + "35.0 | \n", + "121.0 | \n", + "0.99810 | \n", + "3.37 | \n", + "0.49 | \n", + "9.400000 | \n", + "5 | \n", + "
| 192 | \n", + "6.8 | \n", + "0.630 | \n", + "0.12 | \n", + "3.80 | \n", + "0.099 | \n", + "16.0 | \n", + "126.0 | \n", + "0.99690 | \n", + "3.28 | \n", + "0.61 | \n", + "9.500000 | \n", + "5 | \n", + "
| 215 | \n", + "7.0 | \n", + "0.490 | \n", + "0.49 | \n", + "5.60 | \n", + "0.060 | \n", + "26.0 | \n", + "121.0 | \n", + "0.99740 | \n", + "3.34 | \n", + "0.76 | \n", + "10.500000 | \n", + "5 | \n", + "
| 269 | \n", + "11.5 | \n", + "0.180 | \n", + "0.51 | \n", + "4.00 | \n", + "0.104 | \n", + "4.0 | \n", + "23.0 | \n", + "0.99960 | \n", + "3.28 | \n", + "0.97 | \n", + "10.100000 | \n", + "6 | \n", + "
| 270 | \n", + "7.9 | \n", + "0.545 | \n", + "0.06 | \n", + "4.00 | \n", + "0.087 | \n", + "27.0 | \n", + "61.0 | \n", + "0.99650 | \n", + "3.36 | \n", + "0.67 | \n", + "10.700000 | \n", + "6 | \n", + "
| 271 | \n", + "11.5 | \n", + "0.180 | \n", + "0.51 | \n", + "4.00 | \n", + "0.104 | \n", + "4.0 | \n", + "23.0 | \n", + "0.99960 | \n", + "3.28 | \n", + "0.97 | \n", + "10.100000 | \n", + "6 | \n", + "
| 272 | \n", + "10.9 | \n", + "0.370 | \n", + "0.58 | \n", + "4.00 | \n", + "0.071 | \n", + "17.0 | \n", + "65.0 | \n", + "0.99935 | \n", + "3.22 | \n", + "0.78 | \n", + "10.100000 | \n", + "5 | \n", + "
| 274 | \n", + "7.5 | \n", + "0.650 | \n", + "0.18 | \n", + "7.00 | \n", + "0.088 | \n", + "27.0 | \n", + "94.0 | \n", + "0.99915 | \n", + "3.38 | \n", + "0.77 | \n", + "9.400000 | \n", + "5 | \n", + "
| 275 | \n", + "7.9 | \n", + "0.545 | \n", + "0.06 | \n", + "4.00 | \n", + "0.087 | \n", + "27.0 | \n", + "61.0 | \n", + "0.99650 | \n", + "3.36 | \n", + "0.67 | \n", + "10.700000 | \n", + "6 | \n", + "
| 277 | \n", + "11.5 | \n", + "0.180 | \n", + "0.51 | \n", + "4.00 | \n", + "0.104 | \n", + "4.0 | \n", + "23.0 | \n", + "0.99960 | \n", + "3.28 | \n", + "0.97 | \n", + "10.100000 | \n", + "6 | \n", + "
| 278 | \n", + "10.3 | \n", + "0.320 | \n", + "0.45 | \n", + "6.40 | \n", + "0.073 | \n", + "5.0 | \n", + "13.0 | \n", + "0.99760 | \n", + "3.23 | \n", + "0.82 | \n", + "12.600000 | \n", + "8 | \n", + "
| 279 | \n", + "8.9 | \n", + "0.400 | \n", + "0.32 | \n", + "5.60 | \n", + "0.087 | \n", + "10.0 | \n", + "47.0 | \n", + "0.99910 | \n", + "3.38 | \n", + "0.77 | \n", + "10.500000 | \n", + "7 | \n", + "
| ... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "
| 1289 | \n", + "7.0 | \n", + "0.600 | \n", + "0.30 | \n", + "4.50 | \n", + "0.068 | \n", + "20.0 | \n", + "110.0 | \n", + "0.99914 | \n", + "3.30 | \n", + "1.17 | \n", + "10.200000 | \n", + "5 | \n", + "
| 1295 | \n", + "6.6 | \n", + "0.630 | \n", + "0.00 | \n", + "4.30 | \n", + "0.093 | \n", + "51.0 | \n", + "77.5 | \n", + "0.99558 | \n", + "3.20 | \n", + "0.45 | \n", + "9.500000 | \n", + "5 | \n", + "
| 1296 | \n", + "6.6 | \n", + "0.630 | \n", + "0.00 | \n", + "4.30 | \n", + "0.093 | \n", + "51.0 | \n", + "77.5 | \n", + "0.99558 | \n", + "3.20 | \n", + "0.45 | \n", + "9.500000 | \n", + "5 | \n", + "
| 1307 | \n", + "6.8 | \n", + "0.680 | \n", + "0.09 | \n", + "3.90 | \n", + "0.068 | \n", + "15.0 | \n", + "29.0 | \n", + "0.99524 | \n", + "3.41 | \n", + "0.52 | \n", + "11.100000 | \n", + "4 | \n", + "
| 1331 | \n", + "7.8 | \n", + "0.870 | \n", + "0.26 | \n", + "3.80 | \n", + "0.107 | \n", + "31.0 | \n", + "67.0 | \n", + "0.99668 | \n", + "3.26 | \n", + "0.46 | \n", + "9.200000 | \n", + "5 | \n", + "
| 1358 | \n", + "7.4 | \n", + "0.640 | \n", + "0.17 | \n", + "5.40 | \n", + "0.168 | \n", + "52.0 | \n", + "98.0 | \n", + "0.99736 | \n", + "3.28 | \n", + "0.50 | \n", + "9.500000 | \n", + "5 | \n", + "
| 1373 | \n", + "7.7 | \n", + "0.750 | \n", + "0.27 | \n", + "3.80 | \n", + "0.110 | \n", + "34.0 | \n", + "89.0 | \n", + "0.99664 | \n", + "3.24 | \n", + "0.45 | \n", + "9.300000 | \n", + "5 | \n", + "
| 1388 | \n", + "6.6 | \n", + "0.640 | \n", + "0.31 | \n", + "6.10 | \n", + "0.083 | \n", + "7.0 | \n", + "49.0 | \n", + "0.99718 | \n", + "3.35 | \n", + "0.68 | \n", + "10.300000 | \n", + "5 | \n", + "
| 1394 | \n", + "6.4 | \n", + "0.570 | \n", + "0.14 | \n", + "3.90 | \n", + "0.070 | \n", + "27.0 | \n", + "73.0 | \n", + "0.99669 | \n", + "3.32 | \n", + "0.48 | \n", + "9.200000 | \n", + "5 | \n", + "
| 1406 | \n", + "8.2 | \n", + "0.240 | \n", + "0.34 | \n", + "5.10 | \n", + "0.062 | \n", + "8.0 | \n", + "22.0 | \n", + "0.99740 | \n", + "3.22 | \n", + "0.94 | \n", + "10.900000 | \n", + "6 | \n", + "
| 1412 | \n", + "8.2 | \n", + "0.240 | \n", + "0.34 | \n", + "5.10 | \n", + "0.062 | \n", + "8.0 | \n", + "22.0 | \n", + "0.99740 | \n", + "3.22 | \n", + "0.94 | \n", + "10.900000 | \n", + "6 | \n", + "
| 1423 | \n", + "6.4 | \n", + "0.530 | \n", + "0.09 | \n", + "3.90 | \n", + "0.123 | \n", + "14.0 | \n", + "31.0 | \n", + "0.99680 | \n", + "3.50 | \n", + "0.67 | \n", + "11.000000 | \n", + "4 | \n", + "
| 1434 | \n", + "10.2 | \n", + "0.540 | \n", + "0.37 | \n", + "15.40 | \n", + "0.214 | \n", + "55.0 | \n", + "95.0 | \n", + "1.00369 | \n", + "3.18 | \n", + "0.77 | \n", + "9.000000 | \n", + "6 | \n", + "
| 1435 | \n", + "10.2 | \n", + "0.540 | \n", + "0.37 | \n", + "15.40 | \n", + "0.214 | \n", + "55.0 | \n", + "95.0 | \n", + "1.00369 | \n", + "3.18 | \n", + "0.77 | \n", + "9.000000 | \n", + "6 | \n", + "
| 1437 | \n", + "6.8 | \n", + "0.915 | \n", + "0.29 | \n", + "4.80 | \n", + "0.070 | \n", + "15.0 | \n", + "39.0 | \n", + "0.99577 | \n", + "3.53 | \n", + "0.54 | \n", + "11.100000 | \n", + "5 | \n", + "
| 1441 | \n", + "7.4 | \n", + "0.785 | \n", + "0.19 | \n", + "5.20 | \n", + "0.094 | \n", + "19.0 | \n", + "98.0 | \n", + "0.99713 | \n", + "3.16 | \n", + "0.52 | \n", + "9.566667 | \n", + "6 | \n", + "
| 1445 | \n", + "7.4 | \n", + "0.785 | \n", + "0.19 | \n", + "5.20 | \n", + "0.094 | \n", + "19.0 | \n", + "98.0 | \n", + "0.99713 | \n", + "3.16 | \n", + "0.52 | \n", + "9.600000 | \n", + "6 | \n", + "
| 1471 | \n", + "6.7 | \n", + "0.700 | \n", + "0.08 | \n", + "3.75 | \n", + "0.067 | \n", + "8.0 | \n", + "16.0 | \n", + "0.99334 | \n", + "3.43 | \n", + "0.52 | \n", + "12.600000 | \n", + "5 | \n", + "
| 1474 | \n", + "9.9 | \n", + "0.500 | \n", + "0.50 | \n", + "13.80 | \n", + "0.205 | \n", + "48.0 | \n", + "82.0 | \n", + "1.00242 | \n", + "3.16 | \n", + "0.75 | \n", + "8.800000 | \n", + "5 | \n", + "
| 1476 | \n", + "9.9 | \n", + "0.500 | \n", + "0.50 | \n", + "13.80 | \n", + "0.205 | \n", + "48.0 | \n", + "82.0 | \n", + "1.00242 | \n", + "3.16 | \n", + "0.75 | \n", + "8.800000 | \n", + "5 | \n", + "
| 1478 | \n", + "7.1 | \n", + "0.875 | \n", + "0.05 | \n", + "5.70 | \n", + "0.082 | \n", + "3.0 | \n", + "14.0 | \n", + "0.99808 | \n", + "3.40 | \n", + "0.52 | \n", + "10.200000 | \n", + "3 | \n", + "
| 1501 | \n", + "7.8 | \n", + "0.820 | \n", + "0.29 | \n", + "4.30 | \n", + "0.083 | \n", + "21.0 | \n", + "64.0 | \n", + "0.99642 | \n", + "3.16 | \n", + "0.53 | \n", + "9.400000 | \n", + "5 | \n", + "
| 1514 | \n", + "6.9 | \n", + "0.840 | \n", + "0.21 | \n", + "4.10 | \n", + "0.074 | \n", + "16.0 | \n", + "65.0 | \n", + "0.99842 | \n", + "3.53 | \n", + "0.72 | \n", + "9.233333 | \n", + "6 | \n", + "
| 1515 | \n", + "6.9 | \n", + "0.840 | \n", + "0.21 | \n", + "4.10 | \n", + "0.074 | \n", + "16.0 | \n", + "65.0 | \n", + "0.99842 | \n", + "3.53 | \n", + "0.72 | \n", + "9.250000 | \n", + "6 | \n", + "
| 1540 | \n", + "6.2 | \n", + "0.520 | \n", + "0.08 | \n", + "4.40 | \n", + "0.071 | \n", + "11.0 | \n", + "32.0 | \n", + "0.99646 | \n", + "3.56 | \n", + "0.63 | \n", + "11.600000 | \n", + "6 | \n", + "
| 1552 | \n", + "6.3 | \n", + "0.680 | \n", + "0.01 | \n", + "3.70 | \n", + "0.103 | \n", + "32.0 | \n", + "54.0 | \n", + "0.99586 | \n", + "3.51 | \n", + "0.66 | \n", + "11.300000 | \n", + "6 | \n", + "
| 1558 | \n", + "6.9 | \n", + "0.630 | \n", + "0.33 | \n", + "6.70 | \n", + "0.235 | \n", + "66.0 | \n", + "115.0 | \n", + "0.99787 | \n", + "3.22 | \n", + "0.56 | \n", + "9.500000 | \n", + "5 | \n", + "
| 1574 | \n", + "5.6 | \n", + "0.310 | \n", + "0.78 | \n", + "13.90 | \n", + "0.074 | \n", + "23.0 | \n", + "92.0 | \n", + "0.99677 | \n", + "3.39 | \n", + "0.48 | \n", + "10.500000 | \n", + "6 | \n", + "
| 1577 | \n", + "6.2 | \n", + "0.700 | \n", + "0.15 | \n", + "5.10 | \n", + "0.076 | \n", + "13.0 | \n", + "27.0 | \n", + "0.99622 | \n", + "3.54 | \n", + "0.60 | \n", + "11.900000 | \n", + "6 | \n", + "
| 1589 | \n", + "6.6 | \n", + "0.725 | \n", + "0.20 | \n", + "7.80 | \n", + "0.073 | \n", + "29.0 | \n", + "79.0 | \n", + "0.99770 | \n", + "3.29 | \n", + "0.54 | \n", + "9.200000 | \n", + "5 | \n", + "
155 rows × 12 columns
\n", + "| \n", + " | fixed acidity | \n", + "volatile acidity | \n", + "citric acid | \n", + "residual sugar | \n", + "chlorides | \n", + "free sulfur dioxide | \n", + "total sulfur dioxide | \n", + "density | \n", + "pH | \n", + "sulphates | \n", + "alcohol | \n", + "quality | \n", + "
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 14 | \n", + "8.9 | \n", + "0.620 | \n", + "0.18 | \n", + "3.80 | \n", + "0.176 | \n", + "52.0 | \n", + "145.0 | \n", + "0.99860 | \n", + "3.16 | \n", + "0.88 | \n", + "9.2 | \n", + "5 | \n", + "
| 15 | \n", + "8.9 | \n", + "0.620 | \n", + "0.19 | \n", + "3.90 | \n", + "0.170 | \n", + "51.0 | \n", + "148.0 | \n", + "0.99860 | \n", + "3.17 | \n", + "0.93 | \n", + "9.2 | \n", + "5 | \n", + "
| 17 | \n", + "8.1 | \n", + "0.560 | \n", + "0.28 | \n", + "1.70 | \n", + "0.368 | \n", + "16.0 | \n", + "56.0 | \n", + "0.99680 | \n", + "3.11 | \n", + "1.28 | \n", + "9.3 | \n", + "5 | \n", + "
| 19 | \n", + "7.9 | \n", + "0.320 | \n", + "0.51 | \n", + "1.80 | \n", + "0.341 | \n", + "17.0 | \n", + "56.0 | \n", + "0.99690 | \n", + "3.04 | \n", + "1.08 | \n", + "9.2 | \n", + "6 | \n", + "
| 38 | \n", + "5.7 | \n", + "1.130 | \n", + "0.09 | \n", + "1.50 | \n", + "0.172 | \n", + "7.0 | \n", + "19.0 | \n", + "0.99400 | \n", + "3.50 | \n", + "0.48 | \n", + "9.8 | \n", + "4 | \n", + "
| 42 | \n", + "7.5 | \n", + "0.490 | \n", + "0.20 | \n", + "2.60 | \n", + "0.332 | \n", + "8.0 | \n", + "14.0 | \n", + "0.99680 | \n", + "3.21 | \n", + "0.90 | \n", + "10.5 | \n", + "6 | \n", + "
| 81 | \n", + "7.8 | \n", + "0.430 | \n", + "0.70 | \n", + "1.90 | \n", + "0.464 | \n", + "22.0 | \n", + "67.0 | \n", + "0.99740 | \n", + "3.13 | \n", + "1.28 | \n", + "9.4 | \n", + "5 | \n", + "
| 83 | \n", + "7.3 | \n", + "0.670 | \n", + "0.26 | \n", + "1.80 | \n", + "0.401 | \n", + "16.0 | \n", + "51.0 | \n", + "0.99690 | \n", + "3.16 | \n", + "1.14 | \n", + "9.4 | \n", + "5 | \n", + "
| 106 | \n", + "7.8 | \n", + "0.410 | \n", + "0.68 | \n", + "1.70 | \n", + "0.467 | \n", + "18.0 | \n", + "69.0 | \n", + "0.99730 | \n", + "3.08 | \n", + "1.31 | \n", + "9.3 | \n", + "5 | \n", + "
| 109 | \n", + "8.1 | \n", + "0.785 | \n", + "0.52 | \n", + "2.00 | \n", + "0.122 | \n", + "37.0 | \n", + "153.0 | \n", + "0.99690 | \n", + "3.21 | \n", + "0.69 | \n", + "9.3 | \n", + "5 | \n", + "
| 120 | \n", + "7.3 | \n", + "1.070 | \n", + "0.09 | \n", + "1.70 | \n", + "0.178 | \n", + "10.0 | \n", + "89.0 | \n", + "0.99620 | \n", + "3.30 | \n", + "0.57 | \n", + "9.0 | \n", + "5 | \n", + "
| 125 | \n", + "9.0 | \n", + "0.620 | \n", + "0.04 | \n", + "1.90 | \n", + "0.146 | \n", + "27.0 | \n", + "90.0 | \n", + "0.99840 | \n", + "3.16 | \n", + "0.70 | \n", + "9.4 | \n", + "5 | \n", + "
| 147 | \n", + "7.6 | \n", + "0.490 | \n", + "0.26 | \n", + "1.60 | \n", + "0.236 | \n", + "10.0 | \n", + "88.0 | \n", + "0.99680 | \n", + "3.11 | \n", + "0.80 | \n", + "9.3 | \n", + "5 | \n", + "
| 151 | \n", + "9.2 | \n", + "0.520 | \n", + "1.00 | \n", + "3.40 | \n", + "0.610 | \n", + "32.0 | \n", + "69.0 | \n", + "0.99960 | \n", + "2.74 | \n", + "2.00 | \n", + "9.4 | \n", + "4 | \n", + "
| 169 | \n", + "7.5 | \n", + "0.705 | \n", + "0.24 | \n", + "1.80 | \n", + "0.360 | \n", + "15.0 | \n", + "63.0 | \n", + "0.99640 | \n", + "3.00 | \n", + "1.59 | \n", + "9.5 | \n", + "5 | \n", + "
| 181 | \n", + "8.9 | \n", + "0.610 | \n", + "0.49 | \n", + "2.00 | \n", + "0.270 | \n", + "23.0 | \n", + "110.0 | \n", + "0.99720 | \n", + "3.12 | \n", + "1.02 | \n", + "9.3 | \n", + "5 | \n", + "
| 210 | \n", + "9.7 | \n", + "0.530 | \n", + "0.60 | \n", + "2.00 | \n", + "0.039 | \n", + "5.0 | \n", + "19.0 | \n", + "0.99585 | \n", + "3.30 | \n", + "0.86 | \n", + "12.4 | \n", + "6 | \n", + "
| 226 | \n", + "8.9 | \n", + "0.590 | \n", + "0.50 | \n", + "2.00 | \n", + "0.337 | \n", + "27.0 | \n", + "81.0 | \n", + "0.99640 | \n", + "3.04 | \n", + "1.61 | \n", + "9.5 | \n", + "6 | \n", + "
| 240 | \n", + "8.9 | \n", + "0.635 | \n", + "0.37 | \n", + "1.70 | \n", + "0.263 | \n", + "5.0 | \n", + "62.0 | \n", + "0.99710 | \n", + "3.00 | \n", + "1.09 | \n", + "9.3 | \n", + "5 | \n", + "
| 258 | \n", + "7.7 | \n", + "0.410 | \n", + "0.76 | \n", + "1.80 | \n", + "0.611 | \n", + "8.0 | \n", + "45.0 | \n", + "0.99680 | \n", + "3.06 | \n", + "1.26 | \n", + "9.4 | \n", + "5 | \n", + "
| 281 | \n", + "7.7 | \n", + "0.270 | \n", + "0.68 | \n", + "3.50 | \n", + "0.358 | \n", + "5.0 | \n", + "10.0 | \n", + "0.99720 | \n", + "3.25 | \n", + "1.08 | \n", + "9.9 | \n", + "7 | \n", + "
| 291 | \n", + "11.0 | \n", + "0.200 | \n", + "0.48 | \n", + "2.00 | \n", + "0.343 | \n", + "6.0 | \n", + "18.0 | \n", + "0.99790 | \n", + "3.30 | \n", + "0.71 | \n", + "10.5 | \n", + "5 | \n", + "
| 303 | \n", + "7.4 | \n", + "0.670 | \n", + "0.12 | \n", + "1.60 | \n", + "0.186 | \n", + "5.0 | \n", + "21.0 | \n", + "0.99600 | \n", + "3.39 | \n", + "0.54 | \n", + "9.5 | \n", + "5 | \n", + "
| 307 | \n", + "10.3 | \n", + "0.410 | \n", + "0.42 | \n", + "2.40 | \n", + "0.213 | \n", + "6.0 | \n", + "14.0 | \n", + "0.99940 | \n", + "3.19 | \n", + "0.62 | \n", + "9.5 | \n", + "6 | \n", + "
| 308 | \n", + "10.3 | \n", + "0.430 | \n", + "0.44 | \n", + "2.40 | \n", + "0.214 | \n", + "5.0 | \n", + "12.0 | \n", + "0.99940 | \n", + "3.19 | \n", + "0.63 | \n", + "9.5 | \n", + "6 | \n", + "
| 326 | \n", + "11.6 | \n", + "0.530 | \n", + "0.66 | \n", + "3.65 | \n", + "0.121 | \n", + "6.0 | \n", + "14.0 | \n", + "0.99780 | \n", + "3.05 | \n", + "0.74 | \n", + "11.5 | \n", + "7 | \n", + "
| 330 | \n", + "10.2 | \n", + "0.360 | \n", + "0.64 | \n", + "2.90 | \n", + "0.122 | \n", + "10.0 | \n", + "41.0 | \n", + "0.99800 | \n", + "3.23 | \n", + "0.66 | \n", + "12.5 | \n", + "6 | \n", + "
| 331 | \n", + "10.2 | \n", + "0.360 | \n", + "0.64 | \n", + "2.90 | \n", + "0.122 | \n", + "10.0 | \n", + "41.0 | \n", + "0.99800 | \n", + "3.23 | \n", + "0.66 | \n", + "12.5 | \n", + "6 | \n", + "
| 335 | \n", + "11.9 | \n", + "0.695 | \n", + "0.53 | \n", + "3.40 | \n", + "0.128 | \n", + "7.0 | \n", + "21.0 | \n", + "0.99920 | \n", + "3.17 | \n", + "0.84 | \n", + "12.2 | \n", + "7 | \n", + "
| 353 | \n", + "13.5 | \n", + "0.530 | \n", + "0.79 | \n", + "4.80 | \n", + "0.120 | \n", + "23.0 | \n", + "77.0 | \n", + "1.00180 | \n", + "3.18 | \n", + "0.77 | \n", + "13.0 | \n", + "5 | \n", + "
| ... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "
| 1109 | \n", + "10.8 | \n", + "0.470 | \n", + "0.43 | \n", + "2.10 | \n", + "0.171 | \n", + "27.0 | \n", + "66.0 | \n", + "0.99820 | \n", + "3.17 | \n", + "0.76 | \n", + "10.8 | \n", + "6 | \n", + "
| 1146 | \n", + "7.8 | \n", + "0.500 | \n", + "0.12 | \n", + "1.80 | \n", + "0.178 | \n", + "6.0 | \n", + "21.0 | \n", + "0.99600 | \n", + "3.28 | \n", + "0.87 | \n", + "9.8 | \n", + "6 | \n", + "
| 1165 | \n", + "8.5 | \n", + "0.440 | \n", + "0.50 | \n", + "1.90 | \n", + "0.369 | \n", + "15.0 | \n", + "38.0 | \n", + "0.99634 | \n", + "3.01 | \n", + "1.10 | \n", + "9.4 | \n", + "5 | \n", + "
| 1191 | \n", + "6.5 | \n", + "0.885 | \n", + "0.00 | \n", + "2.30 | \n", + "0.166 | \n", + "6.0 | \n", + "12.0 | \n", + "0.99551 | \n", + "3.56 | \n", + "0.51 | \n", + "10.8 | \n", + "5 | \n", + "
| 1193 | \n", + "6.4 | \n", + "0.885 | \n", + "0.00 | \n", + "2.30 | \n", + "0.166 | \n", + "6.0 | \n", + "12.0 | \n", + "0.99551 | \n", + "3.56 | \n", + "0.51 | \n", + "10.8 | \n", + "5 | \n", + "
| 1207 | \n", + "9.9 | \n", + "0.720 | \n", + "0.55 | \n", + "1.70 | \n", + "0.136 | \n", + "24.0 | \n", + "52.0 | \n", + "0.99752 | \n", + "3.35 | \n", + "0.94 | \n", + "10.0 | \n", + "5 | \n", + "
| 1220 | \n", + "10.9 | \n", + "0.320 | \n", + "0.52 | \n", + "1.80 | \n", + "0.132 | \n", + "17.0 | \n", + "44.0 | \n", + "0.99734 | \n", + "3.28 | \n", + "0.77 | \n", + "11.5 | \n", + "6 | \n", + "
| 1221 | \n", + "10.9 | \n", + "0.320 | \n", + "0.52 | \n", + "1.80 | \n", + "0.132 | \n", + "17.0 | \n", + "44.0 | \n", + "0.99734 | \n", + "3.28 | \n", + "0.77 | \n", + "11.5 | \n", + "6 | \n", + "
| 1252 | \n", + "7.1 | \n", + "0.720 | \n", + "0.00 | \n", + "1.80 | \n", + "0.123 | \n", + "6.0 | \n", + "14.0 | \n", + "0.99627 | \n", + "3.45 | \n", + "0.58 | \n", + "9.8 | \n", + "5 | \n", + "
| 1258 | \n", + "6.8 | \n", + "0.640 | \n", + "0.00 | \n", + "2.70 | \n", + "0.123 | \n", + "15.0 | \n", + "33.0 | \n", + "0.99538 | \n", + "3.44 | \n", + "0.63 | \n", + "11.3 | \n", + "6 | \n", + "
| 1259 | \n", + "6.8 | \n", + "0.640 | \n", + "0.00 | \n", + "2.70 | \n", + "0.123 | \n", + "15.0 | \n", + "33.0 | \n", + "0.99538 | \n", + "3.44 | \n", + "0.63 | \n", + "11.3 | \n", + "6 | \n", + "
| 1260 | \n", + "8.6 | \n", + "0.635 | \n", + "0.68 | \n", + "1.80 | \n", + "0.403 | \n", + "19.0 | \n", + "56.0 | \n", + "0.99632 | \n", + "3.02 | \n", + "1.15 | \n", + "9.3 | \n", + "5 | \n", + "
| 1299 | \n", + "7.6 | \n", + "1.580 | \n", + "0.00 | \n", + "2.10 | \n", + "0.137 | \n", + "5.0 | \n", + "9.0 | \n", + "0.99476 | \n", + "3.50 | \n", + "0.40 | \n", + "10.9 | \n", + "3 | \n", + "
| 1319 | \n", + "9.1 | \n", + "0.760 | \n", + "0.68 | \n", + "1.70 | \n", + "0.414 | \n", + "18.0 | \n", + "64.0 | \n", + "0.99652 | \n", + "2.90 | \n", + "1.33 | \n", + "9.1 | \n", + "6 | \n", + "
| 1334 | \n", + "7.2 | \n", + "0.835 | \n", + "0.00 | \n", + "2.00 | \n", + "0.166 | \n", + "4.0 | \n", + "11.0 | \n", + "0.99608 | \n", + "3.39 | \n", + "0.52 | \n", + "10.0 | \n", + "5 | \n", + "
| 1358 | \n", + "7.4 | \n", + "0.640 | \n", + "0.17 | \n", + "5.40 | \n", + "0.168 | \n", + "52.0 | \n", + "98.0 | \n", + "0.99736 | \n", + "3.28 | \n", + "0.50 | \n", + "9.5 | \n", + "5 | \n", + "
| 1370 | \n", + "8.7 | \n", + "0.780 | \n", + "0.51 | \n", + "1.70 | \n", + "0.415 | \n", + "12.0 | \n", + "66.0 | \n", + "0.99623 | \n", + "3.00 | \n", + "1.17 | \n", + "9.2 | \n", + "5 | \n", + "
| 1371 | \n", + "7.5 | \n", + "0.580 | \n", + "0.56 | \n", + "3.10 | \n", + "0.153 | \n", + "5.0 | \n", + "14.0 | \n", + "0.99476 | \n", + "3.21 | \n", + "1.03 | \n", + "11.6 | \n", + "6 | \n", + "
| 1372 | \n", + "8.7 | \n", + "0.780 | \n", + "0.51 | \n", + "1.70 | \n", + "0.415 | \n", + "12.0 | \n", + "66.0 | \n", + "0.99623 | \n", + "3.00 | \n", + "1.17 | \n", + "9.2 | \n", + "5 | \n", + "
| 1374 | \n", + "6.8 | \n", + "0.815 | \n", + "0.00 | \n", + "1.20 | \n", + "0.267 | \n", + "16.0 | \n", + "29.0 | \n", + "0.99471 | \n", + "3.32 | \n", + "0.51 | \n", + "9.8 | \n", + "3 | \n", + "
| 1423 | \n", + "6.4 | \n", + "0.530 | \n", + "0.09 | \n", + "3.90 | \n", + "0.123 | \n", + "14.0 | \n", + "31.0 | \n", + "0.99680 | \n", + "3.50 | \n", + "0.67 | \n", + "11.0 | \n", + "4 | \n", + "
| 1434 | \n", + "10.2 | \n", + "0.540 | \n", + "0.37 | \n", + "15.40 | \n", + "0.214 | \n", + "55.0 | \n", + "95.0 | \n", + "1.00369 | \n", + "3.18 | \n", + "0.77 | \n", + "9.0 | \n", + "6 | \n", + "
| 1435 | \n", + "10.2 | \n", + "0.540 | \n", + "0.37 | \n", + "15.40 | \n", + "0.214 | \n", + "55.0 | \n", + "95.0 | \n", + "1.00369 | \n", + "3.18 | \n", + "0.77 | \n", + "9.0 | \n", + "6 | \n", + "
| 1436 | \n", + "10.0 | \n", + "0.380 | \n", + "0.38 | \n", + "1.60 | \n", + "0.169 | \n", + "27.0 | \n", + "90.0 | \n", + "0.99914 | \n", + "3.15 | \n", + "0.65 | \n", + "8.5 | \n", + "5 | \n", + "
| 1474 | \n", + "9.9 | \n", + "0.500 | \n", + "0.50 | \n", + "13.80 | \n", + "0.205 | \n", + "48.0 | \n", + "82.0 | \n", + "1.00242 | \n", + "3.16 | \n", + "0.75 | \n", + "8.8 | \n", + "5 | \n", + "
| 1476 | \n", + "9.9 | \n", + "0.500 | \n", + "0.50 | \n", + "13.80 | \n", + "0.205 | \n", + "48.0 | \n", + "82.0 | \n", + "1.00242 | \n", + "3.16 | \n", + "0.75 | \n", + "8.8 | \n", + "5 | \n", + "
| 1490 | \n", + "7.1 | \n", + "0.220 | \n", + "0.49 | \n", + "1.80 | \n", + "0.039 | \n", + "8.0 | \n", + "18.0 | \n", + "0.99344 | \n", + "3.39 | \n", + "0.56 | \n", + "12.4 | \n", + "6 | \n", + "
| 1558 | \n", + "6.9 | \n", + "0.630 | \n", + "0.33 | \n", + "6.70 | \n", + "0.235 | \n", + "66.0 | \n", + "115.0 | \n", + "0.99787 | \n", + "3.22 | \n", + "0.56 | \n", + "9.5 | \n", + "5 | \n", + "
| 1570 | \n", + "6.4 | \n", + "0.360 | \n", + "0.53 | \n", + "2.20 | \n", + "0.230 | \n", + "19.0 | \n", + "35.0 | \n", + "0.99340 | \n", + "3.37 | \n", + "0.93 | \n", + "12.4 | \n", + "6 | \n", + "
| 1571 | \n", + "6.4 | \n", + "0.380 | \n", + "0.14 | \n", + "2.20 | \n", + "0.038 | \n", + "15.0 | \n", + "25.0 | \n", + "0.99514 | \n", + "3.44 | \n", + "0.65 | \n", + "11.1 | \n", + "6 | \n", + "
112 rows × 12 columns
\n", + "| \n", + " | fixed acidity | \n", + "volatile acidity | \n", + "citric acid | \n", + "residual sugar | \n", + "chlorides | \n", + "free sulfur dioxide | \n", + "total sulfur dioxide | \n", + "density | \n", + "pH | \n", + "sulphates | \n", + "alcohol | \n", + "quality | \n", + "
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 14 | \n", + "8.9 | \n", + "0.620 | \n", + "0.18 | \n", + "3.80 | \n", + "0.176 | \n", + "52.0 | \n", + "145.0 | \n", + "0.99860 | \n", + "3.16 | \n", + "0.88 | \n", + "9.2 | \n", + "5 | \n", + "
| 15 | \n", + "8.9 | \n", + "0.620 | \n", + "0.19 | \n", + "3.90 | \n", + "0.170 | \n", + "51.0 | \n", + "148.0 | \n", + "0.99860 | \n", + "3.17 | \n", + "0.93 | \n", + "9.2 | \n", + "5 | \n", + "
| 57 | \n", + "7.5 | \n", + "0.630 | \n", + "0.12 | \n", + "5.10 | \n", + "0.111 | \n", + "50.0 | \n", + "110.0 | \n", + "0.99830 | \n", + "3.26 | \n", + "0.77 | \n", + "9.4 | \n", + "5 | \n", + "
| 396 | \n", + "6.6 | \n", + "0.735 | \n", + "0.02 | \n", + "7.90 | \n", + "0.122 | \n", + "68.0 | \n", + "124.0 | \n", + "0.99940 | \n", + "3.47 | \n", + "0.53 | \n", + "9.9 | \n", + "5 | \n", + "
| 400 | \n", + "6.6 | \n", + "0.735 | \n", + "0.02 | \n", + "7.90 | \n", + "0.122 | \n", + "68.0 | \n", + "124.0 | \n", + "0.99940 | \n", + "3.47 | \n", + "0.53 | \n", + "9.9 | \n", + "5 | \n", + "
| 497 | \n", + "7.2 | \n", + "0.340 | \n", + "0.32 | \n", + "2.50 | \n", + "0.090 | \n", + "43.0 | \n", + "113.0 | \n", + "0.99660 | \n", + "3.32 | \n", + "0.79 | \n", + "11.1 | \n", + "5 | \n", + "
| 522 | \n", + "8.2 | \n", + "0.390 | \n", + "0.49 | \n", + "2.30 | \n", + "0.099 | \n", + "47.0 | \n", + "133.0 | \n", + "0.99790 | \n", + "3.38 | \n", + "0.99 | \n", + "9.8 | \n", + "5 | \n", + "
| 584 | \n", + "11.8 | \n", + "0.330 | \n", + "0.49 | \n", + "3.40 | \n", + "0.093 | \n", + "54.0 | \n", + "80.0 | \n", + "1.00020 | \n", + "3.30 | \n", + "0.76 | \n", + "10.7 | \n", + "7 | \n", + "
| 634 | \n", + "7.9 | \n", + "0.350 | \n", + "0.21 | \n", + "1.90 | \n", + "0.073 | \n", + "46.0 | \n", + "102.0 | \n", + "0.99640 | \n", + "3.27 | \n", + "0.58 | \n", + "9.5 | \n", + "5 | \n", + "
| 678 | \n", + "8.3 | \n", + "0.780 | \n", + "0.10 | \n", + "2.60 | \n", + "0.081 | \n", + "45.0 | \n", + "87.0 | \n", + "0.99830 | \n", + "3.48 | \n", + "0.53 | \n", + "10.0 | \n", + "5 | \n", + "
| 925 | \n", + "8.6 | \n", + "0.220 | \n", + "0.36 | \n", + "1.90 | \n", + "0.064 | \n", + "53.0 | \n", + "77.0 | \n", + "0.99604 | \n", + "3.47 | \n", + "0.87 | \n", + "11.0 | \n", + "7 | \n", + "
| 926 | \n", + "9.4 | \n", + "0.240 | \n", + "0.33 | \n", + "2.30 | \n", + "0.061 | \n", + "52.0 | \n", + "73.0 | \n", + "0.99786 | \n", + "3.47 | \n", + "0.90 | \n", + "10.2 | \n", + "6 | \n", + "
| 982 | \n", + "7.3 | \n", + "0.520 | \n", + "0.32 | \n", + "2.10 | \n", + "0.070 | \n", + "51.0 | \n", + "70.0 | \n", + "0.99418 | \n", + "3.34 | \n", + "0.82 | \n", + "12.9 | \n", + "6 | \n", + "
| 1075 | \n", + "9.1 | \n", + "0.250 | \n", + "0.34 | \n", + "2.00 | \n", + "0.071 | \n", + "45.0 | \n", + "67.0 | \n", + "0.99769 | \n", + "3.44 | \n", + "0.86 | \n", + "10.2 | \n", + "7 | \n", + "
| 1131 | \n", + "5.9 | \n", + "0.190 | \n", + "0.21 | \n", + "1.70 | \n", + "0.045 | \n", + "57.0 | \n", + "135.0 | \n", + "0.99341 | \n", + "3.32 | \n", + "0.44 | \n", + "9.5 | \n", + "5 | \n", + "
| 1154 | \n", + "6.6 | \n", + "0.580 | \n", + "0.00 | \n", + "2.20 | \n", + "0.100 | \n", + "50.0 | \n", + "63.0 | \n", + "0.99544 | \n", + "3.59 | \n", + "0.68 | \n", + "11.4 | \n", + "6 | \n", + "
| 1156 | \n", + "8.5 | \n", + "0.180 | \n", + "0.51 | \n", + "1.75 | \n", + "0.071 | \n", + "45.0 | \n", + "88.0 | \n", + "0.99524 | \n", + "3.33 | \n", + "0.76 | \n", + "11.8 | \n", + "7 | \n", + "
| 1175 | \n", + "6.5 | \n", + "0.610 | \n", + "0.00 | \n", + "2.20 | \n", + "0.095 | \n", + "48.0 | \n", + "59.0 | \n", + "0.99541 | \n", + "3.61 | \n", + "0.70 | \n", + "11.5 | \n", + "6 | \n", + "
| 1217 | \n", + "8.2 | \n", + "0.340 | \n", + "0.37 | \n", + "1.90 | \n", + "0.057 | \n", + "43.0 | \n", + "74.0 | \n", + "0.99408 | \n", + "3.23 | \n", + "0.81 | \n", + "12.0 | \n", + "6 | \n", + "
| 1231 | \n", + "7.8 | \n", + "0.815 | \n", + "0.01 | \n", + "2.60 | \n", + "0.074 | \n", + "48.0 | \n", + "90.0 | \n", + "0.99621 | \n", + "3.38 | \n", + "0.62 | \n", + "10.8 | \n", + "5 | \n", + "
| 1244 | \n", + "5.9 | \n", + "0.290 | \n", + "0.25 | \n", + "13.40 | \n", + "0.067 | \n", + "72.0 | \n", + "160.0 | \n", + "0.99721 | \n", + "3.33 | \n", + "0.54 | \n", + "10.3 | \n", + "6 | \n", + "
| 1256 | \n", + "7.5 | \n", + "0.590 | \n", + "0.22 | \n", + "1.80 | \n", + "0.082 | \n", + "43.0 | \n", + "60.0 | \n", + "0.99499 | \n", + "3.10 | \n", + "0.42 | \n", + "9.2 | \n", + "5 | \n", + "
| 1295 | \n", + "6.6 | \n", + "0.630 | \n", + "0.00 | \n", + "4.30 | \n", + "0.093 | \n", + "51.0 | \n", + "77.5 | \n", + "0.99558 | \n", + "3.20 | \n", + "0.45 | \n", + "9.5 | \n", + "5 | \n", + "
| 1296 | \n", + "6.6 | \n", + "0.630 | \n", + "0.00 | \n", + "4.30 | \n", + "0.093 | \n", + "51.0 | \n", + "77.5 | \n", + "0.99558 | \n", + "3.20 | \n", + "0.45 | \n", + "9.5 | \n", + "5 | \n", + "
| 1358 | \n", + "7.4 | \n", + "0.640 | \n", + "0.17 | \n", + "5.40 | \n", + "0.168 | \n", + "52.0 | \n", + "98.0 | \n", + "0.99736 | \n", + "3.28 | \n", + "0.50 | \n", + "9.5 | \n", + "5 | \n", + "
| 1434 | \n", + "10.2 | \n", + "0.540 | \n", + "0.37 | \n", + "15.40 | \n", + "0.214 | \n", + "55.0 | \n", + "95.0 | \n", + "1.00369 | \n", + "3.18 | \n", + "0.77 | \n", + "9.0 | \n", + "6 | \n", + "
| 1435 | \n", + "10.2 | \n", + "0.540 | \n", + "0.37 | \n", + "15.40 | \n", + "0.214 | \n", + "55.0 | \n", + "95.0 | \n", + "1.00369 | \n", + "3.18 | \n", + "0.77 | \n", + "9.0 | \n", + "6 | \n", + "
| 1474 | \n", + "9.9 | \n", + "0.500 | \n", + "0.50 | \n", + "13.80 | \n", + "0.205 | \n", + "48.0 | \n", + "82.0 | \n", + "1.00242 | \n", + "3.16 | \n", + "0.75 | \n", + "8.8 | \n", + "5 | \n", + "
| 1476 | \n", + "9.9 | \n", + "0.500 | \n", + "0.50 | \n", + "13.80 | \n", + "0.205 | \n", + "48.0 | \n", + "82.0 | \n", + "1.00242 | \n", + "3.16 | \n", + "0.75 | \n", + "8.8 | \n", + "5 | \n", + "
| 1558 | \n", + "6.9 | \n", + "0.630 | \n", + "0.33 | \n", + "6.70 | \n", + "0.235 | \n", + "66.0 | \n", + "115.0 | \n", + "0.99787 | \n", + "3.22 | \n", + "0.56 | \n", + "9.5 | \n", + "5 | \n", + "
| \n", + " | fixed acidity | \n", + "volatile acidity | \n", + "citric acid | \n", + "residual sugar | \n", + "chlorides | \n", + "free sulfur dioxide | \n", + "total sulfur dioxide | \n", + "density | \n", + "pH | \n", + "sulphates | \n", + "alcohol | \n", + "quality | \n", + "
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 14 | \n", + "8.9 | \n", + "0.620 | \n", + "0.18 | \n", + "3.8 | \n", + "0.176 | \n", + "52.0 | \n", + "145.0 | \n", + "0.99860 | \n", + "3.16 | \n", + "0.88 | \n", + "9.2 | \n", + "5 | \n", + "
| 15 | \n", + "8.9 | \n", + "0.620 | \n", + "0.19 | \n", + "3.9 | \n", + "0.170 | \n", + "51.0 | \n", + "148.0 | \n", + "0.99860 | \n", + "3.17 | \n", + "0.93 | \n", + "9.2 | \n", + "5 | \n", + "
| 86 | \n", + "8.6 | \n", + "0.490 | \n", + "0.28 | \n", + "1.9 | \n", + "0.110 | \n", + "20.0 | \n", + "136.0 | \n", + "0.99720 | \n", + "2.93 | \n", + "1.95 | \n", + "9.9 | \n", + "6 | \n", + "
| 88 | \n", + "9.3 | \n", + "0.390 | \n", + "0.44 | \n", + "2.1 | \n", + "0.107 | \n", + "34.0 | \n", + "125.0 | \n", + "0.99780 | \n", + "3.14 | \n", + "1.22 | \n", + "9.5 | \n", + "5 | \n", + "
| 90 | \n", + "7.9 | \n", + "0.520 | \n", + "0.26 | \n", + "1.9 | \n", + "0.079 | \n", + "42.0 | \n", + "140.0 | \n", + "0.99640 | \n", + "3.23 | \n", + "0.54 | \n", + "9.5 | \n", + "5 | \n", + "
| 91 | \n", + "8.6 | \n", + "0.490 | \n", + "0.28 | \n", + "1.9 | \n", + "0.110 | \n", + "20.0 | \n", + "136.0 | \n", + "0.99720 | \n", + "2.93 | \n", + "1.95 | \n", + "9.9 | \n", + "6 | \n", + "
| 92 | \n", + "8.6 | \n", + "0.490 | \n", + "0.29 | \n", + "2.0 | \n", + "0.110 | \n", + "19.0 | \n", + "133.0 | \n", + "0.99720 | \n", + "2.93 | \n", + "1.98 | \n", + "9.8 | \n", + "5 | \n", + "
| 109 | \n", + "8.1 | \n", + "0.785 | \n", + "0.52 | \n", + "2.0 | \n", + "0.122 | \n", + "37.0 | \n", + "153.0 | \n", + "0.99690 | \n", + "3.21 | \n", + "0.69 | \n", + "9.3 | \n", + "5 | \n", + "
| 130 | \n", + "8.0 | \n", + "0.745 | \n", + "0.56 | \n", + "2.0 | \n", + "0.118 | \n", + "30.0 | \n", + "134.0 | \n", + "0.99680 | \n", + "3.24 | \n", + "0.66 | \n", + "9.4 | \n", + "5 | \n", + "
| 145 | \n", + "8.1 | \n", + "0.670 | \n", + "0.55 | \n", + "1.8 | \n", + "0.117 | \n", + "32.0 | \n", + "141.0 | \n", + "0.99680 | \n", + "3.17 | \n", + "0.62 | \n", + "9.4 | \n", + "5 | \n", + "
| 154 | \n", + "7.1 | \n", + "0.430 | \n", + "0.42 | \n", + "5.5 | \n", + "0.070 | \n", + "29.0 | \n", + "129.0 | \n", + "0.99730 | \n", + "3.42 | \n", + "0.72 | \n", + "10.5 | \n", + "5 | \n", + "
| 155 | \n", + "7.1 | \n", + "0.430 | \n", + "0.42 | \n", + "5.5 | \n", + "0.071 | \n", + "28.0 | \n", + "128.0 | \n", + "0.99730 | \n", + "3.42 | \n", + "0.71 | \n", + "10.5 | \n", + "5 | \n", + "
| 156 | \n", + "7.1 | \n", + "0.430 | \n", + "0.42 | \n", + "5.5 | \n", + "0.070 | \n", + "29.0 | \n", + "129.0 | \n", + "0.99730 | \n", + "3.42 | \n", + "0.72 | \n", + "10.5 | \n", + "5 | \n", + "
| 157 | \n", + "7.1 | \n", + "0.430 | \n", + "0.42 | \n", + "5.5 | \n", + "0.071 | \n", + "28.0 | \n", + "128.0 | \n", + "0.99730 | \n", + "3.42 | \n", + "0.71 | \n", + "10.5 | \n", + "5 | \n", + "
| 188 | \n", + "7.9 | \n", + "0.500 | \n", + "0.33 | \n", + "2.0 | \n", + "0.084 | \n", + "15.0 | \n", + "143.0 | \n", + "0.99680 | \n", + "3.20 | \n", + "0.55 | \n", + "9.5 | \n", + "5 | \n", + "
| 189 | \n", + "7.9 | \n", + "0.490 | \n", + "0.32 | \n", + "1.9 | \n", + "0.082 | \n", + "17.0 | \n", + "144.0 | \n", + "0.99680 | \n", + "3.20 | \n", + "0.55 | \n", + "9.5 | \n", + "5 | \n", + "
| 190 | \n", + "8.2 | \n", + "0.500 | \n", + "0.35 | \n", + "2.9 | \n", + "0.077 | \n", + "21.0 | \n", + "127.0 | \n", + "0.99760 | \n", + "3.23 | \n", + "0.62 | \n", + "9.4 | \n", + "5 | \n", + "
| 192 | \n", + "6.8 | \n", + "0.630 | \n", + "0.12 | \n", + "3.8 | \n", + "0.099 | \n", + "16.0 | \n", + "126.0 | \n", + "0.99690 | \n", + "3.28 | \n", + "0.61 | \n", + "9.5 | \n", + "5 | \n", + "
| 201 | \n", + "8.8 | \n", + "0.370 | \n", + "0.48 | \n", + "2.1 | \n", + "0.097 | \n", + "39.0 | \n", + "145.0 | \n", + "0.99750 | \n", + "3.04 | \n", + "1.03 | \n", + "9.3 | \n", + "5 | \n", + "
| 219 | \n", + "7.8 | \n", + "0.530 | \n", + "0.33 | \n", + "2.4 | \n", + "0.080 | \n", + "24.0 | \n", + "144.0 | \n", + "0.99655 | \n", + "3.30 | \n", + "0.60 | \n", + "9.5 | \n", + "5 | \n", + "
| 313 | \n", + "8.6 | \n", + "0.470 | \n", + "0.30 | \n", + "3.0 | \n", + "0.076 | \n", + "30.0 | \n", + "135.0 | \n", + "0.99760 | \n", + "3.30 | \n", + "0.53 | \n", + "9.4 | \n", + "5 | \n", + "
| 354 | \n", + "6.1 | \n", + "0.210 | \n", + "0.40 | \n", + "1.4 | \n", + "0.066 | \n", + "40.5 | \n", + "165.0 | \n", + "0.99120 | \n", + "3.25 | \n", + "0.59 | \n", + "11.9 | \n", + "6 | \n", + "
| 396 | \n", + "6.6 | \n", + "0.735 | \n", + "0.02 | \n", + "7.9 | \n", + "0.122 | \n", + "68.0 | \n", + "124.0 | \n", + "0.99940 | \n", + "3.47 | \n", + "0.53 | \n", + "9.9 | \n", + "5 | \n", + "
| 400 | \n", + "6.6 | \n", + "0.735 | \n", + "0.02 | \n", + "7.9 | \n", + "0.122 | \n", + "68.0 | \n", + "124.0 | \n", + "0.99940 | \n", + "3.47 | \n", + "0.53 | \n", + "9.9 | \n", + "5 | \n", + "
| 415 | \n", + "8.6 | \n", + "0.725 | \n", + "0.24 | \n", + "6.6 | \n", + "0.117 | \n", + "31.0 | \n", + "134.0 | \n", + "1.00140 | \n", + "3.32 | \n", + "1.07 | \n", + "9.3 | \n", + "5 | \n", + "
| 417 | \n", + "7.0 | \n", + "0.580 | \n", + "0.12 | \n", + "1.9 | \n", + "0.091 | \n", + "34.0 | \n", + "124.0 | \n", + "0.99560 | \n", + "3.44 | \n", + "0.48 | \n", + "10.5 | \n", + "5 | \n", + "
| 463 | \n", + "8.1 | \n", + "0.660 | \n", + "0.70 | \n", + "2.2 | \n", + "0.098 | \n", + "25.0 | \n", + "129.0 | \n", + "0.99720 | \n", + "3.08 | \n", + "0.53 | \n", + "9.0 | \n", + "5 | \n", + "
| 515 | \n", + "8.5 | \n", + "0.655 | \n", + "0.49 | \n", + "6.1 | \n", + "0.122 | \n", + "34.0 | \n", + "151.0 | \n", + "1.00100 | \n", + "3.31 | \n", + "1.14 | \n", + "9.3 | \n", + "5 | \n", + "
| 522 | \n", + "8.2 | \n", + "0.390 | \n", + "0.49 | \n", + "2.3 | \n", + "0.099 | \n", + "47.0 | \n", + "133.0 | \n", + "0.99790 | \n", + "3.38 | \n", + "0.99 | \n", + "9.8 | \n", + "5 | \n", + "
| 523 | \n", + "9.3 | \n", + "0.400 | \n", + "0.49 | \n", + "2.5 | \n", + "0.085 | \n", + "38.0 | \n", + "142.0 | \n", + "0.99780 | \n", + "3.22 | \n", + "0.55 | \n", + "9.4 | \n", + "5 | \n", + "
| 591 | \n", + "6.6 | \n", + "0.390 | \n", + "0.49 | \n", + "1.7 | \n", + "0.070 | \n", + "23.0 | \n", + "149.0 | \n", + "0.99220 | \n", + "3.12 | \n", + "0.50 | \n", + "11.5 | \n", + "6 | \n", + "
| 636 | \n", + "9.6 | \n", + "0.880 | \n", + "0.28 | \n", + "2.4 | \n", + "0.086 | \n", + "30.0 | \n", + "147.0 | \n", + "0.99790 | \n", + "3.24 | \n", + "0.53 | \n", + "9.4 | \n", + "5 | \n", + "
| 637 | \n", + "9.5 | \n", + "0.885 | \n", + "0.27 | \n", + "2.3 | \n", + "0.084 | \n", + "31.0 | \n", + "145.0 | \n", + "0.99780 | \n", + "3.24 | \n", + "0.53 | \n", + "9.4 | \n", + "5 | \n", + "
| 649 | \n", + "6.7 | \n", + "0.420 | \n", + "0.27 | \n", + "8.6 | \n", + "0.068 | \n", + "24.0 | \n", + "148.0 | \n", + "0.99480 | \n", + "3.16 | \n", + "0.57 | \n", + "11.3 | \n", + "6 | \n", + "
| 651 | \n", + "9.8 | \n", + "0.880 | \n", + "0.25 | \n", + "2.5 | \n", + "0.104 | \n", + "35.0 | \n", + "155.0 | \n", + "1.00100 | \n", + "3.41 | \n", + "0.67 | \n", + "11.2 | \n", + "5 | \n", + "
| 672 | \n", + "9.8 | \n", + "1.240 | \n", + "0.34 | \n", + "2.0 | \n", + "0.079 | \n", + "32.0 | \n", + "151.0 | \n", + "0.99800 | \n", + "3.15 | \n", + "0.53 | \n", + "9.5 | \n", + "5 | \n", + "
| 684 | \n", + "9.8 | \n", + "0.980 | \n", + "0.32 | \n", + "2.3 | \n", + "0.078 | \n", + "35.0 | \n", + "152.0 | \n", + "0.99800 | \n", + "3.25 | \n", + "0.48 | \n", + "9.4 | \n", + "5 | \n", + "
| 694 | \n", + "9.0 | \n", + "0.470 | \n", + "0.31 | \n", + "2.7 | \n", + "0.084 | \n", + "24.0 | \n", + "125.0 | \n", + "0.99840 | \n", + "3.31 | \n", + "0.61 | \n", + "9.4 | \n", + "5 | \n", + "
| 723 | \n", + "7.1 | \n", + "0.310 | \n", + "0.30 | \n", + "2.2 | \n", + "0.053 | \n", + "36.0 | \n", + "127.0 | \n", + "0.99650 | \n", + "2.94 | \n", + "1.62 | \n", + "9.5 | \n", + "5 | \n", + "
| 741 | \n", + "9.2 | \n", + "0.530 | \n", + "0.24 | \n", + "2.6 | \n", + "0.078 | \n", + "28.0 | \n", + "139.0 | \n", + "0.99788 | \n", + "3.21 | \n", + "0.57 | \n", + "9.5 | \n", + "5 | \n", + "
| 771 | \n", + "9.4 | \n", + "0.685 | \n", + "0.26 | \n", + "2.4 | \n", + "0.082 | \n", + "23.0 | \n", + "143.0 | \n", + "0.99780 | \n", + "3.28 | \n", + "0.55 | \n", + "9.4 | \n", + "5 | \n", + "
| 772 | \n", + "9.5 | \n", + "0.570 | \n", + "0.27 | \n", + "2.3 | \n", + "0.082 | \n", + "23.0 | \n", + "144.0 | \n", + "0.99782 | \n", + "3.27 | \n", + "0.55 | \n", + "9.4 | \n", + "5 | \n", + "
| 791 | \n", + "8.8 | \n", + "0.640 | \n", + "0.17 | \n", + "2.9 | \n", + "0.084 | \n", + "25.0 | \n", + "130.0 | \n", + "0.99818 | \n", + "3.23 | \n", + "0.54 | \n", + "9.6 | \n", + "5 | \n", + "
| 1079 | \n", + "7.9 | \n", + "0.300 | \n", + "0.68 | \n", + "8.3 | \n", + "0.050 | \n", + "37.5 | \n", + "278.0 | \n", + "0.99316 | \n", + "3.01 | \n", + "0.51 | \n", + "12.3 | \n", + "7 | \n", + "
| 1081 | \n", + "7.9 | \n", + "0.300 | \n", + "0.68 | \n", + "8.3 | \n", + "0.050 | \n", + "37.5 | \n", + "289.0 | \n", + "0.99316 | \n", + "3.01 | \n", + "0.51 | \n", + "12.3 | \n", + "7 | \n", + "
| 1131 | \n", + "5.9 | \n", + "0.190 | \n", + "0.21 | \n", + "1.7 | \n", + "0.045 | \n", + "57.0 | \n", + "135.0 | \n", + "0.99341 | \n", + "3.32 | \n", + "0.44 | \n", + "9.5 | \n", + "5 | \n", + "
| 1244 | \n", + "5.9 | \n", + "0.290 | \n", + "0.25 | \n", + "13.4 | \n", + "0.067 | \n", + "72.0 | \n", + "160.0 | \n", + "0.99721 | \n", + "3.33 | \n", + "0.54 | \n", + "10.3 | \n", + "6 | \n", + "
| 1400 | \n", + "7.9 | \n", + "0.690 | \n", + "0.21 | \n", + "2.1 | \n", + "0.080 | \n", + "33.0 | \n", + "141.0 | \n", + "0.99620 | \n", + "3.25 | \n", + "0.51 | \n", + "9.9 | \n", + "5 | \n", + "
| 1401 | \n", + "7.9 | \n", + "0.690 | \n", + "0.21 | \n", + "2.1 | \n", + "0.080 | \n", + "33.0 | \n", + "141.0 | \n", + "0.99620 | \n", + "3.25 | \n", + "0.51 | \n", + "9.9 | \n", + "5 | \n", + "
| 1419 | \n", + "7.7 | \n", + "0.640 | \n", + "0.21 | \n", + "2.2 | \n", + "0.077 | \n", + "32.0 | \n", + "133.0 | \n", + "0.99560 | \n", + "3.27 | \n", + "0.45 | \n", + "9.9 | \n", + "5 | \n", + "
| 1493 | \n", + "7.7 | \n", + "0.540 | \n", + "0.26 | \n", + "1.9 | \n", + "0.089 | \n", + "23.0 | \n", + "147.0 | \n", + "0.99636 | \n", + "3.26 | \n", + "0.59 | \n", + "9.7 | \n", + "5 | \n", + "
| 1496 | \n", + "7.7 | \n", + "0.540 | \n", + "0.26 | \n", + "1.9 | \n", + "0.089 | \n", + "23.0 | \n", + "147.0 | \n", + "0.99636 | \n", + "3.26 | \n", + "0.59 | \n", + "9.7 | \n", + "5 | \n", + "
| 1559 | \n", + "7.8 | \n", + "0.600 | \n", + "0.26 | \n", + "2.0 | \n", + "0.080 | \n", + "31.0 | \n", + "131.0 | \n", + "0.99622 | \n", + "3.21 | \n", + "0.52 | \n", + "9.9 | \n", + "5 | \n", + "
| 1560 | \n", + "7.8 | \n", + "0.600 | \n", + "0.26 | \n", + "2.0 | \n", + "0.080 | \n", + "31.0 | \n", + "131.0 | \n", + "0.99622 | \n", + "3.21 | \n", + "0.52 | \n", + "9.9 | \n", + "5 | \n", + "
| 1561 | \n", + "7.8 | \n", + "0.600 | \n", + "0.26 | \n", + "2.0 | \n", + "0.080 | \n", + "31.0 | \n", + "131.0 | \n", + "0.99622 | \n", + "3.21 | \n", + "0.52 | \n", + "9.9 | \n", + "5 | \n", + "
| \n", + " | fixed acidity | \n", + "volatile acidity | \n", + "citric acid | \n", + "residual sugar | \n", + "chlorides | \n", + "free sulfur dioxide | \n", + "total sulfur dioxide | \n", + "density | \n", + "pH | \n", + "sulphates | \n", + "alcohol | \n", + "quality | \n", + "
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 142 | \n", + "5.2 | \n", + "0.340 | \n", + "0.00 | \n", + "1.80 | \n", + "0.050 | \n", + "27.0 | \n", + "63.0 | \n", + "0.99160 | \n", + "3.68 | \n", + "0.79 | \n", + "14.000000 | \n", + "6 | \n", + "
| 144 | \n", + "5.2 | \n", + "0.340 | \n", + "0.00 | \n", + "1.80 | \n", + "0.050 | \n", + "27.0 | \n", + "63.0 | \n", + "0.99160 | \n", + "3.68 | \n", + "0.79 | \n", + "14.000000 | \n", + "6 | \n", + "
| 294 | \n", + "13.3 | \n", + "0.340 | \n", + "0.52 | \n", + "3.20 | \n", + "0.094 | \n", + "17.0 | \n", + "53.0 | \n", + "1.00140 | \n", + "3.05 | \n", + "0.81 | \n", + "9.500000 | \n", + "6 | \n", + "
| 324 | \n", + "10.0 | \n", + "0.490 | \n", + "0.20 | \n", + "11.00 | \n", + "0.071 | \n", + "13.0 | \n", + "50.0 | \n", + "1.00150 | \n", + "3.16 | \n", + "0.69 | \n", + "9.200000 | \n", + "6 | \n", + "
| 325 | \n", + "10.0 | \n", + "0.490 | \n", + "0.20 | \n", + "11.00 | \n", + "0.071 | \n", + "13.0 | \n", + "50.0 | \n", + "1.00150 | \n", + "3.16 | \n", + "0.69 | \n", + "9.200000 | \n", + "6 | \n", + "
| 353 | \n", + "13.5 | \n", + "0.530 | \n", + "0.79 | \n", + "4.80 | \n", + "0.120 | \n", + "23.0 | \n", + "77.0 | \n", + "1.00180 | \n", + "3.18 | \n", + "0.77 | \n", + "13.000000 | \n", + "5 | \n", + "
| 354 | \n", + "6.1 | \n", + "0.210 | \n", + "0.40 | \n", + "1.40 | \n", + "0.066 | \n", + "40.5 | \n", + "165.0 | \n", + "0.99120 | \n", + "3.25 | \n", + "0.59 | \n", + "11.900000 | \n", + "6 | \n", + "
| 364 | \n", + "12.8 | \n", + "0.615 | \n", + "0.66 | \n", + "5.80 | \n", + "0.083 | \n", + "7.0 | \n", + "42.0 | \n", + "1.00220 | \n", + "3.07 | \n", + "0.73 | \n", + "10.000000 | \n", + "7 | \n", + "
| 366 | \n", + "12.8 | \n", + "0.615 | \n", + "0.66 | \n", + "5.80 | \n", + "0.083 | \n", + "7.0 | \n", + "42.0 | \n", + "1.00220 | \n", + "3.07 | \n", + "0.73 | \n", + "10.000000 | \n", + "7 | \n", + "
| 374 | \n", + "14.0 | \n", + "0.410 | \n", + "0.63 | \n", + "3.80 | \n", + "0.089 | \n", + "6.0 | \n", + "47.0 | \n", + "1.00140 | \n", + "3.01 | \n", + "0.81 | \n", + "10.800000 | \n", + "6 | \n", + "
| 381 | \n", + "13.7 | \n", + "0.415 | \n", + "0.68 | \n", + "2.90 | \n", + "0.085 | \n", + "17.0 | \n", + "43.0 | \n", + "1.00140 | \n", + "3.06 | \n", + "0.80 | \n", + "10.000000 | \n", + "6 | \n", + "
| 391 | \n", + "13.7 | \n", + "0.415 | \n", + "0.68 | \n", + "2.90 | \n", + "0.085 | \n", + "17.0 | \n", + "43.0 | \n", + "1.00140 | \n", + "3.06 | \n", + "0.80 | \n", + "10.000000 | \n", + "6 | \n", + "
| 415 | \n", + "8.6 | \n", + "0.725 | \n", + "0.24 | \n", + "6.60 | \n", + "0.117 | \n", + "31.0 | \n", + "134.0 | \n", + "1.00140 | \n", + "3.32 | \n", + "1.07 | \n", + "9.300000 | \n", + "5 | \n", + "
| 442 | \n", + "15.6 | \n", + "0.685 | \n", + "0.76 | \n", + "3.70 | \n", + "0.100 | \n", + "6.0 | \n", + "43.0 | \n", + "1.00320 | \n", + "2.95 | \n", + "0.68 | \n", + "11.200000 | \n", + "7 | \n", + "
| 480 | \n", + "10.6 | \n", + "0.280 | \n", + "0.39 | \n", + "15.50 | \n", + "0.069 | \n", + "6.0 | \n", + "23.0 | \n", + "1.00260 | \n", + "3.12 | \n", + "0.66 | \n", + "9.200000 | \n", + "5 | \n", + "
| 538 | \n", + "12.9 | \n", + "0.350 | \n", + "0.49 | \n", + "5.80 | \n", + "0.066 | \n", + "5.0 | \n", + "35.0 | \n", + "1.00140 | \n", + "3.20 | \n", + "0.66 | \n", + "12.000000 | \n", + "7 | \n", + "
| 554 | \n", + "15.5 | \n", + "0.645 | \n", + "0.49 | \n", + "4.20 | \n", + "0.095 | \n", + "10.0 | \n", + "23.0 | \n", + "1.00315 | \n", + "2.92 | \n", + "0.74 | \n", + "11.100000 | \n", + "5 | \n", + "
| 555 | \n", + "15.5 | \n", + "0.645 | \n", + "0.49 | \n", + "4.20 | \n", + "0.095 | \n", + "10.0 | \n", + "23.0 | \n", + "1.00315 | \n", + "2.92 | \n", + "0.74 | \n", + "11.100000 | \n", + "5 | \n", + "
| 557 | \n", + "15.6 | \n", + "0.645 | \n", + "0.49 | \n", + "4.20 | \n", + "0.095 | \n", + "10.0 | \n", + "23.0 | \n", + "1.00315 | \n", + "2.92 | \n", + "0.74 | \n", + "11.100000 | \n", + "5 | \n", + "
| 559 | \n", + "13.0 | \n", + "0.470 | \n", + "0.49 | \n", + "4.30 | \n", + "0.085 | \n", + "6.0 | \n", + "47.0 | \n", + "1.00210 | \n", + "3.30 | \n", + "0.68 | \n", + "12.700000 | \n", + "6 | \n", + "
| 564 | \n", + "13.0 | \n", + "0.470 | \n", + "0.49 | \n", + "4.30 | \n", + "0.085 | \n", + "6.0 | \n", + "47.0 | \n", + "1.00210 | \n", + "3.30 | \n", + "0.68 | \n", + "12.700000 | \n", + "6 | \n", + "
| 588 | \n", + "5.0 | \n", + "0.420 | \n", + "0.24 | \n", + "2.00 | \n", + "0.060 | \n", + "19.0 | \n", + "50.0 | \n", + "0.99170 | \n", + "3.72 | \n", + "0.74 | \n", + "14.000000 | \n", + "8 | \n", + "
| 591 | \n", + "6.6 | \n", + "0.390 | \n", + "0.49 | \n", + "1.70 | \n", + "0.070 | \n", + "23.0 | \n", + "149.0 | \n", + "0.99220 | \n", + "3.12 | \n", + "0.50 | \n", + "11.500000 | \n", + "6 | \n", + "
| 608 | \n", + "10.1 | \n", + "0.650 | \n", + "0.37 | \n", + "5.10 | \n", + "0.110 | \n", + "11.0 | \n", + "65.0 | \n", + "1.00260 | \n", + "3.32 | \n", + "0.64 | \n", + "10.400000 | \n", + "6 | \n", + "
| 695 | \n", + "5.1 | \n", + "0.470 | \n", + "0.02 | \n", + "1.30 | \n", + "0.034 | \n", + "18.0 | \n", + "44.0 | \n", + "0.99210 | \n", + "3.90 | \n", + "0.62 | \n", + "12.800000 | \n", + "6 | \n", + "
| 821 | \n", + "4.9 | \n", + "0.420 | \n", + "0.00 | \n", + "2.10 | \n", + "0.048 | \n", + "16.0 | \n", + "42.0 | \n", + "0.99154 | \n", + "3.71 | \n", + "0.74 | \n", + "14.000000 | \n", + "7 | \n", + "
| 836 | \n", + "6.7 | \n", + "0.280 | \n", + "0.28 | \n", + "2.40 | \n", + "0.012 | \n", + "36.0 | \n", + "100.0 | \n", + "0.99064 | \n", + "3.26 | \n", + "0.39 | \n", + "11.700000 | \n", + "7 | \n", + "
| 837 | \n", + "6.7 | \n", + "0.280 | \n", + "0.28 | \n", + "2.40 | \n", + "0.012 | \n", + "36.0 | \n", + "100.0 | \n", + "0.99064 | \n", + "3.26 | \n", + "0.39 | \n", + "11.700000 | \n", + "7 | \n", + "
| 889 | \n", + "10.7 | \n", + "0.900 | \n", + "0.34 | \n", + "6.60 | \n", + "0.112 | \n", + "23.0 | \n", + "99.0 | \n", + "1.00289 | \n", + "3.22 | \n", + "0.68 | \n", + "9.300000 | \n", + "5 | \n", + "
| 999 | \n", + "6.4 | \n", + "0.690 | \n", + "0.00 | \n", + "1.65 | \n", + "0.055 | \n", + "7.0 | \n", + "12.0 | \n", + "0.99162 | \n", + "3.47 | \n", + "0.53 | \n", + "12.900000 | \n", + "6 | \n", + "
| 1017 | \n", + "8.0 | \n", + "0.180 | \n", + "0.37 | \n", + "0.90 | \n", + "0.049 | \n", + "36.0 | \n", + "109.0 | \n", + "0.99007 | \n", + "2.89 | \n", + "0.44 | \n", + "12.700000 | \n", + "6 | \n", + "
| 1018 | \n", + "8.0 | \n", + "0.180 | \n", + "0.37 | \n", + "0.90 | \n", + "0.049 | \n", + "36.0 | \n", + "109.0 | \n", + "0.99007 | \n", + "2.89 | \n", + "0.44 | \n", + "12.700000 | \n", + "6 | \n", + "
| 1114 | \n", + "5.0 | \n", + "0.400 | \n", + "0.50 | \n", + "4.30 | \n", + "0.046 | \n", + "29.0 | \n", + "80.0 | \n", + "0.99020 | \n", + "3.49 | \n", + "0.66 | \n", + "13.600000 | \n", + "6 | \n", + "
| 1122 | \n", + "6.3 | \n", + "0.470 | \n", + "0.00 | \n", + "1.40 | \n", + "0.055 | \n", + "27.0 | \n", + "33.0 | \n", + "0.99220 | \n", + "3.45 | \n", + "0.48 | \n", + "12.300000 | \n", + "6 | \n", + "
| 1126 | \n", + "5.8 | \n", + "0.290 | \n", + "0.26 | \n", + "1.70 | \n", + "0.063 | \n", + "3.0 | \n", + "11.0 | \n", + "0.99150 | \n", + "3.39 | \n", + "0.54 | \n", + "13.500000 | \n", + "6 | \n", + "
| 1228 | \n", + "5.1 | \n", + "0.420 | \n", + "0.00 | \n", + "1.80 | \n", + "0.044 | \n", + "18.0 | \n", + "88.0 | \n", + "0.99157 | \n", + "3.68 | \n", + "0.73 | \n", + "13.600000 | \n", + "7 | \n", + "
| 1269 | \n", + "5.5 | \n", + "0.490 | \n", + "0.03 | \n", + "1.80 | \n", + "0.044 | \n", + "28.0 | \n", + "87.0 | \n", + "0.99080 | \n", + "3.50 | \n", + "0.82 | \n", + "14.000000 | \n", + "8 | \n", + "
| 1270 | \n", + "5.0 | \n", + "0.380 | \n", + "0.01 | \n", + "1.60 | \n", + "0.048 | \n", + "26.0 | \n", + "60.0 | \n", + "0.99084 | \n", + "3.70 | \n", + "0.75 | \n", + "14.000000 | \n", + "6 | \n", + "
| 1298 | \n", + "5.7 | \n", + "0.600 | \n", + "0.00 | \n", + "1.40 | \n", + "0.063 | \n", + "11.0 | \n", + "18.0 | \n", + "0.99191 | \n", + "3.45 | \n", + "0.56 | \n", + "12.200000 | \n", + "6 | \n", + "
| 1434 | \n", + "10.2 | \n", + "0.540 | \n", + "0.37 | \n", + "15.40 | \n", + "0.214 | \n", + "55.0 | \n", + "95.0 | \n", + "1.00369 | \n", + "3.18 | \n", + "0.77 | \n", + "9.000000 | \n", + "6 | \n", + "
| 1435 | \n", + "10.2 | \n", + "0.540 | \n", + "0.37 | \n", + "15.40 | \n", + "0.214 | \n", + "55.0 | \n", + "95.0 | \n", + "1.00369 | \n", + "3.18 | \n", + "0.77 | \n", + "9.000000 | \n", + "6 | \n", + "
| 1474 | \n", + "9.9 | \n", + "0.500 | \n", + "0.50 | \n", + "13.80 | \n", + "0.205 | \n", + "48.0 | \n", + "82.0 | \n", + "1.00242 | \n", + "3.16 | \n", + "0.75 | \n", + "8.800000 | \n", + "5 | \n", + "
| 1475 | \n", + "5.3 | \n", + "0.470 | \n", + "0.11 | \n", + "2.20 | \n", + "0.048 | \n", + "16.0 | \n", + "89.0 | \n", + "0.99182 | \n", + "3.54 | \n", + "0.88 | \n", + "13.566667 | \n", + "7 | \n", + "
| 1476 | \n", + "9.9 | \n", + "0.500 | \n", + "0.50 | \n", + "13.80 | \n", + "0.205 | \n", + "48.0 | \n", + "82.0 | \n", + "1.00242 | \n", + "3.16 | \n", + "0.75 | \n", + "8.800000 | \n", + "5 | \n", + "
| 1477 | \n", + "5.3 | \n", + "0.470 | \n", + "0.11 | \n", + "2.20 | \n", + "0.048 | \n", + "16.0 | \n", + "89.0 | \n", + "0.99182 | \n", + "3.54 | \n", + "0.88 | \n", + "13.600000 | \n", + "7 | \n", + "
| \n", + " | fixed acidity | \n", + "volatile acidity | \n", + "citric acid | \n", + "residual sugar | \n", + "chlorides | \n", + "free sulfur dioxide | \n", + "total sulfur dioxide | \n", + "density | \n", + "pH | \n", + "sulphates | \n", + "alcohol | \n", + "quality | \n", + "
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 45 | \n", + "4.6 | \n", + "0.520 | \n", + "0.15 | \n", + "2.10 | \n", + "0.054 | \n", + "8.0 | \n", + "65.0 | \n", + "0.99340 | \n", + "3.90 | \n", + "0.56 | \n", + "13.1 | \n", + "4 | \n", + "
| 94 | \n", + "5.0 | \n", + "1.020 | \n", + "0.04 | \n", + "1.40 | \n", + "0.045 | \n", + "41.0 | \n", + "85.0 | \n", + "0.99380 | \n", + "3.75 | \n", + "0.48 | \n", + "10.5 | \n", + "4 | \n", + "
| 95 | \n", + "4.7 | \n", + "0.600 | \n", + "0.17 | \n", + "2.30 | \n", + "0.058 | \n", + "17.0 | \n", + "106.0 | \n", + "0.99320 | \n", + "3.85 | \n", + "0.60 | \n", + "12.9 | \n", + "6 | \n", + "
| 151 | \n", + "9.2 | \n", + "0.520 | \n", + "1.00 | \n", + "3.40 | \n", + "0.610 | \n", + "32.0 | \n", + "69.0 | \n", + "0.99960 | \n", + "2.74 | \n", + "2.00 | \n", + "9.4 | \n", + "4 | \n", + "
| 268 | \n", + "6.9 | \n", + "0.540 | \n", + "0.04 | \n", + "3.00 | \n", + "0.077 | \n", + "7.0 | \n", + "27.0 | \n", + "0.99870 | \n", + "3.69 | \n", + "0.91 | \n", + "9.4 | \n", + "6 | \n", + "
| 276 | \n", + "6.9 | \n", + "0.540 | \n", + "0.04 | \n", + "3.00 | \n", + "0.077 | \n", + "7.0 | \n", + "27.0 | \n", + "0.99870 | \n", + "3.69 | \n", + "0.91 | \n", + "9.4 | \n", + "6 | \n", + "
| 440 | \n", + "12.6 | \n", + "0.310 | \n", + "0.72 | \n", + "2.20 | \n", + "0.072 | \n", + "6.0 | \n", + "29.0 | \n", + "0.99870 | \n", + "2.88 | \n", + "0.82 | \n", + "9.8 | \n", + "8 | \n", + "
| 544 | \n", + "14.3 | \n", + "0.310 | \n", + "0.74 | \n", + "1.80 | \n", + "0.075 | \n", + "6.0 | \n", + "15.0 | \n", + "1.00080 | \n", + "2.86 | \n", + "0.79 | \n", + "8.4 | \n", + "6 | \n", + "
| 553 | \n", + "5.0 | \n", + "1.040 | \n", + "0.24 | \n", + "1.60 | \n", + "0.050 | \n", + "32.0 | \n", + "96.0 | \n", + "0.99340 | \n", + "3.74 | \n", + "0.62 | \n", + "11.5 | \n", + "5 | \n", + "
| 554 | \n", + "15.5 | \n", + "0.645 | \n", + "0.49 | \n", + "4.20 | \n", + "0.095 | \n", + "10.0 | \n", + "23.0 | \n", + "1.00315 | \n", + "2.92 | \n", + "0.74 | \n", + "11.1 | \n", + "5 | \n", + "
| 555 | \n", + "15.5 | \n", + "0.645 | \n", + "0.49 | \n", + "4.20 | \n", + "0.095 | \n", + "10.0 | \n", + "23.0 | \n", + "1.00315 | \n", + "2.92 | \n", + "0.74 | \n", + "11.1 | \n", + "5 | \n", + "
| 557 | \n", + "15.6 | \n", + "0.645 | \n", + "0.49 | \n", + "4.20 | \n", + "0.095 | \n", + "10.0 | \n", + "23.0 | \n", + "1.00315 | \n", + "2.92 | \n", + "0.74 | \n", + "11.1 | \n", + "5 | \n", + "
| 588 | \n", + "5.0 | \n", + "0.420 | \n", + "0.24 | \n", + "2.00 | \n", + "0.060 | \n", + "19.0 | \n", + "50.0 | \n", + "0.99170 | \n", + "3.72 | \n", + "0.74 | \n", + "14.0 | \n", + "8 | \n", + "
| 614 | \n", + "9.2 | \n", + "0.755 | \n", + "0.18 | \n", + "2.20 | \n", + "0.148 | \n", + "10.0 | \n", + "103.0 | \n", + "0.99690 | \n", + "2.87 | \n", + "1.36 | \n", + "10.2 | \n", + "6 | \n", + "
| 650 | \n", + "10.7 | \n", + "0.430 | \n", + "0.39 | \n", + "2.20 | \n", + "0.106 | \n", + "8.0 | \n", + "32.0 | \n", + "0.99860 | \n", + "2.89 | \n", + "0.50 | \n", + "9.6 | \n", + "5 | \n", + "
| 656 | \n", + "10.7 | \n", + "0.430 | \n", + "0.39 | \n", + "2.20 | \n", + "0.106 | \n", + "8.0 | \n", + "32.0 | \n", + "0.99860 | \n", + "2.89 | \n", + "0.50 | \n", + "9.6 | \n", + "5 | \n", + "
| 657 | \n", + "12.0 | \n", + "0.500 | \n", + "0.59 | \n", + "1.40 | \n", + "0.073 | \n", + "23.0 | \n", + "42.0 | \n", + "0.99800 | \n", + "2.92 | \n", + "0.68 | \n", + "10.5 | \n", + "7 | \n", + "
| 695 | \n", + "5.1 | \n", + "0.470 | \n", + "0.02 | \n", + "1.30 | \n", + "0.034 | \n", + "18.0 | \n", + "44.0 | \n", + "0.99210 | \n", + "3.90 | \n", + "0.62 | \n", + "12.8 | \n", + "6 | \n", + "
| 821 | \n", + "4.9 | \n", + "0.420 | \n", + "0.00 | \n", + "2.10 | \n", + "0.048 | \n", + "16.0 | \n", + "42.0 | \n", + "0.99154 | \n", + "3.71 | \n", + "0.74 | \n", + "14.0 | \n", + "7 | \n", + "
| 930 | \n", + "6.6 | \n", + "0.610 | \n", + "0.01 | \n", + "1.90 | \n", + "0.080 | \n", + "8.0 | \n", + "25.0 | \n", + "0.99746 | \n", + "3.69 | \n", + "0.73 | \n", + "10.5 | \n", + "5 | \n", + "
| 934 | \n", + "6.6 | \n", + "0.610 | \n", + "0.01 | \n", + "1.90 | \n", + "0.080 | \n", + "8.0 | \n", + "25.0 | \n", + "0.99746 | \n", + "3.69 | \n", + "0.73 | \n", + "10.5 | \n", + "5 | \n", + "
| 996 | \n", + "5.6 | \n", + "0.660 | \n", + "0.00 | \n", + "2.20 | \n", + "0.087 | \n", + "3.0 | \n", + "11.0 | \n", + "0.99378 | \n", + "3.71 | \n", + "0.63 | \n", + "12.8 | \n", + "7 | \n", + "
| 997 | \n", + "5.6 | \n", + "0.660 | \n", + "0.00 | \n", + "2.20 | \n", + "0.087 | \n", + "3.0 | \n", + "11.0 | \n", + "0.99378 | \n", + "3.71 | \n", + "0.63 | \n", + "12.8 | \n", + "7 | \n", + "
| 1017 | \n", + "8.0 | \n", + "0.180 | \n", + "0.37 | \n", + "0.90 | \n", + "0.049 | \n", + "36.0 | \n", + "109.0 | \n", + "0.99007 | \n", + "2.89 | \n", + "0.44 | \n", + "12.7 | \n", + "6 | \n", + "
| 1018 | \n", + "8.0 | \n", + "0.180 | \n", + "0.37 | \n", + "0.90 | \n", + "0.049 | \n", + "36.0 | \n", + "109.0 | \n", + "0.99007 | \n", + "2.89 | \n", + "0.44 | \n", + "12.7 | \n", + "6 | \n", + "
| 1111 | \n", + "5.4 | \n", + "0.420 | \n", + "0.27 | \n", + "2.00 | \n", + "0.092 | \n", + "23.0 | \n", + "55.0 | \n", + "0.99471 | \n", + "3.78 | \n", + "0.64 | \n", + "12.3 | \n", + "7 | \n", + "
| 1270 | \n", + "5.0 | \n", + "0.380 | \n", + "0.01 | \n", + "1.60 | \n", + "0.048 | \n", + "26.0 | \n", + "60.0 | \n", + "0.99084 | \n", + "3.70 | \n", + "0.75 | \n", + "14.0 | \n", + "6 | \n", + "
| 1300 | \n", + "5.2 | \n", + "0.645 | \n", + "0.00 | \n", + "2.15 | \n", + "0.080 | \n", + "15.0 | \n", + "28.0 | \n", + "0.99444 | \n", + "3.78 | \n", + "0.61 | \n", + "12.5 | \n", + "6 | \n", + "
| 1316 | \n", + "5.4 | \n", + "0.740 | \n", + "0.00 | \n", + "1.20 | \n", + "0.041 | \n", + "16.0 | \n", + "46.0 | \n", + "0.99258 | \n", + "4.01 | \n", + "0.59 | \n", + "12.5 | \n", + "6 | \n", + "
| 1319 | \n", + "9.1 | \n", + "0.760 | \n", + "0.68 | \n", + "1.70 | \n", + "0.414 | \n", + "18.0 | \n", + "64.0 | \n", + "0.99652 | \n", + "2.90 | \n", + "1.33 | \n", + "9.1 | \n", + "6 | \n", + "
| 1321 | \n", + "5.0 | \n", + "0.740 | \n", + "0.00 | \n", + "1.20 | \n", + "0.041 | \n", + "16.0 | \n", + "46.0 | \n", + "0.99258 | \n", + "4.01 | \n", + "0.59 | \n", + "12.5 | \n", + "6 | \n", + "
| 1377 | \n", + "5.2 | \n", + "0.490 | \n", + "0.26 | \n", + "2.30 | \n", + "0.090 | \n", + "23.0 | \n", + "74.0 | \n", + "0.99530 | \n", + "3.71 | \n", + "0.62 | \n", + "12.2 | \n", + "6 | \n", + "
| 1470 | \n", + "10.0 | \n", + "0.690 | \n", + "0.11 | \n", + "1.40 | \n", + "0.084 | \n", + "8.0 | \n", + "24.0 | \n", + "0.99578 | \n", + "2.88 | \n", + "0.47 | \n", + "9.7 | \n", + "5 | \n", + "
| 1488 | \n", + "5.6 | \n", + "0.540 | \n", + "0.04 | \n", + "1.70 | \n", + "0.049 | \n", + "5.0 | \n", + "13.0 | \n", + "0.99420 | \n", + "3.72 | \n", + "0.58 | \n", + "11.4 | \n", + "5 | \n", + "
| 1491 | \n", + "5.6 | \n", + "0.540 | \n", + "0.04 | \n", + "1.70 | \n", + "0.049 | \n", + "5.0 | \n", + "13.0 | \n", + "0.99420 | \n", + "3.72 | \n", + "0.58 | \n", + "11.4 | \n", + "5 | \n", + "
| \n", + " | fixed acidity | \n", + "volatile acidity | \n", + "citric acid | \n", + "residual sugar | \n", + "chlorides | \n", + "free sulfur dioxide | \n", + "total sulfur dioxide | \n", + "density | \n", + "pH | \n", + "sulphates | \n", + "alcohol | \n", + "quality | \n", + "
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13 | \n", + "7.8 | \n", + "0.610 | \n", + "0.29 | \n", + "1.6 | \n", + "0.114 | \n", + "9.0 | \n", + "29.0 | \n", + "0.99740 | \n", + "3.26 | \n", + "1.56 | \n", + "9.1 | \n", + "5 | \n", + "
| 17 | \n", + "8.1 | \n", + "0.560 | \n", + "0.28 | \n", + "1.7 | \n", + "0.368 | \n", + "16.0 | \n", + "56.0 | \n", + "0.99680 | \n", + "3.11 | \n", + "1.28 | \n", + "9.3 | \n", + "5 | \n", + "
| 19 | \n", + "7.9 | \n", + "0.320 | \n", + "0.51 | \n", + "1.8 | \n", + "0.341 | \n", + "17.0 | \n", + "56.0 | \n", + "0.99690 | \n", + "3.04 | \n", + "1.08 | \n", + "9.2 | \n", + "6 | \n", + "
| 43 | \n", + "8.1 | \n", + "0.660 | \n", + "0.22 | \n", + "2.2 | \n", + "0.069 | \n", + "9.0 | \n", + "23.0 | \n", + "0.99680 | \n", + "3.30 | \n", + "1.20 | \n", + "10.3 | \n", + "5 | \n", + "
| 79 | \n", + "8.3 | \n", + "0.625 | \n", + "0.20 | \n", + "1.5 | \n", + "0.080 | \n", + "27.0 | \n", + "119.0 | \n", + "0.99720 | \n", + "3.16 | \n", + "1.12 | \n", + "9.1 | \n", + "4 | \n", + "
| 81 | \n", + "7.8 | \n", + "0.430 | \n", + "0.70 | \n", + "1.9 | \n", + "0.464 | \n", + "22.0 | \n", + "67.0 | \n", + "0.99740 | \n", + "3.13 | \n", + "1.28 | \n", + "9.4 | \n", + "5 | \n", + "
| 83 | \n", + "7.3 | \n", + "0.670 | \n", + "0.26 | \n", + "1.8 | \n", + "0.401 | \n", + "16.0 | \n", + "51.0 | \n", + "0.99690 | \n", + "3.16 | \n", + "1.14 | \n", + "9.4 | \n", + "5 | \n", + "
| 86 | \n", + "8.6 | \n", + "0.490 | \n", + "0.28 | \n", + "1.9 | \n", + "0.110 | \n", + "20.0 | \n", + "136.0 | \n", + "0.99720 | \n", + "2.93 | \n", + "1.95 | \n", + "9.9 | \n", + "6 | \n", + "
| 88 | \n", + "9.3 | \n", + "0.390 | \n", + "0.44 | \n", + "2.1 | \n", + "0.107 | \n", + "34.0 | \n", + "125.0 | \n", + "0.99780 | \n", + "3.14 | \n", + "1.22 | \n", + "9.5 | \n", + "5 | \n", + "
| 91 | \n", + "8.6 | \n", + "0.490 | \n", + "0.28 | \n", + "1.9 | \n", + "0.110 | \n", + "20.0 | \n", + "136.0 | \n", + "0.99720 | \n", + "2.93 | \n", + "1.95 | \n", + "9.9 | \n", + "6 | \n", + "
| 92 | \n", + "8.6 | \n", + "0.490 | \n", + "0.29 | \n", + "2.0 | \n", + "0.110 | \n", + "19.0 | \n", + "133.0 | \n", + "0.99720 | \n", + "2.93 | \n", + "1.98 | \n", + "9.8 | \n", + "5 | \n", + "
| 106 | \n", + "7.8 | \n", + "0.410 | \n", + "0.68 | \n", + "1.7 | \n", + "0.467 | \n", + "18.0 | \n", + "69.0 | \n", + "0.99730 | \n", + "3.08 | \n", + "1.31 | \n", + "9.3 | \n", + "5 | \n", + "
| 151 | \n", + "9.2 | \n", + "0.520 | \n", + "1.00 | \n", + "3.4 | \n", + "0.610 | \n", + "32.0 | \n", + "69.0 | \n", + "0.99960 | \n", + "2.74 | \n", + "2.00 | \n", + "9.4 | \n", + "4 | \n", + "
| 161 | \n", + "7.6 | \n", + "0.680 | \n", + "0.02 | \n", + "1.3 | \n", + "0.072 | \n", + "9.0 | \n", + "20.0 | \n", + "0.99650 | \n", + "3.17 | \n", + "1.08 | \n", + "9.2 | \n", + "4 | \n", + "
| 169 | \n", + "7.5 | \n", + "0.705 | \n", + "0.24 | \n", + "1.8 | \n", + "0.360 | \n", + "15.0 | \n", + "63.0 | \n", + "0.99640 | \n", + "3.00 | \n", + "1.59 | \n", + "9.5 | \n", + "5 | \n", + "
| 181 | \n", + "8.9 | \n", + "0.610 | \n", + "0.49 | \n", + "2.0 | \n", + "0.270 | \n", + "23.0 | \n", + "110.0 | \n", + "0.99720 | \n", + "3.12 | \n", + "1.02 | \n", + "9.3 | \n", + "5 | \n", + "
| 201 | \n", + "8.8 | \n", + "0.370 | \n", + "0.48 | \n", + "2.1 | \n", + "0.097 | \n", + "39.0 | \n", + "145.0 | \n", + "0.99750 | \n", + "3.04 | \n", + "1.03 | \n", + "9.3 | \n", + "5 | \n", + "
| 226 | \n", + "8.9 | \n", + "0.590 | \n", + "0.50 | \n", + "2.0 | \n", + "0.337 | \n", + "27.0 | \n", + "81.0 | \n", + "0.99640 | \n", + "3.04 | \n", + "1.61 | \n", + "9.5 | \n", + "6 | \n", + "
| 240 | \n", + "8.9 | \n", + "0.635 | \n", + "0.37 | \n", + "1.7 | \n", + "0.263 | \n", + "5.0 | \n", + "62.0 | \n", + "0.99710 | \n", + "3.00 | \n", + "1.09 | \n", + "9.3 | \n", + "5 | \n", + "
| 258 | \n", + "7.7 | \n", + "0.410 | \n", + "0.76 | \n", + "1.8 | \n", + "0.611 | \n", + "8.0 | \n", + "45.0 | \n", + "0.99680 | \n", + "3.06 | \n", + "1.26 | \n", + "9.4 | \n", + "5 | \n", + "
| 281 | \n", + "7.7 | \n", + "0.270 | \n", + "0.68 | \n", + "3.5 | \n", + "0.358 | \n", + "5.0 | \n", + "10.0 | \n", + "0.99720 | \n", + "3.25 | \n", + "1.08 | \n", + "9.9 | \n", + "7 | \n", + "
| 338 | \n", + "12.4 | \n", + "0.490 | \n", + "0.58 | \n", + "3.0 | \n", + "0.103 | \n", + "28.0 | \n", + "99.0 | \n", + "1.00080 | \n", + "3.16 | \n", + "1.00 | \n", + "11.5 | \n", + "6 | \n", + "
| 339 | \n", + "12.5 | \n", + "0.280 | \n", + "0.54 | \n", + "2.3 | \n", + "0.082 | \n", + "12.0 | \n", + "29.0 | \n", + "0.99970 | \n", + "3.11 | \n", + "1.36 | \n", + "9.8 | \n", + "7 | \n", + "
| 340 | \n", + "12.2 | \n", + "0.340 | \n", + "0.50 | \n", + "2.4 | \n", + "0.066 | \n", + "10.0 | \n", + "21.0 | \n", + "1.00000 | \n", + "3.12 | \n", + "1.18 | \n", + "9.2 | \n", + "6 | \n", + "
| 369 | \n", + "9.4 | \n", + "0.270 | \n", + "0.53 | \n", + "2.4 | \n", + "0.074 | \n", + "6.0 | \n", + "18.0 | \n", + "0.99620 | \n", + "3.20 | \n", + "1.13 | \n", + "12.0 | \n", + "7 | \n", + "
| 372 | \n", + "9.1 | \n", + "0.280 | \n", + "0.48 | \n", + "1.8 | \n", + "0.067 | \n", + "26.0 | \n", + "46.0 | \n", + "0.99670 | \n", + "3.32 | \n", + "1.04 | \n", + "10.6 | \n", + "6 | \n", + "
| 376 | \n", + "11.5 | \n", + "0.450 | \n", + "0.50 | \n", + "3.0 | \n", + "0.078 | \n", + "19.0 | \n", + "47.0 | \n", + "1.00030 | \n", + "3.26 | \n", + "1.11 | \n", + "11.0 | \n", + "6 | \n", + "
| 377 | \n", + "9.4 | \n", + "0.270 | \n", + "0.53 | \n", + "2.4 | \n", + "0.074 | \n", + "6.0 | \n", + "18.0 | \n", + "0.99620 | \n", + "3.20 | \n", + "1.13 | \n", + "12.0 | \n", + "7 | \n", + "
| 415 | \n", + "8.6 | \n", + "0.725 | \n", + "0.24 | \n", + "6.6 | \n", + "0.117 | \n", + "31.0 | \n", + "134.0 | \n", + "1.00140 | \n", + "3.32 | \n", + "1.07 | \n", + "9.3 | \n", + "5 | \n", + "
| 451 | \n", + "8.4 | \n", + "0.370 | \n", + "0.53 | \n", + "1.8 | \n", + "0.413 | \n", + "9.0 | \n", + "26.0 | \n", + "0.99790 | \n", + "3.06 | \n", + "1.06 | \n", + "9.1 | \n", + "6 | \n", + "
| 477 | \n", + "10.4 | \n", + "0.240 | \n", + "0.49 | \n", + "1.8 | \n", + "0.075 | \n", + "6.0 | \n", + "20.0 | \n", + "0.99770 | \n", + "3.18 | \n", + "1.06 | \n", + "11.0 | \n", + "6 | \n", + "
| 482 | \n", + "10.6 | \n", + "0.360 | \n", + "0.59 | \n", + "2.2 | \n", + "0.152 | \n", + "6.0 | \n", + "18.0 | \n", + "0.99860 | \n", + "3.04 | \n", + "1.05 | \n", + "9.4 | \n", + "5 | \n", + "
| 483 | \n", + "10.6 | \n", + "0.360 | \n", + "0.60 | \n", + "2.2 | \n", + "0.152 | \n", + "7.0 | \n", + "18.0 | \n", + "0.99860 | \n", + "3.04 | \n", + "1.06 | \n", + "9.4 | \n", + "5 | \n", + "
| 503 | \n", + "10.5 | \n", + "0.260 | \n", + "0.47 | \n", + "1.9 | \n", + "0.078 | \n", + "6.0 | \n", + "24.0 | \n", + "0.99760 | \n", + "3.18 | \n", + "1.04 | \n", + "10.9 | \n", + "7 | \n", + "
| 504 | \n", + "10.5 | \n", + "0.240 | \n", + "0.42 | \n", + "1.8 | \n", + "0.077 | \n", + "6.0 | \n", + "22.0 | \n", + "0.99760 | \n", + "3.21 | \n", + "1.05 | \n", + "10.8 | \n", + "7 | \n", + "
| 506 | \n", + "10.4 | \n", + "0.240 | \n", + "0.46 | \n", + "1.8 | \n", + "0.075 | \n", + "6.0 | \n", + "21.0 | \n", + "0.99760 | \n", + "3.25 | \n", + "1.02 | \n", + "10.8 | \n", + "7 | \n", + "
| 515 | \n", + "8.5 | \n", + "0.655 | \n", + "0.49 | \n", + "6.1 | \n", + "0.122 | \n", + "34.0 | \n", + "151.0 | \n", + "1.00100 | \n", + "3.31 | \n", + "1.14 | \n", + "9.3 | \n", + "5 | \n", + "
| 586 | \n", + "11.1 | \n", + "0.310 | \n", + "0.49 | \n", + "2.7 | \n", + "0.094 | \n", + "16.0 | \n", + "47.0 | \n", + "0.99860 | \n", + "3.12 | \n", + "1.02 | \n", + "10.6 | \n", + "7 | \n", + "
| 614 | \n", + "9.2 | \n", + "0.755 | \n", + "0.18 | \n", + "2.2 | \n", + "0.148 | \n", + "10.0 | \n", + "103.0 | \n", + "0.99690 | \n", + "2.87 | \n", + "1.36 | \n", + "10.2 | \n", + "6 | \n", + "
| 639 | \n", + "8.9 | \n", + "0.290 | \n", + "0.35 | \n", + "1.9 | \n", + "0.067 | \n", + "25.0 | \n", + "57.0 | \n", + "0.99700 | \n", + "3.18 | \n", + "1.36 | \n", + "10.3 | \n", + "6 | \n", + "
| 689 | \n", + "8.1 | \n", + "0.380 | \n", + "0.48 | \n", + "1.8 | \n", + "0.157 | \n", + "5.0 | \n", + "17.0 | \n", + "0.99760 | \n", + "3.30 | \n", + "1.05 | \n", + "9.4 | \n", + "5 | \n", + "
| 692 | \n", + "8.6 | \n", + "0.490 | \n", + "0.51 | \n", + "2.0 | \n", + "0.422 | \n", + "16.0 | \n", + "62.0 | \n", + "0.99790 | \n", + "3.03 | \n", + "1.17 | \n", + "9.0 | \n", + "5 | \n", + "
| 723 | \n", + "7.1 | \n", + "0.310 | \n", + "0.30 | \n", + "2.2 | \n", + "0.053 | \n", + "36.0 | \n", + "127.0 | \n", + "0.99650 | \n", + "2.94 | \n", + "1.62 | \n", + "9.5 | \n", + "5 | \n", + "
| 754 | \n", + "7.8 | \n", + "0.480 | \n", + "0.68 | \n", + "1.7 | \n", + "0.415 | \n", + "14.0 | \n", + "32.0 | \n", + "0.99656 | \n", + "3.09 | \n", + "1.06 | \n", + "9.1 | \n", + "6 | \n", + "
| 795 | \n", + "10.8 | \n", + "0.890 | \n", + "0.30 | \n", + "2.6 | \n", + "0.132 | \n", + "7.0 | \n", + "60.0 | \n", + "0.99786 | \n", + "2.99 | \n", + "1.18 | \n", + "10.2 | \n", + "5 | \n", + "
| 852 | \n", + "8.0 | \n", + "0.420 | \n", + "0.32 | \n", + "2.5 | \n", + "0.080 | \n", + "26.0 | \n", + "122.0 | \n", + "0.99801 | \n", + "3.22 | \n", + "1.07 | \n", + "9.7 | \n", + "5 | \n", + "
| 1051 | \n", + "8.5 | \n", + "0.460 | \n", + "0.59 | \n", + "1.4 | \n", + "0.414 | \n", + "16.0 | \n", + "45.0 | \n", + "0.99702 | \n", + "3.03 | \n", + "1.34 | \n", + "9.2 | \n", + "5 | \n", + "
| 1158 | \n", + "6.7 | \n", + "0.410 | \n", + "0.43 | \n", + "2.8 | \n", + "0.076 | \n", + "22.0 | \n", + "54.0 | \n", + "0.99572 | \n", + "3.42 | \n", + "1.16 | \n", + "10.6 | \n", + "6 | \n", + "
| 1165 | \n", + "8.5 | \n", + "0.440 | \n", + "0.50 | \n", + "1.9 | \n", + "0.369 | \n", + "15.0 | \n", + "38.0 | \n", + "0.99634 | \n", + "3.01 | \n", + "1.10 | \n", + "9.4 | \n", + "5 | \n", + "
| 1260 | \n", + "8.6 | \n", + "0.635 | \n", + "0.68 | \n", + "1.8 | \n", + "0.403 | \n", + "19.0 | \n", + "56.0 | \n", + "0.99632 | \n", + "3.02 | \n", + "1.15 | \n", + "9.3 | \n", + "5 | \n", + "
| 1288 | \n", + "7.0 | \n", + "0.600 | \n", + "0.30 | \n", + "4.5 | \n", + "0.068 | \n", + "20.0 | \n", + "110.0 | \n", + "0.99914 | \n", + "3.30 | \n", + "1.17 | \n", + "10.2 | \n", + "5 | \n", + "
| 1289 | \n", + "7.0 | \n", + "0.600 | \n", + "0.30 | \n", + "4.5 | \n", + "0.068 | \n", + "20.0 | \n", + "110.0 | \n", + "0.99914 | \n", + "3.30 | \n", + "1.17 | \n", + "10.2 | \n", + "5 | \n", + "
| 1319 | \n", + "9.1 | \n", + "0.760 | \n", + "0.68 | \n", + "1.7 | \n", + "0.414 | \n", + "18.0 | \n", + "64.0 | \n", + "0.99652 | \n", + "2.90 | \n", + "1.33 | \n", + "9.1 | \n", + "6 | \n", + "
| 1367 | \n", + "6.9 | \n", + "0.540 | \n", + "0.30 | \n", + "2.2 | \n", + "0.088 | \n", + "9.0 | \n", + "105.0 | \n", + "0.99725 | \n", + "3.25 | \n", + "1.18 | \n", + "10.5 | \n", + "6 | \n", + "
| 1370 | \n", + "8.7 | \n", + "0.780 | \n", + "0.51 | \n", + "1.7 | \n", + "0.415 | \n", + "12.0 | \n", + "66.0 | \n", + "0.99623 | \n", + "3.00 | \n", + "1.17 | \n", + "9.2 | \n", + "5 | \n", + "
| 1371 | \n", + "7.5 | \n", + "0.580 | \n", + "0.56 | \n", + "3.1 | \n", + "0.153 | \n", + "5.0 | \n", + "14.0 | \n", + "0.99476 | \n", + "3.21 | \n", + "1.03 | \n", + "11.6 | \n", + "6 | \n", + "
| 1372 | \n", + "8.7 | \n", + "0.780 | \n", + "0.51 | \n", + "1.7 | \n", + "0.415 | \n", + "12.0 | \n", + "66.0 | \n", + "0.99623 | \n", + "3.00 | \n", + "1.17 | \n", + "9.2 | \n", + "5 | \n", + "
| 1403 | \n", + "7.2 | \n", + "0.330 | \n", + "0.33 | \n", + "1.7 | \n", + "0.061 | \n", + "3.0 | \n", + "13.0 | \n", + "0.99600 | \n", + "3.23 | \n", + "1.10 | \n", + "10.0 | \n", + "8 | \n", + "
| 1408 | \n", + "8.1 | \n", + "0.290 | \n", + "0.36 | \n", + "2.2 | \n", + "0.048 | \n", + "35.0 | \n", + "53.0 | \n", + "0.99500 | \n", + "3.27 | \n", + "1.01 | \n", + "12.4 | \n", + "7 | \n", + "
| \n", + " | fixed acidity | \n", + "volatile acidity | \n", + "citric acid | \n", + "residual sugar | \n", + "chlorides | \n", + "free sulfur dioxide | \n", + "total sulfur dioxide | \n", + "density | \n", + "pH | \n", + "sulphates | \n", + "alcohol | \n", + "quality | \n", + "
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 142 | \n", + "5.2 | \n", + "0.34 | \n", + "0.00 | \n", + "1.8 | \n", + "0.050 | \n", + "27.0 | \n", + "63.0 | \n", + "0.99160 | \n", + "3.68 | \n", + "0.79 | \n", + "14.000000 | \n", + "6 | \n", + "
| 144 | \n", + "5.2 | \n", + "0.34 | \n", + "0.00 | \n", + "1.8 | \n", + "0.050 | \n", + "27.0 | \n", + "63.0 | \n", + "0.99160 | \n", + "3.68 | \n", + "0.79 | \n", + "14.000000 | \n", + "6 | \n", + "
| 467 | \n", + "8.8 | \n", + "0.46 | \n", + "0.45 | \n", + "2.6 | \n", + "0.065 | \n", + "7.0 | \n", + "18.0 | \n", + "0.99470 | \n", + "3.32 | \n", + "0.79 | \n", + "14.000000 | \n", + "6 | \n", + "
| 588 | \n", + "5.0 | \n", + "0.42 | \n", + "0.24 | \n", + "2.0 | \n", + "0.060 | \n", + "19.0 | \n", + "50.0 | \n", + "0.99170 | \n", + "3.72 | \n", + "0.74 | \n", + "14.000000 | \n", + "8 | \n", + "
| 652 | \n", + "15.9 | \n", + "0.36 | \n", + "0.65 | \n", + "7.5 | \n", + "0.096 | \n", + "22.0 | \n", + "71.0 | \n", + "0.99760 | \n", + "2.98 | \n", + "0.84 | \n", + "14.900000 | \n", + "5 | \n", + "
| 821 | \n", + "4.9 | \n", + "0.42 | \n", + "0.00 | \n", + "2.1 | \n", + "0.048 | \n", + "16.0 | \n", + "42.0 | \n", + "0.99154 | \n", + "3.71 | \n", + "0.74 | \n", + "14.000000 | \n", + "7 | \n", + "
| 1114 | \n", + "5.0 | \n", + "0.40 | \n", + "0.50 | \n", + "4.3 | \n", + "0.046 | \n", + "29.0 | \n", + "80.0 | \n", + "0.99020 | \n", + "3.49 | \n", + "0.66 | \n", + "13.600000 | \n", + "6 | \n", + "
| 1132 | \n", + "7.4 | \n", + "0.36 | \n", + "0.34 | \n", + "1.8 | \n", + "0.075 | \n", + "18.0 | \n", + "38.0 | \n", + "0.99330 | \n", + "3.38 | \n", + "0.88 | \n", + "13.600000 | \n", + "7 | \n", + "
| 1228 | \n", + "5.1 | \n", + "0.42 | \n", + "0.00 | \n", + "1.8 | \n", + "0.044 | \n", + "18.0 | \n", + "88.0 | \n", + "0.99157 | \n", + "3.68 | \n", + "0.73 | \n", + "13.600000 | \n", + "7 | \n", + "
| 1269 | \n", + "5.5 | \n", + "0.49 | \n", + "0.03 | \n", + "1.8 | \n", + "0.044 | \n", + "28.0 | \n", + "87.0 | \n", + "0.99080 | \n", + "3.50 | \n", + "0.82 | \n", + "14.000000 | \n", + "8 | \n", + "
| 1270 | \n", + "5.0 | \n", + "0.38 | \n", + "0.01 | \n", + "1.6 | \n", + "0.048 | \n", + "26.0 | \n", + "60.0 | \n", + "0.99084 | \n", + "3.70 | \n", + "0.75 | \n", + "14.000000 | \n", + "6 | \n", + "
| 1475 | \n", + "5.3 | \n", + "0.47 | \n", + "0.11 | \n", + "2.2 | \n", + "0.048 | \n", + "16.0 | \n", + "89.0 | \n", + "0.99182 | \n", + "3.54 | \n", + "0.88 | \n", + "13.566667 | \n", + "7 | \n", + "
| 1477 | \n", + "5.3 | \n", + "0.47 | \n", + "0.11 | \n", + "2.2 | \n", + "0.048 | \n", + "16.0 | \n", + "89.0 | \n", + "0.99182 | \n", + "3.54 | \n", + "0.88 | \n", + "13.600000 | \n", + "7 | \n", + "
| \n", + " | fixed acidity | \n", + "volatile acidity | \n", + "citric acid | \n", + "residual sugar | \n", + "chlorides | \n", + "free sulfur dioxide | \n", + "total sulfur dioxide | \n", + "density | \n", + "pH | \n", + "sulphates | \n", + "alcohol | \n", + "quality | \n", + "
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 267 | \n", + "7.9 | \n", + "0.350 | \n", + "0.46 | \n", + "3.60 | \n", + "0.078 | \n", + "15.0 | \n", + "37.0 | \n", + "0.99730 | \n", + "3.35 | \n", + "0.86 | \n", + "12.80 | \n", + "8 | \n", + "
| 278 | \n", + "10.3 | \n", + "0.320 | \n", + "0.45 | \n", + "6.40 | \n", + "0.073 | \n", + "5.0 | \n", + "13.0 | \n", + "0.99760 | \n", + "3.23 | \n", + "0.82 | \n", + "12.60 | \n", + "8 | \n", + "
| 390 | \n", + "5.6 | \n", + "0.850 | \n", + "0.05 | \n", + "1.40 | \n", + "0.045 | \n", + "12.0 | \n", + "88.0 | \n", + "0.99240 | \n", + "3.56 | \n", + "0.82 | \n", + "12.90 | \n", + "8 | \n", + "
| 440 | \n", + "12.6 | \n", + "0.310 | \n", + "0.72 | \n", + "2.20 | \n", + "0.072 | \n", + "6.0 | \n", + "29.0 | \n", + "0.99870 | \n", + "2.88 | \n", + "0.82 | \n", + "9.80 | \n", + "8 | \n", + "
| 455 | \n", + "11.3 | \n", + "0.620 | \n", + "0.67 | \n", + "5.20 | \n", + "0.086 | \n", + "6.0 | \n", + "19.0 | \n", + "0.99880 | \n", + "3.22 | \n", + "0.69 | \n", + "13.40 | \n", + "8 | \n", + "
| 459 | \n", + "11.6 | \n", + "0.580 | \n", + "0.66 | \n", + "2.20 | \n", + "0.074 | \n", + "10.0 | \n", + "47.0 | \n", + "1.00080 | \n", + "3.25 | \n", + "0.57 | \n", + "9.00 | \n", + "3 | \n", + "
| 481 | \n", + "9.4 | \n", + "0.300 | \n", + "0.56 | \n", + "2.80 | \n", + "0.080 | \n", + "6.0 | \n", + "17.0 | \n", + "0.99640 | \n", + "3.15 | \n", + "0.92 | \n", + "11.70 | \n", + "8 | \n", + "
| 495 | \n", + "10.7 | \n", + "0.350 | \n", + "0.53 | \n", + "2.60 | \n", + "0.070 | \n", + "5.0 | \n", + "16.0 | \n", + "0.99720 | \n", + "3.15 | \n", + "0.65 | \n", + "11.00 | \n", + "8 | \n", + "
| 498 | \n", + "10.7 | \n", + "0.350 | \n", + "0.53 | \n", + "2.60 | \n", + "0.070 | \n", + "5.0 | \n", + "16.0 | \n", + "0.99720 | \n", + "3.15 | \n", + "0.65 | \n", + "11.00 | \n", + "8 | \n", + "
| 517 | \n", + "10.4 | \n", + "0.610 | \n", + "0.49 | \n", + "2.10 | \n", + "0.200 | \n", + "5.0 | \n", + "16.0 | \n", + "0.99940 | \n", + "3.16 | \n", + "0.63 | \n", + "8.40 | \n", + "3 | \n", + "
| 588 | \n", + "5.0 | \n", + "0.420 | \n", + "0.24 | \n", + "2.00 | \n", + "0.060 | \n", + "19.0 | \n", + "50.0 | \n", + "0.99170 | \n", + "3.72 | \n", + "0.74 | \n", + "14.00 | \n", + "8 | \n", + "
| 690 | \n", + "7.4 | \n", + "1.185 | \n", + "0.00 | \n", + "4.25 | \n", + "0.097 | \n", + "5.0 | \n", + "14.0 | \n", + "0.99660 | \n", + "3.63 | \n", + "0.54 | \n", + "10.70 | \n", + "3 | \n", + "
| 828 | \n", + "7.8 | \n", + "0.570 | \n", + "0.09 | \n", + "2.30 | \n", + "0.065 | \n", + "34.0 | \n", + "45.0 | \n", + "0.99417 | \n", + "3.46 | \n", + "0.74 | \n", + "12.70 | \n", + "8 | \n", + "
| 832 | \n", + "10.4 | \n", + "0.440 | \n", + "0.42 | \n", + "1.50 | \n", + "0.145 | \n", + "34.0 | \n", + "48.0 | \n", + "0.99832 | \n", + "3.38 | \n", + "0.86 | \n", + "9.90 | \n", + "3 | \n", + "
| 899 | \n", + "8.3 | \n", + "1.020 | \n", + "0.02 | \n", + "3.40 | \n", + "0.084 | \n", + "6.0 | \n", + "11.0 | \n", + "0.99892 | \n", + "3.48 | \n", + "0.49 | \n", + "11.00 | \n", + "3 | \n", + "
| 1061 | \n", + "9.1 | \n", + "0.400 | \n", + "0.50 | \n", + "1.80 | \n", + "0.071 | \n", + "7.0 | \n", + "16.0 | \n", + "0.99462 | \n", + "3.21 | \n", + "0.69 | \n", + "12.50 | \n", + "8 | \n", + "
| 1090 | \n", + "10.0 | \n", + "0.260 | \n", + "0.54 | \n", + "1.90 | \n", + "0.083 | \n", + "42.0 | \n", + "74.0 | \n", + "0.99451 | \n", + "2.98 | \n", + "0.63 | \n", + "11.80 | \n", + "8 | \n", + "
| 1120 | \n", + "7.9 | \n", + "0.540 | \n", + "0.34 | \n", + "2.50 | \n", + "0.076 | \n", + "8.0 | \n", + "17.0 | \n", + "0.99235 | \n", + "3.20 | \n", + "0.72 | \n", + "13.10 | \n", + "8 | \n", + "
| 1202 | \n", + "8.6 | \n", + "0.420 | \n", + "0.39 | \n", + "1.80 | \n", + "0.068 | \n", + "6.0 | \n", + "12.0 | \n", + "0.99516 | \n", + "3.35 | \n", + "0.69 | \n", + "11.70 | \n", + "8 | \n", + "
| 1269 | \n", + "5.5 | \n", + "0.490 | \n", + "0.03 | \n", + "1.80 | \n", + "0.044 | \n", + "28.0 | \n", + "87.0 | \n", + "0.99080 | \n", + "3.50 | \n", + "0.82 | \n", + "14.00 | \n", + "8 | \n", + "
| 1299 | \n", + "7.6 | \n", + "1.580 | \n", + "0.00 | \n", + "2.10 | \n", + "0.137 | \n", + "5.0 | \n", + "9.0 | \n", + "0.99476 | \n", + "3.50 | \n", + "0.40 | \n", + "10.90 | \n", + "3 | \n", + "
| 1374 | \n", + "6.8 | \n", + "0.815 | \n", + "0.00 | \n", + "1.20 | \n", + "0.267 | \n", + "16.0 | \n", + "29.0 | \n", + "0.99471 | \n", + "3.32 | \n", + "0.51 | \n", + "9.80 | \n", + "3 | \n", + "
| 1403 | \n", + "7.2 | \n", + "0.330 | \n", + "0.33 | \n", + "1.70 | \n", + "0.061 | \n", + "3.0 | \n", + "13.0 | \n", + "0.99600 | \n", + "3.23 | \n", + "1.10 | \n", + "10.00 | \n", + "8 | \n", + "
| 1449 | \n", + "7.2 | \n", + "0.380 | \n", + "0.31 | \n", + "2.00 | \n", + "0.056 | \n", + "15.0 | \n", + "29.0 | \n", + "0.99472 | \n", + "3.23 | \n", + "0.76 | \n", + "11.30 | \n", + "8 | \n", + "
| 1469 | \n", + "7.3 | \n", + "0.980 | \n", + "0.05 | \n", + "2.10 | \n", + "0.061 | \n", + "20.0 | \n", + "49.0 | \n", + "0.99705 | \n", + "3.31 | \n", + "0.55 | \n", + "9.70 | \n", + "3 | \n", + "
| 1478 | \n", + "7.1 | \n", + "0.875 | \n", + "0.05 | \n", + "5.70 | \n", + "0.082 | \n", + "3.0 | \n", + "14.0 | \n", + "0.99808 | \n", + "3.40 | \n", + "0.52 | \n", + "10.20 | \n", + "3 | \n", + "
| 1505 | \n", + "6.7 | \n", + "0.760 | \n", + "0.02 | \n", + "1.80 | \n", + "0.078 | \n", + "6.0 | \n", + "12.0 | \n", + "0.99600 | \n", + "3.55 | \n", + "0.63 | \n", + "9.95 | \n", + "3 | \n", + "
| 1549 | \n", + "7.4 | \n", + "0.360 | \n", + "0.30 | \n", + "1.80 | \n", + "0.074 | \n", + "17.0 | \n", + "24.0 | \n", + "0.99419 | \n", + "3.24 | \n", + "0.70 | \n", + "11.40 | \n", + "8 | \n", + "
| \n", + " | fixed acidity | \n", + "volatile acidity | \n", + "citric acid | \n", + "residual sugar | \n", + "chlorides | \n", + "free sulfur dioxide | \n", + "total sulfur dioxide | \n", + "density | \n", + "pH | \n", + "sulphates | \n", + "alcohol | \n", + "quality | \n", + "quality_categorical | \n", + "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", + "7.4 | \n", + "0.70 | \n", + "0.0 | \n", + "1.9 | \n", + "0.076 | \n", + "11.0 | \n", + "34.0 | \n", + "0.9978 | \n", + "3.51 | \n", + "0.56 | \n", + "9.4 | \n", + "5 | \n", + "1 | \n", + "
| 1 | \n", + "7.8 | \n", + "0.88 | \n", + "0.0 | \n", + "2.6 | \n", + "0.098 | \n", + "25.0 | \n", + "67.0 | \n", + "0.9968 | \n", + "3.20 | \n", + "0.68 | \n", + "9.8 | \n", + "5 | \n", + "1 | \n", + "
In this section, we'll do some exploratory analysis to understand the nature of our data and the underlying distribution.
+ +# Import libraries necessary for this project
+import numpy as np
+import pandas as pd
+from time import time
+from IPython.display import display # Allows the use of display() for displaying DataFrames
+
+import matplotlib.pyplot as plt
+import seaborn as sns
+
+# Import supplementary visualization code visuals.py from project root folder
+import visuals as vs
+
+# Pretty display for notebooks
+%matplotlib inline
+# Load the Red Wines dataset
+data = pd.read_csv("data/winequality-red.csv", sep=';')
+
+# Display the first five records
+display(data.head(n=5))
+data.isnull().any()
+data.info()
+n_wines = data.shape[0]
+
+# Number of wines with quality rating above 6
+quality_above_6 = data.loc[(data['quality'] > 6)]
+n_above_6 = quality_above_6.shape[0]
+
+# Number of wines with quality rating below 5
+quality_below_5 = data.loc[(data['quality'] < 5)]
+n_below_5 = quality_below_5.shape[0]
+
+# Number of wines with quality rating between 5 to 6
+quality_between_5 = data.loc[(data['quality']) >= 5 & (data['quality'] <= 6)]
+n_between_5 = quality_between_5.shape[0]
+
+# Percentage of wines with quality rating above 6
+greater_percent = n_above_6*100/n_wines
+
+# Print the results
+print("Total number of wine data: {}".format(n_wines))
+print("Wines with rating 7 and above: {}".format(n_above_6))
+print("Wines with rating less than 5: {}".format(n_below_5))
+print("Wines with rating 5 and 6: {}".format(n_between_5))
+print("Percentage of wines with quality 7 and above: {:.2f}%".format(greater_percent))
+
+# Some more additional data analysis
+display(np.round(data.describe()))
+# Visualize skewed continuous features of original data
+vs.distribution(data, "quality")
+As we can see, most fines fall under average quality (between 5 and 6). Wines which were rated high are in the lower hundreds, whereas there are very few wines that aren't tasty enough (low ratings).
+Next, since our aim is to predict the quality of wines, we’ll now extract the last column and store it separately.
+ +pd.plotting.scatter_matrix(data, alpha = 0.3, figsize = (40,40), diagonal = 'kde');
+correlation = data.corr()
+#display(correlation)
+plt.figure(figsize=(14, 12))
+heatmap = sns.heatmap(correlation, annot=True, linewidths=0, vmin=-1, cmap="RdBu_r")
+#Create a new dataframe containing only pH and fixed acidity columns to visualize their co-relations
+fixedAcidity_pH = data[['pH', 'fixed acidity']]
+
+#Initialize a joint-grid with the dataframe, using seaborn library
+gridA = sns.JointGrid(x="fixed acidity", y="pH", data=fixedAcidity_pH, size=6)
+
+#Draws a regression plot in the grid
+gridA = gridA.plot_joint(sns.regplot, scatter_kws={"s": 10})
+
+#Draws a distribution plot in the same grid
+gridA = gridA.plot_marginals(sns.distplot)
+fixedAcidity_citricAcid = data[['citric acid', 'fixed acidity']]
+g = sns.JointGrid(x="fixed acidity", y="citric acid", data=fixedAcidity_citricAcid, size=6)
+g = g.plot_joint(sns.regplot, scatter_kws={"s": 10})
+g = g.plot_marginals(sns.distplot)
+fixedAcidity_density = data[['density', 'fixed acidity']]
+gridB = sns.JointGrid(x="fixed acidity", y="density", data=fixedAcidity_density, size=6)
+gridB = gridB.plot_joint(sns.regplot, scatter_kws={"s": 10})
+gridB = gridB.plot_marginals(sns.distplot)
+volatileAcidity_quality = data[['quality', 'volatile acidity']]
+g = sns.JointGrid(x="volatile acidity", y="quality", data=volatileAcidity_quality, size=6)
+g = g.plot_joint(sns.regplot, scatter_kws={"s": 10})
+g = g.plot_marginals(sns.distplot)
+#We can visualize relationships of discreet values better with a bar plot
+
+fig, axs = plt.subplots(ncols=1,figsize=(10,6))
+sns.barplot(x='quality', y='volatile acidity', data=volatileAcidity_quality, ax=axs)
+plt.title('quality VS volatile acidity')
+
+plt.tight_layout()
+plt.show()
+plt.gcf().clear()
+quality_alcohol = data[['alcohol', 'quality']]
+
+g = sns.JointGrid(x="alcohol", y="quality", data=quality_alcohol, size=6)
+g = g.plot_joint(sns.regplot, scatter_kws={"s": 10})
+g = g.plot_marginals(sns.distplot)
+fig, axs = plt.subplots(ncols=1,figsize=(10,6))
+sns.barplot(x='quality', y='alcohol', data=quality_alcohol, ax=axs)
+plt.title('quality VS alcohol')
+
+plt.tight_layout()
+plt.show()
+plt.gcf().clear()
+# TODO: Select any two features of your choice and view their relationship
+# featureA = 'pH'
+# featureB = 'alcohol'
+# featureA_featureB = data[[featureA, featureB]]
+
+# g = sns.JointGrid(x=featureA, y=featureB, data=featureA_featureB, size=6)
+# g = g.plot_joint(sns.regplot, scatter_kws={"s": 10})
+# g = g.plot_marginals(sns.distplot)
+
+# fig, axs = plt.subplots(ncols=1,figsize=(10,6))
+# sns.barplot(x=featureA, y=featureB, data=featureA_featureB, ax=axs)
+# plt.title('quality VS alcohol')
+
+# plt.tight_layout()
+# plt.show()
+# plt.gcf().clear()
+Detecting outliers in the data is extremely important in the data preprocessing step of any analysis. The presence of outliers can often skew results which take into consideration these data points. There are many "rules of thumb" for what constitutes an outlier in a dataset. Here, we will use Tukey's Method for identfying outliers: An outlier step is calculated as 1.5 times the interquartile range (IQR). A data point with a feature that is beyond an outlier step outside of the IQR for that feature is considered abnormal.
+In the code block below:
+NOTE: If you choose to remove any outliers, ensure that the sample data does not contain any of these points! +Once you have performed this implementation, the dataset will be stored in the variable good_data.
+ +# For each feature find the data points with extreme high or low values
+for feature in data.keys():
+
+ # TODO: Calculate Q1 (25th percentile of the data) for the given feature
+ Q1 = np.percentile(data[feature], q=25)
+
+ # TODO: Calculate Q3 (75th percentile of the data) for the given feature
+ Q3 = np.percentile(data[feature], q=75)
+
+ # TODO: Use the interquartile range to calculate an outlier step (1.5 times the interquartile range)
+ interquartile_range = Q3 - Q1
+ step = 1.5 * interquartile_range
+
+ # Display the outliers
+ print("Data points considered outliers for the feature '{}':".format(feature))
+ display(data[~((data[feature] >= Q1 - step) & (data[feature] <= Q3 + step))])
+
+# OPTIONAL: Select the indices for data points you wish to remove
+outliers = []
+
+# Remove the outliers, if any were specified
+good_data = data.drop(data.index[outliers]).reset_index(drop = True)
+#Defining the splits for categories. 1-4 will be poor quality, 5-6 will be average, 7-10 will be great
+bins = [1,4,6,10]
+
+#0 for low quality, 1 for average, 2 for great quality
+quality_labels=[0,1,2]
+data['quality_categorical'] = pd.cut(data['quality'], bins=bins, labels=quality_labels, include_lowest=True)
+
+#Displays the first 2 columns
+display(data.head(n=2))
+
+# Split the data into features and target label
+quality_raw = data['quality_categorical']
+features_raw = data.drop(['quality', 'quality_categorical'], axis = 1)
+# Import train_test_split
+from sklearn.model_selection import train_test_split
+
+# Split the 'features' and 'income' data into training and testing sets
+X_train, X_test, y_train, y_test = train_test_split(features_raw,
+ quality_raw,
+ test_size = 0.2,
+ random_state = 0)
+
+# Show the results of the split
+print("Training set has {} samples.".format(X_train.shape[0]))
+print("Testing set has {} samples.".format(X_test.shape[0]))
+The following are some of the supervised learning models that are currently available in scikit-learn that you may choose from:
To properly evaluate the performance of each model you've chosen, it's important that you create a training and predicting pipeline that allows you to quickly and effectively train models using various sizes of training data and perform predictions on the testing data. Your implementation here will be used in the following section. +In the code block below, you will need to implement the following:
+fbeta_score and accuracy_score from sklearn.metrics.X_test, and also on the first 300 training points X_train[:300].beta parameter!# Import two classification metrics from sklearn - fbeta_score and accuracy_score
+from sklearn.metrics import fbeta_score
+from sklearn.metrics import accuracy_score
+
+def train_predict_evaluate(learner, sample_size, X_train, y_train, X_test, y_test):
+ '''
+ inputs:
+ - learner: the learning algorithm to be trained and predicted on
+ - sample_size: the size of samples (number) to be drawn from training set
+ - X_train: features training set
+ - y_train: quality training set
+ - X_test: features testing set
+ - y_test: quality testing set
+ '''
+
+ results = {}
+
+ """
+ Fit/train the learner to the training data using slicing with 'sample_size'
+ using .fit(training_features[:], training_labels[:])
+ """
+ start = time() # Get start time of training
+ learner = learner.fit(X_train[:sample_size], y_train[:sample_size]) #Train the model
+ end = time() # Get end time of training
+
+ # Calculate the training time
+ results['train_time'] = end - start
+
+ """
+ Get the predictions on the first 300 training samples(X_train),
+ and also predictions on the test set(X_test) using .predict()
+ """
+ start = time() # Get start time
+ predictions_train = learner.predict(X_train[:300])
+ predictions_test = learner.predict(X_test)
+
+ end = time() # Get end time
+
+ # Calculate the total prediction time
+ results['pred_time'] = end - start
+
+ # Compute accuracy on the first 300 training samples which is y_train[:300]
+ results['acc_train'] = accuracy_score(y_train[:300], predictions_train)
+
+ # Compute accuracy on test set using accuracy_score()
+ results['acc_test'] = accuracy_score(y_test, predictions_test)
+
+ # Compute F1-score on the the first 300 training samples using fbeta_score()
+ results['f_train'] = fbeta_score(y_train[:300], predictions_train, beta=0.5, average='micro')
+
+ # Compute F1-score on the test set which is y_test
+ results['f_test'] = fbeta_score(y_test, predictions_test, beta=0.5, average='micro')
+
+ # Success
+ print("{} trained on {} samples.".format(learner.__class__.__name__, sample_size))
+
+ # Return the results
+ return results
+In the code cell, you will need to implement the following:
+'clf_A', 'clf_B', and 'clf_C'.'random_state' for each model you use, if provided.'samples_1', 'samples_10', and 'samples_100' respectively.Note: Depending on which algorithms you chose, the following implementation may take some time to run!
+Further reading: https://stackoverflow.com/questions/31421413/how-to-compute-precision-recall-accuracy-and-f1-score-for-the-multiclass-case
+ +# Import any three supervised learning classification models from sklearn
+from sklearn.naive_bayes import GaussianNB
+from sklearn.tree import DecisionTreeClassifier
+from sklearn.ensemble import RandomForestClassifier
+#from sklearn.linear_model import LogisticRegression
+
+# Initialize the three models
+clf_A = GaussianNB()
+clf_B = DecisionTreeClassifier(max_depth=None, random_state=None)
+clf_C = RandomForestClassifier(max_depth=None, random_state=None)
+
+
+# Calculate the number of samples for 1%, 10%, and 100% of the training data
+# HINT: samples_100 is the entire training set i.e. len(y_train)
+# HINT: samples_10 is 10% of samples_100
+# HINT: samples_1 is 1% of samples_100
+
+samples_100 = len(y_train)
+samples_10 = int(len(y_train)*10/100)
+samples_1 = int(len(y_train)*1/100)
+
+# Collect results on the learners
+results = {}
+for clf in [clf_A, clf_B, clf_C]:
+ clf_name = clf.__class__.__name__
+ results[clf_name] = {}
+ for i, samples in enumerate([samples_1, samples_10, samples_100]):
+ results[clf_name][i] = \
+ train_predict_evaluate(clf, samples, X_train, y_train, X_test, y_test)
+
+#print(results)
+
+# Run metrics visualization for the three supervised learning models chosen
+vs.visualize_classification_performance(results)
+An important task when performing supervised learning on a dataset like the census data we study here is determining which features provide the most predictive power. By focusing on the relationship between only a few crucial features and the target label we simplify our understanding of the phenomenon, which is most always a useful thing to do. In the case of this project, that means we wish to identify a small number of features that most strongly predict the quality of wines.
+Choose a scikit-learn classifier (e.g., adaboost, random forests) that has a feature_importance_ attribute, which is a function that ranks the importance of features according to the chosen classifier. In the next python cell fit this classifier to training set and use this attribute to determine the top 5 most important features for the wines dataset.
Choose a scikit-learn supervised learning algorithm that has a feature_importance_ attribute availble for it. This attribute is a function that ranks the importance of each feature when making predictions based on the chosen algorithm.
In the code cell below, you will need to implement the following:
+'.feature_importances_'.# Import a supervised learning model that has 'feature_importances_'
+model = RandomForestClassifier(max_depth=None, random_state=None)
+
+# Train the supervised model on the training set using .fit(X_train, y_train)
+model = model.fit(X_train, y_train)
+
+# Extract the feature importances using .feature_importances_
+importances = model.feature_importances_
+
+print(X_train.columns)
+print(importances)
+
+# Plot
+vs.feature_plot(importances, X_train, y_train)
+# TODO: Import 'GridSearchCV', 'make_scorer', and any other necessary libraries
+from sklearn.model_selection import GridSearchCV
+from sklearn.metrics import make_scorer
+
+# TODO: Initialize the classifier
+clf = RandomForestClassifier(max_depth=None, random_state=None)
+
+# Create the parameters or base_estimators list you wish to tune, using a dictionary if needed.
+# Example: parameters = {'parameter_1': [value1, value2], 'parameter_2': [value1, value2]}
+
+"""
+n_estimators: Number of trees in the forest
+max_features: The number of features to consider when looking for the best split
+max_depth: The maximum depth of the tree
+"""
+parameters = {'n_estimators': [10, 20, 30], 'max_features':[3,4,5, None], 'max_depth': [5,6,7, None]}
+
+# TODO: Make an fbeta_score scoring object using make_scorer()
+scorer = make_scorer(fbeta_score, beta=0.5, average="micro")
+
+# TODO: Perform grid search on the claszsifier using 'scorer' as the scoring method using GridSearchCV()
+grid_obj = GridSearchCV(clf, parameters, scoring=scorer)
+
+# TODO: Fit the grid search object to the training data and find the optimal parameters using fit()
+grid_fit = grid_obj.fit(X_train, y_train)
+
+# Get the estimator
+best_clf = grid_fit.best_estimator_
+
+# Make predictions using the unoptimized and model
+predictions = (clf.fit(X_train, y_train)).predict(X_test)
+best_predictions = best_clf.predict(X_test)
+
+# Report the before-and-afterscores
+print("Unoptimized model\n------")
+print("Accuracy score on testing data: {:.4f}".format(accuracy_score(y_test, predictions)))
+print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, predictions, beta = 0.5, average="micro")))
+print("\nOptimized Model\n------")
+print(best_clf)
+print("\nFinal accuracy score on the testing data: {:.4f}".format(accuracy_score(y_test, best_predictions)))
+print("Final F-score on the testing data: {:.4f}".format(fbeta_score(y_test, best_predictions, beta = 0.5, average="micro")))
+"""Give inputs in this order: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide,
+total sulfur dioxide, density, pH, sulphates, alcohol
+
+"""
+wine_data = [[8, 0.2, 0.16, 1.8, 0.065, 3, 16, 0.9962, 3.42, 0.92, 9.5],
+ [8, 0, 0.16, 1.8, 0.065, 3, 16, 0.9962, 3.42, 0.92, 1 ],
+ [7.4, 2, 0.00, 1.9, 0.076, 11.0, 34.0, 0.9978, 3.51, 0.56, 0.6]]
+
+# Show predictions
+for i, quality in enumerate(best_clf.predict(wine_data)):
+ print("Predicted quality for Wine {} is: {}".format(i+1, quality))
+Try solving this exercise again as a regression problem. Some of the common algorithms you can try from sklearn are DecisionTreeRegressor, RandomForestRegressor, and using AdaBoostRegressor with DecisionTreeRegressor. Some of the performance metrics that you might need to use in place of Accuracy and f1score are Mean Squared Error and R2Score
+Try using the White Wines data-set in place of the Red Wines
+In this section, we'll do some exploratory analysis to understand the nature of our data and the underlying distribution.
+ +# Import libraries necessary for this project
+import numpy as np
+import pandas as pd
+from time import time
+from IPython.display import display # Allows the use of display() for displaying DataFrames
+
+import matplotlib.pyplot as plt
+import seaborn as sns
+
+# Import supplementary visualization code visuals.py from project root folder
+import visuals as vs
+
+# Pretty display for notebooks
+%matplotlib inline
+# Load the Red Wines dataset
+data = pd.read_csv("data/winequality-red.csv", sep=';')
+
+# Display the first five records
+display(data.head(n=5))
+data.isnull().any()
+data.info()
+n_wines = data.shape[0]
+
+# Number of wines with quality rating above 6
+quality_above_6 = data.loc[(data['quality'] > 6)]
+n_above_6 = quality_above_6.shape[0]
+
+# Number of wines with quality rating below 5
+quality_below_5 = data.loc[(data['quality'] < 5)]
+n_below_5 = quality_below_5.shape[0]
+
+# Number of wines with quality rating between 5 to 6
+quality_between_5 = data.loc[(data['quality']) >= 5 & (data['quality'] <= 6)]
+n_between_5 = quality_between_5.shape[0]
+
+# Percentage of wines with quality rating above 6
+greater_percent = n_above_6*100/n_wines
+
+# Print the results
+print("Total number of wine data: {}".format(n_wines))
+print("Wines with rating 7 and above: {}".format(n_above_6))
+print("Wines with rating less than 5: {}".format(n_below_5))
+print("Wines with rating 5 and 6: {}".format(n_between_5))
+print("Percentage of wines with quality 7 and above: {:.2f}%".format(greater_percent))
+
+# Some more additional data analysis
+display(np.round(data.describe()))
+# Visualize skewed continuous features of original data
+vs.distribution(data, "quality")
+As we can see, most fines fall under average quality (between 5 and 6). Wines which were rated high are in the lower hundreds, whereas there are very few wines that aren't tasty enough (low ratings).
+Next, since our aim is to predict the quality of wines, we’ll now extract the last column and store it separately.
+ +pd.plotting.scatter_matrix(data, alpha = 0.3, figsize = (40,40), diagonal = 'kde');
+correlation = data.corr()
+#display(correlation)
+plt.figure(figsize=(14, 12))
+heatmap = sns.heatmap(correlation, annot=True, linewidths=0, vmin=-1, cmap="RdBu_r")
+#Create a new dataframe containing only pH and fixed acidity columns to visualize their co-relations
+fixedAcidity_pH = data[['pH', 'fixed acidity']]
+
+#Initialize a joint-grid with the dataframe, using seaborn library
+gridA = sns.JointGrid(x="fixed acidity", y="pH", data=fixedAcidity_pH, size=6)
+
+#Draws a regression plot in the grid
+gridA = gridA.plot_joint(sns.regplot, scatter_kws={"s": 10})
+
+#Draws a distribution plot in the same grid
+gridA = gridA.plot_marginals(sns.distplot)
+fixedAcidity_citricAcid = data[['citric acid', 'fixed acidity']]
+g = sns.JointGrid(x="fixed acidity", y="citric acid", data=fixedAcidity_citricAcid, size=6)
+g = g.plot_joint(sns.regplot, scatter_kws={"s": 10})
+g = g.plot_marginals(sns.distplot)
+fixedAcidity_density = data[['density', 'fixed acidity']]
+gridB = sns.JointGrid(x="fixed acidity", y="density", data=fixedAcidity_density, size=6)
+gridB = gridB.plot_joint(sns.regplot, scatter_kws={"s": 10})
+gridB = gridB.plot_marginals(sns.distplot)
+volatileAcidity_quality = data[['quality', 'volatile acidity']]
+g = sns.JointGrid(x="volatile acidity", y="quality", data=volatileAcidity_quality, size=6)
+g = g.plot_joint(sns.regplot, scatter_kws={"s": 10})
+g = g.plot_marginals(sns.distplot)
+#We can visualize relationships of discreet values better with a bar plot
+
+fig, axs = plt.subplots(ncols=1,figsize=(10,6))
+sns.barplot(x='quality', y='volatile acidity', data=volatileAcidity_quality, ax=axs)
+plt.title('quality VS volatile acidity')
+
+plt.tight_layout()
+plt.show()
+plt.gcf().clear()
+quality_alcohol = data[['alcohol', 'quality']]
+
+g = sns.JointGrid(x="alcohol", y="quality", data=quality_alcohol, size=6)
+g = g.plot_joint(sns.regplot, scatter_kws={"s": 10})
+g = g.plot_marginals(sns.distplot)
+fig, axs = plt.subplots(ncols=1,figsize=(10,6))
+sns.barplot(x='quality', y='alcohol', data=quality_alcohol, ax=axs)
+plt.title('quality VS alcohol')
+
+plt.tight_layout()
+plt.show()
+plt.gcf().clear()
+# TODO: Select any two features of your choice and view their relationship
+# featureA = 'pH'
+# featureB = 'alcohol'
+# featureA_featureB = data[[featureA, featureB]]
+
+# g = sns.JointGrid(x=featureA, y=featureB, data=featureA_featureB, size=6)
+# g = g.plot_joint(sns.regplot, scatter_kws={"s": 10})
+# g = g.plot_marginals(sns.distplot)
+
+# fig, axs = plt.subplots(ncols=1,figsize=(10,6))
+# sns.barplot(x=featureA, y=featureB, data=featureA_featureB, ax=axs)
+# plt.title('quality VS alcohol')
+
+# plt.tight_layout()
+# plt.show()
+# plt.gcf().clear()
+Detecting outliers in the data is extremely important in the data preprocessing step of any analysis. The presence of outliers can often skew results which take into consideration these data points. There are many "rules of thumb" for what constitutes an outlier in a dataset. Here, we will use Tukey's Method for identfying outliers: An outlier step is calculated as 1.5 times the interquartile range (IQR). A data point with a feature that is beyond an outlier step outside of the IQR for that feature is considered abnormal.
+In the code block below:
+NOTE: If you choose to remove any outliers, ensure that the sample data does not contain any of these points! +Once you have performed this implementation, the dataset will be stored in the variable good_data.
+ +# For each feature find the data points with extreme high or low values
+for feature in data.keys():
+
+ # TODO: Calculate Q1 (25th percentile of the data) for the given feature
+ Q1 = np.percentile(data[feature], q=25)
+
+ # TODO: Calculate Q3 (75th percentile of the data) for the given feature
+ Q3 = np.percentile(data[feature], q=75)
+
+ # TODO: Use the interquartile range to calculate an outlier step (1.5 times the interquartile range)
+ interquartile_range = Q3 - Q1
+ step = 1.5 * interquartile_range
+
+ # Display the outliers
+ print("Data points considered outliers for the feature '{}':".format(feature))
+ display(data[~((data[feature] >= Q1 - step) & (data[feature] <= Q3 + step))])
+
+# OPTIONAL: Select the indices for data points you wish to remove
+outliers = []
+
+# Remove the outliers, if any were specified
+good_data = data.drop(data.index[outliers]).reset_index(drop = True)
+#Defining the splits for categories. 1-4 will be poor quality, 5-6 will be average, 7-10 will be great
+bins = [1,4,6,10]
+
+#0 for low quality, 1 for average, 2 for great quality
+quality_labels=[0,1,2]
+data['quality_categorical'] = pd.cut(data['quality'], bins=bins, labels=quality_labels, include_lowest=True)
+
+#Displays the first 2 columns
+display(data.head(n=2))
+
+# Split the data into features and target label
+quality_raw = data['quality_categorical']
+features_raw = data.drop(['quality', 'quality_categorical'], axis = 1)
+# Import train_test_split
+from sklearn.model_selection import train_test_split
+
+# Split the 'features' and 'income' data into training and testing sets
+X_train, X_test, y_train, y_test = train_test_split(features_raw,
+ quality_raw,
+ test_size = 0.2,
+ random_state = 0)
+
+# Show the results of the split
+print("Training set has {} samples.".format(X_train.shape[0]))
+print("Testing set has {} samples.".format(X_test.shape[0]))
+The following are some of the supervised learning models that are currently available in scikit-learn that you may choose from:
To properly evaluate the performance of each model you've chosen, it's important that you create a training and predicting pipeline that allows you to quickly and effectively train models using various sizes of training data and perform predictions on the testing data. Your implementation here will be used in the following section. +In the code block below, you will need to implement the following:
+fbeta_score and accuracy_score from sklearn.metrics.X_test, and also on the first 300 training points X_train[:300].beta parameter!# Import two classification metrics from sklearn - fbeta_score and accuracy_score
+from sklearn.metrics import fbeta_score
+from sklearn.metrics import accuracy_score
+
+def train_predict_evaluate(learner, sample_size, X_train, y_train, X_test, y_test):
+ '''
+ inputs:
+ - learner: the learning algorithm to be trained and predicted on
+ - sample_size: the size of samples (number) to be drawn from training set
+ - X_train: features training set
+ - y_train: quality training set
+ - X_test: features testing set
+ - y_test: quality testing set
+ '''
+
+ results = {}
+
+ """
+ Fit/train the learner to the training data using slicing with 'sample_size'
+ using .fit(training_features[:], training_labels[:])
+ """
+ start = time() # Get start time of training
+ learner = learner.fit(X_train[:sample_size], y_train[:sample_size]) #Train the model
+ end = time() # Get end time of training
+
+ # Calculate the training time
+ results['train_time'] = end - start
+
+ """
+ Get the predictions on the first 300 training samples(X_train),
+ and also predictions on the test set(X_test) using .predict()
+ """
+ start = time() # Get start time
+ predictions_train = learner.predict(X_train[:300])
+ predictions_test = learner.predict(X_test)
+
+ end = time() # Get end time
+
+ # Calculate the total prediction time
+ results['pred_time'] = end - start
+
+ # Compute accuracy on the first 300 training samples which is y_train[:300]
+ results['acc_train'] = accuracy_score(y_train[:300], predictions_train)
+
+ # Compute accuracy on test set using accuracy_score()
+ results['acc_test'] = accuracy_score(y_test, predictions_test)
+
+ # Compute F1-score on the the first 300 training samples using fbeta_score()
+ results['f_train'] = fbeta_score(y_train[:300], predictions_train, beta=0.5, average='micro')
+
+ # Compute F1-score on the test set which is y_test
+ results['f_test'] = fbeta_score(y_test, predictions_test, beta=0.5, average='micro')
+
+ # Success
+ print("{} trained on {} samples.".format(learner.__class__.__name__, sample_size))
+
+ # Return the results
+ return results
+In the code cell, you will need to implement the following:
+'clf_A', 'clf_B', and 'clf_C'.'random_state' for each model you use, if provided.'samples_1', 'samples_10', and 'samples_100' respectively.Note: Depending on which algorithms you chose, the following implementation may take some time to run!
+Further reading: https://stackoverflow.com/questions/31421413/how-to-compute-precision-recall-accuracy-and-f1-score-for-the-multiclass-case
+ +# Import any three supervised learning classification models from sklearn
+from sklearn.naive_bayes import GaussianNB
+from sklearn.tree import DecisionTreeClassifier
+from sklearn.ensemble import RandomForestClassifier
+#from sklearn.linear_model import LogisticRegression
+
+# Initialize the three models
+clf_A = GaussianNB()
+clf_B = DecisionTreeClassifier(max_depth=None, random_state=None)
+clf_C = RandomForestClassifier(max_depth=None, random_state=None)
+
+
+# Calculate the number of samples for 1%, 10%, and 100% of the training data
+# HINT: samples_100 is the entire training set i.e. len(y_train)
+# HINT: samples_10 is 10% of samples_100
+# HINT: samples_1 is 1% of samples_100
+
+samples_100 = len(y_train)
+samples_10 = int(len(y_train)*10/100)
+samples_1 = int(len(y_train)*1/100)
+
+# Collect results on the learners
+results = {}
+for clf in [clf_A, clf_B, clf_C]:
+ clf_name = clf.__class__.__name__
+ results[clf_name] = {}
+ for i, samples in enumerate([samples_1, samples_10, samples_100]):
+ results[clf_name][i] = \
+ train_predict_evaluate(clf, samples, X_train, y_train, X_test, y_test)
+
+#print(results)
+
+# Run metrics visualization for the three supervised learning models chosen
+vs.visualize_classification_performance(results)
+An important task when performing supervised learning on a dataset like the census data we study here is determining which features provide the most predictive power. By focusing on the relationship between only a few crucial features and the target label we simplify our understanding of the phenomenon, which is most always a useful thing to do. In the case of this project, that means we wish to identify a small number of features that most strongly predict the quality of wines.
+Choose a scikit-learn classifier (e.g., adaboost, random forests) that has a feature_importance_ attribute, which is a function that ranks the importance of features according to the chosen classifier. In the next python cell fit this classifier to training set and use this attribute to determine the top 5 most important features for the wines dataset.
Choose a scikit-learn supervised learning algorithm that has a feature_importance_ attribute availble for it. This attribute is a function that ranks the importance of each feature when making predictions based on the chosen algorithm.
In the code cell below, you will need to implement the following:
+'.feature_importances_'.# Import a supervised learning model that has 'feature_importances_'
+model = RandomForestClassifier(max_depth=None, random_state=None)
+
+# Train the supervised model on the training set using .fit(X_train, y_train)
+model = model.fit(X_train, y_train)
+
+# Extract the feature importances using .feature_importances_
+importances = model.feature_importances_
+
+print(X_train.columns)
+print(importances)
+
+# Plot
+vs.feature_plot(importances, X_train, y_train)
+# TODO: Import 'GridSearchCV', 'make_scorer', and any other necessary libraries
+from sklearn.model_selection import GridSearchCV
+from sklearn.metrics import make_scorer
+
+# TODO: Initialize the classifier
+clf = RandomForestClassifier(max_depth=None, random_state=None)
+
+# Create the parameters or base_estimators list you wish to tune, using a dictionary if needed.
+# Example: parameters = {'parameter_1': [value1, value2], 'parameter_2': [value1, value2]}
+
+"""
+n_estimators: Number of trees in the forest
+max_features: The number of features to consider when looking for the best split
+max_depth: The maximum depth of the tree
+"""
+parameters = {'n_estimators': [10, 20, 30], 'max_features':[3,4,5, None], 'max_depth': [5,6,7, None]}
+
+# TODO: Make an fbeta_score scoring object using make_scorer()
+scorer = make_scorer(fbeta_score, beta=0.5, average="micro")
+
+# TODO: Perform grid search on the claszsifier using 'scorer' as the scoring method using GridSearchCV()
+grid_obj = GridSearchCV(clf, parameters, scoring=scorer)
+
+# TODO: Fit the grid search object to the training data and find the optimal parameters using fit()
+grid_fit = grid_obj.fit(X_train, y_train)
+
+# Get the estimator
+best_clf = grid_fit.best_estimator_
+
+# Make predictions using the unoptimized and model
+predictions = (clf.fit(X_train, y_train)).predict(X_test)
+best_predictions = best_clf.predict(X_test)
+
+# Report the before-and-afterscores
+print("Unoptimized model\n------")
+print("Accuracy score on testing data: {:.4f}".format(accuracy_score(y_test, predictions)))
+print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, predictions, beta = 0.5, average="micro")))
+print("\nOptimized Model\n------")
+print(best_clf)
+print("\nFinal accuracy score on the testing data: {:.4f}".format(accuracy_score(y_test, best_predictions)))
+print("Final F-score on the testing data: {:.4f}".format(fbeta_score(y_test, best_predictions, beta = 0.5, average="micro")))
+"""Give inputs in this order: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide,
+total sulfur dioxide, density, pH, sulphates, alcohol
+
+"""
+wine_data = [[8, 0.2, 0.16, 1.8, 0.065, 3, 16, 0.9962, 3.42, 0.92, 9.5],
+ [8, 0, 0.16, 1.8, 0.065, 3, 16, 0.9962, 3.42, 0.92, 1 ],
+ [7.4, 2, 0.00, 1.9, 0.076, 11.0, 34.0, 0.9978, 3.51, 0.56, 0.6]]
+
+# Show predictions
+for i, quality in enumerate(best_clf.predict(wine_data)):
+ print("Predicted quality for Wine {} is: {}".format(i+1, quality))
+Try solving this exercise again as a regression problem. Some of the common algorithms you can try from sklearn are DecisionTreeRegressor, RandomForestRegressor, and using AdaBoostRegressor with DecisionTreeRegressor. Some of the performance metrics that you might need to use in place of Accuracy and f1score are Mean Squared Error and R2Score
+Try using the White Wines data-set in place of the Red Wines
+| \n", + " | fixed acidity | \n", + "volatile acidity | \n", + "citric acid | \n", + "residual sugar | \n", + "chlorides | \n", + "free sulfur dioxide | \n", + "total sulfur dioxide | \n", + "density | \n", + "pH | \n", + "sulphates | \n", + "alcohol | \n", + "quality | \n", + "
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", + "7.4 | \n", + "0.70 | \n", + "0.00 | \n", + "1.9 | \n", + "0.076 | \n", + "11.0 | \n", + "34.0 | \n", + "0.9978 | \n", + "3.51 | \n", + "0.56 | \n", + "9.4 | \n", + "5 | \n", + "
| 1 | \n", + "7.8 | \n", + "0.88 | \n", + "0.00 | \n", + "2.6 | \n", + "0.098 | \n", + "25.0 | \n", + "67.0 | \n", + "0.9968 | \n", + "3.20 | \n", + "0.68 | \n", + "9.8 | \n", + "5 | \n", + "
| 2 | \n", + "7.8 | \n", + "0.76 | \n", + "0.04 | \n", + "2.3 | \n", + "0.092 | \n", + "15.0 | \n", + "54.0 | \n", + "0.9970 | \n", + "3.26 | \n", + "0.65 | \n", + "9.8 | \n", + "5 | \n", + "
| 3 | \n", + "11.2 | \n", + "0.28 | \n", + "0.56 | \n", + "1.9 | \n", + "0.075 | \n", + "17.0 | \n", + "60.0 | \n", + "0.9980 | \n", + "3.16 | \n", + "0.58 | \n", + "9.8 | \n", + "6 | \n", + "
| 4 | \n", + "7.4 | \n", + "0.70 | \n", + "0.00 | \n", + "1.9 | \n", + "0.076 | \n", + "11.0 | \n", + "34.0 | \n", + "0.9978 | \n", + "3.51 | \n", + "0.56 | \n", + "9.4 | \n", + "5 | \n", + "
| \n", + " | fixed acidity | \n", + "volatile acidity | \n", + "citric acid | \n", + "residual sugar | \n", + "chlorides | \n", + "free sulfur dioxide | \n", + "total sulfur dioxide | \n", + "density | \n", + "pH | \n", + "sulphates | \n", + "alcohol | \n", + "quality | \n", + "
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "1599.0 | \n", + "
| mean | \n", + "8.0 | \n", + "1.0 | \n", + "0.0 | \n", + "3.0 | \n", + "0.0 | \n", + "16.0 | \n", + "46.0 | \n", + "1.0 | \n", + "3.0 | \n", + "1.0 | \n", + "10.0 | \n", + "6.0 | \n", + "
| std | \n", + "2.0 | \n", + "0.0 | \n", + "0.0 | \n", + "1.0 | \n", + "0.0 | \n", + "10.0 | \n", + "33.0 | \n", + "0.0 | \n", + "0.0 | \n", + "0.0 | \n", + "1.0 | \n", + "1.0 | \n", + "
| min | \n", + "5.0 | \n", + "0.0 | \n", + "0.0 | \n", + "1.0 | \n", + "0.0 | \n", + "1.0 | \n", + "6.0 | \n", + "1.0 | \n", + "3.0 | \n", + "0.0 | \n", + "8.0 | \n", + "3.0 | \n", + "
| 25% | \n", + "7.0 | \n", + "0.0 | \n", + "0.0 | \n", + "2.0 | \n", + "0.0 | \n", + "7.0 | \n", + "22.0 | \n", + "1.0 | \n", + "3.0 | \n", + "1.0 | \n", + "10.0 | \n", + "5.0 | \n", + "
| 50% | \n", + "8.0 | \n", + "1.0 | \n", + "0.0 | \n", + "2.0 | \n", + "0.0 | \n", + "14.0 | \n", + "38.0 | \n", + "1.0 | \n", + "3.0 | \n", + "1.0 | \n", + "10.0 | \n", + "6.0 | \n", + "
| 75% | \n", + "9.0 | \n", + "1.0 | \n", + "0.0 | \n", + "3.0 | \n", + "0.0 | \n", + "21.0 | \n", + "62.0 | \n", + "1.0 | \n", + "3.0 | \n", + "1.0 | \n", + "11.0 | \n", + "6.0 | \n", + "
| max | \n", + "16.0 | \n", + "2.0 | \n", + "1.0 | \n", + "16.0 | \n", + "1.0 | \n", + "72.0 | \n", + "289.0 | \n", + "1.0 | \n", + "4.0 | \n", + "2.0 | \n", + "15.0 | \n", + "8.0 | \n", + "
| \n", + " | fixed acidity | \n", + "volatile acidity | \n", + "citric acid | \n", + "residual sugar | \n", + "chlorides | \n", + "free sulfur dioxide | \n", + "total sulfur dioxide | \n", + "density | \n", + "pH | \n", + "sulphates | \n", + "alcohol | \n", + "quality | \n", + "quality_cat | \n", + "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", + "7.4 | \n", + "0.70 | \n", + "0.00 | \n", + "1.9 | \n", + "0.076 | \n", + "11.0 | \n", + "34.0 | \n", + "0.9978 | \n", + "3.51 | \n", + "0.56 | \n", + "9.4 | \n", + "5 | \n", + "0 | \n", + "
| 1 | \n", + "7.8 | \n", + "0.88 | \n", + "0.00 | \n", + "2.6 | \n", + "0.098 | \n", + "25.0 | \n", + "67.0 | \n", + "0.9968 | \n", + "3.20 | \n", + "0.68 | \n", + "9.8 | \n", + "5 | \n", + "0 | \n", + "
| 2 | \n", + "7.8 | \n", + "0.76 | \n", + "0.04 | \n", + "2.3 | \n", + "0.092 | \n", + "15.0 | \n", + "54.0 | \n", + "0.9970 | \n", + "3.26 | \n", + "0.65 | \n", + "9.8 | \n", + "5 | \n", + "0 | \n", + "
| 3 | \n", + "11.2 | \n", + "0.28 | \n", + "0.56 | \n", + "1.9 | \n", + "0.075 | \n", + "17.0 | \n", + "60.0 | \n", + "0.9980 | \n", + "3.16 | \n", + "0.58 | \n", + "9.8 | \n", + "6 | \n", + "1 | \n", + "
| 4 | \n", + "7.4 | \n", + "0.70 | \n", + "0.00 | \n", + "1.9 | \n", + "0.076 | \n", + "11.0 | \n", + "34.0 | \n", + "0.9978 | \n", + "3.51 | \n", + "0.56 | \n", + "9.4 | \n", + "5 | \n", + "0 | \n", + "
| \n", - " | school | \n", - "sex | \n", - "age | \n", - "address | \n", - "famsize | \n", - "Pstatus | \n", - "Medu | \n", - "Fedu | \n", - "Mjob | \n", - "Fjob | \n", - "reason | \n", - "guardian | \n", - "traveltime | \n", - "studytime | \n", - "failures | \n", - "schoolsup | \n", - "famsup | \n", - "paid | \n", - "activities | \n", - "nursery | \n", - "higher | \n", - "internet | \n", - "romantic | \n", - "famrel | \n", - "freetime | \n", - "goout | \n", - "Dalc | \n", - "Walc | \n", - "health | \n", - "absences | \n", - "passed | \n", - "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", - "GP | \n", - "F | \n", - "18 | \n", - "U | \n", - "GT3 | \n", - "A | \n", - "4 | \n", - "4 | \n", - "at_home | \n", - "teacher | \n", - "course | \n", - "mother | \n", - "2 | \n", - "2 | \n", - "0 | \n", - "yes | \n", - "no | \n", - "no | \n", - "no | \n", - "yes | \n", - "yes | \n", - "no | \n", - "no | \n", - "4 | \n", - "3 | \n", - "4 | \n", - "1 | \n", - "1 | \n", - "3 | \n", - "6 | \n", - "no | \n", - "
| 1 | \n", - "GP | \n", - "F | \n", - "17 | \n", - "U | \n", - "GT3 | \n", - "T | \n", - "1 | \n", - "1 | \n", - "at_home | \n", - "other | \n", - "course | \n", - "father | \n", - "1 | \n", - "2 | \n", - "0 | \n", - "no | \n", - "yes | \n", - "no | \n", - "no | \n", - "no | \n", - "yes | \n", - "yes | \n", - "no | \n", - "5 | \n", - "3 | \n", - "3 | \n", - "1 | \n", - "1 | \n", - "3 | \n", - "4 | \n", - "no | \n", - "
| 2 | \n", - "GP | \n", - "F | \n", - "15 | \n", - "U | \n", - "LE3 | \n", - "T | \n", - "1 | \n", - "1 | \n", - "at_home | \n", - "other | \n", - "other | \n", - "mother | \n", - "1 | \n", - "2 | \n", - "3 | \n", - "yes | \n", - "no | \n", - "yes | \n", - "no | \n", - "yes | \n", - "yes | \n", - "yes | \n", - "no | \n", - "4 | \n", - "3 | \n", - "2 | \n", - "2 | \n", - "3 | \n", - "3 | \n", - "10 | \n", - "yes | \n", - "
| 3 | \n", - "GP | \n", - "F | \n", - "15 | \n", - "U | \n", - "GT3 | \n", - "T | \n", - "4 | \n", - "2 | \n", - "health | \n", - "services | \n", - "home | \n", - "mother | \n", - "1 | \n", - "3 | \n", - "0 | \n", - "no | \n", - "yes | \n", - "yes | \n", - "yes | \n", - "yes | \n", - "yes | \n", - "yes | \n", - "yes | \n", - "3 | \n", - "2 | \n", - "2 | \n", - "1 | \n", - "1 | \n", - "5 | \n", - "2 | \n", - "yes | \n", - "
| 4 | \n", - "GP | \n", - "F | \n", - "16 | \n", - "U | \n", - "GT3 | \n", - "T | \n", - "3 | \n", - "3 | \n", - "other | \n", - "other | \n", - "home | \n", - "father | \n", - "1 | \n", - "2 | \n", - "0 | \n", - "no | \n", - "yes | \n", - "yes | \n", - "no | \n", - "yes | \n", - "yes | \n", - "no | \n", - "no | \n", - "4 | \n", - "3 | \n", - "2 | \n", - "1 | \n", - "2 | \n", - "5 | \n", - "4 | \n", - "yes | \n", - "
| \n", - " | age | \n", - "Medu | \n", - "Fedu | \n", - "traveltime | \n", - "studytime | \n", - "failures | \n", - "famrel | \n", - "freetime | \n", - "goout | \n", - "Dalc | \n", - "Walc | \n", - "health | \n", - "absences | \n", - "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | \n", - "395.0 | \n", - "395.0 | \n", - "395.0 | \n", - "395.0 | \n", - "395.0 | \n", - "395.0 | \n", - "395.0 | \n", - "395.0 | \n", - "395.0 | \n", - "395.0 | \n", - "395.0 | \n", - "395.0 | \n", - "395.0 | \n", - "
| mean | \n", - "17.0 | \n", - "3.0 | \n", - "3.0 | \n", - "1.0 | \n", - "2.0 | \n", - "0.0 | \n", - "4.0 | \n", - "3.0 | \n", - "3.0 | \n", - "1.0 | \n", - "2.0 | \n", - "4.0 | \n", - "6.0 | \n", - "
| std | \n", - "1.0 | \n", - "1.0 | \n", - "1.0 | \n", - "1.0 | \n", - "1.0 | \n", - "1.0 | \n", - "1.0 | \n", - "1.0 | \n", - "1.0 | \n", - "1.0 | \n", - "1.0 | \n", - "1.0 | \n", - "8.0 | \n", - "
| min | \n", - "15.0 | \n", - "0.0 | \n", - "0.0 | \n", - "1.0 | \n", - "1.0 | \n", - "0.0 | \n", - "1.0 | \n", - "1.0 | \n", - "1.0 | \n", - "1.0 | \n", - "1.0 | \n", - "1.0 | \n", - "0.0 | \n", - "
| 25% | \n", - "16.0 | \n", - "2.0 | \n", - "2.0 | \n", - "1.0 | \n", - "1.0 | \n", - "0.0 | \n", - "4.0 | \n", - "3.0 | \n", - "2.0 | \n", - "1.0 | \n", - "1.0 | \n", - "3.0 | \n", - "0.0 | \n", - "
| 50% | \n", - "17.0 | \n", - "3.0 | \n", - "2.0 | \n", - "1.0 | \n", - "2.0 | \n", - "0.0 | \n", - "4.0 | \n", - "3.0 | \n", - "3.0 | \n", - "1.0 | \n", - "2.0 | \n", - "4.0 | \n", - "4.0 | \n", - "
| 75% | \n", - "18.0 | \n", - "4.0 | \n", - "3.0 | \n", - "2.0 | \n", - "2.0 | \n", - "0.0 | \n", - "5.0 | \n", - "4.0 | \n", - "4.0 | \n", - "2.0 | \n", - "3.0 | \n", - "5.0 | \n", - "8.0 | \n", - "
| max | \n", - "22.0 | \n", - "4.0 | \n", - "4.0 | \n", - "4.0 | \n", - "4.0 | \n", - "3.0 | \n", - "5.0 | \n", - "5.0 | \n", - "5.0 | \n", - "5.0 | \n", - "5.0 | \n", - "5.0 | \n", - "75.0 | \n", - "
| \n", - " | school | \n", - "sex | \n", - "age | \n", - "address | \n", - "famsize | \n", - "Pstatus | \n", - "Medu | \n", - "Fedu | \n", - "Mjob | \n", - "Fjob | \n", - "reason | \n", - "guardian | \n", - "traveltime | \n", - "studytime | \n", - "failures | \n", - "schoolsup | \n", - "famsup | \n", - "paid | \n", - "activities | \n", - "nursery | \n", - "higher | \n", - "internet | \n", - "romantic | \n", - "famrel | \n", - "freetime | \n", - "goout | \n", - "Dalc | \n", - "Walc | \n", - "health | \n", - "absences | \n", - "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", - "GP | \n", - "F | \n", - "18 | \n", - "U | \n", - "GT3 | \n", - "A | \n", - "4 | \n", - "4 | \n", - "at_home | \n", - "teacher | \n", - "course | \n", - "mother | \n", - "2 | \n", - "2 | \n", - "0 | \n", - "yes | \n", - "no | \n", - "no | \n", - "no | \n", - "yes | \n", - "yes | \n", - "no | \n", - "no | \n", - "4 | \n", - "3 | \n", - "4 | \n", - "1 | \n", - "1 | \n", - "3 | \n", - "6 | \n", - "
| 1 | \n", - "GP | \n", - "F | \n", - "17 | \n", - "U | \n", - "GT3 | \n", - "T | \n", - "1 | \n", - "1 | \n", - "at_home | \n", - "other | \n", - "course | \n", - "father | \n", - "1 | \n", - "2 | \n", - "0 | \n", - "no | \n", - "yes | \n", - "no | \n", - "no | \n", - "no | \n", - "yes | \n", - "yes | \n", - "no | \n", - "5 | \n", - "3 | \n", - "3 | \n", - "1 | \n", - "1 | \n", - "3 | \n", - "4 | \n", - "
| 2 | \n", - "GP | \n", - "F | \n", - "15 | \n", - "U | \n", - "LE3 | \n", - "T | \n", - "1 | \n", - "1 | \n", - "at_home | \n", - "other | \n", - "other | \n", - "mother | \n", - "1 | \n", - "2 | \n", - "3 | \n", - "yes | \n", - "no | \n", - "yes | \n", - "no | \n", - "yes | \n", - "yes | \n", - "yes | \n", - "no | \n", - "4 | \n", - "3 | \n", - "2 | \n", - "2 | \n", - "3 | \n", - "3 | \n", - "10 | \n", - "
| 3 | \n", - "GP | \n", - "F | \n", - "15 | \n", - "U | \n", - "GT3 | \n", - "T | \n", - "4 | \n", - "2 | \n", - "health | \n", - "services | \n", - "home | \n", - "mother | \n", - "1 | \n", - "3 | \n", - "0 | \n", - "no | \n", - "yes | \n", - "yes | \n", - "yes | \n", - "yes | \n", - "yes | \n", - "yes | \n", - "yes | \n", - "3 | \n", - "2 | \n", - "2 | \n", - "1 | \n", - "1 | \n", - "5 | \n", - "2 | \n", - "
| 4 | \n", - "GP | \n", - "F | \n", - "16 | \n", - "U | \n", - "GT3 | \n", - "T | \n", - "3 | \n", - "3 | \n", - "other | \n", - "other | \n", - "home | \n", - "father | \n", - "1 | \n", - "2 | \n", - "0 | \n", - "no | \n", - "yes | \n", - "yes | \n", - "no | \n", - "yes | \n", - "yes | \n", - "no | \n", - "no | \n", - "4 | \n", - "3 | \n", - "2 | \n", - "1 | \n", - "2 | \n", - "5 | \n", - "4 | \n", - "
Welcome to the second project of the Machine Learning Engineer Nanodegree! In this notebook, some template code has already been provided for you, and it will be your job to implement the additional functionality necessary to successfully complete this project. Sections that begin with 'Implementation' in the header indicate that the following block of code will require additional functionality which you must provide. Instructions will be provided for each section and the specifics of the implementation are marked in the code block with a 'TODO' statement. Please be sure to read the instructions carefully!
In addition to implementing code, there will be questions that you must answer which relate to the project and your implementation. Each section where you will answer a question is preceded by a 'Question X' header. Carefully read each question and provide thorough answers in the following text boxes that begin with 'Answer:'. Your project submission will be evaluated based on your answers to each of the questions and the implementation you provide.
-- -Note: Code and Markdown cells can be executed using the Shift + Enter keyboard shortcut. In addition, Markdown cells can be edited by typically double-clicking the cell to enter edit mode.
-
Your goal for this project is to identify students who might need early intervention before they fail to graduate. Which type of supervised learning problem is this, classification or regression? Why?
- -Answer:
-Given that in this problem we aren't trying to predict continuous values, this is clearly not regression. This problem comes under classification. The problem statement gives a big clue as to why - we need to identify whether a student needs early intervention or not. So there are clearly two labels, wither of which could apply to a student - (a)whether a student needs early intervention, OR (b) A student doesn't need early intervention.
- -Run the code cell below to load necessary Python libraries and load the student data. Note that the last column from this dataset, 'passed', will be our target label (whether the student graduated or didn't graduate). All other columns are features about each student.
# Import libraries
-import numpy as np
-import pandas as pd
-from time import time
-from sklearn.metrics import f1_score
-
-from IPython.display import display # Allows the use of display() for displaying DataFrames
-pd.options.display.max_columns = None #Allows us to view all columns of a DataFrame
-
-# Read student data
-student_data = pd.read_csv("student-data.csv")
-print("Student data read successfully!")
-
-# Display the first five records
-display(student_data.head(n=5))
-
-# Some more additional data analysis
-display(np.round(student_data.describe()))
-Let's begin by investigating the dataset to determine how many students we have information on, and learn about the graduation rate among these students. In the code cell below, you will need to compute the following:
-n_students.n_features.n_passed.n_failed.grad_rate, in percent (%).# TODO: Calculate number of students
-n_students = student_data.shape[0]
-
-# TODO: Calculate number of features
-n_features = student_data.shape[1]
-
-# TODO: Calculate passing students
-passing_students = student_data.loc[(student_data['passed'] == "yes")]
-n_passed = passing_students.shape[0]
-
-# TODO: Calculate failing students
-failing_students = student_data.loc[(student_data['passed'] == "no")]
-n_failed = failing_students.shape[0]
-
-# TODO: Calculate graduation rate
-grad_rate = n_passed*100/n_students
-
-# Print the results
-print("Total number of students: {}".format(n_students))
-print("Number of features: {}".format(n_features))
-print("Number of students who passed: {}".format(n_passed))
-print("Number of students who failed: {}".format(n_failed))
-print("Graduation rate of the class: {:.2f}%".format(grad_rate))
-In this section, we will prepare the data for modeling, training and testing.
-It is often the case that the data you obtain contains non-numeric features. This can be a problem, as most machine learning algorithms expect numeric data to perform computations with.
-Run the code cell below to separate the student data into feature and target columns to see if any features are non-numeric.
- -# Extract feature columns
-feature_cols = list(student_data.columns[:-1])
-
-# Extract target column 'passed'
-target_col = student_data.columns[-1]
-
-# Show the list of columns
-print("Feature columns:\n{}".format(feature_cols))
-print("\nTarget column: {}".format(target_col))
-
-# Separate the data into feature data and target data (X_all and y_all, respectively)
-X_all = student_data[feature_cols]
-y_all = student_data[target_col]
-
-# Show the feature information by printing the first five rows
-print("\nFeature values:")
-display(X_all.head())
-As you can see, there are several non-numeric columns that need to be converted! Many of them are simply yes/no, e.g. internet. These can be reasonably converted into 1/0 (binary) values.
Other columns, like Mjob and Fjob, have more than two values, and are known as categorical variables. The recommended way to handle such a column is to create as many columns as possible values (e.g. Fjob_teacher, Fjob_other, Fjob_services, etc.), and assign a 1 to one of them and 0 to all others.
These generated columns are sometimes called dummy variables, and we will use the pandas.get_dummies() function to perform this transformation. Run the code cell below to perform the preprocessing routine discussed in this section.
def preprocess_features(X):
- ''' Preprocesses the student data and converts non-numeric binary variables into
- binary (0/1) variables. Converts categorical variables into dummy variables. '''
-
- # Initialize new output DataFrame
- output = pd.DataFrame(index = X.index) #Empty DataFrame with range equal to X
-
- # Investigate each feature column for the data
- for col, col_data in X.iteritems():
- #print("col is ", col)
- #print("col_data is ", col_data)
-
- # If data type is non-numeric, replace all yes/no values with 1/0
- if col_data.dtype == object:
- col_data = col_data.replace(['yes', 'no'], [1, 0])
-
- # If data type is categorical, convert to dummy variables
- if col_data.dtype == object:
- # Example: 'school' => 'school_GP' and 'school_MS'
- col_data = pd.get_dummies(col_data, prefix = col)
-
- # Collect the revised columns
- output = output.join(col_data)
-
- return output
-
-X_all = preprocess_features(X_all)
-print("Processed feature columns ({} total features):\n{}".format(len(X_all.columns), list(X_all.columns)))
-So far, we have converted all categorical features into numeric values. For the next step, we split the data (both features and corresponding labels) into training and test sets. In the following code cell below, you will need to implement the following:
-X_all, y_all) into training and testing subsets.random_state for the function(s) you use, if provided.X_train, X_test, y_train, and y_test.# TODO: Import any additional functionality you may need here
-
-# Import train_test_split
-from sklearn.model_selection import train_test_split
-
-# TODO: Set the number of training points
-num_train = 300
-
-# Set the number of testing points
-num_test = X_all.shape[0] - num_train
-
-# TODO: Shuffle and split the dataset into the number of training and testing points above
-X_train = None
-X_test = None
-y_train = None
-y_test = None
-
-# Split the 'features' and labels data into training and testing sets
-X_train, X_test, y_train, y_test = train_test_split(X_all,
- y_all,
- test_size = 95,
- random_state = 0)
-
-# Show the results of the split
-print("Training set has {} samples.".format(X_train.shape[0]))
-print("Testing set has {} samples.".format(X_test.shape[0]))
-In this section, you will choose 3 supervised learning models that are appropriate for this problem and available in scikit-learn. You will first discuss the reasoning behind choosing these three models by considering what you know about the data and each model's strengths and weaknesses. You will then fit the model to varying sizes of training data (100 data points, 200 data points, and 300 data points) and measure the F1 score. You will need to produce three tables (one for each model) that shows the training set size, training time, prediction time, F1 score on the training set, and F1 score on the testing set.
The following supervised learning models are currently available in scikit-learn that you may choose from:
List three supervised learning models that are appropriate for this problem. For each model chosen
-Answer:
-Gaussian Naive Bayes
-Decision Trees
-Random forest
-Support Vector Machines
-Sources:
- - -Run the code cell below to initialize three helper functions which you can use for training and testing the three supervised learning models you've chosen above. The functions are as follows:
-train_classifier - takes as input a classifier and training data and fits the classifier to the data.predict_labels - takes as input a fit classifier, features, and a target labeling and makes predictions using the F1 score.train_predict - takes as input a classifier, and the training and testing data, and performs train_clasifier and predict_labels.def train_classifier(clf, X_train, y_train):
- ''' Fits a classifier to the training data. '''
-
- # Start the clock, train the classifier, then stop the clock
- start = time()
- clf.fit(X_train, y_train)
- end = time()
-
- # Print the results
- print("Trained model in {:.4f} seconds".format(end - start))
-
-
-def predict_labels(clf, features, target):
- ''' Makes predictions using a fit classifier based on F1 score. '''
-
- # Start the clock, make predictions, then stop the clock
- start = time()
- y_pred = clf.predict(features)
- end = time()
-
- # Print and return results
- print("Made predictions in {:.4f} seconds.".format(end - start))
- return f1_score(target.values, y_pred, pos_label='yes')
-
-
-def train_predict(clf, X_train, y_train, X_test, y_test):
- ''' Train and predict using a classifer based on F1 score. '''
-
- # Indicate the classifier and the training set size
- print("Training a {} using a training set size of {}. . .".format(clf.__class__.__name__, len(X_train)))
-
- # Train the classifier
- train_classifier(clf, X_train, y_train)
-
- # Print the results of prediction for both training and testing
- print("F1 score for training set: {:.4f}.".format(predict_labels(clf, X_train, y_train)))
- print("F1 score for test set: {:.4f}.".format(predict_labels(clf, X_test, y_test)))
-With the predefined functions above, you will now import the three supervised learning models of your choice and run the train_predict function for each one. Remember that you will need to train and predict on each classifier for three different training set sizes: 100, 200, and 300. Hence, you should expect to have 9 different outputs below — 3 for each model using the varying training set sizes. In the following code cell, you will need to implement the following:
clf_A, clf_B, and clf_C.random_state for each model you use, if provided.X_train and y_train.# TODO: Import the three supervised learning models from sklearn
-# from sklearn import model_A
-# from sklearn import model_B
-# from sklearn import model_C
-from sklearn.naive_bayes import GaussianNB
-from sklearn.tree import DecisionTreeClassifier
-from sklearn.ensemble import RandomForestClassifier
-
-# TODO: Initialize the three models
-clf_A = GaussianNB()
-clf_B = DecisionTreeClassifier(max_depth=None, random_state=None)
-clf_C = RandomForestClassifier(max_depth=None, random_state=None)
-
-# TODO: Set up the training set sizes
-X_train_100 = X_train[:100]
-y_train_100 = y_train[:100]
-
-X_train_200 = X_train[:200]
-y_train_200 = y_train[:200]
-
-X_train_300 = X_train[:300]
-y_train_300 = y_train[:300]
-
-X_samples = [X_train_100, X_train_200, X_train_300]
-y_samples = [y_train_100, y_train_200, y_train_300]
-# TODO: Execute the 'train_predict' function for each classifier and each training set size
-for clf in [clf_A, clf_B, clf_C]:
- clf_name = clf.__class__.__name__
-
- for i, samples in enumerate(X_samples):
- train_predict(clf, samples, y_samples[i], X_test, y_test)
-
-
-#train_predict(clf, X_train, y_train, X_test, y_test)
-Classifer 1 - ?
-| Training Set Size | -Training Time | -Prediction Time (test) | -F1 Score (train) | -F1 Score (test) | -
|---|---|---|---|---|
| 100 | -0.0012 | -0.003 | -0.8550 | -0.7481 | -
| 200 | -0.0008 | -0.0003 | -0.8321 | -0.71 | -
| 300 | -0.0011 | -0.0005 | -0.8088 | -0.7500 | -
Classifer 2 - ?
-| Training Set Size | -Training Time | -Prediction Time (test) | -F1 Score (train) | -F1 Score (test) | -
|---|---|---|---|---|
| 100 | -0.0009 | -0.0006 | -1 | -0.7009 | -
| 200 | -0.0012 | -0.0002 | -1 | -0.7031 | -
| 300 | -0.0016 | -0.0002 | -1 | -0.7167 | -
Classifer 3 - ?
-| Training Set Size | -Training Time | -Prediction Time (test) | -F1 Score (train) | -F1 Score (test) | -
|---|---|---|---|---|
| 100 | -0.0086 | -0.0008 | -0.9922 | -0.6942 | -
| 200 | -0.0088 | -0.0007 | -1 | -0.7368 | -
| 300 | -0.0090 | -0.0008 | -0.9952 | -0.7939 | -
In this final section, you will choose from the three supervised learning models the best model to use on the student data. You will then perform a grid search optimization for the model over the entire training set (X_train and y_train) by tuning at least one parameter to improve upon the untuned model's F1 score.
Based on the experiments you performed earlier, in one to two paragraphs, explain to the board of supervisors what single model you chose as the best model. Which model is generally the most appropriate based on the available data, limited resources, cost, and performance?
- -Answer:
-Based on the results, I most definitely believe that a random forest model will be most appropriate for this task. When 100% of the training data is used, The F score for Random forest is higher (0.794) compared to other models (Decision trees - 0.716 and Gaussian Naive Bayes - 0.75). Based on these factors, random forest is better suited to make predictions. It performs fairly well and the training time and prediction times are on acceptable levels.
- -In one to two paragraphs, explain to the board of directors in layman's terms how the final model chosen is supposed to work. Be sure that you are describing the major qualities of the model, such as how the model is trained and how the model makes a prediction. Avoid using advanced mathematical or technical jargon, such as describing equations or discussing the algorithm implementation.
- -Answer:
-In order to understand how Random Forest works, we need to understand how a Decision Tree Classifier works.
-A Decision Tree Classifier basically asks a series of Yes or No Questions, and based on the responses, it arrives at the final decision/result/outcome. It's basically like playing a game of 20 questions. Assuming that in this game the task is to predict what person/object are you thinking about, I start to ask a series of Yes/No type questions. Now, if the first question that I ask is - "are you thinking of a potato", that would be pretty much useless. Instead, if the question that I ask is "Is it a person", then that reveals much more information. This way, every question that I ask would be about maximizing the "Information Gain", each successfully bringing me closer and closer to the final prediction. The Decision Tree works exactly this way, using the features in the data-set to form its series of Yes/No questions.
-A decision tree is prone to making mistakes, or overfitting. A random forest works by collecting the results of multiple decision tree classifiers and averaging their results. The majority vote of the collection of decision trees is considered to be the final prediciton.
- -Fine tune the chosen model. Use grid search (GridSearchCV) with at least one important parameter tuned with at least 3 different values. You will need to use the entire training set for this. In the code cell below, you will need to implement the following:
sklearn.grid_search.GridSearchCV and sklearn.metrics.make_scorer.parameters = {'parameter' : [list of values]}.clf.make_scorer and store it in f1_scorer.pos_label parameter to the correct value!clf using f1_scorer as the scoring method, and store it in grid_obj.X_train, y_train), and store it in grid_obj.# TODO: Import 'GridSearchCV' and 'make_scorer'
-from sklearn.model_selection import GridSearchCV
-from sklearn.metrics import make_scorer
-
-# TODO: Create the parameters list you wish to tune
-parameters = {'n_estimators': [10, 20, 30], 'max_features':[3,4,5, None], 'max_depth': [5,6,7, None]}
-
-# TODO: Initialize the classifier
-clf = RandomForestClassifier(max_depth=None, random_state=None)
-
-# TODO: Make an f1 scoring function using 'make_scorer'
-f1_scorer = make_scorer(f1_score, average="micro")
-
-# TODO: Perform grid search on the classifier using the f1_scorer as the scoring method
-grid_obj = GridSearchCV(clf, parameters, scoring=f1_scorer)
-
-# TODO: Fit the grid search object to the training data and find the optimal parameters
-grid_obj = grid_obj.fit(X_train, y_train)
-
-# Get the estimator
-clf = grid_obj.best_estimator_
-
-# Report the final F1 score for training and testing after parameter tuning
-print("Tuned model has a training F1 score of {:.4f}.".format(predict_labels(clf, X_train, y_train)))
-print("Tuned model has a testing F1 score of {:.4f}.".format(predict_labels(clf, X_test, y_test)))
-What is the final model's F1 score for training and testing? How does that score compare to the untuned model?
- -Answer:
-As it turns out, the final model's performance has infact slightly increased, though not by a huge margin. The final model's F1 score is 80.5%.
- -- -Note: Once you have completed all of the code implementations and successfully answered each question above, you may finalize your work by exporting the iPython Notebook as an HTML document. You can do this by using the menu above and navigating to
-
-File -> Download as -> HTML (.html). Include the finished document along with this notebook as your submission.
| \n", - " | school | \n", - "sex | \n", - "age | \n", - "address | \n", - "famsize | \n", - "Pstatus | \n", - "Medu | \n", - "Fedu | \n", - "Mjob | \n", - "Fjob | \n", - "reason | \n", - "guardian | \n", - "traveltime | \n", - "studytime | \n", - "failures | \n", - "schoolsup | \n", - "famsup | \n", - "paid | \n", - "activities | \n", - "nursery | \n", - "higher | \n", - "internet | \n", - "romantic | \n", - "famrel | \n", - "freetime | \n", - "goout | \n", - "Dalc | \n", - "Walc | \n", - "health | \n", - "absences | \n", - "passed | \n", - "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", - "GP | \n", - "F | \n", - "18 | \n", - "U | \n", - "GT3 | \n", - "A | \n", - "4 | \n", - "4 | \n", - "at_home | \n", - "teacher | \n", - "course | \n", - "mother | \n", - "2 | \n", - "2 | \n", - "0 | \n", - "yes | \n", - "no | \n", - "no | \n", - "no | \n", - "yes | \n", - "yes | \n", - "no | \n", - "no | \n", - "4 | \n", - "3 | \n", - "4 | \n", - "1 | \n", - "1 | \n", - "3 | \n", - "6 | \n", - "no | \n", - "
| 1 | \n", - "GP | \n", - "F | \n", - "17 | \n", - "U | \n", - "GT3 | \n", - "T | \n", - "1 | \n", - "1 | \n", - "at_home | \n", - "other | \n", - "course | \n", - "father | \n", - "1 | \n", - "2 | \n", - "0 | \n", - "no | \n", - "yes | \n", - "no | \n", - "no | \n", - "no | \n", - "yes | \n", - "yes | \n", - "no | \n", - "5 | \n", - "3 | \n", - "3 | \n", - "1 | \n", - "1 | \n", - "3 | \n", - "4 | \n", - "no | \n", - "
| 2 | \n", - "GP | \n", - "F | \n", - "15 | \n", - "U | \n", - "LE3 | \n", - "T | \n", - "1 | \n", - "1 | \n", - "at_home | \n", - "other | \n", - "other | \n", - "mother | \n", - "1 | \n", - "2 | \n", - "3 | \n", - "yes | \n", - "no | \n", - "yes | \n", - "no | \n", - "yes | \n", - "yes | \n", - "yes | \n", - "no | \n", - "4 | \n", - "3 | \n", - "2 | \n", - "2 | \n", - "3 | \n", - "3 | \n", - "10 | \n", - "yes | \n", - "
| 3 | \n", - "GP | \n", - "F | \n", - "15 | \n", - "U | \n", - "GT3 | \n", - "T | \n", - "4 | \n", - "2 | \n", - "health | \n", - "services | \n", - "home | \n", - "mother | \n", - "1 | \n", - "3 | \n", - "0 | \n", - "no | \n", - "yes | \n", - "yes | \n", - "yes | \n", - "yes | \n", - "yes | \n", - "yes | \n", - "yes | \n", - "3 | \n", - "2 | \n", - "2 | \n", - "1 | \n", - "1 | \n", - "5 | \n", - "2 | \n", - "yes | \n", - "
| 4 | \n", - "GP | \n", - "F | \n", - "16 | \n", - "U | \n", - "GT3 | \n", - "T | \n", - "3 | \n", - "3 | \n", - "other | \n", - "other | \n", - "home | \n", - "father | \n", - "1 | \n", - "2 | \n", - "0 | \n", - "no | \n", - "yes | \n", - "yes | \n", - "no | \n", - "yes | \n", - "yes | \n", - "no | \n", - "no | \n", - "4 | \n", - "3 | \n", - "2 | \n", - "1 | \n", - "2 | \n", - "5 | \n", - "4 | \n", - "yes | \n", - "
| \n", - " | age | \n", - "Medu | \n", - "Fedu | \n", - "traveltime | \n", - "studytime | \n", - "failures | \n", - "famrel | \n", - "freetime | \n", - "goout | \n", - "Dalc | \n", - "Walc | \n", - "health | \n", - "absences | \n", - "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | \n", - "395.0 | \n", - "395.0 | \n", - "395.0 | \n", - "395.0 | \n", - "395.0 | \n", - "395.0 | \n", - "395.0 | \n", - "395.0 | \n", - "395.0 | \n", - "395.0 | \n", - "395.0 | \n", - "395.0 | \n", - "395.0 | \n", - "
| mean | \n", - "17.0 | \n", - "3.0 | \n", - "3.0 | \n", - "1.0 | \n", - "2.0 | \n", - "0.0 | \n", - "4.0 | \n", - "3.0 | \n", - "3.0 | \n", - "1.0 | \n", - "2.0 | \n", - "4.0 | \n", - "6.0 | \n", - "
| std | \n", - "1.0 | \n", - "1.0 | \n", - "1.0 | \n", - "1.0 | \n", - "1.0 | \n", - "1.0 | \n", - "1.0 | \n", - "1.0 | \n", - "1.0 | \n", - "1.0 | \n", - "1.0 | \n", - "1.0 | \n", - "8.0 | \n", - "
| min | \n", - "15.0 | \n", - "0.0 | \n", - "0.0 | \n", - "1.0 | \n", - "1.0 | \n", - "0.0 | \n", - "1.0 | \n", - "1.0 | \n", - "1.0 | \n", - "1.0 | \n", - "1.0 | \n", - "1.0 | \n", - "0.0 | \n", - "
| 25% | \n", - "16.0 | \n", - "2.0 | \n", - "2.0 | \n", - "1.0 | \n", - "1.0 | \n", - "0.0 | \n", - "4.0 | \n", - "3.0 | \n", - "2.0 | \n", - "1.0 | \n", - "1.0 | \n", - "3.0 | \n", - "0.0 | \n", - "
| 50% | \n", - "17.0 | \n", - "3.0 | \n", - "2.0 | \n", - "1.0 | \n", - "2.0 | \n", - "0.0 | \n", - "4.0 | \n", - "3.0 | \n", - "3.0 | \n", - "1.0 | \n", - "2.0 | \n", - "4.0 | \n", - "4.0 | \n", - "
| 75% | \n", - "18.0 | \n", - "4.0 | \n", - "3.0 | \n", - "2.0 | \n", - "2.0 | \n", - "0.0 | \n", - "5.0 | \n", - "4.0 | \n", - "4.0 | \n", - "2.0 | \n", - "3.0 | \n", - "5.0 | \n", - "8.0 | \n", - "
| max | \n", - "22.0 | \n", - "4.0 | \n", - "4.0 | \n", - "4.0 | \n", - "4.0 | \n", - "3.0 | \n", - "5.0 | \n", - "5.0 | \n", - "5.0 | \n", - "5.0 | \n", - "5.0 | \n", - "5.0 | \n", - "75.0 | \n", - "
| \n", - " | school | \n", - "sex | \n", - "age | \n", - "address | \n", - "famsize | \n", - "Pstatus | \n", - "Medu | \n", - "Fedu | \n", - "Mjob | \n", - "Fjob | \n", - "reason | \n", - "guardian | \n", - "traveltime | \n", - "studytime | \n", - "failures | \n", - "schoolsup | \n", - "famsup | \n", - "paid | \n", - "activities | \n", - "nursery | \n", - "higher | \n", - "internet | \n", - "romantic | \n", - "famrel | \n", - "freetime | \n", - "goout | \n", - "Dalc | \n", - "Walc | \n", - "health | \n", - "absences | \n", - "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", - "GP | \n", - "F | \n", - "18 | \n", - "U | \n", - "GT3 | \n", - "A | \n", - "4 | \n", - "4 | \n", - "at_home | \n", - "teacher | \n", - "course | \n", - "mother | \n", - "2 | \n", - "2 | \n", - "0 | \n", - "yes | \n", - "no | \n", - "no | \n", - "no | \n", - "yes | \n", - "yes | \n", - "no | \n", - "no | \n", - "4 | \n", - "3 | \n", - "4 | \n", - "1 | \n", - "1 | \n", - "3 | \n", - "6 | \n", - "
| 1 | \n", - "GP | \n", - "F | \n", - "17 | \n", - "U | \n", - "GT3 | \n", - "T | \n", - "1 | \n", - "1 | \n", - "at_home | \n", - "other | \n", - "course | \n", - "father | \n", - "1 | \n", - "2 | \n", - "0 | \n", - "no | \n", - "yes | \n", - "no | \n", - "no | \n", - "no | \n", - "yes | \n", - "yes | \n", - "no | \n", - "5 | \n", - "3 | \n", - "3 | \n", - "1 | \n", - "1 | \n", - "3 | \n", - "4 | \n", - "
| 2 | \n", - "GP | \n", - "F | \n", - "15 | \n", - "U | \n", - "LE3 | \n", - "T | \n", - "1 | \n", - "1 | \n", - "at_home | \n", - "other | \n", - "other | \n", - "mother | \n", - "1 | \n", - "2 | \n", - "3 | \n", - "yes | \n", - "no | \n", - "yes | \n", - "no | \n", - "yes | \n", - "yes | \n", - "yes | \n", - "no | \n", - "4 | \n", - "3 | \n", - "2 | \n", - "2 | \n", - "3 | \n", - "3 | \n", - "10 | \n", - "
| 3 | \n", - "GP | \n", - "F | \n", - "15 | \n", - "U | \n", - "GT3 | \n", - "T | \n", - "4 | \n", - "2 | \n", - "health | \n", - "services | \n", - "home | \n", - "mother | \n", - "1 | \n", - "3 | \n", - "0 | \n", - "no | \n", - "yes | \n", - "yes | \n", - "yes | \n", - "yes | \n", - "yes | \n", - "yes | \n", - "yes | \n", - "3 | \n", - "2 | \n", - "2 | \n", - "1 | \n", - "1 | \n", - "5 | \n", - "2 | \n", - "
| 4 | \n", - "GP | \n", - "F | \n", - "16 | \n", - "U | \n", - "GT3 | \n", - "T | \n", - "3 | \n", - "3 | \n", - "other | \n", - "other | \n", - "home | \n", - "father | \n", - "1 | \n", - "2 | \n", - "0 | \n", - "no | \n", - "yes | \n", - "yes | \n", - "no | \n", - "yes | \n", - "yes | \n", - "no | \n", - "no | \n", - "4 | \n", - "3 | \n", - "2 | \n", - "1 | \n", - "2 | \n", - "5 | \n", - "4 | \n", - "
I've made a python notebook to offer detailed explanations about the process. Much of these explanations are directly taken from Udacity's Self Driving Car Course
- -import matplotlib.image as mpimg
-import matplotlib.pyplot as plt
-import numpy as np
-import cv2
-import glob
-import time
-import pickle
-import copy
-from sklearn import svm
-from sklearn.preprocessing import StandardScaler
-from sklearn.model_selection import GridSearchCV
-from skimage.feature import hog
-from sklearn.externals import joblib
-from scipy import ndimage as ndi
-import imageio
-imageio.plugins.ffmpeg.download()
-from moviepy.editor import VideoFileClip
-from collections import deque
-from sklearn.cross_validation import train_test_split
-
-import numpy as np
-import cv2
-from skimage.feature import hog
-def convert_color(img, conv='RGB2YCrCb'):
- if conv == 'RGB2YCrCb':
- return cv2.cvtColor(img, cv2.COLOR_RGB2YCrCb)
- if conv == 'BGR2YCrCb':
- return cv2.cvtColor(img, cv2.COLOR_BGR2YCrCb)
- if conv == 'RGB2LUV':
- return cv2.cvtColor(img, cv2.COLOR_RGB2LUV)
- if conv == 'HLS':
- return cv2.cvtColor(img, cv2.COLOR_RGB2HLS)
-
Template matching is not a particularly robust method for finding vehicles unless you know exactly what your target object looks like. However, raw pixel values are still quite useful to include in your feature vector in searching for cars.
-While it could be cumbersome to include three color channels of a full resolution image, you can perform spatial binning on an image and still retain enough information to help in finding vehicles.
-As you can see in the example above, even going all the way down to 32 x 32 pixel resolution, the car itself is still clearly identifiable by eye, and this means that the relevant features are still preserved at this resolution.
-A convenient function for scaling down the resolution of an image is OpenCV's cv2.resize(). You can use it to scale a color image or a single color channel like this (you can find the original image here):
- -import cv2
-import matplotlib.image as mpimg
-
-image = mpimg.imread('test_img.jpg')
-small_img = cv2.resize(image, (32, 32))
-print(small_img.shape)
-(32, 32, 3)
-If you then wanted to convert this to a one dimensional feature vector, you could simply say something like:
- -feature_vec = small_img.ravel()
-print(feature_vec.shape)
-(3072,)
-And the output would look someting like this -
-def bin_spatial(img, size=(32, 32)):
- # Convert image to new color space (if specified)
- # Use cv2.resize().ravel() to create the feature vector
- # Return the feature vector
- color1 = cv2.resize(img[:,:,0], size).ravel()
- color2 = cv2.resize(img[:,:,1], size).ravel()
- color3 = cv2.resize(img[:,:,2], size).ravel()
- return np.hstack((color1, color2, color3))
-An image template is useful for detecting things that do not vary much in their appearance - for example, icons of emojis. -But for most real world objects that do appear in different forms, orientation, and sizes, this technique does not -work quite well. In template matching, you depend on raw color values laid out in a specific order, and that can vary a lot. -So you need to find some transformations that are robust to changes in appearance. One such transform is to compute -a histogram of color values for an image.
-When you compare the histogram of a known object with the regions of a test image, locations with a similar color distribution will reveal a close match. So we are no longer sensitive to a perfect arrangement of pixels. So objects that appear in slightly different orientations and sizes will still be a match.
-You can construct histograms of the R, G, and B channels like this:
- -import matplotlib.image as mpimg
-import numpy as np
-
-# Read in the image
-image = mpimg.imread('cutout1.jpg')
-
-# Take histograms in R, G, and B
-rhist = np.histogram(image[:,:,0], bins=32, range=(0, 256))
-ghist = np.histogram(image[:,:,1], bins=32, range=(0, 256))
-bhist = np.histogram(image[:,:,2], bins=32, range=(0, 256))
-With np.histogram(), you don't actually have to specify the number of bins or the range, but here I've arbitrarily chosen 32 bins and specified range=(0, 256) in order to get orderly bin sizes. np.histogram() returns a tuple of two arrays. In this case, for example, rhist[0] contains the counts in each of the bins and rhist[1] contains the bin edges (so it is one element longer than rhist[0]).
-To look at a plot of these results, we can compute the bin centers from the bin edges. Each of the histograms in this case have the same bins, so we can just use the rhist bin edges:
- -# Generating bin centers
-bin_edges = rhist[1]
-bin_centers = (bin_edges[1:] + bin_edges[0:len(bin_edges)-1])/2
-And then summing up the results in a bar chart:
- -# Plot a figure with all three bar charts
-fig = plt.figure(figsize=(12,3))
-plt.subplot(131)
-plt.bar(bin_centers, rhist[0])
-plt.xlim(0, 256)
-plt.title('R Histogram')
-plt.subplot(132)
-plt.bar(bin_centers, ghist[0])
-plt.xlim(0, 256)
-plt.title('G Histogram')
-plt.subplot(133)
-plt.bar(bin_centers, bhist[0])
-plt.xlim(0, 256)
-plt.title('B Histogram')
-The output should look like this:
-
These, collectively, are now our feature vector for this particular cutout image. We can concatenate them in the following way:
-hist_features = np.concatenate((rhist[0], ghist[0], bhist[0]))
# Define a function to compute color histogram features
-def color_hist(img, nbins=32, bins_range=(0, 256)):
- # Compute the histogram of the color channels separately
- channel1_hist = np.histogram(img[:,:,0], bins=nbins, range=bins_range)
- channel2_hist = np.histogram(img[:,:,1], bins=nbins, range=bins_range)
- channel3_hist = np.histogram(img[:,:,2], bins=nbins, range=bins_range)
- # Concatenate the histograms into a single feature vector
- hist_features = np.concatenate((channel1_hist[0], channel2_hist[0], channel3_hist[0]))
- # Return the individual histograms, bin_centers and feature vector
- return hist_features
-Read more about it here - https://www.learnopencv.com/histogram-of-oriented-gradients/
-The scikit-image package has a built in function to extract Histogram of Oriented Gradient features. The documentation for this function can be found here and a brief explanation of the algorithm and tutorial can be found here.
-The scikit-image hog() function takes in a single color channel or grayscaled image as input, as well as various parameters. These parameters include orientations, pixels_per_cell and cells_per_block.
-The number of orientations is specified as an integer, and represents the number of orientation bins that the gradient information will be split up into in the histogram. Typical values are between 6 and 12 bins.
-The pixels_per_cell parameter specifies the cell size over which each gradient histogram is computed. This paramater is passed as a 2-tuple so you could have different cell sizes in x and y, but cells are commonly chosen to be square.
-The cells_per_block parameter is also passed as a 2-tuple, and specifies the local area over which the histogram counts in a given cell will be normalized. Block normalization is not necessarily required, but generally leads to a more robust feature set.
-There is another optional power law or "gamma" normalization scheme set by the flag transform_sqrt. This type of normalization may help reduce the effects of shadows or other illumination variation, but will cause an error if your image contains negative values (because it's taking the square root of image values).
-
This is where things get a little confusing though. Let's say you are computing HOG features for an image like the one shown above that is 64×64 pixels. If you set pixels_per_cell=(8, 8) and cells_per_block=(2, 2) and orientations=9. How many elements will you have in your HOG feature vector for the entire image?
-You might guess the number of orientations times the number of cells, or 9×8×8=576, but that's not the case if you're using block normalization! In fact, the HOG features for all cells in each block are computed at each block position and the block steps across and down through the image cell by cell.
-So, the actual number of features in your final feature vector will be the total number of block positions multiplied by the number of cells per block, times the number of orientations, or in the case shown above: 7×7×2×2×9=1764. -For the example above, you would call the hog() function on a single color channel img like this:
- -from skimage.feature import hog
-pix_per_cell = 8
-cell_per_block = 2
-orient = 9
-
-features, hog_image = hog(img, orientations=orient,
- pixels_per_cell=(pix_per_cell, pix_per_cell),
- cells_per_block=(cell_per_block, cell_per_block),
- visualise=True, feature_vector=False,
- block_norm="L2-Hys")
-The visualise=True flag tells the function to output a visualization of the HOG feature computation as well, which we're calling hog_image in this case. If we take a look at a single color channel for a random car image, and its corresponding HOG visulization, they look like this:
-
The HOG visualization is not actually the feature vector, but rather, a representation that shows the dominant gradient direction within each cell with brightness corresponding to the strength of gradients in that cell, much like the "star" representation in the last video.
-If you look at the features output, you'll find it's an array of shape 7×7×2×2×9. This corresponds to the fact that a grid of 7×7 blocks were sampled, with 2×2 cells in each block and 9 orientations per cell. You can unroll this array into a feature vector using features.ravel(), which yields, in this case, a one dimensional array of length 1764.
-Alternatively, you can set the feature_vector=True flag when calling the hog() function to automatically unroll the features. In the project, it could be useful to have a function defined that you could pass an image to with specifications for orientations, pixels_per_cell, and cells_per_block, as well as flags set for whether or not you want the feature vector unrolled and/or a visualization image.
- -# Define a function to return HOG features and visualization
-def get_hog_features(img, orient, pix_per_cell, cell_per_block,
- vis=False, feature_vec=True):
- # Call with two outputs if vis==True
- if vis == True:
- features, hog_image = hog(img, orientations=orient, pixels_per_cell=(pix_per_cell, pix_per_cell),
- cells_per_block=(cell_per_block, cell_per_block), transform_sqrt=True,
- visualise=vis, feature_vector=feature_vec)
- return features, hog_image
- # Otherwise call with one output
- else:
- features = hog(img, orientations=orient, pixels_per_cell=(pix_per_cell, pix_per_cell),
- cells_per_block=(cell_per_block, cell_per_block), transform_sqrt=True,
- visualise=vis, feature_vector=feature_vec)
- return features
-# Define a function to extract features from a list of images
-# Have this function call bin_spatial() and color_hist()
-def extract_features(imgs, cspace='RGB', orient=9,
- pix_per_cell=8, cell_per_block=2, hog_channel=0):
- # Create a list to append feature vectors to
- features = []
- # Iterate through the list of images
- for file in imgs:
- file_features = []
- # Read in each one by one
- image = mpimg.imread(file)
- # normalize the pixels.
- #image = image.astype(np.float32)/255
- # apply color conversion.
- feature_image = convert_color(image, cspace)
-
- spatial_features = bin_spatial(feature_image, size=spatial_size)
- file_features.append(spatial_features)
- # Apply color_hist() also with a color space option now
- hist_features = color_hist(feature_image, nbins=hist_bins, bins_range=hist_range)
- file_features.append(hist_features)
- # Append the new feature vector to the features list
-
- # Call get_hog_features() with vis=False, feature_vec=True
- if hog_channel == 'ALL':
- hog_features = []
- for channel in range(feature_image.shape[2]):
- hog_features.append(get_hog_features(feature_image[:,:,channel],
- orient, pix_per_cell, cell_per_block,
- vis=False, feature_vec=True))
- hog_features = np.ravel(hog_features)
- else:
- hog_features = get_hog_features(feature_image[:,:,hog_channel], orient,
- pix_per_cell, cell_per_block, vis=False, feature_vec=True)
- file_features.append(hog_features)
- # Append the new feature vector to the features list.
- features.append(np.concatenate(file_features))
- # Return list of feature vectors
- return features
-How many windows?
-
To implement a sliding window search, you need to decide what size window you want to search, where in the image you want to start and stop your search, and how much you want windows to overlap. So, let's try an example to see how many windows we would be searching given a particular image size, window size, and overlap.
-Suppose you have an image that is 256 x 256 pixels and you want to search windows of a size 128 x 128 pixels each with an overlap of 50% between adjacent windows in both the vertical and horizontal dimensions. Your sliding window search would then look like this:
-
The goal here is to write a function that takes in an image, start and stop positions in both x and y (imagine a bounding box for the entire search region), window size (x and y dimensions), and overlap fraction (also for both x and y). The function should return a list of bounding boxes for the search windows, which will then be passed to draw draw_boxes() function.
- -def slide_window(img, x_start_stop=[None, None], y_start_stop=[None, None],
- xy_window=(64, 64), xy_overlap=(0.5, 0.5)):
- # If x and/or y start/stop positions not defined, set to image size
- if x_start_stop[0] == None:
- x_start_stop[0] = 0
- if x_start_stop[1] == None:
- x_start_stop[1] = img.shape[1]
- if y_start_stop[0] == None:
- y_start_stop[0] = 0
- if y_start_stop[1] == None:
- y_start_stop[1] = img.shape[0]
-
- # Compute the span of the region to be searched
- xspan = x_start_stop[1] - x_start_stop[0]
- yspan = y_start_stop[1] - y_start_stop[0]
-
- # Compute the number of pixels per step in x/y
- nx_pix_per_step = np.int(xy_window[0]*(1 - xy_overlap[0]))
- ny_pix_per_step = np.int(xy_window[1]*(1 - xy_overlap[1]))
-
- # Compute the number of windows in x/y
- nx_buffer = np.int(xy_window[0]*(xy_overlap[0]))
- ny_buffer = np.int(xy_window[1]*(xy_overlap[1]))
-
- nx_windows = np.int((xspan-nx_buffer)/nx_pix_per_step)
- ny_windows = np.int((yspan-ny_buffer)/ny_pix_per_step)
-
- # Initialize a list to append window positions to
- window_list = []
-
- # Loop through finding x and y window positions
- # Note: you could vectorize this step, but in practice
- # you'll be considering windows one by one with your
- # classifier, so looping makes sense
- for ys in range(ny_windows):
- for xs in range(nx_windows):
- # Calculate window position
- startx = xs*nx_pix_per_step + x_start_stop[0]
- endx = startx + xy_window[0]
- starty = ys*ny_pix_per_step + y_start_stop[0]
- endy = starty + xy_window[1]
- # Append window position to list
- window_list.append(((startx, starty), (endx, endy)))
- # Return the list of windows
- return window_list
-def draw_boxes(img, bboxes, color=(0, 0, 255), thick=6):
- # Make a copy of the image
- imcopy = np.copy(img)
- # Iterate through the bounding boxes
- for bbox in bboxes:
- # Draw a rectangle given bbox coordinates
- cv2.rectangle(imcopy, bbox[0], bbox[1], color, thick)
- # Return the image copy with boxes drawn
- return imcopy
-# Define a function to extract features from a single image window
-# This function is very similar to extract_features()
-# just for a single image rather than list of images
-def single_img_features(img, color_space='RGB', spatial_size=(32, 32),
- hist_bins=32, orient=9,
- pix_per_cell=8, cell_per_block=2, hog_channel=0,
- spatial_feat=True, hist_feat=True, hog_feat=True):
- #1) Define an empty list to receive features
- img_features = []
- #2) Apply color conversion if other than 'RGB'
- if color_space != 'RGB':
- if color_space == 'HSV':
- feature_image = cv2.cvtColor(img, cv2.COLOR_RGB2HSV)
- elif color_space == 'LUV':
- feature_image = cv2.cvtColor(img, cv2.COLOR_RGB2LUV)
- elif color_space == 'HLS':
- feature_image = cv2.cvtColor(img, cv2.COLOR_RGB2HLS)
- elif color_space == 'YUV':
- feature_image = cv2.cvtColor(img, cv2.COLOR_RGB2YUV)
- elif color_space == 'YCrCb':
- feature_image = cv2.cvtColor(img, cv2.COLOR_RGB2YCrCb)
- else:
- feature_image = np.copy(img)
-
- #3) Compute spatial features if flag is set
- if spatial_feat == True:
- spatial_features = bin_spatial(feature_image, size=spatial_size)
-
- #4) Append features to list
- img_features.append(spatial_features)
-
- #5) Compute histogram features if flag is set
- if hist_feat == True:
- hist_features = color_hist(feature_image, nbins=hist_bins)
-
- #6) Append features to list
- img_features.append(hist_features)
-
- #7) Compute HOG features if flag is set
- if hog_feat == True:
- if hog_channel == 'ALL':
- hog_features = []
- for channel in range(feature_image.shape[2]):
- hog_features.extend(get_hog_features(feature_image[:,:,channel],
- orient, pix_per_cell, cell_per_block,
- vis=False, feature_vec=True))
- else:
- hog_features = get_hog_features(feature_image[:,:,hog_channel], orient,
- pix_per_cell, cell_per_block, vis=False, feature_vec=True)
- #8) Append features to list
- img_features.append(hog_features)
-
- #9) Return concatenated array of features
- return np.concatenate(img_features)
-# Define a function you will pass an image
-# and the list of windows to be searched (output of slide_windows())
-
-def search_windows(img, windows, clf, scaler, color_space='RGB',
- spatial_size=(16, 16), hist_bins=32,
- hist_range=(0, 256), orient=9,
- pix_per_cell=8, cell_per_block=2,
- hog_channel=0, spatial_feat=True,
- hist_feat=True, hog_feat=True):
-
- #1) Create an empty list to receive positive detection windows
- on_windows = []
- #2) Iterate over all windows in the list
- for window in windows:
- #3) Extract the test window from original image
- test_img = cv2.resize(img[window[0][1]:window[1][1], window[0][0]:window[1][0]], (64, 64))
- #4) Extract features for that window using single_img_features()
- features = single_img_features(test_img, color_space=colorspace,
- spatial_size=spatial_size, hist_bins=hist_bins,
- orient=orient, pix_per_cell=pix_per_cell,
- cell_per_block=cell_per_block,
- hog_channel=hog_channel, spatial_feat=True,
- hist_feat=True, hog_feat=True)
- #5) Scale extracted features to be fed to classifier
- print(features.shape)
- features = np.array(features)
- X = np.array(features).reshape(1, -1)
- test_features = scaler.transform(X)
-
- #6) Predict using your classifier
- prediction = clf.predict(features)
- #7) If positive (prediction == 1) then save the window
- if prediction == 1:
- on_windows.append(window)
- #8) Return windows for positive detections
- return on_windows
-# Convert windows to heatmap numpy array.
-def create_heatmap(windows, image_shape):
- background = np.zeros(image_shape[:2])
- for window in windows:
- background[window[0][1]:window[1][1], window[0][0]:window[1][0]] += 1
- return background
-
-# find the nonzero areas from a heatmap and
-# turn them to windows
-def find_windows_from_heatmap(image):
- hot_windows = []
- # Threshold the heatmap
- thres = 0
- image[image <= thres] = 0
- # Set labels
- labels = ndi.label(image)
- # iterate through labels and find windows
- for car_number in range(1, labels[1]+1):
- # Find pixels with each car_number label value
- nonzero = (labels[0] == car_number).nonzero()
- # Identify x and y values of those pixels
- nonzeroy = np.array(nonzero[0])
- nonzerox = np.array(nonzero[1])
- # Define a bounding box based on min/max x and y
- bbox = ((np.min(nonzerox), np.min(nonzeroy)), (np.max(nonzerox), np.max(nonzeroy)))
- hot_windows.append(bbox)
- return hot_windows
-
-def combine_boxes(windows, image_shape):
- hot_windows = []
- image = None
- if len(windows)>0:
- # Create heatmap with windows
- image = create_heatmap(windows, image_shape)
- # find boxes from heatmap
- hot_windows = find_windows_from_heatmap(image)
- # return new windows
- return hot_windows
-# Divide up into cars and notcars
-car_images = glob.glob('./images/vehicles/vehicles/*/*png')
-non_car_images = glob.glob('./images/non-vehicles/non-vehicles/*/*png')
-cars = []
-notcars = []
-for image in car_images:
- cars.append(image)
-
-for image in non_car_images:
- notcars.append(image)
-
-
-colorspace = 'YCrCb' # Can be RGB, HSV, LUV, HLS, YUV, YCrCb
-orient = 8
-pix_per_cell = 8
-cell_per_block = 2
-hog_channel = "ALL" # Can be 0, 1, 2, or "ALL"
-spatial_size = (16, 16)
-hist_bins = 32
-hist_range=(0, 256)
-
-#DO REMEMBER TO CHANGE THIS TO TRUE WHILE TRAINING, AND BACK TO FALSE AFTER TRAINING!!
-train_model = False
-#HLS, 4, 8, 95+
-#YCrCb, 4, 8, 95+
-filename_train = './classifier.joblib.pkl'
-filename_scaler = './scaler.joblib.pkl'
-
-Now that we've got several feature extraction methods in your toolkit, we're almost ready to train a classifier, but first, as in any machine learning application, we need to normalize your data. Python's sklearn package provides you with the StandardScaler() method to accomplish this task. To read more about how you can choose different normalizations with the StandardScaler() method, check out the documentation.
To apply StandardScaler() we need to first have your data in the right format, as a numpy array where each row is a single feature vector. We can create a list of feature vectors, and then convert them like this:
- -import numpy as np
-feature_list = [feature_vec1, feature_vec2, ...]
-# Create an array stack, NOTE: StandardScaler() expects np.float64
-X = np.vstack(feature_list).astype(np.float64)
-You can then fit a scaler to X, and scale it like this:
- -from sklearn.preprocessing import StandardScaler
-# Fit a per-column scaler
-X_scaler = StandardScaler().fit(X)
-# Apply the scaler to X
-scaled_X = X_scaler.transform(X)
-Now, scaled_X contains the normalized feature vectors.
-Now we gotta write a function that takes in a list of image filenames, then reads them one by one, then applies a color conversion (if necessary) and uses bin_spatial() and color_hist() to generate feature vectors. The function should then concatenate those two feature vectors and append the result to a list. After cycling through all the images, the function should return the list of feature vectors. Something like this:
- -# Define a function to extract features from a list of images
-# Have this function call bin_spatial() and color_hist()
-def extract_features(imgs, cspace='RGB', spatial_size=(32, 32),
- hist_bins=32, hist_range=(0, 256)):
- # Create a list to append feature vectors to
- features = []
- # Iterate through the list of images
- # Read in each one by one
- # apply color conversion if other than 'RGB'
- # Apply bin_spatial() to get spatial color features
- # Apply color_hist() to get color histogram features
- # Append the new feature vector to the features list
- # Return list of feature vectors
- return features
-
-We can optimize the Gamma and C parameters for an SVC classifier.
-Successfully tuning your algorithm involves searching for a kernel, a gamma value and a C value that minimize prediction error. To tune your SVM vehicle detection model, you can use one of scikit-learn's parameter tuning algorithms.
-When tuning SVM, remember that you can only tune the C parameter with a linear kernel. For a non-linear kernel, you can tune C and gamma.
-Scikit-learn includes two algorithms for carrying out an automatic parameter search:
- -GridSearchCV exhaustively works through multiple parameter combinations, cross-validating as it goes. The beauty is that it can work through many combinations in only a couple extra lines of code.
-For example, if I input the values C:[0.1, 1, 10] and gamma:[0.1, 1, 10], gridSearchCV will train and cross-validate every possible combination of (C, gamma): (0.1, 0.1), (0.1, 1), (0.1, 10), (1, .1), (1, 1), etc.
-RandomizedSearchCV works similarly to GridSearchCV except RandomizedSearchCV takes a random sample of parameter combinations. RandomizedSearchCV is faster than GridSearchCV since RandomizedSearchCV uses a subset of the parameter combinations.
-GridSearchCV uses 3-fold cross validation to determine the best performing parameter set. GridSearchCV will take in a training set and divide the training set into three equal partitions. The algorithm will train on two partitions and then validate using the third partition. Then GridSearchCV chooses a different partition for validation and trains with the other two partitions. Finally, GridSearchCV uses the last remaining partition for cross-validation and trains with the other two partitions.
-By default, GridSearchCV uses accuracy as an error metric by averaging the accuracy for each partition. So for every possible parameter combination, GridSearchCV calculates an accuracy score. Then GridSearchCV will choose the parameter combination that performed the best.
-scikit-learn Cross Validation Example -Here's an example from the sklearn documentation for implementing GridSearchCV:
- -parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
-svr = svm.SVC()
-clf = grid_search.GridSearchCV(svr, parameters)
-clf.fit(iris.data, iris.target)
-Let's break this down line by line.
- -parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
-A dictionary of the parameters, and the possible values they may take. In this case, they're playing around with the kernel (possible choices are 'linear' and 'rbf'), and C (possible choices are 1 and 10).
-Then a 'grid' of all the following combinations of values for (kernel, C) are automatically generated:
-('rbf', 1) ('rbf', 10) -('linear', 1) ('linear', 10)
-Each is used to train an SVM, and the performance is then assessed using cross-validation.
- -svr = svm.SVC()
-This looks kind of like creating a classifier, just like we've been doing since the first lesson. But note that the "clf" isn't made until the next line--this is just saying what kind of algorithm to use. Another way to think about this is that the "classifier" isn't just the algorithm in this case, it's algorithm plus parameter values. Note that there's no monkeying around with the kernel or C; all that is handled in the next line.
- -clf = grid_search.GridSearchCV(svr, parameters)
-This is where the first bit of magic happens; the classifier is being created. We pass the algorithm (svr) and the dictionary of parameters to try (parameters) and it generates a grid of parameter combinations to try.
- -clf.fit(iris.data, iris.target)
-And the second bit of magic. The fit function now tries all the parameter combinations, and returns a fitted classifier that's automatically tuned to the optimal parameter combination. You can now access the parameter values via clf.bestparams.
- -# parameters for GridSearchCV
-#grid_search_parameters = {'kernel':('linear', 'rbf', 'poly'), 'C':[0.001, 0.01, 0.1, 1, 10], 'gamma': [0.001, 0.01, 0.1, 1]}
-if train_model:
- t=time.time()
- car_features = extract_features(cars, cspace=colorspace, orient=orient,
- pix_per_cell=pix_per_cell, cell_per_block=cell_per_block,
- hog_channel=hog_channel)
- notcar_features = extract_features(notcars, cspace=colorspace, orient=orient,
- pix_per_cell=pix_per_cell, cell_per_block=cell_per_block,
- hog_channel=hog_channel)
- t2 = time.time()
- print(round(t2-t, 2), 'Seconds to extract HOG features...')
-
- print("car feature shape: ", len(car_features))
- print("non-car feature shape: ", len(notcar_features))
- # Create an array stack of feature vectors
- X = np.vstack((car_features, notcar_features)).astype(np.float64)
- y = np.hstack((np.ones(len(car_features)), np.zeros(len(notcar_features))))
-
- # Fit a per-column scaler
- X_scaler = StandardScaler().fit(X)
- # Apply the scaler to X
-
- scaled_X = X_scaler.transform(X)
-
-
- # Split up data into randomized training and test sets
- rand_state = np.random.randint(0, 100)
- X_train, X_test, y_train, y_test = train_test_split(
- scaled_X, y, test_size=0.2, random_state=rand_state)
-
- # Use a linear SVC
- clf = svm.SVC(kernel='linear', C=0.001, gamma=0.001)
- # Check the training time for the SVC
- t=time.time()
- #clf = GridSearchCV(svc, grid_search_parameters)
- clf.fit(X_train, y_train)
- t2 = time.time()
- print(round(t2-t, 2), 'Seconds to train SVC...')
- # Check the score of the SVC
- print('Test Accuracy of SVC = ', round(clf.score(X_test, y_test), 4))
- # Check the prediction time for a single sample
- t=time.time()
- n_predict = 10
- print('My SVC predicts: ', clf.predict(X_test[0:n_predict]))
- print('For these',n_predict, 'labels: ', y_test[0:n_predict])
- t2 = time.time()
- print(round(t2-t, 5), 'Seconds to predict', n_predict,'labels with SVC')
-
- # save the trained model
- _ = joblib.dump(clf, filename_train, compress=9)
- _ = joblib.dump(X_scaler, filename_scaler, compress=9)
-else:
- # load the trained model
- clf = joblib.load(filename_train)
- X_scaler = joblib.load(filename_scaler)
-def process_image(image):
- """
- Pipeline to detect and track vehicles across images of video frames
- """
- draw_image = np.copy(image)
-
- windows = slide_window(image, x_start_stop=[None, None], y_start_stop=[400, 640],
- xy_window=(96, 96), xy_overlap=(0.75, 0.75))
-
- windows += slide_window(image, x_start_stop=[32, None], y_start_stop=[400, 610],
- xy_window=(144, 144), xy_overlap=(0.75, 0.75))
- windows += slide_window(image, x_start_stop=[410, 1280], y_start_stop=[390, 540],
- xy_window=(192, 192), xy_overlap=(0.75, 0.75))
-
- hot_windows = search_windows(image, windows, clf, X_scaler, color_space=colorspace,
- spatial_size=spatial_size, hist_bins=hist_bins,
- orient=orient, pix_per_cell=pix_per_cell,
- cell_per_block=cell_per_block,
- hog_channel=hog_channel, spatial_feat=True,
- hist_feat=True, hog_feat=True)
-
-
- #draw_image = draw_boxes(draw_image, hot_windows, color=(255, 0, 0), thick=6)
- combined_windows = combine_boxes(hot_windows, image.shape)
- filtered_windows = []
- # no car detection yet, create new detections and add them to the list.
- if len(detections) == 0:
- for window in combined_windows:
- box_points = get_box_points(window)
- new_car = Detection()
- new_car.add(box_points)
- detections.append(new_car)
- window_img = draw_boxes(draw_image, filtered_windows, color=(0, 0, 255), thick=6)
- return window_img
- else:
- boxes_copy = copy.copy(combined_windows)
- # Run thorugh all the existing detections and see if any new detections
- # matche with them.
- # if match is found add to the detection.
- # If not found decrease the confidence of the previous detection.
- non_detected_cars_idxs = []
- for car_idx, car in enumerate(detections):
- match_found = False
- box_detection_idx = 0
- for idx, box in enumerate(boxes_copy):
- box_points = get_box_points(box)
- if car.match_detection(box_points):
- match_found = True
- if car.consecutive_detection >= min_consecutive_detection:
- average_box = car.average_detections()
- filtered_windows.append(((average_box[0],average_box[1]),(average_box[2], average_box[3])))
-
- # remove after the match.
- box_detection_idx = idx
- # Match for the car is found, break the inner loop
- break
-
- # Match not found for the previous detection, decrease its confidence.
- # The delete detections is true, remove the detection from the list of previous detections.
- if not match_found:
- delete_Detection = car.failed_detect()
- if delete_Detection:
- non_detected_cars_idxs.append(car_idx)
- else:
- average_box = car.average_detections()
- filtered_windows.append(((average_box[0],average_box[1]),(average_box[2], average_box[3])))
- else:
- # Delete the detected box from the list of boxes to be mathched.
- del boxes_copy[box_detection_idx]
-
- # Remove all the undetected cars from the list of detections using thier saved index.
- if len(non_detected_cars_idxs) > 0:
- non_detected_cars_idxs = non_detected_cars_idxs[::-1]
- for i in non_detected_cars_idxs:
- del detections[i]
-
- # Add the unmatched boxes to the detections array.
- for box in boxes_copy:
- box_points = get_box_points(box)
- new_car = Detection()
- new_car.add(box_points)
- detections.append(new_car)
-
- # If the match is not found decrease the confidence of the detection.
-
-
-
- window_img = draw_boxes(draw_image, filtered_windows, color=(0, 0, 255), thick=6)
-
- return window_img
-
-def get_box_points(box):
- """
- Takes in box points of form ((x1,y1), (x2, y2)) and converts it to form
- [x1, y1, x2, y2].
- """
- box_points = []
- x1, y1 = box[0]
- x2, y2 = box[1]
-
- box_points.append(x1)
- box_points.append(y1)
- box_points.append(x2)
- box_points.append(y2)
- return box_points
-
-
-margin = 100
-min_consecutive_detection = 8
-max_allowed_miss = 4
-confidence_thresh = 10
-
-def is_within_margin(a, b):
- if abs(a-b) > margin:
- return False
- return True
-
-class Detection():
- def __init__(self):
- # the box coordinates in the form [x1,y1,x2,y2]
- self.last_box = []
- # number of consecutive frames in which the car has been detected.
- self.consecutive_detection = 0
- # number of consecutive frames in which the car has not been found.
- self.consecutive_miss = 0
- # the box coordinates of last n detections in the form deque([[x1, y1, x2, y2], [x1, y1, x2, y2], [x1, y1, x2, y2]...], maxlen=5)
- self.last_n_detections = deque(maxlen=10)
- # [avg x1 , avg y1, avg x2, avgy2] of last n detections.
- self.average_box = []
-
- def add(self, box):
- """
- box argument should be of format [x1, y1, x2, y2]
- """
- self.last_box = box
- self.consecutive_detection = self.consecutive_detection + 1
- self.last_n_detections.append(box)
- self.average_detections()
- # set the previous count of consecutive misses to 0.
- self.consecutive_miss = 0
-
- def average_detections(self):
- """
- Find the mean of detections in the deque.
- """
-
- self.average_box = np.mean(self.last_n_detections, axis=0)
- return self.average_box
-
- def match_detection(self, box):
- """
- Checks whether the box is very close/similar to the [x1, y1, x2, y2]
- box argument should be of format [x1, y1, x2, y2]
- """
- i = 0
- for point in box:
- # see if all the points in the box lies within the margin of the last detection.
-
- if not is_within_margin(point, self.last_box[i]):
- return False
- i = i + 1
- # If the match found then add it to the detection.
- self.add(box)
- return True
-
- def failed_detect(self):
- delete_detection = True
- self.consecutive_miss = self.consecutive_miss + 1
- # In case the car doesn't get for more than 3 frames consecutively we discard the
- # object.
- if self.consecutive_miss > max_allowed_miss:
- return delete_detection
- # This helps remove the stray false positives which doesn't get detected in
- # consecutive frames.
- if self.consecutive_detection < min_consecutive_detection:
- return delete_detection
-
-
- # Wait till you the miss becomes greater than max_allowed_miss.
- return False
-
-
-# array of Detection class.
-detections = []
-# output video directory
-video_output = './video-tracking-output.mp4'
-# input video directory
-clip1 = VideoFileClip("project_video.mp4")
-# video process pipline
-#video_clip = clip1.fl_image(process_image)
-# write processed files
-#video_clip.write_videofile(video_output, audio=False)
-