Wine Quality Prediction With Random Forest

Introduction

For winemakers, it is very important to know how to judge the quality of wine by its chemical components. In this report, we analyze the white wine dataset, use random forest algorithm and logistic regression algorithm to build models to distinguish the quality of wine, and determine the importance of each chemical component for wine quality judgment by its weights in both algorithm.

Exploratory Data Analysis

First import the dataset and observe the value and range of each column feature of the data set. The white wine dataset contains a total of 11 metrics of chemical composition and a column indicating the quality of the wine. All indicators are stored in the dataset in numeric form and have different ranges of values.

Analyze Target Value

We then analyzed the distribution of wine quality. The histogram below shows that wines of average quality (scores between 5 and 7) make up the majority of the data set, while wines of very poor quality (scores less than 4) and excellent quality (scores greater than 8) are less common. For quantitative discussions, we define wines with scores of 6 or more as high quality and wines with scores less than 6 as low quality.

Analyze Features

Next, we perform an exploratory analysis of the metrics in the dataset. First, we check whether the metrics contain missing values.

fixed acidity           False
volatile acidity        False
citric acid             False
residual sugar          False
chlorides               False
free sulfur dioxide     False
total sulfur dioxide    False
density                 False
pH                      False
sulphates               False
alcohol                 False
quality                 False
dtype: bool

Since there is no missing value in the dataset, we use box plot to observe the general distribution of each metrics. As can be seen from the box plot below, the distribution of alcohol content in each wine is relatively concentrated, but there are great differences in metrics such as chlorides and volatile acid. Because there are many outliers in these metrics, the accuracy of logistic regression algorithm may be relatively low.

Correlation Analysis

According to the data distribution of each metric in different quality white wine, the metric with obvious differentiation can be roughly identified. As can be seen from the box plot below, the difference of alcohol content is the most obvious among different quality white wines, and high quality wines usually contain more alcohol. Also, the overall pH tends to rise as the quality of the wine improves. However, there are many outliers in the pH metric, so it is difficult to be used as the basis for judging the quality of wine.

We can use Pearson correlation coefficients for a more detailed analysis of the linear relationship between the individual metrics. The heat map below shows the values of the Pearson correlation coefficients between the metrics. It can be seen that density has a strong correlation with residual sugar, alcohol content and total sulfur dioxide. This makes sense, since the density of white wines is originally determined by the type and content of chemicals in the wine. The last column shows the Pearson correlation coefficients between white wine quality and all the metrics. It shows that there may exist linear relationship between white wine quality and alcohol as well as densityï¼Œwhich means these two metrics are important criteria for judging the quality of wine. It also shows that critic acid and free sulfur dioxide do not seem to have a linear correlation with wine quality.

We use the Maximum Information Coefficient(MIC) to detect whether there is a nonlinear or periodic relationship between each metric and the quality of white wine. The results show that alcohol and density have maximum MIC value, which is consistent with the conclusion made from Pearson correlation coefficient. So it is very likely that alcohol and density are the most important factors to measure the quality of wine. Besides, fixed acid and sulfate have the minimum MIC value. Because the conclusion(on the unrelated metrics) of MIC measurement is different from that of Pearson correlation coefficient, no metric is excluded in the model construction.

Maximal information coefficient
fixed acidity 0.027
volatile acidity 0.065
citric acid 0.065
residual sugar 0.077
chlorides 0.082
free sulfur dioxide 0.064
total sulfur dioxide 0.086
density 0.156
pH 0.035
sulphates 0.030
alcohol 0.147

Data Pre-processing

Before building model, we seperate the data into training set with 80% of original data and test set with 20% of the original data. In training set, we have 1312 samples of low quality wine and 2606 samples of high quality wine. In order to solve the imbalanced classification problem, we use SMOTE algorithm to oversample the minority class. After pre-processing, we have the same number of samples in both low quality wine class and high quality wine class.

number of samples in low quality wine(0) and high quality wine(1): 
0    1312
1    2606
Name: 0, dtype: int64


0    2606
1    2606
Name: 0, dtype: int64

Logistic Regression

When building model using logistic regression algorithm, we use sklearn.preprocessing.StandardScaler to normalize the data, and use sklearn.model_selection.GridSearchCV to find the best parameter. After training, logistic regression algorithm achieve 0.761 in accuracy score and 0.831 in f1 score.

Best: 0.725469 using {'logisticregression__C': 10}
0.723603  with:   {'logisticregression__C': 0.1}
0.724318  with:   {'logisticregression__C': 1}
0.725469  with:   {'logisticregression__C': 10}


Accuracy score is:  0.7612244897959184
Precision score is:  0.7824324324324324
Recall score is:  0.8880368098159509
F1 score is: 0.8318965517241379

We can judging the importance of different metrics by its weight in logistic regression model. According to the bar plot, alchol, residual sugar, volatile acidity and density play a more important role than other metrics in judging the quality of white wine.

Random Forest

Random forest algorithm is a better way in dealing with data with many outliers. It also prevent the overfitting problem by aggregating many decision trees to give optimal model. Since it uses decision tree as based algorithm, there is no need to normalize the data. By using sklearn.model_selection.GridSearchCV to search best parameter, we build a random forest model which achieve 0.836 in accuracy score and 0.877 in f1 score.

Best: 0.861909 using {'criterion': 'gini', 'n_estimators': 300}
0.856439  with:   {'criterion': 'gini', 'n_estimators': 100}
0.859427  with:   {'criterion': 'gini', 'n_estimators': 200}
0.861909  with:   {'criterion': 'gini', 'n_estimators': 300}
0.859657  with:   {'criterion': 'gini', 'n_estimators': 400}
0.856118  with:   {'criterion': 'entropy', 'n_estimators': 100}
0.858843  with:   {'criterion': 'entropy', 'n_estimators': 200}
0.857771  with:   {'criterion': 'entropy', 'n_estimators': 300}
0.860638  with:   {'criterion': 'entropy', 'n_estimators': 400}


[[246  82]
 [ 78 574]]
0.8367346938775511
0.875
0.8803680981595092
0.8776758409785932

Sklearn library can estimate feature importance in decision tree-based algorithms by calculating difference of Gini index made by each node. In this model, the top4 most important metrics are alcohol, volatile acidity, density and free sulfur dioxide.

Conclusion

According to the correlation measurement made by MIC and Pearson correlation coefficient and feature importance in LR and Random forest model, alcohol is the most important metric in evaluating the quality of white wine. Generally, the higher the alcohol content, the better the quality of the wine. Some other important metrics include density, volatile acidity and residual sugar.