Introduction
This study aims to find the important factors that affect the house prices in a certain area. The Boston housing price dataset is used as an example in this study. This dataset is part of the UCI Machine Learning Repository, and you can use it in Python by importing the sklearn library or in R using the MASS library. This dataset contains 13 factors such as per capita income, education level, population composition, and property size which may have influence on housing prices. This study will first conduct an exploratory data analysis on the dataset and then use multiple linear regression to try to predict housing prices and determine the importance of each feature.
Exploratory Data Analysis
Basic Analysis
Including house prices, this dataset has 14 features and 506 samples in total. A description of the meaning of the features is as follows:
- CRIM: Per capita crime rate by town
- ZN: Proportion of residential land zoned for lots over 25,000 sq. ft
- INDUS: Proportion of non-retail business acres per town
- CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX: Nitric oxide concentration (parts per 10 million)
- RM: Average number of rooms per dwelling
- AGE: Proportion of owner-occupied units built prior to 1940
- DIS: Weighted distances to five Boston employment centers
- RAD: Index of accessibility to radial highways
- TAX: Full-value property tax rate per $10,000
- PTRATIO: Pupil-teacher ratio by town
- B: 1000(Bk — 0.63)², where Bk is the proportion of people of African American descent by town
- LSTAT: Percentage of lower status of the population
- MEDV: Median value of owner-occupied homes in USD 1000's
All features are numeric variables, except CHAS which is a dummy variable. The median, mean, and other numerical characteristics of each feature can be observed using the summary()
in R.
From the numerical characteristics of the features we can see that there is a large difference between the median and the mean of the features CRIM,ZN,TAX and B. This means that the distributions of these four features are likely to be skewed and have more outliers. The box plots allow a more detailed look at the data distribution.
From the box line plot we can see that ZN,CRIM and B all clearly violate normal distribution and have many outliers. Therefore, introducing the these variables into the multiple linear regression model is likely to bring about a large variance.
Next, we analyze the distribution of the target variables. From the density plot, it can be seen that the values of target variable MEDV are basically distributed normally with few outliers.
Correlation Analysis
The correlation coefficient can reveal the strength of the linear correlation between the target variable and the independent variables. there are many ways to calculate the correlation coefficient in R. This study uses the library psych
to calculate the correlation between each independent variable and the target variable and draws a scatter plot of the distribution at the same time.
The correlation matrix shows that all 13 features in the dataset have some correlation with the target variable MEDV. Among them, CRIM, INDUS, NOX, AGE, RAD, TAX, PTRATIO, LSTAT show negative correlations with house prices, while the other features show positive correlations with house prices. It is worth noting that feature DIS as well as LSTAT seem to show nonlinear correlations with house prices, which implies that a nonlinear component may need to be introduced in the multiple linear regression model.
Based on the value of correlation coefficient, it cna be analyzed that LSTAT and RM are the most critical features which affect Boston house prices. When average number of rooms per dwelling becomes higher, the housing price will rise relatively. Conversely, the greater the percentage of lower status of the population, the lower the house price would be.
Besides, there are many independent variables in this dataset that have strong linear correlations with each other. From the correlation matrix, it can be seen that there is a linear correlation between NOX,RM and DIS,RAD with a correlation coefficient of 0.9 or more. If these variables are introduced into the linear regression model at the same time, it may lead to the problem of multicollinearity.
Multiple Linear Model
Before building the model, the data needs to be pre-processed. First, we should detect the missing values in the dataset.
There are 54 missing values in the MEDV column of the dataset. After deleting these rows with missing values, we normalize the features of the dataset. The purpose of normalization is to allow the weights in the regression model to correctly reflect the extent to which individual features can influence house prices. After normalization, we use lm()
in R to build a multiple linear regression model.
Concerning the overall model, the adjusted R-squared value of the model is 0.7328, and the F-statistic value is 96.16 along with its p-value which is far less than 0.05. This proves that the model is effective and can explain the change of house price to a certain extent.
When conerning each feature in the model, we can find that the values of Pr(>|t|) in CRIM, INDUS and AGE are more than 0.05. In linear model, the null hypothesis is that coefficients of the features are zero, and the alternate hypothesis is that the coefficients are not equal to zero, which means there exists a relationship between the features and the dependent variable. Since the values of Pr(>|t|) are more than 0.05, we were unable to demonstrate a significant linear relationship between these three features and house prices.
Feature selecting in linear models can be achieved by using step()
in R. The linear model after feature selection using step ()
method is shown in the following figure.
According to Pr(>|t|) and p-value of F-statistic, we can see that the model is effective as a whole, and each feature has a significant linear relationship with house price. The step()
method removes the CRIM, AGE and INDUS with low correlation coefficients to simplify the model.
The weight coefficients of each feature in the model can reflect its importance for the dependent variable. From the estimated weight, it can be seen that RM and LSTAT are the most important features affecting Boston house prices. This conclusion is consistent with the conclusion summarized from the correlation coefficient matrix.
Since NOX, RM, DIS and RAD factors are included in this model, we need to test this linear model for multicollinearity problems. Variance inflation factor (VIF) is a measure of the amount of multicollinearity in a set of multiple regression variables. In R, we can use vif()
from the car
package to measure the VIF value of each feature in the regression model.
From the calculation results of VIF, we can see that the VIF values of TAX as well as RAD are around 5. It proves that there exists a problem of multicollinearity among the features of the model, which requires further research.
At the end of this section, by analyzing the distribution of the residuals of the model, I will explore the possible reasons for the poor fit of the model.
Multiple linear regression requires several assumptions:
- Linear relationship: There should be a significant linear relationship between the independent and dependent variables.
- Multivariate normality: Data in each independent variable should be roughly normally distributed.
- Multicollinearity: There should be no significant multicollinearity between the independent variables.
- Homoscedasticity: Variance of the residuals should be roughly the same at each level of variables(independent and dependent variables).
- Autocorrelation: The residuals should be uncorrelated.
- Normal distribution: The residuals should be normally distributed.
We use the shapiro.test()
to check whether the residuals belong to a normal distribution. From the result, we can see that p-value is less than 0.05. Since the null hypothesis in shapiro-wilk test is that data is normally distributed, we can find that residuals in this model are not norlmally distributed, which violates the Normal distribution assumption.
By using the plot()
method, we can observe more properties of the residuals. From the plot "Residuals vs Fitted", we can see that residuals reach its minimum at fitted value of about 20, and residuals rises at both end of fitted value. Because the average of residuals varies with the predicted value, it is obvious that residuals violate the Homoscedasticity assumption.
By using the dwtest()
in the lmtest
library, we can test whether the residuals have autocorrelation. The results of Durbin-Watson test show that the value of DW is less than 2 and the p-value is much less than 0.05, which indicates that there is a significant positive autocorrelation of the residuals.
Model Improvement
As I have mentioned before, feature DIS as well as LSTAT seem to show nonlinear correlations with house prices, so I try to add nonlinear part into the model so as to improve the performance.
By adding nonlinear part into model, we can find that Adjusted R-squared increase to around 0.77, and F-statistic increase to 131.3. Since the Pr(>|t|) values of I(DIS^2) and L(STAT^2) are far less than 0.05, we can say that DIS and LSTAT are non-linearly correlated with house prices.
From the calculation of VIF, we can see that the multiple linear regression model suffers from multicollinearity. We can use principal component analysis to solve this problem.
Since the cumulative proportion reach 0.82094 when we take top5 principal components, we take these 5 principal components into multiple linear regression model for prediction.
According to the summary, the multiple linear regression model with principle components reach 0.8745 in adjusted R-squared, which is a significant improvement over the basic multiple linear regression model. All principal components are highly correlated with house prices, which gives a value of 629.6 for the F-statistic.
Conclusion
In the Boston house price dataset, except for CRIM,INDUS and AGE, all other 10 features have a significant correlation with house prices. Among them, the three most influential features are LSTAT, DIS, and RM, among which, LSTAT and DIS show significant non-linear correlation with house price.
From the residual analysis, it is clear that the multiple linear regression model does not predict the house price well. The distribution of the independent variables violates Multivariate normality assumption, and the distribution of the residuals violates Homoscedasticity, Normal distribution, and Autocorrelation assumptions. Moreover, according to the calculation results of VIF, there are multicollinearity among variables. The above factors together lead to the poor fit of multiple linear regression model.
By adding a nonlinear component to the basic linear model and using principal component analysis to address the issue of multicollinearity, this study improves the performance of the linear model on Boston housing dataset.