In this post we are going to talk about Linear Regression which is one of the most widely used statistical tools in Machine Learning. The idea is very simple. We have some features and we want to know how our predictions change as we change the value of features. Features are the square footage of the house, # of bathrooms, # of bedrooms etc. and observation is the price of the house.
So we want to generate a model that takes features as input and outputs the predicted price.
First, let’s explain a naive method to create such an application. We will only consider the square feet of the house as a feature and we will try to predict the price of the house. Let’s take our observations and make a plot of them.
X axis represents the “feature” square feet. It is also called “covariate” or “predictor”. Y axis represents the “observation” that we collect. It is also called “response” or “dependent variable”. Also each point on the graph represents a previous house sale.
So the question is how are we going to use these observations to estimate price of a house? One way is to look at how big the house and look for the similar price range as shown below.
The problem with that approach is that we are only considering 2 house sales that we are going to base our estimate off of. We are throwing out all the other house sales and the question is, is that approach reasonable?
Of course no. In that approach we leave all the other observations as they have nothing to the with our prediction. We can instead think about modeling the relationship between the square footage of the house and the house sales price. To do this we are going to use Linear Regression.
Our main goal is to understand the relationship between the square footage of the house and the house sales price. The simplest model would be just fitting a straight line to data.
This line is defined by;
W0 is the intercept and the W1 is being the weight on the feature X. Intercept and slope are the parameters of our model.
So now the question is, which line is the best line? We need to define a cost for given line to find the best fit. We will use Root Mean Square Deviation(RMSE) to minimize our cost.
Now we are ready to get started to make our prediction with some real data. First of all, click here to download the dataset which includes house sales in King County, the region where the city of Seattle, WA is located. Then open up your iPython Notebook. If you are not familiar with the iPython Notebook and GraphLab Create, I strongly encourage you to read this post.
We start by importing the GraphLab Create library then we load our data.
Fire up GraphLab Create, and load the data
import graphlab sales = graphlab.SFrame('home_data.gl/')
You can view the data in iPython notebook by typing;
This will show the very first few lines of the data.
Let’s explore little bit more about the data. We know that the house price is correlated with the number of square feet of living space. Let’s show this on a scatter plot.
# Set the target to iPython Notebook, so it won't open in a new tab graphlab.canvas.set_target('ipynb') sales.show(view="Scatter Plot", x="sqft_living", y="price")
Now it is time to create a simple linear regression model of sqft_living to price.
- We need to split the data into the training set and test set.
- We will use seed = 0 so that everyone running this notebook gets the same results. In practice, you may set a random seed (or let GraphLab Create pick a random seed for you).
# We take 80% of the data as our training set and the remaining 20% as the test set. train_data,test_data = sales.random_split(.8,seed=0)
Now we can build our model using linear_regression.create function with the only sqft_living as a feature.
sqft_model = graphlab.linear_regression.create(train_data, target='price', features=['sqft_living'])
After that, we can evaluate our model to see how good we are doing.
The output will be very close to following;
Screen Shot 2016-07-31 at 3.40.14 PM
Isn’t it would be great to see what our predictions look like? Surely it would. We will use Matplotlib for visualizing our predictions. Matplotlib is a Python plotting library that is also useful for plotting. You can install it with: ‘pip install matplotlib’
import matplotlib.pyplot as plt %matplotlib inline plt.plot(test_data['sqft_living'], test_data['price'],'.', test_data['sqft_living'], sqft_model.predict(test_data),'-')
This is how the output will look like;
Blue dots are representing the original data, green line is representing the prediction from the simple regression.
Let’s create another model using more features in order to come up with better predictions.
my_features = ['bedrooms','bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode'] my_features_model = graphlab.linear_regression.create(train_data,target='price',features=my_features,validation_set=None)
We can also see a summary of our features with .show() function.
Now we will try to find out what is the most expensive zip code in our data set. For that, we will visualize the data in BoxWhisker view.
sales.show(view='BoxWhisker Plot', x='zipcode', y='price')
Here is the output:
Pull the bar at the bottom to view more of the data. 98039 is the most expensive zip code.
Next, we will compare our simple square feet model with the model that has a few more features.
print sqft_model.evaluate(test_data) print my_features_model.evaluate(test_data)
Screen Shot 2016-07-31 at 6.46.18 PM
As you can see from the output, the RMSE goes down from $255,196 to $179,542 with more features.
We can now build a new and even better regression model. Then we will compare all these 3 models.
# Features for advanced linear regression model advanced_features = [ 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode', 'condition', # condition of house 'grade', # measure of quality of constructio 'waterfront', # waterfront property 'view', # type of view 'sqft_above', # square feet above ground 'sqft_basement', # square feet in basement 'yr_built', # the year built 'yr_renovated', # the year renovated 'lat', 'long', # the lat-long of the parcel 'sqft_living15', # average sq.ft. of 15 nearest neighbors 'sqft_lot15', # average lot size of 15 nearest neighbors ]
# Create the third model my_advanced_features_model = graphlab.linear_regression.create(train_data, target='price', features=advanced_features)
With this model we will have lower RMSE and better predictions. Let’s evaluate it using the test set.
You will immediately see that RMSE goes down to 156.813.
Now we are in the most fun part. Applying the trained models to predict price of a house.
We will choose a house from our test set. The first house that we will use is considered an “average” house in Seattle.
# Choose a house house = sales[sales['id']=='1925069082'] # See its real price print house['price'] # Outputs 
Let’s apply our models.
# Simple Square Feet Model print sqft_model.predict(house) # Model with a bit more features that the Square Feet Model print my_features_model.predict(house) # The output will be # [1262291.197526371] # [1446472.4690774973]
The model with more features provides a better prediction than the simpler model with only 1 feature. However, seems like we can make much more better predictions.
Now let’s see how our advanced model is doing.
print my_advanced_features_model.predict(house) # outputs [2115905.330321929]
Our advanced model did a great job! The original price of the house was $2,200,000 and we predicted $2,115,905 which is pretty reasonable!
At the end, it is also possible that in some cases, the model with more features may provide a worse prediction than the simpler model with only 1 feature. However, on average, the model with more features is better. Also, note that predictions may vary from yours with just a little bit difference.