In this project we’ll learn how to build a basic machine learning model for house price prediction working with California Housing Prices Dataset.

Data Story

We’ll use the California Housing Prices dataset from the StatLib repository. This dataset is based on data from the 1990 California census. It is not exactly recent (a nice house in the Bay Area was still affordable at the time), but it has many qualities for learning, so we will pretend it is recent data. For teaching purposes dataset is modified, added a categorical attribute and removed a few features.

This is the dataset used in the second chapter of Aurélien Géron's recent book 'Hands-On Machine learning with Scikit-Learn and TensorFlow'. It serves as an excellent introduction to implementing machine learning algorithms because it requires rudimentary data cleaning, has an easily understandable list of variables and sits at an optimal size between being to toyish and too cumbersome.

Fig-1: California Housing Prices (Source: Google Images)

Understanding the problem

Fig-2: Problem-solving approach

1. Understand Business Requirements

a) How does the company expect to use and benefit from this model?
- Possibly it's a real-estate firm and wants to invest in California.
- It might be a consulting firm in the real-estate business and wants to provide valuable insights to its clients.
- Current process is time consuming.

b) What are we going to do at root level by observing the data?
- We’ll train our model on existing labelled training data so we are doing supervised learning.
- It is also a typical regression task, since you are asked to predict a  value. More specifically, this is a multiple regression problem, since the system will use multiple features to make a prediction (it will use the district’s  population, the median income, etc.). 
- It is also a univariate regression problem, since we are only trying to predict a single value for each district. If we were trying to predict multiple  values per district, it would be a multivariate regression problem.
- Finally, there is no continuous flow of data coming into the system, there is no particular need to adjust to changing data rapidly, and the data is small enough to fit in memory, so plain batch learning should do just fine.

2. Acquire the dataset

Get the dataset in CSV format [here](https://github.com/ageron/handson-ml2/blob/master/datasets/housing/housing.csv) and store it in the project folder.

Setup virtual environment and install dependencies.

mkdir ml-housing-project
cd ml-housing-project/
pip3 install virtualenv
virtualenv venv
source venv/bin/activate (Linux) || venv\Scripts\activate (Windows)
pip install jupyter matplotlib numpy pandas scipy scikit-learn

# Create a new jupyter notebook with Python3
jupyter notebook

Do the basic imports in jupyter notebook file

import os
import sys
assert sys.version_info >= (3, 5)

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import sklearn
assert sklearn.__version__ >= "0.20"

# Ignore useless warnings
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")


housing = pd.read_csv('./housing.csv')
housing.head(5)
housing["ocean_proximity"].value_counts()
housing.info()
Fig-3: housing.head(5)
Fig-4: housing["ocean_proximity"].value_counts()
Fig-5: housing.info()

From the above 3 figures we can interpret that -

  • In Fig-3 we see that the data has metrics such as the population, median income, median housing price, and so on for each block group in California.
  • Except ocean proximity every other metric is numerical so we have to encode the string value into numerical value as it seems to be an important metric. In Fig-4 we can see that there are 5 types of values for ocean proximity.
  • From the Fig-5 we can observe that there is no null values but there can be redundant values. However, this will not be the general case for real world data where majority of the work has to be done in preparing data.

3. Visualize the Data

housing.hist(bins=50, figsize=(20,15))
plt.show()
Fig-6; Data distribution of metrics/ features (only numerical)

There are a few things you might notice in these histograms:

a) First, the median income attribute does not look like it is expressed in US dollars(USD). After checking with the team that collected the data, you are told that thedata has been scaled and capped at 15 (actually, 15.0001) for higher medianincomes, and at 0.5 (actually, 0.4999) for lower median incomes. The numbersrepresent roughly tens of thousands of dollars (e.g., 3 actually means about$30,000). Working with preprocessed attributes is common in Machine Learning.

b) The housing median age and the median house value were also capped. The latter may be a serious problem since it is your target attribute (your labels). Your Machine Learning algorithms may learn that prices never go beyond that limit. You need to check with your client team (the team that will use your system’s output) to see if this is a problem or not. If they tell you that they need precise pre‐dictions even beyond $500,000, then you have two options: (a) Collect proper labels for the districts whose labels were capped. (b) Remove those districts from the training set (and also from the test set, sinceyour system should not be evaluated poorly if it predicts values beyond $500,000).

c) These attributes have very different scales. We will discuss this later when we explore feature scaling further.

d) Finally, many histograms are tail-heavy: they extend much farther to the right of the median than to the left. This may make it a bit harder for some Machine Learning algorithms to detect patterns. We will try transforming these attributes later on to have more bell-shaped distributions.

Hopefully you now have a better understanding of the kind of data you are dealing with.

4. Preprocess the Data

Say some expert tells us that median income is a very important parameter to find out housing prices. We observe that the median income data is continuous. So let’s make it discrete
# to make this notebook's output identical at every run
np.random.seed(42)
    
housing["income_cat"] = np.ceil(housing["median_income"]/ 1.5)
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)

Now based on income categories, we’ll split our entire data into
- 80% for training our model
- 20% for testing our model

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing['income_cat']):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]
We are using stratified sampling technique here since we derived a new attribute called income category. Stratified sampling is a type of sampling method in which the total population is divided into smaller groups or strata to complete the sampling process. The strata is formed based on some common characteristics in the population data.

We need to remove the income category from our training and test data sets. let’s do it by
strat_train_set.drop(["income_cat"], axis=1, inplace=True)
strat_test_set.drop(["income_cat"], axis=1, inplace=True)
Let’s create a copy of our training data set and visualize it using a scatter plot
housing = strat_train_set.copy()
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
        s=housing["population"]/100, label="population",
        c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
        figsize=(15,7))
plt.legend()
Fig-7; Scatter plot of training set
  • The radius of each circle represents the district’s population (option s)
  • The color represents the price (option c).
  • We use a predefined color map (option cmap) called jet, which ranges from blue (low values) to red (high prices)

Correlation -

  • also known as the Pearson coefficient
  • its a value between +1 and -1
  • +1 means two attributes are highly correlated
  • 0 means two attributes are not at all correlated
  • -1 means two attributes are inversely correlated
  • the correlation coefficient only measures linear relationships

Let’s find out if there is correlation between our attributes using scatter_matrix method.

A scatter plot matrix is a grid (or matrix) of scatter plots used to visualize bivariate relationships between combinations of variables. Each scatter plot in the matrix visualizes the relationship between a pair of variables, allowing many relationships to be explored in one chart.

from pandas.plotting import scatter_matrix

attributes = ['median_house_value', 'median_income',
             'total_rooms', 'housing_median_age']
scatter_matrix(housing[attributes], figsize=(12,8))
Fig-8; Correlation between attributes

We’ll focus on the correlation between median house value and the median house income

housing.plot(kind='scatter', x='median_income', y='median_house_value',
            alpha=0.1, figsize=(8,5))
Fig-9; Correlation between median house value and median house income
  • This is good correlation
  • We see a horizontal line at the top which marks the upper circuit
  • Also some horizontal lines in the middle area
  • We got a good idea about correlation in our dataset. Let’s move on

housing.head(3)

Fig-10; housing.head()
  • These three attributes above => total_rooms, total_bedrooms, population
  • They are for an entire district. Let's find them for a single household.
  • Because we are calculating per house price.
  • So now we are deriving new attributes from existing attributes in our dataset.
housing['rooms_per_household'] = housing['total_rooms']/housing['households']
housing['bedrooms_per_room'] = housing['total_bedrooms']/housing['total_rooms']
housing['population_per_household'] = housing['population']/housing['households']

So far we are going pretty good.
Let’s find out the correlation of our entire dataset and find out which one is most related to median house value.

corr_matrix = housing.corr()
corr_matrix['median_house_value'].sort_values(ascending=False)

median_house_value          1.000000
median_income               0.687160
rooms_per_household         0.146285
total_rooms                 0.135097
housing_median_age          0.114110
households                  0.064506
total_bedrooms              0.047689
population_per_household   -0.021985
population                 -0.026920
longitude                  -0.047432
latitude                   -0.142724
bedrooms_per_room          -0.259984
Name: median_house_value, dtype: float64
  • The correlation coefficient ranges from –1 to 1.
  • When it is close to 1, it means that there is a strong positive correlation; for example, the median house value tends to go up when the median income goes up.
  • When the coefficient is close to –1, it means that there is a strong negative correlation; you can see a small negative correlation between the latitude and the median house value (i.e., prices have a slight tendency togo down when you go north).
  • Finally, coefficients close to 0 mean that there is nolinear correlation.
  • The correlation coefficient only measures linear correlations (“if x goes up, then y generally goes up/down”). It may completely miss out on nonlinear relationships (e.g., “if x is close to 0, then y generally goes up”).

Data Cleaning

We'll create a clean training dataset and introduce new variables for clear understanding

  • housing — this is the data from csv file that we’ll feed to our model to learn
  • housing_labels — this is our target price in our training dataset which our model should learn and be able to predict house price
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set['median_house_value'].copy()

In our dataset total_bedrooms column has some missing values which we can fix in some simple ways, let us understand the pros and cons of each method -

1. Remove the instances(rows) of missing values

housing.dropna(subset=["total_bedrooms"])   # option 1
  • Pros:
    • A model trained with the removal of all missing values creates a robust model.
  • Cons:
    • Loss of a lot of information.
    • Works poorly if the percentage of missing values is excessive in comparison to the complete dataset.

2. Remove the entire attribute/ feature

housing.drop("total_bedrooms", axis=1)     # option 2
  • Pros:
    • Removing the entire attribute especially if it has less importance that we can know from correlation then it can save computation time and can also might increase accuracy
  • Cons:
    • If the deleted attribute carries a significant weightage for making predictions then we can lose important information needed for a better model

3. Set missing value to some value like zero, median or mean

  • Pros:
    • Prevent data loss which results in deletion of rows or columns
    • Works well with a small dataset and easy to implement.
  • Cons:
    • Works only with numerical continuous variables.
    • Can cause data leakage
    • Does not factor the covariance between features.

We’ll take option 3 and put in the median value using the SimpleImputer method and we will remove the text attribute because median can only be calculated on numerical attributes:

# option 3
median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median, inplace=True)

# Scikit-Learn provides a class to take care of missing values: SimpleImputer.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")

housing_num = housing.drop('ocean_proximity', axis=1)
# alternatively: housing_num = housing.select_dtypes(include=[np.number])
housing_num.head()

Now that the data cleaning is done we can use the imputer to learn median values and replace the missing values in the dataset.

imputer.fit(housing_num)
imputer.statistics_

array([ -118.51 , 34.26 , 29. , 2119.5 , 433. , 1164. , 408. , 3.5409])

Now you can use this “trained” imputer to transform the training set by replacing
missing values with the learned medians:

X = imputer.transform(housing_num)

# The result is a plain NumPy array containing the transformed features. 
# If you want to put it back into a pandas DataFrame, it’s simple:

housing_tr = pd.DataFrame(X, columns=housing_num.columns, index=housing_num.index)

Now let's pre-process the categorical input feature, ocean_proximity:

housing_cat = housing[["ocean_proximity"]]
housing_cat.head(10)

 	ocean_proximity
17606 	<1H OCEAN
18632 	<1H OCEAN
14650 	NEAR OCEAN
3230 	INLAND
3555 	<1H OCEAN
19480 	INLAND
8879 	<1H OCEAN
13685 	INLAND
4937 	<1H OCEAN
4861 	<1H OCEAN

It’s not arbitrary text: there are a limited number of possible values, each of which
represents a category. So this attribute is a categorical attribute. Most Machine Learning algorithms prefer to work with numbers, so let’s convert these categories from text to numbers.

from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_encoded[:10]

array([[0.],
       [0.],
       [4.],
       [1.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.]])
ordinal_encoder.categories_

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

One issue with this representation is that ML algorithms will assume that two nearby values are more similar than two distant values. This may be fine in some cases (for e.g. ordered categories such as “bad,” “average,” “good,” and “excellent”), but it is obviously not the case for the ocean_proximity column (for example, categories 0 and 4 are clearly more similar than categories 0 and 1).

To fix this issue, a common solution is to create one binary attribute per category: one attribute equal to 1 when the category is “<1H OCEAN” (and 0 otherwise) and so on. This is called one-hot encoding, because only one attribute will be equal to 1 (hot), while the others will be 0 (cold). The new attributes are sometimes called dummy attributes. Scikit-Learn provides a OneHotEncoder class to convert categorical values into one-hot vectors.

from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot

# By default, the OneHotEncoder class returns a sparse array, but we can convert it to a dense array if needed by calling the toarray() method:

housing_cat_1hot.toarray()

# Alternatively, you can set sparse=False when creating the OneHotEncoder:

cat_encoder = OneHotEncoder(sparse=False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot

80% of the total time spent on data science projects is spent on cleaning and preprocessing of the data.

We’ve all heard that right? So it only makes sense we find ways to automate the pre-processing and cleaning as much as we can.

Up until now we've learned how to perform exploratory data analysis and how to do data cleaning. In next part we'll learn how to build pipeline and models to train.

Read the second part

Predicting House Prices Using Regression: Part 2
This is a continuation of the project in which, we are learning how to build a basic machine learning model for house price prediction working with California Housing Prices Dataset.