Tutorial 2: Random Forest Regression

Tutorial 2: Random Forest Regression#

In this tutorial you will learn how to read and explore census data from the IPUMS USA database. IPUMS USA collects, preserves and harmonizes U.S. census microdata and provides easy access to this data with enhanced documentation. Data includes decennial censuses from 1790 to 2010 and American Community Surveys (ACS) from 2000 to the present. In this tutorial, you will learn how to set up a Random Forest model using real-world data.

Important before we start#

Make sure that you save this file before you continue, else you will lose everything. To do so, go to Bestand/File and click on Een kopie opslaan in Drive/Save a Copy on Drive!

Now, rename the file into Week2_Tutorial2.ipynb. You can do so by clicking on the name in the top of this screen.

Tutorial Outline

1. Introducing the packages
2. Importing the data
4. Dealing with missing values
5. Training the random forest model
6. Prediction with the random forest model
7. Evaluate performance
8. Parameter tuning
9. Feature importance

Learning Objectives#

Learn how to estimate a random forest model.
Deal with missing values in the model.
Evaluate performance of the random forest model.
Tuning the hyperparameters.

1. Introducing the packages#

Within this tutorial, we are going to make use of the following packages:

sklearn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities.

seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

NumPy is a Python library that provides a multidimensional array object, various derived objects, and an assortment of routines for fast operations on arrays.

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Now we will import these packages in the cell below:

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.tree import plot_tree
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

pd.options.mode.chained_assignment = None
sns.set_style("whitegrid",{'axes.grid' : True})

2. Importing the data#

Let’s start again with reading the data. It is the dataset that we produced in the previous Tutorial.

data_in = pd.read_csv(r"https://github.com/ElcoK/BigData_AED/raw/main/week2/usadataforweek2tut2.csv")

And let’s have a look at the data once more:

data_in.head()

In the following line of code we drop duplicates based on the variable SERIAL. Remember that persons from the same household have the same SERIAL number.

data = data_in.drop_duplicates(subset='SERIAL', keep="first")

Question 1: Why do we only want to have one person per household in our random forest model?

3. Dealing with missing values#

In the following lines of code we try to estimate a random forest model, but we run into an error because we have missing values in the dataset.

We don’t estimate the random forest model on the entire dataset, but we split the dataset in a training and testing sample, because we want to train the model on the training dataset and to test model performance on the test dataset.

training_data, testing_data = train_test_split(data, test_size=0.3, random_state=25)

Next we split the training and testing data in two datasets, where x_train contains the explanatory variables and y_train contains the dependent variable.

x_train = training_data.drop(columns = ['COSTENERGY'])
y_train = training_data[['COSTENERGY']]
y_train = y_train.to_numpy() #here we convert y_train from a pandas dataframe with one column to a numpy array.

x_test = testing_data.drop(columns = ['COSTENERGY'])
y_test = testing_data[['COSTENERGY']]
y_test = y_test.to_numpy()

Training and fitting a random forest model is done in one line of code:

rf = RandomForestRegressor(n_estimators = 50, max_features = 'sqrt', max_depth = 5, random_state = 18).fit(x_train, y_train)

Question 2: Explain in your own words the options in the RandomForestRegressor and how these options will affect the results. Also explain two other options in RandomForestRegressor that are not included in the line of code we have used to run the model.

The random forest model didn’t run because there are missing values in the data. There are several options to deal with missing data. We could for example replace all the missing values by 0 or by the mean of that variable. First, let’s check again which variables have missing values.

data.isnull().sum().sort_values(ascending = False)

There are quite some missing values in the dependent variable COSTENERGY. And the number of NA’s in COSTENERGY is suspiciously similar to the number of missing values in some other variables, such as BUILTYR2, ROOMS, HHINCOME, etc. Let’s check what happens if we remove the rows from the dataset when there is no value for COSTENERGY. Probably we can remove a lot of missing values all at once.

In the following line of code we remove all rows (axis = 0 indicates rows, and axis = 1 indicates columns) for which COSTENERGY is NA.

datanan = data.dropna(axis=0, subset=['COSTENERGY'])

Now let’s check if this has removed a lot of NA’s in other variables as well.

datanan.isnull().sum().sort_values(ascending = False)

Indeed for 15145 households there was a lot of data missing and we can savely remove these households from the dataset, because 1) these instances are not very meaningful and 2) we could perfectly predict them if we keep them in the dataset using the missing values of VEHICLES, ROOMS, HHINCOME, etc.

Now we have a handful variables left with missing values. Replace the missing values in all variables with code. You can check the code of the previous tutorial on how to do this.

Question 3: Explain what you would do with the remaining missing values in our data? There is not necessarily a right or wrong answer, but ofcourse your choices matter for the results.

datanan['RENT'] = datanan['RENT'] # fill na
datanan['DEGFIELD'] = datanan['DEGFIELD'] # fill na
datanan['MORTGAG2'] = datanan['MORTGAG2'] # fill na
datanan['MORTGAGE'] = datanan['MORTGAGE'] # fill na
datanan['PROPINSR'] = datanan['PROPINSR'] # fill na
datanan['IND'] = datanan['IND'] # fill na
datanan['OCC'] = datanan['OCC'] # fill na
datanan['EMPSTAT'] = datanan['EMPSTAT'] # fill na

4. Training the random forest model#

Now we can actually estimate the random forest model. Note that we use the phrases training a random forest model or estimating a the random forest model interchangebly. In machine learning, we usually talk about training a model, while in statistics/econometrics, we would say estimating a model, but both terms mean essentially the same.

Remember to split the data again in a training and testing set. Now we use the datanan dataset which has no missing values.

training_data, testing_data = train_test_split(datanan, test_size=0.3, random_state=25)

And now let’s rerun the model

x_train = training_data.drop(['COSTENERGY'], axis = 1)
y_train = training_data[['COSTENERGY']]
y_train = y_train.to_numpy().ravel()

x_test = testing_data.drop(['COSTENERGY'], axis = 1)
y_test = testing_data[['COSTENERGY']]
y_test = y_test.to_numpy().ravel()

rf = RandomForestRegressor(n_estimators = 100, max_features = 'sqrt', max_depth = 3, random_state = 18).fit(x_train, y_train)

To get some insight in the random forest model, we can visualize the first tree of the model. [0] calls the first tree, but we can also plot any other tree of the model.

firsttree = rf.estimators_[0]
plt.figure(figsize=(22,8))
plot_tree(firsttree, feature_names=x_train.columns, filled=True, fontsize = 11)
plt.title('title')
plt.show()

Question 4: Interpret what you see in the decision tree. Hint: Explain the terms squared_error, samples and value in the tree.

5. Prediction with the random forest model#

In the next line of code, we use the trained random forest model to predict the dependent variable of the testing dataset. We use the function predict of the random forest regressor.

y_pred = rf.predict(x_test)

6. Evaluate performance#

Using the predicted dependent variable, we can evaluate the performance of the model. We first do this visually by plotting the dependent variable of the testing dataset together with the predicted values.

fst = 12
plt.plot(y_test,'bo')
plt.plot(y_pred,'ro')
plt.legend(['Observed','Predicted'], fontsize = fst)
plt.xticks(fontsize=fst)
plt.yticks(fontsize=fst)
plt.ylabel('ENERGYCOST',fontsize=fst)

Question 5: Interpret the plot you have created to evaluate the performance of your model.

Next we check performance in terms of a numeric criterion. We can for example check the mean squared error or the R2.

mse = mean_squared_error(y_test, y_pred)
print(mse)

r2 = r2_score(y_test, y_pred)
print(r2)

Question 6: What are the values you receive for the mean squared error and the r-squared? Also explain the two terms.

7. Parameter tuning#

The R2 score is pretty low, so we want to try other hyperparameters. We will change the depth and the number of trees in the forest. In the code below, we run a for loop that tries 3 different n_estimators and 3 different depths, so we estimate 9 different random forest models. We save the mse and r2 of each model in a list, i.e. list_mse and list_r2. It’ll take a minute or two to run.

list_mse = []
list_r2 = []

for n_est in [75, 150]:
    for dep in [4, 7, 10]:
       rf = RandomForestRegressor(n_estimators = n_est, max_features = 'sqrt', max_depth = dep, random_state = 18).fit(x_train, y_train)
       y_pred = rf.predict(x_test)
       mse = mean_squared_error(y_test, y_pred)
       r2 = r2_score(y_test, y_pred)
       list_mse.append(mse)
       list_r2.append(r2)

print(list_mse)
print(list_r2)

Question 7: Interpret the results. Which model performs best. Think about how the for loop works. You could add print(n_est) and print(dep) in the for loop to make life easier.

Question 8: Try another parameter space but think about the time the model needs to run! The higher the number of trees and the depth, the longer it takes to train the model. You can copy the code from the previous cell and change the numbers over which the for loop iterates, i.e. the numbers in [75,150] and [5,7,9]. Because of time, I would advice not to try more than 4 models. Report the hypermeters you tried and the R2 score.

8. Feature importance#

Next we want to see which variables have the most predictive value for ENERGYCOSTS. We can check the importance of each feature in the random forest model by calling rf.feature_importances_. We make a horizontal barplot to see the importance of each feature. You can fill in the hyperparameters for n_estimators and max_depth.

rf = RandomForestRegressor(n_estimators = ..., max_features = 'sqrt', max_depth = ..., random_state = 18).fit(x_train, y_train)
sorted_idx = rf.feature_importances_.argsort() #we sort the importance of features from high to low.
plt.figure(figsize=(7, 10.5))
plt.barh(x_train.columns[sorted_idx], rf.feature_importances_[sorted_idx])
plt.show()

Question 9: According to the plot, which variables, also called features, are most important in the random forest? Should we leave out variables from the model. Hint: Not all variables in the model are meaningful, are they?