Tutorial 2: Model Validation#

The second tutorial of this week will focus on model validation. We will re-run several of the models we have developed over the past three tutorials and try to understand together how each of these models perform and should be interpreted.

Tutorial Outline


Learning Objectives#


  • Learn how to conduct model validation.

  • Learn about popular validation methods.

  • Learn how to interpret validation results.

1. Introducing the packages#


Within this tutorial, we are going to make use of the following packages:

tensorflow is an open-source machine learing library which makes it easy for beginners and experts to create machine learning models.

sklearn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities.

seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

NumPy is a Python library that provides a multidimensional array object, various derived objects, and an assortment of routines for fast operations on arrays.

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Matplotlib is a comprehensive Python package for creating static, animated, and interactive visualizations in Python. Matplotlib makes easy things easy and hard things possible.

Now we will import these packages in the cell below:

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.tree import plot_tree
import seaborn as sns
sns.set_style("whitegrid",{'axes.grid' : True})
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
import tensorflow as tf
from tensorflow import keras

2. Preparing the data#


Read the data. We use the data that we also used to estimate the OLS model in the first tutorial of week 2. It includes the variables you created in the first tutorial of week 2 and it deals with missing values.

data1 = pd.read_csv(r"https://github.com/ElcoK/BigData_AED/raw/main/week2/usadataforOLS.csv", sep = ',', encoding = 'unicode_escape')
data1.head()

Like we did in the last tutorial, we will drop duplicates based on the variable SERIAL. (Remember that persons from the same household have the same SERIAL number.)

data1 = data1.drop_duplicates(subset='SERIAL', keep="first")

And we drop some irrelevant variables:

data1 = data1.drop(['Unnamed: 0', 'YEAR', 'SAMPLE', 'HHWT', 'SERIAL','PERWT', 'RELATE', 'RELATED', 'COUNTYFIP','SEX', 'AGE', 'RACE', 'RACED', 'STATEFIP', 'EDUC', 'EDUCD', 'EMPSTATD', 'OCC', 'IND', 'EMPSTAT', 'LABFORCE', 'DEGFIELD', 'INCTOT'], axis = 1)

As in the previous tutorial, we transform the column REGION into dummy variables.

one_hot = pd.get_dummies(data1['REGION'], prefix = 'REGION_')
print(one_hot.head())
data1 = pd.concat([data1, one_hot], axis = 1) 
data1 = data1.drop(['REGION'], axis = 1)

Let’s check if it has worked out by printing the column names of the dataframe:

data1.columns

Before we can perform some model validations, we need to estimate a few models to validate (isn’t that surprising?) We will fit three different models: an Ordinary Least Squared (OLS) regression (linear model), a Random Forests (RF) model, and an Artificial Neural Networks (ANN) model. First we scale all the data using min-max normalization. We can apply min-max normalization to the entire dataframe.

data1 = (data1 - data1.min()) / (data1.max() - data1.min())

3. Training the models#


Now we separate the data into training and testing datasets and start running the models and making some predictions (just like we did in the past). We will select four variables as explanantory variables in the models.

training_data, testing_data = train_test_split(data1, test_size=0.3, random_state=25) 

x_train = training_data[['DENSITY', 'PROPINSR', 'ROOMS', 'BUILTYR2']]
y_train = training_data[['COSTENERGY']]

x_test = testing_data[['DENSITY', 'PROPINSR', 'ROOMS', 'BUILTYR2']]
y_test = testing_data[['COSTENERGY']]

First we estimate a simple linear regression model using the statsmodels library:

OLSregression = sm.OLS(y_train, x_train)
OLSresults = OLSregression.fit()
print(OLSresults.summary())

Then we estimate a Random Forest model using the scikit learn library (takes a few seconds):

RFmodel = RandomForestRegressor(n_estimators = 100, max_features = 'sqrt', max_depth = 4, random_state = 18).fit(x_train, y_train)

Then we build and compile a neural network model using the tensorflow library:

ANNmodel = keras.Sequential([
      keras.layers.Dense(64, activation='relu', input_shape = (x_train.shape[1],)),
      keras.layers.Dense(64, activation='relu'),
      keras.layers.Dense(1)
  ])


ANNmodel.summary()

ANNmodel.compile(loss='mean_squared_error',
              optimizer=tf.keras.optimizers.Adam(learning_rate=0.0000005),
              metrics=['mean_absolute_error'])

Next we fit the ANN model:

history = ANNmodel.fit(x_train, y_train['COSTENERGY'], epochs=15, validation_split=0.2)
Question 1: Based on the learning progress printed above, what do you think of the number of epochs?
A. I would decrease the number of epochs to 10
B. The number of epochs is fine
C. I would increase the number of epochs to 20

4. Model validation: “basic” methods#


Using the three models, we are going to predict ENERGYCOST in the testing dataset. We print out the first 10 values of ENERGYCOST in the test set.

OLS_pred = OLSresults.predict(x_test)
RF_pred = RFmodel.predict(x_test)
ANN_pred = ANNmodel.predict(x_test)

test = np.column_stack((y_test[0:10], OLS_pred[0:10], RF_pred[0:10], ANN_pred[0:10]))
print(np.column_stack((y_test[0:10], OLS_pred[0:10], RF_pred[0:10], ANN_pred[0:10])))

There are several metrics that can be used for model validation. For now, we are going to focus on some of the most straightforward of them:

  1. Root Mean Squared Error (RMSE): As the name suggests it is the square root of the averaged squared difference between the actual value and the predicted value of the target variable. It returns the average prediction error made by the model, thus the lower the RMSE value is, the better the model performed.

  2. Mean Absolute Error (MAE): This metric gives the absolute difference between the actual values and the values predicted by the model for the target variable. If outliers are not a significant issue, the MAE can be used to evaluate the performance of the model. Again, the lower MAE value is, the better the model performed.

  3. R^2^ Error: You may be already familiar with this metric. It indicates how much percentage of variance in the dependent variable is explained collectively by the independent variables. In other words, it reflects the relationship strength between the target variable and the model on a scale of 0 to 1 (or 0% to 100%). In this case, the higher the R-squared Error value is, the better the model performed.

Question 2: Provide an example of another validation method that can be used to evaluate model performance and explain how it works, in your own words.

We use the scikit learn library to compute the three metrics on the predicted ENERGYCOST.

ols_rmse = np.sqrt(mean_squared_error(OLS_pred, y_test['COSTENERGY']))
rf_rmse = np.sqrt(mean_squared_error(RF_pred, y_test['COSTENERGY']))
ann_rmse = np.sqrt(mean_squared_error(ANN_pred, y_test['COSTENERGY']))
print(ols_rmse, rf_rmse, ann_rmse)
ols_mae = mean_absolute_error(OLS_pred, y_test['COSTENERGY'])
rf_mae = mean_absolute_error(RF_pred, y_test['COSTENERGY'])
ann_mae = mean_absolute_error(ANN_pred, y_test['COSTENERGY'])
print(ols_mae, rf_mae, ann_mae)
ols_r2 = r2_score(y_test['COSTENERGY'], OLS_pred)
rf_r2 = r2_score(y_test['COSTENERGY'], RF_pred)
ann_r2 = r2_score(y_test['COSTENERGY'], ANN_pred)
print(ols_r2, rf_r2, ann_r2)
Question 3: (1) Which model performed the better based on the validation metrics discussed above? (2) Search for an additional validation method available in Python and use it to evaluate the models. Add the results you get from the new metric on *Canvas* and identify which model was the most accurate according to that metric.

Now we would like to see if you can create better models than the ones you have used so far in this tutorial. Use a different combination of variables, model settings, etc., and construct at least one new OLS, RF, and ANN model (one of each). (Note that the more complex your model is, and the more data it uses for training, the more time the training will take. Also note that you will be asked to submit a PDF version of your notebook with your new models on Canvas.)

There is a lot more that you can do with outputs (such as including interactive outputs) with your book. For more information about this, see the Jupyter Book documentation

Question 4: (1) Which of your new models performed better based on the validation metrics? (2) Did your new models performed better than the ones is this tutorial? (3) Why do you think that was the case?

Note that there are a few advantages, as well as disadvantages, of using methods like the RMSE, MAE, and R^2^ Error to evaluate model performance:

Advantages:

  • Some most basic and simple techniques for model validation;

  • Do not require complex programming steps for implementation (and can be quickly computed).

Disadvantages:

  • Model predictions (and thus accuracy) are highly dependent on the subset of observations used for training and testing.

  • Not necessarily robust/consistent across different training/testing datasets [e.g., for some reason the model may achieve a good or bad validation score not because the model is necessary good or bad in general, but because it simply happens to be good (“luckly”) or bad (“unlucky”) given a particular set of training/testing datasets].

5. Model validation: “complex” methods#

Now we will cover a couple of more-complex, and still very popular, validation methods, namely: Leave One Out Cross-Validation (LOOCV) and K-fold Cross-Validation.

Let’s start with LOOCV.The idea behind this method is not too complicated: the dataset is also randomly split between training and testing sets, but there is now also a validation set. Validation with LOOCV is performed in the following way:

  1. The model is trained based on N-1 data points in the training dataset;

  2. The model is then validated against that one data point that was left out;

  3. A prediction error is calculated based on the difference between the predicted point value and its actual value;

  4. Steps 1-3 are repetead until the model is trained and validated based on all data points (one at a time);

  5. An overall prediction error is computed based on the average of the prediction errors that were computed each time a data point was left out of the training.

  6. In steps 1-5 you can try different model settings and model parameters, and the best performing model is tested on the testing dataset.

Let’s see an example of how to implement the LOOCV method. Note that because this procedure can take some time to completed, we will only look at a subset of our data (i.e., the first 1000 rows), and we will only use the OLS (or “linear model”) method (because OLS models can be quickly computed):

Note that we use the Linear Regression from the scikit-learn library: We use the same four explanantory variables as before.

from sklearn.model_selection import LeaveOneOut
from sklearn.linear_model import LinearRegression

# Select data
x_train = training_data[['DENSITY', 'PROPINSR', 'ROOMS', 'BUILTYR2']]
y_train = training_data[['COSTENERGY']]

x_test = testing_data[['DENSITY', 'PROPINSR', 'ROOMS', 'BUILTYR2']]
y_test = testing_data[['COSTENERGY']]


X = x_train.iloc[:1000, :]
Y = y_train.iloc[:1000, :]

# Create a linear regression object
loocv = LeaveOneOut()
regressor = LinearRegression()

# Train the model using leave-one-out cross-validation
errors = []
for train_index, val_index in loocv.split(X):
    X_train, X_val = X.iloc[train_index], X.iloc[val_index]
    Y_train, Y_val = Y.iloc[train_index], Y.iloc[val_index]
    regressor.fit(X_train, Y_train)
    Y_pred = regressor.predict(X_val)
    errors.append(mean_squared_error(Y_val, Y_pred))

# Print the mean squared error
print("Mean Squared Error: ", sum(errors) / len(errors))

Here are some advantages and disadvantages of using the LOOCV method:

Advantages:

  • High accuracy: LOOCV provides a high level of accuracy in evaluating model performance since each data point is used once as the validation set, effectively reducing the chances of error.

  • Low Bias: LOOCV helps reduce the bias in the model evaluation since each data point gets to be the validation set at least once.

  • Suitable for small sample size: LOOCV is a suitable method for evaluating models when the sample size is small, as it ensures that each data point is used as a validation set.

Disadvantages:

  • LOOCV is computationally expensive, especially when the sample size is large or models are complex.

  • It can lead to high variance in the evaluation of model performance in the presence of outliers. This is because the validation set consists of only one data point, leading to a high variance in the estimated performance metrics.

In practice, K-fold Cross-Validation is more often used than LOOCV, because K-fold Cross-Validation is computationally much faster than LOOCV. K-fold Cross-Validation works by dividing the training dataset into K subsets (or folds) of (almost) equal size. Out of these K folds, one subset at a time is used as a validation set, while the rest are used to train the model. The computational steps are the following:

  1. The training dataset is randomly splitted into K subsets;

  2. K-1 subsets is used to train the model;

  3. The trained model is validated against the subset that was left out and an prediction error is calculated;

  4. Steps 1-3 are repetead K times, i.e., until the model has been validated against all subsets;

  5. An overall prediction error is computed based on the average of the prediction errors that were computed each time a subset was left out of the model training.

  6. In steps 1-5 you can try different model settings and model parameters, and the best performing model is tested on the testing dataset.

In the next lines of code we will search for the optimal parameters of a random forest model using K-fold cross validation. Note that this will take 1 or 2 minutes to run. We use GridSearchCV to try different parameter settings.

from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings("ignore") #ignore all warnings in this cell

parametergrid = { 
    'n_estimators': [25, 50, 100],
    'max_features': ['sqrt'],
    'max_depth' : [5,7,9],
    'random_state' : [18]
}

## Grid Search function
CV_rf = GridSearchCV(estimator=RandomForestRegressor(), param_grid=parametergrid, cv= 4, scoring = 'neg_mean_squared_error')
CV_rf.fit(x_train, y_train)
Question 5: How many models are estimated with the parameter grid given above?

Below we print the results of the grid search:

print(np.abs(CV_rf.cv_results_['mean_test_score'])) 
# we take the absolute value of the mean test score, 
# because the scoring function is specified as the negative mean squared error. 


print(CV_rf.best_estimator_)
print(np.abs(CV_rf.best_score_)) 
print(CV_rf.best_params_)
Question 6: Over how many test scores is the mean test score calculated?
Question 7: According to the results of the grid search with cross validation, which model parameters do you want to use in your final model that you test on the testing dataset? Do you think it is a good idea to use another parameter grid in the grid search? Explain.

Now test the model with the most optimal parameters on the testing dataset:

best_model = RandomForestRegressor(n_estimators=CV_rf.best_params_['n_estimators'],  
                                   max_features=CV_rf.best_params_['max_features'],
                                   max_depth=CV_rf.best_params_['max_depth'],
                                   random_state = 18)

# Fit the best model on the training data
best_model.fit(x_train, y_train)

# Predict on the test data
y_pred = best_model.predict(x_test)

print(mean_squared_error(y_pred, y_test))

Advantages:

  • By dividing the data into K folds and training the model on K-1 folds and evaluating it on the remaining fold, K-fold cross-validation provides a better estimate of the model’s performance compared to training the model once on the full training dataset.

  • K-fold cross-validation can be useful in handling imbalanced training/testing samples, as it allows the model to be trained and evaluated on different subsets of training data, reducing the impact of imbalanced samples in the training and evaluation process.

Disadvantages:

  • It can be computationally expensive, especially when the sample size is large or models are complex (but not as expensive as the LOOCV method).

  • Dependent on a sensible selection of the number of folds.

6. Model validation: classification models#

You already learnt that some machine learning models are used for classification. In this final part of the tutorial, we will focus on popular validation metrics that are often used to evaluate the performance of classifiers. First we will create a binary variable that we will predict with a classifier. We will divide the households in two groups: the 50% households that have the highest energy costs are given a 1 and the other 50% of the households are given a 0.

data1['group'] = (data1['COSTENERGY'] >= data1['COSTENERGY'].median()).astype(int)

The most simple evaluation metric is the accuracy of the classifier, i.e. the number of instances the classifier has correctly predicted divided by the total number of instances. Two other well-known metrics are precision and recall, which are based on the number of true positives, true negatives, false positives and false negatives. True positives are the number of 1’s correctly predicted, true negatives are the number of 0’s correctly predicted, false positives are the number 1’s wrongly predicted (they should be 0) and false negatives are the number of 0’s wrongly predicted (they should be 1). The precision is given by the number of true positives / (the number of true positives + the number of false positives). So when the model predicts a 1, how many of the instances are actually 1. Recall is given by the number of true positives / (the number of true positives + the number of false negatives). Recall is a measure of how many of the 1’s is the model able to predict.

Question 8: Let's say you have a stupid classifier that predicts all instances 1. What is your precision and recall of this model?

Now, let’s estimate some models using a training dataset and use the models to classify the households in a testing dataset. This time, let’s use a logistic regression model and an RF model to perform the classification. Let’s start with the logistic regression.

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score

training_data, testing_data = train_test_split(data1, test_size=0.3, random_state=25) 

x_train = training_data[['DENSITY', 'PROPINSR', 'ROOMS', 'BUILTYR2']]
y_train = training_data[['group']]

x_test = testing_data[['DENSITY', 'PROPINSR', 'ROOMS', 'BUILTYR2']]
y_test = testing_data[['group']]

regressionLogit = LogisticRegression()
resultsLogit = regressionLogit.fit(x_train, y_train)

y_pred = resultsLogit.predict(x_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print("Logit Accuracy:", accuracy)
print("Logit Precision:", precision)
print("Logit Recall:", recall)

RFclassifier = RandomForestClassifier(n_estimators=100,  
                                   max_features='sqrt',
                                   max_depth=8,
                                   random_state = 18)

resultsRF = RFclassifier.fit(x_train, y_train)

y_pred = resultsRF.predict(x_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print("RF Accuracy:", accuracy)
print("RF Precision:", precision)
print("RF Recall:", recall)
Question 9: What do you think of the model performance of the two models? In most cases you want to have a high precision score and a high recall score, but can you think of a classifier problem where you specifically want to have a high recall score and where a high precision score is less important?
Question 10: Can you build one Logistic model and one Random Forest classifier that perform better than the two models above? (Use a different combination of variables, model settings, etc.). You will be asked to submit a PDF version of your *notebook* with your new models on *Canvas*, along with your answers.