Tutorial 1: Artificial neural networks (ANNs)#

In this tutorial, we will continue with building machine learing models on top of the census data. In this tutorial, we use the package Tensorflow and Keras, which has been integrated in Tensorflow. Tensorflow is a popular software library for deep learning applications, such as neural networks. You can find more information about Keras here.

Tutorial Outline


Learning Objectives#


  • Learn how to create dummy variables

  • Learn how to construct a linear ANN with one variable

  • Learn how to construct a non-linear ANN with one variable

  • Learn how to construct a non-linear ANN model with multiple variables

  • Get insights in the basic functions and options of neural networks in tensorflow keras

  • Learn how to make predictions with ANN models

  • Evaluate performance of ANN models

1. Introducing the packages#


Within this tutorial, we are going to make use of the following packages:

tensorflow is an open-source machine learing library which makes it easy for beginners and experts to create machine learning models.

sklearn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities.

seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

NumPy is a Python library that provides a multidimensional array object, various derived objects, and an assortment of routines for fast operations on arrays.

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Matplotlib is a comprehensive Python package for creating static, animated, and interactive visualizations in Python. Matplotlib makes easy things easy and hard things possible.

Now we will import these packages in the cell below:

!pip install keras-visualizer
!pip install pydot
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
import seaborn as sns
sns.set_style("whitegrid",{'axes.grid' : True})
from sklearn.model_selection import train_test_split

2. Importing the data#


We use the data same that we also used to estimate the OLS model. It includes the variables you created in the first tutorial and it deals with missing values.

data1 = pd.read_csv(r"https://github.com/ElcoK/BigData_AED/raw/main/week2/usadataforOLS.csv", sep = ',', encoding = 'unicode_escape')

And let’s have a look at the data once more:

data1.head()
data1.columns

We drop several columns from the dataframe that are not useful.

data1 = data1.drop(['YEAR', 'SAMPLE', 'HHWT', 'SERIAL','PERWT', 'RELATE', 'RELATED', 'COUNTYFIP'], axis = 1)

We also drop columns that are specific for an individual.

datann = data1.drop(['SEX', 'AGE', 'RACE', 'RACED', 'EDUC', 'EDUCD', 'EMPSTATD', 'OCC', 'IND', 'EMPSTAT', 'LABFORCE', 'DEGFIELD', 'INCTOT'], axis = 1)

2. One-hot encoding#


We have some categorical variables in our data. Before we can include them in the ANN, we need to first transform them into dummy variables. This is called one-hot encoding. A dummy variable takes the value of either 1 or 0. In our dataset, the variable REGION has nine distinct values (32, 42, 41, 33, 11, 31, 21, 22, and 12), each representing a unique region. That means that to transform the REGION variable into dummies, we actually need to create nine new columns in our dataset, one for each region separatly. For example, if the sample unit in our data (i.e., row), belows to REGION 32, in our new “REGION_32” column, it will have a value of 1, while all other sample units that are from a different region will have a value of 0 in the “REGION_32” column. We can create new dummy variables (columns) in our dataset based on REGION using the function get_dummies from the pandas package. We give the columns of the dummy variables a prefix ‘REGION’ after which follows the region number. Then we concatenate (pd.concat) the dataframe one_hot horizontally (axis = 1) to the original dataframe with all the data.

one_hot = pd.get_dummies(data1['REGION'], prefix = 'REGION').astype(int)
print(one_hot.head())
datann = pd.concat([datann, one_hot], axis = 1)
Question 1: Are there other categorical variables in the dataset you might want to transform into dummy variables? If yes, you can do so below. Don't forget to drop the original variable from the dataframe, see the line below. Go to Canvas and answer the question: why can't we use categorical variables in an ANN model without transforming them into dummy variables first? (You might have to do some searching on your own.)
datann = datann.drop(['REGION'], axis = 1)

3. Normalizing the data#


In general, ANN models can make better predictions with scaled or normalized variables. Neural networks are sensitive to the scale of the input data, and large values can slow down or prevent convergence during training. By normalizing the data, we bring all the features to a similar scale, which can help the optimization process converge faster and better. The model could still converge without normalization, but normalization makes training more stable. We can for example normalize by subtracting the mean of the variable and dividing by the standard deviation. Another way to normalize is by subtracting the minimum of the variable and then dividing by the difference between the maximum and minimum of the variable. x_normalized = (x - mean(x)) / std(x) x_normalized = (x - min(x)) / (max(x) - min(x)) We calculate the normalized variables ourselves, but we could also add a normalization layer to the ANN, which is more convenient than normalizing the data ourselves (but the model training can take longer depending on how it is implemented within the ANN).

First, we split the data in training and testing data.

training_data, testing_data = train_test_split(datann, test_size=0.3, random_state=25)

In our first neural network model, we are going to construct a linear model with one explanantory variable, which is the age of the youngest household member, i.e. OLDEST.

x_train = training_data[['OLDEST']]
y_train = training_data[['COSTENERGY']]

x_test = testing_data[['OLDEST']]
y_test = testing_data[['COSTENERGY']]

In the next lines of code, we build the normalizing layer. The input_shape is [1,], because we only have one explanantory variable, which is a column vector. With the function adapt we apply the normalizer to the column YOUNGEST in x_train. We print the mean of YOUNGEST estimated by the normalizer.

linearmodel_normalizer = tf.keras.layers.Normalization(input_shape=[1,], axis=None)
linearmodel_normalizer.adapt(np.array(x_train['OLDEST']))
print(linearmodel_normalizer.mean.numpy())

4. Building a neural network model with 1 variable#


In the next step, we are going to construct a very simple neural network. It will simulate a linear regression model with one explanantory variable. Remember that a linear regression has the following form: y = m + bx, where y is the independent variable, ENERGYCOST, x is the explanantory variable (we use YOUNGEST), m is the intercept, and b is the regression coefficient. In the model below, you see that the output layer keras.layers.Dense(units=1, activation=’linear’) has a linear activiation function and 1 unit of output (output would be the ENERGYCOST). Hence we go from 1 unit of input, the age of the YOUNGEST member in the household to 1 unit of output. Before the output layer, we add the normalizing layer.

linearmodel = tf.keras.Sequential([
    linearmodel_normalizer,
    keras.layers.Dense(units=1, activation='linear')
])

With the next line of code, we can check the structure of the model.

linearmodel.summary()

From the summary, we can see that we have three parameters in the normalization layer and 2 parameters in the output layer.

Question 2: Does this model have hidden layers?

Let’s visualize the neural network model. We quickly build the model again, but without the normalizer layer, because the keras visualizer library did not allow for this.

from keras_visualizer import visualizer

linearmodelplot = keras.Sequential([
      keras.layers.Dense(1, activation='linear', input_shape = (1,)),
  ])

linearmodelplot.compile(optimizer='adam', loss='mean_squared_error')

visualizer(linearmodelplot, file_name = 'linear_model', file_format = 'png', view=True)
# You find the graph when you click on the folder icon in the top left.
# The file is called linear_model.png.

Back to the linear model with normalizing layer. Now that we have built the model, we have to specify how the model is trained using the compile function. We use the Adam optimizer with a learning rate of 0.05. Our loss function is the mean absolute error. When training the model, this loss function is minimized by the optimization algorithm. We also include the mean squared error in the metrics.

linearmodel.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.05),
    loss='mean_absolute_error',
    metrics = [tf.keras.metrics.MeanSquaredError()])

5. Training a neural network model#


Let’s start training the model using the function fit. In the code below, you can decide on the number of epochs, but we advise to choose a number between 25 and 40 (there is no right or wrong here), because it takes about 4 seconds per epoch to train the model. In each epoch, the neural network goes through all the training data and optimizes the parameters using the Adam optimizer. The more often the model sees the data, the more accurate the parameter estimates become. In each epoch we use 20% of the training data to validate the results.

nr_epochs = 

history_linearmodel = linearmodel.fit(
    x_train['OLDEST'],
    y_train['COSTENERGY'],
    epochs=nr_epochs,
    verbose=1, # Shows the training progress.
    validation_split = 0.2) # Calculate validation results on 20% of the training data.

Let’s see the training progress of the model. We plot the training loss and the validation loss. Recall that the loss function was specified as the mean absolute error of the model. We should see a curve going down, this is the learning process.

plt.figure(figsize=(12, 6)) 
plt.plot(history_linearmodel.history['loss'], label='loss')
plt.plot(history_linearmodel.history['val_loss'], label='val_loss')
plt.xlabel('Epoch')
plt.ylabel('Error')
plt.xticks(np.arange(0,nr_epochs), np.arange(1,nr_epochs+1))
plt.xlim(0,nr_epochs-1)
plt.legend()
plt.grid(True)
plt.show()
Question 3: Interpret what you see in the figure. What is the number of epochs you chose? Do you think the model has been trained enough? Should we be concerned about overfitting or underfitting issues here?

The function get_weights() gives the estimated parameters of the neural network. The first three numbers belong to the normalization layer and the last two numbers are the weights of the output layer.

linearmodel.get_weights()

Now we compare these weights to an OLS linear regression.

Y = datann[['COSTENERGY']]
X = datann[["OLDEST"]]
X['Constant'] = 1

regressionOLS = sm.OLS(Y, X)
resultsOLS = regressionOLS.fit()

print(resultsOLS.summary())
Question 4: The parameters of the neural network model are very different from the parameters in the OLS regression, why is that do you think? The number of epochs you chose will not have such a large effect on the parameters, so that is not the answer.

Next we take a look at the predictions by the model. We let the model predict on an array of x that goes from 0 till the maximum of OLDEST, divided in 200 equal steps. Using the function predict, we obtain the predicted ENERGYCOSTS, given by y in the code.

x = tf.linspace(round(min(x_train['OLDEST'])), round(max(x_train['OLDEST'])), 200)
y = linearmodel.predict(x)
plt.figure(figsize=(12, 6))
plt.scatter(datann['OLDEST'], datann['COSTENERGY'], label='Data')
plt.plot(x, y, color='k', label='Predictions')
plt.xlabel('OLDEST')
plt.ylabel('COSTENERGY')
plt.legend()
plt.show()
Question 5: Can you derive the parameters of the linear regression given by y = m + bx in this plot? Use code to find out the values of m and b. Are they similar to the coefficients of the OLS model?

6. A non-linear neural network model#


Now we are going to build a non-linear neural network model. We include two hidden layers in the model, the first one has 64 units and the second one 32 units. Relu is the activivation function in these two layers and the output layer has a linear activation function, because we deal with a regression problem. Note that linear is the default activation function, so we don’t have to specify it.

nonlinear_model = keras.Sequential([
      linearmodel_normalizer,
      keras.layers.Dense(16, activation='relu', input_shape = (1,)),
      keras.layers.Dense(1)
  ])

nonlinear_model.summary()

From the summary, we can see that in the non-linear model a large amount of parameters has to be estimated, even when we have only one explanantory variable.

Question 6: The more nodes and layers an ANN model has, the more flexible it becomes, often resulting in a better model fit to the training data. That is generally something desirable, as long as the model is not overfitting... Explain with your own words what is the problem with overfitting.

Let’s again visualize the model (without normalizing layer). Note that not all nodes are depicted in the graph.

nonlinear_modelplot = keras.Sequential([
      keras.layers.Dense(16, activation='relu', input_shape = (1,)),
      keras.layers.Dense(1)
  ])

visualizer(nonlinear_modelplot, file_name = 'nonlinear_model', file_format='png', view=True)
# You find the graph when you click on the folder icon in the top left.
# The file is called nonlinear_model.png.

Next, we compile the model. We use a smaller learning rate this time.

nonlinear_model.compile(loss='mean_absolute_error',
                optimizer=tf.keras.optimizers.Adam(0.0005),
                metrics = [tf.keras.metrics.MeanSquaredError()])
Question 7: What does the learning rate parameter control in the model? What would happen if you choose a smaller or larger learning rate? You can use the figure depicting the loss over the epochs of the linear model in your answer. (You might want to refer to the Python package's information.)

And we train the model:

nr_epochs = 25
history_nonlinearmodel = nonlinear_model.fit(
    x_train['OLDEST'],
    y_train['COSTENERGY'],
    epochs=nr_epochs,
    # Show logging.
    verbose=1,
    # Calculate validation results on 20% of the training data.
    validation_split = 0.2)
plt.plot(history_nonlinearmodel.history['loss'], label='loss')
plt.plot(history_nonlinearmodel.history['val_loss'], label='val_loss')
plt.xlabel('Epoch')
plt.ylabel('Error')
plt.xticks(np.arange(0,nr_epochs), np.arange(1,nr_epochs+1))
plt.xlim(0,nr_epochs-1)
plt.legend()
plt.grid(True)
plt.show()

And next we take a look at the predictions by the non-linear neural network.

x = tf.linspace(round(min(x_train['OLDEST'])), round(max(x_train['OLDEST'])), 200)
y = nonlinear_model.predict(x)

plt.figure(figsize=(12, 6))
plt.scatter(datann['OLDEST'], datann['COSTENERGY'], label='Data')
plt.plot(x, y, color='k', label='Predictions')
plt.xlabel('OLDEST')
plt.ylabel('COSTENERGY')
plt.legend()
plt.savefig('nonlinearpred.png', dpi = 200)
plt.show()
Question 8: What do you think of the performance of the non-linear model? Is there a big difference with the linear model?

7. A non-linear neural network model with multiple variables#


In this part, we are going to estimate a neural network on all variables in the training dataset.

x_train = training_data.drop(columns = ['COSTENERGY'], axis = 1)
y_train = training_data[['COSTENERGY']]

x_test = testing_data.drop(columns = ['COSTENERGY'], axis = 1)
y_test = testing_data[['COSTENERGY']]

We create a normalizing layer on all variables now. The input shape is the number of columns in the training dataset.

fullmodel_normalizer = keras.layers.Normalization(input_shape=(x_train.shape[1],), axis=None)
fullmodel_normalizer.adapt(np.array(x_train))

Building the model:

fullmodel = keras.Sequential([
      fullmodel_normalizer,
      keras.layers.Dense(32, activation='relu', input_shape = (x_train.shape[1],)),
      keras.layers.Dense(16, activation='relu'),
      keras.layers.Dense(1)
  ])


fullmodel.summary()

Visualizing the model:

fullmodelplot = keras.Sequential([
      keras.layers.Dense(32, activation='relu', input_shape = (x_train.shape[1],)),
      keras.layers.Dense(16, activation='relu'),
      keras.layers.Dense(1)
  ])

visualizer(fullmodelplot, file_name = 'full_model', file_format='png', view=True)

Setting the options for training:

fullmodel.compile(loss='mean_absolute_error',
                optimizer=tf.keras.optimizers.Adam(0.00005),
                metrics = [tf.keras.metrics.MeanSquaredError()])

Training the model:

nr_epochs = 20
history_fullmodel = fullmodel.fit(
    x_train,
    y_train['COSTENERGY'],
    epochs=nr_epochs,
    # Show logging.
    verbose=1,
    # Calculate validation results on 20% of the training data.
    validation_split = 0.2)

Plot the learning progress:

plt.plot(history_fullmodel.history['loss'], label='loss')
plt.plot(history_fullmodel.history['val_loss'], label='val_loss')
plt.xlabel('Epoch')
plt.ylabel('Error')
plt.xticks(np.arange(0,nr_epochs), np.arange(1,nr_epochs+1))
plt.xlim(0,nr_epochs-1)
plt.legend()
plt.grid(True)
plt.show()

8. Evaluating the performance#


In this last step, we take a look at the performance of the model on the testing dataset. We check the value of the mean absolute error and mean squared error of the three models using the function evaluate.

test_loss_linear = linearmodel.evaluate(x_test['OLDEST'], y_test['COSTENERGY'], verbose=1)
test_loss_nonlinear = nonlinear_model.evaluate(x_test['OLDEST'], y_test['COSTENERGY'], verbose=1)
test_loss_full = fullmodel.evaluate(x_test, y_test['COSTENERGY'], verbose=1)
Question 9: Interpret the results. Which model performed best?