Tutorial 1: Data Exploration and regression analysis

Tutorial 1: Data Exploration and regression analysis#

In this tutorial you will learn how to read and explore census data from the IPUMS USA database. IPUMS USA collects, preserves and harmonizes U.S. census microdata and provides easy access to this data with enhanced documentation. Data includes decennial censuses from 1790 to 2010 and American Community Surveys (ACS) from 2000 to the present.

IPUMS provides census and survey data from around the world integrated across time and space. IPUMS integration and documentation makes it easy to study change, conduct comparative research, merge information across data types, and analyze individuals within family and community context.

In this tutorial, you will learn how to read the IPUMPS USA data, explore and manipulate it (to prepare for analysis) and how to perform a simple correlation and regression analysis. We will use the IPUMS USA database throughout the next four tutorials.

Important before we start#

Make sure that you save this file before you continue, else you will lose everything. To do so, go to Bestand/File and click on Een kopie opslaan in Drive/Save a Copy on Drive!

Now, rename the file into Week2_Tutorial1.ipynb. You can do so by clicking on the name in the top of this screen.

Tutorial Outline

1. Introducing the packages
2. Reading the data and having a first look
3. Exploratory data analysis
4. Creating new variables
5. Nonlinear relationships

Learning Objectives#

Work with Pandas DataFrames with real-world data.
Introducing basic functions to explore and understand the data.
Learn how to Make distribution plots of the data
Create a correlation map
Create new variables
Set up an OLS regression

1. Introducing the packages#

Within this tutorial, we are going to make use of the following packages:

seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

NumPy is a Python library that provides a multidimensional array object, various derived objects, and an assortment of routines for fast operations on arrays.

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

Now we will import these packages in the cell below:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm

sns.set_style("whitegrid",{'axes.grid' : True})
pd.options.mode.chained_assignment = None

2. Reading the data and having a first look#

## Read the data using the pandas package
data = pd.read_csv(r"https://github.com/ElcoK/BigData_AED/raw/main/week2/usadataforweek2tut1.csv")

There are hundreds of variables included in the IPUMS USA database. In the next line of code we read in a file with a short description of the variables that we selected for this tutorial. The dataset contains socio-demographic variables such as age, gender, education, but also economic variables, such as income, mortgage payments, energy costs, etc. In these two weeks we are going to work towards two machine learning models that can predict energy costs per household.

datadescription = pd.read_csv(r"https://github.com/ElcoK/BigData_AED/raw/main/week2/shortdescription_ipumsusa_variables.csv", sep = ';', encoding = 'unicode_escape')
print(datadescription[['Variable_name','Short_description']])

Now let’s take a look at the dataset. A useful start is to explore the first rows through the head() function with Pandas.

data.head()

And let’s have a look at the total size of this dataset. To do so, we can use the shape() function.

data.shape

And some more information through the describe() function.

data.describe()

Question 1: What information do the functions .shape() and .describe() give you about the data?

Let’s find out the data types of the data and if there are non-null counts.

data.info()

Another way to check if there are missing values by using the following line of code.

data.isnull().sum().sort_values(ascending = False)

Question 2:

Please describe what the following line of Python code does:
data.isnull().sum().sort_values(ascending = False)
Hint: Every dot (.) indicates a new function.
Also describe what you see in the data. What does the output mean? Do you notice any surprises?

3. Exploratory data analysis#

Now we are going to take a look at some specific variables. We want to see how variables are distributed and how they are related to each other, and specifically how they are related to energy costs. We can do this in several ways.

Here we see the different values in the variable OWNERSHIP and how many times each value occurs. If you want to know what the values mean, you can search the variable on the ipums usa website.

data['OWNERSHP'].value_counts(dropna = False)

data['HHINCOME'].value_counts(dropna = False)

The function value_counts() is especially useful when the number of values in a variable is limited. For example, for income, which contains a lot of different values, value_counts is not so useful and a plot of the distribution would be more insightful. We can use the seaborn package to easily plot the distribution of the income.

%matplotlib inline
sns.displot(data['HHINCOME'], bins=50,kde=False)

%matplotlib inline
# Less bins
sns.displot(data['HHINCOME'], bins=20,kde=False)

And we can also include the kernel density, by setting the kde parameter to True. Kernel Density Estimation (KDE) is a technique that let’s you create a smooth curve given a set of data. This can be useful if you want to visualize just the “shape” of some data, as a kind of continuous replacement for the discrete histogram.

%matplotlib inline
sns.displot(data['HHINCOME'], bins=50,kde=True)

Question 3: We just showed how to explore the OWNERSHP and HHINCOME variables. Choose a different variable and describe it. What are the units, and does it make sense to use value_counts() to understand this variable or would you rather use something different?

Now we are going to take a look at the variable we want to predict in a later stage, energy costs.

%matplotlib inline
sns.displot(data['COSTENERGY'], bins=50,kde=False)

Next we want to see how different variables are related to each other. We are specifically interested in which variables are correlated to energy costs, as we are going to predict energy costs using a simple regression model in Step 6 of this tutorial. To identify potential predictors of energy costs, we can create a correlation matrix using the code below. You can add a few variables you are interested in yourself. As you can see, there are several options in the seaborn (imported as sns) heatmap. We have annot = True which makes sure that we include the numbers in the heatmap (try switching it off), fmt = .2f specifies that we want to have 2 decimals, with cmap we can specify the color palette and with center = 0, we scale the colors in such a way that a correlation of 0 corresponds to white color.

%matplotlib inline
plt.figure(figsize=(8,6))
sns.heatmap(data[["COSTENERGY", "ROOMS", "RENT", "INCTOT", "HHINCOME", "BUILTYR2", "VEHICLES", "BEDROOMS", "FARM", "OWNERSHP", "MORTGAGE", "AGE", "SEX"]].corr(),
                           annot=True, fmt = ".2f", center=0, cmap = "RdBu")
plt.show()

Question 4: Which variable is most correlated to COSTENERGY and which variable the least?

4. Creating new variables#

The dataset consists of households and individuals. Energy costs are a household variable, i.e. all members of the household have the same energy costs. Age and sex are individual variables, so the relationship between age or sex with energy costs is likely to be low. Think of a family with children, where a 3-year old daughter has the same energy costs as the 40-year old father. But maybe we can use these two variables, AGE and SEX in another way to make them more useful. We can for example create a variable with number of children in a household, or number of women, age of the oldest person in the household etc. To create these variables, we need to know which individuals belong to the same household and luckily this is exactly what the column SERIAL indicates, as SERIAL is the household ID.

Before we go into the variables SEX and AGE, we are going to create a new variable household size based on the variable SERIAL.

serialcount = data['SERIAL'].value_counts(dropna = False).reset_index()

print(serialcount)

The names of the dataframe serialcount are a bit confusing, because the index column corresponds to the SERIAL number and the column SERIAL corresponds to the household size, so let’s rename the SERIAL column to HHSIZE and the index column to SERIAL.

Then we add the column HHSIZE (household) size to our data using the .merge() function. Within the merge() function, you have to define on which column you want to merge the two dataframes using the on argument. Merge on the SERIAL column.

data = data.merge(serialcount, on = 'SERIAL', how = 'left')

Question 5: Please explain what the .merge() function does. Also explain the option 'on' and 'how'.

Now we have merged a new variable HHSIZE to the dataframe. Next we will use the variable SEX to create a new variable. It is possible that women use more energy than men, or the other way around, so the number of women (or men) could be a useful variable for predicting energy costs. To obtain the household size, we simply counted the number of occurances of each serial number, but to obtain the number of women per household, we also need the column SEX. The variable SEX has two values, 1 represents a male and 2 represents a female.

We use the function value_count() to obtain the number of women and men per serial number, i.e., per unique household.

gendercount = data[['SERIAL', 'SEX']].value_counts(dropna = False).reset_index()
print(gendercount)

It is always good to realize that in many cases, multiple ways exist to obtain the same result. The following line of code produces the same results:

gendercount = data[['SERIAL','SEX']].groupby(['SERIAL','SEX']).size().reset_index()
print(gendercount)

Question 6: There are different ways to obtain the same result. Please describe the two lines of code above and explain how they take a different approach, but yet achieve the same outcome.

In the next steps you are going to write code that merges a new variable called ‘NR_OF_WOMEN’ (number of women) to the existing dataframe. The following hints guide you through the steps.

Three lines of code is sufficient. Hint 1 is given for the first line of code, Hint 2 for the second line, and Hint 3a, 3b, 3c for the third line of code.

Hint 1: You want to create a variable that indicates the number of women per household using the dataframe gendercount. The only thing you have to do is keep the number of the women in the dataframe gendercount, i.e., use SEX == 2. Store this dataframe as a new dataframe named womencount, hence your first line of code starts with womencount = .
Hint 2: Change the variable name of 0 to NR_OF_WOMEN in the dataframe womencount. 0 is not a string (so it’s not ‘0’), but a number.
Hint 3a: In the third line of code you merge the dataframe womencount women to the dataframe data.
Hint 3b: Think about the column SEX in the womencount dataframe. It has no meaning anymore, since you only kept the women in the dataframe (all values in the column SEX are 2). Hence, it is not meaningful to merge the column SEX of the womencount dataframe to the data dataframe. Also, there is already a column SEX in the data dataframe, so when we merge another dataframe with the same column name, the column names are changed to SEX_x and SEX_y when we merge them. Hence, to avoid confusion you only want to include the columns NR_OF_WOMEN and SERIAL in the womencount dataframe when we merge it to data dataframe (Hint: you could for example use double square brackets).
Hint 3c: See the line of code where we merged serialcount to data.

womencount =
womencount =
data  = data.merge(

When you succesfully merged the number of women the dataframe, what do you do with the nan values? Fill in the dots. The fillna function automatically replaces nan values with the number between the brackets.

data['NR_OF_WOMEN'] = data['NR_OF_WOMEN'].fillna(...)

Question 7: Please provide the three lines of code you have written to succesfully create and merge the new variable NR_OF_WOMEN. Also describe how you solved/filled the NaN values.

5. Nonlinear relationships#

We are going to create two new variables based on age. First we want to create a variable with the age of the oldest person.

oldest = data[['SERIAL', 'AGE']].groupby('SERIAL').max().reset_index()

We want to change the name of the column AGE to OLDEST, because we already have a column AGE in the dataframe data. Then we merge it to the dataframe.

oldest = oldest.rename(columns = {'AGE':'OLDEST'})
data = data.merge(oldest, on = 'SERIAL', how = 'left')

Now create a variable yourself with the age of the youngest person, named YOUNGEST, and add it to the dataframe data. Follow the same approach as how you added the OLDEST column.

youngest = data[['SERIAL', 'AGE']].groupby(
youngest = 
data = data.merge(

Question 8: Please provide the three lines of code you have written to succesfully create and merge the new variable YOUNGEST.

Let’s check if the correlation of the oldest and youngest person with energy costs is higher than the correlation of age with energy costs.

%matplotlib inline
sns.heatmap(data[["COSTENERGY","YOUNGEST", "OLDEST", "AGE"]].corr(),
                           annot=True, fmt = ".2f", center=0, cmap = "RdBu")

A possible explanation for the low correlation could be that the relationship between age and energy costs is not linear. Let’s see the average energy costs over the age of the oldest person in the household in a barplot.

%matplotlib inline
plt.figure()
sns.barplot(x = 'OLDEST', y = 'COSTENERGY', data = data, ci = None)
plt.xticks([15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95], #position of xticks.
           [15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95]) #values of xticks.
plt.xlim([15,95])
plt.title('Oldest vs Cost of Energy')
plt.show()

In the graph we see that especially households where the oldest member is young, energy costs are relatively low. Then energy costs increase until the age of 55 (maybe these are families with children) and then stabilizes.

6. Regression analysis#

In the last part we are going to estimate a first model to predict energy costs of households. We will use a simple Ordinary Least Squares regression model. To make life a bit easier, download the dataset for the OLS regression in the following line of code (it includes the new variables you created and it deals with missing values). We use the package statsmodels to estimate the OLS regression. We could also use the sklearn library, but the statsmodels package has the advantage that it includes standard errors of the coefficient estimates. This means that we can say something about the significance of the explanantory variables and the causality between the explanatory variables and the dependent variable, something which is not possible with a machine learning model.

dataOLS = pd.read_csv(r"https://github.com/ElcoK/BigData_AED/raw/main/week2/usadataforOLS.csv", sep = ',', encoding = 'unicode_escape')

Y = dataOLS[['COSTENERGY']]

X = dataOLS[["ROOMS", "HHINCOME", "BUILTYR2", "VEHICLES", "FARM", "OWNERSHP", "YOUNGEST", "OLDEST", "HHSIZE", "NR_OF_WOMEN"]]
X['Constant'] = 1

regressionOLS = sm.OLS(Y, X)
resultsOLS = regressionOLS.fit()
print(resultsOLS.summary())

Question 9: Interpret the results of the OLS regression. You can discuss the following questions in your answer. 1) Do you think that the coefficient estimates are plausible? You don't need to explain all coefficient estimates, but highlight a few. 2) Would you include other variables in the regression? Or do you want to exclude variables? Explain which ones. Make another correlation heatmap to support your explanation.

Interpret the results of the OLS regression. Make another correlation heatmap to support your explanation.

# code for correlation heatmap.