Tutorial 1: Social Media & Natural Hazards

Tutorial 1: Social Media & Natural Hazards#

Natural processes such as thunderstorms, wildfires, earthquakes, and floods may lead to significant losses in terms of property and human life. Gathering information about the damages in time is crucial and may help in mitigating the loss, and faster recovery (Said et al, 2019).

Social media are one of the most important sources of not only real-time information but records since their existence. They have been crawled over the years to collect and analyze disaster-related multimedia content (Said et al, 2019). There are different applications where we can use social media data to analyze natural disasters.

Through this tutorial, we will learn how we can use Twitter data to analyze natural hazards. We will do so by applying the concept of natural language processing.

This tutorial is heavily based upon the work of others.

Important before we start#

Make sure that you save this file before you continue, else you will lose everything. To do so, go to Bestand/File and click on Een kopie opslaan in Drive/Save a Copy on Drive!

Now, rename the file into Week6_Tutorial1.ipynb. You can do so by clicking on the name in the top of this screen.

By using this notebook and associated files, you agree to the Twitter Developer Agreement and Policy, which can be found here.

Learning Objectives#

Learn about the importance and application of social media data
Access social media (Twitter) through the API
Retrieve Twitter data
Filter and clean the retrieved data
Visualize the data in different plots such as bar, scatter, and spatial.

Tutorial outline

1. Introducing the packages
2. Social Media
3. Natural Language Processing (NLP)
4. Data retrieval and post-processing
5. Applications: detecting natural hazards

1.Introducing the packages#

Within this tutorial, we are going to make use of the following packages:

GeoPandas is a Python packagee that extends the datatypes used by pandas to allow spatial operations on geometric types.

JSON is a lightweight data interchange format inspired by JavaScript object literal syntax.

Matplotlib is a comprehensive Python package for creating static, animated, and interactive visualizations in Python. Matplotlib makes easy things easy and hard things possible.

NLTK is a platform for building Python programs to work with human language data.

NumPy is a Python library that provides a multidimensional array object, various derived objects, and an assortment of routines for fast operations on arrays.

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

wordcloud is a little word cloud generator in Python.

Now we will import these packages in the cell below:

import numpy as np
from matplotlib import pylab as plt
from matplotlib.lines import Line2D
import pandas as pd
from datetime import datetime, date
import geopandas as gpd
import json
import os
import sys

from mpl_toolkits.axes_grid1 import make_axes_locatable

from wordcloud import WordCloud
from PIL import Image
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer, PorterStemmer

%matplotlib inline

2. Social Media#

Social media are interactive technologies that facilitate the creation and sharing of information, ideas, interests, and other forms of expression through virtual communities and networks.

Therefore, social media can be used as a source of real-time information for natural disaster detection. Moreover, the database can be used to post-analyze natural disasters for a better estimation of the extent and the damages the hazard had caused.

Some of the most popular social media websites, with more than 100 million registered users, include Facebook (and its associated Facebook Messenger), TikTok, WeChat, ShareChat, Instagram, QZone, Weibo, Twitter, Tumblr, Baidu Tieba, and LinkedIn.

Twitter has been proven to be a useful data source for many research communities (Ekta et al, 2017, Graff et al, 2022), from social science to computer science, it can advance research objectives on topics as diverse as the global conversations happening on Twitter. It is one of the most popular online social networking sites with around 450 million monthly active users as of 2022. An important characteristic of Twitter is its real-time nature.

Twitter offers tools and programs that help people when emergencies and natural disasters strike, allowing channels of communication and humanitarian response, among other areas of focus such as environmental conservation and sustainability.

The Twitter API enables programmatic access to Twitter in unique and advanced ways. Twitter’s Developer Platform enables you to harness the power of Twitter’s open, global, real-time, and historical platform within your own applications. The platform provides tools, resources, data, and API products for you to integrate, and expand Twitter’s impact through research, solutions, and more.

Unfortunately, Twitter has made their API policy very strict since February 12. As such, we cannot use their API within this tutorial. As such, we will use data that was previously retrieved by us (in the preparation of this tutorial).

Download the ‘Week6_Data’ folder provided in Canvas and save it to your previously created BigData folder on your Google Drive.

Connect to google drive#

To be able to read the data from Google Drive, we need to mount our Drive to this notebook.

As you can see in the cell below, make sure that in your My Drive folder, where you created BigData folder and within that folder, you have created a Week6_Data folder in which you can store the files that are required to run this analysis.

Please go the URL when its prompted in the box underneath the following cell, and copy the authorization code in that box.

from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

sys.path.append("/content/gdrive/My Drive/BigData/Week6_Data")

data_path = os.path.join('/content/gdrive/My Drive/BigData','Week6_Data')

User Information#

Trough Twitter API we can make requests such as getting the information of a user, for example, we can show the followers of a user or list the user’s latest post.

The following line will open the previously retrieved information of the IVM - VU account.

user_file = os.path.join(data_path, r'VU_user.jsonl')

f = open(user_file)
user = json.load(f)

Have look at the information provided by Twitter for a specific user using:

user

We can also look at a specific field of the data (e.g. name, location, etc.).

Here are some examples:

print(f"user.name: {user['name']}")
print(f"user.screen_name: {user['screen_name']}")
print(f"user.location: {user['XXXX']}") # change the XXXX for the field we want to print
print(f"user.description: {user['XXXX']}")
print(f"user.followers_count: {user['XXXX']}")
print(f"user.listed_count: {user['listed_count']}")
print(f"user.statuses_count: {user['statuses_count']}")
print(f"user urls: {user['entities']['url']['urls'][0]['expanded_url']}")

Now let’s see the latest 5 tweets the IVM has posted. We will use the data that was previously retrieved:

tweets_file = os.path.join(data_path, r'VU_tweets.jsonl')

with open(tweets_file) as f:
    tweets = [json.loads(line) for line in f]

Latest_tweets = tweets[0]

Have you tried to see what the data looks like?

Latest_tweets[0]

As you can see, data contains many parameters and its format is not that convenient for analyzing it.

Later we will learn how to create a DataFrame and process the data. For now, some important parameters we can extract from the tweet data are the description, location, text, hashtags, among others.

Here are some examples:

print(f"username: {Latest_tweets[0]['screen_name']}")
print(f"description: {Latest_tweets[0]['XXXX']}")# change the XXXX for the field we want to print
print(f"text: {Latest_tweets[0]['XXXX']}")
print(f"date_time: {Latest_tweets[0]['XXXX']}")
print(f"location: {Latest_tweets[0]['XXXX']}")
print(f"coordinates: {Latest_tweets[0]['XXXX']}")
print(f"following: {Latest_tweets[0]['friends_count']}")
print(f"followers: {Latest_tweets[0]['followers_count']}")
print(f"totaltweets: {Latest_tweets[0]['statuses_count']}")
print(f"retweetcount: {Latest_tweets[0]['retweet_count']}")
print(f"hashtags: {Latest_tweets[0]['hashtags']}")

Question 1: How many followers does @VU_IVM have? When was the last time @VU_IVM tweeted/retweeted?

We’ve now learned how to make Twitter API requests (albeit unconnected) to get some information from a specific user.

Please note that we can also make changes to our own account such as updating our profile and interacting with other users. If you’re enthusiastic about it, you can find more information here .

3. Natural Language Processing (NLP)#

As mentioned in Lecture, Natural language processing (NLP) is an area of computer science and artificial intelligence concerned with the interaction between computers and humans in natural language. The ultimate goal of NLP is to help computers understand language as well as we do. There are two ways of understanding natural language: Syntactic and Semantic analysis. Whereas Syntactic analysis (also referred to as syntax analysis or parsing) is the process of analyzing natural language with the rules of formal grammar, Semantic analysis is the process of understanding the meaning and interpretation of words, signs, and sentence structure.

There are different techniques for understanding text such as Parsing, Stemming, Text Segmentation, Named Entity Recognition, Relationship Extraction, and Sentiment Analysis (see Lecture).

In this section, we’ll utilize the NLTK package to tokenize and normalize our corpus. Tokenization involves breaking down text into individual words or tokens, while normalization ensures uniformity by standardizing text formats.

3.1 Downloading Punctuation and Wordnet Corpus#

Before we dive into tokenization and normalization, let’s ensure we have the necessary resources. We’ll begin by downloading the punctuation corpus and the WordNet lexical database, which will be useful for later stages of text processing.

nltk.download('punkt')
nltk.download("wordnet")

By executing the above code, NLTK will acquire the required resources, essential for accurate tokenization and further linguistic analysis.

3.2 Understanding Corpus#

Corpus (literally Latin for body) refers to a collection of texts. Such collections may be formed of a single language of texts, or can span multiple languages.

example_sent = """This is a sample sentence, showing off the stop words filtration. We will also show a sample word cloud"""
print(example_sent)

3.3 Tokenization#

Tokenization is, generally, an early step in the NLP process, a step which splits longer strings of text into smaller pieces, or tokens. Larger chunks of text can be tokenized into sentences, sentences can be tokenized into words, etc. Further processing is generally performed after a piece of text has been appropriately tokenized.

word_tokens = word_tokenize(example_sent)
print(word_tokens)

3.4 Normalization#

Normalization generally refers to a series of related tasks meant to put all text on a level playing field: converting all text to the same case (upper or lower), removing punctuation, expanding contractions, converting numbers to their word equivalents, and so on. Normalization puts all words on equal footing, and allows processing to proceed uniformly.

# Initialize a tokenizer
tokenizer = RegexpTokenizer(r'\w+')
normalized_sentence = tokenizer.tokenize(str(XXXX)) # Change the XXXX for the sentence you want to normalize
print(f"Normalized sentence: {normalized_sentence}")
print(f"Length: {XXXX}") # Print the length of the normalized sentence

3.5 Stop Words#

Stop words are those words which are filtered out before further processing of text, since these words contribute little to overall meaning, given that they are generally the most common words in a language. For instance, “the,” “and,” and “a,” while all required words in a particular passage, don’t generally contribute greatly to one’s understanding of content.

nltk.download('stopwords')

By executing the above code, NLTK will acquire the required resources.

Now, let’s acquire the stop words for English and have a look at them:

stop_words = set(stopwords.words('english'))
print(stop_words)

The following line will convert words in word_tokens to lowercase and filter out stop words

filtered_sentence = [w for w in normalized_sentence if not w.lower() in stop_words]
print(len(filtered_sentence)), print(filtered_sentence)
print(f"Filtered sentence: {filtered_sentence}")
print(f"Length: {XXXX}") # Print the length of the filtered sentence

3.6 Lemmatization and Stemming#

Lemmatization is related to stemming, differing in that lemmatization is able to capture canonical forms based on a word’s lemma. For example, stemming the word “better” would fail to return its citation form (another word for lemma); however, lemmatization would result in changing better into good.

# Initialize wordnet lemmatizer
wnl = WordNetLemmatizer()

# Lemmatize filtered words
lemmatized = [wnl.lemmatize(word, pos="v") for word in XXXX] # Change the XXXX for the sentence you want to lemmatize
print(f"Lemmatized sentence: {lemmatized}")
print(f"Length: {XXXX}")  # Print the length of the lematized sentence

Stemming is the process of eliminating affixes (suffixed, prefixes, infixes, circumfixes) from a word in order to obtain a word stem.

# Initialize Python porter stemmer
ps = PorterStemmer()

# Stem filtered words
stemmed = [ps.stem(word) for word in XXXX] # Change the XXXX for the sentence you want to stem
print(f"Stemmed sentence: {stemmed}")
print(f"Length: {XXXX}") # Print the length of the stemmed sentence

Lets see the differences between lemmatization and stemming together

print("{0:20}{1:20}{2:20}".format("Word", "Lemmatized", "Stemmed"))

for word in filtered_sentence:
   print ("{0:20}{1:20}{2:20}".format(word, wnl.lemmatize(word, pos="v"), ps.stem(word),))

Question 2: explain the differences between the three datasets: the original word data, the lemmatized data, and the stemmed data. Additionally, in what scenarios would you choose to use each type of data? (Feel free to explore the differences by using a different sentence.)

3.7 Visualization#

In the process of text analysis, we delve into the frequency of words to gain insights. This starts by distinguishing unique words within the text – those that occur only once. Once we’ve identified these unique terms, we proceed to quantify how frequently each one appears throughout the text. This meticulous approach allows us to grasp the significance and prominence of different words, aiding in the extraction of meaningful patterns and themes.

To count the unique words in a text data, you can use Python’s built-in data structures such as sets or dictionaries. Here’s an example:

sentence = 'Big Data Analysis is really fun!'
unique, count = np.unique(sentence.split(), return_counts=True)
print(unique, count)

np.unique() returns two arrays, one contains the unique elements of a sentence, and the other contains the corresponding counts of each unique element.

By using dict(zip(array_1, array_2)) we can create a dictionary unique_counts where the unique elements are the keys and their counts are the values.

Now, let’s revisit our previous example and bring our data to life with a visually bar plot.

unique, count = np.unique(filtered_sentence, return_counts=True)
unique_counts = dict(zip(unique, count))

# Make sure to change the XXX in order to have a nice plot
plt.figure(figsize=(20, 5))
plt.bar(unique, count)
plt.xticks(fontsize = XX, rotation = XX)
plt.yticks(fontsize = XX)
plt.title(f'Words count\n', fontsize = 20)
plt.ylabel(f'XXX', fontsize=16)
plt.xlabel(f'XXX', fontsize=XX)
plt.tight_layout()

Now, let’s replace our bar plot with a captivating word cloud to visualize the frequency of unique words in our dataset.

A word cloud is a visual representation of text data, where the size of each word indicates its frequency or importance within the text. The more frequently a word appears in the text, the larger and more prominent it appears in the word cloud.

Let’s create a word cloud to explore the frequency of unique words in our dataset.

# Generate a WordCloud object using the unique word frequencies
wc = WordCloud(background_color='XXXX').generate_from_frequencies(unique_counts)

# Display the WordCloud
plt.imshow(wc)
plt.axis('off')
plt.show()

Question 3: What are the advantages and disadvantages of utilizing a bar plot compared to a word cloud for visualizing textual data?

4. Data retrieval and post-processing#

We’ve learned how to retrieve the tweets of a specific user. Now it is time to retrieve them by using keywords that can be content in the text such as hashtags. That will allow us to analyze what is happening at a specific time and/or location.

The following data contains 100 filtered tweets that contain the word “Earthquake”

tweets_file = os.path.join(data_path, r'Earthquakes_tweets.jsonl')

with open(tweets_file) as f:
    tweets = [json.loads(line) for line in f]

list_tweets = tweets[0]

You can see how the data looks like:

list_tweets[0]

As you may have noticed, it is difficult to read the tweets, and therefore, in order to do an analysis we first need to process the data.

We know how to extract specific fields from the data, it would be handy to create then a DataFrame with the information we need.

The following function uses the list of tweets to create a DataFrame, extract the information of each tweet, and finally add it to the DataFrame:

def DataFrame_tweets(list_tweets):
        # Creating DataFrame using pandas
        db = pd.DataFrame(columns=['username',
                                'description',
                                'date_time',
                                'location',
                                'following',
                                'followers',
                                'totaltweets',
                                'retweetcount',
                                'text',
                                'hashtags'])

        # we will iterate over each tweet in the
        # list for extracting information about each tweet
        for tweet in list_tweets:
                username = tweet['screen_name']
                description = tweet['description']
                date_time = tweet['created_at']
                location = tweet['location']
                following = tweet['friends_count']
                followers = tweet['followers_count']
                totaltweets = tweet['statuses_count']
                retweetcount = tweet['retweet_count']
                text = tweet['text']
                hashtags = tweet['hashtags']


                # Here we are appending all the
                # extracted information in the DataFrame
                ith_tweet = [username, description, date_time,
                        location, following,
                        followers, totaltweets,
                        retweetcount, text, hashtags]
                db.loc[len(db)] = ith_tweet
        return db

Let’s see what our DataFrame looks like:

db = DataFrame_tweets(list_tweets)
db

Please take a minute to look at the information about the location.

As you may notice, not all the users share the location, and some of the users that do share it, do not necessarily use a real location.

We can also obtain the geographical location by coordinates, let’s try and find out if there is more information.

def DataFrame_tweets_coordinates(list_tweets):
        # Creating DataFrame using pandas
        db = pd.DataFrame(columns=['username',
                                'description',
                                'date_time',
                                'location',
                                'Coordinates',
                                'following',
                                'followers',
                                'totaltweets',
                                'retweetcount',
                                'text',
                                'hashtags'])

        # we will iterate over each tweet in the
        # list for extracting information about each tweet
        for tweet in list_tweets:
                username = tweet['screen_name']
                description = tweet['description']
                date_time = tweet['created_at']
                location = tweet['location']
                coordinates = tweet['coordinates']
                following = tweet['friends_count']
                followers = tweet['followers_count']
                totaltweets = tweet['statuses_count']
                retweetcount = tweet['retweet_count']
                text = tweet['text']
                hashtags = tweet['hashtags']


                # Here we are appending all the
                # extracted information in the DataFrame
                ith_tweet = [username, description, date_time,
                        location, coordinates, following,
                        followers, totaltweets,
                        retweetcount, text, hashtags]
                db.loc[len(db)] = ith_tweet
        return db

db = DataFrame_tweets_coordinates(list_tweets)
db

Let’s count the tweets that contain the geographical location:

count = 0
for i in range(len(db)):
    if db.Coordinates[i] != None:
        count += 1 # or count = count + 1

print(f'{XXXX} tweets contain the coordinates out of the 100 retrieved tweets') # XXXX 'count'

Question 6: How many tweets contain the coordinates? Mention at least one advantage and one disadvantage of not sharing the coordinates.

Question 7: Provide an explanation of how the for loop and the count function operate.

Unfortunately not all (or none) of the users share the real location nor allow the geolocation for the coordinates.

Therefore, when we want to analyze a certain region we can’t use all the tweets and we need to further filter the information.

Open the previously retrieved dataset, it contains only the tweets with coordinates::

db_file = os.path.join(data_path, r'Earthquakes_wc_db.csv')
db = pd.read_csv(db_file, delimiter=';')

db

5. Application: Natural Hazards#

Earthquakes#

In our previous section, we started filtering the tweets by keywords and location.

We’ll continue with the earthquake example.

Did you notice there is a user that uses the USGS as a source of its tweets?

That’s right, it is ‘everyEarthquake’.

Let’s use the last 100 posts made by @everyEarthquake:

db_file = os.path.join(data_path, r'everyEarthquake_db.csv')
db = pd.read_csv(db_file, delimiter=';')

We can use GeoPandas to plot the location of the last 100 earthquakes.

First, we will convert our DataFrame to a GeoDataFrame:

dbg = gpd.GeoDataFrame(db, geometry=gpd.points_from_xy(db.X, db.Y))
dbg

Now let’s see what our data looks like:

world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
plt.rcParams['font.size'] = '16'

# Make sure to change the XXX in order to have a nice plot
fig, ax = plt.subplots(1, 1, figsize=(20, 10))

world.boundary.plot(ax=ax, color='xxx', alpha= xx)
dbg.plot(ax=ax, marker='o', color='xxxx')

ax.set_title('XXXX', fontsize=XX)
ax.set_ylabel('XXX', fontsize = 16, rotation = XX)
ax.set_xlabel('XXX', fontsize = 16)
plt.xticks(fontsize = XX)
plt.yticks(fontsize = XX)
legend_elements = [Line2D([0], [0], marker='o', color='w', label=f'XXX', markerfacecolor='XX', markersize=10)]
plt.legend(handles=legend_elements, fontsize=16, bbox_to_anchor=(0.85,1), loc="upper left")

world.boundary.plot(ax=ax, color='k', alpha=.3)
dbg.plot(ax=ax, marker='o', color='red')

Have you noticed that the text contains information about the magnitude of earthquakes?

We can also indicate the magnitude in our plot.

We need to process the information to extract the values from the text:

dbg['magnitude'] = 0
for i in range(len(db)):
    text = dbg.text.loc[i]
    text_split= text.split(" ")
    Mag = float(text_split[3].replace('M', ''))
    dbg['magnitude'][i] = Mag

dbg

Now let’s see what the plot looks like:

# Make sure to change the XXX in order to have a nice plot

#Plot size of the circle
z = dbg.magnitude

#Plot color of the circle
fig, ax = plt.subplots(1, 1, figsize=(20, 10))

world.boundary.plot(ax=ax, color='xxxx', alpha=xxx)

divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="2%", pad=0.1)
dbg.plot('magnitude', ax=ax, marker='o', markersize=z*30, edgecolors='k' , cmap='YlOrRd',
         vmin=0, vmax=8, zorder=2, legend=True,
         legend_kwds={'label': f"Magnitude", 'orientation': "vertical"}, cax=cax)

ax.set_title('XXX', fontsize=22) # set title
ax.set_ylabel('XXX', fontsize = XX, rotation = XX)
ax.set_xlabel('XXX', fontsize = XX)
plt.xticks(fontsize = XX)
plt.yticks(fontsize = XX)
legend_elements = [Line2D([0], [0], marker='o', color='w', label=f'Earthquake', markerfacecolor='w', markeredgecolor='k', markersize=10)]
plt.legend(handles=legend_elements, fontsize=16, bbox_to_anchor=(-8,1), loc="upper left")

Question 8: Where and when happen the earthquake with the highest magnitude? Which is its magnitude?

We can have a look at the map from the USGS website.*

You may notice disparities, as the tweets were retrieved in 2023, while the information from the USGS website is current and up-to-date.

As we have mentioned, users do not always share the location or the coordinates.

However, there are other applications where we can still use the tweets without the location.

Question 9: Can you think of examples of other applications? Mention at least one application that needs coordinates and one that does not need coordinates.

Flooding#

Here we have an example of floods. This time we will use a database that has been already downloaded. The database contains tweets about floods located in Texas from 30/07/2014 to 15/11/2022. We can read the data using the following code:

Flood_file = os.path.join(data_path, r'Floods_tweets.jsonl')

with open(Flood_file) as f:
    tweets = [json.loads(line) for line in f]
for tweet in tweets:
    tweet['text'] = tweet['text'].lower()

Let’s see what the data looks like:

for tweet in tweets[:10]:
    print(tweet['date'], '-', tweet['text'])

len(tweets)

We can also plot this data as a bar plot to identify the days when more tweets have been posted.

START_DATE = date(2014, 7, 30)
END_DATE = date(2022, 11, 15)

def plot_tweets(tweets, title):
    dates = [tweet['date'] for tweet in tweets]
    dates = [datetime.fromisoformat(date) for date in dates]
    plt.figure(figsize=(10, 5))
    plt.hist(dates, range=(START_DATE, END_DATE), bins=(END_DATE - START_DATE).days)
    plt.xticks(fontsize = 10)
    plt.yticks(fontsize = 10)
    plt.title(f'{title}', fontsize = 16)
    plt.ylabel(f'Count', fontsize=12)
    plt.xlabel(f'Date', fontsize=12)
    legend_elements = [Line2D([0], [0], color='b', label=f'Tweets')]
    plt.legend(handles=legend_elements, fontsize=10)
    plt.show()

plot_tweets(XXXX, 'XXXX') # tweets, title

Question 10: When does the high peak usually happen each year? What could be a potential explanation for the pattern?

Question 11: When does the bigger peak occur? What was the cause of it?

In Natural Language Process (NLP), semantic analysis is the process of understanding the meaning and interpretation of words.

This time we can use keywords to filter the tweets, identifying negative or positive meanings.

We’re starting with negative words such as ‘cry’ and ‘warning’:

negative_keywords = ['cry', 'warning']
filtered_tweets = []
for tweet in tweets:
    if not any(keyword in tweet['text'] for keyword in negative_keywords):
        filtered_tweets.append(tweet)

print(len(tweets))
print(len(filtered_tweets))

plot_tweets(XXXX, 'XXXX') # tweets, title

Now let’s try positive keywords such as ‘emergency’ and ‘rescue’:

positive_keywords = ['emergency', 'rescue']
filtered_tweets = []
for tweet in tweets:
    if any(keyword in tweet['text'] for keyword in positive_keywords):
        filtered_tweets.append(tweet)

print(len(tweets))
print(len(filtered_tweets))
plot_tweets(filtered_tweets, 'XXXX') # tweets, title

Question 12: Can you think about other negative keywords? Can you think of other positive keywords?