Tutorial 2: Drought Detection#
In this tutorial, we will learn how to detect drought in different regions around the world and monitor whether there is a drought or not using standardized drought indices. We will explore how to visual the drought spatial maps and timeseries. In this practical, you will learn to apply standardized method to calculate drought hazard for your chosen case study area, under historic period.
We will use the Multi-Source Weighted-Ensemble Precipitation (MSWEP) data to analyse droughts in the different regions. MSWEP is a global precipitation product with a daily 0.1° resolution available from 1979 to near real-time.
Important before we start#
Make sure that you save this file before you continue, else you will lose everything. To do so, go to Bestand/File and click on Een kopie opslaan in Drive/Save a Copy on Drive!
Now, rename the file into Week4_Tutorial2.ipynb. You can do so by clicking on the name in the top of this screen.
Learning Objectives#
Learn to apply a standardized method to represent meteorological drought hazard for a specific location;
Understand and interpret large-scale raster data;
Learn to visualize and interpret the results from drought hazard calculation methods.
Tutorial Outline
1. Introducing the packages#
Within this tutorial, we are going to make use of the following packages:
Regionmask is a python package that can be used to create masks of geographic regions for arbitrary longitude and latitude grids.
SciPyis a python package that provides algorithms for optimization, integration, interpolation, eigenvalue problems, algebraic equations, differential equations, statistics and many other classes of problems.
Statistics is a python module that provides functions for calculating mathematical statistics of numeric (Real-valued) data.
OS is a python module that provides a portable way of using operating system dependent functionality i.e. manipulating paths
GeoPandas is a Python packagee that extends the datatypes used by pandas to allow spatial operations on geometric types. It opens shapefiles
Matplotlib is a comprehensive Python package for creating static, animated, and interactive visualizations in Python. Matplotlib makes easy things easy and hard things possible.
xarray is a Python package that allows for easy and efficient use of multi-dimensional arrays.
We will first need to install these packages in the cell below. Uncomment them to make sure we can conda or pip install them
#installing the python packages required
!pip install regionmask
!pip install statistics
Before running any Python script (in an offline or online modus) it is necessary to import a number of Python packages that can help you with performing the calculations.
# Import working modules
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os
import sys
import geopandas as gpd
import regionmask
from scipy.stats import percentileofscore
from statistics import NormalDist
import scipy.stats as stats
%matplotlib inline
Connect to google drive#
To be able to read the data from Google Drive, we need to mount our Drive to this notebook.
As you can see in the cell below, make sure that in your My Drive folder, where you created BigData folder and within that folder, you have created a Week4_Data folder in which you can store the files that are required to run this analysis.
Please go the URL when its prompted in the box underneath the following cell, and copy the authorization code in that box.
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)
sys.path.append("/content/gdrive/My Drive/BigData/Week4_Data")
data_path = os.path.join('/content/gdrive/My Drive/BigData','Week4_Data')
2. Exploring the required datasets#
In the next code cell, we will specify our working directory using os.chdir()
. This directory should contain all the datasets to be used during the tutorial.
To know the current working directory of the file, os.getcwd()
method can be used. After changing the path, one can verify the path of current working directory using this method. We will use os.path.join()
to set the specific path to each of the data folders within the working directory. It enables us to join the main directory path with the data folder path.
However, before we continue you need to make sure that you have downloaded the MSWEP data and stored it in your Week5_Data
folder. You can download the data on Canvas through the download link within the module. Unzip the data locally, but make sure that you keep the folder structure (as you may see below when reading the data).
# Name input file
weather_file = os.path.join(data_path,'MSWEP_Monthly_prec/*.nc')
world_admin_boundaries = os.path.join(data_path,'world-administrative-boundaries/world-administrative-boundaries.shp')
print("Precipitation data:", '\n', weather_file,'\n', "Administrative boundaries shapefile:", '\n', world_admin_boundaries)
Now that we have set the paths to the datasets, let us load the administrative boundaries shapefile using geopandas gpd.read_file()
and in the process set the coordinate system to EPSG:4326 (the standard global coordinate reference system). We will use the same coordinate system throughtout the tutorial.
Let’s start with loading the shapefile of the world boundaries.
data_Adminunits = gpd.read_file(world_admin_boundaries,crs="epsg:4326")
data_Adminunits.head() #represents the countries around the world and their geographical locations
data_Adminunits
is a GeoDataFrame
containing all polygons illustrating the national boundaries for the 256 countries in the world
Let’s have a look at the administrative boundaries we just loaded using pd.DataFrame.plot()
ax=data_Adminunits.plot(XXX) #add the alpha, linewidth, figsize, polygon colors and edgecolor
ax.set_title("XXX")
We have loaded the administrative shapefile. Now we can load the gridded precipitation dataset using the xarray
package through the open_mfdataset()
function. This function allows us to open multiple files as a single dataset.
In a folder I have hundreds of MSWEP reanalysis data downloaded from the here. Each of these files contains a single data of the global reanalysis, about 8.2 MB. The parameter chunks is very important, it defines how big are the “pieces” of data moved from the disk to the memory. With this value the entire computation on a workstation with 2 GB takes a couple of minutes. I want to load all the precipitation files from the year 1980-2021.
Now let’s load the gridded daily precipitation dataset.
data = xr.open_mfdataset(weather_file,chunks={"time":10})
data
As you can see this xarray Dataset contains a single variable (precipitation)
which is stored as a dask.array. This is the result of loading files with open_mfdataset. More information and examples of the interaction between dask and xarray can be found in their documentations(here and here)
Now we are ready for a bit of regionmask magic. The module can create a gridded mask with the function regionmask.Regions()
documented here. With this function we have created an object able to mask gridded data.
sheds_mask_poly = regionmask.Regions(name = 'sheds_mask', numbers = list(range(0,256)), names = list(data_Adminunits.name),
abbrevs = list(data_Adminunits.color_code), outlines = list(data_Adminunits.geometry.values[i] for i in range(0,256)))
sheds_mask_poly
Calculating the mask of the region of interest#
Now we are ready to apply the mask on the gridded dataset data. We select only the first timestep to speed up the process. For the same reason it is better to further “zoom” into the Western European countries. We specify the name of the coordinates (the defaults are lat and lon). You can play around with the coordinates. Use this bounding box here to obtain the correct coordinates. Once you have created a bounding box, you can select the values to be presented in ‘CSV’ format, and copy paste the coordinates in the cell below.
mask = sheds_mask_poly.mask(data.isel(time = 0).sel(lon = slice(XXX), lat = slice(XXX)),lon_name = 'lon', lat_name = 'lat') # change the lat and lon slice to the bounding box of the selected region
mask
Plotting the created mask#
The computation takes a couple of minutes but then the mask can be saved (for example as a NetCDF) for a later use. Let’s plot the mask we just created. You can see that the bounding box we used zooms into the European area.
plt.figure(figsize=(12,8))
ax = plt.axes()
mask.plot(ax = ax)
data_Adminunits.plot(XXX)
ax.set_title('XXX')
3. Extracting gridded and timeseries data of the area of interest#
We can now for the selected country aggregate the grid cells in the national borders to timeseries but before that we will work with gridded data first. The procedure is rather simple, we will focus on a single region and it is easy to extend it for multiple regions.
As first step, we will save the latitude and longitude vectors because we will use it later. Then, we select the mask points where the value is equal to target value (the region code). In the numpy array sel_mask
all the values are nan except for the selected ones.
#Extracting the precipitation data for the selected region
region_name = "...." #Selected country within the bounding region
lat = mask.lat.values # the lat and lon values contained in the mask of the regions
lon = mask.lon.values
ID_REGION = data_Adminunits.index[data_Adminunits.name == region_name].values[0] #getting the index position of the selected country
sel_mask = mask.where(mask == ID_REGION).values
sel_mask
To speed-up the process we want to crop the xarray Dataset selecting the smallest box containing the entire mask. To do this, we store in id_lon and id_lat the coordinate points where the mask has at least a non-nan value.
id_lon = lon[np.where(~np.all(np.isnan(sel_mask), axis=0))]
id_lat = lat[np.where(~np.all(np.isnan(sel_mask), axis=1))]
The data
dataset is reduced selecting only the target year and the coordinates containing the target region. Then the dataset is load from the dask array using compute and then filtered using the mask. This procedure speeds up the computation and reduces the memory use for large dataset, apparently the where()
function is not really dask friendly.
monthly_prec = data.sel(lat = slice(id_lat[0], id_lat[-1]), lon = slice(id_lon[0], id_lon[-1])).compute().where(mask == ID_REGION)
monthly_prec
Now let’s plot the gridded precipitation data for the target region contained in monthly_prec
Action Inspect the code, add the title name and run the code to make the plot. Fill in the plot parameters in the code box below
plt.figure(figsize=(12,8))
ax = plt.axes()
monthly_prec.precipitation.isel(time = 1).plot(ax = ax)
data_Adminunits.plot(ax = ax, alpha = 0.8, facecolor = 'none')
plt.title("XXX") #set the right size for the title
plt.show()
monthly_prec_timeseries=monthly_prec.precipitation.mean(dim=('lon','lat'))
df_prec_timeseries=monthly_prec_timeseries.to_dataframe()
Action Inspect the code, change the label and title names and run the code to make the plot. Fill in the plot parameters in the code box below
fig = plt.figure(figsize=(30,10) )
plt.plot(df_prec_timeseries.index,df_prec_timeseries,"-o",color='b')
plt.ylabel('XXX',size=15) #change
plt.xlabel('XXX')#change
plt.title('XXX',size=25)#change
plt.show()
4. Calculating meteorological drought conditions using standardized drought indicators#
The required datasets are now loaded correctly into Python. From this point on we will start with the actual drought analysis. We will calculate meteorological droughts (Standardized Precipitation Index (SPI)) for your chosen accumulation period (i.e. between 1 - 12 months). Drought conditions are being calculated for each month (using all January’s, February’s, etc).
Let’s revisit python functions (introduced in the Numpy and Pandas exercise),they are defined with the def
keyword, then the function identifier (name) followed by parentheses and a colon as shown in the next cells.
Before starting with the SPI calculation we need to prepare the accumulated time-series first.Using the numpy
package through the cumsum()
function we calculate the cummulative precipitation per grid cell in a moving sum window along the time axis.
Defining the functions for calculating the Standardized Precipitation Index (SPI)#
def cumulative_values(y,w):
def mov_sum(y,w):
'''defining the time period and a function that calculates
the accumulation periods for monthly precipitation per cell where
y is gridded precipitation data and w is the accumulation value in months'''
cumul = np.cumsum(y, dtype= float)
cumul[w:] = cumul[w:]- cumul[:-w]
cumul[:w-1]=(np.nan==-9999)
cumul_dataseries = cumul
return cumul_dataseries
cumul_dataset = np.zeros((y.shape))
rows_p3 = y.shape[1]
cols_p3 = y.shape[2]
time_p3 = y.shape[0]
for r in range(0,rows_p3):
for c in range (0,cols_p3):
for t in range (0,time_p3):
cumul_data = mov_sum(y[:,r,c],w)
cumul_dataset[:,r,c]=cumul_data
return cumul_dataset
Let’s create a function to group the monthly gridded precipitation data into different months i.e. Januarys. Februarys etc. This will enable us to calculate SPI per month over the historic period.
To group the precipitation data we create an empty dataframe using pandas through DataFrame()
and a list of dates for the historic time period through date_range()
. The grouping of precipitation data is done using DatetimeIndex().month()
and each group is stored in a dictionary.
def grouping_prec_months (y,w):
'''groups the precipitation data into all the months based on the dates
where y is the gridded precipitation data and w is the accumulation value'''
month_dates = pd.date_range (start='2002-01-01', end='2022-06-01', freq = 'M')
df = pd.DataFrame(index = month_dates)
df['month'] = pd. DatetimeIndex(df.index).month
monthly_data = cumulative_values(y,w) #calls the cumulative value function to calculate the accumulation periods of the data
Month_data_dict= {} # initializes an empty dictionary
# #grouping the months
Month_data_dict['Jan'] = monthly_data[df.month==1,:,:]
Month_data_dict['Feb'] = monthly_data[df.month==2,:,:]
Month_data_dict['Mar'] = monthly_data[df.month==3,:,:]
Month_data_dict['Apr'] = monthly_data[df.month==4,:,:]
Month_data_dict['May'] = monthly_data[df.month==5,:,:]
Month_data_dict['Jun'] = monthly_data[df.month==6,:,:]
Month_data_dict['Jul'] = monthly_data[df.month==7,:,:]
Month_data_dict['Aug'] = monthly_data[df.month==8,:,:]
Month_data_dict['Sep'] = monthly_data[df.month==9,:,:]
Month_data_dict['Oct'] = monthly_data[df.month==10,:,:]
Month_data_dict['Nov'] = monthly_data[df.month==11,:,:]
Month_data_dict['Dec'] = monthly_data[df.month==12,:,:]
return Month_data_dict
After grouping into months, we create a function using scipy.stats
through percentilescore()
documented here. This allows us to compute the percentile rank of each monthly value relative to a list of values.
def percentiles(y,w):
'''computes the percentile rank of each monthly value relative to the entire 20 years of values. y is the gridded
precipitation data and w is the accumulation value in months'''
mnths= ['Jan', 'Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
perc_dictionary = {} #creates an empty dictionary
Month_data_dict = grouping_prec_months (y,w) # calls the grouping by months function from previous cell
for m in mnths:
output = np.zeros(Month_data_dict[m].shape)
rows = Month_data_dict[m].shape[1]
cols = Month_data_dict[m].shape[2]
time = Month_data_dict[m].shape[0]
for i in range(0,rows):
for j in range (0,cols):
for z in range (0,time):
result =percentileofscore(Month_data_dict[m][:,i,j],Month_data_dict[m][z,i,j], kind = 'mean')/100
output[z,i,j] =result
perc_dictionary[m] = output
return perc_dictionary, Month_data_dict
SPI calculation for each month over the historic time-period#
SPI is used to characterize meteorological drought on a range of timescales. On short timescales, the SPI is closely related to soil moisture, while at longer timescales, the SPI can be related to groundwater and reservoir storage. The SPI can be compared across regions with markedly different climates. It quantifies observed precipitation as a standardized departure from a selected probability distribution function that models raw precipitation data. The raw precipitation data are typically fitted to a gamma or Pearson Type III distribution, and then transformed to a normal distribution so the mean of SPI is zero. For this exercise, we will not fit the raw precipitation through a distribution to save time and also because we are working with historic data.
We will now calculate SPI for all months throughout the year (m = 1:12).By normalizing the precipitation percentiles data afterwards drought conditions become comparable between different locations and for different months. The procedure will result in time series which values vary from ~ -3 (extremely dry) to ~ +3 (extremely wet compared to the long-term mean average conditions), providing information on meteorological drought conditions. The frequency, average duration, and severity of drought events are subsequently estimated by looking at the ‘clustered nature’ of individual drought months.
Further information on SPI can be found here.
Take a careful look how we do this by inspecting the code-boxes and run the code in the code-boxes below to eventually plot the SPI values for all months. Where needed, change the names of titles and axes.
def standardized_index(y,w,a):
'''function to calculate SPI using percentile ranking and normalizing the rank using normal distribution'''
month_dates = pd.date_range (start='2002-01-01', end='2022-06-01', freq = 'M')
df = pd.DataFrame(index = month_dates)
df['month'] = pd. DatetimeIndex(df.index).month
perc_dictionary, Month_data_dict = percentiles(y,w)
index_grouped = {}
mnths = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
for mon in mnths:
output_sp = np.zeros(perc_dictionary[mon].shape)
rows_p = perc_dictionary[mon].shape[1]
cols_p = perc_dictionary[mon].shape[2]
time_p = perc_dictionary[mon].shape[0]
for r in range(0,rows_p):
for c in range (0,cols_p):
for t in range (1,time_p):
zeros = len([p for p in Month_data_dict[mon][:,r,c] if p==0])
count = len([p for p in Month_data_dict[mon][:,r,c]])
value_0 = NormalDist().inv_cdf((zeros+1)/(2*(count+1)))#calculating distr for zeros using stagge formula
#normalizing the data
if Month_data_dict[mon][t,r,c]==0:
norm_dist = value_0
else:
norm_dist= NormalDist().inv_cdf(perc_dictionary[mon][t,r,c])
output_sp[t,r,c] =norm_dist
index_grouped[mon] = output_sp
#merging and converting the numpy arrays into a single xarray dataset with coordinates by applying the merge function
#and appending the data to an empty list
data_arrays=[]
for i, key in enumerate(index_grouped):
da = xr.DataArray(index_grouped[key],coords=[df[df.month==i+1].index, a.lat.values, a.lon.values],dims=["time","latitude[degrees_north]", "longitude[degrees_east]"], name = 'spi_values')
data_arrays.append(da)
index = xr.merge(data_arrays)
return index
SPI can be calculated for a range of timesteps (1,3,6,12 months and so on). Select the accumulation period/ lag time/timestep to use with your group (i.e. between 1-12 months). The use of different timescales allows the effect of precipitation deficits on various water resources (groundwater, soil moisture, reservoir storage and streamflow) to be assessed.
Action: Insert lag time in the code-box below and run it.
#SPI considering zeros using Stagge et al. and gamma distribution
spi_1 = standardized_index(monthly_prec.precipitation[:].values,1, monthly_prec) #choose the accumulation value to determine which SPI you calculate
spi_1
Now we convert the gridded data for the SPI to a timeseries and plot the spi values for the entire historic period both spatially and as a timeseries.
#Timeseries plot
#converting the gridded data to timeseries
indice_timeseries=spi_1.spi_values.mean(dim=("latitude[degrees_north]", "longitude[degrees_east]"))
df_indice_timeseries =indice_timeseries.to_dataframe()
print(df_indice_timeseries)
Visualization of SPI#
Now let’s visualize the SPI. Within xarray’s .plot()
function, we can nicely select that we want to create a multi-plot figure, using the col
and col_wrap
arguments.
spi_1['spi_values'].sel(time='XXXX').plot(cmap='RdBu', col='time', col_wrap=4, vmin =-2.5, vmax=2.5) #change the year
It would also be interesting to plot the spi_1 in a timeseries. In the cell below, we plot the calculated SPI-1 values over time. A value below zero means there is a drought and above zero indicates wet periods.
fig = plt.figure(figsize=(30,10) )
plt.plot(df_indice_timeseries.index,df_indice_timeseries,"-o",color='b')
plt.axhline(y=0, color='r', linestyle='--') # all the values below zero indicate drought conditions. Can be changed to indicate drought threshold
plt.ylabel('XXX',size=15)
plt.xlabel('XXX')
plt.title('XXX',size=25)
plt.show()