Tutorial 2: Drought Detection#

In this tutorial, we will learn how to detect drought in different regions around the world and monitor whether there is a drought or not using standardized drought indices. We will explore how to visual the drought spatial maps and timeseries. In this practical, you will learn to apply standardized method to calculate drought hazard for your chosen case study area, under historic period.

We will use the Multi-Source Weighted-Ensemble Precipitation (MSWEP) data to analyse droughts in the different regions. MSWEP is a global precipitation product with a daily 0.1° resolution available from 1979 to near real-time.

Important before we start#


Make sure that you save this file before you continue, else you will lose everything. To do so, go to Bestand/File and click on Een kopie opslaan in Drive/Save a Copy on Drive!

Now, rename the file into Week4_Tutorial2.ipynb. You can do so by clicking on the name in the top of this screen.

Learning Objectives#


  • Learn to apply a standardized method to represent meteorological drought hazard for a specific location;

  • Understand and interpret large-scale raster data;

  • Learn to visualize and interpret the results from drought hazard calculation methods.

Tutorial Outline


1. Introducing the packages#


Within this tutorial, we are going to make use of the following packages:

Regionmask is a python package that can be used to create masks of geographic regions for arbitrary longitude and latitude grids.

SciPyis a python package that provides algorithms for optimization, integration, interpolation, eigenvalue problems, algebraic equations, differential equations, statistics and many other classes of problems.

Statistics is a python module that provides functions for calculating mathematical statistics of numeric (Real-valued) data.

OS is a python module that provides a portable way of using operating system dependent functionality i.e. manipulating paths

GeoPandas is a Python packagee that extends the datatypes used by pandas to allow spatial operations on geometric types. It opens shapefiles

Matplotlib is a comprehensive Python package for creating static, animated, and interactive visualizations in Python. Matplotlib makes easy things easy and hard things possible.

xarray is a Python package that allows for easy and efficient use of multi-dimensional arrays.

We will first need to install these packages in the cell below. Uncomment them to make sure we can conda or pip install them

#installing the python packages required
!pip install regionmask
!pip install statistics

Before running any Python script (in an offline or online modus) it is necessary to import a number of Python packages that can help you with performing the calculations.

# Import working modules
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os
import sys
import geopandas as gpd
import regionmask
from scipy.stats import percentileofscore
from statistics import NormalDist
import scipy.stats as stats

%matplotlib inline

Connect to google drive#


To be able to read the data from Google Drive, we need to mount our Drive to this notebook.

As you can see in the cell below, make sure that in your My Drive folder, where you created BigData folder and within that folder, you have created a Week4_Data folder in which you can store the files that are required to run this analysis.

Please go the URL when its prompted in the box underneath the following cell, and copy the authorization code in that box.

from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

sys.path.append("/content/gdrive/My Drive/BigData/Week4_Data")

data_path = os.path.join('/content/gdrive/My Drive/BigData','Week4_Data')

2. Exploring the required datasets#


In the next code cell, we will specify our working directory using os.chdir(). This directory should contain all the datasets to be used during the tutorial.

To know the current working directory of the file, os.getcwd() method can be used. After changing the path, one can verify the path of current working directory using this method. We will use os.path.join() to set the specific path to each of the data folders within the working directory. It enables us to join the main directory path with the data folder path.

However, before we continue you need to make sure that you have downloaded the MSWEP data and stored it in your Week5_Data folder. You can download the data on Canvas through the download link within the module. Unzip the data locally, but make sure that you keep the folder structure (as you may see below when reading the data).

# Name input file
weather_file  = os.path.join(data_path,'MSWEP_Monthly_prec/*.nc')

world_admin_boundaries = os.path.join(data_path,'world-administrative-boundaries/world-administrative-boundaries.shp')
print("Precipitation data:", '\n', weather_file,'\n', "Administrative boundaries shapefile:", '\n', world_admin_boundaries)

Now that we have set the paths to the datasets, let us load the administrative boundaries shapefile using geopandas gpd.read_file() and in the process set the coordinate system to EPSG:4326 (the standard global coordinate reference system). We will use the same coordinate system throughtout the tutorial.

Let’s start with loading the shapefile of the world boundaries.

data_Adminunits = gpd.read_file(world_admin_boundaries,crs="epsg:4326") 
data_Adminunits.head() #represents the countries around the world and their geographical locations

data_Adminunits is a GeoDataFrame containing all polygons illustrating the national boundaries for the 256 countries in the world

Let’s have a look at the administrative boundaries we just loaded using pd.DataFrame.plot()

Question 1: Upload a figure of the global administrative boundaries. Include the the linewidth, figure size, the y and x labels, color (indicate none), edgecolor. Don’t forget to add the figure title. Additionally, upload the figure outlining the zoomed area of the western European countries.
ax=data_Adminunits.plot(XXX) #add the alpha, linewidth, figsize, polygon colors and edgecolor
ax.set_title("XXX")

We have loaded the administrative shapefile. Now we can load the gridded precipitation dataset using the xarray package through the open_mfdataset() function. This function allows us to open multiple files as a single dataset.

In a folder I have hundreds of MSWEP reanalysis data downloaded from the here. Each of these files contains a single data of the global reanalysis, about 8.2 MB. The parameter chunks is very important, it defines how big are the “pieces” of data moved from the disk to the memory. With this value the entire computation on a workstation with 2 GB takes a couple of minutes. I want to load all the precipitation files from the year 1980-2021.

Now let’s load the gridded daily precipitation dataset.

data = xr.open_mfdataset(weather_file,chunks={"time":10})    
data

As you can see this xarray Dataset contains a single variable (precipitation) which is stored as a dask.array. This is the result of loading files with open_mfdataset. More information and examples of the interaction between dask and xarray can be found in their documentations(here and here)

Now we are ready for a bit of regionmask magic. The module can create a gridded mask with the function regionmask.Regions() documented here. With this function we have created an object able to mask gridded data.

sheds_mask_poly = regionmask.Regions(name = 'sheds_mask', numbers = list(range(0,256)), names = list(data_Adminunits.name),
                                     abbrevs = list(data_Adminunits.color_code), outlines = list(data_Adminunits.geometry.values[i] for i in range(0,256)))
sheds_mask_poly

Calculating the mask of the region of interest#

Now we are ready to apply the mask on the gridded dataset data. We select only the first timestep to speed up the process. For the same reason it is better to further “zoom” into the Western European countries. We specify the name of the coordinates (the defaults are lat and lon). You can play around with the coordinates. Use this bounding box here to obtain the correct coordinates. Once you have created a bounding box, you can select the values to be presented in ‘CSV’ format, and copy paste the coordinates in the cell below.

mask = sheds_mask_poly.mask(data.isel(time = 0).sel(lon = slice(XXX), lat = slice(XXX)),lon_name = 'lon', lat_name = 'lat') # change the lat and lon slice to the bounding box of the selected region
mask

Plotting the created mask#

The computation takes a couple of minutes but then the mask can be saved (for example as a NetCDF) for a later use. Let’s plot the mask we just created. You can see that the bounding box we used zooms into the European area.

plt.figure(figsize=(12,8))
ax = plt.axes()
mask.plot(ax = ax)
data_Adminunits.plot(XXX)

ax.set_title('XXX')

3. Extracting gridded and timeseries data of the area of interest#


We can now for the selected country aggregate the grid cells in the national borders to timeseries but before that we will work with gridded data first. The procedure is rather simple, we will focus on a single region and it is easy to extend it for multiple regions.

As first step, we will save the latitude and longitude vectors because we will use it later. Then, we select the mask points where the value is equal to target value (the region code). In the numpy array sel_mask all the values are nan except for the selected ones.

#Extracting the precipitation data for the selected region
region_name = "...."   #Selected country within the bounding region

lat = mask.lat.values          # the lat and lon values contained in the mask of the regions
lon = mask.lon.values

ID_REGION = data_Adminunits.index[data_Adminunits.name == region_name].values[0] #getting the index position of the selected country
   
sel_mask = mask.where(mask == ID_REGION).values
sel_mask
Question 2: Which country did you select as a case study region and why?

To speed-up the process we want to crop the xarray Dataset selecting the smallest box containing the entire mask. To do this, we store in id_lon and id_lat the coordinate points where the mask has at least a non-nan value.

id_lon = lon[np.where(~np.all(np.isnan(sel_mask), axis=0))]
id_lat = lat[np.where(~np.all(np.isnan(sel_mask), axis=1))]
Question 3: Describe what the fuction `np.where()`does? Are there other fuctions that can replace it?

The data dataset is reduced selecting only the target year and the coordinates containing the target region. Then the dataset is load from the dask array using compute and then filtered using the mask. This procedure speeds up the computation and reduces the memory use for large dataset, apparently the where() function is not really dask friendly.

monthly_prec = data.sel(lat = slice(id_lat[0], id_lat[-1]), lon = slice(id_lon[0], id_lon[-1])).compute().where(mask == ID_REGION)
monthly_prec

Now let’s plot the gridded precipitation data for the target region contained in monthly_prec

Action Inspect the code, add the title name and run the code to make the plot. Fill in the plot parameters in the code box below

plt.figure(figsize=(12,8))
ax = plt.axes()
monthly_prec.precipitation.isel(time = 1).plot(ax = ax)
data_Adminunits.plot(ax = ax, alpha = 0.8, facecolor = 'none')
plt.title("XXX") #set the right size for the title
plt.show()
monthly_prec_timeseries=monthly_prec.precipitation.mean(dim=('lon','lat'))
df_prec_timeseries=monthly_prec_timeseries.to_dataframe()

Action Inspect the code, change the label and title names and run the code to make the plot. Fill in the plot parameters in the code box below

fig = plt.figure(figsize=(30,10) )
plt.plot(df_prec_timeseries.index,df_prec_timeseries,"-o",color='b')
plt.ylabel('XXX',size=15) #change
plt.xlabel('XXX')#change
plt.title('XXX',size=25)#change 
plt.show()
Question 4: Briefly explain what you see in both the precipitation timeseries plot and the precipitation spatial plot above.

4. Calculating meteorological drought conditions using standardized drought indicators#


The required datasets are now loaded correctly into Python. From this point on we will start with the actual drought analysis. We will calculate meteorological droughts (Standardized Precipitation Index (SPI)) for your chosen accumulation period (i.e. between 1 - 12 months). Drought conditions are being calculated for each month (using all January’s, February’s, etc).

Let’s revisit python functions (introduced in the Numpy and Pandas exercise),they are defined with the def keyword, then the function identifier (name) followed by parentheses and a colon as shown in the next cells.

Before starting with the SPI calculation we need to prepare the accumulated time-series first.Using the numpy package through the cumsum() function we calculate the cummulative precipitation per grid cell in a moving sum window along the time axis.

Defining the functions for calculating the Standardized Precipitation Index (SPI)#

def cumulative_values(y,w): 
    def mov_sum(y,w):
        '''defining the time period and a function that calculates 
        the accumulation periods for monthly precipitation per cell where
        y is gridded precipitation data and w is the accumulation value in months'''
        cumul = np.cumsum(y, dtype= float)
        cumul[w:] = cumul[w:]- cumul[:-w]
        cumul[:w-1]=(np.nan==-9999)
        cumul_dataseries = cumul
    
        return cumul_dataseries

    cumul_dataset = np.zeros((y.shape))
    rows_p3 = y.shape[1]
    cols_p3 = y.shape[2]
    time_p3 = y.shape[0]
    for r in range(0,rows_p3):
        for c in range (0,cols_p3):
            for t in range (0,time_p3):
                cumul_data = mov_sum(y[:,r,c],w)
                cumul_dataset[:,r,c]=cumul_data
    return cumul_dataset

Let’s create a function to group the monthly gridded precipitation data into different months i.e. Januarys. Februarys etc. This will enable us to calculate SPI per month over the historic period.

To group the precipitation data we create an empty dataframe using pandas through DataFrame() and a list of dates for the historic time period through date_range(). The grouping of precipitation data is done using DatetimeIndex().month() and each group is stored in a dictionary.

def grouping_prec_months (y,w):
    '''groups the precipitation data into all the months based on the dates
    where y is the gridded precipitation data and w is the accumulation value'''

    month_dates = pd.date_range (start='2002-01-01', end='2022-06-01', freq = 'M')
    df = pd.DataFrame(index = month_dates)
    df['month'] = pd. DatetimeIndex(df.index).month 
    monthly_data = cumulative_values(y,w) #calls the cumulative value function to calculate the accumulation periods of the data
    Month_data_dict= {} # initializes an empty dictionary
    # #grouping the months
    Month_data_dict['Jan'] = monthly_data[df.month==1,:,:]
    Month_data_dict['Feb'] = monthly_data[df.month==2,:,:]
    Month_data_dict['Mar'] = monthly_data[df.month==3,:,:]
    Month_data_dict['Apr'] = monthly_data[df.month==4,:,:]
    Month_data_dict['May'] = monthly_data[df.month==5,:,:]
    Month_data_dict['Jun'] = monthly_data[df.month==6,:,:]
    Month_data_dict['Jul'] = monthly_data[df.month==7,:,:]
    Month_data_dict['Aug'] = monthly_data[df.month==8,:,:]
    Month_data_dict['Sep'] = monthly_data[df.month==9,:,:]
    Month_data_dict['Oct'] = monthly_data[df.month==10,:,:]
    Month_data_dict['Nov'] = monthly_data[df.month==11,:,:]
    Month_data_dict['Dec'] = monthly_data[df.month==12,:,:]    

    return Month_data_dict

After grouping into months, we create a function using scipy.stats through percentilescore() documented here. This allows us to compute the percentile rank of each monthly value relative to a list of values.

Question 5: What other fuctions in can be used instead of the `scipy.stats` `percentilescore()`?Describe these other fuctions.
def percentiles(y,w):
    '''computes the percentile rank of each monthly value relative to the entire 20 years of values. y is the gridded 
    precipitation data and w is the accumulation value in months'''
    mnths= ['Jan', 'Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
    perc_dictionary = {} #creates an empty dictionary
    Month_data_dict = grouping_prec_months (y,w) # calls the grouping by months function from previous cell
    for m in mnths:
        output = np.zeros(Month_data_dict[m].shape)
        rows = Month_data_dict[m].shape[1]
        cols = Month_data_dict[m].shape[2]
        time = Month_data_dict[m].shape[0]
        for i in range(0,rows):
            for j in range (0,cols):
                for z in range (0,time):                
                    result =percentileofscore(Month_data_dict[m][:,i,j],Month_data_dict[m][z,i,j], kind = 'mean')/100                
                    output[z,i,j] =result
        perc_dictionary[m] = output
    return perc_dictionary, Month_data_dict

SPI calculation for each month over the historic time-period#

SPI is used to characterize meteorological drought on a range of timescales. On short timescales, the SPI is closely related to soil moisture, while at longer timescales, the SPI can be related to groundwater and reservoir storage. The SPI can be compared across regions with markedly different climates. It quantifies observed precipitation as a standardized departure from a selected probability distribution function that models raw precipitation data. The raw precipitation data are typically fitted to a gamma or Pearson Type III distribution, and then transformed to a normal distribution so the mean of SPI is zero. For this exercise, we will not fit the raw precipitation through a distribution to save time and also because we are working with historic data.

We will now calculate SPI for all months throughout the year (m = 1:12).By normalizing the precipitation percentiles data afterwards drought conditions become comparable between different locations and for different months. The procedure will result in time series which values vary from ~ -3 (extremely dry) to ~ +3 (extremely wet compared to the long-term mean average conditions), providing information on meteorological drought conditions. The frequency, average duration, and severity of drought events are subsequently estimated by looking at the ‘clustered nature’ of individual drought months.

Further information on SPI can be found here.

Take a careful look how we do this by inspecting the code-boxes and run the code in the code-boxes below to eventually plot the SPI values for all months. Where needed, change the names of titles and axes.

def standardized_index(y,w,a):
    '''function to calculate SPI using percentile ranking and normalizing the rank using normal distribution'''
    month_dates = pd.date_range (start='2002-01-01', end='2022-06-01', freq = 'M')
    df = pd.DataFrame(index = month_dates)
    df['month'] = pd. DatetimeIndex(df.index).month
    perc_dictionary, Month_data_dict = percentiles(y,w)
    index_grouped = {} 
    mnths = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
    for mon in mnths:
        output_sp = np.zeros(perc_dictionary[mon].shape)
        rows_p = perc_dictionary[mon].shape[1]
        cols_p = perc_dictionary[mon].shape[2]
        time_p = perc_dictionary[mon].shape[0]
        for r in range(0,rows_p):
            for c in range (0,cols_p):
                for t in range (1,time_p):
                    zeros = len([p for p in Month_data_dict[mon][:,r,c] if p==0])
                    count = len([p for p in Month_data_dict[mon][:,r,c]])
                    value_0 = NormalDist().inv_cdf((zeros+1)/(2*(count+1)))#calculating distr for zeros using stagge formula
                    #normalizing the data 
                    if Month_data_dict[mon][t,r,c]==0:
                        norm_dist = value_0
                    else:
                        norm_dist= NormalDist().inv_cdf(perc_dictionary[mon][t,r,c])
                    output_sp[t,r,c] =norm_dist
        index_grouped[mon] = output_sp
        
        #merging and converting the numpy arrays into a single xarray dataset with coordinates by applying the merge function
        #and appending the data to an empty list
    data_arrays=[]
    
    for i, key in enumerate(index_grouped):
        da = xr.DataArray(index_grouped[key],coords=[df[df.month==i+1].index, a.lat.values, a.lon.values],dims=["time","latitude[degrees_north]", "longitude[degrees_east]"], name = 'spi_values')
        data_arrays.append(da)    
    index = xr.merge(data_arrays)  
    
    return index
Question 6: What does the xarray `merge()` and python list `append()` methods do and return? Describe.

SPI can be calculated for a range of timesteps (1,3,6,12 months and so on). Select the accumulation period/ lag time/timestep to use with your group (i.e. between 1-12 months). The use of different timescales allows the effect of precipitation deficits on various water resources (groundwater, soil moisture, reservoir storage and streamflow) to be assessed.

Action: Insert lag time in the code-box below and run it.

#SPI considering zeros using Stagge et al. and gamma distribution

spi_1 = standardized_index(monthly_prec.precipitation[:].values,1, monthly_prec)   #choose the accumulation value to determine which SPI you calculate
spi_1

Now we convert the gridded data for the SPI to a timeseries and plot the spi values for the entire historic period both spatially and as a timeseries.

#Timeseries plot
#converting the gridded data to timeseries
indice_timeseries=spi_1.spi_values.mean(dim=("latitude[degrees_north]", "longitude[degrees_east]"))
df_indice_timeseries =indice_timeseries.to_dataframe()
print(df_indice_timeseries)

Visualization of SPI#

Now let’s visualize the SPI. Within xarray’s .plot() function, we can nicely select that we want to create a multi-plot figure, using the col and col_wrap arguments.

Question 7: Select a year with extreme drought occurence within the selected study region and plot the spi values. Include the plots in Canvas.
spi_1['spi_values'].sel(time='XXXX').plot(cmap='RdBu', col='time', col_wrap=4, vmin =-2.5, vmax=2.5) #change the year
Question 8: Explain the drought spatial plots plotted above. How was the drought in each of the months within the regions of the selected country?

It would also be interesting to plot the spi_1 in a timeseries. In the cell below, we plot the calculated SPI-1 values over time. A value below zero means there is a drought and above zero indicates wet periods.

fig = plt.figure(figsize=(30,10) )
plt.plot(df_indice_timeseries.index,df_indice_timeseries,"-o",color='b')
plt.axhline(y=0, color='r', linestyle='--') # all the values below zero indicate drought conditions. Can be changed to indicate drought threshold
plt.ylabel('XXX',size=15)
plt.xlabel('XXX')
plt.title('XXX',size=25)
plt.show()
Question 9: Explain the drought pattern in the SPI values timeseries above for the selected case study region.
Question 10: Now that we have calculated SPI-1 above, change the accumulation number from 1 and calculate SPI-6 and SPI-12. What differences do you see in the drought characteristics and what does each represent.