{"cells":[{"cell_type":"markdown","id":"f5a5d2f4","metadata":{"id":"f5a5d2f4"},"source":["# Tutorial 2: Drought Detection\n","In this tutorial, we will learn how to detect drought in different regions around the world and monitor whether there is a drought or not using standardized drought indices. We will explore how to visual the drought spatial maps and timeseries. In this practical, you will learn to apply standardized method to calculate drought hazard for your chosen case study area, under historic period.\n","\n","We will use the Multi-Source Weighted-Ensemble Precipitation (MSWEP) data to analyse droughts in the different regions. MSWEP is a global precipitation product with a daily 0.1° resolution available from 1979 to near real-time.\n","\n","## Important before we start\n","<hr>\n","Make sure that you save this file before you continue, else you will lose everything. To do so, go to Bestand/File and click on Een kopie opslaan in Drive/Save a Copy on Drive!\n","\n","Now, rename the file into Week4_Tutorial2.ipynb. You can do so by clicking on the name in the top of this screen."]},{"cell_type":"markdown","id":"845ce454","metadata":{"id":"845ce454"},"source":["## Learning Objectives\n","<hr>"]},{"cell_type":"markdown","id":"4bf95120","metadata":{"id":"4bf95120"},"source":["- Learn to apply a standardized method to represent meteorological drought hazard for a specific location;\n","- Understand and interpret large-scale raster data;\n","- Learn to visualize and interpret the results from drought hazard calculation methods."]},{"cell_type":"markdown","id":"f65887c3-7f73-4584-a7cf-04d03c67b3fb","metadata":{"id":"f65887c3-7f73-4584-a7cf-04d03c67b3fb"},"source":["<h2>Tutorial Outline<span class=\"tocSkip\"></span></h2>\n","<hr>\n","<div class=\"toc\"><ul class=\"toc-item\">\n","<li><span><a href=\"#introducing-the-packages\" data-toc-modified-id=\"1.-Introducing-the-packages-2\">1. Introducing the packages</a></span></li>\n","<li><span><a href=\"#exploring-the-required-datasets\" data-toc-modified-id=\"2.-Exploring-datasets-3\">2. Exploring the required datasets</a></span></li>\n","<li><span><a href=\"#extracting-gridded-and-timeseries-data-of-the-area-of-interest\" data-toc-modified-id=\"3.-Extracting-data-area-interest-4\">3. Extracting gridded and timeseries data of the area of interest</a></span></li>\n","<li><span><a href=\"#calculating-meteorological-drought-conditions-using-standardized-drought-indicators\" data-toc-modified-id=\"4.-Calculating-drought-conditions\">4.-Calculating meteorological drought conditions using standardized drought indicators</a></span></li>\n","</ul></div>"]},{"cell_type":"markdown","id":"1fc471e4","metadata":{"id":"1fc471e4"},"source":["## 1. Introducing the packages\n","<hr>"]},{"cell_type":"markdown","id":"4bd5461c","metadata":{"id":"4bd5461c"},"source":["Within this tutorial, we are going to make use of the following packages: \n","\n","[**Regionmask**](https://regionmask.readthedocs.io/en/stable/) is a python package that can be used to create masks of geographic regions for arbitrary longitude and latitude grids.\n","\n","[**SciPy**](https://scipy.org/)is a python package that provides algorithms for optimization, integration, interpolation, eigenvalue problems, algebraic equations, differential equations, statistics and many other classes of problems.\n","\n","[**Statistics**](https://docs.python.org/3/library/statistics.html) is a python module that provides functions for calculating mathematical statistics of numeric (Real-valued) data.\n","\n","[**OS**](https://docs.python.org/3/library/os.html) is a python module that provides a portable way of using operating system dependent functionality i.e. manipulating paths\n","\n","[**GeoPandas**](https://geopandas.org/) is a Python packagee that extends the datatypes used by pandas to allow spatial operations on geometric types. It opens shapefiles\n","\n","[**Matplotlib**](https://matplotlib.org/) is a comprehensive Python package for creating static, animated, and interactive visualizations in Python. Matplotlib makes easy things easy and hard things possible.\n","\n","[**xarray**](https://docs.xarray.dev/) is a Python package that allows for easy and efficient use of multi-dimensional arrays.\n","\n","*We will first need to install these packages in the cell below. Uncomment them to make sure we can conda or pip install them*"]},{"cell_type":"code","execution_count":null,"id":"9ce65e9f","metadata":{"id":"9ce65e9f"},"outputs":[],"source":["#installing the python packages required\n","!pip install regionmask\n","!pip install statistics"]},{"cell_type":"markdown","id":"4ca94583-f2de-4771-8f76-0f5ebe9a1f2f","metadata":{"id":"4ca94583-f2de-4771-8f76-0f5ebe9a1f2f"},"source":["Before running any Python script (in an offline or online modus) it is necessary to import a number of Python packages that can help you with performing the calculations. "]},{"cell_type":"code","execution_count":null,"id":"2420e97a","metadata":{"id":"2420e97a"},"outputs":[],"source":["# Import working modules\n","import xarray as xr\n","import numpy as np\n","import matplotlib.pyplot as plt\n","import pandas as pd\n","import os\n","import sys\n","import geopandas as gpd\n","import regionmask\n","from scipy.stats import percentileofscore\n","from statistics import NormalDist\n","import scipy.stats as stats\n","\n","%matplotlib inline"]},{"cell_type":"markdown","id":"6119f068","metadata":{"id":"6119f068","tags":[]},"source":["### Connect to google drive\n","<hr>"]},{"cell_type":"markdown","id":"7c731998","metadata":{"id":"7c731998"},"source":["To be able to read the data from Google Drive, we need to *mount* our Drive to this notebook.\n","\n","As you can see in the cell below, make sure that in your **My Drive** folder, where you created **BigData** folder and within that folder, you have created a **Week4_Data** folder in which you can store the files that are required to run this analysis.\n","\n","Please go the URL when its prompted in the box underneath the following cell, and copy the authorization code in that box."]},{"cell_type":"code","execution_count":null,"id":"1af27ec0","metadata":{"id":"1af27ec0"},"outputs":[],"source":["from google.colab import drive\n","drive.mount('/content/gdrive/', force_remount=True)\n","\n","sys.path.append(\"/content/gdrive/My Drive/BigData/Week4_Data\")\n","\n","data_path = os.path.join('/content/gdrive/My Drive/BigData','Week4_Data')"]},{"cell_type":"markdown","id":"0569556d","metadata":{"id":"0569556d"},"source":["## 2. Exploring the required datasets\n","<hr>"]},{"attachments":{},"cell_type":"markdown","id":"08abdc51","metadata":{"id":"08abdc51"},"source":["In the next code cell, we will specify our working directory using `os.chdir()`. This directory should contain all the datasets to be used during the tutorial.\n","\n","To know the current working directory of the file, `os.getcwd()` method can be used. After changing the path, one can verify the path of current working directory using this method. We will use `os.path.join()` to set the specific path to each of the data folders within the working directory. It enables us to join the main directory path with the data folder path.\n","\n","However, before we continue you need to make sure that you have downloaded the MSWEP data and stored it in your `Week5_Data` folder. You can download the data on Canvas through the download link within the module. Unzip the data locally, but make sure that you keep the folder structure (as you may see below when reading the data)."]},{"cell_type":"code","execution_count":null,"id":"d9e26f89","metadata":{"id":"d9e26f89"},"outputs":[],"source":["# Name input file\n","weather_file  = os.path.join(data_path,'MSWEP_Monthly_prec/*.nc')\n","\n","world_admin_boundaries = os.path.join(data_path,'world-administrative-boundaries/world-administrative-boundaries.shp')\n","print(\"Precipitation data:\", '\\n', weather_file,'\\n', \"Administrative boundaries shapefile:\", '\\n', world_admin_boundaries)"]},{"attachments":{},"cell_type":"markdown","id":"645de622","metadata":{"id":"645de622"},"source":["Now that we have set the paths to the datasets, let us load the administrative boundaries shapefile using geopandas `gpd.read_file()` and in the process set the coordinate system to **EPSG:4326** (the standard global coordinate reference system). We will use the same coordinate system throughtout the tutorial.\n","\n","Let's start with loading the shapefile of the world boundaries. "]},{"cell_type":"code","execution_count":null,"id":"aa029265","metadata":{"id":"aa029265"},"outputs":[],"source":["data_Adminunits = gpd.read_file(world_admin_boundaries,crs=\"epsg:4326\") \n","data_Adminunits.head() #represents the countries around the world and their geographical locations"]},{"cell_type":"markdown","id":"5ebfb435","metadata":{"id":"5ebfb435"},"source":["`data_Adminunits` is a `GeoDataFrame` containing all polygons illustrating the national boundaries for the 256 countries in the world\n","\n","Let's have a look at the administrative boundaries we just loaded using `pd.DataFrame.plot()`\n","\n","<div class=\"alert alert-block alert-success\">\n","<b>Question 1:</b> Upload a figure of the global administrative boundaries. Include the the linewidth, figure size, the y and x labels, color (indicate none), edgecolor. Don’t forget to add the figure title. Additionally, upload the figure outlining the zoomed area of the western European countries. </div>"]},{"cell_type":"code","execution_count":null,"id":"3db318de","metadata":{"id":"3db318de"},"outputs":[],"source":["ax=data_Adminunits.plot(XXX) #add the alpha, linewidth, figsize, polygon colors and edgecolor\n","ax.set_title(\"XXX\")"]},{"cell_type":"markdown","id":"39498886","metadata":{"id":"39498886"},"source":["We have loaded the administrative shapefile. Now we can load the gridded precipitation dataset using the `xarray` package through the `open_mfdataset()` function. This function allows us to open multiple files as a single dataset. \n","\n","In a folder I have hundreds of MSWEP reanalysis data downloaded from the [here](http://www.gloh2o.org/mswep/). Each of these files contains a single data of the global reanalysis, about 8.2 MB. The parameter chunks is very important, it defines how big are the “pieces” of data moved from the disk to the memory. With this value the entire computation on a workstation with 2 GB takes a couple of minutes. I want to load all the precipitation files from the year 1980-2021.\n","\n","Now let's load the gridded daily precipitation dataset."]},{"cell_type":"code","execution_count":null,"id":"8822ae47","metadata":{"id":"8822ae47"},"outputs":[],"source":["data = xr.open_mfdataset(weather_file,chunks={\"time\":10})    \n","data"]},{"cell_type":"markdown","id":"4a4b3d9b","metadata":{"id":"4a4b3d9b"},"source":["As you can see this xarray Dataset contains a single variable `(precipitation)` which is stored as a dask.array. This is the result of loading files with open_mfdataset. More information and examples of the interaction between dask and xarray can be found in their documentations([here](https://examples.dask.org/xarray.html) and [here](https://docs.xarray.dev/en/stable/user-guide/dask.html))\n"]},{"cell_type":"markdown","id":"009349a2","metadata":{"id":"009349a2"},"source":["Now we are ready for a bit of **regionmask** magic. The module can create a gridded mask with the function `regionmask.Regions()` documented [here](https://regionmask.readthedocs.io/en/stable/). With this function we have created an object able to mask gridded data."]},{"cell_type":"code","execution_count":null,"id":"f7535c16","metadata":{"id":"f7535c16"},"outputs":[],"source":["sheds_mask_poly = regionmask.Regions(name = 'sheds_mask', numbers = list(range(0,256)), names = list(data_Adminunits.name),\n","                                     abbrevs = list(data_Adminunits.color_code), outlines = list(data_Adminunits.geometry.values[i] for i in range(0,256)))\n","sheds_mask_poly"]},{"cell_type":"markdown","id":"47bc0825","metadata":{"id":"47bc0825"},"source":["### Calculating the mask of the region of interest"]},{"cell_type":"markdown","id":"5c8803a4","metadata":{"id":"5c8803a4"},"source":["Now we are ready to apply the mask on the gridded dataset **data**. We select only the first timestep to speed up the process. For the same reason it is better to further “zoom” into the Western European countries. We specify the name of the coordinates (the defaults are lat and lon). You can play around with the coordinates. Use this bounding box [here](https://boundingbox.klokantech.com/) to obtain the correct coordinates. Once you have created a bounding box, you can select the values to be presented in 'CSV' format, and copy paste the coordinates in the cell below."]},{"cell_type":"code","execution_count":null,"id":"edea2c6f","metadata":{"id":"edea2c6f"},"outputs":[],"source":["mask = sheds_mask_poly.mask(data.isel(time = 0).sel(lon = slice(XXX), lat = slice(XXX)),lon_name = 'lon', lat_name = 'lat') # change the lat and lon slice to the bounding box of the selected region\n","mask"]},{"cell_type":"markdown","id":"bcda3ca3","metadata":{"id":"bcda3ca3"},"source":["### Plotting the created mask"]},{"cell_type":"markdown","id":"b6ba1fc1","metadata":{"id":"b6ba1fc1"},"source":["The computation takes a couple of minutes but then the mask can be saved (for example as a NetCDF) for a later use. Let's plot the mask we just created. You can see that the bounding box we used zooms into the European area."]},{"cell_type":"code","execution_count":null,"id":"b459b1cd","metadata":{"id":"b459b1cd"},"outputs":[],"source":["plt.figure(figsize=(12,8))\n","ax = plt.axes()\n","mask.plot(ax = ax)\n","data_Adminunits.plot(XXX)\n","\n","ax.set_title('XXX')"]},{"cell_type":"markdown","id":"8e203eb0","metadata":{"id":"8e203eb0"},"source":["## 3. Extracting gridded and timeseries data of the area of interest\n","<hr>"]},{"cell_type":"markdown","id":"24426b07","metadata":{"id":"24426b07"},"source":["We can now for the selected country aggregate the grid cells in the national borders to timeseries but before that we will work with gridded data first. The procedure is rather simple, we will focus on a single region and it is easy to extend it for multiple regions.\n","\n","As first step, we will save the latitude and longitude vectors because we will use it later. Then, we select the mask points where the value is equal to target value (the region code). In the numpy array `sel_mask` all the values are nan except for the selected ones."]},{"cell_type":"code","execution_count":null,"id":"4caba60c","metadata":{"id":"4caba60c"},"outputs":[],"source":["#Extracting the precipitation data for the selected region\n","region_name = \"....\"   #Selected country within the bounding region\n","\n","lat = mask.lat.values          # the lat and lon values contained in the mask of the regions\n","lon = mask.lon.values\n","\n","ID_REGION = data_Adminunits.index[data_Adminunits.name == region_name].values[0] #getting the index position of the selected country\n","   \n","sel_mask = mask.where(mask == ID_REGION).values\n","sel_mask"]},{"cell_type":"markdown","id":"d26fc383","metadata":{"id":"d26fc383"},"source":["<div class=\"alert alert-block alert-success\">\n","<b>Question 2:</b> Which country did you select as a case study region and why? \n","</div>"]},{"cell_type":"markdown","id":"1f3f1123","metadata":{"id":"1f3f1123"},"source":["To speed-up the process we want to crop the xarray Dataset selecting the smallest box containing the entire mask. To do this, we store in **id_lon** and **id_lat** the coordinate points where the mask has at least a non-nan value."]},{"cell_type":"code","execution_count":null,"id":"e22ed9f0","metadata":{"id":"e22ed9f0"},"outputs":[],"source":["id_lon = lon[np.where(~np.all(np.isnan(sel_mask), axis=0))]\n","id_lat = lat[np.where(~np.all(np.isnan(sel_mask), axis=1))]"]},{"cell_type":"markdown","id":"a0e14c47","metadata":{"id":"a0e14c47"},"source":["<div class=\"alert alert-block alert-success\">\n","<b>Question 3:</b> Describe what the fuction `np.where()`does? Are there other fuctions that can replace it?\n","</div>"]},{"cell_type":"markdown","id":"4ca235bf","metadata":{"id":"4ca235bf"},"source":["The `data` dataset is reduced selecting only the target year and the coordinates containing the target region. Then the dataset is load from the dask array using compute and then filtered using the mask. This procedure speeds up the computation and reduces the memory use for large dataset, apparently the `where()` function is not really dask friendly."]},{"cell_type":"code","execution_count":null,"id":"35a7f861","metadata":{"id":"35a7f861"},"outputs":[],"source":["monthly_prec = data.sel(lat = slice(id_lat[0], id_lat[-1]), lon = slice(id_lon[0], id_lon[-1])).compute().where(mask == ID_REGION)\n","monthly_prec"]},{"cell_type":"markdown","id":"c348be39","metadata":{"id":"c348be39"},"source":["Now let's plot the gridded precipitation data for the target region contained in `monthly_prec`\n","\n","**Action**  Inspect the code, add the title name and run the code to make the plot. Fill in the plot parameters in the code box below"]},{"cell_type":"code","execution_count":null,"id":"91dbb438","metadata":{"id":"91dbb438"},"outputs":[],"source":["plt.figure(figsize=(12,8))\n","ax = plt.axes()\n","monthly_prec.precipitation.isel(time = 1).plot(ax = ax)\n","data_Adminunits.plot(ax = ax, alpha = 0.8, facecolor = 'none')\n","plt.title(\"XXX\") #set the right size for the title\n","plt.show()"]},{"cell_type":"code","execution_count":null,"id":"23d239f4","metadata":{"id":"23d239f4"},"outputs":[],"source":["monthly_prec_timeseries=monthly_prec.precipitation.mean(dim=('lon','lat'))\n","df_prec_timeseries=monthly_prec_timeseries.to_dataframe()"]},{"cell_type":"markdown","id":"fba2f607","metadata":{"id":"fba2f607"},"source":["**Action**  Inspect the code, change the label and title names and run the code to make the plot. Fill in the plot parameters in the code box below"]},{"cell_type":"code","execution_count":null,"id":"a462f7de","metadata":{"id":"a462f7de"},"outputs":[],"source":["fig = plt.figure(figsize=(30,10) )\n","plt.plot(df_prec_timeseries.index,df_prec_timeseries,\"-o\",color='b')\n","plt.ylabel('XXX',size=15) #change\n","plt.xlabel('XXX')#change\n","plt.title('XXX',size=25)#change \n","plt.show()"]},{"cell_type":"markdown","id":"3e50cce0","metadata":{"id":"3e50cce0"},"source":["<div class=\"alert alert-block alert-success\">\n","<b>Question 4:</b> Briefly explain what you see in both the precipitation timeseries plot  and the precipitation spatial plot above.\n","</div>"]},{"cell_type":"markdown","id":"cfccd978","metadata":{"id":"cfccd978"},"source":["## 4. Calculating meteorological drought conditions using standardized drought indicators\n","<hr>"]},{"cell_type":"markdown","id":"9202de7e","metadata":{"id":"9202de7e"},"source":["The required datasets are now loaded correctly into Python. From this point on we will start with the actual drought analysis. We will calculate meteorological droughts (Standardized Precipitation Index (SPI)) for your chosen accumulation period (i.e. between 1 - 12 months). Drought conditions are being calculated for each month (using all January’s, February’s, etc). "]},{"cell_type":"markdown","id":"542b79de","metadata":{"id":"542b79de"},"source":["Let's revisit python functions (introduced in the Numpy and Pandas exercise),they are defined with the `def` keyword, then the function identifier (name) followed by parentheses and a colon as shown in the next cells. \n","\n","Before starting with the SPI calculation we need to prepare the accumulated time-series first.Using the `numpy` package through the `cumsum()` function we calculate the cummulative precipitation per grid cell in a moving sum window along the time axis."]},{"cell_type":"markdown","id":"8b5f0607","metadata":{"id":"8b5f0607"},"source":["### Defining the functions for calculating the Standardized Precipitation Index (SPI)"]},{"cell_type":"code","execution_count":null,"id":"2a6f1baa","metadata":{"id":"2a6f1baa"},"outputs":[],"source":["def cumulative_values(y,w): \n","    def mov_sum(y,w):\n","        '''defining the time period and a function that calculates \n","        the accumulation periods for monthly precipitation per cell where\n","        y is gridded precipitation data and w is the accumulation value in months'''\n","        cumul = np.cumsum(y, dtype= float)\n","        cumul[w:] = cumul[w:]- cumul[:-w]\n","        cumul[:w-1]=(np.nan==-9999)\n","        cumul_dataseries = cumul\n","    \n","        return cumul_dataseries\n","\n","    cumul_dataset = np.zeros((y.shape))\n","    rows_p3 = y.shape[1]\n","    cols_p3 = y.shape[2]\n","    time_p3 = y.shape[0]\n","    for r in range(0,rows_p3):\n","        for c in range (0,cols_p3):\n","            for t in range (0,time_p3):\n","                cumul_data = mov_sum(y[:,r,c],w)\n","                cumul_dataset[:,r,c]=cumul_data\n","    return cumul_dataset"]},{"cell_type":"markdown","id":"c6b748d2","metadata":{"id":"c6b748d2"},"source":["Let's create a function to group the monthly gridded precipitation data into different months i.e. Januarys. Februarys etc. This will enable us to calculate SPI per month over the historic period.\n","\n","To group the precipitation data we create an empty dataframe using pandas through `DataFrame()` and a list of dates for the historic time period through `date_range()`. The grouping of precipitation data is done using `DatetimeIndex().month()` and each group is stored in a dictionary."]},{"cell_type":"code","execution_count":null,"id":"dd95c530","metadata":{"id":"dd95c530"},"outputs":[],"source":["def grouping_prec_months (y,w):\n","    '''groups the precipitation data into all the months based on the dates\n","    where y is the gridded precipitation data and w is the accumulation value'''\n","\n","    month_dates = pd.date_range (start='2002-01-01', end='2022-06-01', freq = 'M')\n","    df = pd.DataFrame(index = month_dates)\n","    df['month'] = pd. DatetimeIndex(df.index).month \n","    monthly_data = cumulative_values(y,w) #calls the cumulative value function to calculate the accumulation periods of the data\n","    Month_data_dict= {} # initializes an empty dictionary\n","    # #grouping the months\n","    Month_data_dict['Jan'] = monthly_data[df.month==1,:,:]\n","    Month_data_dict['Feb'] = monthly_data[df.month==2,:,:]\n","    Month_data_dict['Mar'] = monthly_data[df.month==3,:,:]\n","    Month_data_dict['Apr'] = monthly_data[df.month==4,:,:]\n","    Month_data_dict['May'] = monthly_data[df.month==5,:,:]\n","    Month_data_dict['Jun'] = monthly_data[df.month==6,:,:]\n","    Month_data_dict['Jul'] = monthly_data[df.month==7,:,:]\n","    Month_data_dict['Aug'] = monthly_data[df.month==8,:,:]\n","    Month_data_dict['Sep'] = monthly_data[df.month==9,:,:]\n","    Month_data_dict['Oct'] = monthly_data[df.month==10,:,:]\n","    Month_data_dict['Nov'] = monthly_data[df.month==11,:,:]\n","    Month_data_dict['Dec'] = monthly_data[df.month==12,:,:]    \n","\n","    return Month_data_dict"]},{"cell_type":"markdown","id":"2e6c833e","metadata":{"id":"2e6c833e"},"source":["After grouping into months, we create a function using `scipy.stats` through `percentilescore()` documented [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.percentileofscore.html). This allows us to compute the percentile rank of each monthly value relative to a list of values."]},{"cell_type":"markdown","id":"7320bbf5","metadata":{"id":"7320bbf5"},"source":["<div class=\"alert alert-block alert-success\">\n","<b>Question 5:</b> What other fuctions in can be used instead of the `scipy.stats` `percentilescore()`?Describe these other fuctions.\n","</div>"]},{"cell_type":"code","execution_count":null,"id":"bbdc3760","metadata":{"id":"bbdc3760"},"outputs":[],"source":["def percentiles(y,w):\n","    '''computes the percentile rank of each monthly value relative to the entire 20 years of values. y is the gridded \n","    precipitation data and w is the accumulation value in months'''\n","    mnths= ['Jan', 'Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']\n","    perc_dictionary = {} #creates an empty dictionary\n","    Month_data_dict = grouping_prec_months (y,w) # calls the grouping by months function from previous cell\n","    for m in mnths:\n","        output = np.zeros(Month_data_dict[m].shape)\n","        rows = Month_data_dict[m].shape[1]\n","        cols = Month_data_dict[m].shape[2]\n","        time = Month_data_dict[m].shape[0]\n","        for i in range(0,rows):\n","            for j in range (0,cols):\n","                for z in range (0,time):                \n","                    result =percentileofscore(Month_data_dict[m][:,i,j],Month_data_dict[m][z,i,j], kind = 'mean')/100                \n","                    output[z,i,j] =result\n","        perc_dictionary[m] = output\n","    return perc_dictionary, Month_data_dict"]},{"cell_type":"markdown","id":"ac4f8444","metadata":{"id":"ac4f8444"},"source":["#### SPI calculation for each month over the historic time-period\n","\n","SPI is used to characterize meteorological drought on a range of timescales. On short timescales, the SPI is closely related to soil moisture, while at longer timescales, the SPI can be related to groundwater and reservoir storage. The SPI can be compared across regions with markedly different climates. It quantifies observed precipitation as a standardized departure from a selected probability distribution function that models raw precipitation data. The raw precipitation data are typically fitted to a gamma or Pearson Type III distribution, and then transformed to a normal distribution so the mean of SPI is zero. For this exercise, we will not fit the raw precipitation through a distribution to save time and also because we are working with historic data.\n","\n","We will now calculate SPI for all months throughout the year (m = 1:12).By normalizing the precipitation percentiles data afterwards drought conditions become comparable between different locations and for different months. The procedure will result in time series which values vary from ~ -3 (extremely dry) to ~ +3 (extremely wet compared to the long-term mean average conditions), providing information on meteorological drought conditions. The frequency, average duration, and severity of drought events are subsequently estimated by looking at the ‘clustered nature’ of individual drought months. \n","\n","Further information on SPI can be found [here](https://climatedataguide.ucar.edu/climate-data/standardized-precipitation-index-spi).\n","\n","Take a careful look how we do this by inspecting the code-boxes and run the code in the code-boxes below to eventually plot the SPI values for all months. Where needed, change the names of titles and axes. "]},{"cell_type":"code","execution_count":null,"id":"c1ad6952","metadata":{"id":"c1ad6952"},"outputs":[],"source":["def standardized_index(y,w,a):\n","    '''function to calculate SPI using percentile ranking and normalizing the rank using normal distribution'''\n","    month_dates = pd.date_range (start='2002-01-01', end='2022-06-01', freq = 'M')\n","    df = pd.DataFrame(index = month_dates)\n","    df['month'] = pd. DatetimeIndex(df.index).month\n","    perc_dictionary, Month_data_dict = percentiles(y,w)\n","    index_grouped = {} \n","    mnths = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']\n","    for mon in mnths:\n","        output_sp = np.zeros(perc_dictionary[mon].shape)\n","        rows_p = perc_dictionary[mon].shape[1]\n","        cols_p = perc_dictionary[mon].shape[2]\n","        time_p = perc_dictionary[mon].shape[0]\n","        for r in range(0,rows_p):\n","            for c in range (0,cols_p):\n","                for t in range (1,time_p):\n","                    zeros = len([p for p in Month_data_dict[mon][:,r,c] if p==0])\n","                    count = len([p for p in Month_data_dict[mon][:,r,c]])\n","                    value_0 = NormalDist().inv_cdf((zeros+1)/(2*(count+1)))#calculating distr for zeros using stagge formula\n","                    #normalizing the data \n","                    if Month_data_dict[mon][t,r,c]==0:\n","                        norm_dist = value_0\n","                    else:\n","                        norm_dist= NormalDist().inv_cdf(perc_dictionary[mon][t,r,c])\n","                    output_sp[t,r,c] =norm_dist\n","        index_grouped[mon] = output_sp\n","        \n","        #merging and converting the numpy arrays into a single xarray dataset with coordinates by applying the merge function\n","        #and appending the data to an empty list\n","    data_arrays=[]\n","    \n","    for i, key in enumerate(index_grouped):\n","        da = xr.DataArray(index_grouped[key],coords=[df[df.month==i+1].index, a.lat.values, a.lon.values],dims=[\"time\",\"latitude[degrees_north]\", \"longitude[degrees_east]\"], name = 'spi_values')\n","        data_arrays.append(da)    \n","    index = xr.merge(data_arrays)  \n","    \n","    return index\n"]},{"cell_type":"markdown","id":"b4c6eac1","metadata":{"id":"b4c6eac1"},"source":["<div class=\"alert alert-block alert-success\">\n","<b>Question 6:</b> What does the xarray `merge()` and python list `append()` methods do and return? Describe.\n","</div>"]},{"cell_type":"markdown","id":"1a1af14f","metadata":{"id":"1a1af14f"},"source":["SPI can be calculated for a range of timesteps (1,3,6,12 months and so on). Select the accumulation period/ lag time/timestep to use with your group (i.e. between 1-12 months). The use of different timescales allows the effect of precipitation deficits on various water resources (groundwater, soil moisture, reservoir storage and streamflow) to be assessed.\n","\n","**Action**: Insert lag time in the code-box below and run it. "]},{"cell_type":"code","execution_count":null,"id":"e80552dc","metadata":{"id":"e80552dc"},"outputs":[],"source":["#SPI considering zeros using Stagge et al. and gamma distribution\n","\n","spi_1 = standardized_index(monthly_prec.precipitation[:].values,1, monthly_prec)   #choose the accumulation value to determine which SPI you calculate\n","spi_1"]},{"cell_type":"markdown","id":"90d8ebb7","metadata":{"id":"90d8ebb7"},"source":["Now we convert the gridded data for the SPI to a timeseries and plot the spi values for the entire historic period both spatially and as a timeseries."]},{"cell_type":"code","execution_count":null,"id":"a89a26fa","metadata":{"id":"a89a26fa"},"outputs":[],"source":["#Timeseries plot\n","#converting the gridded data to timeseries\n","indice_timeseries=spi_1.spi_values.mean(dim=(\"latitude[degrees_north]\", \"longitude[degrees_east]\"))\n","df_indice_timeseries =indice_timeseries.to_dataframe()\n","print(df_indice_timeseries)"]},{"cell_type":"markdown","id":"e0d2de93","metadata":{"id":"e0d2de93"},"source":["### Visualization of SPI\n","\n","Now let's visualize the SPI. Within **xarray**'s `.plot()` function, we can nicely select that we want to create a multi-plot figure, using the `col` and `col_wrap` arguments."]},{"cell_type":"markdown","id":"4947a653","metadata":{"id":"4947a653"},"source":["<div class=\"alert alert-block alert-success\">\n","<b>Question 7:</b> Select a year with extreme drought occurence within the selected study region and plot the spi values. Include the plots in Canvas.\n","</div>"]},{"cell_type":"code","execution_count":null,"id":"c2e72078","metadata":{"id":"c2e72078"},"outputs":[],"source":["spi_1['spi_values'].sel(time='XXXX').plot(cmap='RdBu', col='time', col_wrap=4, vmin =-2.5, vmax=2.5) #change the year"]},{"cell_type":"markdown","id":"f4c5d4d1","metadata":{"id":"f4c5d4d1"},"source":["<div class=\"alert alert-block alert-success\">\n","<b>Question 8:</b> Explain the drought spatial plots plotted above. How was the drought in each of the months within the regions of the selected country?\n","</div>"]},{"cell_type":"markdown","id":"6f29d57b-1343-44d1-811f-56c00c5b1f6c","metadata":{"id":"6f29d57b-1343-44d1-811f-56c00c5b1f6c"},"source":["It would also be interesting to plot the spi_1 in a timeseries. In the cell below, we plot the calculated SPI-1 values over time. A value below zero means there is a drought and above zero indicates wet periods."]},{"cell_type":"code","execution_count":null,"id":"583373d4","metadata":{"id":"583373d4"},"outputs":[],"source":["fig = plt.figure(figsize=(30,10) )\n","plt.plot(df_indice_timeseries.index,df_indice_timeseries,\"-o\",color='b')\n","plt.axhline(y=0, color='r', linestyle='--') # all the values below zero indicate drought conditions. Can be changed to indicate drought threshold\n","plt.ylabel('XXX',size=15)\n","plt.xlabel('XXX')\n","plt.title('XXX',size=25)\n","plt.show()"]},{"attachments":{},"cell_type":"markdown","id":"6d088853","metadata":{"id":"6d088853"},"source":["<div class=\"alert alert-block alert-success\">\n","<b>Question 9:</b> Explain the drought pattern in the SPI values timeseries above for the selected case study region.\n","</div>"]},{"cell_type":"markdown","id":"5f727623","metadata":{"id":"5f727623"},"source":["<div class=\"alert alert-block alert-success\">\n","<b>Question 10:</b> Now that we have calculated SPI-1 above, change the accumulation number from 1 and calculate SPI-6 and SPI-12. What differences do you see in the drought characteristics and what does each represent.\n","</div>"]},{"cell_type":"markdown","id":"efb65c5c","metadata":{},"source":[]}],"metadata":{"colab":{"provenance":[{"file_id":"https://github.com/ElcoK/BigData_AED/blob/main/week5/tutorial2.ipynb","timestamp":1677753528015}]},"kernelspec":{"display_name":"Python 3 (ipykernel)","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.10.11"}},"nbformat":4,"nbformat_minor":5}