{ "cells": [ { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "# Tutorial 2: Social Media & Valuation of Landscapes\n", "\n", "Within this tutorial, we will work with the Flickr photo database to better understand how people value specific land use types. We will learn how to extract information from Flickr, how you can explore and visualize this, and how to use it for some basic analysis.\n", "\n", "### Important before we start\n", "---\n", "Make sure that you save this file before you continue, else you will lose everything. To do so, go to **Bestand/File** and click on **Een kopie opslaan in Drive/Save a Copy on Drive**!\n", "\n", "Now, rename the file into Week6_Tutorial2.ipynb. You can do so by clicking on the name in the top of this screen.\n" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Learning Objectives\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- To understand how we can use public data to better understand people's preferences for leisure.\n", "- To know how to extract data from Flickr using an API.\n", "- To know how to clean and prepare raw data for analysis.\n", "- To know how to cluster geospatial information.\n", "- To visualize clusters and point data in a meaningful way.\n", "- To know how to combine different spatial datasets to gain additional insights." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Tutorial Outline

\n", "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.Introducing the packages\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Within this tutorial, we are going to make use of the following packages: \n", "\n", "[**flickrapi**](https://stuvel.eu/flickrapi-doc/index.html) is a Python interface to the Flickr API. It includes support for authorized and non-authorized access, uploading and replacing photos, and all Flickr API functions.\n", "\n", "[**GeoPandas**](https://geopandas.org/) is a Python package that extends the datatypes used by pandas to allow spatial operations on geometric types.\n", "\n", "[**NumPy**](https://numpy.org/doc/stable/) is a Python library that provides a multidimensional array object, various derived objects, and an assortment of routines for fast operations on arrays.\n", "\n", "[**Pandas**](https://pandas.pydata.org/docs/) is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.\n", "\n", "[**OSMnx**](https://osmnx.readthedocs.io/) is a Python package that lets you download geospatial data from OpenStreetMap and model, project, visualize, and analyze real-world street networks and any other geospatial geometries. You can download and model walkable, drivable, or bikeable urban networks with a single line of Python code then easily analyze and visualize them. You can just as easily download and work with other infrastructure types, amenities/points of interest, building footprints, elevation data, street bearings/orientations, and speed/travel time.\n", "\n", "[**Matplotlib**](https://matplotlib.org/) is a comprehensive Python package for creating static, animated, and interactive visualizations in Python. Matplotlib makes easy things easy and hard things possible.\n", "\n", "*We will first need to install these packages in the cell below. Uncomment them to make sure we can pip install them*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install flickrapi\n", "!pip install osmnx\n", "!pip install contextily" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we will import these packages in the cell below:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import flickrapi\n", "import pandas as pd\n", "import contextily as cx\n", "import geopandas as gpd\n", "import osmnx as ox\n", "import numpy as np\n", "import shapely \n", "\n", "from sklearn.cluster import DBSCAN\n", "from datetime import datetime" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Extract data from Flickr\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To extract data from Flicker, we will use their **API**. This is an acronym for application programming interface. It is a software intermediary that allows two applications to talk to each other. APIs are an accessible way to extract and share data within and across organizations. APIs are all around us. Every time you use a rideshare app, send a mobile payment, or change the thermostat temperature from your phone, you’re using an API.\n", "\n", "However, before we can access this **API**, we need to take a few steps. Most importantly, we need to register ourselves on the [Flickr](https://www.flickr.com/) portal. To do so, we need to register, as explained in the video clip below:\n", "\n", "\n", "
\n", "\n", "Now, the next step is to access the API. You can now login on the website of [Flickr](https://www.flickr.com/), and go to the **API** part of the [website](https://www.flickr.com/services/apps/create/apply/?).\n", "\n", "Now click on `APPLY FOR A NON-COMMERCIAL KEY` and just fill in some information about the course. The name of the App can be something random and related to our big data course, and you just need to fill in two lines of text to describe what you want to do. As soon as you click `SUBMIT`, you will see a generated **api_key** and an **api_secret**. Fill those in below." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "api_key = ''\n", "api_secret=''\n", "\n", "flickr = flickrapi.FlickrAPI(api_key, api_secret, cache=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prepare data extraction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we get started with the extraction of data, we need to specify again an area of interest we want to focus on. In the cell below, you will now read \"Ameland, The Netherlands\". Change this to any random or municipality in the Netherlands that (1) you can think of and (2) will work. \n", "\n", "In some cases, the function does not recognize the location. You could either try a different phrasing or try a different location. Many parts of the Netherlands should work. It is also fine to use an area outside of the Netherlands. I would make sure to not make it too large, as it will take a long time to extract the photos." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "place_name = \"Ameland, The Netherlands\"\n", "area = ox.geocode_to_gdf(place_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let us visualize the bounding box of the area, similar to some of our previous tutorials." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "area_to_check = area.to_crs(epsg=3857)\n", "ax = area_to_check.plot(figsize=(10, 10), color=\"none\", edgecolor=\"k\", linewidth=4)\n", "ax.set_xticks([])\n", "ax.set_yticks([])\n", "ax.set_axis_off()\n", "cx.add_basemap(ax, zoom=11) #depending on the size of your area, you may need to change the zoom level if you run into an error.\n", "size = int(area_to_check.area/1e6)\n", "\n", "ax.set_title(\"{}. Total area: {} km2\".format(place_name,size),fontweight='bold')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data extraction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To extract the metadata of the photos from Flickr, we are going to use the `.walk()` function. This function will return a list of photos matching some criteria. It allows you to search photos based on location, date, tag and so on.\n", "\n", "In our case, we are state that we want to find all photos, no matter how they are tagged. This is defined through `tag_mode=all`. Moreover, as you will see, we also specify that we want some extra information. More specifically, we ask for the `geo(location)`, `tags`, `date_taken` and the `url_m` of a photo.\n", "\n", "We use a bounding box `bbox` to make sure we only select photos from our region of interest. The input requires our bounding box to me in a string format. So let's do that first." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "bbox_string = \",\".join([str(x) for x in area.bounds.values[0]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now we can extract all the photos. We collect all of them in a list, through the `.append()` function" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "tags": [] }, "outputs": [], "source": [ "collect_pictures = []\n", "for photo in flickr.walk(tag_mode='any',\n", " #tags='nature',\n", " bbox=bbox_string,extras='geo,tags,date_taken,url_m'):\n", " \n", " get_attributes = photo.attrib\n", " collect_pictures.append([get_attributes['id'],\n", " get_attributes['owner'],\n", " get_attributes['datetaken'],\n", " get_attributes['tags'],\n", " get_attributes['latitude'],\n", " get_attributes['longitude']])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(collect_pictures)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Explore the data\n", "
\n", "Now we have extracted the data and, let's explore (and clean) this data a bit more. The convenient thing of having everything stored in a list, is that we can easily turn this into a pandas DataFrame." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(collect_pictures,columns=['id','owner','datetaken','tags','latitude','longitude'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.to_csv('Ameland_Flickr.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's explore the data a little bit:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.head(XXX)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And let's see if everything is stored in a format in which we can work with:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We would expect floating values for the **latitude** and **longitude**, and a *datetime* object for the **datetaken**. Let's have a look how these are stored: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.longitude.iloc[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Aha! Strings. It will be difficult to convert strings into proper geometries, so let's convert these columns to floating values, and convert them to **points** using `pygeos.points()`. As you will see, we use a [list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions) to do so. The input for `pygeos.points()` is a list or numpy array, in which each element contains a *longitude* and a *latitude*. To make sure we have that, we create this `list` using the [zip](https://docs.python.org/3.3/library/functions.html#zip) function in Python.\n", "\n", "
\n", "Question 1: As all the data is stored in strings, its pretty difficult to do something with the information straight away. Please provide the lines of code that you have written to convert the geo-information (lattitude and longitude) into actual coordinates and also provide the lines of code that you have written to convert the dates to a datetime object. Explain the lines of code.\n", "
\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "df['longitude'] = df['longitude'].astype(\n", "df['latitude'] = df['latitude'].\n", "df['geometry'] = [shapely.points(x[0],x[1]) for x in list(zip(df['longitude'],df['latitude']))]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's convert the **datetaken** column into a *datetime* type, so we can extract specific years or days. To do so, we will make use of the `lambda` and `apply` functions, and use the `fromisoformat()` function from within the **datetime** package." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "df['datetaken'] = df.datetaken.XXXX(XXXX x : datetime.XXXX(x))\n", "df['year'] = df.datetaken.dt.year\n", "df['month'] = df.datetaken.dt.strftime('%b')\n", "df['month_year'] = df['datetaken'].dt.to_period('M')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have the dates in the correct format, we can plot a figure to identify when most of the photos have been taken/uploaded. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.year.value_counts().plot(kind='bar')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we are dealing with spatial data, it would be nice to plot this information on a map. To do so, we convert our `pandas.DataFrame` into a `geopandas.GeoDataFrame`. Moreover, we have to specify the coordinate reference system. Given that we have a global dataset, it makes most sense to use **epsg:4326**." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "gdf = gpd.GeoDataFrame(df.copy())\n", "gdf.crs = 'epsg:4326'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf_to_plot = gdf.to_crs(epsg=3857)\n", "\n", "ax = gdf_to_plot.plot(column='year',figsize=(15, 3),legend=True)\n", "ax.set_xticks([])\n", "ax.set_yticks([])\n", "ax.set_axis_off()\n", "cx.add_basemap(ax, zoom=12) #depending on the size of your area, you may need to change the zoom level if you run into an error.\n", "\n", "ax.set_title(\"\")" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "
\n", "Question 2: Upload the map with locations of all the photos taken in your area.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Question 3: Describe the plot of the amount of photos over per year (or per month) and the map. Do you already notice specific patterns. Is there something already that may influence our results later on?\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's have a look how all these photos are tagged. And whether we can actually do something with this information." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf['tags']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This looks like a mess. It seems we have some work to do to be able to use some of this. Lets get an overview of all the tags and get an idea how often certain tags are used." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "find_all_tags = []\n", "for row in gdf.tags:\n", " find_all_tags.append(row.split())\n", " \n", "all_tags = [item for sublist in find_all_tags for item in sublist]\n", "\n", "pd.Series(all_tags).value_counts().head(50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we see that quite a lot of the tags don't really say much about the area or why people might visit the area. Let's give it a go by just trying to see how many pictures are tagged in something linked to nature. Add more words if you believe you can identify more in the previous overview." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def find_nature_tags(row):\n", " matches = [\"seehund\", \"zeehond\",\"vis\",\"wadden\",\n", " \"natuur\",\"nature\",\"natur\",\n", " \"landschaft\",\n", " \"strand\",\"beach\",\"zee\",\"sea\",\"meer\",\n", " \"bos\",\"forest\",\n", " \"animal\",\"bird\",\"vogel\",\"dier\"]\n", "\n", " overlap = set(row.split()).intersection(set(matches))\n", " \n", " if len(overlap) :\n", " return 'yes'\n", " \n", "gdf['nature'] = gdf.tags.apply(lambda x: find_nature_tags(x))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Question 4: What are the most important tags in your dataset? Also provide the list of tags that you have used to select photos that show nature on the photo. How many pictures did you end up with?\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's count them." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(gdf.loc[gdf.nature=='yes'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And let's only plot those points on a map" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf_to_plot = gdf.to_crs(epsg=3857)\n", "gdf_to_plot = gdf_to_plot.loc[gdf_to_plot.nature == 'yes']\n", "gdf_to_plot.reset_index(drop=True,inplace=True)\n", "\n", "ax = gdf_to_plot.plot(column='year',figsize=(15, 3),legend=True)\n", "#ax.set_xticks([])\n", "#ax.set_yticks([])\n", "#ax.set_axis_off()\n", "cx.add_basemap(ax, zoom=12) #depending on the size of your area, you may need to change the zoom level if you run into an error.\n", "\n", "ax.set_title(\"\")" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "What we also noticed is that some users seems to upload quite a lot of pictures. Let's have a look at the amount of unique users in this area:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf_to_plot.owner.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And have a look at one of the users with the most pictures" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf_to_plot.loc[gdf_to_plot.owner == 'XXXXX']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, so it seems that we have several users that dominate the amount of uploads. If we want to say something about the preference of locations to visit, we might have to compensate for this. To do so, we can make use of the groupby function. Which columns would you like to use to make sure you still keep enough unique entries? And which groupby functions will you choose to group on? First, last, mean?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "gdf_unique = gdf_to_plot.groupby(['XXX','XXX','XXX']).XXXX().reset_index()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf_unique" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Question 5: Please provide the lines of code that you have written to reduce the potential double-counting of the same user on the same location.\n", "
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Clustering of data\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As it is pretty difficult to see any patterns on the maps above, let's try to cluster some of this information and see if this helps us to better understand why people choose to visit certain locations in the area.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a grid of the area\n", "\n", "The most simple way would be to collect all points within a grid. The `create_grid()` function below will help us to create a evenly-distributed grid within any given area. So let's use that!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "def create_grid(bbox,height):\n", " \"\"\"Create a vector-based grid\n", " Args:\n", " bbox ([type]): [description]\n", " height ([type]): [description]\n", " Returns:\n", " [type]: [description]\n", " \"\"\" \n", "\n", " # set xmin,ymin,xmax,and ymax of the grid\n", " xmin, ymin = shapely.total_bounds(bbox)[0],shapely.total_bounds(bbox)[1]\n", " xmax, ymax = shapely.total_bounds(bbox)[2],shapely.total_bounds(bbox)[3]\n", " \n", " #estimate total rows and columns\n", " rows = int(np.ceil((ymax-ymin) / height))\n", " cols = int(np.ceil((xmax-xmin) / height))\n", "\n", " # set corner points\n", " x_left_origin = xmin\n", " x_right_origin = xmin + height\n", " y_top_origin = ymax\n", " y_bottom_origin = ymax - height\n", "\n", " # create actual grid\n", " res_geoms = []\n", " for countcols in range(cols):\n", " y_top = y_top_origin\n", " y_bottom = y_bottom_origin\n", " for countrows in range(rows):\n", " res_geoms.append((\n", " ((x_left_origin, y_top), (x_right_origin, y_top),\n", " (x_right_origin, y_bottom), (x_left_origin, y_bottom)\n", " )))\n", " y_top = y_top - height\n", " y_bottom = y_bottom - height\n", " x_left_origin = x_left_origin + height\n", " x_right_origin = x_right_origin + height\n", "\n", " return shapely.polygons(res_geoms)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And apply this within our area. We need to specify the height of each cell (the cellsize). Choose a size that doesnt make the grid too large, but also provides enough cells within the area to see some spatial variation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grid = pd.DataFrame(create_grid(area.geometry,XXXX),columns=['geometry'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look how this grid looks like." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gpd.GeoDataFrame(grid.copy()).plot(edgecolor='black')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's make sure we georeference the data and convert it to **EPSG:3857** (the same as the Flickr data)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grid = gpd.GeoDataFrame(grid.copy())\n", "grid.crs = 'epsg:4326'\n", "grid = grid.to_crs(epsg=3857)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To very efficiently find overlap between two geospatial databases, we make use of spatial queries. You can read more about the function we are using [here](https://pygeos.readthedocs.io/en/latest/strtree.html) and about R-tree (the approach for very efficienet spatial queries) [here](https://en.wikipedia.org/wiki/R-tree).\n", "\n", "We start with building the tree from our photos. We want to quickly see how many are within each grid cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tree = shapely.STRtree(gdf_unique.loc[gdf_unique.nature=='yes'].geometry)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now we can use the apply function to look for each of our grid how many photos are taken within each grid cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grid['nature'] = grid.geometry.apply(lambda x: len(tree.query(x,predicate='contains_properly')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now we can plot it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grid = grid.loc[grid.nature > 0]\n", "grid.reset_index(drop=True,inplace=True)\n", "\n", "ax = grid.plot(column='nature',figsize=(15, 3),legend=True,alpha=0.5)\n", "#ax.set_xticks([])\n", "#ax.set_yticks([])\n", "#ax.set_axis_off()\n", "cx.add_basemap(ax, zoom=12) #depending on the size of your area, you may need to change the zoom level if you run into an error.\n", "\n", "#ax.set_title(\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Question 6: Describe the results of the \"clustering\" through using the grid-based approach. Are you already able to identify some areas that seem to be most preferred? Can you identify them by using, for example, Google Maps? Does it surprise you?\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### DBSCAN\n", "The DBSCAN algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape. The central component to the DBSCAN is the concept of **core samples**, which are samples that are in areas of *high density*. A cluster is therefore a set of **core samples**, each close to each other (measured by some distance measure) and a set of *non-core samples* that are close to a **core sample** (but are not themselves core samples). There are two parameters to the algorithm, `min_samples` and `eps`, which define formally what we mean when we say dense. Higher `min_samples` or `lower eps` indicate higher density necessary to form a cluster.\n", "\n", "More formally, we define a **core sample** as being a sample in the dataset such that there exist `min_samples` other samples within a distance of `eps`, which are defined as neighbors of the **core sample**. This tells us that the **core sample** is in a dense area of the vector space. A cluster is a set of **core samples** that can be built by recursively taking a **core sample**, finding all of its neighbors that are **core samples**, finding all of their neighbors that are **core samples**, and so on. A cluster also has a set of *non-core samples*, which are samples that are neighbors of a **core sample** in the cluster but are not themselves **core samples**. Intuitively, these samples are on the fringes of a cluster.\n", "\n", "Any **core sample** is part of a cluster, by definition. Any sample that is *not* a **core sample**, and is at least `eps` in distance from any **core sample**, is considered an outlier by the algorithm.\n", "\n", "The first step is to make sure we get an array of the coordinates of each cells." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "coords = np.array(list(zip(gdf_to_plot.geometry.y.values,gdf_to_plot.geometry.x.values)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now we can run the **DBSCAN** algorithm." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "epsilon = 500 #/ kms_per_radian\n", "db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree').fit((coords)) #, metric='haversine'\n", "cluster_labels = db.labels_\n", "num_clusters = len(set(cluster_labels))\n", "clusters = pd.Series([coords[cluster_labels == n] for n in range(num_clusters) if len(coords[cluster_labels == n]) > 0])\n", "print('Number of clusters: {}'.format(num_clusters))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can play around with the `minimum samples` and the `epsilon`. Once you are satisfied, we can create polygons of the outline of these clusters. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cluster_gdf = gpd.GeoDataFrame(clusters.apply(lambda x : shapely.convex_hull(shapely.multipoints(np.flip(x,axis=1)))),columns=['geometry'])\n", "cluster_gdf['cluster_size'] = clusters.apply(lambda x : len(x))\n", "cluster_gdf.geometry = cluster_gdf.buffer(100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And plot this on a map! Do you see differences compared to the grid-based approach?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cluster_to_plot = cluster_gdf.copy()\n", "cluster_to_plot.crs = 'epsg:3857'\n", "#cluster_to_plot.reset_index(drop=True,inplace=True)\n", "\n", "ax = cluster_to_plot.plot(column='cluster_size',figsize=(15, 3),legend=True,alpha=0.5)\n", "#ax.set_xticks([])\n", "#ax.set_yticks([])\n", "#ax.set_axis_off()\n", "cx.add_basemap(ax, zoom=12) #depending on the size of your area, you may need to change the zoom level if you run into an error.\n", "\n", "#ax.set_title(\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Question 7: Upload the map with the results of your DBSCAN clustering.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Question 8: Describe the outcome of your DBSCAN clustering approach. Which parameter settings did you use? What did you choose for epsilon and which value did you choose as the minimum number of clusters? How do the results differ/to what extent are they the same as the grid-based approach? \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Do people prefer certain land uses?\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We already started to understand our data a little bit better. However, it would be interesting to combine this with some land-use information, to see if we can find some patterns over there.\n", "\n", "Let's use the land-use information from OpenStreetMap to do so (similar to Tutorial 1 in Week 4). As you will see in the cell below, we use the tags *\"landuse\"* and *\"natural\"*. We need to use the *\"natural\"* tag to ensure we also obtain water bodies and other natural elements. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tags = {'landuse': True, 'natural': True} \n", "landuse = ox.features_from_place(place_name, tags)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To ensure we really only get the area that we want, we use geopandas's `clip` function to only keep the area we want. This function does exactly the same as the `clip` function in QGIS." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "landuse = landuse.clip(area)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To more easily work with the data, we want all information in a single column. However, at the moment, all information that was tagged as *\"natural\"*, has no information stored in the *\"landuse\"* tags. It is, however, very convenient if we can just use a single column for further exploration of the data. \n", "\n", "To overcome this issue, we need to add the missing information to the landuse column, as done below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "landuse.natural.unique()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "landuse.loc[landuse.natural=='water','landuse'] = 'water'\n", "landuse.loc[landuse.natural=='beach','landuse'] = 'beach'\n", "landuse.loc[landuse.natural=='wetland','landuse'] = 'wetlands'\n", "...\n", "...\n", "...\n", "...\n", "landuse.loc[landuse.natural=='grassland','landuse'] = 'grass'\n", "\n", "\n", "landuse = landuse.dropna(subset=['landuse'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Question 9: Provide the list of unique natural area classifications within the OpenStreetMap data and provide the lines of code that you have written to make sure that all of them were included in the landuse column.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "landuse.crs = 'epsg:4326'\n", "landuse = landuse.to_crs('epsg:3857')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's overlap this landuse information with our photos" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "landuse = landuse[(landuse.geom_type == 'MultiPolygon') | (landuse.geom_type == 'Polygon')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And use the spatial index again to quickly (in computational terms) overlay the land-use information with our photos." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tree_landuse = shapely.STRtree(landuse.geometry)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We create a little function to make good use of the `apply()` function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def find_landuse(row,tree,landuse):\n", " \n", " intersect = (tree.query(row,predicate='intersects'))\n", " \n", " if len(intersect) == 0:\n", " return 'water'\n", " else:\n", " return landuse.landuse.iloc[intersect][0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grid['landuse'] = grid.geometry.apply(lambda x: find_landuse(x,tree_landuse,landuse))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And let's have a look at the results. Can we say something about preferences?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ax = grid.groupby('landuse').sum().plot.barh()\n", "ax.set_title('Amount of unique photos per land use')\n", "ax.set_xlabel('Amount of photos')\n", "ax.set_ylabel('Land Use')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Question 10: Describe the results of the overlay between OpenStreetMap and Flickr data. Does it surprise you? Is it in line to what you observed from the previous grid-based clustering and the DBSCAN clustering? \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }