Tutorial 2: Social Media & Valuation of Landscapes#
Within this tutorial, we will work with the Flickr photo database to better understand how people value specific land use types. We will learn how to extract information from Flickr, how you can explore and visualize this, and how to use it for some basic analysis.
Important before we start#
Make sure that you save this file before you continue, else you will lose everything. To do so, go to Bestand/File and click on Een kopie opslaan in Drive/Save a Copy on Drive!
Now, rename the file into Week6_Tutorial2.ipynb. You can do so by clicking on the name in the top of this screen.
Learning Objectives#
To understand how we can use public data to better understand people’s preferences for leisure.
To know how to extract data from Flickr using an API.
To know how to clean and prepare raw data for analysis.
To know how to cluster geospatial information.
To visualize clusters and point data in a meaningful way.
To know how to combine different spatial datasets to gain additional insights.
Tutorial Outline
1.Introducing the packages#
Within this tutorial, we are going to make use of the following packages:
flickrapi is a Python interface to the Flickr API. It includes support for authorized and non-authorized access, uploading and replacing photos, and all Flickr API functions.
GeoPandas is a Python package that extends the datatypes used by pandas to allow spatial operations on geometric types.
NumPy is a Python library that provides a multidimensional array object, various derived objects, and an assortment of routines for fast operations on arrays.
Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
OSMnx is a Python package that lets you download geospatial data from OpenStreetMap and model, project, visualize, and analyze real-world street networks and any other geospatial geometries. You can download and model walkable, drivable, or bikeable urban networks with a single line of Python code then easily analyze and visualize them. You can just as easily download and work with other infrastructure types, amenities/points of interest, building footprints, elevation data, street bearings/orientations, and speed/travel time.
Matplotlib is a comprehensive Python package for creating static, animated, and interactive visualizations in Python. Matplotlib makes easy things easy and hard things possible.
We will first need to install these packages in the cell below. Uncomment them to make sure we can pip install them
!pip install flickrapi
!pip install osmnx
!pip install contextily
Now we will import these packages in the cell below:
import flickrapi
import pandas as pd
import contextily as cx
import geopandas as gpd
import osmnx as ox
import numpy as np
import shapely
from sklearn.cluster import DBSCAN
from datetime import datetime
2. Extract data from Flickr#
To extract data from Flicker, we will use their API. This is an acronym for application programming interface. It is a software intermediary that allows two applications to talk to each other. APIs are an accessible way to extract and share data within and across organizations. APIs are all around us. Every time you use a rideshare app, send a mobile payment, or change the thermostat temperature from your phone, you’re using an API.
However, before we can access this API, we need to take a few steps. Most importantly, we need to register ourselves on the Flickr portal. To do so, we need to register, as explained in the video clip below:
Now, the next step is to access the API. You can now login on the website of Flickr, and go to the API part of the website.
Now click on APPLY FOR A NON-COMMERCIAL KEY
and just fill in some information about the course. The name of the App can be something random and related to our big data course, and you just need to fill in two lines of text to describe what you want to do. As soon as you click SUBMIT
, you will see a generated api_key and an api_secret. Fill those in below.
api_key = ''
api_secret=''
flickr = flickrapi.FlickrAPI(api_key, api_secret, cache=True)
Prepare data extraction#
Before we get started with the extraction of data, we need to specify again an area of interest we want to focus on. In the cell below, you will now read “Ameland, The Netherlands”. Change this to any random or municipality in the Netherlands that (1) you can think of and (2) will work.
In some cases, the function does not recognize the location. You could either try a different phrasing or try a different location. Many parts of the Netherlands should work. It is also fine to use an area outside of the Netherlands. I would make sure to not make it too large, as it will take a long time to extract the photos.
place_name = "Ameland, The Netherlands"
area = ox.geocode_to_gdf(place_name)
Now let us visualize the bounding box of the area, similar to some of our previous tutorials.
area_to_check = area.to_crs(epsg=3857)
ax = area_to_check.plot(figsize=(10, 10), color="none", edgecolor="k", linewidth=4)
ax.set_xticks([])
ax.set_yticks([])
ax.set_axis_off()
cx.add_basemap(ax, zoom=11) #depending on the size of your area, you may need to change the zoom level if you run into an error.
size = int(area_to_check.area/1e6)
ax.set_title("{}. Total area: {} km2".format(place_name,size),fontweight='bold')
Data extraction#
To extract the metadata of the photos from Flickr, we are going to use the .walk()
function. This function will return a list of photos matching some criteria. It allows you to search photos based on location, date, tag and so on.
In our case, we are state that we want to find all photos, no matter how they are tagged. This is defined through tag_mode=all
. Moreover, as you will see, we also specify that we want some extra information. More specifically, we ask for the geo(location)
, tags
, date_taken
and the url_m
of a photo.
We use a bounding box bbox
to make sure we only select photos from our region of interest. The input requires our bounding box to me in a string format. So let’s do that first.
bbox_string = ",".join([str(x) for x in area.bounds.values[0]])
And now we can extract all the photos. We collect all of them in a list, through the .append()
function
collect_pictures = []
for photo in flickr.walk(tag_mode='any',
#tags='nature',
bbox=bbox_string,extras='geo,tags,date_taken,url_m'):
get_attributes = photo.attrib
collect_pictures.append([get_attributes['id'],
get_attributes['owner'],
get_attributes['datetaken'],
get_attributes['tags'],
get_attributes['latitude'],
get_attributes['longitude']])
len(collect_pictures)
3. Explore the data#
Now we have extracted the data and, let's explore (and clean) this data a bit more. The convenient thing of having everything stored in a list, is that we can easily turn this into a pandas DataFrame.
df = pd.DataFrame(collect_pictures,columns=['id','owner','datetaken','tags','latitude','longitude'])
df.to_csv('Ameland_Flickr.csv')
Now let’s explore the data a little bit:
df.head(XXX)
And let’s see if everything is stored in a format in which we can work with:
df.dtypes
We would expect floating values for the latitude and longitude, and a datetime object for the datetaken. Let’s have a look how these are stored:
df.longitude.iloc[0]
Aha! Strings. It will be difficult to convert strings into proper geometries, so let’s convert these columns to floating values, and convert them to points using pygeos.points()
. As you will see, we use a list comprehension to do so. The input for pygeos.points()
is a list or numpy array, in which each element contains a longitude and a latitude. To make sure we have that, we create this list
using the zip function in Python.
df['longitude'] = df['longitude'].astype(
df['latitude'] = df['latitude'].
df['geometry'] = [shapely.points(x[0],x[1]) for x in list(zip(df['longitude'],df['latitude']))]
Now let’s convert the datetaken column into a datetime type, so we can extract specific years or days. To do so, we will make use of the lambda
and apply
functions, and use the fromisoformat()
function from within the datetime package.
df['datetaken'] = df.datetaken.XXXX(XXXX x : datetime.XXXX(x))
df['year'] = df.datetaken.dt.year
df['month'] = df.datetaken.dt.strftime('%b')
df['month_year'] = df['datetaken'].dt.to_period('M')
Now we have the dates in the correct format, we can plot a figure to identify when most of the photos have been taken/uploaded.
df.year.value_counts().plot(kind='bar')
As we are dealing with spatial data, it would be nice to plot this information on a map. To do so, we convert our pandas.DataFrame
into a geopandas.GeoDataFrame
. Moreover, we have to specify the coordinate reference system. Given that we have a global dataset, it makes most sense to use epsg:4326.
gdf = gpd.GeoDataFrame(df.copy())
gdf.crs = 'epsg:4326'
gdf
gdf_to_plot = gdf.to_crs(epsg=3857)
ax = gdf_to_plot.plot(column='year',figsize=(15, 3),legend=True)
ax.set_xticks([])
ax.set_yticks([])
ax.set_axis_off()
cx.add_basemap(ax, zoom=12) #depending on the size of your area, you may need to change the zoom level if you run into an error.
ax.set_title("")
Let’s have a look how all these photos are tagged. And whether we can actually do something with this information.
gdf['tags']
This looks like a mess. It seems we have some work to do to be able to use some of this. Lets get an overview of all the tags and get an idea how often certain tags are used.
find_all_tags = []
for row in gdf.tags:
find_all_tags.append(row.split())
all_tags = [item for sublist in find_all_tags for item in sublist]
pd.Series(all_tags).value_counts().head(50)
So we see that quite a lot of the tags don’t really say much about the area or why people might visit the area. Let’s give it a go by just trying to see how many pictures are tagged in something linked to nature. Add more words if you believe you can identify more in the previous overview.
def find_nature_tags(row):
matches = ["seehund", "zeehond","vis","wadden",
"natuur","nature","natur",
"landschaft",
"strand","beach","zee","sea","meer",
"bos","forest",
"animal","bird","vogel","dier"]
overlap = set(row.split()).intersection(set(matches))
if len(overlap) :
return 'yes'
gdf['nature'] = gdf.tags.apply(lambda x: find_nature_tags(x))
Let’s count them.
len(gdf.loc[gdf.nature=='yes'])
And let’s only plot those points on a map
gdf_to_plot = gdf.to_crs(epsg=3857)
gdf_to_plot = gdf_to_plot.loc[gdf_to_plot.nature == 'yes']
gdf_to_plot.reset_index(drop=True,inplace=True)
ax = gdf_to_plot.plot(column='year',figsize=(15, 3),legend=True)
#ax.set_xticks([])
#ax.set_yticks([])
#ax.set_axis_off()
cx.add_basemap(ax, zoom=12) #depending on the size of your area, you may need to change the zoom level if you run into an error.
ax.set_title("")
What we also noticed is that some users seems to upload quite a lot of pictures. Let’s have a look at the amount of unique users in this area:
gdf_to_plot.owner.value_counts()
And have a look at one of the users with the most pictures
gdf_to_plot.loc[gdf_to_plot.owner == 'XXXXX']
Ok, so it seems that we have several users that dominate the amount of uploads. If we want to say something about the preference of locations to visit, we might have to compensate for this. To do so, we can make use of the groupby function. Which columns would you like to use to make sure you still keep enough unique entries? And which groupby functions will you choose to group on? First, last, mean?
gdf_unique = gdf_to_plot.groupby(['XXX','XXX','XXX']).XXXX().reset_index()
gdf_unique
4. Clustering of data#
As it is pretty difficult to see any patterns on the maps above, let’s try to cluster some of this information and see if this helps us to better understand why people choose to visit certain locations in the area.
Create a grid of the area#
The most simple way would be to collect all points within a grid. The create_grid()
function below will help us to create a evenly-distributed grid within any given area. So let’s use that!
def create_grid(bbox,height):
"""Create a vector-based grid
Args:
bbox ([type]): [description]
height ([type]): [description]
Returns:
[type]: [description]
"""
# set xmin,ymin,xmax,and ymax of the grid
xmin, ymin = shapely.total_bounds(bbox)[0],shapely.total_bounds(bbox)[1]
xmax, ymax = shapely.total_bounds(bbox)[2],shapely.total_bounds(bbox)[3]
#estimate total rows and columns
rows = int(np.ceil((ymax-ymin) / height))
cols = int(np.ceil((xmax-xmin) / height))
# set corner points
x_left_origin = xmin
x_right_origin = xmin + height
y_top_origin = ymax
y_bottom_origin = ymax - height
# create actual grid
res_geoms = []
for countcols in range(cols):
y_top = y_top_origin
y_bottom = y_bottom_origin
for countrows in range(rows):
res_geoms.append((
((x_left_origin, y_top), (x_right_origin, y_top),
(x_right_origin, y_bottom), (x_left_origin, y_bottom)
)))
y_top = y_top - height
y_bottom = y_bottom - height
x_left_origin = x_left_origin + height
x_right_origin = x_right_origin + height
return shapely.polygons(res_geoms)
And apply this within our area. We need to specify the height of each cell (the cellsize). Choose a size that doesnt make the grid too large, but also provides enough cells within the area to see some spatial variation.
grid = pd.DataFrame(create_grid(area.geometry,XXXX),columns=['geometry'])
Let’s look how this grid looks like.
gpd.GeoDataFrame(grid.copy()).plot(edgecolor='black')
Now let’s make sure we georeference the data and convert it to EPSG:3857 (the same as the Flickr data).
grid = gpd.GeoDataFrame(grid.copy())
grid.crs = 'epsg:4326'
grid = grid.to_crs(epsg=3857)
To very efficiently find overlap between two geospatial databases, we make use of spatial queries. You can read more about the function we are using here and about R-tree (the approach for very efficienet spatial queries) here.
We start with building the tree from our photos. We want to quickly see how many are within each grid cell.
tree = shapely.STRtree(gdf_unique.loc[gdf_unique.nature=='yes'].geometry)
And now we can use the apply function to look for each of our grid how many photos are taken within each grid cell.
grid['nature'] = grid.geometry.apply(lambda x: len(tree.query(x,predicate='contains_properly')))
And now we can plot it.
grid = grid.loc[grid.nature > 0]
grid.reset_index(drop=True,inplace=True)
ax = grid.plot(column='nature',figsize=(15, 3),legend=True,alpha=0.5)
#ax.set_xticks([])
#ax.set_yticks([])
#ax.set_axis_off()
cx.add_basemap(ax, zoom=12) #depending on the size of your area, you may need to change the zoom level if you run into an error.
#ax.set_title("")
DBSCAN#
The DBSCAN algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape. The central component to the DBSCAN is the concept of core samples, which are samples that are in areas of high density. A cluster is therefore a set of core samples, each close to each other (measured by some distance measure) and a set of non-core samples that are close to a core sample (but are not themselves core samples). There are two parameters to the algorithm, min_samples
and eps
, which define formally what we mean when we say dense. Higher min_samples
or lower eps
indicate higher density necessary to form a cluster.
More formally, we define a core sample as being a sample in the dataset such that there exist min_samples
other samples within a distance of eps
, which are defined as neighbors of the core sample. This tells us that the core sample is in a dense area of the vector space. A cluster is a set of core samples that can be built by recursively taking a core sample, finding all of its neighbors that are core samples, finding all of their neighbors that are core samples, and so on. A cluster also has a set of non-core samples, which are samples that are neighbors of a core sample in the cluster but are not themselves core samples. Intuitively, these samples are on the fringes of a cluster.
Any core sample is part of a cluster, by definition. Any sample that is not a core sample, and is at least eps
in distance from any core sample, is considered an outlier by the algorithm.
The first step is to make sure we get an array of the coordinates of each cells.
coords = np.array(list(zip(gdf_to_plot.geometry.y.values,gdf_to_plot.geometry.x.values)))
And now we can run the DBSCAN algorithm.
epsilon = 500 #/ kms_per_radian
db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree').fit((coords)) #, metric='haversine'
cluster_labels = db.labels_
num_clusters = len(set(cluster_labels))
clusters = pd.Series([coords[cluster_labels == n] for n in range(num_clusters) if len(coords[cluster_labels == n]) > 0])
print('Number of clusters: {}'.format(num_clusters))
You can play around with the minimum samples
and the epsilon
. Once you are satisfied, we can create polygons of the outline of these clusters.
cluster_gdf = gpd.GeoDataFrame(clusters.apply(lambda x : shapely.convex_hull(shapely.multipoints(np.flip(x,axis=1)))),columns=['geometry'])
cluster_gdf['cluster_size'] = clusters.apply(lambda x : len(x))
cluster_gdf.geometry = cluster_gdf.buffer(100)
And plot this on a map! Do you see differences compared to the grid-based approach?
cluster_to_plot = cluster_gdf.copy()
cluster_to_plot.crs = 'epsg:3857'
#cluster_to_plot.reset_index(drop=True,inplace=True)
ax = cluster_to_plot.plot(column='cluster_size',figsize=(15, 3),legend=True,alpha=0.5)
#ax.set_xticks([])
#ax.set_yticks([])
#ax.set_axis_off()
cx.add_basemap(ax, zoom=12) #depending on the size of your area, you may need to change the zoom level if you run into an error.
#ax.set_title("")
5. Do people prefer certain land uses?#
We already started to understand our data a little bit better. However, it would be interesting to combine this with some land-use information, to see if we can find some patterns over there.
Let’s use the land-use information from OpenStreetMap to do so (similar to Tutorial 1 in Week 4). As you will see in the cell below, we use the tags “landuse” and “natural”. We need to use the “natural” tag to ensure we also obtain water bodies and other natural elements.
tags = {'landuse': True, 'natural': True}
landuse = ox.features_from_place(place_name, tags)
To ensure we really only get the area that we want, we use geopandas’s clip
function to only keep the area we want. This function does exactly the same as the clip
function in QGIS.
landuse = landuse.clip(area)
To more easily work with the data, we want all information in a single column. However, at the moment, all information that was tagged as “natural”, has no information stored in the “landuse” tags. It is, however, very convenient if we can just use a single column for further exploration of the data.
To overcome this issue, we need to add the missing information to the landuse column, as done below.
landuse.natural.unique()
landuse.loc[landuse.natural=='water','landuse'] = 'water'
landuse.loc[landuse.natural=='beach','landuse'] = 'beach'
landuse.loc[landuse.natural=='wetland','landuse'] = 'wetlands'
...
...
...
...
landuse.loc[landuse.natural=='grassland','landuse'] = 'grass'
landuse = landuse.dropna(subset=['landuse'])
landuse.crs = 'epsg:4326'
landuse = landuse.to_crs('epsg:3857')
Now let’s overlap this landuse information with our photos
landuse = landuse[(landuse.geom_type == 'MultiPolygon') | (landuse.geom_type == 'Polygon')]
And use the spatial index again to quickly (in computational terms) overlay the land-use information with our photos.
tree_landuse = shapely.STRtree(landuse.geometry)
We create a little function to make good use of the apply()
function.
def find_landuse(row,tree,landuse):
intersect = (tree.query(row,predicate='intersects'))
if len(intersect) == 0:
return 'water'
else:
return landuse.landuse.iloc[intersect][0]
grid['landuse'] = grid.geometry.apply(lambda x: find_landuse(x,tree_landuse,landuse))
And let’s have a look at the results. Can we say something about preferences?
ax = grid.groupby('landuse').sum().plot.barh()
ax.set_title('Amount of unique photos per land use')
ax.set_xlabel('Amount of photos')
ax.set_ylabel('Land Use')