{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "TvpyqeJKuojz" }, "source": [ "# Tutorial 2: Introduction to NumPy and Pandas\n", "\n", "One of the most used packages within our discipline are NumPy and Pandas. This tutorial will focus on teaching you the basics of both of them.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Note: This tutorial is heavily based upon the work of others\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Important: This tutorial is not part of your final grade. You simply have to pass it by answering the questions.\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "wrZ795jHwmTT" }, "source": [ "## Important before we start\n", "
\n", "Make sure that you save this file before you continue, else you will lose everything. To do so, go to Bestand/File and click on Een kopie opslaan in Drive/Save a Copy on Drive!\n", "\n", "Now, rename the file into Week1_Tutorial2.ipynb. You can do so by clicking on the name in the top of this screen." ] }, { "cell_type": "markdown", "metadata": { "id": "zCAr_t4huoj2" }, "source": [ "

Tutorial Outline

\n", "
\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "xTogdBmyuoj3" }, "source": [ "## Learning Objectives\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "FumDwzf2uoj3" }, "source": [ "**NumPy**\n", "- Use NumPy to create arrays with built-in functions inlcuding `np.array()`, `np.arange()`, `np.linspace()` and `np.full()`, `np.zeros()`, `np.ones()`\n", "- Be able to access values from a NumPy array by numeric indexing, slicing, and boolean indexing\n", "- Perform mathematical operations on and with arrays.\n", "- Explain what broadcasting is and how to use it.\n", "- Reshape arrays by adding/removing/reshaping axes with `.reshape()`, `np.newaxis()`, `.ravel()`, `.flatten()`\n", "- Understand how to use built-in NumPy functions like `np.sum()`, `np.mean()`, `np.log()` as stand alone functions or as methods of numpy arrays (when available)\n", "\n", "**Pandas**\n", "- Create Pandas series with `pd.Series()` and Pandas dataframe with `pd.DataFrame()`\n", "- Be able to access values from a Series/DataFrame by numeric indexing, slicing and boolean indexing using notation such as `df[]`, `df.loc[]`, `df.iloc[]`, `df.query[]`\n", "- Perform basic arithmetic operations between two Pandas series and anticipate the result.\n", "- Describe how Pandas assigns dtypes to Series and what the `object` dtype is\n", "- Read a standard .csv file from a local path or url using Pandas `pd.read_csv()`.\n", "- Explain the relationship and differences between `np.ndarray`, `pd.Series` and `pd.DataFrame` objects in Python." ] }, { "cell_type": "markdown", "metadata": { "id": "Evmn3jMQwY2T" }, "source": [ "## Performing the exercise\n", "You will do so by performing the scripts in this Python Jupyter notebook. To run any script in the code-boxes below use *crtl+enter*. In some instances, it is necessary to make changes to particular pieces of code in the code-boxes. When this is the case, this will be asked to you before you arrive at the actual script. This is indicated by means of an **Action**. When a script is running this is indicated by the * on the left side of the window." ] }, { "cell_type": "markdown", "metadata": { "id": "NyBjB3Aauoj4" }, "source": [ "## 1. Introduction to Python Packages\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "Qbh5kjApuoj5" }, "source": [ "Packages are an essential building block in programming. Without packages, we’d spend lots of time writing code that’s already been written. Imagine having to write code from scratch every time you wanted to parse a file in a particular format. You’d never get anything done! That’s why we always want to use packages.\n", "\n", "To understand Python packages, we’ll briefly need to look at scripts and modules. A *script* is something you execute in the shell to accomplish a defined task. To write a script, you’d type your code into your favorite text editor and save it with the .py extension. You can then use the python command in a terminal to execute your script. \n", "\n", "A module on the other hand is a Python program that you import, either in interactive mode or into your other programs. “Module” is really an umbrella term for reusable code.\n", "\n", "A Python package usually consists of several modules. Physically, a package is a folder containing modules and maybe other folders that themselves may contain more folders and modules. Conceptually, it’s a namespace. This simply means that a package’s modules are bound together by a package name, by which they may be referenced.\n", "\n", "The packages we will use for this tutorial include:\n", "\n", "[**OS**](https://docs.python.org/3/library/os.html) is a python module that provides a portable way of using operating system dependent functionality i.e. manipulating paths\n", "\n", "[**NumPy**](https://www.labri.fr/perso/nrougier/from-python-to-numpy/) stands for \"Numerical Python\" and it is the standard Python library used for working with arrays (i.e., vectors & matrices), linear algerba, and other numerical computations. NumPy is written in C, making NumPy arrays faster and more memory efficient than Python lists or arrays\n", "\n", "[**Pandas**](https://pypi.org/project/pandas/) is most popular Python library for tabular data structures. You can think of Pandas as an extremely powerful version of Excel (but free and with a lot more features!) \n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "28Wd0sbRuoj5" }, "source": [ "### Importing a package\n", "\n", "We’ll import a package using the import statement:\n", "\n", "```python\n", "import \n", "```" ] }, { "cell_type": "markdown", "metadata": { "id": "DYQM7fq3uoj6" }, "source": [ "## 2. Introduction to NumPy\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "7GUheqp_uoj8" }, "source": [ "NumPy can be installed using `pip` (in google colab pandas and numpy packages are already installed, hence, we will skip this part):\n", "\n", "```\n", "!pip install numpy\n", "\n", "```\n" ] }, { "cell_type": "markdown", "metadata": { "id": "M9lL2LvAuoj8" }, "source": [ "## 3. NumPy Arrays\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "0GCdFBMZuoj8" }, "source": [ "### What are Arrays?" ] }, { "cell_type": "markdown", "metadata": { "id": "RUoeVaHFuoj9" }, "source": [ "Arrays are \"n-dimensional\" data structures that can contain all the basic Python data types, e.g., floats, integers, strings etc, but work best with numeric data. NumPy arrays (\"ndarrays\") are homogenous, which means that items in the array should be of the same type. ndarrays are also compatible with numpy's vast collection of in-built functions!\n" ] }, { "cell_type": "markdown", "metadata": { "id": "XsdCAtH_uoj9" }, "source": [ "Usually we import numpy with the alias `np` (to avoid having to type out n-u-m-p-y every time we want to use it):" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "HN90DfOPuoj9" }, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "markdown", "metadata": { "id": "hPa7T92Kuoj-" }, "source": [ "A numpy array is sort of like a list:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[1, 1, 1, 1, 1],\n", " [1, 1, 1, 1, 1],\n", " [1, 1, 1, 1, 1],\n", " [1, 1, 1, 1, 1],\n", " [1, 1, 1, 1, 1]])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.full((5, 5), 1)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,\n", " 18, 19, 20])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.arange(1,21)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 193, "status": "ok", "timestamp": 1674648808352, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "bZKfprDVuoj-", "outputId": "5ee61e1d-4e3a-43e5-fc95-61adcd83134f" }, "outputs": [], "source": [ "my_list = [1, 2, 3, 4, 5]\n", "my_list" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674648809466, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "9LJKmXIhuoj_", "outputId": "2b39c8d0-749f-4661-b17f-c7220f9d3808" }, "outputs": [], "source": [ "my_array = np.array([1, 2, 3, 4, 5])\n", "my_array" ] }, { "cell_type": "markdown", "metadata": { "id": "QcNJldJ0uoj_" }, "source": [ "But it has the type `ndarray`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 7, "status": "ok", "timestamp": 1674648810761, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "GNzFEwOCuoj_", "outputId": "bbd6f05f-1887-4b24-b516-99b7f0e0f5bb" }, "outputs": [], "source": [ "type(my_array)" ] }, { "cell_type": "markdown", "metadata": { "id": "egjs_Y3vuoj_" }, "source": [ "Unlike a list, arrays can only hold a single type (usually numbers):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648812424, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "SUKm_u08uokA", "outputId": "d56a25eb-74b9-483f-f348-882d7b0c06f0" }, "outputs": [], "source": [ "my_list = [1, \"hi\"]\n", "my_list" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 8, "status": "ok", "timestamp": 1674648812667, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "X8WYojGPuokA", "outputId": "2f7d6360-7221-4784-8be6-6a398078e793" }, "outputs": [], "source": [ "my_array = np.array((1, \"hi\"))\n", "my_array" ] }, { "cell_type": "markdown", "metadata": { "id": "uZqX4cFUuokA" }, "source": [ "Above: NumPy converted the integer `1` into the string `'1'`!" ] }, { "cell_type": "markdown", "metadata": { "id": "Lu_CN10HuokA" }, "source": [ "### Creating arrays\n", "\n", "ndarrays are typically created using two main methods:\n", "1. From existing data (usually lists or tuples) using `np.array()`, like we saw above; or,\n", "2. Using built-in functions such as `np.arange()`, `np.linspace()`, `np.zeros()`, etc." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 201, "status": "ok", "timestamp": 1674648814810, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "wdlN9l49uokA", "outputId": "93507f60-c0f2-4349-ccb7-a15d819081ea" }, "outputs": [], "source": [ "my_list = [1, 2, 3]\n", "np.array(my_list)" ] }, { "cell_type": "markdown", "metadata": { "id": "h8tnzSHwuokB" }, "source": [ "Just like you can have \"multi-dimensional lists\" (by nesting lists in lists), you can have multi-dimensional arrays (indicated by double square brackets `[[ ]]`):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 6, "status": "ok", "timestamp": 1674648816184, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "OA2fygNiuokB", "outputId": "a7f64065-4c42-47f9-e5e0-525b4c718235" }, "outputs": [], "source": [ "list_2d = [[1, 2], [3, 4], [5, 6]]\n", "list_2d" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648816485, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "g5gXVeNauokB", "outputId": "47ac6f0d-13a4-4d00-801b-b94e04763b29" }, "outputs": [], "source": [ "array_2d = np.array(list_2d)\n", "array_2d" ] }, { "cell_type": "markdown", "metadata": { "id": "ttC71TGfuokB" }, "source": [ "You'll probably use the built-in numpy array creators quite often. Here are some common ones (hint - don't forget to check the docstrings for help with these functions, if you're in Jupyter, remeber the `shift + tab` shortcut):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 5, "status": "ok", "timestamp": 1674648817727, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "mHlouZykuokB", "outputId": "a0019e3d-ebb6-46d5-a76a-0acc4fa33456" }, "outputs": [], "source": [ "np.arange(1, 5) # from 1 inclusive to 5 exclusive. The counting always starts from 0 and not 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648818272, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "kGFXgWzEuokC", "outputId": "08b17cfc-ecb2-4699-bbab-852dda38f861" }, "outputs": [], "source": [ "np.arange(0, 11, 2) # step by 2 from 1 to 11" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648818460, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "8SMR5VbWuokC", "outputId": "555e034c-a7af-4db0-8f5d-25d7fdf54088" }, "outputs": [], "source": [ "np.linspace(0, 10, 5) # 5 equally spaced points between 0 and 10" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648819007, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "XXyQpm1tuokC", "outputId": "5ced821d-6734-4892-829a-1e035debb823" }, "outputs": [], "source": [ "np.ones((2, 2)) # an array of ones with size 2 x 2. Always starts with the row then column" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648819229, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "WlJBQ9ZTuokC", "outputId": "9734d839-bda6-4cbf-8752-6648045c24bb" }, "outputs": [], "source": [ "np.zeros((2, 3)) # an array of zeros with size 2 x 3" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 8, "status": "ok", "timestamp": 1674648819918, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "XPibakhNuokC", "outputId": "a1bc6055-54b4-4745-88e6-c90f5084361e" }, "outputs": [], "source": [ "np.full((3, 3), 3.14) # an array of the number 3.14 with size 3 x 3" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 9, "status": "ok", "timestamp": 1674648820437, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "t5V4AMqHuokD", "outputId": "c770ddc4-a14b-4418-deda-48427a27539e" }, "outputs": [], "source": [ "np.full((3, 3, 3), 3.14) # an array of the number 3.14 with size 3 x 3 x 3" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 199, "status": "ok", "timestamp": 1674648820936, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "_c_R7CD6uokD", "outputId": "352d76c1-ec94-4a26-c678-265e4c5b68b5" }, "outputs": [], "source": [ "np.random.rand(5, 2) # random numbers uniformly distributed from 0 to 1 with size 5 x 2" ] }, { "cell_type": "markdown", "metadata": { "id": "ahXdXgsGuokD" }, "source": [ "There are many useful attributes/methods that can be called off numpy arrays:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 4, "status": "ok", "timestamp": 1674648821835, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "FJKvjfhNuokD", "outputId": "594ee416-0d10-4813-8d7e-b1b0716d62a2" }, "outputs": [], "source": [ "print(dir(np.ndarray))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 4, "status": "ok", "timestamp": 1674648822350, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "JjG7IbxauokD", "outputId": "b1497507-78c8-46a6-813c-58499a9458a4" }, "outputs": [], "source": [ "x = np.random.rand(5, 2)\n", "x" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674648822651, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "-f2CDW_3uokD", "outputId": "fee733c2-11a7-4ffb-e15e-753114b0bad4" }, "outputs": [], "source": [ "x.transpose() # converts the rows to columns and vice versa" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 4, "status": "ok", "timestamp": 1674648823033, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "til6SuK8uokD", "outputId": "0db87c30-0ecd-4a83-fa1d-2e2d987eb5fd" }, "outputs": [], "source": [ "x.mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648823312, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "98iDr8HYuokE", "outputId": "240223dc-06a3-473e-a23f-a468d2bdee58" }, "outputs": [], "source": [ "x.astype(int) # truncates to the nearest whole number" ] }, { "cell_type": "markdown", "metadata": { "id": "AMnYa6I2uokE" }, "source": [ "### Array Shapes" ] }, { "cell_type": "markdown", "metadata": { "id": "qSki7fLhuokE" }, "source": [ "As you just saw above, arrays can be of any dimension, shape and size you desire. In fact, there are three main array attributes you need to know to work out the characteristics of an array:\n", "- `.ndim`: the number of dimensions of an array\n", "- `.shape`: the number of elements in each dimension (like calling `len()` on each dimension)\n", "- `.size`: the total number of elements in an array (i.e., the product of `.shape`)\n", "*Python f-string is an amazing way to format strings by including a code within the string as shown below*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648824762, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "b4H8j7T2uokE", "outputId": "e70f0e6e-394c-4eca-e457-76af38faf102" }, "outputs": [], "source": [ "array_1d = np.ones(3)\n", "print(f\"Dimensions: {array_1d.ndim}\")\n", "print(f\" Shape: {array_1d.shape}\")\n", "print(f\" Size: {array_1d.size}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "zdp-62HauokE" }, "source": [ "Let's turn that print action into a function and try out some other arrays:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4gOD7m-vuokE" }, "outputs": [], "source": [ "def print_array(x):\n", " print(f\"Dimensions: {x.ndim}\")\n", " print(f\" Shape: {x.shape}\")\n", " print(f\" Size: {x.size}\")\n", " print(\"\")\n", " print(x)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648826164, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "fndLqt0PuokE", "outputId": "fd3e7b01-3bba-44a9-aa45-cd127bcb8774" }, "outputs": [], "source": [ "array_2d = np.ones((3, 2))\n", "print_array(array_2d)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 5, "status": "ok", "timestamp": 1674648826603, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "Vo0WyCvkuokF", "outputId": "956e6689-5bce-49c0-f87e-ed3263f4525d" }, "outputs": [], "source": [ "array_4d = np.ones((1, 2, 3, 4))\n", "print_array(array_4d)" ] }, { "cell_type": "markdown", "metadata": { "id": "fL3ceOCluokF" }, "source": [ "After 3 dimensions, printing arrays starts getting pretty messy. As you can see above, the number of square brackets (`[ ]`) in the printed output indicate how many dimensions there are: for example, above, the output starts with 4 square brackets `[[[[` indicative of a 4D array." ] }, { "cell_type": "markdown", "metadata": { "id": "__favtn5uokF" }, "source": [ "### 1-d Arrays" ] }, { "cell_type": "markdown", "metadata": { "id": "9Y6T3W1-uokF" }, "source": [ "One of the most confusing things about numpy is 1-d arrays (vectors) can have 3 possible shapes!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674648828346, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "6XgRq0RouokF", "outputId": "bfb117ba-845f-4b89-b1f2-ed289bb9cfa0" }, "outputs": [], "source": [ "x = np.ones(5)\n", "print_array(x)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 6, "status": "ok", "timestamp": 1674648828647, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "mGragTPYuokF", "outputId": "089adb41-563f-4286-cfb5-13867562129e" }, "outputs": [], "source": [ "y = np.ones((1, 5))\n", "print_array(y)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674648829083, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "4j110T3xuokF", "outputId": "1ae6f2f8-3dda-4f5b-fe26-c5eb00898c99" }, "outputs": [], "source": [ "z = np.ones((5, 1))\n", "print_array(z)" ] }, { "cell_type": "markdown", "metadata": { "id": "VxtdZwwsuokF" }, "source": [ "We can use `np.array_equal()` to determine if two arrays have the same shape and elements:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648829825, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "MzhgTMOpuokG", "outputId": "0766b3eb-faa0-43d0-f127-7f94ea83b2bb" }, "outputs": [], "source": [ "np.array_equal(x, x)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 6, "status": "ok", "timestamp": 1674648830087, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "xE3scp3puokG", "outputId": "433b14e3-19c6-4da2-b2b7-fc55fa637824" }, "outputs": [], "source": [ "np.array_equal(x, y)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 9, "status": "ok", "timestamp": 1674648830378, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "YS1MlikquokG", "outputId": "09517dd9-cbdc-4c15-a9fd-a8515b0a53e9" }, "outputs": [], "source": [ "np.array_equal(x, z)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648830729, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "ZxnPmmKDuokG", "outputId": "a5a79d14-b133-4221-8df4-5ec1f14fd5d4" }, "outputs": [], "source": [ "np.array_equal(y, z)" ] }, { "cell_type": "markdown", "metadata": { "id": "uXPMsfN4uokG" }, "source": [ "The shape of your 1-d arrays can actually have big implications on your mathematical operations!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674648831484, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "vEONsjoOuokG", "outputId": "bf6f59a4-1433-4202-a2aa-a49e39500c73" }, "outputs": [], "source": [ "print(f\"x: {x}\")\n", "print(f\"y: {y}\")\n", "print(f\"z: {z}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 6, "status": "ok", "timestamp": 1674648832015, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "FCoHAkPFuokG", "outputId": "b94047f0-070a-41e4-edc9-8546e60a9ec9" }, "outputs": [], "source": [ "x + y # makes sense" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648832460, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "oxKeU4EFuokH", "outputId": "b1e71494-0c5a-4396-8298-fdfe1d263b37" }, "outputs": [], "source": [ "y + z # wait, what?" ] }, { "cell_type": "markdown", "metadata": { "id": "OR8h7Xg0uokH" }, "source": [ "What happened in the cell above is \"broadcasting\" and we'll discuss it below." ] }, { "cell_type": "markdown", "metadata": { "id": "4geIXfqruokH" }, "source": [ "## 4. Array Operations and Broadcasting\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "V3LSsjPkuokH" }, "source": [ "### Elementwise operations\n", "\n", "Elementwise operations refer to operations applied to each element of an array or between the paired elements of two arrays." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648834028, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "UpOdfNCmuokH", "outputId": "12eb9ed9-eabf-45b2-91bb-b4f1d7f9c346" }, "outputs": [], "source": [ "x = np.ones(4)\n", "x" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648834245, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "NET_HKMOuokH", "outputId": "21ea1bb3-c60c-4423-dbdf-1b27869db9dd" }, "outputs": [], "source": [ "y = x + 1\n", "y" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674648834680, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "Ltdw55MjuokH", "outputId": "19cbc94c-46d9-4895-e7d1-062b6cd0c397" }, "outputs": [], "source": [ "x - y # subtraction" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648835250, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "mw-IfCCquokH", "outputId": "6444fadf-e825-433a-a470-428f1d2d201a" }, "outputs": [], "source": [ "x == y # compares the two arrays" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 11, "status": "ok", "timestamp": 1674648835922, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "_Tk23dx6uokH", "outputId": "9eddc7a2-abc3-47df-f2f0-d6ea1f2f4471" }, "outputs": [], "source": [ "x * y # multiplication " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 6, "status": "ok", "timestamp": 1674648835923, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "ILlSRzj2uokI", "outputId": "c77424dc-9ca0-4167-fae5-9fa0290d737e" }, "outputs": [], "source": [ "x ** y #exponentiation operator or in simpler terms the power operator" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648836467, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "gHq6vQRduokI", "outputId": "cdc96b6b-ad81-4b85-b627-9e2ef2e6af18" }, "outputs": [], "source": [ "x / y #division" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 1, "status": "ok", "timestamp": 1674648836657, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "Y3XdmXMuuokI", "outputId": "a2ebc362-4955-4fe3-8558-a047b7a1db35" }, "outputs": [], "source": [ "np.array_equal(x, y)" ] }, { "cell_type": "markdown", "metadata": { "id": "0lpJNfmIuokI" }, "source": [ "### Broadcasting" ] }, { "cell_type": "markdown", "metadata": { "id": "83MRSeXDuokI" }, "source": [ "ndarrays with different sizes cannot be directly used in arithmetic operations:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 200 }, "executionInfo": { "elapsed": 3, "status": "error", "timestamp": 1674648837733, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "yygEXYAMuokI", "outputId": "99d0d23c-8981-4230-81f6-647cfd02ea5d", "tags": [ "raises-exception" ] }, "outputs": [], "source": [ "a = np.ones((2, 2))\n", "b = np.ones((3, 3))\n", "a + b # this produces an error cause of the different sizes of the arrays" ] }, { "cell_type": "markdown", "metadata": { "id": "13aYhfIkuokI" }, "source": [ "`Broadcasting` describes how NumPy treats arrays with different shapes during arithmetic operations. The idea is to wrangle data so that operations can occur element-wise.\n", "\n", "Let's see an example. Say I sell pies on my weekends. I sell 3 types of pies at different prices, and I sold the following number of each pie last weekend. I want to know how much money I made per pie type per day.\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648841301, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "UWYwGMXsuokJ", "outputId": "5df9da32-72ef-43ae-92d9-ece2e9934b6f" }, "outputs": [], "source": [ "cost = np.array([20, 15, 25])\n", "print(\"Pie cost:\")\n", "print(cost)\n", "sales = np.array([[2, 3, 1], [6, 3, 3], [5, 3, 5]])\n", "print(\"\\nPie sales (#):\")\n", "print(sales)" ] }, { "cell_type": "markdown", "metadata": { "id": "vLZpsZR9uokJ" }, "source": [ "How can we multiply these two arrays together? We could use a loop:\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674648842594, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "eY84dSJeuokJ", "outputId": "f7c79154-533e-4722-82a4-472f64a015a2" }, "outputs": [], "source": [ "total = np.zeros((3, 3)) # initialize an array of 0's\n", "for col in range(sales.shape[1]):\n", " total[:, col] = sales[:, col] * cost\n", "total" ] }, { "cell_type": "markdown", "metadata": { "id": "XKDCwwIduokJ" }, "source": [ "Or we could make them the same size, and multiply corresponding elements \"elementwise\":\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648843766, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "07KYEwSwuokJ", "outputId": "e59e8e79-2cb7-4ef1-a422-a3465ab35ee3" }, "outputs": [], "source": [ "cost = np.repeat(cost, 3).reshape((3, 3))\n", "cost" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648844024, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "Gq7I-O_YuokJ", "outputId": "66f3bb34-fe1c-4bef-ca62-499b4305e947" }, "outputs": [], "source": [ "cost * sales" ] }, { "cell_type": "markdown", "metadata": { "id": "uFHN2o81uokJ" }, "source": [ "Broadcasting is just Numpy essentially doing the `np.repeat()` for you under the hood:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 7, "status": "ok", "timestamp": 1674648845277, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "t85DQ66juokJ", "outputId": "d94195b7-5e9b-4b00-8337-4c0be60b2bf3" }, "outputs": [], "source": [ "cost = np.array([20, 15, 25]).reshape(3, 1)\n", "print(f\" cost shape: {cost.shape}\")\n", "sales = np.array([[2, 3, 1], [6, 3, 3], [5, 3, 5]])\n", "print(f\"sales shape: {sales.shape}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 5, "status": "ok", "timestamp": 1674648845557, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "9-Xuh5QauokK", "outputId": "8900ec3f-effd-4b07-cc3a-22d787191c43" }, "outputs": [], "source": [ "sales * cost" ] }, { "cell_type": "markdown", "metadata": { "id": "3Zdu0OT7uokM" }, "source": [ "In NumPy the smaller array is “broadcast” across the larger array so that they have compatible shapes:\n", "\n", "\n", "\n", "Source: [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) by Jake VanderPlas (2016)" ] }, { "cell_type": "markdown", "metadata": { "id": "E6kGFo3CuokM" }, "source": [ "Why should you care about broadcasting? Well, it's cleaner and faster than looping and it also affects the array shapes resulting from arithmetic operations. Below, we can time how long it takes to loop vs broadcast:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 6841, "status": "ok", "timestamp": 1674648878415, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "oXuPoYs2uokM", "outputId": "8c66bab5-44e5-41b4-9c9d-88c8b37e44b4" }, "outputs": [], "source": [ "cost = np.array([20, 15, 25]).reshape(3, 1)\n", "sales = np.array([[2, 3, 1],\n", " [6, 3, 3],\n", " [5, 3, 5]])\n", "total = np.zeros((3, 3))\n", "\n", "time_loop = %timeit -q -o -r 3 for col in range(sales.shape[1]): total[:, col] = sales[:, col] * np.squeeze(cost)\n", "time_vec = %timeit -q -o -r 3 cost * sales\n", "print(f\"Broadcasting is {time_loop.average / time_vec.average:.2f}x faster than looping here.\")" ] }, { "cell_type": "markdown", "metadata": { "id": "ykcHxwzQuokN" }, "source": [ "Of course, not all arrays are compatible! NumPy compares arrays element-wise. It starts with the trailing dimensions, and works its way forward. Dimensions are compatible if:\n", "- **they are equal**, or\n", "- **one of them is 1**.\n", "\n", "Use the code below to test out array compatibitlity:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 188, "status": "ok", "timestamp": 1674648880246, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "lZ-qwDrxuokN", "outputId": "18c36699-faf9-48f1-a102-a867e310b7ba" }, "outputs": [], "source": [ "a = np.ones((3, 2))\n", "b = np.ones((3, 2, 1))\n", "print(f\"The shape of a is: {a.shape}\")\n", "print(f\"The shape of b is: {b.shape}\")\n", "print(\"\")\n", "try:\n", " print(f\"The shape of a + b is: {(a + b).shape}\")\n", "except:\n", " print(f\"ERROR: arrays are NOT broadcast compatible!\")" ] }, { "cell_type": "markdown", "metadata": { "id": "Hpwqc2WLuokN" }, "source": [ "### Reshaping Arrays" ] }, { "cell_type": "markdown", "metadata": { "id": "xpSnX5uBuokN" }, "source": [ "There are 3 key reshaping methods I want you to know about for reshaping numpy arrays:\n", "- `.reshape()`\n", "- `np.newaxis`\n", "- `.ravel()`/`.flatten()`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 185, "status": "ok", "timestamp": 1674648883478, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "APbZlstFuokN", "outputId": "5f3acac9-43cd-4378-849b-518ebd303f5b" }, "outputs": [], "source": [ "x = np.full((4, 3), 3.14)\n", "x" ] }, { "cell_type": "markdown", "metadata": { "id": "HB2klvliuokN" }, "source": [ "You'll reshape arrays fairly often and the `.reshape()` method is pretty intuitive:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 187, "status": "ok", "timestamp": 1674648885453, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "k3_zb5CCuokN", "outputId": "4f7377f7-89e7-438b-f708-516599793fbb" }, "outputs": [], "source": [ "x.reshape(6, 2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648886050, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "nH__8AVduokN", "outputId": "cf15c9f7-e33b-4a5e-ba3a-3db8201b6286" }, "outputs": [], "source": [ "x.reshape(2, -1) # using -1 will calculate the dimension for you (if possible)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648886903, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "oVk10h_luokO", "outputId": "94dfc707-59bf-4979-edec-1ee8c3ad1eac" }, "outputs": [], "source": [ "a = np.ones(3)\n", "print_array(a)\n", "b = np.ones((3, 2))\n", "print_array(b)" ] }, { "cell_type": "markdown", "metadata": { "id": "AN_Su0sAuokO" }, "source": [ "If I want to add these two arrays I won't be able to because their dimensions are not compatible:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 165 }, "executionInfo": { "elapsed": 4, "status": "error", "timestamp": 1674648888612, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "ZjnnVQ41uokO", "outputId": "123b1af8-3bd4-41d6-da25-ffb7831ddeae", "tags": [ "raises-exception" ] }, "outputs": [], "source": [ "a + b" ] }, { "cell_type": "markdown", "metadata": { "id": "tks6teq6uokO" }, "source": [ "Sometimes you'll want to add dimensions to an array for broadcasting purposes like this. We can do that with `np.newaxis` (note that `None` is an alias for `np.newaxis`). We can add a dimension to `a` to make the arrays compatible:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 210, "status": "ok", "timestamp": 1674648898777, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "NtBWdeGCuokO", "outputId": "816abc14-053d-4078-f232-93531d3dc5e2" }, "outputs": [], "source": [ "print_array(a[:, np.newaxis]) # same as a[:, None]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674648899593, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "CFlBmIRWuokO", "outputId": "5c48c061-227a-4275-c8ee-9982a07ae9b6" }, "outputs": [], "source": [ "a[:, np.newaxis] + b" ] }, { "cell_type": "markdown", "metadata": { "id": "lm-7ydxKuokO" }, "source": [ "Finally, sometimes you'll want to \"flatten\" arrays to a single dimension using `.ravel()` or `.flatten()`. `.flatten()` used to return a copy and `.ravel()` a view/reference but now they both return a copy so I can't think of an important reason to use one over the other 🤷‍♂️" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648901593, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "JKDIqsleuokO", "outputId": "da9d266c-3fb8-43d0-b6b0-e7ed3913965f" }, "outputs": [], "source": [ "x" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648902392, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "uffSkeP4uokP", "outputId": "78e780ac-dc87-4be7-8076-9ec9b945aa3d" }, "outputs": [], "source": [ "print_array(x.flatten())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648903484, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "OrIB2AryuokP", "outputId": "af12a3b2-b921-4210-ed87-d4ad34a0a446" }, "outputs": [], "source": [ "print_array(x.ravel())" ] }, { "cell_type": "markdown", "metadata": { "id": "2MVtR4-MuokP" }, "source": [ "## 5. Indexing and slicing\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "n2Ff-c1_uokP" }, "source": [ "Concepts of indexing should be pretty familiar by now. Indexing arrays is similar to indexing lists but there are just more dimensions." ] }, { "cell_type": "markdown", "metadata": { "id": "Y3nx08SIuokP" }, "source": [ "### Numeric Indexing" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648906423, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "2Ls64NXiuokP", "outputId": "c4086f6a-2573-4d96-89dc-9608f7677935" }, "outputs": [], "source": [ "x = np.arange(10)\n", "x" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 4, "status": "ok", "timestamp": 1674648906835, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "TM90a37luokP", "outputId": "7bc84cc9-4f38-4875-c82e-65be57abc14f" }, "outputs": [], "source": [ "x[3]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674648907424, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "SHu8KgVyuokP", "outputId": "beb59d00-5842-4476-e25f-5044f67d9e9f" }, "outputs": [], "source": [ "x[2:]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 6, "status": "ok", "timestamp": 1674648907972, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "DovPPN6CuokQ", "outputId": "ffa019c4-4f74-49b4-b2b6-0ea236781e50" }, "outputs": [], "source": [ "x[:4]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 4, "status": "ok", "timestamp": 1674648908612, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "xDJi1KVQuokQ", "outputId": "bec97a3c-6b39-4ce0-83b7-b71a34152f1b" }, "outputs": [], "source": [ "x[2:5]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 5, "status": "ok", "timestamp": 1674648909084, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "0iL7iSQDuokQ", "outputId": "9291cfc9-4f42-4b88-ee6c-95d2172bee6a" }, "outputs": [], "source": [ "x[2:3]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 5, "status": "ok", "timestamp": 1674648909628, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "iX4cXgbeuokQ", "outputId": "0848393b-26be-4a3d-be54-68b688aab096" }, "outputs": [], "source": [ "x[-1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648910209, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "zFrvFz01uokQ", "outputId": "fd9c4354-94fb-4835-b2fe-0ce0a6da0ac0" }, "outputs": [], "source": [ "x[-2]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648910417, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "5JrTuNcWuokQ", "outputId": "16cbdef8-1e4e-4c1b-b685-85982fe41781", "scrolled": true }, "outputs": [], "source": [ "x[5:0:-1] # the -1 means going backwards" ] }, { "cell_type": "markdown", "metadata": { "id": "jgMhBVXVuokQ" }, "source": [ "For 2D arrays:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 4, "status": "ok", "timestamp": 1674648911854, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "frhHX3oJuokR", "outputId": "299fc6c2-1a03-42c5-8fc8-67fcf46cdd28" }, "outputs": [], "source": [ "x = np.random.randint(10, size=(4, 6))\n", "x" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648912361, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "D7ODSdONuokR", "outputId": "a233c464-32f8-4f9d-d142-e17e2e080e9b" }, "outputs": [], "source": [ "x[3, 4] # do this" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 4, "status": "ok", "timestamp": 1674648912912, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "_ZeEQsTbuokR", "outputId": "7f642167-89c8-4a2d-981f-1c1c4514dfa2" }, "outputs": [], "source": [ "x[3][4] # i do not like this as much" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648913454, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "CQgKIINWuokR", "outputId": "e269a605-583c-4f34-f1d9-beb6c9870cd4" }, "outputs": [], "source": [ "x[3]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648914161, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "4ATX3rcKuokR", "outputId": "ef0c9e80-b1c3-4599-aa08-ac225ad60922" }, "outputs": [], "source": [ "len(x) # generally, just confusing" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648914342, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "7SdPaDFHuokR", "outputId": "90d84e4d-eac0-4b2d-97a2-a2227a08b98d" }, "outputs": [], "source": [ "x.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 6, "status": "ok", "timestamp": 1674648914856, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "koc-UgFOuokR", "outputId": "a34025e8-d6c3-4aeb-af30-f73c664da64c" }, "outputs": [], "source": [ "x[:, 2] # column number 2" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648915339, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "H55gUT5quokS", "outputId": "5686c19e-f2c9-45d9-fa61-647a4322fe71" }, "outputs": [], "source": [ "x[2:, :3]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 7, "status": "ok", "timestamp": 1674648915852, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "YmQppdCXuokS", "outputId": "0af46743-1fb4-4f3a-ba15-fce831f48046" }, "outputs": [], "source": [ "x.T" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674648916500, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "Y8yfEZhuuokS", "outputId": "dfd3a8dc-010a-4de7-dfc0-62c2e75da3a8" }, "outputs": [], "source": [ "x" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648917132, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "QDp3bg2ruokS", "outputId": "11f2eccf-8f4b-441e-fc10-bd6800d978f3" }, "outputs": [], "source": [ "x[1, 1] = 555555\n", "x" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648917628, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "T7ELhcjcuokS", "outputId": "dd7692a9-c7e3-4a8c-a3b6-a324f9870c43" }, "outputs": [], "source": [ "z = np.zeros(5)\n", "z" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648917838, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "aipFF-XmuokS", "outputId": "7d403d92-a933-4582-e09b-4d148ed8fb74" }, "outputs": [], "source": [ "z[0] = 5\n", "z" ] }, { "cell_type": "markdown", "metadata": { "id": "z0nxu9a-uokS" }, "source": [ "### Boolean Indexing" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "collapsed": false, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648919048, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "LY0u61qruokT", "jupyter": { "outputs_hidden": false }, "outputId": "76bb8fd9-b511-441c-d98f-4efcf65ac0cd" }, "outputs": [], "source": [ "x = np.random.rand(10)\n", "x" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "collapsed": false, "executionInfo": { "elapsed": 1, "status": "ok", "timestamp": 1674648919263, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "I8Ot1MkBuokT", "jupyter": { "outputs_hidden": false }, "outputId": "a4ade92d-0766-4d6a-def2-5d2cd6680e60" }, "outputs": [], "source": [ "x + 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "collapsed": false, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648919764, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "_BVtBkO4uokT", "jupyter": { "outputs_hidden": false }, "outputId": "709bd762-f98c-4a2e-fa7a-3bb3839469be" }, "outputs": [], "source": [ "x_thresh = x > 0.5\n", "x_thresh" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "collapsed": false, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648920470, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "NkIDvUwCuokT", "jupyter": { "outputs_hidden": false }, "outputId": "b349840f-e4d3-470d-8f58-ba07c825ad41" }, "outputs": [], "source": [ "x[x_thresh] = 0.5 # set all elements > 0.5 to be equal to 0.5\n", "x" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648920966, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "qOsn9VMtuokT", "outputId": "d440e0e8-27a1-447f-b972-e0894123041d" }, "outputs": [], "source": [ "x = np.random.rand(10)\n", "x" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 7, "status": "ok", "timestamp": 1674648921196, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "fvGuM5lguokT", "outputId": "33da81ee-43cf-407d-cf96-4e282b88e12d" }, "outputs": [], "source": [ "x[x > 0.5] = 0.5\n", "x" ] }, { "cell_type": "markdown", "metadata": { "id": "WiMlYkGsuokT" }, "source": [ "## 6. More Useful NumPy Functions" ] }, { "cell_type": "markdown", "metadata": { "id": "tQfMhplcuokT" }, "source": [ "Numpy has many built-in functions for mathematical operations, really it has almost every numerical operation you might want to do in its library. I'm not going to explore the whole library here, but as an example of some of the available functions, consider working out the hypotenuse of a triangle that with sides 3m and 4m:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "tCKV7gZnuokU" }, "outputs": [], "source": [ "sides = np.array([3, 4])" ] }, { "cell_type": "markdown", "metadata": { "id": "f1iSYFALuokU" }, "source": [ "There are several ways we could solve this problem. We could directly use Pythagoras's Theorem:\n", "\n", "$$c = \\sqrt{a^2+b^2}$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 320, "status": "ok", "timestamp": 1674648947504, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "VJvt3M6QuokU", "outputId": "d979f834-26ad-468f-805a-25b066d44524" }, "outputs": [], "source": [ "np.sqrt(np.sum([np.power(sides[0], 2), np.power(sides[1], 2)]))" ] }, { "cell_type": "markdown", "metadata": { "id": "NaqVkj02uokU" }, "source": [ "We can leverage the fact that we're dealing with a numpy array and apply a \"vectorized\" operation (more on that in a bit) to the whole vector at one time:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674648948267, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "8TXePpERuokU", "outputId": "46d7a5cf-02a4-458d-9c77-0d333a50e69f" }, "outputs": [], "source": [ "(sides ** 2).sum() ** 0.5" ] }, { "cell_type": "markdown", "metadata": { "id": "FzvrbLOauokU" }, "source": [ "Or we can simply use a numpy built-in function (if it exists):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648949079, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "SNrR28qEuokU", "outputId": "90752f6b-5e7c-4df7-fc93-92c567fda796" }, "outputs": [], "source": [ "np.linalg.norm(sides) " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648949339, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "IYAwQgVRuokU", "outputId": "d50e956c-e4f4-4b70-ffa0-91ab5dbf9098" }, "outputs": [], "source": [ "np.hypot(*sides)" ] }, { "cell_type": "markdown", "metadata": { "id": "z6peb0HauokV" }, "source": [ "### Vectorization" ] }, { "cell_type": "markdown", "metadata": { "id": "MccPP_icuokV" }, "source": [ "Broadly speaking, \"vectorization\" in NumPy refers to the use of optmized C code to perform an operation. Long-story-short, because numpy arrays are homogenous (contain the same dtype), we don't need to check that we can perform an operation on elements of a sequence before we do the operation which results in a huge speed-up. You can kind of think of this concept as NumPy being able to perform an operation on the whole array at the same time rather than one-by-one (this is not actually the case, a super-efficient C loop is still running under the hood, but that's an irrelevant detail). You can read more about vectorization [here](https://www.pythonlikeyoumeanit.com/Module3_IntroducingNumpy/VectorizedOperations.html) but all you need to know is that most operations in NumPy are vectorized, so just try to do things at an \"array-level\" rather than an \"element-level\", e.g.:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674648950976, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "N-nlT_Y0uokV", "outputId": "cffc5ea8-4c3a-489d-a742-5a0a755d51a8" }, "outputs": [], "source": [ "# DONT DO THIS\n", "array = np.array(range(5))\n", "for i, element in enumerate(array):\n", " array[i] = element ** 2\n", "array" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "iTdbqWJ9uokV" }, "outputs": [], "source": [ "# DO THIS\n", "array = np.array(range(5))\n", "array **= 2 " ] }, { "cell_type": "markdown", "metadata": { "id": "7-hbc2cRuokV" }, "source": [ "Let's do a quick timing experiment:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 13429, "status": "ok", "timestamp": 1674648966936, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "1VhypeKTuokV", "outputId": "950237e5-2091-43c7-c3b3-541552d00c32" }, "outputs": [], "source": [ "# loop method\n", "array = np.array(range(5))\n", "time_loop = %timeit -q -o -r 3 for i, element in enumerate(array): array[i] = element ** 2\n", "# vectorized method\n", "array = np.array(range(5))\n", "time_vec = %timeit -q -o -r 3 array ** 2\n", "print(f\"Vectorized operation is {time_loop.average / time_vec.average:.2f}x faster than looping here.\")" ] }, { "cell_type": "markdown", "metadata": { "id": "Aptaq5lwuokV" }, "source": [ "## 7. Introduction to Pandas\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "TT8Chs-PuokW" }, "source": [ "Pandas can be installed using `pip`:\n", "\n", "```\n", "!pip install pandas\n", "```\n", "\n", "We usually import pandas with the alias `pd`. You'll see these two imports at the top of most data science workflows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "zAW_VR7puokW" }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": { "id": "NwX3NTCTuokW" }, "source": [ "## 8. Pandas Series\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "wB4bAwzVuokW" }, "source": [ "### What are Series?" ] }, { "cell_type": "markdown", "metadata": { "id": "_IElg5eJuokW" }, "source": [ "A Series is like a NumPy array but with labels. They are strictly 1-dimensional and can contain any data type (integers, strings, floats, objects, etc), including a mix of them. Series can be created from a scalar, a list, ndarray or dictionary using `pd.Series()` (**note the captial \"S\"**). Here are some example series:" ] }, { "cell_type": "markdown", "metadata": { "id": "C1K1VP90uokW" }, "source": [ "### Creating Series" ] }, { "cell_type": "markdown", "metadata": { "id": "W_v955otuokW" }, "source": [ "By default, series are labelled with indices starting from 0. For example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 211, "status": "ok", "timestamp": 1674649099921, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "aeA3pVgpuokW", "outputId": "b47cf3d7-c50b-4f1c-a19d-e78483d5c7ea" }, "outputs": [], "source": [ "pd.Series(data = [-5, 1.3, 21, 6, 3])" ] }, { "cell_type": "markdown", "metadata": { "id": "46v8f_mwuokX" }, "source": [ "But you can add a custom index:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 183, "status": "ok", "timestamp": 1674649101795, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "_Vw4ES_guokX", "outputId": "693ffd45-289b-4c6c-ba3c-df067803a2b2" }, "outputs": [], "source": [ "pd.Series(data = [-5, 1.3, 21, 6, 3],\n", " index = ['a', 'b', 'c', 'd', 'e'])" ] }, { "cell_type": "markdown", "metadata": { "id": "NpBdPTsEuokX" }, "source": [ "You can create a Series from a dictionary:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674649102932, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "Vs0yx9UMuokX", "outputId": "e32523e1-ae43-4ec8-c379-9d161cac35d3" }, "outputs": [], "source": [ "pd.Series(data = {'a': 10, 'b': 20, 'c': 30})" ] }, { "cell_type": "markdown", "metadata": { "id": "4DWdK8fNuokX" }, "source": [ "Or from an ndarray:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674649103901, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "RyaN4RxkuokX", "outputId": "28b6dab6-826b-4057-93dd-ae646f6410f0" }, "outputs": [], "source": [ "pd.Series(data = np.random.randn(3))" ] }, { "cell_type": "markdown", "metadata": { "id": "3LF-bXcouokX" }, "source": [ "Or even a scalar:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674649105159, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "6fIMeq8yuokX", "outputId": "ea8024e8-684b-4fc5-dc80-5ff117821165" }, "outputs": [], "source": [ "pd.Series(3.141)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674649105388, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "u2q8BnAluokX", "outputId": "d5ab69b8-a95f-497c-d68e-ac2bdc9f30aa" }, "outputs": [], "source": [ "pd.Series(data=3.141, index=['a', 'b', 'c'])" ] }, { "cell_type": "markdown", "metadata": { "id": "ItBhJIpwuokY" }, "source": [ "### Series Characteristics" ] }, { "cell_type": "markdown", "metadata": { "id": "ykRqXTK5uokY" }, "source": [ "Series can be given a `name` attribute. I almost never use this but it might come up sometimes:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674649107243, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "ICwTOBJ2uokY", "outputId": "03a1879c-10d6-4830-aa2e-31a31f1f6d02" }, "outputs": [], "source": [ "s = pd.Series(data = np.random.randn(5), name='random_series')\n", "s" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 36 }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674649108015, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "ADSwiLkiuokY", "outputId": "eb3187c9-6ced-41a0-806f-0ef801c87b0f" }, "outputs": [], "source": [ "s.name" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674649108384, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "IXPtGm_JuokY", "outputId": "d387cc48-cba6-4d10-a5f6-af43ebf0d93c" }, "outputs": [], "source": [ "s.rename(\"another_name\")" ] }, { "cell_type": "markdown", "metadata": { "id": "3nhEzvOvuokY" }, "source": [ "You can access the index labels of your series using the `.index` attribute:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674649109542, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "RuiCOL_VuokY", "outputId": "31a876f1-6c10-4cb0-e478-13b9a6238d03" }, "outputs": [], "source": [ "s.index" ] }, { "cell_type": "markdown", "metadata": { "id": "644xwWH8uokY" }, "source": [ "You can access the underlying data array using `.to_numpy()`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674649111434, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "uMMr73oBuokY", "outputId": "da072795-2d57-4241-9ed1-5087721a54da" }, "outputs": [], "source": [ "s.to_numpy()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674649112000, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "RcK6YW00uokY", "outputId": "c5df3aee-b048-43a1-f8d3-c63a87bf2552" }, "outputs": [], "source": [ "pd.Series([[1, 2, 3], \"b\", 1]).to_numpy()" ] }, { "cell_type": "markdown", "metadata": { "id": "ng7TEuFkuokZ" }, "source": [ "### Indexing and Slicing Series" ] }, { "cell_type": "markdown", "metadata": { "id": "XLz3CHWGuokZ" }, "source": [ "Series are very much like ndarrays (in fact, series can be passed to most NumPy functions!). They can be indexed using square brackets `[ ]` and sliced using colon `:` notation:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674649114383, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "ymtwd-shuokZ", "outputId": "4c99a507-52a7-4dcc-85c6-69c4f6fc85b8" }, "outputs": [], "source": [ "s = pd.Series(data = range(5),\n", " index = ['A', 'B', 'C', 'D', 'E'])\n", "s" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674649114966, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "owRHHH5euokZ", "outputId": "9d61b5f5-eb74-4100-fc05-9781eaa94e39" }, "outputs": [], "source": [ "s[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674649115760, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "Q6N5PxuUuokZ", "outputId": "58d7f938-f1a8-4b90-fbe1-94c563fd7dd9" }, "outputs": [], "source": [ "s[[1, 2, 3]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674649116265, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "NyaHuTy0uokZ", "outputId": "8b69245f-1808-4bea-dfac-74d37656994c" }, "outputs": [], "source": [ "s[0:3]" ] }, { "cell_type": "markdown", "metadata": { "id": "kFM094dwuokZ" }, "source": [ "Note above how array-based indexing and slicing also returns the series index.\n", "\n", "Series are also like dictionaries, in that we can access values using index labels:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 5, "status": "ok", "timestamp": 1674649118025, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "8UHo3EaluokZ", "outputId": "242dea2c-12e9-4bb0-b70b-4eb6d044cabb" }, "outputs": [], "source": [ "s[\"A\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674649118439, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "ojnV3aueuokZ", "outputId": "b1745975-e07d-4285-c5e2-cf933c80d219" }, "outputs": [], "source": [ "s[[\"B\", \"D\", \"C\"]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674649119059, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "__4miCQwuoka", "outputId": "8cb3437d-4482-4aee-809f-c6840b027832" }, "outputs": [], "source": [ "s[\"A\":\"C\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674649119570, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "nBkIK3l4uoka", "outputId": "c1deaf2e-3136-4351-ffb9-45346de8926c" }, "outputs": [], "source": [ "\"A\" in s" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674649120150, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "-nz3q3HDuoka", "outputId": "dc11d21d-e617-43f5-c2c7-50e5219347bf" }, "outputs": [], "source": [ "\"Z\" in s" ] }, { "cell_type": "markdown", "metadata": { "id": "eKkgNlfUuoka" }, "source": [ "Series do allow for non-unique indexing, but **be careful** because indexing operations won't return unique values:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674649121506, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "_S743adEuoka", "outputId": "1b32cf66-67d1-49e1-fb44-acfcfe2d2ee8" }, "outputs": [], "source": [ "x = pd.Series(data = range(5),\n", " index = [\"A\", \"A\", \"A\", \"B\", \"C\"])\n", "x" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674649121786, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "eayHKquuuoka", "outputId": "a94303b8-bcb2-405a-ad05-bb075b3484fe" }, "outputs": [], "source": [ "x[\"A\"]" ] }, { "cell_type": "markdown", "metadata": { "id": "DygSrBEWuokb" }, "source": [ "Finally, we can also do boolean indexing with series:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674649123225, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "AcBcBzE8uokb", "outputId": "08e9f3e7-f19e-4916-c48d-8459baf1ea6a" }, "outputs": [], "source": [ "s[s >= 1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674649123785, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "m_njJ8-Xuokb", "outputId": "5e297f97-a4aa-414a-c410-cea6bdcb7096" }, "outputs": [], "source": [ "s[s > s.mean()]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 5, "status": "ok", "timestamp": 1674649124367, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "hG6l2KwIuokb", "outputId": "9ec1ae0b-ad3c-41fd-cf85-4d53cf16500f" }, "outputs": [], "source": [ "(s != 1)" ] }, { "cell_type": "markdown", "metadata": { "id": "Z9uNc3B8uokb" }, "source": [ "### Series Operations" ] }, { "cell_type": "markdown", "metadata": { "id": "XMn8Fljauokb" }, "source": [ "Unlike ndarrays operations between Series (+, -, /, \\*) align values based on their **LABELS** (not their position in the structure). The resulting index will be the __*sorted union*__ of the two indexes. This gives you the flexibility to run operations on series regardless of their labels." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674649126383, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "_VR3xcW4uokb", "outputId": "eff77137-7194-4cd7-c9de-8d849c01662d" }, "outputs": [], "source": [ "s1 = pd.Series(data = range(4),\n", " index = [\"A\", \"B\", \"C\", \"D\"])\n", "s1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674649126881, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "GUELrK5Nuokb", "outputId": "f7985c55-8b57-480d-bd69-1a1e4e595228" }, "outputs": [], "source": [ "s2 = pd.Series(data = range(10, 14),\n", " index = [\"B\", \"C\", \"D\", \"E\"])\n", "s2" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674649127595, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "-oSR_YL9uokc", "outputId": "30eca895-4b8f-4ef6-b8bf-f068a449e629" }, "outputs": [], "source": [ "s1 + s2" ] }, { "cell_type": "markdown", "metadata": { "id": "czkh9sF1uokc" }, "source": [ "As you can see above, indices that match will be operated on. Indices that don't match will appear in the product but with `NaN` values:" ] }, { "cell_type": "markdown", "metadata": { "id": "ncEHp3zauokc" }, "source": [ "We can also perform standard operations on a series, like multiplying or squaring. NumPy also accepts series as an argument to most functions because series are built off numpy arrays (more on that later):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 189, "status": "ok", "timestamp": 1674649178340, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "pv4vqavUuokc", "outputId": "5fcfd465-338d-44b0-eac1-7b03bd86f2f6" }, "outputs": [], "source": [ "s1 ** 2" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 1, "status": "ok", "timestamp": 1674649178968, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "icerUE-4uokc", "outputId": "675cf912-eb0b-475c-918c-6c0fb98ef1ec" }, "outputs": [], "source": [ "np.exp(s1)" ] }, { "cell_type": "markdown", "metadata": { "id": "Ul5Bkw-6uokc" }, "source": [ "Finally, just like arrays, series have many built-in methods for various operations. You can find them all by running `help(pd.Series)`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 384, "status": "ok", "timestamp": 1674649180541, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "Qyqy9Pnyuokc", "outputId": "edf6fbd9-8752-41eb-8a1f-e55da240d625" }, "outputs": [], "source": [ "print([_ for _ in dir(pd.Series) if not _.startswith(\"_\")]) # print all common methods" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674649180735, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "TvcCNFUkuokc", "outputId": "1f5a04bd-a13d-422b-b12a-bd6efdee1097" }, "outputs": [], "source": [ "s1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674649181696, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "m2Epa5SFuokc", "outputId": "892256e5-0071-4552-b10a-ce7de286b1b5" }, "outputs": [], "source": [ "s1.mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674649182337, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "Iebv7CzSuokc", "outputId": "8f864fa2-820a-40b1-f4db-9217f2dbb6a2" }, "outputs": [], "source": [ "s1.sum()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674649182885, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "FeVLnVrAuokd", "outputId": "4ba2fabc-2b2d-473d-acc0-0e056e1a4aad" }, "outputs": [], "source": [ "s1.astype(float)" ] }, { "cell_type": "markdown", "metadata": { "id": "bmsnP4bvuokd" }, "source": [ "**\"Chaining\"** operations together is also common with pandas:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674649184054, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "YI4za7Rcuokd", "outputId": "11d4f08b-d08b-477f-93e8-e2d18c36b588" }, "outputs": [], "source": [ "s1.add(3.141).astype(int).pow(2).mean()" ] }, { "cell_type": "markdown", "metadata": { "id": "Kw_qDw5zuokd" }, "source": [ "### Data Types" ] }, { "cell_type": "markdown", "metadata": { "id": "Uyo88Lvquokd" }, "source": [ "Series can hold all the data types (`dtypes`) you're used to, e.g., `int`, `float`, `bool`, etc. There are a few other special data types too (`object`, `DateTime` and `Categorical`) which we'll talk about in this and later chapters. You can always read more about pandas dtypes [in the documentation too](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#dtypes). For example, here's a series of `dtype` int64:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 210, "status": "ok", "timestamp": 1674649185952, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "ZgyGejRjuokd", "outputId": "104f7840-0336-4599-97a5-f9b404f85aae" }, "outputs": [], "source": [ "x = pd.Series(range(5))\n", "x.dtype" ] }, { "cell_type": "markdown", "metadata": { "id": "qquyNK_Ruokd" }, "source": [ "The dtype \"`object`\" is used for series of strings or mixed data. Pandas is [currently experimenting](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.StringDtype.html#pandas.StringDtype) with a dedicated string dtype `StringDtype`, but it is still in testing." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 10, "status": "ok", "timestamp": 1674649188113, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "3wGUnqXuuokd", "outputId": "d2fc2c41-91a4-44d2-bdb2-ea19ada08d9c" }, "outputs": [], "source": [ "x = pd.Series(['A', 'B'])\n", "x" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674649188346, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "jQJqUvtjuokd", "outputId": "6aa1a12d-eb5c-437a-8fe3-82a4b670a8a5" }, "outputs": [], "source": [ "x = pd.Series(['A', 1, [\"I\", \"AM\", \"A\", \"LIST\"]])\n", "x" ] }, { "cell_type": "markdown", "metadata": { "id": "nrJSJBrFuoke" }, "source": [ "While flexible, it is recommended to avoid the use of `object` dtypes because of higher memory requirements. Essentially, in an `object` dtype series, every single element stores information about its individual dtype. We can inspect the dtypes of all the elements in a mixed series in several ways, below I'll use the `map` method:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674649189797, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "1tJvcOvNuoke", "outputId": "98cd0cd7-3583-4a7b-9269-38c67ec10f52" }, "outputs": [], "source": [ "x.map(type)" ] }, { "cell_type": "markdown", "metadata": { "id": "IE8MwUEouoke" }, "source": [ "We can see that each object in our series has a different dtype. This comes at a cost. Compare the [memory usage](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.memory_usage.html) of the series below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674649190725, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "SuyzCJo-uoke", "outputId": "5ee6d991-4007-4f15-83a8-7c7609b18eb1" }, "outputs": [], "source": [ "x1 = pd.Series([1, 2, 3])\n", "print(f\"x1 dtype: {x1.dtype}\")\n", "print(f\"x1 memory usage: {x1.memory_usage(deep=True)} bytes\")\n", "print(\"\")\n", "x2 = pd.Series([1, 2, \"3\"])\n", "print(f\"x2 dtype: {x2.dtype}\")\n", "print(f\"x2 memory usage: {x2.memory_usage(deep=True)} bytes\")\n", "print(\"\")\n", "x3 = pd.Series([1, 2, \"3\"]).astype('int8') # coerce the object series to int8\n", "print(f\"x3 dtype: {x3.dtype}\")\n", "print(f\"x3 memory usage: {x3.memory_usage(deep=True)} bytes\")" ] }, { "cell_type": "markdown", "metadata": { "id": "cOHoWeJtuoke" }, "source": [ "In summary, try to use uniform dtypes where possible - they are more memory efficient!\n", "\n", "One more gotcha, `NaN` (frequently used to represent missing values in data) is a float:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 247, "status": "ok", "timestamp": 1674649191972, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "H3IJY4e3uoke", "outputId": "fac72b18-5e9e-43c3-d303-94da12ea0d23" }, "outputs": [], "source": [ "type(np.NaN)" ] }, { "cell_type": "markdown", "metadata": { "id": "jvuQjKlpuoke" }, "source": [ "This can be problematic if you have a series of integers and one missing value, because Pandas will cast the whole series to a float:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674649192742, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "c_Rs0D1Muoke", "outputId": "25d0964a-e65b-4d84-f44a-1a5b346774bc" }, "outputs": [], "source": [ "pd.Series([1, 2, 3, np.NaN])" ] }, { "cell_type": "markdown", "metadata": { "id": "mV5UVQ5juoke" }, "source": [ "Only recently, Pandas has implemented a \"[nullable integer dtype](https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html)\", which can handle `NaN` in an integer series without affecting the `dtype`. Note the captial \"I\" in the type below, differentiating it from numpy's `int64` dtype:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 191, "status": "ok", "timestamp": 1674649194141, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "T2P-M5zouokf", "outputId": "744b164a-fb39-4b2f-9092-c262db5b70ac" }, "outputs": [], "source": [ "pd.Series([1, 2, 3, np.NaN]).astype('Int64')" ] }, { "cell_type": "markdown", "metadata": { "id": "iFBKz70Muokf" }, "source": [ "This is not the default in Pandas yet and functionality of this new feature is still subject to change." ] }, { "cell_type": "markdown", "metadata": { "id": "9IWmdVCxuokf" }, "source": [ "## 9. Pandas DataFrames\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "GNw2pe6cuokf" }, "source": [ "### What are DataFrames?" ] }, { "cell_type": "markdown", "metadata": { "id": "sxcJoAAXuokf" }, "source": [ "Pandas DataFrames are you're new best friend. They are like the Excel spreadsheets you may be used to. DataFrames are really just Series stuck together! Think of a DataFrame as a dictionary of series, with the \"keys\" being the column labels and the \"values\" being the series data:\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "dRkKQMWWuokf" }, "source": [ "### Creating DataFrames" ] }, { "cell_type": "markdown", "metadata": { "id": "qOyMAZtEuokf" }, "source": [ "Dataframes can be created using `pd.DataFrame()` (note the capital \"D\" and \"F\"). Like series, index and column labels of dataframes are labelled starting from 0 by default:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "executionInfo": { "elapsed": 242, "status": "ok", "timestamp": 1674649211209, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "eXLvDGyOuokf", "outputId": "9231488f-5aa0-4d8f-f560-dd812a66fda9" }, "outputs": [], "source": [ "pd.DataFrame([[1, 2, 3],\n", " [4, 5, 6],\n", " [7, 8, 9]])" ] }, { "cell_type": "markdown", "metadata": { "id": "J4C1Q8Umuokf" }, "source": [ "We can use the `index` and `columns` arguments to give them labels:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "executionInfo": { "elapsed": 189, "status": "ok", "timestamp": 1674649212781, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "wFpfMV0Juokf", "outputId": "04998164-d63a-498a-a693-e975f251f6d1" }, "outputs": [], "source": [ "pd.DataFrame([[1, 2, 3],\n", " [4, 5, 6],\n", " [7, 8, 9]],\n", " index = [\"R1\", \"R2\", \"R3\"],\n", " columns = [\"C1\", \"C2\", \"C3\"])" ] }, { "cell_type": "markdown", "metadata": { "id": "UIQxyLmFuokg" }, "source": [ "There are so many ways to create dataframes. I most often create them from dictionaries or ndarrays:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "executionInfo": { "elapsed": 194, "status": "ok", "timestamp": 1674649214708, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "4SV7pWpNuokg", "outputId": "ddf2afae-d68e-44dc-d845-f323491db937" }, "outputs": [], "source": [ "pd.DataFrame({\"C1\": [1, 2, 3],\n", " \"C2\": ['A', 'B', 'C']},\n", " index=[\"R1\", \"R2\", \"R3\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "executionInfo": { "elapsed": 185, "status": "ok", "timestamp": 1674649216006, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "jU7MMKrRuokg", "outputId": "aa05e731-fffe-443d-df49-8dcc2e362d02" }, "outputs": [], "source": [ "pd.DataFrame(np.random.randn(5, 5),\n", " index=[f\"row_{_}\" for _ in range(1, 6)],\n", " columns=[f\"col_{_}\" for _ in range(1, 6)])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "executionInfo": { "elapsed": 4, "status": "ok", "timestamp": 1674649216219, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "y1YmlK-auokg", "outputId": "2bab9549-4793-45ba-b7de-8b721489b026" }, "outputs": [], "source": [ "pd.DataFrame(np.array([['Tom', 7], ['Mike', 15], ['Tiffany', 3]]))" ] }, { "cell_type": "markdown", "metadata": { "id": "MwWUSReQuokg" }, "source": [ "Here's a table of the main ways you can create dataframes (see the [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe) for more):\n", "\n", "|Create DataFrame from|Code|\n", "|---|---|\n", "|Lists of lists|`pd.DataFrame([['Tom', 7], ['Mike', 15], ['Tiffany', 3]])`|\n", "|ndarray|`pd.DataFrame(np.array([['Tom', 7], ['Mike', 15], ['Tiffany', 3]]))`|\n", "|Dictionary|`pd.DataFrame({\"Name\": ['Tom', 'Mike', 'Tiffany'], \"Number\": [7, 15, 3]})`|\n", "|List of tuples|`pd.DataFrame(zip(['Tom', 'Mike', 'Tiffany'], [7, 15, 3]))`|\n", "|Series|`pd.DataFrame({\"Name\": pd.Series(['Tom', 'Mike', 'Tiffany']), \"Number\": pd.Series([7, 15, 3])})`|\n" ] }, { "cell_type": "markdown", "metadata": { "id": "TshDRwrwuokg" }, "source": [ "### Indexing and Slicing DataFrames" ] }, { "cell_type": "markdown", "metadata": { "id": "fMLVnd8Wuokg" }, "source": [ "There are several main ways to select data from a DataFrame:\n", "1. `[]`\n", "2. `.loc[]`\n", "3. `.iloc[]`\n", "4. Boolean indexing\n", "5. `.query()`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "executionInfo": { "elapsed": 202, "status": "ok", "timestamp": 1674649220928, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "dT2gF12uuokg", "outputId": "86a3d0be-66a5-4b3c-dbb6-dced69448123" }, "outputs": [], "source": [ "df = pd.DataFrame({\"Name\": [\"Tom\", \"Mike\", \"Tiffany\"],\n", " \"Language\": [\"Python\", \"Python\", \"R\"],\n", " \"Courses\": [5, 4, 7]})\n", "df" ] }, { "cell_type": "markdown", "metadata": { "id": "6_vO6pcpuokg" }, "source": [ "#### Indexing with `[]`\n", "Select columns by single labels, lists of labels, or slices:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 268, "status": "ok", "timestamp": 1674649223117, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "qnmjnep0uokg", "outputId": "30ee7e24-6f1b-486b-881e-3dfba30cef74" }, "outputs": [], "source": [ "df['Name'] # returns a series" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "executionInfo": { "elapsed": 4, "status": "ok", "timestamp": 1674649223607, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "Lqy9Fq6kuokh", "outputId": "80a535a3-891b-4481-aa89-c3dec0b1f51d" }, "outputs": [], "source": [ "df[['Name']] # returns a dataframe!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "executionInfo": { "elapsed": 4, "status": "ok", "timestamp": 1674649224350, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "de10X-ZZuokh", "outputId": "91f694b4-2e0e-4681-aca7-0a329a2b98b9" }, "outputs": [], "source": [ "df[['Name', 'Language']]" ] }, { "cell_type": "markdown", "metadata": { "id": "bGuYg5Qluokh" }, "source": [ "You can only index rows by using slices, not single values (but not recommended, see preferred methods below)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 501 }, "executionInfo": { "elapsed": 4, "status": "error", "timestamp": 1674649226012, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "2D0DzJPduokh", "outputId": "fe97bb9a-4bdd-4e45-e061-6b43c5877bdf", "tags": [ "raises-exception" ] }, "outputs": [], "source": [ "df[0] # doesn't work" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 81 }, "executionInfo": { "elapsed": 201, "status": "ok", "timestamp": 1674649235016, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "lZlenbfHuokh", "outputId": "e7594d12-854e-4d20-c3d1-00abd4c1ede3" }, "outputs": [], "source": [ "df[0:1] # does work" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 112 }, "executionInfo": { "elapsed": 182, "status": "ok", "timestamp": 1674649236818, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "5tn0PBC7uokh", "outputId": "71caf3cd-e26d-4918-ccca-114d37767a47" }, "outputs": [], "source": [ "df[1:] # does work" ] }, { "cell_type": "markdown", "metadata": { "id": "A2jrNW87uokh" }, "source": [ "#### Indexing with `.loc` and `.iloc`\n", "Pandas created the methods `.loc[]` and `.iloc[]` as more flexible alternatives for accessing data from a dataframe. Use `df.iloc[]` for indexing with integers and `df.loc[]` for indexing with labels. These are typically the [recommended methods of indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated) in Pandas." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "executionInfo": { "elapsed": 192, "status": "ok", "timestamp": 1674649239740, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "hv42ds0Zuokh", "outputId": "c81d78d3-4dbd-4db7-8903-240e7e99c069" }, "outputs": [], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": { "id": "1M57y9FRuokh" }, "source": [ "First we'll try out `.iloc` which accepts *integers* as references to rows/columns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 183, "status": "ok", "timestamp": 1674649241914, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "tNTr15NTuoki", "outputId": "0cddfe22-79a0-46b2-eb35-156b3694aad6" }, "outputs": [], "source": [ "df.iloc[0] # returns a series" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 112 }, "executionInfo": { "elapsed": 4, "status": "ok", "timestamp": 1674649242522, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "O9fKC2l5uoki", "outputId": "0ae28ec4-cc67-4673-b52d-3851e6b10186" }, "outputs": [], "source": [ "df.iloc[0:2] # slicing returns a dataframe" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 36 }, "executionInfo": { "elapsed": 4, "status": "ok", "timestamp": 1674649243842, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "n8DiUATJuoki", "outputId": "272319f4-b2f1-4bd5-feb4-e16c58db78a6" }, "outputs": [], "source": [ "df.iloc[2, 1] # returns the indexed object" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 112 }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674649244670, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "7eh3REHyuoki", "outputId": "d73292d6-4155-4321-c686-cc345ca5bd1b" }, "outputs": [], "source": [ "df.iloc[[0, 1], [1, 2]] # returns a dataframe" ] }, { "cell_type": "markdown", "metadata": { "id": "rBetVEi8uoki" }, "source": [ "Now let's look at `.loc` which accepts *labels* as references to rows/columns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674649247329, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "cIGliwBJuoki", "outputId": "b8e42d6a-1935-4e45-ce2e-03915b04946b" }, "outputs": [], "source": [ "df.loc[:, 'Name']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674649247970, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "YTMjhq5vuoki", "outputId": "8d414154-84c7-4da2-ff74-225584c03b55" }, "outputs": [], "source": [ "df.loc[:, 'Name':'Language']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 112 }, "executionInfo": { "elapsed": 201, "status": "ok", "timestamp": 1674649249014, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "RZK0N0COuoki", "outputId": "e09e677e-0eae-47b6-9771-a951847ac19f" }, "outputs": [], "source": [ "df.loc[[0, 2], ['Language']]" ] }, { "cell_type": "markdown", "metadata": { "id": "M1RP6GM-uoki" }, "source": [ "Sometimes we want to use a mix of integers and labels to reference data in a dataframe. The easiest way to do this is to use `.loc[]` with a label then use an integer in combinations with `.index` or `.columns`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674649250924, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "mD87YQO3uoki", "outputId": "90547ef6-3391-47ff-b744-3623e44489b0" }, "outputs": [], "source": [ "df.index" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674649251541, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "GAGvJA_7uokj", "outputId": "b1d2fe03-2e67-4c01-c50a-71dfda606b9c" }, "outputs": [], "source": [ "df.columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674649252345, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "y78GDga4uokj", "outputId": "509ac70a-bd89-4910-b1d4-0913b8b7e969" }, "outputs": [], "source": [ "df.loc[df.index[0], 'Courses'] # I want to reference the first row and the column named \"Courses\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 36 }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674649252867, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "YY2fblm6uokj", "outputId": "afb086b3-bfff-40bd-b947-e47b71b6a7c2" }, "outputs": [], "source": [ "df.loc[2, df.columns[1]] # I want to reference row \"2\" and the second column" ] }, { "cell_type": "markdown", "metadata": { "id": "O6Oqpf3Kuokj" }, "source": [ "#### Boolean indexing\n", "Just like with series, we can select data based on boolean masks:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674649254414, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "-U2mLkO2uokj", "outputId": "ccc95dd0-3dae-4803-c715-4380f3c50ee6" }, "outputs": [], "source": [ "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 81 }, "executionInfo": { "elapsed": 4, "status": "ok", "timestamp": 1674649255450, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "dIApvm8Puokj", "outputId": "14a06e2c-7272-4573-d8ba-871a34adb7fa" }, "outputs": [], "source": [ "df[df['Courses'] > 5]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 81 }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674649256038, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "jrS6fpsxuokj", "outputId": "c5e0dfb1-9483-442e-fd29-5a23c9dfd99e" }, "outputs": [], "source": [ "df[df['Name'] == \"Tom\"]" ] }, { "cell_type": "markdown", "metadata": { "id": "ETZjzlLiuokj" }, "source": [ "#### Indexing with `.query()`\n", "Boolean masks work fine, but I prefer to use the `.query()` method for selecting data. `df.query()` is a powerful tool for filtering data. It has an odd syntax, one of the strangest I've seen in Python, it is more like SQL - `df.query()` accepts a string expression to evaluate and it \"knows\" the names of the columns in your dataframe." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 81 }, "executionInfo": { "elapsed": 276, "status": "ok", "timestamp": 1674649257624, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "3835n7gVuokj", "outputId": "13fc29ba-eb72-4e90-8c28-eca2b356a25a" }, "outputs": [], "source": [ "df.query(\"Courses > 4 & Language == 'Python'\")" ] }, { "cell_type": "markdown", "metadata": { "id": "qy5lKZoxuokj" }, "source": [ "Note the use of single quotes AND double quotes above, lucky we have both in Python! Compare this to the equivalent boolean indexing operation and you can see that `.query()` is much more readable, especially as the query gets bigger!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 81 }, "executionInfo": { "elapsed": 206, "status": "ok", "timestamp": 1674649259042, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "C2LPgHXYuokk", "outputId": "a019b09b-5525-464f-b586-462192bd548d" }, "outputs": [], "source": [ "df[(df['Courses'] > 4) & (df['Language'] == 'Python')]" ] }, { "cell_type": "markdown", "metadata": { "id": "DC_s0KVjuokk" }, "source": [ "Query also allows you to reference variable in the current workspace using the `@` symbol:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 112 }, "executionInfo": { "elapsed": 287, "status": "ok", "timestamp": 1674649260408, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "V81BMHVjuokk", "outputId": "d8dead80-8418-4a17-e8e1-a0d6898af078" }, "outputs": [], "source": [ "course_threshold = 4\n", "df.query(\"Courses > @course_threshold\")" ] }, { "cell_type": "markdown", "metadata": { "id": "Fsehb6XQuokk" }, "source": [ "#### Indexing cheatsheet\n", "\n", "|Method|Syntax|Output|\n", "|---|---|---|\n", "|Select column|`df[col_label]`|Series|\n", "|Select row slice|`df[row_1_int:row_2_int]`|DataFrame|\n", "|Select row/column by label|`df.loc[row_label(s), col_label(s)]`|Object for single selection, Series for one row/column, otherwise DataFrame|\n", "|Select row/column by integer|`df.iloc[row_int(s), col_int(s)]`|Object for single selection, Series for one row/column, otherwise DataFrame|\n", "|Select by row integer & column label|`df.loc[df.index[row_int], col_label]`|Object for single selection, Series for one row/column, otherwise DataFrame|\n", "|Select by row label & column integer|`df.loc[row_label, df.columns[col_int]]`|Object for single selection, Series for one row/column, otherwise DataFrame|\n", "|Select by boolean|`df[bool_vec]`|Object for single selection, Series for one row/column, otherwise DataFrame|\n", "|Select by boolean expression|`df.query(\"expression\")`|Object for single selection, Series for one row/column, otherwise DataFrame|" ] }, { "cell_type": "markdown", "metadata": { "id": "3pYjvYwcuokk" }, "source": [ "### Reading/Writing Data From External Sources" ] }, { "cell_type": "markdown", "metadata": { "id": "_O-8aHHmuokk" }, "source": [ "#### .csv files" ] }, { "cell_type": "markdown", "metadata": { "id": "-iyA38wbuokk" }, "source": [ "A lot of the time you will be loading .csv files for use in pandas. You can use the `pd.read_csv()` function for this. We'll use a real dataset of weather variables summarized by CORGIS Dataset Project from National Oceanic and Atmospheric Administration, the National Weather Service daily weather reports. There are so many arguments that can be used to help read in your .csv file in an efficient and appropriate manner, feel free to check them out now (by using `shift + tab` in Jupyter, or typing `help(pd.read_csv)`).\n", "\n", "Pandas facilitates reading directly from a url - `pd.read_csv()` accepts urls as input:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 710 }, "executionInfo": { "elapsed": 487, "status": "ok", "timestamp": 1674649387131, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "2R2fWqFfuokk", "outputId": "2e636f18-1911-4cd5-d5c7-82740b4b4b3b" }, "outputs": [], "source": [ "path = 'https://github.com/ElcoK/BigData_AED/raw/main/week1/weather.csv'\n", "df = pd.read_csv(path, index_col=0, parse_dates=True)\n", "df" ] }, { "cell_type": "markdown", "metadata": { "id": "DOXQTEZduokk" }, "source": [ "You can print a dataframe to .csv using `df.to_csv()`. Be sure to check out all of the possible arguments to write your dataframe exactly how you want it." ] }, { "cell_type": "markdown", "metadata": { "id": "IGL3cw8fuokl" }, "source": [ "#### Other\n", "Pandas can read data from all sorts of other file types including HTML, JSON, Excel, Parquet, Feather, etc. There are generally dedicated functions for reading these file types, see the [Pandas documentation here](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-tools-text-csv-hdf5)." ] }, { "cell_type": "markdown", "metadata": { "id": "wfmkbDiyuokl" }, "source": [ "### Common DataFrame Operations" ] }, { "cell_type": "markdown", "metadata": { "id": "NGT48Gltuokl" }, "source": [ "DataFrames have built-in functions for performing most common operations, e.g., `.min()`, `idxmin()`, `sort_values()`, etc. They're all documented in the [Pandas documentation here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) but I'll demonstrate a few below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 374 }, "executionInfo": { "elapsed": 296, "status": "ok", "timestamp": 1674649419756, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "OBZ4BbQ-uokl", "outputId": "7d8da7d2-10a2-4b13-95ab-9cebaafbdd03" }, "outputs": [], "source": [ "df = pd.read_csv(path)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 194, "status": "ok", "timestamp": 1674649421381, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "bt0yLHZUuokl", "outputId": "ebd32214-4687-4d48-9c0e-7e3346f5e7ed" }, "outputs": [], "source": [ "df.columns # to view all the columns within the dataframe" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674649422749, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "GHV-duIyuokl", "outputId": "36637e1c-174b-4391-8432-5965c37fccc5" }, "outputs": [], "source": [ "df.min()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 1, "status": "ok", "timestamp": 1674649423788, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "LIaBXuHhuokl", "outputId": "0da52743-7d82-442b-b753-c4ff34504778" }, "outputs": [], "source": [ "df['Data.Temperature.Avg Temp'].min()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 4, "status": "ok", "timestamp": 1674649424335, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "sGYfn-bZuokl", "outputId": "ec849e51-e0c2-43ab-96b1-cf6c48ee67a3" }, "outputs": [], "source": [ "df['Data.Temperature.Avg Temp'].idxmin() # position index of the min temperature" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1674649425179, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "3yxDt0h_uokl", "outputId": "744edb5c-6321-432b-8003-400846a0071f" }, "outputs": [], "source": [ "df.iloc[20]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 277, "status": "ok", "timestamp": 1674649426214, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "N-_aK6GPuokm", "outputId": "fc3fa250-b75a-4bab-ebd7-3b02c8f4d0a7" }, "outputs": [], "source": [ "df.sum()" ] }, { "cell_type": "markdown", "metadata": { "id": "ukuEbmTeuokm" }, "source": [ "Some methods like `.mean()` will only operate on numeric columns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 185, "status": "ok", "timestamp": 1674649427778, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "KoBXA8iruokm", "outputId": "2afde7c6-c00a-4504-a074-d326ab948790" }, "outputs": [], "source": [ "df.mean()" ] }, { "cell_type": "markdown", "metadata": { "id": "8LOWzLEcuokm" }, "source": [ "Some methods require arguments to be specified, like `.sort_values()`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 678 }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1674649429425, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "iaUILGYvuokm", "outputId": "e3fde9db-82d0-4143-8931-6e29c7c7c89f" }, "outputs": [], "source": [ "df.sort_values(by='Data.Temperature.Max Temp')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 374 }, "executionInfo": { "elapsed": 5, "status": "ok", "timestamp": 1674649430548, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "CtaMoiY3uokm", "outputId": "b4ff7955-afcb-4adc-a6cb-0ae70248a502" }, "outputs": [], "source": [ "df.sort_values(by='Data.Temperature.Max Temp', ascending=False).head()" ] }, { "cell_type": "markdown", "metadata": { "id": "AWrS_ZeHuokm" }, "source": [ "Some methods will operate on the index/columns, like `.sort_index()`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 678 }, "executionInfo": { "elapsed": 193, "status": "ok", "timestamp": 1674649432477, "user": { "displayName": "RA Odongo", "userId": "17326618845752559881" }, "user_tz": -60 }, "id": "9iQYgYe1uokm", "outputId": "111cfa07-3601-4fa1-b461-de2f4638e0c8" }, "outputs": [], "source": [ "df.sort_index(ascending=False)" ] }, { "cell_type": "markdown", "metadata": { "id": "AnDINdLCuokm" }, "source": [ "## 10. Why ndarrays and Series and DataFrames?" ] }, { "cell_type": "markdown", "metadata": { "id": "_5Z0crgBuokm" }, "source": [ "At this point, you might be asking why we need all these different data structures. Well, they all serve different purposes and are suited to different tasks. For example:\n", "- NumPy is typically faster/uses less memory than Pandas;\n", "- not all Python packages are compatible with NumPy & Pandas;\n", "- the ability to add labels to data can be useful (e.g., for time series);\n", "- NumPy and Pandas have different built-in functions available.\n", "\n", "My advice: use the simplest data structure that fulfills your needs!\n", "\n", "Finally, we've seen how to go from: ndarray (`np.array()`) -> series (`pd.series()`) -> dataframe (`pd.DataFrame()`). Remember that we can also go the other way: dataframe/series -> ndarray using `df.to_numpy()`." ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" }, "vscode": { "interpreter": { "hash": "9aa67eedecbb1544f55b95cced32a93bf08cf46b6b214074d43890cbd05ea300" } } }, "nbformat": 4, "nbformat_minor": 4 }