import pandas as pd
import numpy as np
= 'https://www.openintro.org/book/statdata/nycflights.csv'
url = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}
storage_options = pd.read_csv(url, storage_options=storage_options) df
The 80/20 Pandas Tutorial
Ahh, pandas. In addition to being everyone’s favorite mostly vegetarian bear from south central China, it’s also the python library for working with tabular data, a.k.a. dataframes. When you dive into pandas, you’ll quickly find out that there is a lot going on; indeed there are hundreds of methods for operating on dataframes. But luckily for us, as with many areas of life, there is a Pareto Principle, or 80/20 rule, that will help us focus on the small set of methods that collectively solve the majority of our data transformation needs.
If you’re like me, then pandas is not your first data-handling tool; maybe you’ve been using SQL or R with data.table
or dplyr
. If so, that’s great because you already have a sense for the key operations we need when working with tabular data. In their book, R for Data Science, Garrett Grolemund and Hadley Wickham describe five essential operations for manipulating dataframes. I’ve found that these cover the majority of my data transformation tasks to prepare data for analysis, visualization, and modeling.
- filtering rows based on data values
- sorting rows based on data values
- selecting columns by name
- adding new columns based on the existing columns
- creating grouped summaries of the dataset
I would add that we also need a way to build up more complex transformations by chaining these fundamental operations together sequentially.
Before we dive in, here’s the TLDR on the pandas methods that I prefer for accomplishing these tasks, along with their equivalents from SQL and dplyr
in R.
description | pandas | SQL | dplyr |
---|---|---|---|
filter rows based on data values | query() |
WHERE |
filter() |
sort rows based on data values | sort_values() |
ORDER BY |
arrange() |
select columns by name | filter() |
SELECT |
select() |
add new columns based on the existing columns | assign() |
AS |
mutate() |
create grouped summaries of the dataset | groupby() apply() |
GROUP BY |
group_by() summarise() |
chain operations together | . |
%>% |
Imports and Data
We’ll use the nycflights13 dataset which contains data on the roughly 300k flights that departed from New York City in 2013.
Typically if I’m only going to be using a single dataframe, I’ll use the name “df”. This is a pretty strong convention in pandas, e.g. you can see the name “df” being used all over the pandas documentation; therefore it makes your code easier for others to understand. If there will be more than one dataframe, I suggest prepending a meaningful name to the “df”, e.g. flights_df
.
Let’s have a look at the dataframe structure using the info()
method.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32735 entries, 0 to 32734
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 32735 non-null int64
1 month 32735 non-null int64
2 day 32735 non-null int64
3 dep_time 32735 non-null int64
4 dep_delay 32735 non-null int64
5 arr_time 32735 non-null int64
6 arr_delay 32735 non-null int64
7 carrier 32735 non-null object
8 tailnum 32735 non-null object
9 flight 32735 non-null int64
10 origin 32735 non-null object
11 dest 32735 non-null object
12 air_time 32735 non-null int64
13 distance 32735 non-null int64
14 hour 32735 non-null int64
15 minute 32735 non-null int64
dtypes: int64(12), object(4)
memory usage: 4.0+ MB
Select rows based on their values with query()
query()
lets you retain a subset of rows based on the values of the data; it’s like dplyr::filter()
in R or WHERE
in SQL. Its argument is a string specifying the condition to be met for rows to be included in the result. You specify the condition as an expression involving the column names and comparison operators like <
, >
, <=
, >=
, ==
(equal), and ~=
(not equal). You can specify compound expressions using and
and or
, and you can even check if the column value matches any items in a list.
# compare one column to a value
'month == 6')
df.query(
# compare two column values
'arr_delay > dep_delay')
df.query(
# using arithmetic
'arr_delay > 0.5 * air_time')
df.query(
# using "and"
'month == 6 and day == 1')
df.query(
# using "or"
'origin == "JFK" or dest == "JFK"')
df.query(
# column value matching any item in a list
'carrier in ["AA", "UA"]') df.query(
You may have noticed that it seems to be much more popular to filter pandas data frames using boolean indexing. Indeed when I ask my favorite search engine how to filter a pandas dataframe on its values, I find this tutorial, this blog post, various questions on Stack Overflow, and even the pandas documentation, all espousing boolean indexing. Here’s what it looks like.
# canonical boolean indexing
'carrier'] == "AA") & (df['origin'] == "JFK")]
df[(df[
# the equivalent use of query()
'carrier == "AA" and origin == "JFK"') df.query(
There are a few reasons I prefer query()
over boolean indexing.
query()
does not require me to type the dataframe name again, whereas boolean indexing requires me to type it every time I wish to refer to a column.query()
makes the code easier to read and understand, especially when expressions get complex.query()
is more computationally efficient than boolean indexing.query()
can safely be used in dot chains, which we’ll see very soon.
Select columns by name with filter()
filter()
lets you pick out a specific set of columns by name; it’s analogous to dplyr::select()
in R or SELECT
in SQL. You can either provide exactly the column names you want, or you can grab all columns whose names contain a given substring or which match a given regular expression. This isn’t a big deal when your dataframe has only a few columns, but is particularly useful when you have a dataframe with tens or hundreds of columns.
# select a list of columns
filter(['origin', 'dest'])
df.
# select columns containing a particular substring
filter(like='time')
df.
# select columns matching a regular expression
filter(regex='e$') df.
Sort rows with sort_values()
sort_values()
changes the order of the rows based on the data values; it’s likedplyr::arrange()
in R or ORDER BY
in SQL. You can specify one or more columns on which to sort, where their order denotes the sorting priority. You can also specify whether to sort in ascending or descending order.
# sort by a single column
'air_time')
df.sort_values(
# sort by a single column in descending order
'air_time', ascending=False)
df.sort_values(
# sort by carrier, then within carrier, sort by descending distance
'carrier', 'distance'], ascending=[True, False]) df.sort_values([
Add new columns with assign()
assign()
adds new columns which can be functions of the existing columns; it’s like dplyr::mutate()
from R.
# add a new column based on other columns
= lambda x: x.distance / x.air_time)
df.assign(speed
# another new column based on existing columns
= lambda x: x.dep_delay - x.arr_delay) df.assign(gain
If you’re like me, this way of using assign()
might seem a little strange at first. Let’s break it down. In the call to assign()
the keyword argument speed
tells pandas the name of our new column. The business to the right of the =
is a inline lambda function that takes the dataframe we passed to assign()
and returns the column we want to add.
I like using x
as the lambda argument because its easy to type and it evokes tabular data (think design matrix), which reminds me that it refers to the entire dataframe. We can then access the other columns in our dataframe using the dot like x.other_column
.
It’s true that you can skip the whole lambda business and refer to the dataframe to which you are assigning directly inside the assign. That might look like this.
df.assign(speed = flights.distance / flights.air_time)
I prefer using a lambda for the following reasons.
- Using the lambda will save you from typing the name every time you want to refer to a column.
- The lambda makes your code more portable. Since you refer to the dataframe as a generic
x
, you can reuse this same assignment code on a dataframe with a different name. - Most importantly, the lambda will allow you to harness the power of dot chaining.
Chain transformations together with the dot chain
One of the awesome things about pandas is that the object.method()
paradigm lets us easily build up complex dataframe transformations from a sequence of method calls. In R, this is effectively accomplished by the pipe %>%
operator. For example, suppose we want to look at high-speed flights from JFK to Honolulu, which would require us to query for JFK to Honolulu flights, assign a speed column, and maybe sort on that new speed column.
We can say:
# neatly chain method calls together
(
df'origin == "JFK"')
.query('dest == "HNL"')
.query(= lambda x: x.distance / x.air_time)
.assign(speed ='speed', ascending=False)
.sort_values(by'speed > 8.0')
.query( )
We compose the dot chain by wrapping the entire expression in parentheses and indenting each line within. The first line is the name of the dataframe on which we are operating. Each subsequent line has a single method call.
There are a few great things about writing the code this way:
- Readability - It’s easy to scan down the left margin of the code to see what’s happening. The first line gives us our noun (the dataframe) and each subsequent line starts with a verb. You could read this as “take
df
then query the rows where origin is JFK, then query for rows where destination is HNL, then assign a new column called speed, then sort the dataframe by speed, then query only for the rows where speed is greater than 8.0. - Flexibility - It’s easy to comment out individual lines and re-run the cell. It’s also easy to reorder operations, since only one thing happens on each line.
- Neatness - We have not polluted our workspace with any intermediate variables, nor have we wasted any mental energy thinking of names for any temporary variables.
By default, dot chains do not modify the original dataframe; they just output a temporary result that we can inspect directly in the output. If you want to store the result, or pass it along to another function (e.g. for plotting), you can simply assign the entire dot chain to a variable.
# store the output of the dot chain in a new dataframe
= (
high_speed_flights_df
df= lambda x: x.distance / x.air_time)
.assign(speed 'speed > 8.0')
.query( )
Collapsing rows into grouped summaries with groupby()
groupby()
combined with apply()
gives us flexibility and control over our grouped summaries; it’s like dplyr::group_by()
and dplyr::summarise()
in R. This is the primary pattern I use for SQL-style groupby operations in pandas. Specifically it unlocks the following essential functionality you’re used to having in SQL.
- specify the names of the aggregation columns we create
- specify which aggregation function to use on which columns
- compose more complex aggregations such as the proportion of rows meeting some condition
- aggregate over arbitrary functions of multiple columns
Let’s check out the departure delay stats for each carrier.
# grouped summary with groupby and apply
(
df'carrier'])
.groupby([apply(lambda d: pd.Series({
.'n_flights': len(d),
'med_delay': d.dep_delay.median(),
'avg_delay': d.dep_delay.mean(),
}))
.head() )
n_flights | med_delay | avg_delay | |
---|---|---|---|
carrier | |||
9E | 1696.0 | -1.0 | 17.285967 |
AA | 3188.0 | -2.0 | 9.142409 |
AS | 66.0 | -4.5 | 5.181818 |
B6 | 5376.0 | -1.0 | 13.137091 |
DL | 4751.0 | -2.0 | 8.529573 |
While you might be used to apply()
acting over the rows or columns of a dataframe, here we’re calling apply on a grouped dataframe object, so it’s acting over the groups. According to the pandas documentation:
The function passed to apply must take a dataframe as its first argument and return a dataframe, a series or a scalar. apply will then take care of combining the results back together into a single dataframe or series. apply is therefore a highly flexible grouping method.
We need to supply apply()
with a function that takes each chunk of the grouped dataframe and returns (in our case) a series object with one element for each new aggregation column. Notice that I use a lambda to specify the function we pass to apply())
, and that I name its argument d
, which reminds me that it’s a dataframe. My lambda returns a pandas series whose index entries specify the new aggregation column names, and whose values constitute the results of the aggregations for each group. Pandas will then stitch everything back together into a lovely dataframe.
Notice how nice the code looks when we use this pattern. Each aggregation is specified on its own line, which makes it easy to see what aggregation columns we’re creating and allows us to comment, uncomment, and reorder the aggregations without breaking anything.
Here are some more complex aggregations to illustrate some useful patterns.
# more complex grouped summary
(
df'carrier'])
.groupby([apply(lambda d: pd.Series({
.'avg_gain': np.mean(d.dep_delay - d.arr_delay),
'pct_delay_gt_30': np.mean(d.dep_delay > 30),
'pct_late_dep_early_arr': np.mean((d.dep_delay > 0) & (d.arr_delay < 0)),
'avg_arr_given_dep_delay_gt_0': d.query('dep_delay > 0').arr_delay.mean(),
'cor_arr_delay_dep_delay': np.corrcoef(d.dep_delay, d.arr_delay)[0,1],
}))
.head() )
avg_gain | pct_delay_gt_30 | pct_late_dep_early_arr | avg_arr_given_dep_delay_gt_0 | cor_arr_delay_dep_delay | |
---|---|---|---|---|---|
carrier | |||||
9E | 9.247642 | 0.196934 | 0.110259 | 39.086111 | 0.932485 |
AA | 7.743726 | 0.113237 | 0.105395 | 30.087165 | 0.891013 |
AS | 16.515152 | 0.106061 | 0.121212 | 28.058824 | 0.864565 |
B6 | 3.411458 | 0.160528 | 0.084449 | 37.306866 | 0.914180 |
DL | 7.622816 | 0.097874 | 0.100821 | 30.078029 | 0.899327 |
Here’s what’s happening.
np.mean(d.dep_delay - d.arr_delay)
aggregates over the difference of two columns.np.mean(d.dep_delay > 30)
computes the proportion of rows where the delay is greater than 30 minutes. Generating a boolean series based on some condition and then usingmean()
to find the proportion comes up all the time.np.mean((d.dep_delay > 0) & (d.arr_delay < 0))
shows that we can compute proportions where conditions on multiple columns are met.d.query('dep_delay > 0').arr_delay.mean()
computes the average arrival delay on flights where the departure was delayed. Here we first filter each grouped dataframe down to the subset of rows where departure delay is greater than zero usingquery()
, and then we take the mean of the remaining arrival delays.np.corrcoef(d.dep_delay, d.arr_delay)[0,1]
computes the correlation coefficient between departure and arrival delays. Remember we can use pretty much any reduction operation to collapse values down to a scalar.
You might have noticed that the canonical pandas approach for grouped summaries is to use agg()
. That works well if you need to apply the same aggregation function on each column in the dataframe, e.g. taking the mean of every column. But because of the kind of data I work with these days, it’s much more common for me to use customized aggregations like those above, so the groupby()
apply()
idiom works best for me.
Wrapping Up
There you have it, how to pull off the five most essential data transformation tasks using pandas in a style reminiscent of my beloved dplyr
. Remember that part of the beauty of pandas is that since there are so many ways to do most tasks, you can develop your own style based on the kind of data you work with, what you like about other tools, how you see others using the tools, and of course your own taste and preferences.
If you found this post helpful or if you have your own preferred style for accomplishing any of these key transformations with pandas, do let me know about it in the comments.