Some pandas functions that you should start using now!

Posted on November 14, 2020November 14, 2020Analysis, Data science, Python

Pandas is a popular library for exploratory analysis and data wrangling in python. It has a lot of functions and methods to use. Here I listed some functions that you may not know or don’t use them a lot, so let's check them out.
List of functions we are going through are:

Query()
isin()
pct_change() & diff()
nunique() for columns
memory_usage()
describe(include = 'all')
cut()
to_datetime()
startswith() & endswith()

Main data source that I am using for analysis is coming from kaggle: Netflix Shows and Movies . This dataset consists of tv shows and movies available on Netflix as of 2019. I have used it to show applicability of above functions. So let's get started:

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 7)

First loading the data:

netflix_dt = pd.read_csv('C:\\Users\\Pari\\Documents\\Kaggle\\data\\netflix_titles.csv')

netflix_dt.head()

	show_id	type	title	director	cast	country	date_added	release_year	rating	duration	listed_in	description
0	81145628	Movie	Norm of the North: King Sized Adventure	Richard Finn, Tim Maltby	Alan Marriott, Andrew Toth, Brian Dobson, Cole...	United States, India, South Korea, China	September 9, 2019	2019	TV-PG	90 min	Children & Family Movies, Comedies	Before planning an awesome wedding for his gra...
1	80117401	Movie	Jandino: Whatever it Takes	NaN	Jandino Asporaat	United Kingdom	September 9, 2016	2016	TV-MA	94 min	Stand-Up Comedy	Jandino Asporaat riffs on the challenges of ra...
2	70234439	TV Show	Transformers Prime	NaN	Peter Cullen, Sumalee Montano, Frank Welker, J...	United States	September 8, 2018	2013	TV-Y7-FV	1 Season	Kids' TV	With the help of three human allies, the Autob...
3	80058654	TV Show	Transformers: Robots in Disguise	NaN	Will Friedle, Darren Criss, Constance Zimmer, ...	United States	September 8, 2018	2016	TV-Y7	1 Season	Kids' TV	When a prison ship crash unleashes hundreds of...
4	80125979	Movie	#realityhigh	Fernando Lebrija	Nesta Cooper, Kate Walsh, John Michael Higgins...	United States	September 8, 2017	2017	TV-14	99 min	Comedies	When nerdy high schooler Dani finally attracts...

Main columns are:

id: identifier number of row
type: type of movie
title: title of movie
director: name of director(s)
cast: list of actors
country: country of production
date_added: added date to netflix
release_year: release year of movie
rating: rating
duration: duration
listed_in: categories
description: extra description

Query()

This function can be used to apply condition on columns of dataframe and filter data.
Here I wanna get the movies with release year after 2000:

netflix_dt.query('release_year>2000').head()

	show_id	type	title	director	cast	country	date_added	release_year	rating	duration	listed_in	description
0	81145628	Movie	Norm of the North: King Sized Adventure	Richard Finn, Tim Maltby	Alan Marriott, Andrew Toth, Brian Dobson, Cole...	United States, India, South Korea, China	September 9, 2019	2019	TV-PG	90 min	Children & Family Movies, Comedies	Before planning an awesome wedding for his gra...
1	80117401	Movie	Jandino: Whatever it Takes	NaN	Jandino Asporaat	United Kingdom	September 9, 2016	2016	TV-MA	94 min	Stand-Up Comedy	Jandino Asporaat riffs on the challenges of ra...
2	70234439	TV Show	Transformers Prime	NaN	Peter Cullen, Sumalee Montano, Frank Welker, J...	United States	September 8, 2018	2013	TV-Y7-FV	1 Season	Kids' TV	With the help of three human allies, the Autob...
3	80058654	TV Show	Transformers: Robots in Disguise	NaN	Will Friedle, Darren Criss, Constance Zimmer, ...	United States	September 8, 2018	2016	TV-Y7	1 Season	Kids' TV	When a prison ship crash unleashes hundreds of...
4	80125979	Movie	#realityhigh	Fernando Lebrija	Nesta Cooper, Kate Walsh, John Michael Higgins...	United States	September 8, 2017	2017	TV-14	99 min	Comedies	When nerdy high schooler Dani finally attracts...

It is equivalent to:

netflix_dt[netflix_dt.release_year>2000].head()

	show_id	type	title	director	cast	country	date_added	release_year	rating	duration	listed_in	description
0	81145628	Movie	Norm of the North: King Sized Adventure	Richard Finn, Tim Maltby	Alan Marriott, Andrew Toth, Brian Dobson, Cole...	United States, India, South Korea, China	September 9, 2019	2019	TV-PG	90 min	Children & Family Movies, Comedies	Before planning an awesome wedding for his gra...
1	80117401	Movie	Jandino: Whatever it Takes	NaN	Jandino Asporaat	United Kingdom	September 9, 2016	2016	TV-MA	94 min	Stand-Up Comedy	Jandino Asporaat riffs on the challenges of ra...
2	70234439	TV Show	Transformers Prime	NaN	Peter Cullen, Sumalee Montano, Frank Welker, J...	United States	September 8, 2018	2013	TV-Y7-FV	1 Season	Kids' TV	With the help of three human allies, the Autob...
3	80058654	TV Show	Transformers: Robots in Disguise	NaN	Will Friedle, Darren Criss, Constance Zimmer, ...	United States	September 8, 2018	2016	TV-Y7	1 Season	Kids' TV	When a prison ship crash unleashes hundreds of...
4	80125979	Movie	#realityhigh	Fernando Lebrija	Nesta Cooper, Kate Walsh, John Michael Higgins...	United States	September 8, 2017	2017	TV-14	99 min	Comedies	When nerdy high schooler Dani finally attracts...

isin()

isin() can be used when we want to select values in dataframe based on a list of values. The result is boolean, whether each element is matching values of list. As an input parameter, isin() is accepting set or list-like values.
Here I want to get a list of movies which are from the US & UK.

ls_Country = ['United States',' United Kingdom']
netflix_dt[netflix_dt.country.isin(ls_Country)].head()

	show_id	type	title	director	cast	country	date_added	release_year	rating	duration	listed_in	description
2	70234439	TV Show	Transformers Prime	NaN	Peter Cullen, Sumalee Montano, Frank Welker, J...	United States	September 8, 2018	2013	TV-Y7-FV	1 Season	Kids' TV	With the help of three human allies, the Autob...
3	80058654	TV Show	Transformers: Robots in Disguise	NaN	Will Friedle, Darren Criss, Constance Zimmer, ...	United States	September 8, 2018	2016	TV-Y7	1 Season	Kids' TV	When a prison ship crash unleashes hundreds of...
4	80125979	Movie	#realityhigh	Fernando Lebrija	Nesta Cooper, Kate Walsh, John Michael Higgins...	United States	September 8, 2017	2017	TV-14	99 min	Comedies	When nerdy high schooler Dani finally attracts...
8	80117902	TV Show	Fire Chasers	NaN	NaN	United States	September 8, 2017	2017	TV-MA	1 Season	Docuseries, Science & Nature TV	As California's 2016 fire season rages, brave ...
20	80060297	Movie	Manhattan Romance	Tom O'Brien	Tom O'Brien, Katherine Waterston, Caitlin Fitz...	United States	September 8, 2017	2014	TV-14	98 min	Comedies, Independent Movies, Romantic Movies	A filmmaker working on a documentary about lov...

pct_change() & diff()

pct_change() is calculating the percentage of change within current and prior rows in a numeric based column. It is useful when we have a ordered data and want to calculate rate of change.
Just to display its application, I am applying it on the 'release yea'r column. Consider that the first element is having NaN value since it doesn't have a prior row to be compared with:

netflix_dt.release_year.pct_change()

0            NaN
1      -0.001486
2      -0.001488
3       0.001490
4       0.000496
          ...   
6229    0.002488
6230    0.000496
6231    0.000000
6232   -0.001488
6233   -0.004968
Name: release_year, Length: 6234, dtype: float64

If you need just to calculate the difference of current & prior row, you can use diff():

netflix_dt.release_year.diff()

0        NaN
1       -3.0
2       -3.0
        ... 
6231     0.0
6232    -3.0
6233   -10.0
Name: release_year, Length: 6234, dtype: float64

nunique() for all columns

You may have used nunique() before to calculate unique number of values in a column. It is possible to use the function to find the unique values on all of the columns by calling it directly for the dataframe:

netflix_dt.nunique()

show_id         6234
type               2
title           6172
director        3301
cast            5469
country          554
date_added      1524
release_year      72
rating            14
duration         201
listed_in        461
description     6226
dtype: int64

memory_usage()

In the case that you are interested to know the memory usage of your data, you can use this function. If you apply the function on the data frame, as the output, the function returns the memory usage of each column in bytes:

netflix_dt.memory_usage()

Index             128
show_id         49872
type            49872
title           49872
director        49872
cast            49872
country         49872
date_added      49872
release_year    49872
rating          49872
duration        49872
listed_in       49872
description     49872
dtype: int64

If you need to find out the total memory usage of your dataframe, just sum up memory usage in all the columns:

netflix_dt.memory_usage().sum()

describe(include = 'all')

You use this function to generate descriptive statistics for columns in the data frame, however by default, the function is applied just on the numeric columns. There is a way to apply descriptive statistics for other type of columns by specifying as the parameters, include = 'all' :

netflix_dt.describe(include = 'all')

	show_id	type	title	director	cast	country	date_added	release_year	rating	duration	listed_in	description
count	6.234000e+03	6234	6234	4265	5664	5758	6223	6234.00000	6224	6234	6234	6234
unique	NaN	2	6172	3301	5469	554	1524	NaN	14	201	461	6226
top	NaN	Movie	The Silence	Raúl Campos, Jan Suter	David Attenborough	United States	January 1, 2020	NaN	TV-MA	1 Season	Documentaries	A surly septuagenarian gets another chance at ...
freq	NaN	4265	3	18	18	2032	122	NaN	2027	1321	299	3
mean	7.670368e+07	NaN	NaN	NaN	NaN	NaN	NaN	2013.35932	NaN	NaN	NaN	NaN
std	1.094296e+07	NaN	NaN	NaN	NaN	NaN	NaN	8.81162	NaN	NaN	NaN	NaN
min	2.477470e+05	NaN	NaN	NaN	NaN	NaN	NaN	1925.00000	NaN	NaN	NaN	NaN
25%	8.003580e+07	NaN	NaN	NaN	NaN	NaN	NaN	2013.00000	NaN	NaN	NaN	NaN
50%	8.016337e+07	NaN	NaN	NaN	NaN	NaN	NaN	2016.00000	NaN	NaN	NaN	NaN
75%	8.024489e+07	NaN	NaN	NaN	NaN	NaN	NaN	2018.00000	NaN	NaN	NaN	NaN
max	8.123573e+07	NaN	NaN	NaN	NaN	NaN	NaN	2020.00000	NaN	NaN	NaN	NaN

Just consider general descriptive statistics such as number of unique values, frequency, count & top value is going to be the output for non-numeric columns.

cut()

I used to know this function in R, which is converting values into discrete intervals. It can be useful when you have many values and you want to group them. As input parameters, it is possible to specify how many bin (groups) you need or even define the threshold for bins.
Here, I want to divide the 'release year' into 3 groups, and the threshold is calculated automatically by cut() function. This calculation convert integer data into float type (so I may end with 1988.33 as threshold), to prevent this, I set Iprecision as 0, so it doesn't convert the values into float:

tmp = pd.cut(netflix_dt.release_year, bins = 3, precision=0)
tmp.value_counts()

(1988.0, 2020.0]    6057
(1957.0, 1988.0]     158
(1925.0, 1957.0]      19
Name: release_year, dtype: int64

You can specifiy also labels for your groups:

tmp = pd.cut(netflix_dt.release_year, bins = 3, precision=0, labels=['old','mid','new'])
tmp.value_counts()

new    6057
mid     158
old      19
Name: release_year, dtype: int64

As I said above, it is possible to define the threshold for bins too:

tmp = pd.cut(netflix_dt.release_year, bins = [1925,1950,1980,2000], precision=0)
tmp.value_counts()

(1980, 2000]    304
(1950, 1980]     90
(1925, 1950]     15
Name: release_year, dtype: int64

to_datetime()

This function is handy to convert your column type into date type. Its power is coming after conversion. You can select directly almost any date based value by using dt().
Here I convert ‘added date’ into a date based column:

type(netflix_dt.date_added)

pandas.core.series.Series

netflix_dt.date_added = pd.to_datetime(netflix_dt.date_added)

I can access datetimelike properties such as year, quarter, weeks, etc. by using dt():

netflix_dt.date_added.dt.year

0       2019.0
1       2016.0
2       2018.0
         ...  
6231       NaN
6232       NaN
6233       NaN
Name: date_added, Length: 6234, dtype: float64

netflix_dt.date_added.dt.quarter

0       3.0
1       3.0
2       3.0
       ... 
6231    NaN
6232    NaN
6233    NaN
Name: date_added, Length: 6234, dtype: float64

netflix_dt.date_added.dt.week

0       37.0
1       36.0
2       36.0
        ... 
6231     NaN
6232     NaN
6233     NaN
Name: date_added, Length: 6234, dtype: float64

startswith() & endswith()

startswith() & endswith() are useful when you want to test if the start/end of each string element matches a pattern. I have used in the cases that my data had many columns and they shared a specific pattern, so by these functions, I could filter specific columns.
As an example, I want to filter the columns which starts/ends with letter 'd':

netflix_dt.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

netflix_dt.columns[netflix_dt.columns.str.startswith('d')]

Index(['director', 'date_added', 'duration', 'description'], dtype='object')

netflix_dt.columns[netflix_dt.columns.str.endswith('d')]

Index(['show_id', 'date_added'], dtype='object')

You can apply it to any other string based data.

That's the end of the list for now. I hope you enjoyed it & learned something new, comment below if you know other less common useful functions in pandas 🙂

Author: Pari

All posts