Some pandas functions that you should start using now!

Pandas is a popular library for exploratory analysis and data wrangling in python. It has a lot of functions and methods to use. Here I listed some functions that you may not know or don’t use them a lot, so let's check them out.
List of functions we are going through are:

  • Query()
  • isin()
  • pct_change() & diff()
  • nunique() for columns
  • memory_usage()
  • describe(include = 'all')
  • cut()
  • to_datetime()
  • startswith() & endswith()

Main data source that I am using for analysis is coming from kaggle: Netflix Shows and Movies . This dataset consists of tv shows and movies available on Netflix as of 2019. I have used it to show applicability of above functions. So let's get started:

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 7)

First loading the data:

netflix_dt = pd.read_csv('C:\\Users\\Pari\\Documents\\Kaggle\\data\\netflix_titles.csv')

netflix_dt.head()

show_id type title director cast country date_added release_year rating duration listed_in description
0 81145628 Movie Norm of the North: King Sized Adventure Richard Finn, Tim Maltby Alan Marriott, Andrew Toth, Brian Dobson, Cole... United States, India, South Korea, China September 9, 2019 2019 TV-PG 90 min Children & Family Movies, Comedies Before planning an awesome wedding for his gra...
1 80117401 Movie Jandino: Whatever it Takes NaN Jandino Asporaat United Kingdom September 9, 2016 2016 TV-MA 94 min Stand-Up Comedy Jandino Asporaat riffs on the challenges of ra...
2 70234439 TV Show Transformers Prime NaN Peter Cullen, Sumalee Montano, Frank Welker, J... United States September 8, 2018 2013 TV-Y7-FV 1 Season Kids' TV With the help of three human allies, the Autob...
3 80058654 TV Show Transformers: Robots in Disguise NaN Will Friedle, Darren Criss, Constance Zimmer, ... United States September 8, 2018 2016 TV-Y7 1 Season Kids' TV When a prison ship crash unleashes hundreds of...
4 80125979 Movie #realityhigh Fernando Lebrija Nesta Cooper, Kate Walsh, John Michael Higgins... United States September 8, 2017 2017 TV-14 99 min Comedies When nerdy high schooler Dani finally attracts...

Main columns are:

  • id: identifier number of row
  • type: type of movie
  • title: title of movie
  • director: name of director(s)
  • cast: list of actors
  • country: country of production
  • date_added: added date to netflix
  • release_year: release year of movie
  • rating: rating
  • duration: duration
  • listed_in: categories
  • description: extra description

Query()

This function can be used to apply condition on columns of dataframe and filter data.
Here I wanna get the movies with release year after 2000:

netflix_dt.query('release_year>2000').head()

show_id type title director cast country date_added release_year rating duration listed_in description
0 81145628 Movie Norm of the North: King Sized Adventure Richard Finn, Tim Maltby Alan Marriott, Andrew Toth, Brian Dobson, Cole... United States, India, South Korea, China September 9, 2019 2019 TV-PG 90 min Children & Family Movies, Comedies Before planning an awesome wedding for his gra...
1 80117401 Movie Jandino: Whatever it Takes NaN Jandino Asporaat United Kingdom September 9, 2016 2016 TV-MA 94 min Stand-Up Comedy Jandino Asporaat riffs on the challenges of ra...
2 70234439 TV Show Transformers Prime NaN Peter Cullen, Sumalee Montano, Frank Welker, J... United States September 8, 2018 2013 TV-Y7-FV 1 Season Kids' TV With the help of three human allies, the Autob...
3 80058654 TV Show Transformers: Robots in Disguise NaN Will Friedle, Darren Criss, Constance Zimmer, ... United States September 8, 2018 2016 TV-Y7 1 Season Kids' TV When a prison ship crash unleashes hundreds of...
4 80125979 Movie #realityhigh Fernando Lebrija Nesta Cooper, Kate Walsh, John Michael Higgins... United States September 8, 2017 2017 TV-14 99 min Comedies When nerdy high schooler Dani finally attracts...

It is equivalent to:

netflix_dt[netflix_dt.release_year>2000].head()

show_id type title director cast country date_added release_year rating duration listed_in description
0 81145628 Movie Norm of the North: King Sized Adventure Richard Finn, Tim Maltby Alan Marriott, Andrew Toth, Brian Dobson, Cole... United States, India, South Korea, China September 9, 2019 2019 TV-PG 90 min Children & Family Movies, Comedies Before planning an awesome wedding for his gra...
1 80117401 Movie Jandino: Whatever it Takes NaN Jandino Asporaat United Kingdom September 9, 2016 2016 TV-MA 94 min Stand-Up Comedy Jandino Asporaat riffs on the challenges of ra...
2 70234439 TV Show Transformers Prime NaN Peter Cullen, Sumalee Montano, Frank Welker, J... United States September 8, 2018 2013 TV-Y7-FV 1 Season Kids' TV With the help of three human allies, the Autob...
3 80058654 TV Show Transformers: Robots in Disguise NaN Will Friedle, Darren Criss, Constance Zimmer, ... United States September 8, 2018 2016 TV-Y7 1 Season Kids' TV When a prison ship crash unleashes hundreds of...
4 80125979 Movie #realityhigh Fernando Lebrija Nesta Cooper, Kate Walsh, John Michael Higgins... United States September 8, 2017 2017 TV-14 99 min Comedies When nerdy high schooler Dani finally attracts...

isin()

isin() can be used when we want to select values in dataframe based on a list of values. The result is boolean, whether each element is matching values of list. As an input parameter, isin() is accepting set or list-like values.
Here I want to get a list of movies which are from the US & UK.

ls_Country = ['United States',' United Kingdom']
netflix_dt[netflix_dt.country.isin(ls_Country)].head()

show_id type title director cast country date_added release_year rating duration listed_in description
2 70234439 TV Show Transformers Prime NaN Peter Cullen, Sumalee Montano, Frank Welker, J... United States September 8, 2018 2013 TV-Y7-FV 1 Season Kids' TV With the help of three human allies, the Autob...
3 80058654 TV Show Transformers: Robots in Disguise NaN Will Friedle, Darren Criss, Constance Zimmer, ... United States September 8, 2018 2016 TV-Y7 1 Season Kids' TV When a prison ship crash unleashes hundreds of...
4 80125979 Movie #realityhigh Fernando Lebrija Nesta Cooper, Kate Walsh, John Michael Higgins... United States September 8, 2017 2017 TV-14 99 min Comedies When nerdy high schooler Dani finally attracts...
8 80117902 TV Show Fire Chasers NaN NaN United States September 8, 2017 2017 TV-MA 1 Season Docuseries, Science & Nature TV As California's 2016 fire season rages, brave ...
20 80060297 Movie Manhattan Romance Tom O'Brien Tom O'Brien, Katherine Waterston, Caitlin Fitz... United States September 8, 2017 2014 TV-14 98 min Comedies, Independent Movies, Romantic Movies A filmmaker working on a documentary about lov...

pct_change() & diff()

pct_change() is calculating the percentage of change within current and prior rows in a numeric based column. It is useful when we have a ordered data and want to calculate rate of change.
Just to display its application, I am applying it on the 'release yea'r column. Consider that the first element is having NaN value since it doesn't have a prior row to be compared with:

netflix_dt.release_year.pct_change()

0            NaN
1      -0.001486
2      -0.001488
3       0.001490
4       0.000496
          ...   
6229    0.002488
6230    0.000496
6231    0.000000
6232   -0.001488
6233   -0.004968
Name: release_year, Length: 6234, dtype: float64

If you need just to calculate the difference of current & prior row, you can use diff():

netflix_dt.release_year.diff()

0        NaN
1       -3.0
2       -3.0
        ... 
6231     0.0
6232    -3.0
6233   -10.0
Name: release_year, Length: 6234, dtype: float64

nunique() for all columns

You may have used nunique() before to calculate unique number of values in a column. It is possible to use the function to find the unique values on all of the columns by calling it directly for the dataframe:

netflix_dt.nunique()

show_id         6234
type               2
title           6172
director        3301
cast            5469
country          554
date_added      1524
release_year      72
rating            14
duration         201
listed_in        461
description     6226
dtype: int64

memory_usage()

In the case that you are interested to know the memory usage of your data, you can use this function. If you apply the function on the data frame, as the output, the function returns the memory usage of each column in bytes:

netflix_dt.memory_usage()

Index             128
show_id         49872
type            49872
title           49872
director        49872
cast            49872
country         49872
date_added      49872
release_year    49872
rating          49872
duration        49872
listed_in       49872
description     49872
dtype: int64

If you need to find out the total memory usage of your dataframe, just sum up memory usage in all the columns:

netflix_dt.memory_usage().sum()

598592

describe(include = 'all')

You use this function to generate descriptive statistics for columns in the data frame, however by default, the function is applied just on the numeric columns. There is a way to apply descriptive statistics for other type of columns by specifying as the parameters, include = 'all' :

netflix_dt.describe(include = 'all')

show_id type title director cast country date_added release_year rating duration listed_in description
count 6.234000e+03 6234 6234 4265 5664 5758 6223 6234.00000 6224 6234 6234 6234
unique NaN 2 6172 3301 5469 554 1524 NaN 14 201 461 6226
top NaN Movie The Silence Raúl Campos, Jan Suter David Attenborough United States January 1, 2020 NaN TV-MA 1 Season Documentaries A surly septuagenarian gets another chance at ...
freq NaN 4265 3 18 18 2032 122 NaN 2027 1321 299 3
mean 7.670368e+07 NaN NaN NaN NaN NaN NaN 2013.35932 NaN NaN NaN NaN
std 1.094296e+07 NaN NaN NaN NaN NaN NaN 8.81162 NaN NaN NaN NaN
min 2.477470e+05 NaN NaN NaN NaN NaN NaN 1925.00000 NaN NaN NaN NaN
25% 8.003580e+07 NaN NaN NaN NaN NaN NaN 2013.00000 NaN NaN NaN NaN
50% 8.016337e+07 NaN NaN NaN NaN NaN NaN 2016.00000 NaN NaN NaN NaN
75% 8.024489e+07 NaN NaN NaN NaN NaN NaN 2018.00000 NaN NaN NaN NaN
max 8.123573e+07 NaN NaN NaN NaN NaN NaN 2020.00000 NaN NaN NaN NaN

Just consider general descriptive statistics such as number of unique values, frequency, count & top value is going to be the output for non-numeric columns.

cut()

I used to know this function in R, which is converting values into discrete intervals. It can be useful when you have many values and you want to group them. As input parameters, it is possible to specify how many bin (groups) you need or even define the threshold for bins.
Here, I want to divide the 'release year' into 3 groups, and the threshold is calculated automatically by cut() function. This calculation convert integer data into float type (so I may end with 1988.33 as threshold), to prevent this, I set Iprecision as 0, so it doesn't convert the values into float:

tmp = pd.cut(netflix_dt.release_year, bins = 3, precision=0)
tmp.value_counts()

(1988.0, 2020.0]    6057
(1957.0, 1988.0]     158
(1925.0, 1957.0]      19
Name: release_year, dtype: int64

You can specifiy also labels for your groups:

tmp = pd.cut(netflix_dt.release_year, bins = 3, precision=0, labels=['old','mid','new'])
tmp.value_counts()

new    6057
mid     158
old      19
Name: release_year, dtype: int64

As I said above, it is possible to define the threshold for bins too:

tmp = pd.cut(netflix_dt.release_year, bins = [1925,1950,1980,2000], precision=0)
tmp.value_counts()

(1980, 2000]    304
(1950, 1980]     90
(1925, 1950]     15
Name: release_year, dtype: int64

to_datetime()

This function is handy to convert your column type into date type. Its power is coming after conversion. You can select directly almost any date based value by using dt().
Here I convert ‘added date’ into a date based column:

type(netflix_dt.date_added)

pandas.core.series.Series

netflix_dt.date_added = pd.to_datetime(netflix_dt.date_added)

I can access datetimelike properties such as year, quarter, weeks, etc. by using dt():

netflix_dt.date_added.dt.year

0       2019.0
1       2016.0
2       2018.0
         ...  
6231       NaN
6232       NaN
6233       NaN
Name: date_added, Length: 6234, dtype: float64

netflix_dt.date_added.dt.quarter

0       3.0
1       3.0
2       3.0
       ... 
6231    NaN
6232    NaN
6233    NaN
Name: date_added, Length: 6234, dtype: float64

netflix_dt.date_added.dt.week

0       37.0
1       36.0
2       36.0
        ... 
6231     NaN
6232     NaN
6233     NaN
Name: date_added, Length: 6234, dtype: float64

startswith() & endswith()

startswith() & endswith() are useful when you want to test if the start/end of each string element matches a pattern. I have used in the cases that my data had many columns and they shared a specific pattern, so by these functions, I could filter specific columns.
As an example, I want to filter the columns which starts/ends with letter 'd':

netflix_dt.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

netflix_dt.columns[netflix_dt.columns.str.startswith('d')]

Index(['director', 'date_added', 'duration', 'description'], dtype='object')

netflix_dt.columns[netflix_dt.columns.str.endswith('d')]

Index(['show_id', 'date_added'], dtype='object')

You can apply it to any other string based data.

That's the end of the list for now. I hope you enjoyed it & learned something new, comment below if you know other less common useful functions in pandas 🙂

Author: Pari

Leave a Reply

Your email address will not be published. Required fields are marked *