Some pandas functions that you should start using now!

Pandas is a popular library for exploratory analysis and data wrangling in python. It has a lot of functions and methods to use. Here I listed some functions that you may not know or don’t use them a lot, so let's check them out.
List of functions we are going through are:

  • Query()
  • isin()
  • pct_change() & diff()
  • nunique() for columns
  • memory_usage()
  • describe(include = 'all')
  • cut()
  • to_datetime()
  • startswith() & endswith()

Main data source that I am using for analysis is coming from kaggle: Netflix Shows and Movies . This dataset consists of tv shows and movies available on Netflix as of 2019. I have used it to show applicability of above functions. So let's get started:

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 7)

First loading the data:

netflix_dt = pd.read_csv('C:\\Users\\Pari\\Documents\\Kaggle\\data\\netflix_titles.csv')

netflix_dt.head()
show_idtypetitledirectorcastcountrydate_addedrelease_yearratingdurationlisted_indescription
081145628MovieNorm of the North: King Sized AdventureRichard Finn, Tim MaltbyAlan Marriott, Andrew Toth, Brian Dobson, Cole...United States, India, South Korea, ChinaSeptember 9, 20192019TV-PG90 minChildren & Family Movies, ComediesBefore planning an awesome wedding for his gra...
180117401MovieJandino: Whatever it TakesNaNJandino AsporaatUnited KingdomSeptember 9, 20162016TV-MA94 minStand-Up ComedyJandino Asporaat riffs on the challenges of ra...
270234439TV ShowTransformers PrimeNaNPeter Cullen, Sumalee Montano, Frank Welker, J...United StatesSeptember 8, 20182013TV-Y7-FV1 SeasonKids' TVWith the help of three human allies, the Autob...
380058654TV ShowTransformers: Robots in DisguiseNaNWill Friedle, Darren Criss, Constance Zimmer, ...United StatesSeptember 8, 20182016TV-Y71 SeasonKids' TVWhen a prison ship crash unleashes hundreds of...
480125979Movie#realityhighFernando LebrijaNesta Cooper, Kate Walsh, John Michael Higgins...United StatesSeptember 8, 20172017TV-1499 minComediesWhen nerdy high schooler Dani finally attracts...

Main columns are:

  • id: identifier number of row
  • type: type of movie
  • title: title of movie
  • director: name of director(s)
  • cast: list of actors
  • country: country of production
  • date_added: added date to netflix
  • release_year: release year of movie
  • rating: rating
  • duration: duration
  • listed_in: categories
  • description: extra description

Query()

This function can be used to apply condition on columns of dataframe and filter data.
Here I wanna get the movies with release year after 2000:

netflix_dt.query('release_year>2000').head()
show_idtypetitledirectorcastcountrydate_addedrelease_yearratingdurationlisted_indescription
081145628MovieNorm of the North: King Sized AdventureRichard Finn, Tim MaltbyAlan Marriott, Andrew Toth, Brian Dobson, Cole...United States, India, South Korea, ChinaSeptember 9, 20192019TV-PG90 minChildren & Family Movies, ComediesBefore planning an awesome wedding for his gra...
180117401MovieJandino: Whatever it TakesNaNJandino AsporaatUnited KingdomSeptember 9, 20162016TV-MA94 minStand-Up ComedyJandino Asporaat riffs on the challenges of ra...
270234439TV ShowTransformers PrimeNaNPeter Cullen, Sumalee Montano, Frank Welker, J...United StatesSeptember 8, 20182013TV-Y7-FV1 SeasonKids' TVWith the help of three human allies, the Autob...
380058654TV ShowTransformers: Robots in DisguiseNaNWill Friedle, Darren Criss, Constance Zimmer, ...United StatesSeptember 8, 20182016TV-Y71 SeasonKids' TVWhen a prison ship crash unleashes hundreds of...
480125979Movie#realityhighFernando LebrijaNesta Cooper, Kate Walsh, John Michael Higgins...United StatesSeptember 8, 20172017TV-1499 minComediesWhen nerdy high schooler Dani finally attracts...

It is equivalent to:

netflix_dt[netflix_dt.release_year>2000].head()
show_idtypetitledirectorcastcountrydate_addedrelease_yearratingdurationlisted_indescription
081145628MovieNorm of the North: King Sized AdventureRichard Finn, Tim MaltbyAlan Marriott, Andrew Toth, Brian Dobson, Cole...United States, India, South Korea, ChinaSeptember 9, 20192019TV-PG90 minChildren & Family Movies, ComediesBefore planning an awesome wedding for his gra...
180117401MovieJandino: Whatever it TakesNaNJandino AsporaatUnited KingdomSeptember 9, 20162016TV-MA94 minStand-Up ComedyJandino Asporaat riffs on the challenges of ra...
270234439TV ShowTransformers PrimeNaNPeter Cullen, Sumalee Montano, Frank Welker, J...United StatesSeptember 8, 20182013TV-Y7-FV1 SeasonKids' TVWith the help of three human allies, the Autob...
380058654TV ShowTransformers: Robots in DisguiseNaNWill Friedle, Darren Criss, Constance Zimmer, ...United StatesSeptember 8, 20182016TV-Y71 SeasonKids' TVWhen a prison ship crash unleashes hundreds of...
480125979Movie#realityhighFernando LebrijaNesta Cooper, Kate Walsh, John Michael Higgins...United StatesSeptember 8, 20172017TV-1499 minComediesWhen nerdy high schooler Dani finally attracts...

isin()

isin() can be used when we want to select values in dataframe based on a list of values. The result is boolean, whether each element is matching values of list. As an input parameter, isin() is accepting set or list-like values.
Here I want to get a list of movies which are from the US & UK.

ls_Country = ['United States',' United Kingdom']
netflix_dt[netflix_dt.country.isin(ls_Country)].head()
show_idtypetitledirectorcastcountrydate_addedrelease_yearratingdurationlisted_indescription
270234439TV ShowTransformers PrimeNaNPeter Cullen, Sumalee Montano, Frank Welker, J...United StatesSeptember 8, 20182013TV-Y7-FV1 SeasonKids' TVWith the help of three human allies, the Autob...
380058654TV ShowTransformers: Robots in DisguiseNaNWill Friedle, Darren Criss, Constance Zimmer, ...United StatesSeptember 8, 20182016TV-Y71 SeasonKids' TVWhen a prison ship crash unleashes hundreds of...
480125979Movie#realityhighFernando LebrijaNesta Cooper, Kate Walsh, John Michael Higgins...United StatesSeptember 8, 20172017TV-1499 minComediesWhen nerdy high schooler Dani finally attracts...
880117902TV ShowFire ChasersNaNNaNUnited StatesSeptember 8, 20172017TV-MA1 SeasonDocuseries, Science & Nature TVAs California's 2016 fire season rages, brave ...
2080060297MovieManhattan RomanceTom O'BrienTom O'Brien, Katherine Waterston, Caitlin Fitz...United StatesSeptember 8, 20172014TV-1498 minComedies, Independent Movies, Romantic MoviesA filmmaker working on a documentary about lov...

pct_change() & diff()

pct_change() is calculating the percentage of change within current and prior rows in a numeric based column. It is useful when we have a ordered data and want to calculate rate of change.
Just to display its application, I am applying it on the 'release yea'r column. Consider that the first element is having NaN value since it doesn't have a prior row to be compared with:

netflix_dt.release_year.pct_change()
0            NaN
1      -0.001486
2      -0.001488
3       0.001490
4       0.000496
          ...   
6229    0.002488
6230    0.000496
6231    0.000000
6232   -0.001488
6233   -0.004968
Name: release_year, Length: 6234, dtype: float64

If you need just to calculate the difference of current & prior row, you can use diff():

netflix_dt.release_year.diff()
0        NaN
1       -3.0
2       -3.0
        ... 
6231     0.0
6232    -3.0
6233   -10.0
Name: release_year, Length: 6234, dtype: float64

nunique() for all columns

You may have used nunique() before to calculate unique number of values in a column. It is possible to use the function to find the unique values on all of the columns by calling it directly for the dataframe:

netflix_dt.nunique()
show_id         6234
type               2
title           6172
director        3301
cast            5469
country          554
date_added      1524
release_year      72
rating            14
duration         201
listed_in        461
description     6226
dtype: int64

memory_usage()

In the case that you are interested to know the memory usage of your data, you can use this function. If you apply the function on the data frame, as the output, the function returns the memory usage of each column in bytes:

netflix_dt.memory_usage()
Index             128
show_id         49872
type            49872
title           49872
director        49872
cast            49872
country         49872
date_added      49872
release_year    49872
rating          49872
duration        49872
listed_in       49872
description     49872
dtype: int64

If you need to find out the total memory usage of your dataframe, just sum up memory usage in all the columns:

netflix_dt.memory_usage().sum()
598592

describe(include = 'all')

You use this function to generate descriptive statistics for columns in the data frame, however by default, the function is applied just on the numeric columns. There is a way to apply descriptive statistics for other type of columns by specifying as the parameters, include = 'all' :

netflix_dt.describe(include = 'all')
show_idtypetitledirectorcastcountrydate_addedrelease_yearratingdurationlisted_indescription
count6.234000e+036234623442655664575862236234.000006224623462346234
uniqueNaN26172330154695541524NaN142014616226
topNaNMovieThe SilenceRaúl Campos, Jan SuterDavid AttenboroughUnited StatesJanuary 1, 2020NaNTV-MA1 SeasonDocumentariesA surly septuagenarian gets another chance at ...
freqNaN4265318182032122NaN202713212993
mean7.670368e+07NaNNaNNaNNaNNaNNaN2013.35932NaNNaNNaNNaN
std1.094296e+07NaNNaNNaNNaNNaNNaN8.81162NaNNaNNaNNaN
min2.477470e+05NaNNaNNaNNaNNaNNaN1925.00000NaNNaNNaNNaN
25%8.003580e+07NaNNaNNaNNaNNaNNaN2013.00000NaNNaNNaNNaN
50%8.016337e+07NaNNaNNaNNaNNaNNaN2016.00000NaNNaNNaNNaN
75%8.024489e+07NaNNaNNaNNaNNaNNaN2018.00000NaNNaNNaNNaN
max8.123573e+07NaNNaNNaNNaNNaNNaN2020.00000NaNNaNNaNNaN

Just consider general descriptive statistics such as number of unique values, frequency, count & top value is going to be the output for non-numeric columns.

cut()

I used to know this function in R, which is converting values into discrete intervals. It can be useful when you have many values and you want to group them. As input parameters, it is possible to specify how many bin (groups) you need or even define the threshold for bins.
Here, I want to divide the 'release year' into 3 groups, and the threshold is calculated automatically by cut() function. This calculation convert integer data into float type (so I may end with 1988.33 as threshold), to prevent this, I set Iprecision as 0, so it doesn't convert the values into float:

tmp = pd.cut(netflix_dt.release_year, bins = 3, precision=0)
tmp.value_counts()
(1988.0, 2020.0]    6057
(1957.0, 1988.0]     158
(1925.0, 1957.0]      19
Name: release_year, dtype: int64

You can specifiy also labels for your groups:

tmp = pd.cut(netflix_dt.release_year, bins = 3, precision=0, labels=['old','mid','new'])
tmp.value_counts()
new    6057
mid     158
old      19
Name: release_year, dtype: int64

As I said above, it is possible to define the threshold for bins too:

tmp = pd.cut(netflix_dt.release_year, bins = [1925,1950,1980,2000], precision=0)
tmp.value_counts()
(1980, 2000]    304
(1950, 1980]     90
(1925, 1950]     15
Name: release_year, dtype: int64

to_datetime()

This function is handy to convert your column type into date type. Its power is coming after conversion. You can select directly almost any date based value by using dt().
Here I convert ‘added date’ into a date based column:

type(netflix_dt.date_added)
pandas.core.series.Series

netflix_dt.date_added = pd.to_datetime(netflix_dt.date_added)

I can access datetimelike properties such as year, quarter, weeks, etc. by using dt():

netflix_dt.date_added.dt.year
0       2019.0
1       2016.0
2       2018.0
         ...  
6231       NaN
6232       NaN
6233       NaN
Name: date_added, Length: 6234, dtype: float64

netflix_dt.date_added.dt.quarter
0       3.0
1       3.0
2       3.0
       ... 
6231    NaN
6232    NaN
6233    NaN
Name: date_added, Length: 6234, dtype: float64

netflix_dt.date_added.dt.week
0       37.0
1       36.0
2       36.0
        ... 
6231     NaN
6232     NaN
6233     NaN
Name: date_added, Length: 6234, dtype: float64

startswith() & endswith()

startswith() & endswith() are useful when you want to test if the start/end of each string element matches a pattern. I have used in the cases that my data had many columns and they shared a specific pattern, so by these functions, I could filter specific columns.
As an example, I want to filter the columns which starts/ends with letter 'd':

netflix_dt.columns
Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

netflix_dt.columns[netflix_dt.columns.str.startswith('d')]
Index(['director', 'date_added', 'duration', 'description'], dtype='object')

netflix_dt.columns[netflix_dt.columns.str.endswith('d')]
Index(['show_id', 'date_added'], dtype='object')

You can apply it to any other string based data.

That's the end of the list for now. I hope you enjoyed it & learned something new, comment below if you know other less common useful functions in pandas 🙂

Author: Pari

Leave a Reply

Your email address will not be published. Required fields are marked *