Some pandas functions that you should start using now!
Pandas is a popular library for exploratory analysis and data wrangling in python. It has a lot of functions and methods to use. Here I listed some functions that you may not know or don’t use them a lot, so let's check them out.
List of functions we are going through are:
- Query()
- isin()
- pct_change() & diff()
- nunique() for columns
- memory_usage()
- describe(include = 'all')
- cut()
- to_datetime()
- startswith() & endswith()
Main data source that I am using for analysis is coming from kaggle: Netflix Shows and Movies . This dataset consists of tv shows and movies available on Netflix as of 2019. I have used it to show applicability of above functions. So let's get started:
import numpy as np import pandas as pd pd.set_option('display.max_rows', 7)
First loading the data:
netflix_dt = pd.read_csv('C:\\Users\\Pari\\Documents\\Kaggle\\data\\netflix_titles.csv') netflix_dt.head()
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 81145628 | Movie | Norm of the North: King Sized Adventure | Richard Finn, Tim Maltby | Alan Marriott, Andrew Toth, Brian Dobson, Cole... | United States, India, South Korea, China | September 9, 2019 | 2019 | TV-PG | 90 min | Children & Family Movies, Comedies | Before planning an awesome wedding for his gra... |
1 | 80117401 | Movie | Jandino: Whatever it Takes | NaN | Jandino Asporaat | United Kingdom | September 9, 2016 | 2016 | TV-MA | 94 min | Stand-Up Comedy | Jandino Asporaat riffs on the challenges of ra... |
2 | 70234439 | TV Show | Transformers Prime | NaN | Peter Cullen, Sumalee Montano, Frank Welker, J... | United States | September 8, 2018 | 2013 | TV-Y7-FV | 1 Season | Kids' TV | With the help of three human allies, the Autob... |
3 | 80058654 | TV Show | Transformers: Robots in Disguise | NaN | Will Friedle, Darren Criss, Constance Zimmer, ... | United States | September 8, 2018 | 2016 | TV-Y7 | 1 Season | Kids' TV | When a prison ship crash unleashes hundreds of... |
4 | 80125979 | Movie | #realityhigh | Fernando Lebrija | Nesta Cooper, Kate Walsh, John Michael Higgins... | United States | September 8, 2017 | 2017 | TV-14 | 99 min | Comedies | When nerdy high schooler Dani finally attracts... |
Main columns are:
- id: identifier number of row
- type: type of movie
- title: title of movie
- director: name of director(s)
- cast: list of actors
- country: country of production
- date_added: added date to netflix
- release_year: release year of movie
- rating: rating
- duration: duration
- listed_in: categories
- description: extra description
Query()
This function can be used to apply condition on columns of dataframe and filter data.
Here I wanna get the movies with release year after 2000:
netflix_dt.query('release_year>2000').head()
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 81145628 | Movie | Norm of the North: King Sized Adventure | Richard Finn, Tim Maltby | Alan Marriott, Andrew Toth, Brian Dobson, Cole... | United States, India, South Korea, China | September 9, 2019 | 2019 | TV-PG | 90 min | Children & Family Movies, Comedies | Before planning an awesome wedding for his gra... |
1 | 80117401 | Movie | Jandino: Whatever it Takes | NaN | Jandino Asporaat | United Kingdom | September 9, 2016 | 2016 | TV-MA | 94 min | Stand-Up Comedy | Jandino Asporaat riffs on the challenges of ra... |
2 | 70234439 | TV Show | Transformers Prime | NaN | Peter Cullen, Sumalee Montano, Frank Welker, J... | United States | September 8, 2018 | 2013 | TV-Y7-FV | 1 Season | Kids' TV | With the help of three human allies, the Autob... |
3 | 80058654 | TV Show | Transformers: Robots in Disguise | NaN | Will Friedle, Darren Criss, Constance Zimmer, ... | United States | September 8, 2018 | 2016 | TV-Y7 | 1 Season | Kids' TV | When a prison ship crash unleashes hundreds of... |
4 | 80125979 | Movie | #realityhigh | Fernando Lebrija | Nesta Cooper, Kate Walsh, John Michael Higgins... | United States | September 8, 2017 | 2017 | TV-14 | 99 min | Comedies | When nerdy high schooler Dani finally attracts... |
It is equivalent to:
netflix_dt[netflix_dt.release_year>2000].head()
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 81145628 | Movie | Norm of the North: King Sized Adventure | Richard Finn, Tim Maltby | Alan Marriott, Andrew Toth, Brian Dobson, Cole... | United States, India, South Korea, China | September 9, 2019 | 2019 | TV-PG | 90 min | Children & Family Movies, Comedies | Before planning an awesome wedding for his gra... |
1 | 80117401 | Movie | Jandino: Whatever it Takes | NaN | Jandino Asporaat | United Kingdom | September 9, 2016 | 2016 | TV-MA | 94 min | Stand-Up Comedy | Jandino Asporaat riffs on the challenges of ra... |
2 | 70234439 | TV Show | Transformers Prime | NaN | Peter Cullen, Sumalee Montano, Frank Welker, J... | United States | September 8, 2018 | 2013 | TV-Y7-FV | 1 Season | Kids' TV | With the help of three human allies, the Autob... |
3 | 80058654 | TV Show | Transformers: Robots in Disguise | NaN | Will Friedle, Darren Criss, Constance Zimmer, ... | United States | September 8, 2018 | 2016 | TV-Y7 | 1 Season | Kids' TV | When a prison ship crash unleashes hundreds of... |
4 | 80125979 | Movie | #realityhigh | Fernando Lebrija | Nesta Cooper, Kate Walsh, John Michael Higgins... | United States | September 8, 2017 | 2017 | TV-14 | 99 min | Comedies | When nerdy high schooler Dani finally attracts... |
isin()
isin() can be used when we want to select values in dataframe based on a list of values. The result is boolean, whether each element is matching values of list. As an input parameter, isin() is accepting set or list-like values.
Here I want to get a list of movies which are from the US & UK.
ls_Country = ['United States',' United Kingdom'] netflix_dt[netflix_dt.country.isin(ls_Country)].head()
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 70234439 | TV Show | Transformers Prime | NaN | Peter Cullen, Sumalee Montano, Frank Welker, J... | United States | September 8, 2018 | 2013 | TV-Y7-FV | 1 Season | Kids' TV | With the help of three human allies, the Autob... |
3 | 80058654 | TV Show | Transformers: Robots in Disguise | NaN | Will Friedle, Darren Criss, Constance Zimmer, ... | United States | September 8, 2018 | 2016 | TV-Y7 | 1 Season | Kids' TV | When a prison ship crash unleashes hundreds of... |
4 | 80125979 | Movie | #realityhigh | Fernando Lebrija | Nesta Cooper, Kate Walsh, John Michael Higgins... | United States | September 8, 2017 | 2017 | TV-14 | 99 min | Comedies | When nerdy high schooler Dani finally attracts... |
8 | 80117902 | TV Show | Fire Chasers | NaN | NaN | United States | September 8, 2017 | 2017 | TV-MA | 1 Season | Docuseries, Science & Nature TV | As California's 2016 fire season rages, brave ... |
20 | 80060297 | Movie | Manhattan Romance | Tom O'Brien | Tom O'Brien, Katherine Waterston, Caitlin Fitz... | United States | September 8, 2017 | 2014 | TV-14 | 98 min | Comedies, Independent Movies, Romantic Movies | A filmmaker working on a documentary about lov... |
pct_change() & diff()
pct_change() is calculating the percentage of change within current and prior rows in a numeric based column. It is useful when we have a ordered data and want to calculate rate of change.
Just to display its application, I am applying it on the 'release yea'r column. Consider that the first element is having NaN value since it doesn't have a prior row to be compared with:
netflix_dt.release_year.pct_change()
0 NaN
1 -0.001486
2 -0.001488
3 0.001490
4 0.000496
...
6229 0.002488
6230 0.000496
6231 0.000000
6232 -0.001488
6233 -0.004968
Name: release_year, Length: 6234, dtype: float64
If you need just to calculate the difference of current & prior row, you can use diff():
netflix_dt.release_year.diff()
0 NaN
1 -3.0
2 -3.0
...
6231 0.0
6232 -3.0
6233 -10.0
Name: release_year, Length: 6234, dtype: float64
nunique() for all columns
You may have used nunique() before to calculate unique number of values in a column. It is possible to use the function to find the unique values on all of the columns by calling it directly for the dataframe:
netflix_dt.nunique()
show_id 6234
type 2
title 6172
director 3301
cast 5469
country 554
date_added 1524
release_year 72
rating 14
duration 201
listed_in 461
description 6226
dtype: int64
memory_usage()
In the case that you are interested to know the memory usage of your data, you can use this function. If you apply the function on the data frame, as the output, the function returns the memory usage of each column in bytes:
netflix_dt.memory_usage()
Index 128
show_id 49872
type 49872
title 49872
director 49872
cast 49872
country 49872
date_added 49872
release_year 49872
rating 49872
duration 49872
listed_in 49872
description 49872
dtype: int64
If you need to find out the total memory usage of your dataframe, just sum up memory usage in all the columns:
netflix_dt.memory_usage().sum()
598592
describe(include = 'all')
You use this function to generate descriptive statistics for columns in the data frame, however by default, the function is applied just on the numeric columns. There is a way to apply descriptive statistics for other type of columns by specifying as the parameters, include = 'all' :
netflix_dt.describe(include = 'all')
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 6.234000e+03 | 6234 | 6234 | 4265 | 5664 | 5758 | 6223 | 6234.00000 | 6224 | 6234 | 6234 | 6234 |
unique | NaN | 2 | 6172 | 3301 | 5469 | 554 | 1524 | NaN | 14 | 201 | 461 | 6226 |
top | NaN | Movie | The Silence | Raúl Campos, Jan Suter | David Attenborough | United States | January 1, 2020 | NaN | TV-MA | 1 Season | Documentaries | A surly septuagenarian gets another chance at ... |
freq | NaN | 4265 | 3 | 18 | 18 | 2032 | 122 | NaN | 2027 | 1321 | 299 | 3 |
mean | 7.670368e+07 | NaN | NaN | NaN | NaN | NaN | NaN | 2013.35932 | NaN | NaN | NaN | NaN |
std | 1.094296e+07 | NaN | NaN | NaN | NaN | NaN | NaN | 8.81162 | NaN | NaN | NaN | NaN |
min | 2.477470e+05 | NaN | NaN | NaN | NaN | NaN | NaN | 1925.00000 | NaN | NaN | NaN | NaN |
25% | 8.003580e+07 | NaN | NaN | NaN | NaN | NaN | NaN | 2013.00000 | NaN | NaN | NaN | NaN |
50% | 8.016337e+07 | NaN | NaN | NaN | NaN | NaN | NaN | 2016.00000 | NaN | NaN | NaN | NaN |
75% | 8.024489e+07 | NaN | NaN | NaN | NaN | NaN | NaN | 2018.00000 | NaN | NaN | NaN | NaN |
max | 8.123573e+07 | NaN | NaN | NaN | NaN | NaN | NaN | 2020.00000 | NaN | NaN | NaN | NaN |
Just consider general descriptive statistics such as number of unique values, frequency, count & top value is going to be the output for non-numeric columns.
cut()
I used to know this function in R, which is converting values into discrete intervals. It can be useful when you have many values and you want to group them. As input parameters, it is possible to specify how many bin (groups) you need or even define the threshold for bins.
Here, I want to divide the 'release year' into 3 groups, and the threshold is calculated automatically by cut() function. This calculation convert integer data into float type (so I may end with 1988.33 as threshold), to prevent this, I set Iprecision as 0, so it doesn't convert the values into float:
tmp = pd.cut(netflix_dt.release_year, bins = 3, precision=0) tmp.value_counts()
(1988.0, 2020.0] 6057
(1957.0, 1988.0] 158
(1925.0, 1957.0] 19
Name: release_year, dtype: int64
You can specifiy also labels for your groups:
tmp = pd.cut(netflix_dt.release_year, bins = 3, precision=0, labels=['old','mid','new']) tmp.value_counts()
new 6057
mid 158
old 19
Name: release_year, dtype: int64
As I said above, it is possible to define the threshold for bins too:
tmp = pd.cut(netflix_dt.release_year, bins = [1925,1950,1980,2000], precision=0) tmp.value_counts()
(1980, 2000] 304
(1950, 1980] 90
(1925, 1950] 15
Name: release_year, dtype: int64
to_datetime()
This function is handy to convert your column type into date type. Its power is coming after conversion. You can select directly almost any date based value by using dt().
Here I convert ‘added date’ into a date based column:
type(netflix_dt.date_added)
pandas.core.series.Series
netflix_dt.date_added = pd.to_datetime(netflix_dt.date_added)
I can access datetimelike properties such as year, quarter, weeks, etc. by using dt():
netflix_dt.date_added.dt.year
0 2019.0
1 2016.0
2 2018.0
...
6231 NaN
6232 NaN
6233 NaN
Name: date_added, Length: 6234, dtype: float64
netflix_dt.date_added.dt.quarter
0 3.0
1 3.0
2 3.0
...
6231 NaN
6232 NaN
6233 NaN
Name: date_added, Length: 6234, dtype: float64
netflix_dt.date_added.dt.week
0 37.0
1 36.0
2 36.0
...
6231 NaN
6232 NaN
6233 NaN
Name: date_added, Length: 6234, dtype: float64
startswith() & endswith()
startswith() & endswith() are useful when you want to test if the start/end of each string element matches a pattern. I have used in the cases that my data had many columns and they shared a specific pattern, so by these functions, I could filter specific columns.
As an example, I want to filter the columns which starts/ends with letter 'd':
netflix_dt.columns
Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
'release_year', 'rating', 'duration', 'listed_in', 'description'],
dtype='object')
netflix_dt.columns[netflix_dt.columns.str.startswith('d')]
Index(['director', 'date_added', 'duration', 'description'], dtype='object')
netflix_dt.columns[netflix_dt.columns.str.endswith('d')]
Index(['show_id', 'date_added'], dtype='object')
You can apply it to any other string based data.
That's the end of the list for now. I hope you enjoyed it & learned something new, comment below if you know other less common useful functions in pandas 🙂