PickupBrain

Boolean Indexing in Pandas

Apart from selecting data from row/column labels or integer location, Pandas also has a very useful feature that allows selecting data based on boolean index, i.e. True or False. This is boolean indexing in Pandas. It is one of the most useful feature that quickly filters out useless data from dataframe.

Believe me, it has lots of use cases and is helpful to select subset of data based on actual value of data instead of its index or location.

Use Case

Let’s say that you have a vast set of an ecommerce data. And you want answers for couple of queries like:

  • List of all userid and the total payment received from them in the year 2019.
  • All userids which have bought for more than 10,000 Rs. of value in 2018 but not bought at all in 2019.
  • List of all userids who have bought on sales day.
  • List of all purchaser from ‘California’.

Each of the above queries have a logical criteria that must be checked one row at a time.

If you manually answer the above queries, you would need to scan each row and determine whether the row meets the criteria or not. If the row meets the criteria, then it is kept and if not, then it is discarded.

But by using Boolean indexing in Pandas it is so easy to answer

Introduction to Boolean Indexing in Pandas

The Boolean values like ‘True’ and ‘False’ can be used as index in Pandas DataFrame. It can also be used to filter out the required records.

In this indexing, instead of column/row labels, we use a Boolean vector to filter the data.

There are 4 ways to filter the data:

  • Accessing a DataFrame with a Boolean index.
  • Applying a Boolean mask to a DataFrame.
  • Masking data based on column value.
  • Masking data based on index value.

Accessing Pandas DataFrame with a Boolean Index

To access a DataFrame with a Boolean index, we need to create a DataFrame in which index contains a Boolean values ‘True’ or ‘False’.

#Creating a dataframe with dictionary
import pandas as pd
dict= {'Name':["Joe", "Parth", "Sunny", "Lauren", "Lara"],
        'Dept': ["Manager", "Dev", "Admin", "Manager", "Admin"],
        'ID':[90, 40, 80, 98, 50],
        'Salary':[90000, 65000, 50000, 92000, 50000]}

# giving a index to a DataFrame   
df = pd.DataFrame(dict, index = [True, False, True, False, True]) 
   
df

#Output
	Name	Dept	ID	Salary
True	Joe	Manager	90	90000
False	Parth	Dev	40	65000
True	Sunny	Admin	80	50000
False	Lauren	Manager	98	92000
True	Lara	Admin	50	50000

We can access a DataFrame using three functions .loc[].iloc[].ix[].

#Accessing a Dataframe with a boolean index using .loc[] function
df.loc[True]

#Output
	Name	Dept	ID	Salary
True	Joe	Manager	90	90000
True	Sunny	Admin	80	50000
True	Lara	Admin	50	50000

Now let’s try to access the DataFrame with .iloc[] function. .iloc[] is used to select rows and columns by number, in the order that they appear in the DataFrame.

.iloc[] function accepts only integer value as argument, so if we try to pass Boolean value [‘True’, or ‘False’] then it will throw a TypeError.

#Accessing a Dataframe with a boolean index using .iloc[] function
df.iloc[True]

#Output
TypeError

#Accessing a Dataframe with a boolean index using .iloc[] function
df.iloc[1]

#Output
Name      Parth
Dept        Dev
ID           40
Salary    65000
dtype: object

.ix[] function is a hybrid of .loc and .iloc, but this function indexer is deprecated and will be removed someday future, but not on the next release of pandas.

Applying a Boolean mask to Pandas DataFrame

We can apply a Boolean mask by giving list of True and False of the same length as contain in a DataFrame.

On applying a Boolean mask it will print only that DataFrame in which we pass a Boolean value True.

# importing pandas as pd 
import pandas as pd

# dictionary of lists 
dict= {'Name':["Joe", "Parth", "Sunny", "Lauren", "Lara"],
        'Dept': ["Manager", "Dev", "Admin", "Manager", "Admin"],
        'ID':[90, 40, 80, 98, 50],
        'Salary':[90000, 65000, 50000, 92000, 50000]}

# giving a index to a dataframe    
df = pd.DataFrame(dict, index = [0,1,2,3,4]) 

df[[True,False,True,False,True]]

#Output
	Name	Dept	ID	Salary
0	Joe	Manager	90	90000
2	Sunny	Admin	80	50000
4	Lara	Admin	50	50000

Masking Data Based on Column Value

By this way, we can filter a data based on a column value by applying certain condition on dataframe using different operator like ==><<=>=.

When we apply these operators on the data frame, then it produces a Series of True and False. 

Now, let’s take a condition where we will filter out all the data in which the Department is ‘Manager’.

#Creating a dataframe with dictionary
import pandas as pd
dict= {'Name':["Joe", "Parth", "Sunny", "Lauren", "Lara"],
        'Dept': ["Manager", "Dev", "Admin", "Manager", "Admin"],
        'ID':[90, 40, 80, 98, 50],
        'Salary':[90000, 65000, 50000, 92000, 50000]}
   
df = pd.DataFrame(dict)

# Filtering the data whose department is 'Manager'
df.Dept == 'Manager'

#Output
0     True
1    False
2    False
3     True
4    False
Name: Dept, dtype: bool

Here, we are getting the result dataset in which it returns True if Department is Manager. Otherwise, it returns False.

Based on Department, we can select all the other columns data or even complete DataFrame.

Let’s see how we can display this.

#Wrapping the data in the DataFrame
df[df.Dept == 'Manager']

#Output
       	Name	Dept	ID	Salary
0	Joe	Manager	90	90000
3	Lauren	Manager	98	92000

As we can see that the complete DataFrame containing only information in which the Department is ‘Manager’.

We can also use Multiple conditions to filter the data.

For example, getting the data of an employee whose department is manager and salary is greater than 90000.

df[(df.Dept == 'Manager') & (df.Salary > 90000)]

#Output
	Name	Dept	ID	Salary
3	Lauren	Manager	98	92000

Here, we have added an operator to add multiple conditions.

Masking Data Based on Index Value

In a DataFrame to filter a data based on a column value, we can create a mask based on the index values using different operator like ==><, etc..

#Creating a DataFrame with dictionary
import pandas as pd
dict= {'Name':["Joe", "Parth", "Sunny", "Lauren", "Lara"],
        'Dept': ["Manager", "Dev", "Admin", "Manager", "Admin"],
        'ID':[90, 40, 80, 98, 50],
        'Salary':[90000, 65000, 50000, 92000, 50000]}
# giving a index to a dataframe   
df = pd.DataFrame(dict, index = [0,1,2,3,4]) 

mask = df.index == 2

df[mask]

#Output
	Name	Dept	ID	Salary
2	Sunny	Admin	80	50000


#Another Example
dict= {'Name':["Joe", "Parth", "Sunny", "Lauren", "Lara"],
        'Dept': ["Manager", "Dev", "Admin", "Manager", "Admin"],
        'ID':[90, 40, 80, 98, 50],
        'Salary':[90000, 65000, 50000, 92000, 50000]}

# giving a index to a DataFrame   
df = pd.DataFrame(dict, index = [0,1,2,3,4]) 
# filtering data on index value
mask = df.index > 1

df[mask]

#Output
	Name	Dept	ID	Salary
2	Sunny	Admin	80	50000
3	Lauren	Manager	98	92000
4	Lara	Admin	50	50000

Summary

In this tutorial, we learned 4 ways to filter the data. We have also learned how to index a DataFrame with both the loc and iloc methods and how we can use boolean arrays to index or specify our rows and columns.

You may also like

Leave a Reply