PickupBrain

All About Index in Python Pandas

Index of DataFrame in Pandas is like an address, that’s how any data point across the DataFrame or series can be accessed. This is similar to an index that you would see at the end of a book that helps you find content faster.

The DataFrame in Python is labeled as two-dimensional data structures and comprises the main three components – Index, Columns and Data.

Indexing in Pandas helps in selecting particular rows and column of data from a DataFrame. While working with large DataFrames, indexes are quite helpful.

Setting the Index

Let’s understand the index in Pandas by creating a DataFrame.

#importing pandas package
import pandas as pd

# making data frame from dictionary
dict= {'Name':["Joe", "Parth", "Sunny", "Lauren", "Lara"],
        'Dept': ["Manager", "Dev", "Admin", "Manager", "Admin"],
        'ID':[90, 40, 80, 98, 50],
        'Salary':[90000, 65000, 50000, 92000, 50000]}

#Converting dataset dict to a DataFrame
df = pd.DataFrame(dict) 
    
df


#Output
	Name	Dept	ID	Salary
0	Joe	Manager	90	90000
1	Parth	Dev	40	65000
2	Sunny	Admin	80	50000
3	Lauren	Manager	98	92000
4	Lara	Admin	50	50000

The left most column in the DataFrame is auto-generated index column by Pandas to optimize the dataset.

Setting index using DataFrame.index

The auto-generated index starts from 0 but it can manipulate by using DataFrame.index property.

#Manipulating index value from 0 to 1
df.index = [i for i in range(1, len(df.values)+1)] 
df

#Output

        Name	Dept	ID	Salary
1	Joe	Manager	90	90000
2	Parth	Dev	40	65000
3	Sunny	Admin	80	50000
4	Lauren	Manager	98	92000
5	Lara	Admin	50	50000

Here, len(df.values) returns the number of rows in current DataFrame.

Instead of auto-generated index, let’s set the ‘ID’ column of DataFrame as an index field.

#Setting ID as an index
df.index = df['ID']
df

#Output
	Name	Dept	ID	Salary
ID				
90	Joe	Manager	90	90000
40	Parth	Dev	40	65000
80	Sunny	Admin	80	50000
98	Lauren	Manager	98	92000
50	Lara	Admin	50	50000

Now, the index column name is shown as ID. But as you see we have a duplication, Index and Column[‘ID’] are identical. To solve this, let’s drop the column ID.

#Dropping ID Column
df = df.drop(['ID'], axis = 1)
df

#Output
	Name	Dept	Salary
ID			
90	Joe	Manager	90000
40	Parth	Dev	65000
80	Sunny	Admin	50000
98	Lauren	Manager	92000
50	Lara	Admin	50000

Setting index using set_index() Function

Now, we will set an index for the Python DataFrame using the set_index() method.

set_index() is a built-in method that is used to set the List in Series or DataFrame as an index of a Data Frame. It sets the DataFrame index using existing columns. It can replace the existing index or expand on it.

https://youtu.be/H87RFHshw7s

Syntax

DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)

Parameters

  • keys: Column name or list of a column name.
  • drop: It’s a Boolean value which drops the column used for the index if set True.
  • append: It appends the column to the existing index column if True.
  • inplace: It makes the changes in the DataFrame if True.
  • verify_integrity: It checks the new index column for duplicates if True.
import pandas as pd
dict= {'Name':["Joe", "Parth", "Sunny", "Lauren", "Lara"],
        'Dept': ["Manager", "Dev", "Admin", "Manager", "Admin"],
        'ID':[90, 40, 80, 98, 50],
        'Salary':[90000, 65000, 50000, 92000, 50000]}

#Converting dataset dict to a DataFrame
df = pd.DataFrame(dict) 

#Setting index as ID
df.set_index('ID', inplace = True, drop = True)
    
df

#Output
	Name	Dept	Salary
ID			
90	Joe	Manager	90000
40	Parth	Dev	65000
80	Sunny	Admin	50000
98	Lauren	Manager	92000
50	Lara	Admin	50000

In the above code, as inplace parameter is True, current DataFrame is modified. When inplace is set to False, current DataFrame remains unaltered and instead modified DataFrame is returned. The drop = False parameter will keep the ID column in the list, by default the column to be indexed will be drop (drop = True)

MultiIndexing

We can set more than one column can be set as Index. Let’s set columns ‘Dept’ and ‘ID’ as Index.

import pandas as pd
dict= {'Name':["Joe", "Parth", "Sunny", "Lauren", "Lara"],
        'Dept': ["Manager", "Dev", "Admin", "Manager", "Admin"],
        'ID':[90, 40, 80, 98, 50],
        'Salary':[90000, 65000, 50000, 92000, 50000]}

#Converting dataset dict to a DataFrame
df = pd.DataFrame(dict) 

#Setting index as ID and Dept
df.set_index(['Dept','ID'], inplace = True, append = True, drop = False )
    
df

#Output
			Name	Dept	ID	Salary
        Dept	ID				
0	Manager	90	Joe	Manager	90	90000
1	Dev	40	Parth	Dev	40	65000
2	Admin	80	Sunny	Admin	80	50000
3	Manager	98	Lauren	Manager	98	92000
4	Admin	50	Lara	Admin	50	50000

In the above code, drop parameter is used to drop the column and append parameter is used to append passed columns to the already existing index column. 

Resetting the Index

Resetting is useful to reset the index of a DataFrame.

The index of DataFrame and Series can be reassigned to the row number starting from 0 using index_reset() function. You may need to re-index in certain sitiations like order of the rows changes after sorting or missing row number after deleting the row.

It also removes the current index and set it as a new column of the DateFrame.

Syntax of reset_index()

DataFrame.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill=”)

Parameter

level: int, string or a list to select and remove passed column from index.
drop: Boolean value, Adds the replaced index column to the data if False.
inplace: Boolean value, make changes in the original data frame itself if True.
col_level: Select in which column level to insert the labels.
col_fill: Object, to determine how the other levels are named.

import pandas as pd
dict= {'Name':["Joe", "Parth", "Sunny", "Lauren", "Lara"],
        'Dept': ["Manager", "Dev", "Admin", "Manager", "Admin"],
        'ID':[90, 40, 80, 98, 50],
        'Salary':[90000, 65000, 50000, 92000, 50000]}

#Converting dataset dict to a DataFrame
df = pd.DataFrame(dict) 

#Setting index as ID
df.set_index(['Dept','ID'], inplace = True, append = True, drop = False )

df
#Before reset_index
			Name	Dept	ID	Salary
        Dept	ID				
0	Manager	90	Joe	Manager	90	90000
1	Dev	40	Parth	Dev	40	65000
2	Admin	80	Sunny	Admin	80	50000
3	Manager	98	Lauren	Manager	98	92000
4	Admin	50	Lara	Admin	50	50000

#Remove index with default index    
df.reset_index(drop = True, inplace = True)

df

#Output after reset index

        Name	Dept	ID	Salary
0	Joe	Manager	90	90000
1	Parth	Dev	40	65000
2	Sunny	Admin	80	50000
3	Lauren	Manager	98	92000
4	Lara	Admin	50	50000

In the above example, Index is set to default and Department & ID has been reset to a column.

Sort Index

The sort_index() function is used to sort DataFrame/Series by index labels.

Syntax

DataFrame.sort_index(axis=0, level=None, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’, sort_remaining=True, by=None)

Parameter

axis: index, columns to direct sorting
level: if not None, sort on values in specified index level(s)
ascending: Sort ascending vs. descending
inplace: if True, perform operation in-place
kind: {‘quicksort’, ‘mergesort’, ‘heapsort’}, default ‘quicksort’. Choice of sorting algorithm. However, it is only applied when sorting on a single column or label of a DataFrame.
na_position: [{‘first’, ‘last’}, default ‘last’] First puts NaNs at the beginning, last puts NaNs at the end. Not implemented for MultiIndex.
sort_remaining: If true and sorting by level and index is multilevel, sort by other levels too (in order) after sorting by specified level


import pandas as pd
dict= {'Name':["Joe", "Parth", "Sunny", "Lauren", "Lara"],
        'Dept': ["Manager", "Dev", "Admin", "Manager", "Admin"],
        'ID':[90, 40, 80, 98, 50],
        'Salary':[90000, 65000, 50000, 92000, 50000]}

#Converting dataset dict to a DataFrame
df = pd.DataFrame(dict) 

#Sort the DataFrame by row labels
df.sort_index(axis = 0)

#Output
	Name	Dept	ID	Salary
0	Joe	Manager	90	90000
1	Parth	Dev	40	65000
2	Sunny	Admin	80	50000
3	Lauren	Manager	98	92000
4	Lara	Admin	50	50000

#Sort the DataFrame by descending order
df.sort_index(ascending = False)

#Output

        Name	Dept	ID	Salary
4	Lara	Admin	50	50000
3	Lauren	Manager	98	92000
2	Sunny	Admin	80	50000
1	Parth	Dev	40	65000
0	Joe	Manager	90	90000

#Sort the DataFrame by column labels
df.sort_index(axis = 1)

#Output
	Dept	ID	Name	Salary
0	Manager	90	Joe	90000
1	Dev	40	Parth	65000
2	Admin	80	Sunny	50000
3	Manager	98	Lauren	92000
4	Admin	50	Lara	50000

Index are sorted in ascending order by default. However, you can also sort Index in descending order by setting the argument ascending to False.

Summary

In this post we have covered various features of indexing like setting an index, resetting an index and sorting an index.

Leave a Reply