PickupBrain

Python Pandas DataFrame & Different Ways to Create Them

Pandas is the most preferred Python library for data analysis. The reason is its core data structure called DataFrame, one of the two basic data structure of Pandas. It’s 2-dimensional labeled data structure with columns of potentially different types.

DataFrame is a widely used data structure of Pandas and works with a two-dimensional array with labeled axes (rows and columns). Defined as a standard way to store data and has two different indexes, i.e., row index and column index. It comprises the following properties:

  • The columns can be of heterogeneous types like int, float, string and boolean.
  • 2-Dimension Data structure with rows and columns
  • Rows are denoted as ‘Index” while columns are simply denoted as ‘Colums’

In simple words, DataFrame is like a spreadsheet or table with feature to name rows and columns.

DataFrame

Why do we need it?

Let’s take an example to understand the use of Pandas. We have a 5 randomly select patient data. The insurance company is trying to predict how many females will be there by the end of the year.

# Header -> 'Bill_Amount','Sex','Bed_Number','Age'
data = [[ 10000,       'M',       1,      59],
        [ 11000,       'F',       5,      31],
        [ 25000,       'F',       4,      45],
        [ 75000,       'M',       3,      27],
        [ 12000,       'F',      10,      18]]

print(data)

# Output
>>>[[10000, 'M', 1, 59],
>>> [11000, 'F', 5, 31],
>>> [25000, 'F', 4, 45],
>>> [75000, 'M', 3, 27],
>>> [12000, 'F', 10, 18]]

Above data is efficiently contained in the list, but it doesn’t give insight into the data analyst about the variables.

import pandas as pd
data = [[ 10000,       'M',       1,      59],
        [ 11000,       'F',       5,      31],
        [ 25000,       'F',       4,      45],
        [ 75000,       'M',       3,      27],
        [ 12000,       'F',      10,      18]]

df = pd.DataFrame(data, columns = ['Bill_Amount','Sex','Bed_Number','Age'])
print(df)

#Outut
>>> Bill_Amount  Sex  Bed_Number  Age
0        10000    M           1   59
1        11000    F           5   31
2        25000    F           4   45
3        75000    M           3   27
4        12000    F          10   18

On converting list into Pandas DataFrame, tabular data structure of a Pandas DataFrame organizes data in the form of rows and column. This data shows that Insurance company are looking for Bill Amount, Sex, Bed Number and Age as Key Variable.

This is just an example, there are many more reasons to use DataFrames.

DataFrames & Creating DataFrame

Benefits

The benefits of Pandas DataFrame are:

  • Easy to visualize the data from a DataFrame.
  • Quickly analyze data using a variety of inbuilt function
  • DataFrame object can be of mixed data types unlike NumPy arrays which need to be homogenous.
  • It allows to clean data and convert it into a tidy structure.
  • Handling missing values.
  • Encoding categorical data
  • Working with Datetimes

Creating DataFrame

We can create pandas DataFrame from the csv, excel, SQL, list, dictionary, and from a list of dictionary etc. Here are some ways by which we can create a dataframe:

Creating an Empty DataFrame

This is a simple example to create an empty DataFrame in Python.

#Creating an Empty DataFrame
import pandas as pd

data = pd.DataFrame()
print(data)

#Output
Empty DataFrame
Columns: []
Index: []

Creating DataFrame by Comma Separated Value (CSV) File

To create a DataFrame from CSV file, use ‘read_csv’ method. This method reads CSV file into a DataFrame.

#Creating DataFrame by csv file
import pandas as pd

df = pd.read_csv('CompleteCricketData.csv')

print(df)

#     Output
       Unnamed: 0          Player   ...       Date   Unnamed: 15
0               0       RG Sharma   ...  11/13/2014          NaN
1               1      MJ Guptill   ...   3/21/2015          NaN
2               2        V Sehwag   ...   12/8/2011          NaN
3               3        CH Gayle   ...   2/24/2015          NaN
4               4    Fakhar Zaman   ...   7/20/2018          NaN
          ...              ...  ...         ...          ...

Above code required minimum one argument, i.e.relative path name  of CSV file. In above case, CSV file is saved on the same folder Python script. However, if it is located inside a sub folder name ‘subFolder’ then path name would be ‘subFolder/CompleteCricketData.csv’

Creating DataFrame by Excel File

To create a DataFrame from Excel file, use ‘read_excel’ method. It is like above method. It reads Excel file into a DataFrame.

#Creating DataFrame by excel file
import pandas as pd

df = pd.read_excel('SampleData.xlsx')

print(df)
#     Output
       Unnamed: 0          Player   ...       Date   Unnamed: 15
0               0       RG Sharma   ...  11/13/2014          NaN
1               1      MJ Guptill   ...   3/21/2015          NaN
2               2        V Sehwag   ...   12/8/2011          NaN
3               3        CH Gayle   ...   2/24/2015          NaN
4               4    Fakhar Zaman   ...   7/20/2018          NaN
          ...              ...  ...         ...          ...

Creating DataFrame by Python List

We can create the DataFrame using a single list or a list of lists.

#Creating DataFrame by list
import pandas as pd

sample_list1 = [1,2,3,4,5]
df = pd.DataFrame(sample_list1)
print(df)

#Output
   0
0  1
1  2
2  3
3  4
4  5

#Creating a DataFrame by list of lists.
import pandas as pd
sample_list2 = [['Mike',10000],['Jhon',20000],['Paula',30000]]
df = pd.DataFrame(sample_list2 ,columns=['Name','Salary'])
print(df)

#Output
    Name  Salary
0   Mike   10000
1   Jhon   20000
2  Paula   30000

Creating DataFrame by Python Dictionary

We can pass dictionaries as input data to create a DataFrame. The dictionary keys are by default taken as column names.

To create DataFrame from dictionary of array/list, all the array must be of same length. When passed, the length of index should be equal to the length of arrays. Default index will be range(n) where n is the array length.

#Creating DataFrame by dictionary without index
import pandas as pd
sample_data = {'Name':['Jhon', 'Nick', 'Mike', 'Paula'],
        'Salary':[10000, 20000, 30000, 40000]}
df = pd.DataFrame(sample_data)
print(df)

#Output
    Name  Salary
0   Jhon   10000
1   Nick   20000
2   Mike   30000
3  Paula   40000

#Creating DataFrame by dictionary with index
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 50, 'b': 60, 'c': 70}]
df = pd.DataFrame(data, index=['first', 'second'])
print(df)

#Output
         a   b     c
first    1   2   NaN
second  50  60  70.0

Creating DataFrame by Dictionary of List

We declared four lists of elements and then assigned them for each column.

import pandas as pd

Name = ['John', 'Nike', 'Mike', 'Paula']
Age =  [25, 32, 30, 26]
Office = ['Microsoft', 'Amazon', 'Google', 'Spotify']
Salary = [1000000, 1200000, 900000, 1100000]
	      
table = {'Name': Name,
         'Age': Age,
         'Office Name': Office,
         'Salary': Salary
         }
	      
data = pd.DataFrame(table)
print(data)

#Output
    Name  Age Office Name   Salary
0   John   25   Microsoft  1000000
1   Nike   32      Amazon  1200000
2   Mike   30      Google   900000
3  Paula   26     Spotify  1100000

Creating DataFrame by Tuple List

In the code below, each element of the list is tuple. These tuple is a row in DataFrame. Default column name are assigned using range(n), where n is length of each tuple.

#Creating DataFrame by tuple
import pandas as pd
sample_list = [('Mike',10000),('Jhon',20000),('Paula',30000)]
df = pd.DataFrame(sample_list ,columns=['Name','Salary'])
print(df)

#Output
    Name  Salary
0   Mike   10000
1   Jhon   20000
2  Paula   30000

We have covered 7 different ways of creating pandas DataFrame. There are more ways to create DataFrame, you can check Pandas documentation to learn more about creating a Pandas DataFrame.

Leave a Reply