Python pandas
Python pandas
Python pandas
Informatics Practices(New)
CLASS XII Code No. 065 -2019-20
Unit 1: Data Handling (DH-2)
What is pandas?
Pandas is an open source, BSD-licensed library providing high-
performance, easy-to-use data structures and data analysis tools for
the Python programming language.
Python with pandas is in use in a wide variety of academic and
commercial domains, including Finance, Neuroscience, Economics,
Statistics, Advertising, Web Analytics, and more.
Installing pandas
The simplest way to install not only pandas, but Python and the most
popular packages that is with Anaconda, a cross-platform (Linux, Mac
OS X, Windows) Python distribution for data analytics and scientific
computing. After running the installer, the user will have access to
pandas and the rest of the stack without needing to install anything else,
and without needing to wait for any software to be compiled.
Installation instructions for Anaconda can be found here.
Another advantage to installing Anaconda is that you don’t need admin
rights to install it. Anaconda can install in the user’s home directory,
which makes it trivial to delete Anaconda if you decide (just delete that
folder).
Note: Each time we need to use pandas in our python program we need
to write a line of code at the top of the program:
import pandas as <identifier_name>
Above statement will import the pandas library to our program.
We will use two different pandas libraries in in our programs
1. Series
2. DataFrames
pandas Series
Series is a one-dimensional labeled array capable of holding any data
type (integers, strings, floating point numbers, Python objects, etc.). The
axis labels are collectively referred to as the index. The basic method to
create a Series is to call:
import pandas as <identifier name>
<Series_name> = <identifier name>.Series(data, index=index)
Data can be many different things:
a Python dict
a Python list
a Python tuple
The passed index is a list of axis labels.
Step by Step method to create a pandas Series
Step 1
Suppose we have a list of games created with following python codes:
games_list = ['Cricket', 'Volleyball', 'Judo', 'Hockey']
Step 2
Now we create a pandas Series with above list
# Python script to generate a Series object from List
import pandas as ps
games_list = ['Cricket', 'Volleyball', 'Judo', 'Hockey']
s= ps.Series(games_list)
print(s)
OUTPUT
0 Cricket
1 Volleyball
2 Judo
3 Hockey
dtype: object
In the above output 0,1,2,3 are the indexes of list values. We can also
create our own index for each value. Let us create another series with the
same values with our own index values:
# Python script to generate a Series object from List using custom Index
import pandas as pd
games_list = ['Cricket', 'Volleyball', 'Judo', 'Hockey']
s= pd.Series(games_list, index =['G1','G2','G3','G4'])
print(s)
OUTPUT
G1 CRICKET
G2 VOLLEYBALL
G3 JUDO
G4 HOCKEY
dtype: object
In the above output Game_1, Game_2, Game_3, Game_4 are our own
created indexes of list values.
In the similar manner we can create pandas Series with different data
(tuple, dictionary, Object) etc.
Now we will create a Series with a Dictionary
Suppose we have a dictionary of games created with the following Python
codes:
As you can see the output generated for the DataFrame object is look
similar to what we have seen in the excel sheet as. Only difference is that
the default index value for the first row is 0 in DataFrame whereas in
excel sheet this value is 1. We can also customize this index value as per
our need.
Note: A side effect of dictionary is that when accessing the same
dictionary at two separate times, the order in which the information is
returned by the does not remained constant.
One more example of DataFrame with customize index value
# Python script to generate a Dictionary Object with custom index
import pandas as pd
name_dict = {
'Name' : ["Anita", "Sajal", "Ayaan", "Abhey"],
'Age' : [14,32, 4, 6] }
df = pd.DataFrame(name_dict , index=[1,2,3,4])
print(df)
Output
Name Age
1 Anita 14
2 Sajal 15
3 Ayaan 4
4 Abhey 6
In the preceding output the index values start from 1 instead of 0
Viewing the Data of a DataFrame
To selectively view the rows, we can use head(…) and tail(…) functions,
which by default give first or last five rows (if no input is provided),
otherwise shows specific number of rows from top or bottom
Here is how it displays the contents
df.head() # Displays first Five Rows
df.tails() # Displays last Five Rows
print(df.head(2)) # Displays first Two Rows
print(df.tail(1)) #Displays last One Row
print(df.head(-2)) #Displays all rows except last two rows
print(df.tail(-1)) #Displays all rows except first row
Advance operations on Data Frames:
Pivoting:
Output:
Salesman District Sales
Akshit Hamirpur 1000
Kangra 2000
Mandi 1000
Jaswant Hamirpur 2600
Karan Kangra 910
Mandi 300
Example 4:
Maximum sales District wise
import pandas as pd
monthlysale = { 'Salesman' : ["Akshit", "Jaswant","Karan","Akshit",
"Jaswant","Karan","Akshit", "Jaswant","Karan","Akshit",
"Jaswant","Karan"],
'Sales' : [1000,300,800,1000,500,60,1200,900,1300,1000,900,50],
'Quarter' :[1,1,1,2,2,2,3,3,3,4,4,4],
'District':
['Kangra','Hamirpur','Kangra','Mandi','Hamirpur','Kangra','Kangra','Hami
rpur','Mandi','Hamirpur','Hamirpur','Kangra']
}
df = pd.DataFrame(monthlysale )
# Maximum sale:
pd.pivot_table(df, index = ['District'], values = ['Sales'],aggfunc='max')
Output:
District Sales
Hamirpur 1000
Kangra 1200
Mandi 1300
Example 5:
# Minimum sale District Wise
import pandas as pd
monthlysale = { 'Salesman' : ["Akshit", "Jaswant","Karan","Akshit",
"Jaswant","Karan","Akshit", "Jaswant","Karan","Akshit",
"Jaswant","Karan"],
'Sales' : [1000,300,800,1000,500,60,1000,900,300,1000,900,50],
'Quarter' :[1,1,1,2,2,2,3,3,3,4,4,4],
'District':
['Kangra','Hamirpur','Kangra','Mandi','Hamirpur','Kangra','Kangra','Hami
rpur','Mandi','Hamirpur','Hamirpur','Kangra']
}
# Minimum Sale District wise:
pd.pivot_table(df, index = ['District'], values = ['Sales'],aggfunc='min')
Output:
District Sales
Hamirpur 300
Kangra 50
Mandi 300
Example 6:
# Median of sales Distirct wise
import pandas as pd
monthlysale = { 'Salesman' : ["Akshit", "Jaswant","Karan","Akshit",
"Jaswant","Karan","Akshit", "Jaswant","Karan","Akshit",
"Jaswant","Karan"],
'Sales' : [1000,300,800,1000,500,60,1000,900,300,1000,900,50],
'Quarter' :[1,1,1,2,2,2,3,3,3,4,4,4],
'District':
['Kangra','Hamirpur','Kangra','Mandi','Hamirpur','Kangra','Kangra','Hami
rpur','Mandi','Hamirpur','Hamirpur','Kangra']
}
df = pd.DataFrame(monthlysale )
# Median of sales Distirct wise
pd.pivot_table(df, index = ['District'], values = ['Sales'],aggfunc='median')
Output:
District Sales
Hamirpur 900
Kangra 800
Mandi 650
Complete Example:
# Maximum , Minimum , Mean, Mode , Median and Count of sales
Salesman wise
import pandas as pd
print("\n")
print ( "Dataframe of Values\n")
print("\n")
monthlysale = { 'Salesman' : ["Akshit", "Jaswant","Karan","Akshit",
"Jaswant","Karan","Akshit", "Jaswant","Karan","Akshit",
"Jaswant","Karan"],
'Sales' : [1000,300,800,1000,500,60,1000,900,300,1000,900,50],
'Quarter' :[1,1,1,2,2,2,3,3,3,4,4,4],
'District':
['Kangra','Hamirpur','Kangra','Mandi','Hamirpur','Kangra','Kangra','Hami
rpur','Mandi','Hamirpur','Hamirpur','Kangra']
}
df = pd.DataFrame(monthlysale )
# Use of mode() method of DataFrame
print("\n")
print ( "Use of mode() method of DataFrame")
print("\n")
print(df.mode())
print("\n")
print ( "Use of max,min,mean,median and count\n")
pd.pivot_table(df, index = ['Salesman'], values = ['Sales'],aggfunc=
['max','min','mean','median','count])
Output:
Use of mode() method of DataFrame
Salesman Sales Quarter District
0 Akshit 1000.0 1 Hamirpur
1 Jaswant NaN 2 Kangra
2 Karan NaN 3 NaN
3 NaN NaN 4 NaN
Output:
# Use of Histogram and plot() method
import pandas as pd
print("\n")
print ( "Dataframe of Values\n")
monthlysale = { 'Salesman' : ["Akshit", "Jaswant","Karan","Akshit",
"Jaswant","Karan","Akshit", "Jaswant","Karan","Akshit",
"Jaswant","Karan"],
'Sales' : [1000,300,800,1000,500,60,1000,900,300,1000,900,50],
'Quarter' :[1,1,1,2,2,2,3,3,3,4,4,4],
'District':
['Kangra','Hamirpur','Kangra','Mandi','Hamirpur','Kangra','Kangra','Hami
rpur','Mandi','Hamirpur','Hamirpur','Kangra']
}
df = pd.DataFrame(monthlysale )
print(df)
print("\n")
print ( "Use of Histogram plot() method\n")
pd.pivot_table(df, index = ['Salesman'], values = ['Sales']).plot()
Output:
Quantile
print("\n")
print ( "Q1 , Q2 , Q3 and 100th Quantiles \n")
print("Q2 quantile of marks_series : ",marks_series.quantile(.50))
print("Q1 quantile of marks_series : ",marks_series.quantile(.25))
print("Q3 quantile of marks_series : ",marks_series.quantile(.75))
print("100th quantile of marks_series : ",marks_series.quantile(.1))
Sorting of DataFrame:
print("\n")
print ( "Dataframe of Values\n")
monthlysale = { 'Salesman' : ["Akshit", "Jaswant","Karan","Akshit",
"Jaswant","Karan","Akshit", "Jaswant","Karan","Akshit",
"Jaswant","Karan"],
'Sales' : [1000,300,800,1000,500,60,1000,900,300,1000,900,50]
}
df = pd.DataFrame(monthlysale )
print(df)
print("\n")
print ( "Sorting of DataFrame using Sales column in Descending
order\n")
sr=df.sort_index(ascending=False)
print(sr)
Output:
Dataframe of Values
Salesman Sales
0 Akshit 1000
1 Jaswant 300
2 Karan 800
3 Akshit 1000
4 Jaswant 500
5 Karan 60
6 Akshit 1000
7 Jaswant 900
8 Karan 300
9 Akshit 1000
10 Jaswant 900
11 Karan 50
Sorting of DataFrame using Sales column in Descending order
Salesman Sales
11 Karan 50
10 Jaswant 900
9 Akshit 1000
8 Karan 300
7 Jaswant 900
6 Akshit 1000
5 Karan 60
4 Jaswant 500
3 Akshit 1000
2 Karan 800
1 Jaswant 300
0 Akshit 1000
Function application:
If we want to apply user defined function or we want to use some other
library’s function Pyhton pandas provide mainly three important
functions namely pipe() , Apply() , Applymap. In coming section we will
see the use and working of all three functions one by one.
pipe() :
This function performs the custom operation for the entire dataframe.
In below example we will using pipe() function to add value 2 to the
entire dataframe.
# Use of pipe() function with DataFrame
import pandas as pd
import math
# User Defined Functioin
def new_value(dataframe):
return dataframe.Sales * 2
print("\n")
print ( "Creating a Dataframe of Values with Dictionary \n")
print("After applying the pipe() function to multiply the sales values with
2 \n")
df.pipe(new_value)
Output:
0 2000
1 600
2 1600
apply():
This function performs the custom operation for either row wise or
column wise.
# Use of apply() function with DataFrame
import pandas as pd
import numpy as np
print("\n")
print ( "Creating a Dataframe of Values with Dictionary \n")
monthlysale = { 'Salesman' : ["Akshit", "Jaswant","Karan"],
'Sales' : [1000,300,800]
}
df=pd.DataFrame(monthlysale)
print("The original DataFrame is \n")
print(df)
print("After applying the apply function to find the Maximum value in
DataFrame \n")
print(df.apply(np.max))
print("After applying the apply function to find the Minimum value in
DataFrame \n")
print(df.apply(np.max)
Output:
Creating a DataFrame of Values with Dictionary
The original DataFrame is
Salesman Sales
0 Akshit 1000
1 Jaswant 300
2 Karan 800
After applying the apply function to find the Maximum value in
DataFrame
Salesman Karan
Sales 1000
After applying the apply function to find the Minimum value in
DataFrame
Salesman Karan
Sales 1000
applymap():
applymap() Function performs the specified operation for all the
elements the dataframe:
# Use of applymap() function with DataFrame
import pandas as pd
print("\n")
print ( "Creating a Dataframe of Values with Dictionary \n")
monthlysale = { 'Salesman' : ["Akshit", "Jaswant","Karan"],
'Sales_March' : [1000,300,800],'Sales_April' : [1500,400,1200]
}
df=pd.DataFrame(monthlysale)
print("The original Dataframe is \n")
print(df)
print("After applying the applymap() function to multiply both Sales by
2 \n")
print(df.applymap(lambda x:x*2))
Output:
Creating a Dataframe of Values with Dictionary
The original Dataframe is
Salesman Sales_March Sales_April
0 Akshit 1000 1500
1 Jaswant 300 400
2 Karan 800 1200
After applying the applymap() function to multiply both sales by 2
Salesman Sales_March Sales_April
0 AkshitAkshit 2000 3000
1 JaswantJaswant 600 800
2 KaranKaran 1600 2400
Re-indexing:
The reindex() method in Pandas can be used to change the index of
rows and columns of a Series or DataFrame.
# Use of reindex() function with DataFrame
import pandas as pd
print("\n")
df=pd.Series([1500,400,1200], index = [1,2,3])
print("The original Series is \n")
print(df)
rename():
Pandas rename() method is used to rename any index, column or row.
# Use of rename() function with DataFrame
import pandas as pd
print("\n")
print ( "Creating a Dataframe of Values with Dictionary \n")
df=pd.DataFrame(monthlysale)
print("The original Dataframe is \n")
print(df)
df.rename(columns={'Salesman': 'New_Salesman'},inplace=True) #
inplace=True mean to make changes in original Dataframe
print(df)
print("No change in the Original Series if we omint inplace
parameter\n")# Without using inplace parameter
df.rename(columns={'Sales_March': 'March_Sale'})
print(df)
Output:
Creating a Dataframe of Values with Dictionary
The original Dataframe is
Salesman Sales_March
0 Akshit 1000
1 Jaswant 300
2 Karan 800
After applying the rename() function to change the name of one column
New_Salesman Sales_March
0 Akshit 1000
1 Jaswant 300
2 Karan 800
No change in the Original Series if we omint inplace parameter
New_Salesman Sales_March
0 Akshit 1000
1 Jaswant 300
2 Karan 800
Group by Function:
By “group by” we are referring to a process involving one or more of the
following steps:
Splitting the data into groups based on some criteria.
Applying a function to each group independently.
print("\n")
print ( "Dataframe of Values\n")
transform() Function:
This function is used to modify values of a Dataframe.
# Use of transform() function with DataFrame
import pandas as pd
print("\n")
print ( "Dataframe of Values\n")
To Be Continue ………………