Chapter -2
Python Pandas
Introduction :
Pandas is one of the most preferred and widely used data
science libraries. It offer efficient data structure which are not
only powerful , but also very convenient and flexible.
In this chapter we will learn :
pivoting, sorting and aggregation.
Descriptive statistics – histogram and quantiles, various
functions, reindexing and altering labels.
Pandas provide two basic data structures – Series and
DataFrame.
Series – It represents a one-dimensional array of indexed
data.
DataFrame – It store data in two-dimensional way.
Some attributes of DataFrame are :
1. index – to display index (row labels) of the DataFrame.
2. columns – to display column labels of the DataFrame.
3. axes – Return a list representing both the axes – row and
columns.
4. size – Return number of elements in DataFame
5. shape – Return dimension of DataFrame
6. values – Return DataFrame in the form of Numpy Array.
Selecting/Accessing a Column :
Selecting/Accessing a Subset from a DataFrame using
Row/Column Names :
Syntax :
<DataFrameObject>.loc[<startrow>:<endrow>]
To access selective columns use :
<DF object>.loc[:,<start column>:<end column>]
To access range of columns from a range of rows , use :
<DataFrameObject>.loc[<startrow>:<endrow>,
<startcolumn>:<endcolumn>]
Obtaining a Subset/Slice from a DataFrame using
Row/Columns Numeric Index/Position
<DF object>.iloc[<start row index>: <end row index>,
<start col index> : <end column index>]
Selecting/Accessing Individual Values :
<DF object>.<column>[<row name or row numeric index>]
Adding and Deleting Columns in DataFrames
Deleting Columns : use del statement to delete a column
Del<DF object>[<column name>]
Descriptive Statistics with Pandas :
Pandas include many useful statistical functions.
Reference dataframe namely sal_df
Functions min() and max() : The min() and max() functions
find out the minimum or maximum values respectively form a
given set of data.
Parameters :
axis : (0 or 1) by default, minimum and maximum is
calculated along axis 0.
Functions mode() , mean(), median() :
mode() – Returns the mode value (i.e., the value that appears
most often) from a set of values.
Parameters :
axis : axis 0 or ‘index’ get mode of each column
axis 1 or ‘columns’ : get mode of each row
mean() – Returns the computed mean(average) form a set of
values.
median() – returns the middle number form a set of numbers.
(2000 10000 12000 13000)
(10000+12000) =22000, 22000/2 = 11000
(2000 6000 7000 7000)
(6000 + 7000) = 13000, 13000/2 = 6500
Functions count( ) and sum( )
count( ) – counts the non-NA entries for each row or column.
sum( ) – returns the sum of values for the requested axis.
Applying Functions on a Subset of Dataframe :
Sometimes , we need to apply a function on a selective column
or row or a subset of the data frame.
Applying Function on a Column of a DataFrame:
To apply a function on a column, write –
<dataframe>[<column name>]
Applying function on Multiple Columns of a DataFrame :
<dataframe>[[<column name>, <column name>,….]]
Applying Function on a row of a DataFrame :
<dateframe>.loc[<row index>, :]
Applying Functions on a range of rows of a DataFrame:
<dataframe>.loc[<start row> : <end row>, :]
Applying functions to a subset of the DataFrame :
<dataframe>.loc[<start row>:<end row>, :
<start column> : <end column>]
Advanced Operations on DataFrame :
Three advanced operations are :
1. pivoting
2. sorting
3. aggregation
1. Pivoting : Pivoting is actually a summary technique that
works on tabular data (i.e., data in rows and columns).
Pivoting technique rearranges the data from rows and columns,
by possibly aggregating data from multiple sources, in a report
form (with rows transferred to columns) so that data can be
viewed in a different perspective.
Real life example :
An online tutoring company maintains its data about tutors and
online classes in the following table.
Using pivot Function :
Cells in the pivoted table which do not have a matching entry
in the original one are set with NaN.
Now change the rows and columns , i.e. the index and columns
arguments
We can skip the values argument:
The above data is for one quarter only. The online tutoring
company has data for the entire year as shown below :
The index i.e. , the rows are specified as ‘Tutor’ and the
columns as ‘Country’. There are multiple entries of tutor which
are very much different for same country.
Consider Tahira’s entries
Tutor Classes Quarter Country
Tahira 28 1 USA
Tahira 36 2 USA
Tahira 24 3 Barzil
Tahira 36 4 Japan
Try to create a row for tutor Tahira from the above data with
columns as Country.
USA Brazil Japan
Tahira 24 36
Using pivot_table( ) function :
Example-1 Considering the tutoring company data, compute
total classes per tutor.
Example-2 considering the tutoring company data, compute
number of countries (count) per tutor.
Example -3 Considering the tutoring company data, compute
total classes by country.
Example-4 Considering the tutoring company data, compute
total classes on two fields, tutor and country wise.
Example 2.5 Considering the tutoring company data, compute
average classes on two fields
Sorting : Sorting refers to arranging values in a particular
order.
sort_values( ) – function arrange the values in ascending or
descending order.
Creating Histogram :
Histogram – A histogram is a plot that lets you discover, and
show the underlying frequency distribution(shape) of a set of
continuous data.
hist( ) function of pandas is used to create histogram.
Consider the following histogram that has been computed using
the following datasets containing age of 20 people.
37 28 38 44 53 69 74 53 35 38 66 46 24 45 92 48 51 62 58 57
Bin Frequency Age included in Bin
20-30 2 28,24
30-40 4 37,38,35,38
40-50 4 44,46,45,48
50-60 5 53,53,51,56,57
60-70 3 69,66,92
70-80 1 74
80-90 0 --
90-100 1 92
Function Application : It means that a function (a library
function or user defined function ) may be applied on a
dataframe in multiple ways:
(a) on the whole dataframe
(b) row-wise or column-wise
(c) on individual elements, i.e. element-wise
for the above mentioned three types of function application,
Pandas offers following three functions :
(a) pipe() – dataframe wise function application.
(b) apply() – row-wise/column-wise function application.
(c) applymap() – individual elements-wise function application.
(a) pipe( ) function : The piping of functions through pipe( )
basically means the chaining of functions in the order they are
executed.
pipe() Example 1 Function add( ) followed by multiply( )
applied on a dataframe.
The apply and applymap() functions :
apply () – apply is a series function, so it applies the given
function to one row or one column of the dataframe (as single
row/columns of a dataframe is equivalent to a series).
Syntax - <dataframe>.apply(<funcname>,axis = 0)
axis 0 or 1 default 0 , axis along with the function is applied.
If axis is 0 or ‘index’ : function is applied on each column
If axis is 1 or ‘columns’ : function is applied on each row.
applymap() – is an element function, so it applies the
given function to each individual elements, separately.
Syntax - <dataframe>.applymap(<funcname>)
To apply, apply() row-wise write :
<dataframe>.apply(<func>, axis = 1)
NOTE – The apply() will apply the function on individual
columns/rows, only if the passed function name is a Series
function. If you pass a single value function, then apply() will
behave like applymap()
Function groupby( ) :
The duplicate values in the same filed are grouped together to
form groups, e.g. for creating Tutor wise groups :
All the rows having Tutor as Tahira will be clubbed to form
Tahira group.
groupby() function - is used to create group for the
duplicate values in the same filed.
The groupby() creates the group internally and does not
display the grouped data by default.
Grouping on Multiple Columns : Create a group for Tutor
and for each tutor group, a Country-wise subgroup :
Example : to get a group having tutor name as ‘Anusha’ and
Country as ‘UK’ write :
Aggregation via groupby () :
agg( ) method – aggregates the data of the dataframe using
one or more operations over the specified axis.
Syntax - <dataframe>.agg(func, axis = 0)
mean :-
36,40,30,32
(36+40+30+32)/4 =
34.5
median :-
30 32 36 40
n/2 = 4/2 = 2
(n/2) + 1 = (4/2) + 1
=3
(32 + 36) / 2 = 34
Sum : (36+40+30+32)
= 138
We may combine the groupby( ) and agg( ) in single command
:
The transform( ) function : This function transforms the
aggregate data by repeating the summary result for each row
of the group and make the result have the same shape as
original data.
Q- What if we want to add this aggregate data to the
dataframe itself?
Reindexing and Altering Labels : The methods provided by
Pandas for reindexing and relabeling are :
1. rename( ) – simply rename the index and/or column labels
in a dataframe.
(ii) reindex( ) – specify the new order of existing indexes
and column labels, and/or also create new indexes/column
labels.
(a) Reordering the existing indexes using reindex( )
Adding indexes :
(iii) reindex_like() – for creating indexes/column-labels
based on other dataframe object.
<dataframe>.reindex_like(other)
Solved Problems :
1. Consider the following code to create two dataframes with
similar values. What will be printed by the code given below ?
Justify your answer.
import pandas as pd
df1 = pd.DataFrame([1,2,3])
df2 = pd.DataFrame([[1,2,3]])
print("df1")
print(df1)
print("df2")
print(df2)
Ans :