Chapter 1 and 2 Series and Data Frame
Chapter 1 and 2 Series and Data Frame
UNIT - I
DATA HANDLING USING PANDAS
Pandas:
• Python libraries contain a collection of built-in modules
• NumPy, Pandas and Matplotlib are three well-established Python libraries for scientific and analytical
use.
• PANDAS (PANel DAta) is a high-level data manipulation tool used for Data Analysing
• Pandas is an Open Source library built for Python Programming language.
• The main author of Pandas is Wes McKinney.
9
DataFrame :
Creation of Series :
There are a number of ways to create a DataFrame
Coding: Output:
import pandas as pd Series([], dtype: float64)
s1=pd.Series()
print(s1)
(B) Creation of a Series from List:Series can be created from a List:[ default indices range from 0 through
N – 1. Here N is the number of data elements]
Coding: Output:
import pandas as pd
s2=pd.Series(['p', 'y', 't', 'h', 'o', 'n'])
print(s2)
10
Coding: Output:
import pandas as pd
s2=pd.Series(['p', 'y', 't', 'h', 'o', 'n'],
index=[111,222,333,444,555,666])
print(s2)
(C) Creation of a Series from Numpy Array: [ one-dimensional (1D) NumPy array]
Coding: Output:
import pandas as pd
a1=np.array([10,20,30,40])
s=pd.Series(a1)
print(s)
(D) Creation of a Series from Dictionary: Keys become Index and Values become Data
Coding: Output:
import pandas as pd
d1={'I': 'one', 'II': 'two', 'III': 'three'}
s=pd.Series(d1)
print(s)
(E) Creation of a Series from Scalar value(Single value for all items)
Coding: Output:
import pandas as pd
s=pd.Series(5)
print(s)
11
Coding: Output:
import pandas as pd
import numpy as np
s=pd.Series([10,20,30,np.NaN,50])
print(s)
Note: numpy should be imported
12
s=pd.Series(['p', 'y', 't', 'h', 'o', 'n'])
1. s[0] gives p
2. s[2] gives t
3. s[[1,3]] gives
Labelled index :
▪ Single element can be accessed using labelled index (Seriesobject [labelled index])
▪ More than one element of a series can be accessed using a list of index labels
If s is the series given below(Data is the city name, Index is the State )
INDEX DATA
city=['Mumbai','kolkata','Chennai','Bangalore','Hydrebad']
state=['Maharashtra','west Bengal','Tamilnadu',
'karnataka','Telangana']
s=pd.Series(city,state)
print(s)
4) Even though labelled index is used, We can also access elements of the series using
the positional index
5) Both s[3], s['karnataka'] gives Bangalore, Bangalore
(B) Slicing:
▪ It is used to extract a part of a series.
▪ Part of the series to be sliced can be defined by specifying the start and end parameters
[start :end] with the series name. eg: s[2:5]
▪ When we use positional indices for slicing, the value at the end index position is excluded,
i.e., In s[2:5], element at 5th index is excluded, (end-start) 5-2=3 elements at index 2,3 and 4
are extracted
13
▪ If labelled indexes are used for slicing, then value at the end index label is also included
i.e s['west Bengal':'Telangana'] includes all elements from index westbengal till
Telangana(included)
14
s[:-2] Displays Elements from
positional index 0 till -3
(-2 will not display)
s[1]=75
s['Amit']=88
s[3:5]=77
#This changes element from index 3 to 4 as 55
s[['Julie','Amar']]=90
#This changes Julie’s and Amar’s data as 90
15
s['Laxman':'Amit']=33
#This changes data from Laxman till Amit(including) as 33
Attributes of Series :
We can access certain properties called attributes of a series by using that property with the series name.
Attributes Description
16
size s1.size s2.size
3 4
It returns true in case of empty series
empty
s1.empty s2.empty
False False
It returns true if the series contains NaN
Hasnans
s1.hasnans s2.hasnans
Series s
s.head()
s.head(3)
Tail():
i. tail(<n>) function fetches last n rows from a pandas object
ii. To access last 3 rows you should write Series_name.tail(3)
iii. If you do not provide any value for n, (Series_name.tail() )will return last 5 rows
Series s
s.tail()
s.tail(3)
17
Note: if number of rows existing less than the required rows ,available rows will get displayed
Mathematical processing can be performed on series using scalar values and functions. All the arithmetic
operators such as +, -, *, /, etc. can be successfully performed on series.
Note:
Arithmetic operation is possible on objects of same index; otherwise, will result as NaN.
Series also supports vector operations. Any operation to be performed on a series gets performed on every
single element of it
import pandas as pd
s1 = pd.Series([1,3,6,4])
print(s1)
18
print(s1+2) # 2 gets added with every element
print(s1*2) # every element gets multiplied by 2
print(s1>2) # It returns true if element >2, otherwise False
S1>2:
S1: S1+2: S1*2:
We can also give conditions to retrieve values from a series that satisfies the given condition
The following examples performing the filter operation and returns filtered result containing only those values
that return True for the given Boolean expression.
print(s1[s1>2]) #This returns only those result for which s1>2 is True (False data will not be displayed)
print(s1[s1%2==0]) #This returns only those result for which s1%2==0 is True
s.drop("Kavita")
19
s.loc['b'] s.iloc[2]
iloc()- does not
include end index
s.iloc[1:4] loc() includes end
label
s.loc['b':'e']
Index 1 till 3 data gets displayed, 4 is excluded in iloc
Creation of DataFrame :
There are a number of ways to create a DataFrame
Coding: Output:
import pandas as pd Empty
df1=pd.DataFrame() DataFrame
print(df1) Columns: []
Index: []
22
Coding: Output:
Keys of dictionaries (Name,Age,Marks)
become column names
import pandas as pd
d1={'Name':'Priya','Age':16,'Marks':70}
d2={'Name':'Harshini','Age':11,'Marks':99}
d3={'Name':'Kanishka','Age':15,'Marks':90}
df1=pd.DataFrame([d1,d2,d3])
print(df1)
23
Coding: Output:
Keys of dictionary (Name, Age, Gender, Marks)
import pandas as pd become column names
name=['ramya','ravi','abhinav','priya','akash']
age=[16,17,18,17,16]
gender=['f','m','m','f','m']
marks=[88,34,67,73,45]
d1={'name':name,'age':age,'gender':gender,'marks'
:marks}
df1=pd.DataFrame(d1)
print(df1)
Dictionary keys become column labels by default in a Data Frame, and the lists become the rows
DataFrame created from One Series: Output: As no index passed default value of row index starts with 0,
Only one column with default 0 index
Coding:
import pandas as pd
s1=pd.Series([100,200,300,400])
df1=pd.DataFrame(s1)
print(df1)
s1=pd.Series([100,200,300,400],index=['a','b','c','d'])
Column index is index of Series
s2=pd.Series([111,222,333,444],index=['a','b','c','d'])
df1=pd.DataFrame([s1,s2])
print(df1)
s1=pd.Series([100,200,300,400],index=['a','b','c','d'])
Column index is union of all index of all Series
s2=pd.Series([111,222,333,444],index=['a','b','c','e'])
24
df1=pd.DataFrame([s1,s2])
print(df1)
DataFrame from Multiple Series:
• The labels(index) in the series object become the column names
• Each series becomes a row
• No of columns =No of distinct labels in all the series
• If a particular series does not have a corresponding value for a label, NaN is inserted in the DataFrame
column
DataFrame created from Dictionary of Series: Output: Keys becomes Column name
Coding: Values (Series) becomes column data
import pandas as pd
name=pd.Series(['ramya','ravi','abhinav','priya','akash'])
age=pd.Series([16,17,18,17,16])
gender=pd.Series(['f','m','m','f','m'])
marks=pd.Series([88,34,67,73,45])
d1={'name':name,'age':age,'gender':gender,'marks':marks}
df1=pd.DataFrame(d1)
print(df1)
DataFrame created from Dictionary of Series(With Output :
different index: Keys becomes Column name
Values (Series) becomes column data
import pandas as pd If no value for particular row index, NaN is inserted
name=pd.Series(['ramya','ravi','abhinav','priya','akash'],[111,
222,333,444,555])
age=pd.Series([16,17,18,17,16],[111,555,666,222,333])
gender=pd.Series(['f','m','m','f','m'],[111,333,444,555,666])
marks=pd.Series([88,34,67,73,45],[222,333,444,555,666])
d1={'name':name,'age':age,'gender':gender,'marks':marks}
df1=pd.DataFrame(d1)
print(df1)
df1['city'] =
['chennai','mumbai','delhi'
,'mumbai','kolkata']
The following command will add new column newcity with same value ‘chennai’ for all rows
df1['newcity']='chennai'
The following command will change the content of existing column city with new value as chennai for all
rows
26
df1['city']='chennai'
The following command will add new row ‘Swetha’ with given list of values
df1.loc['Sita'] = [77,67,76]
The following command will add new row ‘Gita’ with value 80 for all columns
df1.loc['Gita'] =
80
The following command can set all values of a DataFrame to a particular value
df1[:]=0
27
The following command removes the row ‘Ramya’ [default value of axis is 0]
df1=df1.drop('Ramya')
df1=df1.drop('Priya',axis=0)
df1.drop('Kavita',inplace=True)
df1=df1.drop('Eng',axis=1)
28
The following command removes the Columns Eng ,Maths
df1=df1.drop(['Eng','Maths'],
axis=1)
The following command renames the row label Ramya by Ram[By default axis =0 so row label changes]
df1=df1.rename({'Ramya':'Ram'})
The following command renames the row label Kavita by Savita [ index used]
df1=df1.rename
(index={'Kavita':'Savita'})
The following command renames the row label Priya by Riya [ axis=0 used by default axis is 0]
df1=df1.rename
({'Priya':'Riya'},axis=0)
29
The following command renames the Column label Eng by English [
axis=1 ]
df1=df1.rename({'Eng':'English'}
,axis=1))
The following command renames the Column labels Science by EVS and Maths by Mathematics [ columns ]
df1=df1.rename(columns={'Maths
':'Mathematics','Science':'EVS'})
df1['Maths']
df1.Maths
The [ ] is used to select multiple columns passed as a list ,Df [[list of column names]]
In the given dataframe df1,
30
df1[['Eng','Maths']]
B) Slicing:
• We can use slicing to select a subset of rows and/or columns from a DataFrame, like Select all rows
with particular columns, Select particular rows with all columns etc
C) Accessing the data frame through loc()[label indexing] and iloc()[positional indexing] method
• Pandas provide loc() and iloc() methods to access the subset from a data frame using row/column
Loc() method :
• The loc property is used to access a group of rows and columns by label(s) [label index]
Df.loc[StartRow : EndRow, StartColumn : EndColumn]
• when the row label is passed as an integer value, it is interpreted as a label of the index and not as an
integer position along the index
• When labelled indices are used for slicing, value at the end index label is also included in the output.
Df1.loc[a:e,col1:col4] access ‘a’ to ‘e’ [including ‘e’] and columns col1 to col4
iLoc() method :
• It is used to access a group of rows and columns based on numeric index value
Df.iloc[StartRowindex : EndRowindex, StartColumnindex : EndColumnindex]
• When positional indices are used for slicing, the value at end index position is excluded
Df1.iloc[1:5,2:6] access rows 1 to 4 [excluding 5] and columns 2 to 5[excluding 6]
Note -If we pass “:” in row or column part then pandas provide the entire rows or columns respectively
The following commands helps to access Single row [Details of Ramya ] [Symbol “:” indicates all columns]
31
df1.loc['Ramya']
df1.loc['Ramya',:]
The following commands helps to access Multiple rows (Details of Ramya and Kanishka)
[Records not necessary to be continuous, it should be enclosed in list ]
df1.loc[['Ramya','Kanishka']]
df1.loc['Ramya':'Kanishka']
df1.loc['Ramya':']
32
5) Single Column Access:
The following commands helps to access Single Column [Details of Maths ] [Symbol “:” indicates all
rows]
df1.loc[:,'Maths']
df1.loc[:,['Eng','Science']]
df1.loc [: ,'Eng':'Science']
33
df1.iloc[1]
df1.iloc[1,:]
df1.iloc[1:4]
3) The following commands helps to access Multiple rows [Display all rows from Ramya(index 1) till
last row]
df1.iloc[1:]
4) 4) Single Column Access: The following commands helps to access Single Column [Details of Maths
index-1 ] [Symbol “:” indicates all rows]
df1.iloc[:,1]
34
df1.iloc[:,[0,2]]
The following commands helps to access Multiple Columns (Details from Eng till Science index 0 till last)
[Symbol ‘:’ should be used]
df1.iloc[:,0:]
Boolean Indexing :
▪ Boolean means a binary variable that can represent either of the two states - True (indicated by 1) or
False (indicated by 0).
▪ In Boolean indexing, we can select the subsets of data based on the actual values in the DataFrame
rather than their row/column labels.
▪ Thus, we can use conditions on column names to filter data values.
The following commands displays True or False depending on whether the data value satisfies the
given condition or not. (if Maths>=95 it returns True otherwise it returns False]
df1.Maths>=95
35
The following commands displays the details of those students who secured >= 95 in Maths
df1[df1.Maths>=95]
The following commands displays the English and Science marks of those students who secured >= 95
in Maths
When we create an object of a DataFrame then all information related to it like size, datatype etc can
be accessed by attributes. <DataFrame Object>.<attribute name>
ATTRIBUTE DESCRIPTION
Index It shows index of dataframe
Index(['Priya', 'Ramya', 'Kavita', 'Kanishka', 'Harshini'],
dtype='object')
36
Dtypes It returns data type of data contained by dataframe
Eng int64
Maths int64
Science int64
dtype: object
Size It returns number of elements in an object
15
Shape It returns tuple of dimension of dataframe
(5, 3)
Values It returns numpy form of dataframe
[[80 88 73]
[70 98 81]
[75 77 66]
[86 96 94]
[90 95 92]]
Empty It is an indicator to check whether dataframe is empty or not
False
df1.head(2)
If df1.head() command is executed it displays first 5 rows, if number of rows is less than 5, it will display all rows
37
▪ DataFrame.tail(n) to display the last n rows in the DataFrame
▪ If the parameter n is not specified by default, it gives the last 5 rows of the DataFrame.
df1.tail(2)
If df1.tail() command is executed it displays last 5 rows, if number of rows is less than 5, it will display all rows
Iterations in DataFrame:
Iterrows():
▪ DataFrame. iterrows() method is used to iterate over rows
▪ Each iteration produces an index and a row (a Pandas Series object)
df1:
CODING:
for i,j in df1.iterrows():
print("Details of ",i,":\n",j)
38
Iteritems():
▪ DataFrame. iteritems() method is used to iterate over columns
▪ Each iteration produces a column name and a column(a Pandas Series
object)
df1:
Itertuples():
▪ DataFrame. Itertuple() method return a named tuple for each row in the DataFrame
▪ The first element of the tuple will be the row’s corresponding index value, while the remaining values
are the row values
df1:
CODING:
for i in df1.itertuples():
print(i)
39
#Addition
df3=df1+df2 # This performs addition of two dataframe elementwise
print("df3=df1+df2","\n",df3)
print("********************")
40
df1.at['Ramya','Maths']=77 # it changes the row label ‘Ramya’’s Column ‘Maths’ as 77
Output:
Empty DataFrame
Columns: []
Index: []
2. Creating an Empty Dataframe with columnnames:
Output:
Empty DataFrame
Columns: [Name, Articles, Improved]
Index: []
3. Creating an Empty Dataframe with columnnames and indices:
Output:
Name Articles Improved
a NaN NaN NaN
b NaN NaN NaN
c NaN NaN NaN
4. Creating Dataframes using Dictionary(Keys of dictionary- becomes column names)
Output:
41
Output:
Output:
a) print(df)
b) print(df.index)
c) print(df.columns)
d) print(df.axes)
e) print(df.dtypes)
f) print(df.size)
g) print(df.shape)
h) print(df.values)
i) print(df.empty)
j) print(df.ndim)
k) print(df.T)
a) df :
42
b) df.index : It gives the index of the dataframe
43
Answer: ( All the codings given below will display column ‘Age’ )
print(df['Age'])
print(df.Age)
print(df.loc[:,'Age'])
print(df.iloc[:,1])
9. In the given Dataframe give the command to do the following:
44
a) Give the command to add a column named ‘C’ with all values as 11
Answer: ( All the codings given below will add column ‘C’)
df['C']=11
df['C']=[11,11,11,11]
df.insert(2, "C", 11)
df.insert(2, "C", [11,11,11,11])
[The insert function takes 3 parameters which are the index, the name of the column, and the values. The column
indices start from 0 so we set the index parameter as 2 to add the new column next to column B. ]
df.loc[:, "C"]=11
df=df.assign(C=11)[‘Note: C is not enclosed in quotes and it is assigned to df]
b) Adding single column City with a list of values ['Delhi', 'Bangalore', 'Chennai', 'Patna']
Answer: ( All the codings given below will add column ‘City’)
df['City']=['Delhi', 'Bangalore', 'Chennai', 'Patna']
df.insert(2,"City",['Delhi', 'Bangalore', 'Chennai', 'Patna'])
df = df.assign(City = ['Delhi', 'Bangalore', 'Chennai', 'Patna'])
df.loc[:,'City']=['Delhi', 'Bangalore', 'Chennai', 'Patna']
df.at[:,'City']=['Delhi', 'Bangalore', 'Chennai', 'Patna']
45
Worksheet - Basic Level Questions: (L1)
11) Which of the following can be used to specify the data while creating a DataFrame?
i. Series ii. List of Dictionaries iii. Structured ndarray iv. All of these
12) Carefully observe the following code:
import pandas as pd
Year1={'Q1':5000,'Q2':8000,'Q3':12000,'Q4': 18000}
Year2={'A' :13000,'B':14000,'C':12000}
totSales={1:Year1,2:Year2}
df=pd.DataFrame(totSales)
print(df)
Answer the following:
a. List the index of the DataFrame df
b. List the column names of DataFrame df.
1) Write a Python code to create a DataFrame with appropriate column headings from the list given
below: [[101,'Gurman',98],[102,'Rajveer',95],[103,'Samar' ,96],[104,'Yuvraj',88]]
2) Consider the given DataFrame ‘Stock’:
46
Write suitable Python statements for the following:
i. Add a column called Special_Price with the following data: [135,150,200,440].
ii. Add a new book named ‘The Secret' having price 800. iii. Remove the column
Special_Price.
3) Mark the correct choice as
i. Both A and R are true and R is the correct explanation for A
ii. Both A and R are true and R is not the correct explanation for A
iii. A is True but R is False
iv. A is false but R is True
Assertion (A):- DataFrame has both a row and column index.
Reasoning (R): - A DataFrame is a two-dimensional labelled data structure like a table of
MySQL.
4) Mr. Som, a data analyst has designed the DataFrame df that contains data about Computer
Olympiad with ‘CO1’, ‘CO2’, ‘CO3’, ‘CO4’, ‘CO5’ as indexes shown below. Answer the
following questions:
47
Worksheet - Difficult questions(L3):
If df is as given below, find the output of 1 to 14 and write commands for 15 to 20
1. print(df.loc['a':'d':2])
2. print(df.loc['b':'d','Name'])
3. print(df.loc[['b','d'],'Name'])
4. print(df.loc['a':'d':2,['Name','Age']])
5. print(df.loc['a':'d':2,'Name':'University'])
6. print(df.at['b','Name'])
7. df.at['b','Name']='Ravi'
print(df)
8. print(df.iat[2,1])
9. df.iat[2,1]=111
print(df)
10. print(df.iloc[2])
11. print(df.iloc[2:4])
12. print(df.iloc[2,2])
13. print(df.iloc[1:,1:])
14. df.iloc[2,2]='RU'
15. Display the details of Students who are from BHU university
16. Display the details of Students whose age is more than 21
17. Display the names of Students who are from JNU University
18. Display name and age whose university is DU
19. Give all the possible ways of displaying column Age
20. Make all the values as 0
Output:
28
30. Explain any three methods of pandas Series.
50
Importing and Exporting data between CSV files and Dataframes
CSV files
• Comma separated values files
• Data in tabular format
• Can be imported and exported from programs
51
Exporting data from dataframe to csv file
Function used
Dataframe.to_csv()
52