Ln.
1 – Data Handling using Pandas
INTRODUCTION
• Data science is a large field covering everything from data collection, cleaning,
standardization, analysis, visualization and reporting.
• Data processing is an important part of analyzing the data because the data is not always
available in the desired format.
• Various processing is required before analyzing the data such as cleaning, restructuring or
merging etc.
• NumPy, Spicy, Cython, Panda are the tools available in Python which can be used for fast
processing of data.
Modules and Libraries
Python libraries contain a collection of built-in modules that allow us to perform many
actions without writing detailed programs for it.
Each library in Python contains a large number of modules that one can import and use.
Eg:- Numpy, Pandas, matplotlib
Module is a file which contains various Python functions and global variables.
Pandas-
• Pandas is a high-performance open-source library for data analysis in Python developed by
Wes McKinney in 2008.
• The term ‘Pandas’ is derived from ‘Panel data system’, which is a term used for
multidimensional, structured data set.
• Pandas is built on top of two core Python libraries—matplotlib for data visualization and
NumPy (Numerical Python) for mathematical operations.
• It is a most famous Python package for data science, which offers powerful and flexible data
structures that make data analysis and manipulation easy.
Key Features of Pandas
Quick and efficient data manipulation and analysis.
Merges and joins two datasets easily.
Easy handling of missing data
Represents the data in tabular form.
Support for multiple file formats
Easy sorting of data
Flexible reshaping and organising of data sets.
Time Series functionality.
Summarising data by classification variable
Handles large data efficiently
Note- Installing Pandas
Use command prompt to install pandas
Type pip install pandas
• pip is the standard package manager for Python. It allows you to install and manage
additional packages that are not part of the Python standard library.
Numpy vs Pandas-
Pandas Numpy
When we have to work on Tabular data, we When we have to work on Numerical
prefer the pandas module. data, we prefer the numpy module.
The powerful tools of pandas are Data frame The powerful tool of numpy is Arrays.
and Series.
Pandas consume more memory Numpy is memory efficient.
Indexing of the pandas series is very slow as Indexing of numpy Arrays is very fast.
compared to numpy arrays.
Pandas offers 2d table object called Numpy is capable of providing multi-
DataFrame. dimensional arrays.
Pandas Datatypes
Pandas Data structures :
A data structure is a collection of data values and operations that can be applied to that
data. It enables efficient storage, retrieval and modification to the data.
Pandas deals with the following three data structures −
✓ Series : It is a one-dimensional structure storing homogeneous data.
✓ DataFrame : It is a two-dimensional structure storing heterogeneous data.
✓ Panel: It is a three dimensional way of storing items.
These data structures are built on top of Numpy array, which means they are fast.
Series
• The Series is the primary building block of Pandas.
• It is a one-dimensional labelled array capable of holding data of any type (integer, string,
float etc ) with homogeneous data.
• The data values are mutable (can be changed) but the size of Series data is immutable.
• It contains a sequence of values and an associated position of data labels called its index.
• If we add different data types, then all of the data will get upcasted to the same
dtype=object.
• We can imagine a Pandas Series as a column in a spreadsheet.
Creation of Series
• A Series in Pandas can be created using the ‘Series’ method.
• It can be created using various input data like − Array , Dict , Scalar value or constant , List
• Syntax-
import pandas as pd
pandas.Series( data, index, dtype, copy)
• import statement is used to load Pandas module into memory and can be used to work with.
• pd is an alternate name given to the Pandas module. Its significance is that we can use ‘pd’
instead of typing Pandas every time we need to use it.
Creation of Empty Series
Note –
• Series () displays an empty list along with its default data type.
• Here ‘s’ is the Series Object.
Create a Series from Scalar
• When a scalar is passed, all the elements of the series is initialized to the same value.
• The value will be repeated to match the length of index.
• If we do not explicitly specify an index for the data values while creating a series, then by
default indices range from 0 through N – 1. Here N is the number of data elements.
Alternatively, this can be done using range() method
Creating DataSeries with a list
• Syntax:
<Series Object>=pandas.Series([data],index=[index])
Note- To give a name to the column index and values ,
st.index.name = 'Animals’ # shown at the top of the index column
st.name=‘Pets’ # shown at the bottom of the Series
Program - Print the output as shown below-
1 Jan
2 Feb
3 Mar
4 Apr
5 June
6 July
dtype: object
To create a series using range() method.
Create a series using 2 different lists
>>> import pandas as pd
>>> m=['jan','feb']
>>> n=[23,34]
>>> s=pd.Series(m,index=n)
>>> s
Note-
• type() will give the data type of the series.
• tolist() will convert the series back to a list.
Create a Series from dictionary
• A dictionary can be passed as input to a Series.
• Dictionary keys are used to construct index.
Program
Write a program to convert a dictionary to a Pandas series. The dictionary named Students must contain-
Key : Name, RollNo, Class ,Marks , Grade
Value : Your name, rollNo, class,marks and grade
Arrays-
An array is a data structure that contains a group of elements.
Arrays are commonly used in computer programs to organize data so that a related set of
values can be easily sorted or searched.
Each element can be uniquely identified by its index in the array.
Array Series
Indexing is by default from 0. Indexing can be given manually to the elements.
Elements are arranged horizontally. Arranged vertically.
Indexes are not visible in the array. Indexes are shown along with the elements.
Create series from ndarray
✓ An array of values can be passed to a Series.
✓ If data is an ndarray, index must be the same length as data.
✓ If no index is passed, one will be created having values [0, ..., len(data) - 1].
✓ When index labels are passed with the array, then the length of the index and array must be
of the same size, else it will result in a ValueError.
✓ Example- array1 contains 4 values whereas there are only 3 indices, hence ValueError is
displayed.
>>> series5 = pd.Series(array1, index = ["Jan", "Feb", "Mar"])
ValueError: Length of passed values is 4, index implies 3
import pandas as pd
import numpy as np
a=['J','F','M','A']
s= pd.Series(a, index = ["Jan", "Feb", "Mar", "Apr"])
print (s)
NaN
Any item for which one or the other does not have an entry is marked by
NaN, or “Not a Number”, which is how Pandas marks missing data.
>>> import numpy as np
>>> s = pd.Series([1,2,3,4,np.NaN,5,np.NaN])
>>> s
import pandas as pd
import numpy as np s = pd.Series([2,3,np.nan,7,"The Hobbit"])
To test we need to use s.isnull()
0 False
1 False
2 True
3 False
4 False
dtype: bool
Accessing Elements of a Series
There are two common ways for accessing the elements of a series: Indexing and Slicing.
Indexing is used to access elements in a series.
Indexes are of two types: positional index and labelled index.
Positional index takes an integer value that corresponds to its position in the series starting
from 0, whereas labelled index takes any user-defined label as index.
>>> import pandas as pd
>>> a = pd.Series([2,3,4],index=["Feb","Mar","Apr"])
Note-
The index values associated with the series can be altered by assigning new index values.
Eg:- a.index=[‘May’,’June’,’July’]
To extract part of a series, slicing is done.
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
Write a python program to create a series of odd numbers.
odd=pd.Series(range(1, 10, 2))
>>> odd
0 1
1 3
2 5
3 7
4 9
dtype: int64
Modifying Series data with slicing
>>> import numpy as np
>>> abc = pd.Series(np.arange(10,16,1), index = ['a', 'b', 'c', 'd', 'e', 'f'])
>>> abc
>>> abc[1:3] = 50
>>> abc
Observe that updating the values in a series using slicing also excludes the value at the end
index position.
But, it changes the value at the end index label when slicing is done using labels.
>>> seriesAlph['c':'e'] = 500
>>> seriesAlph
Accessing Data from Series with indexing and slicing
• In a series we can access any position values based on the index number.
• Slicing is used to retrieve subsets of data by position.
• A slice object is built using a syntax of start:end:step, the segments representing the first
item, last item, and the increment between each item that you would like as the step.
Vector operations in Series
• Series support vector operations.
• Any operation to be performed on a series gets performed on every single element.
Eg:-
Binary operations in Series
We can perform binary operation on series using mathematical operations.
While performing operations on series, index matching is implemented and all missing values
are filled in with NaN by default.
The output of operations is NaN if one of the elements or both elements have no value.
When we do not want to have NaN values in the output, we can use the series method add(),
sub()…. and a parameter fill_value to replace missing value with a specified value.
Binary operations in Series [ Using other functions ] -
Program
Write a Pandas program to compare the elements of the two Pandas Series.
Attributes in Series
Program – To sort values
Note-The output of both the given codes below are the same. We can use np.arange or range
function to generate a set of numbers automatically.
Accessing rows using head () and tail() function
✓ Series.head() function will display the top 5 rows in the series.
✓ Series.tail() function will display the last 5 rows in the series
RETRIEVING VALUES USING CONDITIONS
>>> import pandas as pd
>>> S=pd.Series([1.0,1.4,1.7,2.0])
>>> S
Displaying the data using Boolean indexing
Deleting elements from a Series
The drop() function is used to get series with specified index labels removed.
del() can be used to remove a series fully.
>>> import pandas as pd
>>> import numpy as np
>>> s = pd.Series(data=np.arange(3), index=['A', 'B', 'C'])
>>> s
A 0
B 1
C 2
dtype: int32
Dataframes
• It is a two-dimensional data structure, with rows & columns.
• It consists of three principal components-data, rows, & columns.
• DataFrame can be created with the following- Lists , dict , Series, Numpy arrays, Another
DataFrame
Syntax: pd.DataFrame( data, index, columns, dtype, copy)
Series vs Dataframe
• A Series is essentially a column, and a DataFrame is a multi-dimensional table made
up of a collection of Series.
Basic Features of DataFrame
• Columns may be of different types
• Size can be changed (Mutable)
• Labelled axes (rows / columns)
• Can perform arithmetic operations on rows and columns
Create DataFrame
It can be created using- Lists , dict , Series , Numpy arrays , Another DataFrame
Creating an empty Dataframe
Creating a dataframe from single list
Creating a dataframe from list of lists
Creating a Dataframe from lists of lists (multidimensional list)
• Using multi-dimensional list with column name and dtype specified.
import pandas as pd
lst = [['tom', 'reacher', 25], ['krish', 'pete', 30], ['nick', 'wilson', 26],
['juli', 'williams', 22]]
df = pd.DataFrame(lst, columns =['FName', 'LName', 'Age'], dtype =
float)
print(df)
Displaying index and col
>>> df = pd.DataFrame([[0, 1, 2], [3, 4, 5]], index=['row1', 'row2’], columns=['col1', 'col2', 'col3'])
>>> df
col1 col2 col3
row1 0 1 2
row2 3 4 5
>>> print(df.index)
Index(['row1', 'row2'], dtype='object')
>>> print(df.columns)
Index(['col1', 'col2', 'col3'], dtype='object')
Creating DataFrames from Numpy Array
Create a DataFrame from List of Dictionaries
import pandas as pd
data1 = [{'x': 1, 'y': 2},{'x': 5, 'y': 4, 'z':5}]
df1 =pd.DataFrame(data1)
✓ Here, the dictionary keys are taken as column labels, and the values corresponding to each
key are taken as rows.
✓ There will be as many rows as the number of dictionaries present in the list.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b’,’c’])
>>> df1
import pandas as pd
ab=[{'Name': 'Shaun' , 'Age': 35, 'Marks': 91},{'Name': 'Ritika', 'Age':
31, 'Marks': 87},{'Name': 'Smriti', 'Age': 33, 'Marks': 78},{'Name':
'Jacob' , 'Age': 23, 'Marks': 93}]
ab1=pd.DataFrame(ab,index=['a','b','c','d'])
ab1
Creating DataFrame from Dictionary of Lists
Dictionary keys become column labels by default in a DataFrame, and the lists become the
rows.
>>> data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
>>> df = pd.DataFrame(data)
>>> df
>>> dForest = {'State': ['Assam', 'Delhi','Kerala'],'GArea': [78438,
1483, 38852] ,’TArea' : [2797, 6.72,1663]}
>>> dfForest= pd.DataFrame(dForest)
>>> dfForest
Creating DataFrames from Series
>>> p=pd.Series([10,20,30],index=['a','b','c'])
>>> q=pd.Series([40,50,60],index=['a','b','c'])
>>> r=pd.DataFrame([p,q])
>>> r
import pandas as pd
a=["Jitender","Purnima","Arpit","Jyoti"]
b=[210,211,114,178]
s = pd.Series(a)
s1= pd.Series(b)
df=pd.DataFrame({"Author":s,"Article":s1})
df
>>> p={'one':pd.Series([1,2,3], index=['a','b','c']), 'two':pd.Series([11,22,33,44],
index=['a','b','c','d'])}
>>> q=pd.DataFrame(p)
>>> q
Creation of DataFrame from Dictionary of Series
# To create dataframe from 2 series of student data
import pandas as pd
stud_marks=pd.Series([89,94,93,83,89],index=['Anuj','Deepak','S
ohail','Tresa','Hima'])
stud_age=pd.Series([18,17,19,16,18],index=['Anuj','Deepak','Soh
ail','Tresa','Hima'])
>>> stud=pd.DataFrame({'Marks':stud_marks,'Age':stud_age})
>>> stud
>>> ResultSheet={ 'Arnab': pd.Series([90, 91, 97], index=['Maths','Science’, 'Hindi’]), 'Ramit':
pd.Series([92, 81, 96], index=['Maths','Science','Hindi']),
'Samridhi': pd.Series([89, 91, 88], index=['Maths','Science','Hindi’]), 'Riya': pd.Series([81, 71, 67],
index=['Maths','Science','Hindi’]), 'Mallika':
pd.Series([94, 95, 99],
index=['Maths','Science','Hindi'])}
>>> ResultDF = pd.DataFrame(ResultSheet)
>>> ResultDF