Unit -4 Advance Python Programming B.Sc.
CA&IT Sem 6
Working with Pandas and SciPy
Working with Pandas – Introduction to Pandas, Pandas Series, Pandas
Dataframes, Reading CSV and JSON files, Analyzing data, Cleaning data,
Correlations, Plotting , Working with SciPy
------------------------------------------------------------------------------------------------
Introduction to Pandas :
What is Pandas? State its functions.
Pandas is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.
Pandas Series :
A Pandas Series is like a column in a table.
It is a one-dimensional array holding data of any type.
Example:
import pandas as pd
data = [54, 65, 76,98,12]
myvar = pd.Series(data)
print(myvar)
output :
0 54
1 65
2 76
3 98
4 12
dtype: int64
• Label : Label can be used to access a specified value. Uses numeric
index by default. (0,1,….n)
For example : To print value at index 1
print(myvar[1])
Page |1
Unit -4 Advance Python Programming B.Sc. CA&IT Sem 6
• using index argument to create user defined labels.
import pandas as pd
data = [54, 65, 76,98,12]
myvar = pd.Series(data,index=["a","b","c","d","e"])
print(myvar["b"])
-------------------------------------------------------------------
Creating Key/Value Objects as Series
----------------------------------------------------------------
You can also use a key/value object, like a dictionary, to create a Series.
Example :
import pandas as pd
data = {'Name':['Jay', 'Nick', 'Krishna', 'Jhon'],
'Age':[20, 21, 19, 18]}
myvar = pd.Series(data)
print(myvar)
output:
Name [Jay, Nick, Krishna, Jhon]
Age [20, 21, 19, 18]
dtype: object
Selecting particular items from Series
Example : use the index argument and specify only the items you want to
include in the Series. (data should be a dictionary)
Example :
import pandas as pd
data= {"London": 4.20, "Paris": 3.80, "Rome": 3.90}
myvar = pd.Series(data, index = ["London", "Rome"])
print(myvar)
Page |2
Unit -4 Advance Python Programming B.Sc. CA&IT Sem 6
Example :
import pandas as pd
data= {1: "Jack", 2 :"Jhon" , 3 : "Ajay" }
myvar = pd.Series(data, index = [2,3])
print(myvar)
-----------------------------------------------------------------------------------------------
Reading CSV Files
What is CSV file ?
CSV files (comma separated files) is a simple way to store big data sets is to
use.
CSV files contains plain text and is a well know format that can be read by
everyone including Pandas.
Load data into from a csv file : use to_string to print DataFrame
import pandas as pd
df = pd. read_csv('data.csv')
print(df.to_string())
Load data into from a csv file : without using to_string to print DataFrame
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
Reading JSON File :
What is a JSON file ?
JSON file is a JavaScript Object Notation file . It means that a script
(executable) file which is made of text in a programming language, is used to
store and transfer the data.
To use this feature, we import the json package in Python script. The text in
JSON is done through quoted-string which contains the value in key-value
mapping within { }.
Deserialization of JSON
The Deserialization of JSON means the conversion of JSON objects into their
respective Python objects. The load()/loads() method is used for it.
Page |3
Unit -4 Advance Python Programming B.Sc. CA&IT Sem 6
Following is the table which shows JSON object and its equivalent Python
Object :
JSON OBJECT PYTHON OBJECT
object dict
array list
string str
null None
number (int) int
number (real) float
true True
false False
json.load(): json.load() accepts file object, parses the JSON data, populates a
Python dictionary with the data and returns it back to you.
Syntax:
json.load(file object)
Example : Read and print the content of data.json file
import json
f = open('data.json')
data = json.load(f)
for i in data['employee’]:
print(i)
f.close()
Example : Writing the content to data1.json file
import json
d= {
"id": "04",
"name": "Rajesh",
"department": "Production"
}
Page |4
Unit -4 Advance Python Programming B.Sc. CA&IT Sem 6
json_object = json.dumps(d, indent=4)
with open("data1.json", "w") as f:
f.write(json_object)
Appending data to json object
import json
def write_json(new_data, filename='data.json'):
with open(filename,'r+') as f:
file_data = json.load(f)
file_data["employee"].append(new_data)
f.seek(0)
json.dump(file_data, f, indent = 4)
d= {
"id": "05",
"name": "Deep",
"department": "Marketing"}
write_json(d)
Analyzing data – head(), tail(),info()
head() – prints rows from top of the DataFrame.
By default head() prints 5 rows from top.
to print top 5 rows from DataFrame: (By default)
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
to print top 10 rows from DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head(10))
tail() – prints rows from bottom of the DataFrame.
Page |5
Unit -4 Advance Python Programming B.Sc. CA&IT Sem 6
By default tail() prints 5 rows from bottom.
to print last 5 rows from DataFrame: (By default)
import pandas as pd
df = pd.read_csv('data.csv')
print(df.tail())
to print last 3 rows from DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.tail(3))
to print information about the DataFrame:
print(df.info())
output:
0 Duration 169 non-null int64
1 Pulse 169 non-null int64
2 Maxpulse 169 non-null int64
3 Calories 164 non-null float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB
------------------------------------------------------------------------------------------------
Data Cleaning
Data cleaning means fixing bad data in your data set.
Bad data could be:
1. Empty cells
2. Data in wrong format
3. Wrong data
4. Duplicates
1. Cleaning Empty Cells
Empty cells can potentially give you a wrong result when you analyze
data.
For Example in data.csv file : The data set contains some empty cells
("Date" in row 22, and "Calories" in row 18 and 28).
Page |6
Unit -4 Advance Python Programming B.Sc. CA&IT Sem 6
Different ways to deal with empty rows:
i) Remove Rows containing empty cells.
ii) Replace Rows containing empty cells. fillna()
iii) Replace specific column in Rows containing empty
cells. fillna()
iv) Replace values with mean,median or mode, containing
empty cells.
i) Remove Rows containing empty cells - dropna()
One way to deal with empty cells is to remove rows that contain empty
cells.
If data sets are very big, removing a few rows will not have a big impact on
the result.
if you want to make modifications in original dataframe
df.dropna(inplace = True)
new_df = df.dropna()
ii) Replace Empty Values
Empty cells can be replace with new value in the whole Data Frame.
So you do not have to delete entire rows just because of some empty cells.
The fillna() method allows us to replace empty cells with a value:
df.fillna(130)
Replace Only For Specified Columns
To only replace empty values for one column, specify the column name for
the DataFrame:
df["Calories"].fillna(130)
Replace Using Mean, Median, or Mode
Pandas uses the mean() median() and mode() methods to calculate the
respective values for a specified column:
x = df["Calories"].mean()
x = df["Calories"].median()
x = df["Calories"].mode()[0]
Page |7
Unit -4 Advance Python Programming B.Sc. CA&IT Sem 6
2) Cleaning Data of Wrong Format
Cells with data of wrong format can make it difficult, or even impossible, to
analyze data.
To solve the problem :
1. remove the rows
2. Convert the cells in same format.
1. Remove the rows :
df.dropna(subset=['Date'], inplace = True)
2. convert all cells in the columns into the same format :
df['Date'] = pd.to_datetime(df['Date'])
3 ) Fixing Wrong Data
"Wrong data" does not have to be "empty cells" or "wrong format", it can just
be wrong, like if someone registered "100" instead of "1.00".
Sometimes you can spot wrong data by looking at the data set, because you
have an expectation of what it should be.
If you take a look at our data set, you can see that in row 7, the duration is 450,
but for all the other rows the duration is between 30 and 60.
Ways to fix wrong data :
1. Replacing Values
2. Remove the rows containing wrong data
1. Replace values.
In our example, it is most likely a typo, and the value should be "45" instead of
"450", and we could just insert "45" in row 7:
df.loc[7,'Duration'] = 45
For small data sets you might be able to replace the wrong data one by one, but
not for big data sets.
Page |8
Unit -4 Advance Python Programming B.Sc. CA&IT Sem 6
To replace wrong data for larger data sets you can create some rules, e.g. set
some boundaries for legal values, and replace any values that are outside of the
boundaries.
import pandas as pd
df = pd.read_csv('data.csv')
for x in df.index:
if df.loc[x, "Duration"] > 120:
df.loc[x, "Duration"] = 120
print(df.to_string())
2. Removing Rows
Another way of handling wrong data is to remove the rows that contains wrong
data.
for x in df.index:
if df.loc[x, "Duration"] > 120:
df.drop(x, inplace = True)
4 ) Removing Duplicates
• The data set contains duplicates (row 11 and 12).
print(df.duplicated())
Removing duplicates :
import pandas as pd
df = pd.read_csv('data.csv')
df.drop_duplicates(inplace = True)
print(df.to_string())
Correlations :
The corr() method calculates the relationship between each column in your
data set.
df.corr()
The number varies from -1 to 1.
Page |9
Unit -4 Advance Python Programming B.Sc. CA&IT Sem 6
1 means that there is a 1 to 1 relationship (a perfect correlation), and for this
data set, each time a value went up in the first column, the other one went up as
well.
0.9 is also a good relationship, and if you increase one value, the other will
probably increase as well.
-0.9 would be just as good relationship as 0.9, but if you increase one value, the
other will probably go down.
0.2 means NOT a good relationship, meaning that if one value goes up does not
mean that the other will.
Plotting :
----------------------------------------------------------------------------------------------
plot()
Pandas uses the plot() method to create diagrams.
We can use Pyplot, a submodule of the Matplotlib library to visualize the
diagram on the screen.
Example: plotting graph using data from data.csv file
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
df.plot()-
plt.show()
----------------------------------------------------------------------------------------------
Scatter Plot :
To display a scatter plot use argument kind = 'scatter' .
A scatter plot needs an x- and a y-axis.
Example: generating scatter chart from data.csv file
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
df.plot(kind = 'scatter', x = 'Duration', y = 'Pulse')
P a g e | 10
Unit -4 Advance Python Programming B.Sc. CA&IT Sem 6
plt.show()
----------------------------------------------------------------------------------------------
Examples for practical reference only
Example 1 : Return a new Data Frame with no empty cells:
import pandas as pd
df = pd.read_csv('data.csv')
new_df = df.dropna()
print(new_df.to_string())
Example 2 : Fill empty cells with value 130.
import pandas as pd
df = pd.read_csv('data.csv')
df.fillna(130, inplace = True)
Example 3 : Remove the rows :
import pandas as pd
df = pd.read_csv('data.csv')
df['Date'] = pd.to_datetime(df['Date'])
df.dropna(subset=['Date'], inplace = True)
print(df.to_string())
Example 4 : convert all cells in the columns into the same format :
import pandas as pd
df = pd.read_csv('data.csv')
df['Date'] = pd.to_datetime(df['Date'])
print(df.to_string())
Example 5: Fixing wrong data with correct value
import pandas as pd
df = pd.read_csv('data.csv')
df.loc[7,'Duration'] = 45
print(df.to_string())
Example 6: Fixing wrong data with correct value considering the boundary
ranges
import pandas as pd
df = pd.read_csv('data.csv')
for x in df.index:
P a g e | 11
Unit -4 Advance Python Programming B.Sc. CA&IT Sem 6
if df.loc[x, "Duration"] > 120:
df.loc[x, "Duration"] = 120
print(df.to_string())
Example 7 : Find duplicates
import pandas as pd
df = pd.read_csv('data.csv')
print(df.duplicated())
Example 8 : Remove duplicates
import pandas as pd
df = pd.read_csv('data.csv')
df.drop_duplicates(inplace = True)
print(df.to_string())
pie() - draw pie chart
import matplotlib.pyplot as plt
cars = ['AUDI', 'BMW', 'FORD',
'TESLA', 'JAGUAR', 'MERCEDES']
data = [23, 17, 35, 29, 12, 41]
fig = plt.figure(figsize=(10, 7))
plt.pie(data, labels=cars)
plt.show()
------------------------------------------
bar chart
-------------------------------------------
import pandas as pd
import matplotlib.pyplot as plot
data = {"City": ["London", "Paris", "Rome"],
"Tourist": [22.42, 18.95, 10.7]};
df = pd.DataFrame(data=data)
df.plot.bar(x="City", y="Tourist", rot=50, title="Number of tourist visits - Year
2022");
plot.show()
----------------------------------------------------
import pandas as pd
import matplotlib.pyplot as plt
plotdata = pd.DataFrame({
"Maruti":[40, 25, 10, 26, 36],
"Kia":[6, 8, 10, 12, 15],
"Honda":[30, 27, 42, 17, 37]
P a g e | 12
Unit -4 Advance Python Programming B.Sc. CA&IT Sem 6
},
index=["2018", "2019", "2020", "2021", "2022"]
)
plotdata.plot(kind="bar")
plt.title("Sales of Car")
plt.xlabel("Year")
plt.ylabel("Sales in Lakhs")
plt.show()
Short Questions:
1. What is Pandas? State its functions.
2. What is Pandas Series ? Give one example.
3. Explain Deserialization of JSON.
4. Explain json.load() function
5. Explain head() function with example.
6. Explain tail() function with example.
7. Explain how to remove duplicates from csv file.
8. Explain Data Cleaning in brief.
Long Questions :
1. Explain correlations in detail
2. Explain Creating Key/Value Objects as Series and selecting
particular items from series.
3. Explain plot() method and scatter plot()
4. Explain how to fix wrong data in .csv file.
5. Explain how to clean empty cells in .csv file.
6. Explain how to clean data of wrong format in .csv file.
P a g e | 13