0% found this document useful (0 votes)
5 views13 pages

Advance Python Unit 4

This document provides an overview of advanced Python programming concepts, specifically focusing on the Pandas and SciPy libraries. It covers topics such as creating and manipulating Pandas Series and DataFrames, reading CSV and JSON files, data analysis techniques, data cleaning methods, and plotting data visualizations. Practical examples are included to illustrate the usage of various functions and methods within these libraries.

Uploaded by

devr07j
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views13 pages

Advance Python Unit 4

This document provides an overview of advanced Python programming concepts, specifically focusing on the Pandas and SciPy libraries. It covers topics such as creating and manipulating Pandas Series and DataFrames, reading CSV and JSON files, data analysis techniques, data cleaning methods, and plotting data visualizations. Practical examples are included to illustrate the usage of various functions and methods within these libraries.

Uploaded by

devr07j
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Unit -4 Advance Python Programming B.Sc.

CA&IT Sem 6
Working with Pandas and SciPy
Working with Pandas – Introduction to Pandas, Pandas Series, Pandas
Dataframes, Reading CSV and JSON files, Analyzing data, Cleaning data,
Correlations, Plotting , Working with SciPy
------------------------------------------------------------------------------------------------
Introduction to Pandas :
What is Pandas? State its functions.

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

Pandas Series :

A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

Example:
import pandas as pd
data = [54, 65, 76,98,12]
myvar = pd.Series(data)
print(myvar)
output :

0 54
1 65
2 76
3 98
4 12
dtype: int64
• Label : Label can be used to access a specified value. Uses numeric
index by default. (0,1,….n)
For example : To print value at index 1
print(myvar[1])

Page |1
Unit -4 Advance Python Programming B.Sc. CA&IT Sem 6

• using index argument to create user defined labels.


import pandas as pd
data = [54, 65, 76,98,12]
myvar = pd.Series(data,index=["a","b","c","d","e"])
print(myvar["b"])
-------------------------------------------------------------------

Creating Key/Value Objects as Series


----------------------------------------------------------------

You can also use a key/value object, like a dictionary, to create a Series.

Example :

import pandas as pd
data = {'Name':['Jay', 'Nick', 'Krishna', 'Jhon'],
'Age':[20, 21, 19, 18]}
myvar = pd.Series(data)
print(myvar)
output:
Name [Jay, Nick, Krishna, Jhon]
Age [20, 21, 19, 18]
dtype: object

Selecting particular items from Series

Example : use the index argument and specify only the items you want to
include in the Series. (data should be a dictionary)

Example :

import pandas as pd
data= {"London": 4.20, "Paris": 3.80, "Rome": 3.90}
myvar = pd.Series(data, index = ["London", "Rome"])
print(myvar)

Page |2
Unit -4 Advance Python Programming B.Sc. CA&IT Sem 6
Example :
import pandas as pd
data= {1: "Jack", 2 :"Jhon" , 3 : "Ajay" }
myvar = pd.Series(data, index = [2,3])
print(myvar)

-----------------------------------------------------------------------------------------------
Reading CSV Files

What is CSV file ?

CSV files (comma separated files) is a simple way to store big data sets is to
use.

CSV files contains plain text and is a well know format that can be read by
everyone including Pandas.

Load data into from a csv file : use to_string to print DataFrame

import pandas as pd
df = pd. read_csv('data.csv')
print(df.to_string())

Load data into from a csv file : without using to_string to print DataFrame

import pandas as pd
df = pd.read_csv('data.csv')
print(df)

Reading JSON File :


What is a JSON file ?
JSON file is a JavaScript Object Notation file . It means that a script
(executable) file which is made of text in a programming language, is used to
store and transfer the data.

To use this feature, we import the json package in Python script. The text in
JSON is done through quoted-string which contains the value in key-value
mapping within { }.
Deserialization of JSON
The Deserialization of JSON means the conversion of JSON objects into their
respective Python objects. The load()/loads() method is used for it.

Page |3
Unit -4 Advance Python Programming B.Sc. CA&IT Sem 6
Following is the table which shows JSON object and its equivalent Python
Object :

JSON OBJECT PYTHON OBJECT


object dict
array list
string str
null None
number (int) int
number (real) float
true True
false False

json.load(): json.load() accepts file object, parses the JSON data, populates a
Python dictionary with the data and returns it back to you.
Syntax:
json.load(file object)
Example : Read and print the content of data.json file
import json
f = open('data.json')
data = json.load(f)
for i in data['employee’]:
print(i)
f.close()
Example : Writing the content to data1.json file
import json
d= {
"id": "04",
"name": "Rajesh",
"department": "Production"
}

Page |4
Unit -4 Advance Python Programming B.Sc. CA&IT Sem 6
json_object = json.dumps(d, indent=4)
with open("data1.json", "w") as f:
f.write(json_object)
Appending data to json object
import json
def write_json(new_data, filename='data.json'):
with open(filename,'r+') as f:
file_data = json.load(f)
file_data["employee"].append(new_data)
f.seek(0)
json.dump(file_data, f, indent = 4)

d= {
"id": "05",
"name": "Deep",
"department": "Marketing"}

write_json(d)
Analyzing data – head(), tail(),info()

head() – prints rows from top of the DataFrame.


By default head() prints 5 rows from top.

to print top 5 rows from DataFrame: (By default)


import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())

to print top 10 rows from DataFrame:


import pandas as pd
df = pd.read_csv('data.csv')
print(df.head(10))

tail() – prints rows from bottom of the DataFrame.

Page |5
Unit -4 Advance Python Programming B.Sc. CA&IT Sem 6
By default tail() prints 5 rows from bottom.

to print last 5 rows from DataFrame: (By default)


import pandas as pd
df = pd.read_csv('data.csv')
print(df.tail())

to print last 3 rows from DataFrame:


import pandas as pd
df = pd.read_csv('data.csv')
print(df.tail(3))

to print information about the DataFrame:


print(df.info())

output:
0 Duration 169 non-null int64
1 Pulse 169 non-null int64
2 Maxpulse 169 non-null int64
3 Calories 164 non-null float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB
------------------------------------------------------------------------------------------------
Data Cleaning

Data cleaning means fixing bad data in your data set.

Bad data could be:

1. Empty cells
2. Data in wrong format
3. Wrong data
4. Duplicates

1. Cleaning Empty Cells


Empty cells can potentially give you a wrong result when you analyze
data.

For Example in data.csv file : The data set contains some empty cells
("Date" in row 22, and "Calories" in row 18 and 28).

Page |6
Unit -4 Advance Python Programming B.Sc. CA&IT Sem 6
Different ways to deal with empty rows:
i) Remove Rows containing empty cells.
ii) Replace Rows containing empty cells. fillna()
iii) Replace specific column in Rows containing empty
cells. fillna()
iv) Replace values with mean,median or mode, containing
empty cells.

i) Remove Rows containing empty cells - dropna()

One way to deal with empty cells is to remove rows that contain empty
cells.

If data sets are very big, removing a few rows will not have a big impact on
the result.
if you want to make modifications in original dataframe
df.dropna(inplace = True)

new_df = df.dropna()

ii) Replace Empty Values

Empty cells can be replace with new value in the whole Data Frame.

So you do not have to delete entire rows just because of some empty cells.

The fillna() method allows us to replace empty cells with a value:

df.fillna(130)

Replace Only For Specified Columns

To only replace empty values for one column, specify the column name for
the DataFrame:

df["Calories"].fillna(130)
Replace Using Mean, Median, or Mode
Pandas uses the mean() median() and mode() methods to calculate the
respective values for a specified column:
x = df["Calories"].mean()

x = df["Calories"].median()

x = df["Calories"].mode()[0]
Page |7
Unit -4 Advance Python Programming B.Sc. CA&IT Sem 6

2) Cleaning Data of Wrong Format

Cells with data of wrong format can make it difficult, or even impossible, to
analyze data.

To solve the problem :

1. remove the rows

2. Convert the cells in same format.

1. Remove the rows :

df.dropna(subset=['Date'], inplace = True)

2. convert all cells in the columns into the same format :

df['Date'] = pd.to_datetime(df['Date'])

3 ) Fixing Wrong Data

"Wrong data" does not have to be "empty cells" or "wrong format", it can just
be wrong, like if someone registered "100" instead of "1.00".

Sometimes you can spot wrong data by looking at the data set, because you
have an expectation of what it should be.

If you take a look at our data set, you can see that in row 7, the duration is 450,
but for all the other rows the duration is between 30 and 60.

Ways to fix wrong data :


1. Replacing Values
2. Remove the rows containing wrong data

1. Replace values.

In our example, it is most likely a typo, and the value should be "45" instead of
"450", and we could just insert "45" in row 7:

df.loc[7,'Duration'] = 45

For small data sets you might be able to replace the wrong data one by one, but
not for big data sets.

Page |8
Unit -4 Advance Python Programming B.Sc. CA&IT Sem 6
To replace wrong data for larger data sets you can create some rules, e.g. set
some boundaries for legal values, and replace any values that are outside of the
boundaries.

import pandas as pd
df = pd.read_csv('data.csv')

for x in df.index:
if df.loc[x, "Duration"] > 120:
df.loc[x, "Duration"] = 120
print(df.to_string())

2. Removing Rows

Another way of handling wrong data is to remove the rows that contains wrong
data.

for x in df.index:
if df.loc[x, "Duration"] > 120:
df.drop(x, inplace = True)
4 ) Removing Duplicates

• The data set contains duplicates (row 11 and 12).

print(df.duplicated())

Removing duplicates :

import pandas as pd
df = pd.read_csv('data.csv')
df.drop_duplicates(inplace = True)
print(df.to_string())

Correlations :
The corr() method calculates the relationship between each column in your
data set.

df.corr()

The number varies from -1 to 1.

Page |9
Unit -4 Advance Python Programming B.Sc. CA&IT Sem 6
1 means that there is a 1 to 1 relationship (a perfect correlation), and for this
data set, each time a value went up in the first column, the other one went up as
well.

0.9 is also a good relationship, and if you increase one value, the other will
probably increase as well.

-0.9 would be just as good relationship as 0.9, but if you increase one value, the
other will probably go down.

0.2 means NOT a good relationship, meaning that if one value goes up does not
mean that the other will.

Plotting :
----------------------------------------------------------------------------------------------
plot()

Pandas uses the plot() method to create diagrams.

We can use Pyplot, a submodule of the Matplotlib library to visualize the


diagram on the screen.

Example: plotting graph using data from data.csv file

import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
df.plot()-
plt.show()

----------------------------------------------------------------------------------------------

Scatter Plot :

To display a scatter plot use argument kind = 'scatter' .

A scatter plot needs an x- and a y-axis.

Example: generating scatter chart from data.csv file

import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
df.plot(kind = 'scatter', x = 'Duration', y = 'Pulse')
P a g e | 10
Unit -4 Advance Python Programming B.Sc. CA&IT Sem 6
plt.show()
----------------------------------------------------------------------------------------------
Examples for practical reference only

Example 1 : Return a new Data Frame with no empty cells:


import pandas as pd
df = pd.read_csv('data.csv')
new_df = df.dropna()
print(new_df.to_string())

Example 2 : Fill empty cells with value 130.

import pandas as pd
df = pd.read_csv('data.csv')
df.fillna(130, inplace = True)

Example 3 : Remove the rows :

import pandas as pd
df = pd.read_csv('data.csv')
df['Date'] = pd.to_datetime(df['Date'])
df.dropna(subset=['Date'], inplace = True)
print(df.to_string())

Example 4 : convert all cells in the columns into the same format :

import pandas as pd
df = pd.read_csv('data.csv')
df['Date'] = pd.to_datetime(df['Date'])
print(df.to_string())

Example 5: Fixing wrong data with correct value


import pandas as pd
df = pd.read_csv('data.csv')
df.loc[7,'Duration'] = 45
print(df.to_string())

Example 6: Fixing wrong data with correct value considering the boundary
ranges
import pandas as pd
df = pd.read_csv('data.csv')

for x in df.index:

P a g e | 11
Unit -4 Advance Python Programming B.Sc. CA&IT Sem 6
if df.loc[x, "Duration"] > 120:
df.loc[x, "Duration"] = 120
print(df.to_string())

Example 7 : Find duplicates


import pandas as pd
df = pd.read_csv('data.csv')
print(df.duplicated())
Example 8 : Remove duplicates
import pandas as pd
df = pd.read_csv('data.csv')
df.drop_duplicates(inplace = True)
print(df.to_string())

pie() - draw pie chart

import matplotlib.pyplot as plt


cars = ['AUDI', 'BMW', 'FORD',
'TESLA', 'JAGUAR', 'MERCEDES']
data = [23, 17, 35, 29, 12, 41]
fig = plt.figure(figsize=(10, 7))
plt.pie(data, labels=cars)
plt.show()
------------------------------------------
bar chart
-------------------------------------------
import pandas as pd
import matplotlib.pyplot as plot
data = {"City": ["London", "Paris", "Rome"],
"Tourist": [22.42, 18.95, 10.7]};
df = pd.DataFrame(data=data)
df.plot.bar(x="City", y="Tourist", rot=50, title="Number of tourist visits - Year
2022");
plot.show()
----------------------------------------------------
import pandas as pd
import matplotlib.pyplot as plt
plotdata = pd.DataFrame({
"Maruti":[40, 25, 10, 26, 36],
"Kia":[6, 8, 10, 12, 15],
"Honda":[30, 27, 42, 17, 37]
P a g e | 12
Unit -4 Advance Python Programming B.Sc. CA&IT Sem 6
},
index=["2018", "2019", "2020", "2021", "2022"]
)
plotdata.plot(kind="bar")
plt.title("Sales of Car")
plt.xlabel("Year")
plt.ylabel("Sales in Lakhs")
plt.show()

Short Questions:

1. What is Pandas? State its functions.


2. What is Pandas Series ? Give one example.
3. Explain Deserialization of JSON.
4. Explain json.load() function
5. Explain head() function with example.
6. Explain tail() function with example.
7. Explain how to remove duplicates from csv file.
8. Explain Data Cleaning in brief.

Long Questions :
1. Explain correlations in detail
2. Explain Creating Key/Value Objects as Series and selecting
particular items from series.
3. Explain plot() method and scatter plot()
4. Explain how to fix wrong data in .csv file.
5. Explain how to clean empty cells in .csv file.
6. Explain how to clean data of wrong format in .csv file.

P a g e | 13

You might also like