Pandas_numpy_handing_data

The document provides a comprehensive guide on handling data with pandas and numpy, covering topics such as loading and saving data, exploratory data analysis, and data wrangling techniques. It includes practical examples using an iTunes dataset, demonstrating how to read data from CSV and Excel files, perform data filtering, merging, and plotting, as well as handling missing values and duplicates. Additionally, it discusses data transformations, grouping, and saving processed data to disk.

Uploaded by

23070229

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Pandas_numpy_handing_data

Uploaded by

23070229

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Handling data

with pandas and

numpy (cont)
Contents
• Loading and saving data
• Exploratory data analysis
• Processing data
• Wrangling data with filtering, grouping
Dataset
• Itunes dataset
• Note: if the data comes in Excel format

• The function call to read_excel is providing an engine (library) to

read the data with.
• By default, pandas uses the xlrd library to read Excel files.
Unfortunately, there is a bug with xlrd at the time of writing, and
we are specifying the openpyxl library to be used instead.
• The xlrd or openpyxl libraries need to be installed with conda or
pip in order to read Excel files with pandas
from google.colab import drive
drive.mount('/content/drive’)
import pandas as pd
#df = pd.read_csv('/content/drive/MyDrive/Colab
Notebooks/datafortest/Ex10/DS.xlsx')
df = pd.read_csv('/content/drive/MyDrive/Colab
Notebooks/datafortest/Ex10/DS.csv')
df.head()
read _csv
pd.read_csv(filepath_or_buffer, sep=’ ,’ ,
header=’infer’, index_col=None, usecols=None, engine=None,
skiprows=None, nrows=None) df =
pd.read_csv(filename,usecols=['f(s)'])
•filepath_or_buffer: Location of the csv file. It accepts df.head()
any string path or URL of the file.
•sep: It stands for separator, default is ‘, ‘.
•header: It accepts int, a list of int, row numbers to use
as the column names, and the start of the data. If no
names are passed, i.e., header=None, then, it will display
the first column as 0, the second as 1, and so on.
•usecols: Retrieves only selected columns from the CSV
file.
•nrows: Number of rows to be displayed from the
dataset.
•index_col: If None, there are no index numbers
displayed along with records.
•skiprows: Skips passed rows in the new data frame.
DataFrame import pandas as pd
• DataFrames have a certain filename = '/content/drive/MyDrive/Colab
structure: a number of Notebooks/datafortest/Ex10/DS.csv'
df = pd.read_csv(filename,index_col='f(s)')
columns storing data, and an df.head()
index.
• This index can be used as one
method to access the data:
df.index

• The index is autogenerated

while loading data.
• Use read_csv('filename',
index='index_col_name’) to
specify an index
Combining DataFrame
• Use pd.concat to combine multiple DataFrame

• axis=1 to tell the function to join along the columns, not

rows. We also set the join parameter to inner so that only
rows with a matching index are joined
• pd.merge() is another function that can be used to merge
data, such as a SQL join. With merge(), we can join on any
column rather than just the index
import pandas as pd
filename1 = '/content/drive/MyDrive/Colab
Notebooks/datafortest/Ex10/DS.csv'
filename2 = '/content/drive/MyDrive/Colab
Notebooks/datafortest/Ex10/RS.csv'
#df = pd.read_csv(filename1,usecols=['f(s)'])
#df = pd.read_csv(filename1,index_col='f(s)')
df1 = pd.read_csv(filename1)
df2 = pd.read_csv(filename2)
#df= pd.merge(df1,df2,on='Algorithm')
#df = pd.concat([df1,df2])
df = pd.concat([df1,df2],ignore_index=True)
#df.head()
df
Exploratory Data Analysis (EDA)
• df.head() and df.tail()
• we can transpose the printout with itunes_df.tail().T, which
transposes columns and rows
• To index by row number, we use iloc. This is helpful if we
want to look at the first or last row. For example, this is how
we look at the first row (the 0th index) and the last row (the -
1 index):

• Indexing can be done by index value: itunes_df.loc[3052]

EDA
• With iloc, we can also choose a single column. For example,
the following commands print out the value of the first row
and first column (an index of [0, 0]) and the last row and last
column (and an index of [-1, -1])

• We can change our index to be unique

EDA
• we could use the skiprows parameter in
pd.read_csv() to skip the first 5 rows if these have
extra header information:
• df = pd.read_csv('csvfile.csv', skiprows=5)

• If we wanted to get rid of the last 5 rows, we could

use iloc indexing
• df = df.iloc[:-5]
• We can select a column of data
• itunes_df['Milliseconds']

• we want to select multiple columns, we can use a list of

strings
• itunes_df[['Milliseconds', 'Bytes']]
Dimensions, datatypes and missing
values
• Shape of DataFrame : itunes_df.shape

• Data type and missing values : itunes_df.info()

• The info method also shows us the non-null values. As long

as this matches the number of rows, it means there are no
missing values. However, it's easier to look at the number of
missing values: itunes_df.isna().sum()
• To examine the statistics of a dataset is to use a pandas
command: itunes_df.describe()
Ploting with DataFrame
• Need to use matplotlib: import matplotlib.pyplot as plt
import pandas as pd
import matplotlib as mlt
filename1 = '/content/drive/MyDrive/Colab
Notebooks/datafortest/Ex10/DS.csv'
filename2 = '/content/drive/MyDrive/Colab
Notebooks/datafortest/Ex10/RS.csv'
df1 =
pd.read_csv(filename1,usecols=['Budget','f(
s)'])
df1 =
pd.read_csv(filename2,usecols=['Budget','f(
s)'])
df = pd.concat([df1,df2])
df.plot.scatter(x='Budget',y='f(s)')
Ploting with coloring of Data
import pandas as pd
import matplotlib.pyplot as plt
filename1 = '/content/drive/MyDrive/Colab
Notebooks/datafortest/Ex10/DS.csv'
filename2 = '/content/drive/MyDrive/Colab
Notebooks/datafortest/Ex10/RS.csv'
df1 = pd.read_csv(filename1,usecols=['Budget','f(s)'])
df2 = pd.read_csv(filename2,usecols=['Budget','f(s)'])
DS = df1.get('f(s)')
RS = df2.get('f(s)')
x = df1.get('Budget')
plt.plot(x, DS, color = 'g', linestyle = 'dashed',
marker = 'o',label = "DS")
plt.plot(x, RS, color = 'b', linestyle = 'dashed',
marker = '*',label = "RS")
plt.xticks(rotation = 25)
plt.xlabel('Budget')
plt.ylabel('Influence value')
plt.title('Influence comparison', fontsize = 20)
plt.grid()
plt.legend()
plt.show()
Practice 1
• With given data:
• Construct the bar graph to compare the f(s) according to Budget
• Construct the line graph to compare the number of queries according
to Budget
• Construct the histograms to compare the time according to Budget
Practice 2
• Suppose this is the Vietnamese population in 2024. Save the
following data into a csv file. Save the following data into a
csv file with the format [Age, Proportion]
Age Quantity
<18 25,000,000
18-60 60,000,000
>60 15,000,000

• Construct a pie chart to indicate the proportion of population

in Vietnam in 2024
More about Plotting
• Stylesheet:
https://matplotlib.org/stable/gallery/style_sheets/style_shee
ts_reference.html
• Matplot introduction:
https://www.w3schools.com/python/matplotlib_intro.asp
Clean data
• Pandas support many ways to filter data.
• For example, let's get the longest songs from our data that we saw
in our scatter plot. These have lengths over 4,000,000 ms
• itunes_df[itunes_df['Milliseconds'] > 4e6]

• look at the value counts of genres from songs over 2,000,000 ms:
• itunes_df[itunes_df['Milliseconds'] > 2e6]['Genre'].value_counts()

• filtering with multiple conditions:

itunes_df[(itunes_df['Milliseconds'] > 2e6) \
& (itunes_df['Bytes'] < 0.4e9)]['Genre'].value_counts()
Clean data
• Drop columns: itunes_df.drop('Composer', axis=1, inplace=True)

• The axis=1 argument specifies to drop a column, not a row, and

inplace=True changes the DataFrame itself instead of returning a new,
modified DataFrame
• If any genres that are not music, we could do so with filtering
• only_music = itunes_df[~itunes_df['Genre'].isin(['Drama', 'TV Shows', 'Sci Fi &
Fantasy', 'Science Fiction', 'Comedy’])]

• This uses filtering with the isin method. The isin method checks if each
value is in the list or set provided to the function. In this case, we also
negate this condition with the tilde (~) so that any of the non -music
genres are excluded, and our only_music DataFrame has only genres
that are music, just as the variable name suggests
Missing values
• For example, we saw that our Composer column has several
missing values. We can use filtering to see what some of
these rows: itunes_df[itunes_df['Composer'].isna()].sample(5,
random_state=42).head()
• Another option is to drop the missing values. We can either
drop the entire column, as we did earlier, or we can drop the
rows with missing values: itunes_df.dropna(inplace=True)
• Filling with a specific value could be done
• itunes_df.loc[itunes_df['Composer'].isna(), 'Composer'] =
'Unknown'
• itunes_df['Composer'].fillna('Unknown', inplace=True)
Missing values
• Using KNN imputation
Duplicate values
• to check for duplicates is the duplicated() function:
itunes_df.duplicated().sum()

• To drop duplicates: itunes_df.drop_duplicates(inplace=True)

Outliers
• Outliers are data that are not in the usual range of values. Dealing with
categorical outliers can help a little with analysis, but often has a
minimal impact
• For numeric data, it's easy to quantify an outlier. Typically, we use
interquartile range (IQR) or z-score methods
• We get quartiles (25th, 50th, 75th percentiles) from the describe()
function in our EDA. These are sometimes called the first, second, and
third quartiles, respectively (Q1, Q2, and Q3)
• Formula to calculate IRQ:
IRQ = 75_percentile – 25_percentile
upper_bound = 75_percentile + 1.5*IRQ
lower_bound = 25_percentile – 1.5*IRQ
Data transformations
• The basic operation to transform data such as:
itunes_df['Seconds'] = itunes_df['Milliseconds'] / 1000

• If we wanted to create another column:

itunes_df['len_byte_ratio'] = itunes_df['Milliseconds'] /
itunes_df['Bytes']
Data transformation
• To replace several values at once is with the map and replace functions:
genre_dict = {'metal': 'Metal', 'met': 'Metal'}
itunes_df['Genre'].replace(genre_dict)

• Using apply function:

itunes_df['Genre'].apply(lambda x: x.lower())

Or
def lowercase(x):
return x.lower()
itunes_df['Genre'].apply(lowercase)
Or
itunes_df['Genre'].str.lower()
Groupby
• Groupby in pandas is just like in SQL – group by unique
values in a column
itunes_df.groupby('Genre').mean()['Seconds'].sort_values().head()
Writing DataFrame to disk
• Pandas offers many ways to save data: csv, excel, hdf5, ..
itunes_df.to_csv('data/saved_itunes_data.csv', index=False)

We first give the filename as an argument for to_csv, and then

tell it to not write the index to the file with index=False. This
filename would save the data as saved_ itunes_data.csv in the
directory named "data" within the same folder/directory where
the code is being run

FREE CVCe All-in-One Reading Passages PDF
100% (7)
FREE CVCe All-in-One Reading Passages PDF
10 pages
English Shawnee Dictionary
No ratings yet
English Shawnee Dictionary
13 pages
12 Information Practices Text Book Preeti Arora
No ratings yet
12 Information Practices Text Book Preeti Arora
45 pages
Learning Pandas PDF
No ratings yet
Learning Pandas PDF
171 pages
Cheat Sheet: The Pandas Dataframe Object: Column Index (DF - Columns)
No ratings yet
Cheat Sheet: The Pandas Dataframe Object: Column Index (DF - Columns)
6 pages
7. Loading and Wrangling Data with Pandas and NumPy
No ratings yet
7. Loading and Wrangling Data with Pandas and NumPy
46 pages
justenoughpython_pandas_220915_175329
No ratings yet
justenoughpython_pandas_220915_175329
64 pages
ANL252 SU4 Jul2022
No ratings yet
ANL252 SU4 Jul2022
55 pages
CSE445 NSU Week_3
No ratings yet
CSE445 NSU Week_3
48 pages
dataframing_in_csv
No ratings yet
dataframing_in_csv
14 pages
Murali Internship
No ratings yet
Murali Internship
34 pages
ML Lab1 Python Panda
No ratings yet
ML Lab1 Python Panda
9 pages
Server Hosting Management System (Ip Class 12) (2024-25)
No ratings yet
Server Hosting Management System (Ip Class 12) (2024-25)
21 pages
Introduction To Pandas
No ratings yet
Introduction To Pandas
27 pages
What is pandas
No ratings yet
What is pandas
9 pages
Session2-DM Using Pandas
No ratings yet
Session2-DM Using Pandas
51 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
60 pages
Pandas
No ratings yet
Pandas
94 pages
Experiment No 3 Importing and Exporting Data in Python Using Pandas Student
No ratings yet
Experiment No 3 Importing and Exporting Data in Python Using Pandas Student
6 pages
Loki Temp PPT Pandas 2
No ratings yet
Loki Temp PPT Pandas 2
31 pages
Summary: Introduction To Data Visualization Tools
No ratings yet
Summary: Introduction To Data Visualization Tools
13 pages
Pandas Complete + Visualisation Summary of IBM Visualization
No ratings yet
Pandas Complete + Visualisation Summary of IBM Visualization
21 pages
Pandas
No ratings yet
Pandas
13 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
20 pages
DevOps Session 3 Pandas.pptx
No ratings yet
DevOps Session 3 Pandas.pptx
33 pages
python 2.1.2 (2)
No ratings yet
python 2.1.2 (2)
7 pages
Chapter 2 Python Pandas - II
No ratings yet
Chapter 2 Python Pandas - II
19 pages
Python-for-Data-Analysis-edgar
No ratings yet
Python-for-Data-Analysis-edgar
49 pages
7 Days Analytics Course 3feiz7 4
No ratings yet
7 Days Analytics Course 3feiz7 4
8 pages
Pandas PDF(2)
No ratings yet
Pandas PDF(2)
25 pages
Reading An Entire File at Once: Generating Current Date
No ratings yet
Reading An Entire File at Once: Generating Current Date
2 pages
Exercise 3
No ratings yet
Exercise 3
25 pages
CH-6 Data Loading, Storage, and File Formats
No ratings yet
CH-6 Data Loading, Storage, and File Formats
163 pages
CO3_1_Pandas Series and Data Frame
No ratings yet
CO3_1_Pandas Series and Data Frame
37 pages
Data Preprocessing
No ratings yet
Data Preprocessing
84 pages
IP 12th Chapter 3
No ratings yet
IP 12th Chapter 3
9 pages
Python & MySQL for Data Analysis
No ratings yet
Python & MySQL for Data Analysis
45 pages
Pandas
No ratings yet
Pandas
5 pages
2 Python Data Processing
100% (2)
2 Python Data Processing
66 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
Phython Example
No ratings yet
Phython Example
12 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
1 page
CHP 8 Pandas
No ratings yet
CHP 8 Pandas
49 pages
Chapter 2 - Python Pandas II
No ratings yet
Chapter 2 - Python Pandas II
71 pages
Lab 9
No ratings yet
Lab 9
9 pages
DataFrame.docx
No ratings yet
DataFrame.docx
95 pages
Lab-3 Pandas Library
No ratings yet
Lab-3 Pandas Library
14 pages
Course_ Introduction to Data Science (SD211105)
No ratings yet
Course_ Introduction to Data Science (SD211105)
10 pages
EMPLOYEE DATA ANALYSIS SYSTEM (IP CLASS XII)
No ratings yet
EMPLOYEE DATA ANALYSIS SYSTEM (IP CLASS XII)
26 pages
lecture-week2
No ratings yet
lecture-week2
72 pages
Pandas Basics
No ratings yet
Pandas Basics
21 pages
Python-for-Data-Analysis (Pandas
No ratings yet
Python-for-Data-Analysis (Pandas
31 pages
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
No ratings yet
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
7 pages
Data Science - Sec3
No ratings yet
Data Science - Sec3
27 pages
Introduction To Pandas in Data Analytics
No ratings yet
Introduction To Pandas in Data Analytics
12 pages
5CS037 WS02 PandasForDataAnalysis
No ratings yet
5CS037 WS02 PandasForDataAnalysis
30 pages
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
No ratings yet
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
12 pages
Data Wrangling With Python and Pandas
No ratings yet
Data Wrangling With Python and Pandas
7 pages
Cheat Sheet - Pandas
No ratings yet
Cheat Sheet - Pandas
12 pages
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Fycs Oops Manual
No ratings yet
Fycs Oops Manual
72 pages
Karate Goju Ryu Terminology
100% (1)
Karate Goju Ryu Terminology
66 pages
Excel Pp3
No ratings yet
Excel Pp3
3 pages
Cmimc 2017 NT
No ratings yet
Cmimc 2017 NT
11 pages
8
No ratings yet
8
3 pages
Chiến thuật chinh phục TOEIC Part 5 - Homework
No ratings yet
Chiến thuật chinh phục TOEIC Part 5 - Homework
5 pages
Happy Rhymes
100% (1)
Happy Rhymes
16 pages
Hysplit Tutorial 2013
No ratings yet
Hysplit Tutorial 2013
132 pages
KLSMM
100% (1)
KLSMM
15 pages
Volodymyr Zelenskyi's Speech On Ukraine's Independence Day
No ratings yet
Volodymyr Zelenskyi's Speech On Ukraine's Independence Day
2 pages
Introduction To Software Security
No ratings yet
Introduction To Software Security
33 pages
The Social and Political Background of the Abbasid 44dm00s7lr
No ratings yet
The Social and Political Background of the Abbasid 44dm00s7lr
12 pages
The Effect of Flash Card-Based Instruction On Vocabulary Learning by EFL Learners
No ratings yet
The Effect of Flash Card-Based Instruction On Vocabulary Learning by EFL Learners
10 pages
Demo DLL English
No ratings yet
Demo DLL English
3 pages
Huffman Coding Tree
No ratings yet
Huffman Coding Tree
4 pages
Most Common 50 Irregular Verbs Regular Verbs - PDF
No ratings yet
Most Common 50 Irregular Verbs Regular Verbs - PDF
1 page
Islamic Jurisprudence
No ratings yet
Islamic Jurisprudence
40 pages
Minerva MX/MZX Range of Addressable Controllers Product Application & Design Information List of Contents
100% (1)
Minerva MX/MZX Range of Addressable Controllers Product Application & Design Information List of Contents
94 pages
Download ebooks file Writing Reports to Get Results Quick Effective Results Using the Pyramid Method 3rd Edition Blicq R.S. all chapters
100% (10)
Download ebooks file Writing Reports to Get Results Quick Effective Results Using the Pyramid Method 3rd Edition Blicq R.S. all chapters
40 pages
June 2016 (v1) MS
No ratings yet
June 2016 (v1) MS
10 pages
Computer Shortcuts
50% (2)
Computer Shortcuts
2 pages
Yew Hoong Et Al (Final)
No ratings yet
Yew Hoong Et Al (Final)
19 pages
Ready for C1 Advanced Teacher's book 4th Edition Ethan Mansur 2024 Scribd Download
100% (1)
Ready for C1 Advanced Teacher's book 4th Edition Ethan Mansur 2024 Scribd Download
50 pages
Bug Bounty Hunters Extension Guidelines 093022
No ratings yet
Bug Bounty Hunters Extension Guidelines 093022
14 pages
Chapter Five: Design Rules and Implementation Support
No ratings yet
Chapter Five: Design Rules and Implementation Support
45 pages
WLL Week 2 Work Immersion
100% (1)
WLL Week 2 Work Immersion
1 page
Detection and Verification System of Handwriting and Signature Using Raspberry Picopy 220426094934
No ratings yet
Detection and Verification System of Handwriting and Signature Using Raspberry Picopy 220426094934
48 pages
CH 14
No ratings yet
CH 14
9 pages