0% found this document useful (0 votes)
5 views

Pandas_numpy_handing_data

The document provides a comprehensive guide on handling data with pandas and numpy, covering topics such as loading and saving data, exploratory data analysis, and data wrangling techniques. It includes practical examples using an iTunes dataset, demonstrating how to read data from CSV and Excel files, perform data filtering, merging, and plotting, as well as handling missing values and duplicates. Additionally, it discusses data transformations, grouping, and saving processed data to disk.

Uploaded by

23070229
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Pandas_numpy_handing_data

The document provides a comprehensive guide on handling data with pandas and numpy, covering topics such as loading and saving data, exploratory data analysis, and data wrangling techniques. It includes practical examples using an iTunes dataset, demonstrating how to read data from CSV and Excel files, perform data filtering, merging, and plotting, as well as handling missing values and duplicates. Additionally, it discusses data transformations, grouping, and saving processed data to disk.

Uploaded by

23070229
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Handling data

with pandas and


numpy (cont)
Contents
• Loading and saving data
• Exploratory data analysis
• Processing data
• Wrangling data with filtering, grouping
Dataset
• Itunes dataset
• Note: if the data comes in Excel format

• The function call to read_excel is providing an engine (library) to


read the data with.
• By default, pandas uses the xlrd library to read Excel files.
Unfortunately, there is a bug with xlrd at the time of writing, and
we are specifying the openpyxl library to be used instead.
• The xlrd or openpyxl libraries need to be installed with conda or
pip in order to read Excel files with pandas
from google.colab import drive
drive.mount('/content/drive’)
import pandas as pd
#df = pd.read_csv('/content/drive/MyDrive/Colab
Notebooks/datafortest/Ex10/DS.xlsx')
df = pd.read_csv('/content/drive/MyDrive/Colab
Notebooks/datafortest/Ex10/DS.csv')
df.head()
read _csv
pd.read_csv(filepath_or_buffer, sep=’ ,’ ,
header=’infer’, index_col=None, usecols=None, engine=None,
skiprows=None, nrows=None) df =
pd.read_csv(filename,usecols=['f(s)'])
•filepath_or_buffer: Location of the csv file. It accepts df.head()
any string path or URL of the file.
•sep: It stands for separator, default is ‘, ‘.
•header: It accepts int, a list of int, row numbers to use
as the column names, and the start of the data. If no
names are passed, i.e., header=None, then, it will display
the first column as 0, the second as 1, and so on.
•usecols: Retrieves only selected columns from the CSV
file.
•nrows: Number of rows to be displayed from the
dataset.
•index_col: If None, there are no index numbers
displayed along with records.
•skiprows: Skips passed rows in the new data frame.
DataFrame import pandas as pd
• DataFrames have a certain filename = '/content/drive/MyDrive/Colab
structure: a number of Notebooks/datafortest/Ex10/DS.csv'
df = pd.read_csv(filename,index_col='f(s)')
columns storing data, and an df.head()
index.
• This index can be used as one
method to access the data:
df.index

• The index is autogenerated


while loading data.
• Use read_csv('filename',
index='index_col_name’) to
specify an index
Combining DataFrame
• Use pd.concat to combine multiple DataFrame

• axis=1 to tell the function to join along the columns, not


rows. We also set the join parameter to inner so that only
rows with a matching index are joined
• pd.merge() is another function that can be used to merge
data, such as a SQL join. With merge(), we can join on any
column rather than just the index
import pandas as pd
filename1 = '/content/drive/MyDrive/Colab
Notebooks/datafortest/Ex10/DS.csv'
filename2 = '/content/drive/MyDrive/Colab
Notebooks/datafortest/Ex10/RS.csv'
#df = pd.read_csv(filename1,usecols=['f(s)'])
#df = pd.read_csv(filename1,index_col='f(s)')
df1 = pd.read_csv(filename1)
df2 = pd.read_csv(filename2)
#df= pd.merge(df1,df2,on='Algorithm')
#df = pd.concat([df1,df2])
df = pd.concat([df1,df2],ignore_index=True)
#df.head()
df
Exploratory Data Analysis (EDA)
• df.head() and df.tail()
• we can transpose the printout with itunes_df.tail().T, which
transposes columns and rows
• To index by row number, we use iloc. This is helpful if we
want to look at the first or last row. For example, this is how
we look at the first row (the 0th index) and the last row (the -
1 index):

• Indexing can be done by index value: itunes_df.loc[3052]


EDA
• With iloc, we can also choose a single column. For example,
the following commands print out the value of the first row
and first column (an index of [0, 0]) and the last row and last
column (and an index of [-1, -1])

• We can change our index to be unique


EDA
• we could use the skiprows parameter in
pd.read_csv() to skip the first 5 rows if these have
extra header information:
• df = pd.read_csv('csvfile.csv', skiprows=5)

• If we wanted to get rid of the last 5 rows, we could


use iloc indexing
• df = df.iloc[:-5]
• We can select a column of data
• itunes_df['Milliseconds']

• we want to select multiple columns, we can use a list of


strings
• itunes_df[['Milliseconds', 'Bytes']]
Dimensions, datatypes and missing
values
• Shape of DataFrame : itunes_df.shape

• Data type and missing values : itunes_df.info()

• The info method also shows us the non-null values. As long


as this matches the number of rows, it means there are no
missing values. However, it's easier to look at the number of
missing values: itunes_df.isna().sum()
• To examine the statistics of a dataset is to use a pandas
command: itunes_df.describe()
Ploting with DataFrame
• Need to use matplotlib: import matplotlib.pyplot as plt
import pandas as pd
import matplotlib as mlt
filename1 = '/content/drive/MyDrive/Colab
Notebooks/datafortest/Ex10/DS.csv'
filename2 = '/content/drive/MyDrive/Colab
Notebooks/datafortest/Ex10/RS.csv'
df1 =
pd.read_csv(filename1,usecols=['Budget','f(
s)'])
df1 =
pd.read_csv(filename2,usecols=['Budget','f(
s)'])
df = pd.concat([df1,df2])
df.plot.scatter(x='Budget',y='f(s)')
Ploting with coloring of Data
import pandas as pd
import matplotlib.pyplot as plt
filename1 = '/content/drive/MyDrive/Colab
Notebooks/datafortest/Ex10/DS.csv'
filename2 = '/content/drive/MyDrive/Colab
Notebooks/datafortest/Ex10/RS.csv'
df1 = pd.read_csv(filename1,usecols=['Budget','f(s)'])
df2 = pd.read_csv(filename2,usecols=['Budget','f(s)'])
DS = df1.get('f(s)')
RS = df2.get('f(s)')
x = df1.get('Budget')
plt.plot(x, DS, color = 'g', linestyle = 'dashed',
marker = 'o',label = "DS")
plt.plot(x, RS, color = 'b', linestyle = 'dashed',
marker = '*',label = "RS")
plt.xticks(rotation = 25)
plt.xlabel('Budget')
plt.ylabel('Influence value')
plt.title('Influence comparison', fontsize = 20)
plt.grid()
plt.legend()
plt.show()
Practice 1
• With given data:
• Construct the bar graph to compare the f(s) according to Budget
• Construct the line graph to compare the number of queries according
to Budget
• Construct the histograms to compare the time according to Budget
Practice 2
• Suppose this is the Vietnamese population in 2024. Save the
following data into a csv file. Save the following data into a
csv file with the format [Age, Proportion]
Age Quantity
<18 25,000,000
18-60 60,000,000
>60 15,000,000

• Construct a pie chart to indicate the proportion of population


in Vietnam in 2024
More about Plotting
• Stylesheet:
https://matplotlib.org/stable/gallery/style_sheets/style_shee
ts_reference.html
• Matplot introduction:
https://www.w3schools.com/python/matplotlib_intro.asp
Clean data
• Pandas support many ways to filter data.
• For example, let's get the longest songs from our data that we saw
in our scatter plot. These have lengths over 4,000,000 ms
• itunes_df[itunes_df['Milliseconds'] > 4e6]

• look at the value counts of genres from songs over 2,000,000 ms:
• itunes_df[itunes_df['Milliseconds'] > 2e6]['Genre'].value_counts()

• filtering with multiple conditions:


itunes_df[(itunes_df['Milliseconds'] > 2e6) \
& (itunes_df['Bytes'] < 0.4e9)]['Genre'].value_counts()
Clean data
• Drop columns: itunes_df.drop('Composer', axis=1, inplace=True)

• The axis=1 argument specifies to drop a column, not a row, and


inplace=True changes the DataFrame itself instead of returning a new,
modified DataFrame
• If any genres that are not music, we could do so with filtering
• only_music = itunes_df[~itunes_df['Genre'].isin(['Drama', 'TV Shows', 'Sci Fi &
Fantasy', 'Science Fiction', 'Comedy’])]

• This uses filtering with the isin method. The isin method checks if each
value is in the list or set provided to the function. In this case, we also
negate this condition with the tilde (~) so that any of the non -music
genres are excluded, and our only_music DataFrame has only genres
that are music, just as the variable name suggests
Missing values
• For example, we saw that our Composer column has several
missing values. We can use filtering to see what some of
these rows: itunes_df[itunes_df['Composer'].isna()].sample(5,
random_state=42).head()
• Another option is to drop the missing values. We can either
drop the entire column, as we did earlier, or we can drop the
rows with missing values: itunes_df.dropna(inplace=True)
• Filling with a specific value could be done
• itunes_df.loc[itunes_df['Composer'].isna(), 'Composer'] =
'Unknown'
• itunes_df['Composer'].fillna('Unknown', inplace=True)
Missing values
• Using KNN imputation
Duplicate values
• to check for duplicates is the duplicated() function:
itunes_df.duplicated().sum()

• To drop duplicates: itunes_df.drop_duplicates(inplace=True)


Outliers
• Outliers are data that are not in the usual range of values. Dealing with
categorical outliers can help a little with analysis, but often has a
minimal impact
• For numeric data, it's easy to quantify an outlier. Typically, we use
interquartile range (IQR) or z-score methods
• We get quartiles (25th, 50th, 75th percentiles) from the describe()
function in our EDA. These are sometimes called the first, second, and
third quartiles, respectively (Q1, Q2, and Q3)
• Formula to calculate IRQ:
IRQ = 75_percentile – 25_percentile
upper_bound = 75_percentile + 1.5*IRQ
lower_bound = 25_percentile – 1.5*IRQ
Data transformations
• The basic operation to transform data such as:
itunes_df['Seconds'] = itunes_df['Milliseconds'] / 1000

• If we wanted to create another column:


itunes_df['len_byte_ratio'] = itunes_df['Milliseconds'] /
itunes_df['Bytes']
Data transformation
• To replace several values at once is with the map and replace functions:
genre_dict = {'metal': 'Metal', 'met': 'Metal'}
itunes_df['Genre'].replace(genre_dict)

• Using apply function:


itunes_df['Genre'].apply(lambda x: x.lower())

Or
def lowercase(x):
return x.lower()
itunes_df['Genre'].apply(lowercase)
Or
itunes_df['Genre'].str.lower()
Groupby
• Groupby in pandas is just like in SQL – group by unique
values in a column
itunes_df.groupby('Genre').mean()['Seconds'].sort_values().head()
Writing DataFrame to disk
• Pandas offers many ways to save data: csv, excel, hdf5, ..
itunes_df.to_csv('data/saved_itunes_data.csv', index=False)

We first give the filename as an argument for to_csv, and then


tell it to not write the index to the file with index=False. This
filename would save the data as saved_ itunes_data.csv in the
directory named "data" within the same folder/directory where
the code is being run

You might also like