0% found this document useful (0 votes)
94 views

Preprocessing + EDA - Jupyter Notebook

The document discusses various methods for importing data into Python from different file types and sources, including flat files, databases, APIs, and the web. It also covers initial exploration and cleaning of data, such as handling missing values, duplicates, and invalid data. Common Python libraries like Pandas, NumPy, and Matplotlib are used for loading, manipulating, and visualizing the data.

Uploaded by

Sagarika Ramesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views

Preprocessing + EDA - Jupyter Notebook

The document discusses various methods for importing data into Python from different file types and sources, including flat files, databases, APIs, and the web. It also covers initial exploration and cleaning of data, such as handling missing values, duplicates, and invalid data. Common Python libraries like Pandas, NumPy, and Matplotlib are used for loading, manipulating, and visualizing the data.

Uploaded by

Sagarika Ramesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

Importing data

reading a text file

using with

localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 1/30


2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

To readline

with open('huck_finn.txt') as file:

print(file.readline())

NOTE: using ! before a command gives access to the shell, hence we can execute cmds on the python cmd
line

Reading flat files (.csv, .txt)


It will have a header, delimiter can be tab or comma (,) For csv it is , and tab for .txt

localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 2/30


2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

loading numerical data -> numpy


usecol -> import only 1 and 2nd col

for data type -> string (all entries)

For mixed data types -> don't use loadtext(), instead use Dataframe

localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 3/30


2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

Consider a dataset where , sep -> it is tab delimiter, comment takes characters that comments occur after in the
file, which in this case is '#'. na_values takes a list of strings to recognize as NA/NaN, in this case the string
'Nothing' and replace it with NaN.

Import matplotlib.pyplot as plt


import matplotlib.pyplot as plt

Assign filename: file


file = 'titanic_corrupt.txt'

Import file: data,it is tab seperated, remove comment


and replace 'Nothing' with Na/NaN
data = pd.read_csv(file, sep='\t', comment='#', na_values='Nothing', skiprows=3)

Print the head of the DataFrame


print(data.head())

Plot 'Age' variable in a histogram


pd.DataFrame.hist(data[['Age']]) plt.xlabel('Age (years)') plt.ylabel('count') plt.show()

localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 4/30


2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 5/30


2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

xls is not a flat file because it is a spreadsheet consisting of many sheets, not a
single table.

Importing SAS/Stata files

localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 6/30


2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

Importing HDF5 files

localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 7/30


2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

access with .keys() since it is hierarchical

localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 8/30


2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

Matrix Laboratory (MATLAB, .mat)


localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 9/30
2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

Relational databases -> bunch of tables which are


linked
Table is analogus to DataFrame

localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 10/30


2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 11/30


2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 12/30


2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 13/30


2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 14/30


2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 15/30


2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 16/30


2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

Import data from web

localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 17/30


2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 18/30


2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 19/30


2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

simpler way

localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 20/30


2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

beautiful soup

localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 21/30


2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

Importing from API


standard form of transfering data between API is JSON

localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 22/30


2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 23/30


2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

Cleaning

localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 24/30


2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

misleading

localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 25/30


2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

correct

remove future dates

localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 26/30


2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 27/30


2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 28/30


2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

Treat duplicates

localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 29/30


2/9/22, 11:22 AM preprocessing + EDA - Jupyter Notebook

localhost:8889/notebooks/Documents/Python Scripts/preprocessing %2B EDA .ipynb 30/30

You might also like