0% found this document useful (0 votes)

71 views57 pages

4.1 Data Retrieval and Preprocessing of Python

Uploaded by

maxew81693

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views57 pages

4.1 Data Retrieval and Preprocessing of Python

Uploaded by

maxew81693

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

Data Processing Using Python

Data retrieval and preprocessing of Python

ZHANG Li/Dazhuang
Nanjing University
Department of Computer Science and Technology
Department of University Basic Computer Teaching
2
Basic Data Processing Procedure
4
1 3
Result
Evaluation
and
Data
Data Presentation
Collection
2 Analysis
and Mining
Data
Exploration
and
Preprocessing

Nanjing University
Data Processing Using
Python

CONVENIENT AND
FAST DATA
ACQUISITION
Nanjing University
Fetch Data with Python 4

How to get local data?

Open, read/write, close of file

• File open

• File read

• File write

• File close

Nanjing University
Fetch Data with Python 5

How to get (crawl) data from net?

Crawl pages and interpret content
• Crawling
• Urllib built-in module
– urllib.request

• Requests
(third party library)
• Scrapy framework
• Interpreting
• BeautifulSoup library
• re module

Nanjing University
Dow Jones Constituent 6

dji quotes
Nanjing University
Data Format 7

djidf

quotesdf

Nanjing University
Download Data Directly 8

• How to easily and rapidly fetch historical data of

companies from financial websites?

F ile

# Filename: quotes_fromcsv.py
import pandas as pd
quotesdf = pd.read_csv('axp.csv')
print(quotesdf)

Nanjing University
9
Read and Write of csv Format

• Store the basic stock F ile

information of # Filename: to_csv.py

import pandas as pd
American Express in
…
the past year into quotes = retrieve_quotes_historical('AXP')
stockAXP.csv. df = pd.DataFrame(quotes)
df.to_csv('stockAXP.csv')

Nanjing University
10
Read and Write of Excel Data
F ile

# Filename: to_excel.py
…
quotes = retrieve_quotes_historical('AXP')
df = pd.DataFrame(quotes)
df.to_excel('stockAXP.xlsx', sheet_name = 'AXP')

F ile

# Filename: read_excel.py
…
df = pd.read_excel('stockAXP.xlsx', index_col = 'date')
print(df['close'][:3])

Nanjing University
Download Data Directly 11

Nanjing University
Get Data Using API 12

S ource

>>> import pandas_datareader.data as web

>>> f = web.DataReader('AXP', 'stooq')
>>> f.head(5)
Open High Low Close Volume
Date
2019-10-04 112.62 114.530 112.60 114.41 2753195
2019-10-03 112.52 112.955 111.06 112.55 3549232
2019-10-02 115.76 115.810 112.75 112.86 4931560
2019-10-01 118.70 119.500 116.61 116.70 2857528
2019-09-30 119.05 119.240 118.14 118.28 2353731

Nanjing University
Using Datasets Module in Sklearn 13

S ource

>>> from sklearn import datasets

>>> iris = datasets.load_iris()
>>> iris.feature_names
['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)']
>>> iris.data
array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
…
[6.5, 3. , 5.2, 2. ],
[6.2, 3.4, 5.4, 2.3],
[5.9, 3. , 5.1, 1.8]])

Nanjing University
14
NLTK library

gutenberg

webtext brown

reuters
User-
inaugural defined
library
Other
languages

Nanjing University
Easier Approach to Data 15

S ource

>>> from nltk.corpus import gutenberg brown

>>> import nltk
>>> print(gutenberg.fileids())
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-
poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt',
'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-
parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt',
'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
>>> texts = gutenberg.words('shakespeare-hamlet.txt')
>>> print(texts)
['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', ...]
Nanjing University
16

Data Processing Using

Python

FUNDAMENTALS
OF PYTHON
PLOTTING
Nanjing University
Matplotlib Plotting 17

• Matplotlib Plotting

Most famous Python 2D

plotting library

– High quality

– Convenient plotting modules

• Plotting API——pyplot module

Nanjing University
18
Line Chart

S ource

>>> import matplotlib.pyplot as plt

>>> plt.plot([3, 4, 7, 6, 2, 8, 9])

plt.plot(range(7), [3, 4, 7, 6, 2, 8, 9])

Nanjing University
19
Line Chart – for groups of data
• NumPy array can also be
used as a parameter of
Matplotlib
• Groups data plotting

S ource

>>> import numpy as np

>>> import matplotlib.pyplot as plt
>>> t=np.arange(0.,4.,0.1)
>>> plt.plot(t, t, t, t+2, t, t**2)

Nanjing University
Different plot forms 20

S ource

>>> import matplotlib.pyplot as plt

>>> plt.scatter(range(7), [3, 4, 7, 6, 2, 8, 9])
>>> plt.bar(range(7), [3, 4, 7, 6, 2, 8, 9])

Nanjing University
21
Matplotlib Attributes
……
Character attributes
Grid attributes
axes
subplots
Color and style
Line width
Point per inch
Graph size

Default attributes Matplotlib can control

Nanjing University
22
Color and Style

• Could color,
line or style
of graph be
modified?

plt.plot(x, y, 'g--') plt.plot(x, y, 'rD')

Nanjing University
23
Color and Style
Character Color Type Description Mark Description
b blue '-' solid "o" circle
g green '--' dashed "v" triangle_down
r red "s" square
'-.' dash_dot
c cyan "p" pentagon
':' dotted
"*" star
m magenta 'None' draw nothing
"h" hexagon1
Y yellow '' draw nothing "+" plus
k black '' draw nothing "D" diamond
w white
… …

Nanjing University
24
Other Attributes
F ile

# Filename: multilines.py
import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize = (8, 6), dpi = 100)

t = np.arange(0., 4., 0.1)
plt.plot(t, t, color='red', linestyle='-', linewidth=3, label='Line 1')
plt.plot(t, t+2, color='green', linestyle='', marker='*', linewidth=3, label='Line 2')
plt.plot(t, t**2, color='blue', linestyle='', marker='+', linewidth=3, label='Line 3')
plt.legend(loc = 'upper left')
Nanjing University
25
Words
Add titles：graph, vertical
axis and horizontal axis

F ile

# Filename: title.py
import matplotlib.pyplot as plt

plt.title('Plot Example')
plt.xlabel('X label')
plt.ylabel('Y label')
plt.plot(range(7), [3, 4, 7, 6, 2, 8, 9])

Nanjing University
Subplots 26

• The plotting is carried out in the current figure and the current coordinate
system (axes) in Matplotlib. By default, the plotting is in a figure No. 1. We
can plot in multiple areas of a figure.
• Using subplot()/subplots() and axes() functions respectively.

Nanjing University
27
subplots

plt.subplot(211) plt.subplot(121) plt.subplot(221)

plt.subplot(212) plt.subplot(122) plt.subplot(222)
plt.subplot(223)
plt.subplot(224)

Nanjing University
subplot() 28

F ile

# Filename: subplot.py
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-np.pi, np.pi, 300)

plt.figure(1) # default
plt.subplot(211) # first subplot
plt.plot(x, np.sin(x), color = 'r')
plt.subplot(212) # second subplot
plt.plot(x, np.cos(x), color = 'g')

Nanjing University
subplots() 29

F ile

# Filename: subplots.py
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-np.pi, np.pi, 300)

fig, (ax0, ax1) = plt.subplots(2, 1)
ax0.plot(x, np.sin(x), color = 'r')
ax0.set_title('subplot1')
plt.subplots_adjust(hspace = 0.5)
ax1.plot(x, np.cos(x), color = 'g')
ax1.set_title('subplot2')

Nanjing University
subplots-axes 30

axes([left,bottom,width,height]) Range of parameter: (0,1)

F ile

# Filename: axes.py
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-np.pi, np.pi, 300)

plt.axes([.1, .1, 0.8, 0.8])
plt.plot(x, np.sin(x), color = 'r')
plt.axes([.3, .15, 0.4, 0.3])
plt.plot(x, np.cos(x), color = 'g')

Nanjing University
pandas plotting 31

S ource

>>> quotesdf.loc[:9, 'close'].plot()

S ource

>>> quotesdf.loc[:9, ['close', 'open']].plot()

Nanjing University
32
pandas plotting

S ource

>>> ax = djidf.plot(kind = 'bar', x = 'code', y = 'price', color = 'g');

ax.set(ylabel='Price', title = 'Stock Statistics of ^DJI')

Nanjing University
33

Data Processing Using

Python

DATA CLEAN OF DATA

EXPLORATION AND
PREPROCESSING
Nanjing University
34

• check data errors

Data • understand data distribution
Exploration characteristics and inherent regularities

• Data cleaning
Data • Data integration
preprocessing • Data transformation
• Data reduction

Nanjing University
Missing Value Handling 35

fixed value
How to deal with？ mean, median/mode
• drop value
fill
• fill up and down data
interpolation function
most likely value

Nanjing University
Missing value handling—DataFrame 36

quotesdf_nan = pd.read_csv('AXP_NaN.csv', index_col = 'Date')

judge missing value: df.isnull()

drop missing value: df.dropna()
fill missing value: df.fillna()

How to fill missing value with mean value?

quotesdf_nan.fillna(method='ffill', inplace = True)

Nanjing University
Outliers 37

How to observe?
• simple statistics
• plotting
• density-based, knn or
cluster algorithm
How to deal with?
• same as missing
values
• calculate the local
mean (binning)
• do nothing
Nanjing University
38

Data Processing Using

Python

DATA
TRANSFORMATION
OF DATA PRECESSING
Nanjing University
Data Transformation 39

Normalization

common Discretization of
way continuous features
transform data into the Binarization
suitable form

Nanjing University
Normalization 40

What impacts are common method

solved?
• Min-Max normalization
• different
dimension • Z-Score normalization
• wide range of • Normalization by decimal scaling
values

Nanjing University
Boston Housing Datasets 41

>>> boston = datasets.load_boston()

>>> boston.feature_names
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')
>>> boston.target
array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, …, ]])
>>> boston_df = pd.DataFrame(boston.data[:, 4:7])
>>> boston_df.columns= boston.feature_names[4:7]
>>> boston_df
NOX RM AGE
0 0.538 6.575 65.2 4: NOX - nitric oxides concentration (parts per 10 million)
1 0.469 6.421 78.9 5: RM - average number of rooms per dwelling
2 0.469 7.185 61.1 6: AGE - proportion of owner-occupied units built prior to 1940
…
504 0.573 6.794 89.3 MEDV - Median value of owner-occupied homes in $1000's
505 0.573 6.030 80.8

Nanjing University
Min-Max normalization
42

𝑥 − 𝑚𝑖𝑛
𝑥′ =
max −𝑚𝑖𝑛

(df-df.min())/(df.max()-df.min())

Problems:
• If the number in the future exceeds min
and max one, it needs to be redefined.
• If a certain number is large, the
normalized values are close and all are
close to 0.

Nanjing University
Min-Max normalization 43

from sklearn import preprocessing

min_max_scaler = preprocessing.minmax_scale(df) # [0,1]

Nanjing University
Z-Score normalization
44

𝑥 − 𝑥ҧ
𝑥′ =
𝜎
(df-df.mean())/df.std()

Features:
• Most frequently used.
• The mean of the processed
data is 0, and the standard
deviation is 1.

Nanjing University
Z-Score normalization 45

scaler = preprocessing.scale(df)

Nanjing University
Normalization by decimal scaling 46

𝑥
𝑥′ = j
10
df/10**np.ceil(np.log10(df.abs().max()))

Features:
• Move the decimal point position.
The number of moves depends
on the maximum value of the
features' absolute value.
• Fall between [- 1, 1] commonly.

Nanjing University
Discretization of Continuous Features 47

Method
• Binning: equal-width, equal frequency
• Clustering

pd.cut(df.AGE, 5, labels = range(5))

pd.qcut(df.AGE, 5, labels = range(5))

Nanjing University
Feature Binarization 48

S ource

>>> from sklearn.preprocessing import Binarizer

>>> X = boston.target.reshape(-1,1)
>>> Binarizer(threshold = 20.0).fit_transform(X)

Nanjing University
49

Data Processing Using

Python

DATA REDUCTION
OF DATA
PREPROCCESSING
Nanjing University
Data Reduction 50

Purpose： Feature reduction: forward

• The features and values are selection, backward
normalized to obtain a much elimination, decision tree,
smaller specification PCA
representation than the original
Way Value reduction: Parametric
dataset, but still close to the method (regression, log
integrity of the original data. linear model), nonparametric
Mining on the dataset after the method(histogram,
specification can produce clustering, sampling)
almost the same analysis results.

Nanjing University
Feature Reduction - PCA 51

Source

>>> from sklearn.decomposition import PCA

>>> X = preprocessing.scale(boston.data)
>>> pca = PCA(n_components=5)
>>> pca.fit(X)
>>> pca.explained_variance_ratio_
array([0.47129606, 0.11025193, 0.0955859 , 0.06596732, 0.06421661])

Nanjing University
Value Reduction - histogram 52

Features:
• Show data distribution by forming bins.

• Each bin shows the frequency of data value.

array([4, 8, 9, 8, 7, 2, 8, 7, 5, 3, 1, 4, 5, 8, 7, 9, 5, 9, 9, 5, 9, 1, 9, 7, 1, 2, 9, 5, 5,
5, 9, 4, 3, 5, 5, 4, 7, 4, 9, 8, 2, 6, 3, 5, 3, 2, 9, 1, 3, 1])

data = np.random.randint(1,10,50)

Nanjing University
Value Reduction - histogram 53

array([4, 8, 9, 8, 7, 2, 8, 7, 5, 3, 1, 4, 5, 8, 7, 9, 5, 9, 9, 5, 9, 1, 9, 7, 1, 2, 9, 5, 5, 5, 9,
4, 3, 5, 5, 4, 7, 4, 9, 8, 2, 6, 3, 5, 3, 2, 9, 1, 3, 1])
plt.hist(data, bins=…)
Nanjing University
Value Reduction - sampling 54

Some features:
• Without replacement sampling: Take n
samples from N samples of the
Random Sampling: original dataset D, and get different
without replacement data each time.
with replacement • With replacement sampling: Take n
Sampling
samples from the N samples in the
Cluster sampling
original dataset D, record them and
Stratified Sampling put them back. It is possible to extract
the same data.
• Stratified sampling: Dataset D is
divided into disjoint parts(layers), and
each layer is randomly sampled to get
the final result.
Nanjing University
Random Sampling 55

Without Replacement: With Replacement:

iris_df.sample(n = 10) iris_df.sample(n = 10, replace = True)
iris_df.sample(frac = 0.3) iris_df.sample(frac = 0.3, replace = True)

Nanjing University
Stratified Sampling 56

S ource

>>> A = iris_df[iris_df.target == 0].sample(frac = 0.3)

>>> B = iris_df[iris_df.target == 1].sample(frac = 0.2)
>>> A.append(B)

Nanjing University
Summary 57

Nanjing University

Q Tips: Fast, Scalable, and Maintainable Kdb+
From Everand
Q Tips: Fast, Scalable, and Maintainable Kdb+
Nick Psaris
No ratings yet
Case of Unidentified Industries
No ratings yet
Case of Unidentified Industries
5 pages
Form To Answer Exploration Harley
No ratings yet
Form To Answer Exploration Harley
8 pages
Astm B 124 PDF
100% (1)
Astm B 124 PDF
5 pages
EXP1-siddhant Gupta (23 - SE - 148)
No ratings yet
EXP1-siddhant Gupta (23 - SE - 148)
17 pages
Python Unit IV
No ratings yet
Python Unit IV
12 pages
12.1 - 12.9 Introduction To Modules - Libraries For DataScience
No ratings yet
12.1 - 12.9 Introduction To Modules - Libraries For DataScience
54 pages
Unit 3
No ratings yet
Unit 3
19 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
43 pages
Week 4 - Introduction To Python #3
No ratings yet
Week 4 - Introduction To Python #3
47 pages
Unit 5 PythonPackages (Matplotlib)
No ratings yet
Unit 5 PythonPackages (Matplotlib)
24 pages
AD3301 DEV Lab Manual
No ratings yet
AD3301 DEV Lab Manual
26 pages
ML IU48prac1,2
No ratings yet
ML IU48prac1,2
16 pages
Fds Lab
No ratings yet
Fds Lab
16 pages
Fundamentals of Data Science Lab Manual New1
No ratings yet
Fundamentals of Data Science Lab Manual New1
32 pages
Fundamentals of Data Science Lab Manual
No ratings yet
Fundamentals of Data Science Lab Manual
34 pages
Comprehensive Python Data Libraries Curriculum
No ratings yet
Comprehensive Python Data Libraries Curriculum
14 pages
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
No ratings yet
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
28 pages
Ip Study
No ratings yet
Ip Study
18 pages
Python Abstract
No ratings yet
Python Abstract
7 pages
PP&DS Unit Iii
No ratings yet
PP&DS Unit Iii
26 pages
DXV Guidelines
No ratings yet
DXV Guidelines
3 pages
4 Introduction To Python Part 3
No ratings yet
4 Introduction To Python Part 3
48 pages
Pierian Data - Python For Finance & Algorithmic Trading Course Notes
No ratings yet
Pierian Data - Python For Finance & Algorithmic Trading Course Notes
11 pages
Machine Learning Lab File: Submitted To: Submitted by
No ratings yet
Machine Learning Lab File: Submitted To: Submitted by
9 pages
Python For DScience & D Visualisation Updated
No ratings yet
Python For DScience & D Visualisation Updated
11 pages
Python Libraries
No ratings yet
Python Libraries
27 pages
DAV EXP 1 t12 31
No ratings yet
DAV EXP 1 t12 31
39 pages
ENG 202: Computers and Engineering Object Oriented Programming in PYTHON
No ratings yet
ENG 202: Computers and Engineering Object Oriented Programming in PYTHON
56 pages
ML Lab Manual
No ratings yet
ML Lab Manual
12 pages
Unit V Notes
No ratings yet
Unit V Notes
11 pages
NumPy and Pandas
No ratings yet
NumPy and Pandas
12 pages
12 Numpy&Matplotlib
No ratings yet
12 Numpy&Matplotlib
48 pages
UNIT Vnotes
No ratings yet
UNIT Vnotes
44 pages
24-25 Gr-10-Ai Practical File Python Term-1 and Term-2!24!25 2
No ratings yet
24-25 Gr-10-Ai Practical File Python Term-1 and Term-2!24!25 2
23 pages
Answers 1
No ratings yet
Answers 1
17 pages
Data Analysis Lab - Final - 23-24
No ratings yet
Data Analysis Lab - Final - 23-24
11 pages
PR Final File
No ratings yet
PR Final File
70 pages
ML Practice Session 2
No ratings yet
ML Practice Session 2
7 pages
BDA File
No ratings yet
BDA File
26 pages
ML3 Data Analysis
No ratings yet
ML3 Data Analysis
80 pages
DP Prog
No ratings yet
DP Prog
10 pages
Machine Learning - Manual
No ratings yet
Machine Learning - Manual
32 pages
Ex. No: 1 Exploring The Features of Numpy, Scipy, Jupyter, Statsmodels and Pandas Date: 07/08/2024
No ratings yet
Ex. No: 1 Exploring The Features of Numpy, Scipy, Jupyter, Statsmodels and Pandas Date: 07/08/2024
9 pages
DSP LAB-3 (Part-A)
No ratings yet
DSP LAB-3 (Part-A)
16 pages
Unit 5
No ratings yet
Unit 5
28 pages
Fundamentals of Data Science Lab Manual New
No ratings yet
Fundamentals of Data Science Lab Manual New
33 pages
Data Science Fundamentals Lab
No ratings yet
Data Science Fundamentals Lab
24 pages
FOD Record Sem 1
No ratings yet
FOD Record Sem 1
25 pages
Introductory Notes: Matplotlib: Preliminaries
No ratings yet
Introductory Notes: Matplotlib: Preliminaries
11 pages
Fods Lab Manual
No ratings yet
Fods Lab Manual
26 pages
Fundamentals of Data Science Lab Manual-5-26
No ratings yet
Fundamentals of Data Science Lab Manual-5-26
22 pages
4 Introduction To Python Part 3
No ratings yet
4 Introduction To Python Part 3
62 pages
Python Dataviz
No ratings yet
Python Dataviz
16 pages
Time Series Analysis Group 9
No ratings yet
Time Series Analysis Group 9
16 pages
ELE492 - ELE492 - Image Process Lecture Notes 5
No ratings yet
ELE492 - ELE492 - Image Process Lecture Notes 5
41 pages
Smita ML Labbbb-11-20
No ratings yet
Smita ML Labbbb-11-20
10 pages
Leip 102
No ratings yet
Leip 102
36 pages
Aids Lab
No ratings yet
Aids Lab
45 pages
Ip 102
No ratings yet
Ip 102
36 pages
UNIT-4 Important Q-A
No ratings yet
UNIT-4 Important Q-A
28 pages
Mastering Pandas in Python: Course Book
From Everand
Mastering Pandas in Python: Course Book
Pedro Martins
No ratings yet
Mastering matplotlib
From Everand
Mastering matplotlib
Duncan M. McGreggor
No ratings yet
Temp 1694224306889.-363279524
No ratings yet
Temp 1694224306889.-363279524
8 pages
Electrostatic Force
No ratings yet
Electrostatic Force
6 pages
16 Committees, Team & Group Decision Making
No ratings yet
16 Committees, Team & Group Decision Making
15 pages
Britannia Proforma Invoice - 234175
No ratings yet
Britannia Proforma Invoice - 234175
1 page
Module 14 Nortons Theorem
No ratings yet
Module 14 Nortons Theorem
10 pages
2024 Afp Fpa Guide To What Is Financial Analysis
No ratings yet
2024 Afp Fpa Guide To What Is Financial Analysis
24 pages
Tomas Ang vs. Associated Bank
No ratings yet
Tomas Ang vs. Associated Bank
2 pages
Stillwater ALC Cover
No ratings yet
Stillwater ALC Cover
2 pages
Municipality of Dingalan: (LCCAP) 2019-2028: Local Climate Change Action Plan
No ratings yet
Municipality of Dingalan: (LCCAP) 2019-2028: Local Climate Change Action Plan
347 pages
The Powers of Peanut Flours USDA ARS
No ratings yet
The Powers of Peanut Flours USDA ARS
1 page
Acoustically Induced Vibration (Aiv) & Flow Induced Vibration (Fiv) Analysis For The High Pressure Reducing Systems Using Energy Institute Guidelines
No ratings yet
Acoustically Induced Vibration (Aiv) & Flow Induced Vibration (Fiv) Analysis For The High Pressure Reducing Systems Using Energy Institute Guidelines
4 pages
Flow Experience in Participatory Designed Online Environments
No ratings yet
Flow Experience in Participatory Designed Online Environments
61 pages
Crash Trolley Technical Specification
No ratings yet
Crash Trolley Technical Specification
2 pages
TCS iON Course Subscription Manual - ELI CCTV and Network Admin
No ratings yet
TCS iON Course Subscription Manual - ELI CCTV and Network Admin
9 pages
Total Recall: A Data-Driven Analysis of The Takata Airbag Recall
No ratings yet
Total Recall: A Data-Driven Analysis of The Takata Airbag Recall
36 pages
121 521727755 P I PDF
No ratings yet
121 521727755 P I PDF
3 pages
Anhdnn Enhanced PRACH Detector Paper
No ratings yet
Anhdnn Enhanced PRACH Detector Paper
6 pages
Manitou MLT-X 840 (EN)
No ratings yet
Manitou MLT-X 840 (EN)
2 pages
English Bussines
No ratings yet
English Bussines
4 pages
Economics Test Paper
No ratings yet
Economics Test Paper
6 pages
Aluminium Section BR Product Catalogue
No ratings yet
Aluminium Section BR Product Catalogue
42 pages
Solved Problems in Electromagnetics
100% (1)
Solved Problems in Electromagnetics
4 pages
Basic Sugar Cookie Recipe
100% (1)
Basic Sugar Cookie Recipe
14 pages
GR11 Woodworking Revision Pack Term 3 - 4 - 2024
No ratings yet
GR11 Woodworking Revision Pack Term 3 - 4 - 2024
29 pages
Waterproofing
No ratings yet
Waterproofing
2 pages
SAS20 ACC102 Partnership Introduction
No ratings yet
SAS20 ACC102 Partnership Introduction
11 pages
Z Test Population Mean ( ) and ( ) Known or Unknown Variance Sample Size N 30
No ratings yet
Z Test Population Mean ( ) and ( ) Known or Unknown Variance Sample Size N 30
6 pages

4.1 Data Retrieval and Preprocessing of Python

Uploaded by

4.1 Data Retrieval and Preprocessing of Python

Uploaded by

Data Processing Using Python

Data retrieval and preprocessing of Python

How to get local data?

Open, read/write, close of file

How to get (crawl) data from net?

• How to easily and rapidly fetch historical data of

• Store the basic stock F ile

information of # Filename: to_csv.py

>>> import pandas_datareader.data as web

>>> from sklearn import datasets

>>> from nltk.corpus import gutenberg brown

Data Processing Using

Most famous Python 2D

– Convenient plotting modules

>>> import matplotlib.pyplot as plt

plt.plot(range(7), [3, 4, 7, 6, 2, 8, 9])

>>> import numpy as np

>>> import matplotlib.pyplot as plt

Default attributes Matplotlib can control

plt.plot(x, y, 'g--') plt.plot(x, y, 'rD')

plt.figure(figsize = (8, 6), dpi = 100)

plt.subplot(211) plt.subplot(121) plt.subplot(221)

x = np.linspace(-np.pi, np.pi, 300)

x = np.linspace(-np.pi, np.pi, 300)

axes([left,bottom,width,height]) Range of parameter: (0,1)

x = np.linspace(-np.pi, np.pi, 300)

>>> quotesdf.loc[:9, 'close'].plot()

>>> quotesdf.loc[:9, ['close', 'open']].plot()

>>> ax = djidf.plot(kind = 'bar', x = 'code', y = 'price', color = 'g');

Data Processing Using

DATA CLEAN OF DATA

• check data errors

quotesdf_nan = pd.read_csv('AXP_NaN.csv', index_col = 'Date')

judge missing value: df.isnull()

How to fill missing value with mean value?

quotesdf_nan.fillna(method='ffill', inplace = True)

Data Processing Using

What impacts are common method

>>> boston = datasets.load_boston()

from sklearn import preprocessing

min_max_scaler = preprocessing.minmax_scale(df) # [0,1]

pd.cut(df.AGE, 5, labels = range(5))

>>> from sklearn.preprocessing import Binarizer

Data Processing Using

Purpose： Feature reduction: forward

>>> from sklearn.decomposition import PCA

• Each bin shows the frequency of data value.

Without Replacement: With Replacement:

>>> A = iris_df[iris_df.target == 0].sample(frac = 0.3)

You might also like