Data Processing Using Python
Data retrieval and preprocessing of Python
ZHANG Li/Dazhuang
Nanjing University
Department of Computer Science and Technology
Department of University Basic Computer Teaching
2
Basic Data Processing Procedure
4
1 3
Result
Evaluation
and
Data
Data Presentation
Collection
2 Analysis
and Mining
Data
Exploration
and
Preprocessing
Nanjing University
Data Processing Using
Python
CONVENIENT AND
FAST DATA
ACQUISITION
Nanjing University
Fetch Data with Python 4
How to get local data?
Open, read/write, close of file
• File open
• File read
• File write
• File close
Nanjing University
Fetch Data with Python 5
How to get (crawl) data from net?
Crawl pages and interpret content
• Crawling
• Urllib built-in module
– urllib.request
• Requests
(third party library)
• Scrapy framework
• Interpreting
• BeautifulSoup library
• re module
Nanjing University
Dow Jones Constituent 6
dji quotes
Nanjing University
Data Format 7
djidf
quotesdf
Nanjing University
Download Data Directly 8
• How to easily and rapidly fetch historical data of
companies from financial websites?
F ile
# Filename: quotes_fromcsv.py
import pandas as pd
quotesdf = pd.read_csv('axp.csv')
print(quotesdf)
Nanjing University
9
Read and Write of csv Format
• Store the basic stock F ile
information of # Filename: to_csv.py
import pandas as pd
American Express in
…
the past year into quotes = retrieve_quotes_historical('AXP')
stockAXP.csv. df = pd.DataFrame(quotes)
df.to_csv('stockAXP.csv')
Nanjing University
10
Read and Write of Excel Data
F ile
# Filename: to_excel.py
…
quotes = retrieve_quotes_historical('AXP')
df = pd.DataFrame(quotes)
df.to_excel('stockAXP.xlsx', sheet_name = 'AXP')
F ile
# Filename: read_excel.py
…
df = pd.read_excel('stockAXP.xlsx', index_col = 'date')
print(df['close'][:3])
Nanjing University
Download Data Directly 11
Nanjing University
Get Data Using API 12
S ource
>>> import pandas_datareader.data as web
>>> f = web.DataReader('AXP', 'stooq')
>>> f.head(5)
Open High Low Close Volume
Date
2019-10-04 112.62 114.530 112.60 114.41 2753195
2019-10-03 112.52 112.955 111.06 112.55 3549232
2019-10-02 115.76 115.810 112.75 112.86 4931560
2019-10-01 118.70 119.500 116.61 116.70 2857528
2019-09-30 119.05 119.240 118.14 118.28 2353731
Nanjing University
Using Datasets Module in Sklearn 13
S ource
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> iris.feature_names
['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)']
>>> iris.data
array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
…
[6.5, 3. , 5.2, 2. ],
[6.2, 3.4, 5.4, 2.3],
[5.9, 3. , 5.1, 1.8]])
Nanjing University
14
NLTK library
gutenberg
webtext brown
reuters
User-
inaugural defined
library
Other
languages
Nanjing University
Easier Approach to Data 15
S ource
>>> from nltk.corpus import gutenberg brown
>>> import nltk
>>> print(gutenberg.fileids())
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-
poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt',
'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-
parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt',
'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
>>> texts = gutenberg.words('shakespeare-hamlet.txt')
>>> print(texts)
['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', ...]
Nanjing University
16
Data Processing Using
Python
FUNDAMENTALS
OF PYTHON
PLOTTING
Nanjing University
Matplotlib Plotting 17
• Matplotlib Plotting
Most famous Python 2D
plotting library
– High quality
– Convenient plotting modules
• Plotting API——pyplot module
Nanjing University
18
Line Chart
S ource
>>> import matplotlib.pyplot as plt
>>> plt.plot([3, 4, 7, 6, 2, 8, 9])
plt.plot(range(7), [3, 4, 7, 6, 2, 8, 9])
Nanjing University
19
Line Chart – for groups of data
• NumPy array can also be
used as a parameter of
Matplotlib
• Groups data plotting
S ource
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> t=np.arange(0.,4.,0.1)
>>> plt.plot(t, t, t, t+2, t, t**2)
Nanjing University
Different plot forms 20
S ource
>>> import matplotlib.pyplot as plt
>>> plt.scatter(range(7), [3, 4, 7, 6, 2, 8, 9])
>>> plt.bar(range(7), [3, 4, 7, 6, 2, 8, 9])
Nanjing University
21
Matplotlib Attributes
……
Character attributes
Grid attributes
axes
subplots
Color and style
Line width
Point per inch
Graph size
Default attributes Matplotlib can control
Nanjing University
22
Color and Style
• Could color,
line or style
of graph be
modified?
plt.plot(x, y, 'g--') plt.plot(x, y, 'rD')
Nanjing University
23
Color and Style
Character Color Type Description Mark Description
b blue '-' solid "o" circle
g green '--' dashed "v" triangle_down
r red "s" square
'-.' dash_dot
c cyan "p" pentagon
':' dotted
"*" star
m magenta 'None' draw nothing
"h" hexagon1
Y yellow '' draw nothing "+" plus
k black '' draw nothing "D" diamond
w white
… …
Nanjing University
24
Other Attributes
F ile
# Filename: multilines.py
import matplotlib.pyplot as plt
import numpy as np
plt.figure(figsize = (8, 6), dpi = 100)
t = np.arange(0., 4., 0.1)
plt.plot(t, t, color='red', linestyle='-', linewidth=3, label='Line 1')
plt.plot(t, t+2, color='green', linestyle='', marker='*', linewidth=3, label='Line 2')
plt.plot(t, t**2, color='blue', linestyle='', marker='+', linewidth=3, label='Line 3')
plt.legend(loc = 'upper left')
Nanjing University
25
Words
Add titles:graph, vertical
axis and horizontal axis
F ile
# Filename: title.py
import matplotlib.pyplot as plt
plt.title('Plot Example')
plt.xlabel('X label')
plt.ylabel('Y label')
plt.plot(range(7), [3, 4, 7, 6, 2, 8, 9])
Nanjing University
Subplots 26
• The plotting is carried out in the current figure and the current coordinate
system (axes) in Matplotlib. By default, the plotting is in a figure No. 1. We
can plot in multiple areas of a figure.
• Using subplot()/subplots() and axes() functions respectively.
Nanjing University
27
subplots
plt.subplot(211) plt.subplot(121) plt.subplot(221)
plt.subplot(212) plt.subplot(122) plt.subplot(222)
plt.subplot(223)
plt.subplot(224)
Nanjing University
subplot() 28
F ile
# Filename: subplot.py
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(-np.pi, np.pi, 300)
plt.figure(1) # default
plt.subplot(211) # first subplot
plt.plot(x, np.sin(x), color = 'r')
plt.subplot(212) # second subplot
plt.plot(x, np.cos(x), color = 'g')
Nanjing University
subplots() 29
F ile
# Filename: subplots.py
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(-np.pi, np.pi, 300)
fig, (ax0, ax1) = plt.subplots(2, 1)
ax0.plot(x, np.sin(x), color = 'r')
ax0.set_title('subplot1')
plt.subplots_adjust(hspace = 0.5)
ax1.plot(x, np.cos(x), color = 'g')
ax1.set_title('subplot2')
Nanjing University
subplots-axes 30
axes([left,bottom,width,height]) Range of parameter: (0,1)
F ile
# Filename: axes.py
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(-np.pi, np.pi, 300)
plt.axes([.1, .1, 0.8, 0.8])
plt.plot(x, np.sin(x), color = 'r')
plt.axes([.3, .15, 0.4, 0.3])
plt.plot(x, np.cos(x), color = 'g')
Nanjing University
pandas plotting 31
S ource
>>> quotesdf.loc[:9, 'close'].plot()
S ource
>>> quotesdf.loc[:9, ['close', 'open']].plot()
Nanjing University
32
pandas plotting
S ource
>>> ax = djidf.plot(kind = 'bar', x = 'code', y = 'price', color = 'g');
ax.set(ylabel='Price', title = 'Stock Statistics of ^DJI')
Nanjing University
33
Data Processing Using
Python
DATA CLEAN OF DATA
EXPLORATION AND
PREPROCESSING
Nanjing University
34
• check data errors
Data • understand data distribution
Exploration characteristics and inherent regularities
• Data cleaning
Data • Data integration
preprocessing • Data transformation
• Data reduction
Nanjing University
Missing Value Handling 35
fixed value
How to deal with? mean, median/mode
• drop value
fill
• fill up and down data
interpolation function
most likely value
Nanjing University
Missing value handling—DataFrame 36
quotesdf_nan = pd.read_csv('AXP_NaN.csv', index_col = 'Date')
judge missing value: df.isnull()
drop missing value: df.dropna()
fill missing value: df.fillna()
How to fill missing value with mean value?
quotesdf_nan.fillna(method='ffill', inplace = True)
Nanjing University
Outliers 37
How to observe?
• simple statistics
• plotting
• density-based, knn or
cluster algorithm
How to deal with?
• same as missing
values
• calculate the local
mean (binning)
• do nothing
Nanjing University
38
Data Processing Using
Python
DATA
TRANSFORMATION
OF DATA PRECESSING
Nanjing University
Data Transformation 39
Normalization
common Discretization of
way continuous features
transform data into the Binarization
suitable form
Nanjing University
Normalization 40
What impacts are common method
solved?
• Min-Max normalization
• different
dimension • Z-Score normalization
• wide range of • Normalization by decimal scaling
values
Nanjing University
Boston Housing Datasets 41
>>> boston = datasets.load_boston()
>>> boston.feature_names
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')
>>> boston.target
array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, …, ]])
>>> boston_df = pd.DataFrame(boston.data[:, 4:7])
>>> boston_df.columns= boston.feature_names[4:7]
>>> boston_df
NOX RM AGE
0 0.538 6.575 65.2 4: NOX - nitric oxides concentration (parts per 10 million)
1 0.469 6.421 78.9 5: RM - average number of rooms per dwelling
2 0.469 7.185 61.1 6: AGE - proportion of owner-occupied units built prior to 1940
…
504 0.573 6.794 89.3 MEDV - Median value of owner-occupied homes in $1000's
505 0.573 6.030 80.8
Nanjing University
Min-Max normalization
42
𝑥 − 𝑚𝑖𝑛
𝑥′ =
max −𝑚𝑖𝑛
(df-df.min())/(df.max()-df.min())
Problems:
• If the number in the future exceeds min
and max one, it needs to be redefined.
• If a certain number is large, the
normalized values are close and all are
close to 0.
Nanjing University
Min-Max normalization 43
from sklearn import preprocessing
min_max_scaler = preprocessing.minmax_scale(df) # [0,1]
Nanjing University
Z-Score normalization
44
𝑥 − 𝑥ҧ
𝑥′ =
𝜎
(df-df.mean())/df.std()
Features:
• Most frequently used.
• The mean of the processed
data is 0, and the standard
deviation is 1.
Nanjing University
Z-Score normalization 45
scaler = preprocessing.scale(df)
Nanjing University
Normalization by decimal scaling 46
𝑥
𝑥′ = j
10
df/10**np.ceil(np.log10(df.abs().max()))
Features:
• Move the decimal point position.
The number of moves depends
on the maximum value of the
features' absolute value.
• Fall between [- 1, 1] commonly.
Nanjing University
Discretization of Continuous Features 47
Method
• Binning: equal-width, equal frequency
• Clustering
pd.cut(df.AGE, 5, labels = range(5))
pd.qcut(df.AGE, 5, labels = range(5))
Nanjing University
Feature Binarization 48
S ource
>>> from sklearn.preprocessing import Binarizer
>>> X = boston.target.reshape(-1,1)
>>> Binarizer(threshold = 20.0).fit_transform(X)
Nanjing University
49
Data Processing Using
Python
DATA REDUCTION
OF DATA
PREPROCCESSING
Nanjing University
Data Reduction 50
Purpose: Feature reduction: forward
• The features and values are selection, backward
normalized to obtain a much elimination, decision tree,
smaller specification PCA
representation than the original
Way Value reduction: Parametric
dataset, but still close to the method (regression, log
integrity of the original data. linear model), nonparametric
Mining on the dataset after the method(histogram,
specification can produce clustering, sampling)
almost the same analysis results.
Nanjing University
Feature Reduction - PCA 51
Source
>>> from sklearn.decomposition import PCA
>>> X = preprocessing.scale(boston.data)
>>> pca = PCA(n_components=5)
>>> pca.fit(X)
>>> pca.explained_variance_ratio_
array([0.47129606, 0.11025193, 0.0955859 , 0.06596732, 0.06421661])
Nanjing University
Value Reduction - histogram 52
Features:
• Show data distribution by forming bins.
• Each bin shows the frequency of data value.
array([4, 8, 9, 8, 7, 2, 8, 7, 5, 3, 1, 4, 5, 8, 7, 9, 5, 9, 9, 5, 9, 1, 9, 7, 1, 2, 9, 5, 5,
5, 9, 4, 3, 5, 5, 4, 7, 4, 9, 8, 2, 6, 3, 5, 3, 2, 9, 1, 3, 1])
data = np.random.randint(1,10,50)
Nanjing University
Value Reduction - histogram 53
array([4, 8, 9, 8, 7, 2, 8, 7, 5, 3, 1, 4, 5, 8, 7, 9, 5, 9, 9, 5, 9, 1, 9, 7, 1, 2, 9, 5, 5, 5, 9,
4, 3, 5, 5, 4, 7, 4, 9, 8, 2, 6, 3, 5, 3, 2, 9, 1, 3, 1])
plt.hist(data, bins=…)
Nanjing University
Value Reduction - sampling 54
Some features:
• Without replacement sampling: Take n
samples from N samples of the
Random Sampling: original dataset D, and get different
without replacement data each time.
with replacement • With replacement sampling: Take n
Sampling
samples from the N samples in the
Cluster sampling
original dataset D, record them and
Stratified Sampling put them back. It is possible to extract
the same data.
• Stratified sampling: Dataset D is
divided into disjoint parts(layers), and
each layer is randomly sampled to get
the final result.
Nanjing University
Random Sampling 55
Without Replacement: With Replacement:
iris_df.sample(n = 10) iris_df.sample(n = 10, replace = True)
iris_df.sample(frac = 0.3) iris_df.sample(frac = 0.3, replace = True)
Nanjing University
Stratified Sampling 56
S ource
>>> A = iris_df[iris_df.target == 0].sample(frac = 0.3)
>>> B = iris_df[iris_df.target == 1].sample(frac = 0.2)
>>> A.append(B)
Nanjing University
Summary 57
Nanjing University