003-FIN7790 (Part2)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 162

We paused our lecture after covering the

normalization topic last week.

109
Questions?

• Should we do normalization before or after the train/test split?

110
Impute Missing Values with sklearn

• SimpleImputer and IterativeImputer classes in scikit-learn can be used


to impute missing values in numerical columns
• For categorical features, scikit-learn provides SimpleImputer with
“strategy='most_frequent’”
• “strategy='most_frequent’” replaces missing values with the most frequent
value in the column.
• IterativeImputer estimates missing values by modeling each feature
with missing values as a function of other features

111
SimpleImputer

from sklearn.impute import SimpleImputer


# Create a sample DataFrame with a categorical column
data = {'Column1': ['A', 'B', None, 'A', 'A', 'C', None, 'B’]}
df = pd.DataFrame(data)
# Create a SimpleImputer object with strategy='most_frequent’
imputer = SimpleImputer(missing_values=None,strategy='most_frequent')
# Fit the imputer on the column and transform the data
imputed_column1 = imputer.fit_transform(df[['Column1']])
# Replace the original column with the imputed values
df['Column1'] = imputed_column1
print(df)

112
SimpleImputer

from sklearn.impute import SimpleImputer


# Create a sample DataFrame
data = {'Column1': [1, 2, 3, None, 5, 6, None, 8, 9]}
df = pd.DataFrame(data)
# Create a SimpleImputer object with strategy='mean’
imputer = SimpleImputer(strategy='mean’)
# Fit the imputer on the column and transform the data
imputed_column1 = imputer.fit_transform(df[['Column1']])
# Replace the original column with the imputed values
df['Column1'] = imputed_column1
print(df)

113
IterativeImputer

from sklearn.impute import IterativeImputer


# Create a sample dataframe with missing values
data = {'Column1': [1, 2, 3, None, 5],
'Column2': [4, None, 6, 7, 8],
'Column3': [9, 10, None, 12, 13]}
df = pd.DataFrame(data)
# Initialize the IterativeImputer
imputer = IterativeImputer()
# Fit and transform the dataframe to impute missing values
imputed_data = imputer.fit_transform(df)
# Convert the imputed data back to a dataframe
imputed_df = pd.DataFrame(imputed_data, columns=df.columns)
# Print the imputed dataframe
print(imputed_df)

114
Encoding Numeric Data
(Data Discretization/Binning)

115
Data Discretization

• Discretization / Binning
transforms continuous
features into discrete
features by creating a set
of contiguous intervals
(bins) spanning the value
range.

116
What's the benefit of discretization?

117
Data Discretization

• Simplification
• Reduces the complexity of data, making it easier to understand and analyze

• Noise reduction
• can reveal patterns that might not be apparent in continuous data when
relationships aren't linear

• Dealing with outliers


• Outliers are grouped with inlier values in nearby intervals

118
Data Discretization

• Creating bins to convert numerical


values into categories helps represent
the "price" feature, ranging from
5,000 to 45,000, more effectively.

price 5k, 10k, 12k, 12k 30k, 31k 39k, 44k, 44.5k

bins 1 2 3

119
Data Discretization
Bin with equal width
• Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
• Output:
[5, 10, 11, 13, 15, 35, 50, 55, 72]
[92]
[204, 215]

Bins with equal frequency (equal depth)


• Input:[5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
• Output:
[5, 10, 11, 13]
[15, 35, 50, 55]
[72, 92, 204, 215]
120
Equal Width Discretization

The feature values are sorted into intervals of the same width. The number of interval is arbitrarily decided.
Width = (Max(x) – Min(x) )/ Bins

# linspace returns evenly spaced numbers over a specified interval


bins =np.linspace(min(df["price"], max(df["price"]),3)
group_names = ['Low', 'Medium', 'High’]
df["price-binned"] =pd.cut([df["price"], bins, labels=group_names,include_lowest=True)
121
Encoding Categorical Data

122
Encoding Categorical Data

• Values of categorical features are often encoded as ”strings”


• The act of replacing strings with numbers is called categorical
encoding.
• Algorithm Compatibility: Many machine learning algorithms work
better with numerical data.
• Improved Model Performance: Encoding can lead to better
understanding and learning by the model, boosting overall
performance.

123
Turning categorical into quantitative variables
Categorical Variables Categorical à Numeric
• Problem • Solution
• Most statistical models and some ML algorithms • Add dummy variables for each unique category
cannot take in the objects/strings as input. • Assign 0 or 1 in each category

“One-hot encoding”
124
Dummy Variables

• Dummy variables take the value of 0 or 1 to indicate the absence or


presence of a category
• Consider a multiple regression analysis for salary income determination
• Gender – categorical, e.g. ‘M’/’F’
• Years of education – numeric
• In order to see if gender has an effect on wages, we would create two dummy
variables, Female_ and Male_. When the person is female, the value in the Female_
is ‘1’ and the value in the Male_ is ‘0’.

125
Dummy Variables in Python pandas
• Dummy Variables is one that takes the value 0 or 1 to indicate the absence or presence of
some categorical effect
• Convert categorical variable (nominal variable) to dummy variables (0 or 1)
• Use pandas.get_dummies() method to map all values in a column to multiple columns

pd.get_dummies(df[‘fuel’])

encoded_columns = pd.get_dummies(data['column'])
data = data.join(encoded_columns).drop('column', axis=1)

126
Dummy Variable Traps

• When working with dummy variables, it is important to avoid the dummy variable
trap.
• The traps occurs when independent features are multicollinear, or highly correlated
• To avoid the dummy variable trap, drop one of the dummy variables

x
127
Variable with k-categories can be captured using k-1 dummy variables

• A categorical feature with k categories (k different values) can be encoded


in k-1 dummy variables
• For Gender, k is 2 (M/F), therefore only one dummy variable (k-1) is needed to
capture all of the information
• For a color variable that has 3 categories (red/yellow/green), 2 (k-1 =2) dummy
variables are needed.
• In some cases, you still want to encode with k dummy variables
• When training with decision trees, each run might not evaluate the entire feature
space at the same time.
• When determining the importance of each category within a feature

128
Encoding Ordinal Data

129
Ordinal Encoding

• Use ordinal encoding for categorical variables that


have a natural rank ordering
• mapping each unique label (with relative order) to an
integer value

• Encode the categories with digits from 1 to k, which k


is the number of distinct categories of the variable
• Encode “Small/Medium/Large” in a numeric with
order (1/2/3)
• Machine learning algorithms can harness the ordering
relationship.

130
Python Ordinal Encoding

From sklearn.preprocessing import OrdinalEncoder


cols=[‘original_column’]
Mapping =[[‘small’,’medium’,’large’]]
# define ordinal encoding
Encoder=OrdinalEncoder(categories=mapping,dtype=np.int32)
Encoder.fit(data[cols])

Encoding =pd.DataFrame(encoder.transform(data[cols]), columns=cols)

131
Features Construction

132
Feature Construction

Feature construction (a.k.a. feature creation or feature


generation), can enhance the performance of machine learning
models.
1. Aggregating multiple existing features together can create a new
feature.
2. Creating interaction features by combining existing ones through
mathematical operations like multiplication (area = length * width)
3. Creating polynomial features involves generating new features based
on an existing feature.

133
Polynomial Features

• A combination of one feature with itself (i.e. a polynomial


combination of the same feature) can be quite informative and can
increase the predictive power of the predictive algorithms
• e.g., if a dataset has a feature X, then a polynomial feature would be the
addition of a new feature where values are calculated by squaring the values
in X, e.g. X^2.

• With similar logic, polynomial combinations of the same or different


features (interaction features) can return new variables that convey
additional information and capture feature interaction
134
Polynomial Features

In the plot on the left, due to the


quadratic relationship between the target
(y) and the variable (x), there is a poor
linear fit.

In the plot on the right, the x2 variable (a


quadratic combination of x) shows a
linear relationship with the target (y). It,
therefore, improves the performance of
the linear model, which predicts y from
x2 .

135
Polynomial and Interaction Features

• The degree of the polynomial is used to control the number of features


added, e.g. a degree of 3 will add two new variables for each input variable

• Typically a small degree, such as 2 or 3, is used


• 2nd degree polynomial combinations return the following new features
• [𝑎,𝑏,𝑐] 2 = 1,𝑎,𝑏,𝑐,𝑎𝑏,𝑎𝑐,𝑏𝑐,𝑎2,𝑏2,𝑐
including all possible interactions of degree 1 and degree 2 plus the bias term 1

• There is a “include_bias” argument that defaults to True to include the bias


feature (value of 1)

136
Polynomial Features Transform
# demonstrate the types of features created
• Use PolynomialFeatures class in scikit- # from numpy import asarray
learn from sklearn.preprocessing import
PolynomialFeatures
• The features created include: # define the dataset
data = asarray([[2,3],[2,3],[2,3]])
• The bias (the value of 1.0) print(data)
• Values raised to power for each degree # perform a polynomial features transform of the
# dataset
(e.g. x^1, x^2, x^3, …)
trans = PolynomialFeatures(degree=2)
• Interactions between all pairs of features data = trans.fit_transform(data)
(e.g. x1 * x2, x1 * x3, …) print(data)

• Add a non-linear interaction term SMA* [[2 3]


[2 3]
RSI [2 3]]

SMAXRSI = msft_df[‘14-day SMA’]*msft_df[‘14-day RSI’] [[1. 2. 3. 4. 6. 9.]


[1. 2. 3. 4. 6. 9.] <= ‘1’ is the bias feature
[1. 2. 3. 4. 6. 9.]] 137
Curse of Dimensionality
• “Dimensionality” refers to the number of features (i.e.,
input variables) in your dataset.
• The Curse of Dimensionality refers to challenges in

Model Performance
analyzing high-dimensional data, where some models
may perform poorly.
• More features often require more samples to
represent the space adequately.
• Feature selection and extraction are necessary before Dimensionality (numbre of features)

training machine learning models to prevent Optimal number of features

overfitting.
• Feature selection keeps a subset of original features,
while feature extraction creates new ones.

138
Features Selection & Extraction

139
Feature Selection
• Many features in a dataset contain little information
• Only some features are meaningful and have high predictive power
• Meaningful features are independent of each other
• Filter irrelevant or redundant features and keep only the best subset from an
existing set of features without loss of information.

§ Simplifying the models All Features

§ Shortening time for model


construction
Features Selection

§ Avoiding the curse of dimensionality


Features Selected
§ Enhancing model generalization
140
Feature Selection Methods

Three categories of feature selection methods.


• Filter methods select features regardless of any
machine learning algorithm Filter Methods

• Wrapper methods select features based on the


performance of a specific model with a greedy Feature Wrapper
Selection Methods
search in a forward/backward manner.
• Model-based methods (Embedded Method) Model-based
Methods
select features during the model construction
procedure.

141
Feature Selection Methods

• Filter Method – Use Proxy measure


• Correlation
• Chi-square test
• Information gain
• Wrapper Method – Use Predictive model
• Stepwise Selection
• Forward elimination approach
• Backward elimination approach
• Embedded Method- Select features during model building
• Regularization methods such as Decision Tree, LASSO, Ridge Regression, etc.

142
Correlation Approach

• Calculate the correlations between the features

• Remove input features highly correlated with others (i.e. its


values change similarly to another's).
• These features provide redundant information.

• For example, if you had a real-estate dataset with 'Floor Area (Sq. Ft.)'
and 'Floor Area (Sq. Meters)' as separate features, you can safely
remove one of them.
143
Correlation Approach

Highly correlated features can be detected by judging the correlation coefficients.


• Pearson correlation coefficient:
• Measures the linear correlation between two variables

• Spearman rank correlation coefficient:


• Measures if the relationship between two variables is monotonic.

• Kendall rank correlation coefficient:


• Measures the ordinal association between two variables.

• Use Pandas DataFrame.corr(method = ‘pearson’, ‘kendall’, ’spearman’) to compute


the three different types of correlation coefficients.
144
Linear Relationship vs Monotonic Relationship

Monotonically Monotonically Function that is not


increasing function decreasing function monotonic

145
Comparison of Pearson and Spearman Coefficients

Pearson = +1, Pearson = +0.851, Pearson = −0.093,


Spearman = +1 Spearman = +1 Spearman = −0.093

If the relationship between variables is


monotonic, spearman coefficient
provides a more accurate measurement.

Pearson = −1, Pearson = −0.799,


Spearman = −1 Spearman = −1
146
Kendall Rank Correlation

• Test similarities in the ordering of data when its ranked.


• Use pairs of observations as the basis and determine the strength of association based
on the pattern on concordance and discordance between the pair.
• Concordant: A pair of observations is considered concordant if (x2-x1) and (y2-y1) have
the same sign.
• Discordant: A pair of observations is discordant if (x2-x1) and (y2-y1) have opposite sign.
• P-value is more accurate with smaller sample size.
• Example:
• Correlation between a student’s exam grade (A,B,C..) and the time spent studying put in
categories (<2 hours, 2-4 hours, 5-7 hours, etc.)

147
Python: Correlation-based Feature Selection

# import the libraries and packages


import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# load the credit default dataset as an example
df=pd.read_csv('https://raw.githubusercontent.com/samsontai/Data/main/cre
dit_default.csv’)
# examine the data
df.set_index('Id', inplace=True)
df.head().T

149
Python: Correlation-based Feature Selection

150
Python: Correlation-based Feature Selection

Partition the dataset into features and target


# assumes no preprocessing is needed
# partition the dataset into features & target
target = 'Credit Default’
# x has all the input features used to predict the Credit Default
# y is the target
x = df.drop(target, axis =1)
y = df[target]
#show the normalized value counts of the target
y.value_counts(normalize=True)

0 0.727728
1 0.272272
Name: Credit Default, dtype: float64

151
Python: Correlation-based Feature Selection
# show the pearson correlation coefficient matrix of all the
# features in the credit default dataset
df.corr(method='pearson')

152
Python: Correlation-based Feature Selection

import matplotlib
import seaborn as sns
# show the correlation
# heatmap
sns.heatmap(df.corr())

153
Python: Correlation-based Feature Selection
# list the coefficient against the # Now list coefficients against
# target, Credit Default # the target
df.corr()['Credit Default'] # only of the absolute value >0.2
df.corr()['Credit Default'].abs() > 0.2

154
Detect Irrelevant Features

• Correlation Coefficient is a common way to detect irrelevant features


(numeric data type) with the target (numeric data type)

• There are other mathematical methods can be used based on the


type (numeric or categorical) of the output and features
• Categorical Feature, Categorical Target — Chi-Square Test
• Categorical Feature, Continuous Target — ANOVA

155
Categorical Feature Categorical Target — Chi-Square
Test

• The essence of the Chi-square test is assuming there is no


relationship between the input and output (target), and the test
is to check how valid that assumption is.

• A low p-value to reject the assumption.


• Feature selection is made by selecting features with low p values.

156
Chi-Squared Test
from scipy.stats import chi2_contingency
# Example data
data = {'Gender': ['Male', 'Female', 'Male', 'Female', 'Male',
'Female'], 'Smoker': ['Yes', 'No', 'No', 'Yes', 'Yes', 'No']} df =
pd.DataFrame(data) # Create cross-tabulation table
crosstab_table = pd.crosstab(df['Gender'], df['Smoker’])
# Perform chi-square test
chi2, p,_,_ = chi2_contingency(crosstab_table)
# Print the results
print("Chi-square statistic:", chi2) print("p-value:", p)

158
Categorical Feature Continuous Target — ANOVA

• Analysis of Variance (ANOVA) compares the means of different groups


(mean of responses for each categorical feature data) and tests
whether the intergroup difference is statistically significant.
• If a feature is relevant, we expect to see significant differences
between the means of different groups

159
ANOVA Test
X = np.random.randint(3,size=1000)
Y = np.random.rand(1000) # no relationship
one = Y[X==1] ###statistic= 0.7361644252650903,
two = Y[X==2] pvalue= 0.47920759112861
zero = Y[X==0] large P value-> relationship is insignificant
result=sst.f_oneway(one,two,zero)
print(result)

F_onewayResult(statistic=0.7361644252650903, pvalue=0.479207591128618

X = np.random.randint(3,size=1000)
Y = np.random.rand(1000) + 0.1*X # X is part of Y
###statistic= 32.98176360162701,
one = Y[X==1]
two = Y[X==2] pvalue= 1.349374223612499e-14
zero = Y[X==0] Small p value ->reject the null hypothesis-> feature is
result=sst.f_oneway(one,two,zero) relevant
print(result)

F_onewayResult(statistic=32.98176360162701, pvalue=1.349374223612499e-14)
160
The Choice of Feature Selection Algorithm depends on the
nature of the input features and the output target

161
Wrapper Method - Stepwise Approach

Stepwise search is a supervised feature selection method based on


sequential search.
• Forward stepwise search
• Start without any features. Then, train a 1-feature model using each of your
candidate features.
• Continue adding features, one at a time, until your performance improvements stall.
• Backward stepwise search
• Start with all features in the model and then remove one at a time until performance
starts to drop substantially.

162
Recursive Feature Elimination Method
Recursive Feature Elimination is a backward stepwise selection algorithm to select predictors
• Search for a subset of features by starting with all features in the training dataset and then removing
features
• Fit the ML algorithm, rank features by importance, discard the least import features
• Refit the model

#Wrapper method, RFE with sklearn


from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
rfe_selector = RFE(estimator=LogisticRegression(),
n_features_to_select=2, step=1,verbose=-1)
rfe_selector.fit(x_numeric,y)
rfe_selector.get_feature_names_out()
163
Embedded Method – Tree based Feature Importance

• Tree-based algorithms and models (i.e.


random forest) provide feature
importance information
• Feature importance tells us which
variables are more important in making
accurate predictions on the target
variable/class.
• Table on the right is the output from
XGBoost, which contains feature names
and their feature importance score.
164
Feature Extraction

• Feature extraction is for creating a new, smaller set of features that


stills captures most of the useful information.
• Some ML algorithms already have built-in feature extraction.
• The best example is Deep Learning, which extracts increasingly useful
representations of the raw input data through each hidden neural
layer.
• Feature extraction can be unsupervised (i.e. PCA) or supervised (i.e.
Linear Discriminant Analysis LDA).

166
Principal Components Analysis

• Extract the important information from a multivariate dataset and express this
information as a set of a few new variables called principal components
• Principal components explain most of the patterns and latent structures observed
in the original dataset
• Often possible with only a few principal components
• These principal components are orthogonal, which means that they are uncorrelated
• They are ranked in order of their “explained variance.”
• The first principal component (PC1) explains the most variance in your dataset, PC2
explains the second-most variance, and so on.

167
Principal Components Analysis

• PCA is used to reduce the dimensionality of a multivariate dataset by


limiting the number of principal components to keep based on cumulative
explained variance.
• For example, you might keep only as many principal components as needed
to reach a cumulative explained variance of 90%.
• Recommend to normalize the dataset before performing PCA especially
when variables are measured in different units or scales.
• Otherwise, the features that are on the largest scale would dominate your new
principal components

168
Majority of the variance in the original dataset can be
effectively explained by a few principal components

Principal Component

169
Principal Component Analysis (Steps)
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components)
that can be best used to represent data
• Normalize input data: Each attribute falls within the same range
• Compute k orthonormal (unit) vectors, i.e., principal components
• Each input data (vector) is a linear combination of the k principal component vectors
• The principal components are sorted in order of decreasing “significance” or strength
• Since the components are sorted, the size of the data can be reduced by eliminating the
weak components, i.e., those with low variance (i.e., using the strongest principal
components, it is possible to reconstruct a good approximation of the original data)
• Works for numeric data only

170
Python – Using PCA

1. Select all the numeric columns (X) except the target variable(Y) “price”
2. Scale the numeric values which is the important step before applying PCA
3. Instantiate PCA
4. Determine transformed features
5. Determine explained variance using explained_variance_ration_ attribute

list(df.select_dtypes(['float']).columns)
x=df.select_dtypes(include=('float64','integer’))
x.drop('price',axis=1,inplace=True)
# performing preprocessing part
# standardize the range of values of features
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x = sc.fit_transform(x)
171
Python Using PCA
# Applying PCA function, limit the components to 4
from sklearn.decomposition import PCA
pca = PCA(n_components = 4)
x = pca.fit_transform(x)
explained_variance = pca.explained_variance_ratio_
explained_variance
# Cumulative sum of eigenvalues; This will be used to create step plot
# for visualizing the variance explained by each principal component.
cum_sum_eigenvalues = np.cumsum(explained_variance)

array([0.44092527, 0.25246413, 0.1056416 , 0.05037744])

172
Python Using PCA
# Create the visualization plot
plt.bar(range(0,len(explained_variance)),
explained_variance, alpha=0.5, align='center',
label='Individual explained variance’)
# plot the cumulative eigenvalues
plt.step(range(0,len(cum_sum_eigenvalues)),
cum_sum_eigenvalues,
where='mid',label='Cumulative explained
variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')
plt.legend(loc='best')
plt.tight_layout()
plt.show()
The 4 components can explain 80% of patterns
in the original data sets. 173
Python Using PCA
# Reduce the dimensionality to four independent variables
x_trimmed = pd.DataFrame(x)
x_trimmed.head()

174
Sample Code

• Understand the different types of correlation methods: Pearson, Kendall, and


Spearman using the S&P500, GDP, Gold, and Oil Price as examples.
• The S&P 500, Gold and Oil price are from Yahoo Finance, while the GDP info is
from the World Bank.

• FIN7790_2023_Correlation-SP100-GDP.ipynb

Reference: https://www.learnpythonwithrune.org/pandas-correlation-methods-explained-pearson-kendall-and-spearman/
175
Reference

• Zheng, A., & Casari, A. (2018). Feature engineering for machine learning:
principles and techniques for data scientists. " O'Reilly Media, Inc.".
• Galli, S. (2020). Python feature engineering cookbook: over 70 recipes for
creating, engineering, and transforming features to build machine learning
models. Packt Publishing Ltd.
• A Short Guide for Feature Engineering and Feature Selection (download from
Moodle)

176
A short Guide for
Feature Engineering
and Feature
Selection

177
END of Part 1

178
Part 2 (Week 4):
Financial Time Series Data Preparation for ML Prediction
Modeling

179
Time Series Forecasting Model

180
Time Series Forecasting (Forecasting Horizon)

181
Time series data set as supervised learning problem

Transform the Time Series Data Set into a format


suitable for machine learning prediction.
Input: X; Target: Y
182
1-Step Forecast

1-Step Forecast

183
Multi-step Forecast

• 1-step forecasts is not very useful


• Imagine a weather channel that only shows 1 day ahead!
• A brick-and-mortar shop might forecast the sales of a product next
month (from monthly data) to purchase inventory and fulfill the
demand
• Forecast horizon = number of steps to forecast
• E.g. daily sales next week, daily temperature next 7 days

184
2 ways to produce multi-step forecasts

1. Incremental method (can be done with any 1-step predictor)


2. Multi-output forecast (limited to certain algorithms)

185
Univariate time series as multi-step supervised learning

Multi-output Forecast

186
Incremental Multi-step forecast

Let h = 3

h : predict horizon 187


Incremental Multi-step forecast - Example

Let p = 3, h = 3

p: lagged time period


188
Multi-output Multi-step Forecast

Some ML models and deep neural networks can do this

189
Multistep/Multi-output Forecasting using Linear Regression

• FIN7790-2023-ML-NoDiff-AirPassenger.ipynb
• FIN7790-2023-ML-Diff-AirPassenger.ipynb

The two sample codes both use the Linear Regression Algorithm to develop the
Forecasting Models

190
Common Time Series Data Problems
• Restructuring the Timestamp format
• Convert into a standard date-time format
• Fixing Missing Value
• Linear interpolation
• Forward filling
• Backward filling
• Imputation using Mean, Median, or Mode within a period
• Removing Outlier
• Ceiling/Flooring (Min/Max)
• Denoising Features

191
Fixing the Data-Time Format

Incomplete and varying timestamps

yyyy-mm-dd HH:MM:SS Includes time

yyyy-dd-mm Year, day, month

yyyy-mm-dd Year, month, day


UTC, local, and time zones:
yyyy-mm No day
Is the time in UTC format? ss:s Seconds
2020-06-02T13:15:30Z mm-dd No year

192
Date-time Column

• Make sure the data-time column has the date-time datatype

import pandas as pd
passenger = pd.read_csv('AirPassengers.csv')
passenger['Date'] = pd.to_datetime(passenger['Date’])
# Below line of code sorts the values according to dates
passenger.sort_values(by=['Date'], inplace=True, ascending=True)

193
Handling Missing Data in Time Series

Forward Fill Moving Average


Last known == missing value Avg(previous) == missing value

Backward Fill Interpolation


Next value == missing value Linear
Spline
Polynomial

194
Interpolation approach for Time Series Data

• Use the interpolation method to deal


with missing data in the time-series
data
• using the known values on either
side of a data gap to assume what’s
missing.

195
Interpolation approach for Time Series Data

import pandas as pd
import numpy as np

# creating a simple dataframe with some missing values


data = {'A': [1, 2, np.nan, 4, 5, np.nan, 7, 8, 9, 10],
'B': [np.nan, 100, 101, np.nan, 103, 104, np.nan, 106, 107, 108]}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# using interpolation to fill missing values


df_interpolated = df.interpolate(method='linear')

print("\nAfter interpolation:")
print(df_interpolated)

196
Forward/Backward Fill

• Forward fill takes the previous row value and fill the next row.
• Backward fill takes the next row value and fills the previous row.

# forward fill
df.fillna(method='ffill', inplace=True)

# backward fill
df.fillna(method='bfill', inplace=True)

197
Interpolation in Pandas

# Read the AIA market data from csv


df=pd.read_csv('1299.csv')
df.set_index('Date',inplace=True)

# Return the number of rows that have missing values


missing = df['Close'].isna().sum()

#Interpolate linearly within missing windows


df['Close_interp'] = df['Close'].interpolate('linear')

#Plot the interpolated data in red


plt.figure(figsize=(15,5))
df['Close_interp'].plot(grid=True,c='r')
plt.ylabel('Price ($)')
plt.title('AIA Price')

198
Visualizing the interpolated data

199
Outlier Detection
• Using the mean and standard deviation of the entire series is not recommended for
outlier detection because the boundaries would remain fixed in that case.
• The Rolling Statistical Bound-based approach creates boundaries on a rolling basis and is
effective and straightforward for outlier detection.
• For example, define the upper and lower bound as:
Upper Bound = Rolling Mean + 3 x (Rolling Standard Deviation)
Lower Bound = Rolling Mean - 3 x (Rolling Standard Deviation)
Rolling mean is the mean for a window of previous observations.

• Outliers in the data can be effectively identified by calculating these bounds using a
rolling window.

200
Denoising a Time Series

• Rolling Means can


minimize the noise in
time series data
• Rolling mean is mean
for a window of
previous observations.

201
Outlier Handling: Smoothing data with Rolling Mean

Smoothing function
outliers
(Rolling Mean)

202
Bollinger Bands
Rolling Mean and
Rolling Standard
Deviation:

Bollinger Bands use 2


parameters, Period
(rolling window) and
Standard Deviations.
The default values are
20 for the period, and
2 for standard
deviations.
203
Rolling Mean with Percent_change in Python
# Calculate % change
df['Close_interp_pct'] = df['Close_interp'].rolling(window=20).aggregate(percent_change)
# Plot the raw data and %change
fig, axs = plt.subplots(1,2, figsize=(15,5))
ax = df['Close'].plot(ax=axs[0])
ax = df['Close_pct'].plot(ax=axs[1])

Stabilize the mean and


variance

204
Feature Construction for Time Series Data

205
Constucting Features Over Time

1. Creating features over time- develop specific features (e.g. Timestamp) that are useful in timeseries analysis.
2. Extracting features with windows: e.g. using the rolling windows technique to extract features
206
Construct New Features

• Extract Date based features


• Create Time-lag features
• Create Rolling features

207
Extracting date-based features

• In addition to the statistical features, like “mean”, “standard


deviation” other features can be derived from timeseries data
• Timeseries data often has “human” features associated with it, like
days of the week, holiday, etc.
• These features are often useful when dealing with timeseries data
that span multiple years

208
Extracting date-based features
• Date columns usually provide valuable
information
• Extracting the parts of the date into
different columns: Year, Quarter, Month,
Day, etc.
• Extracting some other specific features
from the date: Name of the weekday,
Weekend or not, Holiday or not, etc.
209
Extracting date-based features
import pandas as pd
from datetime import date
data = pd.DataFrame({'date':['01-01-2017','04-12-2008','23-06-1988','25-08-1999','20-02-1993']})
#Transform string to date
data['date'] = pd.to_datetime(data.date, format="%d-%m-%Y")
#Extracting Year
data['year'] = data['date'].dt.year
#Extracting Month
data['month'] = data['date'].dt.month
#Extracting passed years since the date
data['passed_years'] = date.today().year - data['date'].dt.year
#Extracting passed months since the date
data['passed_months'] = (date.today().year - data['date'].dt.year) * 12 + date.today().month -
data['date'].dt.month
#Extracting the weekday name of the date
data['day_name'] = data['date'].dt.day_name()

210
Creating a Time-lagging features
• Create time-shifted versions of the data by "rolling" the time series
data either into the future or into the past
• In Pandas, use the shift method of a DataFrame.
• DataFrame.shift(5)
• Positive values roll the data backward, while negative values roll the data
forward.
• The same index of data will have data from different timepoints in it.

211
Time-shifted DataFrame Positive value roll the data backward
- Shfit the 3 index values towards past
msft_df.shift(3)

Negative value roll the data forward


msft_df.shift(-3)

212
Create several time lagged version of the data..

# data is a pandas Series containing time series data


data = pd.Series(...) # Shifts
shifts = [0, 1, 2, 3, 4, 5, 6, 7]
# Create a dictionary of time-shifted data
many_shifts = {'lag_{}'.format(ii): data.shift(ii) for ii in shifts}
# Convert them into a dataframe
many_shifts = pd.DataFrame(many_shifts)

213
Create several time lagged version of the data..
# data is a pandas Series containing time series data
data = df['Close']
# create 4 time-shift values
shifts = [1,2,3,4]
# Create a dictionary of time-shifted data
many_shifts = {'lag_{}'.format(ii): data.shift(ii) for ii in shifts}
# Convert them into a dataframe
many_shifts = pd.DataFrame(many_shifts)
many_shifts.shape

214
Create several time lagged version of the data..

215
Visualizing the Correlation Coefficients of the Time-lagged Features
# First dropping the missing values
many_shifts.dropna(inplace=True)
# merge with the df
df_result = pd.concat([data, many_shifts], axis=1,join='inner’)
# Fitting the Ridge Linear Regression Model
from sklearn. linear_model import Ridge
# Fit the model using these new input features
model = Ridge()
model.fit(df_result.iloc[:,1:5], df_result['Close’])
# Visualize the fit model coefficients
fig, ax = plt.subplots()
ax.bar(df_result.iloc[:,1:5].columns, model.coef_)
ax.set(xlabel='Coefficient name', ylabel='Coefficient value')
# Set formatting so it looks nice
plt.setp(ax.get_xticklabels(), rotation=45,
horizontalalignment='right')

216
Use the Time-lagging features as the predictor variables
(Autoregression)

• You can create multiple values in the past as input features for a time-series
machine-learning model
• You need to assess how auto-correlated these new signals with a time point
• That is, how correlated is a time point with its neighboring time points (called
autocorrelation).

• The amount of auto-correlation in data will impact the machine-learning model


• How to determine the lag features you need?
• ACF and PACF

217
ACF and PACF Plot
ACF and PACF are plots that
summarize the strength of a
relationship with an observation
in a time series with
observations at prior time steps.

from statsmodels.graphics.tsaplots
import plot_acf, plot_pacf
plot_acf(msft_df['Close'])
plot_pacf(msft_df['Close'])
plt.show()

218
Create multiple rolling features using aggregate
method
• It is common to define rolling features for min, max, mean and
standard deviation.
• We can use the rolling window and aggregate functions to calculate
all these features
• In pandas, the aggregate method can be used to calculate many
features of a window at once.

219
Using .aggregate for feature extraction

• By passing a list of functions to the aggregate method, each function


will be called on the window, and collected in the output.
• Aggregation can be applied one or more column. Most frequently
used aggregations are:
• sum: It is used to return the sum of the values for the requested axis.
• min: It is used to return the minimum of the values for the requested
axis.
• max: It is used to return the maximum values for the requested axis
220
Using .aggregate for feature extraction
Here's an example:
1. first use the dot-rolling method to define a rolling window,
2. then pass a list of three functions (for the standard deviation, maximum and mean value).
3. This extracts three features for each column (Close) over time.

# Calculate a rolling window, then extract two features


feats= msft_df['Close'].rolling(20).aggregate([np.std,np.max,np.mean]).dropna()
# print some rows
print(feats.head(4))
# plot lines -> refer to next page
plt.figure(figsize=(15,5))
plt.plot(feats['mean'],label = "mean")
plt.plot(feats[‘std'],label = ”std")
plt.plot(feats['amax'],label = "amax")
plt.legend()
plt.show()

221
Check the properties of the features

222
Supplement: Working with Time Series Data Using
Python

223
Download Microsoft Data from Yahoo

import yfinance as yf
msft_df = yf.download('MSFT’,
start='2014-01-01’,
end='2021-12-31’,
progress=False)
# output the data to a csv file
msft_df = pd.to_csv(‘msft.csv’)

224
Working with Time Series data using Python

1. use the describe() function to get some useful summary statistics


about your data
# Return first rows of aapl
aapl.head()
# Describe aapl
aapl.describe()

225
Inspecting and saving Timeseries data to a CSV file
# Inspect the columns
print(aapl.columns)
# Inspect the index and make sure the index is there
print(aapl.index)
# Inspect the columns
print(aapl.columns)
# Select only the last 10 observations of `Close`
ts = aapl['Close'][-10:]
# Check the type of `ts`
type(ts)

aapl.to_csv('data/aapl_ohlc.csv’)
df = pd.read_csv('data/aapl_ohlc.csv', header=0, index_col='Date',
parse_dates=True)

226
Visualizing Time Series Data

# Import Matplotlib's `pyplot` module as `plt`


import matplotlib.pyplot as plt

# Plot the closing prices for `aapl`


aapl['Close'].plot(grid=True)

# Show the plot


plt.show()

227
Adding new column
• Adding a new column which indicates the difference between the
open and the close price of a stock
# Add a column `diff` to `aapl`
aapl['diff'] = aapl.Open - aapl.Close

228
Time Shifts

• Shifting values with periods


• Example: calculating the difference in consecutive rows

229
Time Shifts
data = [100,101,99,105,102,103]
#Create pandas DataFrame from list
df = pd.DataFrame(data,columns=['close’])
# To shift by 1 row vertically
df.shift(periods=1) # same as df.shift(1)
df.shift(periods=1, fill_value=0)

df.shift(1,
df.shift(1) fill_value=0)

230
Difference

• Calculate the difference between two rows

231
Difference
data = [100,101,99,105,102,103]
#Create pandas DataFrame from list
df = pd.DataFrame(data,columns=['close'])
df['diff()'] = df[‘close’].diff() # difference previous day

diff( )

232
Percentage Change
The pct_change() function is used to get the percentage change between the current and a prior row.

data = [100,101,99,105,102,103]
#Create pandas DataFrame from list
df = pd.DataFrame(data,columns=['close'])
df['pct_change()'] = df[‘close’].pct_change()

pct_change( )

233
Rolling

• Let’s say you have 20 days of stock data, and you want to know the
mean price of the stock for the last 5 days. What do you do?
• You take the last 5 days, sum them up, and divide by 5.
• But what if you want to know the average of the previous 5 days for
each day in your data set?

234
Rolling
import pandas as pd
import matplotlib.pyplot as plt
#Random stock prices
data = [100,101,99,105,102,103,104,101,105,102,99,98,105,109,105,120,115,109,105,108]
#Create pandas DataFrame from the list
df = pd.DataFrame(data,columns=['close'])
#Calculate a 5-period simple moving average
sma5 = df['close'].rolling(window=5).mean()
#Plot
plt.plot(df['close'],label='Stock Data')
plt.plot(sma5,label='SMA',color='red')
plt.legend()
plt.show()

235
Expanding

• Where rolling windows are a fixed size, expanding windows have a


fixed starting point, and incorporate new data as it becomes available.
• “What’s the mean of the past n values at this point in time?” – Use rolling
windows here.
• “What’s the mean of all the data available up to this point in time?” – Use
expanding windows here.
• Expanding windows have a fixed lower bound. Only the upper bound of the
window is rolled forward (the window gets bigger).

236
Expanding
#Calculate expanding window mean
expanding_mean = df.expanding(min_periods=1).mean()
#Calculate full sample mean for reference
full_sample_mean = df['close'].mean()
#Plot
plt.axhline(full_sample_mean,label='Full Sample Mean',linestyle='--',color='red’)
plt.plot(expanding_mean,label='Expanding Mean',color='red')
plt.plot(df['close'],label='Stock Data')
plt.legend()
plt.show()

237
Resample Function
• Frequency conversion of the time series data, e.g. from daily to month
• Most commonly used time series frequency are:
• W : weekly frequency,
• M : month end frequency,
• Q : quarter end frequency.
# Resampling the time series data based on months
# we apply it on stock close price
# 'M' indicates month
monthly_aapl = aapl.Close.resample('M').mean()
# the above command will find the mean closing price
# of each month
monthly_aapl

238
Common Financial Analysis

• Returns
• Moving Average
• Volatility Calculation

239
Stock Return
• Percentage Change
• Pandas function: pct_change( ) tells us how much the stock
price is gained or lost

240
Indexing the Stock Return by Time

pct_change()
241
Returns
• Using pct_change() to compute the Daily Return and Daily Log Returns

# Assign `Adj Close` as `daily_close`


daily_close = aapl[['Adj Close']]
# Daily returns using .pct_change()
daily_pct_c = daily_close.pct_change()
# Replace NA values with 0 as the pct_c value is not
available in the first record
daily_pct_c.fillna(0, inplace=True)
# Daily log returns: the natural log of 1 plus the
# pct_change; +1 is to avoid the ln(0) which is undefined
daily_log_returns = np.log(daily_close.pct_change()+1)
# Print daily log returns
print(daily_log_returns)

242
Returns (Calculate the Percentage Change)
• Monthly and Quarterly Return with resample()
# Resample `aapl` to business months, take last observation as value
monthly = aapl.resample('BM').apply(lambda x: x[-1])
# Calculate the monthly pc - compare the value of the last day of the month with the last day of
the prior month
monthly.pct_change()
# Resample `aapl` to quarters, take the mean as value per quarter
quarter = aapl.resample("4M").mean()
# Calculate the quarterly percentage change
quarter.pct_change()

quarter.pct_change() =>

243
Returns
Plot the distribution of daily_pct_change:
# Import matplotlib
import matplotlib.pyplot as plt
# Plot the distribution of `daily_pct_c`
daily_pct_c.hist(bins=50)
# Show the plot
# Fom the graph, we can easily tell the daily change distribution, mean is 0.001
plt.show()
# Pull up summary statistics
print(daily_pct_c.describe())
The distribution looks
very symmetrical and
normally distributed.

244
Returns – Cumulative Return
Cumulatively daily rate of return : useful to determine the value of an investment at regular
intervals. You can calculate the cumulative daily rate of return by using the daily percentage change
values, adding 1 to them and calculating the cumulative product with the resulting values
# The cumprod() method returns a DataFrame with the cumulative product for
each row.
cum_daily_return = (1 + daily_pct_c).cumprod()
# Print `cum_daily_return`
print(cum_daily_return)
# Plot the cumulative daily returns
cum_daily_return.plot(figsize=(12,8))
# Show the plot
plt.show()

245
Returns – Cumulative Return
Cumulatively monthly rate of return

# Resample the cumulative daily return to cumulative monthly return


cum_monthly_return = cum_daily_return.resample("M").mean()
# Print the `cum_monthly_return`
print(cum_monthly_return)
# Plot the cumulative daily returns
cum_monthly_return.plot(figsize=(12,8))
# Show the plot
plt.show()

246
Moving Average
• Simple Moving Average
• Using n past days to get the average, Common values, n=14, 50, 200
• Here's a plot showing the MSFT Close price and a 200-day simple moving
average, or SMA. You can see moving averages smooth the data.

247
Moving Windows

• Compute the statistic on a window of data represented by a particular


period of time
• Slide the window across the data by a specified interval
• Statistic is continually calculated as long as the window falls first
within the dates of the time series
• Pandas function: rolling.mean()
• a rolling mean smoothes out short-term fluctuations and highlight
longer-term trends in data.
248
Moving Windows
# Isolate the adjusted closing prices
adj_close_px = aapl['Adj Close']
# Calculate the moving average MA40
moving_avg =
adj_close_px.rolling(window=40).mean()
# Add short/long moving windows
aapl['42'] =
adj_close_px.rolling(window=40).mean()
aapl['252'] =
adj_close_px.rolling(window=252).mean()
# Plot the adjusted closing price, the
short/long windows of rolling means
aapl[['Adj Close', '42', '252']].plot()
249
Volatility
• Volatility is a statistical measure of the dispersion of returns for a given
security or market index.
• measured as either the standard deviation or variance between returns from that
same security or market index.
• In most cases, the higher the volatility, the riskier the security.
• Volatility is often associated with big swings in either direction.
• For example, when the stock market rises and falls more than one percent over a
sustained period of time, it is called a "volatile" market.
• Volatility represents how large an asset's prices swing around the mean
price—it is a statistical measure of its dispersion of returns.
250
Volatility Calculation

• The volatility of a stock is a measurement of the change in variance in


the returns of a stock over a specific period of time.
• It is common to compare the volatility of a stock with another stock
to get a feel for which may have less risk or to a market index to
examine the stock’s volatility in the overall market.
• Generally, the higher the volatility, the riskier the investment in that
stock, which results in investing in one over another.

251
Volatility Calculation
Moving historical standard deviation of the log returns

# Define the min. of periods to consider


min_periods = 75
# Calculate the volatility
vol =
daily_pct_c.rolling(min_periods).std() *
np.sqrt(min_periods)
# Plot the volatility
vol.plot(figsize=(10, 8))
# Show the plot
plt.show()

252
https://blog.devgenius.io/how-to-calculate-the-daily-returns-and-volatility-of-a-stock-with-python-d4e1de53e53b
Relative Strength Index

Relative strength index, or RSI oscillates between 0 and 100.


When it's close to 0, this may mean the price is due to rebound from recent lows.
When RSI is close to 100, this may mean the price of the stock is due to decline.
253
Using the functions from talib to calculate SMA and RSI

!pip install talib-binary


import talib
msft_df[‘sma200'] = talib.SMA(msft_df['Adj Close'].values, timeperiod=200)
msft_df[‘rsi200'] = talib.RSI(msft_df['Adj Close'].values, timeperiod=200)

Use TA-lib's functions for RSI and moving averages calculation


• provide a numpy array of prices and the argument timeperiod (n), n=200
• add the new features to DataFrame as ma200 for the 200-day moving average
• Add feature rsi200 for the 200-day RSI.

254
About TA-Lib

• TA-Lib is an open-source Python library for technical analysis.


• It offers over 150 indicators and patterns for analyzing financial market
data.
• Traders and analysts use it to develop strategies and make informed
decisions.
• It is compatible with multiple programming languages and platforms.
• TA-Lib has an active community and regular updates for improved
functionality.

255
Reference

• How to Calculate the Daily Returns And Volatility of a Stock with


Python
https://blog.devgenius.io/how-to-calculate-the-daily-returns-and-volatility-of-a-
stock-with-python-d4e1de53e53b

256
Basic Analytics for Time Series Market Data

• This sample code demonstrates how to perform basic financial


analysis, such as moving windows, and volatility calculation of time
series data for a financial instrument

Sample Code:
FIN7790_2023_tutorial_python_for_finance.ipynb

257
Other Reference Code

• DataTime Index
FIN7790-2023-TimeSeries-DateTime.ipynb
• Resampling
FIN7790-2023-TimeSeries-Resampling.ipynb
• Time Shift
FIN7790-2023-TimeSeries-Time Shifting.ipynb
• Rolling and Expanding
FIN7790-2023-TimeSeries-Rolling and Expanding.ipynb

258
Feature Extraction for Transaction Data

Withdrawal
Account Number Month Transaction_Date_Time Transaction Tppe ATM Location
Amount

12345678 03 2022-03-01 09:30:00 200 withdrawn ‘123, Mongkok’

12345678 03 2022-03-02 12:15:00 100 deposit '456, YauMaTei'


12345678 03 2022-03-05 18:00:00 500 deposit ‘444, Central’

12345678 03 2022-03-10 10:45:00 100 withdrawn ‘356, Mongkok’


12345678 03 2022-03-15 14:20:00 500 deposit ‘444, Central’

12345678 04 2022-04-01 09:30:00 100 withdrawn ‘123, Mongkok’

12345678 04 2022-04-02 12:15:00 600 deposit '456, YauMaTei'

12345678 04 2022-04-05 18:00:00 350 deposit ‘444, Central’

12345678 05 2022-05-10 10:45:00 900 withdrawn ‘356, Mongkok’


12345678 05 2022-05-15 14:20:00 200 deposit ‘444, Central’

259
Feature Extraction with Gini Coefficient

260
Feature Extraction for Time Series Data

• Given a time series of length N, extract a set of D features that characterizes the time
series.

• Useful for Classification, Clustering, etc,


• Often based on expertise : feature engineering (construction of ad hoc features from
domain knowledge)
• Bag of features representation for time series 261
Feature Extraction for Time Series Data (Fourier Transform)

Sinusoidal wave

262
Are they the same?

263
Fourier Transform

• Fourier Transform extracts relevant features from time series data by


analyzing frequency components.
• It converts data into the frequency domain.
• Valuable features that may be difficult to capture in the time domain
can be extracted.
• Pandas provides the fft() method for calculating the fast Fourier
transform of a time series.

265
Fourier Transform

266
Fourier transform

Two Sine Waves Two Sine Waves + Noise Frequency

Convert a single timeseries into an array that describes the timeseries as a combination of oscillations

267
Python Sample

import pandas as pd
import numpy as np
# Create a time series
ts = pd.Series([1, 2, 3, 4, 5])
# Calculate the Fourier transform
fft = pd.Series(np.fft.fft(ts).real)
print(fft)

268
269
Questions

• What is the output of STFT (Short-time Fourier Transform)?


• How many STFT windows are there?
• What are some time-frequency representations of a customer’s
transactions used in this research paper?

270
Assignment 1
Date Week Class Remark

13-Dec 1 Basic Machine Learning and Approach

20-Dec 2 Time Series Basic

27-Dec 3 Data Preparation I (Time Series Data)


Assign 1 Announce
3-Jan 4 Data Preparation II

10-Jan 5 Supervised Learning: Regression

17-Jan 6 Supervised Learning: Classification I


Assign 2 Announce
24-Jan 7 Supervised Learning: Classification II
Assign 1 due

31-Jan 8 Unsupervised Learning I

7-Feb 9 Unsupervised Learning II

21-Feb 10 Neural Network

28-Feb 11 Implementation Consideration

6-Mar 12 Group Assignment Presentation

13-Mar 13 Examination Assign 2 due

271
Assignment 1

Problem Description:
• The bank wants to improve customer service by analyzing client data.
• The bank has clients’ accounts, transactions, and geographic data.
• Data needs to be cleaned and transformed for machine learning modeling later.
• Your initial insights from analysis can help improve services and make informed decisions.

Task:
• Three datasets will be provided.
• Perform data preparation (handling missing data and outliers, data encoding, etc.)
• Explore the data (data distribution, correlation, etc.)
• Perform feature engineering (feature selection, construction, etc.).

Submission Deadline: Jan 28, 2024, 23:59:00.

272
Details of the Assignment 1 can be found in Moodle

273
END of Part 2

274

You might also like