003-FIN7790 (Part2)
003-FIN7790 (Part2)
003-FIN7790 (Part2)
109
Questions?
110
Impute Missing Values with sklearn
111
SimpleImputer
112
SimpleImputer
113
IterativeImputer
114
Encoding Numeric Data
(Data Discretization/Binning)
115
Data Discretization
• Discretization / Binning
transforms continuous
features into discrete
features by creating a set
of contiguous intervals
(bins) spanning the value
range.
116
What's the benefit of discretization?
117
Data Discretization
• Simplification
• Reduces the complexity of data, making it easier to understand and analyze
• Noise reduction
• can reveal patterns that might not be apparent in continuous data when
relationships aren't linear
118
Data Discretization
price 5k, 10k, 12k, 12k 30k, 31k 39k, 44k, 44.5k
bins 1 2 3
119
Data Discretization
Bin with equal width
• Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
• Output:
[5, 10, 11, 13, 15, 35, 50, 55, 72]
[92]
[204, 215]
The feature values are sorted into intervals of the same width. The number of interval is arbitrarily decided.
Width = (Max(x) – Min(x) )/ Bins
122
Encoding Categorical Data
123
Turning categorical into quantitative variables
Categorical Variables Categorical à Numeric
• Problem • Solution
• Most statistical models and some ML algorithms • Add dummy variables for each unique category
cannot take in the objects/strings as input. • Assign 0 or 1 in each category
“One-hot encoding”
124
Dummy Variables
125
Dummy Variables in Python pandas
• Dummy Variables is one that takes the value 0 or 1 to indicate the absence or presence of
some categorical effect
• Convert categorical variable (nominal variable) to dummy variables (0 or 1)
• Use pandas.get_dummies() method to map all values in a column to multiple columns
pd.get_dummies(df[‘fuel’])
encoded_columns = pd.get_dummies(data['column'])
data = data.join(encoded_columns).drop('column', axis=1)
126
Dummy Variable Traps
• When working with dummy variables, it is important to avoid the dummy variable
trap.
• The traps occurs when independent features are multicollinear, or highly correlated
• To avoid the dummy variable trap, drop one of the dummy variables
x
127
Variable with k-categories can be captured using k-1 dummy variables
128
Encoding Ordinal Data
129
Ordinal Encoding
130
Python Ordinal Encoding
131
Features Construction
132
Feature Construction
133
Polynomial Features
135
Polynomial and Interaction Features
136
Polynomial Features Transform
# demonstrate the types of features created
• Use PolynomialFeatures class in scikit- # from numpy import asarray
learn from sklearn.preprocessing import
PolynomialFeatures
• The features created include: # define the dataset
data = asarray([[2,3],[2,3],[2,3]])
• The bias (the value of 1.0) print(data)
• Values raised to power for each degree # perform a polynomial features transform of the
# dataset
(e.g. x^1, x^2, x^3, …)
trans = PolynomialFeatures(degree=2)
• Interactions between all pairs of features data = trans.fit_transform(data)
(e.g. x1 * x2, x1 * x3, …) print(data)
Model Performance
analyzing high-dimensional data, where some models
may perform poorly.
• More features often require more samples to
represent the space adequately.
• Feature selection and extraction are necessary before Dimensionality (numbre of features)
overfitting.
• Feature selection keeps a subset of original features,
while feature extraction creates new ones.
138
Features Selection & Extraction
139
Feature Selection
• Many features in a dataset contain little information
• Only some features are meaningful and have high predictive power
• Meaningful features are independent of each other
• Filter irrelevant or redundant features and keep only the best subset from an
existing set of features without loss of information.
141
Feature Selection Methods
142
Correlation Approach
• For example, if you had a real-estate dataset with 'Floor Area (Sq. Ft.)'
and 'Floor Area (Sq. Meters)' as separate features, you can safely
remove one of them.
143
Correlation Approach
145
Comparison of Pearson and Spearman Coefficients
147
Python: Correlation-based Feature Selection
149
Python: Correlation-based Feature Selection
150
Python: Correlation-based Feature Selection
0 0.727728
1 0.272272
Name: Credit Default, dtype: float64
151
Python: Correlation-based Feature Selection
# show the pearson correlation coefficient matrix of all the
# features in the credit default dataset
df.corr(method='pearson')
152
Python: Correlation-based Feature Selection
import matplotlib
import seaborn as sns
# show the correlation
# heatmap
sns.heatmap(df.corr())
153
Python: Correlation-based Feature Selection
# list the coefficient against the # Now list coefficients against
# target, Credit Default # the target
df.corr()['Credit Default'] # only of the absolute value >0.2
df.corr()['Credit Default'].abs() > 0.2
154
Detect Irrelevant Features
155
Categorical Feature Categorical Target — Chi-Square
Test
156
Chi-Squared Test
from scipy.stats import chi2_contingency
# Example data
data = {'Gender': ['Male', 'Female', 'Male', 'Female', 'Male',
'Female'], 'Smoker': ['Yes', 'No', 'No', 'Yes', 'Yes', 'No']} df =
pd.DataFrame(data) # Create cross-tabulation table
crosstab_table = pd.crosstab(df['Gender'], df['Smoker’])
# Perform chi-square test
chi2, p,_,_ = chi2_contingency(crosstab_table)
# Print the results
print("Chi-square statistic:", chi2) print("p-value:", p)
158
Categorical Feature Continuous Target — ANOVA
159
ANOVA Test
X = np.random.randint(3,size=1000)
Y = np.random.rand(1000) # no relationship
one = Y[X==1] ###statistic= 0.7361644252650903,
two = Y[X==2] pvalue= 0.47920759112861
zero = Y[X==0] large P value-> relationship is insignificant
result=sst.f_oneway(one,two,zero)
print(result)
F_onewayResult(statistic=0.7361644252650903, pvalue=0.479207591128618
X = np.random.randint(3,size=1000)
Y = np.random.rand(1000) + 0.1*X # X is part of Y
###statistic= 32.98176360162701,
one = Y[X==1]
two = Y[X==2] pvalue= 1.349374223612499e-14
zero = Y[X==0] Small p value ->reject the null hypothesis-> feature is
result=sst.f_oneway(one,two,zero) relevant
print(result)
F_onewayResult(statistic=32.98176360162701, pvalue=1.349374223612499e-14)
160
The Choice of Feature Selection Algorithm depends on the
nature of the input features and the output target
161
Wrapper Method - Stepwise Approach
162
Recursive Feature Elimination Method
Recursive Feature Elimination is a backward stepwise selection algorithm to select predictors
• Search for a subset of features by starting with all features in the training dataset and then removing
features
• Fit the ML algorithm, rank features by importance, discard the least import features
• Refit the model
166
Principal Components Analysis
• Extract the important information from a multivariate dataset and express this
information as a set of a few new variables called principal components
• Principal components explain most of the patterns and latent structures observed
in the original dataset
• Often possible with only a few principal components
• These principal components are orthogonal, which means that they are uncorrelated
• They are ranked in order of their “explained variance.”
• The first principal component (PC1) explains the most variance in your dataset, PC2
explains the second-most variance, and so on.
167
Principal Components Analysis
168
Majority of the variance in the original dataset can be
effectively explained by a few principal components
Principal Component
169
Principal Component Analysis (Steps)
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components)
that can be best used to represent data
• Normalize input data: Each attribute falls within the same range
• Compute k orthonormal (unit) vectors, i.e., principal components
• Each input data (vector) is a linear combination of the k principal component vectors
• The principal components are sorted in order of decreasing “significance” or strength
• Since the components are sorted, the size of the data can be reduced by eliminating the
weak components, i.e., those with low variance (i.e., using the strongest principal
components, it is possible to reconstruct a good approximation of the original data)
• Works for numeric data only
170
Python – Using PCA
1. Select all the numeric columns (X) except the target variable(Y) “price”
2. Scale the numeric values which is the important step before applying PCA
3. Instantiate PCA
4. Determine transformed features
5. Determine explained variance using explained_variance_ration_ attribute
list(df.select_dtypes(['float']).columns)
x=df.select_dtypes(include=('float64','integer’))
x.drop('price',axis=1,inplace=True)
# performing preprocessing part
# standardize the range of values of features
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x = sc.fit_transform(x)
171
Python Using PCA
# Applying PCA function, limit the components to 4
from sklearn.decomposition import PCA
pca = PCA(n_components = 4)
x = pca.fit_transform(x)
explained_variance = pca.explained_variance_ratio_
explained_variance
# Cumulative sum of eigenvalues; This will be used to create step plot
# for visualizing the variance explained by each principal component.
cum_sum_eigenvalues = np.cumsum(explained_variance)
172
Python Using PCA
# Create the visualization plot
plt.bar(range(0,len(explained_variance)),
explained_variance, alpha=0.5, align='center',
label='Individual explained variance’)
# plot the cumulative eigenvalues
plt.step(range(0,len(cum_sum_eigenvalues)),
cum_sum_eigenvalues,
where='mid',label='Cumulative explained
variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')
plt.legend(loc='best')
plt.tight_layout()
plt.show()
The 4 components can explain 80% of patterns
in the original data sets. 173
Python Using PCA
# Reduce the dimensionality to four independent variables
x_trimmed = pd.DataFrame(x)
x_trimmed.head()
174
Sample Code
• FIN7790_2023_Correlation-SP100-GDP.ipynb
Reference: https://www.learnpythonwithrune.org/pandas-correlation-methods-explained-pearson-kendall-and-spearman/
175
Reference
• Zheng, A., & Casari, A. (2018). Feature engineering for machine learning:
principles and techniques for data scientists. " O'Reilly Media, Inc.".
• Galli, S. (2020). Python feature engineering cookbook: over 70 recipes for
creating, engineering, and transforming features to build machine learning
models. Packt Publishing Ltd.
• A Short Guide for Feature Engineering and Feature Selection (download from
Moodle)
176
A short Guide for
Feature Engineering
and Feature
Selection
177
END of Part 1
178
Part 2 (Week 4):
Financial Time Series Data Preparation for ML Prediction
Modeling
179
Time Series Forecasting Model
180
Time Series Forecasting (Forecasting Horizon)
181
Time series data set as supervised learning problem
1-Step Forecast
183
Multi-step Forecast
184
2 ways to produce multi-step forecasts
185
Univariate time series as multi-step supervised learning
Multi-output Forecast
186
Incremental Multi-step forecast
Let h = 3
Let p = 3, h = 3
189
Multistep/Multi-output Forecasting using Linear Regression
• FIN7790-2023-ML-NoDiff-AirPassenger.ipynb
• FIN7790-2023-ML-Diff-AirPassenger.ipynb
The two sample codes both use the Linear Regression Algorithm to develop the
Forecasting Models
190
Common Time Series Data Problems
• Restructuring the Timestamp format
• Convert into a standard date-time format
• Fixing Missing Value
• Linear interpolation
• Forward filling
• Backward filling
• Imputation using Mean, Median, or Mode within a period
• Removing Outlier
• Ceiling/Flooring (Min/Max)
• Denoising Features
191
Fixing the Data-Time Format
192
Date-time Column
import pandas as pd
passenger = pd.read_csv('AirPassengers.csv')
passenger['Date'] = pd.to_datetime(passenger['Date’])
# Below line of code sorts the values according to dates
passenger.sort_values(by=['Date'], inplace=True, ascending=True)
193
Handling Missing Data in Time Series
194
Interpolation approach for Time Series Data
195
Interpolation approach for Time Series Data
import pandas as pd
import numpy as np
print("\nAfter interpolation:")
print(df_interpolated)
196
Forward/Backward Fill
• Forward fill takes the previous row value and fill the next row.
• Backward fill takes the next row value and fills the previous row.
# forward fill
df.fillna(method='ffill', inplace=True)
# backward fill
df.fillna(method='bfill', inplace=True)
197
Interpolation in Pandas
198
Visualizing the interpolated data
199
Outlier Detection
• Using the mean and standard deviation of the entire series is not recommended for
outlier detection because the boundaries would remain fixed in that case.
• The Rolling Statistical Bound-based approach creates boundaries on a rolling basis and is
effective and straightforward for outlier detection.
• For example, define the upper and lower bound as:
Upper Bound = Rolling Mean + 3 x (Rolling Standard Deviation)
Lower Bound = Rolling Mean - 3 x (Rolling Standard Deviation)
Rolling mean is the mean for a window of previous observations.
• Outliers in the data can be effectively identified by calculating these bounds using a
rolling window.
200
Denoising a Time Series
201
Outlier Handling: Smoothing data with Rolling Mean
Smoothing function
outliers
(Rolling Mean)
202
Bollinger Bands
Rolling Mean and
Rolling Standard
Deviation:
204
Feature Construction for Time Series Data
205
Constucting Features Over Time
1. Creating features over time- develop specific features (e.g. Timestamp) that are useful in timeseries analysis.
2. Extracting features with windows: e.g. using the rolling windows technique to extract features
206
Construct New Features
207
Extracting date-based features
208
Extracting date-based features
• Date columns usually provide valuable
information
• Extracting the parts of the date into
different columns: Year, Quarter, Month,
Day, etc.
• Extracting some other specific features
from the date: Name of the weekday,
Weekend or not, Holiday or not, etc.
209
Extracting date-based features
import pandas as pd
from datetime import date
data = pd.DataFrame({'date':['01-01-2017','04-12-2008','23-06-1988','25-08-1999','20-02-1993']})
#Transform string to date
data['date'] = pd.to_datetime(data.date, format="%d-%m-%Y")
#Extracting Year
data['year'] = data['date'].dt.year
#Extracting Month
data['month'] = data['date'].dt.month
#Extracting passed years since the date
data['passed_years'] = date.today().year - data['date'].dt.year
#Extracting passed months since the date
data['passed_months'] = (date.today().year - data['date'].dt.year) * 12 + date.today().month -
data['date'].dt.month
#Extracting the weekday name of the date
data['day_name'] = data['date'].dt.day_name()
210
Creating a Time-lagging features
• Create time-shifted versions of the data by "rolling" the time series
data either into the future or into the past
• In Pandas, use the shift method of a DataFrame.
• DataFrame.shift(5)
• Positive values roll the data backward, while negative values roll the data
forward.
• The same index of data will have data from different timepoints in it.
211
Time-shifted DataFrame Positive value roll the data backward
- Shfit the 3 index values towards past
msft_df.shift(3)
212
Create several time lagged version of the data..
213
Create several time lagged version of the data..
# data is a pandas Series containing time series data
data = df['Close']
# create 4 time-shift values
shifts = [1,2,3,4]
# Create a dictionary of time-shifted data
many_shifts = {'lag_{}'.format(ii): data.shift(ii) for ii in shifts}
# Convert them into a dataframe
many_shifts = pd.DataFrame(many_shifts)
many_shifts.shape
214
Create several time lagged version of the data..
215
Visualizing the Correlation Coefficients of the Time-lagged Features
# First dropping the missing values
many_shifts.dropna(inplace=True)
# merge with the df
df_result = pd.concat([data, many_shifts], axis=1,join='inner’)
# Fitting the Ridge Linear Regression Model
from sklearn. linear_model import Ridge
# Fit the model using these new input features
model = Ridge()
model.fit(df_result.iloc[:,1:5], df_result['Close’])
# Visualize the fit model coefficients
fig, ax = plt.subplots()
ax.bar(df_result.iloc[:,1:5].columns, model.coef_)
ax.set(xlabel='Coefficient name', ylabel='Coefficient value')
# Set formatting so it looks nice
plt.setp(ax.get_xticklabels(), rotation=45,
horizontalalignment='right')
216
Use the Time-lagging features as the predictor variables
(Autoregression)
• You can create multiple values in the past as input features for a time-series
machine-learning model
• You need to assess how auto-correlated these new signals with a time point
• That is, how correlated is a time point with its neighboring time points (called
autocorrelation).
217
ACF and PACF Plot
ACF and PACF are plots that
summarize the strength of a
relationship with an observation
in a time series with
observations at prior time steps.
from statsmodels.graphics.tsaplots
import plot_acf, plot_pacf
plot_acf(msft_df['Close'])
plot_pacf(msft_df['Close'])
plt.show()
218
Create multiple rolling features using aggregate
method
• It is common to define rolling features for min, max, mean and
standard deviation.
• We can use the rolling window and aggregate functions to calculate
all these features
• In pandas, the aggregate method can be used to calculate many
features of a window at once.
219
Using .aggregate for feature extraction
221
Check the properties of the features
222
Supplement: Working with Time Series Data Using
Python
223
Download Microsoft Data from Yahoo
import yfinance as yf
msft_df = yf.download('MSFT’,
start='2014-01-01’,
end='2021-12-31’,
progress=False)
# output the data to a csv file
msft_df = pd.to_csv(‘msft.csv’)
224
Working with Time Series data using Python
225
Inspecting and saving Timeseries data to a CSV file
# Inspect the columns
print(aapl.columns)
# Inspect the index and make sure the index is there
print(aapl.index)
# Inspect the columns
print(aapl.columns)
# Select only the last 10 observations of `Close`
ts = aapl['Close'][-10:]
# Check the type of `ts`
type(ts)
aapl.to_csv('data/aapl_ohlc.csv’)
df = pd.read_csv('data/aapl_ohlc.csv', header=0, index_col='Date',
parse_dates=True)
226
Visualizing Time Series Data
227
Adding new column
• Adding a new column which indicates the difference between the
open and the close price of a stock
# Add a column `diff` to `aapl`
aapl['diff'] = aapl.Open - aapl.Close
228
Time Shifts
229
Time Shifts
data = [100,101,99,105,102,103]
#Create pandas DataFrame from list
df = pd.DataFrame(data,columns=['close’])
# To shift by 1 row vertically
df.shift(periods=1) # same as df.shift(1)
df.shift(periods=1, fill_value=0)
df.shift(1,
df.shift(1) fill_value=0)
230
Difference
231
Difference
data = [100,101,99,105,102,103]
#Create pandas DataFrame from list
df = pd.DataFrame(data,columns=['close'])
df['diff()'] = df[‘close’].diff() # difference previous day
diff( )
232
Percentage Change
The pct_change() function is used to get the percentage change between the current and a prior row.
data = [100,101,99,105,102,103]
#Create pandas DataFrame from list
df = pd.DataFrame(data,columns=['close'])
df['pct_change()'] = df[‘close’].pct_change()
pct_change( )
233
Rolling
• Let’s say you have 20 days of stock data, and you want to know the
mean price of the stock for the last 5 days. What do you do?
• You take the last 5 days, sum them up, and divide by 5.
• But what if you want to know the average of the previous 5 days for
each day in your data set?
234
Rolling
import pandas as pd
import matplotlib.pyplot as plt
#Random stock prices
data = [100,101,99,105,102,103,104,101,105,102,99,98,105,109,105,120,115,109,105,108]
#Create pandas DataFrame from the list
df = pd.DataFrame(data,columns=['close'])
#Calculate a 5-period simple moving average
sma5 = df['close'].rolling(window=5).mean()
#Plot
plt.plot(df['close'],label='Stock Data')
plt.plot(sma5,label='SMA',color='red')
plt.legend()
plt.show()
235
Expanding
236
Expanding
#Calculate expanding window mean
expanding_mean = df.expanding(min_periods=1).mean()
#Calculate full sample mean for reference
full_sample_mean = df['close'].mean()
#Plot
plt.axhline(full_sample_mean,label='Full Sample Mean',linestyle='--',color='red’)
plt.plot(expanding_mean,label='Expanding Mean',color='red')
plt.plot(df['close'],label='Stock Data')
plt.legend()
plt.show()
237
Resample Function
• Frequency conversion of the time series data, e.g. from daily to month
• Most commonly used time series frequency are:
• W : weekly frequency,
• M : month end frequency,
• Q : quarter end frequency.
# Resampling the time series data based on months
# we apply it on stock close price
# 'M' indicates month
monthly_aapl = aapl.Close.resample('M').mean()
# the above command will find the mean closing price
# of each month
monthly_aapl
238
Common Financial Analysis
• Returns
• Moving Average
• Volatility Calculation
239
Stock Return
• Percentage Change
• Pandas function: pct_change( ) tells us how much the stock
price is gained or lost
240
Indexing the Stock Return by Time
pct_change()
241
Returns
• Using pct_change() to compute the Daily Return and Daily Log Returns
242
Returns (Calculate the Percentage Change)
• Monthly and Quarterly Return with resample()
# Resample `aapl` to business months, take last observation as value
monthly = aapl.resample('BM').apply(lambda x: x[-1])
# Calculate the monthly pc - compare the value of the last day of the month with the last day of
the prior month
monthly.pct_change()
# Resample `aapl` to quarters, take the mean as value per quarter
quarter = aapl.resample("4M").mean()
# Calculate the quarterly percentage change
quarter.pct_change()
quarter.pct_change() =>
243
Returns
Plot the distribution of daily_pct_change:
# Import matplotlib
import matplotlib.pyplot as plt
# Plot the distribution of `daily_pct_c`
daily_pct_c.hist(bins=50)
# Show the plot
# Fom the graph, we can easily tell the daily change distribution, mean is 0.001
plt.show()
# Pull up summary statistics
print(daily_pct_c.describe())
The distribution looks
very symmetrical and
normally distributed.
244
Returns – Cumulative Return
Cumulatively daily rate of return : useful to determine the value of an investment at regular
intervals. You can calculate the cumulative daily rate of return by using the daily percentage change
values, adding 1 to them and calculating the cumulative product with the resulting values
# The cumprod() method returns a DataFrame with the cumulative product for
each row.
cum_daily_return = (1 + daily_pct_c).cumprod()
# Print `cum_daily_return`
print(cum_daily_return)
# Plot the cumulative daily returns
cum_daily_return.plot(figsize=(12,8))
# Show the plot
plt.show()
245
Returns – Cumulative Return
Cumulatively monthly rate of return
246
Moving Average
• Simple Moving Average
• Using n past days to get the average, Common values, n=14, 50, 200
• Here's a plot showing the MSFT Close price and a 200-day simple moving
average, or SMA. You can see moving averages smooth the data.
247
Moving Windows
251
Volatility Calculation
Moving historical standard deviation of the log returns
252
https://blog.devgenius.io/how-to-calculate-the-daily-returns-and-volatility-of-a-stock-with-python-d4e1de53e53b
Relative Strength Index
254
About TA-Lib
255
Reference
256
Basic Analytics for Time Series Market Data
Sample Code:
FIN7790_2023_tutorial_python_for_finance.ipynb
257
Other Reference Code
• DataTime Index
FIN7790-2023-TimeSeries-DateTime.ipynb
• Resampling
FIN7790-2023-TimeSeries-Resampling.ipynb
• Time Shift
FIN7790-2023-TimeSeries-Time Shifting.ipynb
• Rolling and Expanding
FIN7790-2023-TimeSeries-Rolling and Expanding.ipynb
258
Feature Extraction for Transaction Data
Withdrawal
Account Number Month Transaction_Date_Time Transaction Tppe ATM Location
Amount
259
Feature Extraction with Gini Coefficient
260
Feature Extraction for Time Series Data
• Given a time series of length N, extract a set of D features that characterizes the time
series.
Sinusoidal wave
262
Are they the same?
263
Fourier Transform
265
Fourier Transform
266
Fourier transform
Convert a single timeseries into an array that describes the timeseries as a combination of oscillations
267
Python Sample
import pandas as pd
import numpy as np
# Create a time series
ts = pd.Series([1, 2, 3, 4, 5])
# Calculate the Fourier transform
fft = pd.Series(np.fft.fft(ts).real)
print(fft)
268
269
Questions
270
Assignment 1
Date Week Class Remark
271
Assignment 1
Problem Description:
• The bank wants to improve customer service by analyzing client data.
• The bank has clients’ accounts, transactions, and geographic data.
• Data needs to be cleaned and transformed for machine learning modeling later.
• Your initial insights from analysis can help improve services and make informed decisions.
Task:
• Three datasets will be provided.
• Perform data preparation (handling missing data and outliers, data encoding, etc.)
• Explore the data (data distribution, correlation, etc.)
• Perform feature engineering (feature selection, construction, etc.).
272
Details of the Assignment 1 can be found in Moodle
273
END of Part 2
274