0% found this document useful (0 votes)
28 views157 pages

OC_Module 4_Theory and Methods 021312

Module 4 focuses on advanced analytics theory and methods, emphasizing the selection of appropriate techniques based on business objectives and data characteristics. Key topics include K-means clustering, association rules, and their respective algorithms, use cases, and evaluation methods. The module also covers the application of R for model fitting and diagnostics, along with practical lab exercises to reinforce learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views157 pages

OC_Module 4_Theory and Methods 021312

Module 4 focuses on advanced analytics theory and methods, emphasizing the selection of appropriate techniques based on business objectives and data characteristics. Key topics include K-means clustering, association rules, and their respective algorithms, use cases, and evaluation methods. The module also covers the application of R for model fitting and diagnostics, along with practical lab exercises to reinforce learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 157

Module 4 – Advanced Analytics - Theory

and Methods
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 1
Module 4: Advanced Analytics – Theory and
Methods
Upon completion of this module, you should be able to:
• Examine analytic needs and select an appropriate technique based on
business objectives; initial hypotheses; and the data's structure and volume
• Apply some of the more commonly used methods in Analytics solutions
• Explain the algorithms and the technical foundations for the commonly used
methods
• Explain the environment (use case) in which each technique can provide the
most value
• Use appropriate diagnostic methods to validate the models created
• Use R and in-database analytical functions to fit, score and evaluate models
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 2
Where “R” we?
• In Module 3 we reviewed R skills and basic statistics
• You can use R to:
 Generate summary statistics to investigate a data set
 Visualize Data
 Perform statistical tests to analyze data and evaluate models
• Now that you have data, and you can see it, you need to plan the
analytic model and determine the analytic method to be used

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 3
Applying the Data Analytics Lifecycle

Discovery

Operationalize Data Prep

• In a typical Data Analytics Problem - you would have gone


through:
Communicate Model
• Phase 1 – Discovery - have the problem framed
Results Planning
• Phase 2 – Data Preparation - have the data prepared
• Now you need to plan the model
Model and determine the method to
be used. Building

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 4
Phase 3 - Model Planning

Discovery

How Operationalize
do people generally solve this Data Prep
problem with the kind of data and
resources I have?
Communicate
• Does that work well enough? Or do I have Model
Results
to come up with something new? Planning

• What are related or analogous problems?


Model
How are they solved? Can I do that? Do I have a good idea
Building about the type of model
Is the model robust to try? Can I refine the
enough? Have we analytic plan?
failed for sure?

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 5
What Kind of Problem do I Need to Solve?
How do I Solve it?
The Problem to Solve The Category of Covered in this Course
Techniques
I want to group items by similarity. Clustering K-means clustering
I want to find structure
(commonalities) in the data
I want to discover relationships Association Rules Apriori
between actions or items
I want to determine the relationship Regression Linear Regression
between the outcome and the input Logistic Regression
variables

I want to assign (known) labels to Classification Naïve Bayes


objects Decision Trees
I want to find the structure in a Time Series Analysis ACF, PACF, ARIMA
temporal process
I want to forecast the behavior of a
temporal process
I want to analyze my text data Text Analysis Regular expressions, Document
representation (Bag of Words),
TF-IDF

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 6
Why These Example Techniques?

• Most popular, frequently used:


 Provide the foundation for Data
Science skills on which to build
• Relatively easy for new Data
Scientists to understand &
comprehend
• Applicable to a broad range of
problems in several verticals

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 7
Module 4: Advanced Analytics – Theory and Methods
Lesson 1: K-means Clustering

During this lesson the following topics are covered:


• Clustering – Unsupervised learning method
• K-means clustering:
• Use cases
• The algorithm
• Determining the optimum value for K
• Diagnostics to evaluate the effectiveness of the method
• Reasons to Choose (+) and Cautions (-) of the method

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 8
Clustering
How do I group these documents by topic?
How do I group my customers by purchase patterns?
• Sort items into groups by similarity:
 Items in a cluster are more similar to each other than they are to
items in other clusters.
 Need to detail the properties that characterize “similarity”
 Or of distance, the "inverse" of similarity
• Not a predictive method; finds similarities, relationships
• Our Example: K-means Clustering

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 9
K-Means Clustering - What is it?
• Used for clustering numerical data, usually a set of
measurements about objects of interest.
• Input: numerical. There must be a distance metric defined over
the variable space.
 Euclidian distance
• Output: The centers of each discovered cluster, and the
assignment of each input datum to a cluster.
 Centroid

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 10
Use Cases
• Often an exploratory technique:
 Discover structure in the data
 Summarize the properties of each cluster
• Sometimes a prelude to classification:
 "Discovering the classes“
• Examples
 The height, weight and average lifespan of animals
 Household income, yearly purchase amount in dollars, number of
household members of customer households
 Patient record with measures of BMI, HBA1C, HDL

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 11
Use-Case Example – On-line Retailer

LTV – Lifetime Customer Value


2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 12
The Algorithm

1. Choose K; then select K


random "centroids"
In our example, K=3
2. Assign records to the
cluster with the closest
centroid

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 13
The Algorithm (Continued)

3. Recalculate the resulting


centroids
Centroid: the mean value of all
the records in the cluster
4. Repeat steps 2 & 3 until record
assignments no longer change
Model Output:
• The final cluster centers
• The final cluster assignments of
the training data

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 14
Picking K
Heuristic: find the "elbow" of the within-sum-of-squares (wss) plot
as a function of K.

K: # of clusters
ni: # points in ith cluster
ci: centroid of ith cluster
xij: jth point of ith cluster

"Elbows" at k=2,4,6

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 15
Diagnostics – Evaluating the Model
• Do the clusters look separated in at least some of the plots when
you do pair-wise plots of the clusters?
 Pair-wise plots can be used when there are not many variables
• Do you have any clusters with few data points?
 Try decreasing the value of K
• Are there splits on variables that you would expect, but don't
see?
 Try increasing the value K
• Do any of the centroids seem too close to each other?
 Try decreasing the value of K

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 16
K-Means Clustering - Reasons to Choose (+) and
Cautions (-)
Reasons to Choose (+) Cautions (-)
Easy to implement Doesn't handle categorical variables
Easy to assign new data to existing Sensitive to initialization (first guess)
clusters
Which is the nearest cluster center?
Concise output Variables should all be measured on
Coordinates the K cluster centers similar or compatible scales
Not scale-invariant!
K (the number of clusters) must be
known or decided a priori
Wrong guess: possibly poor results
Tends to produce "round" equi-sized
clusters.
Not always desirable
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 17
Check Your Knowledge
1. Why do we consider K-means clustering as a unsupervised Your Thoughts?
machine learning algorithm?
2. How do you use “pair-wise” plots to evaluate the effectiveness
of the clustering?
3. Detail the four steps in the K-means clustering algorithm.
4. How do we use WSS to pick the value of K?
5. What is the most common measure of distance used with K-
means clustering algorithms?
6. The attributes of a data set are “purchase decision (Yes/No),
Gender (M/F), income group (<10K, 10-50K, >50K). Can you
use K-means to cluster this data set?

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 18
Module 4: Advanced Analytics – Theory and
Methods
Lesson 1: K-means Clustering - Summary

During this lesson the following topics were covered:


• Clustering – Unsupervised learning method
• What is K-means clustering
• Use cases with K-means clustering
• The K-means clustering algorithm
• Determining the optimum value for K
• Diagnostics to evaluate the effectiveness of K-means clustering
• Reasons to Choose (+) and Cautions (-) of K-means clustering

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 19
Lab Exercise 4: K-means Clustering
• This Lab is designed to investigate and practice K-
means Clustering.

After completing the tasks in this lab you should be


able to:
• Use R functions to create K-means Clustering
models
• Use ODBC connection to the database and execute
SQL statements and read database tables in an R
environment
• Visualize the effectiveness of the K-means
Clustering algorithm using graphic capabilities in R
• Use MADlib function for K-means Clustering

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 20
Lab Exercise 4: K-means Clustering - Workflow
1 • Set the Working Directory

• Establish the ODBC Connection


2

3 • Open Connections to ODBC Database

4 • Get Data from the Database

• Read in the Data for Modeling


5

6 • Execute the Model

• Review the Output


7

8 • Plot the Results

• Find the Appropriate Number of Clusters


9

10 • Close Connections to ODBC Database

• Perform K-means Clustering Using In-database Analytics


11 Method

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 21
Module 4: Advanced Analytics – Theory and
Methods
Lesson 2: Association Rules
During this lesson the following topics are covered:
 Association Rules mining
 Apriori Algorithm
 Prominent use cases of Association Rules
 Support and Confidence parameters
 Lift and Leverage
 Diagnostics to evaluate the effectiveness of rules generated
 Reasons to Choose (+) and Cautions (-) of the Apriori algorithm

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 22
Association Rules
Which of my products tend to be purchased together?
What do other people like this person tend to like/buy/watch?
• Discover "interesting" relationships among variables in a large
database
 Rules of the form "When X observed, Y also observed"
 The definition of "interesting“ varies with the algorithm used for
discovery
• Not a predictive method; finds similarities, relationships

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 23
Association Rules - Apriori
• Specifically designed for mining over transactions in databases
• Used over itemsets: sets of discrete variables that are linked:
 Retail items that are purchased together
 A set of tasks done in one day
 A set of links clicked on by one user in a single session
• Our Example: Apriori

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 24
Apriori Algorithm - What is it?
Support
• Earliest of the association rule algorithms
• Frequent itemset: a set of items L that appears together "often
enough“:
 Formally: meets a minimum support criterion
 Support: the % of transactions that contain L
• Apriori Property: Any subset of a frequent itemset is also
frequent
 It has at least the support of its superset

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 25
Apriori Algorithm (Continued)
Confidence
• Iteratively grow the frequent itemsets from size 1 to size K (or
until we run out of support).
 Apriori property tells us how to prune the search space
• Frequent itemsets are used to find rules X->Y with a minimum
confidence:
 Confidence: The % of transactions that contain X, which also
contain Y
• Output: The set of all rules X -> Y with minimum support and
confidence

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 26
Lift and Leverage

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 27
Association Rules Implementations
• Market Basket Analysis
 People who buy milk also buy cookies 60% of the time.
• Recommender Systems
 "People who bought what you bought also purchased….“.
• Discovering web usage patterns
 People who land on page X click on link Y 76% of the time.

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 28
Use Case Example: Credit Records

Credit Attributes
ID
1 credit_good, female_married, job_skilled, home_owner, …
2 credit_bad, male_single, job_unskilled, renter, …

Minimum Support: 50%

Frequent Itemset Support The itemset {home_owner,


credit_good} has minimum support.
credit_good 70%
male_single 55% The possible rules are
job_skilled 63%
credit_good -> home_owner
home_owner 71%
home_owner, 53% and
credit_good
home_owner -> credit_good

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 29
Computing Confidence and Lift
Suppose we have 1000 credit records:

free_housing home_owner renter total


credit_bad 44 186 70 300
credit_good 64 527 109 700
108 713 179

713 home_owners, 527 have good credit.


home_owner -> credit_good has confidence 527/713 = 74%

700 with good credit, 527 of them are home_owners


credit_good -> home_owner has confidence 527/700 = 75%

The lift of these two rules is

0.527 / (0.700*0.713) = 1.055


2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 30
A Sketch of the Algorithm
• If Lk is the set of frequent k-itemsets:
 Generate the candidate set Ck+1 by joining Lk to itself
 Prune out the (k+1)-itemsets that don't have minimum support
Now we have Lk+1
• We know this catches all the frequent (k+1)-itemsets by the
apriori property
 a (k+1)-itemset can't be frequent if any of its subsets aren't
frequent
• Continue until we reach kmax, or run out of support
• From the union of all the Lk, find all the rules with minimum
confidence

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 31
Step 1: 1-itemsets (L1)

• let min_support = 0.5 Frequent Itemset Count


• 1000 credit records credit_good 700

• Scan the database credit_bad 300


male_single 550
• Prune male_mar_or_wid 92
female 310
job_skilled 631
job_unskilled 200
home_owner 710
renter 179

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 32
Step 2: 2-itemsets (L2)

Frequent Itemset Count


• Join L1 to itself credit_good, 402

• Scan the database to get male_single


credit_good, 544
the counts job_skilled

• Prune credit_good,
home_owner
527

male_single, 340
job_skilled
male_single, 408
home_owner
job_skilled, 452
home_owner

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 33
Step 3: 3-itemsets

Frequent Itemset Count


credit_good, 428
job_skilled,
home_owner

• We have run out of support.


• Candidate rules come from L2:
 credit_good -> job_skilled
 job_skilled -> credit_good
 credit_good -> home_owner
 home_owner -> credit_good

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 34
Finally: Find Confidence Rules
Rule Set Cnt Set Cnt Confidence

IF credit_good credit_good 700 credit_good 544 544/700=77%


THEN job_skilled AND job_skilled
IF credit_good credit_good 700 credit_good 527 527/700=75%
THEN AND
home_owner home_owner
IF job_skilled job_skilled 631 job_skilled AND 544 544/631=86%
THEN credit_good
credit_good
IF home_owner home_owner 710 home_owner 527 527/710=74%
THEN AND
credit_good credit_good

If we want confidence > 80%:


IF job_skilled THEN credit_good
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 35
Diagnostics
• Do the rules make sense?
 What does the domain expert say?
• Make a "test set" from hold-out data:
 Enter some market baskets with a few items missing (selected at
random). Can the rules predict the missing items?
 Remember, some of the test data may not cause a rule to fire.
• Evaluate the rules by lift or leverage.
 Some associations may be coincidental (or obvious).

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 36
Apriori - Reasons to Choose (+) and Cautions (-)

Reasons to Choose (+) Cautions (-)


Easy to implement Requires many database scans
Uses a clever observation to Exponential time complexity
prune the search space
•Apriori property
Easy to parallelize Can mistakenly find spurious
(or coincidental) relationships
•Addressed with Lift and
Leverage measures

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 37
Check Your Knowledge
1. What is the Apriori property and how is it used in the Apriori Your Thoughts?
algorithm?
2. List three popular use cases of the Association Rules mining
algorithms.
3. What is the difference between Lift and Leverage. How is Lift
used in evaluating the quality of rules discovered?
4. Define Support and Confidence
5. How do you use a “hold-out” dataset to evaluate the
effectiveness of the rules generated?

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 38
Module 4: Advanced Analytics – Theory and
Methods
Lesson 2: Association Rules - Summary
During this lesson the following topics were covered:
 Association Rules mining
 Apriori Algorithm
 Prominent use cases of Association Rules
 Support and Confidence parameters
 Lift and Leverage
 Diagnostics to evaluate the effectiveness of rules generated
 Reasons to Choose (+) and Cautions (-) of the Apriori algorithm

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 39
Lab Exercise 5 - Association Rules
• This Lab is designed to investigate and practice
Association Rules.

After completing the tasks in this lab you should be able


to:
• Use R functions for Association Rule based models

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 40
Lab Exercise 5 - Association Rules - Workflow

• Set the Working Directory and install the “arules” package


1

• Read in the Data for Modeling


2

• Review Transaction data


3

• Plot Transactions
4

• Mine the Association Rules


5

• Read in Groceries dataset


6

• Mine the Rules for the Groceries Data


7
• Extract the Rules in which the Confidence Value is >0.8 and high
8 lift

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 41
Module 4: Advanced Analytics – Theory and
Methods
Lesson 3: Linear Regression

During this lesson the following topics are covered:


• General description of regression models
• Technical description of a linear regression model
• Common use cases for the linear regression model
• Interpretation and scoring with the linear regression model
• Diagnostics for validating the linear regression model
• The Reasons to Choose (+) and Cautions (-) of the linear
regression model
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 42
Regression
• Regression focuses on the relationship between an outcome and
its input variables.
 In other words, we don't just predict the outcome, we also have a
sense of how changes in individual drivers affect the outcome.
• The outcome can be continuous or discrete.
 When it's discrete, we are predicting the probability that the
outcome will occur.
Example Questions:
 I want to predict the life time value (LTV) of this customer (and
understand what drives LTV).
 I want to predict the probability that this loan will default (and
understand what drives default).
• Our examples: Linear Regression, Logistic Regression
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 43
Linear Regression -What is it?
• Used to estimate a continuous value as a linear (additive)
function of other variables
 Income as a function of years of education, age, gender
 House price as function of median home price in neighborhood,
square footage, number of bedrooms/bathrooms
 Neighborhood house sales in the past year based on
unemployment, stock price etc.

• Input variables can be continuous or discrete.


• Output:
 A set of coefficients that indicate the relative impact of each
driver.
 A linear expression for predicting outcome as a function of drivers.
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 44
Linear Regression - Use Cases
• The preferred method for almost any problem where we are
predicting a continuous outcome
 Try this first; if it fails, then try something more complicated
• Examples:
 Customer lifetime value
 Home value
 Loss given default on loan
 Income as a function of demographics

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 45
Example: Predict Mortgage Foreclosure/Delinquency
Rates
fdq_rate = -0.9 + 0.66 CurrentUnemp + 1.06 ChgInUnem1yr + 0.22 hicost_mort_rate

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 46
Technical Description

• Solve for the bi


 Ordinary Least Squares
 storage quadratic in number of variables
 must invert a matrix
• Categorical variables are expanded to a set of indicator
variables, one for each possible value.

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 47
Representing Categorical Variable

• State is a categorical variable: 50 possible values.


• Expand it to 49 indicator (0/1) variables:
 The remaining level is the "default level“
 This is done automatically by standard packages
• Gender is categorical, too, but binary
 so one variable: genderMale, which is 0 for females

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 48
What do the Coefficients bi Mean?
• Change in y as a function of unit change in xi
 all other things being equal
• Example: income in units of $10K, years in age, bage= 2
 For the same gender, years of education, and state of residence, a
person's income increases by 2 units (20K)for every year older

• Standard packages also report the significance of the bi:


probability that, in reality, bi = 0
 bi "significant" if P(bi = 0) is small

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 49
Diagnostics
• Hold-out data
 Does the model predict well on data it hasn't seen?
• N-fold cross-validation
 Partition the data into N groups.
 Fit N models, holding out each group, and calculate the residuals
on the group.
 Estimated prediction error is the average over all the residuals.
• R2 : The fraction of the variance in the output variable that the
model can explain.
 It is also the square of the correlation between the true output
and the predicted output. You want it close to 1.

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 50
Diagnostics (Continued)
• Sanity check the coefficients
 Do the signs make sense? Are the coefficients excessively large?
 Wrong sign is an indication of correlated inputs, but doesn't
necessarily affect predictive power.
 Excessively large coefficient magnitudes may indicate strongly
correlated inputs; you may want to consider eliminating some
variables, or using regularized regression techniques.
 Ridge, Lasso
 Infinite magnitude coefficients could indicate a variable that strongly
predicts a subset of the output (and doesn't predict well on the rest).
 Plot output vs. this input, and see if you should segment the data before
regressing.

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 51
Diagnostics (Continued) Overpredicts for low true
values, underpredicts at
higher values. Improve
• Plot it! the model.
 Prediction vs. true outcome
• Look for:
 Systematic over/under
prediction
 Non-consistent variance
 The data cloud should be
symmetric about the line of
true prediction
 Glaring outliers
• You will see other diagnostic
plots in the lab Not quite
consistent
variance, but much
better.
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 52
Linear Regression - Reasons to Choose (+) and
Cautions (-)
Reasons to Choose (+) Cautions (-)
Concise representation (the coefficients) Does not handle missing values well
Robust to redundant variables, correlated Assumes that each variable affects the
variables outcome linearly and additively
Lose some explanatory value Variable transformations and
modeling variable interactions can
alleviate this
A good idea to take the log of
monetary amounts or any variable
with a wide dynamic range
Explanatory value Can't handle variables that affect the
Relative impact of each variable on outcome in a discontinuous way
the outcome Step functions
Easy to score data Doesn't work well with discrete drivers that
have a lot of distinct values
For example, ZIP code
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 53
Check Your Knowledge
1. How is the measure of significance used in determining the Your Thoughts?
explanatory value of a driver with linear regression models?
2. Detail the challenges with categorical values in linear
regression model.
3. Describe N-Fold cross validation method used for diagnosing a
fitted model.
4. List two use cases of linear regression models.
5. List and discuss two standard sanity checks that you will
perform on the coefficients derived from a linear regression
model.

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 54
Module 4: Advanced Analytics – Theory and
Methods
Lesson 3: Linear Regression - Summary

During this lesson the following topics were covered:


• General description of regression models
• Technical description of a linear regression model
• Common use cases for the linear regression model
• Interpretation and scoring with the linear regression model
• Diagnostics for validating the linear regression model
• The Reasons to Choose (+) and Cautions (-) of the linear
regression model
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 55
Lab Exercise 6: Linear Regression
This Lab is designed to investigate and practice Linear
Regression.

After completing the tasks in this lab you should be able


to:
• Use R functions for Linear Regression (Ordinary
Least Squares – OLS)
• Predict the dependent variables based on the
model
• Investigate different statistical parameter tests
that measure the effectiveness of the model

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 56
Lab Exercise 6: Linear Regression - Workflow

• Set Working directory


1

• Use random number generators to create data for the OLS


2 Model

• Generate the OLS model using R function “lm”


3

• Print and visualize the results and review the plots generated
4

• Generate Summary Outputs


5

• Introduce a slight non-linearity and test the model


6

• Perform In-database Analysis of Linear Regression


7

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 57
Module 4: Advanced Analytics – Theory and
Methods
Lesson 4: Logistic Regression
During this lesson the following topics are covered:
• Technical description of a logistic regression model
• Common use cases for the logistic regression model
• Interpretation and scoring with the logistic regression model
• Diagnostics for validating the logistic regression model
• Reasons to Choose (+) and Cautions (-) of the logistic
regression model

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 58
Logistic Regression
• Used to estimate the probability that an event will occur as a
function of other variables
 The probability that a borrower will default as a function of his
credit score, income, the size of the loan, and his existing debts
• Can be considered a classifier, as well
 Assign the class label with the highest probability

• Input variables can be continuous or discrete


• Output:
 A set of coefficients that indicate the relative impact of each driver
 A linear expression for predicting the log-odds ratio of outcome as
a function of drivers. (Binary classification case)
 Log-odds ratio easily converted to the probability of the outcome

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 59
Logistic Regression Use Cases
• The preferred method for many binary classification problems:
 Especially if you are interested in the probability of an event, not
just predicting the "yes or no“
 Try this first; if it fails, then try something more complicated
• Binary Classification examples:
 The probability that a borrower will default
 The probability that a customer will churn
• Multi-class example
 The probability that a politician will vote yes/vote no/not show up
to vote on a given bill

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 60
Logistic Regression Model - Example

• Training data: default is 0/1


 default=1 if loan defaulted
• The model will return the probability that a loan with given
characteristics will default
• If you only want a "yes/no" answer, you need a threshold
 The standard threshold is 0.5

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 61
Logistic Regression- Visualizing the Model

Overall fraction of default:


~20%

Logistic regression returns a


score that estimates the
probability that a borrower
will default

The graph compares the


distribution of defaulters
and non-defaulters as a
function of the model's
predicted probability, for
borrowers scoring higher
than 0.1

Blue=defaulters
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 62
Technical Description (Binary Case)

• y=1 is the case of interest: 'TRUE'


• LHS is called logit(P(y=1))
 hence, "logistic regression"
• logit(P(y=1)) is inverted by the sigmoid function
 standard packages can return probability for you
• Categorical variables are expanded as with linear regression
• Iterative, not closed form solution
 "Iteratively re-weighted least squares"

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 63
What do the Coefficients bi Mean?
• Invert the logit expression:

• exp(bj) tells us how the odds-ratio of y=1 changes for every unit
change in xj
• Example: bcreditScore = -0.69
• exp(bcreditScore) = 0.5 = 1/2
• for the same income, loan, and existing debt, the odds-ratio of
default is halved for every point increase in credit score
• Standard packages return the significance of the coefficients in the
same way as in linear regression
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 64
An Interesting Fact About Logistic Regression
"The probability mass equals the counts"

• If 13% of our loan risk training set defaults


 The sum of all the training set scores will be 13% of the number of
training examples

• If 40% of applicants with income < $50,000 default


 The sum of all the training set scores of people in this income
category will be 40% of the number of examples in this income
category

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 65
Diagnostics
• Hold-out data:
 Does the model predict well on data it hasn't seen?
• N-fold cross-validation: Formal estimate of generalization error
• "Pseudo-R2" : 1 – (deviance/null deviance)
 Deviance, null deviance both reported by most standard packages
 The fraction of "variance" that is explained by the model
 Used the way R2 is used

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 66
Diagnostics (Cont.)
• Sanity check the coefficients
 Do the signs make sense? Are the coefficients excessively large?
 Wrong sign is an indication of correlated inputs, but doesn't
necessarily affect predictive power.
 Excessively large coefficient magnitudes may indicate strongly
correlated inputs; you may want to consider eliminating some
variables, or using regularized regression techniques.
 Unfortunately, regularized logistic regression is not standard.
 Infinite magnitude coefficients could indicate a variable that strongly
predicts a subset of the output (and doesn't predict well on the rest).
 Try a Decision Tree on that variable, to see if you should segment the
data before regressing.

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 67
Diagnostics: ROC Curve

Area under the curve (AUC)


tells you how well the model
predicts. (Ideal AUC = 1)

For logistic regression, ROC


curve can help set classifier
threshold

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 68
Diagnostics: Plot the Histograms of Scores
good separation

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 69
Logistic Regression - Reasons to Choose (+) and
Cautions (-)
Reasons to Choose (+) Cautions (-)
Explanatory value: Does not handle missing values well
Relative impact of each variable on the outcome
in a more complicated way than linear regression
Robust with redundant variables, correlated variables Assumes that each variable affects the log-odds of the
Lose some explanatory value outcome linearly and additively
Variable transformations and modeling variable
interactions can alleviate this
A good idea to take the log of monetary amounts
or any variable with a wide dynamic range
Concise representation with the Cannot handle variables that affect the outcome in a
the coefficients discontinuous way.
Step functions
Easy to score data Doesn't work well with discrete drivers that have a lot
of distinct values
For example, ZIP code
Returns good probability estimates of an event

Preserves the summary statistics of the training data


"The probabilities equal the counts"

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 70
Check Your Knowledge

Your Thoughts?

1. What is a logit and how do we compute class probabilities


from the logit?
2. How is ROC curve used to diagnose the effectiveness of the
logistic regression model?
3. What is Pseudo R2 and what does it measure in a logistic
regression model?
4. How do you describe a binary class problem?
5. Compare and contrast linear and logistic regression methods.

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 71
Module 4: Advanced Analytics – Theory and
Methods
Lesson 4: Logistic Regression - Summary
During this lesson the following topics were covered:
• Technical description of a logistic regression model
• Common use cases for the logistic regression model
• Interpretation and scoring with the logistic regression model
• Diagnostics for validating the logistic regression model
• Reasons to Choose (+) and Cautions (-) of the logistic
regression model

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 72
Lab Exercise 7: Logistic Regression
This Lab is designed to investigate and practice Logistic
Regression.

After completing the tasks in this lab you should be able


to:
• Use R functions for Logistic Regression – (also
known as Logit)
• Predict the dependent variables based on the model
• Investigate different statistical parameter tests that
measure the effectiveness of the model

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 73
Lab Exercise 7: Logistic Regression - Workflow
1 • Set the Working Directory

2 • Define the problem and review input data

3 • Read in and Examine the Data

4 • Build and Review logistic regression Model

5 • Review and interpret the coefficients

6 • Visualize the Model Using the Plot Function

• Use relevel Function to re-level the Price factor with value 30 as the base
7 reference

8 • Plot the ROC Curve

9 • Predict Outcome given Age and Income

• Predict outcome for a sequence of Age values at price 30 and income at its
10 mean

11 • Predict outcome for a sequence of income at price 30 and Age at its mean

12 • Use Logistic regression as a classifier

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 74
Module 4: Advanced Analytics – Theory and
Methods
Lesson 5: Naïve Bayesian Classifiers

During this lesson the following topics are covered:


• Naïve Bayesian Classifier
• Theoretical foundations of the classifier
• Use cases
• Evaluating the effectiveness of the classifier
• The Reasons to Choose (+) and Cautions (-) with the use of
the classifier

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 75
Classifiers

Where in the catalog should I place this product listing?


Is this email spam?
Is this politician Democrat/Republican/Green?

• Classification: assign labels to objects.


• Usually supervised: training set of pre-classified examples.
• Our examples:
 Naïve Bayes,
 Decision Trees
 (and Logistic Regression)

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 76
Naïve Bayesian Classifier : What is it?
• Used for classification
 Actually returns a probability score on class membership:
 In practice, probabilities generally close to either 0 or 1
 Not as well calibrated as Logistic Regression
• Input variables are discrete
 Popular for text classification
• Output:
 Most implementations: log probability for each class
 You could convert it to a probability, but in practice, we stay in the
log space

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 77
Naïve Bayesian Classifier - Use Cases
• Preferred method for many text classification problems.
 Try this first; if it doesn't work, try something more complicated
• Use cases
 Spam filtering, other text classification tasks
 Fraud detection

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 78
Building a Training Dataset
Example : Predicting Good or Bad
credit
Predict the credit behavior of a
credit card applicant from
applicant's attributes:
• personal status
• job type
• housing type
• savings account
These are all categorical variables;
better suited to Naïve Bayesian
classifier than to logistic
regression.

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 79
Technical Description - Bayes' Law

• B is the class label:


 B ε {b1, b2, … bn}
• A is the specific assignment of input variables
 A = (a1, a2, … am)

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 80
The "Naïve" Assumption: Conditional Independence

so:

Independent of class – so it
cancels out

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 81
Building a Naïve Bayesian Classifier

• To build a Naïve Credit example:


Bayesian classifier, • class labels: {good, bad}
collect the following
 P(good) = 0.7
statistics from the
 P(bad) = 0.3
training data:
 P(bj) for all the class • aggregates for housing
labels.  P(own|bad) = 0.62
 P(ai| bj) for all possible  P(own|good) = 0.75
assignments of the input  P(rent|bad) = 0.23
variables and class labels.  P(rent|good) = 0.14
 … and so on

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 82
Building a Naïve Bayesian Classifier (Continued)

• Assign the label that maximizes the value

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 83
Back to Credit Example

P(good|X) ~ (0.28*0.75*0.14*0.06)*0.7 = 0.0012

Credit Example: X P(bad|X) ~ (0.36*0.62*0.17*0.02)*0.3 = 0.0002

• female ai bj P(ai | bj)


• owns home female good 0.28

• Self-employed female
own
bad
good
0.36
0.75
• savings > $1000 own bad 0.62
self emp good 0.14
self emp bad 0.17
P(good|X) > P(bad|X): savings>1K good 0.06
Assign X the label "good" savings>1K bad 0.02

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 84
Implementation Guideline

• High-dimensional problems are prone to numerical underflow


and unobserved events; it's better to calculate the log
probability (with smoothing).

(Smoothing technique varies with implementation)

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 85
Diagnostics
• Hold-out data
 How well does the model classify new instances?
• Cross-validation
• ROC curve/AUC

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 86
Diagnostics: Confusion Matrix

Prediction
True
bad good false positives
Class
bad 262 38 300
good 29 671 700
291 709 1000
false negatives

accuracy: sum of diagonals / sum of table = (262+671)/1000 = 0.93

FPR: false positives / sum of first row = 38/300 = 0.13


FNR: false negatives / sum of second row = 29/700 = 0.04

Precision: true positives / sum of second column = 671/709 = 0.95


Recall: true positives / sum of second row = 671/700 = 0.96

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 87
Naïve Bayesian Classifier - Reasons to Choose (+)
and Cautions (-)
Reasons to Choose (+) Cautions (-)
Handles missing values quite well Numeric variables have to be discrete
(categorized) Intervals
Robust to irrelevant variables Sensitive to correlated variables
"Double-counting"
Easy to implement Not good for estimating probabilities
Stick to class label or yes/no
Easy to score data
Resistant to over-fitting
Computationally efficient
Handles very high dimensional
problems
Handles categorical variables with a
lot of levels

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 88
Check Your Knowledge
1. Consider the following Training Data Set: Your Thoughts?
• Apply the Naïve Bayesian Classifier to this
Training Data Set
data set and compute
X1 X2 X3 Y
P(y = 1|X) for X = (1,0,0) 1 1 1 0
Show your work 1 1 0 0
0 0 0 0
0 1 0 1
1 0 1 1
0 1 1 1

2. List some prominent Use Cases of the Naïve Bayesian Classifier.


3. What gives the Naïve Bayesian Classifier the advantage of being
computationally inexpensive?
4. Why should we use log-likelihoods rather than pure probability
values in the Naïve Bayesian Classifier?
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 89
Check Your Knowledge (Continued)
5. What is a confusion matrix and how it is used to evaluate the
Your Thoughts?
effectiveness of the model?
6. Consider the following data set with two input features
temperature and season
• What is the Naïve Bayesian assumption?
• Is the Naïve Bayesian assumption satisfied for this problem?
Electricity
Temperature Season Usage
(Class)
Below Winter High
Average
Above Winter Low
Average
Below Summer Low
Average
Above Summer High
Average
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 90
Module 4: Advanced Analytics – Theory and
Methods
Lesson 5: Naïve Bayesian Classifiers - Summary
During this lesson the following topics were covered:
• Naïve Bayesian Classifier
• Theoretical foundations of the classifier
• Use cases
• Evaluating the effectiveness of the classifier
• The Reasons to Choose (+) and Cautions (-) with the use of
the classifier

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 91
Lab Exercise 8: Naïve Bayesian Classifier
This Lab is designed to investigate and practice the
Naïve Bayesian Classifier analytic technique.

After completing the tasks in this lab you should be able


to:
• Use R functions for Naïve Bayesian Classification
• Apply the requirements for generating
appropriate training data
• Validate the effectiveness of the Naïve Bayesian
Classifier with the big data

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 92
Lab Exercise 8: Naïve Bayesian Classifier Part1 -
Workflow
• Set working directory and review training and test data
1

• Install and load library “e1071”


2

• Read in and review data


3
• Build the Naïve Bayesian classifier Model from First
4 Principles

• Predict the Results


5

• Execute the Naïve Bayesian Classifier with e1071 package


6

• Predict the Outcome of “Enrolls” with the Testdata


7

• Review results
8

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 93
Lab Exercise 8: Naïve Bayesian Classifier Part2 -
Workflow
• Define the Problem (Translating to an Analytics Question)
1

• Establish the ODBC Connection


2

• Open Connections to ODBC Database


3

• Build the Training Dataset and the Test Dataset from the Database
4

• Extract the first 10000 records for the training data set and the remaining 10 for the
5 test

• Execute the NB Classifier


6

• Validate the Effectiveness of the NB Classifier with a Confusion Matrix


7

• Execute NB Classifier with MADlib Function Calls Within the Database


8

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 94
Module 4: Advanced Analytics – Theory and
Methods
Lesson 6: Decision Trees

During this lesson the following topics are covered:


• Overview of Decision Tree classifier
• General algorithm for Decision Trees
• Decision Tree use cases
• Entropy, Information gain
• Reasons to Choose (+) and Cautions (-) of Decision Tree
classifier
• Classifier methods and conditions in which they are best
suited
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 95
Decision Tree Classifier - What is it?
• Used for classification:
 Returns probability scores of class membership
 Well-calibrated, like logistic regression
 Assigns label based on highest scoring class
 Some Decision Tree algorithms return simply the most likely class
 Regression Trees: a variation for regression
 Returns average value at every node
 Predictions can be discontinuous at the decision boundaries
• Input variables can be continuous or discrete
• Output:
 A tree that describes the decision flow.
 Leaf nodes return either a probability score, or simply a classification.
 Trees can be converted to a set of "decision rules“
 "IF income < $50,000 AND mortgage_amt > $100K THEN default=T with
75% probability“

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 96
Decision Tree – Example of Visual Structure

Female Male

Gender
Female Male
Branch – outcome of test

Income Age Internal Node – decision on variable

<=45,000 >45,000 <=40 >40

Yes No Yes No Leaf Node – class label

Income Age

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 97
Decision Tree Classifier - Use Cases
• When a series of questions (yes/no) are answered to arrive at a
classification
 Biological species classification
 Checklist of symptoms during a doctor’s evaluation of a patient
• When “if-then” conditions are preferred to linear models.
 Customer segmentation to predict response rates
 Financial decisions such as loan approval
 Fraud detection
• Short Decision Trees are the most popular "weak learner" in
ensemble learning techniques

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 98
Example: The Credit Prediction Problem
good
700/1000
p(good)=0.7

savings= <100, (100:500)


savings=(500:1000),
>=1000,no known savings

good
245/294
housing=free, rent p(good)=0.83

housing=own

good
349/501
personal=female, male div/sep p(good)=0.7

personal=male mar/wid, male single

bad good
36/88 70/119
p(good) = 0.42 p(good)=0.6

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 99
General Algorithm
• To construct tree T from training set S
 If all examples in S belong to some class in C, or S is sufficiently
"pure", then make a leaf labeled C.
 Otherwise:
 select the “most informative” attribute A
 partition S according to A’s values
 recursively construct sub-trees T1, T2, ..., for the subsets of S

• The details vary according to the specific algorithm – CART, ID3,


C4.5 – but the general idea is the same

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 100
Step 1: Pick the Most “Informative" Attribute

• Entropy-based methods are one common way

• H = 0 if p(c) = 0 or 1 for any class


 So for binary classification, H=0 is a "pure" node
• H is maximum when all classes are equally probable
 For binary classification, H=1 when classes are 50/50

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 101
Step 1: Pick the most "informative" attribute
(Continued)

• First, we need to get the base entropy of the data

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 102
Step 1: Pick the Most “Informative" Attribute (Continued)
Conditional Entropy

• The weighted sum of the class entropies for each value of the
attribute
• In English: attribute values (home owner vs. renter) give more
information about class membership
 "Home owners are more likely to have good credit than renters"
• Conditional entropy should be lower than unconditioned
entropy

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 103
Conditional Entropy Example

for free own rent


P(housing) 0.108 0.713 0.179
P(bad | housing) 0.407 0.261 0.391
p(good | 0.592 0.739 0.601
housing)

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 104
Step 1: Pick the Most “Informative" Attribute
(Continued) Information Gain

• The information that you gain, by knowing the value of an


attribute
• So the "most informative" attribute is the attribute with the
highest InfoGain

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 105
Back to the Credit Prediction Example

Attribute InfoGain
job 0.001
housing 0.013
personal_status 0.006
savings_status 0.028

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 106
Step 2 & 3: Partition on the Selected Variable
• Step 2: Find the partition
with the highest InfoGain
 In our example the selected good
partition has InfoGain = 0.028 700/1000
p(good)=0.7
savings=(500:100),>
=1000,no known
• Step 3: At each resulting savings= <100, (100:500) savings

node, repeat Steps 1 and 2


 until node is "pure enough" good
245/294
• Pure nodes => no p(good)=0.83

information gain by splitting


on other attributes

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 107
Diagnostics
• Hold-out data
• ROC/AUC
• Confusion Matrix
• FPR/FNR, Precision/Recall
• Do the splits (or the "rules") make sense?
 What does the domain expert say?
• How deep is the tree?
 Too many layers are prone to over-fit
• Do you get nodes with very few members?
 Over-fit

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 108
Decision Tree Classifier - Reasons to Choose (+)
& Cautions (-)
Reasons to Choose (+) Cautions (-)
Takes any input type (numeric, categorical) Decision surfaces can only be axis-aligned
In principle, can handle categorical variables with
many distinct values (ZIP code)
Robust with redundant variables, correlated variables Tree structure is sensitive to small changes in the
training data
Naturally handles variable interaction A "deep" tree is probably over-fit
Because each split reduces the training data for
subsequent splits
Handles variables that have non-linear effect on Not good for outcomes that are dependent on many
outcome variables
Related to over-fit problem, above
Computationally efficient to build Doesn't naturally handle missing values;
However most implementations include a
method for dealing with this
Easy to score data In practice, decision rules can be fairly complex

Many algorithms can return a measure of variable


importance
In principle, decision rules are easy to understand

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 109
Which Classifier Should I Try?
Typical Questions Recommended Method

Do I want class probabilities, rather than just Logistic regression


class labels? Decision Tree
Do I want insight into how the variables affect Logistic regression
the model? Decision Tree
Is the problem high-dimensional? Naïve Bayes
Do I suspect some of the inputs are correlated? Decision Tree
Logistic Regression
Do I suspect some of the inputs are irrelevant? Decision Tree
Naïve Bayes
Are there categorical variables with a large Naïve Bayes
number of levels? Decision Tree
Are there mixed variable types? Decision Tree
Logistic Regression
Is there non-linear data or discontinuities in the Decision Tree
inputs that will affect the outputs?
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 111
Check Your Knowledge

Your Thoughts?
1. How do you define information gain?
2. For what conditions is the value of entropy at a maximum and when is it at
a minimum?
3. List three use cases of Decision Trees.
4. What are weak learners and how are they used in ensemble methods?
5. Why do we end up with an over fitted model with deep trees and in data
sets when we have outcomes that are dependent on many variables?
6. What classification method would you recommend for the following cases:
 High dimensional data
 Data in which outputs are affected by non-linearity and discontinuity in
the inputs

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 112
Module 4: Advanced Analytics – Theory and
Methods
Lesson 6: Decision Trees - Summary

During this lesson the following topics were covered:


• Overview of Decision Tree classifier
• General algorithm for Decision Trees
• Decision Tree use cases
• Entropy, Information gain
• Reasons to Choose (+) and Cautions (-) of Decision Tree
classifier
• Classifier methods and conditions in which they are best
suited
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 113
Lab Exercise 9: Decision Trees
This Lab is designed to investigate and practice Decision
Tree (DT) models covered in the course work.

After completing the tasks in this lab you should be able


to:
• Use R functions for Decision Tree models
• Predict the outcome of an attribute based on the
model

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 114
Lab Exercise 9: Decision Trees - Workflow

• Set the Working Directory


1

• Read in the Data


2

• Build the Decision Tree


3

• Plot the Decision Tree


4

• Prepare Data to Test the Fitted Model


5

• Predict a Decision from the Fitted Model


6

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 115
Module 4: Advanced Analytics – Theory and
Methods
Lesson 7: Time Series Analysis
During this lesson the following topics are covered:
• Time Series Analysis and its applications in forecasting
• ARIMA Model
• Implementing the Box-Jenkins Methodology using R
• Reasons to Choose (+) and Cautions (-) with Time Series
Analysis

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 116
Time Series Analysis
What will our December sales be
(based on the sales of the last few months)?
• Time Series Analysis accounts for the internal structure of
measurements taken over time
 Trend
 Seasonality
 Cycles
 Irregular
• Time series: Ordered sequence of numerical values, measured
over equally spaced time intervals
• The goal can be to identify the internal structure, or to forecast
near-future events based on recent history
• Our Example: Box-Jenkins Methods (ARMA, ARIMA)

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 117
Box-Jenkins: What is it?
Used for predicting the next few observations in a time series,
based on the last few observations.
• Input: Trend and Seasonally-adjusted time series
• Output: Expected future value of the time series
• Applies ARMA (Autoregressive Moving Averages) and ARIMA
(Autoregressive Integrated Moving Averages) model

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 119
Use Cases
• Forecast next month's sales
 Based on last few months
• Forecast tomorrow's stock price
 Based on last few days
• Forecast power demand in the near term
 Based on last few days

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 120
Modeling a Time Series
• Let's model the time series as
Yt =Tt +St +Rt, t=1,...,n.

• Tt: Trend term


 Sales of iPads steadily increased over the last few years: trending
upward.
• St: The seasonal term (short term periodicity)
 Retail sales fluctuate in a regular pattern over the course of a year.
 Typically, sales increase from September through December and
decline in January and February.
• Rt: Random fluctuation
 Noise, or regular high frequency patterns in fluctuation

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 121
Stationary Sequences
Many time series analyses (Basic Box-Jenkins in particular) assume
stationary sequences:
 Mean, variance and autocorrelation structure do not change over
time
 In practice, this often means you must de-trend and seasonally
adjust the data
 ARIMA in principle can make the data (more) stationary with
differencing

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 122
De-trending

• In this example, we see a


linear trend, so we fit a
linear model
 T*t = mYt + b

• The de-trended series is then


 Y1t = Yt – T*t

• In some cases, may have to


fit a non-linear model
 Quadratic
 Exponential

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 123
Seasonal Adjustment

• Often, we know the "season"


 For both retail sales and CO2
concentration, we can model
the period as being a year,
with variation at the month
level

• Simple ad-hoc adjustment:


take several years of data,
calculate the average value
for each month, and subtract
that from Y1t
Y2t = Y1t – S*t

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 124
ACF & PACF
• Auto Correlation Function (ACF)
 Correlation of the values of the time series with itself
 Similarity of the observations as a function of time
 Autocorrelation "carries over"
 if Xt is correlated with Xt-1, it is also correlated with Xt-2 (though to a
lesser degree)
• Partial Auto Correlation Function (PACF)
 The partial autocorrelation at lag k that is not explained by "carry
over“
 Helps determining the order of autoregressive models
 Where does PACF go to zero?

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 125
ARMA Model

• The simplest Box-Jenkins Model


• Combination of two process models
 Autoregressive: Yt is a linear combination of its last p values
 Moving average: Yt is a constant value plus the effects of a
dampened white noise process over the last q time steps

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 126
ARIMA Model
A combination of AR and MA models
The general non-seasonal model is known as ARIMA (p, d, q):
p is the number of autoregressive terms
d is the number of differences
q is the number of moving average terms

• ARIMA adds a differencing term, d, to make the series more


stationary
 rule of thumb:
 linear trend can be removed by d=1
 quadratic trend by d=2, and so on…

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 127
Model Selection
• The Data Scientist must pick p, d and q
 An "art form" that requires domain knowledge, modeling
experience, and a few iterations
 A simple AR model (q = 0), or MA model (p=0) might be simpler for
the novice

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 128
Time Series Analysis - Reasons to Choose (+) &
Cautions (-)
Reasons to Choose (+) Cautions (-)
Minimal data collection No meaningful drivers: prediction
Only have to collect the series based only on past performance
itself No explanatory value
Do not need to input drivers Can't do "what-if" scenarios
Can't stress test
Designed to handle the inherent It's an "art form" to select appropriate
autocorrelation of lagged time series parameters
Compared to simple linear
regression
Once you've seasonally/trend
adjusted
Suitable for short term predictions
only
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 129
Time Series Analysis with R
• Getting the data and plotting
• The function “ts” is used to create time series objects
Made into an R time series via
mydata.data<- ts(mydata,start=c(1999,1),frequency=12)
 Model building – use plot and box plot
• Differencing
diff(hstart.data,1,1)
acf: It computes (and by default plots) estimates of the
autocovariance or autocorrelation function
pacf: It is the function used for the partial autocorrelations

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 130
Time Series Analysis with R (Continued)
•ar: Fit an autoregressive time series model to the data
•arima: Fit an ARIMA model to a Univariate Time Series
•predict: Do model predictions
“predict” is a generic function for predictions from the results of various
model fitting functions. The function invokes particular methods which
depend on the class of the first argument
•arima.sim: Simulate from an ARIMA model
•ARMAtoMA: Convert ARMA process to infinite MA process
•decompose:
Decompose a time series into seasonal, trend and irregular components using
moving averages
Deals with additive or multiplicative seasonal component
stl: Decompose a time series into seasonal, trend and irregular
components using loess

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 131
Check Your Knowledge

Your Thoughts?

1. What is a time series and what are the key components of a


time series?
2. How do we “de-trend” a time series data?
3. What makes data stationary?
4. How is seasonality removed from the data?
5. What are the modeling parameters in ARIMA?
6. How do you use ACF and PACF to determine the “stationarity”
of time series data?

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 132
Module 4: Advanced Analytics – Theory and Methods

Lesson 7: Time Series Analysis - Summary

During this lesson the following topics were covered:


• Time Series Analysis and its applications in forecasting
• ARIMA Model
• Implementing the Box-Jenkins Methodology using R
• Reasons to Choose (+) and Cautions (-) with Time Series Analysis

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 133
Lab Exercise 10: Time Series Analysis
This Lab is designed to investigate and practice Time
Series Analysis with ARIMA models (Box-Jenkins-
methodology).

After completing the tasks in this lab you should be able


to:
• Use R functions for ARIMA models
• Apply the requirements for generating
appropriate training data
• Validate the effectiveness of the ARIMA models

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 134
Lab Exercise 10: Time Series Analysis - Workflow
1 • Set the Working Directory

2 • Establish the ODBC Connection

3 • Open Connections to ODBC Database

4 • Get Data from the Database

5 • Review, Update, and Prepare DataFrame ”msales” File for ARIMA Modeling

6 • Convert “sales” into Time Series Type Data

7 • Plot the Time Series

8 • Analyze the ACF and PACF

9 • Difference the Data to Make it Stationary

10 • Plot ACF and PACF for the Differenced Data

11 • Fit the ARIMA Model

12 • Generate Predictions

13 • Compare predicted values with actual values


2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 135
Module 4: Advanced Analytics – Theory and
Methods
Lesson 8: Text Analysis

During this lesson the following topics are covered:


• Challenges with text analysis
• Key tasks in text analysis
• Definition of terms used in text analysis
• Term frequency, inverse document frequency
• Representation and features of documents and corpus
• Use of regular expressions in parsing text
• Metrics used to measure the quality of search results
• Relevance with tf-idf, precision and recall
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 136
Text Analysis
Encompasses the processing and representation of text for
analysis and learning tasks

• High-dimensionality
 Every distinct term is a dimension
 Green Eggs and Ham: A 50-D problem!
• Data is Un-structured

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 137
Text Analysis – Problem-solving Tasks
• Parsing
 Impose a structure on the unstructured/semi-structured text for
downstream analysis
• Search/Retrieval
 Which documents have this word or phrase?
 Which documents are about this topic or this entity?
• Text-mining Parsing

 "Understand" the content


 Clustering, classification Search
&Retrieval

• Tasks are not an ordered list Text Mining

 Does not represent process


 Set of tasks used appropriately depending on the problem
addressed

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 138
Example: Brand Management

• Acme currently makes two products


 bPhone
 bEbook
• They have lots of competition. They want to maintain their
reputation for excellent products and keep their sales high.
• What is the buzz on Acme?
 Search for mentions of Acme products
 Twitter, Facebook, Review Sites, etc.
 What do people say?
 Positive or negative?
 What do people think is good or bad about the products?

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 139
Buzz Tracking: The Process

1. Monitor social networks, review Parse the data feeds to get actual
sites for mentions of our products. content.
Find and filter the raw text for product
names
(Use Regular Expression).
2. Collect the reviews. Extract the relevant raw text.
Convert the raw text into a suitable
document representation.
Index into our review corpus.
3. Sort the reviews by product. Classification (or "Topic Tagging")
4. Are they good reviews or bad Classification (sentiment analysis)
reviews?
We can keep a simple count here, for
trend analysis.
5. Marketing calls up and reads Search/Information Retrieval.
selected reviews in full, for greater
insight.
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 140
Parsin
Parsing the Feeds g

1. Monitor social networks, review sites for mentions of our products

• Impose structure on
semi-structured
data.
• We need to know
where to look for
what we are looking
for.

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 141
Parsin
Regular Expressions g

1. Monitor social networks, review sites for mentions of our products

• Regular Expressions (regexp) are a means for finding words,


strings or particular patterns in text.
• A match is a Boolean response. The basic use is to ask “does this
regexp match this string?”

regexp matches Note


b[P|p]hone bPhone, bphone Pipe “|” means “or”
bEb*k bEbook, bEbk, bEback … “*” is a wildcard, matches anything
^I love A line starting with "I love" “^” means start of a string
Acme$ A line ending with “Acme” “$” means the end of a string

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 142
Parsin
Extract and Represent Text g

2. Collect the reviews

Document Representation: "I love LOVE my bPhone!"


A structure for analysis
• "Bag of words" Convert this to a vector in the term
space:
 common representation
 A vector with one dimension for every
unique term in space acme 0
 term-frequency (tf): number times a bebook 0
term occurs
 Good for basic search, classification bPhone 1
• Reduce Dimensionality fantastic 0
 Term Space – not ALL terms love 2
 no stop words: "the", "a"
 often no pronouns
slow 0
 Stemming terrible 0
 "phone" = "phones"
terrific 0

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 143
Parsin
Document Representation - Other Features g

2. Collect the reviews


• Feature:
 Anything about the document that is used for search or
analysis.
• Title
• Keywords or tags
• Date information
• Source information
• Named entities

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 144
Parsin
Representing a Corpus (Collection of Documents) g

• Reverse index 2. Collect the reviews

 For every possible feature, a list of all the documents that contain
that feature
• Corpus metrics
 Volume
 Corpus-wide term frequencies
 Inverse Document Frequency (IDF)
 more on this later
• Challenge: a Corpus is dynamic
 Index, metrics must be updated continuously

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 145
Text Classification (I) - "Topic Tagging" Text
Mining

3. Sort the Reviews by


Product
Not as straightforward as it seems

"The bPhone-5X has coverage everywhere. It's much less flaky


than my old bPhone-4G."

"While I love Acme's bPhone series, I've been quite disappointed


by the bEbook. The text is illegible, and it makes even the Kindle
look blazingly fast."

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 146
"Topic Tagging" Text
Mining
3. Sort the Reviews by
Product
Judicious choice of features
 Product mentioned in title?
 Tweet, or review?
 Term frequency
 Canonicalize abbreviations
 "5X" = "bPhone-5X"

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 147
Text Classification (II) Sentiment Analysis Text
Mining

4. Are they good reviews or bad


reviews?
• Naïve Bayes is a good first attempt
• But you need tagged training data!
 THE major bottleneck in text classification
• What to do?
 Hand-tagging
 Clues from review sites
 thumbs-up or down, # of stars
 Cluster documents, then label the clusters

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 148
Search and Information Retrieval Search
&Retrieva
l

5. Marketing calls up and reads selected reviews in full, for greater


insight.
• Marketing calls up documents with queries:
 Collection of search terms
 "bPhone battery life"
 Can also be represented as "bag of words"
 Possibly restricted by other attributes
 within the last month
 from This Review Site

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 149
Quality of Search Results Search
&Retrieva
l

5. Marketing calls up and reads selected reviews in full, for greater


• Relevance
insight.

 Is this document what I wanted?


 Used to rank search results
• Precision
 What % of documents in the result are relevant?
• Recall
 Of all the relevant documents in the corpus, what % were returned
to me?

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 150
Computing Relevance Search
&Retrieva
l

5. Marketing calls up and reads selected reviews in full, for greater


insight.
 Call up all the documents that have any of the terms from the
query, and count how many times each term occurs:

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 151
Inverse Document Frequency (idf) Search
&Retrieva
l

5. Marketing calls up and reads selected reviews in full, for greater


insight.
idfi = log (N/tfi)
 N: Number of documents in corpus
 tfi: Number of documents in which term occurs in the corpus
• Measures term uniqueness in corpus
 "phone" vs. "brick"
• Indicates the importance of the term
 Search (relevance)
 Classification (discriminatory power)

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 152
TF-IDF and Modified Retrieval Algorithm Search
&Retrieva
l

5. Marketing calls up and reads selected reviews in full, for greater


• Term frequency – inverse document frequency (tf-idf)
insight.

tfdocument(term) * idf(term)
query: "unbrick phone"
• Document with "unbrick" a few times more relevant than
document with "phone" many times
• Measure of Relevance with tf-idf
• Call up all the documents that have any of the terms from the
query, and sum up the tf-idf of each term:

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 153
Other Relevance Metrics Search
&Retrieva
l

5. Marketing calls up and reads selected reviews in full, for greater


insight.

• "Authoritativeness" of source
 PageRank is an example of this
• Recency of document
• How often the document has been retrieved by other users

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 154
Effectiveness of Search and Retrieval Search
&Retrieva
l

• Relevance metric
 important for precision, user experience
• Effective crawl, extraction, indexing
 important for recall (and precision)
 more important, often, than retrieval algorithm
• MapReduce
 Reverse index, corpus term frequencies, idf

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 155
Challenges - Text Analysis

• Challenge: finding the right structure for your unstructured data


• Challenge: very high dimensionality
• Challenge: thinking about your problem the right way

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 156
Check Your Knowledge
1. What are the two major challenges in the problem of text Your Thoughts?
analysis?
2. What is a reverse index?
3. Why is the corpus metrics dynamic. Provide an example and a
scenario that explains the dynamism of the corpus metrics.
4. How does tf-idf enhance the relevance of a search result?
5. List and discuss a few methods that are deployed in text
analysis to reduce the dimensions.

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 157
Module 4: Advanced Analytics – Theory and
Methods
Lesson 8: Text Analysis - Summary

During this lesson the following topics were covered:


• Challenges with text analysis
• Key tasks in text analysis
• Definition of terms used in text analysis
• Term frequency, inverse document frequency
• Representation and features of documents and corpus
• Use of regular expressions in parsing text
• Metrics used to measure the quality of search results
• Relevance with tf-idf, precision and recall
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 158
Module 4: Summary
Key Topics Covered in this module Methods Covered in this module
Algorithms and technical foundations Categorization (unsupervised) :
K-means clustering
Association Rules
Key Use cases Regression
Linear
Logistic
Diagnostics and validation of the model Classification (supervised)
Naïve Bayesian classifier
Decision Trees
Reasons to Choose (+) and Cautions (-) of the Time Series Analysis
model
Fitting, scoring and validating model in R and in- Text Analysis
db functions

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 159

You might also like