OC_Module 4_Theory and Methods 021312
OC_Module 4_Theory and Methods 021312
and Methods
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 1
Module 4: Advanced Analytics – Theory and
Methods
Upon completion of this module, you should be able to:
• Examine analytic needs and select an appropriate technique based on
business objectives; initial hypotheses; and the data's structure and volume
• Apply some of the more commonly used methods in Analytics solutions
• Explain the algorithms and the technical foundations for the commonly used
methods
• Explain the environment (use case) in which each technique can provide the
most value
• Use appropriate diagnostic methods to validate the models created
• Use R and in-database analytical functions to fit, score and evaluate models
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 2
Where “R” we?
• In Module 3 we reviewed R skills and basic statistics
• You can use R to:
Generate summary statistics to investigate a data set
Visualize Data
Perform statistical tests to analyze data and evaluate models
• Now that you have data, and you can see it, you need to plan the
analytic model and determine the analytic method to be used
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 3
Applying the Data Analytics Lifecycle
Discovery
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 4
Phase 3 - Model Planning
Discovery
How Operationalize
do people generally solve this Data Prep
problem with the kind of data and
resources I have?
Communicate
• Does that work well enough? Or do I have Model
Results
to come up with something new? Planning
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 5
What Kind of Problem do I Need to Solve?
How do I Solve it?
The Problem to Solve The Category of Covered in this Course
Techniques
I want to group items by similarity. Clustering K-means clustering
I want to find structure
(commonalities) in the data
I want to discover relationships Association Rules Apriori
between actions or items
I want to determine the relationship Regression Linear Regression
between the outcome and the input Logistic Regression
variables
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 6
Why These Example Techniques?
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 7
Module 4: Advanced Analytics – Theory and Methods
Lesson 1: K-means Clustering
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 8
Clustering
How do I group these documents by topic?
How do I group my customers by purchase patterns?
• Sort items into groups by similarity:
Items in a cluster are more similar to each other than they are to
items in other clusters.
Need to detail the properties that characterize “similarity”
Or of distance, the "inverse" of similarity
• Not a predictive method; finds similarities, relationships
• Our Example: K-means Clustering
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 9
K-Means Clustering - What is it?
• Used for clustering numerical data, usually a set of
measurements about objects of interest.
• Input: numerical. There must be a distance metric defined over
the variable space.
Euclidian distance
• Output: The centers of each discovered cluster, and the
assignment of each input datum to a cluster.
Centroid
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 10
Use Cases
• Often an exploratory technique:
Discover structure in the data
Summarize the properties of each cluster
• Sometimes a prelude to classification:
"Discovering the classes“
• Examples
The height, weight and average lifespan of animals
Household income, yearly purchase amount in dollars, number of
household members of customer households
Patient record with measures of BMI, HBA1C, HDL
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 11
Use-Case Example – On-line Retailer
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 12
The Algorithm
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 13
The Algorithm (Continued)
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 14
Picking K
Heuristic: find the "elbow" of the within-sum-of-squares (wss) plot
as a function of K.
K: # of clusters
ni: # points in ith cluster
ci: centroid of ith cluster
xij: jth point of ith cluster
"Elbows" at k=2,4,6
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 15
Diagnostics – Evaluating the Model
• Do the clusters look separated in at least some of the plots when
you do pair-wise plots of the clusters?
Pair-wise plots can be used when there are not many variables
• Do you have any clusters with few data points?
Try decreasing the value of K
• Are there splits on variables that you would expect, but don't
see?
Try increasing the value K
• Do any of the centroids seem too close to each other?
Try decreasing the value of K
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 16
K-Means Clustering - Reasons to Choose (+) and
Cautions (-)
Reasons to Choose (+) Cautions (-)
Easy to implement Doesn't handle categorical variables
Easy to assign new data to existing Sensitive to initialization (first guess)
clusters
Which is the nearest cluster center?
Concise output Variables should all be measured on
Coordinates the K cluster centers similar or compatible scales
Not scale-invariant!
K (the number of clusters) must be
known or decided a priori
Wrong guess: possibly poor results
Tends to produce "round" equi-sized
clusters.
Not always desirable
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 17
Check Your Knowledge
1. Why do we consider K-means clustering as a unsupervised Your Thoughts?
machine learning algorithm?
2. How do you use “pair-wise” plots to evaluate the effectiveness
of the clustering?
3. Detail the four steps in the K-means clustering algorithm.
4. How do we use WSS to pick the value of K?
5. What is the most common measure of distance used with K-
means clustering algorithms?
6. The attributes of a data set are “purchase decision (Yes/No),
Gender (M/F), income group (<10K, 10-50K, >50K). Can you
use K-means to cluster this data set?
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 18
Module 4: Advanced Analytics – Theory and
Methods
Lesson 1: K-means Clustering - Summary
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 19
Lab Exercise 4: K-means Clustering
• This Lab is designed to investigate and practice K-
means Clustering.
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 20
Lab Exercise 4: K-means Clustering - Workflow
1 • Set the Working Directory
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 21
Module 4: Advanced Analytics – Theory and
Methods
Lesson 2: Association Rules
During this lesson the following topics are covered:
Association Rules mining
Apriori Algorithm
Prominent use cases of Association Rules
Support and Confidence parameters
Lift and Leverage
Diagnostics to evaluate the effectiveness of rules generated
Reasons to Choose (+) and Cautions (-) of the Apriori algorithm
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 22
Association Rules
Which of my products tend to be purchased together?
What do other people like this person tend to like/buy/watch?
• Discover "interesting" relationships among variables in a large
database
Rules of the form "When X observed, Y also observed"
The definition of "interesting“ varies with the algorithm used for
discovery
• Not a predictive method; finds similarities, relationships
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 23
Association Rules - Apriori
• Specifically designed for mining over transactions in databases
• Used over itemsets: sets of discrete variables that are linked:
Retail items that are purchased together
A set of tasks done in one day
A set of links clicked on by one user in a single session
• Our Example: Apriori
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 24
Apriori Algorithm - What is it?
Support
• Earliest of the association rule algorithms
• Frequent itemset: a set of items L that appears together "often
enough“:
Formally: meets a minimum support criterion
Support: the % of transactions that contain L
• Apriori Property: Any subset of a frequent itemset is also
frequent
It has at least the support of its superset
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 25
Apriori Algorithm (Continued)
Confidence
• Iteratively grow the frequent itemsets from size 1 to size K (or
until we run out of support).
Apriori property tells us how to prune the search space
• Frequent itemsets are used to find rules X->Y with a minimum
confidence:
Confidence: The % of transactions that contain X, which also
contain Y
• Output: The set of all rules X -> Y with minimum support and
confidence
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 26
Lift and Leverage
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 27
Association Rules Implementations
• Market Basket Analysis
People who buy milk also buy cookies 60% of the time.
• Recommender Systems
"People who bought what you bought also purchased….“.
• Discovering web usage patterns
People who land on page X click on link Y 76% of the time.
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 28
Use Case Example: Credit Records
Credit Attributes
ID
1 credit_good, female_married, job_skilled, home_owner, …
2 credit_bad, male_single, job_unskilled, renter, …
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 29
Computing Confidence and Lift
Suppose we have 1000 credit records:
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 30
A Sketch of the Algorithm
• If Lk is the set of frequent k-itemsets:
Generate the candidate set Ck+1 by joining Lk to itself
Prune out the (k+1)-itemsets that don't have minimum support
Now we have Lk+1
• We know this catches all the frequent (k+1)-itemsets by the
apriori property
a (k+1)-itemset can't be frequent if any of its subsets aren't
frequent
• Continue until we reach kmax, or run out of support
• From the union of all the Lk, find all the rules with minimum
confidence
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 31
Step 1: 1-itemsets (L1)
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 32
Step 2: 2-itemsets (L2)
• Prune credit_good,
home_owner
527
male_single, 340
job_skilled
male_single, 408
home_owner
job_skilled, 452
home_owner
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 33
Step 3: 3-itemsets
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 34
Finally: Find Confidence Rules
Rule Set Cnt Set Cnt Confidence
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 35
Diagnostics
• Do the rules make sense?
What does the domain expert say?
• Make a "test set" from hold-out data:
Enter some market baskets with a few items missing (selected at
random). Can the rules predict the missing items?
Remember, some of the test data may not cause a rule to fire.
• Evaluate the rules by lift or leverage.
Some associations may be coincidental (or obvious).
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 36
Apriori - Reasons to Choose (+) and Cautions (-)
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 37
Check Your Knowledge
1. What is the Apriori property and how is it used in the Apriori Your Thoughts?
algorithm?
2. List three popular use cases of the Association Rules mining
algorithms.
3. What is the difference between Lift and Leverage. How is Lift
used in evaluating the quality of rules discovered?
4. Define Support and Confidence
5. How do you use a “hold-out” dataset to evaluate the
effectiveness of the rules generated?
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 38
Module 4: Advanced Analytics – Theory and
Methods
Lesson 2: Association Rules - Summary
During this lesson the following topics were covered:
Association Rules mining
Apriori Algorithm
Prominent use cases of Association Rules
Support and Confidence parameters
Lift and Leverage
Diagnostics to evaluate the effectiveness of rules generated
Reasons to Choose (+) and Cautions (-) of the Apriori algorithm
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 39
Lab Exercise 5 - Association Rules
• This Lab is designed to investigate and practice
Association Rules.
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 40
Lab Exercise 5 - Association Rules - Workflow
• Plot Transactions
4
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 41
Module 4: Advanced Analytics – Theory and
Methods
Lesson 3: Linear Regression
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 42
Regression
• Regression focuses on the relationship between an outcome and
its input variables.
In other words, we don't just predict the outcome, we also have a
sense of how changes in individual drivers affect the outcome.
• The outcome can be continuous or discrete.
When it's discrete, we are predicting the probability that the
outcome will occur.
Example Questions:
I want to predict the life time value (LTV) of this customer (and
understand what drives LTV).
I want to predict the probability that this loan will default (and
understand what drives default).
• Our examples: Linear Regression, Logistic Regression
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 43
Linear Regression -What is it?
• Used to estimate a continuous value as a linear (additive)
function of other variables
Income as a function of years of education, age, gender
House price as function of median home price in neighborhood,
square footage, number of bedrooms/bathrooms
Neighborhood house sales in the past year based on
unemployment, stock price etc.
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 44
Linear Regression - Use Cases
• The preferred method for almost any problem where we are
predicting a continuous outcome
Try this first; if it fails, then try something more complicated
• Examples:
Customer lifetime value
Home value
Loss given default on loan
Income as a function of demographics
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 45
Example: Predict Mortgage Foreclosure/Delinquency
Rates
fdq_rate = -0.9 + 0.66 CurrentUnemp + 1.06 ChgInUnem1yr + 0.22 hicost_mort_rate
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 46
Technical Description
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 47
Representing Categorical Variable
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 48
What do the Coefficients bi Mean?
• Change in y as a function of unit change in xi
all other things being equal
• Example: income in units of $10K, years in age, bage= 2
For the same gender, years of education, and state of residence, a
person's income increases by 2 units (20K)for every year older
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 49
Diagnostics
• Hold-out data
Does the model predict well on data it hasn't seen?
• N-fold cross-validation
Partition the data into N groups.
Fit N models, holding out each group, and calculate the residuals
on the group.
Estimated prediction error is the average over all the residuals.
• R2 : The fraction of the variance in the output variable that the
model can explain.
It is also the square of the correlation between the true output
and the predicted output. You want it close to 1.
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 50
Diagnostics (Continued)
• Sanity check the coefficients
Do the signs make sense? Are the coefficients excessively large?
Wrong sign is an indication of correlated inputs, but doesn't
necessarily affect predictive power.
Excessively large coefficient magnitudes may indicate strongly
correlated inputs; you may want to consider eliminating some
variables, or using regularized regression techniques.
Ridge, Lasso
Infinite magnitude coefficients could indicate a variable that strongly
predicts a subset of the output (and doesn't predict well on the rest).
Plot output vs. this input, and see if you should segment the data before
regressing.
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 51
Diagnostics (Continued) Overpredicts for low true
values, underpredicts at
higher values. Improve
• Plot it! the model.
Prediction vs. true outcome
• Look for:
Systematic over/under
prediction
Non-consistent variance
The data cloud should be
symmetric about the line of
true prediction
Glaring outliers
• You will see other diagnostic
plots in the lab Not quite
consistent
variance, but much
better.
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 52
Linear Regression - Reasons to Choose (+) and
Cautions (-)
Reasons to Choose (+) Cautions (-)
Concise representation (the coefficients) Does not handle missing values well
Robust to redundant variables, correlated Assumes that each variable affects the
variables outcome linearly and additively
Lose some explanatory value Variable transformations and
modeling variable interactions can
alleviate this
A good idea to take the log of
monetary amounts or any variable
with a wide dynamic range
Explanatory value Can't handle variables that affect the
Relative impact of each variable on outcome in a discontinuous way
the outcome Step functions
Easy to score data Doesn't work well with discrete drivers that
have a lot of distinct values
For example, ZIP code
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 53
Check Your Knowledge
1. How is the measure of significance used in determining the Your Thoughts?
explanatory value of a driver with linear regression models?
2. Detail the challenges with categorical values in linear
regression model.
3. Describe N-Fold cross validation method used for diagnosing a
fitted model.
4. List two use cases of linear regression models.
5. List and discuss two standard sanity checks that you will
perform on the coefficients derived from a linear regression
model.
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 54
Module 4: Advanced Analytics – Theory and
Methods
Lesson 3: Linear Regression - Summary
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 55
Lab Exercise 6: Linear Regression
This Lab is designed to investigate and practice Linear
Regression.
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 56
Lab Exercise 6: Linear Regression - Workflow
• Print and visualize the results and review the plots generated
4
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 57
Module 4: Advanced Analytics – Theory and
Methods
Lesson 4: Logistic Regression
During this lesson the following topics are covered:
• Technical description of a logistic regression model
• Common use cases for the logistic regression model
• Interpretation and scoring with the logistic regression model
• Diagnostics for validating the logistic regression model
• Reasons to Choose (+) and Cautions (-) of the logistic
regression model
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 58
Logistic Regression
• Used to estimate the probability that an event will occur as a
function of other variables
The probability that a borrower will default as a function of his
credit score, income, the size of the loan, and his existing debts
• Can be considered a classifier, as well
Assign the class label with the highest probability
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 59
Logistic Regression Use Cases
• The preferred method for many binary classification problems:
Especially if you are interested in the probability of an event, not
just predicting the "yes or no“
Try this first; if it fails, then try something more complicated
• Binary Classification examples:
The probability that a borrower will default
The probability that a customer will churn
• Multi-class example
The probability that a politician will vote yes/vote no/not show up
to vote on a given bill
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 60
Logistic Regression Model - Example
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 61
Logistic Regression- Visualizing the Model
Blue=defaulters
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 62
Technical Description (Binary Case)
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 63
What do the Coefficients bi Mean?
• Invert the logit expression:
• exp(bj) tells us how the odds-ratio of y=1 changes for every unit
change in xj
• Example: bcreditScore = -0.69
• exp(bcreditScore) = 0.5 = 1/2
• for the same income, loan, and existing debt, the odds-ratio of
default is halved for every point increase in credit score
• Standard packages return the significance of the coefficients in the
same way as in linear regression
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 64
An Interesting Fact About Logistic Regression
"The probability mass equals the counts"
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 65
Diagnostics
• Hold-out data:
Does the model predict well on data it hasn't seen?
• N-fold cross-validation: Formal estimate of generalization error
• "Pseudo-R2" : 1 – (deviance/null deviance)
Deviance, null deviance both reported by most standard packages
The fraction of "variance" that is explained by the model
Used the way R2 is used
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 66
Diagnostics (Cont.)
• Sanity check the coefficients
Do the signs make sense? Are the coefficients excessively large?
Wrong sign is an indication of correlated inputs, but doesn't
necessarily affect predictive power.
Excessively large coefficient magnitudes may indicate strongly
correlated inputs; you may want to consider eliminating some
variables, or using regularized regression techniques.
Unfortunately, regularized logistic regression is not standard.
Infinite magnitude coefficients could indicate a variable that strongly
predicts a subset of the output (and doesn't predict well on the rest).
Try a Decision Tree on that variable, to see if you should segment the
data before regressing.
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 67
Diagnostics: ROC Curve
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 68
Diagnostics: Plot the Histograms of Scores
good separation
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 69
Logistic Regression - Reasons to Choose (+) and
Cautions (-)
Reasons to Choose (+) Cautions (-)
Explanatory value: Does not handle missing values well
Relative impact of each variable on the outcome
in a more complicated way than linear regression
Robust with redundant variables, correlated variables Assumes that each variable affects the log-odds of the
Lose some explanatory value outcome linearly and additively
Variable transformations and modeling variable
interactions can alleviate this
A good idea to take the log of monetary amounts
or any variable with a wide dynamic range
Concise representation with the Cannot handle variables that affect the outcome in a
the coefficients discontinuous way.
Step functions
Easy to score data Doesn't work well with discrete drivers that have a lot
of distinct values
For example, ZIP code
Returns good probability estimates of an event
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 70
Check Your Knowledge
Your Thoughts?
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 71
Module 4: Advanced Analytics – Theory and
Methods
Lesson 4: Logistic Regression - Summary
During this lesson the following topics were covered:
• Technical description of a logistic regression model
• Common use cases for the logistic regression model
• Interpretation and scoring with the logistic regression model
• Diagnostics for validating the logistic regression model
• Reasons to Choose (+) and Cautions (-) of the logistic
regression model
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 72
Lab Exercise 7: Logistic Regression
This Lab is designed to investigate and practice Logistic
Regression.
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 73
Lab Exercise 7: Logistic Regression - Workflow
1 • Set the Working Directory
• Use relevel Function to re-level the Price factor with value 30 as the base
7 reference
• Predict outcome for a sequence of Age values at price 30 and income at its
10 mean
11 • Predict outcome for a sequence of income at price 30 and Age at its mean
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 74
Module 4: Advanced Analytics – Theory and
Methods
Lesson 5: Naïve Bayesian Classifiers
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 75
Classifiers
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 76
Naïve Bayesian Classifier : What is it?
• Used for classification
Actually returns a probability score on class membership:
In practice, probabilities generally close to either 0 or 1
Not as well calibrated as Logistic Regression
• Input variables are discrete
Popular for text classification
• Output:
Most implementations: log probability for each class
You could convert it to a probability, but in practice, we stay in the
log space
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 77
Naïve Bayesian Classifier - Use Cases
• Preferred method for many text classification problems.
Try this first; if it doesn't work, try something more complicated
• Use cases
Spam filtering, other text classification tasks
Fraud detection
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 78
Building a Training Dataset
Example : Predicting Good or Bad
credit
Predict the credit behavior of a
credit card applicant from
applicant's attributes:
• personal status
• job type
• housing type
• savings account
These are all categorical variables;
better suited to Naïve Bayesian
classifier than to logistic
regression.
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 79
Technical Description - Bayes' Law
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 80
The "Naïve" Assumption: Conditional Independence
so:
Independent of class – so it
cancels out
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 81
Building a Naïve Bayesian Classifier
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 82
Building a Naïve Bayesian Classifier (Continued)
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 83
Back to Credit Example
• Self-employed female
own
bad
good
0.36
0.75
• savings > $1000 own bad 0.62
self emp good 0.14
self emp bad 0.17
P(good|X) > P(bad|X): savings>1K good 0.06
Assign X the label "good" savings>1K bad 0.02
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 84
Implementation Guideline
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 85
Diagnostics
• Hold-out data
How well does the model classify new instances?
• Cross-validation
• ROC curve/AUC
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 86
Diagnostics: Confusion Matrix
Prediction
True
bad good false positives
Class
bad 262 38 300
good 29 671 700
291 709 1000
false negatives
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 87
Naïve Bayesian Classifier - Reasons to Choose (+)
and Cautions (-)
Reasons to Choose (+) Cautions (-)
Handles missing values quite well Numeric variables have to be discrete
(categorized) Intervals
Robust to irrelevant variables Sensitive to correlated variables
"Double-counting"
Easy to implement Not good for estimating probabilities
Stick to class label or yes/no
Easy to score data
Resistant to over-fitting
Computationally efficient
Handles very high dimensional
problems
Handles categorical variables with a
lot of levels
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 88
Check Your Knowledge
1. Consider the following Training Data Set: Your Thoughts?
• Apply the Naïve Bayesian Classifier to this
Training Data Set
data set and compute
X1 X2 X3 Y
P(y = 1|X) for X = (1,0,0) 1 1 1 0
Show your work 1 1 0 0
0 0 0 0
0 1 0 1
1 0 1 1
0 1 1 1
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 89
Check Your Knowledge (Continued)
5. What is a confusion matrix and how it is used to evaluate the
Your Thoughts?
effectiveness of the model?
6. Consider the following data set with two input features
temperature and season
• What is the Naïve Bayesian assumption?
• Is the Naïve Bayesian assumption satisfied for this problem?
Electricity
Temperature Season Usage
(Class)
Below Winter High
Average
Above Winter Low
Average
Below Summer Low
Average
Above Summer High
Average
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 90
Module 4: Advanced Analytics – Theory and
Methods
Lesson 5: Naïve Bayesian Classifiers - Summary
During this lesson the following topics were covered:
• Naïve Bayesian Classifier
• Theoretical foundations of the classifier
• Use cases
• Evaluating the effectiveness of the classifier
• The Reasons to Choose (+) and Cautions (-) with the use of
the classifier
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 91
Lab Exercise 8: Naïve Bayesian Classifier
This Lab is designed to investigate and practice the
Naïve Bayesian Classifier analytic technique.
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 92
Lab Exercise 8: Naïve Bayesian Classifier Part1 -
Workflow
• Set working directory and review training and test data
1
• Review results
8
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 93
Lab Exercise 8: Naïve Bayesian Classifier Part2 -
Workflow
• Define the Problem (Translating to an Analytics Question)
1
• Build the Training Dataset and the Test Dataset from the Database
4
• Extract the first 10000 records for the training data set and the remaining 10 for the
5 test
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 94
Module 4: Advanced Analytics – Theory and
Methods
Lesson 6: Decision Trees
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 95
Decision Tree Classifier - What is it?
• Used for classification:
Returns probability scores of class membership
Well-calibrated, like logistic regression
Assigns label based on highest scoring class
Some Decision Tree algorithms return simply the most likely class
Regression Trees: a variation for regression
Returns average value at every node
Predictions can be discontinuous at the decision boundaries
• Input variables can be continuous or discrete
• Output:
A tree that describes the decision flow.
Leaf nodes return either a probability score, or simply a classification.
Trees can be converted to a set of "decision rules“
"IF income < $50,000 AND mortgage_amt > $100K THEN default=T with
75% probability“
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 96
Decision Tree – Example of Visual Structure
Female Male
Gender
Female Male
Branch – outcome of test
Income Age
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 97
Decision Tree Classifier - Use Cases
• When a series of questions (yes/no) are answered to arrive at a
classification
Biological species classification
Checklist of symptoms during a doctor’s evaluation of a patient
• When “if-then” conditions are preferred to linear models.
Customer segmentation to predict response rates
Financial decisions such as loan approval
Fraud detection
• Short Decision Trees are the most popular "weak learner" in
ensemble learning techniques
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 98
Example: The Credit Prediction Problem
good
700/1000
p(good)=0.7
good
245/294
housing=free, rent p(good)=0.83
housing=own
good
349/501
personal=female, male div/sep p(good)=0.7
bad good
36/88 70/119
p(good) = 0.42 p(good)=0.6
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 99
General Algorithm
• To construct tree T from training set S
If all examples in S belong to some class in C, or S is sufficiently
"pure", then make a leaf labeled C.
Otherwise:
select the “most informative” attribute A
partition S according to A’s values
recursively construct sub-trees T1, T2, ..., for the subsets of S
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 100
Step 1: Pick the Most “Informative" Attribute
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 101
Step 1: Pick the most "informative" attribute
(Continued)
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 102
Step 1: Pick the Most “Informative" Attribute (Continued)
Conditional Entropy
• The weighted sum of the class entropies for each value of the
attribute
• In English: attribute values (home owner vs. renter) give more
information about class membership
"Home owners are more likely to have good credit than renters"
• Conditional entropy should be lower than unconditioned
entropy
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 103
Conditional Entropy Example
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 104
Step 1: Pick the Most “Informative" Attribute
(Continued) Information Gain
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 105
Back to the Credit Prediction Example
Attribute InfoGain
job 0.001
housing 0.013
personal_status 0.006
savings_status 0.028
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 106
Step 2 & 3: Partition on the Selected Variable
• Step 2: Find the partition
with the highest InfoGain
In our example the selected good
partition has InfoGain = 0.028 700/1000
p(good)=0.7
savings=(500:100),>
=1000,no known
• Step 3: At each resulting savings= <100, (100:500) savings
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 107
Diagnostics
• Hold-out data
• ROC/AUC
• Confusion Matrix
• FPR/FNR, Precision/Recall
• Do the splits (or the "rules") make sense?
What does the domain expert say?
• How deep is the tree?
Too many layers are prone to over-fit
• Do you get nodes with very few members?
Over-fit
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 108
Decision Tree Classifier - Reasons to Choose (+)
& Cautions (-)
Reasons to Choose (+) Cautions (-)
Takes any input type (numeric, categorical) Decision surfaces can only be axis-aligned
In principle, can handle categorical variables with
many distinct values (ZIP code)
Robust with redundant variables, correlated variables Tree structure is sensitive to small changes in the
training data
Naturally handles variable interaction A "deep" tree is probably over-fit
Because each split reduces the training data for
subsequent splits
Handles variables that have non-linear effect on Not good for outcomes that are dependent on many
outcome variables
Related to over-fit problem, above
Computationally efficient to build Doesn't naturally handle missing values;
However most implementations include a
method for dealing with this
Easy to score data In practice, decision rules can be fairly complex
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 109
Which Classifier Should I Try?
Typical Questions Recommended Method
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 111
Check Your Knowledge
Your Thoughts?
1. How do you define information gain?
2. For what conditions is the value of entropy at a maximum and when is it at
a minimum?
3. List three use cases of Decision Trees.
4. What are weak learners and how are they used in ensemble methods?
5. Why do we end up with an over fitted model with deep trees and in data
sets when we have outcomes that are dependent on many variables?
6. What classification method would you recommend for the following cases:
High dimensional data
Data in which outputs are affected by non-linearity and discontinuity in
the inputs
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 112
Module 4: Advanced Analytics – Theory and
Methods
Lesson 6: Decision Trees - Summary
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 113
Lab Exercise 9: Decision Trees
This Lab is designed to investigate and practice Decision
Tree (DT) models covered in the course work.
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 114
Lab Exercise 9: Decision Trees - Workflow
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 115
Module 4: Advanced Analytics – Theory and
Methods
Lesson 7: Time Series Analysis
During this lesson the following topics are covered:
• Time Series Analysis and its applications in forecasting
• ARIMA Model
• Implementing the Box-Jenkins Methodology using R
• Reasons to Choose (+) and Cautions (-) with Time Series
Analysis
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 116
Time Series Analysis
What will our December sales be
(based on the sales of the last few months)?
• Time Series Analysis accounts for the internal structure of
measurements taken over time
Trend
Seasonality
Cycles
Irregular
• Time series: Ordered sequence of numerical values, measured
over equally spaced time intervals
• The goal can be to identify the internal structure, or to forecast
near-future events based on recent history
• Our Example: Box-Jenkins Methods (ARMA, ARIMA)
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 117
Box-Jenkins: What is it?
Used for predicting the next few observations in a time series,
based on the last few observations.
• Input: Trend and Seasonally-adjusted time series
• Output: Expected future value of the time series
• Applies ARMA (Autoregressive Moving Averages) and ARIMA
(Autoregressive Integrated Moving Averages) model
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 119
Use Cases
• Forecast next month's sales
Based on last few months
• Forecast tomorrow's stock price
Based on last few days
• Forecast power demand in the near term
Based on last few days
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 120
Modeling a Time Series
• Let's model the time series as
Yt =Tt +St +Rt, t=1,...,n.
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 121
Stationary Sequences
Many time series analyses (Basic Box-Jenkins in particular) assume
stationary sequences:
Mean, variance and autocorrelation structure do not change over
time
In practice, this often means you must de-trend and seasonally
adjust the data
ARIMA in principle can make the data (more) stationary with
differencing
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 122
De-trending
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 123
Seasonal Adjustment
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 124
ACF & PACF
• Auto Correlation Function (ACF)
Correlation of the values of the time series with itself
Similarity of the observations as a function of time
Autocorrelation "carries over"
if Xt is correlated with Xt-1, it is also correlated with Xt-2 (though to a
lesser degree)
• Partial Auto Correlation Function (PACF)
The partial autocorrelation at lag k that is not explained by "carry
over“
Helps determining the order of autoregressive models
Where does PACF go to zero?
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 125
ARMA Model
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 126
ARIMA Model
A combination of AR and MA models
The general non-seasonal model is known as ARIMA (p, d, q):
p is the number of autoregressive terms
d is the number of differences
q is the number of moving average terms
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 127
Model Selection
• The Data Scientist must pick p, d and q
An "art form" that requires domain knowledge, modeling
experience, and a few iterations
A simple AR model (q = 0), or MA model (p=0) might be simpler for
the novice
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 128
Time Series Analysis - Reasons to Choose (+) &
Cautions (-)
Reasons to Choose (+) Cautions (-)
Minimal data collection No meaningful drivers: prediction
Only have to collect the series based only on past performance
itself No explanatory value
Do not need to input drivers Can't do "what-if" scenarios
Can't stress test
Designed to handle the inherent It's an "art form" to select appropriate
autocorrelation of lagged time series parameters
Compared to simple linear
regression
Once you've seasonally/trend
adjusted
Suitable for short term predictions
only
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 129
Time Series Analysis with R
• Getting the data and plotting
• The function “ts” is used to create time series objects
Made into an R time series via
mydata.data<- ts(mydata,start=c(1999,1),frequency=12)
Model building – use plot and box plot
• Differencing
diff(hstart.data,1,1)
acf: It computes (and by default plots) estimates of the
autocovariance or autocorrelation function
pacf: It is the function used for the partial autocorrelations
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 130
Time Series Analysis with R (Continued)
•ar: Fit an autoregressive time series model to the data
•arima: Fit an ARIMA model to a Univariate Time Series
•predict: Do model predictions
“predict” is a generic function for predictions from the results of various
model fitting functions. The function invokes particular methods which
depend on the class of the first argument
•arima.sim: Simulate from an ARIMA model
•ARMAtoMA: Convert ARMA process to infinite MA process
•decompose:
Decompose a time series into seasonal, trend and irregular components using
moving averages
Deals with additive or multiplicative seasonal component
stl: Decompose a time series into seasonal, trend and irregular
components using loess
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 131
Check Your Knowledge
Your Thoughts?
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 132
Module 4: Advanced Analytics – Theory and Methods
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 133
Lab Exercise 10: Time Series Analysis
This Lab is designed to investigate and practice Time
Series Analysis with ARIMA models (Box-Jenkins-
methodology).
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 134
Lab Exercise 10: Time Series Analysis - Workflow
1 • Set the Working Directory
5 • Review, Update, and Prepare DataFrame ”msales” File for ARIMA Modeling
12 • Generate Predictions
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 135
Module 4: Advanced Analytics – Theory and
Methods
Lesson 8: Text Analysis
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 136
Text Analysis
Encompasses the processing and representation of text for
analysis and learning tasks
• High-dimensionality
Every distinct term is a dimension
Green Eggs and Ham: A 50-D problem!
• Data is Un-structured
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 137
Text Analysis – Problem-solving Tasks
• Parsing
Impose a structure on the unstructured/semi-structured text for
downstream analysis
• Search/Retrieval
Which documents have this word or phrase?
Which documents are about this topic or this entity?
• Text-mining Parsing
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 138
Example: Brand Management
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 139
Buzz Tracking: The Process
1. Monitor social networks, review Parse the data feeds to get actual
sites for mentions of our products. content.
Find and filter the raw text for product
names
(Use Regular Expression).
2. Collect the reviews. Extract the relevant raw text.
Convert the raw text into a suitable
document representation.
Index into our review corpus.
3. Sort the reviews by product. Classification (or "Topic Tagging")
4. Are they good reviews or bad Classification (sentiment analysis)
reviews?
We can keep a simple count here, for
trend analysis.
5. Marketing calls up and reads Search/Information Retrieval.
selected reviews in full, for greater
insight.
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 140
Parsin
Parsing the Feeds g
• Impose structure on
semi-structured
data.
• We need to know
where to look for
what we are looking
for.
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 141
Parsin
Regular Expressions g
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 142
Parsin
Extract and Represent Text g
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 143
Parsin
Document Representation - Other Features g
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 144
Parsin
Representing a Corpus (Collection of Documents) g
For every possible feature, a list of all the documents that contain
that feature
• Corpus metrics
Volume
Corpus-wide term frequencies
Inverse Document Frequency (IDF)
more on this later
• Challenge: a Corpus is dynamic
Index, metrics must be updated continuously
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 145
Text Classification (I) - "Topic Tagging" Text
Mining
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 146
"Topic Tagging" Text
Mining
3. Sort the Reviews by
Product
Judicious choice of features
Product mentioned in title?
Tweet, or review?
Term frequency
Canonicalize abbreviations
"5X" = "bPhone-5X"
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 147
Text Classification (II) Sentiment Analysis Text
Mining
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 148
Search and Information Retrieval Search
&Retrieva
l
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 149
Quality of Search Results Search
&Retrieva
l
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 150
Computing Relevance Search
&Retrieva
l
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 151
Inverse Document Frequency (idf) Search
&Retrieva
l
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 152
TF-IDF and Modified Retrieval Algorithm Search
&Retrieva
l
tfdocument(term) * idf(term)
query: "unbrick phone"
• Document with "unbrick" a few times more relevant than
document with "phone" many times
• Measure of Relevance with tf-idf
• Call up all the documents that have any of the terms from the
query, and sum up the tf-idf of each term:
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 153
Other Relevance Metrics Search
&Retrieva
l
• "Authoritativeness" of source
PageRank is an example of this
• Recency of document
• How often the document has been retrieved by other users
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 154
Effectiveness of Search and Retrieval Search
&Retrieva
l
• Relevance metric
important for precision, user experience
• Effective crawl, extraction, indexing
important for recall (and precision)
more important, often, than retrieval algorithm
• MapReduce
Reverse index, corpus term frequencies, idf
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 155
Challenges - Text Analysis
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 156
Check Your Knowledge
1. What are the two major challenges in the problem of text Your Thoughts?
analysis?
2. What is a reverse index?
3. Why is the corpus metrics dynamic. Provide an example and a
scenario that explains the dynamism of the corpus metrics.
4. How does tf-idf enhance the relevance of a search result?
5. List and discuss a few methods that are deployed in text
analysis to reduce the dimensions.
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 157
Module 4: Advanced Analytics – Theory and
Methods
Lesson 8: Text Analysis - Summary
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 158
Module 4: Summary
Key Topics Covered in this module Methods Covered in this module
Algorithms and technical foundations Categorization (unsupervised) :
K-means clustering
Association Rules
Key Use cases Regression
Linear
Logistic
Diagnostics and validation of the model Classification (supervised)
Naïve Bayesian classifier
Decision Trees
Reasons to Choose (+) and Cautions (-) of the Time Series Analysis
model
Fitting, scoring and validating model in R and in- Text Analysis
db functions
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 159