Sharda Dss11e Ch03
Sharda Dss11e Ch03
Sharda Dss11e Ch03
Chapter 3
Nature of Data, Statistical Modeling
and Visualization
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Introduction
• In the age of Big Data and business analytics in which we are
living, the importance of data is undeniable. Newly coined
phrases such as “data are the oil,” “data are the new bacon,”
“data are the new currency,” and “data are the king” are
further stressing the renewed importance of data. But the type
of data we are talking about is obviously not just any data.
• The “garbage in garbage out—GIGO” concept/principle
applies to today’s Big Data phenomenon more so than any
data definition that we have had in the past. To live up to their
promise, value proposition, and ability to turn into insight,
data have to be carefully created/identified, collected,
integrated, cleaned, transformed, and properly contextualized
for use in accurate and timely decision making.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Introduction
• Data are the main theme of this chapter. Accordingly, the
chapter starts with a description of the nature of data: what
they are, what different types and forms they can come in,
and how they can be preprocessed and made ready for
analytics.
• Following the statistics sections are sections on reporting
and visualization. A report is a communication artifact
prepared with the specific intention of converting data into
information and knowledge and relaying that information in
an easily understandable/digestible format.
• Today, these reports are visually oriented, often using colors
and graphical icons that collectively look like a dashboard to
enhance the information content.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Learning Objectives (1 of 2)
3.1 Understand the nature of data as it relates to business
intelligence (B I) and analytics
3.2 Learn the methods used to make real-world data
analytics ready
3.3 Describe statistical modeling and its relationship to
business analytics
3.4 Learn about descriptive and inferential statistics
3.5 Define business reporting, and understand its historical
evolution
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Learning Objectives (2 of 2)
3.6 Understand the importance of data/information
visualization
3.7 Learn different types of visualization techniques
3.8 Appreciate the value that visual analytics brings to
business analytics
3.9 Know the capabilities and limitations of dashboards
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Opening Vignette
Attracts and Engages a New Generation of
Radio Consumers with Data-Driven Marketing
1. What does Sirius X M do? In what type of market does it
conduct its business?
2. What were the challenges? Comment on both
technology and data-related challenges.
3. What were the proposed solutions?
4. How did they implement the proposed solutions? Did
they face any implementation challenges?
5. What were the results and benefits? Were they worth the
effort/investment?
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
The Nature of Data (1 of 2)
• Data: a collection of facts
– usually obtained as the result of experiences,
observations, or experiments
• Data may consist of numbers, words, images, …
• Data is the lowest level of abstraction (from which
information and knowledge are derived)
• Data is the source for information and knowledge
• Data quality and data integrity critical to analytics
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
The Nature of Data (2 of 2)
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Metrics for Analytics ready Data
• Data source reliability
• Data content accuracy
• Data accessibility
• Data security and data privacy
• Data richness
• Data consistency
• Data currency/data timeliness
• Data granularity
• Data validity and data relevancy
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
A Simple Taxonomy of Data (1 of 2)
• Data (datum—singular form of data): facts
• Structured data
– Targeted for computers to process
– Numeric versus nominal
• Unstructured/textual data
– Targeted for humans to process/digest
• Semi-structured data?
– XM L, HTM L, Log files, etc.
• Data taxonomy…
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
A Simple Taxonomy of Data (2 of 2)
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Application Case 3.1
Verizon Answers the Call for Innovation: The
Nation’s Largest Network Provider uses
Advanced Analytics to Bring the Future to its
Customers
Questions for Discussion:
1. What was the challenge Verizon was facing?
2. What was the data-driven solution proposed for Verizon’s
business units?
3. What were the results?
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
The Art and Science of Data
Preprocessing (1 of 2)
• The real-world data is dirty, misaligned, overly complex,
and inaccurate
– Not ready for analytics!
• Readying the data for analytics is needed
– Data preprocessing
Data consolidation
Data cleaning
Data transformation
Data reduction
• Art – it develops and improves with experience
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
The Art and Science of Data
Preprocessing (2 of 2)
• Data reduction
1. Variables
– Dimensional reduction
– Variable selection
2. Cases/samples
– Sampling
– Balancing / stratification
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Preprocessing Tasks and Methods
Table 3.1 A Summary of Data Preprocessing Tasks and Potential Methods.
Main Task Subtasks Popular Methods
Data consolidation Access and collect the data SQL queries, software agents, Web services. Domain expertise, SQL queries,
Select and filter the data statistical tests. SQL queries, domain expertise, ontology-driven data mapping.
Integrate and unify the data
Data cleaning Handle missing values in the Fill in missing values (imputations) with most appropriate values (mean, median,
data min/max, mode, etc.); recode the missing values with a constant such as “ML”;
remove the record of the missing value; do nothing.
Blank Identify and reduce noise in Identify the outliers in data with simple statistical techniques (such as averages and
the data standard deviations) or with cluster analysis; once identified, either remove the
outliers or smooth them by using binning, regression, or simple averages.
Blank Find and eliminate erroneous Identify the erroneous values in data (other than outliers), such as odd values,
data inconsistent class labels, odd distributions; once identified, use domain expertise to
correct the values or remove the records holding the erroneous values.
Data transformation Normalize the data Reduce the range of values in each numerically valued variable to a standard range
(e.g., 0 to 1 or −1 to +1) by using a variety of normalization or scaling techniques.
Blank Discretize or aggregate the If needed, convert the numeric variables into discrete representations using range-
data or frequency-based binning techniques; for categorical variables, reduce the number
of values by applying proper concept hierarchies.
Blank Construct new attributes Derive new and more informative variables from the existing ones using a wide
range of mathematical functions (as simple as addition and multiplication or as
complex as a hybrid combination of log transformations).
Data reduction Reduce number of attributes Use principal component analysis, independent component analysis, chi-square
testing, correlation analysis, and decision tree induction.
Blank Reduce number of records Perform random sampling, stratified sampling, expert-knowledge-driven purposeful
sampling.
Blank Balance skewed data Oversample the less represented or undersample the more represented classes.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Application Case 3.2 (1 of 4)
Improving Student Retention with Data-Driven
Analytics
Questions for Discussion:
1. What is student attrition, and why is it an important
problem in higher education?
2. What were the traditional methods to deal with the
attrition problem?
3. List and discuss the data-related challenges within
context of this case study.
4. What was the proposed solution? And, what were the
results?
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Application Case 3.2 (2 of 4)
Improving Student
Retention with Data-Driven
Analytics
• Student retention
– Freshmen class
• Why it is important?
• What are the common techniques
to deal with student attrition?
• Analytics versus theoretical
approaches to student retention
problem
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Application Case 3.2 (3 of 4)
Improving Student Retention with Data-Driven
Analytics
• Data imbalance problem
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Application Case 3.2 (4 of 4)
Improving Student Retention with Data-Driven Analytics
• Results…
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Statistical Modeling for Business
Analytics (1 of 2)
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Statistical Modeling for Business
Analytics (2 of 2)
• Statistics
– A collection of mathematical techniques to
characterize and interpret data
• Descriptive Statistics
– Describing the data (as it is)
• Inferential statistics
– Drawing inferences about the population based on a
sample data
• Descriptive statistics for descriptive analytics
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Descriptive Statistics Measures of
Centrality Tendency (1 of 2)
• Arithmetic mean
x1 x2 ... xn
n
xi
x x i 1
n n
• Median
– The number in the middle
• Mode
– The most frequent observation
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Descriptive Statistics Measures of
Dispersion (1 of 2)
• Dispersion
– Degree of variation in a given variable
• Range
– Max - Min
• Variance Standard Deviation
i 1 ( xi x )2 i 1 i
n n
( x x ) 2
s2 s
n 1 n 1
• Mean Absolute Deviation (MA D)
– Average absolute deviation from the mean
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Descriptive Statistics Measures of
Dispersion (2 of 2)
• Quartiles
• Box-and-Whiskers Plot
– a.k.a. box-plot
– Versatile / informative
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Descriptive Statistics Measures of
Centrality Tendency (2 of 2)
• Histogram - frequency chart
• Skewness
– Measure of asymmetry
n
( xi x ) 3
skewness s i 1
(n 1) s 3
• Kurtosis
– Peak/tall/skinny nature of the distribution
n
( xi x ) 4
kurtosis K i 1
4
3
ns
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Relationship Between Dispersion
and Shape Properties
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Technology Insights 3.1 –
Descriptive Statistics in Excel
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Technology Insights 3.1 – Descriptive
Statistics in Excel Creating box-plot
in Microsoft Excel
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Application Case 3.3
Town of Cary Uses Analytics to Analyze Data
from Sensors, Assess Demand, and Detect
Problems
Questions for Discussion:
1. What were the challenges the Town of Cary was facing?
2. What was the proposed solution?
3. What were the results?
4. What other problems and data analytics solutions do you
foresee for towns like Cary?
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Regression Modeling for Inferential
Statistics
• Regression
– A part of inferential statistics
– The most widely known and used analytics technique
in statistics
– Used to characterize relationship between explanatory
(input) and response (output) variable
• It can be used for
– Hypothesis testing (explanation)
– Forecasting (prediction)
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Regression Modeling (1 of 3)
• Correlation versus Regression
– What is the difference (or relationship)?
• Simple Regression versus Multiple Regression
– Base on number of input variables
• How do we develop linear regression models?
– Scatter plots (visualization—for simple regression)
– Ordinary least squares method
A line that minimizes squared of the errors
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Regression Modeling (2 of 3)
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Regression Modeling (3 of 3)
• x: input, y: output
• Simple Linear Regression
y 0 1 x
• Multiple Linear Regression
y 0 1 x1 2 x2 3 x3 ... n xn
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Process of Developing a
Regression Model
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Regression Modeling Assumptions
• Linearity
• Independence
• Normality (Normal Distribution)
• Constant Variance
• Multicollinearity
• What happens if the assumptions do NOT hold?
– What do we do then?
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Logistic Regression Modeling (1 of 2)
• A very popular statistics-based classification algorithm
• Employs supervised learning
• Developed in 1940s
• The difference between Linear Regression and Logistic
Regression
– In Logistic Regression Output/Target variable is a
binomial (binary classification) variable (as supposed
to numeric variable)
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Logistic Regression Modeling (2 of 2)
1
f ( y)
1 e ( 0 1 x )
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Application Case 3.4 (1 of 4)
Predicting NCA A Bowl Game Outcomes
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Application
Case 3.4 (2 of 4)
Predicting NCA A
Bowl Game
Outcomes
• The analytics
process to develop
prediction models
(both regression and
classification type)
for NCA A Bowl
Game outcomes
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Application Case 3.4 (3 of 4)
Predicting NCA A Bowl Game Outcomes
Prediction Results
1. Classification
2. Regression
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Application Case 3.4 (4 of 4)
Predicting NCA A Bowl Game Outcomes
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Time Series Forecasting
• Is it different than Simple Linear Regression? How?
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Business Reporting Definitions and
Concepts
• Report = Information Decision
• Report?
– Any communication artifact prepared to convey
specific information
• A report can fulfill many functions
– To ensure proper departmental functioning
– To provide information
– To provide the results of an analysis
– To persuade others to act
– To create an organizational memory…
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
What is a Business Report?
• A written document that contains information regarding
business matters.
• Purpose: to improve managerial decisions
• Source: data from inside and outside the organization (via
the use of ET L)
• Format: text + tables + graphs/charts
• Distribution: in-print, email, portal/intranet
Data acquisition Information generation Decision
making Process management
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Business Reporting
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Types of Business Reports
• Metric Management Reports
– Help manage business performance through metrics
(SL A s for externals; KPI s for internals)
– Can be used as part of Six Sigma and/or TQ M
• Dashboard-Type Reports
– Graphical presentation of several performance
indicators in a single page using dials/gauges
• Balanced Scorecard–Type Reports
– Include financial, customer, business process, and
learning & growth indicators
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Application Case 3.5
Flood of Paper Ends at F E M A
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Visualization
“The use of visual representations to explore, make sense
of, and communicate data.”
• Data visualization vs. Information visualization
• Information = aggregation, summarization, and
contextualization of data
• Related to information graphics, scientific visualization,
and statistical graphics
• Often includes charts, graphs, illustrations, …
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
A Brief History of Data Visualization
• Data visualization can date back to the second century
AD
• Most developments have occurred in the last two and a
half centuries
• Until recently it was not recognized as a discipline
• Today’s most popular visual forms date back a few
centuries
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
The First Pie Chart Created by William Playfair in 1801
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Different Types of Charts and Graphs
Source: gapminder.org.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
The Emergence of Data Visualization
And Visual Analytics (1 of 2)
Figure 3.23 Magic Quadrant for Business Intelligence and Analytics Platforms.
• Magic Quadrant for Business
Intelligence and Analytics
Platforms (Source: Gartner.com)
• Many data visualization companies
are in the 4th quadrant
• There is a move towards
visualization
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Visual Analytics
• A recently coined term
– Information visualization + predictive analytics
• Information visualization
– Descriptive, backward focused
– “what happened” “what is happening”
• Predictive analytics
– Predictive, future focused
– “what will happen” “why will it happen”
• There is a strong move toward visual analytics
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Visual Analytics by SA S Institute
(1 of 2)
Figure 3.25 An Overview of S A S Visual Analytics Architecture.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Performance Dashboards (2 of 4)
Figure 3.27 A Sample Executive Dashboard.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Performance Dashboards (3 of 4)
• Dashboard design
– The fundamental challenge of dashboard design is to
display all the required information on a single screen,
clearly and without distraction, in a manner that can be
assimilated quickly
• Three layer of information
– Monitoring
– Analysis
– Management
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Performance Dashboards (4 of 4)
• What to look for in a dashboard
– Use of visual components to highlight data and
exceptions that require action.
– Transparent to the user, meaning that they require
minimal training and are extremely easy to use
– Combine data from a variety of systems into a single,
summarized, unified view of the business
– Enable drill-down or drill-through to underlying data
sources or reports
– Present a dynamic, real-world view with timely data
– Require little coding to implement, deploy, and
maintain
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Best Practices in Dashboard Design
• Benchmark KPI s with Industry Standards
• Wrap the Metrics with Contextual Metadata
• Validate the Design by a Usability Specialist
• Prioritize and Rank Alerts and Exceptions
• Enrich Dashboard with Business-User Comments
• Present Information in Three Different Levels
• Pick the Right Visual Constructs
• Provide for Guided Analytics
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Application Case 3.8
Visual analytics helps energy supplier Make
better connections
Questions for Discussion:
1. Why do you think energy supply companies are among
the prime users of information visualization tools?
2. How did Electrabel use information visualization for the
single version of the truth?
3. What were their challenges, the proposed solution, and
the obtained results?
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
End of Chapter 3
• Questions / Comments
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Copyright
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved