Introduction To Data Mining For Business Analytics

This document provides an introduction to data mining in business analytics. It discusses key concepts like business intelligence, the data mining process, common data mining techniques, and how data mining informs business analytics. The six steps of the data mining process are outlined as business understanding, data understanding, data preparation, modeling, evaluation, and deployment. Classification, prediction, association rules, and predictive analytics are presented as core data mining techniques.

Uploaded by

Sherwin Lopez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

211 views51 pages

Introduction To Data Mining For Business Analytics

Uploaded by

Sherwin Lopez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

MODULE 1

INTRODUCTION TO DATA MINING

IN BUSINESS ANALYTICS
At the end of the topic, the learner should be able to:
• Learn and understand the importance of data mining in
business analytics
• Learn the different terminologies in data mining for business
analytics
• Understand the reasons why there are so many different
methods in data mining
• Identify and understand the different steps in data mining
• Understand the modeling process in dealing with data mining
• Business Analytics is the practice and art of bringing
quantitative data to bear on decision-making. The term
means different things to different organizations.
• Business Analytics, or more generically, analytics, include
a range of data analysis methods. Many powerful
applications involve little more than counting, rule-
checking, and basic arithmetic
• Business Intelligence (BI), refers to data visualization and
reporting for understanding “what happened and what is
happening.”
• BI, which earlier consisted mainly of generating static
reports, has evolved into more user-friendly and effective
tools and practices, such as creating interactive
dashboards that allow the user not only to access real-
time data, but also to directly interact with it.
Beware the organizational setting where
analytics is a solution in search of a problem
• Business Understanding
• Data Understanding
• Data Preparation
• Modeling
• Evaluation
• Deployment
The first step to successful data mining is to understand the
overall objectives of the business, then be able to convert this
into a data mining problem and a plan. Without an
understanding of the ultimate goal of the business, you won’t
be able to design a good data mining algorithm.
After you know what the business is looking for, it’s time to
collect data. There are many complex ways that data can be
obtained from an organization, organized, stored, and
managed. Data mining involves getting familiar with the data,
identifying any issues, getting insights, or observing subsets.
Data preparation involves getting the information production
ready. This is the biggest part of data mining. It is taking the
computer-language data, and converting it into a form that
people can understand and quantify.
In the modeling phase, mathematical models are used to
search for patterns in the data. There are usually several
techniques that can be used for the same set of data. There
is a lot of trial and error involved in modeling.
When the model is complete, it needs to be carefully
evaluated and the steps to make the model need to be
reviewed, to ensure it meets the business objectives. At the
end of this phase, a decision about the data mining results
will be made.
This can be a simple or complex part of data mining,
depending on the output of the process. It can be as simple
as generating a report, or as complex as creating a
repeatable data mining process to happen regularly.
How does data mining inform business
analytics?
• Classification. This data mining technique is more complex, using attributes of
data to move them into discernable categories, helping you draw further
conclusions.
• Clustering. This technique is very similar to classification, chunking data
together based on their similarities. Cluster groups are less structured than
classification groups, making it a more simple option for data mining.
• Associate Rules. Association in data mining is all about tracking patterns,
specifically based on linked variables.
• Regression Analysis. Regression is used to plan and model, identifying the
likelihood of a specific variable.
• Anomaly/outlier detection. For many data mining cases, just seeing the
overarching pattern might not be all you need. Data needs to be able to identify
and understand the outliers in your data as well.
• DataMelt. Performs mathematics, statistics, calculations, data analysis, and
visualization. Many scripting languages and Java packages are available in this
system.
• ELKI Data Mining Framework. Focuses on algorithms with a specific emphasis
on unsupervised cluster and outlier systems. ELKI is designed to be easy for
researchers, students, and business organizations to use
• Orange Data Mining. Helps organizations do simple data analysis and use top
visualization and graphics. Heatmaps, hierarchical clustering, decision trees, and
more are used in this process.
• The R Project for Statistical Computing. Used in statistical modeling and
graphics and is utilized on many operating systems and programs
• Rattle GUI. Presents statistical and visual summaries of data, helps prepare it to
be modeled, and utilizes supervised and unsupervised machine learning to
present the information.
Big Data is a relative term—data today are big by reference
to the past, and to the methods and devices available to deal
with them.
• Volume
• Velocity
• Variety
• Veracity
Data science is a mix of skills in the areas of statistics,
machine learning, math, programming, business, and IT.
Why there are so many different methods and
techniques of data mining for business
analytics?
Predictive analytics, the tasks of classification and
prediction that are becoming key elements of a
“business intelligence" function in most large firms.
CORE IDEAS OF DATA MINING
• Classification
• Prediction
• Association Rules
• Predictive Analytics
• Data Reduction
• Data Exploration
• Data Visualization
A common task in data mining is to examine data where the
classification is unknown or will occur in the future, with the goal
of predicting what that classification is or will be. Similar data
where the classification is known are used to develop rules,
which are then applied to the data with the unknown
classification.
• The recipient of an offer can respond or not respond.
• An applicant for a loan can repay on time, repay late, or
declare bankruptcy.
• A credit card transaction can be normal or fraudulent.
• A packet of data traveling on a network can be benign or
threatening.
• A bus in a fleet can be available for service or unavailable.
• The victim of an illness can be recovered, still be ill, or be
deceased.
Prediction refers to the value of a continuous variable.
(Sometimes in the data mining literature, the term estimation is
used to refer to the prediction of the value of a continuous
variable, and prediction may be used for both continuous and
categorical data.)
Association rules, or affinity analysis, can then be used in a variety of ways.
For example:
• Grocery stores can use such information after a customer’s purchases
have all been scanned to print discount coupons, where the items being
discounted are determined by mapping the customer’s purchases onto the
association rules.
• Online merchants such as Amazon.com and Netflix.com use these
methods as the heart of a “recommender” system that suggests new
purchases to customers.
Classification, prediction, and to some extent, affinity analysis
constitute the analytical methods employed in predictive
analytics.
Classification, prediction, and to some extent, affinity analysis
constitute the analytical methods employed in predictive
analytics.
• Sensible data analysis often requires distillation of complex
data into simpler data. Rather than dealing with thousands of
product types, an analyst might wish to group them into a
smaller number of groups.
• This process of consolidating a large number of variables (or
cases) into a smaller set is termed data reduction.
A full understanding of the data may require a reduction in its scale or
dimension to allow us to see the forest without getting lost in the trees.
Similar variables (i.e., variables that supply similar information) might be
aggregated into a single variable incorporating all the similar variables.
Analogously, records might be aggregated into groups of similar records.

Example:
• an essential part of the job is to review and examine the data to see what
messages they hold, much as a detective might survey a crime scene.
Another technique for exploring data to see what information they hold is
through graphical analysis. This includes looking at each variable separately
as well as looking at relationships between variables.
For numerical variables, we use histograms and boxplots to learn about
the distribution of their values, to detect outliers (extreme observations), and
to find other information that is relevant to the analysis task.
Similarly, for categorical variables we use bar charts. We can also look at
scatterplots of pairs of numerical variables to learn about possible
relationships, the type of relationship, and again, to detect outliers.
SUPERVISED AND
UNSUPERVISED LEARNING
Supervised learning algorithms are those used in which
the value of the outcome of interest (e.g, purchase or no
purchase) is known.

Training Validation
Test data
data data
Traning data are the data from which the classification or
prediction algorithm "learns", or is "trained," about the
relationship between predictor variables and the outcome
variable.
Once the algorithm has learned from the training data, it is
then applied to another sample of data (the validation data)
where the outcome is known, to see how well it does in
comparison to other models.
If many different models are being tried out, it is prudent to
save a third sample of known outcomes (the test data) to
use with the model finally selected to predict how well it will
do.

Simple Linear Regression Analysis

Unsupervised learning algorithms are those used where
there is no outcome variable to predict or classify. Hence,
there is no "learning" from cases where such an outcome
variable is known.

Association Rules
Dimension Reduction Methods
Clustering Techniques
Some of the most serious errors in data analysis
result from a poor understanding of the problem
STEPS IN DATA MINING
• Develop an understanding of the purpose of the data mining project
• Obtain the dataset to be used in the analysis.
• Explore, clean, and preprocess the data.
• Reduce the data, if necessary, and (where supervised training is
involved) separate them into training, validation, and test datasets.
• Determine the data mining task (classification, prediction, clustering,
etc.).
• Choose the data mining techniques to be used (regression, neural
nets, hierarchical clustering, etc.).
1. Develop an understanding of the purpose of the data
mining project (if it is a one-shot effort to answer a
question or questions) or application (if it is an ongoing
procedure)
2. Obtain the dataset to be used in the analysis. This
often involves random sampling from a large database to
capture records to be used in analysis.
3. Explore, clean, and preprocess the data. This involves
verifying that the data are in reasonable condition.
• How should missing data be handles?
• Are the values in a reasonable range, given what you would expect for each variable?
• Are there obvious outliers?
4. Reduce the data, if necessary, and (where supervised
training is involved) separate them into training, validation,
and test datasets. This can involve operations such as
eliminating unneeded variables, transforming variables, and
creating new variables.
5. Determine the data mining task. This involves
translating the general question or problem of step 1
into a more specific statistical question.
6. Choose the data mining techniques to be used
(regression, neural nets, hierarchical clustering, etc.).
7. Use algorithms to perform the task. This is typically
an iterative process –tying multiple variants, and often
using multiple variants of the same algorithm.
8. Interpret the results of the algorithms. This involves
making a choice as to the best algorithm to deploy,
and where possible, testing the final choice on the test
to get an idea as to how well it will perform.
9. Deploy the model. This involves integrating the model
into operational systems and running it on real records
to produce decisions or actions.
PRELIMINARY STEPS
• Organization of database
• Sampling from a database
• Oversampling rare events
• Preprocessing and cleaning the data
• Types of variables
• Handling categorical variables
• Variable selection
• Overfitting
• How many variables and how much data?
• Outliers
• Missing Values
• Normalizing the data
• Use and creation of partitions
• Training partition
• Validation partition
• Test partition
• Shmueli G., et al. Data Mining for Business Intelligence Concepts,
Techniques, and Applications in Microsoft Office Excel with XLMiner 2nd
Ed. A John Wiley & Sons, Inc. Publication
• Bruce P., et al. Data Mining for Business Analytics Concepts, Techniques
and Applications. John Wiley & Sons, Inc. 2020
• Shmueli G., et al. Data Mining for Business Intelligence Concepts,
Techniques, and Applications in Microsoft Office Excel with XLMiner 2nd
Ed. A John Wiley & Sons, Inc. Publication
• Bruce P., et al. Data Mining for Business Analytics Concepts, Techniques
and Applications. John Wiley & Sons, Inc. 2020

HRPD - Human Resources Planning and Development
No ratings yet
HRPD - Human Resources Planning and Development
50 pages
What Is Business Analytics?: Predictive Analytics Descriptive Analytics Prescriptive Analytics
No ratings yet
What Is Business Analytics?: Predictive Analytics Descriptive Analytics Prescriptive Analytics
35 pages
Purposive Communication: Lesso N 1: Communic Atio N M O D Els
100% (1)
Purposive Communication: Lesso N 1: Communic Atio N M O D Els
30 pages
M.SC 2022-2023
No ratings yet
M.SC 2022-2023
220 pages
Lecture 1428550844
No ratings yet
Lecture 1428550844
87 pages
FINANCIAL_MANAGEMENT (1)
No ratings yet
FINANCIAL_MANAGEMENT (1)
113 pages
6 Causes of Miscommunication - How To Use Plain Language Effectively
No ratings yet
6 Causes of Miscommunication - How To Use Plain Language Effectively
14 pages
Unit 3
No ratings yet
Unit 3
34 pages
Categorical Data Analysis With Graphics
No ratings yet
Categorical Data Analysis With Graphics
104 pages
修辞学百科全书
100% (1)
修辞学百科全书
837 pages
Stata Training Course
No ratings yet
Stata Training Course
43 pages
Data Mining-CH5
No ratings yet
Data Mining-CH5
49 pages
R For Programmers PDF
No ratings yet
R For Programmers PDF
370 pages
Using R For Time Series Analysis - Time Series 0.2 Documentation
No ratings yet
Using R For Time Series Analysis - Time Series 0.2 Documentation
37 pages
combinepdf-1
No ratings yet
combinepdf-1
74 pages
Introduction To Statistics With R
No ratings yet
Introduction To Statistics With R
17 pages
Data Analytics for Accounting, 3rd Edition
No ratings yet
Data Analytics for Accounting, 3rd Edition
641 pages
Ingmar Visser, Maarten Speekenbrink - Mixture and Hidden Markov Models With R (Use R!) - Springer (2022)
No ratings yet
Ingmar Visser, Maarten Speekenbrink - Mixture and Hidden Markov Models With R (Use R!) - Springer (2022)
277 pages
Health Economics: Teacher: Teresita Balgos Reporter: Group 4 MLS-II F
100% (1)
Health Economics: Teacher: Teresita Balgos Reporter: Group 4 MLS-II F
19 pages
The Standard Architecture of DotNet Applications
No ratings yet
The Standard Architecture of DotNet Applications
76 pages
Exploratory Data Analysis With R PDF
No ratings yet
Exploratory Data Analysis With R PDF
125 pages
11 Data Visualization
No ratings yet
11 Data Visualization
44 pages
IntroducingR Princeton University
No ratings yet
IntroducingR Princeton University
24 pages
Hale Elements of Reasoning
No ratings yet
Hale Elements of Reasoning
53 pages
Unit 3 Data Mining
No ratings yet
Unit 3 Data Mining
21 pages
AWS Certified Solutions Architect - Associate SAA-C03 Exam
No ratings yet
AWS Certified Solutions Architect - Associate SAA-C03 Exam
249 pages
Answer of The Exam - Analytical Technique For Decision Making
No ratings yet
Answer of The Exam - Analytical Technique For Decision Making
10 pages
Seefeld-Statistics Using R With Biological Examples PDF
No ratings yet
Seefeld-Statistics Using R With Biological Examples PDF
325 pages
Xstoresuite Replication Performance Troubleshooting
No ratings yet
Xstoresuite Replication Performance Troubleshooting
6 pages
Sanu Debating Championships 2013 Training Guide by African Voice
No ratings yet
Sanu Debating Championships 2013 Training Guide by African Voice
47 pages
Splunk Fundamentals 1 Lab Exercises: Lab Module 11 - Using Pivot
No ratings yet
Splunk Fundamentals 1 Lab Exercises: Lab Module 11 - Using Pivot
8 pages
French Revolution Flow Chart Reading New
100% (1)
French Revolution Flow Chart Reading New
2 pages
Basic Data Cleaning With Microsoft Excel v1.1
No ratings yet
Basic Data Cleaning With Microsoft Excel v1.1
16 pages
Sub Ict Short Notes (S5, S6)
No ratings yet
Sub Ict Short Notes (S5, S6)
53 pages
Data Mining Technique Using Weka Tool
No ratings yet
Data Mining Technique Using Weka Tool
21 pages
17-Cloud Environment - Google App Engine-29-09-2023
No ratings yet
17-Cloud Environment - Google App Engine-29-09-2023
25 pages
Data Cleaning and Datamining
No ratings yet
Data Cleaning and Datamining
54 pages
RDataMining Slides Text Mining
No ratings yet
RDataMining Slides Text Mining
34 pages
Sampling and Sampling Distribution
100% (1)
Sampling and Sampling Distribution
22 pages
Clustering in R Tutorial
No ratings yet
Clustering in R Tutorial
13 pages
Non Verbal Cues
No ratings yet
Non Verbal Cues
38 pages
ECS765P_W4_Introduction to Spark
No ratings yet
ECS765P_W4_Introduction to Spark
39 pages
SQL Server Index Design Guide
No ratings yet
SQL Server Index Design Guide
27 pages
UNIT-1(IOT)
No ratings yet
UNIT-1(IOT)
11 pages
Analysis - Ecological - Data PCA in R
No ratings yet
Analysis - Ecological - Data PCA in R
126 pages
Forecast Time Series With R Language
No ratings yet
Forecast Time Series With R Language
98 pages
Statistical Techniques, Sample, and Data Collection
100% (3)
Statistical Techniques, Sample, and Data Collection
6 pages
Powerbi Intro
No ratings yet
Powerbi Intro
46 pages
The Demography of Health and Health Care: Second Edition
No ratings yet
The Demography of Health and Health Care: Second Edition
385 pages
Data Science With R
No ratings yet
Data Science With R
21 pages
Introduction To R Programming PDF
No ratings yet
Introduction To R Programming PDF
5 pages
UNIT 1: Database Systems
No ratings yet
UNIT 1: Database Systems
25 pages
PL-600 Microsoft Exam Valid Questions
No ratings yet
PL-600 Microsoft Exam Valid Questions
25 pages
Data Science With R by Jigsaw Academy
0% (1)
Data Science With R by Jigsaw Academy
4 pages
Are Expert Systems Dead - A Review of Recent Trends, Use Cases - by Professor Simon J. Preis, Ph.D. - Towards Data Science
No ratings yet
Are Expert Systems Dead - A Review of Recent Trends, Use Cases - by Professor Simon J. Preis, Ph.D. - Towards Data Science
12 pages
UseR2013 Booklet
No ratings yet
UseR2013 Booklet
24 pages
Stat Tutorial R
No ratings yet
Stat Tutorial R
20 pages
Self-Curing Concrete Water Retention, Hydration and Moisture Transport
100% (1)
Self-Curing Concrete Water Retention, Hydration and Moisture Transport
6 pages
Database Programming in Python
No ratings yet
Database Programming in Python
21 pages
Goals, Process, and Challenges of Exploratory Data Analysis: An Interview Study
No ratings yet
Goals, Process, and Challenges of Exploratory Data Analysis: An Interview Study
10 pages
Abell Model-Business Modeling - (Chapter 2 MSO)
No ratings yet
Abell Model-Business Modeling - (Chapter 2 MSO)
35 pages
Java Architect 1
No ratings yet
Java Architect 1
10 pages
Differential Database Backups in SQL Server
No ratings yet
Differential Database Backups in SQL Server
7 pages
Set of 15 Sample Papers With Solutions & Blueprint For Class 12 Computer Science, 2024-25 Exam Edition
No ratings yet
Set of 15 Sample Papers With Solutions & Blueprint For Class 12 Computer Science, 2024-25 Exam Edition
163 pages
Automated Fare Collection Technlogy – Automated Fare Collection
No ratings yet
Automated Fare Collection Technlogy – Automated Fare Collection
11 pages
Sweet Shop Management With Ui Codes
No ratings yet
Sweet Shop Management With Ui Codes
9 pages
Steps To Enable Context Sensitive DFF Based On Absence Type in Absence Form and Page 12 2
No ratings yet
Steps To Enable Context Sensitive DFF Based On Absence Type in Absence Form and Page 12 2
11 pages
An Overview of SAP Core Data Services
No ratings yet
An Overview of SAP Core Data Services
6 pages
2) Public Health and Epidemiology
No ratings yet
2) Public Health and Epidemiology
34 pages
Cluster Analysis in R
No ratings yet
Cluster Analysis in R
8 pages
Master SQL PROFILE for Oracle DBA
No ratings yet
Master SQL PROFILE for Oracle DBA
7 pages
Lab 1
No ratings yet
Lab 1
6 pages
Bacdrive For Bacteria Classification
No ratings yet
Bacdrive For Bacteria Classification
6 pages
A4629ac494 Syllabus
No ratings yet
A4629ac494 Syllabus
3 pages
Confounding in Epidemiology
100% (1)
Confounding in Epidemiology
36 pages
R (Programming Language)
No ratings yet
R (Programming Language)
4 pages
Business Analytics
No ratings yet
Business Analytics
42 pages
DWM Mse 1
No ratings yet
DWM Mse 1
2 pages
How To Set Up Oracle GoldenGate Microservices 12.3
No ratings yet
How To Set Up Oracle GoldenGate Microservices 12.3
6 pages
Post Graduate Diploma in Bio Statistics and Data Management
No ratings yet
Post Graduate Diploma in Bio Statistics and Data Management
4 pages
Data Manipulation Language: Module of Instruction
No ratings yet
Data Manipulation Language: Module of Instruction
11 pages
Alteryx To Knime Cheat Sheet
No ratings yet
Alteryx To Knime Cheat Sheet
2 pages
Pricing Matrix Complete
No ratings yet
Pricing Matrix Complete
2 pages
Hierarchical Cluster Analysis - R Tutorial
No ratings yet
Hierarchical Cluster Analysis - R Tutorial
3 pages
Definition, Epidemiology, & Etiology-Renata
No ratings yet
Definition, Epidemiology, & Etiology-Renata
2 pages
Difference Between P3 and P6
No ratings yet
Difference Between P3 and P6
1 page
Hints For Research Proposal
No ratings yet
Hints For Research Proposal
3 pages
Dataprep Cheat Sheet
No ratings yet
Dataprep Cheat Sheet
1 page
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet