Introduction To Data Mining For Business Analytics
Introduction To Data Mining For Business Analytics
Example:
• an essential part of the job is to review and examine the data to see what
messages they hold, much as a detective might survey a crime scene.
Another technique for exploring data to see what information they hold is
through graphical analysis. This includes looking at each variable separately
as well as looking at relationships between variables.
For numerical variables, we use histograms and boxplots to learn about
the distribution of their values, to detect outliers (extreme observations), and
to find other information that is relevant to the analysis task.
Similarly, for categorical variables we use bar charts. We can also look at
scatterplots of pairs of numerical variables to learn about possible
relationships, the type of relationship, and again, to detect outliers.
SUPERVISED AND
UNSUPERVISED LEARNING
Supervised learning algorithms are those used in which
the value of the outcome of interest (e.g, purchase or no
purchase) is known.
Training Validation
Test data
data data
Traning data are the data from which the classification or
prediction algorithm "learns", or is "trained," about the
relationship between predictor variables and the outcome
variable.
Once the algorithm has learned from the training data, it is
then applied to another sample of data (the validation data)
where the outcome is known, to see how well it does in
comparison to other models.
If many different models are being tried out, it is prudent to
save a third sample of known outcomes (the test data) to
use with the model finally selected to predict how well it will
do.
Association Rules
Dimension Reduction Methods
Clustering Techniques
Some of the most serious errors in data analysis
result from a poor understanding of the problem
STEPS IN DATA MINING
• Develop an understanding of the purpose of the data mining project
• Obtain the dataset to be used in the analysis.
• Explore, clean, and preprocess the data.
• Reduce the data, if necessary, and (where supervised training is
involved) separate them into training, validation, and test datasets.
• Determine the data mining task (classification, prediction, clustering,
etc.).
• Choose the data mining techniques to be used (regression, neural
nets, hierarchical clustering, etc.).
1. Develop an understanding of the purpose of the data
mining project (if it is a one-shot effort to answer a
question or questions) or application (if it is an ongoing
procedure)
2. Obtain the dataset to be used in the analysis. This
often involves random sampling from a large database to
capture records to be used in analysis.
3. Explore, clean, and preprocess the data. This involves
verifying that the data are in reasonable condition.
• How should missing data be handles?
• Are the values in a reasonable range, given what you would expect for each variable?
• Are there obvious outliers?
4. Reduce the data, if necessary, and (where supervised
training is involved) separate them into training, validation,
and test datasets. This can involve operations such as
eliminating unneeded variables, transforming variables, and
creating new variables.
5. Determine the data mining task. This involves
translating the general question or problem of step 1
into a more specific statistical question.
6. Choose the data mining techniques to be used
(regression, neural nets, hierarchical clustering, etc.).
7. Use algorithms to perform the task. This is typically
an iterative process –tying multiple variants, and often
using multiple variants of the same algorithm.
8. Interpret the results of the algorithms. This involves
making a choice as to the best algorithm to deploy,
and where possible, testing the final choice on the test
to get an idea as to how well it will perform.
9. Deploy the model. This involves integrating the model
into operational systems and running it on real records
to produce decisions or actions.
PRELIMINARY STEPS
• Organization of database
• Sampling from a database
• Oversampling rare events
• Preprocessing and cleaning the data
• Types of variables
• Handling categorical variables
• Variable selection
• Overfitting
• How many variables and how much data?
• Outliers
• Missing Values
• Normalizing the data
• Use and creation of partitions
• Training partition
• Validation partition
• Test partition
• Shmueli G., et al. Data Mining for Business Intelligence Concepts,
Techniques, and Applications in Microsoft Office Excel with XLMiner 2nd
Ed. A John Wiley & Sons, Inc. Publication
• Bruce P., et al. Data Mining for Business Analytics Concepts, Techniques
and Applications. John Wiley & Sons, Inc. 2020
• Shmueli G., et al. Data Mining for Business Intelligence Concepts,
Techniques, and Applications in Microsoft Office Excel with XLMiner 2nd
Ed. A John Wiley & Sons, Inc. Publication
• Bruce P., et al. Data Mining for Business Analytics Concepts, Techniques
and Applications. John Wiley & Sons, Inc. 2020