IBA - MODULe 4.3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

MODULE 4.

3 Data Mining Process

Data Mining Process: Models, Process Steps


& Challenges Involved
Data Mining, which is also known as Knowledge Discovery in Databases is a process of
discovering useful information from large volumes of data stored in databases and data
warehouses. This analysis is done for decision-making processes in the companies.

Data Mining is carried using various techniques such as clustering, association, and
sequential pattern analysis & decision tree.

What Is Data Mining?

Data Mining is a process of discovering interesting patterns and knowledge from large
amounts of data. The data sources can include databases, data warehouses, the web,
and other information repositories or data that are streamed into the system dynamically.

Why Do Businesses Need Data Extraction?

With the advent of Big Data, data mining has become more prevalent. Big data is
extremely large sets of data that can be analyzed by computers to reveal certain
patterns, associations, and trends that can be understood by humans. Big data has
extensive information about varied types and varied content.

Thus with this amount of data, simple statistics with manual intervention would not work.
This need is fulfilled by the data mining process. This leads to change from simple data
statistics to complex data mining algorithms.

The data mining process will extract relevant information from raw data such as
transactions, photos, videos, flat files and automatically process the information to
generate reports useful for businesses to take action.

Thus, the data mining process is crucial for businesses to make better decisions by
discovering patterns & trends in data, summarizing the data and taking out relevant
information.

Data Extraction As A Process

Any business problem will examine the raw data to build a model that will describe the
information and bring out the reports to be used by the business. Building a model from
data sources and data formats is an iterative process as the raw data is available in
many different sources and many forms.
Data is increasing day by day, hence when a new data source is found, it can change the
results.

Below is the outline of the process.

Data Mining Models

Many industries such as manufacturing, marketing, chemical, and aerospace are taking
advantage of data mining. Thus the demand for standard and reliable data mining
processes is increased drastically.
The important data mining models include:

#1) Cross-Industry Standard Process for Data Mining


(CRISP-DM)

CRISP-DM is a reliable data mining model consisting of six phases. It is a cyclical


process that provides a structured approach to the data mining process. The six phases
can be implemented in any order but it would sometimes require backtracking to the
previous steps and repetition of actions.

The six phases of CRISP-DM include:

#1) Business Understanding: In this step, the goals of the businesses are set and the
important factors that will help in achieving the goal are discovered.

#2) Data Understanding: This step will collect the whole data and populate the data in
the tool (if using any tool). The data is listed with its data source, location, how it is
acquired and if any issue encountered. Data is visualized and queried to check its
completeness.

#3) Data Preparation: This step involves selecting the appropriate data, cleaning,
constructing attributes from data, integrating data from multiple databases.

#4) Modeling: Selection of the data mining technique such as decision-tree, generate
test design for evaluating the selected model, building models from the dataset and
assessing the built model with experts to discuss the result is done in this step.

#5) Evaluation: This step will determine the degree to which the resulting model meets
the business requirements. Evaluation can be done by testing the model on real
applications. The model is reviewed for any mistakes or steps that should be repeated.

#6) Deployment: In this step a deployment plan is made, strategy to monitor and
maintain the data mining model results to check for its usefulness is formed, final reports
are made and review of the whole process is done to check any mistake and see if any
step is repeated.
#2) SEMMA (Sample, Explore, Modify, Model, Assess)

SEMMA is another data mining methodology developed by SAS Institute. The acronym
SEMMA stands for sample, explore, modify, model, assess.

SEMMA makes it easy to apply exploratory statistical and visualization techniques, select
and transform the significant predicted variables, create a model using the variables to
come out with the result, and check its accuracy. SEMMA is also driven by a highly
iterative cycle.
Steps in SEMMA

1. Sample: In this step, a large dataset is extracted and a sample that represents
the full data is taken out. Sampling will reduce the computational costs and
processing time.
2. Explore: The data is explored for any outlier and anomalies for a better
understanding of the data. The data is visually checked to find out the trends
and groupings.
3. Modify: In this step, manipulation of data such as grouping, and subgrouping
is done by keeping in focus the model to be built.
4. Model: Based on the explorations and modifications, the models that explain
the patterns in data are constructed.
5. Assess: The usefulness and reliability of the constructed model are assessed
in this step. Testing of the model against real data is done here.

Both the SEMMA and CRISP approach work for the Knowledge Discovery Process.
Once models are built, they are deployed for businesses and research work.

Steps In The Data Mining Process

The data mining process is divided into two parts i.e. Data Preprocessing and Data
Mining. Data Preprocessing involves data cleaning, data integration, data reduction, and
data transformation. The data mining part performs data mining, pattern evaluation and
knowledge representation of data.
Why do we preprocess the data?

There are many factors that determine the usefulness of data such as accuracy,
completeness, consistency, timeliness. The data has to quality if it satisfies the intended
purpose. Thus preprocessing is crucial in the data mining process. The major steps
involved in data preprocessing are explained below.

#1) Data Cleaning

Data cleaning is the first step in data mining. It holds importance as dirty data if used
directly in mining can cause confusion in procedures and produce inaccurate results.

Basically, this step involves the removal of noisy or incomplete data from the collection.
Many methods that generally clean data by itself are available but they are not robust.

This step carries out the routine cleaning work by:

(i) Fill The Missing Data:

Missing data can be filled by methods such as:

● Ignoring the tuple.


● Filling the missing value manually.
● Use the measure of central tendency, median or
● Filling in the most probable value.

(ii) Remove The Noisy Data: Random error is called noisy data.

Methods to remove noise are :

Binning: Binning methods are applied by sorting values into buckets or bins.
Smoothening is performed by consulting the neighboring values.

Binning is done by smoothing by bin i.e. each bin is replaced by the mean of the bin.
Smoothing by a median, where each bin value is replaced by a bin median. Smoothing
by bin boundaries i.e. The minimum and maximum values in the bin are bin boundaries
and each bin value is replaced by the closest boundary value.

● Identifying the Outliers


● Resolving Inconsistencies

#2) Data Integration

When multiple heterogeneous data sources such as databases, data cubes or files are
combined for analysis, this process is called data integration. This can help in improving
the accuracy and speed of the data mining process.

Different databases have different naming conventions of variables, by causing


redundancies in the databases. Additional Data Cleaning can be performed to remove
the redundancies and inconsistencies from the data integration without affecting the
reliability of data.

Data Integration can be performed using Data Migration Tools such as Oracle Data
Service Integrator and Microsoft SQL etc.

#3) Data Reduction

This technique is applied to obtain relevant data for analysis from the collection of data.
The size of the representation is much smaller in volume while maintaining integrity. Data
Reduction is performed using methods such as Naive Bayes, Decision Trees, Neural
network, etc.

Some strategies of data reduction are:

● Dimensionality Reduction: Reducing the number of attributes in the dataset.


● Numerosity Reduction: Replacing the original data volume by smaller forms
of data representation.
● Data Compression: Compressed representation of the original data.

#4) Data Transformation

In this process, data is transformed into a form suitable for the data mining process. Data
is consolidated so that the mining process is more efficient and the patterns are easier to
understand. Data Transformation involves Data Mapping and code generation process.

Strategies for data transformation are:

● Smoothing: Removing noise from data using clustering, regression


techniques, etc.
● Aggregation: Summary operations are applied to data.
● Normalization: Scaling of data to fall within a smaller range.
● Discretization: Raw values of numeric data are replaced by intervals. For
Example, Age.

#5) Data Mining

Data Mining is a process to identify interesting patterns and knowledge from a large
amount of data. In these steps, intelligent patterns are applied to extract the data
patterns. The data is represented in the form of patterns and models are structured using
classification and clustering techniques.

#6) Pattern Evaluation

This step involves identifying interesting patterns representing the knowledge based on
interestingness measures. Data summarization and visualization methods are used to
make the data understandable by the user.

#7) Knowledge Representation

Knowledge representation is a step where data visualization and knowledge


representation tools are used to represent the mined data. Data is visualized in the form
of reports, tables, etc.

Data Mining Process In Oracle DBMS

RDBMS represents data in the form of tables with rows and columns. Data can be
accessed by writing database queries.

Relational Database management systems such as Oracle support Data mining using
CRISP-DM. The facilities of the Oracle database are useful in data preparation and
understanding. Oracle supports data mining through java interface, PL/SQL interface,
automated data mining, SQL functions, and graphical user interfaces.

Data Mining Process In Datawarehouse

A data warehouse is modeled for a multidimensional data structure called data cube.
Each cell in a data cube stores the value of some aggregate measures.

Data mining in multidimensional space carried out in OLAP style (Online Analytical
Processing) where it allows exploration of multiple combinations of dimensions at varying
levels of granularity.

What Are The Applications of Data Extraction?


List of areas where data mining is widely used includes:

#1) Financial Data Analysis: Data Mining is widely used in banking, investment, credit
services, mortgage, automobile loans, and insurance & stock investment services. The
data collected from these sources is complete, reliable and is of high quality. This
facilitates systematic data analysis and data mining.

#2) Retail and Telecommunication Industries: Retail Sector collects huge amounts of
data on sales, customer shopping history, goods transportation, consumption, and
service. Retail data mining helps to identify customer buying behaviors, customer
shopping patterns, and trends, improve the quality of customer service, better customer
retention, and satisfaction.

#3) Science and Engineering: Data mining computer science and engineering can help
to monitor system status, improve system performance, isolate software bugs, detect
software plagiarism, and recognize system malfunctions.

#4) Intrusion Detection and Prevention: Intrusion is defined as any set of actions that
threaten the integrity, confidentiality or availability of network resources. Data mining
methods can help in intrusion detection and prevention system to enhance its
performance.

#5) Recommender Systems: Recommender systems help consumers by making


product recommendations that are of interest to users.

Data Mining Challenges

Enlisted below are the various challenges involved in Data Mining.

1. Data Mining needs large databases and data collection that are difficult to
manage.
2. The data mining process requires domain experts that are again difficult to
find.
3. Integration from heterogeneous databases is a complex process.
4. The organizational level practices need to be modified to use the data mining
results. Restructuring the process requires effort and cost.

Conclusion

Data Mining is an iterative process where the mining process can be refined, and new
data can be integrated to get more efficient results. Data Mining meets the requirement of
effective, scalable and flexible data analysis.

It can be considered as a natural evaluation of information technology. As a knowledge


discovery process, Data preparation and data mining tasks complete the data mining
process.
Data mining processes can be performed on any kind of data such as database data and
advanced databases such as time series etc. The data mining process comes with its
own challenges as well.

SOURCE: https://www.softwaretestinghelp.com/data-mining-process/

You might also like