Unit - 2 Data Minig Notes

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

Data Mining

UNIT 2: DATA MINING


What is Data Mining?

Data Mining refers to analyzing large amounts of data to identify relationships and patterns so
that different business problems can be solved. Its tasks are often designed semi or fully
automatic to work on large and complex datasets.

The data mining process encompasses several tools and techniques to allow enterprises to
describe data and predict future trends, helping to increase situational awareness and informed
decision-making. It is considered an integral part of data analytics and a core discipline of data
science.

Classification of Data Mining

Data mining has a vast application in big data to predict and characterize data. The function is to
find trends in data science. Generally, data mining is categorized as:

1. Descriptive data mining: Similarities and patterns in data may be discovered using
descriptive data mining. Descriptive data mining may also be used to isolate interesting
groupings within the supplied data.

This kind of mining focuses on transforming raw data into information that can be used in
reports and analyses. It provides certain knowledge about the data, for instance, count, average.
It gives information about what is happening inside the data without any previous idea. It

Srikanth T N, Dept of BCA, SRS FIRST GRADE COLLEGE Page 1


Data Mining

exhibits the common features in the data. In simple words, you get to know the general
properties of the data present in the database.

2. Predictive data mining: It is not the present behaviour that is being mined for, but rather
predictions about the future. It takes advantage of target-prediction capabilities gained via
supervised learning. Classification, time-series analysis, and regression are the subset of data
mining techniques that fall under this domain.

The functionality of data mining

1. Class/Concept Description: Characterization


and Discrimination

2. Classification

3. Prediction

4. Association Analysis

5. Cluster Analysis

6. Outlier Analysis

7. Evolution & Deviation Analysis

Below are all the data mining functionalities with examples, so that you have an in-depth
understanding of how these functionalities are used in the real world to work with data.

1. Class/Concept Description: Characterization and Discrimination

Data is associated with classes or concepts so they can be correlated with results. Data
class/concept description can be explained for data mining functionalities with examples. An
example of data mining functionality in the class/concept description can be explained by, for
example, the new iPhone model, which is released in three variants to attend to the targeted
customers based on their requirements like Pro, Pro max, and Plus.

Data characterization

When you summarize the general features of the data, it is called data characterization. It
produces the characteristic rules for the target class, like our iPhone buyers. We can collect the
data using simple SQL queries and perform OLAP functions to generalize the data.

Srikanth T N, Dept of BCA, SRS FIRST GRADE COLLEGE Page 2


Data Mining

Attribute- oriented induction technique is also used to generalize or characterize the data with
minimal user interaction. The generalized data is presented in various forms like tables, pie
charts, line charts, bar charts, and graphs. The multi-dimensional relationship between the data is
presented in a rule called characteristics rule of the target class.

Data discrimination

Data discrimination is one of the functionalities of data mining. It compares the data between the
two classes. Generally, it maps the target class with a predefined group or class. It compares and
contrasts the characteristics of the class with the predefined class using a set of rules called
discriminate rules. The methods used in data discrimination is similar to data characterization.

2. Classification

Classification is probably one of the most important data mining functionalities. It uses data
models to predict the trends in data.

It uses methods like IF-THEN, decision tree, mathematical formulae, or neural network to
predict or analyse a model. It uses training data to produce new instances to compare with the
one existing.

IF-THEN: The IF clause of an IF-THEN rule is referred to as the rule antecedent or precondition.
The THEN portion of the IF-THEN rule is known as the rule consequent. The antecedent portion
of the condition includes one or more attribute tests, which are logically ANDed together. The
antecedent and the consequent are used together to make a binary true or false decision.

Decision Tree: Classification Models may be created with the use of Decision Tree Mining, a
data mining approach. It constructs tree-like models for classifying data. It’s utilized in the
development of data models for forming inferences about classes of objects or numerical values.

Neural Networks: By efficiently transforming unstructured data into usable insights, neural
networks are a common tool for successful data mining.

3. Prediction

Prediction data mining functionality finds the missing numeric values in the data. It uses
regression analysis to find the unavailable data. If the class label is missing, then the prediction is
done using classification. Prediction is popular because of its importance in business intelligence.
There are two ways one can predict data:

1. Predicting the unavailable or missing data using prediction analysis

2. Predicting the class label using the previously built class model.

Srikanth T N, Dept of BCA, SRS FIRST GRADE COLLEGE Page 3


Data Mining

It is a forecasting technique that allows us to find value deep into the future. We need to have a
huge data set of past values to predict future trends.

4. Association Analysis

Association Analysis is a functionality of data mining. It relates two or more attributes of the
data. It discovers the relationship between the data and the rules that are binding them. It finds its
application widely in retail sales. The suggestion that Amazon shows on the bottom, “Customers
who bought this also bought..” is a real-time example of association analysis.

It associates attributes that are frequently transacted together. They find out what are called
association rules and are widely used in market basket analysis. There are two items to associate
the attributes. One is the confidence that says the probability of both associated together, and
another is support, which tells past occurrence of associations.

5. Cluster Analysis

Unsupervised classification is called cluster analysis. It is similar to the


classification functionality of data mining where the data are grouped. Unlike classification, in
cluster analysis, the class label is unknown. Data are grouped based on clustering algorithms.

The objects that are similarly grouped under one cluster. There will be a huge difference between
one cluster and the other. Grouping is done to maximizing the intra-class similarity and
minimizing the intra class similarity. Clustering is applied in many fields like machine learning,
image processing, pattern recognition, and bioinformatics.

6. Outlier Analysis

When data that cannot be grouped in any of the class appears, we use outlier analysis. There will
be occurrences of data that will have different attributes to any of the other classes or general
models. These outstanding data are called outliers. They are usually considered noise or
exceptions, and the analysis of these outliers is called outlier mining.

7. Evolution & Deviation Analysis

With evolution analysis being another data mining functionalities in data mining, we get time-
related clustering of data. We can find trends and changes in behavior over a period. We can find
features like time-series data, periodicity, and similarity in trends with such distinct analysis.

Data Preprocessing
Data preprocessing is an important step in the data mining process. It refers to the
cleaning, transforming, and integrating of data in order to make it ready for analysis. The goal of
data preprocessing is to improve the quality of the data and to make it more suitable for the
specific data mining task.

Srikanth T N, Dept of BCA, SRS FIRST GRADE COLLEGE Page 4


Data Mining

Some common steps in data preprocessing include:

Data preprocessing is an important step in the data mining process that involves cleaning and
transforming raw data to make it suitable for analysis. Some common steps in data preprocessing
include:

Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data,
such as missing values, outliers, and duplicates. Various techniques can be used for data
cleaning, such as imputation, removal, and transformation.

The data can have many irrelevant and missing parts. To handle this part, data cleaning is done.
It involves handling of missing data, noisy data etc.

 (a). Missing Data:

This situation arises when some data is missing in the data. It can be handled in various
ways. Some of them are:

1. Ignore the tuples:

This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.

Srikanth T N, Dept of BCA, SRS FIRST GRADE COLLEGE Page 5


Data Mining

2. Fill the Missing values:

There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.

 (b). Noisy Data:

Noisy data is a meaningless data that can’t be interpreted by machines. It can be


generated due to faulty data collection, data entry errors etc. It can be handled in
following ways :

1. Binning Method:

This method works on sorted data in order to smooth it. The whole data is divided
into segments of equal size and then various methods are performed to complete
the task. Each segmented is handled separately. One can replace all data in a
segment by its mean or boundary values can be used to complete the task.

2. Regression:
Here data can be made smooth by fitting it to a regression function. The
regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).

3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected
or it will fall outside the clusters.

Data Integration: This involves combining data from multiple sources to create a unified
dataset. Data integration can be challenging as it requires handling data with different formats,
structures, and semantics. Techniques such as record linkage and data fusion can be used for data
integration.

Data Transformation: This involves converting the data into a suitable format for analysis.
Common techniques used in data transformation include normalization, standardization, and
discretization. Normalization is used to scale the data to a common range, while standardization
is used to transform the data to have zero mean and unit variance. Discretization is used to
convert continuous data into discrete categories.

This step is taken in order to transform the data in appropriate forms suitable for mining process.
This involves following ways:

Srikanth T N, Dept of BCA, SRS FIRST GRADE COLLEGE Page 6


Data Mining

1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

2. Attribute Selection:

In this strategy, new attributes are constructed from the given set of attributes to help the
mining process.

3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.

4. Concept Hierarchy Generation:

Here attributes are converted from lower level to higher level in hierarchy. For Example-
The attribute “city” can be converted to “country”.

Data Reduction: This involves reducing the size of the dataset while preserving the important
information. Data reduction can be achieved through techniques such as feature selection and
feature extraction. Feature selection involves selecting a subset of relevant features from the
dataset, while feature extraction involves transforming the data into a lower-dimensional space
while preserving the important information.

This is done to improve the efficiency of data analysis and to avoid over-fitting of the model.
Some common steps involved in data reduction are:

1. Feature Selection: This involves selecting a subset of relevant features from the dataset.
Feature selection is often performed to remove irrelevant or redundant features from the
dataset. It can be done using various techniques such as correlation analysis, mutual
information, and principal component analysis (PCA).
2. Feature Extraction: This involves transforming the data into a lower-dimensional space
while preserving the important information. Feature extraction is often used when the
original features are high-dimensional and complex. It can be done using techniques such
as PCA, linear discriminant analysis (LDA), and non-negative matrix factorization
(NMF).
3. Sampling: This involves selecting a subset of data points from the dataset. Sampling is
often used to reduce the size of the dataset while preserving the important information. It

Srikanth T N, Dept of BCA, SRS FIRST GRADE COLLEGE Page 7


Data Mining

can be done using techniques such as random sampling, stratified sampling, and
systematic sampling.
4. Clustering: This involves grouping similar data points together into clusters. Clustering
is often used to reduce the size of the dataset by replacing similar data points with a
representative centroid. It can be done using techniques such as k-means, hierarchical
clustering, and density-based clustering.
5. Compression: This involves compressing the dataset while preserving the important
information. Compression is often used to reduce the size of the dataset for storage and
transmission purposes. It can be done using techniques such as wavelet compression,
JPEG compression, and gzip compression.

Data Mining System Architecture

The data mining system architecture consists of the following modules as shown in the Figure
below.

Srikanth T N, Dept of BCA, SRS FIRST GRADE COLLEGE Page 8


Data Mining

1. Data sources: The data mining system assembles data from different data sources for
performing investigation task. The sources of data are data warehouses, flat files, databases,
World Wide Web (WWW), spreadsheets, or other kinds of information repositories. Data
selection and data preprocessing techniques are in use with the data.

2. Database or Data Warehouse Server: The database or data warehouse server is the central
storage accountable for extracting the related data, according to the data mining request or query
issued by the user.

3. Knowledge Base: Data mining procedure might refer to a knowledge base, which is a
repository of knowledge related to a particular domain that would help the searching procedure
for finding the interesting patterns. This kind of knowledge may include “concept hierarchies”
which organizes features or feature values into several levels of abstraction. It may also include
“user beliefs”, which can evaluate the interestingness measure of a data pattern according to its
suddenness or unexpectedness. The other instances of domain knowledge are any added
thresholds or interestingness constraints and metadata (i.e., data about data).

4. Data Mining Engine: This is an important part of the data mining system. It contains a set of
functional modules for performing several tasks such as summarization, association analysis,
classification, regression, cluster analysis, and outlier detection.

5. Pattern Evaluation: This module usually applies some thresholds or interestingness


constraints to determine the interesting knowledge. It also communicates with the data mining
module so as to help focus the search for interesting patterns.

6. Graphical User Interface: The module interacts with users and the data mining system. It
allows the user to communicate with the system by providing a data mining request or query and
offers the necessary information to guide the search. Based on the users’ data mining application,
the mined knowledge is presented to the user using some visualization techniques.

Classification of Data Mining Systems

Data mining systems can be categorized according to various criteria, as follows:

Srikanth T N, Dept of BCA, SRS FIRST GRADE COLLEGE Page 9


Data Mining

Classification according to the kinds of databases mined: A data mining system can be
classified according to the kinds of databases mined. Database systems can be classified
according to different criteria (such as data models, or the types of data or applications involved),
each of which may require its own data mining technique. Data mining systems can therefore be
classified accordingly.

Classification according to the kinds of knowledge mined: Data mining systems can
be categorized according to the kinds of knowledge they mine, that is, based on data mining
functionalities, such as characterization, discrimination, association and correlation analysis,
classification, prediction, clustering, outlier analysis, and evolution analysis. A comprehensive
data mining system usually provides multiple and/or integrated data mining functionalities.

Classification according to the kinds of techniques utilized: Data mining systems can
be categorized according to the underlying data mining techniques employed. These techniques
can be described according to the degree of user interaction involved (e.g., autonomous systems,
interactive exploratory systems, query-driven systems) or the methods of data analysis employed
(e.g., database-oriented or data warehouse– oriented techniques, machine learning, statistics,
visualization, pattern recognition, neural networks, and so on). A sophisticated data mining
system will often adopt multiple data mining techniques or work out an effective, integrated
technique that combines the merits of a few individual approaches.

Srikanth T N, Dept of BCA, SRS FIRST GRADE COLLEGE Page 10


Data Mining

Classification according to the applications adapted: Data mining systems can also
be categorized according to the applications they adapt. For example, data mining systems may
be tailored specifically for finance, telecommunications, DNA, stock markets, e-mail, and so on.
Different applications often require the integration of application-specific methods. Therefore, a
generic, all-purpose data mining system may not fit domain-specific mining tasks.

Srikanth T N, Dept of BCA, SRS FIRST GRADE COLLEGE Page 11


Data Mining

Srikanth T N, Dept of BCA, SRS FIRST GRADE COLLEGE Page 12


Data Mining

2.4 Efficient And Scalable Frequent Itemset Mining Methods:

2.4.1 Finding Frequent Itemsets Using Candidate Generation: The Apriori


Algorithm

 Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for


mining frequent itemsets for Boolean association rules.
 The name of the algorithm is based on the fact that the algorithm uses prior knowledge
of frequent itemset properties.
 Apriori employs an iterative approach known as a level-wise search, where k-itemsets
are used to explore (k+1)-itemsets.
 First, the set of frequent 1-itemsets is found by scanning the database to accumulate the
 count for each item, and collecting those items that satisfy minimum support.
The resulting set is denoted L1.Next, L1 is used to find L2, the set of frequent 2-
itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can be
found.
 The finding of each Lk requires one full scan of the database.
 A two-step process is followed in Apriori consisting of join and prune action.

Srikanth T N, Dept of BCA, SRS FIRST GRADE COLLEGE Page 13


Data Mining

Srikanth T N, Dept of BCA, SRS FIRST GRADE COLLEGE Page 14


Data Mining

Srikanth T N, Dept of BCA, SRS FIRST GRADE COLLEGE Page 15

You might also like