Unit - 2 Data Minig Notes
Unit - 2 Data Minig Notes
Unit - 2 Data Minig Notes
Data Mining refers to analyzing large amounts of data to identify relationships and patterns so
that different business problems can be solved. Its tasks are often designed semi or fully
automatic to work on large and complex datasets.
The data mining process encompasses several tools and techniques to allow enterprises to
describe data and predict future trends, helping to increase situational awareness and informed
decision-making. It is considered an integral part of data analytics and a core discipline of data
science.
Data mining has a vast application in big data to predict and characterize data. The function is to
find trends in data science. Generally, data mining is categorized as:
1. Descriptive data mining: Similarities and patterns in data may be discovered using
descriptive data mining. Descriptive data mining may also be used to isolate interesting
groupings within the supplied data.
This kind of mining focuses on transforming raw data into information that can be used in
reports and analyses. It provides certain knowledge about the data, for instance, count, average.
It gives information about what is happening inside the data without any previous idea. It
exhibits the common features in the data. In simple words, you get to know the general
properties of the data present in the database.
2. Predictive data mining: It is not the present behaviour that is being mined for, but rather
predictions about the future. It takes advantage of target-prediction capabilities gained via
supervised learning. Classification, time-series analysis, and regression are the subset of data
mining techniques that fall under this domain.
2. Classification
3. Prediction
4. Association Analysis
5. Cluster Analysis
6. Outlier Analysis
Below are all the data mining functionalities with examples, so that you have an in-depth
understanding of how these functionalities are used in the real world to work with data.
Data is associated with classes or concepts so they can be correlated with results. Data
class/concept description can be explained for data mining functionalities with examples. An
example of data mining functionality in the class/concept description can be explained by, for
example, the new iPhone model, which is released in three variants to attend to the targeted
customers based on their requirements like Pro, Pro max, and Plus.
Data characterization
When you summarize the general features of the data, it is called data characterization. It
produces the characteristic rules for the target class, like our iPhone buyers. We can collect the
data using simple SQL queries and perform OLAP functions to generalize the data.
Attribute- oriented induction technique is also used to generalize or characterize the data with
minimal user interaction. The generalized data is presented in various forms like tables, pie
charts, line charts, bar charts, and graphs. The multi-dimensional relationship between the data is
presented in a rule called characteristics rule of the target class.
Data discrimination
Data discrimination is one of the functionalities of data mining. It compares the data between the
two classes. Generally, it maps the target class with a predefined group or class. It compares and
contrasts the characteristics of the class with the predefined class using a set of rules called
discriminate rules. The methods used in data discrimination is similar to data characterization.
2. Classification
Classification is probably one of the most important data mining functionalities. It uses data
models to predict the trends in data.
It uses methods like IF-THEN, decision tree, mathematical formulae, or neural network to
predict or analyse a model. It uses training data to produce new instances to compare with the
one existing.
IF-THEN: The IF clause of an IF-THEN rule is referred to as the rule antecedent or precondition.
The THEN portion of the IF-THEN rule is known as the rule consequent. The antecedent portion
of the condition includes one or more attribute tests, which are logically ANDed together. The
antecedent and the consequent are used together to make a binary true or false decision.
Decision Tree: Classification Models may be created with the use of Decision Tree Mining, a
data mining approach. It constructs tree-like models for classifying data. It’s utilized in the
development of data models for forming inferences about classes of objects or numerical values.
Neural Networks: By efficiently transforming unstructured data into usable insights, neural
networks are a common tool for successful data mining.
3. Prediction
Prediction data mining functionality finds the missing numeric values in the data. It uses
regression analysis to find the unavailable data. If the class label is missing, then the prediction is
done using classification. Prediction is popular because of its importance in business intelligence.
There are two ways one can predict data:
2. Predicting the class label using the previously built class model.
It is a forecasting technique that allows us to find value deep into the future. We need to have a
huge data set of past values to predict future trends.
4. Association Analysis
Association Analysis is a functionality of data mining. It relates two or more attributes of the
data. It discovers the relationship between the data and the rules that are binding them. It finds its
application widely in retail sales. The suggestion that Amazon shows on the bottom, “Customers
who bought this also bought..” is a real-time example of association analysis.
It associates attributes that are frequently transacted together. They find out what are called
association rules and are widely used in market basket analysis. There are two items to associate
the attributes. One is the confidence that says the probability of both associated together, and
another is support, which tells past occurrence of associations.
5. Cluster Analysis
The objects that are similarly grouped under one cluster. There will be a huge difference between
one cluster and the other. Grouping is done to maximizing the intra-class similarity and
minimizing the intra class similarity. Clustering is applied in many fields like machine learning,
image processing, pattern recognition, and bioinformatics.
6. Outlier Analysis
When data that cannot be grouped in any of the class appears, we use outlier analysis. There will
be occurrences of data that will have different attributes to any of the other classes or general
models. These outstanding data are called outliers. They are usually considered noise or
exceptions, and the analysis of these outliers is called outlier mining.
With evolution analysis being another data mining functionalities in data mining, we get time-
related clustering of data. We can find trends and changes in behavior over a period. We can find
features like time-series data, periodicity, and similarity in trends with such distinct analysis.
Data Preprocessing
Data preprocessing is an important step in the data mining process. It refers to the
cleaning, transforming, and integrating of data in order to make it ready for analysis. The goal of
data preprocessing is to improve the quality of the data and to make it more suitable for the
specific data mining task.
Data preprocessing is an important step in the data mining process that involves cleaning and
transforming raw data to make it suitable for analysis. Some common steps in data preprocessing
include:
Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data,
such as missing values, outliers, and duplicates. Various techniques can be used for data
cleaning, such as imputation, removal, and transformation.
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done.
It involves handling of missing data, noisy data etc.
This situation arises when some data is missing in the data. It can be handled in various
ways. Some of them are:
This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.
There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided
into segments of equal size and then various methods are performed to complete
the task. Each segmented is handled separately. One can replace all data in a
segment by its mean or boundary values can be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function. The
regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected
or it will fall outside the clusters.
Data Integration: This involves combining data from multiple sources to create a unified
dataset. Data integration can be challenging as it requires handling data with different formats,
structures, and semantics. Techniques such as record linkage and data fusion can be used for data
integration.
Data Transformation: This involves converting the data into a suitable format for analysis.
Common techniques used in data transformation include normalization, standardization, and
discretization. Normalization is used to scale the data to a common range, while standardization
is used to transform the data to have zero mean and unit variance. Discretization is used to
convert continuous data into discrete categories.
This step is taken in order to transform the data in appropriate forms suitable for mining process.
This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the
mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
Here attributes are converted from lower level to higher level in hierarchy. For Example-
The attribute “city” can be converted to “country”.
Data Reduction: This involves reducing the size of the dataset while preserving the important
information. Data reduction can be achieved through techniques such as feature selection and
feature extraction. Feature selection involves selecting a subset of relevant features from the
dataset, while feature extraction involves transforming the data into a lower-dimensional space
while preserving the important information.
This is done to improve the efficiency of data analysis and to avoid over-fitting of the model.
Some common steps involved in data reduction are:
1. Feature Selection: This involves selecting a subset of relevant features from the dataset.
Feature selection is often performed to remove irrelevant or redundant features from the
dataset. It can be done using various techniques such as correlation analysis, mutual
information, and principal component analysis (PCA).
2. Feature Extraction: This involves transforming the data into a lower-dimensional space
while preserving the important information. Feature extraction is often used when the
original features are high-dimensional and complex. It can be done using techniques such
as PCA, linear discriminant analysis (LDA), and non-negative matrix factorization
(NMF).
3. Sampling: This involves selecting a subset of data points from the dataset. Sampling is
often used to reduce the size of the dataset while preserving the important information. It
can be done using techniques such as random sampling, stratified sampling, and
systematic sampling.
4. Clustering: This involves grouping similar data points together into clusters. Clustering
is often used to reduce the size of the dataset by replacing similar data points with a
representative centroid. It can be done using techniques such as k-means, hierarchical
clustering, and density-based clustering.
5. Compression: This involves compressing the dataset while preserving the important
information. Compression is often used to reduce the size of the dataset for storage and
transmission purposes. It can be done using techniques such as wavelet compression,
JPEG compression, and gzip compression.
The data mining system architecture consists of the following modules as shown in the Figure
below.
1. Data sources: The data mining system assembles data from different data sources for
performing investigation task. The sources of data are data warehouses, flat files, databases,
World Wide Web (WWW), spreadsheets, or other kinds of information repositories. Data
selection and data preprocessing techniques are in use with the data.
2. Database or Data Warehouse Server: The database or data warehouse server is the central
storage accountable for extracting the related data, according to the data mining request or query
issued by the user.
3. Knowledge Base: Data mining procedure might refer to a knowledge base, which is a
repository of knowledge related to a particular domain that would help the searching procedure
for finding the interesting patterns. This kind of knowledge may include “concept hierarchies”
which organizes features or feature values into several levels of abstraction. It may also include
“user beliefs”, which can evaluate the interestingness measure of a data pattern according to its
suddenness or unexpectedness. The other instances of domain knowledge are any added
thresholds or interestingness constraints and metadata (i.e., data about data).
4. Data Mining Engine: This is an important part of the data mining system. It contains a set of
functional modules for performing several tasks such as summarization, association analysis,
classification, regression, cluster analysis, and outlier detection.
6. Graphical User Interface: The module interacts with users and the data mining system. It
allows the user to communicate with the system by providing a data mining request or query and
offers the necessary information to guide the search. Based on the users’ data mining application,
the mined knowledge is presented to the user using some visualization techniques.
Classification according to the kinds of databases mined: A data mining system can be
classified according to the kinds of databases mined. Database systems can be classified
according to different criteria (such as data models, or the types of data or applications involved),
each of which may require its own data mining technique. Data mining systems can therefore be
classified accordingly.
Classification according to the kinds of knowledge mined: Data mining systems can
be categorized according to the kinds of knowledge they mine, that is, based on data mining
functionalities, such as characterization, discrimination, association and correlation analysis,
classification, prediction, clustering, outlier analysis, and evolution analysis. A comprehensive
data mining system usually provides multiple and/or integrated data mining functionalities.
Classification according to the kinds of techniques utilized: Data mining systems can
be categorized according to the underlying data mining techniques employed. These techniques
can be described according to the degree of user interaction involved (e.g., autonomous systems,
interactive exploratory systems, query-driven systems) or the methods of data analysis employed
(e.g., database-oriented or data warehouse– oriented techniques, machine learning, statistics,
visualization, pattern recognition, neural networks, and so on). A sophisticated data mining
system will often adopt multiple data mining techniques or work out an effective, integrated
technique that combines the merits of a few individual approaches.
Classification according to the applications adapted: Data mining systems can also
be categorized according to the applications they adapt. For example, data mining systems may
be tailored specifically for finance, telecommunications, DNA, stock markets, e-mail, and so on.
Different applications often require the integration of application-specific methods. Therefore, a
generic, all-purpose data mining system may not fit domain-specific mining tasks.