VO_MCA_S4_Data Mining Unit 3
VO_MCA_S4_Data Mining Unit 3
Names of Sub-Units
Task relevant data, Background knowledge, Representing input data and output
knowledge, EDA on data
Overview
In this Unit “ Data Mining representation”, you will understand ‘Task relevant data’,
‘ Background knowledge’. Later it explains how to Represent input data and output
knowledge. Exploratory Data Analysis (EDA) on various types of Data.
1
Learning Objectives
Learning Outcomes
Understand the patterns and relationships within the data that can be used to
make predictions, identify trends, and support decision making.
Understand the machine learning techniques to improve the accuracy and
efficiency of data mining and knowledge representation processes.
https://www.ibm.com/in-en/topics/exploratory-data-analysis
https://www.oreilly.com/library/view/data-mining-
2nd/9780120884070/xhtml/CHP003.html
2
3.1 Data Mining Knowledge Representation – Introduction
Data mining is the process of discovering patterns and knowledge from large data sets.
Knowledge representation in data mining refers to the process of encoding the discovered
patterns and knowledge in a format that can be easily understood and used by humans or
other systems. There are several different ways to represent knowledge in data mining,
including:
Decision trees: These are graphical representations that show the decisions and
outcomes of a process. They are commonly used in decision-making tasks and
classification problems.
Rules: These are statements that describe a relationship between input and output
variables. They are commonly used in association rule mining and classification
problems.
Clusters: These are groups of similar data points. They are commonly used in clustering
and segmentation tasks.
Graphs: These are graphical representations of relationships between data points. They
are commonly used in social network analysis and web mining.
Ultimately, the choice of knowledge representation will depend on the specific task and
the audience for the discovered knowledge.
3
3.2 KDD – The Process
Source: Javatpoint
Data selection: This step involves selecting the data that will be used for the analysis.
The data can come from a variety of sources, such as databases, text files, or sensor data.
Data pre-processing: This step involves cleaning, transforming, and integrating the
data to make it suitable for analysis. This may involve tasks such as removing missing or
irrelevant data, handling outliers, and normalizing the data.
Data mining: This step involves applying various data mining techniques, such as
classification, clustering, or association rule mining, to discover patterns and
relationships in the data.
4
involve tasks such as creating decision trees, rules, or visualizations.
Deployment: This step involves putting the discovered knowledge into practice. This
can include tasks such as building predictive models, creating new products, or
improving decision-making processes.
It is important to note that the KDD process is an iterative and interactive process, and the
steps may be refined and repeated as needed. The choice of techniques and tools used in
each step will depend on the specific task and the characteristics of the data.
Data mining and KDD (Knowledge Discovery in Databases) are closely related fields that
involve extracting useful information and knowledge from large data sets.
Data mining is a process of identifying patterns and relationships in data. It involves applying
various techniques, such as classification, clustering, association rule mining, and anomaly
detection, to discover patterns and relationships in the data. These techniques are applied to
a wide range of data types, such as text, images, or sensor data, and are used in a variety of
applications, such as fraud detection, customer segmentation, and predictive modeling.
KDD is a broader and more comprehensive process that encompasses not only the techniques
of data mining but also the entire process of discovering useful information and knowledge
from data. The KDD process includes several steps, such as data selection, data preprocessing,
data mining, pattern evaluation, knowledge representation, and deployment, and it is an
iterative and interactive process.
In summary, Data mining is a specific task of identifying patterns and relationships in data and
KDD is a more comprehensive process that encompasses several tasks from data selection,
preprocessing, mining, evaluating the patterns, representing the knowledge, and deploying it.
5
logic or first-order logic, allows for precise and unambiguous representation of knowledge.
These languages are used in artificial intelligence and knowledge representation systems.
Concept Maps: Representing knowledge using concept maps, which are graphical
representations of concepts and their relationships. Concept maps can be used to organize
and visualize knowledge in a way that is easy for humans to understand.
Conceptual Graphs: Representing knowledge using conceptual graphs, which are a formal
representation of concepts and relationships, similar to ontologies but with a graph-based
structure.
Decision Trees: Representing knowledge using decision trees, which are tree-like
structures that represent decision-making processes. Each internal node represents a test
on an attribute, each branch represents the outcome of a test and each leaf node
represents a class label.
Bayesian Networks: Representing knowledge using Bayesian networks, which are directed
acyclic graphs that represent probabilistic relationships between variables.
Neural Networks: Representing knowledge using neural networks, which are a type of
machine learning model that can learn complex relationships between inputs and outputs
by adjusting the weights of the connections between neurons.
The choice of method for knowledge representation depends on the complexity of the
information and the intended use of the knowledge. Some methods, like natural language, are
easy for humans to understand but difficult for computers to interpret. Other methods, like
formal languages, are precise and unambiguous but can be difficult for humans to understand.
Representing input and output data is an important aspect of data mining and KDD. The
representation of input data refers to the format and structure of the data that is used as input
for data mining and KDD algorithms. The representation of output data refers to the format
and structure of the data that is generated as a result of the data mining and KDD process.
Input data representation:
6
represents an instance or a record and each column represents an attribute or a
feature.
Spatial representation: This representation is used when the data has a spatial
component, such as geographic data.
Text representation: This representation is used when the data is in text format, such
as emails or documents.
Rules: These are statements that describe a relationship between input and output
variables. They are commonly used in association rule mining and classification
problems.
Decision trees: These are graphical representations that show the decisions and
outcomes of a process. They are commonly used in decision-making tasks and
classification problems.
Clusters: These are groups of similar data points. They are commonly used in
clustering and segmentation tasks.
In data mining, task-relevant data refers to the subset of data that is relevant to the specific
task or problem that is being addressed. Selecting task-relevant data is an important step in
7
the data mining process, as it helps to focus the analysis on the most relevant information and
reduce noise and irrelevant information in the data. There are several ways to select task-
relevant data, including:
Data Sampling: This involves selecting a subset of data from the entire dataset. This
can be done randomly or by using a specific sampling technique such as stratified
sampling.
Feature Selection: This involves selecting a subset of features or attributes from the
entire set of features that are relevant to the task. This can be done by using techniques
such as mutual information, correlation-based feature selection, or wrapper-based
feature selection.
Instance selection: This involves selecting a subset of instances or records from the
entire dataset. This can be done by using techniques such as active learning, or by
removing outliers or instances with missing values.
Data mining is the process of discovering patterns and knowledge from large datasets. It
involves the use of various techniques and algorithms to extract useful information from data
and transform it into an understandable structure for further use. The main steps in data
mining include:
Data integration: This step involves combining data from different sources to create
a unified dataset.
Data selection: This step involves selecting a subset of the data that is relevant to
the analysis.
8
Data transformation: This step involves transforming the data into a format that
can be used by the data mining algorithms.
Data mining: This step involves applying various algorithms and techniques to
extract patterns and knowledge from the data.
Data mining algorithms are implemented using various programming languages like R,
Python, Weka, KNIME, RapidMiner and etc.
Data mining is widely used in a variety of fields, including business, science, and engineering.
It is used to discover patterns in sales data, customer behavior, and medical records, as well as
to identify potential fraud, predict equipment failures, and analyze scientific data.
Exploratory Data Analysis (EDA) is an approach to analyzing and understanding data sets
through visual and statistical methods. The goal of EDA is to identify patterns, anomalies, and
relationships within a data set that can be used to guide further analysis and modeling. The
main steps in EDA include:
Data visualization: This step involves creating visual representations of the data, such as
histograms, scatter plots, and box plots, to understand the distribution of the data and
identify any patterns or outliers.
Data summary statistics: This step involves calculating summary statistics such as mean,
9
median, standard deviation, and quartiles to understand the overall shape of the data.
Data normalization: This step involves transforming the data to a common scale to make
it easier to compare and analyze.
Data transformation: This step involves transforming the data into a format that can be
used by the data mining algorithms.
Data cleaning: This step involves removing or correcting any errors or inconsistencies in
the data.
Data correlation: This step involves identifying the relationship between variables by
calculating correlation coefficients.
Data dimensionality reduction: This step involves reducing the number of variables in a
data set by combining or removing variables that are highly correlated or have low variance.
EDA is a crucial step in the data mining process as it helps to identify any issues with the data
and provides a deeper understanding of the data set. EDA can be done using a variety of tools
and programming languages like R, Python, SQL, SAS, SPSS, STATA and etc. EDA is widely used
in a variety of fields, including business, science, and engineering. It is used to discover patterns
in sales data, customer behavior, and medical records, as well as to identify potential fraud,
predict equipment failures, and analyze scientific data.
Why EDA ?
Exploratory Data Analysis (EDA) is an important step in the data mining process because it
helps to identify patterns, anomalies, and relationships within a data set that can guide further
analysis and modeling. EDA is also important for knowledge representation because it allows
for a deeper understanding of the data set, which can inform the choice of representation
method.
By visualizing and summarizing the data, EDA helps to identify the overall shape and structure
of the data, and to identify any outliers or unusual patterns. This information can be used to
guide the selection of data mining techniques and algorithms, and to identify any issues that
need to be addressed before further analysis.
EDA also allows for the identification of relationships between variables, which can inform the
choice of representation method. For example, if the data has a high degree of correlation
between variables, it may be appropriate to use a method for dimensionality reduction, such
as principal component analysis. In addition, EDA allows for the data to be normalized, which
10
is essential for comparing different variables or subsets of the data. It also allows for data
cleaning, which is crucial for removing errors or inconsistencies from the data.
Overall, EDA provides a deeper understanding of the data and helps to identify patterns and
relationships that can guide the choice of data mining techniques and knowledge
representation methods. This can lead to more accurate and informative analysis and
modeling.
Conclusion
Data Mining is a process of extracting useful information from unit-level data. This can be done
by applying various data mining techniques, such as classification, clustering, association rule
mining, and anomaly detection. The goal of unit data mining is to identify patterns and trends
in the data that can be used to make better decisions, improve operations, and increase the
efficiency of the organization. Knowledge Representation is an important aspect of unit data
mining. It refers to the process of representing the knowledge obtained from the data mining
process in a way that is meaningful and understandable to the user. This can be done using
various techniques, such as visualizations, decision trees, rules, and ontologies. One of the key
challenges of unit data mining is dealing with the high dimensionality and complexity of the
data. High dimensionality refers to the large number of attributes or variables in the data,
which can make it difficult to identify patterns and trends. Complexity refers to the intricate
relationships between variables, which can make it difficult to understand the data. To
overcome these challenges, various dimensionality reduction techniques, such as feature
selection and feature extraction, can be used to simplify the data. Another challenge of unit
data mining is dealing with missing or incomplete data. Missing data can occur due to various
reasons, such as data entry errors, system failures, or missing values in the data. To handle
missing data, various imputation techniques, such as mean imputation, median imputation,
and multiple imputation can be used to fill in the missing values. In conclusion, Data Mining is
an important process for organizations that need to manage and analyze large volumes of
unit-level data. By applying various data mining techniques, organizations can identify patterns
and trends in the data that can be used to make better decisions, improve operations, and
increase the efficiency of the organization. Knowledge representation is an important aspect
of unit data mining, as it allows the user to understand the knowledge obtained from the data
mining process in a meaningful and understandable way. The challenges of high
dimensionality and complexity of the data, as well as missing or incomplete data, can be
overcome by using various dimensionality reduction and imputation techniques.
11
Summary
Data Mining is the process of extracting useful information from unit-level data by
applying various data mining techniques such as classification, clustering, association
rule mining, and anomaly detection.
Knowledge Representation is an important aspect of unit data mining and it refers to
the process of representing the knowledge obtained from the data mining process in a
way that is meaningful and understandable to the user.
One of the key challenges of unit data mining is dealing with the high dimensionality
and complexity of the data. Dimensionality reduction techniques, such as feature
selection and feature extraction, can be used to simplify the data.
Another challenge of unit data mining is dealing with missing or incomplete data, which
can be handled by using imputation techniques such as mean imputation, median
imputation and multiple imputation.
Data mining can help organizations make better decisions, improve operations and
increase efficiency by identifying patterns and trends in the data, and knowledge
representation allows the user to understand the knowledge obtained from the data
mining process in a meaningful and understandable way.
A. Descriptive Questions
1. What are the key steps in the data mining process and how do they relate to knowledge
representation?
2. How does data visualization play a role in exploratory data analysis and how does it inform
knowledge representation?
3. What are some common methods of knowledge representation and when is each method
appropriate?
4. How does dimensionality reduction inform knowledge representation and what are some
common techniques used for dimensionality reduction?
5. How do natural language processing techniques aid in understanding unstructured data
and how do they inform knowledge representation
12
Post Unit Reading Material
Book chapters
1. "Knowledge Representation" Chapter 8, in "Data Mining: Concepts and Techniques" by
Jiawei Han, Micheline Kamber, and Jian Pei. This chapter provides a comprehensive
introduction to knowledge representation in data mining, including the different
techniques used to represent knowledge, such as decision trees, rules, and patterns.
2. "Knowledge Representation" Chapter 8, in "Introduction to Data Mining" by Pang-Ning
Tan, Michael Steinbach, and Vipin Kumar. This chapter provides an in-depth introduction
to knowledge representation in data mining, including the different techniques used to
represent knowledge, such as decision trees, rules, and patterns.
3. "Knowledge Representation" Chapter 7, in "Data Mining Techniques" by Michael Berry and
Gordon Linoff. This chapter provides a detailed overview of knowledge representation in
data mining, including the different techniques used to represent knowledge, such as
decision trees, rules, and patterns.
4. "Knowledge Representation" Chapter 6, in "Big Data: Techniques and Technologies in
Geoinformatics" by Jun Li. This chapter provides a comprehensive introduction to
knowledge representation in data mining, including the different techniques used to
represent knowledge, such as decision trees, rules, and patterns.
5. "Knowledge Representation" Chapter 5, in "Data Science from Scratch" by O'Reilly Media.
This chapter provides an introduction to knowledge representation in data mining,
including the different techniques used to represent knowledge, such as decision trees,
rules, and patterns.
14