Unit 1 Datamining
Unit 1 Datamining
Unit 1 Datamining
Descriptive:
Predictive:
predictive data mining tasks perform inference on the available data set to
predict how a new data set will behave.
2.Clustering:
Clustering is used to identify data objects that are similar to one another.
3.Summarization:
Summarization is the generalization of data. A set of relevant data is
summarized which result in a smaller set that gives aggregated
information of the data.
KDD Process
KDD (Knowledge Discovery in Databases) is a process that involves the extraction
of useful, previously unknown, and potentially valuable information from large
datasets. The KDD process is an iterative process and it requires multiple iterations
of the above steps to extract accurate knowledge from the data.The following steps
are included in KDD process:
1.Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data from
collection.
1. Cleaning in case of Missing values.
2. Cleaning noisy data, where noise is a random or variance error.
3. Cleaning with Data discrepancy detection and Data transformation
tools.
2.Data Integration
Data integration is defined as heterogeneous data from multiple sources
combined in a common source(DataWarehouse). Data integration
using Data Migration tools, Data Synchronization tools and
ETL(Extract-Load-Transformation) process.
3.Data Selection
Data selection is defined as the process where data relevant to the analysis
is decided and retrieved from the data collection.
4.Data Transformation
Data Transformation is defined as the process of transforming data into
appropriate form required by mining procedure. Data Transformation is a two
step process:
1. Data Mapping
2. Code generation
5.Data Mining
Data Mining refers to a process of extracting useful and valuable information or
patterns from large data sets.
6.Pattern Evaluation
Pattern Evaluation is defined as identifying strictly increasing patterns
representing knowledge based on given measures.
7.Knowledge Representation
This involves presenting the results in a way that is meaningful and can be
used to make decisions.
2. Performance Issues
o Performance-related data mining issues revolve around scalability,
efficiency, and handling large datasets.
o As data volumes continue to grow exponentially, it becomes essential
to develop algorithms and infrastructure capable of processing and
analyzing data promptly.
o Performance bottlenecks can hinder the practical application of data
mining techniques.
4.Dataminining metrics
Data mining metrics are measures used to evaluate the
performance and effectiveness of data mining algorithms and models.
They help assess the quality of the discovered patterns and
predictions.
1.Usefulness
Usefulness involves several metrics that tell us whether the model provides
useful data. For instance, a data mining model that correlates save the
location with sales can be both accurate and reliable, but cannot be useful,
because it cannot generalize that result by inserting more stores at the same
location.
1. Personalized Services:
o Data mining enables businesses and service providers to offer
personalized recommendations, advertisements, and services based on
individual preferences and behavior.
2. Improved Healthcare:
Data mining in healthcare can lead to better patient outcomes by
identifying trends, predicting disease occurrences, and personalizing
treatment plans.
3. Enhanced Education:
Educational institutions use data mining to analyze student
performance, identify learning patterns, and tailor educational
programs to individual needs, improving the overall quality of
education.
4. CustomerSatisfaction:
Businesses can enhance customer satisfaction by analyzing
customer feedback and preferences, leading to improved product and
service offerings.
5.Fraud Detection:
Data mining helps identify unusual patterns and anomalies in financial
transactions, contributing to fraud detection and prevention in areas
like banking and credit card transactions.
Negative Social Implications:
1. Privacy Concerns:
The widespread use of data mining raises privacy concerns as
individuals may feel that their personal information is being used without their
knowledge or consent.
2.Security Risks:
The collection and storage of vast amounts of data for data mining
purposes increase the risk of security breaches, potentially exposing
sensitive information to unauthorized parties.
3.Social Sorting:
Data mining can contribute to social sorting, where individuals are
categorized and treated differently based on their profiles, leading to potential
social stratification.
4.Privacy Concerns:
The widespread use of data mining raises privacy concerns as individuals
may feel that their personal information is being used without their knowledge
or consent.
5.Loss of Autonomy:
Individuals may feel a loss of autonomy when decisions affecting them are
made by algorithms based on their data, particularly in scenarios like
automated hiring or credit scoring.
Scalability
To effectively extract information from a huge amount of data
in databases.
The knowledge discovery algorithms must be efficient and
scalable to large databases.
The running time of a data mining algorithm must be
predictable and acceptable in large databases.
Real world data
Noisy and missing attributes values.
Algorithm should be able to work even in the presence of
these problems.
Updates
o Datamining algorithm work with static data sets.
o It is not a realistic assumption
Ease of use
Datamining algorithm many work well, they may not be well if difficult
to
Use or understand
7.Data Mining Techniques
1. Classification:
Data are categorized to separate them into predefined groups or classes. Based on the
values of a number of attributes, this method of data mining identifies the class to which
a document belongs. Sorting data into predetermined classes is the aim.
2.Clustering:
The next data mining technique is clustering. Similar entries inside a database are
grouped together using the clustering approach to form clusters. The clustering first
identifies these groups inside the dataset and afterward classifies factors based on their
properties, in contrast to classification, which places variables into established
categories.
3. Regression:
The next data mining technique is Regression. A link between variables is
established using regression. Its objective is to identify the appropriate
function that best captures the relationship.
4. Association Rules:
This data mining technique helps to discover a link between two or more items. It finds
a hidden pattern in the data set. Association rules are if-then statements that support to
show the probability of interactions between data items within large data sets in
different types of databases.
5. Outer detection:
This type of data mining technique relates to the observation of data items in the data
set, which do not match an expected pattern or expected behavior. This technique may
be used in various domains like intrusion, detection, fraud detection, etc. It is also known
as Outlier Analysis or Outilier mining.
6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating sequential
data to discover sequential patterns. It comprises of finding interesting subsequences in
a set of sequences, where the stake of a sequence can be measured in terms of different
criteria like length, occurrence frequency, etc.
7. Prediction:
Prediction used a combination of other data mining techniques such as trends,
clustering, classification, etc. It analyzes past events or instances in the right sequence to
predict a future event.
8.Similarity measure
The similarity measures in data mining, is a distance with dimensions
describing object features. This means that in case the distance among two
data points is small then there is a high degree of similarity among the objects
and vice versa.
3.Cosine similarity
Cosine similarity is quite different from the previous two. This similarity measure is more
concerned with the orientation of the two points in space than it is with their exact
distance from one another.
4.Jaccard similarity
The Jaccard distance measures the similarity of the two data set items as the
intersection of those items divided by the union of the data items.
5.Minkowski distance
The Minkowski distance is the generalized form of the Euclidean and Manhattan
Distance Measure.
8.Decision Tree
What is a Decision Tree?
A decision tree is a flowchart-like tree structure where each internal node denotes the
feature, branches denote the rules and the leaf nodes denote the result of the
algorithm. It is a versatile supervised machine-learning algorithm, which is used for
both classification and regression problems. It is one of the very powerful algorithms.
Decision Tree Terminologies
Root Node: It is the topmost node in the tree, which represents the complete
dataset. It is the starting point of the decision-making process.
Decision/Internal Node: A node that symbolizes a choice regarding an input
feature. Branching off of internal nodes connects them to leaf nodes or other
internal nodes.
Leaf/Terminal Node: A node without any child nodes that indicates a class label
or a numerical value.
Splitting: The process of splitting a node into two or more sub-nodes using a split
criterion and a selected feature.
Branch/Sub-Tree: A subsection of the decision tree starts at an internal node and
ends at the leaf nodes.
Parent Node: The node that divides into one or more child nodes.
Child Node: The nodes that emerge when a parent node is split.
Impurity: A measurement of the target variable’s homogeneity in a subset of
data. The Gini index and entropy are two commonly used impurity measurements
in decision trees for classifications task
Variance: Variance measures how much the predicted and the target variables
vary in different samples of a dataset. It is used for regression problems in decision
trees.
Information Gain: Information gain is a measure of the reduction in impurity
achieved by splitting a dataset on a particular feature in a decision tree.
Pruning: The process of removing branches from the tree that do not provide any
additional information or lead to overfitting.
Types of Decision Trees in Data Mining
Decision tree in data mining is mainly divided into two types –
Categorical Variable Decision Tree
A categorical variable decision tree comprises categorical target variables, which are
further bifurcated categories, such as Yes or No. Categories specify that the stages of a
decision process are categorically divided.
Continuous Variable Decision Tree
A continuous variable decision tree has a continuous target variable.
9.Neural Networks
What is a neural network?
Neural Networks are computational models that mimic the complex functions of the
human brain. The neural networks consist of interconnected nodes or neurons that
process and learn from data, enabling tasks such as pattern recognition and decision
making in machine learning.
Input Layer: Each feature in the input layer is represented by a node on the network,
which receives input data.
Hidden Layers: Each hidden layer neuron processes inputs by multiplying them by
weights, adding them up, and then passing them through an activation function
Output: The final result is produced by repeating the process until the output layer is
reached.
Recurrent Neural Network (RNN): An artificial neural network type intended for sequential
data processing is called a Recurrent Neural Network (RNN).
Long Short-Term Memory (LSTM): LSTM is a type of RNN that is designed to overcome
the vanishing gradient problem in training RNNs. It uses memory cells and gates to
selectively read, write, and erase information.