Unit 1 Datamining

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 16

Unit-1

1.Basic datamining task


The data mining tasks can be classified generally into two types descriptive
tasks and predictive tasks.

Descriptive:

The descriptive data mining tasks characterize the general properties of


data

Predictive:

predictive data mining tasks perform inference on the available data set to
predict how a new data set will behave.

Different Data Mining Tasks

Predictive data mining tasks:


1.Classification:
 Classification derives a model to determine the class of an object
based on its attributes.
 Classification can be used in direct marketing, that is to reduce
marketing costs by targeting a set of customers who are likely to buy a
new product.
2. Prediction:
Prediction task predicts the possible values of missing or future data.
Prediction involves developing a model based on the available data and this
model is used in predicting future values of a new data set of interest.

3.Time - Series Analysis:


Time series is a sequence of events where the next event is determined by
one or more of the preceding events.

Descriptive data mining tasks:


1.Association:
 Association discovers the association or connection among a set of
items. Association identifies the relationships between objects.
 Association analysis is used for commodity management, advertising,
catalog design, direct marketing etc.

2.Clustering:
Clustering is used to identify data objects that are similar to one another.

3.Summarization:
Summarization is the generalization of data. A set of relevant data is
summarized which result in a smaller set that gives aggregated
information of the data.

2.datamining versus knowledge discovery


in database

What is Data Mining?


Data mining, also known as Knowledge Discovery in Databases, refers to the nontrivial
extraction of implicit, previously unknown, and potentially useful information from data
stored in databases.
There are four major data mining tasks: clustering, classification, regression, and
association (summarization).

KDD Process
KDD (Knowledge Discovery in Databases) is a process that involves the extraction
of useful, previously unknown, and potentially valuable information from large
datasets. The KDD process is an iterative process and it requires multiple iterations
of the above steps to extract accurate knowledge from the data.The following steps
are included in KDD process:
1.Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data from
collection.
1. Cleaning in case of Missing values.
2. Cleaning noisy data, where noise is a random or variance error.
3. Cleaning with Data discrepancy detection and Data transformation
tools.
2.Data Integration
Data integration is defined as heterogeneous data from multiple sources
combined in a common source(DataWarehouse). Data integration
using Data Migration tools, Data Synchronization tools and
ETL(Extract-Load-Transformation) process.
3.Data Selection
Data selection is defined as the process where data relevant to the analysis
is decided and retrieved from the data collection.
4.Data Transformation
Data Transformation is defined as the process of transforming data into
appropriate form required by mining procedure. Data Transformation is a two
step process:
1. Data Mapping
2. Code generation
5.Data Mining
Data Mining refers to a process of extracting useful and valuable information or
patterns from large data sets.
6.Pattern Evaluation
Pattern Evaluation is defined as identifying strictly increasing patterns
representing knowledge based on given measures.
7.Knowledge Representation
This involves presenting the results in a way that is meaningful and can be
used to make decisions.

Difference between KDD and Data Mining

KDD refers to a process of


Data Mining refers to a
identifying valid, novel,
process of extracting useful
Definition potentially useful, and ultimately
and valuable information or
understandable patterns and
patterns from large data sets.
relationships in data.

To find useful knowledge from To extract useful information


Objective
data. from data.

Data cleaning, data integration,


Association rules,
data selection, data
classification, clustering,
Techniques transformation, data mining,
regression, decision trees,
Used pattern evaluation, and
neural networks, and
knowledge representation and
dimensionality reduction.
visualization.

Structured information, such as Patterns, associations, or


rules and models, that can be insights that can be used to
Output
used to make decisions or improve decision-making or
predictions. understanding.

Focus is on the discovery of Data mining focus is on the


Focus useful knowledge, rather than discovery of patterns or
simply finding patterns in data. relationships in data.

Domain expertise is important Domain expertise is less


Role of in KDD, as it helps in defining critical in data mining, as the
domain the goals of the process, algorithms are designed to
expertise choosing appropriate data, and identify patterns without
interpreting the results. relying on prior knowledge.
3.Datamining issues

1. Mining Methodology Issues


 Methodology-related data mining issues encompass challenges related
to the choice and application of mining algorithms and techniques.
 Selecting the right method for a specific dataset and problem can be
daunting.

2. Performance Issues
o Performance-related data mining issues revolve around scalability,
efficiency, and handling large datasets.
o As data volumes continue to grow exponentially, it becomes essential
to develop algorithms and infrastructure capable of processing and
analyzing data promptly.
o Performance bottlenecks can hinder the practical application of data
mining techniques.

3. Diverse Data Types Issue


o The diverse data types data mining issues highlight the complexity of
dealing with heterogeneous data sources.
o Data mining often involves integrating data from various formats, such
as text, images, and structured databases.
o Each data type presents unique challenges in terms of preprocessing,
feature extraction, and modelling, requiring specialized approaches and
tools to tackle these complexities effectively.

4.Dataminining metrics
Data mining metrics are measures used to evaluate the
performance and effectiveness of data mining algorithms and models.
They help assess the quality of the discovered patterns and
predictions.

1.Usefulness
Usefulness involves several metrics that tell us whether the model provides
useful data. For instance, a data mining model that correlates save the
location with sales can be both accurate and reliable, but cannot be useful,
because it cannot generalize that result by inserting more stores at the same
location.

2.Return on Investment (ROI)


 Data mining tools will find interesting patterns buried inside the data
and develop predictive models.
 These models will have several measures for denoting how well they fit
the records.
3.Access Financial Information during Data Mining
The simplest way to frame decisions in financial terms is to augment the raw
information that is generally mined to also contain financial data. Some
organizations are investing and developing data warehouses, and data
marts.
4.Converting Data Mining Metrics into Financial Terms
A general data mining metric is the measure of "Lift". Lift is a measure of
what is achieved by using the specific model or pattern relative to a base rate in
which the model is not used. High values mean much is achieved.
5.Accuracy
Accuracy is a measure of how well the model correlates results with the
attributes in the data that has been supported. There are several measures of
accuracy, but all measures of accuracy are dependent on the information that is
used.
5. social implications of data mining
Data mining has numerous social implications, both positive and potentially
negative
Positive Social Implications:

1. Personalized Services:
o Data mining enables businesses and service providers to offer
personalized recommendations, advertisements, and services based on
individual preferences and behavior.
2. Improved Healthcare:
Data mining in healthcare can lead to better patient outcomes by
identifying trends, predicting disease occurrences, and personalizing
treatment plans.
3. Enhanced Education:
Educational institutions use data mining to analyze student
performance, identify learning patterns, and tailor educational
programs to individual needs, improving the overall quality of
education.
4. CustomerSatisfaction:
Businesses can enhance customer satisfaction by analyzing
customer feedback and preferences, leading to improved product and
service offerings.
5.Fraud Detection:
Data mining helps identify unusual patterns and anomalies in financial
transactions, contributing to fraud detection and prevention in areas
like banking and credit card transactions.
Negative Social Implications:
1. Privacy Concerns:
The widespread use of data mining raises privacy concerns as
individuals may feel that their personal information is being used without their
knowledge or consent.
2.Security Risks:
The collection and storage of vast amounts of data for data mining
purposes increase the risk of security breaches, potentially exposing
sensitive information to unauthorized parties.
3.Social Sorting:
Data mining can contribute to social sorting, where individuals are
categorized and treated differently based on their profiles, leading to potential
social stratification.
4.Privacy Concerns:
The widespread use of data mining raises privacy concerns as individuals
may feel that their personal information is being used without their knowledge
or consent.
5.Loss of Autonomy:
Individuals may feel a loss of autonomy when decisions affecting them are
made by algorithms based on their data, particularly in scenarios like
automated hiring or credit scoring.

6.Datamining from a database perspective


1.Scalability
2.Real world data
3.Updates
4.Ease Of use

Scalability
 To effectively extract information from a huge amount of data
in databases.
 The knowledge discovery algorithms must be efficient and
scalable to large databases.
 The running time of a data mining algorithm must be
predictable and acceptable in large databases.
Real world data
 Noisy and missing attributes values.
 Algorithm should be able to work even in the presence of
these problems.
Updates
o Datamining algorithm work with static data sets.
o It is not a realistic assumption
Ease of use
Datamining algorithm many work well, they may not be well if difficult
to
Use or understand
7.Data Mining Techniques

1. Classification:
Data are categorized to separate them into predefined groups or classes. Based on the
values of a number of attributes, this method of data mining identifies the class to which
a document belongs. Sorting data into predetermined classes is the aim.

2.Clustering:
The next data mining technique is clustering. Similar entries inside a database are
grouped together using the clustering approach to form clusters. The clustering first
identifies these groups inside the dataset and afterward classifies factors based on their
properties, in contrast to classification, which places variables into established
categories.

3. Regression:
The next data mining technique is Regression. A link between variables is
established using regression. Its objective is to identify the appropriate
function that best captures the relationship.
4. Association Rules:
This data mining technique helps to discover a link between two or more items. It finds
a hidden pattern in the data set. Association rules are if-then statements that support to
show the probability of interactions between data items within large data sets in
different types of databases.

5. Outer detection:
This type of data mining technique relates to the observation of data items in the data
set, which do not match an expected pattern or expected behavior. This technique may
be used in various domains like intrusion, detection, fraud detection, etc. It is also known
as Outlier Analysis or Outilier mining.

6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating sequential
data to discover sequential patterns. It comprises of finding interesting subsequences in
a set of sequences, where the stake of a sequence can be measured in terms of different
criteria like length, occurrence frequency, etc.

7. Prediction:
Prediction used a combination of other data mining techniques such as trends,
clustering, classification, etc. It analyzes past events or instances in the right sequence to
predict a future event.

8.Similarity measure
The similarity measures in data mining, is a distance with dimensions
describing object features. This means that in case the distance among two
data points is small then there is a high degree of similarity among the objects
and vice versa.

What are the types of similarity?


1.Euclidean Distance
Euclidean distance is widely used as the traditional metric to work on problems with
geometry. You can explain it as the ordinary distance between two points.

2.Manhattan Distance (City Block Distance)


The Manhattan Distance or City Block Distance determines the absolute difference
among the pair of the coordinates. It might be surprising, but the simplest way of
calculating the distance between two points is to go horizontally and then vertically until
you get from one point to the other, instead of just going in a straight line.

3.Cosine similarity
Cosine similarity is quite different from the previous two. This similarity measure is more
concerned with the orientation of the two points in space than it is with their exact
distance from one another.

4.Jaccard similarity
The Jaccard distance measures the similarity of the two data set items as the
intersection of those items divided by the union of the data items.

5.Minkowski distance
The Minkowski distance is the generalized form of the Euclidean and Manhattan
Distance Measure.

8.Decision Tree
What is a Decision Tree?
A decision tree is a flowchart-like tree structure where each internal node denotes the
feature, branches denote the rules and the leaf nodes denote the result of the
algorithm. It is a versatile supervised machine-learning algorithm, which is used for
both classification and regression problems. It is one of the very powerful algorithms.
Decision Tree Terminologies
 Root Node: It is the topmost node in the tree, which represents the complete
dataset. It is the starting point of the decision-making process.
 Decision/Internal Node: A node that symbolizes a choice regarding an input
feature. Branching off of internal nodes connects them to leaf nodes or other
internal nodes.
 Leaf/Terminal Node: A node without any child nodes that indicates a class label
or a numerical value.
 Splitting: The process of splitting a node into two or more sub-nodes using a split
criterion and a selected feature.
 Branch/Sub-Tree: A subsection of the decision tree starts at an internal node and
ends at the leaf nodes.
 Parent Node: The node that divides into one or more child nodes.
 Child Node: The nodes that emerge when a parent node is split.
 Impurity: A measurement of the target variable’s homogeneity in a subset of
data. The Gini index and entropy are two commonly used impurity measurements
in decision trees for classifications task
 Variance: Variance measures how much the predicted and the target variables
vary in different samples of a dataset. It is used for regression problems in decision
trees.
 Information Gain: Information gain is a measure of the reduction in impurity
achieved by splitting a dataset on a particular feature in a decision tree.
 Pruning: The process of removing branches from the tree that do not provide any
additional information or lead to overfitting.
Types of Decision Trees in Data Mining
Decision tree in data mining is mainly divided into two types –
Categorical Variable Decision Tree
A categorical variable decision tree comprises categorical target variables, which are
further bifurcated categories, such as Yes or No. Categories specify that the stages of a
decision process are categorically divided.
Continuous Variable Decision Tree
A continuous variable decision tree has a continuous target variable.

Advantages of the Decision Tree:


1. It is simple to understand as it follows the same process which a human follow
while making any decision in real-life.
2. It can be very useful for solving decision-related problems.
3. It helps to think about all the possible outcomes for a problem.
4. There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree:
1. The decision tree contains lots of layers, which makes it complex.
2. It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
3. For more class labels, the computational complexity of the decision tree may
increase.

9.Neural Networks
What is a neural network?
Neural Networks are computational models that mimic the complex functions of the
human brain. The neural networks consist of interconnected nodes or neurons that
process and learn from data, enabling tasks such as pattern recognition and decision
making in machine learning.
 Input Layer: Each feature in the input layer is represented by a node on the network,
which receives input data.

 Hidden Layers: Each hidden layer neuron processes inputs by multiplying them by
weights, adding them up, and then passing them through an activation function

 Output: The final result is produced by repeating the process until the output layer is
reached.

Types of Neural Networks

There are seven types of neural networks that can be used.

 Feedforward Neteworks: A feedforward neural network is a simple artificial neural network


architecture in which data moves from input to output in a single direction.
 Multilayer Perceptron (MLP): MLP is a type of feedforward neural network with three or
more layers, including an input layer, one or more hidden layers, and an output layer. It uses
nonlinear activation functions.

 Convolutional Neural Network (CNN): A Convolutional Neural Network (CNN) is a


specialized artificial neural network designed for image processing. CNNs have
revolutionized computer vision and are pivotal in tasks like object detection and image
analysis.

 Recurrent Neural Network (RNN): An artificial neural network type intended for sequential
data processing is called a Recurrent Neural Network (RNN).
 Long Short-Term Memory (LSTM): LSTM is a type of RNN that is designed to overcome
the vanishing gradient problem in training RNNs. It uses memory cells and gates to
selectively read, write, and erase information.

ADVANTAGES OF NEURAL NETWORK

 The ability to learn by themselves


 The ability to work with insufficient data
 The ability of parallel processing

DISADVANTAGES OF NEURAL NETWORK

 The black box nature and uncertain prediction rates


 Long training processes and limited data efficiency
 Economically and computationally expensive
10.Genetic Algorithms
Genetic Algorithm (GA) is a search-based optimization technique
based on the principles of Genetics and Natural Selection. It is
frequently used to find optimal or near-optimal solutions to difficult
problems which otherwise would take a lifetime to solve. It is
frequently used to solve optimization problems

Advantages of Genetic Algorithm


o The parallel capabilities of genetic algorithms are best.
o It helps in optimizing various problems such as discrete functions, multi-objective
problems, and continuous functions.
o It provides a solution for a problem that improves over time.
o A genetic algorithm does not need derivative information.
disadvantages of Genetic Algorithm
1. Genetic algorithms can be computationally expensive, especially for complex
problems or large solution spaces.

2. There is no guarantee of finding the optimal solution, as genetic algorithms


provide approximate solutions.

3. The performance of genetic algorithms can be sensitive to the choice of


parameters, such as population size, crossover and mutation rates.

4. Genetic algorithms may struggle with high-dimensional problems or problems


with a large number of constraints.

Limitations of Genetic Algorithms


o Genetic algorithms are not efficient algorithms for solving simple problems.
o It does not guarantee the quality of the final solution to a problem.
o Repetitive calculation of fitness values may generate some computational
challenges.

You might also like