0% found this document useful (0 votes)

67 views15 pages

DWM Assignment

The document discusses data transformation techniques which are important for knowledge discovery in databases (KDD) as they improve data quality, enable feature engineering, and facilitate data integration. An example is provided where raw customer purchase data is transformed through cleaning, normalization, feature creation and integration with demographic data to prepare it for analysis.

Uploaded by

uspoken91

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views15 pages

DWM Assignment

Uploaded by

uspoken91

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

DEPARTMENT OF COMPUTER SCIENCE & INFORMATION TECHNOLOGY

IES IPS ACADEMY, INDORE

B.TECH IV YEAR VIII SEM
PEC-CS801 (A) DATA MINING AND WAREHOUSING

Assignment-1

Q. 1 What is data transformation? Why it is essential in the form of KDD? Give example .(CO1,BT 2)
Data transformation is a crucial step in the Knowledge Discovery in Databases (KDD) process. It
involves converting raw data into a clean and usable format suitable for further analysis and modeling.
Here's a breakdown of its importance and how it contributes to KDD:

Why Data Transformation is Essential in KDD:

• Improved Data Quality: Raw data often contains errors, inconsistencies, and missing values. Data
transformation techniques like cleaning, normalization, and imputation address these issues to
ensure the quality and reliability of the data used for analysis.
• Enhanced Feature Engineering: Many data mining algorithms work better with specific data
formats. Transformation allows you to create new features, combine existing ones, or scale features
for optimal performance during the mining process.
• Facilitates Data Integration: KDD often involves data from multiple sources. Transformation helps
standardize formats and structures, enabling seamless integration of diverse data sets for
comprehensive analysis.
• Improved Efficiency: Clean and well-structured data allows data mining algorithms to run faster
and more efficiently. This saves time and computational resources during the KDD process.

Example of Data Transformation in KDD:

Imagine you're analyzing customer purchase data for a retail store. Here's how data transformation might
be applied:

1. Data Cleaning: You identify missing values in customer addresses or inconsistent date formats.
Data cleaning techniques like filling in missing values or standardizing date formats ensure data
accuracy.
2. Normalization: You find that some customer names have variations like "John" and "Johnny."
Normalization techniques like converting all names to a single format (e.g., uppercase) improve data
consistency.
3. Feature Engineering: You create a new feature "Total Spent" by summing up the purchase
amounts for each customer. This provides a more insightful metric for customer analysis.
4. Data Integration: You combine purchase data with customer demographic data like age and
location. This integration allows you to analyze buying patterns based on demographics.

By transforming the raw customer data, you prepare it for effective analysis, enabling you to uncover
valuable insights about customer behavior, purchasing trends, and potential targeted marketing
campaigns.

Q. 2 Explain with reference to Data Warehouse: “Data inconsistencies are removed; data from diverse operational
applications is integrated”. (CO1,BT 2)

Assignment-2
Q.1 . Explain the following in OLAP
a) Roll up operation
b) Drill Down operation
c) Slice operation
d) Dice operation
e) Pivot operation (CO2,BT 2)
Q.2 How a database design is represented in OLTP systems and OLAP systems? (CO2,BT 2)

Assignment-3

Q.1 Describe challenges to data mining regarding data mining methodology and user
interaction issues.(CO3,BT 2)
Q.2 If A and B are two fuzzy sets with membership functions μA(x) = {0.2, 0.5, 0.6, 0.1, 0.9} μB(x) = {0.1, 0.5, 0.2,
0.7, 0.8} what will be the value of μA ∩B? (CO3,BT 3)

Assignment-4

Q.1 . (CO4,BT 3)

Q.2calculating hamming distance between bit string.

row1 = [0,0,0,0,0,1]
row2 = [0,0,0,0,1,0] (CO4,BT 3)

Assignment-5

Q.1 What is the goal of clustering? How does partitioning around medoids algorithm achieve this? (CO5,BT 2)

The goal of clustering is to group a set of data points into clusters based on their similarity. The goal is to identify
groups of data points that are similar to each other, while being different from those in other clusters.
The Partitioning Around Medoids (PAM) algorithm is a clustering algorithm that aims to minimize the sum of
dissimilarities between each point and its medoid, where the medoid is the point within the cluster that has the lowest
average distance to all other points in the cluster.
The PAM algorithm works as follows:
1. Choose k initial medoids at random from the dataset.
2. Assign each data point to the nearest medoid to form k clusters.
3. For each cluster, try to find a new medoid that minimizes the sum of dissimilarities between each point and the
medoid.
4. Repeat steps 2 and 3 until the medoids no longer change.
The PAM algorithm can be seen as a variation of the k-means algorithm, but with the key difference that it uses actual
data points as medoids, rather than the means of the points in each cluster.
The PAM algorithm can be effective in clustering data where the number of clusters is unknown, and when the data is
not well-suited to a spherical shape, as it can handle non-spherical clusters. However, PAM can be computationally
expensive,especially when dealing with large datasets, as it requires a distance matrix to be computed for all pairs of.
The PAM algorithm is advantageous because it can handle large datasets and is less sensitive to outliers compared to
other clustering algorithms like K-means. However, it can be computationally expensive, especially for large values
of K.
In summary, the goal of clustering is to group similar data points together, and the PAM algorithm achieves this by
iteratively assigning data points to clusters and optimizing the clustering solution through the selection of
representative medoids.

Q.2 What are the things suffering the performance of Apriori candidate generation technique.(CO5,BT 2)
The performance of the Apriori candidate generation technique can be affected by several factors, including:

Number of Transactions: As the number of transactions in the dataset increases, the number of candidate itemsets also
grows exponentially. Generating and pruning these candidate itemsets becomes increasingly computationally
expensive.

Number of Unique Items: The number of unique items or attributes in the dataset affects the size of the itemset search
space. A larger number of unique items leads to a larger number of potential candidate itemsets, which can slow down
the candidate generation process.

Minimum Support Threshold: Setting a lower minimum support threshold results in more frequent itemsets being
generated and considered during the candidate generation phase. This increases the computational overhead as more
candidate itemsets need to be evaluated.

Size of Frequent Itemsets: The size of frequent itemsets being considered also impacts performance. Generating larger
itemsets requires more computational resources and may result in a larger number of candidate itemsets to be
generated and evaluated.

Apriori Property Violations: The Apriori property states that if an itemset is infrequent, all its supersets will also be
infrequent. However, in certain cases, this property may not hold true, leading to unnecessary candidate generation
and pruning steps, which can degrade performance.

Memory Constraints: The Apriori algorithm typically requires storing large data structures, such as candidate itemsets
and their corresponding support counts, in memory. Memory constraints can limit the size of datasets that can be
processed efficiently.

Implementation Efficiency: The efficiency of the implementation of the Apriori algorithm can significantly impact
performance. Optimizations such as efficient data structures, pruning techniques, and parallelization can help improve
performance.

Data Distribution: In a distributed setting, the distribution of data across multiple nodes can impact the performance of
candidate generation. Uneven data distribution or skewed datasets may result in some nodes processing significantly
more data than others, leading to load imbalance and reduced performance.

Addressing these factors through optimization techniques such as pruning strategies, parallelization, efficient data
structures, and algorithmic improvements can help mitigate the performance issues associated with the Apriori
candidate generation technique.class main{
public static void main(Stings[] args){
System.out.println("helloworld");
}Clustering Algorithms:

1. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies):

• BIRCH is a hierarchical clustering algorithm designed for clustering large datasets.

• It incrementally and dynamically builds a tree structure to represent the entire dataset.
• It employs a clustering feature called CF-tree (Clustering Feature tree), which summarizes the
data distribution in a hierarchical manner.
• BIRCH is particularly efficient in handling large-scale datasets due to its compact
representation of the data and its ability to incrementally update clusters.

2. DBSCAN (Density-Based Spatial Clustering of Applications with Noise):

• DBSCAN is a density-based clustering algorithm.

• It groups together closely packed points based on a threshold for density.
• Points in low-density regions are labeled as noise or outliers.
• DBSCAN doesn't require specifying the number of clusters beforehand, which makes it robust
to arbitrary shapes of clusters.
• It's effective in identifying clusters of varying shapes and sizes, but it may struggle with
clusters of varying densities.

3. CURE (Clustering Using Representatives):

• CURE is a partitional clustering algorithm.

• It aims to find clusters by representing each cluster with a certain number of representative
points.
• CURE employs a three-step process: sampling, clustering, and then refining.
• It's designed to handle outliers and to be robust against noise in the dataset.
• CURE is suitable for datasets with arbitrary shapes and sizes.

Hierarchical algorithms, Partitioned algorithms

Here's a breakdown of hierarchical and partitioned algorithms, two fundamental approaches used in data
clustering:

Hierarchical Clustering Algorithms:

• Concept: Hierarchical clustering algorithms organize data points into a hierarchy of clusters,
represented as a tree-like structure called a dendrogram. This hierarchy can be built in two ways:
o Agglomerative (Bottom-Up): This approach starts by considering each data point as a
separate cluster. In each step, the two most similar clusters are merged based on a chosen
distance metric (e.g., Euclidean distance) until a single cluster remains.
o Divisive (Top-Down): This approach starts with all data points in a single cluster. In each
step, a cluster is recursively split into two sub-clusters based on a chosen criterion, until a
desired number of clusters is reached.
Applications of Hierarchical Clustering:

• Exploratory Data Analysis: Hierarchical clustering can be used to visually explore the data and
identify potential cluster structures that might not be readily apparent with other methods.
• Document Clustering: Grouping documents based on content similarity, useful for information
retrieval or topic modeling.
• Customer Segmentation: Identifying distinct customer groups based on their characteristics for
targeted marketing campaigns.

Advantages:

• No Predefined Number of Clusters: Unlike partitioned algorithms, hierarchical clustering doesn't

require specifying the number of clusters beforehand. The dendrogram helps visualize the hierarchy
and determine an appropriate stopping point.
• Can Discover Clusters of Arbitrary Shapes: Hierarchical clustering is not limited to spherical or
circular clusters and can discover clusters of various shapes and sizes.

Disadvantages:

• Computationally Expensive: For large datasets, hierarchical clustering can be computationally

expensive, especially for the agglomerative approach with many data points.
• Sensitivity to Outliers: Outliers in the data can significantly impact the clustering process and
hierarchy formation.
• Dendrogram Interpretation: Interpreting the dendrogram to determine the optimal number of
clusters can be subjective.

Partitioned Clustering Algorithms:

• Concept: Partitioned clustering algorithms divide the data points into a fixed, predefined number of
clusters. They typically employ an iterative approach to optimize a specific objective function, such
as minimizing the within-cluster distance (variance) or maximizing the between-cluster distance.
• K-Means Algorithm: A popular example of a partitioned clustering algorithm. It randomly assigns
data points to initial cluster centers (centroids) and iteratively refines these assignments:
o Calculate cluster centers (centroids): The mean (average) of the data points within each
cluster is calculated, representing the new centroid.
o Reassign data points: Each data point is reassigned to the cluster with the closest centroid
based on a distance metric.
o Repeat: Steps 1 and 2 are repeated until a stopping criterion (e.g., no significant changes in
cluster assignments) is met.

Applications of Partitioned Clustering:

• Image Segmentation: Grouping pixels in an image based on color or other features to identify
objects or regions.
• Customer Segmentation (Similar to Hierarchical): Similar to hierarchical clustering, segmenting
customers into distinct groups based on their characteristics.
• Gene Expression Analysis: Grouping genes with similar expression patterns to understand
biological processes.

Advantages:

• Efficient for Large Datasets: Partitioned clustering algorithms, especially K-Means, are generally
faster and more efficient than hierarchical clustering for large datasets.
• Simple Interpretation: The resulting clusters are easy to understand as they represent distinct
partitions of the data.

Disadvantages:

• Predefined Number of Clusters: You need to specify the desired number of clusters beforehand,
which might not be readily apparent in the data.
• Sensitive to Initial Centroids: The initial placement of cluster centroids can significantly impact the
final clustering results.
• Assumes Spherical Clusters: Partitioned clustering algorithms like K-Means typically work well for
spherical or circular clusters but might struggle with data containing clusters of irregular shapes.

Choosing the Right Algorithm:

}
In data mining, similarity measures quantify the degree of similarity or dissimilarity between objects, data
points, or patterns in a dataset. The choice of similarity measures depends on several factors, including the
nature of the data, the specific task or application, and the characteristics of the dataset. Here are some
common measures of similarity in data mining and considerations for choosing them:

1. Euclidean Distance:

• Definition: Euclidean distance measures the straight-line distance between two points in a
multidimensional space.
• Formula: Euclidean Distance=∑�=1�(��−��)2Euclidean Distance=∑i=1n(xi−yi)2
• Usage: Suitable for numeric data or continuous variables. It is commonly used in clustering
algorithms such as k-means and hierarchical clustering.

2. Cosine Similarity:

• Definition: Cosine similarity measures the cosine of the angle between two vectors, representing
their orientation in a multidimensional space.
• Formula:
Cosine Similarity=∑�=1��⋅��∑�=1��2⋅∑�=1��2Cosine Similarity=∑i=
1nxi2⋅∑i=1nyi2∑i=1nxi⋅yi

• Usage: Suitable for text data or high-dimensional sparse data, commonly used in information
retrieval, document similarity, and recommendation systems.

3. Jaccard Similarity:

• Definition: Jaccard similarity measures the intersection over the union of two sets, representing the
proportion of common elements between them.
• Formula: Jaccard Similarity=∣�∩�∣∣�∪�∣Jaccard Similarity=∣A∪B∣∣A∩B∣
• Usage: Suitable for binary or categorical data, commonly used in collaborative filtering,
recommendation systems, and text mining.
4. Hamming Distance:

• Definition: Hamming distance measures the number of positions at which corresponding symbols
differ between two strings of equal length.
• Formula: Hamming Distance=∑�=1��(��,��)Hamming Distance=∑i=1nδ(xi,yi)
• Usage: Suitable for binary data or categorical variables with fixed-length representations, commonly
used in DNA sequence analysis, error detection, and text classification.

5. Pearson Correlation Coefficient:

• Definition: Pearson correlation coefficient measures the linear correlation between two variables,
indicating the strength and direction of their relationship.
• Formula:
Pearson Correlation Coefficient=∑�=1�(��−�ˉ)⋅(��−�ˉ)∑�=1�(��−�ˉ)2⋅∑
�=1�(��−�ˉ)2Pearson Correlation Coefficient=∑i=1n(xi−xˉ)2⋅∑i=1n(yi−yˉ)2∑i=1n(xi−xˉ)⋅(yi−yˉ)
• Usage: Suitable for numeric data, particularly when assessing the linear relationship between
variables, commonly used in regression analysis and feature selection.

Considerations for Choosing Similarity Measures:

1. Data Type: Choose a similarity measure that is appropriate for the data type of your variables (e.g.,
numeric, categorical, text).
2. Domain Knowledge: Consider the characteristics of your dataset and the underlying domain to
select a measure that aligns with the semantics of the data.
3. Task Requirements: Choose a similarity measure that best suits the specific task or application, such
as clustering, classification, or recommendation.
4. Computational Complexity: Consider the computational efficiency and scalability of the similarity
measure, particularly for large datasets or high-dimensional data.
5. Normalization: Normalize the data if necessary to ensure that the chosen similarity measure
provides meaningful results across different scales or units.

Data distributions
ChatGPT
ChatGPT
Data distributions refer to the patterns or shapes formed by the values of a dataset when plotted on a graph or analyzed
statistically. Understanding data distributions is crucial in data analysis and data mining as it provides insights into the
central tendency, variability, and characteristics of the data. Here are some common data distributions:

1. Normal Distribution (Gaussian Distribution):

• Definition: The normal distribution is symmetric and bell-shaped, with the mean, median, and mode all coinciding
at the center. It is characterized by a constant standard deviation and is often described by its mean and standard
deviation parameters.
• Examples: Many natural phenomena follow a normal distribution, such as heights, weights, and test scores in a
large population.

2. Uniform Distribution:
• Definition: The uniform distribution is characterized by a constant probability for all values within a specified
range. It is flat and rectangular-shaped, with equal likelihood of any value occurring.
• Examples: Rolling a fair six-sided die, where each face has an equal chance of landing face-up.

3. Exponential Distribution:

• Definition: The exponential distribution describes the time between events in a Poisson process, where events
occur continuously and independently at a constant average rate. It is characterized by a rapidly decreasing
probability density function.
• Examples: Inter-arrival times between customers at a service counter, time until the next radioactive decay event.

4. Poisson Distribution:

• Definition: The Poisson distribution models the number of events occurring within a fixed interval of time or
space, assuming events occur independently at a constant average rate. It is characterized by discrete values and a
right-skewed shape.
• Examples: Number of phone calls received at a call center within a one-hour period, number of accidents at an
intersection in a day.

5. Binomial Distribution:

• Definition: The binomial distribution models the number of successes in a fixed number of independent Bernoulli
trials, where each trial has two possible outcomes (success or failure) with a constant probability of success.
• Examples: Number of heads obtained when flipping a coin multiple times, number of defective items in a
production batch.

asic data mining task

Data mining tasks can be broadly categorized into two main approaches:

1. Descriptive Data Mining: This focuses on summarizing and understanding the characteristics of
your data. It helps you get a foundational understanding of what's present within the data. Here are
some common descriptive data mining tasks:
o Data summarization: Using summary statistics (mean, median, mode) and data
visualizations (histograms, boxplots) to describe central tendency, spread, and shape of the
data.
o Frequency analysis: Identifying how often certain values or combinations of values appear
in the data. This can reveal patterns and trends.
o Correlation analysis: Measuring the strength and direction of the relationship between two
variables. Helps understand how changes in one variable might influence the other.
2. Predictive Data Mining: This goes beyond description and aims to predict future trends or
outcomes based on historical data. These predictions can be used to make informed business
decisions. Here are some common predictive data mining tasks:
o Classification: Developing models that can predict the category or class to which a new data
point belongs. Examples include predicting customer churn (active vs. inactive) or spam
email detection.
o Regression: Creating models to predict a continuous numerical value based on one or more
independent variables. Examples include forecasting sales figures or predicting stock prices.
o Clustering: Grouping data points together based on their similarity. Helps identify hidden
patterns and segment customers or products into distinct categories.
o Association rule learning: Discovering relationships or associations between different items
or events within the data. Examples include identifying product recommendations based on
past purchases or finding frequent itemsets in market basket analysis.

Choosing the Right Task:

The choice between descriptive and predictive data mining depends on your specific goals.

• If you're exploring a new dataset and want to understand its basic characteristics, descriptive
techniques are a good starting point.
• If you want to make predictions about future events or trends, then predictive data mining becomes
crucial.

Here's a table summarizing the basic data mining tasks:

Task Goal Examples

Describe central
Data Mean, median, mode,
tendency, spread, and
Summarization histograms, boxplots
shape of data

Frequency Identify how often values Finding most frequent

Analysis or combinations appear customer purchases

Understanding how
Correlation Measure relationship
advertising spending
Analysis between variables
affects sales

Predict category of a new Classifying emails as spam

Classification
data point or not spam

Predict a continuous Forecasting monthly sales

Regression
numerical value figures

Group similar data points Segmenting customers

Clustering
together based on buying behavior

Association Rule Discover relationships Recommending products

Learning between items/events based on past purchases

drive_spreadsheetExport to Sheets

By understanding these basic data mining tasks and their purposes, you can begin to leverage the power
of data mining to extract valuable knowledge and insights from your datasets.

Statistical-based algorithms,
Statistical-based algorithms are another essential category of algorithms used in data mining. They
leverage statistical techniques to extract knowledge and insights from data. These algorithms play a crucial
role in various tasks, including:

• Classification: Predicting the category or class to which a new data point belongs.
• Regression: Predicting a continuous numerical value based on one or more independent variables.
• Clustering: Grouping data points together based on their similarity.
• Anomaly Detection: Identifying data points that deviate significantly from the norm.

Here's a breakdown of some common statistical-based algorithms in data mining:

1. Classification Algorithms:

• Logistic Regression: A widely used algorithm for binary classification (two possible classes) that
models the relationship between features and the probability of belonging to a specific class.
• Naive Bayes: A probabilistic classifier that relies on Bayes' theorem to calculate the probability of a
data point belonging to a particular class based on its features.
• Decision Trees: Classify data points by following a tree-like structure where decisions are made
based on the values of features at each node.

2. Regression Algorithms:

• Linear Regression: Models the relationship between a dependent variable (to be predicted) and
one or more independent variables using a linear equation.
• Decision Trees (can also be used for regression): Can be adapted to predict continuous values
by using different splitting criteria at each node in the tree.
• Support Vector Regression (SVR): Finds a hyperplane that best separates the data points
belonging to different categories in a high-dimensional space.

3. Clustering Algorithms:

• K-Means Clustering: A popular algorithm that groups data points into a predefined number of
clusters (k) based on minimizing the within-cluster distance.
• Hierarchical Clustering: Builds a hierarchy of clusters by iteratively merging or splitting clusters
based on a distance measure.
• Expectation-Maximization (EM) algorithm: Useful for clustering data with missing values or
belonging to multiple clusters.
o Information Gain
o Gini Index

Decision Tree Classification Algorithm

o Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches represent
the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes
are used to make any decision and have multiple branches, whereas Leaf nodes are the output of
those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision based
on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which expands on
further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and Regression
Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree into
subtrees.
o Below diagram explains the general structure of a decision tree:

Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.

Why use Decision Trees?

There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset and
problem is the main point to remember while creating a machine learning model. Below are the two reasons
for using the Decision tree:

o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like structure.

AD
Decision Tree Terminologies
• Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which further gets
divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting a leaf
node.
• Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to the given
conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the tree.
• Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the child nodes.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of the
tree. This algorithm compares the values of root attribute with the record (real dataset) attribute and, based
on the comparison, follows the branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other sub-nodes and move
further. It continues the process until it reaches the leaf node of the tree. The complete process can be better
understood using the below algorithm:

o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and called
the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether he should accept
the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary attribute by
ASM). The root node splits further into the next decision node (distance from the office) and one leaf node
based on the corresponding labels. The next decision node further gets split into one decision node (Cab
facility) and one leaf node. Finally, the decision node splits into two leaf nodes (Accepted offers and Declined
offer). Consider the below diagram:
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best attribute for the root
node and for sub-nodes. So, to solve such problems there is a technique which is called as Attribute
selection measure or ASM. By this measurement, we can easily select the best attribute for the nodes of the
tree. There are two popular techniques for ASM, which are:

o Information Gain
o Gini Index

1. Information Gain:

o Information gain is the measurement of changes in entropy after the segmentation of a dataset based
on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a node/attribute
having the highest information gain is split first. It can be calculated using the below formula:

1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in data.
Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

o S= Total number of samples

o P(yes)= probability of yes
o P(no)= probability of no

2. Gini Index:

o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2

neural network-based algorithms,

As you previously mentioned, neural network-based algorithms, also known as Artificial Neural Networks (ANNs),
are a powerful category of machine learning algorithms inspired by the structure and function of the human brain.
These algorithms are known for their ability to learn complex patterns from data and make predictions or
classifications without explicit programming.

Here's a deeper look into neural network-based algorithms for data mining:

Core Concepts:

Artificial Neurons: The building blocks of neural networks. These simulated neurons receive input from other
neurons, process it using a weighted sum and an activation function, and generate an output.
Layers: Neurons are organized into layers: an input layer, multiple hidden layers (where the magic happens!), and an
output layer. Information flows from the input layer through the hidden layers to the output layer, where the final
prediction or classification is produced.
Learning: Neural networks learn by adjusting the weights between neurons based on the difference between the
predicted output and the actual target value (supervised learning). This process, called backpropagation, iteratively
improves the network's performance.
Types of Neural Networks:

There are various types of neural networks, each suited for specific tasks:

Multilayer Perceptrons (MLPs): The basic type of neural network with multiple hidden layers. They can learn
complex non-linear relationships but can struggle with high-dimensional data like images.
Convolutional Neural Networks (CNNs): Specialized for image and video recognition. They use filters to extract
spatial features from data, making them highly effective in tasks like object detection and image classification.
Recurrent Neural Networks (RNNs): Designed to handle sequential data like text or time series, where information
from previous steps is crucial. They struggle with long-term dependencies though.
Long Short-Term Memory (LSTM) Networks: A type of RNN that overcomes limitations of traditional RNNs.
LSTMs can learn long-term dependencies in sequential data, making them powerful for tasks like machine translation
or sentiment analysis of long texts.
Applications in Data Mining:

Neural networks are widely used in data mining tasks like:

Image Recognition: CNNs excel at recognizing objects, faces, and scenes in images and videos, enabling applications
like facial recognition or medical image analysis.
Natural Language Processing (NLP): RNNs and LSTMs are used for tasks like sentiment analysis, machine
translation, and text generation. They can analyze large amounts of text data and extract insights.
Speech Recognition: Neural networks can convert spoken language into text, enabling voice assistants and speech-to-
text applications.
Recommendation Systems: These systems leverage neural networks to personalize recommendations for users based
on their past behavior and preferences.
Advantages:

High Performance: Neural networks can achieve state-of-the-art performance on various tasks, especially when
dealing with complex, high-dimensional data.
Feature Learning: They can automatically learn features from data, reducing the need for manual feature engineering,
which can be a time-consuming process.
Flexibility: Different network architectures can be designed to suit specific tasks and data types.
Disadvantages:

Complexity: Neural networks can be complex to design, train, and optimize, requiring significant computational
resources and expertise. Tuning these models effectively can be challenging.
Black Box Problem: While effective, neural networks can be difficult to interpret, making it challenging to understand
how they arrive at their predictions. This lack of interpretability can be a concern in some applications.
Data Hunger: Neural networks typically require large amounts of data for effective training, which might not always
be available.

Slides Merged
No ratings yet
Slides Merged
374 pages
DMA_qb_solved
No ratings yet
DMA_qb_solved
42 pages
Applied Statistics For Bioinformatics Using R
100% (2)
Applied Statistics For Bioinformatics Using R
279 pages
Class Notes Ml 1
No ratings yet
Class Notes Ml 1
108 pages
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-1
No ratings yet
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-1
11 pages
ADB Slides 5
No ratings yet
ADB Slides 5
52 pages
DM UNIT-1 Question and Answer
No ratings yet
DM UNIT-1 Question and Answer
25 pages
DW Ans
No ratings yet
DW Ans
19 pages
1 Green IT446 Test Bank 2 2
No ratings yet
1 Green IT446 Test Bank 2 2
61 pages
DM Question Bank
No ratings yet
DM Question Bank
50 pages
DWM PYQ
No ratings yet
DWM PYQ
10 pages
DW Model Questions
No ratings yet
DW Model Questions
8 pages
REVIEW QUESTIONS FOR CS410 DATA MINING AND DATA WAREHOUSING
No ratings yet
REVIEW QUESTIONS FOR CS410 DATA MINING AND DATA WAREHOUSING
6 pages
DWDM-CSE-Question Bank
No ratings yet
DWDM-CSE-Question Bank
11 pages
FUNDAMENTALS OF DATA SCIENCE-1
No ratings yet
FUNDAMENTALS OF DATA SCIENCE-1
9 pages
Schiffer Et Al 1978 The Design of Archaeological Surveys
100% (1)
Schiffer Et Al 1978 The Design of Archaeological Surveys
29 pages
"Educational Data Mining A Review of Satate of Art
No ratings yet
"Educational Data Mining A Review of Satate of Art
18 pages
Datamining Bits
No ratings yet
Datamining Bits
16 pages
Dcs 7302
No ratings yet
Dcs 7302
17 pages
DM Obj
No ratings yet
DM Obj
16 pages
DWM Paper 1
No ratings yet
DWM Paper 1
2 pages
Unit V - Clustering
No ratings yet
Unit V - Clustering
19 pages
Script of E__Previous Question Papers_URR18 03.08.2023_VI Semester_U18CS605.pdf
No ratings yet
Script of E__Previous Question Papers_URR18 03.08.2023_VI Semester_U18CS605.pdf
10 pages
DWM Paper 2
No ratings yet
DWM Paper 2
2 pages
Model Question paper 2
No ratings yet
Model Question paper 2
7 pages
Mobile computing
No ratings yet
Mobile computing
3 pages
DMDW Question Bank
No ratings yet
DMDW Question Bank
17 pages
Data Mining University Answer
No ratings yet
Data Mining University Answer
10 pages
2022 2nd DWHM
No ratings yet
2022 2nd DWHM
2 pages
Vi Sem Bca Qbank - Wcms - Fds
50% (2)
Vi Sem Bca Qbank - Wcms - Fds
11 pages
CEUC502 - DMBI_Question_Bank
No ratings yet
CEUC502 - DMBI_Question_Bank
12 pages
Question Bank: Q1) What Is Data Warehouse?
No ratings yet
Question Bank: Q1) What Is Data Warehouse?
17 pages
Ques 1.give Some Examples of Data Preprocessing Techniques?: Assignment - DWDM Submitted By-Tanya Sikka 1719210284
No ratings yet
Ques 1.give Some Examples of Data Preprocessing Techniques?: Assignment - DWDM Submitted By-Tanya Sikka 1719210284
7 pages
Data Mining Questions
No ratings yet
Data Mining Questions
5 pages
DM-Question Bank 2024-25 Objective Question Bank
No ratings yet
DM-Question Bank 2024-25 Objective Question Bank
14 pages
16CS531-Data Warehousing and Data Mining (1)
No ratings yet
16CS531-Data Warehousing and Data Mining (1)
6 pages
Image Segmentation Based On Watershed and Edge Detection Techniques
No ratings yet
Image Segmentation Based On Watershed and Edge Detection Techniques
7 pages
Jntuqp DWDM
No ratings yet
Jntuqp DWDM
8 pages
Be Computer Engineering Aids Semester 5 2024 May Data Warehousing and Miningrev 2019 c Scheme
No ratings yet
Be Computer Engineering Aids Semester 5 2024 May Data Warehousing and Miningrev 2019 c Scheme
2 pages
Speech Detection On Urban Sounds
No ratings yet
Speech Detection On Urban Sounds
66 pages
Seperated
No ratings yet
Seperated
11 pages
1 - Page
No ratings yet
1 - Page
11 pages
IT - Sem VI - DMBI - Sample Questions
No ratings yet
IT - Sem VI - DMBI - Sample Questions
10 pages
Data Mining Suggestions
No ratings yet
Data Mining Suggestions
5 pages
DWM 5
No ratings yet
DWM 5
9 pages
MZA Userguide PDF
No ratings yet
MZA Userguide PDF
8 pages
SemSuggestions DM
No ratings yet
SemSuggestions DM
6 pages
The Rising Trends of Smart E-Commerce Logistics
No ratings yet
The Rising Trends of Smart E-Commerce Logistics
19 pages
Internship Report Poorab
No ratings yet
Internship Report Poorab
30 pages
Fundamentals and Advances in Multiple-Hypothesis Tracking_NATO report
No ratings yet
Fundamentals and Advances in Multiple-Hypothesis Tracking_NATO report
18 pages
Q1R_ext(2)
No ratings yet
Q1R_ext(2)
4 pages
University of Mumbai
No ratings yet
University of Mumbai
32 pages
Data Mining Long Answers
No ratings yet
Data Mining Long Answers
4 pages
Data Mining (Gtu Sem-6)002
No ratings yet
Data Mining (Gtu Sem-6)002
5 pages
Ballweg Et Al. - 2018 - Visual Similarity Perception of Directed Acyclic G
No ratings yet
Ballweg Et Al. - 2018 - Visual Similarity Perception of Directed Acyclic G
15 pages
2085-Article Text-5597-1-10-20220804
No ratings yet
2085-Article Text-5597-1-10-20220804
12 pages
DWDM MID - 2 Question Paper and Online Bits
No ratings yet
DWDM MID - 2 Question Paper and Online Bits
3 pages
Data Mining Question Bank
No ratings yet
Data Mining Question Bank
4 pages
Data Mining
No ratings yet
Data Mining
8 pages
Sample Question DMW
No ratings yet
Sample Question DMW
4 pages
QUESTION BANK BCA_IDS
No ratings yet
QUESTION BANK BCA_IDS
3 pages
Question Bank 2
No ratings yet
Question Bank 2
4 pages
DM_MCQS_UNIT-1
No ratings yet
DM_MCQS_UNIT-1
4 pages
Correspondence Analysis and Classification: Lebart L.
No ratings yet
Correspondence Analysis and Classification: Lebart L.
18 pages
Cs1004: Data Warehousing and Mining Two Marks Questions and Answers Unit I
No ratings yet
Cs1004: Data Warehousing and Mining Two Marks Questions and Answers Unit I
31 pages
DWM Experiment5 E059
No ratings yet
DWM Experiment5 E059
15 pages
Unit 3 & 4 (p18)
No ratings yet
Unit 3 & 4 (p18)
18 pages
Data Mining IMP Objective Questions_Sep 2023
No ratings yet
Data Mining IMP Objective Questions_Sep 2023
4 pages
Machine Learning CS-8 Dept. of CS, KFUEIT: Instructor: Muhammad Adeel Abid
No ratings yet
Machine Learning CS-8 Dept. of CS, KFUEIT: Instructor: Muhammad Adeel Abid
19 pages
Prediction Analysis Techniques of Data M
No ratings yet
Prediction Analysis Techniques of Data M
8 pages
Not Just RoboStudents Conner Pope - PDF Final
No ratings yet
Not Just RoboStudents Conner Pope - PDF Final
17 pages
DWDM Unitwise Qns
No ratings yet
DWDM Unitwise Qns
3 pages
Data Warehousing and Mining April 2019
No ratings yet
Data Warehousing and Mining April 2019
4 pages
Multivariate Analysis
No ratings yet
Multivariate Analysis
15 pages
K-Medoids-Clustering Method
No ratings yet
K-Medoids-Clustering Method
5 pages
DWM 700
No ratings yet
DWM 700
16 pages
DWDM MCQ Qns 2020
No ratings yet
DWDM MCQ Qns 2020
5 pages
Unit 3
No ratings yet
Unit 3
10 pages
Clustering and Visualisation of Data - 2020
No ratings yet
Clustering and Visualisation of Data - 2020
5 pages
Gandhinagar Institute of Technology: Computer Engineer Ing Department Question Bank
No ratings yet
Gandhinagar Institute of Technology: Computer Engineer Ing Department Question Bank
3 pages
Factors Influencing The Purchase of Bread in A Retail Store in Perth, Western Australia
No ratings yet
Factors Influencing The Purchase of Bread in A Retail Store in Perth, Western Australia
4 pages
An Adaptive K-Means Clustering Algorithm For Breast Image Segmentation
No ratings yet
An Adaptive K-Means Clustering Algorithm For Breast Image Segmentation
4 pages
Classification of 72 Melakartha Ragas Using PAM Clustering Method: Carnatic Music
No ratings yet
Classification of 72 Melakartha Ragas Using PAM Clustering Method: Carnatic Music
4 pages
Image Segmentation Using K Mean Algorithm
No ratings yet
Image Segmentation Using K Mean Algorithm
5 pages
GreptimeDB Essentials: The Complete Guide for Developers and Engineers
From Everand
GreptimeDB Essentials: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
C Data Structures and Algorithms: Implementing Efficient ADTs
From Everand
C Data Structures and Algorithms: Implementing Efficient ADTs
Larry Jones
No ratings yet
CatBoost Algorithms and Applications: Definitive Reference for Developers and Engineers
From Everand
CatBoost Algorithms and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
XGBoost in Practice: Definitive Reference for Developers and Engineers
From Everand
XGBoost in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
LightGBM in Practice: Definitive Reference for Developers and Engineers
From Everand
LightGBM in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)