Data Mining Notes Unit 4
Data Mining Notes Unit 4
Data Mining Notes Unit 4
Classification Algorithms
• 1R Algorithm
The 1R algorithm is a simplified form of a decision tree. Instead of building a complex
tree structure, it creates a single rule based on a single attribute. This rule is used to
predict the target variable.
1. Attribute Selection: The algorithm identifies the attribute that best discriminates
between different classes. This is often measured using metrics like information gain or
entropy.
2. Rule Creation: For each value of the selected attribute, a rule is created. The rule
predicts the class that is most common among instances with that attribute value.
1. Attribute Selection:
• Calculating information gain for each attribute, we find that Outlook has the highest.
2. Rule Creation:
• Rule 1: If Outlook is Sunny, then Play is No (6 out of 9 instances with Sunny Outlook
didn't play).
• Rule 2: If Outlook is Cloudy, then Play is Yes (4 out of 4 instances with Cloudy Outlook
played).
• Rule 3: If Outlook is Rainy, then Play is Yes (6 out of 7 instances with Rainy Outlook
played).
Visual Representation
[Image of a simple decision tree with only one level based on the Outlook attribute]
Applications
• Data Exploration: A quick way to understand the relationship between a single attribute
and the target variable.
• Baseline Model: Can serve as a baseline for comparison with more complex models.
• Real-World Use Cases: In scenarios where simplicity and efficiency are prioritized, 1R
can be a suitable choice.
Decision Trees
Decision trees are a popular machine learning algorithm used for both classification and
regression tasks. They are essentially flowcharts where each internal node represents a test on an
attribute (e.g., Is the temperature greater than 80 degrees?), each branch represents the possible
outcomes of the test, and each leaf node represents a class label or a predicted value.
Example:
A decision tree is a flowchart-like tree structure where an internal node represents a feature(or
attribute), the branch represents a decision rule, and each leaf node represents the outcome.
The topmost node in a decision tree is known as the root node. It learns to partition on the basis of the
attribute value. It partitions the tree in a recursive manner called recursive partitioning. This flowchart-
like structure helps you in decision-making. It's visualization like a flowchart diagram which easily mimics
the human level thinking. That is why decision trees are easy to understand and interpret.
A decision tree is a white box type of ML algorithm. It shares internal decision-making logic, which is not
available in the black box type of algorithms such as with a neural network. Its training time is faster
compared to the neural network algorithm.
The time complexity of decision trees is a function of the number of records and attributes in the given
data. The decision tree is a distribution-free or non-parametric method which does not depend upon
probability distribution assumptions. Decision trees can handle high-dimensional data with good
accuracy.
1. Select the best attribute using Attribute Selection Measures (ASM) to split the records.
2. Make that attribute a decision node and breaks the dataset into smaller subsets.
3. Start tree building by repeating this process recursively for each child until one of the conditions
will match:
• All the tuples belong to the same attribute value.
• There are no more remaining attributes.
• There are no more instances.
Attribute Selection Measures
Attribute selection measure is a heuristic for selecting the splitting criterion that partitions data in the
best possible manner. It is also known as splitting rules because it helps us to determine breakpoints for
tuples on a given node. ASM provides a rank to each feature (or attribute) by explaining the given
dataset. The best score attribute will be selected as a splitting attribute (Source). In the case of a
continuous-valued attribute, split points for branches also need to define. The most popular selection
measures are Information Gain, Gain Ratio, and Gini Index.
Information Gain
Claude Shannon invented the concept of entropy, which measures the impurity of the input set. In
physics and mathematics, entropy is referred to as the randomness or the impurity in a system. In
information theory, it refers to the impurity in a group of examples. Information gain is the decrease in
entropy. Information gain computes the difference between entropy before the split and average
entropy after the split of the dataset based on given attribute values. ID3 (Iterative Dichotomiser)
decision tree algorithm uses information gain.
• Info(D) is the average amount of information needed to identify the class label of a tuple in D.
• |Dj|/|D| acts as the weight of the jth partition.
• InfoA(D) is the expected information required to classify a tuple from D based on the
partitioning by A.
The attribute A with the highest information gain, Gain(A), is chosen as the splitting attribute at node
N().
Gain Ratio
Information gain is biased for the attribute with many outcomes. It means it prefers the attribute with a
large number of distinct values. For instance, consider an attribute with a unique identifier, such as
customer_ID, that has zero info(D) because of pure partition. This maximizes the information gain and
creates useless partitioning.
C4.5, an improvement of ID3, uses an extension to information gain known as the gain ratio. Gain ratio
handles the issue of bias by normalizing the information gain using Split Info. Java implementation of the
C4.5 algorithm is known as J48, which is available in WEKA data mining tool.
Where:
Gini index
Another decision tree algorithm CART (Classification and Regression Tree) uses the Gini method to
create split points.
The Gini Index considers a binary split for each attribute. You can compute a weighted sum of the
impurity of each partition. If a binary split on attribute A partitions data D into D1 and D2, the Gini index
of D is:
In the case of a discrete-valued attribute, the subset that gives the minimum gini index for that chosen is
selected as a splitting attribute. In the case of continuous-valued attributes, the strategy is to select each
pair of adjacent values as a possible split point, and a point with a smaller gini index is chosen as the
splitting point.
The attribute with the minimum Gini index is chosen as the splitting attribute.
Pruning Techniques
Pruning is a technique used to reduce the size of a decision tree to prevent overfitting.
1. Cost-Complexity Pruning:
o Assigns a cost to each leaf node.
o Prune subtrees that don't significantly improve the overall accuracy.
2. Error-Based Pruning:
o Prune subtrees if pruning doesn't increase the error rate on a validation set.
Advantages:
Disadvantages:
Decision trees are a versatile and widely used machine learning algorithm. By understanding
their structure, construction methods, and pruning techniques, you can effectively apply them to
various classification and regression problems.
Covering Rules
Definition and Purpose
Covering rules are a type of machine learning algorithm that aims to learn a set of rules to
classify instances. These rules are typically in the form of "if-then" statements. The goal is to
cover as many instances as possible with these rules while minimizing overlap.
Key characteristics:
Purpose:
• Interpretability: Covering rules are often more interpretable than other models, making them
easier to understand and explain.
• Efficiency: They can be computationally efficient, especially for small datasets.
• Flexibility: Can handle both numerical and categorical data.
Decision Trees:
Naive Bayes:
Neural Networks:
Purpose:
Example:
• Medical diagnosis: Given a patient's symptoms, predict whether they have a particular disease.
• Image recognition: Classify images into categories such as "cat," "dog," or "car."
Prediction
Purpose:
Example:
Techniques:
Example:
• Association rule mining: "People who buy milk also tend to buy bread."
• Decision tree: "If temperature is high and humidity is low, then play tennis."
• Rule-based classifier: "If income is high and credit score is good, then approve loan."
3. Prediction Algorithms
The Prediction Task
Objectives
• Output type: Prediction tasks typically deal with continuous variables (e.g., numbers), while
classification tasks involve categorical variables (e.g., labels).
• Evaluation metrics: Different metrics are used to evaluate prediction tasks compared to
classification tasks. Common metrics include:
o Mean squared error (MSE): Measures the average squared difference between
predicted and actual values.
o Mean absolute error (MAE): Measures the average absolute difference between
predicted and actual values.
o Root mean squared error (RMSE): The square root of the MSE, which provides a more
interpretable measure.
• Algorithms: While some algorithms can be used for both classification and prediction, others
are more specialized for one task or the other. For example:
o Linear regression: Commonly used for prediction tasks.
o Logistic regression: Commonly used for classification tasks.
Example
In the figure, the blue line represents the predicted sales based on advertising spending. The red
dots represent the actual sales data. The goal of the prediction task is to minimize the distance
between the predicted and actual values.
By understanding the objectives, differences, and evaluation metrics for prediction tasks, you can
effectively apply appropriate algorithms and techniques to solve your specific problems.
• Statistical (Bayesian) Classification
Bayesian Theorem
Formula:
Explanation:
• P(A|B): The probability of event A happening, given that event B has already happened.
• P(B|A): The probability of event B happening, given that event A has already happened.
• P(A): The prior probability of event A happening.
• P(B): The prior probability of event B happening.
Example:
Using Bayes' theorem, we can calculate the probability of having the disease given a positive test
result.
Steps:
1. Calculate prior probabilities: Determine the probability of each class in the training data.
2. Calculate conditional probabilities: Calculate the probability of each feature given each class.
3. Apply Bayes' theorem: For a new instance, calculate the probability of each class using Bayes'
theorem.
4. Classify: Assign the instance to the class with the highest probability.
Example:
Consider a dataset with two features (outlook and temperature) and two classes (play or not
play).
Sunny Hot No
Rainy Cool No
Export to Sheets
To classify a new instance with outlook "Sunny" and temperature "Hot," we would calculate:
• P(Play=Yes|Outlook=Sunny, Temperature=Hot)
• P(Play=No|Outlook=Sunny, Temperature=Hot)
• Assumes independence of features, which may not always hold in real-world data
• Can be sensitive to the distribution of the data
Naive Bayes is a popular and effective classification algorithm, especially for large datasets with
many features. However, it is important to be aware of its limitations and consider other
algorithms if the independence assumption is violated.
Bayesian Networks
Structure
The edges are directed, indicating a causal relationship between the variables. The absence of an
edge implies conditional independence.
Example:
Inference
Inference in Bayesian networks involves calculating the probability of one or more variables
given evidence. There are two main types of inference:
Inference Algorithms:
• Exact inference:
o Variable elimination: Iteratively eliminates variables to calculate the desired probability.
o Join tree: Creates a junction tree representation of the network and uses message
passing to calculate probabilities.
• Approximate inference:
o Sampling methods: Generate samples from the joint probability distribution to estimate
probabilities.
o Variational methods: Approximate the posterior distribution with a simpler distribution.
Example:
Consider the Bayesian network in the example above. To infer the probability of the sprinkler
being on given that the grass is wet, we would use diagnostic inference. We could employ
variable elimination to eliminate the variables "weather" and "slippery sidewalk" and calculate
the desired probability.
Steps:
• Structure learning: Determining the network structure can be challenging, especially for large
networks.
• Computational complexity: Inference can be computationally expensive for large networks.
Instance-Based Methods:
k-Nearest Neighbors (k-NN) is a simple yet effective machine learning algorithm that belongs
to the family of instance-based learning methods. It makes predictions based on the similarity
between the new instance and the k nearest neighbors in the training dataset.
1. Choose a value for k: Determine the number of neighbors to consider when making
predictions.
2. Calculate distances: For each instance in the training dataset, calculate its distance to the
new instance. Common distance metrics include Euclidean distance, Manhattan distance,
and Minkowski distance.
3. Select k nearest neighbors: Identify the k instances in the training dataset that are
closest to the new instance.
4. Make prediction: If the task is classification, assign the class label to the new instance
that is most common among its k nearest neighbors. If the task is regression, calculate the
average value of the target variable among the k nearest neighbors.
Advantages of k-NN:
Disadvantages of k-NN:
Example:
Consider a dataset of iris flowers with features sepal length, sepal width, petal length, and petal
width, and a target variable indicating the species (setosa, versicolor, or virginica). To classify a
new iris flower, k-NN would find the k nearest neighbors in the training dataset based on their
feature values and assign the class label that is most common among those neighbors.
The choice of k can significantly affect the performance of k-NN. A small value of k can make
the model more sensitive to noise, while a large value of k can make the model less flexible.
Cross-validation is often used to select the optimal value of k.
Variations of k-NN:
• Weighted k-NN: Assigns weights to the nearest neighbors based on their distance,
giving more weight to closer neighbors.
• Approximate nearest neighbors: Uses approximate algorithms to efficiently find the
nearest neighbors for large datasets.
k-NN is a versatile and easy-to-use algorithm that can be effective for many classification and
regression tasks. However, it is important to consider its limitations and choose the appropriate
value of k to achieve optimal performance
Linear Models
Linear models are a class of statistical models that assume a linear relationship between the
dependent variable and one or more independent variables. They are widely used in various
fields, including statistics, machine learning, and data science.
Linear Regression
Purpose:
Equation:
where:
• y: Dependent variable
• β0: Intercept
• β1, β2, ..., βp: Coefficients
• x1, x2, ..., xp: Independent variables
• ε: Error term
Example: Predicting house prices based on features like square footage, number of bedrooms,
and location.
Logistic Regression
Purpose:
Equation:
where:
Key Differences:
• Output: Linear regression predicts a continuous value, while logistic regression predicts a
probability.
• Equation: Linear regression uses a linear equation, while logistic regression uses a logistic
function.
• Loss function: Linear regression typically uses mean squared error, while logistic regression uses
cross-entropy loss.
• Assume a linear relationship between variables, which may not always hold.
• Can be sensitive to outliers.
Linear models are a fundamental tool in machine learning and statistics, providing a solid
foundation for understanding and building predictive models.
Additional Notes:
• Experiment with different algorithms: Weka offers a wide range of classifiers, so try different
ones to find the best fit for your dataset.
• Tune parameters: Experiment with different parameter values to optimize performance.
• Evaluate performance: Use appropriate evaluation metrics (e.g., accuracy, precision, recall, F1-
score, MSE) to assess the model's effectiveness.
• Consider preprocessing: Preprocess your data if necessary (e.g., handle missing values,
normalize features).
By following these steps and experimenting with different algorithms, you can effectively
implement Bayesian classifiers, instance-based methods, and linear models using Weka to solve
your machine learning problems.
4. Evaluating What’s Been Learned
• Basic Issues
Overfitting occurs when a model is too complex and learns the training data too well, including
its noise. This can lead to poor performance on new, unseen data.
Underfitting occurs when a model is too simple and cannot capture the underlying patterns in
the data. This can also lead to poor performance, as the model is unable to generalize to new
data.
Example:
• Overfitting: A high-degree polynomial model might fit the training data perfectly but perform
poorly on new data due to its complexity.
• Underfitting: A linear model might not be able to capture a non-linear relationship in the data,
leading to poor performance.
Bias-Variance Tradeoff
The bias-variance tradeoff is a fundamental concept in machine learning that explains the
relationship between model complexity and its performance.
• Bias: The error due to the model's inability to capture the underlying relationship in the data. A
complex model typically has lower bias.
• Variance: The error due to the model's sensitivity to small changes in the training data. A
complex model typically has higher variance.
The goal is to find a model that balances bias and variance to achieve optimal performance.
Example:
• High bias, low variance: A simple model like linear regression might have high bias if the
relationship is non-linear, but it will have low variance as it is less sensitive to noise in the data.
• Low bias, high variance: A complex model like a deep neural network might have low bias but
high variance, as it can easily overfit to the training data.
By understanding the concepts of overfitting, underfitting, and the bias-variance tradeoff, you
can make informed decisions when building and evaluating machine learning models.
1. Random Split:
o Randomly divides the dataset into training and testing sets.
o Simple and easy to implement.
o May introduce bias if the dataset is not well-shuffled.
2. Stratified Split:
o Ensures that the distribution of classes in the training and testing sets is similar to the
overall dataset.
o Helps to avoid bias when dealing with imbalanced datasets.
3. K-Fold Cross-Validation:
o Divides the dataset into k folds.
o Trains the model k times, each time using k-1 folds for training and 1 fold for testing.
o Provides a more robust estimate of the model's performance.
o Common values for k are 5 or 10.
4. Leave-One-Out Cross-Validation (LOOCV):
o A special case of k-fold cross-validation where k equals the number of instances in the
dataset.
o Provides a very accurate estimate of performance but can be computationally expensive
for large datasets.
Factors to Consider
• Dataset size: For small datasets, k-fold cross-validation or LOOCV can be more suitable.
• Class imbalance: Stratified splitting is recommended for imbalanced datasets.
• Computational resources: LOOCV can be computationally expensive, especially for large
datasets.
• Model complexity: More complex models may benefit from more rigorous evaluation using
techniques like k-fold cross-validation.
Example:
By carefully selecting a data splitting strategy, you can obtain a reliable evaluation of your
machine learning model's performance
Holdout Method
• Simple approach: Randomly divides the dataset into training and testing sets.
• Pros: Easy to implement.
• Cons: Can be sensitive to the specific split, leading to biased estimates.
Cross-Validation
• More robust: Divides the dataset into k folds, trains the model k times, and evaluates
performance on each fold.
• Types:
o k-fold cross-validation: Commonly used, with k typically set to 5 or 10.
o Stratified k-fold cross-validation: Ensures that the distribution of classes is similar in
each fold, especially for imbalanced datasets.
o Leave-one-out cross-validation (LOOCV): Sets k equal to the number of instances,
providing a more accurate estimate but can be computationally expensive for large
datasets.
• Extreme case of k-fold: Trains the model on all but one instance and evaluates on the remaining
instance.
• Pros: Provides a very accurate estimate.
• Cons: Can be computationally expensive for large datasets.
Example:
• Dataset size: For small datasets, LOOCV can provide a more accurate estimate.
• Computational resources: LOOCV can be computationally expensive for large datasets.
• Class imbalance: Stratified k-fold cross-validation is recommended for imbalanced datasets.
• Idea: Create multiple models by training them on different bootstrap samples of the original
dataset.
• Process:
1. Sample the dataset with replacement to create multiple bootstrap samples.
2. Train a model on each bootstrap sample.
3. Combine the predictions of all models, typically by majority voting for classification or
averaging for regression.
Example:
• Random Forest: An ensemble of decision trees where each tree is trained on a bootstrap sample
and feature selection is randomized.
Boosting
• Idea: Create multiple models sequentially, with each model focusing on correcting the errors of
the previous models.
• Process:
1. Train an initial model.
2. Assign weights to instances based on their classification accuracy.
3. Train a new model that focuses on instances with higher weights.
4. Repeat steps 2 and 3 until a specified number of models is reached.
5. Combine the predictions of all models, typically using weighted voting.
Example:
• AdaBoost: A boosting algorithm that assigns weights to instances based on their classification
accuracy.
• Gradient Boosting: A boosting algorithm that fits a new model to the residuals of the previous
models.
Stacking (Stacked Generalization)
• Idea: Train multiple models and use another model to combine their predictions.
• Process:
1. Train multiple base models on the original dataset.
2. Use the predictions of the base models as features for a meta-learner model.
3. Train the meta-learner model to combine the predictions of the base models.
Example:
• A meta-learner model (e.g., logistic regression) can be used to combine the predictions of
multiple base models (e.g., decision trees, support vector machines).
The best ensemble method depends on the specific problem and dataset. Consider the following
factors:
• Bias-variance tradeoff: Bagging can help reduce variance, while boosting can help reduce bias.
• Computational cost: Some ensemble methods, like boosting, can be computationally expensive.
• Interpretability: Some ensemble methods, like random forests, can be more interpretable than
others.
The Minimum Description Length (MDL) principle is a general-purpose framework for model
selection and inference. It states that the best model is the one that can encode both the model
itself and the data using the fewest bits.
In other words, the MDL principle seeks to find a model that is both simple and fits the data
well. A simpler model requires fewer bits to encode, while a model that fits the data well
requires fewer bits to encode the residuals.
Application
The MDL principle can be applied to various machine learning tasks, including:
• Model selection: Choosing the best model from a set of candidate models.
• Feature selection: Selecting the most relevant features for a model.
• Data compression: Finding the most efficient way to encode data.
Advantages of MDL:
Disadvantages of MDL:
The MDL principle is a powerful tool for model selection and inference. By seeking to find the
simplest model that fits the data well, it can help prevent overfitting and improve the
generalization performance of machine learning models.
Performing Cross-Validation
1. Choose the "Classify" tab: This is where you'll apply machine learning algorithms.
2. Select your desired algorithm: Choose a classifier from the dropdown (e.g., J48 for decision
trees).
3. Set parameters: Adjust any algorithm-specific parameters.
4. Start the experiment: Click the "Start" button.
5. Evaluate results: The "Evaluate" tab will show the cross-validation results, including accuracy,
precision, recall, F-measure, and other metrics.
1. Choose the "Evaluate" tab: This tab provides various evaluation metrics.
2. Select the desired metrics: Choose metrics like accuracy, precision, recall, F-measure, or ROC
curve.
3. Analyze results: Interpret the evaluation metrics to assess the model's performance.
1. Choose the "Bagging" or "Boosting" tab: These tabs provide options for ensemble methods.
2. Select a base classifier: Choose a classifier to use as the base model for the ensemble.
3. Set parameters: Adjust parameters like the number of models to create.
4. Start the experiment: Click the "Start" button.
5. Evaluate results: Analyze the performance of the ensemble model using the "Evaluate" tab.
Additional Tips:
• Preprocess data: Handle missing values, normalize features, and consider feature engineering.
• Experiment with different algorithms: Try various classifiers to find the best fit for your dataset.
• Tune parameters: Optimize algorithm parameters using techniques like grid search or random
search.
• Visualize results: Use Weka's visualization tools to understand your models and results better.
By following these steps and experimenting with different techniques, you can effectively use
Weka to build and evaluate machine learning models for your specific tasks.