Supervised Learning Notes 1-4
Supervised Learning Notes 1-4
Supervised Learning Notes 1-4
CSIT 330
NOTES
MODULE 01
INTRODUCTION
Introduction to Supervised Learning
Supervised learning is a foundational concept in machine learning where
algorithms learn from labeled training data to make predictions or decisions. It
involves a process where the model is trained on input-output pairs, with the aim
of learning the mapping or relationship between the inputs and corresponding
outputs.
1. Labeled Data: In supervised learning, the dataset used for training the
model consists of labeled examples. Each example in the dataset includes
input features and their corresponding output labels. For instance, in a
dataset aiming to predict whether an email is spam or not, the features
could be various attributes of the email (like words used, sender, etc.),
and the labels would be "spam" or "not spam".
2. Objective: The primary goal is to train the model to generalize patterns
and relationships present in the training data so that it can accurately
predict or classify the output for new, unseen input data.
3. Types of Supervised Learning:
Classification: The algorithm predicts a discrete class label as the
output. For instance, determining whether an email is spam or not,
predicting whether a patient has a particular disease, etc.
Regression: The algorithm predicts a continuous numerical value
as the output. For example, predicting house prices based on
features like square footage, number of bedrooms, etc.
4. Model Training: During the training phase, the algorithm learns from the
labeled data by adjusting its internal parameters or weights iteratively. It
tries to minimize the difference between its predictions and the actual
labels using various optimization techniques (e.g., gradient descent).
5. Evaluation: After training, the model's performance is assessed on a
separate dataset called the validation or test set. This evaluation helps in
understanding how well the model generalizes to new, unseen data.
Common evaluation metrics differ based on the type of problem -
accuracy, precision, recall, F1-score for classification; mean squared error,
R-squared for regression, etc.
6. Prediction: Once the model is trained and evaluated, it can be used to
make predictions or classify new unseen data based on the patterns it has
learned from the training data.
7. Iterative Process: The process of supervised learning is often iterative. It
involves tweaking various aspects of the model (like changing its
architecture, hyperparameters, or employing different algorithms) to
improve its performance until satisfactory results are achieved.
y=mx+c
Where:
The goal of simple linear regression is to estimate the values of m and c that
best fit the given data. This estimation is commonly done using the method of
least squares, which minimizes the sum of the squared differences between the
observed and predicted values.
=0+11+22+⋯+y=b0+b1x1+b2x2+⋯+bnxn
Where:
Gradient Descent
Gradient descent is an optimization algorithm used to minimize the cost function
or error function in machine learning models, especially in training algorithms for
supervised learning. It's employed to iteratively update the parameters of a
model to reach the optimal values that minimize the cost.
1. Initialization: Start with initial values for the parameters of the model
(weights and biases) typically chosen randomly.
2. Calculate Gradient: Compute the gradient of the cost function with
respect to each parameter. This gradient represents the direction and rate
of the steepest increase in the cost function.
3. Update Parameters: Adjust the parameters by moving in the opposite
direction of the gradient. The update formula for a parameter θ can be
represented as:
�=�−�×gradientθ=θ−α×gradient
Where �α (learning rate) determines the size of the steps taken in the
direction of the negative gradient. This rate is crucial; a large learning rate
might cause overshooting, while a small learning rate might lead to slow
convergence.
4. Iterate: Repeat steps 2 and 3 until convergence or a stopping criterion is
reached. Convergence occurs when the algorithm reaches a point where
the gradient is close to zero, indicating that the minimum of the cost
function has been found or the change in the cost function is negligible.
In classification:
1. Input Data: The dataset used for training consists of labeled examples,
where each example comprises input features (also called independent
variables or predictors) and corresponding output labels (also known as
classes or categories).
2. Objective: The primary aim is to build a model that can accurately assign
unseen or new input data to the correct predefined class based on the
learned patterns from the training data.
3. Types of Classification:
Binary Classification: In binary classification, the model classifies
data into one of two classes, such as "spam" or "not spam,"
"positive" or "negative," etc.
Multiclass Classification: Multiclass classification involves
categorizing data into more than two classes. For instance,
classifying emails into categories like "spam," "promotions," or
"primary inbox."
4. Algorithms: Various algorithms are used for classification tasks,
including:
Logistic Regression: Despite its name, logistic regression is
primarily used for binary classification.
Decision Trees: They split the data into branches based on
different features to classify instances.
Support Vector Machines (SVM): SVM finds the best hyperplane
that separates different classes in the feature space.
Random Forests: An ensemble learning method that constructs
multiple decision trees to improve accuracy and prevent overfitting.
Neural Networks: Deep learning models capable of learning
complex patterns from data, often used for classification tasks.
5. Model Training: During the training phase, the model learns the
relationship between the input features and their corresponding output
labels. The algorithm adjusts its parameters or weights to minimize the
classification error by comparing predicted labels with true labels.
6. Evaluation: The performance of the classification model is assessed
using various metrics such as accuracy, precision, recall, F1-score,
confusion matrix, ROC curves, and others, depending on the problem and
data characteristics.
7. Prediction: Once trained and evaluated, the model can predict the class
labels for new, unseen data based on the learned patterns.
Classification finds applications in numerous fields, including but not limited to:
Introduction to KNN
K-Nearest Neighbors (KNN) is a simple yet powerful supervised learning
algorithm used for both classification and regression tasks. It's a non-parametric,
instance-based learning method that makes predictions based on the similarity
of data points.
1. Data Preparation:
Collect and preprocess the dataset, which includes cleaning,
normalization, feature scaling, and splitting into training and testing
sets.
2. Choosing 'k':
Determine an appropriate value of 'k' for the algorithm. This can be
done through cross-validation or other validation techniques.
3. Training:
KNN doesn't require explicit training as it memorizes the entire
training dataset. The algorithm essentially stores all the data points
and their labels.
4. Prediction:
For a new data point, KNN finds its k nearest neighbors based on
the chosen distance metric and determines the class (for
classification) or value (for regression) based on the majority or
average of these neighbors.
5. Evaluation:
Assess the model's performance using evaluation metrics such as
accuracy, precision, recall, F1-score (for classification), or Mean
Squared Error, R-squared (for regression).
Applications of KNN:
Recommendation systems
Pattern recognition
Image recognition
Medical diagnosis
Anomaly detection
Predictive maintenance
KNN's simplicity and effectiveness make it a valuable tool for various tasks,
especially in scenarios where the data has a discernible local structure and can
be separated by proximity in the feature space
Naïve Bayes
Naive Bayes is a popular and straightforward probabilistic classification algorithm
based on Bayes' theorem with an assumption of independence among features.
Despite its simplicity, it often performs well in various real-world applications and
is particularly useful in text classification and spam filtering.
1. Bayes' Theorem:
Bayes' theorem describes the probability of an event based on prior
knowledge of conditions that might be related to the event.
P(A∣B)=P(B)P(B∣A)×P(A)
P(y∣X)=P(y)×∏i=1nP(xi∣y)
For classification:
1. Binary Classification:
Logistic Regression predicts the probability that a given input
belongs to a particular class (usually denoted as 0 or 1).
2. Sigmoid Function (Logistic Function):
The logistic function is used to model the probability that a given
input x belongs to the positive class (class 1) in binary
classification.
1. Data Preprocessing:
Clean and preprocess the data, handle missing values, encode
categorical variables, and perform feature scaling if required.
2. Model Training:
Initialize the model with random weights and biases.
Use optimization techniques like gradient descent or more
advanced optimizers to update the model parameters to minimize
the cost function.
3. Cost Function (Log Loss):
The cost function for logistic regression is typically the log loss or
cross-entropy loss, which penalizes the model based on the
difference between predicted probabilities and actual classes.
4. Prediction:
For new data points, the trained model calculates the probability of
belonging to the positive class using the learned weights and
biases.
5. Evaluation:
Assess the model's performance using metrics like accuracy,
precision, recall, F1-score, ROC curve, and AUC (Area Under the
Curve).
Logistic Regression, despite its simplicity, remains a widely used and powerful
algorithm due to its interpretability, ease of implementation, and effectiveness in
handling binary classification tasks.
MODULE 03
SUPERVISED LEARNING DECISION FRAMING
Logistic regression
1. Binary Classification:
Logistic regression predicts the probability that a given input
belongs to one of two classes, typically represented as 0 or 1.
2. Sigmoid Function (Logistic Function):
The logistic regression model uses the sigmoid function to map
input values to a probability between 0 and 1.
Here, P(Y=1∣X) represents the probability that the dependent
variable ( Y) equals 1 given the independent variables ( X), and z is
the linear combination of input features and their corresponding
weights plus a bias term.
3. Decision Boundary:
Logistic regression uses a threshold (often 0.5) to make decisions. If
the predicted probability is greater than the threshold, the input is
classified into class 1; otherwise, it's classified into class 0.
4. Maximum Likelihood Estimation:
Logistic regression employs maximum likelihood estimation to find
the optimal weights that maximize the likelihood of observing the
given data under the model.
5. Regularization:
Regularization techniques like L1 (Lasso) or L2 (Ridge)
regularization can be applied to prevent overfitting by penalizing
large coefficients.
1. Data Preprocessing:
Clean and preprocess the data, handle missing values, encode
categorical variables, and perform feature scaling if required.
2. Model Training:
Initialize the model with random weights and biases.
Use optimization techniques like gradient descent or more
advanced optimizers to update the model parameters to minimize
the cost function.
3. Cost Function (Log Loss):
The cost function for logistic regression is typically the log loss or
cross-entropy loss, which penalizes the model based on the
difference between predicted probabilities and actual classes.
4. Prediction:
For new data points, the trained model calculates the probability of
belonging to the positive class using the learned weights and
biases.
5. Evaluation:
Assess the model's performance using metrics like accuracy,
precision, recall, F1-score, ROC curve, and AUC (Area Under the
Curve).
SVM
SUPPORT VECTOR MACHINE
Support Vector Machine (SVM) is a powerful supervised learning algorithm used
for classification and regression tasks. SVMs are particularly effective in high-
dimensional spaces and in cases where the data is not easily separable by a
linear boundary.
1. Linear Separability:
SVM works on the principle of finding the optimal hyperplane that
best separates the data into different classes. In two dimensions,
this hyperplane is a line, and in higher dimensions, it becomes a
hyperplane.
2. Maximizing Margin:
The optimal hyperplane in SVM is the one that maximizes the
margin, which is the distance between the hyperplane and the
nearest data points from both classes. These nearest data points
are known as support vectors.
3. Kernel Trick:
SVMs can handle non-linearly separable data by transforming the
input space into a higher-dimensional space using kernel functions
such as polynomial kernel, radial basis function (RBF) kernel, or
sigmoid kernel. This transformation allows SVM to find a hyperplane
in the higher-dimensional space where the data becomes linearly
separable.
4. C and Gamma Parameters:
In SVM, the regularization parameter 'C' controls the trade-off
between maximizing the margin and minimizing the classification
error. A smaller 'C' value allows for a wider margin but may lead to
misclassifications. A larger 'C' value imposes a smaller margin but
reduces misclassifications.
The 'gamma' parameter influences the reach of the kernel and
determines the flexibility of the decision boundary. Higher values of
'gamma' tend to fit the training data more precisely, potentially
leading to overfitting.
5. Support Vectors:
Support vectors are the data points closest to the hyperplane and
are crucial for defining the decision boundary. They determine the
margin and the orientation of the separating hyperplane.
1. Data Preprocessing:
Clean and preprocess the data, handle missing values, encode
categorical variables, and perform feature scaling if required.
2. Model Training:
Choose an appropriate kernel function (linear, polynomial, RBF,
etc.).
Train the SVM model by finding the hyperplane that maximizes the
margin or minimizes classification errors.
3. Tuning Hyperparameters:
Fine-tune hyperparameters like 'C', 'gamma', and the choice of
kernel through cross-validation to optimize model performance.
4. Prediction:
Use the trained SVM model to predict the classes for new, unseen
data.
5. Evaluation:
Assess the model's performance using metrics like accuracy,
precision, recall, F1-score, ROC curve, and AUC.
Support Vector Machines are powerful tools for both classification and regression
tasks, known for their ability to handle complex data and find optimal separation
boundaries in various domains. However, they might require careful parameter
tuning and might be computationally expensive for large datasets.
Decision Tree
Decision Trees are versatile supervised learning algorithms used for both
classification and regression tasks. They represent a tree-like structure where
each internal node represents a feature, each branch represents a decision rule,
and each leaf node represents the outcome or class label. Decision Trees aim to
create a model that predicts the value of a target variable by learning simple
decision rules inferred from the data features.
1. Tree Structure:
Decision Trees organize data into a tree structure where each
internal node represents a test on an attribute, each branch
corresponds to an outcome of the test, and each leaf node
represents a class label or a value.
2. Splitting Criteria:
The decision-making process involves selecting the best attribute at
each node to split the data. Various splitting criteria such as Gini
impurity, entropy, or information gain measure the homogeneity of
the target variable after the split.
3. Tree Pruning:
Decision Trees tend to overfit the training data. Pruning methods
(pre-pruning and post-pruning) are used to prevent overfitting by
stopping the tree's growth early or by removing nodes that do not
add significant value to the model.
4. Handling Categorical and Numerical Data:
Decision Trees can handle both categorical and numerical data. For
numerical attributes, they determine the best split point based on a
threshold, while for categorical attributes, they create separate
branches for each category.
5. Ensemble Methods (Random Forests, Gradient Boosting):
Ensemble methods like Random Forests and Gradient Boosting are
extensions of Decision Trees that combine multiple trees to improve
predictive accuracy and generalize better to new data.
1. Data Preprocessing:
Clean and preprocess the data, handle missing values, encode
categorical variables, and perform feature scaling if required.
2. Tree Construction:
Choose a splitting criterion (Gini impurity, entropy, etc.) and grow
the tree by recursively selecting the best feature to split the data
until a stopping criterion is met (e.g., maximum depth, minimum
samples per leaf).
3. Tree Pruning:
Apply pruning techniques to reduce overfitting by removing
unnecessary branches or nodes.
4. Prediction:
Use the trained decision tree to predict the target variable for new
instances by traversing the tree according to the learned decision
rules.
5. Evaluation:
Assess the model's performance using metrics like accuracy,
precision, recall, F1-score, confusion matrix, etc., on a validation or
test dataset.
Decision Trees are intuitive, interpretable, and suitable for both categorical and
numerical data. They provide insights into the decision-making process and are
widely used due to their simplicity and effectiveness in handling complex
classification and regression problems.
Fundamentals of Ensembling
Ensembling is a machine learning technique that combines multiple individual
models (called base learners or weak learners) to improve predictive
performance compared to using just a single model. Ensembling methods aim to
reduce bias, variance, and overfitting while enhancing the overall predictive
power of the model.
Fundamentals of Ensembling:
1. Base Learners:
Base learners are individual models that can be of the same or
different types. These models can be decision trees, neural
networks, support vector machines, or any other machine learning
algorithm.
2. Ensemble Methods:
There are several ensemble methods, with the two primary types
being:
Bagging (Bootstrap Aggregating): Bagging involves
training multiple instances of the same base learner on
different subsets of the training data, using bootstrapping
(sampling with replacement), and combining their predictions
through averaging (for regression) or voting (for
classification). Random Forest is an example of a bagging-
based ensemble using decision trees.
Boosting: Boosting aims to sequentially train models by
focusing on instances that were previously misclassified or
had higher errors. It iteratively builds a strong model by
combining several weak models. AdaBoost (Adaptive
Boosting) and Gradient Boosting Machines (GBM) are popular
boosting algorithms.
3. Voting/Aggregation:
Ensembling techniques typically combine predictions from multiple
base learners. For classification, they use voting (majority or
weighted) among the predictions from individual models. For
regression, they often take the average or weighted average of
predictions.
4. Diversity in Ensembling:
Diversity among base learners is essential in ensembling. Models
that perform differently on subsets of the data or capture different
aspects of the problem provide more robust predictions when
combined. Ensemble methods encourage diversity to avoid
overfitting and improve generalization.
5. Ensemble Performance:
Ensembling generally leads to better performance compared to
individual models, provided the base learners are sufficiently
diverse and sufficiently accurate. However, there is a trade-off
between model complexity and interpretability.
6. Stacking (Stacked Generalization):
Stacking involves training multiple base learners and using a meta-
learner (another model, often a simple linear model) that learns to
combine the predictions of the base learners. It leverages the
strengths of different models and improves predictive performance
further.
Advantages of Ensembling:
Applications of Ensembling:
BAGGING
BAGGING (Bootstrap Aggregating) is an ensemble learning technique used to
improve the accuracy and robustness of machine learning models by reducing
variance and overfitting. It involves training multiple instances of the same base
learning algorithm on different subsets of the training data, employing
bootstrapping (sampling with replacement), and then aggregating their
predictions to make final decisions.
Advantages of BAGGING:
Random Forest:
Random Forest is a popular ensemble learning method that uses
BAGGING with decision trees as base learners. It creates an
ensemble of decision trees, each trained on a different bootstrap
sample of the training data and uses the average (for regression) or
majority voting (for classification) of predictions from individual
trees.
Applications of BAGGING:
Advantages of BOOSTING:
1. scikit-learn:
Overview: Scikit-learn is one of the most widely used machine
learning libraries in Python, providing a comprehensive set of tools
for various supervised learning tasks.
Features: It includes a wide range of supervised learning
algorithms for classification, regression, clustering, dimensionality
reduction, model selection, and preprocessing techniques.
Algorithms: Support Vector Machines (SVM), Decision Trees,
Random Forests, Gradient Boosting, Naive Bayes, k-Nearest
Neighbors (k-NN), Logistic Regression, etc.
2. TensorFlow:
Overview: TensorFlow is an open-source deep learning framework
developed by Google, widely used for building neural networks and
deep learning models.
Features: Supports building and training various neural network
architectures, including convolutional neural networks (CNNs),
recurrent neural networks (RNNs), and more.
Applications: Image and speech recognition, natural language
processing, and more.
3. Keras:
Overview: Keras is a high-level neural network API that runs on top
of TensorFlow, designed to be user-friendly and easy to use for
rapid prototyping of deep learning models.
Features: Provides a simple and intuitive interface for building and
training neural networks, allowing for fast experimentation with
architectures.
4. PyTorch:
Overview: PyTorch is another popular deep learning framework
known for its flexibility and ease of use in building and training
neural networks.
Features: Offers dynamic computation graphs, making it easier to
build and debug models. Widely used in research and industry for
deep learning tasks.
5. XGBoost:
Overview: XGBoost (eXtreme Gradient Boosting) is an optimized
gradient boosting library that is highly efficient and provides high-
performance implementations of gradient boosting algorithms.
Features: Known for its speed and performance improvements
over traditional gradient boosting algorithms. Widely used in data
competitions and various domains.
6. LightGBM and CatBoost:
LightGBM: Developed by Microsoft, LightGBM is a gradient
boosting framework known for its high efficiency, low memory
usage, and speed.
CatBoost: CatBoost is a gradient boosting library developed by
Yandex, designed to handle categorical features efficiently without
the need for preprocessing.
These libraries offer a wide range of functionalities and support for supervised
learning tasks, providing implementations of various algorithms and tools to
preprocess data, build models, and evaluate performance. The choice of library
often depends on the specific requirements of the task, ease of use, performance
considerations, and personal preferences.
NUMPY
NumPy, short for Numerical Python, is a fundamental library in Python used for
numerical computing. It provides support for large, multi-dimensional arrays and
matrices, along with a collection of mathematical functions to operate on these
arrays efficiently. NumPy is a cornerstone library for scientific computing in
Python and is extensively used in data science, machine learning, and various
domains that involve numerical computations.
Example Code:
Applications of NumPy:
Data manipulation and analysis in scientific computing, statistics, and
machine learning.
Linear algebra operations, matrix computations, and solving mathematical
equations.
Signal processing, image processing, and numerical simulations.
PANDAS
Pandas is a powerful open-source library in Python used for data
manipulation and analysis. It provides easy-to-use data structures and
functionalities for working with structured data, making it a fundamental
tool for data scientists, analysts, and developers dealing with tabular and
time series data.
Example Code:
Applications of Pandas:
Data cleaning, preparation, and preprocessing in data science and
analytics workflows.
Exploratory data analysis (EDA) and visualization of datasets.
Handling and manipulation of structured data from various sources
such as databases, CSV files, Excel files, etc.
Time series analysis and manipulation in finance, economics, and
other fields.
SCIKIT LEARN
Scikit-learn, often referred to as sklearn, is a powerful open-source
machine learning library in Python that provides simple and efficient tools
for data mining and data analysis. It is built on top of other scientific
computing libraries such as NumPy, SciPy, and matplotlib and offers a
wide array of machine learning algorithms and utilities for various tasks
related to supervised and unsupervised learning, clustering,
dimensionality reduction, model selection, and more.
Example Code:
Applications of Scikit-learn:
Scikit-learn is widely used in various fields for machine learning
tasks, including:
Classification and regression in finance, healthcare, and
marketing.
Natural language processing (NLP) for text analysis and
sentiment analysis.
Image processing and computer vision.
Recommender systems.
Clustering and anomaly detection.
NLTK
NLTK (Natural Language Toolkit) is a leading platform in Python for working with
human language data and performing various natural language processing (NLP)
tasks. It provides easy-to-use interfaces to over 50 corpora and lexical resources,
along with a suite of text processing libraries for tasks such as tokenization,
stemming, tagging, parsing, and semantic reasoning.
Example Code:
Applications of NLTK:
NLTK finds applications in various domains, including:
Sentiment analysis and opinion mining from social media or
product reviews.
Information retrieval and extraction from text data.
Machine translation and language understanding.
Building chatbots and conversational agents.
Educational purposes and academic research in linguistics and
NLP.
CARET
caret (Classification And REgression Training) is a comprehensive and
versatile R package that serves as a unified interface for training and
tuning various machine learning models. It provides a consistent
framework to streamline the process of model building, evaluation, and
selection. caret is especially useful for beginners and experts alike,
offering a wide range of algorithms and tools for predictive modeling.
RANDOM FOREST
RandomForest is an R package and a powerful machine learning algorithm
used for both classification and regression tasks. It constructs multiple
decision trees during training and merges their predictions to improve
accuracy and reduce overfitting, making it robust for various datasets and
predictive tasks.
MICE PACKAGE
The mice package in R is a powerful tool used for imputing missing values
in datasets through multiple imputation by chained equations (MICE)
methodology. Handling missing data is crucial in statistical analysis and
machine learning tasks, and the mice package offers a comprehensive
framework for imputing missing values using a sophisticated imputation
algorithm.
The mice package is widely used for imputing missing values and is
valuable in ensuring that the statistical analyses and machine learning
models built on incomplete datasets maintain accuracy and robustness.