ML NOTES
ML NOTES
LEARNING
Gunjal
1. Data Collection and Preprocessing: Gathering and preparing data for analysis.
2. Model Building: Training algorithms to understand patterns in the data.
3. Evaluation and Deployment: Testing the model's accuracy and deploying it for real-world use.
Supervised Learning
• Definition: The model is trained on labeled data, where both input and the corresponding
output are provided. The goal is to learn the mapping between inputs and outputs.
• Examples:
o Predicting house prices (input: features like area, location; output: price).
o Classifying emails as spam or not spam.
• Common Algorithms:
o Regression: Linear Regression, Logistic Regression.
o Classification: Decision Trees, Support Vector Machines (SVM), Random Forests,
Neural Networks.
Unsupervised Learning
• Definition: The model is trained on unlabeled data, where only the input is provided. The goal
is to find hidden patterns or groupings in the data.
• Examples:
o Customer segmentation in marketing.
o Dimensionality reduction using PCA (Principal Component Analysis).
• Common Algorithms:
o Clustering: K-Means, Hierarchical Clustering, DBSCAN.
o Dimensionality Reduction: PCA, t-SNE.
Reinforcement Learning
• Definition: The model learns by interacting with an environment and receiving feedback in
the form of rewards or penalties. The goal is to take actions that maximize cumulative
rewards.
• Examples:
o Game-playing AI (e.g., AlphaGo, chess engines).
o Self-driving cars optimizing routes.
• Key Components:
o Agent: The decision-maker (e.g., the model).
o Environment: Where the agent operates (e.g., the game or driving conditions).
o Actions, Rewards, and States.
Each type of ML plays a crucial role in advancing AI applications in fields like healthcare, finance, robotics,
and natural language processing.
Overfitting
• Definition: The model learns not only the underlying pattern in the training data but also
noise and outliers. This leads to excellent performance on training data but poor performance
on unseen data.
• Characteristics:
o High accuracy on training data.
o Low accuracy on validation/test data.
o Model is too complex (e.g., too many parameters).
Example:
• Dataset: Predict house prices based on features like size, location, and number of bedrooms.
• Scenario:
o A decision tree model with depth=20 fits every minor fluctuation in the training data.
o Result:
▪ Training accuracy: 98%
▪ Validation accuracy: 65%
o Reason: The model memorized the data rather than learning the general pattern.
Solution:
• Simplify the model (e.g., reduce depth of decision trees, add regularization).
• Use cross-validation.
Underfitting
• Definition: The model is too simple to capture the underlying pattern in the data. It performs
poorly on both training and validation/test data.
• Characteristics:
o Low accuracy on training data.
o Low accuracy on validation/test data.
o Model lacks complexity (e.g., too few parameters).
Example:
• Dataset: Predict house prices based on features like size, location, and number of bedrooms.
• Scenario:
o A linear regression model tries to fit a non-linear relationship between features and
target.
o Result:
▪ Training accuracy: 50%
▪ Validation accuracy: 45%
o Reason: The model fails to capture the complex relationship between features and
target.
Solution:
1. Bias
• Definition: Bias refers to the error due to overly simplistic assumptions in the model. High
bias can cause the model to miss important relationships in the data, leading to underfitting.
• Characteristics:
o The model is too simple.
o High training error and high validation/test error.
o Cannot capture the underlying patterns in the data.
Example:
• Dataset: Predict house prices based on features like size, location, and number of bedrooms.
• Model: Linear regression for data with a clear non-linear relationship.
o Prediction fails to capture the curved trend in the data.
o Both training and validation errors are high because the model is biased towards
linearity.
2. Variance
• Definition: Variance refers to the error due to the model's sensitivity to small fluctuations in
the training data. High variance means the model learns noise and performs poorly on unseen
data, leading to overfitting.
• Characteristics:
o The model is overly complex.
o Low training error but high validation/test error.
o Fails to generalize to new data.
Example:
• Dataset: Predict house prices based on features like size, location, and number of bedrooms.
• Model: A decision tree with very high depth.
o The model memorizes the training data but cannot generalize to new data points.
1. Linear Regression
Overview:
Linear Regression is used for predicting continuous numerical values. It models the relationship
between the dependent variable and one or more independent variables xxx by fitting a straight
line (or a hyperplane for multiple variables) to the data.
Advantages:
Limitations:
Use Cases:
• Predicting house prices based on features like size, location, and age.
• Forecasting sales based on historical data.
2. Logistic Regression
Overview:
Logistic Regression is used for binary classification tasks (e.g., yes/no, spam/not spam). It predicts
the probability of the dependent variable belonging to a particular class.
Advantages:
Limitations:
• Assumes a linear relationship between features and the log-odds of the target variable.
• Sensitive to multicollinearity and outliers.
Use Cases:
3. Decision Trees
Overview:
Decision Trees are non-linear models used for both classification and regression tasks. They split
data into subsets based on feature conditions, forming a tree-like structure.
How it Works:
• At each node, the algorithm chooses the feature and split point that best separates the
data using criteria like Gini Impurity or Entropy.
• Stops splitting when leaf nodes meet criteria (e.g., pure class or max depth reached).
Advantages:
Limitations:
Use Cases:
Random Forest is an ensemble method that builds multiple decision trees and combines their
outputs (via averaging for regression or voting for classification).
How it Works:
Advantages:
Limitations:
Use Cases:
How it Works:
• Constructs a decision boundary (hyperplane) such that the margin between classes is
maximized.
• Uses kernel functions (e.g., linear, polynomial, RBF) to handle non-linear separations.
Advantages:
Limitations:
Use Cases:
• Image classification.
• Text categorization and sentiment analysis.
KNN is an instance-based algorithm that assigns a data point to the class of its 𝜿 nearest
neighbors.
How it Works:
• Compute the distance (e.g., Euclidean) between the query point and all data points.
• Find the 𝜿-closest points and assign the majority class (classification) or average value
(regression).
Advantages:
• Simple to implement.
• No training phase; only stores data points.
Limitations:
Use Cases:
• Recommender systems.
• Handwritten digit recognition.
Summary Table:
1. K-Means Clustering
Overview:
K-Means is one of the simplest and most popular clustering algorithms. It partitions a dataset into
𝑘 clusters, where each cluster is represented by its centroid.
How it Works:
Advantages:
Limitations:
Use Cases:
2. Hierarchical Clustering
Overview:
Hierarchical clustering creates a tree-like structure (dendrogram) that groups data points into clusters
based on their similarity.
How it Works:
Advantages:
Limitations:
Use Cases:
Overview:
DBSCAN is a density-based clustering algorithm that identifies clusters based on regions of high
density and can detect outliers.
How it Works:
Advantages:
Limitations:
Overview:
PCA is a dimensionality reduction technique that projects data onto a lower-dimensional space
while retaining as much variance as possible.
How it Works:
Advantages:
Limitations:
Use Cases:
Ensemble Techniques
Ensemble techniques combine predictions from multiple models to improve performance and
generalization compared to individual models. The idea is to leverage the strengths of multiple models
while mitigating their weaknesses.
Example Algorithms:
• Random Forest:
o Builds multiple decision trees on bootstrapped datasets.
o Aggregates their predictions (majority vote or average).
Advantages:
2. Boosting
• Concept: Builds models sequentially, where each new model tries to correct the errors of the
previous ones. Models are weighted based on their performance.
• Goal: Reduces bias and creates a strong model from weak learners.
Example Algorithms:
• AdaBoost:
o Assigns weights to data points; misclassified points get higher weights in the next
iteration.
• Gradient Boosting:
o Minimizes the loss function by training models sequentially to correct errors of the
previous model.
• XGBoost:
o An optimized version of Gradient Boosting, faster and more efficient.
Advantages:
Disadvantages:
Example:
• Very flexible.
• Can achieve high accuracy.
Disadvantages:
• Computationally expensive.
• Complex to implement and tune.
4. Voting
• Concept: Combines predictions from multiple models and uses majority voting (classification)
or averaging (regression) to make the final prediction.
• Goal: Simple way to aggregate model outputs for better accuracy.
Types:
Advantages:
• Easy to implement.
• Works well with diverse models.
Disadvantages:
Performance Metrics
Performance metrics evaluate how well a machine learning model performs on a dataset. The choice of a
metric depends on the type of problem being solved (classification, regression, etc.).
True Positive (TP): The model correctly predicts the positive class.
True Negative (TN): The model correctly predicts the negative class.
False Positive (FP): The model incorrectly predicts the positive class for a negative instance
(Type I Error).
False Negative (FN): The model incorrectly predicts the negative class for a positive instance
(Type II Error).
1. Accuracy
2. Precision
• Definition: The proportion of true positive predictions out of all positive predictions.
• When to Use: Useful when false positives are costly (e.g., spam detection).
4. F1-Score
• When to Use: For imbalanced datasets where both precision and recall are important.
• Definition: Measures the tradeoff between true positive rate (TPR) and false positive rate
(FPR) at different thresholds.
• When to Use: Evaluates model performance across all classification thresholds.
6. Confusion Matrix
A tabular representation of predictions:
• Definition: The average of absolute differences between actual and predicted values.
• Definition: The average of squared differences between actual and predicted values.
• When to Use: More interpretable than MSE (in same units as target variable).
4. R-Squared (R²)
• When to Use: Measures how well the model fits the data.
Cross-Validation
Definition:
Cross-validation is a technique to evaluate the performance of a machine learning model by
splitting the data into multiple subsets for training and testing.
Purpose:
• Reduces overfitting.
• Provides reliable performance metrics.
• Ensures efficient use of the dataset.
Types:
• K-Fold Cross-Validation: Splits the data into 𝜅 folds; each fold is used as a test set once.
• Stratified K-Fold: Maintains class distribution across folds (useful for imbalanced data).
• Leave-One-Out (LOOCV): Each data point is used as a test set once.
• Time Series CV: Ensures training data precedes test data (for temporal data).
• Nested Cross-Validation: Combines inner and outer loops for hyperparameter tuning and
evaluation.
How It Works:
Advantages:
• Improves model generalization.
• Reduces bias from specific train-test splits.
• Evaluates model stability.
Disadvantages:
• Computationally expensive.
• May not always be necessary for very large datasets.
Hyperparameter Tuning
Definition:
Hyperparameter tuning is the process of optimizing the hyperparameters of a machine learning
model to improve its performance. Hyperparameters are settings that cannot be learned directly
from the data (e.g., learning rate, number of trees in a random forest).
How It Works:
Examples of Hyperparameters:
Model Transform
• Definition: Applies the learned transformation or prediction logic to the data.
• How: Uses the transform() method for transformation tasks (e.g., scaling data) or predict() for
predictions.
• Why Use It:
o To process data using the trained model.
o For prediction or feature transformations.
Model Fit-Transform
• Definition: Combines fit and transform in one step.
• How: Commonly used with preprocessing tools like StandardScaler or PCA.
• Why Use It:
Model Predict
• Definition: Predicts outcomes based on the trained model.
• How: Uses the predict() method.
• Why Use It:
o To generate predictions on unseen (test) data.
One Hot Encoder
One Hot Encoder is a technique used in machine learning to convert categorical data into a
numerical format suitable for algorithms. It creates a binary column for each category in a
categorical variable and assigns a 1 or 0 to indicate the presence of a category in a given record.
1. Machine Learning Algorithms Need Numerical Input: Algorithms work better with
numerical data rather than categorical strings.
2. Avoid Ordinal Misinterpretation: Unlike label encoding, one-hot encoding does not imply
any order or hierarchy among the categories.
3. Ensures Better Performance: It allows algorithms to understand categorical distinctions
effectively without introducing bias.
Example: