Terms to Review
Terms to Review
tasks involving structured data like images, videos, and spatial or temporal signals. CNNs
are particularly effective for these tasks because they can automatically detect and learn
hierarchical patterns in the data, such as edges, textures, and more complex features. In
summary, CNNs are versatile and powerful tools in AI/ML, particularly suited for image-
related tasks, due to their ability to learn spatial hierarchies and patterns directly from
data.
o Convolutional Layers:
Purpose: Extract features from the input data (e.g., an image).
How it works: A small matrix called a filter or kernel slides over the input
data (a process called convolution) to produce a feature map. This
operation highlights specific patterns, such as edges or textures.
Example: A filter might detect horizontal lines in an image.
o Pooling Layers:
Purpose: Reduce the spatial dimensions of the feature maps while
preserving the most important information.
How it works: Aggregates values within a small region of the feature map
(e.g., max-pooling keeps the maximum value in a region).
Benefit: Makes the network more computationally efficient and robust to
small spatial changes in the input.
o Fully Connected Layers:
Purpose: Combine the features learned by previous layers to make
predictions.
How it works: Flattens the feature maps into a 1D vector and passes it
through one or more dense layers to output predictions (e.g., class
probabilities).
o Activation Functions: Non-linear functions (e.g., ReLU, sigmoid, or softmax) are
applied after each layer to introduce non-linearity, allowing the model to learn
complex patterns.
o Feature Hierarchy: Early layers learn low-level features (edges, gradients), while
deeper layers learn high-level features (object shapes, textures).
o Why CNN’s are powerful:
Local Feature Learning: Convolution layers focus on local regions, making
CNNs efficient for spatially coherent data.
Parameter Sharing: Filters are reused across the input, significantly
reducing the number of parameters.
Automatic Feature Extraction: CNNs eliminate the need for manual
feature engineering by learning features directly from raw data.
o Applications of CNNs:
Image Classification: Identifying objects in an image (e.g., cats vs. dogs).
Object Detection: Locating and classifying objects in an image (e.g.,
bounding boxes for cars or pedestrians).
Segmentation: Dividing an image into regions based on features (e.g.,
medical imaging).
Speech and Audio Recognition: Extracting spatial features in
spectrograms.
Generative Models: Creating new data (e.g., deepfake images).
Learning rate selection
o Definition: The learning rate controls the size of the steps the optimization
algorithm takes when updating model parameters during training.
o Importance: A learning rate that is too high can cause the training to overshoot
minima, leading to instability. A learning rate that is too low can result in slow
convergence or getting stuck in local minima.
o Strategies: Use a fixed learning rate or a dynamic schedule (e.g., reducing it as
training progresses). Techniques like learning rate decay, warm restarts, or
adaptive optimizers (e.g., Adam) adjust the learning rate during training.
Tuning hyperparameter
o Definition: Hyperparameter tuning involves selecting the best set of
hyperparameters (configurations) that are not learned from the data during
training but affect model performance.
o Examples of Hyperparameters: Learning rate, batch size, number of layers,
dropout rate, regularization coefficients.
o Techniques:
Grid search: Exhaustive search over a predefined set of hyperparameters.
Random search: Random sampling of hyperparameters from
distributions.
Bayesian optimization or automated methods like Hyperband.
Batch dropout and normalization
o Batch Dropout:
Definition: A regularization technique where a fraction of nodes in a layer
are randomly ignored ("dropped out") during training to prevent
overfitting.
Impact: Encourages the network to rely on multiple pathways rather than
overfitting to specific features.
Key Parameter: Dropout rate (e.g., 0.5 means 50% of neurons are
randomly dropped).
o Batch Normalization:
Definition: A technique to normalize the inputs to a layer across a mini-
batch, ensuring consistent distribution and reducing internal covariate
shift.
Benefits:
Accelerates convergence.
Allows for higher learning rates.
Reduces sensitivity to initialization.
Mechanism: Normalizes inputs to have a mean of 0 and a standard
deviation of 1, followed by learnable scaling and shifting parameters.
Regularization strategies
o Definition: Techniques used to prevent overfitting by penalizing complex models
and reducing their ability to memorize the training data.
o Common Strategies:
L1 Regularization: Adds a penalty proportional to the absolute value of
weights (encourages sparsity in weights).
L2 Regularization (Ridge): Adds a penalty proportional to the squared
value of weights (prevents large weight magnitudes).
Dropout: Randomly ignoring neurons during training (see Batch Dropout
above).
Early Stopping: Halting training when validation performance stops
improving.
Data Augmentation: Increasing training data variety through
transformations (e.g., flipping, cropping, etc.).
Loss function selection
o Definition: The loss function measures the difference between the model's
predictions and the true target values, guiding the optimization process.
o Types of Loss Functions:
Regression Tasks: Mean Squared Error (MSE), Mean Absolute Error
(MAE).
Classification Tasks: Cross-entropy loss, Hinge loss.
Custom Loss Functions: Designed for specific tasks, like IoU loss for object
detection or attention-based losses.
o Importance: The choice of loss function depends on the task, as it directly affects
model performance and optimization.
Network optimization
o Definition: The process of fine-tuning the model’s weights and biases to
minimize the loss function during training.
o Core Techniques:
Gradient Descent: A method to iteratively adjust parameters to minimize
the loss function.
Variants:
Stochastic Gradient Descent (SGD): Updates weights using a single
data point.
Mini-batch SGD: Updates weights using small batches of data.
Adaptive Methods: Adam, RMSProp, Adagrad—optimizers that
adapt learning rates based on gradients or past updates.
Key Components:
Learning rate (see Learning Rate Selection above).
Momentum: Helps smooth updates and prevent oscillations.
Weight initialization: Proper initialization can prevent
vanishing/exploding gradients.
Backpropagation:
o
Algorithms:
1. SVM (Support Vector Machine)
Definition: A supervised learning algorithm used for classification and regression tasks.
Key Idea: Finds the hyperplane that best separates data into classes in a high-
dimensional space.
Key Components:
o Support Vectors: Data points closest to the hyperplane that influence its
position.
o Kernel Trick: Allows SVM to operate in a transformed feature space for handling
non-linear relationships.
Applications: Text classification, image recognition, bioinformatics.
2. RF (Random Forest)
Definition: An ensemble learning method combining multiple decision trees to improve
classification or regression results.
Key Idea: Aggregates predictions from many decision trees (trained on random subsets
of data and features) to reduce overfitting and improve accuracy.
Key Components:
o Bagging: Random subsets of data are used to train each tree.
o Majority Voting or Averaging: Used to combine outputs from individual trees.
Applications: Fraud detection, recommendation systems, feature selection.
3. KNN (K-Nearest Neighbors)
Definition: A simple, non-parametric supervised learning algorithm for classification or
regression.
Key Idea: Classifies a data point based on the majority class of its k nearest neighbors in
feature space.
Key Components:
o Distance Metric: Determines "closeness" (e.g., Euclidean or Manhattan distance).
o Value of k: The number of neighbors considered; too small or too large can affect
performance.
Applications: Handwriting recognition, anomaly detection, recommendation systems.
4. RNN (Recurrent Neural Network)
Definition: A type of neural network designed to handle sequential data, such as time
series or text.
Key Idea: Uses loops in its architecture to maintain "memory" of previous inputs, making
it suitable for sequential and temporal patterns.
Variants:
o LSTM (Long Short-Term Memory): Addresses the vanishing gradient problem by
introducing gating mechanisms.
o GRU (Gated Recurrent Unit): A simpler alternative to LSTMs with comparable
performance.
Applications: Language modeling, speech recognition, time series forecasting.
5. AE (Autoencoder)
Definition: An unsupervised learning model used for data compression and feature
learning.
Key Idea: Consists of an encoder and decoder that reconstruct the input data while
reducing its dimensionality.
Key Components:
o Latent Space Representation: Compressed version of input data.
o Loss Function: Measures reconstruction accuracy (e.g., mean squared error).
Applications: Dimensionality reduction, denoising, anomaly detection.
6. GAN (Generative Adversarial Network)
Definition: A framework consisting of two neural networks (a generator and a
discriminator) that compete with each other to generate realistic data.
Key Idea:
o Generator: Produces fake data from random noise.
o Discriminator: Tries to distinguish between real and fake data.
o Training ends when the generator produces data indistinguishable from real
data.
Applications: Image generation (deepfakes), style transfer, drug discovery.
Acronym Full Name Type Primary Use
SVM Support Vector Machine Supervised Classification, regression
Classification, regression, feature
RF Random Forest Supervised
ranking
KNN K-Nearest Neighbors Supervised Classification, regression
RNN Recurrent Neural Network Deep Learning Sequential data (time series, text)
Dimensionality reduction, anomaly
AE Autoencoder Unsupervised
detection
Generative Adversarial
GAN Deep Learning Data generation, image synthesis
Network
Classifying complex data. (A) Transforming data to enable linear separation of non-linearly
separable raw data. Raw non-linear data are transformed by mapping functions that may
include time, frequency, or other operations. This projects them into higher-dimensional
parameters space in which they are now linearly separable. One example is classifying patients
with heart failure with preserved ejection fraction whose response to beta-blockers may vary
due to obesity, atrial fibrillation, left ventricular hypertrophy, diabetes, or other factors. Data
transformation to a higher-di- mensional space now enables a simple partitioning process. (B)
Bias–variance tradeoff. Model with high bias (straight line), when a straight line could not
classify appropriately (here, between atrial fibrillation and normal sinus rhythm) in both
training dataset (5.B.a) and testing dataset (5.B.b). This leads to prediction errors on other
datasets (low variance - frequent errors). In contrast, model with low bias (i.e. due to
overtraining) when data is fitted well in training set (5.B.c), but not in testing set (5.B.d), leading
to reduced generalization (high variability due to difference between training and validation
sets).
*C-statistic / *Statistics, C:
The C-statistic (also called the concordance statistic) is a measure used to evaluate the
predictive accuracy of a model, particularly in binary classification problems and survival
analysis. It assesses how well a model distinguishes between positive and negative outcomes
Definition:
The C-statistic is equivalent to the area under the receiver operating characteristic curve
(AUC-ROC).
It represents the probability that a randomly selected positive case (e.g., disease
present) is assigned a higher predicted risk score by the model than a randomly selected
negative case (e.g., disease absent).
Interpretation:
Values range from 0.5 to 1.0:
o C-statistic = 0.5 C-statistic=0.5: Model performs no better than random guessing.
o C-statistic = 1.0 C-statistic=1.0: Perfect model with complete separation of
outcomes.
o Values closer to 1.0 indicate better discrimination.
Formula:
For binary classification:
o
A concordant pair is one where the model assigns a higher risk score to the positive case
than to the negative case.
A discordant pair occurs when the negative case gets a higher score.
*Chain Rule
The chain rule is a fundamental principle in calculus used to compute the derivative of a
function that is composed of other functions. It is widely used in machine learning, especially in
backpropagation, where it allows for the computation of gradients through complex, multi-
layered neural networks.
Conceptual Understanding:
The chain rule states that to find the total rate of change of a function, you:
1. Find the rate of change of the outer function with respect to the inner function.
2. Multiply it by the rate of change of the inner function with respect to its input.
Why It Matters:
The chain rule enables efficient computation of gradients, even for very deep networks.
Without it, calculating gradients for complex, multi-layer functions would be infeasible.
In essence, the chain rule is the backbone of gradient-based optimization in machine
learning.