21CS743 (2)

21CS743 – Deep Learning Model Question Paper-1 with answers
1a)Explain the historical trends in deep learning
1940s–1950s: Foundations
1. McCulloch-Pitts Model (1943): Simplified artificial neuron model.
2. Hebbian Learning (1949): Proposed learning mechanism based on neuron co-
activation.
1960s–1970s: Perceptrons and AI Winter
1. Perceptron (1958): Early neural network for linear classification.
2. Limitations: Minsky and Papert (1969) proved single-layer perceptrons couldn't
solve XOR problems.
3. AI Winter: Funding and interest declined due to perceived limitations.
1980s: Revival with Backpropagation

1. Backpropagation Algorithm (1986): Efficient training of multilayer networks
(Rumelhart, Hinton, Williams).
2. Applications: Early successes in tasks like handwritten digit recognition.
1990s: Specialized Architectures

1. Recurrent Neural Networks (RNNs): Introduced for sequence data, with LSTMs
(1997) addressing gradient issues.
2. Convolutional Neural Networks (CNNs): Used for image recognition (e.g.,
LeNet).
2000s: Shift to Kernel Methods
1. Support Vector Machines (SVMs): Dominated machine learning due to better
performance.
2. Feature Engineering: Focused on manually creating features for models.
2010s: Deep Learning Revolution

1. GPU Acceleration: Enabled training of deep networks.
2. Breakthrough Architectures:
o AlexNet (2012): CNN won ImageNet challenge, reinvigorating deep learning.
o ResNet (2015): Introduced residual connections for deeper networks.
3. Transformer Models (2017): Revolutionized NLP and beyond.
4. Applications Expanded: NLP, computer vision, speech recognition, etc.
2020s: Scaling and Ethics
1. Large Pretrained Models: GPT, BERT, and Vision Transformers (ViT).
2. Self-Supervised Learning: Leveraged unlabeled data for pretraining.
Ethics and Explainability: Focused on fairness, interpretability, and sustainability
1b) Define machine learning. Explain different types of ML algorithms
Machine Learning
Machine Learning is a branch of artificial intelligence that allows systems to learn and
improve from data without explicit programming. It focuses on creating algorithms to
identify patterns and make predictions or decisions.
Types of Machine Learning Algorithms

1. Supervised Learning
 Uses labeled data with input-output pairs.
 Maps inputs to specific outputs.
 Examples:
o Regression: Predicting continuous outcomes (e.g., house prices).
o Classification: Categorizing data (e.g., spam detection).
 Algorithms: Linear Regression, Logistic Regression, SVM, Random Forest.
2. Unsupervised Learning
 Works with unlabeled data to find hidden structures or patterns.
 Examples:
o Clustering: Grouping data into clusters (e.g., customer
segmentation).
o Dimensionality Reduction: Simplifying data features (e.g., PCA).
 Algorithms: K-Means, Hierarchical Clustering, Autoencoders.
3. Semi-Supervised Learning
 Combines a small amount of labeled data with a large amount of unlabeled data.
 Reduces the effort required for data labeling.
Examples: Web content classification, medical image diagnosis.
4. Reinforcement Learning
 Learns through interactions with an environment, receiving rewards or penalties.
 Focuses on maximizing cumulative rewards over time.
Examples:
o Robotics: Teaching robots to perform tasks.
o Gaming: Developing AI agents like AlphaGo.
 Algorithms: Q-Learning, Deep Q-Networks (DQN).
5. Deep Learning
 Employs deep neural networks to learn hierarchical data representations.
 Ideal for large-scale data and complex tasks.
 Applications:
o Image recognition (e.g., object detection).
o Natural language processing (e.g., chatbots).
 Architectures: CNNs, RNNs, Transformers.

2a)Explain in detail about the supervised learning approach by taking a suitable example
Supervised Learning
 Trains a model using labeled data where each input corresponds to a known output.
 Aims to learn the relationship between inputs and outputs to make
predictions for new data.
 Consists of two phases: training (model learns patterns) and testing (model evaluates its
performance).
 Predicts outcomes by minimizing the difference between predicted and actual
outputs using a loss function.
 Divided into regression (predicts continuous outputs) and classification (predicts
categorical outputs).
Example: Predicting House Prices
 Input features include size, location, and number of bedrooms, and the target is the
house price.
 A dataset is split into training and testing sets to train and evaluate the model.
 The model learns the relationship between features and prices and predicts prices for
unseen data.
Common Algorithms
 Linear Regression for continuous predictions.
 Logistic Regression for binary classification.
 Decision Trees for data splitting based on features.
 Support Vector Machines for separating classes with a hyperplane.
 Neural Networks for handling complex, non-linear relationships.
Advantages
 Produces accurate results when trained on quality data.
 Easy to understand and implement for straightforward tasks.
 Widely applicable in areas like fraud detection, spam filtering, and predictive
maintenance.
Limitations
 Requires a large and accurately labeled dataset.
 May overfit, leading to poor performance on unseen data.
 Labeling data can be time-consuming and resource-intensive.
Applications
 Used in healthcare for disease prediction and treatment planning.

 Finance applications include credit scoring and fraud detection.
 Marketing uses include customer segmentation and churn prediction.
2b)Write a note on support vector machines and PCA

Support Vector Machine (SVM)
 SVM is a supervised learning algorithm used for classification and regression
tasks.
 Works by finding the optimal hyperplane that separates data points of different
classes.
 The hyperplane maximizes the margin, which is the distance between the closest
data points (support vectors) from each class.
 Handles linear and non-linear data using kernel functions like polynomial, radial
basis function (RBF), and sigmoid.
 Effective for high-dimensional data and applications like text classification,
image recognition, and bioinformatics.
 Advantages include robustness to outliers and versatility in solving linear and non-
linear problems.
 Limitations include sensitivity to parameter tuning and computational cost for large
datasets.
Principal Component Analysis (PCA)
 PCA is an unsupervised dimensionality reduction technique.
 Identifies the most important features by projecting data into a lower-
dimensional space.
 Finds principal components, which are orthogonal vectors that capture the maximum
variance in the data.
 Reduces computational complexity and eliminates redundant features while retaining
significant information.
 Commonly used in preprocessing for machine learning models, image
compression, and visualization of high-dimensional data.
 Advantages include improved performance in large datasets and easier
visualization.
 Limitations include loss of interpretability in reduced dimensions and
sensitivity to scaling of features.
3.A] Explain the working of deep forward networks. – 10 Marks
Deep Forward Networks

1. Structure
o Consist of three main layers:
 Input Layer: Receives raw data features.
 Hidden Layers: Process and transform the data through weights,
biases, and activation functions.
 Output Layer: Provides the final prediction or output.
2. Forward Propagation
o Input data passes through the layers in one direction (from input to output).
o Each neuron computes a weighted sum of inputs, adds a bias, and applies an
activation function.
o Activation functions introduce non-linearity, enabling the network to learn
complex relationships.
3. Activation Functions
o Examples: ReLU (Rectified Linear Unit), Sigmoid, and Tanh.
o Help the network capture patterns that are not linearly separable.
4. Loss Calculation
o Loss function measures the error between predicted and actual outputs.
o Examples:
o Mean Squared Error for regression problems.
o Cross-Entropy Loss for classification problems.
5. Backward Propagation
o Calculates gradients of the loss with respect to weights and biases using the
chain rule.
o Updates weights to minimize the loss using optimization algorithms like
gradient descent.
6. Optimization
o Algorithms like SGD (Stochastic Gradient Descent), Adam, and RMSProp
improve the learning process.
o Learning rate determines the step size for weight updates.
7. Training Process
o Involves multiple iterations (epochs) over the dataset to refine weights.
o Regularization techniques (e.g., dropout) help prevent overfitting.
8. Applications
o Image recognition, natural language processing, speech recognition, and
predictive modeling.
9. Advantages
o Ability to learn complex patterns and hierarchical features.
o Suitable for large datasets and diverse applications.
10. Limitations
 High computational cost and long training times.
 Requires large amounts of labeled data.
Susceptible to overfitting without proper regularization.
3.B] What is regularization? How does regularization help in reducing overfitting? – 10

Marks
Regularization
 Regularization is a technique used in machine learning to prevent overfitting by
adding a penalty to the loss function.
 The penalty discourages the model from fitting the noise in the training data,
encouraging simpler models with better generalization.
How Regularization Helps Reduce Overfitting

1. Simplifies the Model
o Regularization reduces the complexity of the model by shrinking the
coefficients of less important features toward zero.
o Simpler models are less likely to overfit and are better at generalizing to new
data.
2. Avoids Over-reliance on Specific Features
o Penalizes large weights, ensuring the model does not rely too heavily on a
small subset of features.
o Promotes learning from the broader structure of the data.
3. Types of Regularization
o L1 Regularization (Lasso): Adds the sum of absolute values of weights
to the loss function.
 Shrinks some weights to zero, effectively performing feature selection.
o L2 Regularization (Ridge): Adds the sum of squared values of weights
to the loss function.
 Reduces the magnitude of all weights but does not eliminate
them entirely.
o Elastic Net: Combines L1 and L2 penalties for more flexibility.
4. Dropout Regularization
o In neural networks, dropout randomly disables a subset of neurons during
training.
o Prevents neurons from co-adapting, improving generalization.
5. Improves Model Generalization
o By reducing the model's capacity to memorize the training data,
regularization ensures better performance on unseen data.
Advantages of Regularization
 Reduces overfitting while retaining model accuracy.
 Encourages sparsity in features (L1) and prevents large weights (L2).
 Enhances model robustness and generalization capabilities.
4.A] Explain briefly about the gradient descent algorithm. – 10 Marks

Gradient Descent Algorithm
1. Gradient descent is an optimization algorithm used to minimize the loss function
in machine learning models by iteratively adjusting model parameters (weights
and biases).
2. It calculates the gradient (partial derivative) of the loss function with respect to
each parameter to determine the direction of steepest descent.
3. Model parameters are updated iteratively using the formula:
o New Parameter = Old Parameter - Learning Rate × Gradient
4. The learning rate controls the step size during updates:
o A small learning rate ensures convergence but may be slow.
o A large learning rate may lead to overshooting or divergence.
5. Types of Gradient Descent:
o Batch Gradient Descent: Uses the entire dataset to compute the gradient
for each update. Efficient for small datasets but slow for large datasets.
o Stochastic Gradient Descent (SGD): Uses one data point at a time for
updates, making it faster but noisier.
o Mini-batch Gradient Descent: Combines the advantages of batch and SGD
by using a small subset (mini-batch) of the data for updates.
6. Gradient descent helps find the optimal parameters that minimize the loss,
improving model performance.
7. Challenges:
o May get stuck in local minima or saddle points.
o Requires careful tuning of the learning rate for effective
optimization.
8. Variants (e.g., Adam, RMSProp) improve convergence speed and handle
complex optimization landscapes.
4.B] Discuss the working of backpropagation. – 10 Marks
Backpropagation
1. Purpose
Backpropagation is an algorithm used to train neural networks by updating weights and
biases to minimize the loss function.
2. Process Overview
o Involves two main steps: forward propagation and backward
propagation.
o Forward propagation computes the output of the network and calculates
the loss.
o Backward propagation adjusts weights and biases using the gradient of the
loss function.
3. Steps in Backpropagation
o Forward Propagation:
 Input data passes through the network layer by layer.
 Weighted sums and activation functions are applied to compute
the output.
 Loss is calculated using a predefined loss function.
o Backward Propagation:
 Gradients of the loss with respect to the output layer
parameters are computed.
 Gradients are propagated backward through the network using the
chain rule to compute gradients for each layer.
 These gradients indicate how weights and biases in each layer should
be updated.
4. Weight and Bias Updates
o Parameters are updated using gradient descent or its variants:
 New Weight = Old Weight - Learning Rate × Gradient.
o The learning rate determines the step size during updates.
5. Key Components
o Loss Function: Measures the difference between predicted and actual
outputs (e.g., Mean Squared Error or Cross-Entropy Loss).
o Activation Function: Introduces non-linearity, enabling the network to learn
complex patterns. Examples include ReLU, Sigmoid, and Tanh.
6. Training Iterations
o The algorithm repeats forward and backward propagation for
multiple epochs until the loss converges or reaches a predefined threshold.
7. Advantages
o Efficiently trains deep networks by distributing the error signal to all layers.
o Can handle large-scale data with the help of optimization techniques.
8. Limitations
o Computationally expensive for large networks.
o Sensitive to vanishing or exploding gradients, especially in very deep networks.
9. Applications
Widely used in training neural networks for tasks like image recognition, natural
language processing, and predictive modeling.
5A] Explain empirical risk minimization. – 10 Marks
 Empirical Risk Minimization (ERM) is a principle in machine learning where the
objective is to minimize the average loss over the training data.
 It aims to find a model that performs best on the given dataset by minimizing the
empirical risk.
 The empirical risk is the average of the loss function applied to all training examples.
 The goal of ERM is to find the hypothesis (model) that minimizes the empirical risk,
which is an approximation of the true risk (expected loss over the entire distribution).
 A loss function measures how well the model's predictions match the true values.
Common examples include Mean Squared Error (MSE) for regression and Cross-
Entropy Loss for classification problems.
 While ERM minimizes the error on the training data, it does not directly
ensure good performance on unseen data, which could lead to overfitting.
 ERM focuses only on the training data, which can lead to overfitting if the model is
too complex or underfitting if the model is too simple.
 Regularization methods like L1 or L2 regularization can be used alongside ERM to
prevent overfitting by adding penalties for overly complex models.
 The true risk (expected risk) is the average loss over the entire distribution of data,
while ERM approximates it with training data, but may not always align with it.
 ERM is widely used in supervised learning models, including linear regression,
decision trees, and neural networks.
5B] Explain the challenges that occur in neural network optimization in detail. – 10
Marks
 Vanishing and Exploding Gradients: Gradients can become too small or too large
during backpropagation, leading to slow updates or unstable training, respectively.
 Local Minima and Saddle Points: The non-convex loss surface of neural networks
can cause the optimization process to get stuck in local minima or saddle points,
preventing the model from reaching the global minimum.
 Overfitting and Underfitting:
o Overfitting occurs when the model memorizes the training data and fails to
generalize to new data.
o Underfitting occurs when the model is too simple to capture the underlying
patterns in the data.
 High Computational Cost: Training deep networks with many parameters requires
significant computational resources, which can be expensive and time-consuming.
 Learning Rate Selection: Choosing the right learning rate is critical. A learning rate
that's too high can cause the model to overshoot the optimal solution, while a rate
that's too low leads to slow convergence.
 Overfitting on Small Datasets: Deep networks trained on small datasets are prone to
overfitting, where the model memorizes the data instead of learning general patterns.
 Optimization Algorithm Selection: Choosing the right optimizer (e.g.,
SGD, Adam) and tuning its parameters is important for effective training.
 Gradient Clipping: In certain networks (e.g., RNNs), gradient explosion is
controlled by limiting gradients to a specific threshold.
 Noise in the Data: Noisy data can mislead the training process, affecting model
performance. Data cleaning and noise reduction are essential.
 Choice of Activation Functions: The choice of activation function influences
training. For example, ReLU can mitigate vanishing gradients but may cause dead
neurons (dying ReLU problem).
 Model Initialization: Poor weight initialization can slow convergence or cause
gradient-related issues, requiring careful initialization strategies like Xavier or He
initialization.
 Hyperparameter Tuning: The effectiveness of a neural network depends heavily on
selecting the right hyperparameters, which can be computationally expensive to
optimize.
 Difficulty in Interpretability: Neural networks are often "black boxes," making it
hard to interpret their decision-making process, which is crucial in sensitive fields.
 Batch Normalization Issues: While it speeds up convergence, batch normalization
can introduce challenges during inference, especially with small batch sizes, and
requires tuning.
 Generalization vs. Memorization: Striking a balance between model generalization
and memorization of training data is crucial for effective performance on unseen data.
6A] Explain AdaGrad and write an algorithm for AdaGrad. – 10 Marks
1. AdaGrad is an optimization algorithm designed to adapt the learning rate for each
parameter based on the historical gradients, making it effective in training models
with sparse data.
2. The key feature of AdaGrad is that it adjusts the learning rate for each parameter
individually. Parameters with frequent gradients receive smaller updates, while
infrequent parameters receive larger updates.
3. AdaGrad computes the squared gradients for each parameter at every step and
accumulates them over time. The learning rate is then scaled by the inverse square
root of the accumulated gradient sum, helping the model converge faster on sparse
features.
4. The update rule for the parameters is as follows:
5. AdaGrad helps adjust the learning rates for different features, making it especially
useful for high-dimensional or sparse datasets, such as in natural language processing
or image recognition tasks.
6. One of the advantages of AdaGrad is that it eliminates the need for manual learning
rate decay since the algorithm adapts the learning rate based on the parameters'
updates over time.
7. However, a disadvantage of AdaGrad is that the learning rate tends to decrease
rapidly as the algorithm progresses, which can slow down convergence in the later
stages of training.
8. AdaGrad is particularly effective in domains with sparse data where certain features
appear less frequently than others, allowing the model to adjust the learning rates
accordingly for better optimization.
9. Compared to Stochastic Gradient Descent (SGD), AdaGrad adjusts the learning rate
for each parameter, allowing it to perform better when dealing with datasets where
feature frequencies vary widely.
10. While AdaGrad is a useful algorithm for sparse data, its rapid learning rate
decay can limit its efficiency in more complex, dense data scenarios
Alternative algorithms like RMSprop and Adam are often preferred to address this
limitation
6.B] Explain the Adam algorithm in detail. – 10 Marks
1. Adam is an optimization algorithm that combines the advantages of both momentum

and adaptive learning rates, making it efficient for training deep learning models.
2. The algorithm computes two moments for each parameter: the first moment (mean)
which is an estimate of the gradient direction, and the second moment (variance)
which estimates the gradient magnitude.
3. At each iteration, Adam updates the first moment estimate and second moment
estimate using the current gradient.
4. The first moment estimate is updated by incorporating a weighted sum of past
gradients, while the second moment estimate is updated with the squared gradients.
5. Both the first and second moment estimates are biased toward zero in the initial
iterations, so Adam applies bias correction to these estimates to account for the initial
values.
6. The parameter updates are performed using the corrected first and second moments,
scaling the gradient by the square root of the second moment estimate and
adjusting it based on the first moment.
7. Adam is computationally efficient, requires little memory, and works well on large
datasets and high-dimensional problems, making it widely used in various deep
learning applications.
8. The algorithm has hyperparameters like learning rate decay rates and a small constant
to avoid division by zero.
9. It adapts the learning rate for each parameter based on its gradient history, which is
especially useful for sparse gradients in tasks like natural language processing and
image recognition.
10. Although Adam is highly effective, it may be sensitive to hyperparameter choices
and may require tuning, especially the learning rate and decay rates.
7A] Explain the components of a CNN layer. – 10 Marks
1. Input Layer:
o The input layer receives the image or data in the form of a multi- dimensional
array (e.g., height, width, and depth for color images). The input data passes
through the CNN for further processing.
2. Convolutional Layer:
This layer performs the core operation of a CNN. It applies a set of filters (kernels) to the
input image, performing a convolution operation. The filters slide over the image,
computing dot products between the filter and the region of the image it covers,
extracting features such as edges, textures, or patterns.
3. Activation Function:
o After the convolution operation, an activation function is applied, typically
the Rectified Linear Unit (ReLU). This function introduces non-linearity,
enabling the network to learn more complex patterns and representations.
4. Pooling Layer:
o The pooling layer reduces the spatial dimensions (height and width) of the
feature maps while retaining important information. Common types of
pooling include Max Pooling (selects the maximum value in the region) and
Average Pooling (computes the average value). Pooling helps reduce the
computational complexity and prevent overfitting.
5. Fully Connected Layer (Dense Layer):
o This layer connects every neuron in the previous layer to every neuron in the
current layer. It’s used for classification or regression tasks. The fully
connected layer outputs a final prediction or classification, such as
determining the class of the object in an image.
6. Normalization Layer:
o Normalization layers, like Batch Normalization, help to stabilize the learning
process by reducing internal covariate shift. They normalize the input to each
layer to have zero mean and unit variance, speeding up training and improving
performance.
7. Dropout Layer:
o Dropout is a regularization technique where random neurons are "dropped"
(set to zero) during training. This prevents overfitting by ensuring that the
network doesn’t rely too heavily on any single neuron and helps it generalize
better to new data.
8. Flatten Layer:
The flatten layer converts the multi-dimensional output from the convolutional and
pooling layers into a 1D vector. This step is necessary before passing the data into the
fully connected layers, as they require a 1D input.
9. Output Layer:
The output layer generates the final prediction or classification result. In classification
tasks, it often uses a softmax activation function for multi-class problems or a sigmoid
function for binary classification. The output layer size corresponds to the number of
classes or categories in the problem.
7B] Explain pooling with network representation. – 10 Marks
1. Pooling is a technique used in Convolutional Neural Networks (CNNs) to reduce the
spatial size (height and width) of feature maps, making the network more efficient.
2. There are two main types of pooling:
o Max Pooling: Takes the maximum value from each region of the feature map.
o Average Pooling: Takes the average value from each region of the feature
map.
3. A pooling layer typically uses a small window (e.g., 2x2, 3x3) that slides over the
input feature map.
4. Stride refers to how much the pooling window moves at each step. For example,
with a stride of 2, the window moves two steps at a time.
5. Max pooling helps retain the most significant features, such as edges or textures,
from the input data.
6. Average pooling provides a more generalized representation of the features by
averaging the values in the pooling window.
7. Pooling reduces the spatial dimensions of the input, resulting in fewer parameters
and computations, which speeds up the learning process.
8. Pooling also provides translation invariance, meaning the network becomes less
sensitive to small changes in the position of features within the input.
9. After pooling, the feature map is smaller, retaining only the most important
features for further processing.
10. Pooling helps prevent overfitting by reducing the complexity of the network, making
the model less prone to memorizing specific patterns in the training data.
8A] Explain the variants of the CNN model. – 10 Marks
1. LeNet (LeNet-5):
o One of the first CNN architectures, designed for digit recognition (e.g.,
MNIST).
o Composed of two convolutional layers, followed by pooling layers and fully
connected layers.
o Simple architecture suitable for small image datasets.
2. AlexNet:
o Introduced in 2012 and won the ImageNet challenge.
o Consists of five convolutional layers and three fully connected layers.
o Uses ReLU activation, dropout, and data augmentation to improve training
efficiency.
3. VGGNet (VGG16/VGG19):
Known for its deep architecture with 16 or 19 layers.
o Uses 3x3 convolution filters stacked on top of each other.
o Simple but deep, it is easy to understand and apply.
4. GoogLeNet (Inception):
o Uses inception modules, where multiple convolution filters of different
sizes are applied at each layer.
o Combines different levels of feature extraction, making the network efficient.
o Introduced the concept of "network in network."
5. ResNet (Residual Networks):
o Introduced residual learning to avoid vanishing gradient problems.
o Uses skip connections to pass output from one layer to a deeper layer.
o ResNet allows for much deeper networks (e.g., ResNet-50, ResNet- 101).
6. DenseNet (Densely Connected Convolutional Networks):
o Each layer connects to every previous layer, enhancing feature reuse.
o Improves gradient flow and reduces the vanishing gradient problem.
o Requires fewer parameters compared to traditional CNNs.
7. MobileNet:
o Designed for mobile and embedded systems with limited
computational power.
o Uses depthwise separable convolutions to reduce computational cost.
o Efficient for real-time mobile vision applications.
8. SqueezeNet:
o A compact CNN model designed for efficiency with fewer
parameters.
o Utilizes fire modules that combine 1x1 convolutions and 3x3
convolutions.
o Achieves competitive accuracy with a significantly smaller model size.
9. U-Net:
o Primarily used for image segmentation tasks, especially in medical image
analysis.
o Features an encoder-decoder architecture that reduces and restores spatial
dimensions.
o Performs pixel-wise predictions to segment images.
10. EfficientNet:
 A family of models that balances depth, width, and resolution to improve accuracy
while reducing parameters.
 Uses a compound scaling method to scale the network efficiently.
 Achieves high performance with fewer parameters compared to other models.
8.B] Explain structured output with neural networks. – 10 Marks
1. Structured output refers to tasks where the prediction involves multiple

interdependent components rather than independent labels.
2. Examples of structured output tasks include:
o Sequence prediction (e.g., language translation, speech recognition).
o Image segmentation (e.g., classifying each pixel in an image).
o Object detection (e.g., identifying objects and their locations in an image).
3. Neural networks can be adapted to handle structured outputs by using specialized
architectures.
4. Sequence Prediction:
o Recurrent Neural Networks (RNNs), especially LSTM or GRU, are used for
tasks like sequence generation, where each output depends on previous
elements.
5. Image Segmentation:
Fully Convolutional Networks (FCNs) are used to predict pixel-wise class labels in an image,
making it suitable for segmentation tasks.
6. Object Detection:
o Architectures like YOLO (You Only Look Once) and Faster R-CNN predict
both object classes and bounding box locations in images.
7. Encoder-Decoder Networks:
o Encoder-decoder architectures are used to convert an input into a structured
output, such as converting an image to a segmented map or a sentence to
another sentence (in translation tasks).
8. Conditional Random Fields (CRFs):
o CRFs are used in conjunction with neural networks to model dependencies
between output variables, ensuring consistent and accurate structured
predictions.
9. Loss Functions:
o Specialized loss functions like sequence loss, dice coefficient for
segmentation, and negative log-likelihood are used to handle dependencies
between output components.
10. Training Structured Output Models:
Structured output models are trained end-to-end, learning both local features and global
dependencies, often using backpropagation with techniques like CRFs for better accuracy.
9a]Explain how the recurrent neural network (RNN) processes data sequences
1. Recurrent Neural Networks (RNNs) are designed to process sequences of data,

enabling them to maintain information from previous time steps.
2. At each time step, an RNN processes the current input and the hidden state from the
previous time step.
3. The hidden state acts as the network's memory, capturing information from earlier
steps in the sequence.
4. The output of the previous time step is fed back into the network, allowing it to
influence future outputs.
5. RNNs are typically visualized by "unrolling" them across time steps, where each
time step corresponds to a layer of the network.
6. Training an RNN involves using Backpropagation Through Time (BPTT),
which computes gradients for each time step and updates the weights through
the entire sequence.
7. Some challenges faced by RNNs include the vanishing gradient problem, where
gradients become too small to learn long-term dependencies, and the exploding
gradient problem, where gradients grow too large and cause instability.
8. To overcome these challenges, more advanced architectures like Long Short-Term
Memory (LSTM) and Gated Recurrent Units (GRU) are used, which introduce
mechanisms to preserve important information and avoid vanishing or exploding
gradients.
9b]Discuss bidirectional RNNs

1. Bidirectional RNNs extend traditional RNNs by processing data in both forward
and backward directions.
2. They consist of two RNNs: one processes the sequence from left to right (forward),
and the other processes it from right to left (backward).
3. The input sequence is fed into both forward and backward RNNs to capture
information from both past and future contexts.
4. The outputs from both directions are combined, typically by concatenation or
averaging, to produce the final output at each time step.
5. By considering both past and future information, Bidirectional RNNs provide a more
comprehensive understanding of the sequence.
6. This approach is especially beneficial in tasks where future context is important,
such as language modeling, speech recognition, and machine translation.
7. For example, in speech recognition, understanding a word may require knowledge of
both the preceding and following words in the sequence.
8. Bidirectional RNNs improve performance over unidirectional RNNs, making them
more suitable for complex tasks requiring context from both directions.
9. The main challenges include increased computational cost and complexity due to
processing the sequence in both directions.
10. Bidirectional RNNs can be further enhanced by using LSTM or GRU cells, which
mitigate the vanishing gradient problem and improve learning over long sequences.
10a]Explain the LSTM working principle along with equations
1. LSTM (Long Short-Term Memory) is a type of RNN designed to address the

vanishing gradient problem in traditional RNNs and effectively capture long-term
dependencies.
2. LSTM uses a specialized architecture consisting of three main gates: Forget Gate,
Input Gate, and Output Gate, each controlling the flow of information within the
network.
3. Working Principle:
o The LSTM cell maintains the cell state across time steps, allowing it to retain
long-term dependencies.
o The forget gate selectively forgets parts of the previous state, while the input
gate updates the state with new information.
o The output gate determines the next hidden state, which will be used by the
LSTM for the next time step and for the final prediction.
4. Advantages:
o LSTMs are capable of learning long-range dependencies in sequential data by
maintaining and updating cell states over time.
o They effectively mitigate the vanishing gradient problem, making them

suitable for tasks with long sequences, like language translation and speech
recognition.
10b]Write a note on Speech Recognition and NLP
1. Speech Recognition:
o Speech recognition is the process of converting spoken language into text
using computational algorithms.
o It involves analyzing sound waves, recognizing speech patterns, and
converting them into text or commands.
o The process begins by recording the audio input and preprocessing it to
remove noise and enhance clarity.
o Acoustic features, such as phonemes (smallest speech units), are then extracted
from the audio signal.
o These features are matched against a model trained on a large dataset of speech
samples.
o Modern speech recognition systems often use deep learning techniques, such
as recurrent neural networks (RNNs) or deep neural networks (DNNs), to
improve accuracy.
o Applications of speech recognition include virtual assistants (like Siri, Alexa,
and Google Assistant), transcription services, and voice- controlled devices.
2. Natural Language Processing (NLP):
o NLP is a branch of artificial intelligence (AI) that focuses on the interaction
between computers and human language.
o It involves developing algorithms and models that enable machines to
understand, interpret, and generate human language in a meaningful way.
o NLP tasks include text classification, machine translation, named entity
recognition (NER), sentiment analysis, and summarization.
o The primary challenge in NLP is dealing with the ambiguity and complexity
of natural language, such as homophones (words with the same pronunciation
but different meanings) and context-based meanings.
o NLP uses various techniques, including:
 Tokenization: Breaking text into words, sentences, or subword units.
 Part-of-speech tagging: Identifying the grammatical structure
of a sentence.
 Named Entity Recognition (NER): Recognizing entities like names,
locations, and dates in text.
 Word Embeddings: Representing words in a dense vector format
that captures their semantic meaning.
 Transformers and Attention Mechanisms: Advanced models that
capture relationships between words in context, used in models like
BERT, GPT, and T5.
o Applications of NLP include search engines, chatbots, sentiment analysis
tools, and translation services.
3. Relation Between Speech Recognition and NLP:
o Speech recognition converts spoken language into text, while NLP works on
processing and understanding that text.
o A combined system, such as a voice assistant, uses speech recognition to
convert voice input into text, and NLP to comprehend the text and provide
meaningful responses.
o The integration of these two technologies enables the development of
applications like automated transcription, real-time translation, and intelligent
virtual assistants.
4. Challenges:
o Speech Recognition: Handling accents, background noise, and variations in
pronunciation can lead to errors in recognition.
o NLP: Ambiguity in language, sarcasm, and context-dependent meanings pose

challenges in natural language understanding.

21CS743 (2)

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

21CS743 (2)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

21CS743 (2)

Uploaded by

Copyright:

Available Formats

21CS743 – Deep Learning Model Question Paper-1 with answers

1a)Explain the historical trends in deep learning

1980s: Revival with Backpropagation

1990s: Specialized Architectures

2010s: Deep Learning Revolution

1b) Define machine learning. Explain different types of ML algorithms

Types of Machine Learning Algorithms

 Used in healthcare for disease prediction and treatment planning.

2b)Write a note on support vector machines and PCA

Deep Forward Networks

3.B] What is regularization? How does regularization help in reducing overfitting? – 10

How Regularization Helps Reduce Overfitting

4.A] Explain briefly about the gradient descent algorithm. – 10 Marks

1. Adam is an optimization algorithm that combines the advantages of both momentum

8A] Explain the variants of the CNN model. – 10 Marks

8.B] Explain structured output with neural networks. – 10 Marks

1. Structured output refers to tasks where the prediction involves multiple

1. Recurrent Neural Networks (RNNs) are designed to process sequences of data,

9b]Discuss bidirectional RNNs

1. LSTM (Long Short-Term Memory) is a type of RNN designed to address the

o They effectively mitigate the vanishing gradient problem, making them

10b]Write a note on Speech Recognition and NLP

o NLP: Ambiguity in language, sarcasm, and context-dependent meanings pose

You might also like