Unit - 2 Deep Learning
Unit - 2 Deep Learning
UNIT-2
Machine Learning: Basics and Under fitting, Hyper parameters and Validation Sets,
Estimators, Bias and Variance, Maximum Likelihood, Bayesian Statistics, and Unsupervised
Learning, Stochastic Gradient Descent, Challenges Motivating Deep Learning. Deep Feed
forward Networks: Learning XOR, Gradient-Based Learning, Hidden Units, Architecture
Design, Back-Propagation and other Differentiation Algorithms.
Underfitting
DL - SEAGI Page 1 of 26
Unit - 2
learning model. Its occurrence simply means that our model or the algorithm does not fit the
data well enough. It usually happens when we have fewer data to build an accurate model and
also when we try to build a linear model with fewer non-linear data. In such cases, the rules
of the machine learning model are too easy and flexible to be applied to such minimal data
and therefore the model will probably make a lot of wrong predictions. Underfitting can be
avoided by using more data and also reducing the features by feature selection. In a nutshell,
Underfitting refers to a model that can neither performs well on the training data nor
generalize to new data.
DL - SEAGI Page 2 of 26
Unit - 2
Hyper parameters:
A hyperparameter is a machine learning parameter whose value is chosen before a learning
In each epoch, the same training data is fed to the neural network architecture repeatedly, and
the model continues to learn the features of the data.
The training set should have a diversified set of inputs so that the model is trained in all
scenarios and can predict any unseen data sample that may appear in the future.
DL - SEAGI Page 3 of 26
Unit - 2
Validation Sets:
The validation set is a set of data, separate from the training set, that is used to validate our
model performance during training.
This validation process gives information that helps us tune the model’s hyperparameters and
configurations accordingly. It is like a critic telling us whether the training is moving in the
right direction or not.
The model is trained on the training set, and, simultaneously, the model evaluation is
performed on the validation set after every epoch.
The main idea of splitting the dataset into a validation set is to prevent our model from
overfitting i.e., the model becomes really good at classifying the samples in the training set
but cannot generalize and make accurate classifications on the data it has not seen before.
Estimators
Estimators are functions of random variables that can help us find approximate values for
these parameters. Think of these estimators like any other function, that takes an input,
processes it, and renders an output. So, the process of estimation goes as follows:
1) From the distribution, we take a series of random samples.
2) We input these random samples into the estimator function.
3) The estimator function processes it and gives a set of outputs.
4) The expected value of that set is the approximate value of the parameter.
In machine learning, an estimator is an equation for picking the “best,” or most
likely accurate, data model based upon observations in realty. Not to be confused
DL - SEAGI Page 4 of 26
Unit - 2
with estimation in general, the estimator is the formula that evaluates a given quantity
(the estimand) and generates an estimate.
SVC, Naive Bayes, k-NN are some of the popular classification estimators but have
comparatively high time complexity. For data with <100k samples, one can try the
Linear SVC model.
An estimator is a statistic used for the purpose of estimating an unknown
parameter. An estimator is a function of the data in a sample. Common estimators
are the sample mean and sample variance which are used to estimate the unknown
population mean and variance.
The two main types of estimators in statistics are point estimators and interval
estimators. Point estimation is the opposite of interval estimation. It produces a single
value while the latter produces a range of values
A good estimator should be unbiased, consistent, and relatively efficient.
Bias is considered a systematic error that occurs in the machine learning model itself due to
incorrect assumptions in the ML process.
Technically, we can define bias as the error between average model prediction and the
ground truth. Moreover, it describes how well the model matches the training data set:
A model with a higher bias would not match the data set closely.
A low bias model will closely match the training data set.
Characteristics of a high bias model include:
Failure to capture proper data trends
DL - SEAGI Page 5 of 26
Unit - 2
This table lists common algorithms and their expected behavior regarding bias and variance:
DL - SEAGI Page 6 of 26
Unit - 2
1. Low-Bias, Low-Variance:
The combination of low bias and low variance shows an ideal machine learning model.
However, it is not possible practically.
2. Low-Bias, High-Variance: With low bias and high variance, model predictions are
inconsistent and accurate on average. This case occurs when the model learns with a large
number of parameters and hence leads to an overfitting.
3. High-Bias, Low-Variance: With High bias and low variance, predictions are consistent
but inaccurate on average. This case occurs when a model does not learn well with the
training dataset or uses few numbers of the parameter. It leads to underfitting problems in the
model.
4. High-Bias,High-Variance:
With high bias and high variance, predictions are inconsistent and also inaccurate on average.
Bayesian inference is a specific way to learn from data that is heavily used in statistics
for data analysis. Bayesian inference is used less often in the field of machine
learning, but it offers an elegant framework to understand what “learning” actually is.
It is generally useful to know about Bayesian inference.
Thus, this statistical approach is not directly applicable to every deep learning
technique.
However, it affects the three key fields of machine learning:
Statistical Inference- It uses Bayesian probability, to sum up proof for the likelihood
of an expectation.
Statistical Modeling- It encourages a few models by grouping and indicating the
earlier dissemination of any obscure boundaries.
Experiment Design- By including the idea of "prior belief influence," this strategy
utilizes successive investigations to factor in the result of prior tests when planning
new ones. These "beliefs" are refreshed by prior and posterior dispersion.
Unsupervised Learning
DL - SEAGI Page 8 of 26
Unit - 2
structure of dataset, group that data according to similarities, and represent that
dataset in a compressed format.
Why use Unsupervised Learning?
Below are some main reasons which describe the importance of Unsupervised Learning:
Unsupervised learning is helpful for finding useful insights from the data.
Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
In real-world, we do not always have input data with the corresponding output so
to solve such cases, we need unsupervised learning.
Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the machine
learning model in order to train it. Firstly, it will interpret the raw data to find the hidden
patterns from the data and then will apply suitable algorithms such as k-means clustering,
Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.
DL - SEAGI Page 9 of 26
Unit - 2
Clustering: Clustering is a method of grouping the objects into clusters such that objects
with most similarities remains into a group and has less or no similarities with the objects of
another group. Cluster analysis finds the commonalities between the data objects and
categorizes them as per the presence and absence of those commonalities.
DL - SEAGI Page 10 of 26
Unit - 2
Stochastic gradient descent is a popular algorithm for training a wide range of models
in machine learning, including (linear) support vector machines, logistic regression
(see, e.g., Vowpal Wabbit) and graphical models.
In Machine Learning, there occurs a process of analyzing data for building or training
models. It is just everywhere; from Amazon product recommendations to self-driven cars, it
beholds great value throughout. As per the latest research, the global machine learning
market is expected to grow by 43% by 2024. This revolution has enhanced the demand for
machine learning professionals to a great extent. AI and machine learning jobs have observed
a significant growth rate of 75% in the past four years, and the industry is growing
continuously. A career in the Machine learning domain offers job satisfaction, excellent
growth, insanely high salary, but it is a complex and challenging process.
There are a lot of challenges that machine learning professionals face to inculcate ML skills
and create an application from scratch. What are these challenges? In this blog, we will
discuss seven major challenges faced by machine learning professionals. Let’s have a look.
Data plays a significant role in the machine learning process. One of the significant issues
that machine learning professionals face is the absence of good quality data. Unclean and
noisy data can make the whole process extremely exhausting. We don’t want our algorithm to
make inaccurate or faulty predictions. Hence the quality of data is essential to enhance the
output. Therefore, we need to ensure that the process of data preprocessing which includes
removing outliers, filtering missing values, and removing unwanted features, is done with the
utmost level of perfection.
6. Slow Implementation
This is one of the common issues faced by machine learning professionals. The machine
learning models are highly efficient in providing accurate results, but it takes a tremendous
DL - SEAGI Page 12 of 26
Unit - 2
amount of time. Slow programs, data overload, and excessive requirements usually take a lot
of time to provide accurate results. Further, it requires constant monitoring and maintenance
to deliver the best output.
7. Imperfections in the Algorithm When Data Grows
So you have found quality data, trained it amazingly, and the predictions are really concise
and accurate. Yay, you have learned how to create a machine learning algorithm!! But wait,
there is a twist; the model may become useless in the future as data grows. The best model of
the present may become inaccurate in the coming Future and require further rearrangement.
So you need regular monitoring and maintenance to keep the algorithm working. This is one
of the most exhausting issues faced by machine learning professionals.
Deep Feed forward Networks :
Feed Forward Neural Network is an artificial Neural Network in which the nodes are
connected circularly. A feed-forward neural network, in which some routes are cycled, is the
polar opposite of a Recurrent Neural Network. The feed-forward model is the basic type of
neural network because the input is only processed in one direction. The data always flows in
one direction and never backwards/opposite.
Neural Networks are a type of function that connects inputs with outputs. In theory, neural
networks should be able to estimate any sort of function, no matter how complex it is.
Nonetheless, supervised learning entails learning a function that translates a given X to a
specified Y and then utilising that function to determine the proper Y for a fresh X. If that’s
the case, how do neural networks differ from typical machine learning methods? Inductive
Bias, a psychological phenomenon, is the answer. The phrase may appear to be fresh.
However, before applying a machine learning model to it, it is nothing more than our
assumptions about the relationship between X and Y.
DL - SEAGI Page 13 of 26
Unit - 2
The linear relationship between X and Y is the Inductive Bias of linear regression. As a
result, it fits the data to a line or a hyperplane.
Feed Forward Neural Network is an artificial Neural Network in which the nodes are
connected circularly. A feed-forward neural network, in which some routes are cycled, is the
polar opposite of a Recurrent Neural Network. The feed-forward model is the basic type of
neural network because the input is only processed in one direction. The data always flows in
one direction and never backwards/opposite.
The Neural Network advanced from the perceptron, a prominent machine learning
algorithm. Frank Rosenblatt, a physicist, invented perceptrons in the 1950s and 1960s,
based on earlier work by Warren McCulloch and Walter Pitts.
Before we look at why neural networks work, it’s important to understand what
neural networks do. Before we can grasp the design of a neural network, we
must first understand what a neuron performs.
A weight is assigned to each input to an artificial neuron. First, the inputs are multiplied by
their weights, and then a bias is applied to the outcome. After that, the weighted sum is
passed via an activation function, being a non-linear function.
DL - SEAGI Page 14 of 26
Unit - 2
weight is being applied to each input to an artificial neuron. First, the inputs are multiplied by
their weights, and then a bias is applied to the outcome. This is called the weighted sum.
After that, the weighted sum is processed via an activation function, as a non-linear function.
The first layer is the input layer, which appears to have six neurons but is only the data that is
sent into the neural network. The output layer is the final layer. The dataset and the type of
challenge determine the number of neurons in the final layer and the first layer. Trial and
error will be used to determine the number of neurons in the hidden layers and the number of
hidden layers.
All of the inputs from the previous layer will be connected to the first neuron from the first
hidden layer. The second neuron in the first hidden layer will be connected to all of the
preceding layer’s inputs, and so forth for all of the first hidden layer’s neurons. The outputs
of the previously hidden layer are regarded inputs for neurons in the second hidden layer, and
each of these neurons is coupled to all of the preceding neurons.
These models are called feed forward because information flows through the function being
evaluated from x, through the intermediate computations used to define f, and finally to the
output y. There are no feedback connections in which outputs of the model are fed back into
itself. When feedforward neural networks are extended to include feedback connections, they
are called recurrent neural networks,
Feedforward neural networks are called networks because they are typically represented by
composing together many different functions. The model is associated with a directed acyclic
graph describing how the functions are composed together. For example, we might have three
functions f(1), f(2), and f(3) connected in a chain, to form f(x) = f(3)(f(2)(f(1)(x))). These
chain structures are the most commonly used structures of neural networks. In this case, f (1)
is called the first layer of the network, f(2) is called the second layer, and so on.
DL - SEAGI Page 15 of 26
Unit - 2
The training data provides us with noisy, approximate examples of f ∗(x) evaluated at
different training points. Each example x is accompanied by a label y ≈ f ∗ (x).
The training examples specify directly what the output layer must do at each point x; it must
produce a value that is close to y. The behavior of the other layers is not directly specified by
the training data. The learning algorithm must decide how to use those layers to produce the
desired output, but the training data does not say what each individual layer should do.
Finally, these networks are called neural because they are loosely inspired by neuroscience.
Each hidden layer of the network is typically vector-valued. The dimensionality of these
hidden layers determines the width of the model. Each element of the vector may be
interpreted as playing a role analogous to a neuron. Rather than thinking of the layer as
representing a single vector-to-vector function,
To make the idea of a feedforward network more concrete, we begin with an example of a
fully functioning feedforward network on a very simple task: learning the XOR function. The
XOR function (“exclusive or”) is an operation on two binary values, x1 and x2. When exactly
one of these binary values is equal to 1, the XOR function returns 1. Otherwise, it returns 0.
The XOR function provides the target function y = f∗(x) that we want to learn. Our model
provides a function y = f(x;θ) and our learning algorithm will adapt the parameters θ to make
f as similar as possible In this simple example, we will not be concerned with statistical
DL - SEAGI Page 16 of 26
Unit - 2
generalization. We want our network to perform correctly on the four points X = {[0, 0],
[0,1], [1,0], and [1,1]}. We will train the network on all four of these points. The only
challenge is to fit the training set. We can treat this problem as a regression problem and use
a mean squared error loss function. We choose this loss function to simplify the math for this
example as much as possible. In practical applications, MSE is usually not an appropriate
cost function for modeling binary data.
More appropriate approaches a Evaluated on our whole training set, the MSE loss function is
a linear model, with θ consisting of w and b. Our model is defined to be
f (x; w, b) = x w + b.
We can minimize J(θ) in closed form with respect to w and b using the normal equations.
Solving the XOR problem by learning a representation. The bold numbers printed on the plot
indicate the value that the learned function must output at each point. (Left)A linear model
applied directly to the original input cannot implement the XOR function. When x1 = 0, the
model’s output must increase as x2 increases. When x1 = 1, the model’s output must decrease
as x2 increases. A linear model must apply a fixed coefficient w2 to x2. The linear model
therefore cannot use the value of x1 to change the coefficient on x2 and cannot solve this
problem. (Right)In the transformed space represented by the features extracted by a neural
network, a linear model can now solve the problem. In our example solution, the two points
that must have output 1 have been collapsed into a single point in feature space. In other
words, the nonlinear features have mapped both x = [1,0] and x = [0,1] to a single point in
feature space, h = [1,0]. The linear model can now describe the function as increasing in h1
and decreasing in h2. In this example, the motivation for learning the feature space is only to
DL - SEAGI Page 17 of 26
Unit - 2
make the model capacity greater so that it can fit the training set. In more realistic
applications, learned representations can also help the model to generalize.
The following table shows the truth table for the XOR function:
0 0 0
0 1 1
1 0 1
1 1 1
Gradient-Based Learning:
Gradient Descent is known as one of the most commonly used optimization algorithms to
train machine learning models by means of minimizing errors between actual and expected
results. Further, gradient descent is also used to train Neural Networks.
The best way to define the local minimum or local maximum of a function using gradient
descent is as follows:
If we move towards a negative gradient or away from the gradient of the function at the
current point, it will give the local minimum of that function.
Whenever we move towards a positive gradient or towards the gradient of the function at the
current point, we will get the local maximum of that function.
DL - SEAGI Page 18 of 26
Unit - 2
This entire procedure is known as Gradient Ascent, which is also known as steepest descent.
The main objective of using a gradient descent algorithm is to minimize the cost function
using iteration. To achieve this goal, it performs two steps iteratively:
Calculates the first-order derivative of the function to compute the gradient or slope of that
function.
Move away from the direction of the gradient, which means slope increased from the current
point by alpha times, where Alpha is defined as Learning Rate. It is a tuning parameter in the
optimization process which helps to decide the length of the steps.
What is Cost-function?
The cost function is defined as the measurement of difference or error between actual values
and expected values at the current position and present in the form of a single real number. It
helps to increase and improve machine learning efficiency by providing feedback to this
model so that it can minimize error and find the local or global minimum.
Further, it continuously iterates along the direction of the negative gradient until the cost
function approaches zero. At this steepest descent point, the model will stop learning further.
Although cost function and loss function are considered synonymous, also there is a minor
difference between them. The slight difference between the loss function and the cost
function is about the error within the training of machine learning models, as loss function
refers to the error of one training example, while a cost function calculates the average error
across an entire training set.
The cost function is calculated after making a hypothesis with initial parameters and
modifying these parameters using gradient descent algorithms over known data to reduce the
cost function.
DL - SEAGI Page 19 of 26
Unit - 2
The slope becomes steeper at the starting point or arbitrary point, but whenever new
parameters are generated, then steepness gradually reduces, and at the lowest point, it
approaches the lowest point, which is called a point of convergence.
The main objective of gradient descent is to minimize the cost function or the error between
expected and actual. To minimize the cost function, two data points are require
Learning Rate:
It is defined as the step size taken to reach the minimum or lowest point. This is typically a
small value that is evaluated and updated based on the behavior of the cost function. If the
DL - SEAGI Page 20 of 26
Unit - 2
learning rate is high, it results in larger steps but also leads to risks of overshooting the
minimum. At the same time, a low learning rate shows the small step sizes, which
compromises overall efficiency but gives the advantage of more precision.
Based on the error in various training models, the Gradient Descent learning algorithm can be
divided into Batch gradient descent, stochastic gradient descent, and mini-batch gradient
descent. Let's understand these different types of gradient descent:
1. Batch Gradient Descent:
Batch gradient descent (BGD) is used to find the error for each point in the training set and
update the model after evaluating all training examples. This procedure is known as the
training epoch. In simple words, it is a greedy approach where we have to sum over all
examples for each update.
Hidden Units:
DL - SEAGI Page 22 of 26
Unit - 2
Architecture Design:
compute the result using activation functions. It is one of the types of Neural Networks in
which the flow of the network is from input to output units and it does not have any loops, no
feedback, and no signal moves in backward directions that is from output to hidden and input
layer.
The ANN is a self-learning network that learns from sample data sets and signals, it is based
on the function of the biological nervous system. The type of activation function depends on
the desired output. It is a part of machine learning and AI, which are the fastest-growing
fields, and lots of research is going on to make it more effective.
Input Layer: It is starting layer of the network that has a weight associated with the signals.
Hidden Layer: This layer lies after the input layer and contains multiple neurons that
perform all computations and pass the result to the output unit.
Output Layer: It is a layer that contains output units or neurons and receives processed data
from the hidden layer, if there are further hidden layers connected to it then it passes the
weighted unit to the connected hidden layer for further processing to get the desired result.
The input and hidden layers use sigmoid and linear activation functions whereas the output
layer uses a Heaviside step activation function at nodes because it is a two-step activation
function that helps in predicting results as per requirements. All units also known as neurons
have weights and calculation at the hidden layer is the summation of the dot product of all
weights and their signals and finally the sigmoid function of the calculated sum. Multiple
hidden and output layer increases the accuracy of the output.
DL - SEAGI Page 24 of 26
Unit - 2
Medical field
Speech regeneration
Data processing and compression
Image processing
Limitations:
This ANN is a basic form of Neural Network that has no cycles and computes only in the
forward direction. It has some limitations like sometimes information about the neighborhood
is lost and in that case, it becomes difficult to process further all steps are needed to be
performed again and it does not support back propagation so the network cannot learn or
correct the fault of the previous stage.
The algorithm gets its name because the weights are updated backward, from output to input.
The advantages of using a backpropagation algorithm are as follows:
It does not have any parameters to tune except for the number of inputs.
It is highly adaptable and efficient and does not require any prior knowledge about the
network.
It is a standard process that usually works well.
It is user-friendly, fast and easy to program.
Users do not need to learn any special functions.
DL - SEAGI Page 25 of 26
Unit - 2
Backpropagation requires a known, desired output for each input value in order to calculate
the loss function gradient -- how a prediction differs from actual results -- as a type of
supervised machine learning. Along with classifiers such as Naive Bayesian filters and
decision trees, the backpropagation training algorithm has emerged as an important part of
machine learning applications that involve predictive analytics.
************
DL - SEAGI Page 26 of 26