3.1 - 3.20 Foundational Math - Calculus For ML and AI
3.1 - 3.20 Foundational Math - Calculus For ML and AI
3.1 - 3.20 Foundational Math - Calculus For ML and AI
1 Introduction
)
y
At timestamp in video
op
C
ft
ra
(D
ts
oo
● In this chapter we study calculus for machine learning and AI applications in real world
dR
context.
● We cover all the above concepts
ie
● We collect customer data manually .Given data of n customers ,we train our model and
pl
)
y
op
C
ft
ra
● Let’s assume 4 we have four features and we have to predict income based on these
features .Each row corresponds to data of one customer.
(D
At timestamp 6.24
ts
oo
dR
ie
pl
● We have four features we call them as input variables and based on these features we
Ap
)
y
op
C
ft
●
ra
The features can be any data that we have collected as shown above
(D
At timestamp 14.22
ts
oo
dR
ie
pl
)
y
op
C
ft
ra
(D
● If the output variable is real valued then the problem at hand is a regression problem .
ts
At timestamp 15.40
oo
dR
ie
pl
● If the response variable is categorical ,this means the problem at hand is classification
Ap
ft
ra
(D
ts
oo
dR
ie
● As we are going to solve the regression problem ,let's try to visualize our data in 2 D,So
we have only 1 feature as shown above along the x axis.
pl
● We can clearly see that the feature f and response variable y are having almost a linear
relationship.
Ap
At timestamp 4.02 in video
)
y
op
C
ft
ra
● So we can represent the relation as y=a.f+b(as a line).
(D
ts
oo
dR
ie
pl
Ap
● Let’s say we trained and fit the model and we got the above line as our model.
● y=110.f1+10000 is our mathematical model
At timestamp 7.21 in video
)
y
op
C
ft
ra
● If we consider the above line as our model then when we have data about a new
customer we can just substitute it and we get the income of the new customer.
(D
At timestamp 7.31 in video
ts
oo
dR
ie
pl
Ap
● Let’s visualize data in 3D ,we have two features f1 ,f1 and our response variable is
y.Assume that y is linearly related to f1 and f2 then we can fit a plane to our data.We try
to find a plane such that most points lie close to the plane or on the plane itself.
● If we have 1 feature we use a line as our model,if we have two features we use a plane.
)
y
op
C
ft
ra
(D
ts
oo
dR
● Similarly if we have d features which are linearly related to y we can have a hyperplane
in d+1 dimensional space.
ie
● We cannot use lines and planes if features are not linearly related y
pl
● Given (x,y) n pairs in dataset d,we have to find the best plane.We will try to formulate a
regression problem from a mathematical point of view.
)
y
op
C
ft
● In order to find the best plane we want to minimize the distance from the point to the
ra
plane . (D
ts
oo
dR
● The predicted value for y is the value which line gives us ,the actual y value is given in
ie
the dataset.Our predicted and actual y values may differ based on the line we have as
our model
pl
Ap
)
y
op
C
ft
● When we say minimize the distance from point to the line we are actually trying to
ra
minimize the value d as shown .(we are minimizing the difference between actual
predicted y)
(D
At timestamp 12.36 in video
ts
oo
dR
ie
pl
Ap
● Our objective here is to find w,b such that the distance (d) from point(x ) to the plane
should be close to zero.
● We want the sum of distances as close to zero as possible.If distances are closer to zero
the points are closer to hyperplane
At timestamp 13.57
)
y
op
C
ft
ra
(D
● For points which are either side of the plane the distances will be both positive and
negative ,if we sum up the distances it will not be appropriate so we consider the sum of
squares of distances.
● We want these sum of squares of distances as close to zero as possible
ts
At timestamp 15.42
oo
dR
ie
pl
Ap
● Our whole problem boils down to finding w,b such that we can minimise the squared
distances as much as possible.
● By minimising each if di^2 we arrive at best hyperplane.
)
At timestamp 18.30 in video
y
op
C
ft
ra
(D
ts
oo
dR
ie
pl
Ap
● We have to find optimal w,b such that the sum of squared distances will be minimised.
● Here we are minimise loss so that we can obtain w,b
At timestamp 22.35 in video
)
y
op
C
ft
ra
● To solve this regression problem we use maxima minima from calculus.
(D
ts
oo
dR
ie
pl
Ap
At timestamp in video
)
y
op
3.5 Minimisation of multiple values
C
At timestamp 2.08 in video
ft
ra
(D
ts
oo
dR
● Imagine we have 3 points and corresponding to each point we have squared distances
as shown above.Lets assume a plane with some values of w,b as our model.(remember
we can determine a plane if we know and b)
ie
pl
Ap
At timestamp 2.55 in video
)
y
op
C
● For some other plane with w’,b’ the squared distances are as shown above
ft
● As the plane changes the distances change.
ra
At timestamp 3.40 in video
(D
ts
oo
dR
ie
)
y
op
C
● We can use other operations than addition such as log. We don't want to increase the
complexity so we use addition.
ft
ra
(D
3.6 Limits, Range and Domain of a function
In this chapter we learn calculus ,limits and range for most popularly used functions in AI/ML
ts
oo
● Consider function f(x) = 1/x and lets understand the concept of one sided limits
● As you are coming closer and closer towards 0 from the positive side of the x axis,the
curve is moving towards infinity.The same fact can be represented mathematically using
limits as shown above and it is often referred to as the right/positive sided limit .
● As you are moving closer from the negative side of the x axis towards 0,the curve is
tending towards minus infinity.The same fact can be represented mathematically using
)
limits as shown above and it is often referred to as the left/negative sided limit .
y
At timestamp 5.49 in video
op
C
ft
ra
(D
ts
● The above curve is a parabola,There is a similar concept called two sided limit
oo
● As you are moving closer and closer towards 0 from the positive side of the x axis,the
curve is moving 0 so the right hand limit is 0 similarly the left hand limit is also 0.
● We represent the above behaviour mathematically as shown below.
dR
ie
pl
Ap
At timestamp 9.05 in video
)
y
op
C
ft
ra
● The above is log(x),we use this function a lot in machine learning.we know that log of 0
and -ve numbers is not defined, log is defined only for positive numbers .The set of
(D
possible values which a function can take is called as domain and the domain for log(x)
is positive real numbers.
● From the plot we can see that the right handed limit when we move closer to 0 from the
positive x axis the function is tending towards - infinity.
ts
● The values taken by the function on y axis is called range ,log(x ) can be positive ,0 and
negative .So the range of function y=log(x) is all real numbers.
At timestamp 12.40 in video
)
y
op
C
ft
ra
(D
● Let's consider the function e^x,as we move closer and closer to -∞ the function tends
towards 0.It can be mathematically represented using limits as shown above .
● Similarly as we move towards +∞ the function also moves towards +infinity.
ts
oo
dR
ie
pl
Ap
● The domain of the function is set of all real numbers and the range is all positive real
numbers.(R+ means all positive real numbers excluding 0)
At timestamp 16.01 in video
)
y
op
C
ft
ra
● The above function is absolute value function,the both sided limits of the function are 0
as shown above.
(D
ts
oo
dR
ie
pl
● The domain of the function is all positive real numbers and the range of the function is all
positive real numbers including 0 because at x=0 the function gives 0 .
Ap
At timestamp 17.59 in video
)
y
op
C
ft
●
ra
We use the above shown functions extensively in ML.
(D
3.7 Continuity of functions
ts
oo
ra
● Consider the function 1/x and from the plot you can see that at x=6 the function is
continuous ,but at point x=0 the curve is discontinuous.
(D
At timestamp 1.20 in video
ts
oo
dR
ie
pl
)
y
op
C
ft
ra
● For the function tan(x) you can clearly see from the above plot that there is a lot of
discontinuity.
(D
At timestamp 3.32
ts
oo
dR
ie
pl
● Now that we have seen enough examples ,we define what continuity is mathematically
Ap
as shown above.
● A function f(x) is said to be continuous at a point x = a, in its domain if the following three
conditions are satisfied:
1. f(a) exists (i.e. the value of f(a) is finite)
2. Lim x→a f(x) exists (i.e. the right-hand limit = left-hand limit, and both are finite)
3. Lim x→a f(x) = f(a)
3.8 Derivatives: geometric intuition
)
At timestamp 0.24 in video
y
op
C
ft
ra
(D
● Let’s understand the concept of derivatives,consider the function f(x)=x^2 which is a
ts
● If we consider point x=4 corresponding y will be 16 ,if we draw a tangent to the curve at
this point x=4 and we call it as T4.This tangent makes some angle with x axis we call it
𝜭4.
dR
ie
pl
Ap
At timestamp 3.35 in video
)
y
op
C
ft
● The derivative of f(x) with respect to x at x=4 as the slope of tangent T4 and it is denoted
ra
mathematically as shown above.
● The slope of T4 is nothing but tan𝜭4.
● The slope of tangents at different points on the curve could be different.
(D
At timestamp 6.19 in video
ts
oo
dR
ie
pl
Ap
● Consider three tangents and angles made by the tangents with x axis as shown
above.We can clearly see that T4 is having positive slope ,T6 is having negative slope
and T0 has zero slope.
● If slope at a point is positive it means that the underlying curve is increasing on the other
hand if the slope is negative at a point the curve will be decreasing.
● If the slope of a tangent at a point is zero at a point it's slightly complicated and we will
discuss it.
)
3.9 Derivatives: rate of change + Math
y
op
At timestamp 0.55 in video
C
ft
ra
(D
ts
oo
● Let’s understand derivatives from a rate of change perspective .Consider two points on
the curve x1,x2 and their corresponding coordinates as shown above .
dR
)
y
op
C
ft
● Slope of the secant line can be calculated as shown above.
ra
(D
ts
oo
dR
ft
● So the derivative is the rate of change of y around the point x=x1.Instead of computing
ra
the derivative at a particular point we can compute it at all points.
At time stamp 9.56
(D
ts
oo
dR
ie
pl
Ap
)
y
op
C
ft
ra
(D
ts
oo
● For computing the derivative at a specific point we can substitute the corresponding x
value.
ie
pl
Ap
3.10 Derivative of Common functions
)
y
At timestamp 0.10 in video
op
C
ft
ra
(D
● Let’s see the derivatives of most commonly used functions in Machine learning.The
ts
function shown above is a polynomial function and the derivative can be obtained as
shown.
oo
dR
ie
pl
Ap
● Derivatives of log x and e^x are as shown above,using limits you can solve these easily.
)
y
op
C
● The derivative of abs(x) can be calculated as shown using limits.
ft
● Incase of absolute value function if x>0 the derivative is +1
ra
(D
ts
oo
dR
● When x value x=0 then the derivative is not defined because the left handed limit is not
equal to right handed limit.
dR
● You can see for above functions are differentiable at other parts of the curve but at x=0
pl
they are not.Wherever the graph or curve is non-smooth we tend to face the problem of
non -differentiability.
Ap
At timestamp 3.02
)
y
op
C
ft
● We face the problem of non-differentiability when the function is discontinuous,because
ra
the positive and negative limits will not be the same .
(D
3.12 Rules of differentiation
ts
oo
dR
ie
pl
Ap
● The derivative of y=f(x) with respect to x can be represented in any of the above ways.
● We have rules of differentiation which help us when we try to differentiate large
equations.These rules are listed below.
)
y
op
C
ft
ra
(D
ts
oo
dR
● When we have to find the derivative of a product of two functions we use product rule.
ie
pl
Ap
At timestamp 13.07
)
y
op
C
ft
● Quotient rule is used when we have to find the derivative of two functions where g(x) is
ra
dividing f(x).Here g(x ) cannot be 0.
(D
3.13 Maxima and Minima
At timestamp 1.19
ts
oo
dR
ie
pl
Ap
● All the calculus we have studied till now is primarily to study maxima and minima.In Ml
most of the time we try to minimize or maximize certain functions or mathematical
expressions.
● In the functions shown above the first function doesn’t have maxima but has minima at
x=x1 and the other doesn’t have minima but has maxima at x=x2.
● Function can have maxima and minima some functions can have multiple maxima and
minima some functions may not have any minima or maxima.
At timestamp 2.19
)
y
op
C
ft
ra
(D
ts
oo
dR
● Whenever we have local maxima or minima we use a certain term called optima and it
ie
)
y
op
C
ft
ra
● Consider a curve as shown above y=x^2 and two tangents T1,T2 at points x1,x2 on
(D
curve and making angels 𝝧1 and 𝝧2 with x axis respectively.Let’s see how to find
minima mathematically.
● We can clearly see that the slope of tangent T2 is less than slope of T1.Slope is nothing
but the derivative of the function at x=x1,x2
● The derivative of x1 and x2 are positive but derivative of x1 >derivative of x2.
ts
● At x=0 the tangent is nothing but x-axis,so the slope of tangent at x=0 is 0.
oo
At timestamp 13.10
dR
ie
pl
Ap
● You can observe a similar trend on the left side of the curve as well as shown above.
At timestamp 13.58
)
y
op
C
ft
ra
(D
● We can observe that on one side of minima the derivatives are all positive and on other
side derivatives are all negative at minima the derivative is zero
● We find the minima at a point where the derivative is zero and function should be
increasing on right side of minima and decreasing on left side of the minima
ts
At timestamp 16.40
oo
dR
ie
pl
Ap
● Just like the derivative at minima is zero the derivative at maxima is also 0.You can see
that the angle made by tangent with x-axis is 0 so the derivative or slope is 0.
● We find the maxima at a point where the derivative is zero and the function should be
decreasing on the right side of the maxima and increasing on the left side of the maxima
)
At timestamp 18.54
y
op
C
ft
ra
(D
ts
oo
dR
ie
pl
Ap
)
y
op
C
● When the derivative is zero we can analyse whether it is maxima or minima as shown
above.
ft
ra
(D
3.15 Partial derivatives & Del
At timestamp 2.4
ts
oo
dR
ie
pl
Ap
ie
dR
oo
ts
(D
ra
ft
C
op
y
)
Ap
pl
ie
dR
oo
ts
(D
ra
ft
C
op
y
)
)
y
op
C
ft
● We have seen how to find derivatives when we have one variable,here we have more
ra
than one variable so we use a concept called partial derivative as shown above.
● We can extend the same concept even for n dimensions.
(D
3.16 Optima using Partial derivatives
At timestamp 0.53
ts
oo
dR
ie
pl
Ap
)
y
op
C
● Above is an example of calculating partial derivatives and by equating them to 0 we got
ft
optima at x=1 and y=-1.The optima can be either minima or maxima.
ra
(D
ts
oo
dR
● If we plot the function we can clearly see that the optima that we arrived at using partial
derivatives at x,y=(1,-1) is minima.
ie
pl
Ap
)
y
op
C
● There is a special case where the derivative of the function(can be regular derivative or
vector of partial derivatives) is 0 but it doesn’t have either minima or maxima.
ft
3.17 Saddle point
ra
At timestamp 1.15
(D
ts
oo
dR
ie
● For the above function you can clearly see that the derivative of the function at x=0 is
pl
)
y
op
C
ft
ra
● We we have more than one variable in our function as shown above it's slightly tricky
,you can find the partial derivatives and can say that at x,y=(0,0) we have optima.But
(D
there is no optima
ts
oo
dR
ie
pl
Ap
)
y
op
C
ft
ra
(D
ts
oo
dR
● When we calculate second order partial derivatives you can observe that x is
ie
minima.
Ap
3.18 Gradient Descent
At timestamp 3.48
)
y
op
C
ft
ra
(D
● Let's consider the above curve and we want to find the minima ,we consider a random
point x0 and find the derivative at that point .Since the derivative is positive we know that
the function is increasing and minima<x0.
ts
oo
dR
ie
pl
Ap
● We know that we have to move towards minima and we do this by using the above
equation.
)
y
op
C
ft
ra
(D
ts
oo
dR
● We keep on updating x using the equation until we reach the minima as shown.
ie
pl
Ap
At timestamp 13.45
)
y
op
C
ft
ra
● The algorithm is called gradient descent because we are using gradient or slope or
(D
derivative and we are slowly descending towards minima by starting at a random point
on the curve.
variables
oo
At timestamp 0.23
dR
ie
pl
Ap
)
y
op
C
ft
● We are trying to understand gradient descent with multiple variables as shown above
ra
(D
ts
oo
dR
● We start with x0,y0(we randomly choose them) and we keep on updating the values
using the above equation until we get partial derivatives as close to zero as
ie
)
y
op
C
ft
● Let's solve our regression problem using gradient descent .We arrived at the above
ra
optimization problem where we have to find optimal w,b such that our sum of squared
distances should be as minimum as possible.
(D
● Given dataset D(xi,yi) ,The problem at hand is minimizing the above function
ts
oo
dR
)
y
op
C
ft
ra
● We can use gradient descent here since we have two variables w,b and we have to find
the minima.
(D
At timestamp 4.16
ts
oo
dR
ie
pl
Ap
)
y
op
C
ft
ra
● We calculate the partial derivatives as shown above.
(D
ts
oo
dR
ie
pl
Ap
At timestamp 12.42
)
y
op
C
ft
● Initially we randomly pick w vector and b,then we use gradient descent and keep on
update the values and move towards the minima as shown above.
ra
● We use gradient descent even when we have more than two variables.So given a
regression problem we can find the best line/plane using gradient descent while
(D
minimising the loss.
ts
oo
dR
ie
pl
Ap