3.1 - 3.20 Foundational Math - Calculus For ML and AI

3.
1 Introduction
)
y
At timestamp in video
op
C
ft
ra
(D
ts
oo
● In this chapter we study calculus for machine learning and AI applications in real world
dR
context.
● We cover all the above concepts
ie
3.2 Real world problem: Estimate income

pl
using purchase data

Ap
At timestamp 1.24 in video

)
y
op
C
ft
ra
● In this chapter we estimate income from purchase data of an ecommerce company.

(D
ts
oo
dR
ie
● We collect customer data manually .Given data of n customers ,we train our model and
pl
try to predict the income of others.

Ap
At timestamp 4.30
)
y
op
C
ft
ra
● Let’s assume 4 we have four features and we have to predict income based on these
features .Each row corresponds to data of one customer.
(D
At timestamp 6.24
ts
oo
dR
ie
pl
● We have four features we call them as input variables and based on these features we
Ap
predict output or response variable.

● Once we build a mathematical model we give the model input features and it outputs us
the response.
At timestamp 9.53
)
y
op
C
ft
●
ra
The features can be any data that we have collected as shown above
(D
At timestamp 14.22
ts
oo
dR
ie
pl
● In this chapter we consider all features to be numeric going further we discuss

Ap
categorical features as well.

3.3 Visualize the Data
At timestamp 14.53
)
y
op
C
ft
ra
(D
● If the output variable is real valued then the problem at hand is a regression problem .
ts
At timestamp 15.40
oo
dR
ie
pl
● If the response variable is categorical ,this means the problem at hand is classification
Ap
problem.(like fish sorting problem)

● If y belongs to more than one class it is called a multiclass classification problem.
)
y
op
C
ft
ra
(D
ts
oo
dR
ie
● As we are going to solve the regression problem ,let's try to visualize our data in 2 D,So
we have only 1 feature as shown above along the x axis.
pl
● We can clearly see that the feature f and response variable y are having almost a linear
relationship.
Ap
)
y
op
C
ft
ra
● So we can represent the relation as y=a.f+b(as a line).
(D
ts
oo
dR
ie
pl
Ap
● Let’s say we trained and fit the model and we got the above line as our model.
● y=110.f1+10000 is our mathematical model
)
y
op
C
ft
ra
● If we consider the above line as our model then when we have data about a new
customer we can just substitute it and we get the income of the new customer.
(D
ts
oo
dR
ie
pl
Ap
● Let’s visualize data in 3D ,we have two features f1 ,f1 and our response variable is
y.Assume that y is linearly related to f1 and f2 then we can fit a plane to our data.We try
to find a plane such that most points lie close to the plane or on the plane itself.
● If we have 1 feature we use a line as our model,if we have two features we use a plane.
)
y
op
C
ft
ra
(D
ts
oo
dR
● Similarly if we have d features which are linearly related to y we can have a hyperplane
in d+1 dimensional space.
ie
● We cannot use lines and planes if features are not linearly related y
pl

Ap
)
y
op
C
ft
ra
● The challenge is how do we find the plane .Lets find out.
(D
3.4 Formulation of regression
ts

oo
dR
ie
pl
Ap
● Given (x,y) n pairs in dataset d,we have to find the best plane.We will try to formulate a
regression problem from a mathematical point of view.
)
y
op
C
ft
● In order to find the best plane we want to minimize the distance from the point to the
ra
plane . (D
ts
oo
dR
● The predicted value for y is the value which line gives us ,the actual y value is given in
ie
the dataset.Our predicted and actual y values may differ based on the line we have as
our model
pl
Ap
)
y
op
C
ft
● When we say minimize the distance from point to the line we are actually trying to
ra
minimize the value d as shown .(we are minimizing the difference between actual
predicted y)
(D
ts
oo
dR
ie
pl
Ap
● Our objective here is to find w,b such that the distance (d) from point(x ) to the plane
should be close to zero.
● We want the sum of distances as close to zero as possible.If distances are closer to zero
the points are closer to hyperplane
At timestamp 13.57
)
y
op
C
ft
ra
(D
● For points which are either side of the plane the distances will be both positive and
negative ,if we sum up the distances it will not be appropriate so we consider the sum of
squares of distances.
● We want these sum of squares of distances as close to zero as possible
ts
At timestamp 15.42
oo
dR
ie
pl
Ap
● Our whole problem boils down to finding w,b such that we can minimise the squared
distances as much as possible.
● By minimising each if di^2 we arrive at best hyperplane.
)
y
op
C
ft
ra
(D
ts
oo
dR
ie
pl
Ap
● We have to find optimal w,b such that the sum of squared distances will be minimised.
● Here we are minimise loss so that we can obtain w,b
)
y
op
C
ft
ra
● To solve this regression problem we use maxima minima from calculus.
(D
ts
oo
dR
ie
pl
Ap
At timestamp in video
)
y
op
3.5 Minimisation of multiple values
C
ft
ra
(D
ts
oo
dR
● Imagine we have 3 points and corresponding to each point we have squared distances
as shown above.Lets assume a plane with some values of w,b as our model.(remember
we can determine a plane if we know and b)
ie
pl
Ap
)
y
op
C
● For some other plane with w’,b’ the squared distances are as shown above
ft
● As the plane changes the distances change.
ra
(D
ts
oo
dR
ie
● We choose the plane which minimises total sum of squared distances.

pl
Ap
)
y
op
C
ft
ra
(D
ts
oo
● Minimising each squared distance is equivalent to minimising the sum of squared

distances and we cannot use the product of the squared distances even if one distance
dR
becomes zero(actual and predicted y are same ) total product becomes 0.

ie
pl
Ap
)
y
op
C
● We can use other operations than addition such as log. We don't want to increase the
complexity so we use addition.
ft
ra
(D
3.6 Limits, Range and Domain of a function
In this chapter we learn calculus ,limits and range for most popularly used functions in AI/ML
ts
oo

dR
ie
pl
Ap
● Consider function f(x) = 1/x and lets understand the concept of one sided limits
● As you are coming closer and closer towards 0 from the positive side of the x axis,the
curve is moving towards infinity.The same fact can be represented mathematically using
limits as shown above and it is often referred to as the right/positive sided limit .
● As you are moving closer from the negative side of the x axis towards 0,the curve is
tending towards minus infinity.The same fact can be represented mathematically using
)
limits as shown above and it is often referred to as the left/negative sided limit .
y
op
C
ft
ra
(D
ts
● The above curve is a parabola,There is a similar concept called two sided limit
oo
● As you are moving closer and closer towards 0 from the positive side of the x axis,the
curve is moving 0 so the right hand limit is 0 similarly the left hand limit is also 0.
● We represent the above behaviour mathematically as shown below.
dR
ie
pl
Ap
)
y
op
C
ft
ra
● The above is log(x),we use this function a lot in machine learning.we know that log of 0
and -ve numbers is not defined, log is defined only for positive numbers .The set of
(D
possible values which a function can take is called as domain and the domain for log(x)
is positive real numbers.
● From the plot we can see that the right handed limit when we move closer to 0 from the
positive x axis the function is tending towards - infinity.
ts

oo
dR
ie
pl
Ap
● The values taken by the function on y axis is called range ,log(x ) can be positive ,0 and
negative .So the range of function y=log(x) is all real numbers.
)
y
op
C
ft
ra
(D
● Let's consider the function e^x,as we move closer and closer to -∞ the function tends
towards 0.It can be mathematically represented using limits as shown above .
● Similarly as we move towards +∞ the function also moves towards +infinity.
ts
oo
dR
ie
pl
Ap
● The domain of the function is set of all real numbers and the range is all positive real
numbers.(R+ means all positive real numbers excluding 0)
)
y
op
C
ft
ra
● The above function is absolute value function,the both sided limits of the function are 0
as shown above.
(D
ts
oo
dR
ie
pl
● The domain of the function is all positive real numbers and the range of the function is all
positive real numbers including 0 because at x=0 the function gives 0 .
Ap
)
y
op
C
ft
●
ra
We use the above shown functions extensively in ML.
(D
3.7 Continuity of functions
ts
oo

dR
ie
pl
Ap
)
y
op
C
ft
● Let's understand the concept of continuity of functions.
ra
● Consider the function 1/x and from the plot you can see that at x=6 the function is
continuous ,but at point x=0 the curve is discontinuous.
(D
ts
oo
dR
ie
pl
● If we consider the above function it is continuous everywhere even at x=0 th function is

continuous,there is no gap or break or discontinuity.
Ap
)
y
op
C
ft
ra
● For the function tan(x) you can clearly see from the above plot that there is a lot of
discontinuity.
(D
At timestamp 3.32
ts
oo
dR
ie
pl
● Now that we have seen enough examples ,we define what continuity is mathematically
Ap
as shown above.
● A function f(x) is said to be continuous at a point x = a, in its domain if the following three
conditions are satisfied:
1. f(a) exists (i.e. the value of f(a) is finite)
2. Lim x→a f(x) exists (i.e. the right-hand limit = left-hand limit, and both are finite)
3. Lim x→a f(x) = f(a)
3.8 Derivatives: geometric intuition
)
y
op
C
ft
ra
(D
● Let’s understand the concept of derivatives,consider the function f(x)=x^2 which is a
ts
parabola as shown above.

● A tangent is a straight line that touches the curve exactly at one point.
oo
● If we consider point x=4 corresponding y will be 16 ,if we draw a tangent to the curve at
this point x=4 and we call it as T4.This tangent makes some angle with x axis we call it
𝜭4.
dR
ie
pl
Ap
)
y
op
C
ft
● The derivative of f(x) with respect to x at x=4 as the slope of tangent T4 and it is denoted
ra
mathematically as shown above.
● The slope of T4 is nothing but tan𝜭4.
● The slope of tangents at different points on the curve could be different.
(D
ts
oo
dR
ie
pl
Ap
● Consider three tangents and angles made by the tangents with x axis as shown
above.We can clearly see that T4 is having positive slope ,T6 is having negative slope
and T0 has zero slope.
● If slope at a point is positive it means that the underlying curve is increasing on the other
hand if the slope is negative at a point the curve will be decreasing.
● If the slope of a tangent at a point is zero at a point it's slightly complicated and we will
discuss it.
)
3.9 Derivatives: rate of change + Math
y
op
C
ft
ra
(D
ts
oo
● Let’s understand derivatives from a rate of change perspective .Consider two points on
the curve x1,x2 and their corresponding coordinates as shown above .
dR
● The line between two points is called as secant line

ie
pl
Ap
)
y
op
C
ft
● Slope of the secant line can be calculated as shown above.
ra
(D
ts
oo
dR
● When(x2 - x1) 𝚫x→0 secant becomes tangent x1.

ie
pl
Ap
)
y
op
C
● The derivative of function y=f(x) at x=x1 can be written using limits as shown above
ft
● So the derivative is the rate of change of y around the point x=x1.Instead of computing
ra
the derivative at a particular point we can compute it at all points.
At time stamp 9.56
(D
ts
oo
dR
ie
pl
Ap
)
y
op
C
ft
ra
(D
ts
oo
● The derivative of the function can be calculated as shown above .

dR
● For computing the derivative at a specific point we can substitute the corresponding x
value.
ie
pl
Ap
3.10 Derivative of Common functions
)
y
op
C
ft
ra
(D
● Let’s see the derivatives of most commonly used functions in Machine learning.The
ts
function shown above is a polynomial function and the derivative can be obtained as
shown.
oo
dR
ie
pl
Ap
● Derivatives of log x and e^x are as shown above,using limits you can solve these easily.
)
y
op
C
● The derivative of abs(x) can be calculated as shown using limits.
ft
● Incase of absolute value function if x>0 the derivative is +1
ra
(D
ts
oo
dR
● When we have negative x value x<0 then the derivative is -1.

ie
pl
Ap
)
y
op
C
ft
ra
(D
ts
oo
● When x value x=0 then the derivative is not defined because the left handed limit is not
equal to right handed limit.
dR
3.11 Differentiability of functions

At time stamp 0.10
ie
pl
Ap
)
y
op
C
ft
ra
● Absolute value function is not differentiable at x=0 but it might be differentiable at other
points
At timestamp 1.59
(D
ts
oo
dR
ie
● You can see for above functions are differentiable at other parts of the curve but at x=0
pl
they are not.Wherever the graph or curve is non-smooth we tend to face the problem of
non -differentiability.
Ap
At timestamp 3.02
)
y
op
C
ft
● We face the problem of non-differentiability when the function is discontinuous,because
ra
the positive and negative limits will not be the same .
(D
3.12 Rules of differentiation
ts
oo
dR
ie
pl
Ap
● The derivative of y=f(x) with respect to x can be represented in any of the above ways.
● We have rules of differentiation which help us when we try to differentiate large
equations.These rules are listed below.
)
y
op
C
ft
ra
(D
ts
oo
dR
● The derivative of constant is with respect to x is 0.

ie
pl
Ap
)
y
op
C
ft
ra
(D
ts
oo
dR
● When we have to find the derivative of a product of two functions we use product rule.
ie
pl
Ap
At timestamp 13.07
)
y
op
C
ft
● Quotient rule is used when we have to find the derivative of two functions where g(x) is
ra
dividing f(x).Here g(x ) cannot be 0.
(D
3.13 Maxima and Minima
At timestamp 1.19
ts
oo
dR
ie
pl
Ap
● All the calculus we have studied till now is primarily to study maxima and minima.In Ml
most of the time we try to minimize or maximize certain functions or mathematical
expressions.
● In the functions shown above the first function doesn’t have maxima but has minima at
x=x1 and the other doesn’t have minima but has maxima at x=x2.
● Function can have maxima and minima some functions can have multiple maxima and
minima some functions may not have any minima or maxima.
At timestamp 2.19
)
y
op
C
ft
ra
(D
ts
oo
dR
● Whenever we have local maxima or minima we use a certain term called optima and it
ie
can be either maxima or minima.

● We can have local optima which can be either local maxima or minima.Similarly we have
pl
global optima which can be global maxima or global minima.

Ap
At timestamp 6.09
)
y
op
C
ft
ra
● Consider a curve as shown above y=x^2 and two tangents T1,T2 at points x1,x2 on
(D
curve and making angels 𝝧1 and 𝝧2 with x axis respectively.Let’s see how to find
minima mathematically.
● We can clearly see that the slope of tangent T2 is less than slope of T1.Slope is nothing
but the derivative of the function at x=x1,x2
● The derivative of x1 and x2 are positive but derivative of x1 >derivative of x2.
ts
● At x=0 the tangent is nothing but x-axis,so the slope of tangent at x=0 is 0.
oo
At timestamp 13.10
dR
ie
pl
Ap
● You can observe a similar trend on the left side of the curve as well as shown above.
At timestamp 13.58
)
y
op
C
ft
ra
(D
● We can observe that on one side of minima the derivatives are all positive and on other
side derivatives are all negative at minima the derivative is zero
● We find the minima at a point where the derivative is zero and function should be
increasing on right side of minima and decreasing on left side of the minima
ts
At timestamp 16.40
oo
dR
ie
pl
Ap
● Just like the derivative at minima is zero the derivative at maxima is also 0.You can see
that the angle made by tangent with x-axis is 0 so the derivative or slope is 0.
● We find the maxima at a point where the derivative is zero and the function should be
decreasing on the right side of the maxima and increasing on the left side of the maxima
)
At timestamp 18.54
y
op
C
ft
ra
(D
ts
oo
dR
ie
pl
Ap
)
y
op
C
● When the derivative is zero we can analyse whether it is maxima or minima as shown
above.
ft
ra
(D
3.15 Partial derivatives & Del
At timestamp 2.4
ts
oo
dR
ie
pl
Ap
● When we have multivariate functions as shown above we use partial derivatives .

● The above function is quadratic,we have three variables x,y,z so we can plot the function
in 3D.
Ap
pl At timestamp 3.51
ie
dR
oo
ts
(D
ra
ft
C
op
y
)
Ap
pl
ie
dR
oo
ts
(D
ra
ft
C
op
y
)
)
y
op
C
ft
● We have seen how to find derivatives when we have one variable,here we have more
ra
than one variable so we use a concept called partial derivative as shown above.
● We can extend the same concept even for n dimensions.
(D
3.16 Optima using Partial derivatives
At timestamp 0.53
ts
oo
dR
ie
pl
Ap
● At optima each of these partial derivatives will be zero(since we vector of partial

derivatives each component must be equal to 0)
At timestamp 2.01
)
y
op
C
● Above is an example of calculating partial derivatives and by equating them to 0 we got
ft
optima at x=1 and y=-1.The optima can be either minima or maxima.
ra
(D
ts
oo
dR
● If we plot the function we can clearly see that the optima that we arrived at using partial
derivatives at x,y=(1,-1) is minima.
ie
pl
Ap
)
y
op
C
● There is a special case where the derivative of the function(can be regular derivative or
vector of partial derivatives) is 0 but it doesn’t have either minima or maxima.
ft
3.17 Saddle point
ra
At timestamp 1.15
(D
ts
oo
dR
ie
● For the above function you can clearly see that the derivative of the function at x=0 is
pl
zero but it's either minima or maxima .

Ap
At timestamp 3.45
)
y
op
C
ft
ra
● We we have more than one variable in our function as shown above it's slightly tricky
,you can find the partial derivatives and can say that at x,y=(0,0) we have optima.But
(D
there is no optima
ts
oo
dR
ie
pl
Ap
)
y
op
C
ft
ra
(D
ts
oo
dR
● When we calculate second order partial derivatives you can observe that x is
ie
increasing(positive) and y is decreasing(negative).From x perspective the point is

minima and from y’s perspective it is maxima.
● Such points are called saddle points and we cannot say that it is either maxima or
pl
minima.
Ap
3.18 Gradient Descent
At timestamp 3.48
)
y
op
C
ft
ra
(D
● Let's consider the above curve and we want to find the minima ,we consider a random
point x0 and find the derivative at that point .Since the derivative is positive we know that
the function is increasing and minima<x0.
ts
oo
dR
ie
pl
Ap
● We know that we have to move towards minima and we do this by using the above
equation.
)
y
op
C
ft
ra
(D
ts
oo
dR
● We keep on updating x using the equation until we reach the minima as shown.
ie
pl
Ap
At timestamp 13.45
)
y
op
C
ft
ra
● The algorithm is called gradient descent because we are using gradient or slope or
(D
derivative and we are slowly descending towards minima by starting at a random point
on the curve.
3.19 Gradient Descent with multiple

ts
variables
oo
At timestamp 0.23
dR
ie
pl
Ap
)
y
op
C
ft
● We are trying to understand gradient descent with multiple variables as shown above
ra
(D
ts
oo
dR
● We start with x0,y0(we randomly choose them) and we keep on updating the values
using the above equation until we get partial derivatives as close to zero as
ie
possible,then we consider that we have reached the minima.

pl
3.20 Regression using Gradient Descent

Ap
At timestamp 0.23
)
y
op
C
ft
● Let's solve our regression problem using gradient descent .We arrived at the above
ra
optimization problem where we have to find optimal w,b such that our sum of squared
distances should be as minimum as possible.
(D
● Given dataset D(xi,yi) ,The problem at hand is minimizing the above function
ts
oo
dR
● We can write this as function of vector W and scaler b

ie
pl
Ap
At timestamp 3.40
)
y
op
C
ft
ra
● We can use gradient descent here since we have two variables w,b and we have to find
the minima.
(D
At timestamp 4.16
ts
oo
dR
ie
pl
Ap
)
y
op
C
ft
ra
● We calculate the partial derivatives as shown above.
(D
ts
oo
dR
ie
pl
Ap
At timestamp 12.42
)
y
op
C
ft
● Initially we randomly pick w vector and b,then we use gradient descent and keep on
update the values and move towards the minima as shown above.
ra
● We use gradient descent even when we have more than two variables.So given a
regression problem we can find the best line/plane using gradient descent while
(D
minimising the loss.
ts
oo
dR
ie
pl
Ap

3.1 - 3.20 Foundational Math - Calculus For ML and AI

Uploaded by

Copyright:

Available Formats

3.1 - 3.20 Foundational Math - Calculus For ML and AI

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3.1 - 3.20 Foundational Math - Calculus For ML and AI

Uploaded by

Copyright:

Available Formats

3.

3.2 Real world problem: Estimate income

using purchase data

At timestamp 1.24 in video

At timestamp 2.24 in video

try to predict the income of others.

predict output or response variable.

● In this chapter we consider all features to be numeric going further we discuss

categorical features as well.

problem.(like fish sorting problem)

At timestamp 20.19 in video

At timestamp 0.11 in video

● We choose the plane which minimises total sum of squared distances.

● Minimising each squared distance is equivalent to minimising the sum of squared

becomes zero(actual and predicted y are same ) total product becomes 0.

At timestamp 2.15 in video

At timestamp 11.41 in video

At timestamp 0.19 in video

● If we consider the above function it is continuous everywhere even at x=0 th function is

parabola as shown above.

● The line between two points is called as secant line

● When(x2 - x1) 𝚫x→0 secant becomes tangent x1.

● The derivative of the function can be calculated as shown above .

● When we have negative x value x<0 then the derivative is -1.

3.11 Differentiability of functions

● The derivative of constant is with respect to x is 0.

can be either maxima or minima.

global optima which can be global maxima or global minima.

● When we have multivariate functions as shown above we use partial derivatives .

● At optima each of these partial derivatives will be zero(since we vector of partial

zero but it's either minima or maxima .

increasing(positive) and y is decreasing(negative).From x perspective the point is

3.19 Gradient Descent with multiple

possible,then we consider that we have reached the minima.

3.20 Regression using Gradient Descent

● We can write this as function of vector W and scaler b

You might also like