6 Gradient Method

Gradient Method
1
Gradient
• If we look at the Level Sets (projections at different
heights), then the derivatives indicate the direction
along which the function increases at the maximum
rate.
2
Gradient
Level Set of a function 𝑓: ℝ𝑛 → ℝ at a level 𝑐 is the set of points 𝑆 = {𝑥: 𝑓 𝑥 = 𝑐}.
Example:
2 2
𝑓 𝑥, 𝑦 = 𝑥𝑒 −𝑥 −𝑦
𝑥 ∈ −0.1, 2 , 𝑦 ∈ [−0.1, 3]
• The level sets are seen as the contour

lines here.
• As we approach the extremum, the

sets converge to zero gradient.
3
Gradient
• If we consider the direction of increment to be 𝑑, 𝑑 = 1,
then the rate of increase of 𝑓(𝑥) along 𝑑 at 𝑥 is
𝛻𝑓 𝑥 𝑇 𝑑 ≤ 𝛻𝑓 𝑥
{Cauchy Schwarz Inequality}.
𝛻𝑓 𝑥
• If 𝑑 = i.e., unit vector along the direction of
| 𝛻𝑓 𝑥 |
𝛻𝑓(𝑥), then
𝑇
𝛻𝑓 𝑥
𝛻𝑓 𝑥 = 𝛻𝑓 𝑥
| 𝛻𝑓 𝑥 |
Hence this is the direction where we will have the maximum

rate of increase.
4
Gradient
𝑇
𝛻𝑓 𝑥
𝛻𝑓 𝑥 = 𝛻𝑓 𝑥
| 𝛻𝑓 𝑥 |
Hence this is the direction where we will have the

maximum rate of increase.
• This indicates the direction −𝛻𝑓(𝑥) is the maximum

rate of decrease and we will find the minimum if we
move along this direction.
• Thus we will search along the direction of negative

gradient or Gradient Descent.
5
Gradient Descent
• Let us start at an initial approximation 𝑥0 , then to find the minimum, the next point according to gradient
descent will be 𝑥1 = 𝑥0 − 𝛻𝑓(𝑥0 ).
• Here we can add a step size to avoid oscillation

𝑥1 = 𝑥0 − 𝛼𝛻𝑓 𝑥0
• Thus, the update equation will be

𝑥𝑘+1 = 𝑥𝑘 − 𝛼𝑘 𝛻𝑓(𝑥𝑘 )
• Here, 𝛼𝑘 > 0 is called the Step Size or Learning Rate and we start at 𝑥𝑘 and move along the direction
− 𝛼𝑘 𝛻𝑓 𝑥𝑘 .
• 𝜶𝒌 is a Hyper-parameter of the algorithm.
6
Gradient Descent
• The gradient varies as the search proceeds, tending to zero as we approach the minimizer.
• We have the option of either taking very small steps and reevaluating the gradient at every step, or
we can take large steps each time.
• The first approach results in a laborious method of reaching the minimizer, whereas the second
approach may result in a more zigzag path to the minimizer.
• The advantage of the second approach is possibly fewer gradient evaluations.
7
Steepest Descent
• Here, the step size 𝛼𝑘 is chosen to achieve the maximum reduction in the objective function at each
step.
• Some methods of selecting 𝛼𝑘 are
• Fixed learning rate
• Line search
• Bold driver
• Momentum
8
Steepest Descent: Fixed Rate
4 3 4
Example: Find the minimum of 𝑓 𝑥1 , 𝑥2 , 𝑥3 = 𝑥1 − 4 + 𝑥2 − 3 + 4 𝑥3 + 5
Solution: Assume 𝑥0 = 4, 2, −1 𝑇
9
Steepest Descent: Fixed Rate
Example: Find the minimum of 𝑓 𝑥1 , 𝑥2 , 𝑥3 = 𝑥1 − 4 4 + 𝑥2 − 3 3 + 4 𝑥3 − 5 4
𝑇
Solution: Assume 𝑥0 = 4, 2, −1 and choose 𝛼 = 0.001
Step 1: Calculate the gradient at the point
4 𝑥1 − 4 3
𝛻𝑓 𝑥 ቚ = 2 𝑥2 − 3 2
𝑥= 4,2,−1 𝑇
16 𝑥3 + 5 3 𝑥= 4,2,−1 𝑇
10
Steepest Descent: Line Search
• Step 3: Use Line search to determine 𝛼𝑘 and then calculate the next point 𝑥𝑘+1
𝜙(𝛼) = 2𝛼 − 1 4 + 4 4 − 1024𝛼 4
• Minimize 𝜙(𝛼), such that 𝛼 ≥ 0.
• The solution will give you 𝛼0 .
• We can find 𝛼0 using the Secant Method:
𝛼 𝑘 − 𝛼 𝑘−1
𝛼 𝑘+1 = 𝛼𝑘 − ′ 𝑘 ′ 𝑘−1
𝜙 ′ (𝛼 𝑘 )
𝜙 𝛼 −𝜙 𝛼
⇒ 𝛼0 = 3.976 × 10−3
11
Steepest Descent: Line Search
• Step 3: Use Line search to determine 𝛼𝑘 and then calculate the next point 𝑥𝑘+1
4 4
𝜙(𝛼) = 2𝛼 − 1 + 4 4 − 1024𝛼
• Minimize 𝜙(𝛼), such that 𝛼 ≥ 0.
• The solution will give you 𝛼0 .
• We can find 𝛼0 using the Secant Method: 𝛼0 = 3.976 × 10−3
• Place this 𝛼0 in 𝑥1 = 𝑥0 − 𝛼0 𝛻𝑓 𝑥0 .
• Go to Step 1 and use this 𝑥1 to find 𝑥2 and keep repeating Steps 1, 2 and 3 to reach the minimum
of 𝑓 𝑥1 , 𝑥2 , 𝑥3 . Each iteration will have a new 𝛼 value.
12
Selection of the Stopping Criteria
• If we are working with a convex function, then stop when 𝛻𝑓 𝑥𝑘 = 0 .
• But if we are working with local minima or constraint sets, then we will have to modify the
stopping criterion. Such as
• Stop when the step size 𝛼𝑘 approaches zero.
• Stop when the error 𝑥𝑘 − 𝑥𝑘−1 ≤ 𝜖 or 𝑓(𝑥𝑘 ) − 𝑓(𝑥𝑘−1 ) ≤ 𝜖 for 0 < 𝜖 ≪ 1.
• Stop if the gradient 𝛻𝑓(𝑥𝑘 ) is small enough.
13
Examples
1. Find the solution for ℎ 𝑥 = 0 using fixed rate steepest descent.
4 + 3𝑥1 + 2𝑥2
ℎ 𝑥 = =0
1 + 2𝑥1 + 3𝑥2
Make , ℎ 𝑥 = 𝛻𝑓 𝑥 ⇒ 𝑥𝑘+1 = 𝑥𝑘 − 𝛼ℎ(𝑥𝑘 )
𝑥1 = −1.999
𝑥2 = 0.999
14
Steepest Descent for Quadratic Function
• For quadratic functions of the form
1
𝑓 𝑥 = 𝑥 𝑇 𝑄𝑥 − 𝑏 𝑇 𝑥
2
Where, 𝑄 ∈ ℝ𝑛×𝑛 is a positive definite symmetric matrix, 𝑏 ∈ ℝ𝑛 and 𝑥 ∈ ℝ𝑛 .

𝛻𝑓 𝑥 = 𝑄𝑥 − 𝑏 = 𝑔
• In this case, for the update equation, 𝑥𝑘+1 = 𝑥𝑘 − 𝛼𝑘 𝑔𝑘 , we can determine the update of 𝛼𝑘 .
• Since
𝛼𝑘 = 𝑎𝑟𝑔𝑚𝑖𝑛𝛼≥0 𝑓 𝑥𝑘 − 𝛼𝑔𝑘
⇒ 𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝜙 𝛼 = 𝑥𝑘 − 𝛼𝑔𝑘 𝑇 𝑄 𝑥𝑘 − 𝛼𝑔𝑘 − 𝑏 𝑇 𝑥𝑘 − 𝛼𝑔𝑘
⇒ 𝜙 ′ 𝛼 = 𝑥𝑘 − 𝛼𝑔𝑘 𝑇 𝑄 −𝑔𝑘 − 𝑏 𝑇 −𝑔𝑘
15
𝜙 ′ 𝛼 = 𝑥𝑘 − 𝛼𝑔𝑘 𝑇 𝑄 −𝑔𝑘 − 𝑏𝑇 −𝑔𝑘 = 0

⇒ −𝑥𝑘𝑇 𝑄𝑔𝑘 + 𝛼𝑘 𝑔𝑘𝑇 𝑄𝑔𝑘 + 𝑏 𝑇 𝑔𝑘 = 0
⇒ − 𝑥𝑘𝑇 𝑄 − 𝑏𝑇 𝑔𝑘 + 𝛼𝑘 𝑔𝑘𝑇 𝑄𝑔𝑘 = 0
⇒ −𝑔𝑘𝑇 𝑔𝑘 + 𝛼𝑘 𝑔𝑘𝑇 𝑄𝑔𝑘 = 0
𝑔𝑘𝑇 𝑔𝑘
⇒ 𝛼𝑘 = 𝑇
𝑔𝑘 𝑄𝑔𝑘
Hence, the update equation would be 𝑥𝑘+1 = 𝑥𝑘 − 𝑔𝑘 , where 𝑔𝑘 = 𝑄𝑥𝑘 − 𝑏
𝑔𝑘𝑇 𝑄𝑔𝑘
16
For fixed rate of step size 𝛼, the gradient descent algorithm 𝑥𝑘 → 𝑥 ∗ for any 𝑥0 if and only if
2
0<𝛼<
𝜆𝑚𝑎𝑥 (𝑄)
Where, 𝜆𝑚𝑎𝑥 (𝑄) is the maximum eigenvalue of Q.
17
Home Work 1
1
Convert the function in 𝑓 𝑥 = 2 𝑥 𝑇 𝑄𝑥 + 𝑏 𝑇 𝑥 format and show steepest descent with varying 𝜶
for 2 iterations.
3 2 2
𝑓 𝑥 = (𝑥1 +𝑥2 ) + 2𝑥1 𝑥2 + 3𝑥1 − 𝑥2 − 22
2
Hint: 𝑄 must be symmetric and positive definite.
2
• 0<𝛼<
𝜆_𝑚𝑎𝑥 𝑄
• Defined 𝑓(𝑥) and 𝛻𝑓 𝑥 = 𝑄𝑥𝑘 − 𝑏 in script as function

• Use the update equation 𝑥𝑘+1 = 𝑥𝑘 − 𝛼𝛻𝑓(𝑥𝑘 )
• Repeat the same problem with 𝛼𝑘 = , instead of Fixed value.
• Note down the differences, pros and cons you have.
18
Home Work 2
Show the first two iterations for steepest descent with a fixed 𝜶 for the following function
3
𝑓 𝑥 = 𝑥 𝑇 4 2√2 𝑥 + 𝑥 𝑇 + 24
0 5 6
• Hint: Rearrange the equation to make 𝑸 symmetric and then check for positive definiteness.
• 𝑄 = 8 2 2; 2 2 10 ; Find maximum eigenvalue of Q
2
• 0<𝛼<
𝜆_𝑚𝑎𝑥 𝑄
• Defined 𝑓(𝑥) and 𝛻𝑓 𝑥 = 𝑄𝑥𝑘 − 𝑏 in script as function
• Use the update equation 𝑥𝑘+1 = 𝑥𝑘 − 𝛼𝛻𝑓(𝑥𝑘 )
• Repeat the same problem with 𝛼𝑘 = , instead of Fixed value.
• Note down the differences, pros and cons you have.
19

6 Gradient Method

Uploaded by

Copyright:

Available Formats

6 Gradient Method

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

6 Gradient Method

Uploaded by

Copyright:

Available Formats

Gradient Method

• The level sets are seen as the contour

• As we approach the extremum, the

{Cauchy Schwarz Inequality}.

Hence this is the direction where we will have the maximum

Hence this is the direction where we will have the

• This indicates the direction −𝛻𝑓(𝑥) is the maximum

• Thus we will search along the direction of negative

• Here we can add a step size to avoid oscillation

• Thus, the update equation will be

• 𝜶𝒌 is a Hyper-parameter of the algorithm.

• Minimize 𝜙(𝛼), such that 𝛼 ≥ 0.

• The solution will give you 𝛼0 .

• We can find 𝛼0 using the Secant Method:

• Minimize 𝜙(𝛼), such that 𝛼 ≥ 0.

• The solution will give you 𝛼0 .

• We can find 𝛼0 using the Secant Method: 𝛼0 = 3.976 × 10−3

• If we are working with a convex function, then stop when 𝛻𝑓 𝑥𝑘 = 0 .

• Stop when the error 𝑥𝑘 − 𝑥𝑘−1 ≤ 𝜖 or 𝑓(𝑥𝑘 ) − 𝑓(𝑥𝑘−1 ) ≤ 𝜖 for 0 < 𝜖 ≪ 1.

• Stop if the gradient 𝛻𝑓(𝑥𝑘 ) is small enough.

Make , ℎ 𝑥 = 𝛻𝑓 𝑥 ⇒ 𝑥𝑘+1 = 𝑥𝑘 − 𝛼ℎ(𝑥𝑘 )

Where, 𝑄 ∈ ℝ𝑛×𝑛 is a positive definite symmetric matrix, 𝑏 ∈ ℝ𝑛 and 𝑥 ∈ ℝ𝑛 .

𝜙 ′ 𝛼 = 𝑥𝑘 − 𝛼𝑔𝑘 𝑇 𝑄 −𝑔𝑘 − 𝑏𝑇 −𝑔𝑘 = 0

Where, 𝜆𝑚𝑎𝑥 (𝑄) is the maximum eigenvalue of Q.

• Defined 𝑓(𝑥) and 𝛻𝑓 𝑥 = 𝑄𝑥𝑘 − 𝑏 in script as function

• Note down the differences, pros and cons you have.

You might also like