6 Gradient Method

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Gradient Method

1
Gradient
• If we look at the Level Sets (projections at different
heights), then the derivatives indicate the direction
along which the function increases at the maximum
rate.

2
Gradient
Level Set of a function 𝑓: ℝ𝑛 → ℝ at a level 𝑐 is the set of points 𝑆 = {𝑥: 𝑓 𝑥 = 𝑐}.
Example:

2 2
𝑓 𝑥, 𝑦 = 𝑥𝑒 −𝑥 −𝑦
𝑥 ∈ −0.1, 2 , 𝑦 ∈ [−0.1, 3]

• The level sets are seen as the contour


lines here.

• As we approach the extremum, the


sets converge to zero gradient.

3
Gradient
• If we consider the direction of increment to be 𝑑, 𝑑 = 1,
then the rate of increase of 𝑓(𝑥) along 𝑑 at 𝑥 is
𝛻𝑓 𝑥 𝑇 𝑑 ≤ 𝛻𝑓 𝑥

{Cauchy Schwarz Inequality}.

𝛻𝑓 𝑥
• If 𝑑 = i.e., unit vector along the direction of
| 𝛻𝑓 𝑥 |
𝛻𝑓(𝑥), then

𝑇
𝛻𝑓 𝑥
𝛻𝑓 𝑥 = 𝛻𝑓 𝑥
| 𝛻𝑓 𝑥 |

Hence this is the direction where we will have the maximum


rate of increase.

4
Gradient

𝑇
𝛻𝑓 𝑥
𝛻𝑓 𝑥 = 𝛻𝑓 𝑥
| 𝛻𝑓 𝑥 |

Hence this is the direction where we will have the


maximum rate of increase.

• This indicates the direction −𝛻𝑓(𝑥) is the maximum


rate of decrease and we will find the minimum if we
move along this direction.

• Thus we will search along the direction of negative


gradient or Gradient Descent.

5
Gradient Descent
• Let us start at an initial approximation 𝑥0 , then to find the minimum, the next point according to gradient
descent will be 𝑥1 = 𝑥0 − 𝛻𝑓(𝑥0 ).

• Here we can add a step size to avoid oscillation


𝑥1 = 𝑥0 − 𝛼𝛻𝑓 𝑥0

• Thus, the update equation will be


𝑥𝑘+1 = 𝑥𝑘 − 𝛼𝑘 𝛻𝑓(𝑥𝑘 )

• Here, 𝛼𝑘 > 0 is called the Step Size or Learning Rate and we start at 𝑥𝑘 and move along the direction
− 𝛼𝑘 𝛻𝑓 𝑥𝑘 .

• 𝜶𝒌 is a Hyper-parameter of the algorithm.

6
Gradient Descent
• The gradient varies as the search proceeds, tending to zero as we approach the minimizer.
• We have the option of either taking very small steps and reevaluating the gradient at every step, or
we can take large steps each time.
• The first approach results in a laborious method of reaching the minimizer, whereas the second
approach may result in a more zigzag path to the minimizer.
• The advantage of the second approach is possibly fewer gradient evaluations.

7
Steepest Descent
• Here, the step size 𝛼𝑘 is chosen to achieve the maximum reduction in the objective function at each
step.
• Some methods of selecting 𝛼𝑘 are
• Fixed learning rate
• Line search
• Bold driver
• Momentum

8
Steepest Descent: Fixed Rate

4 3 4
Example: Find the minimum of 𝑓 𝑥1 , 𝑥2 , 𝑥3 = 𝑥1 − 4 + 𝑥2 − 3 + 4 𝑥3 + 5

Solution: Assume 𝑥0 = 4, 2, −1 𝑇

9
Steepest Descent: Fixed Rate
Example: Find the minimum of 𝑓 𝑥1 , 𝑥2 , 𝑥3 = 𝑥1 − 4 4 + 𝑥2 − 3 3 + 4 𝑥3 − 5 4

𝑇
Solution: Assume 𝑥0 = 4, 2, −1 and choose 𝛼 = 0.001
Step 1: Calculate the gradient at the point

4 𝑥1 − 4 3
𝛻𝑓 𝑥 ቚ = 2 𝑥2 − 3 2
𝑥= 4,2,−1 𝑇
16 𝑥3 + 5 3 𝑥= 4,2,−1 𝑇

10
Steepest Descent: Line Search

• Step 3: Use Line search to determine 𝛼𝑘 and then calculate the next point 𝑥𝑘+1
𝜙(𝛼) = 2𝛼 − 1 4 + 4 4 − 1024𝛼 4

• Minimize 𝜙(𝛼), such that 𝛼 ≥ 0.

• The solution will give you 𝛼0 .

• We can find 𝛼0 using the Secant Method:

𝛼 𝑘 − 𝛼 𝑘−1
𝛼 𝑘+1 = 𝛼𝑘 − ′ 𝑘 ′ 𝑘−1
𝜙 ′ (𝛼 𝑘 )
𝜙 𝛼 −𝜙 𝛼
⇒ 𝛼0 = 3.976 × 10−3

11
Steepest Descent: Line Search
• Step 3: Use Line search to determine 𝛼𝑘 and then calculate the next point 𝑥𝑘+1
4 4
𝜙(𝛼) = 2𝛼 − 1 + 4 4 − 1024𝛼

• Minimize 𝜙(𝛼), such that 𝛼 ≥ 0.

• The solution will give you 𝛼0 .

• We can find 𝛼0 using the Secant Method: 𝛼0 = 3.976 × 10−3

• Place this 𝛼0 in 𝑥1 = 𝑥0 − 𝛼0 𝛻𝑓 𝑥0 .

• Go to Step 1 and use this 𝑥1 to find 𝑥2 and keep repeating Steps 1, 2 and 3 to reach the minimum
of 𝑓 𝑥1 , 𝑥2 , 𝑥3 . Each iteration will have a new 𝛼 value.

12
Selection of the Stopping Criteria

• If we are working with a convex function, then stop when 𝛻𝑓 𝑥𝑘 = 0 .

• But if we are working with local minima or constraint sets, then we will have to modify the
stopping criterion. Such as
• Stop when the step size 𝛼𝑘 approaches zero.

• Stop when the error 𝑥𝑘 − 𝑥𝑘−1 ≤ 𝜖 or 𝑓(𝑥𝑘 ) − 𝑓(𝑥𝑘−1 ) ≤ 𝜖 for 0 < 𝜖 ≪ 1.

• Stop if the gradient 𝛻𝑓(𝑥𝑘 ) is small enough.

13
Examples
1. Find the solution for ℎ 𝑥 = 0 using fixed rate steepest descent.

4 + 3𝑥1 + 2𝑥2
ℎ 𝑥 = =0
1 + 2𝑥1 + 3𝑥2

Make , ℎ 𝑥 = 𝛻𝑓 𝑥 ⇒ 𝑥𝑘+1 = 𝑥𝑘 − 𝛼ℎ(𝑥𝑘 )

𝑥1 = −1.999
𝑥2 = 0.999

14
Steepest Descent for Quadratic Function
• For quadratic functions of the form

1
𝑓 𝑥 = 𝑥 𝑇 𝑄𝑥 − 𝑏 𝑇 𝑥
2

Where, 𝑄 ∈ ℝ𝑛×𝑛 is a positive definite symmetric matrix, 𝑏 ∈ ℝ𝑛 and 𝑥 ∈ ℝ𝑛 .


𝛻𝑓 𝑥 = 𝑄𝑥 − 𝑏 = 𝑔

• In this case, for the update equation, 𝑥𝑘+1 = 𝑥𝑘 − 𝛼𝑘 𝑔𝑘 , we can determine the update of 𝛼𝑘 .

• Since

𝛼𝑘 = 𝑎𝑟𝑔𝑚𝑖𝑛𝛼≥0 𝑓 𝑥𝑘 − 𝛼𝑔𝑘
⇒ 𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝜙 𝛼 = 𝑥𝑘 − 𝛼𝑔𝑘 𝑇 𝑄 𝑥𝑘 − 𝛼𝑔𝑘 − 𝑏 𝑇 𝑥𝑘 − 𝛼𝑔𝑘
⇒ 𝜙 ′ 𝛼 = 𝑥𝑘 − 𝛼𝑔𝑘 𝑇 𝑄 −𝑔𝑘 − 𝑏 𝑇 −𝑔𝑘

15
Steepest Descent for Quadratic Function

𝜙 ′ 𝛼 = 𝑥𝑘 − 𝛼𝑔𝑘 𝑇 𝑄 −𝑔𝑘 − 𝑏𝑇 −𝑔𝑘 = 0


⇒ −𝑥𝑘𝑇 𝑄𝑔𝑘 + 𝛼𝑘 𝑔𝑘𝑇 𝑄𝑔𝑘 + 𝑏 𝑇 𝑔𝑘 = 0
⇒ − 𝑥𝑘𝑇 𝑄 − 𝑏𝑇 𝑔𝑘 + 𝛼𝑘 𝑔𝑘𝑇 𝑄𝑔𝑘 = 0
⇒ −𝑔𝑘𝑇 𝑔𝑘 + 𝛼𝑘 𝑔𝑘𝑇 𝑄𝑔𝑘 = 0

𝑔𝑘𝑇 𝑔𝑘
⇒ 𝛼𝑘 = 𝑇
𝑔𝑘 𝑄𝑔𝑘

𝑔𝑘𝑇 𝑔𝑘
Hence, the update equation would be 𝑥𝑘+1 = 𝑥𝑘 − 𝑔𝑘 , where 𝑔𝑘 = 𝑄𝑥𝑘 − 𝑏
𝑔𝑘𝑇 𝑄𝑔𝑘

16
Steepest Descent for Quadratic Function

For fixed rate of step size 𝛼, the gradient descent algorithm 𝑥𝑘 → 𝑥 ∗ for any 𝑥0 if and only if

2
0<𝛼<
𝜆𝑚𝑎𝑥 (𝑄)

Where, 𝜆𝑚𝑎𝑥 (𝑄) is the maximum eigenvalue of Q.

17
Home Work 1
1
Convert the function in 𝑓 𝑥 = 2 𝑥 𝑇 𝑄𝑥 + 𝑏 𝑇 𝑥 format and show steepest descent with varying 𝜶
for 2 iterations.
3 2 2
𝑓 𝑥 = (𝑥1 +𝑥2 ) + 2𝑥1 𝑥2 + 3𝑥1 − 𝑥2 − 22
2
Hint: 𝑄 must be symmetric and positive definite.
2
• 0<𝛼<
𝜆_𝑚𝑎𝑥 𝑄

• Defined 𝑓(𝑥) and 𝛻𝑓 𝑥 = 𝑄𝑥𝑘 − 𝑏 in script as function


• Use the update equation 𝑥𝑘+1 = 𝑥𝑘 − 𝛼𝛻𝑓(𝑥𝑘 )
𝑔𝑘𝑇 𝑔𝑘
• Repeat the same problem with 𝛼𝑘 = , instead of Fixed value.
𝑔𝑘𝑇 𝑄𝑔𝑘

• Note down the differences, pros and cons you have.

18
Home Work 2
Show the first two iterations for steepest descent with a fixed 𝜶 for the following function
3
𝑓 𝑥 = 𝑥 𝑇 4 2√2 𝑥 + 𝑥 𝑇 + 24
0 5 6

• Hint: Rearrange the equation to make 𝑸 symmetric and then check for positive definiteness.
• 𝑄 = 8 2 2; 2 2 10 ; Find maximum eigenvalue of Q
2
• 0<𝛼<
𝜆_𝑚𝑎𝑥 𝑄
• Defined 𝑓(𝑥) and 𝛻𝑓 𝑥 = 𝑄𝑥𝑘 − 𝑏 in script as function
• Use the update equation 𝑥𝑘+1 = 𝑥𝑘 − 𝛼𝛻𝑓(𝑥𝑘 )
𝑔𝑘𝑇 𝑔𝑘
• Repeat the same problem with 𝛼𝑘 = , instead of Fixed value.
𝑔𝑘𝑇 𝑄𝑔𝑘
• Note down the differences, pros and cons you have.

19

You might also like