Logistic Regression
Logistic Regression
Logistic Regression
Also, the data is normalized to be centers around 0 on the horizontal axis (renamed z ),
resulting in both positive and negative input values. Doing so allows the output values to
always be between 0 and 1.
Along with being an s-shaped curve, the sigmoid function also represent a probability P ,
which is a core trait of why it is so useful. When normalized to the origin, this equation gives
the probability that an input value will be either 0 or 1. The sigmoid function is defined as
follows,
1
g(z) = = 1 − P (−z)
1 + e−z
g(z) = loge ( )
P
1−P
Logistic Regression 1
have different shapes, they represent the
same model of the line. This is best
explained with the graph to the right.
Therefore, the linear function can be set
equal to the sigmoid function.
z = w ⋅x+b
1
g(z) = = w ⋅x+b
1 + e−z
Since z and w ⋅ x + b represent the same input parameters, they can be substituted for each
other, giving the final logistic regression algorithm.
Explaining why the function needs to be normalized to the origin of the horizontal axis and
how the functions are actually derived can be confusing. This video does a great job of
showing both.
P (y = 0) + P (y = 1) = 1
Research papers and other publications will often refer to this function in a more formal
notation. The following equation reads as the probability that y = 1, given input x, and
parameters w , b.
fw ,b (x) = P (y = 1∣x; w , b)
Logistic Regression 2
positives and make sure no tumor is missed. For this application, marking malignment tumors
as positive is worth the tradeoff.
Is fw ,b (x) ≥ threshold ?
Yes: y^ = 1
No: y^ = 0
Now, the next question to ask is - under what circumstances do these two cases happen?
Since the function is normalized to the origin of the horizontal axis, this happens whenever z
is greater than or equal to 0. Moreover, since z = w ⋅ x + b,
w ⋅x+b≥ 0 w ⋅x+b< 0
y^ = 1 y^ = 0
Notice how when z = 0, it is included in the positive boundary, even though this is actually
the decision boundary itself. Since this is a binary classification problem, it must be added
along with either the positive class or negative class.
This logic applies exactly the same to non-linear (polynomial) functions, and is just a matter of
fitting the function to 0 too.
m
1 2
J (w , b) = ∑ (f w ,b (xi ) − yi )
2m
i=1
In the case of logistic regression, the squared error can be treated as a Loss function L,
L (fw ,b (xi ), yi )
Logistic Regression 3
⎧ −log (f (xi )) if yi = 1 ⎫
L (f w ,b (x ), y ) = ⎨ ⎬
w ,b
⎩ −log (1 − f (xi )) if yi = 0 ⎭
i i
w ,b
The further the prediction is from the target, the higher the loss. The algorithm is highly
incentivized to not have a high loss, because as the function approaches the wrong value (0
or 1), the loss approaches infinity.
Even though this equation looks much more complicated, keep in mind y can only be either 0
or 1. Substituting for both these cases dramatically reduces the function back to the it’s more
compact semi-simplified form above. This final notation expresses the function in one line and
without separating out the two cases.
Using the loss function in conjunction with the original cost function, the pre-derived cost
function can be defined as,
m
1
J (w , b) = ∑ [L (f w ,b (xi ), yi )]
m
i=1
Logistic Regression 4
Substituting in the simplified loss function, and while doing a bit of arithmetic arranging, the
final cost function is defined.
This particular cost function is derived from statistics, using the maximum likelihood principle.
It has a convenient property that it is convex (single global minimum), which makes running a
gradient descent intuitive.
m
1
J (w , b) = − ∑ [−yi log (f w ,b (xi )) + (1 − yi )log (1 − f w ,b (xi ))]
m
i=1
∂ ∂
w =w−α J (w , b) b= b−α J (w , b)
∂w ∂b
m
∂ 1
∑ (f w ,b (xi ) − yi ) xj
(i)
J (w , b) =
∂w m
i=1
m
∂ 1
J (w , b) = ∑ (f w ,b (xi ) − yi )
∂b m
i=1
wj = wj − α [ ∑ (f w ,b (xi ) − yi ) xj ]
m
1 (i)
m
i=1
b = b − α [ ∑ (f w ,b (xi ) − yi )]
m
1
m
i=1
Notice how these equations are exactly like the linear regression equations. However, the key
difference lies within f w ,b (x), where the function for the line changes. Even though they look
the same, they are very different, due to the model function.
Logistic Regression 5
Applying regularization to logistic regression follows closely to linear regression. Considering
the follow cost function equation for logistic regression, the only modification needed is to add
the regularization term to the end of the equation,
m n
1 λ
J (w , b) = − ∑ [−yi log (f w ,b (xi )) + (1 − yi )log (1 − f w ,b (xi ))] + ∑ wj2
m 2m
i=1 j=1
Moving onto the gradient descent algorithm, consider the following equations,
∂
wj = wj − α J (w , b)
∂wj
∂
b= b−α J (w , b)
∂b
When solving for the derivatives, they result in the nearly the exact same equations as those
found for linear regression. The only difference is the model function fw ,b (xi ).
m
∂ 1 λ
∑(fw ,b (xi ) − yi )xj + wj
(i)
J (w , b) =
∂wj m i=1 m
m
∂ 1
J (w , b) = ∑(fw ,b (xi ) − yi )
∂b m i=1
m
1
b = b − α ∑ (fw ,b (xi ) − yi )
m i=1
Logistic Regression 6