What Is Cross-Entropy?: 1 Answer

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

5/6/2020 machine learning - What is cross-entropy?

- Stack Overflow

What is cross-entropy?
Asked 3 years, 3 months ago Active 9 months ago Viewed 46k times

I know that there are a lot of explanations of what cross-entropy is, but I'm still confused.

Is it only a method to describe the loss function? Can we use gradient descent algorithm to
90 find the minimum using the loss function?

machine-learning cross-entropy

59
edited Mar 29 '19 at 18:53 asked Feb 1 '17 at 21:38
nbro theateist
9,861 15 71 125 11.9k 13 59 95

10 Not a good fit for SO. Here's a similar question on the datascience sister site:
datascience.stackexchange.com/questions/9302/… – Metropolis Feb 1 '17 at 21:59

1 Answer Active Oldest Votes

Cross-entropy is commonly used to quantify the difference between two probability


distributions. Usually the "true" distribution (the one that your machine learning algorithm is
219 trying to match) is expressed in terms of a one-hot distribution.

For example, suppose for a specific training instance, the label is B (out of the possible labels
A, B, and C). The one-hot distribution for this training instance is therefore:

Pr(Class A) Pr(Class B) Pr(Class C)


0.0 1.0 0.0

You can interpret the above "true" distribution to mean that the training instance has 0%
probability of being class A, 100% probability of being class B, and 0% probability of being
class C.

Now, suppose your machine learning algorithm predicts the following probability distribution:

Pr(Class A) Pr(Class B) Pr(Class C)


0.228 0.619 0.153

How close is the predicted distribution to the true distribution? That is what the cross-entropy
loss determines. Use this formula:

Where p(x) is the wanted probability, and q(x) the actual probability. The sum is over the
three classes A, B, and C. In this case the loss is 0.479 :
By using our site, you acknowledge that you have read and understand our Cookie Policy, Privacy Policy, and
our Terms of
H =Service.
- (0.0*ln(0.228) + 1.0*ln(0.619) + 0.0*ln(0.153)) = 0.479

https://stackoverflow.com/questions/41990250/what-is-cross-entropy 1/3
5/6/2020 machine learning - What is cross-entropy? - Stack Overflow

So that is how "wrong" or "far away" your prediction is from the true distribution.

Cross entropy is one out of many possible loss functions (another popular one is SVM hinge
loss). These loss functions are typically written as J(theta) and can be used within gradient
descent, which is an iterative algorithm to move the parameters (or coefficients) towards the
optimum values. In the equation below, you would replace J(theta) with H(p, q) . But note
that you need to compute the derivative of H(p, q) with respect to the parameters first.

So to answer your original questions directly:

Is it only a method to describe the loss function?

Correct, cross-entropy describes the loss between two probability distributions. It is one of
many possible loss functions.

Then we can use, for example, gradient descent algorithm to find the minimum.

Yes, the cross-entropy loss function can be used as part of gradient descent.

Further reading: one of my other answers related to TensorFlow.

edited Mar 29 '19 at 18:52 answered Feb 1 '17 at 22:21


nbro stackoverflowuser2010
9,861 15 71 125 23.6k 26 117 154

so, cross-entropy describes the loss by sum of probabilities for each example X. – theateist Feb 1 '17
at 22:34

so, can we instead of describing the error as cross-entropy, describe the error as an angle between two
vectors (cosine similarity/ angular distance) and try to minimize the angle? – theateist Feb 1 '17 at
22:55

1 apparently it's not the best solution, but I just wanted to know, in theory, if we could use cosine
(dis)similarity to describe the error through the angle and then try to minimize the angle. –
theateist Feb 2 '17 at 17:22

2 @Stephen: If you look at the example I gave, p(x) would be the list of ground-truth probabilities for
each of the classes, which would be [0.0, 1.0, 0.0 . Likewise, q(x) is the list of predicted
probability for each of the classes, [0.228, 0.619, 0.153] . H(p, q) is then - (0 * log(2.28) +
1.0 * log(0.619) + 0 * log(0.153)) , which comes out to be 0.479. Note that it's common to use
Python's np.log() function, which is actually the natural log; it doesn't matter. –
stackoverflowuser2010 Oct 20 '17 at 23:02

1@HAr: For one-hot encoding of the true label, there is only one non-zero class that we care about.
However, cross-entropy can compare any two probability distributions; it is not necessary that one of
By using our site,
themyou
hasacknowledge that you –have
one-hot probabilities. read and understand
stackoverflowuser2010 Febour
13Cookie Policy, Privacy Policy, and
'18 at 20:30
our Terms of Service.

https://stackoverflow.com/questions/41990250/what-is-cross-entropy 2/3
5/6/2020 machine learning - What is cross-entropy? - Stack Overflow

By using our site, you acknowledge that you have read and understand our Cookie Policy, Privacy Policy, and
our Terms of Service.

https://stackoverflow.com/questions/41990250/what-is-cross-entropy 3/3

You might also like