-
-
Notifications
You must be signed in to change notification settings - Fork 26k
ENH Speedup confusion_matrix #9843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Speedup confusion_matrix #9843
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks worthwhile. Efficient mapping to ints is something pandas does well... Sometimes I wish we could exploit that.
In terms of check_targets, I've wondered about ways to speed up type_of_target to make it linear at worst, and maybe even approximate from a sample when trying to identify number of labels (risky).
A function that did like np.unique but special-cased binary data might already be a big improvement.
I'm not sure exactly how pandas does its mapping. Here is a script I used to get timings: https://gist.github.com/Erotemic/f4d6005f0c1a472997426fae3969840f I tested over the following grid basis1 = {
'n_input': [10, 1000, 10000, 200000],
'n_classes': [2, 10, 100, 1000],
'dtype': ['uint8', 'int64'],
}
basis2 = {
'labelkw': [None, 'int-consec', 'int-nonconsec', 'str'],
'weightkw': (None, 'float'),
} the
And here are the results: We can see > 10x speedups on certain cases when int-consec is satisfied. The greatest speedup is achieved when the number of classes is relatively small and the number of inputs is large. When the number of classes increases, the speedup tends to decrease back towards a 1:1 ratio. While integer inputs with a specified labels achieves the greatest speedup, the common case where y_pred and y_true are given in normal integer format with specifying labels also achieves a 2-4x speedup. When the check conditions are not satisfied, or the arrays are small, the slowdown is never worse than ~.9 factor, which a large portion of can be attributed to time-measurement inaccuracies.
|
I think a key idea in Pandas is to abstract the mapping behind a polymorphic OO interface, such that if the data is numeric and canonical, it can handle that in one way, and if strings, it can handle that another way. And then to use low-level implementations wherever it helps. |
It would be interesting to dig into their implementation to see how they do it, but unfortunately I'm not going to be able to allocate any time for that. However, I think this PR is pretty much complete. The speedup seems to apply to many common cases and the overhead for the other cases is minimal. There may be a small additional speedup to be gained by optionally disabling checks, but this is good enough for now. I've added a what's new blurb, and I don't think there is any new documentation needed. I noticed that other what's new entries referenced and issue. Should I make an issue for this PR to address (/should I reference the PR number)? |
I realise it's unclear because of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than wondering if we could/should share this kind of optimisation with other code, this LGTM.
sklearn/metrics/classification.py
Outdated
sample_weight = sample_weight[ind] | ||
# If labels are not consecitive integers starting from zero, then | ||
# yt, yp must be converted into index form | ||
need_index_conversion = not ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely np.all(labels == np.arange(len(labels))
is just as fast for a reasonable number of classes and much more readable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is both faster and more readable, when I originally wrote this I didn't consider that the ordering of the labels was significant, so the np.diff
was a quick fix after realizing this. Your solution is simpler and better.
In [1]: labels = np.arange(100)
In [2]: %timeit labels.min() == 0 and np.all(np.diff(labels) == 1)
8.3 µs ± 162 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [3]: %timeit np.all(np.arange(len(labels)) == labels)
3.6 µs ± 47.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Ah, that is confusing. I added the PR number. |
Quickly browsing through your extensive benchmarks (thanks a lot for these!) it seems to me that when this PR improves things, v1 (which I am guessing is master) Very naively I just think: why do we care? Maybe you can elaborate on when you would want to compute hundreds of confusion_matrix as you mention in your first post? Is confusion_matrix really the bottleneck in your use case (I would think fitting and predicting is more likely to be)? |
So, the speed increase does seem to be linear in terms of the number items being classified. It is true, the speed is overshadowed when the number of classes is large, but in modern deep learning tasks, the number of items greatly outnumbers the number of classes. I noticed the slowness of this function when running evaluation scripts on a large number of pre-classified (for a semantic segmentation task) images. The learning and prediction was already precomputed. In the larger scheme of things the learning and prediction is the bottleneck, but when simply running the evaluation scripts, the confusion matrix was the bottleneck. Because amount of data is extremely large, making this change caused the running time of my scripts to improve from minutes to seconds. I think this instance justifies this sort of optimization. |
Your use case seems completely reasonable, thanks for the details! |
Hi @Erotemic many thanks for your patience! If you are still interested in working on that do you mind synchronizing with upstream? This will help in giving more visibility again to your PR. Thanks! |
Wow, gotta brush the dust off this one. But yeah, I can do that. |
dfc9332
to
2372c75
Compare
@cmarmo I rebased this on master. |
2372c75
to
217a6ce
Compare
@Erotemic thanks! I believe that an entry in the what's new file was also planned in the PR description? scikit-learn is heading for 1.0 now! :) |
@cmarmo I re-added it. Let me know if I put it in wrong place / did the formatting wrong. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Only a couple of nitpicks.
# convert yt, yp into index | ||
y_pred = np.array([label_to_ind.get(x, n_labels + 1) for x in y_pred]) | ||
y_true = np.array([label_to_ind.get(x, n_labels + 1) for x in y_true]) | ||
# If labels are not consecutive integers starting from zero, then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are improving confusion_matrix
, could you use sklearn.utils.validation.check_sample_weight
in the line above (312-315)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it looks like check_sample_weight
might hurt efficiency in the case where sample weight is None. Using that code will force us to use the same dtype in both (1) the case where sample_weight is None and (2) the case where sample_weight is specified. And if we are forced to specify a dtype, then it would have to be float64. At least IMO I'd prefer the ability to have an integral confusion matrix, especially in the case where there the samples are not weighted.
I haven't benchmarked yet, but adding this might work against the purpose of this PR. (In general my opinion is that sklearn does far too many of these checks in a way that drastically reduces performance and does not offer a nice way to disable them).
Its possible that just forcing float64 isn't a terrible thing to do, but I'll need to benchmark it. As modifying this line is not a clear improvement, I would prefer if we could merge this (its been >3 years) and look at this issue in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glemaitre So this change might have merit. I was worried that using float64 would impact how fast it took to build the coo_matrix, but it doesn't seem to:
import kwplot
import seaborn as sns
kwplot.autoplt()
sns.set()
ax = sns.lineplot(data=df, x='n', y='time', hue='label')
ax.set_yscale('log')
ax.set_xscale('log')
from sklearn.utils.validation import _check_sample_weight
import numpy as np
results = []
ns = np.logspace(1, 6, 100).astype(np.int)
for n in ub.ProgIter(ns, desc='time-tradeoff', verbose=3):
print('n = {!r}'.format(n))
n_labels = 100
y_true = np.random.randint(0, n_labels, n).astype(np.int64)
y_pred = np.random.randint(0, n_labels, n).astype(np.int64)
sample_weight = np.ones(y_true.shape[0], dtype=np.int64)
for timer in ti.reset('use-old-uint8-sample-weight-default'):
with timer:
if sample_weight.dtype.kind in {'i', 'u', 'b'}:
dtype = np.int64
else:
dtype = np.float64
cm = coo_matrix((sample_weight, (y_true, y_pred)),
shape=(n_labels, n_labels), dtype=dtype,
).toarray()
results.append({
'n': n,
'label': ti.label,
'time': ti.mean(),
})
sample_weight = _check_sample_weight(None, y_true, dtype=np.int64)
for timer in ti.reset('use-new-float64-sample-weight-default'):
with timer:
if sample_weight.dtype.kind in {'i', 'u', 'b'}:
dtype = np.int64
else:
dtype = np.float64
cm = coo_matrix((sample_weight, (y_true, y_pred)),
shape=(n_labels, n_labels), dtype=dtype,
).toarray()
results.append({
'n': n,
'label': ti.label,
'time': ti.mean(),
})
import pandas as pd
df = pd.DataFrame(results)
import kwplot
import seaborn as sns
kwplot.autoplt()
sns.set()
ax = sns.lineplot(data=df, x='n', y='time', hue='label')
ax.set_yscale('log')
ax.set_xscale('log')
However, the amount of time it takes to actually call _check_sample_weight
is much greater than the way the code currently is:
import ubelt as ub
from sklearn.utils.validation import _check_sample_weight
import numpy as np
results = []
ns = np.logspace(1, 6, 100).astype(np.int)
for n in ub.ProgIter(ns, desc='time-tradeoff', verbose=3):
print('n = {!r}'.format(n))
y_true = np.random.randint(0, 100, n).astype(np.int64)
y_pred = np.random.randint(0, 100, n).astype(np.int64)
sample_weight = np.random.rand(n)
import timerit
ti = timerit.Timerit(9, bestof=3, verbose=2)
for timer in ti.reset('old-sample-weight-given'):
with timer:
np.asarray(sample_weight)
results.append({
'n': n,
'label': ti.label,
'time': ti.mean(),
})
for timer in ti.reset('new-sample-weight-given'):
with timer:
_check_sample_weight(sample_weight, y_true, dtype=np.int64)
results.append({
'n': n,
'label': ti.label,
'time': ti.mean(),
})
for timer in ti.reset('old-sample-weight-default'):
with timer:
np.ones(y_true.shape[0], dtype=np.int64)
results.append({
'n': n,
'label': ti.label,
'time': ti.mean(),
})
for timer in ti.reset('new-sample-weight-default'):
with timer:
_check_sample_weight(None, y_true, dtype=np.int64)
results.append({
'n': n,
'label': ti.label,
'time': ti.mean(),
})
import pandas as pd
df = pd.DataFrame(results)
import kwplot
import seaborn as sns
kwplot.autoplt()
sns.set()
ax = sns.lineplot(data=df, x='n', y='time', hue='label')
ax.set_yscale('log')
ax.set_xscale('log')
But they do seem to converge towards each other as the number of items grows large. Perhaps it's reasonable to make this replacement.
I just ran all my benchmarks using _check_sample_weight
and the speed isn't much different, so it makes sense to make this change for readability and API consistency. I'll patch that in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a mistake, it is noticeably different, but its not that significant. The old way averages out at about 0.1% of the function computation and the new way is 3x slower, taking about 0.3%. However, the computation doesn't grow as fast as other parts of the code, so this will never be a bottleneck so I'm still for replacing the 4 lines with 1 function call.
It would be nice in the future if _check_sample_weight
could return a weight array with dtype uint8. While it may not have much speed benefit, it would be more memory efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I was wrong again. It is causing a noticeable slowdown to the point where I'm not ok with it anymore. In my previous test, I ran one variant right after the other, so it never hit all of its cases.
With the original way we have:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
82 432 1031.0 2.4 0.1 if sample_weight is None:
83 216 3294.0 15.2 0.2 sample_weight = np.ones(y_true.shape[0], dtype=np.int64)
84 else:
85 216 843.0 3.9 0.0 sample_weight = np.asarray(sample_weight)
But with the new way we have:
201 432 40067.0 92.7 3.9 sample_weight = _check_sample_weight(sample_weight, y_true, dtype=np.uint8)
That's going from 0.3% to 3.9%. That's too big of a jump for my taste, I'm going to revert to the previous method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure this has any meaningful impact in practice?
Looking at the code of _check_sample_weight
, it should be efficient enough. Try without the profiler enabled on a large enough on a call to confusion_matrix
on a dataset that is large enough for performance to matter (e.g. at least one second).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here some results on my machine:
(dev) ogrisel@mba scikit-learn % git diff
diff --git a/sklearn/metrics/_classification.py b/sklearn/metrics/_classification.py
index b4ab145d8..35c5aab5f 100644
--- a/sklearn/metrics/_classification.py
+++ b/sklearn/metrics/_classification.py
@@ -34,6 +34,7 @@ from ..utils import assert_all_finite
from ..utils import check_array
from ..utils import check_consistent_length
from ..utils import column_or_1d
+from ..utils.validation import _check_sample_weight
from ..utils.multiclass import unique_labels
from ..utils.multiclass import type_of_target
from ..utils.validation import _num_samples
@@ -312,12 +313,9 @@ def confusion_matrix(y_true, y_pred, *, labels=None, sample_weight=None,
elif len(np.intersect1d(y_true, labels)) == 0:
raise ValueError("At least one label specified must be in y_true")
- if sample_weight is None:
- sample_weight = np.ones(y_true.shape[0], dtype=np.int64)
- else:
- sample_weight = np.asarray(sample_weight)
-
- check_consistent_length(y_true, y_pred, sample_weight)
+ check_consistent_length(y_true, y_pred)
+ sample_weight = _check_sample_weight(sample_weight, y_true,
+ dtype=np.float64)
if normalize not in ['true', 'pred', 'all', None]:
raise ValueError("normalize must be one of {'true', 'pred', "
(dev) ogrisel@mba scikit-learn % python ~/tmp/time_confusion_matrix.py
1.399 +/- 0.005 s
(dev) ogrisel@mba scikit-learn % git stash
Saved working directory and index state WIP on speedup_confusion_matrix: 4b8e1e864 Use faster intersect1d instead of list comprehension
(dev) ogrisel@mba scikit-learn % python ~/tmp/time_confusion_matrix.py
1.408 +/- 0.005 s
So no significant difference. It's in the noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually thinking about it, it's fine to keep the code the way it is.
If no sample_weight
it's passed we would like to have int64
to allow for precise integer outputs.
If the user is passing float sample_weight
we do not want to cast them to integer because that would be wrong (and the output will be floating points but that's fine).
Using _check_sample_weight
will either break on or the other of the above statements. I therefore think the current code is fine. We could probably add a test to check this behavior but this is out of the scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given how long this PR has been open, lets see if we can work toward merging it.
y_true = y_true[ind] | ||
# also eliminate weights of eliminated items | ||
sample_weight = sample_weight[ind] | ||
if not np.all(ind): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does including this greatly improve performance?
I think we can revert this change and only have needs_index_conversion
so we can get this merged quicker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember it being significant when I originally wrote this, which was awhile ago. It will depend on the use case, but I believe it does give a noticeable speed up to the fastest case with a minimal impact on the slow cases.
I would have to benchmark again to be sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I was able to come up with a small benchmark that at least demonstrates the efficacy of the idea. In most cases the true and pred labels are all going to be "valid" -- i.e. between 0 and n_labels. In this case the question is: do we gain a benefit by checking if we should do the __getitem__
operation before we do it? Or is the __getitem__
fast enough such that the check isn't saving us time.
Furthermore, we have the less common case where there are invalid labels. In that case, does doing this extra check (which by definition will fail) cost us that much in overhead?
To test this I came up with this code, which simply times the all
and the __getitem__
operation.
import ubelt as ub
import numpy as np
results = []
ns = np.logspace(1, 7, 100).astype(np.int)
for n in ub.ProgIter(ns, desc='time-tradeoff', verbose=3):
print('n = {!r}'.format(n))
y_true = np.random.randint(0, 100, n).astype(np.int64)
y_pred = np.random.randint(0, 100, n).astype(np.int64)
sample_weight = np.random.rand(n)
isvalid = np.random.rand(n) > 0
import timerit
ti = timerit.Timerit(9, bestof=3, verbose=2)
for timer in ti.reset('all-check'):
with timer:
np.all(isvalid)
results.append({
'n': n,
'label': ti.label,
'time': ti.mean(),
})
for timer in ti.reset('all-index'):
with timer:
y_true[isvalid]
y_pred[isvalid]
sample_weight[isvalid]
results.append({
'n': n,
'label': ti.label,
'time': ti.mean(),
})
df = pd.DataFrame(results)
import kwplot
import seaborn as sns
kwplot.autoplt()
sns.set()
ax = sns.lineplot(data=df, x='n', y='time', hue='label')
ax.set_yscale('log')
ax.set_xscale('log')
which produces
So in the case where there are more than 1000 elements (the case that we start to really care about speed!), we get an order of magnitude difference in how long each of those lines takes.
In the case where we must incur both costs, again the check is an order of magnitude faster than the actual __getitem__
call, so it doesn't add much overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the question is, how much does it matter in the grand scheme of this function?
Well, running the benchmark suite we see on average this line takes about 0.8% of the total time (which in this case was 0.735286 s)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
270 # intersect y_pred, y_true with labels, eliminate items not in labels
271 324 4191.0 12.9 0.6 ind = np.logical_and(y_pred < n_labels, y_true < n_labels)
272 324 5657.0 17.5 0.8 if not np.all(ind):
273 y_pred = y_pred[ind]
274 y_true = y_true[ind]
275 # also eliminate weights of eliminated items
276 sample_weight = sample_weight[ind]
Now if we comment that portion of the code and reprofile we see:
270 # intersect y_pred, y_true with labels, eliminate items not in labels
271 324 4272.0 13.2 0.6 ind = np.logical_and(y_pred < n_labels, y_true < n_labels)
272 # if not np.all(ind):
273 324 2541.0 7.8 0.4 y_pred = y_pred[ind]
274 324 2061.0 6.4 0.3 y_true = y_true[ind]
275 # also eliminate weights of eliminated items
276 324 2182.0 6.7 0.3 sample_weight = sample_weight[ind]
which takes about 1.0% of the time on average. So there is benefit to having this check! But not much. However, that's only part of the story. The benchmark script is running over inputs of size 10, 1000, and 10000, and we know from above that this starts to make a difference when the input is larger. So how much?
Lets restrict inputs sizes to 10000 and 100000.
Now, with the check we see:
270 # intersect y_pred, y_true with labels, eliminate items not in labels
271 216 17120.0 79.3 0.4 ind = np.logical_and(y_pred < n_labels, y_true < n_labels)
272 216 5704.0 26.4 0.1 if not np.all(ind):
273 y_pred = y_pred[ind]
274 y_true = y_true[ind]
275 # also eliminate weights of eliminated items
276 sample_weight = sample_weight[ind]
and without we get
270 # intersect y_pred, y_true with labels, eliminate items not in labels
271 216 17222.0 79.7 0.4 ind = np.logical_and(y_pred < n_labels, y_true < n_labels)
272 # if not np.all(ind):
273 216 19397.0 89.8 0.4 y_pred = y_pred[ind]
274 216 18590.0 86.1 0.4 y_true = y_true[ind]
275 # also eliminate weights of eliminated items
276 216 22554.0 104.4 0.5 sample_weight = sample_weight[ind]
so that's a 0.1% of the time versus 1.3% of the time. That is a very significant difference, and I believe is enough to justify existence of the line. Let me know what you think. @thomasjpfan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly we don't really care about a 1% performance change of a function call (if that percent stays constant with the dataset size). But has this specific code snippet is simple, I am fine with keeping it.
Hi @Erotemic will you be able to address the last review? Thanks! |
@cmarmo I can do a deep dive and really address concerns on what components cause significant speedup versus what doesn't. It might take me a week or two to get to that though. Let me know if that would be valuable. |
@Erotemic , I believe benchmarks are always very appreciated and surely improve the quality of the contribution, but your PR has already waited for so long, I guess this was also @thomasjpfan concern... If you are motivated to dive in and you think this is doable in one two weeks ... |
While doing the benchmarks I noticed that the elif np.all([l not in y_true for l in labels]): Which took about 430 microseconds on average versus: elif len(np.intersect1d(y_true, labels)) == 0: which took 153 microseconds. That is actually reasonably significant as it reduces that line from 10% of the average computation to about 4.7% of the computation. A clear win, and it makes sense from a theoretical perspective, the latter is O(N + D) whereas the previous is O(N * D). It seems like a clear win to upgrade that. Benchmark with the list comprehension Pystone time: 0.777921 s
File: /home/joncrall/misc/tests/python/bench_confusion_matrix.py
Function: confusion_matrix_2021_new at line 179
Line # Hits Time Per Hit % Time Line Contents
==============================================================
179 @xdev.profile
180 def confusion_matrix_2021_new(y_true, y_pred, *, labels=None,
181 sample_weight=None, normalize=None):
182 324 143350.0 442.4 18.4 y_type, y_true, y_pred = _check_targets(y_true, y_pred)
183 324 740.0 2.3 0.1 if y_type not in ("binary", "multiclass"):
184 raise ValueError("%s is not supported" % y_type)
185
186 324 573.0 1.8 0.1 if labels is None:
187 108 59921.0 554.8 7.7 labels = unique_labels(y_true, y_pred)
188 else:
189 216 766.0 3.5 0.1 labels = np.asarray(labels)
190 216 427.0 2.0 0.1 n_labels = labels.size
191 216 403.0 1.9 0.1 if n_labels == 0:
192 raise ValueError("'labels' should contains at least one label.")
193 216 398.0 1.8 0.1 elif y_true.size == 0:
194 return np.zeros((n_labels, n_labels), dtype=int)
195 else:
196 # flag1 = len(np.intersect1d(y_true, labels)) == 0
197 216 88756.0 410.9 11.4 flag2 = np.all([l not in y_true for l in labels])
198 216 484.0 2.2 0.1 if flag2:
199 raise ValueError("At least one label specified must be in y_true")
200
201 324 634.0 2.0 0.1 if sample_weight is None:
202 162 2349.0 14.5 0.3 sample_weight = np.ones(y_true.shape[0], dtype=np.int64)
203 else:
204 162 598.0 3.7 0.1 sample_weight = np.asarray(sample_weight)
205
206 324 19000.0 58.6 2.4 check_consistent_length(y_true, y_pred, sample_weight)
207
208 324 772.0 2.4 0.1 if normalize not in ['true', 'pred', 'all', None]:
209 raise ValueError("normalize must be one of {'true', 'pred', "
210 "'all', None}")
211
212 324 641.0 2.0 0.1 n_labels = labels.size
213 # If labels are not consecutive integers starting from zero, then
214 # y_true and y_pred must be converted into index form
215 324 609.0 1.9 0.1 need_index_conversion = not (
216 840 1849.0 2.2 0.2 labels.dtype.kind in {'i', 'u', 'b'} and
217 324 7583.0 23.4 1.0 np.all(labels == np.arange(n_labels)) and
218 384 5423.0 14.1 0.7 y_true.min() >= 0 and y_pred.min() >= 0
219 )
220 324 546.0 1.7 0.1 if need_index_conversion:
221 132 1919.0 14.5 0.2 label_to_ind = {y: x for x, y in enumerate(labels)}
222 132 171946.0 1302.6 22.1 y_pred = np.array([label_to_ind.get(x, n_labels + 1) for x in y_pred])
223 132 169948.0 1287.5 21.8 y_true = np.array([label_to_ind.get(x, n_labels + 1) for x in y_true])
224
225 # intersect y_pred, y_true with labels, eliminate items not in labels
226 324 4107.0 12.7 0.5 ind = np.logical_and(y_pred < n_labels, y_true < n_labels)
227 324 5427.0 16.8 0.7 if not np.all(ind):
228 y_pred = y_pred[ind]
229 y_true = y_true[ind]
230 # also eliminate weights of eliminated items
231 sample_weight = sample_weight[ind]
232
233 # Choose the accumulator dtype to always have high precision
234 324 864.0 2.7 0.1 if sample_weight.dtype.kind in {'i', 'u', 'b'}:
235 162 320.0 2.0 0.0 dtype = np.int64
236 else:
237 162 334.0 2.1 0.0 dtype = np.float64
238
239 648 58290.0 90.0 7.5 cm = coo_matrix((sample_weight, (y_true, y_pred)),
240 324 574.0 1.8 0.1 shape=(n_labels, n_labels), dtype=dtype,
241 ).toarray()
242
243 324 6118.0 18.9 0.8 with np.errstate(all='ignore'):
244 324 678.0 2.1 0.1 if normalize == 'true':
245 cm = cm / cm.sum(axis=1, keepdims=True)
246 324 585.0 1.8 0.1 elif normalize == 'pred':
247 cm = cm / cm.sum(axis=0, keepdims=True)
248 324 557.0 1.7 0.1 elif normalize == 'all':
249 cm = cm / cm.sum()
250 324 19772.0 61.0 2.5 cm = np.nan_to_num(cm)
251
252 324 660.0 2.0 0.1 return cm Versus the one with intersect: Pystone time: 0.735806 s
File: /home/joncrall/misc/tests/python/bench_confusion_matrix.py
Function: confusion_matrix_2021_new at line 179
Line # Hits Time Per Hit % Time Line Contents
==============================================================
179 @xdev.profile
180 def confusion_matrix_2021_new(y_true, y_pred, *, labels=None,
181 sample_weight=None, normalize=None):
182 324 150859.0 465.6 20.5 y_type, y_true, y_pred = _check_targets(y_true, y_pred)
183 324 773.0 2.4 0.1 if y_type not in ("binary", "multiclass"):
184 raise ValueError("%s is not supported" % y_type)
185
186 324 549.0 1.7 0.1 if labels is None:
187 108 62184.0 575.8 8.5 labels = unique_labels(y_true, y_pred)
188 else:
189 216 802.0 3.7 0.1 labels = np.asarray(labels)
190 216 437.0 2.0 0.1 n_labels = labels.size
191 216 363.0 1.7 0.0 if n_labels == 0:
192 raise ValueError("'labels' should contains at least one label.")
193 216 394.0 1.8 0.1 elif y_true.size == 0:
194 return np.zeros((n_labels, n_labels), dtype=int)
195 else:
196 216 34387.0 159.2 4.7 flag1 = len(np.intersect1d(y_true, labels)) == 0
197 # flag2 = np.all([l not in y_true for l in labels])
198 216 465.0 2.2 0.1 if flag1:
199 raise ValueError("At least one label specified must be in y_true")
200
201 324 621.0 1.9 0.1 if sample_weight is None:
202 162 2462.0 15.2 0.3 sample_weight = np.ones(y_true.shape[0], dtype=np.int64)
203 else:
204 162 641.0 4.0 0.1 sample_weight = np.asarray(sample_weight)
205
206 324 19952.0 61.6 2.7 check_consistent_length(y_true, y_pred, sample_weight)
207
208 324 792.0 2.4 0.1 if normalize not in ['true', 'pred', 'all', None]:
209 raise ValueError("normalize must be one of {'true', 'pred', "
210 "'all', None}")
211
212 324 633.0 2.0 0.1 n_labels = labels.size
213 # If labels are not consecutive integers starting from zero, then
214 # y_true and y_pred must be converted into index form
215 324 633.0 2.0 0.1 need_index_conversion = not (
216 846 1806.0 2.1 0.2 labels.dtype.kind in {'i', 'u', 'b'} and
217 324 8494.0 26.2 1.2 np.all(labels == np.arange(n_labels)) and
218 396 6068.0 15.3 0.8 y_true.min() >= 0 and y_pred.min() >= 0
219 )
220 324 514.0 1.6 0.1 if need_index_conversion:
221 126 2059.0 16.3 0.3 label_to_ind = {y: x for x, y in enumerate(labels)}
222 126 168675.0 1338.7 22.9 y_pred = np.array([label_to_ind.get(x, n_labels + 1) for x in y_pred])
223 126 165491.0 1313.4 22.5 y_true = np.array([label_to_ind.get(x, n_labels + 1) for x in y_true])
224
225 # intersect y_pred, y_true with labels, eliminate items not in labels
226 324 4633.0 14.3 0.6 ind = np.logical_and(y_pred < n_labels, y_true < n_labels)
227 324 6149.0 19.0 0.8 if not np.all(ind):
228 y_pred = y_pred[ind]
229 y_true = y_true[ind]
230 # also eliminate weights of eliminated items
231 sample_weight = sample_weight[ind]
232
233 # Choose the accumulator dtype to always have high precision
234 324 862.0 2.7 0.1 if sample_weight.dtype.kind in {'i', 'u', 'b'}:
235 162 323.0 2.0 0.0 dtype = np.int64
236 else:
237 162 338.0 2.1 0.0 dtype = np.float64
238
239 648 62685.0 96.7 8.5 cm = coo_matrix((sample_weight, (y_true, y_pred)),
240 324 555.0 1.7 0.1 shape=(n_labels, n_labels), dtype=dtype,
241 ).toarray()
242
243 324 6615.0 20.4 0.9 with np.errstate(all='ignore'):
244 324 667.0 2.1 0.1 if normalize == 'true':
245 cm = cm / cm.sum(axis=1, keepdims=True)
246 324 559.0 1.7 0.1 elif normalize == 'pred':
247 cm = cm / cm.sum(axis=0, keepdims=True)
248 324 533.0 1.6 0.1 elif normalize == 'all':
249 cm = cm / cm.sum()
250 324 21223.0 65.5 2.9 cm = np.nan_to_num(cm)
251
252 324 610.0 1.9 0.1 return cm Hopefully its not a big deal to make minor changes to this PR. Given that I'm putting the time to do the analysis, I figure I might as well make simple improvements when I see them. |
I'm keeping my benchmark script here: https://github.com/Erotemic/misc/blob/master/tests/python/bench_confusion_matrix.py Averaging over all results I'm currently seeing these numbers. Let me know if you want detailed benchmarks. Its not much different from the original ones I posted. I tested the versions of this function from the original 2017 variant I started with, my 2017 patch, the 2021 master branch, and the 2021 variant of this patch.
With the new I also rebased on |
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
d6dc056
to
4b8e1e8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@thomasjpfan I think all your comments have been addressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this PR and your patience through the review process @Erotemic ! The benchmarks you provided were very useful :)
LGTM
Thanks everyone! |
This PR is a speedup to the confusion_matrix function in the case where labels is specified and already in index form. I added check to bypass expensive label to index conversion, which results in a nice 16x speedup.
I've labeled this as WIP because I've only tested the case where
len(labels) = 12
andlen(y_true) = len(y_pred) = 172800
. I need to do is more comprehensive testing to ensure that I didn't slow down other cases for the sake of one particular case. I believe this will be an overall improvement, but I want to make sure.The one benchmark I've done so far (on the aforementioned data) resulted in a reduction of compute time from
90.0ms
to6.0ms
. This is a 15x increase. When computing several hundred confusion matrices, this becomes quite significant.The running time can be further improved to a
36x
increase if we added an extra flag (calledenable_checks=True
) to allow the user to disable the_check_targets
andcheck_consistent_length
call when appropriate. I didn't add this to the PR by default because I thought there might be some pushback on adding a argument to a function signature. However, if a reviewer thinks this is ok, let me know and I'll add it to get some extra speed.TODO