Skip to content

[MRG+1] Use sparse cluster contingency matrix by default #7419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
7 changes: 7 additions & 0 deletions doc/whats_new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -330,6 +330,11 @@ Enhancements
(`#7248 <https://github.com/scikit-learn/scikit-learn/pull/7248>_`)
By `Andreas Müller`_.

- Support sparse contingency matrices in cluster evaluation
(:mod:`metrics.cluster.supervised`) and use these by default.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"uses these by default" is not entirely true, right? Only in the internals, but no return value changed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

(`#7419 <https://github.com/scikit-learn/scikit-learn/pull/7419>_`)
By `Gregory Stupp`_ and `Joel Nothman`_.

Bug fixes
.........

Expand Down Expand Up @@ -4543,3 +4548,5 @@ David Huard, Dave Morrill, Ed Schofield, Travis Oliphant, Pearu Peterson.
.. _Sebastián Vanrell: https://github.com/srvanrell

.. _Robert McGibbon: https://github.com/rmcgibbo

.. _Gregory Stupp: https://github.com/stuppie
4 changes: 2 additions & 2 deletions sklearn/metrics/cluster/expected_mutual_info_fast.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ def expected_mutual_information(contingency, int n_samples):
#cdef np.ndarray[int, ndim=2] start, end
R, C = contingency.shape
N = <DOUBLE>n_samples
a = np.sum(contingency, axis=1).astype(np.int32)
b = np.sum(contingency, axis=0).astype(np.int32)
a = np.ravel(contingency.sum(axis=1).astype(np.int32))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the ravel needed?

Copy link
Member

@ogrisel ogrisel Sep 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it can now be a sparse matrix and the sum over the rows of a sparse matrix is a sparse matrix with a single column instead of a 1D array.

b = np.ravel(contingency.sum(axis=0).astype(np.int32))
# There are three major terms to the EMI equation, which are multiplied to
# and then summed over varying nij values.
# While nijs[0] will never be used, having it simplifies the indexing.
Expand Down
Loading