Skip to content

[MRG+1] Affinity propagation edge cases (#9612) #9635

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Sep 5, 2017

Conversation

jsamoocha
Copy link
Contributor

@jsamoocha jsamoocha commented Aug 27, 2017

Reference Issue

Fixes #9612

What does this implement/fix? Explain your changes.

  1. AffinityPropagation.predict(X) would fail in case of non-convergence of the algorithm when fitting the model. It now returns the same label ('-1' for noise) for every sample in X in that case.
  2. AffinityPropagation.fit() behavior was undefined and sometimes arbitrary (depending on preference and/or damping values) for cases when training samples had equal mutual similarities. It now behaves in a consistent way for these edge cases, i.e. returning one cluster when preference < mutual similarities and returning n_samples clusters when preference >= mutual similarities.

Jonatan Samoocha added 10 commits August 26, 2017 11:49
As discussed in issue scikit-learn#9612, expecting cluster centers to be an empty array and labels to be
unique for every sample.
Returns empty list as cluster center indices to prevent adding a dimension in fit() method,
returns unique labels for samples making this consistent with (TBD) predict() behavior for
non-convergence.
In this case, it will log a warning and return unique labels for every new sample.
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, thanks!

Could you please test that/when a warning is raised with assert_warns? (In this PR or elsewhere we should also use a ConvergenceWarning in the existing non-convergence case...)

if self.cluster_centers_.size > 0:
return pairwise_distances_argmin(X, self.cluster_centers_)
else:
logger.warning("This model does not have any cluster centers "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For good or bad, we use warnings.warn, not logging...

# cluster equal to the single sample
return (np.array([0]), np.array([0]), 0) if return_n_iter \
else (np.array([0], np.array([0])))
elif equal_similarities_and_preferences(S, preference):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you just do if n_samples == 1 or equal... here?

labels = np.empty((n_samples, 1))
cluster_centers_indices = None
labels.fill(np.nan)
logger.warning("Affinity propagation did not converge, this model "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a sklearn.exceptions.ConvergenceWarning

# It makes no sense to run the algorithm in this case, so return 1 or
# n_samples clusters, depending on preferences
if np.array(preference).flat[0] >= S.flat[1]:
return (np.arange(n_samples), np.arange(n_samples), 0) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this case deserve a warning?

Jonatan Samoocha added 2 commits August 28, 2017 08:05
n_samples == 1 case does not need a separate return statement.
Added assertions for warnings in tests.

def all_equal_similarities():
# Fill "diagonal" of S with first similarity value in S
S.flat[::S.shape[0] + 1] = S.flat[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be modifying S...?

from ..base import BaseEstimator, ClusterMixin
from ..utils import as_float_array, check_array
from ..utils.validation import check_is_fitted
from ..metrics import euclidean_distances
from ..metrics import pairwise_distances_argmin


def equal_similarities_and_preferences(S, preference):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefix with _ to make clear this is not public API

warnings.warn("All samples have mutually equal similarities, "
"returning arbitrary cluster center(s).")
if np.array(preference).flat[0] >= S.flat[n_samples - 1]:
return (np.arange(n_samples), np.arange(n_samples), 0) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aesthetics:

        if np.array(preference).flat[0] >= S.flat[n_samples - 1]:
            return (np.arange(n_samples), np.arange(n_samples), 0)
                    if return_n_iter
                    else (np.arange(n_samples), np.arange(n_samples))

# n_samples clusters, depending on preferences
warnings.warn("All samples have mutually equal similarities, "
"returning arbitrary cluster center(s).")
if np.array(preference).flat[0] >= S.flat[n_samples - 1]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can convert preference to an array outside of this condition and avoid repeatedly casting it...

if n_samples == 1 or equal_similarities_and_preferences(S, preference):
# It makes no sense to run the algorithm in this case, so return 1 or
# n_samples clusters, depending on preferences
warnings.warn("All samples have mutually equal similarities, "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, -> .

# It makes no sense to run the algorithm in this case, so return 1 or
# n_samples clusters, depending on preferences
warnings.warn("All samples have mutually equal similarities, "
"returning arbitrary cluster center(s).")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a ConvergenceWarning? I suppose UserWarning is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a UserWarning right now

# Force non-convergence by allowing only a single iteration
af = AffinityPropagation(preference=-10, max_iter=1).fit(X)

assert_array_equal(np.array([0, 1, 2]), af.predict(X))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid this doesn't make sense. We can't predict every sample as being in its own cluster in the inductive case, if this means that there are new clusters at predict time than at fit time. The implication here is also that some of the predicted items are clustered with the training points in the same position, which does not make sense.

Options:

  • instead mark all points as not clustered, i.e. label -1 as in dbscan
  • raise an error in predict if there are no exemplars
  • make each training point an exemplar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! That will prevent some confusion...

Jonatan Samoocha added 7 commits August 28, 2017 11:58
Returning a unique label for every sample in X suggests that these were based on actual clusters.
Since there are no clusters, it makes more sense to return a negative label for all samples,
indicating there were no clusters.
@jnothman jnothman added the Bug label Aug 28, 2017
@@ -90,6 +106,23 @@ def affinity_propagation(S, preference=None, convergence_iter=15, max_iter=200,
if damping < 0.5 or damping >= 1:
raise ValueError('damping must be >= 0.5 and < 1')

preference_array = np.array(preference)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to reuse the name preference? It's quite clear from how it's used (e.g. .flat) that it must be an array.

else:
warnings.warn("This model does not have any cluster centers "
"because affinity propagation did not converge. "
"Returning unique labels for the provided samples.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer the case.

Are you sure we don't just want to raise an error when a user tries to predict with such a model? Perhaps an array of -1 makes sense in fit_predict if we're going to do it here..?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predict() unexpectedly raising an error was the initial reason this PR got started. An error during predict() would be appropriate if the caller would provide incorrect data IMHO. Since these models can potentially live for a long time (in my use case, they're trained infrequently and are deserialized later with joblib.load()), I wouldn't want to make the caller of predict() responsible for dealing with potential crashes because of issues during training of the models. Returning -1 as labels (and logging the warning) seems to be a bit more friendly to the caller.

I'll address the incorrect warning message.

@codecov
Copy link

codecov bot commented Aug 28, 2017

Codecov Report

Merging #9635 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9635      +/-   ##
==========================================
+ Coverage   96.16%   96.16%   +<.01%     
==========================================
  Files         336      336              
  Lines       62102    62154      +52     
==========================================
+ Hits        59720    59772      +52     
  Misses       2382     2382
Impacted Files Coverage Δ
sklearn/cluster/tests/test_affinity_propagation.py 100% <100%> (ø) ⬆️
sklearn/cluster/affinity_propagation_.py 98.34% <100%> (+0.3%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d13009...5fe3367. Read the comment docs.

@codecov
Copy link

codecov bot commented Aug 28, 2017

Codecov Report

Merging #9635 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9635      +/-   ##
==========================================
+ Coverage   96.16%   96.16%   +<.01%     
==========================================
  Files         336      336              
  Lines       62102    62154      +52     
==========================================
+ Hits        59720    59772      +52     
  Misses       2382     2382
Impacted Files Coverage Δ
sklearn/cluster/tests/test_affinity_propagation.py 100% <100%> (ø) ⬆️
sklearn/cluster/affinity_propagation_.py 98.34% <100%> (+0.3%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d13009...5fe3367. Read the comment docs.

@jnothman
Copy link
Member

But then don't you think that labels_ should also be -1 then?

@jsamoocha
Copy link
Contributor Author

Definitely. I overlooked that part - busy with too many things at the same time :-/

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good, except that it deserves some documentation. Perhaps a Notes section in both class and function docstrings?

@jnothman
Copy link
Member

LGTM, thanks! Another review?

@jnothman jnothman changed the title [MRG]Affinity propagation 9612 [MRG+1] Affinity propagation 9612 Aug 31, 2017
@jnothman jnothman changed the title [MRG+1] Affinity propagation 9612 [MRG+1] Affinity propagation edge cases (#9612) Aug 31, 2017
@agramfort
Copy link
Member

+1 for MRG

please just update what's new bug fix section and let's merge

@jsamoocha
Copy link
Contributor Author

Not too sure what it means to "update ... bug fix section", was any action expected from my side?

@jnothman
Copy link
Member

jnothman commented Sep 5, 2017 via email

@jnothman jnothman merged commit fb64216 into scikit-learn:master Sep 5, 2017
@jnothman
Copy link
Member

jnothman commented Sep 5, 2017

Thanks!

@jsamoocha jsamoocha deleted the affinity_propagation_9612 branch September 5, 2017 10:41
massich pushed a commit to massich/scikit-learn that referenced this pull request Sep 15, 2017
…earn#9635)

* Added test exposing non-convergence issues

As discussed in issue scikit-learn#9612, expecting cluster centers to be an empty array and labels to be
unique for every sample.

* Addresses non-convergence issues

Returns empty list as cluster center indices to prevent adding a dimension in fit() method,
returns unique labels for samples making this consistent with (TBD) predict() behavior for
non-convergence.

* Made predict() handle case of non-convergence while fitting

In this case, it will log a warning and return unique labels for every new sample.

* Added helper function for detecting mutually equal similarities and preferences

* Tidied imports

* Immediately returning trivial clusters and labels in case of equal similarities and preferences

* Simplified code for preference(s) equality test

* Corrected for failing unit tests covering case of n_samples=1

* Corrected for PEP8 line too long

* Rewriting imports to comply with max 80-column lines

* Simplified code

n_samples == 1 case does not need a separate return statement.

* Replaced logging warnings by warnings.warn()

Added assertions for warnings in tests.

* Marking function as non-public

* Using mask instead of modifying S

* Improvement suggested by review comment

* Avoided casting preference to array twice

* Readability improvements

* Improved returned labels in case of no cluster centers

Returning a unique label for every sample in X suggests that these were based on actual clusters.
Since there are no clusters, it makes more sense to return a negative label for all samples,
indicating there were no clusters.

* PEP8 line too long

* Avoided creating separate variable for preference as array

* Corrected warning message

* Making labels consistent with predict() behavior in case of non-convergence

* Minor readability improvement

* Added detail to test comment about expected result

* Added documentation about edge cases

* Added documentation to 'what's new'
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
…earn#9635)

* Added test exposing non-convergence issues

As discussed in issue scikit-learn#9612, expecting cluster centers to be an empty array and labels to be
unique for every sample.

* Addresses non-convergence issues

Returns empty list as cluster center indices to prevent adding a dimension in fit() method,
returns unique labels for samples making this consistent with (TBD) predict() behavior for
non-convergence.

* Made predict() handle case of non-convergence while fitting

In this case, it will log a warning and return unique labels for every new sample.

* Added helper function for detecting mutually equal similarities and preferences

* Tidied imports

* Immediately returning trivial clusters and labels in case of equal similarities and preferences

* Simplified code for preference(s) equality test

* Corrected for failing unit tests covering case of n_samples=1

* Corrected for PEP8 line too long

* Rewriting imports to comply with max 80-column lines

* Simplified code

n_samples == 1 case does not need a separate return statement.

* Replaced logging warnings by warnings.warn()

Added assertions for warnings in tests.

* Marking function as non-public

* Using mask instead of modifying S

* Improvement suggested by review comment

* Avoided casting preference to array twice

* Readability improvements

* Improved returned labels in case of no cluster centers

Returning a unique label for every sample in X suggests that these were based on actual clusters.
Since there are no clusters, it makes more sense to return a negative label for all samples,
indicating there were no clusters.

* PEP8 line too long

* Avoided creating separate variable for preference as array

* Corrected warning message

* Making labels consistent with predict() behavior in case of non-convergence

* Minor readability improvement

* Added detail to test comment about expected result

* Added documentation about edge cases

* Added documentation to 'what's new'
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
…earn#9635)

* Added test exposing non-convergence issues

As discussed in issue scikit-learn#9612, expecting cluster centers to be an empty array and labels to be
unique for every sample.

* Addresses non-convergence issues

Returns empty list as cluster center indices to prevent adding a dimension in fit() method,
returns unique labels for samples making this consistent with (TBD) predict() behavior for
non-convergence.

* Made predict() handle case of non-convergence while fitting

In this case, it will log a warning and return unique labels for every new sample.

* Added helper function for detecting mutually equal similarities and preferences

* Tidied imports

* Immediately returning trivial clusters and labels in case of equal similarities and preferences

* Simplified code for preference(s) equality test

* Corrected for failing unit tests covering case of n_samples=1

* Corrected for PEP8 line too long

* Rewriting imports to comply with max 80-column lines

* Simplified code

n_samples == 1 case does not need a separate return statement.

* Replaced logging warnings by warnings.warn()

Added assertions for warnings in tests.

* Marking function as non-public

* Using mask instead of modifying S

* Improvement suggested by review comment

* Avoided casting preference to array twice

* Readability improvements

* Improved returned labels in case of no cluster centers

Returning a unique label for every sample in X suggests that these were based on actual clusters.
Since there are no clusters, it makes more sense to return a negative label for all samples,
indicating there were no clusters.

* PEP8 line too long

* Avoided creating separate variable for preference as array

* Corrected warning message

* Making labels consistent with predict() behavior in case of non-convergence

* Minor readability improvement

* Added detail to test comment about expected result

* Added documentation about edge cases

* Added documentation to 'what's new'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AffinityPropagation creates 3d array of cluster centers on rare occasions
3 participants