-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG+1] Affinity propagation edge cases (#9612) #9635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Affinity propagation edge cases (#9612) #9635
Conversation
As discussed in issue scikit-learn#9612, expecting cluster centers to be an empty array and labels to be unique for every sample.
Returns empty list as cluster center indices to prevent adding a dimension in fit() method, returns unique labels for samples making this consistent with (TBD) predict() behavior for non-convergence.
In this case, it will log a warning and return unique labels for every new sample.
…milarities and preferences
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, thanks!
Could you please test that/when a warning is raised with assert_warns? (In this PR or elsewhere we should also use a ConvergenceWarning in the existing non-convergence case...)
if self.cluster_centers_.size > 0: | ||
return pairwise_distances_argmin(X, self.cluster_centers_) | ||
else: | ||
logger.warning("This model does not have any cluster centers " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For good or bad, we use warnings.warn, not logging...
# cluster equal to the single sample | ||
return (np.array([0]), np.array([0]), 0) if return_n_iter \ | ||
else (np.array([0], np.array([0]))) | ||
elif equal_similarities_and_preferences(S, preference): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you just do if n_samples == 1 or equal... here?
labels = np.empty((n_samples, 1)) | ||
cluster_centers_indices = None | ||
labels.fill(np.nan) | ||
logger.warning("Affinity propagation did not converge, this model " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a sklearn.exceptions.ConvergenceWarning
# It makes no sense to run the algorithm in this case, so return 1 or | ||
# n_samples clusters, depending on preferences | ||
if np.array(preference).flat[0] >= S.flat[1]: | ||
return (np.arange(n_samples), np.arange(n_samples), 0) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this case deserve a warning?
n_samples == 1 case does not need a separate return statement.
Added assertions for warnings in tests.
|
||
def all_equal_similarities(): | ||
# Fill "diagonal" of S with first similarity value in S | ||
S.flat[::S.shape[0] + 1] = S.flat[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should be modifying S
...?
from ..base import BaseEstimator, ClusterMixin | ||
from ..utils import as_float_array, check_array | ||
from ..utils.validation import check_is_fitted | ||
from ..metrics import euclidean_distances | ||
from ..metrics import pairwise_distances_argmin | ||
|
||
|
||
def equal_similarities_and_preferences(S, preference): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefix with _
to make clear this is not public API
warnings.warn("All samples have mutually equal similarities, " | ||
"returning arbitrary cluster center(s).") | ||
if np.array(preference).flat[0] >= S.flat[n_samples - 1]: | ||
return (np.arange(n_samples), np.arange(n_samples), 0) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aesthetics:
if np.array(preference).flat[0] >= S.flat[n_samples - 1]:
return (np.arange(n_samples), np.arange(n_samples), 0)
if return_n_iter
else (np.arange(n_samples), np.arange(n_samples))
# n_samples clusters, depending on preferences | ||
warnings.warn("All samples have mutually equal similarities, " | ||
"returning arbitrary cluster center(s).") | ||
if np.array(preference).flat[0] >= S.flat[n_samples - 1]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can convert preference to an array outside of this condition and avoid repeatedly casting it...
if n_samples == 1 or equal_similarities_and_preferences(S, preference): | ||
# It makes no sense to run the algorithm in this case, so return 1 or | ||
# n_samples clusters, depending on preferences | ||
warnings.warn("All samples have mutually equal similarities, " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
, -> .
# It makes no sense to run the algorithm in this case, so return 1 or | ||
# n_samples clusters, depending on preferences | ||
warnings.warn("All samples have mutually equal similarities, " | ||
"returning arbitrary cluster center(s).") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be a ConvergenceWarning
? I suppose UserWarning
is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a UserWarning
right now
# Force non-convergence by allowing only a single iteration | ||
af = AffinityPropagation(preference=-10, max_iter=1).fit(X) | ||
|
||
assert_array_equal(np.array([0, 1, 2]), af.predict(X)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid this doesn't make sense. We can't predict every sample as being in its own cluster in the inductive case, if this means that there are new clusters at predict time than at fit time. The implication here is also that some of the predicted items are clustered with the training points in the same position, which does not make sense.
Options:
- instead mark all points as not clustered, i.e. label -1 as in dbscan
- raise an error in
predict
if there are no exemplars - make each training point an exemplar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! That will prevent some confusion...
Returning a unique label for every sample in X suggests that these were based on actual clusters. Since there are no clusters, it makes more sense to return a negative label for all samples, indicating there were no clusters.
@@ -90,6 +106,23 @@ def affinity_propagation(S, preference=None, convergence_iter=15, max_iter=200, | |||
if damping < 0.5 or damping >= 1: | |||
raise ValueError('damping must be >= 0.5 and < 1') | |||
|
|||
preference_array = np.array(preference) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason not to reuse the name preference
? It's quite clear from how it's used (e.g. .flat
) that it must be an array.
else: | ||
warnings.warn("This model does not have any cluster centers " | ||
"because affinity propagation did not converge. " | ||
"Returning unique labels for the provided samples.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer the case.
Are you sure we don't just want to raise an error when a user tries to predict with such a model? Perhaps an array of -1 makes sense in fit_predict if we're going to do it here..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
predict()
unexpectedly raising an error was the initial reason this PR got started. An error during predict()
would be appropriate if the caller would provide incorrect data IMHO. Since these models can potentially live for a long time (in my use case, they're trained infrequently and are deserialized later with joblib.load()
), I wouldn't want to make the caller of predict()
responsible for dealing with potential crashes because of issues during training of the models. Returning -1 as labels (and logging the warning) seems to be a bit more friendly to the caller.
I'll address the incorrect warning message.
Codecov Report
@@ Coverage Diff @@
## master #9635 +/- ##
==========================================
+ Coverage 96.16% 96.16% +<.01%
==========================================
Files 336 336
Lines 62102 62154 +52
==========================================
+ Hits 59720 59772 +52
Misses 2382 2382
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #9635 +/- ##
==========================================
+ Coverage 96.16% 96.16% +<.01%
==========================================
Files 336 336
Lines 62102 62154 +52
==========================================
+ Hits 59720 59772 +52
Misses 2382 2382
Continue to review full report at Codecov.
|
But then don't you think that labels_ should also be -1 then? |
Definitely. I overlooked that part - busy with too many things at the same time :-/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good, except that it deserves some documentation. Perhaps a Notes section in both class and function docstrings?
LGTM, thanks! Another review? |
+1 for MRG please just update what's new bug fix section and let's merge |
Not too sure what it means to "update ... bug fix section", was any action expected from my side? |
see doc/whats_new.rst
…On 5 September 2017 at 17:08, Jonatan Samoocha ***@***.***> wrote:
Not too sure what it means to "update ... bug fix section", was any action
expected from my side?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9635 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6_ky-O-0Siv3r4l7d7PCBsHKP_-Cks5sfPNzgaJpZM4PDtqs>
.
|
Thanks! |
…earn#9635) * Added test exposing non-convergence issues As discussed in issue scikit-learn#9612, expecting cluster centers to be an empty array and labels to be unique for every sample. * Addresses non-convergence issues Returns empty list as cluster center indices to prevent adding a dimension in fit() method, returns unique labels for samples making this consistent with (TBD) predict() behavior for non-convergence. * Made predict() handle case of non-convergence while fitting In this case, it will log a warning and return unique labels for every new sample. * Added helper function for detecting mutually equal similarities and preferences * Tidied imports * Immediately returning trivial clusters and labels in case of equal similarities and preferences * Simplified code for preference(s) equality test * Corrected for failing unit tests covering case of n_samples=1 * Corrected for PEP8 line too long * Rewriting imports to comply with max 80-column lines * Simplified code n_samples == 1 case does not need a separate return statement. * Replaced logging warnings by warnings.warn() Added assertions for warnings in tests. * Marking function as non-public * Using mask instead of modifying S * Improvement suggested by review comment * Avoided casting preference to array twice * Readability improvements * Improved returned labels in case of no cluster centers Returning a unique label for every sample in X suggests that these were based on actual clusters. Since there are no clusters, it makes more sense to return a negative label for all samples, indicating there were no clusters. * PEP8 line too long * Avoided creating separate variable for preference as array * Corrected warning message * Making labels consistent with predict() behavior in case of non-convergence * Minor readability improvement * Added detail to test comment about expected result * Added documentation about edge cases * Added documentation to 'what's new'
…earn#9635) * Added test exposing non-convergence issues As discussed in issue scikit-learn#9612, expecting cluster centers to be an empty array and labels to be unique for every sample. * Addresses non-convergence issues Returns empty list as cluster center indices to prevent adding a dimension in fit() method, returns unique labels for samples making this consistent with (TBD) predict() behavior for non-convergence. * Made predict() handle case of non-convergence while fitting In this case, it will log a warning and return unique labels for every new sample. * Added helper function for detecting mutually equal similarities and preferences * Tidied imports * Immediately returning trivial clusters and labels in case of equal similarities and preferences * Simplified code for preference(s) equality test * Corrected for failing unit tests covering case of n_samples=1 * Corrected for PEP8 line too long * Rewriting imports to comply with max 80-column lines * Simplified code n_samples == 1 case does not need a separate return statement. * Replaced logging warnings by warnings.warn() Added assertions for warnings in tests. * Marking function as non-public * Using mask instead of modifying S * Improvement suggested by review comment * Avoided casting preference to array twice * Readability improvements * Improved returned labels in case of no cluster centers Returning a unique label for every sample in X suggests that these were based on actual clusters. Since there are no clusters, it makes more sense to return a negative label for all samples, indicating there were no clusters. * PEP8 line too long * Avoided creating separate variable for preference as array * Corrected warning message * Making labels consistent with predict() behavior in case of non-convergence * Minor readability improvement * Added detail to test comment about expected result * Added documentation about edge cases * Added documentation to 'what's new'
…earn#9635) * Added test exposing non-convergence issues As discussed in issue scikit-learn#9612, expecting cluster centers to be an empty array and labels to be unique for every sample. * Addresses non-convergence issues Returns empty list as cluster center indices to prevent adding a dimension in fit() method, returns unique labels for samples making this consistent with (TBD) predict() behavior for non-convergence. * Made predict() handle case of non-convergence while fitting In this case, it will log a warning and return unique labels for every new sample. * Added helper function for detecting mutually equal similarities and preferences * Tidied imports * Immediately returning trivial clusters and labels in case of equal similarities and preferences * Simplified code for preference(s) equality test * Corrected for failing unit tests covering case of n_samples=1 * Corrected for PEP8 line too long * Rewriting imports to comply with max 80-column lines * Simplified code n_samples == 1 case does not need a separate return statement. * Replaced logging warnings by warnings.warn() Added assertions for warnings in tests. * Marking function as non-public * Using mask instead of modifying S * Improvement suggested by review comment * Avoided casting preference to array twice * Readability improvements * Improved returned labels in case of no cluster centers Returning a unique label for every sample in X suggests that these were based on actual clusters. Since there are no clusters, it makes more sense to return a negative label for all samples, indicating there were no clusters. * PEP8 line too long * Avoided creating separate variable for preference as array * Corrected warning message * Making labels consistent with predict() behavior in case of non-convergence * Minor readability improvement * Added detail to test comment about expected result * Added documentation about edge cases * Added documentation to 'what's new'
Reference Issue
Fixes #9612
What does this implement/fix? Explain your changes.
AffinityPropagation.predict(X)
would fail in case of non-convergence of the algorithm when fitting the model. It now returns the same label ('-1' for noise) for every sample inX
in that case.AffinityPropagation.fit()
behavior was undefined and sometimes arbitrary (depending on preference and/or damping values) for cases when training samples had equal mutual similarities. It now behaves in a consistent way for these edge cases, i.e. returning one cluster when preference < mutual similarities and returningn_samples
clusters when preference >= mutual similarities.