Skip to content

[MRG+1] Improves class design for AgglomerativeClustering and FeatureAgglomeration #9875

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Oct 24, 2017

Conversation

thechargedneutron
Copy link
Contributor

Reference Issue

Fixes #9846

What does this implement/fix? Explain your changes.

Improves the class design of AgglomerativeClustering and FeatureAgglomeration

Any other comments?

Changes will be made if not green.

@thechargedneutron thechargedneutron changed the title [WIP] Improve class design for AgglomerativeClustering and FeatureAgglomeration [WIP] Improves class design for AgglomerativeClustering and FeatureAgglomeration Oct 5, 2017
@thechargedneutron
Copy link
Contributor Author

Need help in what needs to be done to pass the tests in test_non_meta_estimators.

@agramfort
Copy link
Member

"Improves the class design" is what sense? what is the design issue you identified?

@thechargedneutron
Copy link
Contributor Author

I am not sure if I am on right track but I tried using a _BaseClass, from which the two classes Agglomerative Clustering and Feature Extraction are derived and having pooling_func only in FeatureAgglomeration. I am not sure.

@thechargedneutron
Copy link
Contributor Author

@lesteve I guess this is what you were trying to convey. I have made a base class which is used by both AgglomerativeClustering and FeatureAgglomeration. The later one uses pooling_func, not the former.

from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import AgglomerativeClustering


centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=300, centers=centers, cluster_std=0.5,
                            random_state=0)

model = AgglomerativeClustering(linkage='complete',
                                connectivity=None,
                                affinity = 'cosine',
                                pooling_func = "test_error",
                                n_clusters=3)
model.fit(X)

The above patch now raises an error:

TypeError: __init__() got an unexpected keyword argument 'pooling_func'

@jnothman
Copy link
Member

I think this is not quite right. You don't need a new base class. FeatureAgglomeration strictly extends from AgglomerativeClustering already, and that is fine. What is a problem is that pooling_func should not be available in AgglomerativeClustering only in descendants of AgglomerationTransform.

We want to remove pooling_func from AgglomerativeClustering and retain it in FeatureAgglomeration. But we cannot merely remove it from AgglomerativeClustering as this may, in an awkward case, break code. So we should deprecate it.

I see the current PR's implementation, however, as more wrong than right. FeatureAgglomeration should define a new init which stores pooling_func and calls AgglomerativeClustering.__init__ with the remaining params. It should also have its own docstring

@thechargedneutron
Copy link
Contributor Author

So, what I can understand from your concern is that, at present, I only need to add a deprecation warning which will raise a warning if a user tries to explicitly call pooling_func. The warning may be raised during fit (or init ??). The rest if fine for atleast two releases. Correct me if I am wrong.

@jnothman
Copy link
Member

jnothman commented Oct 17, 2017 via email

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mark the parameter as deprecated in the docstring.

@@ -678,6 +678,12 @@ def __init__(self, n_clusters=2, affinity="euclidean",
self.linkage = linkage
self.affinity = affinity
self.pooling_func = pooling_func
if self.pooling_func != np.mean:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should really be in fit, not in __init__ according to our conventions. And ideally we'd change the default to something like 'deprecated' so that we can warn when the user explicitly passed pooling_func=np.mean

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the recommended changes, check once

@@ -642,6 +642,7 @@ class AgglomerativeClustering(BaseEstimator, ClusterMixin):
all observations of the two sets.

pooling_func : callable, default=np.mean
deprecated, since AgglomerativeClustering do not use pooling_func
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just say "ignored. Deprecate to be removed in ..." And remove the rest as irrelevant

@thechargedneutron thechargedneutron changed the title [WIP] Improves class design for AgglomerativeClustering and FeatureAgglomeration [MRG] Improves class design for AgglomerativeClustering and FeatureAgglomeration Oct 18, 2017
@thechargedneutron
Copy link
Contributor Author

@jnothman Review please.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

This combines the values of agglomerated features into a single
value, and should accept an array of shape [M, N] and the keyword
argument ``axis=1``, and reduce it to an array of size [M].
ignored, Deprecate to be removed in 0.22.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*deprecated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at other places we use the "deprecated" sphinx directive. git grep '\.\. deprecated' will show you plenty of examples.

@jnothman jnothman changed the title [MRG] Improves class design for AgglomerativeClustering and FeatureAgglomeration [MRG+1] Improves class design for AgglomerativeClustering and FeatureAgglomeration Oct 19, 2017
@@ -694,6 +692,12 @@ def fit(self, X, y=None):
-------
self
"""
if self.pooling_func != 'deprecated':
warnings.warn('"pooling_func" as a parameter argument is '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it surprises me that this has test coverage. Please find out if there is a use of AgglomerativeClustering(pooling_func=...) in the codebase and remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a use of pooling_func. there's no test covering pooling_func.

memory=None,
connectivity=None, compute_full_tree='auto',
linkage='ward', pooling_func=np.mean):
self.n_clusters = n_clusters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And could you please use super(FeatureAgglomeration, self).__init__(n_clusters=..., ...) instead of setting things individually here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved.

@jnothman
Copy link
Member

jnothman commented Oct 19, 2017 via email

Copy link
Member

@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main problem is that with your changes as it is FeatureAgglomeration with default parameter has a warning:

from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import AgglomerativeClustering, FeatureAgglomeration


centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=300, centers=centers, cluster_std=0.5,
                            random_state=0)

model = FeatureAgglomeration()
model.fit(X)

Some other comments too.

This combines the values of agglomerated features into a single
value, and should accept an array of shape [M, N] and the keyword
argument ``axis=1``, and reduce it to an array of size [M].
ignored, Deprecate to be removed in 0.22.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at other places we use the "deprecated" sphinx directive. git grep '\.\. deprecated' will show you plenty of examples.

connectivity=connectivity,
compute_full_tree=compute_full_tree,
linkage=linkage, affinity=affinity,
pooling_func=pooling_func)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not pass pooling_function to the base class otherwise you'll get a warning in AgglomerativeClustering.__init__. This should be fine:

super(FeatureAgglomeration, self).__init__(n_clusters=n_clusters,
    memory=memory,
    connectivity=connectivity,
    compute_full_tree=compute_full_tree,
    linkage=linkage, affinity=affinity)
self.pooling_func=pooling_func

warnings.warn('"pooling_func" as a parameter argument is '
'deprecated in version 0.20 and will be removed '
'in version 0.22. It is being removed since '
'AgglomerativeClustering do not directly use '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rewording:

Agglomerative 'pooling_func' parameter is not used. It has been deprecated in version 0.20
and will be removed in 0.22.

@thechargedneutron
Copy link
Contributor Author

@lesteve Added the changes. I am still not able to figure out the warnings raised by codecov as pointed out by @jnothman . Also, there's a PEP8 failure "line too long". Here:

super(FeatureAgglomeration, self).__init__(n_clusters=n_clusters,
                                                   memory=memory,
                                                   connectivity=connectivity,
                                                   compute_full_tree=compute_full_tree,
                                                   linkage=linkage, affinity=affinity)

Can I modify it in any way?

memory=None,
connectivity=None, compute_full_tree='auto',
linkage='ward', pooling_func=np.mean):
super(FeatureAgglomeration, self).__init__(n_clusters=n_clusters,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just put a newline after ( and indent 4 spaces

@@ -644,7 +644,11 @@ class AgglomerativeClustering(BaseEstimator, ClusterMixin):
pooling_func : callable, default=np.mean
This combines the values of agglomerated features into a single
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it does not. This should just say ignored. When @lesteve pointed to other uses of .. deprecated they may be cases where the param is deprecated, but still functional. This one is not.

This combines the values of agglomerated features into a single
value, and should accept an array of shape [M, N] and the keyword
argument ``axis=1``, and reduce it to an array of size [M].
ignored, Deprecated to be removed in 0.22.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not saying that you could or should not use .. deprecated but that leaving in an untrue description of behaviour helped no one.

@thechargedneutron
Copy link
Contributor Author

Is this what you were suggesting? @jnothman

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's alright.

@lesteve
Copy link
Member

lesteve commented Oct 24, 2017

I added a whats_new entry. I'll look at the generated doc and then merge if it looks fine.

@lesteve
Copy link
Member

lesteve commented Oct 24, 2017

Merging, thanks a lot!

@lesteve lesteve merged commit b4561f0 into scikit-learn:master Oct 24, 2017
@thechargedneutron thechargedneutron deleted the aggcluster branch October 25, 2017 10:47
donigian added a commit to donigian/scikit-learn that referenced this pull request Oct 28, 2017
…cs/donigian-update-contribution-guidelines

* 'master' of github.com:scikit-learn/scikit-learn: (23 commits)
  fixes scikit-learn#10031: fix attribute name and shape in documentation (scikit-learn#10033)
  [MRG+1] add changelog entry for fixed and merged PR scikit-learn#10005 issue scikit-learn#9633 (scikit-learn#10025)
  [MRG] Fix LogisticRegression see also should include LogisticRegressionCV(scikit-learn#9995) (scikit-learn#10022)
  [MRG + 1] Labels of clustering should start at 0 or -1 if noise (scikit-learn#10015)
  MAINT Remove redundancy in scikit-learn#9552 (scikit-learn#9573)
  [MRG+1] correct comparison in GaussianNB for 'priors' (scikit-learn#10005)
  [MRG + 1] ENH add check_inverse in FunctionTransformer (scikit-learn#9399)
  [MRG] FIX bug in nested set_params usage (scikit-learn#9999)
  [MRG+1] Fix LOF and Isolation benchmarks (scikit-learn#9798)
  [MRG + 1] Fix negative inputs checking in mean_squared_log_error (scikit-learn#9968)
  DOC Fix typo (scikit-learn#9996)
  DOC Fix typo: x axis -> y axis (scikit-learn#9985)
  improve example plot_forest_iris.py (scikit-learn#9989)
  [MRG+1] Deprecate pooling_func unused parameter in AgglomerativeClustering (scikit-learn#9875)
  DOC update news
  DOC Fix three typos in manifold documentation (scikit-learn#9990)
  DOC add missing dot in docstring
  DOC Add what's new for 0.19.1 (scikit-learn#9983)
  Improve readability of outlier detection example. (scikit-learn#9973)
  DOC: Fixed typo (scikit-learn#9977)
  ...
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants