Skip to content

[MRG+1] Clustering algorithm - BIRCH #3802

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Dec 11, 2014
Merged

Conversation

MechCoder
Copy link
Member

Fixes #2690

The design is similar to the Java code written here https://code.google.com/p/jbirch/
I am pretty much sure it works (If the JAVA implementation is correct, ofc), since I get the same clusters for both cases. I opened this as a Proof of Concept.

This example has been modified, http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html

When threshold is set to 3.0

figure_1

When threshold is set to 1.0

figure_2

TODO: A LOT

  • Make it as fast as possible.
  • Support for sparse matrices.
  • Extensive testing.
  • Make the common tests pass, for now I have added it to dont_test ;)
  • Narrative documentation

Awating some initial feedback!

@agramfort
Copy link
Member

support for new metrics is low priority. The idea of sum of squares as sufficient stats is designed for euclidean norm.

support for sparse matrices should be straightforward. I don't see why it would not work.


dist_idx = dist[[farthest_idx1, farthest_idx2]]

newsubcluster1 = CFSubcluster()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor style remark: the name should be new_subcluster1 :).

@GaelVaroquaux
Copy link
Member

OVerall, I am impressed that you managed to put this together that fast.
This is great, and this is absolutely going in the right direction.
Congratulation!

Making it fast will probably require porting part of the code to Cython.
To probably calls for profiling before starting this endeavour.

@MechCoder
Copy link
Member Author

Thanks. It is always nice to be appreciated.

@GaelVaroquaux
Copy link
Member

As a general comment, before going low level to speed things up, I think that it would help having a high-level reflection on the structure of the code:

There are two classes. Can these 2 classes be collapsed into one? Less levels of indirection make it easier to profile and optimize. It also makes simpler Cython code.

What is the best way to store the data structures? We have here lists of objects, that have attributes that store the interesting information. Can we have lists / arrays of these interesting info? (unlikely, but worth thinking).

To think about that, I would make a diagram with all the data-structures layed out, and what they contribute to the algorithm.

@jnothman
Copy link
Member

  1. Yes, partial_fit (or partial_fit_predict?) support would be wonderful, but I'm not sure that the current solution of storing labels for every training instance makes sense in a large-scale clustering approach. A fit-predict paradigm could make more sense, or could be provided as an option (and this means you don't need to track sample_indices_, but have an additional O(nlogn) predict time.
  2. The paper describes a global clustering step in which an arbitrary clustering algorithm is applied to the identified subclusters. It seems this is not presently included. Is it possible to provide a meta-estimator or some other kind of data reduction transformation that allows subcluster centroids to then be clustered and those global-cluster labels returned from predict for the subcluster nearest each sample? (Note also that BIRCH is coming very close to some of the instance reduction techniques that have been mooted on the ML.)
  3. Does adding __slots__ to every CFNode/CFSubcluster improve its profile?
  4. I have thought about how to represent this, and it is not really very numpy friendly, in that I don't see a clean way for arrays to be used to represent the node space. Without tracking sample_indices_ (as above), it may be possible to have CFNode.subcluters_ as an array and hence vectorize the update step.

@MechCoder
Copy link
Member Author

@jnothman @agramfort Thanks for all your comments. I shall give detailed response tomorrow.

@MechCoder
Copy link
Member Author

I am responding to @jnothman 's major comment (and in a way to @GaelVaroquaux 's too) .

Yes, I did not include the global clustering step, in which the leaf sub-clusters are grouped back (according to an arbitrary clustering algorithm). In that way

  1. In the fit method we can build the tree (and the global clusters) and in the predict method we can map the samples to the respective closest global clusters.
  2. Do away with the tracking of sample_indices_ and collapse CFSubluster to 3 separate arrays
  3. It also gives more sense to a partial_fit method.

However I am not sure how are we to accommodate this in the public API. Or do we just allow the user to give an option to set n_clustersand hardcode the final global clustering ourselves?

@MechCoder
Copy link
Member Author

And I think, if we need to just euclidean metrics for now (according to @agramfort 's comment) , we can also do away with computing ss

EDIT:
It should not be too much work in any case, I can work that out once we come to a conclusion on how to handle the global clustering step.

@MechCoder
Copy link
Member Author

@jnothman I have added the global clustering step, IRL me and @agramfort discussed about the global clustering step, right now we have hardcoded it and allowing the user to provide n_clusters. If n_clusters is None then we skip the global clustering step.
I have also put a partial_fit method to avoid the rebuilding of the CF Tree and inherited from Cluster Mixin for the fit_predict method.

@MechCoder
Copy link
Member Author

it may be possible to have CFNode.subcluters_ as an array

I would like it to stay as it is because of the child_ param. IMO, it is now more readable.

@jnothman
Copy link
Member

Sounds good. I'll look through this later. Is it worth adding the option that n_clusters=arbitrary_clusterer?

@MechCoder
Copy link
Member Author

We had thought about it, but then how do you account for the parameters of the arbitrary_clusterer?

@MechCoder MechCoder changed the title [WIP] Clustering algorithm - BIRCH (Proof of Concept) [WIP] Clustering algorithm - BIRCH Oct 27, 2014
@GaelVaroquaux
Copy link
Member

Sounds good. I'll look through this later. Is it worth adding the option that
n_clusters=arbitrary_clusterer?

👍 good idea.

@GaelVaroquaux
Copy link
Member

We had thought about it, but then how do you account for the parameters of the
arbitrary_clusterer?

You give an instance, not an object.

@MechCoder
Copy link
Member Author

You give an instance, not an object.

I see, so I guess we are in the make it fast stage? ;)

@GaelVaroquaux
Copy link
Member

I see, so I guess we are in the make it fast stage? ;)

No. This is good design (I meant a "not a class", by the way, in my
previous post).

@MechCoder
Copy link
Member Author

Yes of course, I was asking if the next step would be to profile the code, after I make that change.

@MechCoder
Copy link
Member Author

We need some work to speed it up, This example takes 1.24s with n_clusters=3 as compared to 0.02s with MiniBatchKMeans .

@MechCoder
Copy link
Member Author

Though it is much faster than AgglomerativeClustering with the default parameters (which takes 6s) @GaelVaroquaux Is this expected?

@agramfort
Copy link
Member

yes. AgglomerativeClustering is quadratic in sample size (unless you use
connectivity)

if is_fitted and X.shape[1] != self.subcluster_centers_.shape[1]:
raise ValueError(
"Training data and predicted data do "
"not have same no. of features.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"no." -> "number".

@jnothman
Copy link
Member

jnothman commented Dec 9, 2014

Once those minor things are addressed, I expect this will have my +1.

@MechCoder
Copy link
Member Author

@jnothman I have addressed all your comments in the last commit, I have also added your name in the author's file, for your help in reviewing this

1. Made radius a property.
2. Added test for compute_label.
3. Minor changes to doc and split_subcluster.
4. n_cluster -> clusterer.
@MechCoder MechCoder force-pushed the birch branch 3 times, most recently from d2ad2c7 to 36151f6 Compare December 9, 2014 15:21
@MechCoder
Copy link
Member Author

It was a problem with random_state, I moved the test to common_tests

@MechCoder
Copy link
Member Author

@jnothman I can haz merge?

@jnothman
Copy link
Member

I think so! Well done!

agramfort added a commit that referenced this pull request Dec 11, 2014
[MRG+1] Clustering algorithm - BIRCH
@agramfort agramfort merged commit 6dab7c5 into scikit-learn:master Dec 11, 2014
@MechCoder MechCoder deleted the birch branch December 11, 2014 22:11
@agramfort
Copy link
Member

Congrats !

@MechCoder
Copy link
Member Author

@jnothman @agramfort Thanks for bearing with my impatience. Now is just the simple matter of rewriting it in Cython!

@jnothman
Copy link
Member

Now is just the simple matter of rewriting it in Cython!

Lol. No, the tricky part is writing as little of it as possible in Python
while keeping it clean and fast.

On 12 December 2014 at 09:11, Alexandre Gramfort notifications@github.com
wrote:

Congrats !


Reply to this email directly or view it on GitHub
#3802 (comment)
.

@GaelVaroquaux
Copy link
Member

Congrats !

Hurray!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: Additional Clustering Algorithms:BIRCH
7 participants