-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Clustering algorithm - BIRCH #3802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
support for new metrics is low priority. The idea of sum of squares as sufficient stats is designed for euclidean norm. support for sparse matrices should be straightforward. I don't see why it would not work. |
|
||
dist_idx = dist[[farthest_idx1, farthest_idx2]] | ||
|
||
newsubcluster1 = CFSubcluster() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor style remark: the name should be new_subcluster1 :).
OVerall, I am impressed that you managed to put this together that fast. Making it fast will probably require porting part of the code to Cython. |
Thanks. It is always nice to be appreciated. |
As a general comment, before going low level to speed things up, I think that it would help having a high-level reflection on the structure of the code: There are two classes. Can these 2 classes be collapsed into one? Less levels of indirection make it easier to profile and optimize. It also makes simpler Cython code. What is the best way to store the data structures? We have here lists of objects, that have attributes that store the interesting information. Can we have lists / arrays of these interesting info? (unlikely, but worth thinking). To think about that, I would make a diagram with all the data-structures layed out, and what they contribute to the algorithm. |
|
@jnothman @agramfort Thanks for all your comments. I shall give detailed response tomorrow. |
I am responding to @jnothman 's major comment (and in a way to @GaelVaroquaux 's too) . Yes, I did not include the global clustering step, in which the leaf sub-clusters are grouped back (according to an arbitrary clustering algorithm). In that way
However I am not sure how are we to accommodate this in the public API. Or do we just allow the user to give an option to set |
And I think, if we need to just EDIT: |
@jnothman I have added the global clustering step, IRL me and @agramfort discussed about the global clustering step, right now we have hardcoded it and allowing the user to provide |
I would like it to stay as it is because of the |
Sounds good. I'll look through this later. Is it worth adding the option that |
We had thought about it, but then how do you account for the parameters of the arbitrary_clusterer? |
👍 good idea. |
You give an instance, not an object. |
I see, so I guess we are in the |
No. This is good design (I meant a "not a class", by the way, in my |
Yes of course, I was asking if the next step would be to profile the code, after I make that change. |
We need some work to speed it up, This example takes 1.24s with |
Though it is much faster than AgglomerativeClustering with the default parameters (which takes 6s) @GaelVaroquaux Is this expected? |
yes. AgglomerativeClustering is quadratic in sample size (unless you use |
if is_fitted and X.shape[1] != self.subcluster_centers_.shape[1]: | ||
raise ValueError( | ||
"Training data and predicted data do " | ||
"not have same no. of features.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"no." -> "number".
Once those minor things are addressed, I expect this will have my +1. |
@jnothman I have addressed all your comments in the last commit, I have also added your name in the author's file, for your help in reviewing this |
1. Made radius a property. 2. Added test for compute_label. 3. Minor changes to doc and split_subcluster. 4. n_cluster -> clusterer.
d2ad2c7
to
36151f6
Compare
It was a problem with |
@jnothman I can haz merge? |
I think so! Well done! |
[MRG+1] Clustering algorithm - BIRCH
Congrats ! |
@jnothman @agramfort Thanks for bearing with my impatience. Now is just the simple matter of rewriting it in Cython! |
Lol. No, the tricky part is writing as little of it as possible in Python On 12 December 2014 at 09:11, Alexandre Gramfort notifications@github.com
|
Hurray! |
Fixes #2690
The design is similar to the Java code written here https://code.google.com/p/jbirch/
I am pretty much sure it works (If the JAVA implementation is correct, ofc), since I get the same clusters for both cases. I opened this as a Proof of Concept.
This example has been modified, http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html
When
threshold
is set to 3.0When
threshold
is set to 1.0TODO: A LOT
dont_test
;)Awating some initial feedback!