Skip to content

Commit e1adf6d

Browse files
yu-iskwjkbradley
authored andcommitted
[SPARK-6518][MLLIB][EXAMPLE][DOC] Add example code and user guide for bisecting k-means
This PR includes only an example code in order to finish it quickly. I'll send another PR for the docs soon. Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes apache#9952 from yu-iskw/SPARK-6518. (cherry picked from commit 7b6dc29) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
1 parent e5b8571 commit e1adf6d

File tree

4 files changed

+165
-0
lines changed

4 files changed

+165
-0
lines changed

docs/mllib-clustering.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -718,6 +718,41 @@ sameModel = LDAModel.load(sc, "myModelPath")
718718

719719
</div>
720720

721+
## Bisecting k-means
722+
723+
Bisecting K-means can often be much faster than regular K-means, but it will generally produce a different clustering.
724+
725+
Bisecting k-means is a kind of [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering).
726+
Hierarchical clustering is one of the most commonly used method of cluster analysis which seeks to build a hierarchy of clusters.
727+
Strategies for hierarchical clustering generally fall into two types:
728+
729+
- Agglomerative: This is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
730+
- Divisive: This is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
731+
732+
Bisecting k-means algorithm is a kind of divisive algorithms.
733+
The implementation in MLlib has the following parameters:
734+
735+
* *k*: the desired number of leaf clusters (default: 4). The actual number could be smaller if there are no divisible leaf clusters.
736+
* *maxIterations*: the max number of k-means iterations to split clusters (default: 20)
737+
* *minDivisibleClusterSize*: the minimum number of points (if >= 1.0) or the minimum proportion of points (if < 1.0) of a divisible cluster (default: 1)
738+
* *seed*: a random seed (default: hash value of the class name)
739+
740+
**Examples**
741+
742+
<div class="codetabs">
743+
<div data-lang="scala" markdown="1">
744+
Refer to the [`BisectingKMeans` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeans) and [`BisectingKMeansModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeansModel) for details on the API.
745+
746+
{% include_example scala/org/apache/spark/examples/mllib/BisectingKMeansExample.scala %}
747+
</div>
748+
749+
<div data-lang="java" markdown="1">
750+
Refer to the [`BisectingKMeans` Java docs](api/java/org/apache/spark/mllib/clustering/BisectingKMeans.html) and [`BisectingKMeansModel` Java docs](api/java/org/apache/spark/mllib/clustering/BisectingKMeansModel.html) for details on the API.
751+
752+
{% include_example java/org/apache/spark/examples/mllib/JavaBisectingKMeansExample.java %}
753+
</div>
754+
</div>
755+
721756
## Streaming k-means
722757

723758
When data arrive in a stream, we may want to estimate clusters dynamically,

docs/mllib-guide.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@ We list major functionality from both below, with links to detailed guides.
4949
* [Gaussian mixture](mllib-clustering.html#gaussian-mixture)
5050
* [power iteration clustering (PIC)](mllib-clustering.html#power-iteration-clustering-pic)
5151
* [latent Dirichlet allocation (LDA)](mllib-clustering.html#latent-dirichlet-allocation-lda)
52+
* [bisecting k-means](mllib-clustering.html#bisecting-kmeans)
5253
* [streaming k-means](mllib-clustering.html#streaming-k-means)
5354
* [Dimensionality reduction](mllib-dimensionality-reduction.html)
5455
* [singular value decomposition (SVD)](mllib-dimensionality-reduction.html#singular-value-decomposition-svd)
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
/*
2+
* Licensed to the Apache Software Foundation (ASF) under one or more
3+
* contributor license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright ownership.
5+
* The ASF licenses this file to You under the Apache License, Version 2.0
6+
* (the "License"); you may not use this file except in compliance with
7+
* the License. You may obtain a copy of the License at
8+
*
9+
* http://www.apache.org/licenses/LICENSE-2.0
10+
*
11+
* Unless required by applicable law or agreed to in writing, software
12+
* distributed under the License is distributed on an "AS IS" BASIS,
13+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
* See the License for the specific language governing permissions and
15+
* limitations under the License.
16+
*/
17+
18+
package org.apache.spark.examples.mllib;
19+
20+
import java.util.ArrayList;
21+
22+
// $example on$
23+
import com.google.common.collect.Lists;
24+
// $example off$
25+
import org.apache.spark.SparkConf;
26+
import org.apache.spark.api.java.JavaSparkContext;
27+
// $example on$
28+
import org.apache.spark.api.java.JavaRDD;
29+
import org.apache.spark.mllib.clustering.BisectingKMeans;
30+
import org.apache.spark.mllib.clustering.BisectingKMeansModel;
31+
import org.apache.spark.mllib.linalg.Vector;
32+
import org.apache.spark.mllib.linalg.Vectors;
33+
// $example off$
34+
35+
/**
36+
* Java example for graph clustering using power iteration clustering (PIC).
37+
*/
38+
public class JavaBisectingKMeansExample {
39+
public static void main(String[] args) {
40+
SparkConf sparkConf = new SparkConf().setAppName("JavaBisectingKMeansExample");
41+
JavaSparkContext sc = new JavaSparkContext(sparkConf);
42+
43+
// $example on$
44+
ArrayList<Vector> localData = Lists.newArrayList(
45+
Vectors.dense(0.1, 0.1), Vectors.dense(0.3, 0.3),
46+
Vectors.dense(10.1, 10.1), Vectors.dense(10.3, 10.3),
47+
Vectors.dense(20.1, 20.1), Vectors.dense(20.3, 20.3),
48+
Vectors.dense(30.1, 30.1), Vectors.dense(30.3, 30.3)
49+
);
50+
JavaRDD<Vector> data = sc.parallelize(localData, 2);
51+
52+
BisectingKMeans bkm = new BisectingKMeans()
53+
.setK(4);
54+
BisectingKMeansModel model = bkm.run(data);
55+
56+
System.out.println("Compute Cost: " + model.computeCost(data));
57+
for (Vector center: model.clusterCenters()) {
58+
System.out.println("");
59+
}
60+
Vector[] clusterCenters = model.clusterCenters();
61+
for (int i = 0; i < clusterCenters.length; i++) {
62+
Vector clusterCenter = clusterCenters[i];
63+
System.out.println("Cluster Center " + i + ": " + clusterCenter);
64+
}
65+
// $example off$
66+
67+
sc.stop();
68+
}
69+
}
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
/*
2+
* Licensed to the Apache Software Foundation (ASF) under one or more
3+
* contributor license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright ownership.
5+
* The ASF licenses this file to You under the Apache License, Version 2.0
6+
* (the "License"); you may not use this file except in compliance with
7+
* the License. You may obtain a copy of the License at
8+
*
9+
* http://www.apache.org/licenses/LICENSE-2.0
10+
*
11+
* Unless required by applicable law or agreed to in writing, software
12+
* distributed under the License is distributed on an "AS IS" BASIS,
13+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
* See the License for the specific language governing permissions and
15+
* limitations under the License.
16+
*/
17+
18+
package org.apache.spark.examples.mllib
19+
20+
// scalastyle:off println
21+
// $example on$
22+
import org.apache.spark.mllib.clustering.BisectingKMeans
23+
import org.apache.spark.mllib.linalg.{Vector, Vectors}
24+
// $example off$
25+
import org.apache.spark.{SparkConf, SparkContext}
26+
27+
/**
28+
* An example demonstrating a bisecting k-means clustering in spark.mllib.
29+
*
30+
* Run with
31+
* {{{
32+
* bin/run-example mllib.BisectingKMeansExample
33+
* }}}
34+
*/
35+
object BisectingKMeansExample {
36+
37+
def main(args: Array[String]) {
38+
val sparkConf = new SparkConf().setAppName("mllib.BisectingKMeansExample")
39+
val sc = new SparkContext(sparkConf)
40+
41+
// $example on$
42+
// Loads and parses data
43+
def parse(line: String): Vector = Vectors.dense(line.split(" ").map(_.toDouble))
44+
val data = sc.textFile("data/mllib/kmeans_data.txt").map(parse).cache()
45+
46+
// Clustering the data into 6 clusters by BisectingKMeans.
47+
val bkm = new BisectingKMeans().setK(6)
48+
val model = bkm.run(data)
49+
50+
// Show the compute cost and the cluster centers
51+
println(s"Compute Cost: ${model.computeCost(data)}")
52+
model.clusterCenters.zipWithIndex.foreach { case (center, idx) =>
53+
println(s"Cluster Center ${idx}: ${center}")
54+
}
55+
// $example off$
56+
57+
sc.stop()
58+
}
59+
}
60+
// scalastyle:on println

0 commit comments

Comments
 (0)