Open
Description
Currently no clusterers (or clustering metrics) support weighted dataset (although support for DBSCAN is proposed in #3994).
Weighting can be a compact way of representing repeated samples, and may affect cluster means and
variance, average link between clusters, etc.
Ideally BIRCH's global clustering stage should be provided a weighted dataset, and is current use of unweighted representatives may make its parametrisation more brittle.
This could be subject to an invariance test along the lines of:
sample_weight = np.random.randint(0, 10, size=X.shape[0])
weighted_y = clusterer.fit_predict(X, sample_weight=sample_weight)
repeated_y = clusterer.fit_predict(np.repeat(X, sample_weight))
assert_equal(adjusted_rand_score(np.repeat(weighted_y, sample_weight), repeated_y)
# NB: this is only a useful sufficient test if weighted_y differs from clusterer.fit_predict(X)
(There is also a minor question of whether sample_weight
should be universally accepted by ClusterMixin
or whether WeightedClusterMixin
should be created, etc.)
Sample weight support for clusterers:
- Affinity propagation (I don't know this well enough to know the applicability)
- BIRCH
- DBSCAN
- Hierarchical -> Ward link
- Hierarchical -> Complete link (N/A, as far as I can tell)
- Hierarchical -> Average link
- K Means
- Minibatch K Means
- Mean shift
- Spectral