Skip to content

Support sample weight in clusterers #3998

Open
@jnothman

Description

@jnothman

Currently no clusterers (or clustering metrics) support weighted dataset (although support for DBSCAN is proposed in #3994).

Weighting can be a compact way of representing repeated samples, and may affect cluster means and
variance, average link between clusters, etc.

Ideally BIRCH's global clustering stage should be provided a weighted dataset, and is current use of unweighted representatives may make its parametrisation more brittle.

This could be subject to an invariance test along the lines of:

sample_weight = np.random.randint(0, 10, size=X.shape[0])
weighted_y = clusterer.fit_predict(X, sample_weight=sample_weight)
repeated_y = clusterer.fit_predict(np.repeat(X, sample_weight))
assert_equal(adjusted_rand_score(np.repeat(weighted_y, sample_weight), repeated_y)
# NB: this is only a useful sufficient test if weighted_y differs from clusterer.fit_predict(X)

(There is also a minor question of whether sample_weight should be universally accepted by ClusterMixin or whether WeightedClusterMixin should be created, etc.)

Sample weight support for clusterers:

  • Affinity propagation (I don't know this well enough to know the applicability)
  • BIRCH
  • DBSCAN
  • Hierarchical -> Ward link
  • Hierarchical -> Complete link (N/A, as far as I can tell)
  • Hierarchical -> Average link
  • K Means
  • Minibatch K Means
  • Mean shift
  • Spectral

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions