Skip to content

NearestCentroid not handling properly distance metrics other than Manhattan or Euclidean. #23890

Closed
@Valentin-Laurent

Description

@Valentin-Laurent

Describe the bug

Three issues, quite related.

  • Using a metric that is not Manhattan or Euclidean triggers the following warning: UserWarning: Averaging for metrics other than euclidean and manhattan not supported. The average is set to be the mean., but I believe an error message should be thrown instead. Indeed, computing the centroids as the average of the class samples is wrong for (I think) all distances except Euclidean, so it is misleading to actually do it even with a warning. Throwing an error here would fix the 2 following issues.

  • Calling .predict on a NearestCentroid with metric wminkowski, seuclidean or mahalanobis throws an error, because those metrics need params that we can't pass to NearestCentroid. DOC Improve doc of Nearest Neighbors metric param #23806 would fix the documentation, but maybe it should be nice to handle this properly in the code.

  • Using any string as a metric (even a random string like foobar) does not throws an error when fitting, but does when calling .predict. Doing this with another classifier like KNeighborsClassifier throws an error at the fitting step.

If relevant I'd like to work on fixing those issues (new contributor here :)) I noticed a stalled PR (#19262) related to implementing partial_fit to NearestCentroid. Maybe I should start working on it first, as it seems to improve the code quality of _nearest_centroid.py, so it would be an interesting base to start with.

Steps/Code to Reproduce

#Reproduces the second issue mentionned

from sklearn.neighbors import NearestCentroid
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
nearest_centroid = NearestCentroid(metric="seuclidean")
nearest_centroid.fit(X,y)
nearest_centroid.predict(X)

Expected Results

Fitting a NearestCentroid with a metric that is not Manhattan or Euclidean should trigger a proper error message.

Actual Results

Error related to the second issue:
ValueError: The 'V' parameter is required for the seuclidean metric when Y is passed.

Versions

System:
    python: 3.8.6 (default, Mar 29 2021, 18:42:37)  [Clang 12.0.0 (clang-1200.0.32.29)]
executable: /Users/valentinlaurent/.pyenv/versions/3.8.6/envs/lewagon/bin/python3.8
   machine: macOS-10.15.7-x86_64-i386-64bit

Python dependencies:
          pip: 22.1.1
   setuptools: 49.2.1
      sklearn: 1.0.2
        numpy: 1.19.5
        scipy: 1.6.2
       Cython: 0.29.26
       pandas: 1.1.3
   matplotlib: 3.4.1
       joblib: 1.1.0
threadpoolctl: 2.1.0

Built with OpenMP: True

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions