Skip to content

AgglomerativeClustering with metric='cosine' broken for all-zero rows #7689

Closed
@weixuanfu

Description

@weixuanfu

Original title: "Cosine" affinity type in FeatureAgglomeration somehow casue memory overflow in a particular dataset

Description

Please carefully test the codes below! Using "cosine" affinity type in FeatureAgglomeration, the codes will cause memory overflow with a particular dataset (Download here). It is all right with other affinity types. And this issue cannot be reproduced using simulation data (like using make_classification in sci-kit learn). Not sure why it happened.

Steps/Code to Reproduce

from sklearn.cluster import FeatureAgglomeration
import numpy as np
import time
train_data = np.genfromtxt('fold_2_trainFeatVec.csv', delimiter=',')
train_labels= np.genfromtxt('fold_2_trainLabels.csv', delimiter=',')


fa = FeatureAgglomeration(affinity="cosine", linkage="average") # no matter linage ="average" or "complete"
time_start= time.time()
fa.fit(train_data, train_labels) # memory keeps increasing here
time_end = time.time()
print('Time usage:',time_end-time_start)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions