Skip to content

Add Sparse Matrix Support For HistGradientBoostingClassifier #15336

@jmwoloso

Description

@jmwoloso

Description

Hi!

I'm receiving the error below when attempting to pass a sparse matrix to HistGradientBoostingClassifier. The matrix is the result of using CountVectorizer and TfidfTransformer on input text.

In my case, the size of the text prohibits converting the sparse matrix to a dense one (I run out of memory).

Steps/Code to Reproduce

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import HistGradientBoostingClassifier

df = pd.read_csv(...)

vectorizer = CountVectorizer()
tfidf = TfidfTransformer()
clf = HistGradientBoostingClassifier()

vecs = vectorizer.fit_transform(df.loc[:, "very_large_text"])
vecs = tfidf.fit_transform(vecs)

clf.fit(vecs, df.loc[:, "label"])

Expected Results

No error is thrown.

Actual Results

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

Versions

System:
python: 3.7.3 (default, Oct 1 2019, 18:28:53) [GCC 5.4.0 20160609]
executable: /local_disk0/pythonVirtualEnvDirs/virtualEnv-3631eab5-084b-4139-952e-5aff594ac1bb/bin/python
machine: Linux-4.15.0-1050-azure-x86_64-with-debian-stretch-sid

Python deps:
pip: 19.0.3
setuptools: 40.8.0
sklearn: 0.21.3
numpy: 1.16.2
scipy: 1.2.1
Cython: 0.29.6
pandas: 0.24.2

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions