-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Description
Description
Hi!
I'm receiving the error below when attempting to pass a sparse matrix to HistGradientBoostingClassifier
. The matrix is the result of using CountVectorizer
and TfidfTransformer
on input text.
In my case, the size of the text prohibits converting the sparse matrix to a dense one (I run out of memory).
Steps/Code to Reproduce
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import HistGradientBoostingClassifier
df = pd.read_csv(...)
vectorizer = CountVectorizer()
tfidf = TfidfTransformer()
clf = HistGradientBoostingClassifier()
vecs = vectorizer.fit_transform(df.loc[:, "very_large_text"])
vecs = tfidf.fit_transform(vecs)
clf.fit(vecs, df.loc[:, "label"])
Expected Results
No error is thrown.
Actual Results
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
Versions
System:
python: 3.7.3 (default, Oct 1 2019, 18:28:53) [GCC 5.4.0 20160609]
executable: /local_disk0/pythonVirtualEnvDirs/virtualEnv-3631eab5-084b-4139-952e-5aff594ac1bb/bin/python
machine: Linux-4.15.0-1050-azure-x86_64-with-debian-stretch-sid
Python deps:
pip: 19.0.3
setuptools: 40.8.0
sklearn: 0.21.3
numpy: 1.16.2
scipy: 1.2.1
Cython: 0.29.6
pandas: 0.24.2