Skip to content

LogisticRegression memory consumption goes crazy on 0.22+ #17125

@AndriiNeverov

Description

@AndriiNeverov

Describe the bug

LogisticRegression started to consume crazy amounts of RAM on 0.22+.

Steps/Code to Reproduce

import pandas as pd
import numpy as np
import io
import requests
from io import StringIO

import sklearn
import sklearn.ensemble
import sklearn.metrics
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import *
from sklearn.utils import shuffle


url = "https://storage.googleapis.com/tensorflow-workshop-examples/stack-overflow-data.csv" # .csv file location
s = requests.get(url).content
df = pd.read_csv(io.StringIO(s.decode('utf-8')))

df = df[pd.notnull(df['tags'])]
df = df.sample(frac=0.5, random_state=99).reset_index(drop=True)
df = shuffle(df, random_state=22)
df = df.reset_index(drop=True)
df['class_label'] = df['tags'].factorize()[0]


df_train, df_test = train_test_split(df, test_size=0.2, random_state=40)

X_train = df_train["post"].tolist()
X_test = df_test["post"].tolist()
y_train = df_train["class_label"].tolist()
y_test = df_test["class_label"].tolist()

vectorizer = CountVectorizer(analyzer='word',token_pattern=r'\w{1,}', ngram_range=(1, 3), stop_words = 'english', binary=True)
train_vectors = vectorizer.fit_transform(X_train)
test_vectors = vectorizer.transform(X_test)

logreg = LogisticRegression(n_jobs=1, C=1e5)
logreg.fit(train_vectors, y_train)

Expected Results

0.22+ memory consumption should stay reasonably the same.

Actual Results

0.22+ behavior (tried 0.22.0, 0.22.1, 0.22.2.post1):

If run inside a container with limited memory (1-2 GB), the code crashes (by OOM Killer).

Locally, top -o mem shows memory consumption growth to 9GB and continues increasing.

0.21.3 behavior:

Everything works fine within a 1GB container.

top -o mem locally never shows past 1GB memory consumption.

Versions

System:
python: 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0]
executable: /usr/bin/python3
machine: Linux-3.10.0-957.21.3.el7.x86_64-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
pip: 19.3.1
setuptools: 44.0.0
sklearn: 0.22.1
numpy: 1.18.1
scipy: 1.4.1
Cython: None
pandas: 0.25.3
matplotlib: 3.1.2
joblib: 0.14.1

Built with OpenMP: True

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions