LogisticRegression convert to float64 (for SAG solver) #13243

massich · 2019-02-25T12:26:00Z

Reference Issues/PRs

Works on #8769 and #11000 (for SAG solver) Fixes #9040

closes #11155 (takes over)

initial PR commit seq_dataset.pyx generated from template seq_dataset.pyx generated from template #2 rename variables fused types consistency test for seq_dataset a sklearn/utils/tests/test_seq_dataset.py new if statement add doc sklearn/utils/seq_dataset.pyx.tp minor changes minor changes typo fix check numeric accuracy only up 5th decimal Address oliver's request for changing test name add test for make_dataset and rename a variable in test_seq_dataset

…69_minus_merged

massich · 2019-02-25T13:32:09Z

sklearn/linear_model/sag.py

@@ -245,8 +245,9 @@ def sag_solver(X, y, sample_weight=None, loss='log', alpha=1., beta=0.,
        max_iter = 1000

    if check_input:
-        X = check_array(X, dtype=np.float64, accept_sparse='csr', order='C')
-        y = check_array(y, dtype=np.float64, ensure_2d=False, order='C')
+        _dtype = [np.float64, np.float32]


@lesteve said (https://github.com/scikit-learn/scikit-learn/pull/11155/files#r192674850):

This does not seem to be covered by any of the tests.

@glemaitre responded:

I see a pattern that we applied sometimes: the solvers are always checking the inputs while the class level does not do it (e.g. KMeans)

Is it something that we want to extend here or we let the lines uncovered or removing it?

I'm not really sure about what you were pointing at. _dtype=[bla.] looks fine to me.

sklearn/linear_model/sag.py

sklearn/linear_model/tests/test_base.py

sklearn/utils/seq_dataset.pyx.tp

sklearn/utils/tests/test_seq_dataset.py

glemaitre · 2019-02-26T11:18:00Z

sklearn/utils/tests/test_seq_dataset.py

-    dataset2 = CSRDataset(X_csr.data, X_csr.indptr, X_csr.indices,
-                          y, sample_weight, seed=42)
-
+X64 = iris.data.astype(np.float64)


Could create a pytest fixture which will return either 64 bits or 32 bits datasets. It can return X, X_sparse, y, sample_weight.

We can then use this fixture in parametrize.

sklearn/utils/tests/test_seq_dataset.py

glemaitre · 2019-02-26T14:07:08Z

The linter is not happy

massich · 2019-02-26T17:07:31Z

The last commit should solve the linter.

I tried to call the constructors with named parameters and avoid verbosity but when I do so, I get __cinit__ errors. That I don't seem to get around.

If green, we can merge it as it is or I could give a pass to sklearn/linear_model/tests/test_base.py

massich · 2019-02-26T17:13:49Z

on second thought, since then you compare for consistancy within the test you need both 32 and 64 so you can't parametrize. We could always get a EXPECTED_VALUES and compare against that.

thibsej · 2019-02-26T17:29:03Z

doc/whats_new/v0.21.rst

@@ -137,6 +137,11 @@ Support for Python 3.4 and below has been officially dropped.
 :mod:`sklearn.linear_model`
 ...........................

+- |Enhancement| :class:`linear_model.make_dataset` now preserves
+  ``float32`` and ``float64`` dtypes. :issues:`8769` and `11000` by


11000 misses :issues: infront

- |Enhancement| :class:`linear_model.make_dataset` now preserves ``float32`` and ``float64`` dtypes. :issues:`8769` and :issues:`11000` by :user:`Nelle Varoquaux`_, :user:`Arthur Imbert`_, :user:`Guillaume Lemaitre <glemaitre>`, and :user:`Joan Massich <massich>`

GaelVaroquaux · 2019-02-27T10:14:24Z

LGTM. Merging. This is a net improvement. More improvements can follow later.

…13243) * Remove unused code * Squash all the PR 9040 commits initial PR commit seq_dataset.pyx generated from template seq_dataset.pyx generated from template #2 rename variables fused types consistency test for seq_dataset a sklearn/utils/tests/test_seq_dataset.py new if statement add doc sklearn/utils/seq_dataset.pyx.tp minor changes minor changes typo fix check numeric accuracy only up 5th decimal Address oliver's request for changing test name add test for make_dataset and rename a variable in test_seq_dataset * FIX tests * TST more numerically stable test_sgd.test_tol_parameter * Added benchmarks to compare SAGA 32b and 64b * Fixing gael's comments * fix * solve some issues * PEP8 * Address lesteve comments * fix merging * avoid using assert_equal * use all_close * use explicit ArrayDataset64 and CSRDataset64 * fix: remove unused import * Use parametrized to cover ArrayDaset-CSRDataset-32-64 matrix * for consistency use 32 first then 64 + add 64 suffix to variables * it would be cool if this worked !!! * more verbose version * revert SGD changes as much as possible. * Add solvers back to bench_saga * make 64 explicit in the naming * remove checking native python type + add comparison between 32 64 * Add whatsnew with everyone with commits * simplify a bit the testing * simplify the parametrize * update whatsnew * fix pep8

…t-learn#13243)" This reverts commit 0b56ac8.

…13243) * Remove unused code * Squash all the PR 9040 commits initial PR commit seq_dataset.pyx generated from template seq_dataset.pyx generated from template scikit-learn#2 rename variables fused types consistency test for seq_dataset a sklearn/utils/tests/test_seq_dataset.py new if statement add doc sklearn/utils/seq_dataset.pyx.tp minor changes minor changes typo fix check numeric accuracy only up 5th decimal Address oliver's request for changing test name add test for make_dataset and rename a variable in test_seq_dataset * FIX tests * TST more numerically stable test_sgd.test_tol_parameter * Added benchmarks to compare SAGA 32b and 64b * Fixing gael's comments * fix * solve some issues * PEP8 * Address lesteve comments * fix merging * avoid using assert_equal * use all_close * use explicit ArrayDataset64 and CSRDataset64 * fix: remove unused import * Use parametrized to cover ArrayDaset-CSRDataset-32-64 matrix * for consistency use 32 first then 64 + add 64 suffix to variables * it would be cool if this worked !!! * more verbose version * revert SGD changes as much as possible. * Add solvers back to bench_saga * make 64 explicit in the naming * remove checking native python type + add comparison between 32 64 * Add whatsnew with everyone with commits * simplify a bit the testing * simplify the parametrize * update whatsnew * fix pep8

Joan Massich and others added 13 commits July 8, 2018 15:46

Remove unused code

c4e223f

FIX tests

c1b795b

TST more numerically stable test_sgd.test_tol_parameter

8a46660

Added benchmarks to compare SAGA 32b and 64b

9168b8f

Fixing gael's comments

8d89662

fix

21c3f98

Merge remote-tracking branch 'origin/master' into NelleV-henley_is_87…

2534661

…69_minus_merged

solve some issues

b10e2d7

Merge remote-tracking branch 'origin/master' into NelleV-henley_is_87…

3af8134

…69_minus_merged

PEP8

30f2080

Address lesteve comments

7babe3e

Merge branch 'master' into henley_is_8769_minus_merged

8690bd6

massich changed the title ~~Henley is 8769 minus merged~~ LogisticRegression convert to float64 (for SAG solver) Feb 25, 2019

GaelVaroquaux mentioned this pull request Feb 25, 2019

[MRG] LogisticRegression convert to float64 (sag) #11155

Closed

fix merging

7864555