Skip to content

[MRG+1] Astype fix #4645

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 19, 2015
Merged

[MRG+1] Astype fix #4645

merged 1 commit into from
May 19, 2015

Conversation

TomDLT
Copy link
Member

@TomDLT TomDLT commented Apr 29, 2015

I extended check_array to add support for accept_sparse=True
and solve the 3 errors by changing astype into check_array when sparse matrix are possible

I will check the other uses of astype tomorrow.

TODO:

  • fix the test error
  • finish changing astype to check_array, when we may have a sparse matrix
  • clean some remaining usage of warn_if_not_float to use check_array instead

@agramfort
Copy link
Member

thanks @TomDLT

Travis is not happy. You have failing tests.

@TomDLT
Copy link
Member Author

TomDLT commented Apr 30, 2015

OK maybe I will start testing before pushing

@ogrisel
Copy link
Member

ogrisel commented Apr 30, 2015

Also it would be great to rebase this branch on top of current master while removing the upstream/master merge commit by @jrings.

@agramfort
Copy link
Member

OK maybe I will start testing before pushing

Sounds like a plan ;)

@amueller
Copy link
Member

meh, you can also just check travis after pushing ;) but that utually takes longer.

@TomDLT TomDLT force-pushed the astype_fix branch 2 times, most recently from 0222c8b to 411b868 Compare April 30, 2015 17:43
@TomDLT
Copy link
Member Author

TomDLT commented Apr 30, 2015

check_array is now enhanced to accept :
-accept_sparse=True to accept all sparse matrix formats
-dtype with an array of accepted types
-warn_on_dtype=True to warn if the dtype as been changed

Big thanks to @ogrisel that worked on this with me

TODO:
-fix the error
-finish changing astypeto check_array, when we may have a sparse matrix
-clean some warn_if_not_float using check_array

I will look at this monday

@ogrisel
Copy link
Member

ogrisel commented May 4, 2015

To fix the test that fail on travis, you might want to setup a conda env with the same versions for numpy and scipy.

conda create -n oldnumpy python=2.6 numpy=1.6.2 scipy=0.11.0
source activate oldnumpy
make clean test

@ogrisel ogrisel changed the title Astype fix [WIP] Astype fix May 4, 2015
@ogrisel
Copy link
Member

ogrisel commented May 4, 2015

Good catch, travis is now green. I re-wrapped and moved your todo list to put it in the description of the PR. Once all the boxes are ticked please edit the title of the PR to replace [WIP] (work in progress) to [MRG] (ready for merge reviews).

@TomDLT
Copy link
Member Author

TomDLT commented May 4, 2015

In k_means.py, do we want to change _check_test_data to use the enhanced version of check_array?
It will change a RuntimeWarning into a DataConversionWarning

note : warn_if_not_float is not used anymore

@TomDLT TomDLT changed the title [WIP] Astype fix [MRG] Astype fix May 4, 2015
@amueller
Copy link
Member

amueller commented May 4, 2015

I'm not sure how great an idea of warn_if_no_float was (I think it was mine). We should probably just convert to float.

@ogrisel
Copy link
Member

ogrisel commented May 4, 2015

I'm not sure how great an idea of warn_if_no_float was (I think it was mine). We should probably just convert to float.

This is what we are doing in this PR when you pass check_array(X, dtype=[np.float64, np.float32, np.float16]) which can also be written check_array(X, dtype=FLOAT_DTYPES) for convenience.

@ogrisel
Copy link
Member

ogrisel commented May 4, 2015

In k_means.py, do we want to change ˋ_check_test_dataˋ to use the enhanced version of check_array? It will change a ˋRuntimeWarningˋ into a ˋDataConversionWarningˋ

+1 for using check_array(X, dtype=FLOAT_DTYPES) in that case as well. I don't thing that changing the type of warning is such a big deal. Increasing the consistency of the code base is good.

X_train = X_train.astype(np.float64)
X_test = X_test.astype(np.float64)
X_train = astype(X_train, np.float64, copy=False)
X_test = astype(X_test, np.float64, copy=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to set copy=False for this use case as the original data type is int16 for sure as explained in the previous commit. There is no way to avoid the malloc / dtype conversion in that case. We can keep:

    X_train = X_train.astype(np.float64)
    X_test = X_test.astype(np.float64)

@TomDLT
Copy link
Member Author

TomDLT commented May 4, 2015

Is there a clean way to revert some old commits without a total rebase?

@TomDLT TomDLT closed this May 4, 2015
@amueller
Copy link
Member

amueller commented May 6, 2015

Run a git grep. If it is not used (and I think it shouldn't be), please remove it.

# not a data type (e.g. a column named dtype in a pandas DataFrame)
dtype_orig = None

if dtype_numeric:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure if we want to keep this behavior. In what cases do we want to pass through integer data? Or bytes? But maybe that is for another PR? I'm not sure....
We could also just set dtype = NUMERIC_DTYPES (which would be a long list) to remove this branch at least.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also thought about that option but I think it's fine to leave it this way for now.

@amueller
Copy link
Member

amueller commented May 6, 2015

This is awesome, thanks!
I am a bit uncertain about the conversion in spectral_embedding, otherwise this LGTM.

I'm not sure how good the "numeric" option was, and maybe we should remove it now, and use explicit dtypes instead? But I think we can merge this before that refactoring.

@TomDLT
Copy link
Member Author

TomDLT commented May 7, 2015

Do you want me to squash all commits in only one?

@ogrisel
Copy link
Member

ogrisel commented May 7, 2015

Maybe we should deprecate warn_if_not_float instead of removing it. I know those utilities are not supposed to be part of the public API but... Any opinion @amueller ?

@ogrisel
Copy link
Member

ogrisel commented May 7, 2015

I'm not sure how good the "numeric" option was, and maybe we should remove it now, and use explicit dtypes instead? But I think we can merge this before that refactoring.

+1 for thinking about this point again once this PR has been merged.

@ogrisel
Copy link
Member

ogrisel commented May 7, 2015

Do you want me to squash all commits in only one?

I think so yes.

@amueller
Copy link
Member

amueller commented May 7, 2015

I think removing warn_if_not_float is fine. And we should really split public / private utils :-/

@amueller
Copy link
Member

amueller commented May 7, 2015

+1 on squashing everything and +1 on merge.

@amueller
Copy link
Member

amueller commented May 7, 2015

See #4685 for tracking the possible removal of "numeric".

@TomDLT
Copy link
Member Author

TomDLT commented May 7, 2015

Squashed

@TomDLT
Copy link
Member Author

TomDLT commented May 7, 2015

Just curiosity, to deprecate warn_if_not_float, how can I do a relative import of deprecated ?
(inside utils/validation.py, import from utils/__init__.py)

@amueller
Copy link
Member

amueller commented May 7, 2015

from . import deprecated maybe?

@TomDLT
Copy link
Member Author

TomDLT commented May 7, 2015

It did not work for me

@amueller
Copy link
Member

amueller commented May 9, 2015

@ogrisel merge?

@ogrisel
Copy link
Member

ogrisel commented May 19, 2015

Fine with the removal of warn_if_not_float then. +1 for merge.

Can you please rebase on top of the current master once again? There is a small conflict in sklearn/preprocessing/data.py (easy to resolve, just keep the min_max_axis import). All tests pass on my box but I would like travis to run once again on this PR before merging.

ENH improve check_array to warn on dtype conversions

ENH make check_array accept several dtypes

ENH change validation with improved check_array

ENH change astype to avoid copy if possible

ENH remove warn_if_not_float
@TomDLT
Copy link
Member Author

TomDLT commented May 19, 2015

Rebased on top of master

amueller added a commit that referenced this pull request May 19, 2015
@amueller amueller merged commit 9385c45 into scikit-learn:master May 19, 2015
@amueller
Copy link
Member

Thanks :)

@amueller
Copy link
Member

This is really great btw 🍻

@TomDLT
Copy link
Member Author

TomDLT commented May 19, 2015

Thanks :)

@agramfort
Copy link
Member

agramfort commented May 19, 2015 via email

@ogrisel
Copy link
Member

ogrisel commented May 21, 2015

I forgot to drink a beer on that PR: 🍻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants