Skip to content

StandardScaler fit overflows on float16 #13007

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
baluyotraf opened this issue Jan 18, 2019 · 1 comment · Fixed by #13010
Closed

StandardScaler fit overflows on float16 #13007

baluyotraf opened this issue Jan 18, 2019 · 1 comment · Fixed by #13010

Comments

@baluyotraf
Copy link
Contributor

baluyotraf commented Jan 18, 2019

Description

When using StandardScaler on a large float16 numpy array the mean and std calculation overflows. I can convert the array to a larger precision but when working with a larger dataset the memory saved by using float16 on smaller numbers kind of matter. The error is mostly on numpy. Adding the dtype on the mean/std calculation does it but I'm not sure if that how people here would like to do it.

Steps/Code to Reproduce

from sklearn.preprocessing import StandardScaler

sample = np.full([10_000_000, 1], 10.0, dtype=np.float16)
StandardScaler().fit_transform(sample)

Expected Results

The normalized array

Actual Results

/opt/conda/lib/python3.6/site-packages/numpy/core/_methods.py:36: RuntimeWarning: overflow encountered in reduce
  return umr_sum(a, axis, dtype, out, keepdims, initial)
/opt/conda/lib/python3.6/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/opt/conda/lib/python3.6/site-packages/numpy/core/_methods.py:36: RuntimeWarning: overflow encountered in reduce
  return umr_sum(a, axis, dtype, out, keepdims, initial)
/opt/conda/lib/python3.6/site-packages/sklearn/preprocessing/data.py:765: RuntimeWarning: invalid value encountered in true_divide
  X /= self.scale_

array([[nan],
       [nan],
       [nan],
       ...,
       [nan],
       [nan],
       [nan]], dtype=float16)

Versions

System:
    python: 3.6.6 |Anaconda, Inc.| (default, Oct  9 2018, 12:34:16)  [GCC 7.3.0]
executable: /opt/conda/bin/python
   machine: Linux-4.9.0-5-amd64-x86_64-with-debian-9.4

BLAS:
    macros: SCIPY_MKL_H=None, HAVE_CBLAS=None
  lib_dirs: /opt/conda/lib
cblas_libs: mkl_rt, pthread

Python deps:
       pip: 18.1
setuptools: 39.1.0
   sklearn: 0.20.2
     numpy: 1.16.0
     scipy: 1.1.0
    Cython: 0.29.2
    pandas: 0.23.4
@jnothman
Copy link
Member

jnothman commented Jan 18, 2019 via email

baluyotraf added a commit to baluyotraf/scikit-learn that referenced this issue Jan 18, 2019
baluyotraf added a commit to baluyotraf/scikit-learn that referenced this issue Jan 18, 2019
baluyotraf added a commit to baluyotraf/scikit-learn that referenced this issue Jan 19, 2019
baluyotraf added a commit to baluyotraf/scikit-learn that referenced this issue Jan 19, 2019
…tils.extmath. Also fixed some line lengths to fit the 80 limit (scikit-learn#13007)
baluyotraf added a commit to baluyotraf/scikit-learn that referenced this issue Jan 20, 2019
baluyotraf added a commit to baluyotraf/scikit-learn that referenced this issue Jan 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants