Skip to content

VarianceThreshold doesn't remove feature with zero variance #13691

@rlms

Description

@rlms

Description

When calling VarianceThreshold().fit_transform() on certain inputs, it fails to remove a column that has only one unique value.

Steps/Code to Reproduce

import numpy as np
from sklearn.feature_selection import VarianceThreshold

works_correctly = np.array([[-0.13725701,  7.        ],
                            [-0.13725701, -0.09853293],
                            [-0.13725701, -0.09853293],
                            [-0.13725701, -0.09853293],
                            [-0.13725701, -0.09853293],
                            [-0.13725701, -0.09853293],
                            [-0.13725701, -0.09853293],
                            [-0.13725701, -0.09853293],
                            [-0.13725701, -0.09853293]])

broken = np.array([[-0.13725701,  7.        ],
                   [-0.13725701, -0.09853293],
                   [-0.13725701, -0.09853293],
                   [-0.13725701, -0.09853293],
                   [-0.13725701, -0.09853293],
                   [-0.13725701, -0.09853293],
                   [-0.13725701, -0.09853293],
                   [-0.13725701, -0.09853293],
                   [-0.13725701, -0.09853293],
                   [-0.13725701, -0.09853293]])

selector = VarianceThreshold()
print(selector.fit_transform(works_correctly))

selector = VarianceThreshold()
print(selector.fit_transform(broken))
print(set(broken[:, 0]))

Expected Results

The Variance threshold should produce

[[ 7.        ]
 [-0.09853293]
 [-0.09853293]
 [-0.09853293]
 [-0.09853293]
 [-0.09853293]
 [-0.09853293]
 [-0.09853293]
 [-0.09853293]]

Actual Results

[[ 7.        ]
 [-0.09853293]
 [-0.09853293]
 [-0.09853293]
 [-0.09853293]
 [-0.09853293]
 [-0.09853293]
 [-0.09853293]
 [-0.09853293]]
[[-0.13725701  7.        ]
 [-0.13725701 -0.09853293]
 [-0.13725701 -0.09853293]
 [-0.13725701 -0.09853293]
 [-0.13725701 -0.09853293]
 [-0.13725701 -0.09853293]
 [-0.13725701 -0.09853293]
 [-0.13725701 -0.09853293]
 [-0.13725701 -0.09853293]
 [-0.13725701 -0.09853293]]
{-0.13725701}

This issue arose when I was using VarianceThreshold on a real dataset (of which this is a subset). It appears to work correctly in other situations (for instance I can't reproduce this behaviour if I replace the first column with 1's).

Versions

System

python: 3.5.6 |Anaconda, Inc.| (default, Aug 26 2018, 16:30:03)  [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]

machine: Darwin-18.2.0-x86_64-i386-64bit
executable: /anaconda3/envs/tensorflow/bin/python3

BLAS

macros: NO_ATLAS_INFO=3, HAVE_CBLAS=None

lib_dirs:
cblas_libs: cblas

Python deps

setuptools: 40.2.0
numpy: 1.15.4
sklearn: 0.20.0
Cython: None
scipy: 1.1.0
pandas: 0.24.0
pip: 19.0.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    EasyWell-defined and straightforward way to resolvehelp wanted

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions