Skip to content

mutual_info_regression misbehaves when X is integer-typed #26696

@billiejoe-bw

Description

@billiejoe-bw

Describe the bug

Mathematically, the mutual information between two continuous variables P and Q is symmetric, i.e. swapping the places of the two variables doesn't change the result. But I've found that when one of the variables is integer-typed, the result from mutual_info_regression will change very much when you swap them over. It seems that when the X
argument to mutual_info_regression is integer-typed, something goes badly wrong. See demo code below.

I'm definitely no expert, but here's my hunch about the cause. mutual_info_regression calls _estimate_mi, which contains this code:

X[:, continuous_mask] = scale(
    X[:, continuous_mask], with_mean=False, copy=False
)

As far as I can tell, when X is integer-typed this assignment rounds the results of the scale operation to be integers. In the problematic case in the demo code below, this causes nearly all the values to be rounded to zero, destroying any information between the two variables. The _estimate_mi code next does

X = X.astype(np.float64, copy=False)

but by then it is too late.

Steps/Code to Reproduce

import numpy as np
from sklearn.feature_selection import mutual_info_regression

ns = list(range(100))
P_int   = np.array(ns + [1_000], dtype=np.int64)
P_float = np.array(ns + [1_000], dtype=np.float64)
Q       = np.array(ns + [100],   dtype=np.float64)

# When both arguments are float64, the result looks plausible and swapping the places
# of P and Q doesn't materially change the result.

print(
    mutual_info_regression(
        np.expand_dims(P_float, axis=1),
        Q,
        discrete_features=False,
        random_state=1
    ).item()
)
# 2.0886490359382472

print(
    mutual_info_regression(
        np.expand_dims(Q, axis=1),
        P_float,
        discrete_features=False,
        random_state=1
    ).item()
)
# 2.0919493659712503

# When p is int64, swapping the places of P and Q changes the result a lot

print(
    mutual_info_regression(
        np.expand_dims(P_int, axis=1),
        Q,
        discrete_features=False,
        random_state=1
    ).item()
)
# 0.0026793627906570583  - implausible result, surely something went wrong?

print(
    mutual_info_regression(
        np.expand_dims(Q, axis=1),
        P_int,
        discrete_features=False,
        random_state=1
    ).item()
)
# 2.0919493659712503  - agrees with what happens when both inputs are float64

Expected Results

The result of mutual_info_regression should be close to equal in all four cases. Since the input values are all whole numbers, it should not matter whether integer or float typed arrays are used. Also, because mutual information is symmetric, swapping the places of the P and Q variables should not produce a very different result.

Actual Results

The code produces

2.0886490359382472
2.0919493659712503
0.0026793627906570583
2.0919493659712503

i.e. the result when the X argument to mutual_info_regression is integer-typed is very different from the others, and is an implausible value.

Versions

System:
    python: 3.11.3 (main, Jun  7 2023, 09:49:56) [Clang 14.0.3 (clang-1403.0.22.14.1)]
executable: /Users/billiejoe/Library/Caches/pypoetry/virtualenvs/mi-bug-report-XmcsNNRX-py3.11/bin/python
   machine: macOS-13.4.1-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.2.2
          pip: 23.1.2
   setuptools: 67.7.2
        numpy: 1.25.0
        scipy: 1.9.3
       Cython: None
       pandas: None
   matplotlib: None
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /Users/billiejoe/Library/Caches/pypoetry/virtualenvs/mi-bug-report-XmcsNNRX-py3.11/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.23
threading_layer: pthreads
   architecture: Haswell
    num_threads: 6

       user_api: openmp
   internal_api: openmp
         prefix: libomp
       filepath: /Users/billiejoe/Library/Caches/pypoetry/virtualenvs/mi-bug-report-XmcsNNRX-py3.11/lib/python3.11/site-packages/sklearn/.dylibs/libomp.dylib
        version: None
    num_threads: 12

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /Users/billiejoe/Library/Caches/pypoetry/virtualenvs/mi-bug-report-XmcsNNRX-py3.11/lib/python3.11/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.18
threading_layer: pthreads
   architecture: Haswell
    num_threads: 6

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions