-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
Description
Describe the bug
Mathematically, the mutual information between two continuous variables P and Q is symmetric, i.e. swapping the places of the two variables doesn't change the result. But I've found that when one of the variables is integer-typed, the result from mutual_info_regression
will change very much when you swap them over. It seems that when the X
argument to mutual_info_regression
is integer-typed, something goes badly wrong. See demo code below.
I'm definitely no expert, but here's my hunch about the cause. mutual_info_regression
calls _estimate_mi
, which contains this code:
X[:, continuous_mask] = scale(
X[:, continuous_mask], with_mean=False, copy=False
)
As far as I can tell, when X
is integer-typed this assignment rounds the results of the scale
operation to be integers. In the problematic case in the demo code below, this causes nearly all the values to be rounded to zero, destroying any information between the two variables. The _estimate_mi
code next does
X = X.astype(np.float64, copy=False)
but by then it is too late.
Steps/Code to Reproduce
import numpy as np
from sklearn.feature_selection import mutual_info_regression
ns = list(range(100))
P_int = np.array(ns + [1_000], dtype=np.int64)
P_float = np.array(ns + [1_000], dtype=np.float64)
Q = np.array(ns + [100], dtype=np.float64)
# When both arguments are float64, the result looks plausible and swapping the places
# of P and Q doesn't materially change the result.
print(
mutual_info_regression(
np.expand_dims(P_float, axis=1),
Q,
discrete_features=False,
random_state=1
).item()
)
# 2.0886490359382472
print(
mutual_info_regression(
np.expand_dims(Q, axis=1),
P_float,
discrete_features=False,
random_state=1
).item()
)
# 2.0919493659712503
# When p is int64, swapping the places of P and Q changes the result a lot
print(
mutual_info_regression(
np.expand_dims(P_int, axis=1),
Q,
discrete_features=False,
random_state=1
).item()
)
# 0.0026793627906570583 - implausible result, surely something went wrong?
print(
mutual_info_regression(
np.expand_dims(Q, axis=1),
P_int,
discrete_features=False,
random_state=1
).item()
)
# 2.0919493659712503 - agrees with what happens when both inputs are float64
Expected Results
The result of mutual_info_regression
should be close to equal in all four cases. Since the input values are all whole numbers, it should not matter whether integer or float typed arrays are used. Also, because mutual information is symmetric, swapping the places of the P and Q variables should not produce a very different result.
Actual Results
The code produces
2.0886490359382472
2.0919493659712503
0.0026793627906570583
2.0919493659712503
i.e. the result when the X
argument to mutual_info_regression
is integer-typed is very different from the others, and is an implausible value.
Versions
System:
python: 3.11.3 (main, Jun 7 2023, 09:49:56) [Clang 14.0.3 (clang-1403.0.22.14.1)]
executable: /Users/billiejoe/Library/Caches/pypoetry/virtualenvs/mi-bug-report-XmcsNNRX-py3.11/bin/python
machine: macOS-13.4.1-x86_64-i386-64bit
Python dependencies:
sklearn: 1.2.2
pip: 23.1.2
setuptools: 67.7.2
numpy: 1.25.0
scipy: 1.9.3
Cython: None
pandas: None
matplotlib: None
joblib: 1.2.0
threadpoolctl: 3.1.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: /Users/billiejoe/Library/Caches/pypoetry/virtualenvs/mi-bug-report-XmcsNNRX-py3.11/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
version: 0.3.23
threading_layer: pthreads
architecture: Haswell
num_threads: 6
user_api: openmp
internal_api: openmp
prefix: libomp
filepath: /Users/billiejoe/Library/Caches/pypoetry/virtualenvs/mi-bug-report-XmcsNNRX-py3.11/lib/python3.11/site-packages/sklearn/.dylibs/libomp.dylib
version: None
num_threads: 12
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: /Users/billiejoe/Library/Caches/pypoetry/virtualenvs/mi-bug-report-XmcsNNRX-py3.11/lib/python3.11/site-packages/scipy/.dylibs/libopenblas.0.dylib
version: 0.3.18
threading_layer: pthreads
architecture: Haswell
num_threads: 6