[MRG] Missing values imputation #2148

NicolasTr · 2013-07-10T16:33:32Z

Here's some code for the data imputation.

It's still a work in progress so Travis will probably be mad at me.

I'll work on the documentation tomorrow. Meanwhile, you can look at the tests if you want to see how it works.

_What's new_
I added four classes:

MeanImputer: replace the missing values with the mean of the row/column
MedianImputer: replace the missing values with the median of the row/column
MostFrequentImputer: replace the missing values with the most frequent value in the row/column
RandomImputer: replace the missing values with one of the values of the row/column

_Behaviour_

Any value can represent a missing value (not only np.nan for dense matrices and 0 for sparse matrices, any value)
Different missing values can be imputed separately in a pipeline
Support dense and sparse matrices
To determine: What should the imputer do if he has nothing to work with? For example: What should be done if there's no median value because the whole row/column is missing?
- 0 for the mean and np.nan for the median/most frequent/random?
- always np.nan?
- Let the user choose with a parameter?

_What's missing_

glouppe · 2013-07-10T20:00:27Z

sklearn/preprocessing.py

+    @abstractmethod
+    def __init__(self,
+                 missing_values_fit=[np.nan],
+                 missing_values_transform=[np.nan],


Could it be possible to make missing_values_transform=None by default and then assume missing_values_transform is the same as missing_values_fit? I think it is the most common use, isn't it?

Is there a use case for encoding missing value differently at fit and transform time? Also is there a use case for encoding missing values with more than one value?

I'd rather just use missing_values=np.nan. This should simplify both the interface and the code.

Could it be possible to make missing_values_transform=None by default and then assume missing_values_transform is the same as missing_values_fit? I think it is the most common use, isn't it?

Ok

Is there a use case for encoding missing value differently at fit and transform time? Also is there a use case for encoding missing values with more than one value?

I two cases:

with a Pipeline

for matrix completion (I'll probably use 0 for the missing values and np.nan for the missing values that should be imputed)

with a Pipeline

Can you elaborate?

I'll probably use 0 for the missing value and np.nan for the missing values that should be imputed

So there are two kinds of missing values?

For the Pipeline: https://github.com/scikit-learn/scikit-learn/pull/2148/files#L1R985

So there are two kinds of missing values?

Yes. For example: If you have a matrix where (i, j) is the rating that the user i gave to the movie j. You may want to impute the ratings that the active users would give to the movies that are actually in theaters. In this case:

np.nan (or something else) for these ratings

0 for the rest of the missing ratings

I have a usecase for this.

I work on a similarity matrix, representing interaction frequencies between regions of DNA. I have two types of missing data. The first one is that we do not have enough cells to have a measure. It means that the two regions are far away from each other. The second one is due to repetitions in the DNA sequence.
In both case, I can infer the missing data, but the inference is very different.

On Wed, Jul 10, 2013 at 01:00:30PM -0700, Gilles Louppe wrote:

Could it be possible to make missing_values_transform=None by default and then
assume missing_values_transform is the same as missing_values_fit? I think it
is the most common use, isn't it?

I don't even think that we should make a difference.

I work on a similarity matrix, representing interaction frequencies between
regions of DNA. I have two types of missing data. The first one is that we do
not have enough cells to have a measure. It means that the two regions are far
away from each other. The second one is due to repetitions in the DNA sequence.
In both case, I can infer the missing data, but the inference is very
different.

It sounds to me like you need to imputations, not a different fit and
transform. Am I wrong?

G

I don't even think that we should make a difference.

Let's keep the interface simple for now (with a single parameter missing_values=np.nan) and let's consider this use case later, possibly using a mask as I suggested before.

@GaelVaroquaux indeed, the only requirement in my case is to be able to deal with two missing data type.

glouppe · 2013-07-10T20:24:26Z

It looks like a good start! However, I have some concerns regarding usability. Because of the various Imputer objects you created, it is actually not possible to grid-search for the imputing strategy (e.g., you pipeline an imputer with a classifier, and you want to grid-search the imputing strategy (mean, median, most frequent, etc) in addition of the classifier hyper-parameters). Don't you think it would be best to have a single Imputer transformer, with something like a algorithm or strategy hyper-parameter? I think this is legitimate for such simple imputation strategies. CC: @mblondel

glouppe · 2013-07-10T20:25:57Z

sklearn/preprocessing.py

+        super(RandomImputer, self).__init__(
+            missing_values_fit = missing_values_fit,
+            missing_values_transform = missing_values_transform)
+        self.random_state = check_random_state(random_state)


check_random_state should be called at fit. Conventions are that the constructor shouldn't change the values of the given parameters. Its only role is to store them.

mblondel · 2013-07-11T01:02:41Z

Any value can represent a missing value (not only np.nan for dense matrices and 0 for sparse matrices, any value)

Do you mean that sparse matrices can encode missing values with other values than 0? This doesn't make sense to me, as the point of using 0 is to take advantage of the sparse data structure. In recommendation systems, there are more missing values than observed values so you cannot explicitly store the missing values.

mblondel · 2013-07-11T02:06:08Z

I'll work on the documentation tomorrow

I think it's too early to work on the documentation. Let's get the code and unit tests right first.

mblondel · 2013-07-11T02:15:09Z

A reminder of different missing values scenarios I posted on the ML:

fewer missing values than observed values
much more missing-values than observed values (recommendation system setting)
fewer missing values than observed values AND many zero features
much more missing-values than observed values AND many zero features
can be handled by masked arrays or arrays w/ missing-value encoded by NaN.
can be handled by scipy sparse matrices containing only the observed values
can be handled by scipy sparse matrices where missing-values are encoded by NaN
seems hard because the data is sparse both wrt missing-values and wrt zero features.

What scenarios do you target with this PR? 1) and 2) are the most important IMO. I wonder if 3) is worth the pain. 4) is very difficult to handle.

NicolasTr · 2013-07-11T09:23:20Z

In what I did, you can choose how the missing values are encoded in the constructor.

Simply use np.nan
Use 0 for the missing values and a sparse matrix
Use np.nan for the missing values, 0 for the zeros and a sparse matrix
Same as 3. However, in this scenario your sparse matrix won't be very sparse (i.e. you could do it but I think it will be slow)

mblondel · 2013-07-11T09:54:09Z

@NicolasTr Thanks for the clarification. Do you handle scenarios 2 and 3 using the same code? I haven't looked into your code yet but my intuition is that these two scenarios require quite different implementations in order to be efficient.

mblondel · 2013-07-16T13:01:18Z

@glouppe What's your opinion on allowing to encode missing values differently in fit and transform? @GaelVaroquaux and I think it's unnecessary.

mblondel · 2013-07-16T13:04:09Z

sklearn/preprocessing.py

+                 placeholder_transform=[np.nan],
+                 strategy='mean',
+                 default=np.nan,
+                 axis=0):


What's the point of the axis parameter? Can't the user just use fit(X.T) and transform(X.T)?

What's the point of the axis parameter? Can't the user just use fit(X.T) and transform(X.T)?

Unless you are in a pipeline. The question is: does this pattern occur
very often or not?

Does imputing values across samples / rows (instead of features / columns) make sense?

I added it because it's really easy to handle:

For dense arrays, I pass the parameter to the right function

For dense arrays, I change the representation of the matrix (CSC -> CSR or CSR -> CSC)

Hypotetical example: if you have students and grades, it could be relevant to impute the missing grades by student rather than by course.

I you think it's useless, I can remove it.

glouppe · 2013-07-16T13:10:14Z

@mblondel Indeed, I also don't think it is something necessary. +1 on removing it.

glouppe · 2013-07-16T13:13:26Z

sklearn/preprocessing.py

+class Imputer(BaseEstimator, TransformerMixin):
+
+    def __init__(self,
+                 placeholder_fit=[np.nan],


Could you set instead a scalar value by default (i.e., not a list)? I think it would make the interface clearer. List values will be documented later in the docstrings.

Do we need to handle lists at all?

In python, a list is not safe as a default argument. Use a tuple instead if you need.

http://pythonconquerstheuniverse.wordpress.com/2012/02/15/mutable-default-arguments/

Do we need to handle lists at all?

I was indeed thinking exactly the same.

What is your opinion @NicolasTr ?

Fixed in 25f4abb

mblondel · 2013-07-16T13:20:16Z

Another strategy for imputing missing values is to use the "prior" distribution of observed values (regardless of their frequency). Say a column X[:, j] of X can take the values 1, 2, 3, then imputing the missing values with np.mean([1, 2, 3]) or np.median([1, 2, 3]) can make sense in some applications.

mblondel · 2013-07-16T13:22:28Z

I wonder if a Cython implementation wouldn't be both more efficient and more readable. The current code is hard to follow.

mblondel · 2013-07-16T13:43:04Z

I wonder if we wouldn't be better off temporarily dropping support for the median strategy. This PR already adds 600+ lines of tests. So it's hard to review all tests for us and make sure they are good and ensure correctness of the code. I'm particularly concerned about the median because it is hard to implement in the sparse case. In fact, we could even focus on mean-based imputation for this PR.

In the future, I'd suggest to create small well-scoped incremental PRs which we can review and merge easily. If merging in master is not possible, we can create a branch for merging reviewed code.

glouppe · 2013-07-16T13:48:42Z

I wonder if we wouldn't be better off temporarily dropping support for the median strategy. This PR already adds 600+ lines of tests. So it's hard to review all tests for us and make sure they are good and ensure correctness of the code. I'm particularly concerned about the median because it is hard to implement in the sparse case. In fact, we could even focus on mean-based imputation for this PR.
In the future, I'd suggest to create small well-scoped incremental PRs which we can review and merge easily. If merging in master is not possible, we can create a branch for merging reviewed code.

Good idea, +1. What is your opinion about that @NicolasTr ?

GaelVaroquaux · 2013-07-18T07:50:00Z

I wonder if we wouldn't be better off temporarily dropping support for the median strategy.

But that's the strategy that is recommended by all my statistician
friend... It's the one I'd really like to have in.

mblondel · 2013-07-18T08:10:02Z

But that's the strategy that is recommended by all my statistician friend... It's the one I'd really like to have in.

Temporarily. Nicolas' progress is very hectic and I think it would be better to have incremental well-focused PRs.

For the median strategy, I'd recommend this simple approach:

convert numpy array with missing values encoded by np.nan or whatever other value to a masked array (skip this step if the input is already a masked array)
use the median() method provided by masked arrays
impute the values and return a standard numpy array, since they are what scikit-learn estimators expect.

NicolasTr · 2013-07-19T04:38:37Z

I added some explanations here

If the array is dense, I use the MaskedArray class. It works great for mean and median. I had some problems with the imputation of the most frequent because scipy.stats.mode() and scipy.stats.mstats.mode() doesn't work properly (scipy.stats.mode doesn't return the right element and scipy.stats.mstats.mode doesn't work if the first value is masked, I reported the bug here).

If the array is sparse:

For the mean, I simply divide the sum by the number of elements in the column (excluding the zeros if they are in the missing values).
For the median, I do the following steps:
1. I separate the positives (array), the negatives (array) and the zeros (number)
2. I compute the position of the median (or the position of the elements surrounding it)
3. If it falls in the zeros, I return zero. Otherwise, I compute the real position of the median in the positives or in the negatives and I return it (I need to sort the arrays).
For the most frequent
1. I compute the most frequent ignoring the zeros
2. I compare the number of occurrences of the most frequent with the number of zeros to return the right value

Missing values are handled by the mask in the dense case. In the sparse case, I mask the data vector of the matrix and I have special conditions for the zero.

I refactored the tests. However, I'm not sure that they are really more readable. For the mean and the median, I create a test matrix containing the kind of columns that could be problematic (all missing, all zero, combination, 0 in the missing values or not, ...)

I also implemented the sparse mean and the sparse median in cython. It's 2-3 times faster. I read somewhere that it's important to avoid calling python functions in cython. I'm not sure if I'm doing that or not. Maybe it's possible to make it much faster.

There's some code duplication (in python and cython). I think I could move the sparse mean and the sparse median into the tests to validate the python code. Currently, I'm using a dictionary of functions to check if I did a cython implementation of the "strategy". It's ugly but it allows me to easily switch between the implementations for the speed tests.

At this moment, all the tests are passing. I plan to add some more tests with the copy parameter.

All the cases described by Mathieu are handled. Allowing the user to choose the representation of the missing values, I'm trading some speed for a lot of flexibility. Anyway, the real cost of this sacrifice depends on the "number of representations" of the missing values (placeholders), so it should not be huge.

There's also one more thing that should be decided (I'll use the mean as an exemple): if the mean cannot be computed (if all the values of the column are missing), what should be done? I'm currently using np.nan. Wouldn't it be better using a "fill_value"?

mblondel · 2013-07-19T04:51:10Z

There's also one more thing that should be decided (I'll use the mean as an exemple): if the mean cannot be computed (if all the values of the column are missing), what should be done? I'm currently using np.nan. Wouldn't it be better using a "fill_value"?

I would remove the entire column in transform (and raise a warning).

mblondel · 2013-07-19T04:59:38Z

The rationale is that we want the transformed data to be used in any scikit-learn estimator. So, we don't want to leave missing values in the transformed data.

glouppe · 2013-07-19T06:33:13Z

sklearn/tests/test_preprocessing.py

+            nb_values = shape[0] - nb_zeros - nb_placeholders
+
+            z = zeros[:nb_zeros]
+            p = np.random.choice(placeholders, nb_placeholders)


Initialize the random generator globally with a fixed seed instead (as a global variable) and use it all your tests. Otherwise, tests are not deterministic and may not be reproducible.

Fixed in 1f3a285

GaelVaroquaux · 2013-07-25T11:41:43Z

sklearn/tests/test_preprocessing.py

+        X_true = X_true[:, cols_to_keep]
+
+        _test_statistics(X, X_true, strategy,
+                              true_statistics, test_missing_values)


People who are very picking with PEP8 would note that this is not properly indented :)

GaelVaroquaux · 2013-07-25T11:46:27Z

Finished my review!

amueller · 2013-07-25T12:01:11Z

sklearn/tests/test_common.py

@@ -307,7 +307,7 @@ def test_estimators_nan_inf():
            if name in dont_test:
                continue
            if name in ('PLSCanonical', 'PLSRegression', 'CCA',
-                        'PLSSVD'):
+                        'PLSSVD', 'Imputer'):


maybe add a comment that the Imputer it not supposed to raise a warning here.

…tion

amueller · 2013-07-25T12:03:32Z

sklearn/tests/test_preprocessing.py

+        [np.nan, 6, 0,  5,  13],
+    ])
+
+    mean = np.array([


I feel this variable name to be quite confusing. I would rename it X_imputed_mean, same for median below.

amueller · 2013-07-25T12:04:04Z

sklearn/tests/test_preprocessing.py

+        [6, 5,  13],
+    ])
+
+    _test_statistics(X, mean, "mean", [np.nan, 3, np.nan, np.nan, 7], 0)


maybe I would define variables mean and median and assign these lists to them, so it is clear what they represent.

arjoly · 2013-07-25T12:45:40Z

sklearn/preprocessing.py

+        valid_statistics = self.statistics_[valid_mask]
+        valid_statistics_indexes = np.where(valid_mask)[0]
+        missing = ",".join(
+            map(str, np.arange(X.shape[not self.axis])[invalid_mask]))


Do you need this?

amueller · 2013-07-25T13:04:18Z

👍 for merge!

agramfort · 2013-07-25T13:09:54Z

sklearn/preprocessing.py

 import scipy.sparse as sp
+import scipy.stats as st


we usually use

from scipy import stats
from scipy import sparse

and avoid shortcuts

personally when I see st or sp I don't know what import it corresponds to.

agramfort · 2013-07-25T13:11:20Z

besides +1 for merge too.

amueller · 2013-07-25T13:58:50Z

merge by rebase :)

Thanks a lot!

glouppe · 2013-07-25T14:02:16Z

Yay! Thanks a lot Nicolas!

arjoly · 2013-07-25T14:03:19Z

Congratulations !!!

agramfort · 2013-07-25T14:05:06Z

nice !

mblondel · 2013-07-25T14:41:26Z

Thanks for your patience and hard work Nicolas. Looking forward to the matrix factorization part :)

mblondel · 2013-07-26T10:56:19Z

Sorry I didn't realize this before but transform should NOT set statistics_ when axis = 1. transform is not for learning and should be side-effect free. Fortunately this seems like an easy fix: just store the statistics in a local variable when axis = 1.

NicolasTr · 2013-07-26T11:12:14Z

Ok

NicolasTr · 2013-07-26T11:22:56Z

To avoid any conflict, I'll wait until the merge of this pull request #2252

amueller · 2013-07-26T11:49:16Z

you can just do another PR on top of that.

glouppe · 2013-07-26T17:15:08Z

Fixed and merged in #2262

glouppe reviewed Jul 10, 2013
View reviewed changes

mblondel reviewed Jul 16, 2013
View reviewed changes

glouppe reviewed Jul 16, 2013
View reviewed changes

glouppe reviewed Jul 19, 2013
View reviewed changes

GaelVaroquaux reviewed Jul 25, 2013
View reviewed changes

amueller reviewed Jul 25, 2013
View reviewed changes

NicolasTr added 2 commits July 25, 2013 14:01

Imp readability

cd0bbcc

Merge github.com:scikit-learn/scikit-learn into missing_values_imputa…

61dc402

…tion

amueller reviewed Jul 25, 2013
View reviewed changes

Fix use of as_float_array

f955d70

amueller reviewed Jul 25, 2013
View reviewed changes

NicolasTr added 3 commits July 25, 2013 14:05

pep8

c738544

Imp variables names

0ecf7f2

Del use of as_float_array + naming and documentation improvements

64bc7fd

arjoly reviewed Jul 25, 2013
View reviewed changes

Fix use of mask

b3863c0

agramfort reviewed Jul 25, 2013
View reviewed changes

Fix import names

60a5427

amueller closed this Jul 25, 2013

Uh oh!

[MRG] Missing values imputation #2148

[MRG] Missing values imputation #2148

Uh oh!

Conversation

NicolasTr commented Jul 10, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glouppe commented Jul 10, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mblondel commented Jul 11, 2013

Uh oh!

mblondel commented Jul 11, 2013

Uh oh!

mblondel commented Jul 11, 2013

Uh oh!

NicolasTr commented Jul 11, 2013

Uh oh!

mblondel commented Jul 11, 2013

Uh oh!

mblondel commented Jul 16, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glouppe commented Jul 16, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mblondel commented Jul 16, 2013

Uh oh!

mblondel commented Jul 16, 2013

Uh oh!

mblondel commented Jul 16, 2013

Uh oh!

glouppe commented Jul 16, 2013

Uh oh!

GaelVaroquaux commented Jul 18, 2013

Uh oh!

mblondel commented Jul 18, 2013

Uh oh!

NicolasTr commented Jul 19, 2013

Uh oh!

mblondel commented Jul 19, 2013