Skip to content

[MRG] Missing values imputation #2148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from
Closed

[MRG] Missing values imputation #2148

wants to merge 16 commits into from

Conversation

NicolasTr
Copy link
Contributor

Here's some code for the data imputation.

It's still a work in progress so Travis will probably be mad at me.

I'll work on the documentation tomorrow. Meanwhile, you can look at the tests if you want to see how it works.

_What's new_
I added four classes:

  • MeanImputer: replace the missing values with the mean of the row/column
  • MedianImputer: replace the missing values with the median of the row/column
  • MostFrequentImputer: replace the missing values with the most frequent value in the row/column
  • RandomImputer: replace the missing values with one of the values of the row/column

_Behaviour_

  • Any value can represent a missing value (not only np.nan for dense matrices and 0 for sparse matrices, any value)
  • Different missing values can be imputed separately in a pipeline
  • Support dense and sparse matrices
  • To determine: What should the imputer do if he has nothing to work with? For example: What should be done if there's no median value because the whole row/column is missing?
    • 0 for the mean and np.nan for the median/most frequent/random?
    • always np.nan?
    • Let the user choose with a parameter?

_What's missing_

  • Documentation
  • Delete the rows/columns where np.isnan(imputation_data_)
  • Add tests with copy
  • Add tests with grid search
  • Test pickling

@abstractmethod
def __init__(self,
missing_values_fit=[np.nan],
missing_values_transform=[np.nan],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be possible to make missing_values_transform=None by default and then assume missing_values_transform is the same as missing_values_fit? I think it is the most common use, isn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a use case for encoding missing value differently at fit and transform time? Also is there a use case for encoding missing values with more than one value?

I'd rather just use missing_values=np.nan. This should simplify both the interface and the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be possible to make missing_values_transform=None by default and then assume missing_values_transform is the same as missing_values_fit? I think it is the most common use, isn't it?

Ok

Is there a use case for encoding missing value differently at fit and transform time? Also is there a use case for encoding missing values with more than one value?

I two cases:

  • with a Pipeline
  • for matrix completion (I'll probably use 0 for the missing values and np.nan for the missing values that should be imputed)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with a Pipeline

Can you elaborate?

I'll probably use 0 for the missing value and np.nan for the missing values that should be imputed

So there are two kinds of missing values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the Pipeline: https://github.com/scikit-learn/scikit-learn/pull/2148/files#L1R985

So there are two kinds of missing values?

Yes. For example: If you have a matrix where (i, j) is the rating that the user i gave to the movie j. You may want to impute the ratings that the active users would give to the movies that are actually in theaters. In this case:

  • np.nan (or something else) for these ratings
  • 0 for the rest of the missing ratings

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a usecase for this.

I work on a similarity matrix, representing interaction frequencies between regions of DNA. I have two types of missing data. The first one is that we do not have enough cells to have a measure. It means that the two regions are far away from each other. The second one is due to repetitions in the DNA sequence.
In both case, I can infer the missing data, but the inference is very different.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Wed, Jul 10, 2013 at 01:00:30PM -0700, Gilles Louppe wrote:

Could it be possible to make missing_values_transform=None by default and then
assume missing_values_transform is the same as missing_values_fit? I think it
is the most common use, isn't it?

I don't even think that we should make a difference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I work on a similarity matrix, representing interaction frequencies between
regions of DNA. I have two types of missing data. The first one is that we do
not have enough cells to have a measure. It means that the two regions are far
away from each other. The second one is due to repetitions in the DNA sequence.
In both case, I can infer the missing data, but the inference is very
different.

It sounds to me like you need to imputations, not a different fit and
transform. Am I wrong?

G

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't even think that we should make a difference.

Let's keep the interface simple for now (with a single parameter missing_values=np.nan) and let's consider this use case later, possibly using a mask as I suggested before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GaelVaroquaux indeed, the only requirement in my case is to be able to deal with two missing data type.

@glouppe
Copy link
Contributor

glouppe commented Jul 10, 2013

It looks like a good start! However, I have some concerns regarding usability. Because of the various Imputer objects you created, it is actually not possible to grid-search for the imputing strategy (e.g., you pipeline an imputer with a classifier, and you want to grid-search the imputing strategy (mean, median, most frequent, etc) in addition of the classifier hyper-parameters). Don't you think it would be best to have a single Imputer transformer, with something like a algorithm or strategy hyper-parameter? I think this is legitimate for such simple imputation strategies. CC: @mblondel

super(RandomImputer, self).__init__(
missing_values_fit = missing_values_fit,
missing_values_transform = missing_values_transform)
self.random_state = check_random_state(random_state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_random_state should be called at fit. Conventions are that the constructor shouldn't change the values of the given parameters. Its only role is to store them.

@mblondel
Copy link
Member

Any value can represent a missing value (not only np.nan for dense matrices and 0 for sparse matrices, any value)

Do you mean that sparse matrices can encode missing values with other values than 0? This doesn't make sense to me, as the point of using 0 is to take advantage of the sparse data structure. In recommendation systems, there are more missing values than observed values so you cannot explicitly store the missing values.

@mblondel
Copy link
Member

I'll work on the documentation tomorrow

I think it's too early to work on the documentation. Let's get the code and unit tests right first.

@mblondel
Copy link
Member

A reminder of different missing values scenarios I posted on the ML:

  1. fewer missing values than observed values

  2. much more missing-values than observed values (recommendation system setting)

  3. fewer missing values than observed values AND many zero features

  4. much more missing-values than observed values AND many zero features

  5. can be handled by masked arrays or arrays w/ missing-value encoded by NaN.

  6. can be handled by scipy sparse matrices containing only the observed values

  7. can be handled by scipy sparse matrices where missing-values are encoded by NaN

  8. seems hard because the data is sparse both wrt missing-values and wrt zero features.

What scenarios do you target with this PR? 1) and 2) are the most important IMO. I wonder if 3) is worth the pain. 4) is very difficult to handle.

@NicolasTr
Copy link
Contributor Author

In what I did, you can choose how the missing values are encoded in the constructor.

  1. Simply use np.nan
  2. Use 0 for the missing values and a sparse matrix
  3. Use np.nan for the missing values, 0 for the zeros and a sparse matrix
  4. Same as 3. However, in this scenario your sparse matrix won't be very sparse (i.e. you could do it but I think it will be slow)

@mblondel
Copy link
Member

@NicolasTr Thanks for the clarification. Do you handle scenarios 2 and 3 using the same code? I haven't looked into your code yet but my intuition is that these two scenarios require quite different implementations in order to be efficient.

@mblondel
Copy link
Member

@glouppe What's your opinion on allowing to encode missing values differently in fit and transform? @GaelVaroquaux and I think it's unnecessary.

placeholder_transform=[np.nan],
strategy='mean',
default=np.nan,
axis=0):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of the axis parameter? Can't the user just use fit(X.T) and transform(X.T)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of the axis parameter? Can't the user just use fit(X.T) and transform(X.T)?

Unless you are in a pipeline. The question is: does this pattern occur
very often or not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does imputing values across samples / rows (instead of features / columns) make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it because it's really easy to handle:

  • For dense arrays, I pass the parameter to the right function
  • For dense arrays, I change the representation of the matrix (CSC -> CSR or CSR -> CSC)

Hypotetical example: if you have students and grades, it could be relevant to impute the missing grades by student rather than by course.

I you think it's useless, I can remove it.

@glouppe
Copy link
Contributor

glouppe commented Jul 16, 2013

@mblondel Indeed, I also don't think it is something necessary. +1 on removing it.

class Imputer(BaseEstimator, TransformerMixin):

def __init__(self,
placeholder_fit=[np.nan],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you set instead a scalar value by default (i.e., not a list)? I think it would make the interface clearer. List values will be documented later in the docstrings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to handle lists at all?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In python, a list is not safe as a default argument. Use a tuple instead if you need.

http://pythonconquerstheuniverse.wordpress.com/2012/02/15/mutable-default-arguments/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to handle lists at all?

I was indeed thinking exactly the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your opinion @NicolasTr ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 25f4abb

@mblondel
Copy link
Member

Another strategy for imputing missing values is to use the "prior" distribution of observed values (regardless of their frequency). Say a column X[:, j] of X can take the values 1, 2, 3, then imputing the missing values with np.mean([1, 2, 3]) or np.median([1, 2, 3]) can make sense in some applications.

@mblondel
Copy link
Member

I wonder if a Cython implementation wouldn't be both more efficient and more readable. The current code is hard to follow.

@mblondel
Copy link
Member

I wonder if we wouldn't be better off temporarily dropping support for the median strategy. This PR already adds 600+ lines of tests. So it's hard to review all tests for us and make sure they are good and ensure correctness of the code. I'm particularly concerned about the median because it is hard to implement in the sparse case. In fact, we could even focus on mean-based imputation for this PR.

In the future, I'd suggest to create small well-scoped incremental PRs which we can review and merge easily. If merging in master is not possible, we can create a branch for merging reviewed code.

@glouppe
Copy link
Contributor

glouppe commented Jul 16, 2013

I wonder if we wouldn't be better off temporarily dropping support for the median strategy. This PR already adds 600+ lines of tests. So it's hard to review all tests for us and make sure they are good and ensure correctness of the code. I'm particularly concerned about the median because it is hard to implement in the sparse case. In fact, we could even focus on mean-based imputation for this PR.
In the future, I'd suggest to create small well-scoped incremental PRs which we can review and merge easily. If merging in master is not possible, we can create a branch for merging reviewed code.

Good idea, +1. What is your opinion about that @NicolasTr ?

@GaelVaroquaux
Copy link
Member

I wonder if we wouldn't be better off temporarily dropping support for the median strategy.

But that's the strategy that is recommended by all my statistician
friend... It's the one I'd really like to have in.

@mblondel
Copy link
Member

But that's the strategy that is recommended by all my statistician friend... It's the one I'd really like to have in.

Temporarily. Nicolas' progress is very hectic and I think it would be better to have incremental well-focused PRs.

For the median strategy, I'd recommend this simple approach:

  1. convert numpy array with missing values encoded by np.nan or whatever other value to a masked array (skip this step if the input is already a masked array)
  2. use the median() method provided by masked arrays
  3. impute the values and return a standard numpy array, since they are what scikit-learn estimators expect.

@NicolasTr
Copy link
Contributor Author

I added some explanations here

If the array is dense, I use the MaskedArray class. It works great for mean and median. I had some problems with the imputation of the most frequent because scipy.stats.mode() and scipy.stats.mstats.mode() doesn't work properly (scipy.stats.mode doesn't return the right element and scipy.stats.mstats.mode doesn't work if the first value is masked, I reported the bug here).

If the array is sparse:

  • For the mean, I simply divide the sum by the number of elements in the column (excluding the zeros if they are in the missing values).
  • For the median, I do the following steps:
    1. I separate the positives (array), the negatives (array) and the zeros (number)
    2. I compute the position of the median (or the position of the elements surrounding it)
    3. If it falls in the zeros, I return zero. Otherwise, I compute the real position of the median in the positives or in the negatives and I return it (I need to sort the arrays).
  • For the most frequent
    1. I compute the most frequent ignoring the zeros
    2. I compare the number of occurrences of the most frequent with the number of zeros to return the right value

Missing values are handled by the mask in the dense case. In the sparse case, I mask the data vector of the matrix and I have special conditions for the zero.

I refactored the tests. However, I'm not sure that they are really more readable. For the mean and the median, I create a test matrix containing the kind of columns that could be problematic (all missing, all zero, combination, 0 in the missing values or not, ...)

I also implemented the sparse mean and the sparse median in cython. It's 2-3 times faster. I read somewhere that it's important to avoid calling python functions in cython. I'm not sure if I'm doing that or not. Maybe it's possible to make it much faster.

There's some code duplication (in python and cython). I think I could move the sparse mean and the sparse median into the tests to validate the python code. Currently, I'm using a dictionary of functions to check if I did a cython implementation of the "strategy". It's ugly but it allows me to easily switch between the implementations for the speed tests.

At this moment, all the tests are passing. I plan to add some more tests with the copy parameter.

All the cases described by Mathieu are handled. Allowing the user to choose the representation of the missing values, I'm trading some speed for a lot of flexibility. Anyway, the real cost of this sacrifice depends on the "number of representations" of the missing values (placeholders), so it should not be huge.

There's also one more thing that should be decided (I'll use the mean as an exemple): if the mean cannot be computed (if all the values of the column are missing), what should be done? I'm currently using np.nan. Wouldn't it be better using a "fill_value"?

@mblondel
Copy link
Member

There's also one more thing that should be decided (I'll use the mean as an exemple): if the mean cannot be computed (if all the values of the column are missing), what should be done? I'm currently using np.nan. Wouldn't it be better using a "fill_value"?

I would remove the entire column in transform (and raise a warning).

@mblondel
Copy link
Member

The rationale is that we want the transformed data to be used in any scikit-learn estimator. So, we don't want to leave missing values in the transformed data.

nb_values = shape[0] - nb_zeros - nb_placeholders

z = zeros[:nb_zeros]
p = np.random.choice(placeholders, nb_placeholders)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initialize the random generator globally with a fixed seed instead (as a global variable) and use it all your tests. Otherwise, tests are not deterministic and may not be reproducible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1f3a285

X_true = X_true[:, cols_to_keep]

_test_statistics(X, X_true, strategy,
true_statistics, test_missing_values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People who are very picking with PEP8 would note that this is not properly indented :)

@GaelVaroquaux
Copy link
Member

Finished my review!

@@ -307,7 +307,7 @@ def test_estimators_nan_inf():
if name in dont_test:
continue
if name in ('PLSCanonical', 'PLSRegression', 'CCA',
'PLSSVD'):
'PLSSVD', 'Imputer'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a comment that the Imputer it not supposed to raise a warning here.

[np.nan, 6, 0, 5, 13],
])

mean = np.array([
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this variable name to be quite confusing. I would rename it X_imputed_mean, same for median below.

[6, 5, 13],
])

_test_statistics(X, mean, "mean", [np.nan, 3, np.nan, np.nan, 7], 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I would define variables mean and median and assign these lists to them, so it is clear what they represent.

valid_statistics = self.statistics_[valid_mask]
valid_statistics_indexes = np.where(valid_mask)[0]
missing = ",".join(
map(str, np.arange(X.shape[not self.axis])[invalid_mask]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need this?

@amueller
Copy link
Member

👍 for merge!

import scipy.sparse as sp
import scipy.stats as st
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually use

from scipy import stats
from scipy import sparse

and avoid shortcuts

personally when I see st or sp I don't know what import it corresponds to.

@agramfort
Copy link
Member

besides +1 for merge too.

@amueller
Copy link
Member

merge by rebase :)

Thanks a lot!

@amueller amueller closed this Jul 25, 2013
@glouppe
Copy link
Contributor

glouppe commented Jul 25, 2013

Yay! Thanks a lot Nicolas!

@arjoly
Copy link
Member

arjoly commented Jul 25, 2013

Congratulations !!!

@agramfort
Copy link
Member

nice !

@mblondel
Copy link
Member

Thanks for your patience and hard work Nicolas. Looking forward to the matrix factorization part :)

@mblondel
Copy link
Member

Sorry I didn't realize this before but transform should NOT set statistics_ when axis = 1. transform is not for learning and should be side-effect free. Fortunately this seems like an easy fix: just store the statistics in a local variable when axis = 1.

@NicolasTr
Copy link
Contributor Author

Ok

@NicolasTr
Copy link
Contributor Author

To avoid any conflict, I'll wait until the merge of this pull request #2252

@amueller
Copy link
Member

you can just do another PR on top of that.

@glouppe
Copy link
Contributor

glouppe commented Jul 26, 2013

Fixed and merged in #2262

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.