-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG] Missing values imputation #2148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Missing values imputation #2148
Conversation
@abstractmethod | ||
def __init__(self, | ||
missing_values_fit=[np.nan], | ||
missing_values_transform=[np.nan], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be possible to make missing_values_transform=None
by default and then assume missing_values_transform
is the same as missing_values_fit
? I think it is the most common use, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a use case for encoding missing value differently at fit and transform time? Also is there a use case for encoding missing values with more than one value?
I'd rather just use missing_values=np.nan
. This should simplify both the interface and the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be possible to make missing_values_transform=None by default and then assume missing_values_transform is the same as missing_values_fit? I think it is the most common use, isn't it?
Ok
Is there a use case for encoding missing value differently at fit and transform time? Also is there a use case for encoding missing values with more than one value?
I two cases:
- with a Pipeline
- for matrix completion (I'll probably use 0 for the missing values and np.nan for the missing values that should be imputed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with a Pipeline
Can you elaborate?
I'll probably use 0 for the missing value and np.nan for the missing values that should be imputed
So there are two kinds of missing values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the Pipeline: https://github.com/scikit-learn/scikit-learn/pull/2148/files#L1R985
So there are two kinds of missing values?
Yes. For example: If you have a matrix where (i, j) is the rating that the user i gave to the movie j. You may want to impute the ratings that the active users would give to the movies that are actually in theaters. In this case:
- np.nan (or something else) for these ratings
- 0 for the rest of the missing ratings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a usecase for this.
I work on a similarity matrix, representing interaction frequencies between regions of DNA. I have two types of missing data. The first one is that we do not have enough cells to have a measure. It means that the two regions are far away from each other. The second one is due to repetitions in the DNA sequence.
In both case, I can infer the missing data, but the inference is very different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Wed, Jul 10, 2013 at 01:00:30PM -0700, Gilles Louppe wrote:
Could it be possible to make missing_values_transform=None by default and then
assume missing_values_transform is the same as missing_values_fit? I think it
is the most common use, isn't it?
I don't even think that we should make a difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I work on a similarity matrix, representing interaction frequencies between
regions of DNA. I have two types of missing data. The first one is that we do
not have enough cells to have a measure. It means that the two regions are far
away from each other. The second one is due to repetitions in the DNA sequence.
In both case, I can infer the missing data, but the inference is very
different.
It sounds to me like you need to imputations, not a different fit and
transform. Am I wrong?
G
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't even think that we should make a difference.
Let's keep the interface simple for now (with a single parameter missing_values=np.nan
) and let's consider this use case later, possibly using a mask as I suggested before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GaelVaroquaux indeed, the only requirement in my case is to be able to deal with two missing data type.
It looks like a good start! However, I have some concerns regarding usability. Because of the various Imputer objects you created, it is actually not possible to grid-search for the imputing strategy (e.g., you pipeline an imputer with a classifier, and you want to grid-search the imputing strategy (mean, median, most frequent, etc) in addition of the classifier hyper-parameters). Don't you think it would be best to have a single Imputer transformer, with something like a |
super(RandomImputer, self).__init__( | ||
missing_values_fit = missing_values_fit, | ||
missing_values_transform = missing_values_transform) | ||
self.random_state = check_random_state(random_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check_random_state
should be called at fit
. Conventions are that the constructor shouldn't change the values of the given parameters. Its only role is to store them.
Do you mean that sparse matrices can encode missing values with other values than 0? This doesn't make sense to me, as the point of using 0 is to take advantage of the sparse data structure. In recommendation systems, there are more missing values than observed values so you cannot explicitly store the missing values. |
I think it's too early to work on the documentation. Let's get the code and unit tests right first. |
A reminder of different missing values scenarios I posted on the ML:
What scenarios do you target with this PR? 1) and 2) are the most important IMO. I wonder if 3) is worth the pain. 4) is very difficult to handle. |
In what I did, you can choose how the missing values are encoded in the constructor.
|
@NicolasTr Thanks for the clarification. Do you handle scenarios 2 and 3 using the same code? I haven't looked into your code yet but my intuition is that these two scenarios require quite different implementations in order to be efficient. |
@glouppe What's your opinion on allowing to encode missing values differently in fit and transform? @GaelVaroquaux and I think it's unnecessary. |
placeholder_transform=[np.nan], | ||
strategy='mean', | ||
default=np.nan, | ||
axis=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point of the axis parameter? Can't the user just use fit(X.T) and transform(X.T)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point of the axis parameter? Can't the user just use fit(X.T) and transform(X.T)?
Unless you are in a pipeline. The question is: does this pattern occur
very often or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does imputing values across samples / rows (instead of features / columns) make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it because it's really easy to handle:
- For dense arrays, I pass the parameter to the right function
- For dense arrays, I change the representation of the matrix (CSC -> CSR or CSR -> CSC)
Hypotetical example: if you have students and grades, it could be relevant to impute the missing grades by student rather than by course.
I you think it's useless, I can remove it.
@mblondel Indeed, I also don't think it is something necessary. +1 on removing it. |
class Imputer(BaseEstimator, TransformerMixin): | ||
|
||
def __init__(self, | ||
placeholder_fit=[np.nan], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you set instead a scalar value by default (i.e., not a list)? I think it would make the interface clearer. List values will be documented later in the docstrings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to handle lists at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In python, a list is not safe as a default argument. Use a tuple instead if you need.
http://pythonconquerstheuniverse.wordpress.com/2012/02/15/mutable-default-arguments/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to handle lists at all?
I was indeed thinking exactly the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is your opinion @NicolasTr ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 25f4abb
Another strategy for imputing missing values is to use the "prior" distribution of observed values (regardless of their frequency). Say a column X[:, j] of X can take the values 1, 2, 3, then imputing the missing values with np.mean([1, 2, 3]) or np.median([1, 2, 3]) can make sense in some applications. |
I wonder if a Cython implementation wouldn't be both more efficient and more readable. The current code is hard to follow. |
I wonder if we wouldn't be better off temporarily dropping support for the median strategy. This PR already adds 600+ lines of tests. So it's hard to review all tests for us and make sure they are good and ensure correctness of the code. I'm particularly concerned about the median because it is hard to implement in the sparse case. In fact, we could even focus on mean-based imputation for this PR. In the future, I'd suggest to create small well-scoped incremental PRs which we can review and merge easily. If merging in master is not possible, we can create a branch for merging reviewed code. |
Good idea, +1. What is your opinion about that @NicolasTr ? |
But that's the strategy that is recommended by all my statistician |
Temporarily. Nicolas' progress is very hectic and I think it would be better to have incremental well-focused PRs. For the median strategy, I'd recommend this simple approach:
|
I added some explanations here If the array is dense, I use the If the array is sparse:
Missing values are handled by the mask in the dense case. In the sparse case, I mask the data vector of the matrix and I have special conditions for the zero. I refactored the tests. However, I'm not sure that they are really more readable. For the mean and the median, I create a test matrix containing the kind of columns that could be problematic (all missing, all zero, combination, 0 in the missing values or not, ...) I also implemented the sparse mean and the sparse median in cython. It's 2-3 times faster. I read somewhere that it's important to avoid calling python functions in cython. I'm not sure if I'm doing that or not. Maybe it's possible to make it much faster. There's some code duplication (in python and cython). I think I could move the sparse mean and the sparse median into the tests to validate the python code. Currently, I'm using a dictionary of functions to check if I did a cython implementation of the "strategy". It's ugly but it allows me to easily switch between the implementations for the speed tests. At this moment, all the tests are passing. I plan to add some more tests with the copy parameter. All the cases described by Mathieu are handled. Allowing the user to choose the representation of the missing values, I'm trading some speed for a lot of flexibility. Anyway, the real cost of this sacrifice depends on the "number of representations" of the missing values (placeholders), so it should not be huge. There's also one more thing that should be decided (I'll use the mean as an exemple): if the mean cannot be computed (if all the values of the column are missing), what should be done? I'm currently using np.nan. Wouldn't it be better using a "fill_value"? |
I would remove the entire column in transform (and raise a warning). |
The rationale is that we want the transformed data to be used in any scikit-learn estimator. So, we don't want to leave missing values in the transformed data. |
nb_values = shape[0] - nb_zeros - nb_placeholders | ||
|
||
z = zeros[:nb_zeros] | ||
p = np.random.choice(placeholders, nb_placeholders) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initialize the random generator globally with a fixed seed instead (as a global variable) and use it all your tests. Otherwise, tests are not deterministic and may not be reproducible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 1f3a285
X_true = X_true[:, cols_to_keep] | ||
|
||
_test_statistics(X, X_true, strategy, | ||
true_statistics, test_missing_values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
People who are very picking with PEP8 would note that this is not properly indented :)
Finished my review! |
@@ -307,7 +307,7 @@ def test_estimators_nan_inf(): | |||
if name in dont_test: | |||
continue | |||
if name in ('PLSCanonical', 'PLSRegression', 'CCA', | |||
'PLSSVD'): | |||
'PLSSVD', 'Imputer'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a comment that the Imputer
it not supposed to raise a warning here.
[np.nan, 6, 0, 5, 13], | ||
]) | ||
|
||
mean = np.array([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this variable name to be quite confusing. I would rename it X_imputed_mean
, same for median
below.
[6, 5, 13], | ||
]) | ||
|
||
_test_statistics(X, mean, "mean", [np.nan, 3, np.nan, np.nan, 7], 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I would define variables mean
and median
and assign these lists to them, so it is clear what they represent.
valid_statistics = self.statistics_[valid_mask] | ||
valid_statistics_indexes = np.where(valid_mask)[0] | ||
missing = ",".join( | ||
map(str, np.arange(X.shape[not self.axis])[invalid_mask])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need this?
👍 for merge! |
import scipy.sparse as sp | ||
import scipy.stats as st |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we usually use
from scipy import stats
from scipy import sparse
and avoid shortcuts
personally when I see st or sp I don't know what import it corresponds to.
besides +1 for merge too. |
merge by rebase :) Thanks a lot! |
Yay! Thanks a lot Nicolas! |
Congratulations !!! |
nice ! |
Thanks for your patience and hard work Nicolas. Looking forward to the matrix factorization part :) |
Sorry I didn't realize this before but transform should NOT set |
Ok |
To avoid any conflict, I'll wait until the merge of this pull request #2252 |
you can just do another PR on top of that. |
Fixed and merged in #2262 |
Here's some code for the data imputation.
It's still a work in progress so Travis will probably be mad at me.
I'll work on the documentation tomorrow. Meanwhile, you can look at the tests if you want to see how it works.
_What's new_
I added four classes:
_Behaviour_
_What's missing_
np.isnan(imputation_data_)