Skip to content

[MRG+2-1] ENH add a ValueDropper to artificially insert missing values (NMAR or MCAR) to the dataset #7084

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 18 commits into from

Conversation

raghavrv
Copy link
Member

@raghavrv raghavrv commented Jul 26, 2016

Fixes #6284

ValueDropper transformer class that has a missing_distribution attribute.

This can be:

  • a float for MCAR missingness specifying the rate of missing values,
  • a 1D array of probabilities for each feature (MCAR again)
  • For NMAR, a dict (of float / 1D array of probabilities for each feature) mapping labels to required distribution of missing values for samples belonging to that label.

Refer this example for a better idea.

  • ValueDropper to introduce missing values.
  • Simplify and cleanup.
  • Add example.
  • Tests/doc

I'd add an example for #5974 which will use this function to compare the missing value handling strategy for MCAR/MNAR missingness.

@agramfort @GaelVaroquaux @amueller @jnothman @glouppe @MechCoder

@amueller
Copy link
Member

I don't think this should live in the preprocessing module. This is not for preprocessing but for dataset generation. Would a user ever use this (I can see it being used for benchmarking) or is that for us for testing?
Depending on that, I'd either put it in utils.testing or datasets.

for label in labels:
label_mask[y==label] = True

# The logic of MCAR/MNAR is implemented here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write out acronyms at least once.

@raghavrv
Copy link
Member Author

raghavrv commented Jul 26, 2016

sklearn.datasets seems like a better choice. Thanks.

BTW what do you think about the function in general? Will it be useful outside that example? (Say, Would there be an usecase where people would want to drop values randomly and impute to reduce noise?)

@raghavrv
Copy link
Member Author

raghavrv commented Jul 26, 2016

Wow that was some super quick reviewing ;) Thx

@raghavrv
Copy link
Member Author

raghavrv commented Jul 26, 2016

@amueller The main motive of this function is to help generate MNAR (incrementally) and help benchmark / illustrate missing value strategies that we will be adding in the future (the immediate usecase will be in using it for an example to illustrate RandomForest's handling).

@amueller
Copy link
Member

yeah so why not restrict it to MNAR and simplify the logic? Maybe the logic needs to be as complicated, the function just seems very long. I totally agree that this is something very helpful for benchmarking. I just don't like code ;)

n_unique_labels = len(unique_labels)

if labels is None:
# Labels is an int specifying the no of labels to correlate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no = number

@raghavrv
Copy link
Member Author

yeah so why not restrict it to MNAR and simplify the logic?

Sure I'll attempt simplifying the logic.

By restrict it to MNAR, do you mean let's not handle MCAR? If so MCAR is just a special case when label_correlation=0.

If instead you meant let's just generate pure MNAR (label_correlation=1)... That seems to not happen in real cases. For instance in the titanic dataset, the cabin and age values go missing and seem to be correlated with the survival of the passenger. In that case the distribution is as follows -

image

I observed something similar in the census income dataset.

@amueller
Copy link
Member

I meant removing the special case for MCAR, but depends on whether that actually simplifies the logic.

----------

X : ndarray like of shape (n_features, n_samples)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove blank line

@raghavrv
Copy link
Member Author

I've still not cleaned up / addressed some of the comments. But most of the comments got hidden as I moved the file from preprocessing to datasets.

Can this be used for regression where label_correlation=0

For regression... hmmm I think we can change label to target and specify an additional deviation term to generate missing values when the target is within that deviation of the specified targets_to_correlate.

I don't get this description. The lack of examples does not help. Do you mean to say that whether or not any missing elements already exist, this will be modified in-place?

I've added an example. Pl. let me know if the motive is justified.

Comment on why I am resetting the random state after choosing label(s) in random

The reason why I reset the random state after randomly choosing labels is to make sure that missing values generated - when labels are randomly chosen and when those randomly chosen labels are explicitly passed - are same.

X1, y, labels = drop_values(X, y, ..., labels=None, n_labels=1, return_labels=True, copy=True) # 1
X2, y= drop_values(X, y, ..., labels=labels, copy=True) # 2

# This will happen only if the random seed at the time of generation of the missing values
# at #1 is same as that of #2. Generating labels in random will modify the random state
# and hence needs to be reset before generating missing values.
X1 == X2   

@amueller
Copy link
Member

The reason why I reset the random state after randomly choosing labels is to make sure that missing values generated - when labels are randomly chosen and when those randomly chosen labels are explicitly passed - are same.

This is not working if the user passed a RandomState object, as @jnothman observed.

@raghavrv
Copy link
Member Author

Should I make a copy for that RandomState object then?

@amueller
Copy link
Member

I guess you could do that but I'm not sure if the contract you're trying to establish is that useful. Can you give an example when one would need it?

@raghavrv
Copy link
Member Author

raghavrv commented Jul 28, 2016

Okay. I've removed this... The use case is a very specific one that I ran into myself. It may not be so useful for all.

@amueller
Copy link
Member

please ping once you want another round.

@raghavrv
Copy link
Member Author

raghavrv commented Aug 1, 2016

Ok I'll remove the missing_mask in missing_mask out functionality that can be done by simply passing missing_mask as X and missing_values as True. That would pare down the function to the essentials.

@amueller
Copy link
Member

amueller commented Dec 7, 2016

@raghavrv your sorting approach is deterministic, so that's not good. I think @jnothman and me have similar ideas. I think @jnothman's might be using RankScaler (#2176) followed by my logistic thing?

@jnothman
Copy link
Member

jnothman commented Dec 8, 2016

I think that @amueller and I are quite enthusiastic about having a helper, somewhere, that actually simulates MNAR as relevant to social science settings, and perhaps others. The main challenge is making it believable or flexible enough to be "standard" in some way.

@raghavrv
Copy link
Member Author

raghavrv commented Dec 9, 2016

Thanks for the comments. Myself and @tguillemot discussed a bit about this IRL.

@raghavrv your sorting approach is deterministic

What if we have ranking_func return values between 0 and 1 depending on the value of X for each feature?

X, y = ...
X_dropped = drop_values(X,
                        ranking_func={0: lambda X_feat0: np.exp(-(X_feat0 ** 2) / 2),
                                      1: lambda X_feat1: X_feat1,
                                      2: lambda X_feat2: -X_feat2,
                                      3: lambda X_feat3: (X_feat3 > 55) * 1 + (X_feat3 < 55) * 0.3},
                        missing_proba=[0.1, 0.3, 0.2, 0.2],
                        random_state=0)

The above code snippet would givedrop 10% of values in feature 0, giving higher precedence to values close to 0, 30%/20% of values in feature 1/2 (giving higher precedence to high/low values) and 20% of values in feature 3 (giving precedence that looks something like the rankscaler PR)...

(Optionally we thought about allowing to pass a single function instead of dict of (max) n_features functions... That single function would be made to return a huge continuous rank matrix for the whole X...)...

This functionality could be illustrated with a nice example covering various possible ranking function (that simulates various distributions like top 10%, bottom 10%, values that are too off from mean, values too off from a particular measure etc)...

If this is okay, then we can extend our usecase to the 3rd type of missing values (MAR) too. MAR (Missing At Random) is when value is dropped based on non-missing feature values... (Say if you answer one question you are likely to not answer the others kind...)

@amueller
Copy link
Member

amueller commented Dec 9, 2016

The above code snippet would give 10% of values in feature 0, giving higher precedence to values close to 0, 30%/20% of values in feature 1/2 (giving higher precedence to high/low values) and 20% of values in feature 3 (giving precedence that looks something like the rankscaler PR)...

I don't understand what that means. "give 10% of values in feature 0"?

While I think that having something might be valuable, I look at the code and your api proposal and they seem very complicated. Consider these functions:

def mcar(X, probs=.1):
    X = X.copy()
    X[np.random.uniform(size=X.shape) < probs] = np.NaN
    return X

def mnar(X, probs=.1):
    X = X.copy()
    sm = softmax(X.T).T
    mask = np.random.uniform(size=X.shape) < sm * probs * X.shape[0]
    X[mask] = np.NaN
    return X
    
def mar(X, probs=.1):
    X = X.copy()
    weights = np.random.normal(X.shape[1], X.shape[1])
    sm = softmax(np.dot(X, weights).T).T
    mask = np.random.uniform(size=X.shape) < sm * probs * X.shape[0]
    X[mask] = np.NaN
    return X
    

I think they are reasonable (though pretty linear) and took me 1 minute to write down. Note that probs can be an array of length n_features. They are easy to understand and should be sufficient to get relatively realistic data, I think.

If your whole PR only implements the first one, which is literally one line, that seems not worth the effort. If it also implements mnar it might be more worth it, but then I feel it should clearly have some benefit of the (written pretty verbosely above) ca two line softmax solution I provided.

@raghavrv
Copy link
Member Author

raghavrv commented Dec 9, 2016

"give 10% of values in feature 0"?

Sorry! That was a typo, I meant drop...

@raghavrv
Copy link
Member Author

raghavrv commented Dec 9, 2016

@amueller So what is your suggestion on how to proceed?

@raghavrv
Copy link
Member Author

(WIth much thanks to @tguillemot) I have a gist with an alternative API that is a slight extension of your 1-min script that will handle label based or feature based NMAR... - https://gist.github.com/raghavrv/3237a1753077b0e3bbc3aef870315771

@amueller @jnothman @agramfort could you look into that and see if that API is simpler and also not too trivial? If so I can proceed along those lines...

@raghavrv
Copy link
Member Author

(Currently the probabilities do not directly translate to missing percentages, but I think with some scaling tricks we can achieve that...)

@jnothman
Copy link
Member

jnothman commented Dec 12, 2016 via email

@raghavrv
Copy link
Member Author

Thanks for the comment!

@amueller
Copy link
Member

amueller commented Dec 13, 2016

I did a softmax over the samples so the desired number of missing values per feature is reached in expectation.

@jnothman
Copy link
Member

jnothman commented Dec 13, 2016 via email

@amueller
Copy link
Member

Softmax over samples still doesn't mean that you can use a constant threshold to yield a constant proportion of samples?

Hm true, not sure what I was thinking....
Can you elaborate on the logistic + random value?

@jnothman
Copy link
Member

Let me think about it a little more :)

@jnothman
Copy link
Member

I think I might have written a bit of nonsense above.

@jnothman
Copy link
Member

I suspect my critiques apply to any solution that does not assume data of a particular scale. :|

@jnothman
Copy link
Member

So let's say we're applying this technique to a single feature.

  1. We want the user to be able to specify either (a) the overall probability p of making a value disappear; or (b) the number of values to disappear.
  2. We also want this to be biased towards extreme, high values, either (a) in proportion to their value, or (b) only in proportional to their rank. We want a single parameter to specify the extent of this preference for the extreme.
  3. We also want this to be non-deterministic.

Assume the feature x is approximately normally distributed:

  • x > stats.norm(0, 1).ppf(1 - p) will deterministically select top 100p% of x at expectation
  • x > stats.norm(0, 1 + f(s)).ppf(1 - p) * np.exp(s * randn(len(x))) will randomly select 100p% of x at expectation with preference for top parametrised by s. I'm sure I should be able to work out what f is, and what the relationship is between s and the expected proportion of the top 50% of samples to be selected, but my statistics is rusty.

For similar reasons, don't ask me to do it where we want to hide both extrema, or have to handle non-normal features. Hmm. Maybe this isn't so simple.

@jnothman
Copy link
Member

jnothman commented Dec 14, 2016 via email

@raghavrv
Copy link
Member Author

raghavrv commented Mar 4, 2017

What is the way forward? Any consensus on a simplified API? I'd really like to have this as it would help in benchmarking and profiling a lot of new missing value methods that are "soon" to come...

For now I'd say let's stick to MCAR and add a ValueDropper transformer... I know it's pretty simple and doesn't warrant a whole class but I feel it would be pretty useful even in it's minimal state + we do plan on adding NMAR / MAR after we have an idea of clean API... So are you okay with a super simple class doing MCAR for now + an open issue asking for that class to be extended to NMAR and MAR cases? @amueller @jnothman

@jnothman
Copy link
Member

jnothman commented Mar 4, 2017 via email

@jnothman
Copy link
Member

jnothman commented Mar 4, 2017 via email

@jnothman
Copy link
Member

jnothman commented Mar 6, 2017 via email

@@ -0,0 +1,232 @@
import numpy as np
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add author + license

@jnothman
Copy link
Member

jnothman commented Sep 27, 2018

Closing in preference for working towards something like R's ampute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants