[MRG+2-1] ENH add a ValueDropper to artificially insert missing values (NMAR or MCAR) to the dataset #7084

raghavrv · 2016-07-26T14:49:38Z

ValueDropper transformer class that has a missing_distribution attribute.

This can be:

a float for MCAR missingness specifying the rate of missing values,
a 1D array of probabilities for each feature (MCAR again)
For NMAR, a dict (of float / 1D array of probabilities for each feature) mapping labels to required distribution of missing values for samples belonging to that label.

Refer this example for a better idea.

ValueDropper to introduce missing values.
Simplify and cleanup.
Add example.
Tests/doc

I'd add an example for #5974 which will use this function to compare the missing value handling strategy for MCAR/MNAR missingness.

@agramfort @GaelVaroquaux @amueller @jnothman @glouppe @MechCoder

amueller · 2016-07-26T14:55:02Z

I don't think this should live in the preprocessing module. This is not for preprocessing but for dataset generation. Would a user ever use this (I can see it being used for benchmarking) or is that for us for testing?
Depending on that, I'd either put it in utils.testing or datasets.

amueller · 2016-07-26T14:55:42Z

sklearn/preprocessing/value_dropper.py

+        for label in labels:
+            label_mask[y==label] = True
+
+    # The logic of MCAR/MNAR is implemented here


write out acronyms at least once.

raghavrv · 2016-07-26T15:01:28Z

sklearn.datasets seems like a better choice. Thanks.

BTW what do you think about the function in general? Will it be useful outside that example? (Say, Would there be an usecase where people would want to drop values randomly and impute to reduce noise?)

raghavrv · 2016-07-26T15:02:28Z

Wow that was some super quick reviewing ;) Thx

raghavrv · 2016-07-26T15:14:59Z

@amueller The main motive of this function is to help generate MNAR (incrementally) and help benchmark / illustrate missing value strategies that we will be adding in the future (the immediate usecase will be in using it for an example to illustrate RandomForest's handling).

amueller · 2016-07-26T15:28:25Z

yeah so why not restrict it to MNAR and simplify the logic? Maybe the logic needs to be as complicated, the function just seems very long. I totally agree that this is something very helpful for benchmarking. I just don't like code ;)

amueller · 2016-07-26T15:28:44Z

sklearn/preprocessing/value_dropper.py

+    n_unique_labels = len(unique_labels)
+
+    if labels is None:
+        # Labels is an int specifying the no of labels to correlate


no = number

raghavrv · 2016-07-26T15:56:21Z

yeah so why not restrict it to MNAR and simplify the logic?

Sure I'll attempt simplifying the logic.

By restrict it to MNAR, do you mean let's not handle MCAR? If so MCAR is just a special case when label_correlation=0.

If instead you meant let's just generate pure MNAR (label_correlation=1)... That seems to not happen in real cases. For instance in the titanic dataset, the cabin and age values go missing and seem to be correlated with the survival of the passenger. In that case the distribution is as follows -

I observed something similar in the census income dataset.

amueller · 2016-07-26T16:03:05Z

I meant removing the special case for MCAR, but depends on whether that actually simplifies the logic.

jnothman · 2016-07-26T23:39:59Z

sklearn/preprocessing/value_dropper.py

+    ----------
+
+    X : ndarray like of shape (n_features, n_samples)
+


remove blank line

raghavrv · 2016-07-27T16:50:28Z

I've still not cleaned up / addressed some of the comments. But most of the comments got hidden as I moved the file from preprocessing to datasets.

Can this be used for regression where label_correlation=0

For regression... hmmm I think we can change label to target and specify an additional deviation term to generate missing values when the target is within that deviation of the specified targets_to_correlate.

I don't get this description. The lack of examples does not help. Do you mean to say that whether or not any missing elements already exist, this will be modified in-place?

I've added an example. Pl. let me know if the motive is justified.

Comment on why I am resetting the random state after choosing label(s) in random

The reason why I reset the random state after randomly choosing labels is to make sure that missing values generated - when labels are randomly chosen and when those randomly chosen labels are explicitly passed - are same.

X1, y, labels = drop_values(X, y, ..., labels=None, n_labels=1, return_labels=True, copy=True) # 1
X2, y= drop_values(X, y, ..., labels=labels, copy=True) # 2

# This will happen only if the random seed at the time of generation of the missing values
# at #1 is same as that of #2. Generating labels in random will modify the random state
# and hence needs to be reset before generating missing values.
X1 == X2

amueller · 2016-07-27T17:08:47Z

The reason why I reset the random state after randomly choosing labels is to make sure that missing values generated - when labels are randomly chosen and when those randomly chosen labels are explicitly passed - are same.

This is not working if the user passed a RandomState object, as @jnothman observed.

raghavrv · 2016-07-27T17:37:58Z

Should I make a copy for that RandomState object then?

amueller · 2016-07-27T18:16:08Z

I guess you could do that but I'm not sure if the contract you're trying to establish is that useful. Can you give an example when one would need it?

raghavrv · 2016-07-28T13:57:00Z

Okay. I've removed this... The use case is a very specific one that I ran into myself. It may not be so useful for all.

amueller · 2016-07-28T16:12:54Z

please ping once you want another round.

raghavrv · 2016-08-01T11:54:32Z

Ok I'll remove the missing_mask in missing_mask out functionality that can be done by simply passing missing_mask as X and missing_values as True. That would pare down the function to the essentials.

amueller · 2016-12-07T17:36:54Z

@raghavrv your sorting approach is deterministic, so that's not good. I think @jnothman and me have similar ideas. I think @jnothman's might be using RankScaler (#2176) followed by my logistic thing?

jnothman · 2016-12-08T22:02:05Z

I think that @amueller and I are quite enthusiastic about having a helper, somewhere, that actually simulates MNAR as relevant to social science settings, and perhaps others. The main challenge is making it believable or flexible enough to be "standard" in some way.

raghavrv · 2016-12-09T17:01:28Z

Thanks for the comments. Myself and @tguillemot discussed a bit about this IRL.

@raghavrv your sorting approach is deterministic

What if we have ranking_func return values between 0 and 1 depending on the value of X for each feature?

X, y = ...
X_dropped = drop_values(X,
                        ranking_func={0: lambda X_feat0: np.exp(-(X_feat0 ** 2) / 2),
                                      1: lambda X_feat1: X_feat1,
                                      2: lambda X_feat2: -X_feat2,
                                      3: lambda X_feat3: (X_feat3 > 55) * 1 + (X_feat3 < 55) * 0.3},
                        missing_proba=[0.1, 0.3, 0.2, 0.2],
                        random_state=0)

The above code snippet would ~~give~~drop 10% of values in feature 0, giving higher precedence to values close to 0, 30%/20% of values in feature 1/2 (giving higher precedence to high/low values) and 20% of values in feature 3 (giving precedence that looks something like the rankscaler PR)...

(Optionally we thought about allowing to pass a single function instead of dict of (max) n_features functions... That single function would be made to return a huge continuous rank matrix for the whole X...)...

This functionality could be illustrated with a nice example covering various possible ranking function (that simulates various distributions like top 10%, bottom 10%, values that are too off from mean, values too off from a particular measure etc)...

If this is okay, then we can extend our usecase to the 3rd type of missing values (MAR) too. MAR (Missing At Random) is when value is dropped based on non-missing feature values... (Say if you answer one question you are likely to not answer the others kind...)

amueller · 2016-12-09T17:28:20Z

The above code snippet would give 10% of values in feature 0, giving higher precedence to values close to 0, 30%/20% of values in feature 1/2 (giving higher precedence to high/low values) and 20% of values in feature 3 (giving precedence that looks something like the rankscaler PR)...

I don't understand what that means. "give 10% of values in feature 0"?

While I think that having something might be valuable, I look at the code and your api proposal and they seem very complicated. Consider these functions:

def mcar(X, probs=.1):
    X = X.copy()
    X[np.random.uniform(size=X.shape) < probs] = np.NaN
    return X

def mnar(X, probs=.1):
    X = X.copy()
    sm = softmax(X.T).T
    mask = np.random.uniform(size=X.shape) < sm * probs * X.shape[0]
    X[mask] = np.NaN
    return X
    
def mar(X, probs=.1):
    X = X.copy()
    weights = np.random.normal(X.shape[1], X.shape[1])
    sm = softmax(np.dot(X, weights).T).T
    mask = np.random.uniform(size=X.shape) < sm * probs * X.shape[0]
    X[mask] = np.NaN
    return X

I think they are reasonable (though pretty linear) and took me 1 minute to write down. Note that probs can be an array of length n_features. They are easy to understand and should be sufficient to get relatively realistic data, I think.

If your whole PR only implements the first one, which is literally one line, that seems not worth the effort. If it also implements mnar it might be more worth it, but then I feel it should clearly have some benefit of the (written pretty verbosely above) ca two line softmax solution I provided.

raghavrv · 2016-12-09T17:51:59Z

"give 10% of values in feature 0"?

Sorry! That was a typo, I meant drop...

raghavrv · 2016-12-09T19:05:29Z

@amueller So what is your suggestion on how to proceed?

raghavrv · 2016-12-12T17:04:04Z

(WIth much thanks to @tguillemot) I have a gist with an alternative API that is a slight extension of your 1-min script that will handle label based or feature based NMAR... - https://gist.github.com/raghavrv/3237a1753077b0e3bbc3aef870315771

@amueller @jnothman @agramfort could you look into that and see if that API is simpler and also not too trivial? If so I can proceed along those lines...

raghavrv · 2016-12-12T17:05:17Z

(Currently the probabilities do not directly translate to missing percentages, but I think with some scaling tricks we can achieve that...)

jnothman · 2016-12-12T23:48:00Z

I think we've concluded label-based masks are a bad idea: they mean that your prediction task is too influenced by the missingness (which sometimes is the case, but it's too easy for users to draw bad conclusions). Softmax is not appropriate. Here it looks like you are normalising across features in each sample. Features should be treated independently, or, if anything, missingness should be correlated across features in a sample, but softmax does the opposite: if there is an outlyingly high value, it will be removed and as a result of its outlyingness, the others will all not be removed. If all features values are small, then small differences become magnified under the normalisation. You're better off just applying a logistic to each feature independently, as @amueller suggested. I don't mind the simplified interface.

…

On 13 December 2016 at 04:05, Raghav RV ***@***.***> wrote: (Currently the probabilities do not directly translate to missing percentages, but I think with some scaling tricks we can achieve that...) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7084 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6w05nbBZiMRod3Z2gh5SlrXseodMks5rHX7PgaJpZM4JVOnb> .

raghavrv · 2016-12-12T23:51:59Z

Thanks for the comment!

amueller · 2016-12-13T00:56:00Z

I did a softmax over the samples so the desired number of missing values per feature is reached in expectation.

jnothman · 2016-12-13T01:18:54Z

The current code is doing a softmax over features, though, isn't it? Softmax over samples still doesn't mean that you can use a constant threshold to yield a constant proportion of samples? It also leaks data from test to training, or if is applied to test independently is nonsense. If a specific probability of missingness is sought, subtracting a random value from independent logistics over rank-scaled features should be fine (or not rank-scaled depending on how we define the sought probability). If a specific quantity of missingness then an argpartition over the same statistics will work.

…

On 13 December 2016 at 11:56, Andreas Mueller ***@***.***> wrote: I did a softmax over the samples so the desired number of missing values per feature is reached... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7084 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6xCz3KbCa7IOFbbKSiOg47hnuttrks5rHe0hgaJpZM4JVOnb> .

amueller · 2016-12-13T01:32:48Z

Softmax over samples still doesn't mean that you can use a constant threshold to yield a constant proportion of samples?

Hm true, not sure what I was thinking....
Can you elaborate on the logistic + random value?

jnothman · 2016-12-13T02:38:40Z

Let me think about it a little more :)

jnothman · 2016-12-13T02:42:54Z

I think I might have written a bit of nonsense above.

jnothman · 2016-12-13T03:01:25Z

I suspect my critiques apply to any solution that does not assume data of a particular scale. :|

jnothman · 2016-12-13T23:38:50Z

So let's say we're applying this technique to a single feature.

We want the user to be able to specify either (a) the overall probability p of making a value disappear; or (b) the number of values to disappear.
We also want this to be biased towards extreme, high values, either (a) in proportion to their value, or (b) only in proportional to their rank. We want a single parameter to specify the extent of this preference for the extreme.
We also want this to be non-deterministic.

Assume the feature x is approximately normally distributed:

x > stats.norm(0, 1).ppf(1 - p) will deterministically select top 100p% of x at expectation
x > stats.norm(0, 1 + f(s)).ppf(1 - p) * np.exp(s * randn(len(x))) will randomly select 100p% of x at expectation with preference for top parametrised by s. I'm sure I should be able to work out what f is, and what the relationship is between s and the expected proportion of the top 50% of samples to be selected, but my statistics is rusty.

For similar reasons, don't ask me to do it where we want to hide both extrema, or have to handle non-normal features. Hmm. Maybe this isn't so simple.

jnothman · 2016-12-14T04:02:31Z

and I think my use of an exponential there was not quite right...

…

On 13 December 2016 at 12:32, Andreas Mueller ***@***.***> wrote: Softmax over samples still doesn't mean that you can use a constant threshold to yield a constant proportion of samples? Hm true, not sure what I was thinking.... Can you elaborate on the logistic + random value? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7084 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6zNVT2v9d9Dv0lvwWZmagMRM0Dxaks5rHfXBgaJpZM4JVOnb> .

raghavrv · 2017-03-04T02:56:54Z

What is the way forward? Any consensus on a simplified API? I'd really like to have this as it would help in benchmarking and profiling a lot of new missing value methods that are "soon" to come...

For now I'd say let's stick to MCAR and add a ValueDropper transformer... I know it's pretty simple and doesn't warrant a whole class but I feel it would be pretty useful even in it's minimal state + we do plan on adding NMAR / MAR after we have an idea of clean API... So are you okay with a super simple class doing MCAR for now + an open issue asking for that class to be extended to NMAR and MAR cases? @amueller @jnothman

jnothman · 2017-03-04T21:28:33Z

Can you be specific in describing what kinds of transformations are possible with the simplified ValueDropper?

…

On 4 March 2017 at 13:56, (Venkat) Raghav (Rajagopalan) < ***@***.***> wrote: What is the way forward? Any consensus on a simplified API? I'd really like to have this as it would help in benchmarking and profiling a lot of new missing value methods that are "soon" to come... For now I'd say let's stick to MCAR and add a ValueDropper transformer... I know it's pretty simple and doesn't warrant a whole class but I feel it would be pretty useful even in it's minimal state + we do plan on adding NMAR / MAR after we have an idea of clean API... So are you okay with a super simple class doing MCAR for now + an open issue asking for that class to be extended to NMAR and MAR cases? @amueller <https://github.com/amueller> @jnothman <https://github.com/jnothman> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7084 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6wmEvX6ja-1jLawanwjfntC0xMaVks5riNL3gaJpZM4JVOnb> .

jnothman · 2017-03-04T21:29:28Z

I might ask some stats colleagues...

…

On 5 March 2017 at 08:28, Joel Nothman ***@***.***> wrote: Can you be specific in describing what kinds of transformations are possible with the simplified ValueDropper? On 4 March 2017 at 13:56, (Venkat) Raghav (Rajagopalan) < ***@***.***> wrote: > What is the way forward? Any consensus on a simplified API? I'd really > like to have this as it would help in benchmarking and profiling a lot of > new missing value methods that are "soon" to come... > > For now I'd say let's stick to MCAR and add a ValueDropper > transformer... I know it's pretty simple and doesn't warrant a whole class > but I feel it would be pretty useful even in it's minimal state + we do > plan on adding NMAR / MAR after we have an idea of clean API... So are you > okay with a super simple class doing MCAR for now + an open issue asking > for that class to be extended to NMAR and MAR cases? @amueller > <https://github.com/amueller> @jnothman <https://github.com/jnothman> > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#7084 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AAEz6wmEvX6ja-1jLawanwjfntC0xMaVks5riNL3gaJpZM4JVOnb> > . >

jnothman · 2017-03-06T05:29:57Z

More philosophising. Perhaps we could consider this NMAR generation task as: sample a fixed proportion of elements from 1d array x, whose quantiles in x approximate some specified (i.e. parameters given) beta distribution. The beta is a good distribution because it is bounded and can be parametrised to create different skews (i.e. preferring to drop extreme values more, less, at both or one extreme). This could be performed by some heuristic or greedy search to minimize KL(beta(a, b) || sample quantiles). Is there a better way to perform this sampling? Is there a more user friendly parametrisation of beta?

…

On 5 March 2017 at 08:29, Joel Nothman ***@***.***> wrote: I might ask some stats colleagues... On 5 March 2017 at 08:28, Joel Nothman ***@***.***> wrote: > Can you be specific in describing what kinds of transformations are > possible with the simplified ValueDropper? > > On 4 March 2017 at 13:56, (Venkat) Raghav (Rajagopalan) < > ***@***.***> wrote: > >> What is the way forward? Any consensus on a simplified API? I'd really >> like to have this as it would help in benchmarking and profiling a lot of >> new missing value methods that are "soon" to come... >> >> For now I'd say let's stick to MCAR and add a ValueDropper >> transformer... I know it's pretty simple and doesn't warrant a whole class >> but I feel it would be pretty useful even in it's minimal state + we do >> plan on adding NMAR / MAR after we have an idea of clean API... So are you >> okay with a super simple class doing MCAR for now + an open issue asking >> for that class to be extended to NMAR and MAR cases? @amueller >> <https://github.com/amueller> @jnothman <https://github.com/jnothman> >> >> — >> You are receiving this because you were mentioned. >> Reply to this email directly, view it on GitHub >> <#7084 (comment)>, >> or mute the thread >> <https://github.com/notifications/unsubscribe-auth/AAEz6wmEvX6ja-1jLawanwjfntC0xMaVks5riNL3gaJpZM4JVOnb> >> . >> > >

agramfort · 2017-03-06T22:24:46Z

sklearn/datasets/tests/test_value_dropper.py

@@ -0,0 +1,232 @@
+import numpy as np


add author + license

jnothman · 2018-09-27T00:11:35Z

Closing in preference for working towards something like R's ampute.

amueller reviewed Jul 26, 2016
View reviewed changes

jnothman reviewed Jul 26, 2016
View reviewed changes

sklearn/preprocessing/value_dropper.py Outdated

----------

X : ndarray like of shape (n_features, n_samples)

Copy link

Member

jnothman Jul 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove blank line

raghavrv force-pushed the value_dropper branch from 7f9f866 to fcc9d11 Compare July 27, 2016 16:37

tguillemot mentioned this pull request Dec 19, 2016

[MRG+1] MissingIndicator transformer #8075

Merged

3 tasks

agramfort reviewed Mar 6, 2017

View reviewed changes

sklearn/datasets/tests/test_value_dropper.py

@@ -0,0 +1,232 @@

import numpy as np

Copy link

Member

agramfort Mar 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add author + license

raghavrv removed the Waiting for Reviewer label Mar 24, 2017

jnothman added the help wanted label Feb 6, 2018

sergeyf mentioned this pull request Sep 26, 2018

[MRG] IterativeImputer extended example #12100

Merged

jnothman closed this Sep 27, 2018

glemaitre mentioned this pull request May 14, 2024

Add missing values and categorical features when generating datasets #28952

Open

[MRG+2-1] ENH add a ValueDropper to artificially insert missing values (NMAR or MCAR) to the dataset #7084

[MRG+2-1] ENH add a ValueDropper to artificially insert missing values (NMAR or MCAR) to the dataset #7084

Conversation

raghavrv commented Jul 26, 2016 • edited Loading

amueller commented Jul 26, 2016

amueller Jul 26, 2016

Choose a reason for hiding this comment

raghavrv commented Jul 26, 2016 • edited Loading

raghavrv commented Jul 26, 2016 • edited Loading

raghavrv commented Jul 26, 2016 • edited Loading

amueller commented Jul 26, 2016

amueller Jul 26, 2016

Choose a reason for hiding this comment

raghavrv commented Jul 26, 2016

amueller commented Jul 26, 2016

jnothman Jul 26, 2016

Choose a reason for hiding this comment

raghavrv commented Jul 27, 2016

amueller commented Jul 27, 2016

raghavrv commented Jul 27, 2016

amueller commented Jul 27, 2016

raghavrv commented Jul 28, 2016 • edited Loading

amueller commented Jul 28, 2016

raghavrv commented Aug 1, 2016

amueller commented Dec 7, 2016

jnothman commented Dec 8, 2016

raghavrv commented Dec 9, 2016 • edited Loading

amueller commented Dec 9, 2016 • edited Loading

raghavrv commented Dec 9, 2016

raghavrv commented Dec 9, 2016

raghavrv commented Dec 12, 2016

raghavrv commented Dec 12, 2016

jnothman commented Dec 12, 2016 via email

raghavrv commented Dec 12, 2016

amueller commented Dec 13, 2016 • edited Loading

jnothman commented Dec 13, 2016 via email

amueller commented Dec 13, 2016

jnothman commented Dec 13, 2016

jnothman commented Dec 13, 2016

jnothman commented Dec 13, 2016

jnothman commented Dec 13, 2016

jnothman commented Dec 14, 2016 via email

raghavrv commented Mar 4, 2017

jnothman commented Mar 4, 2017 via email

jnothman commented Mar 4, 2017 via email

jnothman commented Mar 6, 2017 via email

agramfort Mar 6, 2017

Choose a reason for hiding this comment

jnothman commented Sep 27, 2018 • edited Loading

raghavrv commented Jul 26, 2016 •

edited

Loading

raghavrv commented Jul 26, 2016 •

edited

Loading

raghavrv commented Jul 26, 2016 •

edited

Loading

raghavrv commented Jul 26, 2016 •

edited

Loading

raghavrv commented Jul 28, 2016 •

edited

Loading

raghavrv commented Dec 9, 2016 •

edited

Loading

amueller commented Dec 9, 2016 •

edited

Loading

amueller commented Dec 13, 2016 •

edited

Loading

jnothman commented Sep 27, 2018 •

edited

Loading