-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+2-1] ENH add a ValueDropper to artificially insert missing values (NMAR or MCAR) to the dataset #7084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I don't think this should live in the |
for label in labels: | ||
label_mask[y==label] = True | ||
|
||
# The logic of MCAR/MNAR is implemented here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
write out acronyms at least once.
BTW what do you think about the function in general? Will it be useful outside that example? (Say, Would there be an usecase where people would want to drop values randomly and impute to reduce noise?) |
Wow that was some super quick reviewing ;) Thx |
@amueller The main motive of this function is to help generate MNAR (incrementally) and help benchmark / illustrate missing value strategies that we will be adding in the future (the immediate usecase will be in using it for an example to illustrate |
yeah so why not restrict it to MNAR and simplify the logic? Maybe the logic needs to be as complicated, the function just seems very long. I totally agree that this is something very helpful for benchmarking. I just don't like code ;) |
n_unique_labels = len(unique_labels) | ||
|
||
if labels is None: | ||
# Labels is an int specifying the no of labels to correlate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no = number
Sure I'll attempt simplifying the logic. By restrict it to MNAR, do you mean let's not handle MCAR? If so MCAR is just a special case when If instead you meant let's just generate pure MNAR ( I observed something similar in the census income dataset. |
I meant removing the special case for MCAR, but depends on whether that actually simplifies the logic. |
---------- | ||
|
||
X : ndarray like of shape (n_features, n_samples) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove blank line
I've still not cleaned up / addressed some of the comments. But most of the comments got hidden as I moved the file from
For regression... hmmm I think we can change
I've added an example. Pl. let me know if the motive is justified.
The reason why I reset the random state after randomly choosing labels is to make sure that missing values generated - when labels are randomly chosen and when those randomly chosen labels are explicitly passed - are same. X1, y, labels = drop_values(X, y, ..., labels=None, n_labels=1, return_labels=True, copy=True) # 1
X2, y= drop_values(X, y, ..., labels=labels, copy=True) # 2
# This will happen only if the random seed at the time of generation of the missing values
# at #1 is same as that of #2. Generating labels in random will modify the random state
# and hence needs to be reset before generating missing values.
X1 == X2 |
This is not working if the user passed a |
Should I make a copy for that |
I guess you could do that but I'm not sure if the contract you're trying to establish is that useful. Can you give an example when one would need it? |
Okay. I've removed this... The use case is a very specific one that I ran into myself. It may not be so useful for all. |
please ping once you want another round. |
Ok I'll remove the |
I think that @amueller and I are quite enthusiastic about having a helper, somewhere, that actually simulates MNAR as relevant to social science settings, and perhaps others. The main challenge is making it believable or flexible enough to be "standard" in some way. |
Thanks for the comments. Myself and @tguillemot discussed a bit about this IRL.
What if we have X, y = ...
X_dropped = drop_values(X,
ranking_func={0: lambda X_feat0: np.exp(-(X_feat0 ** 2) / 2),
1: lambda X_feat1: X_feat1,
2: lambda X_feat2: -X_feat2,
3: lambda X_feat3: (X_feat3 > 55) * 1 + (X_feat3 < 55) * 0.3},
missing_proba=[0.1, 0.3, 0.2, 0.2],
random_state=0) The above code snippet would (Optionally we thought about allowing to pass a single function instead of dict of (max) This functionality could be illustrated with a nice example covering various possible ranking function (that simulates various distributions like top 10%, bottom 10%, values that are too off from mean, values too off from a particular measure etc)... If this is okay, then we can extend our usecase to the 3rd type of missing values (MAR) too. MAR (Missing At Random) is when value is dropped based on non-missing feature values... (Say if you answer one question you are likely to not answer the others kind...) |
I don't understand what that means. "give 10% of values in feature 0"? While I think that having something might be valuable, I look at the code and your api proposal and they seem very complicated. Consider these functions: def mcar(X, probs=.1):
X = X.copy()
X[np.random.uniform(size=X.shape) < probs] = np.NaN
return X
def mnar(X, probs=.1):
X = X.copy()
sm = softmax(X.T).T
mask = np.random.uniform(size=X.shape) < sm * probs * X.shape[0]
X[mask] = np.NaN
return X
def mar(X, probs=.1):
X = X.copy()
weights = np.random.normal(X.shape[1], X.shape[1])
sm = softmax(np.dot(X, weights).T).T
mask = np.random.uniform(size=X.shape) < sm * probs * X.shape[0]
X[mask] = np.NaN
return X
I think they are reasonable (though pretty linear) and took me 1 minute to write down. Note that If your whole PR only implements the first one, which is literally one line, that seems not worth the effort. If it also implements mnar it might be more worth it, but then I feel it should clearly have some benefit of the (written pretty verbosely above) ca two line softmax solution I provided. |
Sorry! That was a typo, I meant drop... |
@amueller So what is your suggestion on how to proceed? |
(WIth much thanks to @tguillemot) I have a gist with an alternative API that is a slight extension of your 1-min script that will handle label based or feature based NMAR... - https://gist.github.com/raghavrv/3237a1753077b0e3bbc3aef870315771 @amueller @jnothman @agramfort could you look into that and see if that API is simpler and also not too trivial? If so I can proceed along those lines... |
(Currently the probabilities do not directly translate to missing percentages, but I think with some scaling tricks we can achieve that...) |
I think we've concluded label-based masks are a bad idea: they mean that
your prediction task is too influenced by the missingness (which sometimes
is the case, but it's too easy for users to draw bad conclusions).
Softmax is not appropriate. Here it looks like you are normalising across
features in each sample. Features should be treated independently, or, if
anything, missingness should be correlated across features in a sample, but
softmax does the opposite: if there is an outlyingly high value, it will be
removed and as a result of its outlyingness, the others will all not be
removed. If all features values are small, then small differences become
magnified under the normalisation. You're better off just applying a
logistic to each feature independently, as @amueller suggested.
I don't mind the simplified interface.
…On 13 December 2016 at 04:05, Raghav RV ***@***.***> wrote:
(Currently the probabilities do not directly translate to missing
percentages, but I think with some scaling tricks we can achieve that...)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7084 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6w05nbBZiMRod3Z2gh5SlrXseodMks5rHX7PgaJpZM4JVOnb>
.
|
Thanks for the comment! |
I did a softmax over the samples so the desired number of missing values per feature is reached in expectation. |
The current code is doing a softmax over features, though, isn't it?
Softmax over samples still doesn't mean that you can use a constant
threshold to yield a constant proportion of samples? It also leaks data
from test to training, or if is applied to test independently is nonsense.
If a specific probability of missingness is sought, subtracting a random
value from independent logistics over rank-scaled features should be fine
(or not rank-scaled depending on how we define the sought probability). If
a specific quantity of missingness then an argpartition over the same
statistics will work.
…On 13 December 2016 at 11:56, Andreas Mueller ***@***.***> wrote:
I did a softmax over the samples so the desired number of missing values
per feature is reached...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7084 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6xCz3KbCa7IOFbbKSiOg47hnuttrks5rHe0hgaJpZM4JVOnb>
.
|
Hm true, not sure what I was thinking.... |
Let me think about it a little more :) |
I think I might have written a bit of nonsense above. |
I suspect my critiques apply to any solution that does not assume data of a particular scale. :| |
So let's say we're applying this technique to a single feature.
Assume the feature
For similar reasons, don't ask me to do it where we want to hide both extrema, or have to handle non-normal features. Hmm. Maybe this isn't so simple. |
and I think my use of an exponential there was not quite right...
…On 13 December 2016 at 12:32, Andreas Mueller ***@***.***> wrote:
Softmax over samples still doesn't mean that you can use a constant
threshold to yield a constant proportion of samples?
Hm true, not sure what I was thinking....
Can you elaborate on the logistic + random value?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7084 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6zNVT2v9d9Dv0lvwWZmagMRM0Dxaks5rHfXBgaJpZM4JVOnb>
.
|
What is the way forward? Any consensus on a simplified API? I'd really like to have this as it would help in benchmarking and profiling a lot of new missing value methods that are "soon" to come... For now I'd say let's stick to MCAR and add a |
Can you be specific in describing what kinds of transformations are
possible with the simplified ValueDropper?
…On 4 March 2017 at 13:56, (Venkat) Raghav (Rajagopalan) < ***@***.***> wrote:
What is the way forward? Any consensus on a simplified API? I'd really
like to have this as it would help in benchmarking and profiling a lot of
new missing value methods that are "soon" to come...
For now I'd say let's stick to MCAR and add a ValueDropper transformer...
I know it's pretty simple and doesn't warrant a whole class but I feel it
would be pretty useful even in it's minimal state + we do plan on adding
NMAR / MAR after we have an idea of clean API... So are you okay with a
super simple class doing MCAR for now + an open issue asking for that class
to be extended to NMAR and MAR cases? @amueller
<https://github.com/amueller> @jnothman <https://github.com/jnothman>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7084 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6wmEvX6ja-1jLawanwjfntC0xMaVks5riNL3gaJpZM4JVOnb>
.
|
I might ask some stats colleagues...
…On 5 March 2017 at 08:28, Joel Nothman ***@***.***> wrote:
Can you be specific in describing what kinds of transformations are
possible with the simplified ValueDropper?
On 4 March 2017 at 13:56, (Venkat) Raghav (Rajagopalan) <
***@***.***> wrote:
> What is the way forward? Any consensus on a simplified API? I'd really
> like to have this as it would help in benchmarking and profiling a lot of
> new missing value methods that are "soon" to come...
>
> For now I'd say let's stick to MCAR and add a ValueDropper
> transformer... I know it's pretty simple and doesn't warrant a whole class
> but I feel it would be pretty useful even in it's minimal state + we do
> plan on adding NMAR / MAR after we have an idea of clean API... So are you
> okay with a super simple class doing MCAR for now + an open issue asking
> for that class to be extended to NMAR and MAR cases? @amueller
> <https://github.com/amueller> @jnothman <https://github.com/jnothman>
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#7084 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAEz6wmEvX6ja-1jLawanwjfntC0xMaVks5riNL3gaJpZM4JVOnb>
> .
>
|
More philosophising. Perhaps we could consider this NMAR generation task
as: sample a fixed proportion of elements from 1d array x, whose quantiles
in x approximate some specified (i.e. parameters given) beta distribution.
The beta is a good distribution because it is bounded and can be
parametrised to create different skews (i.e. preferring to drop extreme
values more, less, at both or one extreme). This could be performed by some
heuristic or greedy search to minimize KL(beta(a, b) || sample quantiles).
Is there a better way to perform this sampling? Is there a more user
friendly parametrisation of beta?
…On 5 March 2017 at 08:29, Joel Nothman ***@***.***> wrote:
I might ask some stats colleagues...
On 5 March 2017 at 08:28, Joel Nothman ***@***.***> wrote:
> Can you be specific in describing what kinds of transformations are
> possible with the simplified ValueDropper?
>
> On 4 March 2017 at 13:56, (Venkat) Raghav (Rajagopalan) <
> ***@***.***> wrote:
>
>> What is the way forward? Any consensus on a simplified API? I'd really
>> like to have this as it would help in benchmarking and profiling a lot of
>> new missing value methods that are "soon" to come...
>>
>> For now I'd say let's stick to MCAR and add a ValueDropper
>> transformer... I know it's pretty simple and doesn't warrant a whole class
>> but I feel it would be pretty useful even in it's minimal state + we do
>> plan on adding NMAR / MAR after we have an idea of clean API... So are you
>> okay with a super simple class doing MCAR for now + an open issue asking
>> for that class to be extended to NMAR and MAR cases? @amueller
>> <https://github.com/amueller> @jnothman <https://github.com/jnothman>
>>
>> —
>> You are receiving this because you were mentioned.
>> Reply to this email directly, view it on GitHub
>> <#7084 (comment)>,
>> or mute the thread
>> <https://github.com/notifications/unsubscribe-auth/AAEz6wmEvX6ja-1jLawanwjfntC0xMaVks5riNL3gaJpZM4JVOnb>
>> .
>>
>
>
|
@@ -0,0 +1,232 @@ | |||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add author + license
Closing in preference for working towards something like R's ampute. |
Fixes #6284
ValueDropper
transformer class that has amissing_distribution
attribute.This can be:
Refer this example for a better idea.
ValueDropper
to introduce missing values.I'd add an example for #5974 which will use this function to compare the missing value handling strategy for MCAR/MNAR missingness.
@agramfort @GaelVaroquaux @amueller @jnothman @glouppe @MechCoder