-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Add KNN strategy for imputation #4844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks. |
@@ -116,12 +124,13 @@ class Imputer(BaseEstimator, TransformerMixin): | |||
contain missing values). | |||
""" | |||
def __init__(self, missing_values="NaN", strategy="mean", | |||
axis=0, verbose=0, copy=True): | |||
axis=0, verbose=0, copy=True, kneighbor=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be called n_neighbors
as in the KNeighborsClassifier.
…lidean calculation method and add some tests
I just did a speed test by using make_classification function to create dataset. 40% of samples have 1 missing feature, 40% of samples have 2 missing features and the rest is full. KNN imputation used 6 neighbors. 10000 samples, 20 features: 10000 samples, 200 features: |
Btw, it would be cool to have a graph that shows the runtime for block queries with varying block size or varying number of features and samples. |
have a look at the cool plots here: #4852 |
There is also a test error on older numpy. |
Well I reckon my solution was fun but not clever. The better solution groups I won't write it out, but to pull out groupings you can either use a |
@jnothman do you think grouping by missing values would work in practice? I think it depends a lot on the number of samples and features whether the same pattern of missingness happens even twice... The stupidly batched version seems to work ok. |
well, we could bench it for different values of |
Making it feasible with non-Euclidean isn't enough reason? On 13 June 2015 at 01:29, Andreas Mueller notifications@github.com wrote:
|
I'm not sure how common / important that is.... and we can use |
Indeed, my suggestion currently implemented could be simplified using
I don't know whether grouping by set of missing values will be efficient in practice. There are obviously |
Yeah, I'm not sure what would be the right assumption on missingness. Theory usually does missing completely at random but I would expect that not to be the case in practice. I have no source of actual data, so not sure how to test this. |
@@ -8,13 +8,14 @@ | |||
from scipy import sparse | |||
from scipy import stats | |||
|
|||
from ..neighbors import KDTree, NearestNeighbors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be that this creates circular imports? I feel something odd is happening here. Do tests run on your box without any problems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Test runs perfectly on my mac.. I can remove those lines and make another commit to try.
can you also please fix the doctests (see travis output on python 3). |
X_sl = X[index_start: index_stop].copy() | ||
mask_sl = _get_mask(X_sl, self.missing_values) | ||
missing_index_sl = np.where(mask_sl.any(1))[0] | ||
D2 = (X_sl[missing_index_sl, np.newaxis] - statistics) ** 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the travis error is because X_sl
is already 2d, so you have to pass two indices in the older numpy. You want it to be 3d, right? Then it should be X_sl[missing_index_sl, :, np.newaxis]
.
@jnothman That's a fair point. On a separate note, how would you recommend continuing the work on this pull request? It seems to have been inactive for a while, so presumably a lot of things outside this PR would have changed. What would be the ideal way forward? Thanks. |
I don't think this PR has the right design or the right implementation, so you should basically start from scratch. The right design, IMO, is an extra parameter to |
I don't know if I understood you correctly, but wouldn't such an approach restrict our ability to work with "mixed" datasets that might, for instance, contain both continuous as well as nominal variables? Or is the idea here to see how things turn out with this approach first and consider a separate knn "strategy" in the future? |
Standard KNN imputation as far as I know tends to be defined for euclidean
(or minkowsky distances). If you're looking for something else, it's not
technically KNN imputation, even if it involves KNN!
…On 14 June 2017 at 14:26, ashimb9 ***@***.***> wrote:
I don't know if I understood you correctly, but wouldn't such an approach
restrict our ability to work with "mixed" datasets that might, for
instance, contain both continuous as well as nominal variables? Or is the
idea here to see how things turn out with this approach first and consider
a separate knn "strategy" in the future?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4844 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6zWetnEIVCTsGenm1LXkUCIFEI4Rks5sD2ECgaJpZM4E-vPJ>
.
|
Ahh, I thought you literally meant euclidean only. In any case, the reason I brought that up is due to the VIM R-package (noted above) that uses the Gower distance for KNN imputation. Their implementation appears to allow handling of mixed datasets composed of "binary, categorical, ordered, continuous and semi-continuous" features. It would be interesting to know what you thought about their methodology (if you have the time to take a look at the linked paper that is). If that is not of interest then I'll proceed forward with what you have suggested above. |
okay. in theory, I'd appreciate any good nan-handling metrics. currently we
have none implemented in scikit-learn, and implementing them may provide
new clustering etc options as well. I'll try look at the paper, but tbh
we're focused on a release atm and this won't make it in
…On 14 Jun 2017 4:44 pm, "ashimb9" ***@***.***> wrote:
Ahh, I thought you literally meant euclidean only. In any case, the reason
I brought that up is due to the VIM R-package (noted
#4844#
issuecomment-308069449 <http://here>) that uses the Gower distance for
KNN imputation. Their implementation appears to allow handling of mixed
datasets composed of "binary, categorical, ordered, continuous and
semi-continuous" features. It would be interesting to know what you thought
about their methodology (if you have the time to take a look at the linked
paper that is). If that is not of interest then I'll proceed forward with
what you have suggested above.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4844 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz68GbWi7ml_6P7URnO-8dK_x-g1Cdks5sD4FfgaJpZM4E-vPJ>
.
|
that Gower description looks fairly well established, though difficult to
provide an API for. Still that paper does not clearly talk about distance
in variables that have a missing value
…On 14 Jun 2017 5:00 pm, "Joel Nothman" ***@***.***> wrote:
okay. in theory, I'd appreciate any good nan-handling metrics. currently
we have none implemented in scikit-learn, and implementing them may provide
new clustering etc options as well. I'll try look at the paper, but tbh
we're focused on a release atm and this won't make it in
On 14 Jun 2017 4:44 pm, "ashimb9" ***@***.***> wrote:
> Ahh, I thought you literally meant euclidean only. In any case, the
> reason I brought that up is due to the VIM R-package (noted
> #4844 (comment)
> comment-308069449 <http://here>) that uses the Gower distance for KNN
> imputation. Their implementation appears to allow handling of mixed
> datasets composed of "binary, categorical, ordered, continuous and
> semi-continuous" features. It would be interesting to know what you thought
> about their methodology (if you have the time to take a look at the linked
> paper that is). If that is not of interest then I'll proceed forward with
> what you have suggested above.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#4844 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAEz68GbWi7ml_6P7URnO-8dK_x-g1Cdks5sD4FfgaJpZM4E-vPJ>
> .
>
|
Right. What I appreciated about their implementation has to do with how they handle mixed datasets and not so much their handling of missing covariates during imputation. In fact, when I looked at their R code, I ran into this comment: "# TODO: find better way to handle missing in distance variables", which presumably means their current approach is probably not worth pursuing. I am currently aware of two other ways of handling such situations. First, the obvious one, used in the "impute" R-package from Bioconductor (which is also the original inspiration for this pull request I believe), which is using distances measured over existing variables: "Each candidate neighbor might be missing some of the coordinates used to calculate the distance. In this case we average the distance from the non-missing coordinates." A second method is an EM based sequential KNN imputation in the vein of MICE (i.e. in the sense of using an iterative procedure). And I think you also allude to this algorithm above. Additionally, I believe the paper also examines a non-iterative one time sequential KNN imputation. |
I am interested in this problem and have some time/energy to contribute. While using the euclidean metric on the non-null features scaled by the (square root of the) inverse of the number of features used seems better than not scaling, I'm not sure this is ideal. For example suppose we have the data u = (1, 1, nan, 5, 7), v = (1, 1, 2, nan, nan), w = (1, 1, 8, 4.9, 7). Should v or w be deemed closer to u? In my opinion w should. Perhaps there should be some parameter specifying the minimum number or proportion of non-null features that must be used in the distance calculation? |
I tried normalising by the count of non-graph at some point but I don't
recall it being a better estimate. It should be possible to evaluate
different variant metrics by hiding features.
But ideally we aim for something that is also consistent with the
literature or existing implementations.
…On 18 Jun 2017 1:49 pm, "Eric O. Korman" ***@***.***> wrote:
I am interested in this problem and have some time/energy to contribute.
While using the euclidean metric on the non-null features scaled by the
(square root of the) inverse of the number of features used seems better
than not scaling, I'm not sure this is ideal. For example suppose we have
the data u = (1, 1, nan, 5, 7), v = (1, 1, 2, nan, nan), w = (1, 1, 8, 4.9,
7). Should v or w be deemed closer to u? In my opinion w should. Perhaps
there should be some parameter specifying the minimum number or proportion
of non-null features that must be used in the distance calculation?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4844 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz66KgVtJehrLbOGwU5XEiAGteQREXks5sFJ48gaJpZM4E-vPJ>
.
|
The paper's approach, as reflected in their latest R package, is that if a given row is missing more than 50% of the columns (by default, but user can pass their own value), those rows are NOT used in knn imputation and their own missing columns are imputed to have the column means. Further, if any column has more than 80% values missing (again, this is an adjustable default) then the program throws an error and simply terminates execution. |
it sounds reasonable to have settings for such thresholds
…On 20 Jun 2017 5:59 am, "ashimb9" ***@***.***> wrote:
The paper's approach, which is also reflected in their latest R package
<https://github.com/Bioconductor-mirror/impute/tree/release-3.5>, is that
if a given row is missing more than 50% of the columns (by default, but
user can pass their own value), those rows are NOT used in knn imputation
and their own missing columns are imputed to have the column means.
Further, if any column has more than 80% values missing (again, this is an
adjustable default) then the program throws an error and simply terminates
execution.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4844 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6w2HBDm_bRBfN2Y3uWKUJpRkCCMRks5sFtMugaJpZM4E-vPJ>
.
|
Hi all, so you might have noted that I recently added a pull request that implements a kNN imputation algorithm. I would very much welcome and appreciate any feedback/suggestions/criticism you might have regarding the code. Also, it would be super if anybody wanted to join in, as the development is not yet complete. |
Nice work. I also started working on this (https://github.com/ekorman/scikit-learn/blob/knn-impute/sklearn/metrics/pairwise.py), but my strategy was to change euclidean_distance in pairwise.py to allow for null entries and then use sklean.neighbors to do imputation. I think it may be useful to have this generality since presumably there are cases beyond imputation where it would be useful to use KNN on data with some null values. |
Oh, cool! I don't know the approach you are taking, but the matrix formula used in my PR for calculating euclidean distance between all vector pairs in a matrix with NaN values (or between vectors in one matrix to those of another matrix with NaNs) might potentially be of interest to your work (it is contained in the function _get_knn()). In any case, I would be happy to change my code to incorporate the "NaN-enabled" sklearn.neighbors after you finish working on it. |
Without looking at the code, I would prefer a metric implementation in
metrics.pairwise as it makes it easy to reuse existing nearest neighbours
machinery
…On 25 Jun 2017 7:51 am, "ashimb9" ***@***.***> wrote:
Oh, cool! I don't know the approach you are taking, but the matrix formula
used in my PR for calculating euclidean distance between all vector pairs
in a matrix with NaN values (or between vectors in one matrix to those of
another matrix with NaNs) might potentially be of interest to your work. In
any case, I would be happy to change my code to incorporate the
"NaN-enabled" sklearn.neighbors after you finish working on it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4844 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz644JiSOIl9wIvwFiaKYlFdFVxfnrks5sHYT7gaJpZM4E-vPJ>
.
|
Agreed. As @ekorman noted, the incorporation of a NaN handling pairwise distance calculator is useful beyond this particular use-case. |
@ekorman By the way, please feel free to let me know if I can be of any use to help wrap things up faster unless, of course, if the remaining workload isn't that much and/or you prefer working by yourself. |
Hey @ekorman , sorry to bother you again but was just wondering if you were still continuing your work with the NaN-enabled pairwise distance calculator. And in case if it is not feasible for you to work on it at the moment, please let me know and I can give it a try. Thanks. |
@ashimb9 sorry for not responding sooner, some stuff came up but I should be able to start thinking about this again in the next few days. I definitely welcome any help. However, I think the code I wrote to compute the distances when there are NaN's is more or less the same code you wrote. |
No worries man. Let me know how I can help. |
@jnothman Hey Joel, as per our discussion here, I just created a PR which implements a NaN enabled euclidean distance calculator within sklearn.metrics and leverages that to get the nearest neighbors. I look forward to your feedback. @ekorman Hi Eric, since I have not heard back from you in a while, I went ahead with the modification. I really had to get the ball rolling on this one as it has been stalled for a while. I hope you don't mind. |
Yes, okay. Current PR at #9212 |
Trying to solve #2989.
Imputation strategy is following:
Find all rows with full features as complete set, and impute the missing features in a row by taking the mean of those features among its K-nearest neighbors in complete set. Use scikit-learn.neighbors.NearestNeighbors to find neighbors. Now it can only handle dense matrix.
Based on this paper: http://web.stanford.edu/~hastie/Papers/missing.pdf
Similar package for R: Impute(http://www.bioconductor.org/packages/release/bioc/manuals/impute/man/impute.pdf)
Similar function for Matlab:
Knnimpute(http://www.mathworks.com/help/bioinfo/ref/knnimpute.html)
Still working on documentation and examples.
This change is