[MRG] Add KNN strategy for imputation #4844

twangnyc · 2015-06-10T18:35:21Z

Trying to solve #2989.

Imputation strategy is following:

Find all rows with full features as complete set, and impute the missing features in a row by taking the mean of those features among its K-nearest neighbors in complete set. Use scikit-learn.neighbors.NearestNeighbors to find neighbors. Now it can only handle dense matrix.

Based on this paper: http://web.stanford.edu/~hastie/Papers/missing.pdf

Similar package for R: Impute(http://www.bioconductor.org/packages/release/bioc/manuals/impute/man/impute.pdf)
Similar function for Matlab:
Knnimpute(http://www.mathworks.com/help/bioinfo/ref/knnimpute.html)

Still working on documentation and examples.

This change is

amueller · 2015-06-10T18:40:31Z

Thanks.
Does anyone have experience with using non-complete samples to do knn imputation? There is an R package, but it is unclear to me what it is doing.

amueller · 2015-06-10T18:41:16Z

sklearn/preprocessing/imputation.py

@@ -116,12 +124,13 @@ class Imputer(BaseEstimator, TransformerMixin):
      contain missing values).
    """
    def __init__(self, missing_values="NaN", strategy="mean",
-                 axis=0, verbose=0, copy=True):
+                 axis=0, verbose=0, copy=True, kneighbor=1):


I think it should be called n_neighbors as in the KNeighborsClassifier.

twangnyc · 2015-06-11T13:53:05Z

Thanks @amueller @jnothman for the comment! I will modify the code.

…lidean calculation method and add some tests

twangnyc · 2015-06-11T18:50:00Z

I just did a speed test by using make_classification function to create dataset. 40% of samples have 1 missing feature, 40% of samples have 2 missing features and the rest is full. KNN imputation used 6 neighbors.

10000 samples, 20 features:
-mean: 0.0036s
-@jnothman 's method: 3.3282s (block query, block = 100 samples)
-default KNN: 14.3168s

10000 samples, 200 features:
-mean: 0.0296s
-@jnothman 's method: 36.8256s (block query, block = 100 samples)
-default KNN: 68.7889s

amueller · 2015-06-11T18:56:39Z

Btw, it would be cool to have a graph that shows the runtime for block queries with varying block size or varying number of features and samples.

amueller · 2015-06-11T18:57:01Z

have a look at the cool plots here: #4852

amueller · 2015-06-11T18:57:40Z

There is also a test error on older numpy.

jnothman · 2015-06-11T22:02:49Z

Well I reckon my solution was fun but not clever. The better solution groups X by which features are missing, then for each group uses KNeighborsRegressor over the other features. This batches the work, does not necessarily require O(n^2 d) memory, reuses existing infrastructure/optimisations (and can use nearest neighbor trees if the number of queries for a particular field subset is sufficient to justify their construction), does not constrain to Euclidean...

I won't write it out, but to pull out groupings you can either use a lexsort over np.isnan(X).T and itertools.groupby, or the hackier np.unique(np.isnan(X).astype('u1').view((np.void, X.shape[1])), return_inverse=True), or perhaps there's a less obfuscated solution.

amueller · 2015-06-12T15:25:41Z

@jnothman do you think grouping by missing values would work in practice? I think it depends a lot on the number of samples and features whether the same pattern of missingness happens even twice... The stupidly batched version seems to work ok.

amueller · 2015-06-12T15:29:24Z

well, we could bench it for different values of n_samples or n_features with MAR data. Maybe it makes sense for points where exactly one value is missing, and then stupidly batch the rest? I'm not sure that the speed will warrant the complexity, though.

jnothman · 2015-06-13T09:24:08Z

Making it feasible with non-Euclidean isn't enough reason?

On 13 June 2015 at 01:29, Andreas Mueller notifications@github.com wrote:

well, we could bench it for different values of n_samples or n_features
with MAR data. Maybe it makes sense for points where exactly one value is
missing, and then stupidly batch the rest? I'm not sure that the speed will
warrant the complexity, though.

—
Reply to this email directly or view it on GitHub
#4844 (comment)
.

amueller · 2015-06-15T14:19:00Z

I'm not sure how common / important that is.... and we can use pairwise_distances without using trees, right?

jnothman · 2015-06-15T14:34:46Z

Indeed, my suggestion currently implemented could be simplified using pairwise_distances where a batch of queries with NaN replaced with 0 is X and the entire index is Y, except that:

pairwise_distances won't currently accept inputs containing NaN, necessary to this approach (wherein we don't select columns in order to process in batch)
replacing NaN with 0 only works for some metrics

I don't know whether grouping by set of missing values will be efficient in practice. There are obviously 2**n_features possible groups, which is bad, but as long as data is not missing at random, and only a fraction of data is missing, you'd expect very few of these combinations to actually exist. We'd need some real datasets to try it. Or just go with the batched approach.

amueller · 2015-06-15T14:47:09Z

Yeah, I'm not sure what would be the right assumption on missingness. Theory usually does missing completely at random but I would expect that not to be the case in practice. I have no source of actual data, so not sure how to test this.

amueller · 2015-06-15T18:20:38Z

sklearn/preprocessing/imputation.py

@@ -8,13 +8,14 @@
 from scipy import sparse
 from scipy import stats

+from ..neighbors import KDTree, NearestNeighbors


Could it be that this creates circular imports? I feel something odd is happening here. Do tests run on your box without any problems?

Yes. Test runs perfectly on my mac.. I can remove those lines and make another commit to try.

amueller · 2015-06-15T18:44:41Z

can you also please fix the doctests (see travis output on python 3).

amueller · 2015-06-15T18:54:27Z

sklearn/preprocessing/imputation.py

+                        X_sl = X[index_start: index_stop].copy()
+                        mask_sl = _get_mask(X_sl, self.missing_values)
+                        missing_index_sl = np.where(mask_sl.any(1))[0]
+                        D2 = (X_sl[missing_index_sl, np.newaxis] - statistics) ** 2


the travis error is because X_sl is already 2d, so you have to pass two indices in the older numpy. You want it to be 3d, right? Then it should be X_sl[missing_index_sl, :, np.newaxis].

ashimb9 · 2017-06-14T03:36:50Z

@jnothman That's a fair point. On a separate note, how would you recommend continuing the work on this pull request? It seems to have been inactive for a while, so presumably a lot of things outside this PR would have changed. What would be the ideal way forward? Thanks.

jnothman · 2017-06-14T03:44:24Z

I don't think this PR has the right design or the right implementation, so you should basically start from scratch. The right design, IMO, is an extra parameter to Imputer, n_neighbors, which defaults to None, but otherwise only applies the strategy (mean, median or mode) over nearest neighbors (not that I'm sure median is useful when the nearest neighbors are euclidean). Ideally the implementation would make use of the neighbors subpackage (if only to ensure edge cases are correctly handled), which is why I suggest implementing a new metric for euclidean distance with features missing in sklearn.metrics.pairwise. It may be possible to take some of the example code from this PR. Does that sound doable?

ashimb9 · 2017-06-14T04:26:40Z

I don't know if I understood you correctly, but wouldn't such an approach restrict our ability to work with "mixed" datasets that might, for instance, contain both continuous as well as nominal variables? Or is the idea here to see how things turn out with this approach first and consider a separate knn "strategy" in the future?

jnothman · 2017-06-14T05:05:43Z

Standard KNN imputation as far as I know tends to be defined for euclidean (or minkowsky distances). If you're looking for something else, it's not technically KNN imputation, even if it involves KNN!

…

On 14 June 2017 at 14:26, ashimb9 ***@***.***> wrote: I don't know if I understood you correctly, but wouldn't such an approach restrict our ability to work with "mixed" datasets that might, for instance, contain both continuous as well as nominal variables? Or is the idea here to see how things turn out with this approach first and consider a separate knn "strategy" in the future? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4844 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6zWetnEIVCTsGenm1LXkUCIFEI4Rks5sD2ECgaJpZM4E-vPJ> .

ashimb9 · 2017-06-14T06:44:44Z

Ahh, I thought you literally meant euclidean only. In any case, the reason I brought that up is due to the VIM R-package (noted above) that uses the Gower distance for KNN imputation. Their implementation appears to allow handling of mixed datasets composed of "binary, categorical, ordered, continuous and semi-continuous" features. It would be interesting to know what you thought about their methodology (if you have the time to take a look at the linked paper that is). If that is not of interest then I'll proceed forward with what you have suggested above.

jnothman · 2017-06-14T07:00:11Z

okay. in theory, I'd appreciate any good nan-handling metrics. currently we have none implemented in scikit-learn, and implementing them may provide new clustering etc options as well. I'll try look at the paper, but tbh we're focused on a release atm and this won't make it in

…

On 14 Jun 2017 4:44 pm, "ashimb9" ***@***.***> wrote: Ahh, I thought you literally meant euclidean only. In any case, the reason I brought that up is due to the VIM R-package (noted #4844# issuecomment-308069449 <http://here>) that uses the Gower distance for KNN imputation. Their implementation appears to allow handling of mixed datasets composed of "binary, categorical, ordered, continuous and semi-continuous" features. It would be interesting to know what you thought about their methodology (if you have the time to take a look at the linked paper that is). If that is not of interest then I'll proceed forward with what you have suggested above. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4844 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz68GbWi7ml_6P7URnO-8dK_x-g1Cdks5sD4FfgaJpZM4E-vPJ> .

jnothman · 2017-06-14T07:07:12Z

that Gower description looks fairly well established, though difficult to provide an API for. Still that paper does not clearly talk about distance in variables that have a missing value

…

On 14 Jun 2017 5:00 pm, "Joel Nothman" ***@***.***> wrote: okay. in theory, I'd appreciate any good nan-handling metrics. currently we have none implemented in scikit-learn, and implementing them may provide new clustering etc options as well. I'll try look at the paper, but tbh we're focused on a release atm and this won't make it in On 14 Jun 2017 4:44 pm, "ashimb9" ***@***.***> wrote: > Ahh, I thought you literally meant euclidean only. In any case, the > reason I brought that up is due to the VIM R-package (noted > #4844 (comment) > comment-308069449 <http://here>) that uses the Gower distance for KNN > imputation. Their implementation appears to allow handling of mixed > datasets composed of "binary, categorical, ordered, continuous and > semi-continuous" features. It would be interesting to know what you thought > about their methodology (if you have the time to take a look at the linked > paper that is). If that is not of interest then I'll proceed forward with > what you have suggested above. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#4844 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AAEz68GbWi7ml_6P7URnO-8dK_x-g1Cdks5sD4FfgaJpZM4E-vPJ> > . >

ashimb9 · 2017-06-14T07:25:45Z

Right. What I appreciated about their implementation has to do with how they handle mixed datasets and not so much their handling of missing covariates during imputation. In fact, when I looked at their R code, I ran into this comment: "# TODO: find better way to handle missing in distance variables", which presumably means their current approach is probably not worth pursuing. I am currently aware of two other ways of handling such situations. First, the obvious one, used in the "impute" R-package from Bioconductor (which is also the original inspiration for this pull request I believe), which is using distances measured over existing variables: "Each candidate neighbor might be missing some of the coordinates used to calculate the distance. In this case we average the distance from the non-missing coordinates." A second method is an EM based sequential KNN imputation in the vein of MICE (i.e. in the sense of using an iterative procedure). And I think you also allude to this algorithm above. Additionally, I believe the paper also examines a non-iterative one time sequential KNN imputation.

ekorman · 2017-06-18T03:49:14Z

I am interested in this problem and have some time/energy to contribute. While using the euclidean metric on the non-null features scaled by the (square root of the) inverse of the number of features used seems better than not scaling, I'm not sure this is ideal. For example suppose we have the data u = (1, 1, nan, 5, 7), v = (1, 1, 2, nan, nan), w = (1, 1, 8, 4.9, 7). Should v or w be deemed closer to u? In my opinion w should. Perhaps there should be some parameter specifying the minimum number or proportion of non-null features that must be used in the distance calculation?

jnothman · 2017-06-18T06:33:33Z

I tried normalising by the count of non-graph at some point but I don't recall it being a better estimate. It should be possible to evaluate different variant metrics by hiding features. But ideally we aim for something that is also consistent with the literature or existing implementations.

…

On 18 Jun 2017 1:49 pm, "Eric O. Korman" ***@***.***> wrote: I am interested in this problem and have some time/energy to contribute. While using the euclidean metric on the non-null features scaled by the (square root of the) inverse of the number of features used seems better than not scaling, I'm not sure this is ideal. For example suppose we have the data u = (1, 1, nan, 5, 7), v = (1, 1, 2, nan, nan), w = (1, 1, 8, 4.9, 7). Should v or w be deemed closer to u? In my opinion w should. Perhaps there should be some parameter specifying the minimum number or proportion of non-null features that must be used in the distance calculation? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4844 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz66KgVtJehrLbOGwU5XEiAGteQREXks5sFJ48gaJpZM4E-vPJ> .

ashimb9 · 2017-06-19T19:59:36Z

The paper's approach, as reflected in their latest R package, is that if a given row is missing more than 50% of the columns (by default, but user can pass their own value), those rows are NOT used in knn imputation and their own missing columns are imputed to have the column means. Further, if any column has more than 80% values missing (again, this is an adjustable default) then the program throws an error and simply terminates execution.

jnothman · 2017-06-19T22:33:00Z

it sounds reasonable to have settings for such thresholds

…

On 20 Jun 2017 5:59 am, "ashimb9" ***@***.***> wrote: The paper's approach, which is also reflected in their latest R package <https://github.com/Bioconductor-mirror/impute/tree/release-3.5>, is that if a given row is missing more than 50% of the columns (by default, but user can pass their own value), those rows are NOT used in knn imputation and their own missing columns are imputed to have the column means. Further, if any column has more than 80% values missing (again, this is an adjustable default) then the program throws an error and simply terminates execution. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4844 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6w2HBDm_bRBfN2Y3uWKUJpRkCCMRks5sFtMugaJpZM4E-vPJ> .

ashimb9 · 2017-06-24T18:55:12Z

Hi all, so you might have noted that I recently added a pull request that implements a kNN imputation algorithm. I would very much welcome and appreciate any feedback/suggestions/criticism you might have regarding the code. Also, it would be super if anybody wanted to join in, as the development is not yet complete.
With regards to the progress, I am happy to report that the implementation works fine in cases where the same data matrix is passed in fit() and transform(). And by "works fine", I mean in comparison to the R-package Impute that this implementation is inspired from. Some other things remain, and I hope to complete them in the coming days. Anyway, I look forward to hearing back from the community. Thanks! :)

ekorman · 2017-06-24T19:19:13Z

Nice work. I also started working on this (https://github.com/ekorman/scikit-learn/blob/knn-impute/sklearn/metrics/pairwise.py), but my strategy was to change euclidean_distance in pairwise.py to allow for null entries and then use sklean.neighbors to do imputation. I think it may be useful to have this generality since presumably there are cases beyond imputation where it would be useful to use KNN on data with some null values.

ashimb9 · 2017-06-24T21:51:53Z

Oh, cool! I don't know the approach you are taking, but the matrix formula used in my PR for calculating euclidean distance between all vector pairs in a matrix with NaN values (or between vectors in one matrix to those of another matrix with NaNs) might potentially be of interest to your work (it is contained in the function _get_knn()). In any case, I would be happy to change my code to incorporate the "NaN-enabled" sklearn.neighbors after you finish working on it.

jnothman · 2017-06-25T02:08:52Z

Without looking at the code, I would prefer a metric implementation in metrics.pairwise as it makes it easy to reuse existing nearest neighbours machinery

…

On 25 Jun 2017 7:51 am, "ashimb9" ***@***.***> wrote: Oh, cool! I don't know the approach you are taking, but the matrix formula used in my PR for calculating euclidean distance between all vector pairs in a matrix with NaN values (or between vectors in one matrix to those of another matrix with NaNs) might potentially be of interest to your work. In any case, I would be happy to change my code to incorporate the "NaN-enabled" sklearn.neighbors after you finish working on it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4844 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz644JiSOIl9wIvwFiaKYlFdFVxfnrks5sHYT7gaJpZM4E-vPJ> .

ashimb9 · 2017-06-25T02:47:20Z

Agreed. As @ekorman noted, the incorporation of a NaN handling pairwise distance calculator is useful beyond this particular use-case.

ashimb9 · 2017-06-25T08:39:21Z

@ekorman By the way, please feel free to let me know if I can be of any use to help wrap things up faster unless, of course, if the remaining workload isn't that much and/or you prefer working by yourself.

ashimb9 · 2017-06-30T22:59:39Z

Hey @ekorman , sorry to bother you again but was just wondering if you were still continuing your work with the NaN-enabled pairwise distance calculator. And in case if it is not feasible for you to work on it at the moment, please let me know and I can give it a try. Thanks.

ekorman · 2017-07-02T02:47:22Z

@ashimb9 sorry for not responding sooner, some stuff came up but I should be able to start thinking about this again in the next few days. I definitely welcome any help. However, I think the code I wrote to compute the distances when there are NaN's is more or less the same code you wrote.

ashimb9 · 2017-07-03T10:01:21Z

No worries man. Let me know how I can help.

ashimb9 · 2017-07-13T09:34:39Z

@jnothman Hey Joel, as per our discussion here, I just created a PR which implements a NaN enabled euclidean distance calculator within sklearn.metrics and leverages that to get the nearest neighbors. I look forward to your feedback.

@ekorman Hi Eric, since I have not heard back from you in a while, I went ahead with the modification. I really had to get the ball rolling on this one as it has been stalled for a while. I hope you don't mind.

ashimb9 · 2017-11-03T18:03:03Z

@lesteve @jnothman Hey just saw that the status of this was recently changed to "help wanted". Shouldn't this rather be closed in light of PR 9212?

jnothman · 2017-11-04T10:55:24Z

Yes, okay. Current PR at #9212

add simple knn imputation/ only works for dense matrix

ac8fdbd

amueller reviewed Jun 10, 2015
View reviewed changes

twangnyc added 2 commits June 11, 2015 13:40

add examples for knn imputation, fix error messages, add possible euc…

de2182d

…lidean calculation method and add some tests

change to block query

1608097

amueller reviewed Jun 15, 2015
View reviewed changes

jnothman added this to the 0.20 milestone Jun 14, 2017

ashimb9 mentioned this pull request Jun 24, 2017

[MRG] Added k-Nearest Neighbor imputation for missing data #9212

Closed

7 tasks

ashimb9 mentioned this pull request Jul 13, 2017

[MRG] Modified sklearn.metrics to enable euclidean distance calculation with NaN #9348

Closed

lesteve added help wanted and removed Need Contributor labels Oct 18, 2017

jnothman closed this Nov 4, 2017

[MRG] Add KNN strategy for imputation #4844

[MRG] Add KNN strategy for imputation #4844

Conversation

twangnyc commented Jun 10, 2015 • edited by lesteve Loading

amueller commented Jun 10, 2015

amueller Jun 10, 2015

Choose a reason for hiding this comment

twangnyc commented Jun 11, 2015

twangnyc commented Jun 11, 2015

amueller commented Jun 11, 2015

amueller commented Jun 11, 2015

amueller commented Jun 11, 2015

jnothman commented Jun 11, 2015

amueller commented Jun 12, 2015

amueller commented Jun 12, 2015

jnothman commented Jun 13, 2015

amueller commented Jun 15, 2015

jnothman commented Jun 15, 2015

amueller commented Jun 15, 2015

amueller Jun 15, 2015

Choose a reason for hiding this comment

twangnyc Jun 15, 2015

Choose a reason for hiding this comment

amueller commented Jun 15, 2015

amueller Jun 15, 2015

Choose a reason for hiding this comment

ashimb9 commented Jun 14, 2017

jnothman commented Jun 14, 2017 • edited Loading

ashimb9 commented Jun 14, 2017

jnothman commented Jun 14, 2017 via email

ashimb9 commented Jun 14, 2017 • edited Loading

jnothman commented Jun 14, 2017 via email

jnothman commented Jun 14, 2017 via email

ashimb9 commented Jun 14, 2017 • edited Loading

ekorman commented Jun 18, 2017

jnothman commented Jun 18, 2017 via email

ashimb9 commented Jun 19, 2017 • edited Loading

jnothman commented Jun 19, 2017 via email

ashimb9 commented Jun 24, 2017

ekorman commented Jun 24, 2017 • edited Loading

ashimb9 commented Jun 24, 2017 • edited Loading

jnothman commented Jun 25, 2017 via email

ashimb9 commented Jun 25, 2017

ashimb9 commented Jun 25, 2017

ashimb9 commented Jun 30, 2017

ekorman commented Jul 2, 2017

ashimb9 commented Jul 3, 2017

ashimb9 commented Jul 13, 2017

ashimb9 commented Nov 3, 2017

jnothman commented Nov 4, 2017

twangnyc commented Jun 10, 2015 •

edited by lesteve

Loading

jnothman commented Jun 14, 2017 •

edited

Loading

ashimb9 commented Jun 14, 2017 •

edited

Loading

ashimb9 commented Jun 14, 2017 •

edited

Loading

ashimb9 commented Jun 19, 2017 •

edited

Loading

ekorman commented Jun 24, 2017 •

edited

Loading

ashimb9 commented Jun 24, 2017 •

edited

Loading