Skip to content

[MRG] Add KNN strategy for imputation #4844

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 17 commits into from
Closed

Conversation

twangnyc
Copy link
Contributor

@twangnyc twangnyc commented Jun 10, 2015

Trying to solve #2989.

Imputation strategy is following:

Find all rows with full features as complete set, and impute the missing features in a row by taking the mean of those features among its K-nearest neighbors in complete set. Use scikit-learn.neighbors.NearestNeighbors to find neighbors. Now it can only handle dense matrix.

Based on this paper: http://web.stanford.edu/~hastie/Papers/missing.pdf

Similar package for R: Impute(http://www.bioconductor.org/packages/release/bioc/manuals/impute/man/impute.pdf)
Similar function for Matlab:
Knnimpute(http://www.mathworks.com/help/bioinfo/ref/knnimpute.html)

Still working on documentation and examples.


This change is Reviewable

@amueller
Copy link
Member

Thanks.
Does anyone have experience with using non-complete samples to do knn imputation? There is an R package, but it is unclear to me what it is doing.

@@ -116,12 +124,13 @@ class Imputer(BaseEstimator, TransformerMixin):
contain missing values).
"""
def __init__(self, missing_values="NaN", strategy="mean",
axis=0, verbose=0, copy=True):
axis=0, verbose=0, copy=True, kneighbor=1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be called n_neighbors as in the KNeighborsClassifier.

@twangnyc
Copy link
Contributor Author

Thanks @amueller @jnothman for the comment! I will modify the code.

@twangnyc
Copy link
Contributor Author

I just did a speed test by using make_classification function to create dataset. 40% of samples have 1 missing feature, 40% of samples have 2 missing features and the rest is full. KNN imputation used 6 neighbors.

10000 samples, 20 features:
-mean: 0.0036s
-@jnothman 's method: 3.3282s (block query, block = 100 samples)
-default KNN: 14.3168s

10000 samples, 200 features:
-mean: 0.0296s
-@jnothman 's method: 36.8256s (block query, block = 100 samples)
-default KNN: 68.7889s

@amueller
Copy link
Member

Btw, it would be cool to have a graph that shows the runtime for block queries with varying block size or varying number of features and samples.

@amueller
Copy link
Member

have a look at the cool plots here: #4852

@amueller
Copy link
Member

There is also a test error on older numpy.

@jnothman
Copy link
Member

Well I reckon my solution was fun but not clever. The better solution groups X by which features are missing, then for each group uses KNeighborsRegressor over the other features. This batches the work, does not necessarily require O(n^2 d) memory, reuses existing infrastructure/optimisations (and can use nearest neighbor trees if the number of queries for a particular field subset is sufficient to justify their construction), does not constrain to Euclidean...

I won't write it out, but to pull out groupings you can either use a lexsort over np.isnan(X).T and itertools.groupby, or the hackier np.unique(np.isnan(X).astype('u1').view((np.void, X.shape[1])), return_inverse=True), or perhaps there's a less obfuscated solution.

@amueller
Copy link
Member

@jnothman do you think grouping by missing values would work in practice? I think it depends a lot on the number of samples and features whether the same pattern of missingness happens even twice... The stupidly batched version seems to work ok.

@amueller
Copy link
Member

well, we could bench it for different values of n_samples or n_features with MAR data. Maybe it makes sense for points where exactly one value is missing, and then stupidly batch the rest? I'm not sure that the speed will warrant the complexity, though.

@jnothman
Copy link
Member

Making it feasible with non-Euclidean isn't enough reason?

On 13 June 2015 at 01:29, Andreas Mueller notifications@github.com wrote:

well, we could bench it for different values of n_samples or n_features
with MAR data. Maybe it makes sense for points where exactly one value is
missing, and then stupidly batch the rest? I'm not sure that the speed will
warrant the complexity, though.


Reply to this email directly or view it on GitHub
#4844 (comment)
.

@amueller
Copy link
Member

I'm not sure how common / important that is.... and we can use pairwise_distances without using trees, right?

@jnothman
Copy link
Member

Indeed, my suggestion currently implemented could be simplified using pairwise_distances where a batch of queries with NaN replaced with 0 is X and the entire index is Y, except that:

  • pairwise_distances won't currently accept inputs containing NaN, necessary to this approach (wherein we don't select columns in order to process in batch)
  • replacing NaN with 0 only works for some metrics

I don't know whether grouping by set of missing values will be efficient in practice. There are obviously 2**n_features possible groups, which is bad, but as long as data is not missing at random, and only a fraction of data is missing, you'd expect very few of these combinations to actually exist. We'd need some real datasets to try it. Or just go with the batched approach.

@amueller
Copy link
Member

Yeah, I'm not sure what would be the right assumption on missingness. Theory usually does missing completely at random but I would expect that not to be the case in practice. I have no source of actual data, so not sure how to test this.

@@ -8,13 +8,14 @@
from scipy import sparse
from scipy import stats

from ..neighbors import KDTree, NearestNeighbors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be that this creates circular imports? I feel something odd is happening here. Do tests run on your box without any problems?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Test runs perfectly on my mac.. I can remove those lines and make another commit to try.

@amueller
Copy link
Member

can you also please fix the doctests (see travis output on python 3).

X_sl = X[index_start: index_stop].copy()
mask_sl = _get_mask(X_sl, self.missing_values)
missing_index_sl = np.where(mask_sl.any(1))[0]
D2 = (X_sl[missing_index_sl, np.newaxis] - statistics) ** 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the travis error is because X_sl is already 2d, so you have to pass two indices in the older numpy. You want it to be 3d, right? Then it should be X_sl[missing_index_sl, :, np.newaxis].

@ashimb9
Copy link
Contributor

ashimb9 commented Jun 14, 2017

@jnothman That's a fair point. On a separate note, how would you recommend continuing the work on this pull request? It seems to have been inactive for a while, so presumably a lot of things outside this PR would have changed. What would be the ideal way forward? Thanks.

@jnothman
Copy link
Member

jnothman commented Jun 14, 2017

I don't think this PR has the right design or the right implementation, so you should basically start from scratch. The right design, IMO, is an extra parameter to Imputer, n_neighbors, which defaults to None, but otherwise only applies the strategy (mean, median or mode) over nearest neighbors (not that I'm sure median is useful when the nearest neighbors are euclidean). Ideally the implementation would make use of the neighbors subpackage (if only to ensure edge cases are correctly handled), which is why I suggest implementing a new metric for euclidean distance with features missing in sklearn.metrics.pairwise. It may be possible to take some of the example code from this PR. Does that sound doable?

@ashimb9
Copy link
Contributor

ashimb9 commented Jun 14, 2017

I don't know if I understood you correctly, but wouldn't such an approach restrict our ability to work with "mixed" datasets that might, for instance, contain both continuous as well as nominal variables? Or is the idea here to see how things turn out with this approach first and consider a separate knn "strategy" in the future?

@jnothman
Copy link
Member

jnothman commented Jun 14, 2017 via email

@ashimb9
Copy link
Contributor

ashimb9 commented Jun 14, 2017

Ahh, I thought you literally meant euclidean only. In any case, the reason I brought that up is due to the VIM R-package (noted above) that uses the Gower distance for KNN imputation. Their implementation appears to allow handling of mixed datasets composed of "binary, categorical, ordered, continuous and semi-continuous" features. It would be interesting to know what you thought about their methodology (if you have the time to take a look at the linked paper that is). If that is not of interest then I'll proceed forward with what you have suggested above.

@jnothman
Copy link
Member

jnothman commented Jun 14, 2017 via email

@jnothman
Copy link
Member

jnothman commented Jun 14, 2017 via email

@ashimb9
Copy link
Contributor

ashimb9 commented Jun 14, 2017

Right. What I appreciated about their implementation has to do with how they handle mixed datasets and not so much their handling of missing covariates during imputation. In fact, when I looked at their R code, I ran into this comment: "# TODO: find better way to handle missing in distance variables", which presumably means their current approach is probably not worth pursuing. I am currently aware of two other ways of handling such situations. First, the obvious one, used in the "impute" R-package from Bioconductor (which is also the original inspiration for this pull request I believe), which is using distances measured over existing variables: "Each candidate neighbor might be missing some of the coordinates used to calculate the distance. In this case we average the distance from the non-missing coordinates." A second method is an EM based sequential KNN imputation in the vein of MICE (i.e. in the sense of using an iterative procedure). And I think you also allude to this algorithm above. Additionally, I believe the paper also examines a non-iterative one time sequential KNN imputation.

@jnothman jnothman added this to the 0.20 milestone Jun 14, 2017
@ekorman
Copy link

ekorman commented Jun 18, 2017

I am interested in this problem and have some time/energy to contribute. While using the euclidean metric on the non-null features scaled by the (square root of the) inverse of the number of features used seems better than not scaling, I'm not sure this is ideal. For example suppose we have the data u = (1, 1, nan, 5, 7), v = (1, 1, 2, nan, nan), w = (1, 1, 8, 4.9, 7). Should v or w be deemed closer to u? In my opinion w should. Perhaps there should be some parameter specifying the minimum number or proportion of non-null features that must be used in the distance calculation?

@jnothman
Copy link
Member

jnothman commented Jun 18, 2017 via email

@ashimb9
Copy link
Contributor

ashimb9 commented Jun 19, 2017

The paper's approach, as reflected in their latest R package, is that if a given row is missing more than 50% of the columns (by default, but user can pass their own value), those rows are NOT used in knn imputation and their own missing columns are imputed to have the column means. Further, if any column has more than 80% values missing (again, this is an adjustable default) then the program throws an error and simply terminates execution.

@jnothman
Copy link
Member

jnothman commented Jun 19, 2017 via email

@ashimb9
Copy link
Contributor

ashimb9 commented Jun 24, 2017

Hi all, so you might have noted that I recently added a pull request that implements a kNN imputation algorithm. I would very much welcome and appreciate any feedback/suggestions/criticism you might have regarding the code. Also, it would be super if anybody wanted to join in, as the development is not yet complete.
With regards to the progress, I am happy to report that the implementation works fine in cases where the same data matrix is passed in fit() and transform(). And by "works fine", I mean in comparison to the R-package Impute that this implementation is inspired from. Some other things remain, and I hope to complete them in the coming days. Anyway, I look forward to hearing back from the community. Thanks! :)

@ekorman
Copy link

ekorman commented Jun 24, 2017

Nice work. I also started working on this (https://github.com/ekorman/scikit-learn/blob/knn-impute/sklearn/metrics/pairwise.py), but my strategy was to change euclidean_distance in pairwise.py to allow for null entries and then use sklean.neighbors to do imputation. I think it may be useful to have this generality since presumably there are cases beyond imputation where it would be useful to use KNN on data with some null values.

@ashimb9
Copy link
Contributor

ashimb9 commented Jun 24, 2017

Oh, cool! I don't know the approach you are taking, but the matrix formula used in my PR for calculating euclidean distance between all vector pairs in a matrix with NaN values (or between vectors in one matrix to those of another matrix with NaNs) might potentially be of interest to your work (it is contained in the function _get_knn()). In any case, I would be happy to change my code to incorporate the "NaN-enabled" sklearn.neighbors after you finish working on it.

@jnothman
Copy link
Member

jnothman commented Jun 25, 2017 via email

@ashimb9
Copy link
Contributor

ashimb9 commented Jun 25, 2017

Agreed. As @ekorman noted, the incorporation of a NaN handling pairwise distance calculator is useful beyond this particular use-case.

@ashimb9
Copy link
Contributor

ashimb9 commented Jun 25, 2017

@ekorman By the way, please feel free to let me know if I can be of any use to help wrap things up faster unless, of course, if the remaining workload isn't that much and/or you prefer working by yourself.

@ashimb9
Copy link
Contributor

ashimb9 commented Jun 30, 2017

Hey @ekorman , sorry to bother you again but was just wondering if you were still continuing your work with the NaN-enabled pairwise distance calculator. And in case if it is not feasible for you to work on it at the moment, please let me know and I can give it a try. Thanks.

@ekorman
Copy link

ekorman commented Jul 2, 2017

@ashimb9 sorry for not responding sooner, some stuff came up but I should be able to start thinking about this again in the next few days. I definitely welcome any help. However, I think the code I wrote to compute the distances when there are NaN's is more or less the same code you wrote.

@ashimb9
Copy link
Contributor

ashimb9 commented Jul 3, 2017

No worries man. Let me know how I can help.

@ashimb9
Copy link
Contributor

ashimb9 commented Jul 13, 2017

@jnothman Hey Joel, as per our discussion here, I just created a PR which implements a NaN enabled euclidean distance calculator within sklearn.metrics and leverages that to get the nearest neighbors. I look forward to your feedback.

@ekorman Hi Eric, since I have not heard back from you in a while, I went ahead with the modification. I really had to get the ball rolling on this one as it has been stalled for a while. I hope you don't mind.

@ashimb9
Copy link
Contributor

ashimb9 commented Nov 3, 2017

@lesteve @jnothman Hey just saw that the status of this was recently changed to "help wanted". Shouldn't this rather be closed in light of PR 9212?

@jnothman
Copy link
Member

jnothman commented Nov 4, 2017

Yes, okay. Current PR at #9212

@jnothman jnothman closed this Nov 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.