-
-
Notifications
You must be signed in to change notification settings - Fork 26k
Added Boruta and mutual information based feature selection modules. #6313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I failed the auto-checks apparently because I use bottleneck and statsmodels. Both are pipable, and mature packages, is there a way to use them in scikit development or should I do everything without them? Thanks! |
no chance we add these dependencies
|
Thanks for the quick and honest answer. If I rework the code so it doesn't require these packages will you consider adding these methods to the feature selection module? Thanks |
it will not be lost if it ends up in a clean project referenced in the list |
Support for mutual information based scoring functions has been done in #5372 by @nmayorov . However note that these are just functions and the actual mRMR as proposed in the paper, ie picking the feature that gives the max(relevance with target minus redundancy with the remaining subset of features) has not been implemented yet. Partly because we were not able to show the usefulness of mRMR over just plain correlation on some standard datasets see (#5372 (comment)) and (#5372 (comment)) This goes with the nature of the algorithm, consider features A, B and C where A and B are highly correlated and C is noise mRMR will pick A and C, even though C is useless because of which we considered having a parameter alpha to tune the balance between relevance and redundancy Of course, it might be useful in higher dimensional data, if you are able to convince others by a benchmark that mRMR is indeed useful then that will be great! |
I wouldn't necessarily recommend integrating mRMR, but JMI works pretty well and provides extra functionality over correlation based methods. mRMR is linear so the previous experiments couldn't show benefit over scoring using RFE with a linear classifier. JMI allows for non-linear interactions which adds value to the selection process (i.e. there are datasets that JMI can find features for that mRMR can't). One example is the MADELON test dataset from the NIPS feature selection challenge in 2003. mRMR can be useful if people want to use a linear feature selection algorithm, as an alternative to RFE type methods. I wrote a review paper on mutual information based feature selection for JMLR a few years ago, (http://www.jmlr.org/papers/v13/brown12a.html) and I'm happy to go through the theoretical or experimental analysis to provide support for information theoretic feature selection. Unfortunately my python isn't too hot so I can't help out on the coding (my experiments were all done in C or Matlab). |
Hi Adam, Thanks so much for joining the discussion. Based on my benchmarks I totally agree with you mRMR is quite famous but not very useful whereas JMI is an absolute beast. I'm quite busy this week but I'm hoping to write up some benchmarking results which shows the power of JMI and that could persuade sklearn guys to include it in the next release. |
@Craigacp Could you please explain better about mutual information failing to capture non-linear interactions? Mutual information depends between |
Also note that this is univariate not multivariate. |
Mutual information itself captures non-linear interactions (between two variables). The mRMR feature selection algorithm can't find non-linear interactions between more than two variables, as it only subtracts from the feature score. While the scoring function (mutual information) is non-linear, the mRMR algorithm is missing the term that allows it to capture 3 way non-linearities. Basically if you have three binary variables X, Y and Z, where Z = X xor Y, and Z is your target variable, mRMR can't find the pair X & Y which give perfect accuracy, whereas something like JMI can. This is because the mutual information for the pair X & Y, I(XY;Z), is bigger than I(X;Z) + I(Y;Z) and mRMR assumes that each variable's score is at most it's mutual information with the class (i.e. I(X;Z) or I(Y;Z)). mRMR removes features which are redundant (i.e. those which provide the same information as a selected feature), so it fits well with a linear model (as a linear model prefers to have the features be orthogonal, and can't cope with a non-linear interaction between the features and a class). JMI removes features which are redundant, but keeps features which combine well with the currently selected set (in the sense that they provide extra information when combined, rather than looking at them alone). |
removed mifs future selection module ftm
It's great to have expert input into the utility of these different methods. And thanks @danielhomola for getting rid of the extra dependencies. However, as @agramfort suggested, these algorithms, with the exception of MRMR and JMI, appear to not be sufficiently well-established for scikit-learn to commit to their maintenance. And as @MechCoder points out, corroborated by the further comments, MRMR appears to give little advantage over existing methods. If you would like to make these implementations available in a scikit-learn compatible interface for people to play with and compare, providing them external to scikit-learn, perhaps in the http://scikit-learn-contrib.github.io project, may be the best way to go. Of the algorithms implemented, it sounds like JMI might be a valuable contribution to Scikit-Learn. I suggest that you close this PR as very unlikely to get merged, and consider offering a PR with JMI (and perhaps with the other MI approaches given they are only slight variants). Please also concoct an example (as in our |
Thanks for your comments @jnothman. May I ask what is the criterion for inclusion? The Boruta paper came out 6 years ago and was cited 120 times + used in numerous kaggle competitions successfully. It's a very powerful FS method which is practically parameter free. I'm not the author of it, so this isn't self-promotion, but I found it to be a very valuable tool.. I'm happy to work on the JMI (unit tests, etc) and I too think it's a valuable feature selector. |
The criteria are a bit rough, but usually we'd expect a couple of hundred citations at a minimum. I think there's room to argue for different criteria for different types of estimators, though, and maybe we should be looser here. It seems I did not look deeply enough, and may have been mistaken; the number of citations for JMI seems to be very few, given its age, while Boruta's citation count seems respectable given its age. In any case, what we really need is an example that shows these methods doing something different/better than what we already have in scikit-learn. This PR is unlikely to go anywhere without that type of evidence. |
The JMI paper has a low citation count, but DISR and mIMR (which are basically the same technique with a different name and slightly different initialisation) have another 150 citations between them. My review paper tested 9 different filter feature selection algorithms and found JMI to perform the best across the 17 datasets we used. If it's relevant the review paper has 267 citations since publication in 2012 (according to Google Scholar), and the Matlab & C package I wrote which implements the algorithms has about 6k downloads from MLOSS. By an example do you mean a single dataset, a bunch of datasets or a toy problem showing the principle? |
Sometimes showing performance on a bunch of datasets is useful in pushing a PR. Yes your paper may help there. But we generally accept algorithms alongside an example that demonstrates how competing estimators compare on some real/generated dataset. |
@Craigacp Thanks a lot for the explanation. It makes sense and goes well with my previous understanding. |
@Craigacp you didn't consider boruta in your paper, right? do you have any opinion on it? |
@amueller We didn't look at Boruta, we were focusing on filters so we could get to the theory behind them and wrappers are harder to analyse theoretically. It looks like a sensible extension of scoring features using the random forests' importance score, but it won't scale up in the way that the filters will as you increase the number of features. The importance metric it uses is going to score features more jointly than (e.g.) JMI would so it will be better at datasets where large numbers of features have complicated interactions (though I've found filters to work pretty well in that case). It does automatically detect the number of relevant features (according to it's metric) which is one definite advantage over filters. I haven't seen a comparison with it against a wide range of feature selection techniques so I'm not sure how well it does empirically. The papers that present it don't do a comparison against other metrics, and a quick google doesn't find many papers on it, and this paper (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-8) shows it's basically equivalent to other random forest based wrappers. |
I think that this PR can be closed as it became a contrib package @amueller , @jnothman |
Thanks @chkoar. And thanks @danielhomola for the contribution! |
Dear FS module maintainers,
I have implemented the Boruta algorithm (https://m2.icm.edu.pl/boruta/) with a sklearn like interface. Here's my original repo: https://github.com/danielhomola/boruta_py
and here's a blog post I wrote about this FS method: http://danielhomola.com/2015/05/08/borutapy-an-all-relevant-feature-selection-method/
I also implemented 3 famous mutual information based FS methods (JMI, MRMR, plus the pretty recent JMIM), and wrapped them up in a single module, with parallelized execution.
Here's my repo: https://github.com/danielhomola/mifs
and here's a blog post I wrote about this method: http://danielhomola.com/2016/01/31/mifs-parallelized-mutual-information-based-feature-selection-module/
Wrote docstrings and examples at the end of docstings. Also ran autopep8 on these files, but could you help me with unit-testing a bit? What is expected from feature selectors in terms of unittesing? Should I simulate some data run these FS methods and assert that the discovered features are the same as in a previous run? Also should I write test for every single internal function as well?
Best wishes,
Dan