-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Feature Request: Pipelining Outlier Removal #9630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is the right place. Sorry I've not fished out issue numbers, but certainly things like this have been raised before, particularly as regards resampling in pipeline. A few points:
|
Hey, I skimmed through some issues but didn't find a related topic. Sorry for that. I fully understand that an API change is suboptimal. Resample is of course the right method name for imblearn, Concerning the meta estimator: how would such an estimator remove rows? I just don't see it at the moment and a small(!) hint is appreciated. |
I think I found one of the related issues And I think I understood how to use a meta estimator. Compared to bagging, the random part in bagging is replaced by the outlier free set and the many estimators are replaced by the estimator we are interested in. However, this means, that we can only remove outliers in the final step (or build further meta estimators out of transformer (which seems awkward)). I think in that case, a new method might be a better choice |
Yes, #3855, #4143, #4552 and scikit-learn/enhancement_proposals#2 all relate. I don't see why resample is entirely inappropriate here. Meta-estimator (almost; untested): class WithoutOutliersClassifier(BaseEstimator, ClassifierMixin):
def __init__(self, outlier_detector, classifier):
self.outlier_detector = outlier_detector
self.classifier = classifier
def fit(self, X, y):
self.outlier_detector_ = clone(self.outlier_detector)
mask = self.outlier_detector_.fit_predict(X, y) == 1
self.classifier_ = clone(self.classifier).fit(X[mask], y[mask])
return self
def predict(self, X):
return self.classifier_.predict(X) |
For me, (re)sampling always contains a strong random component. This is not directly fulfilled when outliers shall get removed. Thanks for the metaestimator hint. I thought in a similiar direction. Additionally, the classifier can be a pipeline, so my previous comment on removing outliers only in the last step is wrong in some sense. |
yes, this is more-or-less sufficient. The questions are whether it is
elegant; and whether we should have some such thing in the library.
|
I think we should close this as duplicate, given all the other open issues and the enhancement proposal. Resampling doesn't necessarily mean random. In signal processing at least, it is not ;) the meta-estimator is a good solution imho. Interesting question: it looks like you don't want to remove outliers at test time. Why is that? That seems counter-intuitive to me. I would refuse to make a prediction on points that I would have removed from the training set. |
refusing to classify is interesting. The estimator could have an option to
add an additional unlabelled class for outliers in prediction, but the key
idea is to reduce noise in training.
Also, resampling pipelines alone would not fix this, so unless we have
other issues about outlier removal I think we should keep this open. We
either need a wrapper for outlier detectors, or need to add a fit_resample
method on each of them. Or we can just add a metaestimator like mine above
to make outlier removal in training practical.
…On 29 Aug 2017 6:31 am, "Andreas Mueller" ***@***.***> wrote:
I think we should close this as duplicate, given all the other open issues
and the enhancement proposal.
Resampling doesn't necessarily mean random. In signal processing at least,
it is not ;)
the meta-estimator is a good solution imho. Interesting question: it looks
like you don't want to remove outliers at test time. Why is that? That
seems counter-intuitive to me. I would refuse to make a prediction on
points that I would have removed from the training set.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9630 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz69cj85oGjbk-UKn__ZPE2EdFnJiiks5scyOdgaJpZM4PDfu1>
.
|
Just passing by this issue. Starting reading I thought that imblearn could implement a Then, I am sharing the feeling of @amueller regarding "not removing the outliers at test time". Using imblearn pipeline with such a FunctionSampler, you will get exactly this behaviour but I am really not sure this is the right thing to do. @jnothman @amueller if somehow it would make sense to have such feature, you could delegate it to imblearn. NB: |
@amueller, thanks for the remark on signal processing and "sampling". I support the view of @jnothman concerning the outlier removal only in training. Particularly, I see the test set as "real" data that can of course be noisy. In that scenario, I typically want to have a prediction for any datapoint. (contrary to @mueller). However, during training I want to reduce overfitting a bit by removing outliers. This can/could improve model quality. Maybe marking points as outliers and providing predictions could be a compromise. |
Just ran into this exact scenario and came looking to see if anything was in-progress, so I'll echo my support for the feature (for whatever that's worth =)). Regarding removing outliers from the test set, one could argue both ways and it's likely too dependent on the specifics of the problem to have a right answer, but the scenario @datajanko described seems to be more common (i.e. remove from training to improve generalization). |
perhaps it's time we just accept a PR to support imblearn-style pipeline
with resampling and outlier removal examples, and we tweak from there...
Ping @glemaite
|
It would be great ... I think that @chkoar would agree with that. By the way, we started to make a PR for a FunctionSampler as previously stated. We still have to document it but it should allow to warp some outlier detection algorithm inside it to later on make some sampling. Regarding the Pipeline itself, I wanted to go back to #8350 and it could be the occasion of introducing the SamplerMixin support at the same time. @jnothman WDYT? |
I'd be tempted to just add a selection method to outlier detectors directly rather than create FunctionSampler, but maybe I've not understood its benefits. |
I think this is quite unrelated to #8350. I also don't think it would be
arduous to reimplement an imblearn-like interface without getting explicit
permission from its creators :P
|
In the meantime, we made a quick (stupid) example: We are still working on it so any feedback is welcomed. |
Hi, I have been working on a similar pipeline that bypasses specific data. In my case I don't want it to resample the data but rather refuse to classify (i.e. either set nans as the classification outcome or use a masked array as output) those subjects that conform to a specific condition. I'm not sure if this is helpful, or similar to the FunctionSampler in imblearn, or might be appropriate for #3855. In any case, if anyone is interested, I have made an available branch with the pipeline included. |
Link to your branch here? But I don't really think this should be a branch modifying Pipeline so much as a separate estimator: it sounds like hierarchical (i.e. coarse then fine) classification which I think is hard to develop a generic tool for. |
Sorry, I'm a bit new to this. The branch is at: |
If the act of detecting and removing outliers are part of the ML steps, why have they been left out the pipeline processing? I came here because I want to do the exact same thing as the author of this issue. The Scikit-learn is a fantastic framework. The GridSearchCV is very powerful and useful, and I would like to integrate different strategies for detecting and removing outliers in the pipeline and let the GridSearchCV figure out which one is the best. It's clear that changing rows in the X during the pipeline breaks the relationship between X and y, because the y is not seeing the same changes. I don't know if this is a design flaw or something that has been considered by the designers since the beginning. But the outlier detection/removal seem to be a rational use case to be used in the pipeline, according to the reason I explained in the paragraph above. |
There is a plan towards a solution in #3855, and you are welcome to help
implement it.
|
@averri it's very easy to implement this as a meta-estimator btw, which would allow using all the grid-search and cross-validation tools. pre_outlier_pipeline = make_pipeline(...)
post_outlier_pipeline = make_pipeline(..., SomeClassifier())
outlier_detector = ...
full_model = make_pipeline(pre_outlier_pipeline,
OutlierMeta(post_outlier_pipeline, outlier_detector)) Where That does break the nice linear flow of the pipeline, though. Also, it would be interesting if you could share actual real-world examples of data where using an outlier detection algorithm is helpful. |
I have a meta-estimator coded above: #9630 (comment) |
Great so our two snippets together nearly make one complete untested solution ;) I guess I should have re-read the rest of the thread. |
Would it make sense to put this somewhere in the documentation? |
An example illustrating how to do this would be welcome in the gallery.
|
Can anyone provide an example where this was done in practice? Or any paper evaluating using automatic outlier removal for supervised learning? |
|
This change must be implemented because the current version encourages data hygiene violations. When using IsolationForest outside of a pipeline, users will typically define outliers based on data in both the training and test set, which causes leakage. |
Based on this comment scikit-learn/scikit-learn#9630 (comment)
Any updates? |
This is entangled with the issue of "sampling" where the number of samples is changed within a pipeline, and so far we don't have a good solution for it. imbalanced-learn does some hacks to implement this, but they're not included in scikit-learn itself, and the SLEPs regarding this haven't moved forward. |
I wonder if we could make outlier removal available in pipelines.
I tried implementing it for example using the IsolationForest but so far I couldn't solve it and I know why.
The problem boils down to
fit_transform
only returning a transformedX
this suffices in the vast majority of cases, since we typically only throw away columns (think of a PCA). However, using outlier removal in a pipeline, we need to throw away rows ofX
andy
during training and do nothing during testing. This is not supported so far. Essentially, we would need to turn thepredict
function into some kind oftransform
function during training.Investigating the pipeline implementation shows, that
fit_transform
is called if present during the fitting part of the pipeline, rather thanfit(X, y).transform(X)
. Particularly, in a cross validationfit_transform
is only called during training. This would be perfect for outlier removal. However, it remains to do nothing in the test step. But to this end we can simply implement a "do-nothing"transform
-function.The most direct way to implement this, would be an API-change of the TransformerMixin-class, unfortunately.
So my questions are:
Would it be interesting to contain feature removal in pipelines?
Are there other more suitable ideas of implementing this feature in a pipeline?
If the content of this question is somehow inapropriate (e.g. since I'm only an active user, not an active developer of the project) or at the wrong place, feel free to remove the thread.
The text was updated successfully, but these errors were encountered: