Skip to content

[MRG-0] Make LabelEncoder more friendly to new labels #3483

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

mjbommar
Copy link
Contributor

This is a final, cleanly rebased version of PR 3243 (#3243) incorporating discussions.

Summary:

This PR intends to make preprocessing.LabelEncoder more friendly for production/pipeline usage by adding a new_labels constructor argument.

Instead of always raising ValueError for unseen/new labels in transform, LabelEncoder may be initialized with new_labels as:

  • "raise": current behavior, i.e., raise ValueError; to remain default behavior
  • "update": update classes with new IDs [N, ..., N+m-1] for m new labels and assign
  • an integer value: set newly seen labels to have fixed value corresponding to this integer value

N.B.: .classes_ is not a property to support the new_labels="update" behavior.

Tests and documentation updates included.

@arjoly
Copy link
Member

arjoly commented Jul 24, 2014

You have a rebase issue. I got once the problem. If I remember, I think that I have fixed
the issue using the following strategy

git checkout master
git fetch upstream
git rebase upstream/master
git checkout new-branch
git rebase master
git push -f 

Hope it helps.

@mjbommar
Copy link
Contributor Author

@arjoly, appears to have worked. Thanks for the recommendation and sorry for the mistake.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.0%) when pulling 866e939 on mjbommar:label-encoder-unseen-final into 376ac51 on scikit-learn:master.

@arjoly
Copy link
Member

arjoly commented Jul 24, 2014

No problem, I ran once in that issue and this was frustrating. I am happy that it works for you!

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.0%) when pulling da4cafb on mjbommar:label-encoder-unseen-final into 376ac51 on scikit-learn:master.

@mblondel
Copy link
Member

N.B.: Direct access to .classes_ should be replaced with .get_classes() in order to properly handle the new_labels="update".

But it's important to have a consistent API across the scikit. Could you use a property instead?

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.01%) when pulling 0f3e3d3 on mjbommar:label-encoder-unseen-final into 376ac51 on scikit-learn:master.

@mjbommar
Copy link
Contributor Author

@mblondel , thanks for the recommendation. Implemented as suggested.

@mjbommar mjbommar changed the title Make LabelEncoder more friendly to new labels [MRG-0] Make LabelEncoder more friendly to new labels Jul 25, 2014
@mjbommar
Copy link
Contributor Author

mjbommar commented Aug 3, 2014

@arjoly, @jnothman , anything I can do to make this easier for you to review?

- If ``"raise"``, then raise ValueError.
- If ``"update"``, then re-map the new labels to
classes ``[N, ..., N+m-1]``, where ``m`` is the number of new labels.
- If an integer value is passed, then re-label with this value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it work with string label (string)?

@jnothman
Copy link
Member

@mjbommar, should we expect you won't be completing this any time soon and label it "needs contributor" for someone to adopt? I will do so, but you should say if you'd rather complete it.

@mjbommar
Copy link
Contributor Author

mjbommar commented Apr 28, 2016

@jnothman, my recollection is fuzzy, but I think this issue was primarily blocked by design disagreements. If we can come to an agreement about desired behavior, I could see how easily the work could be completed and merged into master.

@mjbommar mjbommar closed this Jul 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants