-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] UnaryEncoder to encode ordinal features into unary levels #8652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jnothman @jmschrei Questions for reviewers
Need help with these 3 errors which i get while running nosetests after build
Thanks |
Thanks for the PR... I think |
Also what does
Just return |
I've updated the PR description based on your subsequent comment. Hope you don't mind... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few starting comments...
sklearn/preprocessing/data.py
Outdated
@@ -1868,7 +1868,7 @@ def _transform(self, X): | |||
if np.any(~mask): | |||
if self.handle_unknown not in ['error', 'ignore']: | |||
raise ValueError("handle_unknown should be either error or " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also make it either 'error' or 'ignore'
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(surround with single quotes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Also updated the similar error message for OneHotEncoder.
sklearn/preprocessing/data.py
Outdated
return self | ||
|
||
def _fit_transform(self, X): | ||
"""Assumes X contains only oridinal features.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ordinal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
sklearn/preprocessing/data.py
Outdated
except (ValueError, TypeError): | ||
raise TypeError("Wrong type for parameter `n_values`. Expected" | ||
" 'auto', int or array of ints, got %r" | ||
% type(X)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just print it, rather than printing the type...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Thanks for reviewing Venkat.
On naming, I feel
Thanks for bringing this up, in current implementation it doesn't denote anything of use because all the features in output will always be active. I will remove it. In my first implementation, it had some relevance.
But then for this input
Actually thanks, looks cleaner. |
Also this encoder can give same output for two different inputs. For example,
But i think thats okay as anyway first feature is not holding anything meaningful. Let me know if you feel some concern around it. |
I think I'd call it more of a "concatenation of unary number sequences" rather than binary... I am strongly inclining towards
I think it should actually be -
|
Such that |
I think unary is less ambiguous than ordinal
…On 29 Mar 2017 4:17 am, "(Venkat) Raghav (Rajagopalan)" < ***@***.***> wrote:
Such that n_new_features >= n_features...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8652 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz69zwPjKO-B1rI_I-C4ksuZWVh--qks5rqUCpgaJpZM4MpxOA>
.
|
is this a requirement for all encoders to satisfy? why? |
I'm okay with discarding input features that are constant. It's fine to
generally output k features for a feature with k values.
…On 29 March 2017 at 11:51, Arjun Jauhari ***@***.***> wrote:
Such that n_new_features >= n_features...
is this a requirement for all encoders to satisfy? why?
As I don't see a problem with n_new_features < n_features since its a
lossless dimensionality reduction it this case.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8652 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz67XRFz7ElQ8bRaERhnA8-S7UvubEks5rqar0gaJpZM4MpxOA>
.
|
As of now I output,
I can also do, no output features for input feature that are constant(as @jnothman suggested). But all this makes |
having fewer output features is fine imo
…On 30 Mar 2017 2:29 am, "Arjun Jauhari" ***@***.***> wrote:
It's fine to generally output k features for a feature with k values.
As of now I output,
- k-1 features for a input feature with k values (so it ends up in
discarding input feature with just 1 value)
I can also do, no output features for input feature that are constant(as
@jnothman <https://github.com/jnothman> suggested).
But all this makes n_out_features < n_features(in) a real possibility.
The question is if that's acceptable or not?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8652 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz63JFUxKXFqwzRXXUH4hv6mTS4hdMks5rqnjAgaJpZM4MpxOA>
.
|
@arjunjauhari, are you still working on this? More than anything else, this needs tests in Here's something I had lying around which may apply, to get started: def test_ordinal_encoder():
X = np.arange(-1, 5).reshape(-1, 1)
est = OrdinalEncoder(4)
Xt = est.fit_transform(X)
assert_array_equal(Xt, [[0, 0, 0], # -1
[0, 0, 0], # 0
[1, 0, 0], # 2
[1, 1, 0], # 3
[1, 1, 1], # 4
[1, 1, 1]]) # 5
Xt2 = est.transform(X)
assert_array_equal(Xt2, Xt)
# smaller n_values leads to horizontally truncated output
Xt3 = OrdinalEncoder(3).fit_transform(X)
assert_array_equal(Xt3, np.array(Xt2)[:, :-1])
# multiple input features stacks output
rng = np.random.RandomState(0)
X_multi = rng.randint(3, size=(10, 3))
X_multi_t = OrdinalEncoder(3).fit_transform(X_multi)
assert_equal(X_multi_t.shape, (10, 3 * 2))
expected = np.hstack([OrdinalEncoder(3).fit_transform(X_multi[:, i:i + 1])
for i in range(X_multi.shape[1])])
assert_array_equal(expected, X_multi_t) |
After 20 days of silence from @arjunjauhari, I'm marking this as Need Contributor. @arjunjauhari, let me know if you would rather keep working on it. |
@jnothman I will continue working on this, if that's ok |
You're welcome to open a PR continuing the work (or start again) as far as I'm concerned. Note that we're currently trying to release version 0.19, so if you don't get quicky responses, you should ping after a few weeks |
@ruxandraburtica I am planning to continue this work and take PR to completion. I hope you have not started yet. Sorry for any inconvenience. |
that's fine. no great hurry, but it is hard to know when there's no
response what the status is
…On 23 Jun 2017 6:51 pm, "Arjun Jauhari" ***@***.***> wrote:
@ruxandraburtica <https://github.com/ruxandraburtica> I am planning to
continue this work and take PR to completion. I hope you have not started
yet. Sorry for any inconvenience.
@jnothman <https://github.com/jnothman> Sorry for my absence in last
month. I will complete this PR on priority. I hope that's okay.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8652 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz62CfF-IERtTXDuIA3MkLqLxiu2Mrks5sG3yegaJpZM4MpxOA>
.
|
@jnothman Thanks, I will continue working. |
@jnothman @arjunjauhari actually I started working on this as well, I didn't open a pull request because there were a few more tests I wanted to add. I'll be at a computer in about 2 hours, and create the PR. |
ecdae4f
to
f5ec141
Compare
@jnothman @arjunjauhari I've opened a PR here: #9216 It still needs some tests, but I wanted to know if I should continue working on it. |
@ruxandraburtica I am happy to merge your work with mine, if that's okay to you. That way none of our work go unused. |
@ruxandraburtica, You should be able to open a pull request to @arjunjauhari's branch, if that helps merge your work. Sorry for the miscommunication |
@arjunjauhari @jnothman yes, I will do that today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning needs tests
sklearn/preprocessing/data.py
Outdated
|
||
handle_greater : str, 'warn' or 'error' or 'clip', default='warn' | ||
Whether to raise an error or clip or warn if an | ||
ordinal feature >= n_values is passed in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clip could do with a description, and warn can be described as clipping with a warning
sklearn/preprocessing/data.py
Outdated
elif self.handle_greater == 'error': | ||
raise ValueError("Found feature values %s which exceeds " | ||
"n_values during transform." | ||
% X.ravel()[mask]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not reporting how many?
sklearn/preprocessing/data.py
Outdated
self.n_values == 'auto'): | ||
n_values = np.max(X, axis=0) + 1 | ||
elif isinstance(self.n_values, numbers.Integral): | ||
if (np.max(X, axis=0) >= self.n_values).any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it redundant with the check at this end of _fit
?
which is better since it uses handle_greater
|
||
def _generate_random_features_matrix(n_values=3, size=10): | ||
rng = np.random.RandomState(0) | ||
X = rng.randint(n_values, size=(size, n_values)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why n_features == n_values
?
Using n_samples
and n_features
variable names would be clearer.
|
||
def test_unary_encoder_stack(): | ||
# multiple input features stack to same output | ||
n_values = np.random.randint(2, 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We tend to prefer deterministic tests, using rng = np.random.RandomState(0)
and rng.randint
instead of np.random.randint
.
encoder.fit(X) | ||
|
||
# test that an error is raised when different shape | ||
larger_n_values = n_values + delta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not clear.
It should be larger n_features
and not larger n_values
.
This is linked to the use of n_values
as both n_values
and n_features
in _generate_random_features_matrix
.
|
||
X = _generate_random_features_matrix(n_values, size) | ||
X_trans = enc.fit_transform(X) | ||
assert_equal(X_trans.shape, (size, unary_n_values * len(X[0]))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_features
is clearer than len(X[0])
Explaining handle_greater options Adding test cases to test warn option of handle_greater parameter Updated warning message updated feature_matrix generation function and made test deterministic making test cases clearer by using n_features n_values and n_features cleanup
conflicts and test failures... |
Haven't really looked at this but not sure if I'm sold on |
here we're using it in the first sense, yes, which I think is not uncommon,
but is less familiar in computer science and algebra.
To clarify (using the framework of semantic roles), "one hot" in
OneHotEncoder describes the manner of encoding, while "categorical" in
CategoricalEncoder describes the kind of operand. "Unary" would describe
the manner of encoding; "ordinal" the operand. I would expect an
OrdinalEncoder to give me an option to encode strings or numbers as one-hot
or unary or as cardinal integers. We could do that, but I also believe that
the UnaryEncoder could/should support non-ordinal, real-valued input.
|
What would |
on non-negative real values it would do the same: x > k for column k of
output
|
I think @amueller has disappeared into teaching land, but others are welcome to state their opinion, including you... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments (mainly from the viewpoint of just having implemented CategoricalEncodoer)
I think you still need to add a whatsnew notice?
with scikit-learn estimators is to use a unary encoding, which is | ||
implemented in :class:`UnaryEncoder`. This estimator transforms each | ||
ordinal feature with ``m`` possible values into ``m - 1`` binary features, | ||
where the ith feature is active if x > i (for i = 0, ... k - 1). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is k
coming from? Is this the same as m
? (or can k -1 be m - 2 ?)
since those already work on the basis of a particular feature value being | ||
< or > than a threshold, unlike linear and kernel-based models. | ||
|
||
Continuing the example above:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not directly clear for me this refers to the list of example features (I was first searching for the last code example).
Also, this example at once starts with integers, while the example features above are strings, while this step is not explained.
array([[ 0., 1., 0., 1., 0., 0.]]) | ||
|
||
By default, how many values each feature can take is inferred automatically | ||
from the dataset. It is possible to specify this explicitly using the parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you specify this 'automatically' is done by looking at the maximum?
By default, how many values each feature can take is inferred automatically | ||
from the dataset. It is possible to specify this explicitly using the parameter | ||
``n_values``. | ||
* There are two genders, three possible continents and four web browsers in our |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a blank line above this line (to get the list to render well)
* ``["short", "tall"]`` | ||
* ``["low income", "medium income", "high income"]`` | ||
* ``["elementary school graduate", "high school graduate", "some college", | ||
"college graduate"]`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one space too much here at the beginning of the line (to align it with the previous line, you can always check the built docs in https://15748-843222-gh.circle-artifacts.com/0/home/ubuntu/scikit-learn/doc/_build/html/stable/_changed.html, see explanation in http://scikit-learn.org/stable/developers/contributing.html#documentation)
|
||
See also | ||
-------- | ||
sklearn.preprocessing.OneHotEncoder: encodes categorical integer features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
categorical -> ordinal ?
See also | ||
-------- | ||
sklearn.preprocessing.OneHotEncoder: encodes categorical integer features | ||
using a one-hot aka one-of-K scheme. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add here a see also to CategoricalEncoder as well?
Whether to raise an error or clip or warn if an | ||
ordinal feature >= n_values is passed in. | ||
|
||
- 'warn' (default): same as clip but with warning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For CategoricalEncoder the default is error
(and for OneHotEncoder as well). Is there a good reason to deviate in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the discussion about this, @jnotham saying:
I now wonder if we should have a default handle_greater='warn'. I think in practice that raising an error when a count-valued feature exceeds its training set range is too intrusive. Better off clipping but warning unless the user has specified otherwise.
I understand that reasoning for count-like values. But the examples in the documentation are not count-like, and for those this makes less sense I think as a default.
(but don't know with which type of data this will be used more often)
if self.handle_greater == 'error': | ||
raise ValueError("handle_greater='error' but found %d feature" | ||
" values which exceeds n_values." | ||
% np.count_nonzero(mask)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle this check is not needed in case of 'auto'
(but not sure whether a small performance improvement for this case is worth moving it elsewhere)
data = np.ones(X_ceil.ravel().sum()) | ||
out = sparse.coo_matrix((data, (row_indices, column_indices)), | ||
shape=(n_samples, indices[-1]), | ||
dtype=self.dtype).tocsr() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not too important, but it is also possible to directly create a csr matrix instead of first constructing coo and then converting (for CategoricalEncoder I edited the construction of the indices to do this: 85cf315)
@arjunjauhari, will you be following up @jorisvandenbossche's comments? |
@jnothman, will address the comments soon. @jorisvandenbossche thanks for the review. |
@arjunjauhari I would love to have this available for use in the KBinsDiscretizer. Let us know if you would like someone else to finish it |
@jnothman I apologize for delay. I think its pretty close to finish line. I will finish it up in a week or two. |
Thanks @arjunjauhari. If can be done this weekend, it has a good chance of being merged next week at development sprints, and included in the upcoming release. It would be great if in or subsequent to this PR, encoding='unary' could be added to the new |
FYI (Sorry if duplicate with existing comments): |
@arjunjauhari do you want to finish this soonish? |
@arjunjauhari FYI have just opened #12893 |
@NicolasHug, @jnothman, this PR was superseded by #12893, now closed, as no consensus was reached about the implementation. Is the 'UnaryEncoder' issue still relevant? Is this implementation still useful as a starting point? Or should them be closed? Thanks for your help. |
I think we can close this one too. If the idea of the |
Reference Issue
Resolves #8628
Resolves #3336
What does this implement/fix? Explain your changes.
This implements OrdinalEncoder, which is a more informative encoding than one-hot for ordinal features. Implemented a new class OrdinalEncoder whose interface is same as OneHotEncoder class.
Any other comments?
Logic: For k values 0, ..., k - 1 of the ordinal feature x, this creates k - 1 binary features such that the ith is active if x > i (for i = 0, ... k - 1)
Working Example
Another One
A to-do list
UnaryEncoder[+2]
,OrdinalEncoder[0]