-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG + 1] ENH: new CategoricalEncoder class #9151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
doc/modules/classes.rst
Outdated
@@ -1197,7 +1197,7 @@ See the :ref:`metrics` section of the user guide for further details. | |||
preprocessing.MaxAbsScaler | |||
preprocessing.MinMaxScaler | |||
preprocessing.Normalizer | |||
preprocessing.OneHotEncoder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indicated deprecation of the old class? I'm not sure we want to do that yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, for now just rebased old pr. See my to do list in top post. I would for now just leave the OneHotEncoder as is. We can always decide to deprecate it later if we want
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked at the code before I looked at your description. I think your description is a good summary.
I'm generally good with the to-dos, though the |
And obviously I would prefer not to move this to a different module, I'd be fine with adding a note to the docs that this is experimental and might change, but I'll not press that point so we can move forward. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least for the record, could you please remind me why this is superior to DictVectorizer().fit(frame.to_json(orient='records')
and to ColumnTransformer([(f, LabelEncoder(), f) for f in fields])
?
I appreciate the difference of this from OHE, and that it provides a more definitive interface for this kind of operation. We should at the same time clarify what OHE is for (ordinals; #8628 should get a similar interface) and what LabelEncoder
is not for.
sklearn/preprocessing/data.py
Outdated
|
||
Parameters | ||
---------- | ||
classes : 'auto', 2D array of ints or strings or both. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not 2d. A list of lists of values.
sklearn/preprocessing/data.py
Outdated
- 'auto' : Determine classes automatically from the training data. | ||
- array: ``classes[i]`` holds the classes expected in the ith column. | ||
|
||
categorical_features : 'all' or array of indices or mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd support getting rid of this as unnecesarry complexity.
sklearn/preprocessing/data.py
Outdated
Values per feature. | ||
|
||
- 'auto' : Determine classes automatically from the training data. | ||
- array: ``classes[i]`` holds the classes expected in the ith column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this specify the order of these values in the output, or just the set? Must be clear.
dictionary items (also handles string-valued features). | ||
sklearn.feature_extraction.FeatureHasher : performs an approximate one-hot | ||
encoding of dictionary items or strings. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely we need See Also to also describe the relationship to / distinction from OHE
Followed by some version of the one-hot-encoder, right?
Tbh I'm not familiar enough with how DictVectorizer treats integers and floats for that. Maybe a good argument would be the possibility of an exact There is also somewhere a hack that uses If we actually decide (which was the consensus at the sprint between @GaelVaroquaux, @jorisvandenbossche an me) that we always want to transform all columns, then maybe one of these implementations could actually work. I would like something discoverable and with good feature names and the possibility to have some feature provenance in the future. Maybe someone can write a blogpost about all the subtle differences between these lol. |
Yes, I meant LabelBinarizer.
…On 19 June 2017 at 10:25, Andreas Mueller ***@***.***> wrote:
ColumnTransformer([(f, LabelEncoder(), f) for f in fields])
Followed by some version of the one-hot-encoder, right?
Or I guess with LabelBinarizer() it would be fine? If that's a correct
implementation that allows for an inverse_transform and get_feature_names.
I'm all for it.
DictVectorizer().fit(frame.to_json(orient='records')
Tbh I'm not familiar enough with how DictVectorizer treats integers and
floats for that. Maybe a good argument would be the possibility of an exact
inverse_transform which we decided is out-of-scope?
There is also somewhere a hack that uses CountVectorizer(analyzer=lambda
x: x) or something like that, and that also works for a single Column.
If we actually decide (which was the consensus at the sprint between
@GaelVaroquaux <https://github.com/gaelvaroquaux>, @jorisvandenbossche
<https://github.com/jorisvandenbossche> an me) that we always want to
transform all columns, then maybe one of these implementations could
actually work.
I would like something discoverable and with good feature names and the
possibility to have some feature provenance in the future.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9151 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz67avpEDmrx4xAQQcUXb5FChCjjpPks5sFcAIgaJpZM4N9lyH>
.
|
Problem with the
DictVectorizer seems to work and gives us
It is indeed using Full experimentation of the different possibilities: http://nbviewer.ipython.org/d6a79e96b490872905e74202d0818ab2 |
I wouldn't use CountVectorizer under the hood here. But we might need to
investigate performance of LabelEncoder's binary search (it uses that,
doesn't it?) with a large vocabulary of categories.
…On 19 June 2017 at 18:26, Joris Van den Bossche ***@***.***> wrote:
ColumnTransformer([(f, LabelBinarizer(), f) for f in fields])
If that's a correct implementation that allows for an inverse_transform
and get_feature_names. I'm all for it.
Problem with the LabelBinarizer is that it currently only works on
strings, not numerical values (which could maybe be fixed), and that it
doesn't play nice with the ColumnTransformer due to both X and y being
passed in the transformers (but there is a PR to fix this?)
It also doesn't give us a get_feature_names out of the box.
DictVectorizer().fit(frame.to_dict(orient='records')
Tbh I'm not familiar enough with how DictVectorizer treats integers and
floats for that. Maybe a good argument would be the possibility of an exact
inverse_transform which we decided is out-of-scope?
DictVectorizer seems to work and gives us get_feature_names, but, it
treats string values and numerical values differently. Strings are dummy
encoded, but integer are just passed through. So not fully the behaviour we
want.
I also think the conversion to a dict (instead of working on the column as
arrays) can become quite costly for larger datasets.
There is also somewhere a hack that uses CountVectorizer(analyzer=lambda
x: x) or something like that, and that also works for a single column.
It is indeed using CountVectorizer(analyzer=lambda x: [x]) that gives us
more or less exactly what we want. It also gives us get_feature_names (we
only have to fix it to be able to deal with mixed strings and numerical
values).
So this could be used under the hood instead of
LabelEncoder/OneHotEncoder. But given the quite different original use
case, I am not sure this is a good way to go.
Full experimentation of the different possibilities:
http://nbviewer.ipython.org/d6a79e96b490872905e74202d0818ab2
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9151 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz63Xf_VlQx-OoIJmIkwKrNYV3zPtbks5sFjCpgaJpZM4N9lyH>
.
|
|
searchsorted is a binary search and is likely to be much slower for a large
number of strings than dict lookup used by *Vectorizer
|
- remove compat code for numpy < 1.8 - remove categorical_features keyword - make label_encoders_ private - rename classes to categories
Hy Sci-kittens! :D I recently suggested in mailing list about having a drop_one parameter in the OneHotEncoder so that one of the columns in the end or beginning of encoded array is dropped as it is like doubling same column twice, it will benefit some models like LinearRegression. I got guided to this PR, would like to know if it can be added to new Categorical encoder? |
- check that it works on pandas frames - fix doctests - un-deprecate OneHotEncoder - undo changes in _transform_selected (as we no longer need those changes for CategoricalEncoder) - add see also to OneHotEncoder and vice versa - for now remove the self.feature_indices_ attribute
6c764e0
to
5f2b403
Compare
OK, I cleaned the code a bit further and added some more tests, I think this is ready for more detailed review. @Trion129 That's indeed a possible extension of the current PR. If this is desired, we could add a keyword the determines this behaviour, with the default to not drop any column (current behaviour). As a reference, the pandas |
(the docs should probably be further updated, as now it just changes the example using OneHotEncoder to CategoricalEncoder. But we might want to keep an example with OneHotEncoder as well, if there is a good example) |
Btw, in our tutorial @agramfort mentioned using integer encodings for categorical as that works reasonably for trees. Should we implement that? If so, in this estimator? but probably not for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good. Needs attribute documentation and then we can talk about what a good way to expose the per-feature categories is. Maybe also get_feature_names
?
sklearn.preprocessing.OneHotEncoder : handles nominal/categorical features | ||
encoded as columns of integers. | ||
sklearn.preprocessing.CategoricalEncoder : handles nominal/categorical | ||
features encoded as columns of arbitraty data types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it? or strings or integers? What happens with pandas categorical?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arbitraty -> arbitrary
sklearn/preprocessing/data.py
Outdated
@@ -1730,13 +1728,14 @@ def _transform_selected(X, transform, selected="all", copy=True): | |||
|
|||
|
|||
class OneHotEncoder(BaseEstimator, TransformerMixin): | |||
"""Encode categorical integer features using a one-hot aka one-of-K scheme. | |||
"""Encode ordinal integer features using a one-hot aka one-of-K scheme. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think ordinal is right as they are not ordered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used "ordinal" to indicate that the actual set of values do mean something (eg the values [1,4,5] means that there are also categories 2 and 3 which are not present in the data), but I agree that this is not the same as being ordered.
That said, I also find 'categorical' misleading given the above behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say categorical and in the next line clarify they need to be 0 to (n_categories - 1).
sklearn/preprocessing/data.py
Outdated
|
||
handle_unknown : str, 'error' or 'ignore' | ||
Whether to raise an error or ignore if a unknown categorical feature is | ||
present during transform. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we always raise an error when categories is not auto and an unknown category is encountered during training, right? Maybe we should make that explicit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably say what happens with the unknown value: all columns will be zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we always raise an error when categories is not auto and an unknown category is encountered during training, right?
Shouldn't we also let that depend on this keyword? So also when you specify the categories yourself, you can ignore such errors of unknown values with this keyword (the default it to raise anyhow)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'm fine with that.
sklearn/preprocessing/data.py
Outdated
Examples | ||
-------- | ||
Given a dataset with three features and two samples, we let the encoder | ||
find the maximum value per feature and transform the data to a binary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maximum value? Unique values, right? Maybe add in a 999? or just a 5?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed, remaining from OneHotEncoder
Will add other number. Ideally string as well, that's not possible with just using simple lists for the example ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still says maximum ;)
sklearn/preprocessing/data.py
Outdated
See also | ||
-------- | ||
sklearn.preprocessing.OneHotEncoder : performs a one-hot encoding of | ||
integer ordinal features. This transformer assumes that input features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think ordinal is right. "This" -> OneHotEncoder (otherwise I feel it's ambiguous)
enc = CategoricalEncoder(sparse=False) | ||
Xtr3 = enc.fit_transform(X) | ||
|
||
assert_allclose(Xtr1.toarray(), Xtr2.toarray()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this test. Determinism?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, before they did something different (different selection of the columns, but i removed that feature), so not needed anymore
assert_allclose(Xtr, [[1, 0, 1, 0], [0, 1, 0, 1]]) | ||
|
||
Xtr = CategoricalEncoder().fit_transform(X) | ||
assert_allclose(Xtr.toarray(), [[1, 0, 1, 0, 1], [0, 1, 0, 1, 1]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably explain the handling of constant features. Does this make the most sense? Not sure what else....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you mean the case when there is just one category, and now this gets one column with ones?
Also not sure what else we could do for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. We could also remove it, or make it zeros. But ones is maybe fine? But we should be explicit on what we do.
enc.fit(X) | ||
|
||
X[0][0] = -1 | ||
msg = re.escape('Unknown feature(s) [-1] in column 0') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be feature value? Unknown category?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed. Took 'category'
assert_array_equal(enc.fit_transform(X).toarray(), exp) | ||
|
||
# don't follow order of passed categories, but sort them | ||
enc = CategoricalEncoder(categories=[['c', 'b', 'a']]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not documented, is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See last bullet point in #9151 (but indeed, this is not yet documented)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for now, documented it how it is now
enc = CategoricalEncoder(categories=[['a', 'b']]) | ||
assert_raises(ValueError, enc.fit, X) | ||
enc = CategoricalEncoder(categories=[['a', 'b']], handle_unknown='ignore') | ||
enc.fit(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check the outcome for that. We only get two columns, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently this gives a row of all zeros (with indeed only two columns). Is this the desired outcome? (not sure what to do otherwise)
Yes, I believe this should be able to produce ordinal encodings too. Yes,
better for trees; and the presence of both LabelEncoder and LabelBinarizer
(which has nothing to do with Binarizer) is only confusing, so best avoid
similar here.
…On 13 July 2017 at 05:59, Andreas Mueller ***@***.***> wrote:
***@***.**** commented on this pull request.
Mostly looks good. Needs attribute documentation and then we can talk
about what a good way to expose the attributes is.
------------------------------
In sklearn/feature_extraction/dict_vectorizer.py
<#9151 (comment)>
:
> @@ -88,8 +89,8 @@ class DictVectorizer(BaseEstimator, TransformerMixin):
See also
--------
FeatureHasher : performs vectorization using only a hash function.
- sklearn.preprocessing.OneHotEncoder : handles nominal/categorical features
- encoded as columns of integers.
+ sklearn.preprocessing.CategoricalEncoder : handles nominal/categorical
+ features encoded as columns of arbitraty data types.
is it? or strings or integers? What happens with pandas categorical?
------------------------------
In sklearn/preprocessing/data.py
<#9151 (comment)>
:
> @@ -1730,13 +1728,14 @@ def _transform_selected(X, transform, selected="all", copy=True):
class OneHotEncoder(BaseEstimator, TransformerMixin):
- """Encode categorical integer features using a one-hot aka one-of-K scheme.
+ """Encode ordinal integer features using a one-hot aka one-of-K scheme.
I don't think ordinal is right as they are not ordered.
------------------------------
In sklearn/preprocessing/data.py
<#9151 (comment)>
:
> + categories : 'auto' or a list of lists/arrays of values.
+ Values per feature.
+
+ - 'auto' : Determine categories automatically from the training data.
+ - list : ``categories[i]`` holds the categories expected in the ith
+ column.
+
+ dtype : number type, default=np.float
+ Desired dtype of output.
+
+ sparse : boolean, default=True
+ Will return sparse matrix if set True else will return an array.
+
+ handle_unknown : str, 'error' or 'ignore'
+ Whether to raise an error or ignore if a unknown categorical feature is
+ present during transform.
we always raise an error when categories is not auto and an unknown
category is encountered during training, right? Maybe we should make that
explicit?
------------------------------
In sklearn/preprocessing/data.py
<#9151 (comment)>
:
> + column.
+
+ dtype : number type, default=np.float
+ Desired dtype of output.
+
+ sparse : boolean, default=True
+ Will return sparse matrix if set True else will return an array.
+
+ handle_unknown : str, 'error' or 'ignore'
+ Whether to raise an error or ignore if a unknown categorical feature is
+ present during transform.
+
+ Examples
+ --------
+ Given a dataset with three features and two samples, we let the encoder
+ find the maximum value per feature and transform the data to a binary
maximum value? Unique values, right? Maybe add in a 999? or just a 5?
------------------------------
In sklearn/preprocessing/data.py
<#9151 (comment)>
:
> + find the maximum value per feature and transform the data to a binary
+ one-hot encoding.
+
+ >>> from sklearn.preprocessing import CategoricalEncoder
+ >>> enc = CategoricalEncoder()
+ >>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
+ ... # doctest: +ELLIPSIS
+ CategoricalEncoder(categories='auto', dtype=<... 'numpy.float64'>,
+ handle_unknown='error', sparse=True)
+ >>> enc.transform([[0, 1, 1]]).toarray()
+ array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]])
+
+ See also
+ --------
+ sklearn.preprocessing.OneHotEncoder : performs a one-hot encoding of
+ integer ordinal features. This transformer assumes that input features
I don't think ordinal is right. "This" -> OneHotEncoder (otherwise I feel
it's ambiguous)
------------------------------
In sklearn/preprocessing/data.py
<#9151 (comment)>
:
> + one-hot encoding.
+
+ >>> from sklearn.preprocessing import CategoricalEncoder
+ >>> enc = CategoricalEncoder()
+ >>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
+ ... # doctest: +ELLIPSIS
+ CategoricalEncoder(categories='auto', dtype=<... 'numpy.float64'>,
+ handle_unknown='error', sparse=True)
+ >>> enc.transform([[0, 1, 1]]).toarray()
+ array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]])
+
+ See also
+ --------
+ sklearn.preprocessing.OneHotEncoder : performs a one-hot encoding of
+ integer ordinal features. This transformer assumes that input features
+ take on values in the range [0, max(feature)].
backticks?
------------------------------
In sklearn/preprocessing/data.py
<#9151 (comment)>
:
> + le = self._label_encoders_[i]
+ Xi = X[:, i]
+ if self.categories == 'auto':
+ le.fit(Xi)
+ else:
+ if not np.all(np.in1d(Xi, self.categories[i])):
+ if self.handle_unknown == 'error':
+ diff = np.setdiff1d(Xi, self.categories[i])
+ msg = 'Unknown feature(s) %s in column %d' % (diff, i)
+ raise ValueError(msg)
+ le.classes_ = np.array(np.sort(self.categories[i]))
+
+ return self
+
+ def transform(self, X, y=None):
+ """Encode the selected categorical features using the one-hot scheme.
Parameters? returns?
------------------------------
In sklearn/preprocessing/data.py
<#9151 (comment)>
:
> + "'ignore', got %s")
+ raise ValueError(template % self.handle_unknown)
+
+ X = check_array(X, dtype=np.object, accept_sparse='csc', copy=True)
+ n_samples, n_features = X.shape
+
+ self._label_encoders_ = [LabelEncoder() for _ in range(n_features)]
+
+ for i in range(n_features):
+ le = self._label_encoders_[i]
+ Xi = X[:, i]
+ if self.categories == 'auto':
+ le.fit(Xi)
+ else:
+ if not np.all(np.in1d(Xi, self.categories[i])):
+ if self.handle_unknown == 'error':
This is different from what the docstring says, but also reasonable
behavior.
------------------------------
In sklearn/preprocessing/data.py
<#9151 (comment)>
:
> +
+ X = check_array(X, dtype=np.object, accept_sparse='csc', copy=True)
+ n_samples, n_features = X.shape
+
+ self._label_encoders_ = [LabelEncoder() for _ in range(n_features)]
+
+ for i in range(n_features):
+ le = self._label_encoders_[i]
+ Xi = X[:, i]
+ if self.categories == 'auto':
+ le.fit(Xi)
+ else:
+ if not np.all(np.in1d(Xi, self.categories[i])):
+ if self.handle_unknown == 'error':
+ diff = np.setdiff1d(Xi, self.categories[i])
+ msg = 'Unknown feature(s) %s in column %d' % (diff, i)
during fit?
------------------------------
In sklearn/preprocessing/data.py
<#9151 (comment)>
:
> + return self
+
+ def transform(self, X, y=None):
+ """Encode the selected categorical features using the one-hot scheme.
+ """
+ X = check_array(X, accept_sparse='csc', dtype=np.object, copy=True)
+ n_samples, n_features = X.shape
+ X_int = np.zeros_like(X, dtype=np.int)
+ X_mask = np.ones_like(X, dtype=np.bool)
+
+ for i in range(n_features):
+ valid_mask = np.in1d(X[:, i], self._label_encoders_[i].classes_)
+
+ if not np.all(valid_mask):
+ if self.handle_unknown == 'error':
+ diff = np.setdiff1d(X[:, i],
I guess negating valid_mask here might not necessarily make it faster?
------------------------------
In sklearn/preprocessing/data.py
<#9151 (comment)>
:
> + if self.handle_unknown == 'error':
+ diff = np.setdiff1d(X[:, i],
+ self._label_encoders_[i].classes_)
+ msg = 'Unknown feature(s) %s in column %d' % (diff, i)
+ raise ValueError(msg)
+ else:
+ # Set the problematic rows to an acceptable value and
+ # continue `The rows are marked `X_mask` and will be
+ # removed later.
+ X_mask[:, i] = valid_mask
+ X[:, i][~valid_mask] = self._label_encoders_[i].classes_[0]
+ X_int[:, i] = self._label_encoders_[i].transform(X[:, i])
+
+ mask = X_mask.ravel()
+ n_values = [le.classes_.shape[0] for le in self._label_encoders_]
+ n_values = np.hstack([[0], n_values])
same as np.array([0] + n_values), right?
------------------------------
In sklearn/preprocessing/data.py
<#9151 (comment)>
:
> + categories : 'auto' or a list of lists/arrays of values.
+ Values per feature.
+
+ - 'auto' : Determine categories automatically from the training data.
+ - list : ``categories[i]`` holds the categories expected in the ith
+ column.
+
+ dtype : number type, default=np.float
+ Desired dtype of output.
+
+ sparse : boolean, default=True
+ Will return sparse matrix if set True else will return an array.
+
+ handle_unknown : str, 'error' or 'ignore'
+ Whether to raise an error or ignore if a unknown categorical feature is
+ present during transform.
We should probably say what happens with the unknown value: all columns
will be zero.
------------------------------
In sklearn/preprocessing/data.py
<#9151 (comment)>
:
> +
+ Examples
+ --------
+ Given a dataset with three features and two samples, we let the encoder
+ find the maximum value per feature and transform the data to a binary
+ one-hot encoding.
+
+ >>> from sklearn.preprocessing import CategoricalEncoder
+ >>> enc = CategoricalEncoder()
+ >>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
+ ... # doctest: +ELLIPSIS
+ CategoricalEncoder(categories='auto', dtype=<... 'numpy.float64'>,
+ handle_unknown='error', sparse=True)
+ >>> enc.transform([[0, 1, 1]]).toarray()
+ array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]])
+
Attributes? seems pretty important ;)
------------------------------
In sklearn/preprocessing/tests/test_data.py
<#9151 (comment)>
:
> @@ -1952,6 +1955,98 @@ def test_one_hot_encoder_unknown_transform():
assert_raises(ValueError, oh.transform, y)
+def check_categorical(X):
+ enc = CategoricalEncoder()
+ Xtr1 = enc.fit_transform(X)
+
+ enc = CategoricalEncoder()
+ Xtr2 = enc.fit_transform(X)
+
+ enc = CategoricalEncoder(sparse=False)
+ Xtr3 = enc.fit_transform(X)
+
+ assert_allclose(Xtr1.toarray(), Xtr2.toarray())
I don't understand this test. Determinism?
------------------------------
In sklearn/preprocessing/tests/test_data.py
<#9151 (comment)>
:
> + assert sparse.issparse(Xtr1)
+ assert sparse.issparse(Xtr2)
+ return Xtr1.toarray()
+
+
+def test_categorical_encoder():
+ X = [['abc', 1, 55], ['def', 2, 55]]
+
+ Xtr = check_categorical(np.array(X)[:, [0]])
+ assert_allclose(Xtr, [[1, 0], [0, 1]])
+
+ Xtr = check_categorical(np.array(X)[:, [0, 1]])
+ assert_allclose(Xtr, [[1, 0, 1, 0], [0, 1, 0, 1]])
+
+ Xtr = CategoricalEncoder().fit_transform(X)
+ assert_allclose(Xtr.toarray(), [[1, 0, 1, 0, 1], [0, 1, 0, 1, 1]])
We should probably explain the handling of constant features. Does this
make the most sense? Not sure what else....
------------------------------
In sklearn/preprocessing/tests/test_data.py
<#9151 (comment)>
:
> +
+ Xtr = check_categorical(np.array(X)[:, [0, 1]])
+ assert_allclose(Xtr, [[1, 0, 1, 0], [0, 1, 0, 1]])
+
+ Xtr = CategoricalEncoder().fit_transform(X)
+ assert_allclose(Xtr.toarray(), [[1, 0, 1, 0, 1], [0, 1, 0, 1, 1]])
+
+
+def test_categorical_encoder_errors():
+
+ enc = CategoricalEncoder()
+ X = [[1, 2, 3], [4, 5, 6]]
+ enc.fit(X)
+
+ X[0][0] = -1
+ msg = re.escape('Unknown feature(s) [-1] in column 0')
Shouldn't it be feature value? Unknown category?
------------------------------
In sklearn/preprocessing/tests/test_data.py
<#9151 (comment)>
:
> + enc.fit(X)
+ X[0][0] = -1
+ Xtr = enc.transform(X)
+ assert_allclose(Xtr.toarray(), [[0, 0, 1, 0, 1, 0], [0, 1, 0, 1, 0, 1]])
+
+
+def test_categorical_encoder_specified_categories():
+ X = np.array([['a', 'b']], dtype=object).T
+
+ enc = CategoricalEncoder(categories=[['a', 'b', 'c']])
+ exp = np.array([[1., 0., 0.],
+ [0., 1., 0.]])
+ assert_array_equal(enc.fit_transform(X).toarray(), exp)
+
+ # don't follow order of passed categories, but sort them
+ enc = CategoricalEncoder(categories=[['c', 'b', 'a']])
That's not documented, is it?
------------------------------
In sklearn/preprocessing/tests/test_data.py
<#9151 (comment)>
:
> + assert_array_equal(enc.fit_transform(X).toarray(), exp)
+
+ # multiple columns
+ X = np.array([['a', 'b'], ['A', 'C']], dtype=object).T
+ enc = CategoricalEncoder(categories=[['a', 'b', 'c'], ['A', 'B', 'C']])
+ exp = np.array([[1., 0., 0., 1., 0., 0.],
+ [0., 1., 0., 0., 0., 1.]])
+ assert_array_equal(enc.fit_transform(X).toarray(), exp)
+
+ # when specifying categories manually, unknown categories should already
+ # raise when fitting
+ X = np.array([['a', 'b', 'c']]).T
+ enc = CategoricalEncoder(categories=[['a', 'b']])
+ assert_raises(ValueError, enc.fit, X)
+ enc = CategoricalEncoder(categories=[['a', 'b']], handle_unknown='ignore')
+ enc.fit(X)
We should check the outcome for that. We only get two columns, right?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9151 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz65TNiOEMksLTr_q1OtNRfrK2KYR2ks5sNSWUgaJpZM4N9lyH>
.
|
Any news on this? ;) |
Sorry for the delay. The question is whether we want dataframe in -> dataframe out? That might be nice, but I'd rather merge without that and possibly add that later. |
We're gonna gonna do |
Yes, that is on my list of follow-ups (#9151 (comment)), although there are some questions about what it should exactly do (#9151 (comment))
I didn't consider the 'dataframe out' (although that can also be a consideration, but a much bigger one I think). Here it was more about having some special code inside the transformer to prevent converting a dataframe with different dtypes to 'object' dtyped array( |
@jorisvandenbossche ah, that makes more sense. I would leave it as-is. Merge? |
I think the proposal with something like |
Yes, I think somebody can merge it! |
Sure. Let's see what this thing does in the wild! |
Congratulations! And thanks. Please feel free to make follow-up issues. |
And thanks a lot for all the review! |
My excitement about this is pretty much through the roof lol ;) |
I would appreciate it if you could focus on #9012, maybe we can leave the get_feature_names for someone else, it shouldn't be too tricky. |
Congratulations and thanks @jorisvandenbossche |
Hi, any chance you could add a |
indeed it's pretty hard to drop first in a pipeline if multiple variables
are involved. Although we have generally avoided this kind of thing and
relied on regularisation. I know we have had similar requests for OHE, but
please create a new issue rather than a comment.
|
For people who are subscribed here: I opened a new issue with some questions on the API and naming: #10521 |
|
This is currently simply a rebase of PR #6559 by @vighneshbirodkar .
Some context: this PR #6559 was the first of a series of PRs related to this. This PR added a CategoricalEncoder. Then it was decided, instead of adding a new class, to add this functionality to the existing OneHotEncoder: #8793 and #7327, and recently taken up by @stephen-hoover in #8793.
At the sprint we discussed this, and @amueller put a summary of that in #8793 (comment).
The main reason not to add this in OneHotEncoder is that this is fundamentally different behaviour (OneHotEncoder determines the categories based on the range of the positive integer values passed, the new CategoricalEncoder would determine it based on unique values), and that almost all keyword, attributes and behaviour of the current OneHotEncoder would be deprecated, which makes the implementation to do this in one class (deprecated + new behaviour) overly complex.
The basics already work nicely with the rebased PR:
Some changes I would like to make to the current PR:
categorical_features
keyword (the ability to select certain columns) for now to keep it more simple, as this can always be achieved in combination with the ColumnTransformercategories_
attribute that is a dict of {column number/name : [categories]}. And maybe the underlying LabelEncoders can be hidden from the users (currently stored in thelabel_encoders_
attribute)get_feature_names()
method ?But before doing that, I wanted to check if we agree on this way forward (separate CategoricalEncoder class). @jnothman are you OK with this or still in favor of doing the changes in the OneHotEncoder?
If we want to keep the name OneHotEncoder, another option would be to implement the 'new' OneHotEncoder in eg a 'sklearn.future' module, so people can still move to it gradually and the current one can be deprecated, but keeping the implementations separate.
Closes #7375, closes #7327, closes #4920, closes #3956
Related issue that can possibly be closed as well: #3599, #8136