[MRG+1] UnaryEncoder to encode ordinal features into unary levels #8652

arjunjauhari · 2017-03-27T04:02:51Z

Reference Issue

Resolves #8628
Resolves #3336

What does this implement/fix? Explain your changes.

This implements OrdinalEncoder, which is a more informative encoding than one-hot for ordinal features. Implemented a new class OrdinalEncoder whose interface is same as OneHotEncoder class.

Any other comments?

Logic: For k values 0, ..., k - 1 of the ordinal feature x, this creates k - 1 binary features such that the ith is active if x > i (for i = 0, ... k - 1)

Working Example

>>> from sklearn.preprocessing import OrdinalEncoder
>>> enc = OrdinalEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])  
OrdinalEncoder(dtype=<type 'numpy.float64'>, handle_unknown='error',
        n_values='auto', ordinal_features='all', sparse=True)
>>> 
>>> enc.n_values_
array([2, 3, 4])
>>> 
>>> enc.feature_indices_
array([0, 1, 3, 6])
>>> 
>>> enc.active_features_
array([0, 1, 2, 3, 4, 5])
>>>
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 0.,  1.,  0.,  1.,  0.,  0.]])

Another One

>>> OrdinalEncoder(3).fit_transform([[0], [1], [2]]).toarray()
array([[ 0.,  0.],
       [ 1.,  0.],
       [ 1.,  1.]])

A to-do list

Converge on a final name for this feature. Current suggestions: UnaryEncoder[+2], OrdinalEncoder[0]
Fix the nosetests errors
Write examples
Write unit-test cases
Any documentation work

arjunjauhari · 2017-03-27T18:16:35Z

@jnothman @jmschrei
Detailed comment for #8628

Questions for reviewers

How to we handle when all the features in training data has just one value. Currently it return empty array. Is there a better way to handle this? For example,

>>> enc = OrdinalEncoder()
>>> enc.fit_transform([[0,0],[0,0],[0,0]]).toarray()
array([], shape=(3, 0), dtype=float64)

Need help with these 3 errors which i get while running nosetests after build

======================================================================
ERROR: /Users/arjunjauhari/Desktop/personal/work/scikit-learn/sklearn/tests/test_common.py.test_non_meta_estimators:check_transformer_general(OrdinalEncoder)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/arjunjauhari/venvs/py2gen/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/arjunjauhari/Desktop/personal/work/scikit-learn/sklearn/utils/testing.py", line 741, in __call__
    return self.check(*args, **kwargs)
  File "/Users/arjunjauhari/Desktop/personal/work/scikit-learn/sklearn/utils/testing.py", line 292, in wrapper
    return fn(*args, **kwargs)
  File "/Users/arjunjauhari/Desktop/personal/work/scikit-learn/sklearn/utils/estimator_checks.py", line 662, in check_transformer_general
    _check_transformer(name, Transformer, X, y)
  File "/Users/arjunjauhari/Desktop/personal/work/scikit-learn/sklearn/utils/estimator_checks.py", line 744, in _check_transformer
    % Transformer)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/testing/utils.py", line 811, in assert_array_almost_equal
    header=('Arrays are not almost equal to %d decimals' % decimal))
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/testing/utils.py", line 633, in assert_array_compare
    reduced = val.ravel()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/base.py", line 440, in __getattr__
    raise AttributeError(attr + " not found")
AttributeError: ravel not found

======================================================================
FAIL: sklearn.feature_extraction.tests.test_image.test_connect_regions_with_grid
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/arjunjauhari/venvs/py2gen/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/arjunjauhari/Desktop/personal/work/scikit-learn/sklearn/feature_extraction/tests/test_image.py", line 86, in test_connect_regions_with_grid
    assert_equal(ndimage.label(mask)[1], connected_components(graph)[0])
AssertionError: 200 != 199
    '200 != 199' = '%s != %s' % (safe_repr(200), safe_repr(199))
    '200 != 199' = self._formatMessage('200 != 199', '200 != 199')
>>  raise self.failureException('200 != 199')
    

======================================================================
FAIL: /Users/arjunjauhari/Desktop/personal/work/scikit-learn/sklearn/tests/test_common.py.test_all_estimators:check_parameters_default_constructible(OrdinalEncoder)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/arjunjauhari/venvs/py2gen/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/arjunjauhari/Desktop/personal/work/scikit-learn/sklearn/utils/testing.py", line 741, in __call__
    return self.check(*args, **kwargs)
  File "/Users/arjunjauhari/Desktop/personal/work/scikit-learn/sklearn/utils/estimator_checks.py", line 1598, in check_parameters_default_constructible
    np.float64, types.FunctionType, Memory])
AssertionError: <type 'type'> not found in [<type 'str'>, <type 'int'>, <type 'float'>, <type 'bool'>, <type 'tuple'>, <type 'NoneType'>, <type 'numpy.float64'>, <type 'function'>, <class 'sklearn.externals.joblib.memory.Memory'>]
    """Fail immediately, with the given message."""
>>  raise self.failureException("<type 'type'> not found in [<type 'str'>, <type 'int'>, <type 'float'>, <type 'bool'>, <type 'tuple'>, <type 'NoneType'>, <type 'numpy.float64'>, <type 'function'>, <class 'sklearn.externals.joblib.memory.Memory'>]")

Thanks

raghavrv · 2017-03-28T12:12:54Z

Thanks for the PR... I think UnaryEncoder is more apt? As these are infact unary numbers...

raghavrv · 2017-03-28T12:14:55Z

Also what does active_features_ denote?

How to we handle when all the features in training data has just one value

Just return [0] * n_features? Returning an empty matrix is not good as it can lead to errors when used in a pipeline...

raghavrv · 2017-03-28T12:17:43Z

I've updated the PR description based on your subsequent comment. Hope you don't mind...

raghavrv

A few starting comments...

raghavrv · 2017-03-28T13:17:32Z

sklearn/preprocessing/data.py

@@ -1868,7 +1868,7 @@ def _transform(self, X):
        if np.any(~mask):
            if self.handle_unknown not in ['error', 'ignore']:
                raise ValueError("handle_unknown should be either error or "


Can you also make it either 'error' or 'ignore'?

(surround with single quotes)

Done. Also updated the similar error message for OneHotEncoder.

raghavrv · 2017-03-28T13:32:11Z

sklearn/preprocessing/data.py

+        return self
+
+    def _fit_transform(self, X):
+        """Assumes X contains only oridinal features."""


raghavrv · 2017-03-28T13:33:51Z

sklearn/preprocessing/data.py

+            except (ValueError, TypeError):
+                raise TypeError("Wrong type for parameter `n_values`. Expected"
+                                " 'auto', int or array of ints, got %r"
+                                % type(X))


I think you can just print it, rather than printing the type...

arjunjauhari · 2017-03-28T16:20:08Z

Thanks for reviewing Venkat.

I think UnaryEncoder is more apt? As these are infact unary numbers...

On naming, I feel UnaryEncoder makes sense at some level. But want to highlight that even though individual feature value gets encoded as a Unary system, still the final output is binary.

Also what does active_features_ denote?

Thanks for bringing this up, in current implementation it doesn't denote anything of use because all the features in output will always be active. I will remove it. In my first implementation, it had some relevance.

Just return [0] * n_features? Returning an empty matrix is not good as it can lead to errors when used in a pipeline...

But then for this input >>> enc.fit_transform([[0,0],[0,0],[0,0]]) we will get output [0, 0]. I don't think that's what we want. Let me know if I am missing something.
How about we do one of these:

Either we return the X itself back
or we can return an array with shape (n_samples, 1) filled with zeros. For about input it will be [[0],[0],[0]]

I've updated the PR description based on your subsequent comment. Hope you don't mind...

Actually thanks, looks cleaner.

arjunjauhari · 2017-03-28T16:22:44Z

Also this encoder can give same output for two different inputs. For example,

>>> enc.fit_transform([[0,0],[0,0],[0,1]]).toarray()
array([[ 0.],
       [ 0.],
       [ 1.]])
>>> enc.fit_transform([[0],[0],[1]]).toarray()
array([[ 0.],
       [ 0.],
       [ 1.]])

But i think thats okay as anyway first feature is not holding anything meaningful. Let me know if you feel some concern around it.

raghavrv · 2017-03-28T17:16:29Z

still the final output is binary

I think I'd call it more of a "concatenation of unary number sequences" rather than binary... I am strongly inclining towards UnaryEncoder... Opinions may vary though :)

Also this encoder can give same output for two different inputs. For example,

I think it should actually be -

>>> enc.fit_transform([[0, 0], [0, 0], [0, 1]]).toarray()
array([[ 0, 0],
       [ 0, 0],
       [ 0, 1]])
>>> enc.fit_transform([[0],[0],[1]]).toarray()
array([[ 0],
       [ 0],
       [ 1]])

raghavrv · 2017-03-28T17:17:27Z

Such that n_new_features >= n_features...

jnothman · 2017-03-28T22:05:02Z

I think unary is less ambiguous than ordinal

…

On 29 Mar 2017 4:17 am, "(Venkat) Raghav (Rajagopalan)" < ***@***.***> wrote: Such that n_new_features >= n_features... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8652 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz69zwPjKO-B1rI_I-C4ksuZWVh--qks5rqUCpgaJpZM4MpxOA> .

arjunjauhari · 2017-03-29T00:50:59Z

Such that n_new_features >= n_features...

is this a requirement for all encoders to satisfy? why?
As I don't see a problem with n_new_features < n_features since its a lossless dimensionality reduction it this case.

jnothman · 2017-03-29T03:40:52Z

I'm okay with discarding input features that are constant. It's fine to generally output k features for a feature with k values.

…

On 29 March 2017 at 11:51, Arjun Jauhari ***@***.***> wrote: Such that n_new_features >= n_features... is this a requirement for all encoders to satisfy? why? As I don't see a problem with n_new_features < n_features since its a lossless dimensionality reduction it this case. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8652 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz67XRFz7ElQ8bRaERhnA8-S7UvubEks5rqar0gaJpZM4MpxOA> .

arjunjauhari · 2017-03-29T15:29:02Z

It's fine to generally output k features for a feature with k values.

As of now I output,

k-1 features for a input feature with k values (so it ends up in discarding input feature with just 1 value)

I can also do, no output features for input feature that are constant(as @jnothman suggested).

But all this makes n_out_features < n_features(in) a real possibility. The question is if that's acceptable or not?

jnothman · 2017-03-29T21:04:20Z

having fewer output features is fine imo

…

On 30 Mar 2017 2:29 am, "Arjun Jauhari" ***@***.***> wrote: It's fine to generally output k features for a feature with k values. As of now I output, - k-1 features for a input feature with k values (so it ends up in discarding input feature with just 1 value) I can also do, no output features for input feature that are constant(as @jnothman <https://github.com/jnothman> suggested). But all this makes n_out_features < n_features(in) a real possibility. The question is if that's acceptable or not? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8652 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz63JFUxKXFqwzRXXUH4hv6mTS4hdMks5rqnjAgaJpZM4MpxOA> .

jnothman · 2017-05-28T12:52:22Z

@arjunjauhari, are you still working on this?

More than anything else, this needs tests in sklearn/preprocessing/tests/test_data.py.

Here's something I had lying around which may apply, to get started:

def test_ordinal_encoder():
    X = np.arange(-1, 5).reshape(-1, 1)
    est = OrdinalEncoder(4)
    Xt = est.fit_transform(X)
    assert_array_equal(Xt, [[0, 0, 0],   # -1
                            [0, 0, 0],   # 0
                            [1, 0, 0],   # 2
                            [1, 1, 0],   # 3
                            [1, 1, 1],   # 4
                            [1, 1, 1]])  # 5
    Xt2 = est.transform(X)
    assert_array_equal(Xt2, Xt)

    # smaller n_values leads to horizontally truncated output
    Xt3 = OrdinalEncoder(3).fit_transform(X)
    assert_array_equal(Xt3, np.array(Xt2)[:, :-1])

    # multiple input features stacks output
    rng = np.random.RandomState(0)
    X_multi = rng.randint(3, size=(10, 3))
    X_multi_t = OrdinalEncoder(3).fit_transform(X_multi)
    assert_equal(X_multi_t.shape, (10, 3 * 2))
    expected = np.hstack([OrdinalEncoder(3).fit_transform(X_multi[:, i:i + 1])
                          for i in range(X_multi.shape[1])])
    assert_array_equal(expected, X_multi_t)

jnothman · 2017-06-19T00:17:32Z

After 20 days of silence from @arjunjauhari, I'm marking this as Need Contributor. @arjunjauhari, let me know if you would rather keep working on it.

ruxandraburtica · 2017-06-22T08:18:36Z

@jnothman I will continue working on this, if that's ok

jnothman · 2017-06-22T09:36:28Z

You're welcome to open a PR continuing the work (or start again) as far as I'm concerned.

Note that we're currently trying to release version 0.19, so if you don't get quicky responses, you should ping after a few weeks

arjunjauhari · 2017-06-23T08:51:40Z

@ruxandraburtica I am planning to continue this work and take PR to completion. I hope you have not started yet. Sorry for any inconvenience.
@jnothman Sorry for my absence in last month. I will complete this PR on priority. I hope that's okay.

jnothman · 2017-06-24T14:13:00Z

that's fine. no great hurry, but it is hard to know when there's no response what the status is

…

On 23 Jun 2017 6:51 pm, "Arjun Jauhari" ***@***.***> wrote: @ruxandraburtica <https://github.com/ruxandraburtica> I am planning to continue this work and take PR to completion. I hope you have not started yet. Sorry for any inconvenience. @jnothman <https://github.com/jnothman> Sorry for my absence in last month. I will complete this PR on priority. I hope that's okay. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8652 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz62CfF-IERtTXDuIA3MkLqLxiu2Mrks5sG3yegaJpZM4MpxOA> .

arjunjauhari · 2017-06-24T16:03:49Z

@jnothman Thanks, I will continue working.

ruxandraburtica · 2017-06-24T16:19:32Z

@jnothman @arjunjauhari actually I started working on this as well, I didn't open a pull request because there were a few more tests I wanted to add. I'll be at a computer in about 2 hours, and create the PR.

ruxandraburtica · 2017-06-25T07:17:15Z

@jnothman @arjunjauhari I've opened a PR here: #9216

It still needs some tests, but I wanted to know if I should continue working on it.

arjunjauhari · 2017-06-25T07:33:13Z

@ruxandraburtica I am happy to merge your work with mine, if that's okay to you. That way none of our work go unused.

jnothman · 2017-06-25T07:52:34Z

@ruxandraburtica, You should be able to open a pull request to @arjunjauhari's branch, if that helps merge your work. Sorry for the miscommunication

ruxandraburtica · 2017-06-25T07:59:43Z

@arjunjauhari @jnothman yes, I will do that today

jnothman

The warning needs tests

jnothman · 2017-12-04T07:01:06Z

sklearn/preprocessing/data.py

+
+    handle_greater : str, 'warn' or 'error' or 'clip', default='warn'
+        Whether to raise an error or clip or warn if an
+        ordinal feature >= n_values is passed in.


clip could do with a description, and warn can be described as clipping with a warning

jnothman · 2017-12-04T07:01:59Z

sklearn/preprocessing/data.py

+            elif self.handle_greater == 'error':
+                raise ValueError("Found feature values %s which exceeds "
+                                 "n_values during transform."
+                                 % X.ravel()[mask])


Not reporting how many?

TomDLT · 2017-12-06T13:04:22Z

sklearn/preprocessing/data.py

+                self.n_values == 'auto'):
+            n_values = np.max(X, axis=0) + 1
+        elif isinstance(self.n_values, numbers.Integral):
+            if (np.max(X, axis=0) >= self.n_values).any():


Isn't it redundant with the check at this end of _fit?
which is better since it uses handle_greater

TomDLT · 2017-12-06T13:08:36Z

sklearn/preprocessing/tests/test_data.py

+
+def _generate_random_features_matrix(n_values=3, size=10):
+    rng = np.random.RandomState(0)
+    X = rng.randint(n_values, size=(size, n_values))


Is there a reason why n_features == n_values ?
Using n_samples and n_features variable names would be clearer.

TomDLT · 2017-12-06T13:12:20Z

sklearn/preprocessing/tests/test_data.py

+
+def test_unary_encoder_stack():
+    # multiple input features stack to same output
+    n_values = np.random.randint(2, 10)


We tend to prefer deterministic tests, using rng = np.random.RandomState(0) and rng.randint instead of np.random.randint.

TomDLT · 2017-12-06T13:19:59Z

sklearn/preprocessing/tests/test_data.py

+    encoder.fit(X)
+
+    # test that an error is raised when different shape
+    larger_n_values = n_values + delta


This is not clear.
It should be larger n_features and not larger n_values.
This is linked to the use of n_values as both n_values and n_features in _generate_random_features_matrix.

TomDLT · 2017-12-06T13:24:05Z

sklearn/preprocessing/tests/test_data.py

+
+    X = _generate_random_features_matrix(n_values, size)
+    X_trans = enc.fit_transform(X)
+    assert_equal(X_trans.shape, (size, unary_n_values * len(X[0])))


n_features is clearer than len(X[0])

Explaining handle_greater options Adding test cases to test warn option of handle_greater parameter Updated warning message updated feature_matrix generation function and made test deterministic making test cases clearer by using n_features n_values and n_features cleanup

amueller · 2017-12-19T00:03:47Z

conflicts and test failures...

amueller · 2017-12-19T00:43:03Z

Haven't really looked at this but not sure if I'm sold on UnaryEncoder. I feel like it's hard to discover. OrdinalEncoder says what it does, but not how it does it. I guess the user guide helps with discoverability, but I feel a discoverable name would be good, too. Do we have any references for either name? Btw, here's the "unary" wikipedia disambiguation: https://en.wikipedia.org/wiki/Unary I would naturally associate it with 2 and 3, while here we're using 1 and 4.

jnothman · 2017-12-19T01:06:47Z

here we're using it in the first sense, yes, which I think is not uncommon, but is less familiar in computer science and algebra. To clarify (using the framework of semantic roles), "one hot" in OneHotEncoder describes the manner of encoding, while "categorical" in CategoricalEncoder describes the kind of operand. "Unary" would describe the manner of encoding; "ordinal" the operand. I would expect an OrdinalEncoder to give me an option to encode strings or numbers as one-hot or unary or as cardinal integers. We could do that, but I also believe that the UnaryEncoder could/should support non-ordinal, real-valued input.

amueller · 2017-12-19T02:01:11Z

What would UnaryEncoder to on real-valued input? Discretization?
The familiarity is definitely field dependent and a bit hard to argue. I didn't know what `UnaryEncoder`` would do from the name. As I said, I think finding references might help. This seems like a method from the stats literature, but I couldn't find a good name with a quick google.

jnothman · 2017-12-19T02:21:59Z

on non-negative real values it would do the same: x > k for column k of output

arjunjauhari · 2018-01-12T22:16:38Z

@amueller @jnothman Did we converge on the name?

jnothman · 2018-01-13T22:39:54Z

I think @amueller has disappeared into teaching land, but others are welcome to state their opinion, including you...

jorisvandenbossche

I added some comments (mainly from the viewpoint of just having implemented CategoricalEncodoer)

I think you still need to add a whatsnew notice?

jorisvandenbossche · 2018-01-16T10:29:33Z

doc/modules/preprocessing.rst

+with scikit-learn estimators is to use a unary encoding, which is
+implemented in :class:`UnaryEncoder`.  This estimator transforms each
+ordinal feature with ``m`` possible values into ``m - 1`` binary features,
+where the ith feature is active if x > i (for i = 0, ... k - 1).


Where is k coming from? Is this the same as m ? (or can k -1 be m - 2 ?)

jorisvandenbossche · 2018-01-16T10:31:38Z

doc/modules/preprocessing.rst

+  since those already work on the basis of a particular feature value being
+  < or > than a threshold, unlike linear and kernel-based models.
+
+Continuing the example above::


It is not directly clear for me this refers to the list of example features (I was first searching for the last code example).
Also, this example at once starts with integers, while the example features above are strings, while this step is not explained.

jorisvandenbossche · 2018-01-16T10:32:44Z

doc/modules/preprocessing.rst

+  array([[ 0.,  1.,  0.,  1.,  0.,  0.]])
+
+By default, how many values each feature can take is inferred automatically
+from the dataset. It is possible to specify this explicitly using the parameter


Can you specify this 'automatically' is done by looking at the maximum?

jorisvandenbossche · 2018-01-16T10:33:30Z

doc/modules/preprocessing.rst

+By default, how many values each feature can take is inferred automatically
+from the dataset. It is possible to specify this explicitly using the parameter
+``n_values``. 
+* There are two genders, three possible continents and four web browsers in our


Need a blank line above this line (to get the list to render well)

jorisvandenbossche · 2018-01-16T10:41:35Z

doc/modules/preprocessing.rst

+* ``["short", "tall"]``
+* ``["low income", "medium income", "high income"]``
+* ``["elementary school graduate", "high school graduate", "some college",
+   "college graduate"]``


There is one space too much here at the beginning of the line (to align it with the previous line, you can always check the built docs in https://15748-843222-gh.circle-artifacts.com/0/home/ubuntu/scikit-learn/doc/_build/html/stable/_changed.html, see explanation in http://scikit-learn.org/stable/developers/contributing.html#documentation)

jorisvandenbossche · 2018-01-16T11:04:47Z

sklearn/preprocessing/data.py

+
+    See also
+    --------
+    sklearn.preprocessing.OneHotEncoder: encodes categorical integer features


categorical -> ordinal ?

jorisvandenbossche · 2018-01-16T11:04:50Z

sklearn/preprocessing/data.py

+    See also
+    --------
+    sklearn.preprocessing.OneHotEncoder: encodes categorical integer features
+      using a one-hot aka one-of-K scheme.


Can you add here a see also to CategoricalEncoder as well?

jorisvandenbossche · 2018-01-16T12:26:01Z

sklearn/preprocessing/data.py

+        Whether to raise an error or clip or warn if an
+        ordinal feature >= n_values is passed in.
+
+        - 'warn' (default): same as clip but with warning.


For CategoricalEncoder the default is error (and for OneHotEncoder as well). Is there a good reason to deviate in this case?

I see the discussion about this, @jnotham saying:

I now wonder if we should have a default handle_greater='warn'. I think in practice that raising an error when a count-valued feature exceeds its training set range is too intrusive. Better off clipping but warning unless the user has specified otherwise.

I understand that reasoning for count-like values. But the examples in the documentation are not count-like, and for those this makes less sense I think as a default.
(but don't know with which type of data this will be used more often)

jorisvandenbossche · 2018-01-16T12:27:37Z

sklearn/preprocessing/data.py

+            if self.handle_greater == 'error':
+                raise ValueError("handle_greater='error' but found %d feature"
+                                 " values which exceeds n_values."
+                                 % np.count_nonzero(mask))


In principle this check is not needed in case of 'auto' (but not sure whether a small performance improvement for this case is worth moving it elsewhere)

jorisvandenbossche · 2018-01-16T12:29:59Z

sklearn/preprocessing/data.py

+        data = np.ones(X_ceil.ravel().sum())
+        out = sparse.coo_matrix((data, (row_indices, column_indices)),
+                                shape=(n_samples, indices[-1]),
+                                dtype=self.dtype).tocsr()


Maybe not too important, but it is also possible to directly create a csr matrix instead of first constructing coo and then converting (for CategoricalEncoder I edited the construction of the indices to do this: 85cf315)

jnothman · 2018-06-05T00:29:57Z

@arjunjauhari, will you be following up @jorisvandenbossche's comments?

arjunjauhari · 2018-06-05T16:32:23Z

@jnothman, will address the comments soon. @jorisvandenbossche thanks for the review.

jnothman · 2018-07-09T23:33:53Z

@arjunjauhari I would love to have this available for use in the KBinsDiscretizer. Let us know if you would like someone else to finish it

arjunjauhari · 2018-07-11T19:31:45Z

@jnothman I apologize for delay. I think its pretty close to finish line. I will finish it up in a week or two.

jnothman · 2018-07-12T00:33:46Z

Thanks @arjunjauhari. If can be done this weekend, it has a good chance of being merged next week at development sprints, and included in the upcoming release.

It would be great if in or subsequent to this PR, encoding='unary' could be added to the new KBinsDiscretizer.

qinhanmin2014 · 2018-07-12T01:14:46Z

FYI (Sorry if duplicate with existing comments):
(1) We need to remove ordinal_features to keep consistent with OneHotEncoder&KBinsDiscretizer. Maybe also a note to tell users to use ColumnTransfomer if they only want to preprocess part of the features.
(2) I think we need an example to demonstrate the usage of it and maybe several references since different people may have different names for this feature.

NicolasHug · 2018-12-25T23:10:23Z

@arjunjauhari do you want to finish this soonish?
Else I'll pick it up.

NicolasHug · 2018-12-30T21:35:57Z

@arjunjauhari FYI have just opened #12893

cmarmo · 2020-07-10T12:50:28Z

@NicolasHug, @jnothman, this PR was superseded by #12893, now closed, as no consensus was reached about the implementation. Is the 'UnaryEncoder' issue still relevant? Is this implementation still useful as a starting point? Or should them be closed? Thanks for your help.

NicolasHug · 2020-07-10T13:37:14Z

I think we can close this one too. If the idea of the UnaryEncoder is to be resuscitated, I think it should start from #12893 which is more advanced

raghavrv changed the title ~~[WIP] First commit for adding OrdinalEncoder feature. Feature is working bu…~~ [WIP] OrdinalEncoder to encode ordinal features into unary levels Mar 28, 2017

raghavrv added the New Feature label Mar 28, 2017

raghavrv previously requested changes Mar 28, 2017

View reviewed changes

jnothman added the Need Contributor label Jun 19, 2017

arjunjauhari force-pushed the ordinal-encoder branch from ecdae4f to f5ec141 Compare June 24, 2017 18:53

jnothman removed the help wanted label Dec 4, 2017

jnothman reviewed Dec 4, 2017

View reviewed changes

TomDLT reviewed Dec 6, 2017

View reviewed changes

Resolved conflicts with master

a43dfb5

jorisvandenbossche reviewed Jan 16, 2018

View reviewed changes

jorisvandenbossche mentioned this pull request Jan 23, 2018

Rethinking the CategoricalEncoder API ? #10521

Closed

qinhanmin2014 mentioned this pull request Jul 28, 2018

Added UnaryEncoder. #3336

Closed

NicolasHug mentioned this pull request Dec 30, 2018

[MRG] Unary encoder -- continued #12893

Closed

amueller added the Superseded PR has been replace by a newer PR label Aug 6, 2019

github-actions bot added module:preprocessing module:utils labels Mar 2, 2020

NicolasHug closed this Jul 10, 2020

Uh oh!

[MRG+1] UnaryEncoder to encode ordinal features into unary levels #8652

[MRG+1] UnaryEncoder to encode ordinal features into unary levels #8652

Uh oh!

Conversation

arjunjauhari commented Mar 27, 2017 • edited by jnothman Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

arjunjauhari commented Mar 27, 2017 • edited by raghavrv Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raghavrv commented Mar 28, 2017

Uh oh!

raghavrv commented Mar 28, 2017

Uh oh!

raghavrv commented Mar 28, 2017

Uh oh!

raghavrv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arjunjauhari commented Mar 28, 2017

Uh oh!

arjunjauhari commented Mar 28, 2017

Uh oh!

raghavrv commented Mar 28, 2017

Uh oh!

raghavrv commented Mar 28, 2017

Uh oh!

jnothman commented Mar 28, 2017 via email

Uh oh!

arjunjauhari commented Mar 29, 2017

Uh oh!

jnothman commented Mar 29, 2017 via email

Uh oh!

arjunjauhari commented Mar 29, 2017

Uh oh!

jnothman commented Mar 29, 2017 via email

Uh oh!

jnothman commented May 28, 2017

Uh oh!

jnothman commented Jun 19, 2017

Uh oh!

ruxandraburtica commented Jun 22, 2017

Uh oh!

jnothman commented Jun 22, 2017

Uh oh!

arjunjauhari commented Jun 23, 2017

Uh oh!

jnothman commented Jun 24, 2017 via email

Uh oh!

arjunjauhari commented Jun 24, 2017

Uh oh!

ruxandraburtica commented Jun 24, 2017

Uh oh!

ruxandraburtica commented Jun 25, 2017

Uh oh!

arjunjauhari commented Jun 25, 2017

Uh oh!

jnothman commented Jun 25, 2017

Uh oh!

ruxandraburtica commented Jun 25, 2017

Uh oh!

jnothman left a comment

arjunjauhari commented Mar 27, 2017 •

edited by jnothman

Loading

arjunjauhari commented Mar 27, 2017 •

edited by raghavrv

Loading