[WIP] FrequencyEncoder #11805

sshleifer · 2018-08-13T22:27:40Z

What does this implement/fix? Explain your changes.

This is an alternative to LabelEncoder and OneHotEncoder that encodes categoricals based on the number of times they occur in the training data. It usually provides more information about the encoded value that LabelEncoder at the cost of potential collisions. I would love to add test coverage and examples if people think this is a good idea.

jnothman · 2018-08-13T22:52:05Z

This is not suitable for encoding classification labels, but for features.

jnothman · 2018-08-13T22:52:53Z

See also #9614

sshleifer · 2018-08-13T23:28:33Z

agreed that this is for features and should be in preprocessing/.
#9614 is very related, but it creates three arrays for every array you send it. This is completely different than LabelEncoder.

sshleifer · 2018-08-21T01:45:44Z

would it make sense to try to steer that PR to a simpler API?

amueller · 2018-08-21T14:29:48Z

#9614 is very related, but it creates three arrays for every array you send it.

Can you elaborate on that? I don't think I understand.

sshleifer · 2018-08-22T03:47:14Z

Sure! From line 2935 of this file it appears that when CountFeaturizer.fit_transform is sent an array of shape (N, 1) it returns an array of shape (N, 3).

Another difference is that CountFeaturizer takes an inclusion argument: 'all', 'each', list, or numpy.ndarray, whereas in my experience most preprocessors preprocess all the data they are sent.

amueller · 2018-08-22T04:01:42Z

hm I think the interface of the CountVectorizer is a bit odd right now and I'd change some of that.
My expectation would be that in a multi-class setting you have n_classes many columns. How would you reduce it to a single column?

sshleifer · 2018-08-22T13:06:33Z

Something like

class FrequencyEncoder(BaseEstimator, TransformerMixin):
    """If a value appears n times in a column sent to fit, it is encoded as n."""

    def fit(self, X):
        self._counts = [Counter(X[:, i] for i in range(X.shape[1]))]
        return self

    def transform(self, X):
		# try to broadcast more
        return np.array([[self._counts[i][elem]
						  for elem in X[:, i]]
                          for i in range(X.shape[1])])

CountVectorizer changes could also be interesting. I am trying to transform features, not labels, though so I don't think whether the setting is multi-class matters?

jnothman · 2018-08-23T00:09:35Z

What is that trying to capture? In classification, we would learn coefficients indicating the homogeneity of each class's categorical variables? Is that what we want? Rather if you get the frequency that the category occurs jointly with each class you get a lot more information to learn from

sshleifer · 2018-08-23T13:03:08Z

My theory is that for a tree model, this would be like using OrdinalEncoder to transform categoricals, but provide more information than just the order of appearance.

if our data is [red, red, red, red, yellow, red, green, green, green], replacing red with 1 and yellow with 2 and green with 3 tells the model less than replacing them with 4, 1, 3. CountFeaturizer provides even more info at the cost of more features.

I can work on providing empirical proof, if that would help, but also happy to give up and look for other stuff to do!

sshleifer added 2 commits August 13, 2018 18:15

add Frequency Encoder

a713dc4

Fix docstring

b2c9508

sshleifer closed this Mar 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] FrequencyEncoder #11805

[WIP] FrequencyEncoder #11805

sshleifer commented Aug 13, 2018 •

edited

Loading

jnothman commented Aug 13, 2018

jnothman commented Aug 13, 2018

sshleifer commented Aug 13, 2018

sshleifer commented Aug 21, 2018

amueller commented Aug 21, 2018

sshleifer commented Aug 22, 2018 •

edited

Loading

amueller commented Aug 22, 2018

sshleifer commented Aug 22, 2018 •

edited

Loading

jnothman commented Aug 23, 2018 via email

sshleifer commented Aug 23, 2018

[WIP] FrequencyEncoder #11805

[WIP] FrequencyEncoder #11805

Conversation

sshleifer commented Aug 13, 2018 • edited Loading

What does this implement/fix? Explain your changes.

jnothman commented Aug 13, 2018

jnothman commented Aug 13, 2018

sshleifer commented Aug 13, 2018

sshleifer commented Aug 21, 2018

amueller commented Aug 21, 2018

sshleifer commented Aug 22, 2018 • edited Loading

amueller commented Aug 22, 2018

sshleifer commented Aug 22, 2018 • edited Loading

jnothman commented Aug 23, 2018 via email

sshleifer commented Aug 23, 2018

sshleifer commented Aug 13, 2018 •

edited

Loading

sshleifer commented Aug 22, 2018 •

edited

Loading

sshleifer commented Aug 22, 2018 •

edited

Loading