Skip to content

[WIP] FrequencyEncoder #11805

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

sshleifer
Copy link
Contributor

@sshleifer sshleifer commented Aug 13, 2018

What does this implement/fix? Explain your changes.

This is an alternative to LabelEncoder and OneHotEncoder that encodes categoricals based on the number of times they occur in the training data. It usually provides more information about the encoded value that LabelEncoder at the cost of potential collisions. I would love to add test coverage and examples if people think this is a good idea.

@jnothman
Copy link
Member

This is not suitable for encoding classification labels, but for features.

@jnothman
Copy link
Member

See also #9614

@sshleifer
Copy link
Contributor Author

agreed that this is for features and should be in preprocessing/.
#9614 is very related, but it creates three arrays for every array you send it. This is completely different than LabelEncoder.

@sshleifer
Copy link
Contributor Author

would it make sense to try to steer that PR to a simpler API?

@amueller
Copy link
Member

#9614 is very related, but it creates three arrays for every array you send it.

Can you elaborate on that? I don't think I understand.

@sshleifer
Copy link
Contributor Author

sshleifer commented Aug 22, 2018

Sure! From line 2935 of this file it appears that when CountFeaturizer.fit_transform is sent an array of shape (N, 1) it returns an array of shape (N, 3).

Another difference is that CountFeaturizer takes an inclusion argument: 'all', 'each', list, or numpy.ndarray, whereas in my experience most preprocessors preprocess all the data they are sent.

@amueller
Copy link
Member

hm I think the interface of the CountVectorizer is a bit odd right now and I'd change some of that.
My expectation would be that in a multi-class setting you have n_classes many columns. How would you reduce it to a single column?

@sshleifer
Copy link
Contributor Author

sshleifer commented Aug 22, 2018

Something like

class FrequencyEncoder(BaseEstimator, TransformerMixin):
    """If a value appears n times in a column sent to fit, it is encoded as n."""

    def fit(self, X):
        self._counts = [Counter(X[:, i] for i in range(X.shape[1]))]
        return self

    def transform(self, X):
		# try to broadcast more
        return np.array([[self._counts[i][elem]
						  for elem in X[:, i]]
                          for i in range(X.shape[1])])

CountVectorizer changes could also be interesting. I am trying to transform features, not labels, though so I don't think whether the setting is multi-class matters?

@jnothman
Copy link
Member

jnothman commented Aug 23, 2018 via email

@sshleifer
Copy link
Contributor Author

My theory is that for a tree model, this would be like using OrdinalEncoder to transform categoricals, but provide more information than just the order of appearance.

if our data is [red, red, red, red, yellow, red, green, green, green], replacing red with 1 and yellow with 2 and green with 3 tells the model less than replacing them with 4, 1, 3. CountFeaturizer provides even more info at the cost of more features.

I can work on providing empirical proof, if that would help, but also happy to give up and look for other stuff to do!

@sshleifer sshleifer closed this Mar 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants