-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] FrequencyEncoder #11805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] FrequencyEncoder #11805
Conversation
This is not suitable for encoding classification labels, but for features. |
See also #9614 |
agreed that this is for features and should be in |
would it make sense to try to steer that PR to a simpler API? |
Can you elaborate on that? I don't think I understand. |
Sure! From line 2935 of this file it appears that when Another difference is that |
hm I think the interface of the CountVectorizer is a bit odd right now and I'd change some of that. |
Something like
CountVectorizer changes could also be interesting. I am trying to transform features, not labels, though so I don't think whether the setting is multi-class matters? |
What is that trying to capture? In classification, we would learn
coefficients indicating the homogeneity of each class's categorical
variables? Is that what we want?
Rather if you get the frequency that the category occurs jointly with each
class you get a lot more information to learn from
|
My theory is that for a tree model, this would be like using if our data is [red, red, red, red, yellow, red, green, green, green], replacing red with 1 and yellow with 2 and green with 3 tells the model less than replacing them with 4, 1, 3. I can work on providing empirical proof, if that would help, but also happy to give up and look for other stuff to do! |
What does this implement/fix? Explain your changes.
This is an alternative to
LabelEncoder
andOneHotEncoder
that encodes categoricals based on the number of times they occur in the training data. It usually provides more information about the encoded value thatLabelEncoder
at the cost of potential collisions. I would love to add test coverage and examples if people think this is a good idea.