-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] CountFeaturizer for Categorical Data #9614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
tests are failing ;) |
Have not had internet access the past few days due to moving and driving all my stuff back to university. Will take a look and see what's going on. |
Hmm..
|
Those are undocumented arguments of methods. |
(Accidentally closed it by accident)
Seems to be the error and it only occurs in a specific configuration of Python 3.6.1. edit: Ahh, I see, thanks! |
See comment above. |
Yeah I just created a PR for a more useful message (#9651) |
The test cases are passing now. Somehow during the copy pasting to fix the merge conflict, the newline chars got deleted from the docstrings |
@chenhe95 what's the status on this? Is it good to go from your end? Then please rename to MRG instead of WIP |
It is indeed good to go, I have just changed it to MRG. |
can you resolve the conflicts please? |
975d9bc
to
f1c8bc0
Compare
Merge conflicts should be fixed!
I'll try to see what this is about |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to give this another good look. I'd like to be sure that what we've got is reasonably close to standard and useful...
@staticmethod | ||
def _check_inclusion(inclusion, n_input_features=1): | ||
if inclusion is None: | ||
raise ValueError("Inclusion cannot be none") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not tested
return np.array([[i] for i in range(n_input_features)]) | ||
elif CountFeaturizer._valid_data_type(inclusion): | ||
if len(inclusion) == 0: | ||
raise ValueError("Inclusion size must not be 0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not tested
sklearn/preprocessing/data.py
Outdated
else: | ||
return [inclusion] | ||
else: | ||
raise ValueError("Illegal data type in inclusion") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not tested
The User Guide is missing, right? |
can we maybe make the example run a bit quicker? |
max_estimators = (175 * n_datapoints // 500) | ||
time_start = time.time() | ||
|
||
for i in range(min_estimators, max_estimators + 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does going in steps of 10s make this faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amueller requested me to do a preliminary review. I will post it here.
The Feature Count module considers two scenarios.
a) there is no y value. In this case the Feature Count counts how often the combination of x values appears in the train set.
b) there is a y list or matrix. In this case the counts are measured against the class type.
|
||
|
||
time_start = time.time() | ||
pipeline_cf = make_pipeline( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It is unclear why the pipeline does not contain the classifier. Now the preprocessing steps and the classifier are separated.
Major consequence: Code is hard to understand.
Minor consequence: The time measurements only involve the classification time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the premises CountFeaturizer was built on (based on this Microsoft ML article on count featurization https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/data-transformation-learning-with-counts) was that it may be possible to reduce the time training and classification takes compared to something like one hot encoding due to there being less features generated. Here, the reason why the pipeline was split was because it's necessary to separate out the time it took to preprocess the data from the time it takes to train and use the classifier.
|
||
|
||
clf = get_classifier() | ||
labels = ["CountFeaturizer + RandomForestClassifier", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this experiment (running the three pipelines on an increasing number of trees) be put into a for loop? Removes duplicate code
plt.plot(xs, ys, label=label) | ||
|
||
plt.xlim(min_estimators, max_estimators) | ||
plt.xlabel("n_estimators") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(* Personally I'm a big fan of declaring the plotting labels and other stuff (l135, 136, 137) high up in the code, so
the script is reusable. for this it would be even better to put everything in a function but maybe this is me taking it
too far .. )
@@ -2866,6 +2869,229 @@ def power_transform(X, method='box-cox', standardize=True, copy=True): | |||
return pt.fit_transform(X) | |||
|
|||
|
|||
def _get_nested_counter(remaining, y_dim, inclusion_size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- _get_nested_counter: I know it's not a requirement to document private functions, but it would make the review
so much easier to know what the components are supposed to do. E.g., what is a 'nested dictionary', what do you
mean with layers and a 2d array of what in the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is something in the CI tests (something to do with the pickle module) that does not allow you to do something like
a = {}
a[1] = {}
a[1][1] = 4
So this is a workaround
Should I add "This is a workaround due to some pickle issues in CI testing regarding creating nested dicts dynamically" to this as a comment?
_get_nested_counter, remaining - 1, y_dim, inclusion_size)) | ||
|
||
|
||
class CountFeaturizer(BaseEstimator, TransformerMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The code seems small and elegant for something with many functionalities.
[1, 2, 1, 1], [1, 2, 1, 1]])) | ||
|
||
|
||
def test_count_featurizer_inclusion(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- test_count_featurizer_inclusion() why not run all examples on both arrays?
[1, 0, 2, 1, 1], [1, 0, 2, 1, 1]])) | ||
|
||
|
||
def test_count_featurizer_multi_y(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- test_count_featurizer_multi_y same question as (1) three additional values doesn't make sense to me.
Please explain in comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does my comment on line 2527 clarify this? The first column in y
contributes to 1 of the columns being added since it only takes 1 unique value. The second column in y
contributes to 2 of the columns being added due to having two values 0 and 1.
np.array([[0, 0, 2, 0, 3, 1], [0, 0, 2, 0, 3, 1], | ||
[1, 0, 1, 1, 3, 1], [1, 0, 1, 1, 3, 1]])) | ||
|
||
cf = CountFeaturizer(inclusion=[[0]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- line 2521 cf = CountFeaturizer(inclusion=[[0]]) this one can be checked independently (by crosschecking against
an instance with normally invoked featurizer)
** same for line 2514
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By this, do you mean comparing the output against cf = CountFeaturizer()
and checking if they are equal? If you meant it that way, I don't think it will work because CountFeaturizer()
is equivalent to CountFeaturizer(inclusion=[0, 1])
in this specific case.
[1, 2, 1, 1], [1, 2, 1, 1]])) | ||
|
||
|
||
def test_count_featurizer_multi_inclusion(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- test_count_featurizer_multi_inclusion() the y value is differently encoded (2d array), which influences how to invoke
the inclusion param. what happens if I invoke the inclusion in the vanilla way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If by the comment you mean y = np.array([[0], [0], [0], [1]])
vs y = np.array([0, 0, 0, 1])
then it should be the same because of y = np.reshape(y, (-1, 1))
. I can make a test case for that.
TODO: Test case for this scenario
np.array([[0, 0, 2, 0], [0, 0, 2, 0], | ||
[1, 0, 1, 1], [1, 0, 1, 1]])) | ||
|
||
cf = CountFeaturizer(inclusion="each") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- test_count_featurizer_multi_inclusion line 2527: Please explain how this results in 12 new outputs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each column of y gets its own columns outputted for each unique value it can take. For example, here, the first column of y gets 2 unique values (0 and 1), so 2 columns are outputted for each inclusion list (since 'each' is used, there are 2 inclusion lists [0] and [1]). The second column of y gets 4 unique values (0, 1, 2, 3) so 4 columns are outputted for each inclusion list. As a result (4 + 2, the total # of different y values) x (2, the size of the inclusion list) = 12 columns are outputted
Thank you for the feedback. I will look into your comments and try to address them. |
|
||
Shows how to use CountFeaturizer to transform some categorical variables | ||
into a frequency feature. CountFeaturizer can often be used to reduce | ||
training time, classification time, and classification error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would replace the last sentence with:
CountFeaturizer can be used as an alternative to one-hot encoding for non-linear estimators that do not work efficiently on high cardinality categorical variables such as ensemble of trees (random forests and gradient boosted trees).
I'm confused. The docs say there's one column per value of X, I would expect one column per value of y... |
@amueller Could you elaborate which line?
|
2 comments here, which echo concerns raised on the original issue (#5853):
|
There is an interesting benchmark here @GaelVaroquaux cc @jorisvandenbossche who was also interested in this. |
Found an interesting reference for the GLM model here: the GLM seems to be the preferred choice given the benchmark I posted above. |
A GLM encoder is a variant of a TargetEncoder. Indeed, using an OLS on
orthonormal vectors leads to computing means.
Adding regularization in the GLM amounts to shrinkage in the
TargetEncoder.
|
@GaelVaroquaux indeed, but the GLM is a very particular form of determining the shrinkage. |
I am closing this PR since it was superseded by #25334 . Thank you for working on this issue. |
Reference Issue
#5853
What does this implement/fix? Explain your changes.
It adds the CountFeaturizer transformation class, which can help with getting better accuracy because it will use how often a particular data row occurs as a feature
Any other comments?
Currently work in progress, please let me know if there is something that I should add or if there is anything I can do in a better or faster way!
Also a continuation of
#7803
#8144