-
-
Notifications
You must be signed in to change notification settings - Fork 26k
WIP CountFeaturizer for Categorical Data #7803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Next I will make it so that transform() and fit() work just like the rest of the API expects of those two functions, and add any more if necessary. |
Definitely check out this: http://scikit-learn.org/dev/developers/contributing.html#rolling-your-own-estimator |
WIP, I will make the documentation afterwards | ||
""" | ||
|
||
def fit(self, X, y=None, inclusion='all'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inclusion should be a parameter to __init__
and stored in the estimator.
self.inclusion = inclusion | ||
return self | ||
|
||
def transform(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
transform should also get "X" - which will different data than during training.
I see |
if not _valid_data_type(X): | ||
raise ValueError("Only supports lists / numpy arrays (transform)") | ||
len_data = len(X) | ||
if len_data > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did a test fail without that? You should probably call X = check_array(X)
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. It seems like a good idea to add a check_array(X) method instead of calling the same thing multiple times.
Sets the value of count_cache which holds the counts of each data point | ||
""" | ||
|
||
if not _valid_data_type(X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think doing X = check_array(X)
would be better here.
I believe I should also add in a check to see if all of the rows X_i of the given X have the same len(X_i), because I suspect that is what is causing
Because in line 628 of utils.estimator_checks.py, we have
@amueller Do you think I should add an extra parameter ensure_well_formed=False to utils.validation.check_array()? Or would it be better to handle it locally and just make a small helper function. Edit: The error came from the fact that when the fitted array has a different number of errors than the transform array, no error was thrown. |
This really deserves something in |
I am currently looking into https://blogs.technet.microsoft.com/machinelearning/2015/11/03/using-azure-ml-to-build-clickthrough-prediction-models/
I'll try and see if I can come up with some toy example that can reproduce those results in the benchmark and ROC curve. |
def _valid_data_type(type_check): | ||
""" | ||
Defines the data types that are compatible with CountFeaturizer | ||
Currently, only Python lists and numpy arrays are accepted | ||
""" | ||
return type(type_check) == np.ndarray or type(type_check) == list | ||
|
||
@staticmethod | ||
def _check_well_formed(X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could later integrate this into utils.validation.check_array() if necessary.
It is failing the test case where the input to X contains non-numeric values and the test case where the input to X contains float('inf'), but I want to make it so that it's fine to have non-numeric or infinity in the input, since all CountFeaturizer does is add a new column to X. I may add a new parameter that lets the user replace some features with the count feature instead of simply only appending the count feature to the end of X, since in the article, it was mentioned that the count feature was used as a substitute to OneHotEncoder where the number of possible values a particular categorical feature could take on is too high (such as strings). |
yeah the result of the example looks great now :) |
I intend to review shortly. Example at http://scikit-learn.org/circle?7307/auto_examples/preprocessing/count_featurizer_transformer.html |
Please add to classes.rst |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. You need to conform more to our idiom as described in the contributors' guide and seen in our other implementations.
More broadly, I'm not entirely convinced of the general utility of this method and specification of interface. I see how making the counts conditional on y
would make it much more powerful.
|
||
# X_count is X with an additional feature column 'count' | ||
cf = CountFeaturizer(inclusion=discretized_features) | ||
X_count = cf.fit_transform(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would much rather you execute this preprocessing in a pipeline. You are leaking information from the test set into the training set.
|
||
class CountFeaturizer(BaseEstimator, TransformerMixin): | ||
|
||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP257 says put a description here that fits in one line
Perhaps "Adds features representing each feature value's count in training"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about these small PEP things. I'll fix them along with those accidentally added blank lines.
class CountFeaturizer(BaseEstimator, TransformerMixin): | ||
|
||
""" | ||
Adds in a new feature column containing the number of occurences of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably need to read any references (which should be listed in the docstring below), but I don't really get why this technique works. I have seen similar where a feature is added for each class (i.e. each value of y
) counting the frequency of features given each class.
Without being conditional on y
, your transformer seems to be bluntly measuring density, and so is suited only to problems where the classes can be differentiated by their density in particular features.
|
||
Parameters | ||
---------- | ||
inclusion: set, list, numpy.ndarray, or string (only 'all'), default='all' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need space before colon
I'd be interested in supporting a list of lists of features, and an easy way for each feature to be counted independently.
Parameters | ||
---------- | ||
inclusion: set, list, numpy.ndarray, or string (only 'all'), default='all' | ||
The inclusion criteria for counting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check RST formatting
or type(type_check) == set | ||
|
||
@staticmethod | ||
def _check_well_formed(X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason not to use our usual check_array
? We usually expect input to be an array (or sparse matrix). Please read the contributors' guide.
|
||
def fit(self, X, y=None): | ||
""" | ||
Sets the value of count_cache which holds the counts of each data point |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs Parameters and Return sections
return cols_X + 1 | ||
|
||
def fit(self, X, y=None): | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP257: summary goes here
# number of columns | ||
raise ValueError("inclusion or removal_policy incompatible") | ||
self.num_features = cols_X | ||
self.count_cache = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a model attribute and should end with _
. Please read the contributors' guide.
perhaps use a collections.Counter()
Indeed, this could look something like count_cache = Counter(tuple(x) for x in X.take(inclusion, axis=1).tolist())
self.num_features = cols_X | ||
self.count_cache = {} | ||
for data_i in X: | ||
data_i_tuple = self._extract_tuple(data_i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For array (not list of list) input
# Generate a binary classification dataset. | ||
X, y = make_classification(n_samples=n_datapoints, n_features=n_features, | ||
n_clusters_per_class=1, n_informative=n_informative, | ||
n_redundant=n_redundant, random_state=RANDOM_STATE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the results consistent across different random states?
It's not uncommon in examples to show results for arbitrary random state.
|
||
RANDOM_STATE = 123 | ||
|
||
n_datapoints = 1000 # 500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please delete the alternatives in comments.
ohe = OneHotEncoder() | ||
X_one_hot_part = ohe.fit_transform(X[:, discretized_features]) | ||
|
||
# build the original matrix with back |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what "with back" means
@@ -0,0 +1,122 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this file needs to start with plot_
to be rendered appropriately.
I didn't review this yet, but this was supposed to be conditional on y. |
@jnothman Thanks for the feedback. I will address them and introduce fixes for the reviews this Friday. |
The removal_policy question can be sorted out later. Get the primary feature right, and show it off, first. |
TODO: I still have to make some test cases where the counts are dependent on 'y' and also look into how to make my example do the preprocessing steps in a pipeline. |
Let us know when this is ready for another review. Thanks! |
I rebased my branch to resolve the merge conflict. I think I am going to close this and start a new PR. |
Continued in #8144 |
Reference Issue
#5853
What does this implement/fix? Explain your changes.
It adds the CountFeaturizer transformation class, which can help with getting better accuracy because it will use how often a particular data row occurs as a feature
Any other comments?
Currently work in progress, please let me know if there is something that I should add or if there is anything I can do in a better or faster way!
Currently there are no test cases and no formal changes in the documentation .rst either, but I am planning on adding it later.