Skip to content

WIP CountFeaturizer for Categorical Data #7803

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 42 commits into from

Conversation

chenhe95
Copy link
Contributor

@chenhe95 chenhe95 commented Nov 1, 2016

Reference Issue

#5853

What does this implement/fix? Explain your changes.

It adds the CountFeaturizer transformation class, which can help with getting better accuracy because it will use how often a particular data row occurs as a feature

Any other comments?

Currently work in progress, please let me know if there is something that I should add or if there is anything I can do in a better or faster way!

Currently there are no test cases and no formal changes in the documentation .rst either, but I am planning on adding it later.

@chenhe95 chenhe95 mentioned this pull request Nov 1, 2016
@chenhe95
Copy link
Contributor Author

chenhe95 commented Nov 1, 2016

Next I will make it so that transform() and fit() work just like the rest of the API expects of those two functions, and add any more if necessary.
I also realize that things might be a bit slow, so after doing that, I will try to make it work with something like Cython and do some benchmarks to see how I can make it faster.

@amueller
Copy link
Member

amueller commented Nov 2, 2016

WIP, I will make the documentation afterwards
"""

def fit(self, X, y=None, inclusion='all'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inclusion should be a parameter to __init__ and stored in the estimator.

self.inclusion = inclusion
return self

def transform(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transform should also get "X" - which will different data than during training.

@chenhe95
Copy link
Contributor Author

chenhe95 commented Nov 2, 2016

I see
Note to self: fit(X_training_set) is supposed to store the count dict only and and get all the counts from the X_training_set while transform(X_test_set) is supposed to apply the counts obtained from X_training_set to the new X_test_set

if not _valid_data_type(X):
raise ValueError("Only supports lists / numpy arrays (transform)")
len_data = len(X)
if len_data > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did a test fail without that? You should probably call X = check_array(X) instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It seems like a good idea to add a check_array(X) method instead of calling the same thing multiple times.

Sets the value of count_cache which holds the counts of each data point
"""

if not _valid_data_type(X):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think doing X = check_array(X) would be better here.

@chenhe95
Copy link
Contributor Author

chenhe95 commented Nov 3, 2016

I believe I should also add in a check to see if all of the rows X_i of the given X have the same len(X_i), because I suspect that is what is causing

Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\nose\case.py", line 197, in runTest
    self.test(*self.arg)
  File "C:\Python27\lib\site-packages\sklearn\utils\testing.py", line 830, in __call__
    return self.check(*args, **kwargs)
  File "C:\Python27\lib\site-packages\sklearn\utils\testing.py", line 355, in wrapper
    return fn(*args, **kwargs)
  File "C:\Python27\lib\site-packages\sklearn\utils\estimator_checks.py", line 535, in check_transformer_general
    _check_transformer(name, Transformer, X, y)
  File "C:\Python27\lib\site-packages\sklearn\utils\estimator_checks.py", line 628, in _check_transformer
    assert_raises(ValueError, transformer.transform, X.T)
AssertionError: ValueError not raised

Because in line 628 of utils.estimator_checks.py, we have

    # raises error on malformed input for transform
    if hasattr(X, 'T'):
        # If it's not an array, it does not have a 'T' property
        assert_raises(ValueError, transformer.transform, X.T)

@amueller Do you think I should add an extra parameter ensure_well_formed=False to utils.validation.check_array()? Or would it be better to handle it locally and just make a small helper function.
ensure_well_formed=True would throw an error if it detected that two rows X_i, X_j in X has len(X_i) != len(X_j)

Edit: The error came from the fact that when the fitted array has a different number of errors than the transform array, no error was thrown.

@jnothman
Copy link
Member

jnothman commented Nov 3, 2016

This really deserves something in examples/ to motivate it.

@chenhe95
Copy link
Contributor Author

chenhe95 commented Nov 3, 2016

@jnothman

I am currently looking into https://blogs.technet.microsoft.com/machinelearning/2015/11/03/using-azure-ml-to-build-clickthrough-prediction-models/
Where the conclusion was

The use of conditional counts (DRACuLa) features results in a compact representation of the high-cardinality categorical features present in the Criteo dataset. Moreover, we find that training times are about twice as fast after incorporating count features than without any count features. In addition, we find that the model performance is better on using the count-based DRACuLa features than without. This is because the count-based features provide a compact representation of the otherwise sparse high-dimensional categorical features.

I'll try and see if I can come up with some toy example that can reproduce those results in the benchmark and ROC curve.

def _valid_data_type(type_check):
"""
Defines the data types that are compatible with CountFeaturizer
Currently, only Python lists and numpy arrays are accepted
"""
return type(type_check) == np.ndarray or type(type_check) == list

@staticmethod
def _check_well_formed(X):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could later integrate this into utils.validation.check_array() if necessary.

@chenhe95
Copy link
Contributor Author

chenhe95 commented Nov 3, 2016

It is failing the test case where the input to X contains non-numeric values and the test case where the input to X contains float('inf'), but I want to make it so that it's fine to have non-numeric or infinity in the input, since all CountFeaturizer does is add a new column to X.

I may add a new parameter that lets the user replace some features with the count feature instead of simply only appending the count feature to the end of X, since in the article, it was mentioned that the count feature was used as a substitute to OneHotEncoder where the number of possible values a particular categorical feature could take on is too high (such as strings).

@chenhe95
Copy link
Contributor Author

chenhe95 commented Nov 22, 2016

@jnothman, @amueller Okay, I fixed all the flake8 and pep8 things. I am ready for the first review of code, tests, and example.

@amueller
Copy link
Member

yeah the result of the example looks great now :)

@jnothman
Copy link
Member

jnothman commented Dec 6, 2016

@jnothman
Copy link
Member

jnothman commented Dec 6, 2016

Please add to classes.rst

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. You need to conform more to our idiom as described in the contributors' guide and seen in our other implementations.

More broadly, I'm not entirely convinced of the general utility of this method and specification of interface. I see how making the counts conditional on y would make it much more powerful.


# X_count is X with an additional feature column 'count'
cf = CountFeaturizer(inclusion=discretized_features)
X_count = cf.fit_transform(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would much rather you execute this preprocessing in a pipeline. You are leaking information from the test set into the training set.


class CountFeaturizer(BaseEstimator, TransformerMixin):

"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP257 says put a description here that fits in one line

Perhaps "Adds features representing each feature value's count in training"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about these small PEP things. I'll fix them along with those accidentally added blank lines.

class CountFeaturizer(BaseEstimator, TransformerMixin):

"""
Adds in a new feature column containing the number of occurences of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably need to read any references (which should be listed in the docstring below), but I don't really get why this technique works. I have seen similar where a feature is added for each class (i.e. each value of y) counting the frequency of features given each class.

Without being conditional on y, your transformer seems to be bluntly measuring density, and so is suited only to problems where the classes can be differentiated by their density in particular features.


Parameters
----------
inclusion: set, list, numpy.ndarray, or string (only 'all'), default='all'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need space before colon

I'd be interested in supporting a list of lists of features, and an easy way for each feature to be counted independently.

Parameters
----------
inclusion: set, list, numpy.ndarray, or string (only 'all'), default='all'
The inclusion criteria for counting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check RST formatting

or type(type_check) == set

@staticmethod
def _check_well_formed(X):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to use our usual check_array? We usually expect input to be an array (or sparse matrix). Please read the contributors' guide.


def fit(self, X, y=None):
"""
Sets the value of count_cache which holds the counts of each data point
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs Parameters and Return sections

return cols_X + 1

def fit(self, X, y=None):
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP257: summary goes here

# number of columns
raise ValueError("inclusion or removal_policy incompatible")
self.num_features = cols_X
self.count_cache = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a model attribute and should end with _. Please read the contributors' guide.

perhaps use a collections.Counter()

Indeed, this could look something like count_cache = Counter(tuple(x) for x in X.take(inclusion, axis=1).tolist())

self.num_features = cols_X
self.count_cache = {}
for data_i in X:
data_i_tuple = self._extract_tuple(data_i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For array (not list of list) input

# Generate a binary classification dataset.
X, y = make_classification(n_samples=n_datapoints, n_features=n_features,
n_clusters_per_class=1, n_informative=n_informative,
n_redundant=n_redundant, random_state=RANDOM_STATE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the results consistent across different random states?

It's not uncommon in examples to show results for arbitrary random state.


RANDOM_STATE = 123

n_datapoints = 1000 # 500
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please delete the alternatives in comments.

ohe = OneHotEncoder()
X_one_hot_part = ohe.fit_transform(X[:, discretized_features])

# build the original matrix with back
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what "with back" means

@@ -0,0 +1,122 @@
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file needs to start with plot_ to be rendered appropriately.

@amueller
Copy link
Member

amueller commented Dec 7, 2016

I didn't review this yet, but this was supposed to be conditional on y.

@chenhe95
Copy link
Contributor Author

chenhe95 commented Dec 7, 2016

@jnothman Thanks for the feedback. I will address them and introduce fixes for the reviews this Friday.
And also I agree. It should have been conditional on the value of Y. It was an oversight on my end. I re-read the Microsoft report on (dracula) count featurizing and that is what they did.
Changing it to be conditional to Y should be an easy fix, I believe: I only have to change the dictionary key to include Y.

@jnothman
Copy link
Member

jnothman commented Dec 8, 2016

The removal_policy question can be sorted out later. Get the primary feature right, and show it off, first.

@chenhe95
Copy link
Contributor Author

TODO: I still have to make some test cases where the counts are dependent on 'y' and also look into how to make my example do the preprocessing steps in a pipeline.

@jnothman
Copy link
Member

Let us know when this is ready for another review. Thanks!

@chenhe95
Copy link
Contributor Author

chenhe95 commented Jan 2, 2017

I rebased my branch to resolve the merge conflict. I think I am going to close this and start a new PR.

@chenhe95
Copy link
Contributor Author

chenhe95 commented Jan 2, 2017

Continued in #8144

@amueller amueller added Superseded PR has been replace by a newer PR and removed Waiting for Reviewer labels Aug 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Superseded PR has been replace by a newer PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants