WIP CountFeaturizer for Categorical Data #7803

chenhe95 · 2016-11-01T00:55:02Z

Reference Issue

What does this implement/fix? Explain your changes.

It adds the CountFeaturizer transformation class, which can help with getting better accuracy because it will use how often a particular data row occurs as a feature

Any other comments?

Currently work in progress, please let me know if there is something that I should add or if there is anything I can do in a better or faster way!

Currently there are no test cases and no formal changes in the documentation .rst either, but I am planning on adding it later.

chenhe95 · 2016-11-01T01:49:06Z

Next I will make it so that transform() and fit() work just like the rest of the API expects of those two functions, and add any more if necessary.
I also realize that things might be a bit slow, so after doing that, I will try to make it work with something like Cython and do some benchmarks to see how I can make it faster.

amueller · 2016-11-02T18:06:16Z

Definitely check out this: http://scikit-learn.org/dev/developers/contributing.html#rolling-your-own-estimator

amueller · 2016-11-02T18:06:47Z

sklearn/preprocessing/data.py

+    WIP, I will make the documentation afterwards 
+    """
+
+    def fit(self, X, y=None, inclusion='all'):


inclusion should be a parameter to __init__ and stored in the estimator.

amueller · 2016-11-02T18:10:57Z

sklearn/preprocessing/data.py

+        self.inclusion = inclusion
+        return self
+
+    def transform(self):


transform should also get "X" - which will different data than during training.

chenhe95 · 2016-11-02T18:20:38Z

I see
Note to self: fit(X_training_set) is supposed to store the count dict only and and get all the counts from the X_training_set while transform(X_test_set) is supposed to apply the counts obtained from X_training_set to the new X_test_set

amueller · 2016-11-02T19:58:58Z

sklearn/preprocessing/data.py

+        if not _valid_data_type(X):
+            raise ValueError("Only supports lists / numpy arrays (transform)")
+        len_data = len(X)
+        if len_data > 0:


did a test fail without that? You should probably call X = check_array(X) instead.

I agree. It seems like a good idea to add a check_array(X) method instead of calling the same thing multiple times.

amueller · 2016-11-02T19:59:32Z

sklearn/preprocessing/data.py

+        Sets the value of count_cache which holds the counts of each data point
+        """
+
+        if not _valid_data_type(X):


I think doing X = check_array(X) would be better here.

chenhe95 · 2016-11-03T03:01:23Z

I believe I should also add in a check to see if all of the rows X_i of the given X have the same len(X_i), because I suspect that is what is causing

Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\nose\case.py", line 197, in runTest
    self.test(*self.arg)
  File "C:\Python27\lib\site-packages\sklearn\utils\testing.py", line 830, in __call__
    return self.check(*args, **kwargs)
  File "C:\Python27\lib\site-packages\sklearn\utils\testing.py", line 355, in wrapper
    return fn(*args, **kwargs)
  File "C:\Python27\lib\site-packages\sklearn\utils\estimator_checks.py", line 535, in check_transformer_general
    _check_transformer(name, Transformer, X, y)
  File "C:\Python27\lib\site-packages\sklearn\utils\estimator_checks.py", line 628, in _check_transformer
    assert_raises(ValueError, transformer.transform, X.T)
AssertionError: ValueError not raised

Because in line 628 of utils.estimator_checks.py, we have

    # raises error on malformed input for transform
    if hasattr(X, 'T'):
        # If it's not an array, it does not have a 'T' property
        assert_raises(ValueError, transformer.transform, X.T)

@amueller Do you think I should add an extra parameter ensure_well_formed=False to utils.validation.check_array()? Or would it be better to handle it locally and just make a small helper function.
ensure_well_formed=True would throw an error if it detected that two rows X_i, X_j in X has len(X_i) != len(X_j)

Edit: The error came from the fact that when the fitted array has a different number of errors than the transform array, no error was thrown.

jnothman · 2016-11-03T04:50:46Z

This really deserves something in examples/ to motivate it.

chenhe95 · 2016-11-03T05:18:11Z

@jnothman

I am currently looking into https://blogs.technet.microsoft.com/machinelearning/2015/11/03/using-azure-ml-to-build-clickthrough-prediction-models/
Where the conclusion was

The use of conditional counts (DRACuLa) features results in a compact representation of the high-cardinality categorical features present in the Criteo dataset. Moreover, we find that training times are about twice as fast after incorporating count features than without any count features. In addition, we find that the model performance is better on using the count-based DRACuLa features than without. This is because the count-based features provide a compact representation of the otherwise sparse high-dimensional categorical features.

I'll try and see if I can come up with some toy example that can reproduce those results in the benchmark and ROC curve.

chenhe95 · 2016-11-03T19:53:05Z

sklearn/preprocessing/data.py

    def _valid_data_type(type_check):
        """
        Defines the data types that are compatible with CountFeaturizer
        Currently, only Python lists and numpy arrays are accepted
        """
        return type(type_check) == np.ndarray or type(type_check) == list 

+    @staticmethod
+    def _check_well_formed(X):


I could later integrate this into utils.validation.check_array() if necessary.

chenhe95 · 2016-11-03T21:25:17Z

It is failing the test case where the input to X contains non-numeric values and the test case where the input to X contains float('inf'), but I want to make it so that it's fine to have non-numeric or infinity in the input, since all CountFeaturizer does is add a new column to X.

I may add a new parameter that lets the user replace some features with the count feature instead of simply only appending the count feature to the end of X, since in the article, it was mentioned that the count feature was used as a substitute to OneHotEncoder where the number of possible values a particular categorical feature could take on is too high (such as strings).

chenhe95 · 2016-11-22T18:50:44Z

@jnothman, @amueller Okay, I fixed all the flake8 and pep8 things. I am ready for the first review of code, tests, and example.

amueller · 2016-11-28T23:11:13Z

yeah the result of the example looks great now :)

jnothman · 2016-12-06T05:40:05Z

I intend to review shortly. Example at http://scikit-learn.org/circle?7307/auto_examples/preprocessing/count_featurizer_transformer.html

jnothman · 2016-12-06T05:41:29Z

Please add to classes.rst

jnothman

Thanks. You need to conform more to our idiom as described in the contributors' guide and seen in our other implementations.

More broadly, I'm not entirely convinced of the general utility of this method and specification of interface. I see how making the counts conditional on y would make it much more powerful.

jnothman · 2016-12-06T22:55:22Z

examples/preprocessing/count_featurizer_transformer.py

+
+# X_count is X with an additional feature column 'count'
+cf = CountFeaturizer(inclusion=discretized_features)
+X_count = cf.fit_transform(X)


I would much rather you execute this preprocessing in a pipeline. You are leaking information from the test set into the training set.

jnothman · 2016-12-06T22:55:23Z

sklearn/preprocessing/data.py

+
+class CountFeaturizer(BaseEstimator, TransformerMixin):
+
+    """


PEP257 says put a description here that fits in one line

Perhaps "Adds features representing each feature value's count in training"

Sorry about these small PEP things. I'll fix them along with those accidentally added blank lines.

jnothman · 2016-12-06T22:55:23Z

sklearn/preprocessing/data.py

+class CountFeaturizer(BaseEstimator, TransformerMixin):
+
+    """
+    Adds in a new feature column containing the number of occurences of


I probably need to read any references (which should be listed in the docstring below), but I don't really get why this technique works. I have seen similar where a feature is added for each class (i.e. each value of y) counting the frequency of features given each class.

Without being conditional on y, your transformer seems to be bluntly measuring density, and so is suited only to problems where the classes can be differentiated by their density in particular features.

jnothman · 2016-12-06T22:55:23Z

sklearn/preprocessing/data.py

+
+    Parameters
+    ----------
+    inclusion: set, list, numpy.ndarray, or string (only 'all'), default='all'


need space before colon

I'd be interested in supporting a list of lists of features, and an easy way for each feature to be counted independently.

jnothman · 2016-12-06T22:55:23Z

sklearn/preprocessing/data.py

+    Parameters
+    ----------
+    inclusion: set, list, numpy.ndarray, or string (only 'all'), default='all'
+        The inclusion criteria for counting


Please check RST formatting

jnothman · 2016-12-06T22:55:23Z

sklearn/preprocessing/data.py

+            or type(type_check) == set
+
+    @staticmethod
+    def _check_well_formed(X):


Is there a reason not to use our usual check_array? We usually expect input to be an array (or sparse matrix). Please read the contributors' guide.

jnothman · 2016-12-06T22:55:24Z

sklearn/preprocessing/data.py

+
+    def fit(self, X, y=None):
+        """
+        Sets the value of count_cache which holds the counts of each data point


Needs Parameters and Return sections

jnothman · 2016-12-06T22:55:24Z

sklearn/preprocessing/data.py

+        return cols_X + 1
+
+    def fit(self, X, y=None):
+        """


PEP257: summary goes here

jnothman · 2016-12-06T22:55:24Z

sklearn/preprocessing/data.py

+            # number of columns
+            raise ValueError("inclusion or removal_policy incompatible")
+        self.num_features = cols_X
+        self.count_cache = {}


This is a model attribute and should end with _. Please read the contributors' guide.

perhaps use a collections.Counter()

Indeed, this could look something like count_cache = Counter(tuple(x) for x in X.take(inclusion, axis=1).tolist())

jnothman · 2016-12-06T22:55:24Z

sklearn/preprocessing/data.py

+        self.num_features = cols_X
+        self.count_cache = {}
+        for data_i in X:
+            data_i_tuple = self._extract_tuple(data_i)


For array (not list of list) input

jnothman · 2016-12-06T22:55:22Z

examples/preprocessing/count_featurizer_transformer.py

+# Generate a binary classification dataset.
+X, y = make_classification(n_samples=n_datapoints, n_features=n_features,
+                           n_clusters_per_class=1, n_informative=n_informative,
+                           n_redundant=n_redundant, random_state=RANDOM_STATE)


Are the results consistent across different random states?

It's not uncommon in examples to show results for arbitrary random state.

jnothman · 2016-12-06T22:55:22Z

examples/preprocessing/count_featurizer_transformer.py

+
+RANDOM_STATE = 123
+
+n_datapoints = 1000  # 500


please delete the alternatives in comments.

jnothman · 2016-12-06T22:55:22Z

examples/preprocessing/count_featurizer_transformer.py

+ohe = OneHotEncoder()
+X_one_hot_part = ohe.fit_transform(X[:, discretized_features])
+
+# build the original matrix with back


I don't know what "with back" means

jnothman · 2016-12-06T22:55:22Z

examples/preprocessing/count_featurizer_transformer.py

@@ -0,0 +1,122 @@
+"""


this file needs to start with plot_ to be rendered appropriately.

amueller · 2016-12-07T16:15:53Z

I didn't review this yet, but this was supposed to be conditional on y.

chenhe95 · 2016-12-07T16:29:57Z

@jnothman Thanks for the feedback. I will address them and introduce fixes for the reviews this Friday.
And also I agree. It should have been conditional on the value of Y. It was an oversight on my end. I re-read the Microsoft report on (dracula) count featurizing and that is what they did.
Changing it to be conditional to Y should be an easy fix, I believe: I only have to change the dictionary key to include Y.

jnothman · 2016-12-08T04:19:42Z

The removal_policy question can be sorted out later. Get the primary feature right, and show it off, first.

chenhe95 · 2016-12-14T18:23:43Z

TODO: I still have to make some test cases where the counts are dependent on 'y' and also look into how to make my example do the preprocessing steps in a pipeline.

jnothman · 2016-12-25T10:14:13Z

Let us know when this is ready for another review. Thanks!

chenhe95 · 2017-01-02T02:29:32Z

I rebased my branch to resolve the merge conflict. I think I am going to close this and start a new PR.

chenhe95 · 2017-01-02T02:51:15Z

Continued in #8144

new pull reqest, added CountFeaturizer + code docs

808c320

chenhe95 mentioned this pull request Nov 1, 2016

WIP CountFeaturizer #7765

Closed

valid_data_type fix

df23020

He Chen added 2 commits November 2, 2016 09:47

make CountFeaturizer work like rest of API

202f02e

make CountFeaturizer work like rest of API (fix)

18f3d4f

amueller reviewed Nov 2, 2016

View reviewed changes

correct understanding of fit/transform

ba50d12

amueller reviewed Nov 2, 2016

View reviewed changes

He Chen and others added 7 commits November 2, 2016 16:00

fixed some static method errors

b26e806

fixed self not defined

7dbc715

loop ending early bug fix

7724749

uses check_array()

f8e7d4c

KeyError fix

136465f

removed intentional ignore of NaN/infinite/non-numeric input

6a74e1a

very basic test case for sanity check

4542289

He Chen added 2 commits November 2, 2016 23:13

throws ValueError if transform called before fit

fccf789

updated example code in doc string

0bc069c

checking malformed input

8fbe182

chenhe95 commented Nov 3, 2016

View reviewed changes

small fix to _check_well_formed()

121c685

added checks for num_feature consistency in fit/transform

6b13da4

amueller added the Waiting for Reviewer label Nov 28, 2016

jnothman requested changes Dec 6, 2016

View reviewed changes

jnothman reviewed Dec 6, 2016

View reviewed changes

chenhe95 added 3 commits December 14, 2016 03:40

fixed data.py, but only data.py

91ba95d

fixed data.py, but only data.py (2)

d1462e3

fixes in example and data.py

718821c

chenhe95 added 3 commits December 24, 2016 18:13

fixed y dim error

bfa347e

fixed y dim error (2)

2c2657b

fixed y dim error (3)

f395c97

added more test cases

df0bce3

chenhe95 closed this Jan 2, 2017

chenhe95 mentioned this pull request Jan 2, 2017

WIP CountFeaturizer for Categorical Data #8144

Closed

chenhe95 mentioned this pull request Aug 23, 2017

[MRG] CountFeaturizer for Categorical Data #9614

Closed

jnothman mentioned this pull request Jan 2, 2019

[MRG] Unary encoder -- continued #12893

Closed

amueller added Superseded PR has been replace by a newer PR and removed Waiting for Reviewer labels Aug 6, 2019

Uh oh!

WIP CountFeaturizer for Categorical Data #7803

WIP CountFeaturizer for Categorical Data #7803

Uh oh!

Conversation

chenhe95 commented Nov 1, 2016

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

chenhe95 commented Nov 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Nov 2, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenhe95 commented Nov 2, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenhe95 commented Nov 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Nov 3, 2016

Uh oh!

chenhe95 commented Nov 3, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenhe95 commented Nov 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chenhe95 commented Nov 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Nov 28, 2016

Uh oh!

jnothman commented Dec 6, 2016

Uh oh!

jnothman commented Dec 6, 2016

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

chenhe95 commented Nov 1, 2016 •

edited

Loading

chenhe95 commented Nov 3, 2016 •

edited

Loading

chenhe95 commented Nov 3, 2016 •

edited

Loading

chenhe95 commented Nov 22, 2016 •

edited

Loading

chenhe95 commented Dec 7, 2016 •

edited

Loading