WIP CountFeaturizer for Categorical Data #8144

chenhe95 · 2017-01-02T02:50:18Z

Reference Issue

What does this implement/fix? Explain your changes.

It adds the CountFeaturizer transformation class, which can help with getting better accuracy because it will use how often a particular data row occurs as a feature

Any other comments?

Currently work in progress, please let me know if there is something that I should add or if there is anything I can do in a better or faster way!

Also a continuation of #7803

jnothman

Regarding inclusion, does this make it clear? Sorry if there are mistakes as I've done this by hand:

>>> D = [[0, 0, 0],
         [0, 1, 1],
         [0, 0, 1],
         [0, 1, 1],
         [1, 0, 0],
         [1, 1, 0],
         [1, 0, 0],
         [1, 2, 0]]
>>> X, y = D[:2], D[2]

>>> CountFeaturizer(inclusion=[[0]]).fit_transform(X, y)
array([[1, 3],
       [1, 3],
       [1, 3],
       [1, 3],
       [4, 0],
       [4, 0],
       [4, 0],
       [4, 0]])

>>> Xt = CountFeaturizer(inclusion=[[0], [1]]).fit_transform(X, y)
>>> Xt
array([[1, 3, 3, 1],
       [1, 3, 1, 2],
       [1, 3, 3, 1],
       [1, 3, 1, 2],
       [4, 0, 3, 1],
       [4, 0, 1, 2],
       [4, 0, 3, 1],
       [4, 0, 1, 0]])
>>> np.all(Xt == CountFeaturizer(inclusion='each').fit_transform(X))
True

>>> Xt = CountFeaturizer(inclusion='altogether').fit_transform(X, y)
>>> Xt
array([[1, 1],
       [0, 2],
       [1, 1],
       [0, 2],
       [2, 0],
       [1, 0],
       [2, 0],
       [1, 0]])
>>> np.all(Xt == CountFeaturizer(inclusion=[[0, 1]]).fit_transform(X, y))
True

>>> CountFeaturizer(inclusion=[[0], [1], [0, 1]]).fit_transform(X, y)
array([[1, 3, 3, 1, 1, 1],
       [1, 3, 1, 2, 0, 2],
       [1, 3, 3, 1, 1, 1],
       [1, 3, 1, 2, 0, 2],
       [4, 0, 3, 1, 2, 0],
       [4, 0, 1, 2, 1, 0],
       [4, 0, 3, 1, 2, 0],
       [4, 0, 1, 0, 1, 0]])

jnothman · 2017-01-24T03:09:42Z

sklearn/preprocessing/data.py

+        The counts of each example learned during 'fit'
+
+    y_set_ : list of (index, y) tuples
+        An enumerated set of all unique values y can have


I think this is what is called classes_ in most classifiers

jnothman · 2017-01-24T03:09:50Z

sklearn/preprocessing/data.py

+        The number of columns of 'X' learned during 'fit'
+
+    col_num_Y_ : int
+        The number of clumns of 'y' learned during 'fit'


(Oops, sorry about that)

jnothman · 2017-01-24T03:10:20Z

sklearn/preprocessing/data.py

+        An enumerated set of all unique values y can have
+
+    col_num_X_ : int
+        The number of columns of 'X' learned during 'fit'


Not sure what this means

Essentially it's the number of columns of X that is put into the fit() function
Then when transform is called on another X2, we verify that X2 has the same number of columns as X, if the number of columns are different, an error is thrown

jnothman · 2017-01-24T22:36:38Z

sklearn/preprocessing/data.py

+        self.y_set_ = [list(enumerate(sorted(ys))) for ys in self.y_set_]
+        # freeze the dicts for pickling
+        self.count_cache_.default_factory = None
+        for cc_inner_dict in self.count_cache_.values():


Since you're always accessing all layers of the defaultdict's nesting, why not just use a tuple of (X, output_idx, y[output_idx]) as a key to a single Counter?

jnothman · 2017-01-25T01:48:51Z

Shrug

…

On 25 January 2017 at 12:29, He Chen ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sklearn/preprocessing/data.py <#8144>: > + if len(y.shape) == 1: + # if y is 1D, y[i] is a scalar, which is not iterable + y_key = tuple([y[i]]) + else: + y_key = tuple(y[i].take([j])) + self.count_cache_[X_key][j][y_key] += 1 + self.y_set_[j].add(y_key) + else: + self.y_set_[0].add(0) + for i in range(len_data): + X_key = tuple(X[i].take(inclusion_used)) + self.count_cache_[X_key][0][0] += 1 + self.y_set_ = [list(enumerate(sorted(ys))) for ys in self.y_set_] + # freeze the dicts for pickling + self.count_cache_.default_factory = None + for cc_inner_dict in self.count_cache_.values(): Yeah. If I do it like that, then continuous integration won't throw a pickling error and it will avoid an extra O(n) freezing operation on fit() and transform() — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8144>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz60lhl-bgF-IIyKtODNRXsgUl3kHCks5rVqWBgaJpZM4LY0ht> .

jnothman · 2017-01-26T09:33:08Z

sklearn/preprocessing/data.py

+
+def _get_count_dict_0():
+    """Gets the innermost count dictionary."""
+    return defaultdict(int)


You might want to use a Counter here. A difference which may be valuable is that looking up a key in a defaultdict sets that key. Looking it up in a Counter will return 0 but not expand the dict.

For similar reasons you should not use a defaultdict with an active default_factory at transform time.

Sorry, you should not use a default with an active default_factory when the keys are likely to be sparsely found.

If I were you, I would benchmark having a prefabricated dict, or an array of counts, for all possible classes. Then just copy as needed, and at test time don't copy at all. That way for X values that are seen, you will get correctly-sized dicts from the outset. A bit more memory for X values that are exclusive to some values of y, but perhaps a valuable saving in time. I'm not certain, but I think there's some potential for experimenting with the correct data structure here.

Maybe we should consider producing a working estimator but keeping these count dicts private until we're happy with the data structure.

Yes. I agree, we should experiment with data structures to find the right one.
Currently I am leaning towards just making my own "default" dict that checks if key in my_dict: every time.
I did a benchmark on this, testing the time to build counts when no key exists, when the key already exists, and lookup into dict

import time from collections import defaultdict from collections import Counter loop_count = 10000000 t = time.time() a = {} for i in xrange(loop_count): if i in a: a[i] += i else: a[i] = i for i in xrange(loop_count / 2): if i in a: a[i] += i else: a[i] = i for i in xrange(loop_count * 2): if i in a: x = a[i] + 5 else: x = 5 print "time take for {}:", (time.time() - t) t = time.time() b = defaultdict(int) for i in xrange(loop_count): b[i] += i for i in xrange(loop_count / 2): b[i] += i for i in xrange(loop_count * 2): x = b[i] + 5 print "time take for defaultdict:", (time.time() - t) t = time.time() c = Counter(xrange(loop_count)) + Counter(xrange(loop_count / 2)) for i in xrange(loop_count * 2): x = c[i] + 5 print "time take for counter:", (time.time() - t) t = time.time() c2 = Counter() for i in xrange(loop_count): c2[i] += i for i in xrange(loop_count / 2): c2[i] += i for i in xrange(loop_count * 2): x = c2[i] + 5 print "time take for counter version 2:", (time.time() - t)

And the results were kind of surprising:

time take for {}: 7.626983881 time take for defaultdict: 9.94141697884 time take for counter: 58.303508997 time take for counter version 2: 15.7543189526

jnothman

a partial review

jnothman · 2017-01-26T09:39:06Z

sklearn/preprocessing/data.py

+        - 'each' : Each feature will have its own set of counts
+        - 1D list of indices : Only the given list of features is
+        counted
+        - 2D list of indices : The given list of lists of features is counted,


You mean list of lists, not 2d. Needn't be rectangular array.

Changed to lists of lists:
- 'all' (default) : Every feature is concatenated and counted
- 'each' : Each feature will have its own set of counts
- list of indices : Only the given list of features is
concatenated and counted
- list of lists of indices : The given list of lists of features is
concatenated and counted, but each list in the
list of lists has its own set of counts

jnothman · 2017-01-26T09:39:18Z

sklearn/preprocessing/data.py

+        - 'all' (default) : Every feature given is counted
+        - 'each' : Each feature will have its own set of counts
+        - 1D list of indices : Only the given list of features is
+        counted


hanging indent, please

jnothman · 2017-01-26T09:39:23Z

sklearn/preprocessing/data.py

+        - 1D list of indices : Only the given list of features is
+        counted
+        - 2D list of indices : The given list of lists of features is counted,
+        but each list in the list of lists have its own set of counts


hanging indent, please

jnothman · 2017-01-26T09:40:25Z

sklearn/preprocessing/data.py

+
+        - 'all' (default) : Every feature given is counted
+        - 'each' : Each feature will have its own set of counts
+        - 1D list of indices : Only the given list of features is


not clear that these are counted together, i.e. these features are concatenated and counted.

Changed to

- 1D list of indices : Only the given list of features is concatenated and counted - 2D list of indices : The given list of lists of features is concatenated and counted, but each list in the list of lists has its own set of counts

jnothman · 2017-01-26T09:40:31Z

sklearn/preprocessing/data.py

+    inclusion : 'all', 'each', list, or numpy.ndarray
+        The inclusion criteria for counting
+
+        - 'all' (default) : Every feature given is counted


not clear that these are counted together, i.e. these features are concatenated and counted.

Changed to "- 'all' (default) : Every feature is concatenated and counted"

jnothman · 2017-01-26T11:37:13Z

sklearn/preprocessing/data.py

+                for i in range(len_data):
+                    for j in range(self.col_num_Y_):
+                        X_key = tuple(X[i].take(inclusion_used[inclusion_i]))
+                        if len(y.shape) == 1:


why don't you just reshape y at the top?

Now reshapes y at the top if it is 1D

jnothman · 2017-01-26T11:38:06Z

sklearn/preprocessing/data.py

+            if y is not None:
+                for i in range(len_data):
+                    for j in range(self.col_num_Y_):
+                        X_key = tuple(X[i].take(inclusion_used[inclusion_i]))


do this out of the j loop

Also, tuple(my_array) is slower than tuple(my_array.tolist())

jnothman · 2017-01-26T11:41:49Z

sklearn/preprocessing/data.py

+        len_data = len(X)
+        self.col_num_X_ = len(X[0])
+        self.count_cache_ = _get_count_dict_3()
+        self.classes_ = [set() for i in range(self.col_num_Y_)]


make this a local variable if you're going to overwrite it below. but you could just consider using something like LabelEncoder.

Changed to classes_unsorted
I am unsure about using LabelEncoder because I won't make good use of everything in there and there will be additional overhead in calculating the things that I won't be using

jnothman · 2017-01-26T11:42:36Z

sklearn/preprocessing/data.py

+        inclusion_used = self._get_inclusion_used()
+
+        for inclusion_i in range(len(inclusion_used)):
+            if y is not None:


why don't you just set y to 0s when it's None

Changed to np.zeros(len(X))
Was unsure about it before because it would create len(X) additional elements in memory that wouldn't really have that much use, but it certainly did make the code a lot cleaner.

jnothman · 2017-01-26T12:05:17Z

sklearn/preprocessing/data.py

+        for inclusion_i in range(len(inclusion_used)):
+            col_offset_y = 0
+            col_offset_inclusion = inclusion_i * len_classes
+            for j in range(self.col_num_Y_):


If you have arrays instead of single counts in your cache you don't need this loop.

jnothman · 2017-01-26T21:40:16Z

No, I'm certain you should be using an array giving counts across all y and outputs rather than a counter.

…

On 27 January 2017 at 03:13, He Chen ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sklearn/preprocessing/data.py <#8144>: > + return defaultdict(_get_count_dict_2) + + +def _get_count_dict_2(): + """Gets the inner count dictionary.""" + return defaultdict(_get_count_dict_1) + + +def _get_count_dict_1(): + """Gets the inner count dictionary.""" + return defaultdict(_get_count_dict_0) + + +def _get_count_dict_0(): + """Gets the innermost count dictionary.""" + return defaultdict(int) Yes. I agree, we should experiment with data structures to find the right one. Currently I am leaning towards just making my own "default" dict that checks if key in my_dict: every time. I did a benchmark on this: import time from collections import defaultdict from collections import Counter loop_count = 10000000 t = time.time() a = {} for i in xrange(loop_count): if i in a: a[i] += i else: a[i] = i for i in xrange(loop_count / 2): if i in a: a[i] += i else: a[i] = i for i in xrange(loop_count * 2): if i in a: x = a[i] + 5 else: x = 5 print "time take for {}:", (time.time() - t) t = time.time() b = defaultdict(int) for i in xrange(loop_count): b[i] += i for i in xrange(loop_count / 2): b[i] += i for i in xrange(loop_count * 2): x = b[i] + 5 print "time take for defaultdict:", (time.time() - t) t = time.time() c = Counter(xrange(loop_count)) + Counter(xrange(loop_count / 2)) for i in xrange(loop_count * 2): x = c[i] + 5 print "time take for counter:", (time.time() - t) t = time.time() c2 = Counter() for i in xrange(loop_count): c2[i] += i for i in xrange(loop_count / 2): c2[i] += i for i in xrange(loop_count * 2): x = c2[i] + 5 print "time take for counter version 2:", (time.time() - t) And the results were kind of surprising: time take for {}: 7.626983881 time take for defaultdict: 9.94141697884 time take for counter: 58.303508997 time take for counter version 2: 15.7543189526 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8144>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6-GqK3pKKvrXBId15LjL2xtXvWbbks5rWMYfgaJpZM4LY0ht> .

chenhe95 · 2017-02-01T21:01:26Z

Small update: I have been busy with school work lately, but I am still working on this.

jnothman · 2017-02-01T23:26:35Z

That's fine. A number of core scikit-learn devs are busy elsewhere too.

…

On 2 February 2017 at 08:01, He Chen ***@***.***> wrote: Small update: I have been busy with school work lately, but I am still working on this. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8144 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz62Smo8-prgVF1pFqxjBce8NfZvnqks5rYPKngaJpZM4LY0ht> .

chenhe95 · 2017-03-04T01:48:33Z

Working on making the _count_cache a 2D array at the very last layer currently
It was kind of pointless to use a dict when every element in the [y_col, inclusion_index] matrix is dense with every element populated
So accesses will look like this

_count_cache[X][y_key][y_col, inclusion_index]

Also planning to implement the nested counter something along these lines:

def _get_nested_counter(remaining, y_dim, inclusion_size):
    "A nested dictionary with 'remaining' layers and a 2D array at the end"
    if remaining == 1:
        return np.zeros((y_dim, inclusion_size))
    return defaultdict(functools.partial(_get_nested_counter, remaining - 1, y_dim, inclusion_size))

codecov · 2017-03-04T07:36:43Z

Codecov Report

❗ No coverage uploaded for pull request base (master@288827b). Click here to learn what that means.
The diff coverage is 98.63%.

@@            Coverage Diff            @@
##             master    #8144   +/-   ##
=========================================
  Coverage          ?   95.48%           
=========================================
  Files             ?      342           
  Lines             ?    61059           
  Branches          ?        0           
=========================================
  Hits              ?    58304           
  Misses            ?     2755           
  Partials          ?        0

Impacted Files	Coverage Δ
sklearn/preprocessing/tests/test_data.py	`99.9% <100%> (ø)`
sklearn/preprocessing/data.py	`99.03% <97.61%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 288827b...34bd821. Read the comment docs.

jnothman · 2017-03-05T12:26:25Z

Are you seeking another review, then? are other issues addressed?

chenhe95 · 2017-03-05T22:20:19Z

I have only addressed 2-3 of the issues mentioned so far.
I'll work on getting the other things handled too.
One thing I am curious about, is there a way to see the code coverage of only a single file? The thing on the contributor guide only mentions how to get the code coverage of the entire project.

chenhe95 · 2017-03-06T01:16:37Z

I have addressed all the ones except for the one in transform:

chenhe95 · 2017-03-06T05:24:38Z

@jnothman
I feel like the loop is necessary. I've considered doing it in the format of
transformed[section of transformed] = count_cache[section of count cache]
instead of a for loop where I assign everything one by one
But I can't really come up with a way of doing that without making the y a dense array like count_cache[X][y,j,inclusion] or in general, without increasing the memory requirement significantly.
Maybe a list comprehension might work, but I feel like it would just be less readable and the loop would still be there technically
What are your thoughts on this?

chenhe95 · 2017-06-07T18:12:16Z

Any updates on this? 😃

amueller · 2017-06-07T18:23:23Z

Hopefully someone at the sprint can find some time ;)

chenhe95 · 2017-06-07T18:28:53Z

Sounds good!

amueller · 2017-07-24T15:59:42Z

Sorry for the long stall. I'll review this week (hopefully). Can you please fix the merge conflicts? Thanks!

amueller · 2017-08-03T19:30:10Z

https://youtu.be/-n7qZAdWFL0?t=1074 argues (I think) that this should be done with cross-validation to avoid overfitting.

amueller · 2017-08-03T19:31:35Z

I think this should be refactored assuming we have ColumnTransformer so that it'll transform all input columns.

amueller · 2017-08-04T16:18:34Z

I just realized that using any form of cross-validation would raise the same issues as in stacking #8960 that mean that fit(X).transform(X) is quite different from fit_transform(X)

jnothman · 2017-08-06T03:17:11Z

Yes, and we're encountering the same in KNNImputer. I think we'll just need to make that (sometimes) okay (tag it).

…

On 5 August 2017 at 02:18, Andreas Mueller ***@***.***> wrote: I just realized that using any form of cross-validation would raise the same issues as in stacking #8960 <#8960> that mean that fit(X).transform(X) is quite different from fit_transform(X) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8144 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz60UUERXXwGTUGbYw1iRhh_Dd-Vt7ks5sU0RcgaJpZM4LY0ht> .

chenhe95 · 2017-08-21T18:40:09Z

Okay, closing pull request now to fix merge conflicts and open up new pull request later. Will add link.

chenhe95 · 2017-08-23T20:02:40Z

Added new pull request with code added on top of current code.

He Chen and others added 30 commits January 1, 2017 20:37

new pull reqest, added CountFeaturizer + code docs

3ed33d3

valid_data_type fix

d6e37fb

make CountFeaturizer work like rest of API

cc1614b

make CountFeaturizer work like rest of API (fix)

d00113f

correct understanding of fit/transform

1bb67b6

fixed some static method errors

5e27440

fixed self not defined

d720065

loop ending early bug fix

604d397

uses check_array()

7059f10

KeyError fix

649ee59

removed intentional ignore of NaN/infinite/non-numeric input

5ae6a01

very basic test case for sanity check

fa0b92f

throws ValueError if transform called before fit

ffb737f

updated example code in doc string

ec73059

checking malformed input

4bcf413

small fix to _check_well_formed()

e39b58b

added checks for num_feature consistency in fit/transform

e9ad796

string support test

72ee4ab

removed string support, still works with word array index

bd8cda7

xrange() -> range() for python 3.5

9341c8f

fixed example docstring

7870da0

docstring tweak

a8f106d

trying to fix the docstring spacing issue

99e06df

changed inclusion from one hot to array of indices

060f670

None check on removal_policy

e44259d

trying to pass the docstring error in CI tests

dbe6539

implementation of the removal_policy

69e61f9

NORMALIZE_WHITESPACE

5976e95

NORMALIZE_WHITESPACE (2)

6bf3b2c

pep8 + bug fix on feature removal

d3d777e

jnothman reviewed Jan 24, 2017

View reviewed changes

multi-dimensional inclusion support + test cases

79497d4

jnothman reviewed Jan 26, 2017

View reviewed changes

made the innermost countdict a matrix

8b6e5ed

all issues fixed except for the one in transform

af11ca6

inclusion fix

34bd821

chenhe95 closed this Aug 21, 2017

chenhe95 mentioned this pull request Aug 23, 2017

[MRG] CountFeaturizer for Categorical Data #9614

Closed

amueller added the Superseded PR has been replace by a newer PR label Aug 6, 2019

Uh oh!

WIP CountFeaturizer for Categorical Data #8144

WIP CountFeaturizer for Categorical Data #8144

Uh oh!

Conversation

chenhe95 commented Jan 2, 2017

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jan 25, 2017 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenhe95 Jan 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenhe95 Mar 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenhe95 Mar 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenhe95 Jan 26, 2017 •

edited

Loading

chenhe95 Mar 5, 2017 •

edited

Loading

chenhe95 Mar 6, 2017 •

edited

Loading

chenhe95 commented Mar 4, 2017 •

edited

Loading

codecov bot commented Mar 4, 2017 •

edited

Loading

chenhe95 commented Jun 7, 2017 •

edited

Loading