Skip to content

WIP CountFeaturizer for Categorical Data #8144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 65 commits into from

Conversation

chenhe95
Copy link
Contributor

@chenhe95 chenhe95 commented Jan 2, 2017

Reference Issue

#5853

What does this implement/fix? Explain your changes.

It adds the CountFeaturizer transformation class, which can help with getting better accuracy because it will use how often a particular data row occurs as a feature

Any other comments?

Currently work in progress, please let me know if there is something that I should add or if there is anything I can do in a better or faster way!

Also a continuation of #7803

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding inclusion, does this make it clear? Sorry if there are mistakes as I've done this by hand:

>>> D = [[0, 0, 0],
         [0, 1, 1],
         [0, 0, 1],
         [0, 1, 1],
         [1, 0, 0],
         [1, 1, 0],
         [1, 0, 0],
         [1, 2, 0]]
>>> X, y = D[:2], D[2]

>>> CountFeaturizer(inclusion=[[0]]).fit_transform(X, y)
array([[1, 3],
       [1, 3],
       [1, 3],
       [1, 3],
       [4, 0],
       [4, 0],
       [4, 0],
       [4, 0]])

>>> Xt = CountFeaturizer(inclusion=[[0], [1]]).fit_transform(X, y)
>>> Xt
array([[1, 3, 3, 1],
       [1, 3, 1, 2],
       [1, 3, 3, 1],
       [1, 3, 1, 2],
       [4, 0, 3, 1],
       [4, 0, 1, 2],
       [4, 0, 3, 1],
       [4, 0, 1, 0]])
>>> np.all(Xt == CountFeaturizer(inclusion='each').fit_transform(X))
True

>>> Xt = CountFeaturizer(inclusion='altogether').fit_transform(X, y)
>>> Xt
array([[1, 1],
       [0, 2],
       [1, 1],
       [0, 2],
       [2, 0],
       [1, 0],
       [2, 0],
       [1, 0]])
>>> np.all(Xt == CountFeaturizer(inclusion=[[0, 1]]).fit_transform(X, y))
True

>>> CountFeaturizer(inclusion=[[0], [1], [0, 1]]).fit_transform(X, y)
array([[1, 3, 3, 1, 1, 1],
       [1, 3, 1, 2, 0, 2],
       [1, 3, 3, 1, 1, 1],
       [1, 3, 1, 2, 0, 2],
       [4, 0, 3, 1, 2, 0],
       [4, 0, 1, 2, 1, 0],
       [4, 0, 3, 1, 2, 0],
       [4, 0, 1, 0, 1, 0]])

The counts of each example learned during 'fit'

y_set_ : list of (index, y) tuples
An enumerated set of all unique values y can have
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is what is called classes_ in most classifiers

The number of columns of 'X' learned during 'fit'

col_num_Y_ : int
The number of clumns of 'y' learned during 'fit'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*columns

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Oops, sorry about that)

An enumerated set of all unique values y can have

col_num_X_ : int
The number of columns of 'X' learned during 'fit'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what this means

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially it's the number of columns of X that is put into the fit() function
Then when transform is called on another X2, we verify that X2 has the same number of columns as X, if the number of columns are different, an error is thrown

self.y_set_ = [list(enumerate(sorted(ys))) for ys in self.y_set_]
# freeze the dicts for pickling
self.count_cache_.default_factory = None
for cc_inner_dict in self.count_cache_.values():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're always accessing all layers of the defaultdict's nesting, why not just use a tuple of (X, output_idx, y[output_idx]) as a key to a single Counter?

@jnothman
Copy link
Member

jnothman commented Jan 25, 2017 via email


def _get_count_dict_0():
"""Gets the innermost count dictionary."""
return defaultdict(int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to use a Counter here. A difference which may be valuable is that looking up a key in a defaultdict sets that key. Looking it up in a Counter will return 0 but not expand the dict.

For similar reasons you should not use a defaultdict with an active default_factory at transform time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, you should not use a default with an active default_factory when the keys are likely to be sparsely found.

If I were you, I would benchmark having a prefabricated dict, or an array of counts, for all possible classes. Then just copy as needed, and at test time don't copy at all. That way for X values that are seen, you will get correctly-sized dicts from the outset. A bit more memory for X values that are exclusive to some values of y, but perhaps a valuable saving in time. I'm not certain, but I think there's some potential for experimenting with the correct data structure here.

Maybe we should consider producing a working estimator but keeping these count dicts private until we're happy with the data structure.

Copy link
Contributor Author

@chenhe95 chenhe95 Jan 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I agree, we should experiment with data structures to find the right one.
Currently I am leaning towards just making my own "default" dict that checks if key in my_dict: every time.
I did a benchmark on this, testing the time to build counts when no key exists, when the key already exists, and lookup into dict

import time
from collections import defaultdict
from collections import Counter

loop_count = 10000000

t = time.time()
a = {}
for i in xrange(loop_count):
	if i in a:
		a[i] += i
	else:
		a[i] = i

for i in xrange(loop_count / 2):
	if i in a:
		a[i] += i
	else:
		a[i] = i

for i in xrange(loop_count * 2):
	if i in a:
		x = a[i] + 5
	else:
		x = 5

print "time take for {}:", (time.time() - t)


t = time.time()
b = defaultdict(int)
for i in xrange(loop_count):
	b[i] += i

for i in xrange(loop_count / 2):
	b[i] += i

for i in xrange(loop_count * 2):
	x = b[i] + 5

print "time take for defaultdict:", (time.time() - t)

t = time.time()
c = Counter(xrange(loop_count)) + Counter(xrange(loop_count / 2))

for i in xrange(loop_count * 2):
	x = c[i] + 5

print "time take for counter:", (time.time() - t)


t = time.time()
c2 = Counter()
for i in xrange(loop_count):
	c2[i] += i

for i in xrange(loop_count / 2):
	c2[i] += i

for i in xrange(loop_count * 2):
	x = c2[i] + 5

print "time take for counter version 2:", (time.time() - t)

And the results were kind of surprising:

time take for {}: 7.626983881
time take for defaultdict: 9.94141697884
time take for counter: 58.303508997
time take for counter version 2: 15.7543189526

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a partial review

- 'each' : Each feature will have its own set of counts
- 1D list of indices : Only the given list of features is
counted
- 2D list of indices : The given list of lists of features is counted,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean list of lists, not 2d. Needn't be rectangular array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to lists of lists:
- 'all' (default) : Every feature is concatenated and counted
- 'each' : Each feature will have its own set of counts
- list of indices : Only the given list of features is
concatenated and counted
- list of lists of indices : The given list of lists of features is
concatenated and counted, but each list in the
list of lists has its own set of counts

- 'all' (default) : Every feature given is counted
- 'each' : Each feature will have its own set of counts
- 1D list of indices : Only the given list of features is
counted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hanging indent, please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled

- 1D list of indices : Only the given list of features is
counted
- 2D list of indices : The given list of lists of features is counted,
but each list in the list of lists have its own set of counts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hanging indent, please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled


- 'all' (default) : Every feature given is counted
- 'each' : Each feature will have its own set of counts
- 1D list of indices : Only the given list of features is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not clear that these are counted together, i.e. these features are concatenated and counted.

Copy link
Contributor Author

@chenhe95 chenhe95 Mar 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to

    - 1D list of indices : Only the given list of features is
                           concatenated and counted
    - 2D list of indices : The given list of lists of features is
                           concatenated and counted, but each list in the
                           list of lists has its own set of counts

inclusion : 'all', 'each', list, or numpy.ndarray
The inclusion criteria for counting

- 'all' (default) : Every feature given is counted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not clear that these are counted together, i.e. these features are concatenated and counted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to "- 'all' (default) : Every feature is concatenated and counted"

for i in range(len_data):
for j in range(self.col_num_Y_):
X_key = tuple(X[i].take(inclusion_used[inclusion_i]))
if len(y.shape) == 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you just reshape y at the top?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now reshapes y at the top if it is 1D

if y is not None:
for i in range(len_data):
for j in range(self.col_num_Y_):
X_key = tuple(X[i].take(inclusion_used[inclusion_i]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do this out of the j loop

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, tuple(my_array) is slower than tuple(my_array.tolist())

len_data = len(X)
self.col_num_X_ = len(X[0])
self.count_cache_ = _get_count_dict_3()
self.classes_ = [set() for i in range(self.col_num_Y_)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a local variable if you're going to overwrite it below. but you could just consider using something like LabelEncoder.

Copy link
Contributor Author

@chenhe95 chenhe95 Mar 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to classes_unsorted
I am unsure about using LabelEncoder because I won't make good use of everything in there and there will be additional overhead in calculating the things that I won't be using

inclusion_used = self._get_inclusion_used()

for inclusion_i in range(len(inclusion_used)):
if y is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you just set y to 0s when it's None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to np.zeros(len(X))
Was unsure about it before because it would create len(X) additional elements in memory that wouldn't really have that much use, but it certainly did make the code a lot cleaner.

for inclusion_i in range(len(inclusion_used)):
col_offset_y = 0
col_offset_inclusion = inclusion_i * len_classes
for j in range(self.col_num_Y_):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have arrays instead of single counts in your cache you don't need this loop.

@jnothman
Copy link
Member

jnothman commented Jan 26, 2017 via email

@chenhe95
Copy link
Contributor Author

chenhe95 commented Feb 1, 2017

Small update: I have been busy with school work lately, but I am still working on this.

@jnothman
Copy link
Member

jnothman commented Feb 1, 2017 via email

@chenhe95
Copy link
Contributor Author

chenhe95 commented Mar 4, 2017

Working on making the _count_cache a 2D array at the very last layer currently
It was kind of pointless to use a dict when every element in the [y_col, inclusion_index] matrix is dense with every element populated
So accesses will look like this

_count_cache[X][y_key][y_col, inclusion_index]

Also planning to implement the nested counter something along these lines:

def _get_nested_counter(remaining, y_dim, inclusion_size):
    "A nested dictionary with 'remaining' layers and a 2D array at the end"
    if remaining == 1:
        return np.zeros((y_dim, inclusion_size))
    return defaultdict(functools.partial(_get_nested_counter, remaining - 1, y_dim, inclusion_size))

@codecov
Copy link

codecov bot commented Mar 4, 2017

Codecov Report

❗ No coverage uploaded for pull request base (master@288827b). Click here to learn what that means.
The diff coverage is 98.63%.

@@            Coverage Diff            @@
##             master    #8144   +/-   ##
=========================================
  Coverage          ?   95.48%           
=========================================
  Files             ?      342           
  Lines             ?    61059           
  Branches          ?        0           
=========================================
  Hits              ?    58304           
  Misses            ?     2755           
  Partials          ?        0
Impacted Files Coverage Δ
sklearn/preprocessing/tests/test_data.py 99.9% <100%> (ø)
sklearn/preprocessing/data.py 99.03% <97.61%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 288827b...34bd821. Read the comment docs.

@jnothman
Copy link
Member

jnothman commented Mar 5, 2017

Are you seeking another review, then? are other issues addressed?

@chenhe95
Copy link
Contributor Author

chenhe95 commented Mar 5, 2017

I have only addressed 2-3 of the issues mentioned so far.
I'll work on getting the other things handled too.
One thing I am curious about, is there a way to see the code coverage of only a single file? The thing on the contributor guide only mentions how to get the code coverage of the entire project.

@chenhe95
Copy link
Contributor Author

chenhe95 commented Mar 6, 2017

I have addressed all the ones except for the one in transform:

@chenhe95
Copy link
Contributor Author

chenhe95 commented Mar 6, 2017

@jnothman
I feel like the loop is necessary. I've considered doing it in the format of
transformed[section of transformed] = count_cache[section of count cache]
instead of a for loop where I assign everything one by one
But I can't really come up with a way of doing that without making the y a dense array like count_cache[X][y,j,inclusion] or in general, without increasing the memory requirement significantly.
Maybe a list comprehension might work, but I feel like it would just be less readable and the loop would still be there technically
What are your thoughts on this?

@chenhe95
Copy link
Contributor Author

chenhe95 commented Jun 7, 2017

Any updates on this? 😃

@amueller
Copy link
Member

amueller commented Jun 7, 2017

Hopefully someone at the sprint can find some time ;)

@chenhe95
Copy link
Contributor Author

chenhe95 commented Jun 7, 2017

Sounds good!

@amueller
Copy link
Member

Sorry for the long stall. I'll review this week (hopefully). Can you please fix the merge conflicts? Thanks!

@amueller
Copy link
Member

amueller commented Aug 3, 2017

https://youtu.be/-n7qZAdWFL0?t=1074 argues (I think) that this should be done with cross-validation to avoid overfitting.

@amueller
Copy link
Member

amueller commented Aug 3, 2017

I think this should be refactored assuming we have ColumnTransformer so that it'll transform all input columns.

@amueller
Copy link
Member

amueller commented Aug 4, 2017

I just realized that using any form of cross-validation would raise the same issues as in stacking #8960 that mean that fit(X).transform(X) is quite different from fit_transform(X)

@jnothman
Copy link
Member

jnothman commented Aug 6, 2017 via email

@chenhe95
Copy link
Contributor Author

Okay, closing pull request now to fix merge conflicts and open up new pull request later. Will add link.

@chenhe95
Copy link
Contributor Author

Added new pull request with code added on top of current code.

@amueller amueller added the Superseded PR has been replace by a newer PR label Aug 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Superseded PR has been replace by a newer PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants