-
-
Notifications
You must be signed in to change notification settings - Fork 26k
WIP CountFeaturizer for Categorical Data #8144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding inclusion, does this make it clear? Sorry if there are mistakes as I've done this by hand:
>>> D = [[0, 0, 0],
[0, 1, 1],
[0, 0, 1],
[0, 1, 1],
[1, 0, 0],
[1, 1, 0],
[1, 0, 0],
[1, 2, 0]]
>>> X, y = D[:2], D[2]
>>> CountFeaturizer(inclusion=[[0]]).fit_transform(X, y)
array([[1, 3],
[1, 3],
[1, 3],
[1, 3],
[4, 0],
[4, 0],
[4, 0],
[4, 0]])
>>> Xt = CountFeaturizer(inclusion=[[0], [1]]).fit_transform(X, y)
>>> Xt
array([[1, 3, 3, 1],
[1, 3, 1, 2],
[1, 3, 3, 1],
[1, 3, 1, 2],
[4, 0, 3, 1],
[4, 0, 1, 2],
[4, 0, 3, 1],
[4, 0, 1, 0]])
>>> np.all(Xt == CountFeaturizer(inclusion='each').fit_transform(X))
True
>>> Xt = CountFeaturizer(inclusion='altogether').fit_transform(X, y)
>>> Xt
array([[1, 1],
[0, 2],
[1, 1],
[0, 2],
[2, 0],
[1, 0],
[2, 0],
[1, 0]])
>>> np.all(Xt == CountFeaturizer(inclusion=[[0, 1]]).fit_transform(X, y))
True
>>> CountFeaturizer(inclusion=[[0], [1], [0, 1]]).fit_transform(X, y)
array([[1, 3, 3, 1, 1, 1],
[1, 3, 1, 2, 0, 2],
[1, 3, 3, 1, 1, 1],
[1, 3, 1, 2, 0, 2],
[4, 0, 3, 1, 2, 0],
[4, 0, 1, 2, 1, 0],
[4, 0, 3, 1, 2, 0],
[4, 0, 1, 0, 1, 0]])
sklearn/preprocessing/data.py
Outdated
The counts of each example learned during 'fit' | ||
|
||
y_set_ : list of (index, y) tuples | ||
An enumerated set of all unique values y can have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is what is called classes_
in most classifiers
sklearn/preprocessing/data.py
Outdated
The number of columns of 'X' learned during 'fit' | ||
|
||
col_num_Y_ : int | ||
The number of clumns of 'y' learned during 'fit' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Oops, sorry about that)
sklearn/preprocessing/data.py
Outdated
An enumerated set of all unique values y can have | ||
|
||
col_num_X_ : int | ||
The number of columns of 'X' learned during 'fit' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what this means
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially it's the number of columns of X
that is put into the fit() function
Then when transform is called on another X2
, we verify that X2
has the same number of columns as X
, if the number of columns are different, an error is thrown
sklearn/preprocessing/data.py
Outdated
self.y_set_ = [list(enumerate(sorted(ys))) for ys in self.y_set_] | ||
# freeze the dicts for pickling | ||
self.count_cache_.default_factory = None | ||
for cc_inner_dict in self.count_cache_.values(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you're always accessing all layers of the defaultdict's nesting, why not just use a tuple of (X, output_idx, y[output_idx])
as a key to a single Counter?
Shrug
…On 25 January 2017 at 12:29, He Chen ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sklearn/preprocessing/data.py
<#8144>:
> + if len(y.shape) == 1:
+ # if y is 1D, y[i] is a scalar, which is not iterable
+ y_key = tuple([y[i]])
+ else:
+ y_key = tuple(y[i].take([j]))
+ self.count_cache_[X_key][j][y_key] += 1
+ self.y_set_[j].add(y_key)
+ else:
+ self.y_set_[0].add(0)
+ for i in range(len_data):
+ X_key = tuple(X[i].take(inclusion_used))
+ self.count_cache_[X_key][0][0] += 1
+ self.y_set_ = [list(enumerate(sorted(ys))) for ys in self.y_set_]
+ # freeze the dicts for pickling
+ self.count_cache_.default_factory = None
+ for cc_inner_dict in self.count_cache_.values():
Yeah. If I do it like that, then continuous integration won't throw a
pickling error and it will avoid an extra O(n) freezing operation on fit()
and transform()
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8144>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AAEz60lhl-bgF-IIyKtODNRXsgUl3kHCks5rVqWBgaJpZM4LY0ht>
.
|
sklearn/preprocessing/data.py
Outdated
|
||
def _get_count_dict_0(): | ||
"""Gets the innermost count dictionary.""" | ||
return defaultdict(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to use a Counter
here. A difference which may be valuable is that looking up a key in a defaultdict sets that key. Looking it up in a Counter
will return 0 but not expand the dict.
For similar reasons you should not use a defaultdict
with an active default_factory
at transform
time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, you should not use a default with an active default_factory
when the keys are likely to be sparsely found.
If I were you, I would benchmark having a prefabricated dict, or an array of counts, for all possible classes. Then just copy as needed, and at test time don't copy at all. That way for X
values that are seen, you will get correctly-sized dicts from the outset. A bit more memory for X
values that are exclusive to some values of y
, but perhaps a valuable saving in time. I'm not certain, but I think there's some potential for experimenting with the correct data structure here.
Maybe we should consider producing a working estimator but keeping these count dicts private until we're happy with the data structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I agree, we should experiment with data structures to find the right one.
Currently I am leaning towards just making my own "default" dict that checks if key in my_dict:
every time.
I did a benchmark on this, testing the time to build counts when no key exists, when the key already exists, and lookup into dict
import time
from collections import defaultdict
from collections import Counter
loop_count = 10000000
t = time.time()
a = {}
for i in xrange(loop_count):
if i in a:
a[i] += i
else:
a[i] = i
for i in xrange(loop_count / 2):
if i in a:
a[i] += i
else:
a[i] = i
for i in xrange(loop_count * 2):
if i in a:
x = a[i] + 5
else:
x = 5
print "time take for {}:", (time.time() - t)
t = time.time()
b = defaultdict(int)
for i in xrange(loop_count):
b[i] += i
for i in xrange(loop_count / 2):
b[i] += i
for i in xrange(loop_count * 2):
x = b[i] + 5
print "time take for defaultdict:", (time.time() - t)
t = time.time()
c = Counter(xrange(loop_count)) + Counter(xrange(loop_count / 2))
for i in xrange(loop_count * 2):
x = c[i] + 5
print "time take for counter:", (time.time() - t)
t = time.time()
c2 = Counter()
for i in xrange(loop_count):
c2[i] += i
for i in xrange(loop_count / 2):
c2[i] += i
for i in xrange(loop_count * 2):
x = c2[i] + 5
print "time take for counter version 2:", (time.time() - t)
And the results were kind of surprising:
time take for {}: 7.626983881
time take for defaultdict: 9.94141697884
time take for counter: 58.303508997
time take for counter version 2: 15.7543189526
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a partial review
sklearn/preprocessing/data.py
Outdated
- 'each' : Each feature will have its own set of counts | ||
- 1D list of indices : Only the given list of features is | ||
counted | ||
- 2D list of indices : The given list of lists of features is counted, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean list of lists, not 2d. Needn't be rectangular array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to lists of lists:
- 'all' (default) : Every feature is concatenated and counted
- 'each' : Each feature will have its own set of counts
- list of indices : Only the given list of features is
concatenated and counted
- list of lists of indices : The given list of lists of features is
concatenated and counted, but each list in the
list of lists has its own set of counts
sklearn/preprocessing/data.py
Outdated
- 'all' (default) : Every feature given is counted | ||
- 'each' : Each feature will have its own set of counts | ||
- 1D list of indices : Only the given list of features is | ||
counted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hanging indent, please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handled
sklearn/preprocessing/data.py
Outdated
- 1D list of indices : Only the given list of features is | ||
counted | ||
- 2D list of indices : The given list of lists of features is counted, | ||
but each list in the list of lists have its own set of counts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hanging indent, please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handled
sklearn/preprocessing/data.py
Outdated
|
||
- 'all' (default) : Every feature given is counted | ||
- 'each' : Each feature will have its own set of counts | ||
- 1D list of indices : Only the given list of features is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not clear that these are counted together, i.e. these features are concatenated and counted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to
- 1D list of indices : Only the given list of features is
concatenated and counted
- 2D list of indices : The given list of lists of features is
concatenated and counted, but each list in the
list of lists has its own set of counts
sklearn/preprocessing/data.py
Outdated
inclusion : 'all', 'each', list, or numpy.ndarray | ||
The inclusion criteria for counting | ||
|
||
- 'all' (default) : Every feature given is counted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not clear that these are counted together, i.e. these features are concatenated and counted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to "- 'all' (default) : Every feature is concatenated and counted"
sklearn/preprocessing/data.py
Outdated
for i in range(len_data): | ||
for j in range(self.col_num_Y_): | ||
X_key = tuple(X[i].take(inclusion_used[inclusion_i])) | ||
if len(y.shape) == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't you just reshape y at the top?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now reshapes y at the top if it is 1D
sklearn/preprocessing/data.py
Outdated
if y is not None: | ||
for i in range(len_data): | ||
for j in range(self.col_num_Y_): | ||
X_key = tuple(X[i].take(inclusion_used[inclusion_i])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do this out of the j loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, tuple(my_array)
is slower than tuple(my_array.tolist())
sklearn/preprocessing/data.py
Outdated
len_data = len(X) | ||
self.col_num_X_ = len(X[0]) | ||
self.count_cache_ = _get_count_dict_3() | ||
self.classes_ = [set() for i in range(self.col_num_Y_)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this a local variable if you're going to overwrite it below. but you could just consider using something like LabelEncoder
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to classes_unsorted
I am unsure about using LabelEncoder
because I won't make good use of everything in there and there will be additional overhead in calculating the things that I won't be using
sklearn/preprocessing/data.py
Outdated
inclusion_used = self._get_inclusion_used() | ||
|
||
for inclusion_i in range(len(inclusion_used)): | ||
if y is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't you just set y to 0s when it's None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to np.zeros(len(X))
Was unsure about it before because it would create len(X)
additional elements in memory that wouldn't really have that much use, but it certainly did make the code a lot cleaner.
sklearn/preprocessing/data.py
Outdated
for inclusion_i in range(len(inclusion_used)): | ||
col_offset_y = 0 | ||
col_offset_inclusion = inclusion_i * len_classes | ||
for j in range(self.col_num_Y_): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have arrays instead of single counts in your cache you don't need this loop.
No, I'm certain you should be using an array giving counts across all y and
outputs rather than a counter.
…On 27 January 2017 at 03:13, He Chen ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sklearn/preprocessing/data.py
<#8144>:
> + return defaultdict(_get_count_dict_2)
+
+
+def _get_count_dict_2():
+ """Gets the inner count dictionary."""
+ return defaultdict(_get_count_dict_1)
+
+
+def _get_count_dict_1():
+ """Gets the inner count dictionary."""
+ return defaultdict(_get_count_dict_0)
+
+
+def _get_count_dict_0():
+ """Gets the innermost count dictionary."""
+ return defaultdict(int)
Yes. I agree, we should experiment with data structures to find the right
one.
Currently I am leaning towards just making my own "default" dict that
checks if key in my_dict: every time.
I did a benchmark on this:
import time
from collections import defaultdict
from collections import Counter
loop_count = 10000000
t = time.time()
a = {}
for i in xrange(loop_count):
if i in a:
a[i] += i
else:
a[i] = i
for i in xrange(loop_count / 2):
if i in a:
a[i] += i
else:
a[i] = i
for i in xrange(loop_count * 2):
if i in a:
x = a[i] + 5
else:
x = 5
print "time take for {}:", (time.time() - t)
t = time.time()
b = defaultdict(int)
for i in xrange(loop_count):
b[i] += i
for i in xrange(loop_count / 2):
b[i] += i
for i in xrange(loop_count * 2):
x = b[i] + 5
print "time take for defaultdict:", (time.time() - t)
t = time.time()
c = Counter(xrange(loop_count)) + Counter(xrange(loop_count / 2))
for i in xrange(loop_count * 2):
x = c[i] + 5
print "time take for counter:", (time.time() - t)
t = time.time()
c2 = Counter()
for i in xrange(loop_count):
c2[i] += i
for i in xrange(loop_count / 2):
c2[i] += i
for i in xrange(loop_count * 2):
x = c2[i] + 5
print "time take for counter version 2:", (time.time() - t)
And the results were kind of surprising:
time take for {}: 7.626983881
time take for defaultdict: 9.94141697884
time take for counter: 58.303508997
time take for counter version 2: 15.7543189526
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8144>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AAEz6-GqK3pKKvrXBId15LjL2xtXvWbbks5rWMYfgaJpZM4LY0ht>
.
|
Small update: I have been busy with school work lately, but I am still working on this. |
That's fine. A number of core scikit-learn devs are busy elsewhere too.
…On 2 February 2017 at 08:01, He Chen ***@***.***> wrote:
Small update: I have been busy with school work lately, but I am still
working on this.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8144 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz62Smo8-prgVF1pFqxjBce8NfZvnqks5rYPKngaJpZM4LY0ht>
.
|
Working on making the
Also planning to implement the nested counter something along these lines:
|
Codecov Report
@@ Coverage Diff @@
## master #8144 +/- ##
=========================================
Coverage ? 95.48%
=========================================
Files ? 342
Lines ? 61059
Branches ? 0
=========================================
Hits ? 58304
Misses ? 2755
Partials ? 0
Continue to review full report at Codecov.
|
Are you seeking another review, then? are other issues addressed? |
I have only addressed 2-3 of the issues mentioned so far. |
@jnothman |
Any updates on this? 😃 |
Hopefully someone at the sprint can find some time ;) |
Sounds good! |
Sorry for the long stall. I'll review this week (hopefully). Can you please fix the merge conflicts? Thanks! |
https://youtu.be/-n7qZAdWFL0?t=1074 argues (I think) that this should be done with cross-validation to avoid overfitting. |
I think this should be refactored assuming we have |
I just realized that using any form of cross-validation would raise the same issues as in stacking #8960 that mean that |
Yes, and we're encountering the same in KNNImputer. I think we'll just need
to make that (sometimes) okay (tag it).
…On 5 August 2017 at 02:18, Andreas Mueller ***@***.***> wrote:
I just realized that using any form of cross-validation would raise the
same issues as in stacking #8960
<#8960> that mean that
fit(X).transform(X) is quite different from fit_transform(X)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8144 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz60UUERXXwGTUGbYw1iRhh_Dd-Vt7ks5sU0RcgaJpZM4LY0ht>
.
|
Okay, closing pull request now to fix merge conflicts and open up new pull request later. Will add link. |
Added new pull request with code added on top of current code. |
Reference Issue
#5853
What does this implement/fix? Explain your changes.
It adds the CountFeaturizer transformation class, which can help with getting better accuracy because it will use how often a particular data row occurs as a feature
Any other comments?
Currently work in progress, please let me know if there is something that I should add or if there is anything I can do in a better or faster way!
Also a continuation of #7803