[MRG+3] Fix min_weight_fraction_leaf to work when sample_weights are not provided #7301

nelson-liu · 2016-08-31T04:17:02Z

Reference Issue

Fixes #6945, previous PR at #6947

What does this implement/fix? Explain your changes.

min_weight_fraction_leaf should work even when sample_weight is not given (in which case samples are assumed to have equal weight). I also tweaked the description of min_weight_fraction_leaf to make its purpose a bit more clear.

Any other comments?

nelson-liu · 2016-08-31T04:23:06Z

sklearn/tree/tree.py

+        if sample_weight is None:
+            min_weight_leaf = int(ceil(self.min_weight_fraction_leaf *
+                                       n_samples))
+            min_samples_leaf = max(min_samples_leaf, min_weight_leaf)


I'm not sure whether it's better to have this block as what it is, or

min_samples_leaf = max(min_samples_leaf, int(ceil(self.min_weight_fraction_leaf * n_samples))) min_weight_leaf = 0 # (seeing as min_weight_leaf is unnecessary when sample_weight is None)

or simply omit the min_samples_leaf = max(...) and just set min_weight_leaf since both min_weight_leaf and min_samples_leaf are provided to the Criterion object.

I don't understand what you mean by:

or simply omit the min_samples_leaf = max(...) and just set min_weight_leaf since both min_weight_leaf and min_samples_leaf are provided to the Criterion object.

But I think I am fine with the first two suggestions.

ok, i'll just leave it as it is.

The max seems odd to me. Why not just remove it?

Yeah, that's what I was trying to propose in my last sentence on my original comment (was probably unclear, sorry). I haven't tested it but it should work... And it seems a bit more logical too

I think I'm for that.

yeah, it seems like that removing the call to max works as well. The max here is implicit because it's implemented in Criterion anyway.

ogrisel · 2016-08-31T09:00:18Z

LGTM.

jnothman · 2016-08-31T14:16:43Z

You don't currently test interaction between min_samples_leaf and this

nelson-liu · 2016-08-31T16:21:04Z

You don't currently test interaction between min_samples_leaf and this

Good idea, their interaction is simply that the max is the bound, right?

amueller · 2016-08-31T17:34:15Z

maybe test that with no sample weights, the two parameters have the same effect?

nelson-liu · 2016-08-31T23:52:41Z

@amueller so I'm actually of the opinion now that they shouldn't have the same effect... I pushed a test that checks if they are the same; it crashes and burns, but in a justified manner I think.

First, I changed the min_weight_leaf calculation formula to actually provide the minimum weight as opposed to using the same formula as min_samples_leaf. The parameter deals with weights, so we shouldn't be rounding up and turning it into an int (thus representing the min samples in the leaf). To further reinforce this, the Cython builder classes actually take min_weight_leaf as a double. This also mirrors the implementation in the else clause.

So onto why they shouldn't have the same effect:
When calculating whether a node is a leaf in _tree.pyx, the following conditional is used:

is_leaf = ((depth >= max_depth) or
                (n_node_samples < min_samples_split) or
                (n_node_samples < 2 * min_samples_leaf) or
                (weighted_n_node_samples < min_weight_leaf))

Say that we fit on a dataset with 5 samples, and provide no sample_weight at fit (thus uniform weighting). If we set min_weight_leaf = .1 and min_samples_leaf = .1, the comparisons will be (assuming this is the first split, so n_node_samples / weighted_n_node_samples = 5):

5 < 2 * int(ceil(.1*5)) = 2* 1 < 2
5 < (.1*5) = (.5)

Which is very different. This approach does have the downside that (in this case), setting min_weight_leaf = .1 essentially does nothing because the value of weighted_n_node_samples will always be greater than .1 (it's always greater than 1). However, I think that's staying true to the parameter definition.

I think we should take this approach, but I feel like a warning might be good if min_weight_leaf * sum(sample_weights) < min(sample_weights). What do you think?

edit: there are cases where the two parameters have the same effect, but they do not always.

nelson-liu · 2016-09-01T03:24:42Z

There are probably some hand-crafted values that could yield equal values...I suppose that those could serve as a useful test? Not sure if it is necessary, though. what do you guys think?

nelson-liu · 2016-09-04T11:00:04Z

since the two parameters don't give the same effect on uniform weighted data when frac*total_weight == int(ceil(frac*total_weight)) (and this seems incorrect), I've opened an issue for it @ #7338. If we decide to adopt the solutions in that issue to give the two parameters an equivalent interpretation, I'll verify that the test (such that the two trees are equal) works. If not, i don't think it's a suitable test...

It does seem reasonable for the two grown trees to be equal under this scenario, though.

nelson-liu · 2016-09-05T03:55:06Z

this is ready to be looked at again; removing the +1 because there have been significant changes to the tests.

raghavrv · 2016-09-06T08:13:40Z

Say that we fit on a dataset with 5 samples, and provide no sample_weight at fit (thus uniform weighting). If we set min_weight_leaf = .1 and min_samples_leaf = .1

I am sorry but I don't get what you are trying to say. min_samples_leaf is an int and min_weight_leaf is a float. Why are you giving a float value to min_samples_leaf?

raghavrv · 2016-09-06T08:23:51Z

sklearn/ensemble/forest.py

@@ -796,7 +796,8 @@ class RandomForestClassifier(ForestClassifier):

    min_weight_fraction_leaf : float, optional (default=0.)
        The minimum weighted fraction of the input samples required to be at a
-        leaf node.
+        leaf node where weights are determined by ``sample_weight`` provided


I'd write it as -

The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node.

nelson-liu · 2016-09-06T14:28:04Z

I am sorry but I don't get what you are trying to say. min_samples_leaf is an int and min_weight_leaf is a float. Why are you giving a float value to min_samples_leaf?

min_samples_leaf can actually be a float, from the documentation:

If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

amueller · 2016-09-06T17:15:31Z

On 08/31/2016 07:52 PM, Nelson Liu wrote:

Say that we fit on a dataset with 5 samples, and provide no
|sample_weight| at fit (thus uniform weighting). If we set
|min_weight_leaf = .1| and |min_samples_leaf = .1|, the comparisons
will be (assuming this is the first split, so |n_node_samples /
weighted_n_node_samples = 5|):

I was talking about the case where min_weight_leaf * n_samples is an
integer. It seems to me that in this case they should have the same
behavior.
I'm surprised about the 2* with min_samples_leaf but not with
min_weight_leaf. Why is that?

nelson-liu · 2016-09-06T17:17:53Z

I was talking about the case where min_weight_leaf * n_samples is an
integer. It seems to me that in this case they should have the same
behavior.

@jnothman and I discussed this a bit in #7338 (specifically #7338 (comment)) , could you check it out?

I'm surprised about the 2* with min_samples_leaf but not with
min_weight_leaf. Why is that?

I believe it's because weights can be split, but samples cannot

jnothman · 2016-09-06T20:41:16Z

I'm surprised about the 2* with min_samples_leaf but not with min_weight_leaf. Why is that?

The min_weight_leaf condition shouldn't be there. It's plain wrong. The min_samples_leaf condition is only an optimisation, and the same kind of optimisation can't be easily achieved for weight. The real work happens in the splitter.

jnothman · 2016-09-07T06:59:29Z

I'm tempted to follow the second option at #6945 and either raise an error or a warning if sample_weight is None and min_weight_fraction_leaf is not None.

jnothman · 2016-09-07T07:00:01Z

But I'm okay with the "assume sample_weight=1" solution too..?

nelson-liu · 2016-09-07T22:09:42Z

I'm tempted to follow the second option at #6945 and either raise an error or a warning if sample_weight is None and min_weight_fraction_leaf is not None.

That solution would definitely be far easier to implement, but I think that assuming sample_weight=1 is more intuitive. However, it's equally counterintuitive that the results for min_weight_fraction_leaf do not match up with those for min_samples_leaf in this case. Ideally, the best thing to do would be to fix what's going in the splitter / tree to make the two grown trees the same, but doing that seems a bit out of scope of this PR. I'm fine with leaving it as it is or raising a warning / error, how about others?

jnothman · 2016-09-08T00:59:00Z

No, the difference is not a bug and should not be fixed.

On 8 September 2016 at 08:09, Nelson Liu notifications@github.com wrote:

I'm tempted to follow the second option at #6945
#6945 and either
raise an error or a warning if sample_weight is None and
min_weight_fraction_leaf is not None.

That solution would definitely be far easier to implement, but I think
that assuming sample_weight=1 is more intuitive. However, it's equally
counterintuitive that the results for min_weight_fraction_leaf do not
match up with those for min_samples_leaf in this case. Ideally, the best
thing to do would be to fix what's going in the splitter / tree to make the
two grown trees the same, but doing that seems a bit out of scope of this
PR. I'm fine with leaving it as it is or raising a warning / error, how
about others?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7301 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz69Pm4wZqxPqPKal1akEPBQnEvDrjks5qnzYogaJpZM4JxNyX
.

jnothman · 2016-09-08T00:59:20Z

I'm happy to accept what you implemented too.

On 8 September 2016 at 10:58, Joel Nothman joel.nothman@gmail.com wrote:

No, the difference is not a bug and should not be fixed.

On 8 September 2016 at 08:09, Nelson Liu notifications@github.com wrote:

I'm tempted to follow the second option at #6945
#6945 and either
raise an error or a warning if sample_weight is None and
min_weight_fraction_leaf is not None.

That solution would definitely be far easier to implement, but I think
that assuming sample_weight=1 is more intuitive. However, it's equally
counterintuitive that the results for min_weight_fraction_leaf do not
match up with those for min_samples_leaf in this case. Ideally, the best
thing to do would be to fix what's going in the splitter / tree to make the
two grown trees the same, but doing that seems a bit out of scope of this
PR. I'm fine with leaving it as it is or raising a warning / error, how
about others?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7301 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz69Pm4wZqxPqPKal1akEPBQnEvDrjks5qnzYogaJpZM4JxNyX
.

nelson-liu · 2016-09-08T01:14:48Z

No, the difference is not a bug and should not be fixed.

ok

I'm happy to accept what you implemented too.

In that vein, I've addressed @raghavrv 's comments. Perhaps this would be good for 0.18?

jnothman · 2016-09-08T03:29:53Z

sklearn/ensemble/forest.py

-        The minimum weighted fraction of the input samples required to be at a
-        leaf node.
+        The minimum weighted fraction of the sum total of weights (of all
+        the input samples) required to be at a leaf node.


It it worth noting, "Samples have equal weight when sample_weight is not provided, but min_samples_leaf is more efficient."

jnothman · 2016-09-26T23:49:56Z

sklearn/ensemble/forest.py

-        leaf node.
+        The minimum weighted fraction of the sum total of weights (of all
+        the input samples) required to be at a leaf node. Samples have
+        equal weight when sample_weight is not provided, but


I think we should now drop "but min_samples_leaf is more efficient" if we've realised we can use the 2 * min_weight_leaf change.

jnothman · 2016-09-26T23:50:34Z

This actually has @ogrisel's +1 above as well. So it's got +2 assuming nothing substantial has changed since then. Also LGTM

jnothman · 2016-09-26T23:50:55Z

Pending that minor change, I should say.

nelson-liu · 2016-09-27T17:29:19Z

sorry for getting back to this so late, been a bit busy recently. I pushed the changes to the docstrings that were requested, is there anything else needed?

jnothman · 2016-09-27T23:40:21Z

A what's new entry is needed. Please put under 0.19, as I think this has
missed the boat for 0.18.

On 28 September 2016 at 03:29, Nelson Liu notifications@github.com wrote:

sorry for getting back to this so late, been a bit busy recently. I pushed
the changes to the docstrings that were requested, is there anything else
needed?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7301 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz670gWKcRX5AD0RrbyXjjgn6aKfKdks5quVJxgaJpZM4JxNyX
.

nelson-liu · 2016-09-28T00:06:32Z

@jnothman thanks for the reminder. I put this under "Enhancements"; do you think "Bug Fixes" would be better?

jnothman · 2016-09-28T01:37:03Z

If you state "previously it was silently ignored", then it belongs in bug fixes :)

nelson-liu · 2016-09-28T03:05:45Z

@jnothman good point, that info is important. added and moved to bugfixes.

jnothman · 2016-09-28T04:48:54Z

thanks @nelson-liu

@amueller, I've currently assumed this is not for 0.18. Feel free to backport and move the what's new if you disagree.

…not provided (scikit-learn#7301) * fix min_weight_fraction_leaf when sample_weights is None * fix flake8 error * remove added newline and unnecessary assignment * remove max bc it's implemented in cython and add interaction test * edit weight calculation formula and add test to check equality * remove test that sees if two parameter build the same tree * reword min_weight_fraction_leaf docstring * clarify uniform weight in forest docstrings * update docstrings for all classes * add what's new entry * move whatsnew entry to bug fixes and explain previous behavior

…not provided (scikit-learn#7301) * fix min_weight_fraction_leaf when sample_weights is None * fix flake8 error * remove added newline and unnecessary assignment * remove max bc it's implemented in cython and add interaction test * edit weight calculation formula and add test to check equality * remove test that sees if two parameter build the same tree * reword min_weight_fraction_leaf docstring * clarify uniform weight in forest docstrings * update docstrings for all classes * add what's new entry * move whatsnew entry to bug fixes and explain previous behavior # Conflicts: # doc/whats_new.rst

…not provided (scikit-learn#7301) * fix min_weight_fraction_leaf when sample_weights is None * fix flake8 error * remove added newline and unnecessary assignment * remove max bc it's implemented in cython and add interaction test * edit weight calculation formula and add test to check equality * remove test that sees if two parameter build the same tree * reword min_weight_fraction_leaf docstring * clarify uniform weight in forest docstrings * update docstrings for all classes * add what's new entry * move whatsnew entry to bug fixes and explain previous behavior

nelson-liu reviewed Aug 31, 2016
View reviewed changes

ogrisel changed the title ~~[MRG] Fix min_weight_fraction_leaf to work when sample_weights are not provided~~ [MRG+1] Fix min_weight_fraction_leaf to work when sample_weights are not provided Aug 31, 2016

ogrisel added this to the 0.18 milestone Aug 31, 2016

ogrisel mentioned this pull request Aug 31, 2016

Modified BaseDecisionTree so that min_weight_fraction_leaf works when… #6947

Closed

jnothman modified the milestone: 0.18 Aug 31, 2016

nelson-liu mentioned this pull request Sep 4, 2016

Duplicate logic in Decision Tree Building #7338

Closed

nelson-liu changed the title ~~[MRG+1] Fix min_weight_fraction_leaf to work when sample_weights are not provided~~ [MRG] Fix min_weight_fraction_leaf to work when sample_weights are not provided Sep 5, 2016

raghavrv reviewed Sep 6, 2016
View reviewed changes

jnothman reviewed Sep 8, 2016
View reviewed changes

glouppe approved these changes Sep 26, 2016

View reviewed changes

jnothman changed the title ~~[MRG] Fix min_weight_fraction_leaf to work when sample_weights are not provided~~ [MRG+1] Fix min_weight_fraction_leaf to work when sample_weights are not provided Sep 26, 2016

jnothman requested changes Sep 26, 2016

View reviewed changes

jnothman changed the title ~~[MRG+1] Fix min_weight_fraction_leaf to work when sample_weights are not provided~~ [MRG+3] Fix min_weight_fraction_leaf to work when sample_weights are not provided Sep 26, 2016

nelson-liu added 9 commits September 27, 2016 17:00

fix min_weight_fraction_leaf when sample_weights is None

167ffd5

fix flake8 error

f2b584f

remove added newline and unnecessary assignment

ac1e4da

remove max bc it's implemented in cython and add interaction test

51424dc

edit weight calculation formula and add test to check equality

1687550

remove test that sees if two parameter build the same tree

1b9a9cc

reword min_weight_fraction_leaf docstring

6c624d7

clarify uniform weight in forest docstrings

a591eaa

update docstrings for all classes

cf72fd4

nelson-liu force-pushed the min_weight_fraction_fix branch from 209f967 to cf72fd4 Compare September 28, 2016 00:00

add what's new entry

de27df0

move whatsnew entry to bug fixes and explain previous behavior

01f3091

jnothman approved these changes Sep 28, 2016

View reviewed changes

jnothman merged commit da118d0 into scikit-learn:master Sep 28, 2016

Uh oh!

[MRG+3] Fix min_weight_fraction_leaf to work when sample_weights are not provided #7301

[MRG+3] Fix min_weight_fraction_leaf to work when sample_weights are not provided #7301

Uh oh!

Conversation

nelson-liu commented Aug 31, 2016

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

nelson-liu Aug 31, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel Aug 31, 2016

Choose a reason for hiding this comment

Uh oh!

nelson-liu Aug 31, 2016

Choose a reason for hiding this comment

Uh oh!

amueller Aug 31, 2016

Choose a reason for hiding this comment

Uh oh!

nelson-liu Aug 31, 2016

Choose a reason for hiding this comment

Uh oh!

amueller Aug 31, 2016

Choose a reason for hiding this comment

Uh oh!

nelson-liu Aug 31, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Aug 31, 2016

Uh oh!

jnothman commented Aug 31, 2016

Uh oh!

nelson-liu commented Aug 31, 2016

Uh oh!

amueller commented Aug 31, 2016

Uh oh!

nelson-liu commented Aug 31, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nelson-liu commented Sep 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nelson-liu commented Sep 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nelson-liu commented Sep 5, 2016

Uh oh!

raghavrv commented Sep 6, 2016

Uh oh!

raghavrv Sep 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nelson-liu commented Sep 6, 2016

Uh oh!

amueller commented Sep 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nelson-liu commented Sep 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Sep 6, 2016

Uh oh!

jnothman commented Sep 7, 2016

Uh oh!

jnothman commented Sep 7, 2016

Uh oh!

nelson-liu commented Sep 7, 2016

Uh oh!

jnothman commented Sep 8, 2016

Uh oh!

jnothman commented Sep 8, 2016

Uh oh!

nelson-liu commented Sep 8, 2016

Uh oh!

nelson-liu Aug 31, 2016 •

edited

Loading

nelson-liu Aug 31, 2016 •

edited

Loading

nelson-liu commented Aug 31, 2016 •

edited

Loading

nelson-liu commented Sep 1, 2016 •

edited

Loading

nelson-liu commented Sep 4, 2016 •

edited

Loading

raghavrv Sep 6, 2016 •

edited

Loading

amueller commented Sep 6, 2016 •

edited

Loading

nelson-liu commented Sep 6, 2016 •

edited

Loading