-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+3] Fix min_weight_fraction_leaf to work when sample_weights are not provided #7301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+3] Fix min_weight_fraction_leaf to work when sample_weights are not provided #7301
Conversation
if sample_weight is None: | ||
min_weight_leaf = int(ceil(self.min_weight_fraction_leaf * | ||
n_samples)) | ||
min_samples_leaf = max(min_samples_leaf, min_weight_leaf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether it's better to have this block as what it is, or
min_samples_leaf = max(min_samples_leaf, int(ceil(self.min_weight_fraction_leaf * n_samples)))
min_weight_leaf = 0 # (seeing as min_weight_leaf is unnecessary when sample_weight is None)
or simply omit the min_samples_leaf = max(...)
and just set min_weight_leaf
since both min_weight_leaf
and min_samples_leaf
are provided to the Criterion
object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what you mean by:
or simply omit the min_samples_leaf = max(...) and just set min_weight_leaf since both min_weight_leaf and min_samples_leaf are provided to the Criterion object.
But I think I am fine with the first two suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i'll just leave it as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The max
seems odd to me. Why not just remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's what I was trying to propose in my last sentence on my original comment (was probably unclear, sorry). I haven't tested it but it should work... And it seems a bit more logical too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it seems like that removing the call to max
works as well. The max here is implicit because it's implemented in Criterion
anyway.
LGTM. |
You don't currently test interaction between |
Good idea, their interaction is simply that the max is the bound, right? |
maybe test that with no sample weights, the two parameters have the same effect? |
@amueller so I'm actually of the opinion now that they shouldn't have the same effect... I pushed a test that checks if they are the same; it crashes and burns, but in a justified manner I think. First, I changed the So onto why they shouldn't have the same effect:
Say that we fit on a dataset with 5 samples, and provide no
Which is very different. This approach does have the downside that (in this case), setting I think we should take this approach, but I feel like a warning might be good if edit: there are cases where the two parameters have the same effect, but they do not always. |
There are probably some hand-crafted values that could yield equal values...I suppose that those could serve as a useful test? Not sure if it is necessary, though. what do you guys think? |
since the two parameters don't give the same effect on uniform weighted data when It does seem reasonable for the two grown trees to be equal under this scenario, though. |
this is ready to be looked at again; removing the +1 because there have been significant changes to the tests. |
I am sorry but I don't get what you are trying to say. |
@@ -796,7 +796,8 @@ class RandomForestClassifier(ForestClassifier): | |||
|
|||
min_weight_fraction_leaf : float, optional (default=0.) | |||
The minimum weighted fraction of the input samples required to be at a | |||
leaf node. | |||
leaf node where weights are determined by ``sample_weight`` provided |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd write it as -
The minimum weighted fraction of the sum total of weights (of all the input samples) required
to be at a leaf node.
|
On 08/31/2016 07:52 PM, Nelson Liu wrote:
|
@jnothman and I discussed this a bit in #7338 (specifically #7338 (comment)) , could you check it out?
I believe it's because weights can be split, but samples cannot |
The min_weight_leaf condition shouldn't be there. It's plain wrong. The min_samples_leaf condition is only an optimisation, and the same kind of optimisation can't be easily achieved for weight. The real work happens in the splitter. |
I'm tempted to follow the second option at #6945 and either raise an error or a warning if |
But I'm okay with the "assume sample_weight=1" solution too..? |
That solution would definitely be far easier to implement, but I think that assuming sample_weight=1 is more intuitive. However, it's equally counterintuitive that the results for |
No, the difference is not a bug and should not be fixed. On 8 September 2016 at 08:09, Nelson Liu notifications@github.com wrote:
|
I'm happy to accept what you implemented too. On 8 September 2016 at 10:58, Joel Nothman joel.nothman@gmail.com wrote:
|
ok
In that vein, I've addressed @raghavrv 's comments. Perhaps this would be good for 0.18? |
The minimum weighted fraction of the input samples required to be at a | ||
leaf node. | ||
The minimum weighted fraction of the sum total of weights (of all | ||
the input samples) required to be at a leaf node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It it worth noting, "Samples have equal weight when sample_weight
is not provided, but min_samples_leaf
is more efficient."
leaf node. | ||
The minimum weighted fraction of the sum total of weights (of all | ||
the input samples) required to be at a leaf node. Samples have | ||
equal weight when sample_weight is not provided, but |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should now drop "but min_samples_leaf is more efficient" if we've realised we can use the 2 * min_weight_leaf
change.
This actually has @ogrisel's +1 above as well. So it's got +2 assuming nothing substantial has changed since then. Also LGTM |
Pending that minor change, I should say. |
sorry for getting back to this so late, been a bit busy recently. I pushed the changes to the docstrings that were requested, is there anything else needed? |
A what's new entry is needed. Please put under 0.19, as I think this has On 28 September 2016 at 03:29, Nelson Liu notifications@github.com wrote:
|
209f967
to
cf72fd4
Compare
@jnothman thanks for the reminder. I put this under "Enhancements"; do you think "Bug Fixes" would be better? |
If you state "previously it was silently ignored", then it belongs in bug fixes :) |
@jnothman good point, that info is important. added and moved to bugfixes. |
thanks @nelson-liu @amueller, I've currently assumed this is not for 0.18. Feel free to backport and move the what's new if you disagree. |
…not provided (scikit-learn#7301) * fix min_weight_fraction_leaf when sample_weights is None * fix flake8 error * remove added newline and unnecessary assignment * remove max bc it's implemented in cython and add interaction test * edit weight calculation formula and add test to check equality * remove test that sees if two parameter build the same tree * reword min_weight_fraction_leaf docstring * clarify uniform weight in forest docstrings * update docstrings for all classes * add what's new entry * move whatsnew entry to bug fixes and explain previous behavior
…not provided (scikit-learn#7301) * fix min_weight_fraction_leaf when sample_weights is None * fix flake8 error * remove added newline and unnecessary assignment * remove max bc it's implemented in cython and add interaction test * edit weight calculation formula and add test to check equality * remove test that sees if two parameter build the same tree * reword min_weight_fraction_leaf docstring * clarify uniform weight in forest docstrings * update docstrings for all classes * add what's new entry * move whatsnew entry to bug fixes and explain previous behavior # Conflicts: # doc/whats_new.rst
…not provided (scikit-learn#7301) * fix min_weight_fraction_leaf when sample_weights is None * fix flake8 error * remove added newline and unnecessary assignment * remove max bc it's implemented in cython and add interaction test * edit weight calculation formula and add test to check equality * remove test that sees if two parameter build the same tree * reword min_weight_fraction_leaf docstring * clarify uniform weight in forest docstrings * update docstrings for all classes * add what's new entry * move whatsnew entry to bug fixes and explain previous behavior
…not provided (scikit-learn#7301) * fix min_weight_fraction_leaf when sample_weights is None * fix flake8 error * remove added newline and unnecessary assignment * remove max bc it's implemented in cython and add interaction test * edit weight calculation formula and add test to check equality * remove test that sees if two parameter build the same tree * reword min_weight_fraction_leaf docstring * clarify uniform weight in forest docstrings * update docstrings for all classes * add what's new entry * move whatsnew entry to bug fixes and explain previous behavior
Reference Issue
Fixes #6945, previous PR at #6947
What does this implement/fix? Explain your changes.
Any other comments?