-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Modified BaseDecisionTree so that min_weight_fraction_leaf works when… #6947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modified BaseDecisionTree so that min_weight_fraction_leaf works when… #6947
Conversation
… sample_weight is None and improved parameter description
if self.min_weight_fraction_leaf != 0. and sample_weight is not None: | ||
if self.min_weight_fraction_leaf != 0.: | ||
if sample_weight is None: | ||
sample_weight = np.repeat(1., n_samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firstly, we don't need to explicitly do this, since all we want is the sum of weights, i.e. "total weight". So you could just do min_weight_leaf = self.min_weight_fraction_leaf * n_samples
.
Secondly, I see why you might want this, but it now replicates the function of min_samples_leaf
. So if this is the way we go, i.e. rather than a warning, you can actually just use
min_samples_leaf = max(min_samples_leaf, int(ceil(self.min_weight_fraction_leaf * n_samples)))
In terms of testing, you should look at existing tests for min_samples_leaf
and check that min_weight_fraction_leaf
offers the same behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When sample_weight
is None
, then samples are equally weighted in the Cython code. Does this addition really change anything? I believe behaviours should be identical. Do you have counter-examples where the trees that are built are actually different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If sample_weight is None
then min_weight_fraction_leaf
has no effect. I think that's quite clear from the else
case here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh my bad, I was not aware of that else
clause. Indeed. Ouch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we fix this in the cython code?
In terms of advice by skype et al, you might find the scikit-learn channel on Gitter helpful. |
@@ -297,9 +297,13 @@ def fit(self, X, y, sample_weight=None, check_input=True, | |||
sample_weight = expanded_class_weight | |||
|
|||
# Set min_weight_leaf from min_weight_fraction_leaf | |||
if self.min_weight_fraction_leaf != 0. and sample_weight is not None: | |||
if self.min_weight_fraction_leaf != 0.: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't actually need this if
@ben519 do you intend to finish off your work on this? |
@jnothman I definitely can't this week. I can give it a shot this weekend, but I'd be happier if someone more knowledgeable than me took this over. |
if no one else is available, i can do it. i'll be traveling the next week but starting sept 3 i'll have ~ a week free i can use to fixing this up. might be a good test to see if all that gsoc work w/ tree paid off |
@nelson-liu please go ahead and submit a PR! |
Closing this in favor of #7301. |
Reference Issue
Addresses #6945
Changes
min_weight_fraction_leaf should work even when sample_weight is not given (in which case samples are assumed to have equal weight). I also tweaked the description of min_weight_fraction_leaf to make its purpose a bit more clear.
Other comments
I'm fairly new to open-source collaboration as well as scikit-learn. I think my changes have room for improvement. For example, if sample_weight is given as None by the user, I reset it to an array of all 1s. I'm guessing this is a bad solution. However, I could use help figuring out where and how to implement the correct solution. I'm also slightly unsure the best way to test my changes. Would really appreciate someone offering 5 or 10 minutes of their time via Skype or Google Hangout so I can become a contributor and hopefully add more contributions in the future.
… sample_weight is None and improved parameter description