Skip to content

[MRG] _tree.pyx documentation additions #5010

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jul 31, 2015

Conversation

jmschrei
Copy link
Member

This is the documentation aspect of #5007 alone. No functional changes to the code are made, and it passes all nosetests. This documentation covers all criterion, and the dense splitters, and includes both docstrings and inline comments.

@amueller
Copy link
Member

This looks great! ping @glouppe @arjoly

@@ -157,7 +230,18 @@ cdef class ClassificationCriterion(Criterion):

def __cinit__(self, SIZE_t n_outputs,
np.ndarray[SIZE_t, ndim=1] n_classes):
# Default values
"""
Initialize attributes for this classifier, automatically calling the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

classifier -> criterion

@glouppe
Copy link
Contributor

glouppe commented Jul 21, 2015

General comment: we should agree on a common terminology regarding outputs. So far the code was mentioning "targets" or "outputs". Now you have added/replaced with "responses". I dont know what is best but we should try to limit ourselves -- it is otherwise more confusing than helpful.

total += entropy
label_count_total += label_count_stride

# Return the total entropy over all responses divided by the number of
# responses.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, not sure if the added verbosity is so helpful

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, this code is quite straight forward. You might want to rename total to total_entropy if you want to make it fully explicit. No need for inline comments in this function.

@glouppe
Copy link
Contributor

glouppe commented Jul 21, 2015

Thanks for this PR! In general, I agree with your proposal, in particular at places where docstrings were missing. However, I am not totally happy with the inline comments you added. It is sometimes adding too much verbosity for nothing. In my opinion, inline comments should be used with parsimony, at places where the code may be difficult to understand by itself -- adding too many of them may have an overall negative impact. I dont know what is others opinion on this?

@jmschrei
Copy link
Member Author

I can tone down the inline comments, especially in places where they were not helpful. I think 'targets' is the right way to talk about y, so I can change that as well.

@ogrisel
Copy link
Member

ogrisel commented Jul 21, 2015

In my opinion, inline comments should be used with parsimony, at places where the code may be difficult to understand by itself -- adding too many of them may have an overall negative impact. I dont know what is others opinion on this?

I agree in general. When the "what" is not clear from the Python code, then generally renaming the local variable to have more explicit names is a better solution than adding an inline comment.

When the "why" of a particular line is not clear, then an inline comment is the good solution.

@@ -391,29 +548,49 @@ cdef class Entropy(ClassificationCriterion):

cdef double entropy = 0.0
cdef double total = 0.0
cdef double tmp
cdef double pmk
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think probability_mass_k is more explicit. The statements are short so we can afford a long variable name here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively density_k or frequency_k are fine as well.

@ogrisel
Copy link
Member

ogrisel commented Jul 21, 2015

General remark for docstrings: it is recommend to follow PEP 257 to be nice with developer tools like IDE tooltips and sphinx API doc. In particular that means following the structure:

def my_function(argument_1, argument_2):
    """One-line short summary of the function in less than 80 cols

    Extended paragraph that give a detailed description of what the function
    does. Note that the extended paragraph is separated from the one line
    short summary by a blank line.

    ... (parameters doc and so on)
    """
    # body of the function here

The input data which is being split
start: int64
The first sample to be used on this leaf
end: int64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use SIZE_t as type definition.

if tmp > 0.0:
tmp /= weighted_n_right
entropy_right -= tmp * log(tmp)
# Calculate this classes entropy on the left node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this

@glouppe
Copy link
Contributor

glouppe commented Jul 24, 2015

Did my round of review, should be good to go once they are fixed along with Olivier's.

pass

cdef void reset(self) nogil:
"""Reset the criterion at pos=start."""
"""Placeholder for a method which will reset the criterion at
pos=start.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to add the fact that this is a place holder, I would say


"""Reset the criterion at pos=start

This methods must be implemented by the sub-class.
"""

@arjoly
Copy link
Member

arjoly commented Jul 24, 2015

Instead of saying that the method is a placeholder, I would find more informative to say that the method
must be implemented by the sub-class.

safe_realloc(&self.n_classes, n_outputs)

cdef SIZE_t k = 0
cdef SIZE_t label_count_stride = 0

# For each target, set the number of unique classes in that target,
# and also set the stride for that target
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and also set the stride for that target-> and also compute the maximal stride of all targets

@glouppe glouppe mentioned this pull request Jul 28, 2015
30 tasks
@jmschrei
Copy link
Member Author

All suggestions should be incorporated. Thank you!

@jmschrei
Copy link
Member Author

Apparently this commit broke Gaussian processes! But only for Python 3.4 on Travis. I always knew that Gaussian processes were secretly opposed to documentation.

@glouppe
Copy link
Contributor

glouppe commented Jul 29, 2015

Yes, this is due to #5045. Dont worry about it :)

@jmschrei
Copy link
Member Author

_gradient_boosting.c decided to tag along for the ride, so I accidentally pushed it too, but I went back and fixed the issue (I believe). All changes should be incorporated now.

@glouppe
Copy link
Contributor

glouppe commented Jul 31, 2015

Merging this and making a few minor changes. Thanks for your patience!

glouppe added a commit that referenced this pull request Jul 31, 2015
[MRG] _tree.pyx documentation additions
@glouppe glouppe merged commit d857349 into scikit-learn:master Jul 31, 2015
@arjoly
Copy link
Member

arjoly commented Jul 31, 2015

Thanks @jmschrei !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants