Skip to content

[MRG] Added tips for reading the code base #12874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Dec 31, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ Quick links to:
* [Submitting a bug report or feature request](http://scikit-learn.org/dev/developers/contributing.html#submitting-a-bug-report-or-a-feature-request)
* [Contributing code](http://scikit-learn.org/dev/developers/contributing.html#contributing-code)
* [Coding guidelines](http://scikit-learn.org/dev/developers/contributing.html#coding-guidelines)
* [Tips to read current code](http://scikit-learn.org/dev/developers/contributing.html#reading-code)

Code of Conduct
---------------
Expand Down
61 changes: 61 additions & 0 deletions doc/developers/contributing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1395,3 +1395,64 @@ that implement common linear model patterns.

The :mod:`sklearn.utils.multiclass` module contains useful functions
for working with multiclass and multilabel problems.

.. _reading-code:

Reading the existing code base
==============================

Reading and digesting an existing code base is always a difficult exercise
that takes time and experience to master. Even though we try to write simple
code in general, understanding the code can seem overwhelming at first,
given the sheer size of the project. Here is a list of tips that may help
make this task easier and faster (in no particular order).

- Get acquainted with the :ref:`api_overview`: understand what :term:`fit`,
:term:`predict`, :term:`transform`, etc. are used for.
- Before diving into reading the code of a function / class, go through the
docstrings first and try to get an idea of what each parameter / attribute
is doing. It may also help to stop a minute and think *how would I do this
myself if I had to?*
- The trickiest thing is often to identify which portions of the code are
relevant, and which are not. In scikit-learn **a lot** of input checking
is performed, especially at the beginning of the :term:`fit` methods.
Sometimes, only a very small portion of the code is doing the actual job.
For example looking at the ``fit()`` method of
:class:`sklearn.linear_model.LinearRegression`, what you're looking for
might just be the call the ``scipy.linalg.lstsq``, but it is buried into
multiple lines of input checking and the handling of different kinds of
parameters.
- Due to the use of `Inheritance
<https://en.wikipedia.org/wiki/Inheritance_(object-oriented_programming)>`_,
some methods may be implemented in parent classes. All estimators inherit
at least from :class:`BaseEstimator <sklearn.base.BaseEstimator>`, and
from a ``Mixin`` class (e.g. :class:`ClassifierMixin
<sklearn.base.ClassifierMixin>`) that enables default behaviour depending
on the nature of the estimator (classifier, regressor, transformer, etc.).
- Sometimes, reading the tests for a given function will give you an idea of
what its intended purpose is. You can use ``git grep`` (see below) to find
all the tests written for a function. Most tests for a specific
function/class are placed under the ``tests/`` folder of the module
- You'll often see code looking like this:
``out = Parallel(...)(delayed(some_function)(param) for param in
some_iterable)``. This runs ``some_function`` in parallel using `Joblib
<https://joblib.readthedocs.io/>`_. ``out`` is then an iterable containing
the values returned by ``some_function`` for each call.
- We use `Cython <https://cython.org/>`_ to write fast code. Cython code is
located in ``.pyx`` and ``.pxd`` files. Cython code has a more C-like
flavor: we use pointers, perform manual memory allocation, etc. Having
some minimal experience in C / C++ is pretty much mandatory here.
- Master your tools.

- With such a big project, being efficient with your favorite editor or
IDE goes a long way towards digesting the code base. Being able to quickly
jump (or *peek*) to a function/class/attribute definition helps a lot.
So does being able to quickly see where a given name is used in a file.
- `git <https://git-scm.com/book/en>`_ also has some built-in killer
features. It is often useful to understand how a file changed over time,
using e.g. ``git blame`` (`manual
<https://git-scm.com/docs/git-blame>`_). This can also be done directly
on GitHub. ``git grep`` (`examples
<https://git-scm.com/docs/git-grep#_examples>`_) is also extremely
useful to see every occurrence of a pattern (e.g. a function call or a
variable) in the code base.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add that git grep only searches files tracked by git by default compared to grep. Unless scikit-learn doc is not considered to be the place to say this. It's just that I have been using grep for a while before realizing that git grep might be better for this purpose :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You sure? I feel it's not really necessary since it's quiet unlikely that the intended audience of this section will ever need to look for patterns in untracked files, and if they do they can still refer to the doc in the link.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay leave it as it is.