-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG] Added tips for reading the code base #12874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
43d2351
d1c6e7b
c0e934d
d4a04fd
6e554a9
3a97f56
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1395,3 +1395,64 @@ that implement common linear model patterns. | |
|
||
The :mod:`sklearn.utils.multiclass` module contains useful functions | ||
for working with multiclass and multilabel problems. | ||
|
||
.. _reading-code: | ||
|
||
Reading the existing code base | ||
============================== | ||
|
||
Reading and digesting an existing code base is always a difficult exercise | ||
that takes time and experience to master. Even though we try to write simple | ||
code in general, understanding the code can seem overwhelming at first, | ||
given the sheer size of the project. Here is a list of tips that may help | ||
make this task easier and faster (in no particular order). | ||
|
||
- Get acquainted with the :ref:`api_overview`: understand what :term:`fit`, | ||
:term:`predict`, :term:`transform`, etc. are used for. | ||
- Before diving into reading the code of a function / class, go through the | ||
docstrings first and try to get an idea of what each parameter / attribute | ||
is doing. It may also help to stop a minute and think *how would I do this | ||
myself if I had to?* | ||
- The trickiest thing is often to identify which portions of the code are | ||
relevant, and which are not. In scikit-learn **a lot** of input checking | ||
is performed, especially at the beginning of the :term:`fit` methods. | ||
Sometimes, only a very small portion of the code is doing the actual job. | ||
For example looking at the ``fit()`` method of | ||
:class:`sklearn.linear_model.LinearRegression`, what you're looking for | ||
might just be the call the ``scipy.linalg.lstsq``, but it is buried into | ||
multiple lines of input checking and the handling of different kinds of | ||
parameters. | ||
- Due to the use of `Inheritance | ||
<https://en.wikipedia.org/wiki/Inheritance_(object-oriented_programming)>`_, | ||
some methods may be implemented in parent classes. All estimators inherit | ||
at least from :class:`BaseEstimator <sklearn.base.BaseEstimator>`, and | ||
from a ``Mixin`` class (e.g. :class:`ClassifierMixin | ||
<sklearn.base.ClassifierMixin>`) that enables default behaviour depending | ||
on the nature of the estimator (classifier, regressor, transformer, etc.). | ||
- Sometimes, reading the tests for a given function will give you an idea of | ||
what its intended purpose is. You can use ``git grep`` (see below) to find | ||
all the tests written for a function. Most tests for a specific | ||
function/class are placed under the ``tests/`` folder of the module | ||
- You'll often see code looking like this: | ||
``out = Parallel(...)(delayed(some_function)(param) for param in | ||
some_iterable)``. This runs ``some_function`` in parallel using `Joblib | ||
<https://joblib.readthedocs.io/>`_. ``out`` is then an iterable containing | ||
the values returned by ``some_function`` for each call. | ||
- We use `Cython <https://cython.org/>`_ to write fast code. Cython code is | ||
located in ``.pyx`` and ``.pxd`` files. Cython code has a more C-like | ||
flavor: we use pointers, perform manual memory allocation, etc. Having | ||
some minimal experience in C / C++ is pretty much mandatory here. | ||
- Master your tools. | ||
|
||
- With such a big project, being efficient with your favorite editor or | ||
IDE goes a long way towards digesting the code base. Being able to quickly | ||
jump (or *peek*) to a function/class/attribute definition helps a lot. | ||
So does being able to quickly see where a given name is used in a file. | ||
- `git <https://git-scm.com/book/en>`_ also has some built-in killer | ||
features. It is often useful to understand how a file changed over time, | ||
using e.g. ``git blame`` (`manual | ||
<https://git-scm.com/docs/git-blame>`_). This can also be done directly | ||
on GitHub. ``git grep`` (`examples | ||
<https://git-scm.com/docs/git-grep#_examples>`_) is also extremely | ||
useful to see every occurrence of a pattern (e.g. a function call or a | ||
variable) in the code base. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe add that git grep only searches files tracked by git by default compared to grep. Unless scikit-learn doc is not considered to be the place to say this. It's just that I have been using grep for a while before realizing that git grep might be better for this purpose :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You sure? I feel it's not really necessary since it's quiet unlikely that the intended audience of this section will ever need to look for patterns in untracked files, and if they do they can still refer to the doc in the link. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay leave it as it is. |
Uh oh!
There was an error while loading. Please reload this page.