Skip to content

[WIP] Earth (MARS) #2285

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 23 commits into from
Closed

[WIP] Earth (MARS) #2285

wants to merge 23 commits into from

Conversation

jcrudy
Copy link

@jcrudy jcrudy commented Jul 28, 2013

This pull request adds multivariate adaptive regression splines to scikit-learn.

  1. Adds the sklearn.earth module, which provides the Earth class
  2. Adds an earth page to the documentation
  3. Adds some earth examples

It probably could still use more work, but I'm looking for feedback on how best to proceed. Currently my plan is to focus on documentation going forward.

@glouppe
Copy link
Contributor

glouppe commented Jul 28, 2013

Thanks! I believe this is a great addition, but it will indeed need some work to be to our standards.

Also, don't expect any review right now - we are in the middle of releasing the new 0.14 version. Everyone is busy at it :)

(Note: don't hesitate to ping us again when 0.14 is released, to have a deeper look at this.)

@GaelVaroquaux
Copy link
Member

I haven't had time to look at this, but I am very very excited about
this. Thanks heaps

@amueller
Copy link
Member

same here ;)

@jcrudy
Copy link
Author

jcrudy commented Jul 29, 2013

Thanks, guys:). I'm going to be very busy for the next two weeks anyway because I'm moving. After I'm settled I'll ping this thread again and probably put some more work into documentation.

@jcrudy
Copy link
Author

jcrudy commented Aug 24, 2013

I'm going to have some time to work on this soon. Does anyone have time to give it a look and point me in the right direction? If not I will put time into documentation, but it would be great to get some feedback to move the code forward as well.

@glouppe
Copy link
Contributor

glouppe commented Aug 26, 2013

This PR is really huge! Can you point the most important parts for which you need a initial review? Also, can you explain us in a few words what is each Cython file used for? (to help us have a good overview of your code)

@jcrudy
Copy link
Author

jcrudy commented Aug 26, 2013

The file that's most likely needs attention is earth.py. It contains the Earth class, which is the only class that users are expected to interact with directly. I have tried to make it conform to the sklearn interface standards, but I am not sure how successful I was. My main question is whether I need to change anything to make it more compatible with the rest of sklearn. The pyx files contain the following stuff:

_basis.pyx : classes that represent MARS basis functions and sets of basis functions
_forward.pyx : implements the MARS forward pass
_pruning.pyx : implements the MARS pruning pass
_record.pyx : classes for recording and reporting on the execution of the MARS algorithm (how many iterations were performed, what was the error after each iteration, what were the stopping conditions, etc.)
_util.pyx : functions that don't fit well anywhere else but are used repeatedly throughout the code

Perhaps one place I should focus is on commenting the parts of the code that are not user-facing. The following other things might be issues:

  1. When I build the documentation, my examples don't display as nicely as the ones already in sklearn. I think I am missing something about how examples are supposed to be added.
  2. My version is not as fast as the R package in most cases. It can be made faster with some work (and at some point I want to make a multicore version for my own education), but I'm not sure how much of a priority speed should be. See the pyearth_vs_earth.py file (which I now realize needs to be renamed) for a comparison.
  3. I have some public functionality on the Earth class not found in other types of sklearn models. I find it useful, but is its usefulness worth the heterogeneity? For example, the Earth class uses column names from certain types of data structures (such as pandas DataFrames) in a way that I don't think other sklearn models do, and reports information on the fitting process in a non-standard way. It also allows you to run different parts of the fitting process (forward pass, pruning pass, linear fit) independently.

@jcrudy
Copy link
Author

jcrudy commented Aug 26, 2013

If you need a reference for MARS, wikipedia has a great general description of the algorithm:

http://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines

@ogrisel
Copy link
Member

ogrisel commented Aug 27, 2013

I don't have time to have look tonight but you can start here:

http://scikit-learn.org/stable/developers/index.html#contributing-code

Also as the official name of the algorithm is MARS I think the name of the classes in scikit-learn should be one of:

  • MARS (but this is an acronym...)
  • MARSRegressor
  • MultiVariateAdaptiveRegressionSplines (maybe too long)
  • MultiVariateAdaptiveSplineRegressor (descriptive name, almost as long but more in line with the *Regressor convention already often used in scikit-learn)

I would also remove all references to "earth" and replace them by "mars" in module names, function names, variable names and so on.

We should also not commit images in the doc but replace them with plot automatically generated from the examples when possible. For instance, the plot in this rendered paragraph:

http://scikit-learn.org/stable/modules/ensemble.html#extremely-randomized-trees

is rendered by this example:

https://github.com/scikit-learn/scikit-learn/blob/master/examples/ensemble/plot_forest_iris.py

which is included in the reST source:

https://raw.github.com/scikit-learn/scikit-learn/master/doc/modules/ensemble.rst





Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please never insert more than 2 consecutive blank lines in reST or python code.

@jcrudy
Copy link
Author

jcrudy commented Aug 30, 2013

Great feedback, guys. Keep it coming. Regarding the name, the problem with MARS is that, as I understand the situation, it is under copyright (only based on a comment on CRAN by Steve Milborrow: http://cran.r-project.org/web/packages/earth/index.html). I don't know if that matters for the name of a class within a larger software project like scikit-learn. Perhaps someone has more knowledge of copyright law than I do. Given that information, if all of the core developers are okay with MARSRegressor then so am I. If anyone is worried about copyrights, then I think MultiVariateAdaptiveSplineRegressor might be a good choice.

@agramfort
Copy link
Member

agramfort commented Mar 4, 2015 via email

@jcrudy
Copy link
Author

jcrudy commented Mar 9, 2015

Great, thank you @agramfort. A clarification: what do you mean by checking import orders? @mehdidc, is there any part of that you're interested in taking on?

As one of these involves removing some functionality (avoiding public methods other than fit, predict, and transform), I would suggest that at least that change should be done as part of integration into sklearn. The rest can be done directly in py-earth.

@mehdidc
Copy link
Contributor

mehdidc commented Mar 9, 2015

@jcrudy : In addition to what He suggested above, @agramfort gave a good todo list here : #4341, I am working on it.

@devashishd12
Copy link
Contributor

Hey everyone, I am planning to take this project of integrating pyearth into sklearn forward. I'll create a new branch, make a new sklearn/additive, pull all the py-earth code in it and work on that. But on the other hand, to integrate into sklearn later I'll have to merge into the files such as classes.rst so I was thinking of manually adding into these files and making incremental commits so that the review process can be faster. Should I go ahead with the second method?

@mblondel
Copy link
Member

We recently created a new organization called scikit-learn-contrib whose goal is to host scikit-learn compatible projects:
https://github.com/scikit-learn-contrib

This is very much a work in progress but our goal is to gather nice projects in the scikit-learn ecosystem. We just created it so only lightning is there but more projects will be added in the future.

We are thinking py-earth could possibly be moved there. The advantage is that py-earth would become more collaborative. When lightning was hosted in my account I was a bottleneck since I was the only one who could merge PRs. But now others have joined and bugs can be fixed more quickly.

Earth / Mars is sufficiently mature to warrant inclusion in scikit-learn but moving py-earth to scikit-learn-contrib would probably involve much less work.

@agramfort
Copy link
Member

agramfort commented Feb 26, 2016 via email

@devashishd12
Copy link
Contributor

So probably one of the first things we should do is add a description of this in the scikit learn readme so that the users get to know about it.

@agramfort
Copy link
Member

agramfort commented Feb 26, 2016 via email

@mblondel
Copy link
Member

Yes it's too early to advertise it. I just informed you so you don't loose time.

@devashishd12
Copy link
Contributor

@agramfort just wanted to ask one more thing, are the remaining projects of adding GAM (and for LSS), SpAM and LISO to sklearn still of any interest?

@agramfort
Copy link
Member

agramfort commented Feb 26, 2016 via email

@mblondel
Copy link
Member

mblondel commented Mar 7, 2016

@jcrudy This is very much a work in progress but I have uploaded the basic workflow for inclusion in scikit-learn-contrib. If you're interested in moving py-earth there, we can provide some assistance.

@jcrudy
Copy link
Author

jcrudy commented Mar 8, 2016

@mblondel Yes, I would like to do that. I think it's a great idea. I'm probably too busy to look at it much for the next couple weeks, but I will as soon as I can, and I'll start with the workflow you uploaded. Is there anything else I should keep in mind?

@agramfort
Copy link
Member

agramfort commented Mar 8, 2016 via email

@mehdidc
Copy link
Contributor

mehdidc commented Mar 8, 2016

@agramfort Yes ! @jcrudy @mblondel Ok so I first go through the requirements then add a request in scikit-learn-contrib

@jcrudy
Copy link
Author

jcrudy commented Mar 9, 2016

@mehdidc There is a branch of py-earth called rewrite_forward_pass that I've been working on. I'm planning to merge it into master in the not too distant future, but it isn't quite ready yet. It includes some bug fixes to missing data functionality and allows arbitrary (non-separable) sample weights. It also refactors/rewrites the forward pass in a way that is less messy, and uses householder reflections instead of gram-schmidt for orthogonalization of the B matrix (the main reason I did it, as I was having stability problems on a project). Just want you to be aware so you don't repeat work. If you find major changes are needed to pass the contrib requirements, let's talk before you do them. (Because it might be better to merge first.) Also, let me know if I can answer any questions or be of assistance in some other way.

@mehdidc
Copy link
Contributor

mehdidc commented Mar 14, 2016

@jcrudy Great ! it will be nice to have arbitrary sample weights. Ok I see, thanks I will let you know

@springcoil
Copy link

I see that @jcrudy made some progress with Py-Earth. Which I think make it even more appetising. I know there are lots of open PR's for Scikitlearn, but I think MARS should get in there. Has anyone made a todo list for what needs to be done for this port?

@mblondel
Copy link
Member

If it's ok with @jcrudy and @mehdidc, I think we should transfer [*] py-earth to the contrib organization and collaboratively work on meeting the requirements there. We need a project base so as to make the contrib more attractive. Currently, with lighting only, this feels a bit sad.

[*] Transferring rather than forking is important so as to not loose branches and stars.

@jcrudy
Copy link
Author

jcrudy commented Mar 23, 2016

@mblondel That's fine with me. @mehdidc, will that mess up anything you're doing?

@mehdidc
Copy link
Contributor

mehdidc commented Mar 23, 2016

@mblondel @jcrudy No, it's fine with me too

@jcrudy
Copy link
Author

jcrudy commented Mar 23, 2016

@mblondel @mehdidc Okay, I'll transfer it over. Exciting times!

@jcrudy
Copy link
Author

jcrudy commented Mar 23, 2016

@mblondel Do I need to be added to the contrib organization in order to make the transfer?

https://help.github.com/articles/transferring-a-repository/#transferring-from-a-user-to-an-organization

@agramfort
Copy link
Member

agramfort commented Mar 23, 2016 via email

@jcrudy
Copy link
Author

jcrudy commented Mar 23, 2016

@agramfort @mehdidc @mblondel Transfer complete. The py-earth project is now a part of scikit-learn-contrib!

How is progress on making it actually compliant with the requirements? It seems it might be actually there, since @mehdidc has addressed the problems it was having with check_estimator. If that is complete, I think the next step is to work toward a release.

@mblondel
Copy link
Member

@jcrudy Thanks a lot! You might also want to fork the project in your github. This way, the old project URL is still valid.

@mblondel mblondel closed this Mar 24, 2016
@mblondel
Copy link
Member

Closed this pull-request. Let's create an issue directly in the project to follow the progress of meeting the requirements.

@mblondel
Copy link
Member

I added a py-earth team. @jcrudy you should be able to add collaborators to the team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.