Skip to content

Kriging model class #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
29 commits merged into from
Nov 24, 2010
Merged

Kriging model class #14

29 commits merged into from
Nov 24, 2010

Conversation

dubourg
Copy link
Contributor

@dubourg dubourg commented Oct 31, 2010

Hello list,

I'd like to commit a new feature to the scikits-learn project.
I implemented a kriging model class which is able to perform both regression and probabilistic classification.
I will send an e-mail with a script that performs a demo.
Since I have no clue where to put my contribution I simply created a single
file named kriging.py with the class and other functions such as correlation
and regression models, and I added an import instruction to the main
init.py module of scikit.learn.

I hope you'll enjoy this contribution.

Vincent

dubourg added 3 commits November 1, 2010 00:44
I'd like to commit a new feature to the scikits-learn project.
I implemented a kriging model class which is able to perform both regression
and probabilistic classification.
I will send an e-mail with a script that performs a demo.
Since I have no clue where to put my contribution I simply created a single
file named kriging.py with the class and other functions such as correlation
and regression models, and I added an import instruction to the main
__init__.py module of scikit.learn.

I hope you'll enjoy this contribution.

Vincent
@dubourg
Copy link
Contributor Author

dubourg commented Nov 1, 2010

Hi all,

I am glad to read your reactions and answer some of them in the following:

  • I'll put my contribution in a kriging directory alike svm, but I'd need help to understand the structure a test subdirectory should have.
  • Regarding the PEP8 code format, I replaced all the tabs kate inserted for me so that my next push should be exempt of tabulations.
  • @agramfort: I am eager to see what you can do to improve the performance of this first implementation, because kriging is known to be quite expensive to evaluate, so that it would be great to make it faster!
  • @GaelVaroquaux: I switched to scipy's linalg. However I can't see how np.atleast_2d function would help me at the beginning of fit. But you'll eventually be able to change this as soon as I'll have pushed an RST page, tests and examples.
  • @mblondel: Then, how should I entitle the directory 'kriging', 'gaussian_processes', 'gp'?

Thanks for welcoming me and my contribution. For the sake of being honnest, this implementation is essentially inspired from the DACE toolbox (implemented in Matlab and Scilab, http://www2.imm.dtu.dk/~hbn/dace/) that I am using for my PhD. I just changed the score function into its opposite so that it matches my understanding of the scikits-learn coding guidelines : the higher the score, the better. By the way, should I put the DACE website as a reference in the RST page?

Finally, I'd like to add another precision about this implementation of kriging regarding the classification problem. Classification is only a postprocessing there, meaning you first need to solve a regression (interpolation) problem by providing the complete scalar ouptut of the experiment you attempt to surrogate. You can't specify binary labels as input -- or at least you can bu it won't be as efficient. Indeed, you use the Gaussian property of the prediction to perform probabilistic classification. I hope it will become clearer with the examples I implemented...

@ogrisel
Copy link
Member

ogrisel commented Nov 1, 2010

As for coding guidelines, also please follow http://www.python.org/dev/peps/pep-0257 (this is not just for Vincent but for everybody :)

@ogrisel
Copy link
Member

ogrisel commented Nov 1, 2010

Also when you say "inspired by DACE", I hope this is not a language matlab to numpy direct transliteration. Since SAVE license is not opensource hence would not allow us to accept your contribution (or at least we should ask the permission of the authors).

@dubourg
Copy link
Contributor Author

dubourg commented Nov 1, 2010

@ogrisel: Well I am afraid that this is nothing more than a "direct transliteration" for the moment, out of a few modifications in an attempt to match scikits-learn's philosophy (which is not that far from DACE's). I don't want you guys (or even me) to have any trouble with copyrights so please let me know if that's a problem.

I apologize if this has been nothing but a loss of time for you.

@ogrisel
Copy link
Member

ogrisel commented Nov 1, 2010

Ideally it would have been best to read the algorithm from the paper and come up with your own scipy / numpy implementation yourself without literally translating the matlab code. Since this is done I think to integrate your code in scikit-learn (which is BSD licensed) it would be better to explicitly ask (by email) the DACE authors whether they don't see any objection that we distribute your python implementation under the BSD license. Of course, if they accept they would get due credits with a reference to their paper and website in the documentation of the module (both in the source code and on the public scikit-learn HTML website).

@agramfort
Copy link
Member

I agree with Olivier

@ogrisel
Copy link
Member

ogrisel commented Nov 1, 2010

@dubourg I let you contact them and let us know on the mailing list what is their answer?

@dubourg
Copy link
Contributor Author

dubourg commented Nov 1, 2010

Sure, I have just sent an e-mail to them. I'll let you know as soon as I get a reply. I precised that scikit-learn is distributed under BSD license and gave them the link to the scikit-learn website.

@mblondel
Copy link
Member

mblondel commented Nov 2, 2010

@dubourg I think you can call the module "gaussian_processes" and your class "GaussianProcess".

from scikits.learn.gaussian_processes import GaussianProcess

Not sure if the module name should be singular or plural.

@agramfort
Copy link
Member

@dubourg: In case the DACE people do not agree with BSD licencing I think it would good for you to keep this code open in a separate github project while keeping the scikit-learn API. The code would then be limited to personal use.

@dubourg
Copy link
Contributor Author

dubourg commented Nov 8, 2010

Hi all,

I got response from Pr Nielsen, the corresponding author of the Matlab's kriging toolbox. It seems to me that he accepts this implementation provided I refer to the original Matlab toolbox. Here are his reply and my initial request :

Dear Vincent Dubourg,

You are very wellcome to make a Python version with due references of the DACE toolbox.
I can tell you that Jan Frydendall, jf@imm.dtu.dk, already has made his private Python version
of some of the toolbox, and before Christmas Oscar Borries and I will finish an extended and improved version of the Matlab implementation.

Sincerely,

Hans Bruun Nielsen

Vincent Dubourg wrote:

Dear Pr Nielsen,

I am planning to make a Python implementation of your Matlab kriging toolbox that would be a part of the scikit-learn project (http://scikit-learn.sourceforge.net/index.html).

In case you've never heard about Python before, it is an interpreted, general purpose high-level programming language. It is extensively used in many fields. It recently came up in the scientific community thanks to some large projects such as scipy/numpy (optimization, statistics, linear algebra) and matplotlib (visualization) which makes it a competitive alternative to Matlab.

More specifically, the scikit-learn project aims at gathering various methods for both supervised and unsupervised machine learning. It already implements support vector machine classifiers and regressors, generalized linear models (ordinary least squares, least angle regression, etc. ...) and artificial neural networks. The scikit-learn module is distributed under the BSD license. Of course, if you accept you would get due credits with a reference to your papers and website in the documentation of the module (both in the source code and on the public scikit-learn HTML website). In case you don't agree, just tell me and I'll simply forget this idea.

In any case, thank you (and your colleagues) for the Matlab toolbox which is both clear and computationally efficient,
Best regards,
Vincent Dubourg

Hans Bruun Nielsen

Informatics and Mathematical Modelling (IMM) Richard Petersens Plads
DTU, Building 321, room 009 DK 2800 Kongens Lyngby
E-mail: hbn@imm.dtu.dk Tel: +45 4525 3077 Fax: +45 4588 2673
http://www.imm.dtu.dk/~hbn


@agramfort
Copy link
Member

this is great news !

i think you're ready to add tests, documentation, examples and ask for a pull request :)

@ogrisel
Copy link
Member

ogrisel commented Nov 8, 2010

That sounds like a go to me :) It was better to ask explicitly. It's always a pain to spend days working on a piece of code only to discover later that there is a legal issue that would prevent you to distribute it.

@dubourg
Copy link
Contributor Author

dubourg commented Nov 8, 2010

I completely agree with the fact that it was better to ask.
I have started to write a doc page, based on the svm.rst file (is there any better editor than kate to edit such files???). Then I'll need to add some examples to the example root directory of course...
For the code however, could you give me any recommandation on the way I should organize my files? At least, does the following sound good to you?

scikits/learn/gpml/init.py # The main file where I would implement my present 'KrigingModel' class turned into a 'GaussianProcessModel' class so that mblondel's use case holds: "from scikits.learn.gpml import GaussianProcessModel as gpm"
scikits/learn/gpml/correlation_models.py
scikits/learn/gpml/regression_models.py

GPML stands for 'Gaussian Processes for Machine Learning' (named after the book by Williams & Rasmussen that you can download for free on the website of the authors).

@agramfort
Copy link
Member

I would go for:

-- gaussian_process |-- __init__.py |-- classification.py |-- regresssion.py -- tests
|-- test_classification.py
`-- test_regresssion.py

where classification.py contains GaussianProcessClassification
and regression.py contains GaussianProcessRegression
then in the init.py you import both models.
Try to keep explicit naming for classes and variables.

Then you can add a 1D example for regression model
in examples/gaussian_process/plot_gaussian_process_regression.py

ok with that folks?

@mblondel
Copy link
Member

mblondel commented Nov 9, 2010

+1!

dubourg added 4 commits November 13, 2010 23:17
…l directory. The module implement a class named GaussianProcessModel. I also add doc, examples and tests (involving a coupling with the cross_val module).
…. I just don't know how to involve this test within the whole scikit testing procedure (nosetests). Also add a modification of the TOC in doc.
@dubourg
Copy link
Contributor Author

dubourg commented Nov 13, 2010

Hi list,

I added a few commits among which:

  • a reimplementation of the main class following some of the scikits.learn coding rules and some recommandations discussed above ;
  • a small documentation page with references ;
  • two examples ;
  • two tests.

One of the test involves a coupling with your REALLY GREAT cross_val and metrics modules !!! I like them. I just don't know why I get a Pickling Error when I ask for n_jobs > 1... Which is quite a shame... But I'm sure you'll be able to help me with this latter point !

@ogrisel
Copy link
Member

ogrisel commented Nov 13, 2010

joblib is using the multiprocessing module to compute in // . To pass parameters and models results across processes the models need to be picklable.

BTW, when you are commenting on the issue, you are not talking to the project mailing list as a whole, just the people registered on the issue.

dubourg added 2 commits November 14, 2010 10:51
…the deviation between the predicted targets and the true ones. This is for convenience only because it allows then to use the distributing capacity of the cross_val module. The old score function is renamed with the more explicit name: `reduced_likelihood_function` (see eg the DACE documentation).
…e in order to load the gpml module and tests.
@agramfort
Copy link
Member

A few quick "mess of" remarks:

  • I'm not found of using "ml" in the module name as the scikit is a machine learning package. It's redundant. Also I would explicitely name the module gaussian_proccess (not gp). I know that we have svm and glm but gaussian_proccess is not that long and "gp" does not mean much to common users.
  • run the pep8 command : http://pypi.python.org/pypi/pep8 on your files. You will see that there are some cosmetic modifications that need to be done to your code (eg. long lines > 80char, missing spaces etc.)
  • an example should plot only 1 image for the documentation.
  • do not make examples interactive as it prevents them from being automatically added to the doc. Your examples are great by the way :)
  • you should remove : from . import kriging in learns/init.py
  • I would rename : regression_models.py -> regression.py, GaussianProcessModel -> GaussianProcess
  • I'm not a fan of a predict that can output more that an y. See.
    y, MSE = aGaussianProcessModel.predict(x, eval_MSE=True)
    In the GLM module the models store the explained variance on the train
  • You should try to make examples as short as possible. I prefer many examples with a clear docstring
    than one big example
  • Use print doc to print the example docstring when run.
  • Is there a reason for not providing a GaussianProcessClassification model directly?

A comment on this sentence (in the doc):

    • It loses efficiency in high dimensional spaces -- namely when the number
  •  of features exceeds a few dozens. It might indeed give poor performance
    
  •  and it becomes computationally inefficient.
    

I've seen Gaussian Processes applied with thousands of features
(eg. http://www.ncbi.nlm.nih.gov/pubmed/19879364). Can you enlighten me on this?

that's it for this sunday morning :)

@dubourg
Copy link
Contributor Author

dubourg commented Nov 14, 2010

Thank you Olivier I understood my mistake about the PicklingError: I declared a score function on the go right before I run the cross validation. Thus, the score function was not pickable.
So, I realized that my old score function of the GaussianProcessModel was not a really good "score function" from the cross_val perspective. I thus renamed my old score function as reduced_likelihood_function which is more explicit. And the new score function simply computes the deviations between the predicted values and the true test targets.
The parallel evaluation of a leave-one-out estimate of the explained variance error is now possible (I ran up to 10 jobs in parallell).

@GaelVaroquaux
Copy link
Member

I agree with Alex: I'd like the package to be named 'gaussian_process' (I hate acronyms, I am not too happy with the various gmm, glm, sgd that we already have). For the name of the example files (not the example directory), I think using 'gp' instead of the full 'gaussian_process' should be fine, as it is explicit because of the context.
I am also +1 on all his suggested renames (look like we spend too much time together, if we agree on all this).

+1 also with the fact that predict shouldn't output more than y: it can easily break all our model selection machinery.

Thanks a lot, this is starting to look very good!

@dubourg
Copy link
Contributor Author

dubourg commented Nov 23, 2010

Is it still planned to merge my branch despite the more recent discussions on the forthcoming release?
(I tried to "polish" the RST doc page in this perspective.)

@dubourg
Copy link
Contributor Author

dubourg commented Nov 23, 2010

Do you have any idea on how I could prevent the QR decomposition function from scipy from sending this iritating DeprecationWarning?

/usr/lib/python2.6/dist-packages/scipy/linalg/decomp.py:1177: DeprecationWarning: qr econ argument will be removed after scipy 0.7. The economy transform will then be available through the mode='economic' argument.
"the mode='economic' argument.", DeprecationWarning)

And maybe there is a better work around than a try/except to ensure this kwarg backward compatibility (Lines 561--569 of gaussian_process.py)?...

@agramfort
Copy link
Member

Is it still planned to merge my branch despite the more recent discussions on the forthcoming release?

it should be. Give a bit of time to review all the code. I'll get back to you ASAP

@fabianp
Copy link
Member

fabianp commented Nov 24, 2010

@dubourg: you now have write access on the git repo, you should be able commit it now by yourself :-)

@ogrisel
Copy link
Member

ogrisel commented Nov 24, 2010

Also please remove the comments blocks such as:

###############

Dependencies

###############

They don't bring any additional information and break the beautiful flow a you python program :)

@agramfort
Copy link
Member

@dubourg: if you have pbs with git to commit to the main repo please do not use -f to force the commit :)
if you like you can ask olivier or fabian to do the job for you.

Removed useless comment blocks in the code.
@dubourg
Copy link
Contributor Author

dubourg commented Nov 24, 2010

I will merge my pull request tonight following the guidelines that appeared at the bottom of my pull request.
I won't force anything but rather ask for help. In any case, I'd prefer to learn how to do it by my self instead of asking for somebody else to do the job for me.
@olivier and fabian: you'll probably receive some questions tonight ;)

@fabianp
Copy link
Member

fabianp commented Nov 24, 2010

Don't hesitate to ask questions if you run into trouble (but I'll probably reply tomorrow :-)

@dubourg
Copy link
Contributor Author

dubourg commented Nov 24, 2010

As you can see on the GitHub pull request I was able to merge the new features (since my latest pull in... September!), but I can't find the command I should run to pull my merged branch into the trunk...
Does:

git push origin origin/master

sounds good?

@ogrisel
Copy link
Member

ogrisel commented Nov 24, 2010

git push master origin/master

@dubourg
Copy link
Contributor Author

dubourg commented Nov 24, 2010

Well I don't have any remote named master, I only have "origin"...

@agramfort
Copy link
Member

well it depends on what is origin for you.

for me origin is scikit-learn/scikit-learn
and alex is agramfort/scikit-learn

so I would do

git push alex master # to push to my fork
git push origin master # to push to the main repo

hope this helps

@dubourg
Copy link
Contributor Author

dubourg commented Nov 24, 2010

Not really...
Here are the two remotes I have:

dubourg PTlami14:~/Tools/scikit-learn> git remote show origin

  • remote origin

    Fetch URL: git@github.com:dubourg/scikit-learn.git

    Push URL: git@github.com:dubourg/scikit-learn.git

    HEAD branch: master

    Remote branches:

    debian tracked

    dubourg-master tracked

    master tracked

    Local branch configured for 'git pull':

    master merges with remote master

    Local refs configured for 'git push':

    dubourg-master pushes to dubourg-master (up to date)

    master pushes to master (up to date)

dubourg PTlami14:~/Tools/scikit-learn> git remote show main

  • remote main

    Fetch URL: git@github.com:scikit-learn/scikit-learn.git

    Push URL: git@github.com:scikit-learn/scikit-learn.git

    HEAD branch: master

    Remote branches:

    debian new (next fetch will store in remotes/main)

    master new (next fetch will store in remotes/main)

    Local ref configured for 'git push':

    master pushes to master (fast-forwardable)

And following your examples, I'd be tempted to try:

git push origin main

But it fails with error:

error: src refspec main does not match any.

error: failed to push some refs to 'git@github.com:dubourg/scikit-learn.git'

@GaelVaroquaux
Copy link
Member

git push main master, I believe

pprett pushed a commit to pprett/scikit-learn that referenced this pull request Sep 21, 2011
dengemann pushed a commit to dengemann/scikit-learn that referenced this pull request Jul 22, 2013
raghavrv pushed a commit to raghavrv/scikit-learn that referenced this pull request Jun 20, 2017
glemaitre referenced this pull request in glemaitre/scikit-learn Jun 20, 2019
prismdata pushed a commit to prismdata/scikit-learn that referenced this pull request Oct 28, 2020
ogrisel added a commit that referenced this pull request Feb 3, 2023
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants