Skip to content

DOC More docs needed on model and data persistence #2801

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jnothman opened this issue Jan 29, 2014 · 12 comments · Fixed by #18046
Closed

DOC More docs needed on model and data persistence #2801

jnothman opened this issue Jan 29, 2014 · 12 comments · Fixed by #18046
Labels
Documentation Easy Well-defined and straightforward way to resolve

Comments

@jnothman
Copy link
Member

It seems the only documentation of model persistence is in the Quick Start tutorial. I think it needs to be covered in the User Guide, mentioning security and forward-compatibility caveats pertaining to pickle. It might note the benefits of joblib (for large models at least) over Pickle. Most other ML toolkits (particularly command-line tools) treat persistence as a very basic operation of the package. (This was fixed in #3317)

Similarly, there should be some comment on saving and loading custom data (input and output). Indeed, I can't find a direct description of the supported data-types; users are unlikely to have played with scipy.sparse before (although the feature_extraction module means they may not need to). Noting the benefits of joblib (without which the user may dump a sparse matrix's data, indices, indptr using .tofile) and memmapping is worthwhile. So may be reference to Pandas which could help manipulate datasets before/after entering Scikit-learn, and provides import/export to a variety of formats (http://pandas.pydata.org/pandas-docs/dev/io.html).

I have recently discovered new users who think the way to import/export large sparse arrays is load_svmlight_format, but then note that the loading/saving takes much more time than the processing they're trying to do... Let's give them a hand.

@jnothman
Copy link
Member Author

scipy.io also deserves a mention when discussing dataset loading.

@MattpSoftware
Copy link
Contributor

A good issue for a novice to work on?

@jnothman
Copy link
Member Author

Some of this was covered in #3317 (and it turns out this issue was a duplicate of #1332). But if you can make an effort to understand what's covered now and what's missing, yes, I think this question of import/export could still have better coverage in the documentation.

@MattpSoftware
Copy link
Contributor

Is all the current persistence documentation in model_persistence.rst?

On Mon, Sep 29, 2014 at 5:19 PM, jnothman notifications@github.com wrote:

Some of this was covered in #3317
#3317 (and it turns
out this issue was a duplicate of #1332
#1332). But if you
can make an effort to understand what's covered now and what's missing,
yes, I think this question of import/export could still have better
coverage in the documentation.


Reply to this email directly or view it on GitHub
#2801 (comment)
.

@jnothman
Copy link
Member Author

jnothman commented Oct 1, 2014

Basically, yes

On 1 October 2014 11:46, Joel Nothman joel.nothman@sydney.edu.au wrote:

On 1 October 2014 10:40, Matt Pico notifications@github.com wrote:

Is all the current persistence documentation in model_persistence.rst?

On Mon, Sep 29, 2014 at 5:19 PM, jnothman notifications@github.com
wrote:

Some of this was covered in #3317
#3317 (and it turns
out this issue was a duplicate of #1332
#1332). But if
you
can make an effort to understand what's covered now and what's missing,
yes, I think this question of import/export could still have better
coverage in the documentation.


Reply to this email directly or view it on GitHub
<
https://github.com/scikit-learn/scikit-learn/issues/2801#issuecomment-57250301>

.


Reply to this email directly or view it on GitHub
#2801 (comment)
.

@MattpSoftware
Copy link
Contributor

Starting review of model_persistence.rst …

@arjoly
Copy link
Member

arjoly commented Oct 7, 2014

Thanks @MattpSoftware !

@evanmiller29
Copy link

Hey @amueller/@jnothman - does this still need some work? I'm happy to open a PR and work on it.

@jnothman
Copy link
Member Author

jnothman commented Aug 7, 2019

Well this is certainly and old issue, but it would be great if someone who knows both common and efficient ways to store data to review what our user guides and tutorials currently say on the topic

@evanmiller29
Copy link

evanmiller29 commented Aug 8, 2019

Ha yeah the new bug issues are a little too competitive so I thought I'd start working on some of the older issues..

I'm not 100% sure about the efficiency side, but I'm wondering if there's scope to do a model persistence code tutorial?

Here's the overview of what I'd do:

  • Fit model on data (probably using house price dataset)
  • Dump model using joblib to .pkl
  • Programatically output requirements.txt
  • Save train, test to CSV
  • Reload .pkl file and create dummy predictions.

Keen to hear your thoughts @jnothman

@amueller
Copy link
Member

amueller commented Aug 8, 2019

@zioalex
Copy link
Contributor

zioalex commented Oct 12, 2019

Hi guys, I was searching a good easy issue where to start. This seems already done to me. Can I help here some how?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Easy Well-defined and straightforward way to resolve
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants