Skip to content

DOC More docs needed on model and data persistence #2801

Closed
@jnothman

Description

@jnothman

It seems the only documentation of model persistence is in the Quick Start tutorial. I think it needs to be covered in the User Guide, mentioning security and forward-compatibility caveats pertaining to pickle. It might note the benefits of joblib (for large models at least) over Pickle. Most other ML toolkits (particularly command-line tools) treat persistence as a very basic operation of the package. (This was fixed in #3317)

Similarly, there should be some comment on saving and loading custom data (input and output). Indeed, I can't find a direct description of the supported data-types; users are unlikely to have played with scipy.sparse before (although the feature_extraction module means they may not need to). Noting the benefits of joblib (without which the user may dump a sparse matrix's data, indices, indptr using .tofile) and memmapping is worthwhile. So may be reference to Pandas which could help manipulate datasets before/after entering Scikit-learn, and provides import/export to a variety of formats (http://pandas.pydata.org/pandas-docs/dev/io.html).

I have recently discovered new users who think the way to import/export large sparse arrays is load_svmlight_format, but then note that the loading/saving takes much more time than the processing they're trying to do... Let's give them a hand.

Metadata

Metadata

Assignees

No one assigned

    Labels

    DocumentationEasyWell-defined and straightforward way to resolve

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions