Description
It seems the only documentation of model persistence is in the Quick Start tutorial. I think it needs to be covered in the User Guide, mentioning security and forward-compatibility caveats pertaining to pickle. It might note the benefits of joblib (for large models at least) over Pickle. Most other ML toolkits (particularly command-line tools) treat persistence as a very basic operation of the package. (This was fixed in #3317)
Similarly, there should be some comment on saving and loading custom data (input and output). Indeed, I can't find a direct description of the supported data-types; users are unlikely to have played with scipy.sparse
before (although the feature_extraction
module means they may not need to). Noting the benefits of joblib (without which the user may dump a sparse matrix's data
, indices
, indptr
using .tofile
) and memmapping is worthwhile. So may be reference to Pandas which could help manipulate datasets before/after entering Scikit-learn, and provides import/export to a variety of formats (http://pandas.pydata.org/pandas-docs/dev/io.html).
I have recently discovered new users who think the way to import/export large sparse arrays is load_svmlight_format
, but then note that the loading/saving takes much more time than the processing they're trying to do... Let's give them a hand.