-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MRG: Major SGD extension module refactoring #612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… instead of explicit ndarray and wscale.
…ctor. some cosmits in dense sgd extension module.
- new dataset abstraction allows us to dump the sparse impl.
btw: removes 4K lines of code |
Did you get inspired from sofia-ml ;-) ? The idea of the abstractions is nice and I think that in the long term we can even move them to utils/ so that they can benefit in other places (e.g. MiniBatchKmeans). It's a bit a pity that the dot product needs to iterate over the feature indices even in the dense case (this prevents us for example from calling cblas). It would be nice if we could have a different implementation of "dot" in the dense and sparse cases, while keeping the same logic in plain_sgd. In C++, with overloading, it would be easy to keep the logic of the algorithm the same and call just call "dot". I wonder if we can do something similar by using the fact that ArrayDataset and CSRDataset inherits from the same parent class... A feature that would be useful is to precompute the square norms and store them as attribute inside ArrayDataset and CSRDataset. This would be useful for computing the euclidean distances more efficiently ( Speaking of sofia-ml, a long time ago I contributed a branch implementing multinomial logistic regression and multi-class hinge loss but unfortunately the code has never been merged because sofia-ml wasn't really designed with multiclass classification in mind. The code is in the branch/ folder in google-code if you want to have a look. |
assert y_i.shape[0] == y.shape[0] == sample_weight.shape[0] | ||
if sp.issparse(X): | ||
dataset = CSRDataset(X.data, X.indptr, X.indices, y_i, sample_weight) | ||
intercept_decay = 0.01 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be float(X.data.shape[0]) / (X.shape[0] * X.shape[1])
instead of a magic constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you used that before? Does it give better results? If so, we should definitely choose the density instead of the magic constant.
A quick thought: By using the density we assume that - more or less - the sparsity pattern of all features are equal. This might not be the case for bag-of-words features (Zipfs law).
On the other hand, I've been using the magic constant for about 3 years now; it gave reasonable results (compared to no damping of the intercept update). We could make intercept_decay a parameter and use 1.0 as dense default and 0.01 or density as the sparse default. Sometimes it makes sense to change the intercept decay - e.g. when you have both (a large number of) sparse and (some) dense features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I don't know. I haven't tried to play with this in the past.
I would have initially agreed with @mblondel for the lack of blas possibility in the dense case using the new API but that might only do a speed difference if we were to do some kind of minibatch handling which (which we don't) or if the n_features was very large yet dense. As SGD is especially interesting when n_samples is large too, such a dataset wouldn't probably fit in memory and the cost might be dominated from disk IO rather than blas based dot products any way. So I am ok with the current state, especially if it allows to implement hashed feature cross products in a lazy way. A quick cosmit notes: PEP257 states that docstrings should be one line + space + paragraph with details: def some_func(arg1, arg2):
"""Short oneline description without any newline char.
More details in a paragraph. More details in a paragraph. More
details in a paragraph. More details in a paragraph.
""" Or for oneliners: def some_func(arg1, arg2):
"""Short oneline description without any newline char.""" Apart from the above inline remarks, 👍 for merging. |
You got me - I recently studied sofia-ml's, vw's and mahout's hashing approach .
Agreed, btw: do we have cross-package cython imports somewhere in the scikit so that I can peek how this works?
Ok, this becomes an issue when we use Dataset in places other than sgd?
Great - I do have some ideas about how to consolidate multi-class and the binary special case but I'm always looking for better/easier ways to get things done - thanks |
As long as we use |
# ----------------------------------------- | ||
|
||
cdef class Dataset: | ||
cdef int n_samples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be a Py_ssize_t
to be 64-bit clean.
@larsmans, @ogrisel, @mblondel thanks for the feedback! @larsmans: I changed n_features and n_samples to Py_ssize_t . Regarding the sample set issue: Dataset encapsulates both the samples and the labels; the major reason for this design decision is a) shuffling and b) streaming dataset implementations (e.g. a socket) . |
Hey Peter. I just saw the PR and just wanted to say thank you for the refactoring and cleanup. This looks awesome! Multinomial logistic regression and multi-class hinge loss have been on my todo for to long - maybe I'll get around to it now that you made it much easier :) |
👍 for merging (I can live with the remaining PEP257 violations) |
@ogrisel: I fixed the PEP257 issues in fast_sgd.pyx - thx for reviewing! |
Alright, shall we merge? |
I think so - it's a good investment in terms of maintainability. I'll merge
|
A major refactoring of the SGD extension module which allows us to dump
sgd_fast_sparse.pyx
::Dataset
, which basically provides anext
and ashuffle
method.There are currently two concrete implementations
ArrayDataset
which is backed by a c-style numpy array andCSRDataset
which is backed by a scipy CSR matrix.WeightVector
abstraction is introduced which basically wrapswscale
and the np arrayw
and provides methods such asscale
,add
, andreset_wscale
.I timed the refactoring on 20news and covertype; there are no noticeable differences in terms of runtime.
This refactoring reduces code-complexity and should allow us to add features more easily. I'm thinking about various hashing tricks (e.g. feature cross-products via a new
WeightVector
etc. ) and different multi-class approaches such as multinomial LR and multi-class hinge loss.