Skip to content

MRG: Major SGD extension module refactoring #612

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Feb 13, 2012

Conversation

pprett
Copy link
Member

@pprett pprett commented Feb 7, 2012

A major refactoring of the SGD extension module which allows us to dump sgd_fast_sparse.pyx ::

  1. A new dataset abstraction is introduced, Dataset, which basically provides a next and a shuffle method.
    There are currently two concrete implementations ArrayDataset which is backed by a c-style numpy array and CSRDataset which is backed by a scipy CSR matrix.
  2. A WeightVector abstraction is introduced which basically wraps wscale and the np array w and provides methods such as scale, add, and reset_wscale.

I timed the refactoring on 20news and covertype; there are no noticeable differences in terms of runtime.

This refactoring reduces code-complexity and should allow us to add features more easily. I'm thinking about various hashing tricks (e.g. feature cross-products via a new WeightVector etc. ) and different multi-class approaches such as multinomial LR and multi-class hinge loss.

@pprett
Copy link
Member Author

pprett commented Feb 7, 2012

btw: removes 4K lines of code

@mblondel
Copy link
Member

mblondel commented Feb 7, 2012

Did you get inspired from sofia-ml ;-) ?

The idea of the abstractions is nice and I think that in the long term we can even move them to utils/ so that they can benefit in other places (e.g. MiniBatchKmeans). It's a bit a pity that the dot product needs to iterate over the feature indices even in the dense case (this prevents us for example from calling cblas). It would be nice if we could have a different implementation of "dot" in the dense and sparse cases, while keeping the same logic in plain_sgd. In C++, with overloading, it would be easy to keep the logic of the algorithm the same and call just call "dot". I wonder if we can do something similar by using the fact that ArrayDataset and CSRDataset inherits from the same parent class...

A feature that would be useful is to precompute the square norms and store them as attribute inside ArrayDataset and CSRDataset. This would be useful for computing the euclidean distances more efficiently (||a-b||^2 = ||a||^2 + ||b||^2 - 2 <a,b>) in MiniBatchKMeans or in an RBF kernel.

Speaking of sofia-ml, a long time ago I contributed a branch implementing multinomial logistic regression and multi-class hinge loss but unfortunately the code has never been merged because sofia-ml wasn't really designed with multiclass classification in mind. The code is in the branch/ folder in google-code if you want to have a look.

assert y_i.shape[0] == y.shape[0] == sample_weight.shape[0]
if sp.issparse(X):
dataset = CSRDataset(X.data, X.indptr, X.indices, y_i, sample_weight)
intercept_decay = 0.01
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be float(X.data.shape[0]) / (X.shape[0] * X.shape[1]) instead of a magic constant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you used that before? Does it give better results? If so, we should definitely choose the density instead of the magic constant.

A quick thought: By using the density we assume that - more or less - the sparsity pattern of all features are equal. This might not be the case for bag-of-words features (Zipfs law).

On the other hand, I've been using the magic constant for about 3 years now; it gave reasonable results (compared to no damping of the intercept update). We could make intercept_decay a parameter and use 1.0 as dense default and 0.01 or density as the sparse default. Sometimes it makes sense to change the intercept decay - e.g. when you have both (a large number of) sparse and (some) dense features.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I don't know. I haven't tried to play with this in the past.

@ogrisel
Copy link
Member

ogrisel commented Feb 8, 2012

I would have initially agreed with @mblondel for the lack of blas possibility in the dense case using the new API but that might only do a speed difference if we were to do some kind of minibatch handling which (which we don't) or if the n_features was very large yet dense. As SGD is especially interesting when n_samples is large too, such a dataset wouldn't probably fit in memory and the cost might be dominated from disk IO rather than blas based dot products any way.

So I am ok with the current state, especially if it allows to implement hashed feature cross products in a lazy way.

A quick cosmit notes: PEP257 states that docstrings should be one line + space + paragraph with details:

    def some_func(arg1, arg2):
        """Short oneline description without any newline char.

        More details in a paragraph. More details in a paragraph. More
        details in a paragraph. More details in a paragraph. 

        """

Or for oneliners:

    def some_func(arg1, arg2):
        """Short oneline description without any newline char."""

Apart from the above inline remarks, 👍 for merging.

@pprett
Copy link
Member Author

pprett commented Feb 8, 2012

Did you get inspired from sofia-ml ;-) ?

You got me - I recently studied sofia-ml's, vw's and mahout's hashing approach .

The idea of the abstractions is nice and I think that in the long term we can even move them to utils/ so that they can benefit in other places (e.g. MiniBatchKmeans).

Agreed, btw: do we have cross-package cython imports somewhere in the scikit so that I can peek how this works?

It's a bit a pity that the dot product needs to iterate over the feature indices even in the dense case (this prevents us for example from calling cblas). It would be nice if we could have a different implementation of "dot" in the dense and sparse cases, while keeping the same logic in plain_sgd. In C++, with overloading, it would be easy to keep the logic of the algorithm the same and call just call "dot". I wonder if we can do something similar by using the fact that ArrayDataset and CSRDataset inherits from the same parent class...

WeightVector used to have both add and add_sparse but I consolidated the differences. In the case of ArrayDataset x_data_ptr points to a row in the numpy array and xnnz equals the number of features, so we could create a new class CBlasWeightVector which inherits from WeightVector but implements dot via a cblas function call. BTW: we should definitely move the WeightVector instantiation from plain_sgd to the stochastic_gradientmodule.

A feature that would be useful is to precompute the square norms and store them as attribute inside ArrayDataset and CSRDataset. This would be useful for computing the euclidean distances more efficiently (||a-b||^2 = ||a||^2 + ||b||^2 - 2 <a,b>) in MiniBatchKMeans or in an RBF kernel.

Ok, this becomes an issue when we use Dataset in places other than sgd?

Speaking of sofia-ml, a long time ago I contributed a branch implementing multinomial logistic regression and multi-class hinge loss but unfortunately the code has never been merged because sofia-ml wasn't really designed with multiclass classification in mind. The code is in the branch/ folder in google-code if you want to have a look.

Great - I do have some ideas about how to consolidate multi-class and the binary special case but I'm always looking for better/easier ways to get things done - thanks

@ogrisel
Copy link
Member

ogrisel commented Feb 8, 2012

Agreed, btw: do we have cross-package cython imports somewhere in the scikit so that I can peek how this works?

As long as we use cimport sklearn.absolute.path.header.pxd it seems to work (at least I think I tried once to check that the murmurhash functions could be cimported from another cython package of the scikit. We could have a new sklearn.utils.largescale package to share cython utilities for oneline-ish algorithms that would benefit from the Dataset abstraction at the cython level.

# -----------------------------------------

cdef class Dataset:
cdef int n_samples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be a Py_ssize_t to be 64-bit clean.

@pprett
Copy link
Member Author

pprett commented Feb 8, 2012

@larsmans, @ogrisel, @mblondel thanks for the feedback!

@larsmans: I changed n_features and n_samples to Py_ssize_t . Regarding the sample set issue: Dataset encapsulates both the samples and the labels; the major reason for this design decision is a) shuffling and b) streaming dataset implementations (e.g. a socket) .

@amueller
Copy link
Member

amueller commented Feb 8, 2012

Hey Peter. I just saw the PR and just wanted to say thank you for the refactoring and cleanup. This looks awesome!

Multinomial logistic regression and multi-class hinge loss have been on my todo for to long - maybe I'll get around to it now that you made it much easier :)

@ogrisel
Copy link
Member

ogrisel commented Feb 11, 2012

👍 for merging (I can live with the remaining PEP257 violations)

@pprett
Copy link
Member Author

pprett commented Feb 12, 2012

@ogrisel: I fixed the PEP257 issues in fast_sgd.pyx - thx for reviewing!

@ogrisel
Copy link
Member

ogrisel commented Feb 13, 2012

Alright, shall we merge?

@pprett
Copy link
Member Author

pprett commented Feb 13, 2012

I think so - it's a good investment in terms of maintainability. I'll merge
as soon as I am in office.
Am 13.02.2012 09:09 schrieb "Olivier Grisel" <
reply@reply.github.com

:

Alright, shall we merge?


Reply to this email directly or view it on GitHub:
#612 (comment)

@pprett pprett merged commit 6f914f3 into scikit-learn:master Feb 13, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants