MRG: Major SGD extension module refactoring #612

pprett · 2012-02-07T20:54:51Z

A major refactoring of the SGD extension module which allows us to dump sgd_fast_sparse.pyx ::

A new dataset abstraction is introduced, Dataset, which basically provides a next and a shuffle method.
There are currently two concrete implementations ArrayDataset which is backed by a c-style numpy array and CSRDataset which is backed by a scipy CSR matrix.
A WeightVector abstraction is introduced which basically wraps wscale and the np array w and provides methods such as scale, add, and reset_wscale.

I timed the refactoring on 20news and covertype; there are no noticeable differences in terms of runtime.

This refactoring reduces code-complexity and should allow us to add features more easily. I'm thinking about various hashing tricks (e.g. feature cross-products via a new WeightVector etc. ) and different multi-class approaches such as multinomial LR and multi-class hinge loss.

… instead of explicit ndarray and wscale.

…ctor. some cosmits in dense sgd extension module.

…ough.

…e green again!

…ations.

- new dataset abstraction allows us to dump the sparse impl.

pprett · 2012-02-07T21:09:07Z

btw: removes 4K lines of code

mblondel · 2012-02-07T22:00:11Z

Did you get inspired from sofia-ml ;-) ?

The idea of the abstractions is nice and I think that in the long term we can even move them to utils/ so that they can benefit in other places (e.g. MiniBatchKmeans). It's a bit a pity that the dot product needs to iterate over the feature indices even in the dense case (this prevents us for example from calling cblas). It would be nice if we could have a different implementation of "dot" in the dense and sparse cases, while keeping the same logic in plain_sgd. In C++, with overloading, it would be easy to keep the logic of the algorithm the same and call just call "dot". I wonder if we can do something similar by using the fact that ArrayDataset and CSRDataset inherits from the same parent class...

A feature that would be useful is to precompute the square norms and store them as attribute inside ArrayDataset and CSRDataset. This would be useful for computing the euclidean distances more efficiently (||a-b||^2 = ||a||^2 + ||b||^2 - 2 <a,b>) in MiniBatchKMeans or in an RBF kernel.

Speaking of sofia-ml, a long time ago I contributed a branch implementing multinomial logistic regression and multi-class hinge loss but unfortunately the code has never been merged because sofia-ml wasn't really designed with multiclass classification in mind. The code is in the branch/ folder in google-code if you want to have a look.

ogrisel · 2012-02-08T07:27:10Z

sklearn/linear_model/stochastic_gradient.py

+    assert y_i.shape[0] == y.shape[0] == sample_weight.shape[0]
+    if sp.issparse(X):
+        dataset = CSRDataset(X.data, X.indptr, X.indices, y_i, sample_weight)
+        intercept_decay = 0.01


Shouldn't this be float(X.data.shape[0]) / (X.shape[0] * X.shape[1]) instead of a magic constant?

Have you used that before? Does it give better results? If so, we should definitely choose the density instead of the magic constant.

A quick thought: By using the density we assume that - more or less - the sparsity pattern of all features are equal. This might not be the case for bag-of-words features (Zipfs law).

On the other hand, I've been using the magic constant for about 3 years now; it gave reasonable results (compared to no damping of the intercept update). We could make intercept_decay a parameter and use 1.0 as dense default and 0.01 or density as the sparse default. Sometimes it makes sense to change the intercept decay - e.g. when you have both (a large number of) sparse and (some) dense features.

Actually I don't know. I haven't tried to play with this in the past.

ogrisel · 2012-02-08T07:46:20Z

I would have initially agreed with @mblondel for the lack of blas possibility in the dense case using the new API but that might only do a speed difference if we were to do some kind of minibatch handling which (which we don't) or if the n_features was very large yet dense. As SGD is especially interesting when n_samples is large too, such a dataset wouldn't probably fit in memory and the cost might be dominated from disk IO rather than blas based dot products any way.

So I am ok with the current state, especially if it allows to implement hashed feature cross products in a lazy way.

A quick cosmit notes: PEP257 states that docstrings should be one line + space + paragraph with details:

    def some_func(arg1, arg2):
        """Short oneline description without any newline char.

        More details in a paragraph. More details in a paragraph. More
        details in a paragraph. More details in a paragraph. 

        """

Or for oneliners:

    def some_func(arg1, arg2):
        """Short oneline description without any newline char."""

Apart from the above inline remarks, 👍 for merging.

pprett · 2012-02-08T07:55:42Z

Did you get inspired from sofia-ml ;-) ?

You got me - I recently studied sofia-ml's, vw's and mahout's hashing approach .

The idea of the abstractions is nice and I think that in the long term we can even move them to utils/ so that they can benefit in other places (e.g. MiniBatchKmeans).

Agreed, btw: do we have cross-package cython imports somewhere in the scikit so that I can peek how this works?

It's a bit a pity that the dot product needs to iterate over the feature indices even in the dense case (this prevents us for example from calling cblas). It would be nice if we could have a different implementation of "dot" in the dense and sparse cases, while keeping the same logic in plain_sgd. In C++, with overloading, it would be easy to keep the logic of the algorithm the same and call just call "dot". I wonder if we can do something similar by using the fact that ArrayDataset and CSRDataset inherits from the same parent class...

WeightVector used to have both add and add_sparse but I consolidated the differences. In the case of ArrayDataset x_data_ptr points to a row in the numpy array and xnnz equals the number of features, so we could create a new class CBlasWeightVector which inherits from WeightVector but implements dot via a cblas function call. BTW: we should definitely move the WeightVector instantiation from plain_sgd to the stochastic_gradientmodule.

A feature that would be useful is to precompute the square norms and store them as attribute inside ArrayDataset and CSRDataset. This would be useful for computing the euclidean distances more efficiently (||a-b||^2 = ||a||^2 + ||b||^2 - 2 <a,b>) in MiniBatchKMeans or in an RBF kernel.

Ok, this becomes an issue when we use Dataset in places other than sgd?

Speaking of sofia-ml, a long time ago I contributed a branch implementing multinomial logistic regression and multi-class hinge loss but unfortunately the code has never been merged because sofia-ml wasn't really designed with multiclass classification in mind. The code is in the branch/ folder in google-code if you want to have a look.

Great - I do have some ideas about how to consolidate multi-class and the binary special case but I'm always looking for better/easier ways to get things done - thanks

ogrisel · 2012-02-08T08:01:09Z

Agreed, btw: do we have cross-package cython imports somewhere in the scikit so that I can peek how this works?

As long as we use cimport sklearn.absolute.path.header.pxd it seems to work (at least I think I tried once to check that the murmurhash functions could be cimported from another cython package of the scikit. We could have a new sklearn.utils.largescale package to share cython utilities for oneline-ish algorithms that would benefit from the Dataset abstraction at the cython level.

larsmans · 2012-02-08T09:46:45Z

sklearn/linear_model/sgd_fast.pxd

+# -----------------------------------------
+
+cdef class Dataset:
+    cdef int n_samples


Should be a Py_ssize_t to be 64-bit clean.

pprett · 2012-02-08T21:15:55Z

@larsmans, @ogrisel, @mblondel thanks for the feedback!

@larsmans: I changed n_features and n_samples to Py_ssize_t . Regarding the sample set issue: Dataset encapsulates both the samples and the labels; the major reason for this design decision is a) shuffling and b) streaming dataset implementations (e.g. a socket) .

amueller · 2012-02-08T21:37:17Z

Hey Peter. I just saw the PR and just wanted to say thank you for the refactoring and cleanup. This looks awesome!

Multinomial logistic regression and multi-class hinge loss have been on my todo for to long - maybe I'll get around to it now that you made it much easier :)

ogrisel · 2012-02-11T13:17:13Z

👍 for merging (I can live with the remaining PEP257 violations)

pprett · 2012-02-12T18:17:49Z

@ogrisel: I fixed the PEP257 issues in fast_sgd.pyx - thx for reviewing!

ogrisel · 2012-02-13T08:09:18Z

Alright, shall we merge?

pprett · 2012-02-13T08:25:37Z

I think so - it's a good investment in terms of maintainability. I'll merge
as soon as I am in office.
Am 13.02.2012 09:09 schrieb "Olivier Grisel" <
reply@reply.github.com

:

Alright, shall we merge?

Reply to this email directly or view it on GitHub:
#612 (comment)

pprett added 8 commits February 7, 2012 09:02

factored out weight vector class; dense SGD now uses WeightVector…

b7be62a

… instead of explicit ndarray and wscale.

enh: performance of WeightVector now comparable to explicit weight ve…

62c5420

…ctor. some cosmits in dense sgd extension module.

wip: sparse sgd now uses WeightVector - there are some broken tests t…

a620e55

…ough.

wip: still hunter sparse sgd bug

28f9b4c

fix: forgot to scale by wscale at the end of dot_sparse. All tests ar…

1bf502b

…e green again!

added new sgd dataset abstraction to unify sparse and dense implement…

d2ee677

…ations.

Merge branch 'master' into sgd-refactoring

cc8489b

major refactoring of sgd module::

36a4c37

- new dataset abstraction allows us to dump the sparse impl.

ogrisel reviewed Feb 8, 2012
View reviewed changes

larsmans reviewed Feb 8, 2012
View reviewed changes

use Py_ssize_t where appropriate; cosmit

efc5bfe

pprett added 2 commits February 12, 2012 18:44

Merge remote branch 'upstream/master' into sgd-refactoring

4017ade

cosmit: better docstrings for SGD

6f914f3

pprett merged commit 6f914f3 into scikit-learn:master Feb 13, 2012

TomDLT mentioned this pull request Dec 20, 2022

Sparse data representations results in worse models than dense data for some classifiers #25198

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MRG: Major SGD extension module refactoring #612

MRG: Major SGD extension module refactoring #612

pprett commented Feb 7, 2012

pprett commented Feb 7, 2012

mblondel commented Feb 7, 2012

ogrisel Feb 8, 2012

pprett Feb 8, 2012

ogrisel Feb 8, 2012

ogrisel commented Feb 8, 2012

pprett commented Feb 8, 2012

ogrisel commented Feb 8, 2012

larsmans Feb 8, 2012

pprett commented Feb 8, 2012

amueller commented Feb 8, 2012

ogrisel commented Feb 11, 2012

pprett commented Feb 12, 2012

ogrisel commented Feb 13, 2012

pprett commented Feb 13, 2012

MRG: Major SGD extension module refactoring #612

MRG: Major SGD extension module refactoring #612

Conversation

pprett commented Feb 7, 2012

pprett commented Feb 7, 2012

mblondel commented Feb 7, 2012

ogrisel Feb 8, 2012

Choose a reason for hiding this comment

pprett Feb 8, 2012

Choose a reason for hiding this comment

ogrisel Feb 8, 2012

Choose a reason for hiding this comment

ogrisel commented Feb 8, 2012

pprett commented Feb 8, 2012

ogrisel commented Feb 8, 2012

larsmans Feb 8, 2012

Choose a reason for hiding this comment

pprett commented Feb 8, 2012

amueller commented Feb 8, 2012

ogrisel commented Feb 11, 2012

pprett commented Feb 12, 2012

ogrisel commented Feb 13, 2012

pprett commented Feb 13, 2012