[MRG+2] Added 'average' option to passive aggressive classifier/regressor. #4939

aesuli · 2015-07-08T13:02:44Z

Differently from the SGDClassifier (and SGDRegressor), the PassiveAggressiveClassifier (and PassivieAggressiveRegressor) does not expose the 'average' option of the BaseSGD class.
The 'average' option helps smoothing out the impact of rarely observed variables. For example, when used in combination with the HashingVectorizer on text, it helps to compensate for the lack of idf information, lowering the impact low-idf (i.e., rare) features.

The following is an example of using averaging both on SGDClassifier and PassiveAggressiveClassifier.
The data is the Bo Pang movie review dataset ( https://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz ).
The results show that averaging in BaseSGD works perfectly for both classifiers to remove "noise" features.

The following is the code to replicate the experiment that generated the above plot:

import argparse
import codecs
import sys
import matplotlib.pyplot as plt
import numpy as np

from sklearn.cross_validation import ShuffleSplit
from sklearn import cross_validation
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier, SGDClassifier
from sklearn.metrics import SCORERS
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC


def randomized_train_test(estimator, data, target, test_size, n_iter, scorer, random_state, n_jobs):
    cv = ShuffleSplit(len(data), n_iter=n_iter, test_size=test_size, random_state=random_state)

    scores = cross_validation.cross_val_score(estimator, data, target, scorer, cv, n_jobs)
    score_mean = scores.mean()
    score_std = scores.std()
    return score_mean, score_std, scores


def plot(scorer_name, results, grid_w=None, grid_h=None):
    all_scores = list()
    for _, (_, _, scores) in results:
        all_scores.extend(scores)

    count = len(results)

    bins = np.histogram(all_scores, bins=10 + (len(all_scores) /
                                               (count * np.max((
                                                   1, np.log2(len(all_scores) / count))))))[1]

    if grid_w is None:
        grid_w = int(np.ceil(np.sqrt(count)))
    if grid_h is None:
        grid_h = int(np.ceil(count / grid_w))

    subplots = list()

    for i, (name, (score_mean, score_std, scores)) in enumerate(results):
        if len(subplots) > 0:
            subplot = plt.subplot2grid((grid_h, grid_w),
                                       (int(i / grid_w), int(i % grid_w)),
                                       sharex=subplots[0],
                                       sharey=subplots[0])
        else:
            subplot = plt.subplot2grid((grid_h, grid_w),
                                       (int(i / grid_w), int(i % grid_w)))

        subplot.set_title('%s mean %s=%0.3f std=%0.3f' % (name, scorer_name, score_mean, score_std))
        subplot.hist(scores, alpha=0.5, bins=bins)
        subplots.append(subplot)

    ylim = subplots[0].get_ylim()

    for (name, (score_mean, score_std, _)), subplot in zip(results, subplots):
        subplot.plot([score_mean for _ in ylim], ylim, '--k', linewidth=2)
        subplot.plot([score_mean - score_std for _ in ylim], ylim, ':k', linewidth=2)
        subplot.plot([score_mean + score_std for _ in ylim], ylim, ':k', linewidth=2)

# download demo data from:
# https://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz

def main():
    sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer)
    parser = argparse.ArgumentParser(description='')
    parser.add_argument('-p', '--path', help='Input path', required=True)
    parser.add_argument('-r', '--random', help='Randomization seed (default: 0)', type=int, default=0)
    parser.add_argument('-s', '--scorer',
                        help='Evaluation function (default: f1, values: %s)' % ' '.join(SCORERS.keys()), type=str,
                        default='f1')
    parser.add_argument('-i', '--iter', help='Number of randomized runs (default: 100)', type=int, default=100)
    parser.add_argument('-t', '--test', help='Portions of examples to be used as test set (default: 0.2)', type=float,
                        default=0.2)
    args = parser.parse_args()

    test_size = args.test
    random_state = args.random
    n_iter = args.iter
    n_jobs = -1
    scorer = SCORERS[args.scorer]

    examples = load_files(args.path, shuffle=False)

    data = examples.data
    target = examples.target

    print("Total %i Train/test split %0.2f/%0.2f %i/%i" % (len(data),
                                                           (1 - args.test), args.test,
                                                           int(len(data) * (1 - args.test)),
                                                           int(len(data) * args.test)))

    experiments = list()

    experiments.append(('pa-tfidf', Pipeline([
        ('vec', TfidfVectorizer()),
        ('clf', PassiveAggressiveClassifier(random_state=random_state)),
    ])))

    experiments.append(('pa-hash-no-average', Pipeline([
        ('vec', HashingVectorizer()),
        ('clf', PassiveAggressiveClassifier(random_state=random_state)),
    ])))

    experiments.append(('pa-hash-average-100', Pipeline([
        ('vec', HashingVectorizer()),
        ('clf', PassiveAggressiveClassifier(random_state=random_state, average=100)),
    ])))


    experiments.append(('sgdc-tfidf', Pipeline([
        ('vec', TfidfVectorizer()),
        ('clf', SGDClassifier(random_state=random_state)),
    ])))

    experiments.append(('sgdc-hash-no-average', Pipeline([
        ('vec', HashingVectorizer()),
        ('clf', SGDClassifier(random_state=random_state)),
    ])))

    experiments.append(('sgdc-hash-average-100', Pipeline([
        ('vec', HashingVectorizer()),
        ('clf', SGDClassifier(random_state=random_state, average=100)),
    ])))

    results = list()

    for name, estimator in experiments:
        print(name)
        results.append((name, list(randomized_train_test(estimator, data, target, test_size, n_iter,
                                                         scorer,
                                                         random_state, n_jobs))))

    print(results)

    plt.figure()
    plot(args.scorer, results, grid_w=3, grid_h=2)
    plt.show()


if __name__ == "__main__":
    main()
'''

<!-- Reviewable:start -->

---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/scikit-learn/scikit-learn/4939)
<!-- Reviewable:end -->

Differently from the SGDClassifier (and SGDRegressor), the PassiveAggressiveClassifier (and PassivieAggressiveRegressor) does not expose the 'average' option of the BaseSGD class. The 'average' option helps smoothing out the impact of rarely observed variables. For example, when used in combination with the HashingVectorizer on text, it helps to compensate for the lack of idf information, lowering the impact low-idf (i.e., rare) features.

agramfort · 2015-07-08T13:06:58Z

can you add a tiny test?

aesuli · 2015-07-08T14:00:02Z

Yes. But it is not clear to me what the test should check. Have you any
example of similar tests to point me at?
Il 08/lug/2015 15:07, "Alexandre Gramfort" notifications@github.com ha
scritto:

can you add a tiny test?

—
Reply to this email directly or view it on GitHub
#4939 (comment)
.

aesuli · 2015-07-09T10:22:00Z

I have extended the original tests for PassiveAggressive classifier/regressor to be run and checked also with average option turned on.
Note that the code that actually performs the averaging is implemented in the BaseSGDClassifier, BaseSGDRegressor and BaseSGD classes, and it is extensively tested in linear_model/tests/test_sgd.py

agramfort · 2015-07-11T07:30:05Z

looks pretty good but can you avoid duplicating so much code for testing?

also I feel that the average param is not really tested. I am sure that the tests would still pass if the average param was ignored internally.

amueller · 2015-07-11T15:45:05Z

@ogrisel are there logs we can look into for the failure? Or restart appveyor?

…de, following what done in test_sgd.py.

aesuli · 2015-07-13T08:50:31Z

@agramfort I have added the same checks from test_sgd.py to verify that averaging is actually performed (checking for presence of attributes added exclusively by the averaging code).
With respect to code duplication, it is the minimal code to get the data, fit the classifier, and check its accuracy.

TomDLT · 2015-07-27T15:13:29Z

Looks good to me.

To reduce test duplication, you can do something like :

def test_classifier_partial_fit():
    classes = np.unique(y)
    for data in (X, X_csr):
        for average in (False, True):
        clf = PassiveAggressiveClassifier(C=1.0, fit_intercept=True,
                                          random_state=0, average=average)
        for t in range(30):
            clf.partial_fit(data, y, classes)
        score = clf.score(data, y)
        assert_greater(score, 0.79)
        if average:
          assert_true(hasattr(clf, 'average_coef_'))

and so on...

aesuli · 2015-07-29T20:36:36Z

@TomDLT thank you for the suggestion, I followed your advice.

TomDLT · 2015-07-30T08:05:26Z

Good job, it looks better !
Travis is not happy but it is unrelated, master branch also fails.

amueller · 2015-07-30T16:16:13Z

Travis errors are cause by #5045 and unrelated.

amueller · 2015-07-30T16:17:13Z

LGTM (though testing for the presence of the attributes in a single test would probably have sufficed).

Conflicts: sklearn/linear_model/passive_aggressive.py

…into aesuli-pa-average Conflicts: sklearn/linear_model/passive_aggressive.py

TomDLT

We could have merged that a long time ago, sorry @aesuli for the reaction time.
I added a few comments and I think we can merge it.

TomDLT · 2016-09-19T14:59:00Z

sklearn/linear_model/passive_aggressive.py

+        result in the ``coef_`` attribute. If set to an int greater than 1,
+        averaging will begin once the total number of samples seen reaches
+        average. So average=10 will begin averaging after seeing 10 samples.
+


You should add a tag .. versionadded:: 0.19 as above.

TomDLT · 2016-09-19T15:02:02Z

sklearn/linear_model/tests/test_passive_aggressive.py

@@ -1,3 +1,5 @@
+import unittest
+from nose.tools import assert_true


Please use instead from sklearn.utils.testing import assert_true which centralizes the nose dependency.

TomDLT · 2016-09-19T15:02:16Z

sklearn/linear_model/tests/test_passive_aggressive.py

@@ -1,3 +1,5 @@
+import unittest


I guess you can remove this import.

jnothman · 2016-09-20T23:57:27Z

sklearn/linear_model/tests/test_passive_aggressive.py

+                score = clf.score(data, y)
+                assert_greater(score, 0.79)
+                if average:
+                    assert_true(hasattr(clf, 'average_coef_'))


Is there a straightforward way to test the sanity of these values?

I don't know what kind of relevant characteristics these weights have that could be checked.
The check on the score in line 81 is an indirect check on them that they produce sound classifications.

jnothman · 2016-09-20T23:57:34Z

sklearn/linear_model/passive_aggressive.py

@@ -59,9 +59,15 @@ class PassiveAggressiveClassifier(BaseSGDClassifier):
        weights inversely proportional to class frequencies in the input data
        as ``n_samples / (n_classes * np.bincount(y))``

-        .. versionadded:: 0.17
+		.. versionadded:: 0.17


Why this change in indentation?

jnothman · 2016-09-20T23:57:42Z

sklearn/linear_model/passive_aggressive.py

-            warm_start=warm_start,
-            class_weight=class_weight,
-            n_jobs=n_jobs)
+                                   penalty=None,


Why this change in indentation?

jnothman · 2016-09-20T23:57:45Z

sklearn/linear_model/passive_aggressive.py

+        result in the ``coef_`` attribute. If set to an int greater than 1,
+        averaging will begin once the total number of samples seen reaches
+        average. So average=10 will begin averaging after seeing 10 samples.
+


veersionadded

jnothman · 2016-09-20T23:57:48Z

sklearn/linear_model/passive_aggressive.py

-            verbose=verbose,
-            random_state=random_state,
-            warm_start=warm_start)
+                                  penalty=None,


indentation

jnothman · 2016-09-28T23:17:28Z

LGTM

jnothman · 2016-09-28T23:17:44Z

Please add an entry to what's new and we can merge.

aesuli · 2016-09-29T12:58:59Z

@jnothman It's a "new feature" or an "enhancement"?

jnothman · 2016-09-30T02:48:04Z

Enhancement is fine. Could be either, but features are usually bigger (e.g. new estimators) and highlighted in a release.

jnothman · 2016-09-30T02:49:32Z

Thanks, @aesuli

…ssor. (scikit-learn#4939)

Added tests with average option turned on.

83711dd

Added checks for presence of attributes generated by the averaging co…

6d3d762

…de, following what done in test_sgd.py.

Reduced test code duplication.

59b44f9

amueller changed the title ~~Added 'average' option to passive aggressive classifier/regressor.~~ [MRG] Added 'average' option to passive aggressive classifier/regressor. Jul 30, 2015

aesuli closed this Sep 1, 2015

aesuli reopened this Sep 1, 2015

aesuli added 2 commits September 1, 2015 16:43

Merge remote-tracking branch 'upstream/master' into aesuli-pa-average

f0e5f10

Conflicts: sklearn/linear_model/passive_aggressive.py

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

33a3104

…into aesuli-pa-average Conflicts: sklearn/linear_model/passive_aggressive.py

amueller added the Waiting for Reviewer label Dec 10, 2015

TomDLT approved these changes Sep 19, 2016

View reviewed changes

jnothman requested changes Sep 20, 2016

View reviewed changes

Including the changes requested by @TomDLT and @jnothman

fb1debf

TomDLT changed the title ~~[MRG] Added 'average' option to passive aggressive classifier/regressor.~~ [MRG+1] Added 'average' option to passive aggressive classifier/regressor. Sep 23, 2016

aesuli added 2 commits September 26, 2016 15:53

Merge remote-tracking branch 'upstream/master' into aesuli-pa-average

309d429

Fixed formatting

4424efb

jnothman approved these changes Sep 28, 2016

View reviewed changes

jnothman removed the Waiting for Reviewer label Sep 28, 2016

aesuli added 2 commits September 29, 2016 15:24

Added entry in what's new.

628b9e7

Added entry in what's new.

ac5fab3

jnothman changed the title ~~[MRG+1] Added 'average' option to passive aggressive classifier/regressor.~~ [MRG+2] Added 'average' option to passive aggressive classifier/regressor. Sep 30, 2016

jnothman merged commit 69c8382 into scikit-learn:master Sep 30, 2016

TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016

[MRG+1] Added 'average' option to passive aggressive classifier/regre…

eb8fe5d

…ssor. (scikit-learn#4939)

Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG+1] Added 'average' option to passive aggressive classifier/regre…

6dfc027

…ssor. (scikit-learn#4939)

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+1] Added 'average' option to passive aggressive classifier/regre…

4281c61

…ssor. (scikit-learn#4939)

		@@ -1,3 +1,5 @@
		import unittest
		from nose.tools import assert_true

Uh oh!

[MRG+2] Added 'average' option to passive aggressive classifier/regressor. #4939

[MRG+2] Added 'average' option to passive aggressive classifier/regressor. #4939

Uh oh!

Conversation

aesuli commented Jul 8, 2015 • edited by lesteve Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agramfort commented Jul 8, 2015

Uh oh!

aesuli commented Jul 8, 2015

Uh oh!

aesuli commented Jul 9, 2015

Uh oh!

agramfort commented Jul 11, 2015

Uh oh!

amueller commented Jul 11, 2015

Uh oh!

aesuli commented Jul 13, 2015

Uh oh!

TomDLT commented Jul 27, 2015

Uh oh!

aesuli commented Jul 29, 2015

Uh oh!

TomDLT commented Jul 30, 2015

Uh oh!

amueller commented Jul 30, 2015

Uh oh!

amueller commented Jul 30, 2015

Uh oh!

TomDLT left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Sep 28, 2016

Uh oh!

jnothman commented Sep 28, 2016

Uh oh!

aesuli commented Sep 29, 2016

Uh oh!

jnothman commented Sep 30, 2016

Uh oh!

jnothman commented Sep 30, 2016

Uh oh!

Uh oh!

aesuli commented Jul 8, 2015 •

edited by lesteve

Loading