Skip to content

[MRG+2] Added 'average' option to passive aggressive classifier/regressor. #4939

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Sep 30, 2016
Merged

[MRG+2] Added 'average' option to passive aggressive classifier/regressor. #4939

merged 11 commits into from
Sep 30, 2016

Conversation

aesuli
Copy link
Contributor

@aesuli aesuli commented Jul 8, 2015

Differently from the SGDClassifier (and SGDRegressor), the PassiveAggressiveClassifier (and PassivieAggressiveRegressor) does not expose the 'average' option of the BaseSGD class.
The 'average' option helps smoothing out the impact of rarely observed variables. For example, when used in combination with the HashingVectorizer on text, it helps to compensate for the lack of idf information, lowering the impact low-idf (i.e., rare) features.

The following is an example of using averaging both on SGDClassifier and PassiveAggressiveClassifier.
The data is the Bo Pang movie review dataset ( https://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz ).
The results show that averaging in BaseSGD works perfectly for both classifiers to remove "noise" features.

averagedemo

The following is the code to replicate the experiment that generated the above plot:

import argparse
import codecs
import sys
import matplotlib.pyplot as plt
import numpy as np

from sklearn.cross_validation import ShuffleSplit
from sklearn import cross_validation
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier, SGDClassifier
from sklearn.metrics import SCORERS
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC


def randomized_train_test(estimator, data, target, test_size, n_iter, scorer, random_state, n_jobs):
    cv = ShuffleSplit(len(data), n_iter=n_iter, test_size=test_size, random_state=random_state)

    scores = cross_validation.cross_val_score(estimator, data, target, scorer, cv, n_jobs)
    score_mean = scores.mean()
    score_std = scores.std()
    return score_mean, score_std, scores


def plot(scorer_name, results, grid_w=None, grid_h=None):
    all_scores = list()
    for _, (_, _, scores) in results:
        all_scores.extend(scores)

    count = len(results)

    bins = np.histogram(all_scores, bins=10 + (len(all_scores) /
                                               (count * np.max((
                                                   1, np.log2(len(all_scores) / count))))))[1]

    if grid_w is None:
        grid_w = int(np.ceil(np.sqrt(count)))
    if grid_h is None:
        grid_h = int(np.ceil(count / grid_w))

    subplots = list()

    for i, (name, (score_mean, score_std, scores)) in enumerate(results):
        if len(subplots) > 0:
            subplot = plt.subplot2grid((grid_h, grid_w),
                                       (int(i / grid_w), int(i % grid_w)),
                                       sharex=subplots[0],
                                       sharey=subplots[0])
        else:
            subplot = plt.subplot2grid((grid_h, grid_w),
                                       (int(i / grid_w), int(i % grid_w)))

        subplot.set_title('%s mean %s=%0.3f std=%0.3f' % (name, scorer_name, score_mean, score_std))
        subplot.hist(scores, alpha=0.5, bins=bins)
        subplots.append(subplot)

    ylim = subplots[0].get_ylim()

    for (name, (score_mean, score_std, _)), subplot in zip(results, subplots):
        subplot.plot([score_mean for _ in ylim], ylim, '--k', linewidth=2)
        subplot.plot([score_mean - score_std for _ in ylim], ylim, ':k', linewidth=2)
        subplot.plot([score_mean + score_std for _ in ylim], ylim, ':k', linewidth=2)

# download demo data from:
# https://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz

def main():
    sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer)
    parser = argparse.ArgumentParser(description='')
    parser.add_argument('-p', '--path', help='Input path', required=True)
    parser.add_argument('-r', '--random', help='Randomization seed (default: 0)', type=int, default=0)
    parser.add_argument('-s', '--scorer',
                        help='Evaluation function (default: f1, values: %s)' % ' '.join(SCORERS.keys()), type=str,
                        default='f1')
    parser.add_argument('-i', '--iter', help='Number of randomized runs (default: 100)', type=int, default=100)
    parser.add_argument('-t', '--test', help='Portions of examples to be used as test set (default: 0.2)', type=float,
                        default=0.2)
    args = parser.parse_args()

    test_size = args.test
    random_state = args.random
    n_iter = args.iter
    n_jobs = -1
    scorer = SCORERS[args.scorer]

    examples = load_files(args.path, shuffle=False)

    data = examples.data
    target = examples.target

    print("Total %i Train/test split %0.2f/%0.2f %i/%i" % (len(data),
                                                           (1 - args.test), args.test,
                                                           int(len(data) * (1 - args.test)),
                                                           int(len(data) * args.test)))

    experiments = list()

    experiments.append(('pa-tfidf', Pipeline([
        ('vec', TfidfVectorizer()),
        ('clf', PassiveAggressiveClassifier(random_state=random_state)),
    ])))

    experiments.append(('pa-hash-no-average', Pipeline([
        ('vec', HashingVectorizer()),
        ('clf', PassiveAggressiveClassifier(random_state=random_state)),
    ])))

    experiments.append(('pa-hash-average-100', Pipeline([
        ('vec', HashingVectorizer()),
        ('clf', PassiveAggressiveClassifier(random_state=random_state, average=100)),
    ])))


    experiments.append(('sgdc-tfidf', Pipeline([
        ('vec', TfidfVectorizer()),
        ('clf', SGDClassifier(random_state=random_state)),
    ])))

    experiments.append(('sgdc-hash-no-average', Pipeline([
        ('vec', HashingVectorizer()),
        ('clf', SGDClassifier(random_state=random_state)),
    ])))

    experiments.append(('sgdc-hash-average-100', Pipeline([
        ('vec', HashingVectorizer()),
        ('clf', SGDClassifier(random_state=random_state, average=100)),
    ])))

    results = list()

    for name, estimator in experiments:
        print(name)
        results.append((name, list(randomized_train_test(estimator, data, target, test_size, n_iter,
                                                         scorer,
                                                         random_state, n_jobs))))

    print(results)

    plt.figure()
    plot(args.scorer, results, grid_w=3, grid_h=2)
    plt.show()


if __name__ == "__main__":
    main()
'''

<!-- Reviewable:start -->

---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/scikit-learn/scikit-learn/4939)
<!-- Reviewable:end -->

Differently from the SGDClassifier (and SGDRegressor), the PassiveAggressiveClassifier (and PassivieAggressiveRegressor) does not expose the 'average' option of the BaseSGD class.
The 'average' option helps smoothing out the impact of rarely observed variables. For example, when used in combination with the HashingVectorizer on text, it helps to compensate for the lack of idf information, lowering the impact low-idf (i.e., rare) features.
@agramfort
Copy link
Member

can you add a tiny test?

@aesuli
Copy link
Contributor Author

aesuli commented Jul 8, 2015

Yes. But it is not clear to me what the test should check. Have you any
example of similar tests to point me at?
Il 08/lug/2015 15:07, "Alexandre Gramfort" notifications@github.com ha
scritto:

can you add a tiny test?


Reply to this email directly or view it on GitHub
#4939 (comment)
.

@aesuli
Copy link
Contributor Author

aesuli commented Jul 9, 2015

I have extended the original tests for PassiveAggressive classifier/regressor to be run and checked also with average option turned on.
Note that the code that actually performs the averaging is implemented in the BaseSGDClassifier, BaseSGDRegressor and BaseSGD classes, and it is extensively tested in linear_model/tests/test_sgd.py

@agramfort
Copy link
Member

looks pretty good but can you avoid duplicating so much code for testing?

also I feel that the average param is not really tested. I am sure that the tests would still pass if the average param was ignored internally.

@amueller
Copy link
Member

@ogrisel are there logs we can look into for the failure? Or restart appveyor?

@aesuli
Copy link
Contributor Author

aesuli commented Jul 13, 2015

@agramfort I have added the same checks from test_sgd.py to verify that averaging is actually performed (checking for presence of attributes added exclusively by the averaging code).
With respect to code duplication, it is the minimal code to get the data, fit the classifier, and check its accuracy.

@TomDLT
Copy link
Member

TomDLT commented Jul 27, 2015

Looks good to me.

To reduce test duplication, you can do something like :

def test_classifier_partial_fit():
    classes = np.unique(y)
    for data in (X, X_csr):
        for average in (False, True):
        clf = PassiveAggressiveClassifier(C=1.0, fit_intercept=True,
                                          random_state=0, average=average)
        for t in range(30):
            clf.partial_fit(data, y, classes)
        score = clf.score(data, y)
        assert_greater(score, 0.79)
        if average:
          assert_true(hasattr(clf, 'average_coef_'))

and so on...

@aesuli
Copy link
Contributor Author

aesuli commented Jul 29, 2015

@TomDLT thank you for the suggestion, I followed your advice.

@TomDLT
Copy link
Member

TomDLT commented Jul 30, 2015

Good job, it looks better !
Travis is not happy but it is unrelated, master branch also fails.

@amueller
Copy link
Member

Travis errors are cause by #5045 and unrelated.

@amueller
Copy link
Member

LGTM (though testing for the presence of the attributes in a single test would probably have sufficed).

@amueller amueller changed the title Added 'average' option to passive aggressive classifier/regressor. [MRG] Added 'average' option to passive aggressive classifier/regressor. Jul 30, 2015
@aesuli aesuli closed this Sep 1, 2015
@aesuli aesuli reopened this Sep 1, 2015
Conflicts:
	sklearn/linear_model/passive_aggressive.py
…into aesuli-pa-average

Conflicts:
	sklearn/linear_model/passive_aggressive.py
Copy link
Member

@TomDLT TomDLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have merged that a long time ago, sorry @aesuli for the reaction time.
I added a few comments and I think we can merge it.

result in the ``coef_`` attribute. If set to an int greater than 1,
averaging will begin once the total number of samples seen reaches
average. So average=10 will begin averaging after seeing 10 samples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add a tag .. versionadded:: 0.19 as above.

@@ -1,3 +1,5 @@
import unittest
from nose.tools import assert_true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use instead from sklearn.utils.testing import assert_true which centralizes the nose dependency.

@@ -1,3 +1,5 @@
import unittest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you can remove this import.

score = clf.score(data, y)
assert_greater(score, 0.79)
if average:
assert_true(hasattr(clf, 'average_coef_'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a straightforward way to test the sanity of these values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what kind of relevant characteristics these weights have that could be checked.
The check on the score in line 81 is an indirect check on them that they produce sound classifications.

@@ -59,9 +59,15 @@ class PassiveAggressiveClassifier(BaseSGDClassifier):
weights inversely proportional to class frequencies in the input data
as ``n_samples / (n_classes * np.bincount(y))``

.. versionadded:: 0.17
.. versionadded:: 0.17
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change in indentation?

warm_start=warm_start,
class_weight=class_weight,
n_jobs=n_jobs)
penalty=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change in indentation?

result in the ``coef_`` attribute. If set to an int greater than 1,
averaging will begin once the total number of samples seen reaches
average. So average=10 will begin averaging after seeing 10 samples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

veersionadded

verbose=verbose,
random_state=random_state,
warm_start=warm_start)
penalty=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

@TomDLT TomDLT changed the title [MRG] Added 'average' option to passive aggressive classifier/regressor. [MRG+1] Added 'average' option to passive aggressive classifier/regressor. Sep 23, 2016
@jnothman
Copy link
Member

LGTM

@jnothman
Copy link
Member

Please add an entry to what's new and we can merge.

@aesuli
Copy link
Contributor Author

aesuli commented Sep 29, 2016

@jnothman It's a "new feature" or an "enhancement"?

@jnothman
Copy link
Member

Enhancement is fine. Could be either, but features are usually bigger (e.g. new estimators) and highlighted in a release.

@jnothman jnothman changed the title [MRG+1] Added 'average' option to passive aggressive classifier/regressor. [MRG+2] Added 'average' option to passive aggressive classifier/regressor. Sep 30, 2016
@jnothman jnothman merged commit 69c8382 into scikit-learn:master Sep 30, 2016
@jnothman
Copy link
Member

Thanks, @aesuli

TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants