Skip to content

[MRG+1] Stronger tests for variable importances #5261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Sep 14, 2015

Conversation

glouppe
Copy link
Contributor

@glouppe glouppe commented Sep 12, 2015

This PR adds stronger tests for variable importances in forests, including:

  • checks for all forests and all criteria (as proposed earlier by @arjoly);
  • checks for invariance with respect to sample weight scaling;
  • checks for the convergence of variable importances of totally randomized trees towards their true values.

@@ -663,7 +772,7 @@ def check_sparse_input(name, X, X_sparse, y):

def test_sparse_input():
X, y = datasets.make_multilabel_classification(random_state=0,
n_samples=40)
n_samples=50)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I disagree, but what is the reason for this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to make the tree construction more stable by automatically rescaling the sample weights. Nothing worked with much success (it just moved the issue to someplace else...), but this test failed a few times because of this. Using more samples made that test more stable.

@jmschrei
Copy link
Member

I've made some comments, but I support this.

@glouppe
Copy link
Contributor Author

glouppe commented Sep 12, 2015

Thanks for your review @jmschrei ! I believe I addressed all your suggestions.

importances = est.feature_importances_
assert_true(np.all(importances >= 0.0))

for scale in [10, 100, 1000]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for including my suggestion. I apologize for being pedantic, but I meant a scale of more like [1e-8, 1e-4, 1e-1, 1e1, 1e4, 1e8]. Do you think this test is sufficient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, which such larger/smaller scale factors, tests dont pass anymore. Numerical discrepancies creep in, leading to slightly different impurity values... I am not sure we can do anything about that. (My idea was to internally scale down sample_weight by 1. / sample_weight.max(), but this just moves the numerical issues somewhere else...)

So for me this test is more a safeguard that nothing obviously wrong is happening, rather than a bulletproof test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay. Thanks for checking!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but 1e-4 would still work, right?

@jmschrei
Copy link
Member

Other than my one comment, LGTM.

@jmschrei
Copy link
Member

With the last comment resolved, this LGTM, +1.

@glouppe glouppe changed the title [MRG] Stronger tests for variable importances [MRG+1] Stronger tests for variable importances Sep 12, 2015
assert_equal(n_important, 3)

X_new = est.transform(X, threshold="mean")
assert_less(0 < X_new.shape[1], X.shape[1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line confuses me.
X_new.shape[1] == 3
So he first expression evaluates to True, which is cast to 1, which is less than X.shape[1] == 10
So the assert would hold even if X_new.shape[1] == 11. Actually it succeeds for all values of X_new.shape[1] as long as X.shape[1] > 1

I guess remove the 0 <?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woops :) This one has been there for long without anybody noticing

@amueller
Copy link
Member

thanks for more tests :) looks good though I didn't check the math in the theoretical importances.

@glouppe
Copy link
Contributor Author

glouppe commented Sep 13, 2015

@amueller Thanks for the comments! I fixed those


# Check correctness
assert_almost_equal(entropy(y), sum(importances))
assert_less(np.abs(true_importances - importances).mean(), 0.01)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you use assert_less over assert_array_almost_equal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it easier to control the quality of the approximation, rather than having all single importance values match up to some digit (but doing this, I only require their mean to do so).

@arjoly
Copy link
Member

arjoly commented Sep 13, 2015

I would add a test to check the parallel computation of importances.
Something like

importance = est.feature_importance_
est.set_params(n_jobs=2)
importance_parrallel =  est.feature_importance_
assert_array_equal(importance, importance_parrallel)

@arjoly
Copy link
Member

arjoly commented Sep 13, 2015

Otherwise looks good.

@glouppe
Copy link
Contributor Author

glouppe commented Sep 14, 2015

Thanks for the review. I fixed the last comments. Merging.

glouppe added a commit that referenced this pull request Sep 14, 2015
[MRG+1] Stronger tests for variable importances
@glouppe glouppe merged commit b099a59 into scikit-learn:master Sep 14, 2015
@jmschrei jmschrei mentioned this pull request Sep 14, 2015
12 tasks
@glouppe glouppe deleted the check-importances branch October 20, 2015 07:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants