[MRG+1] Stronger tests for variable importances #5261

glouppe · 2015-09-12T13:22:49Z

This PR adds stronger tests for variable importances in forests, including:

checks for all forests and all criteria (as proposed earlier by @arjoly);
checks for invariance with respect to sample weight scaling;
checks for the convergence of variable importances of totally randomized trees towards their true values.

jmschrei · 2015-09-12T15:06:49Z

sklearn/ensemble/tests/test_forest.py

@@ -663,7 +772,7 @@ def check_sparse_input(name, X, X_sparse, y):

 def test_sparse_input():
    X, y = datasets.make_multilabel_classification(random_state=0,
-                                                   n_samples=40)
+                                                   n_samples=50)


Not that I disagree, but what is the reason for this change?

I was trying to make the tree construction more stable by automatically rescaling the sample weights. Nothing worked with much success (it just moved the issue to someplace else...), but this test failed a few times because of this. Using more samples made that test more stable.

jmschrei · 2015-09-12T15:22:19Z

I've made some comments, but I support this.

glouppe · 2015-09-12T16:39:00Z

Thanks for your review @jmschrei ! I believe I addressed all your suggestions.

jmschrei · 2015-09-12T16:56:12Z

sklearn/ensemble/tests/test_forest.py

+    importances = est.feature_importances_
+    assert_true(np.all(importances >= 0.0))
+
+    for scale in [10, 100, 1000]:


Thanks for including my suggestion. I apologize for being pedantic, but I meant a scale of more like [1e-8, 1e-4, 1e-1, 1e1, 1e4, 1e8]. Do you think this test is sufficient?

Unfortunately, which such larger/smaller scale factors, tests dont pass anymore. Numerical discrepancies creep in, leading to slightly different impurity values... I am not sure we can do anything about that. (My idea was to internally scale down sample_weight by 1. / sample_weight.max(), but this just moves the numerical issues somewhere else...)

So for me this test is more a safeguard that nothing obviously wrong is happening, rather than a bulletproof test.

Ah, okay. Thanks for checking!

but 1e-4 would still work, right?

jmschrei · 2015-09-12T16:56:42Z

Other than my one comment, LGTM.

jmschrei · 2015-09-12T17:20:19Z

With the last comment resolved, this LGTM, +1.

amueller · 2015-09-13T17:52:04Z

sklearn/ensemble/tests/test_forest.py

+    assert_equal(n_important, 3)
+
+    X_new = est.transform(X, threshold="mean")
+    assert_less(0 < X_new.shape[1], X.shape[1])


This line confuses me.
X_new.shape[1] == 3
So he first expression evaluates to True, which is cast to 1, which is less than X.shape[1] == 10
So the assert would hold even if X_new.shape[1] == 11. Actually it succeeds for all values of X_new.shape[1] as long as X.shape[1] > 1

I guess remove the 0 <?

woops :) This one has been there for long without anybody noticing

amueller · 2015-09-13T17:59:53Z

thanks for more tests :) looks good though I didn't check the math in the theoretical importances.

glouppe · 2015-09-13T18:10:45Z

@amueller Thanks for the comments! I fixed those

arjoly · 2015-09-13T19:15:44Z

sklearn/ensemble/tests/test_forest.py

+
+    # Check correctness
+    assert_almost_equal(entropy(y), sum(importances))
+    assert_less(np.abs(true_importances - importances).mean(), 0.01)


Why do you use assert_less over assert_array_almost_equal?

I find it easier to control the quality of the approximation, rather than having all single importance values match up to some digit (but doing this, I only require their mean to do so).

arjoly · 2015-09-13T19:18:08Z

I would add a test to check the parallel computation of importances.
Something like

importance = est.feature_importance_
est.set_params(n_jobs=2)
importance_parrallel =  est.feature_importance_
assert_array_equal(importance, importance_parrallel)

arjoly · 2015-09-13T19:19:38Z

Otherwise looks good.

glouppe · 2015-09-14T05:45:40Z

Thanks for the review. I fixed the last comments. Merging.

[MRG+1] Stronger tests for variable importances

TEST: stronger tests for variable importances

25dbb15

glouppe force-pushed the check-importances branch from 9ad805b to 6290fcf Compare September 12, 2015 13:54

TEST: use sklearn.fixes.bincount

78974de

glouppe force-pushed the check-importances branch from 6290fcf to 78974de Compare September 12, 2015 14:08

jmschrei reviewed Sep 12, 2015
View reviewed changes

glouppe force-pushed the check-importances branch from 58fafac to d1371e5 Compare September 12, 2015 16:38

TEST: take comments into account

bcc6f1b

glouppe force-pushed the check-importances branch from d1371e5 to bcc6f1b Compare September 12, 2015 16:41

jmschrei reviewed Sep 12, 2015
View reviewed changes

glouppe changed the title ~~[MRG] Stronger tests for variable importances~~ [MRG+1] Stronger tests for variable importances Sep 12, 2015

amueller reviewed Sep 13, 2015
View reviewed changes

TEST: reduce test time, variable name, etc

5f589fb

arjoly reviewed Sep 13, 2015
View reviewed changes

TEST: check parallel computation

3575db6

glouppe added a commit that referenced this pull request Sep 14, 2015

Merge pull request #5261 from glouppe/check-importances

b099a59

[MRG+1] Stronger tests for variable importances

glouppe merged commit b099a59 into scikit-learn:master Sep 14, 2015

jmschrei mentioned this pull request Sep 14, 2015

[RFC] Tree module improvements #5212

Open

12 tasks

glouppe deleted the check-importances branch October 20, 2015 07:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG+1] Stronger tests for variable importances #5261

[MRG+1] Stronger tests for variable importances #5261

glouppe commented Sep 12, 2015

jmschrei Sep 12, 2015

glouppe Sep 12, 2015

jmschrei commented Sep 12, 2015

glouppe commented Sep 12, 2015

jmschrei Sep 12, 2015

glouppe Sep 12, 2015

jmschrei Sep 12, 2015

amueller Sep 13, 2015

jmschrei commented Sep 12, 2015

jmschrei commented Sep 12, 2015

amueller Sep 13, 2015

glouppe Sep 13, 2015

amueller commented Sep 13, 2015

glouppe commented Sep 13, 2015

arjoly Sep 13, 2015

glouppe Sep 14, 2015

arjoly commented Sep 13, 2015

arjoly commented Sep 13, 2015

glouppe commented Sep 14, 2015

[MRG+1] Stronger tests for variable importances #5261

[MRG+1] Stronger tests for variable importances #5261

Conversation

glouppe commented Sep 12, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmschrei commented Sep 12, 2015

glouppe commented Sep 12, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmschrei commented Sep 12, 2015

jmschrei commented Sep 12, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amueller commented Sep 13, 2015

glouppe commented Sep 13, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arjoly commented Sep 13, 2015

arjoly commented Sep 13, 2015

glouppe commented Sep 14, 2015