ENH Remove unnecessary OOB computation when n_more_estimators == 0 #26318

choo8 · 2023-05-02T15:11:07Z

Reference Issues/PRs

Fixes #20435, based on the comments by @NicolasHug (#20435 (comment))

What does this implement/fix? Explain your changes.

Removes unnecessary OOB computation when n_more_estimators == 0

Any other comments?

I understand from our previous conversation (#24579 (comment)) that @glemaitre would like some unit tests for this change.

I noticed that the unit test below will check for the case where oob_score is toggled from False to True

scikit-learn/sklearn/ensemble/tests/test_forest.py

Lines 1418 to 1465 in c5f10c8

    
           def check_warm_start_oob(name): 
        
               # Test that the warm start computes oob score when asked. 
        
               X, y = hastie_X, hastie_y 
        
               ForestEstimator = FOREST_ESTIMATORS[name] 
        
               # Use 15 estimators to avoid 'some inputs do not have OOB scores' warning. 
        
               est = ForestEstimator( 
        
                   n_estimators=15, 
        
                   max_depth=3, 
        
                   warm_start=False, 
        
                   random_state=1, 
        
                   bootstrap=True, 
        
                   oob_score=True, 
        
               ) 
        
               est.fit(X, y) 
        
               est_2 = ForestEstimator( 
        
                   n_estimators=5, 
        
                   max_depth=3, 
        
                   warm_start=False, 
        
                   random_state=1, 
        
                   bootstrap=True, 
        
                   oob_score=False, 
        
               ) 
        
               est_2.fit(X, y) 
        
               est_2.set_params(warm_start=True, oob_score=True, n_estimators=15) 
        
               est_2.fit(X, y) 
        
               assert hasattr(est_2, "oob_score_") 
        
               assert est.oob_score_ == est_2.oob_score_ 
        
               # Test that oob_score is computed even if we don't need to train 
        
               # additional trees. 
        
               est_3 = ForestEstimator( 
        
                   n_estimators=15, 
        
                   max_depth=3, 
        
                   warm_start=True, 
        
                   random_state=1, 
        
                   bootstrap=True, 
        
                   oob_score=False, 
        
               ) 
        
               est_3.fit(X, y) 
        
               assert not hasattr(est_3, "oob_score_") 
        
               est_3.set_params(oob_score=True) 
        
               ignore_warnings(est_3.fit)(X, y) 
        
               assert est.oob_score_ == est_3.oob_score_

However, I am not sure how I should go about checking that OOB computation (call to self._set_oob_score_and_attributes() at

scikit-learn/sklearn/ensemble/_forest.py

Lines 492 to 494 in c5f10c8

    
           self._set_oob_score_and_attributes( 
        
               X, y, scoring_function=self.oob_score 
        
           )

or

scikit-learn/sklearn/ensemble/_forest.py

Line 496 in c5f10c8

self._set_oob_score_and_attributes(X, y)

) is not called when n_more_estimators == 0.

I thought of checking on some object attribute that might be changed by self._set_oob_score_and_attributes() but I couldn't find any candidates. I did some research online and it seems like others create a mock function if they want to check for functions being called in runtime.

Do you have any suggestions on how I can go about writing the test for this?

thomasjpfan

Thank you for the PR @choo8 !

Do you have any suggestions on how I can go about writing the test for this?

As for testing, I think we can use a Python's Mock object with a callable as the side_effect, pass it to self.oob_score, and assert the number of times it is called. With this PR, self.oob_score should be called less than on main.

choo8 · 2023-05-23T16:14:33Z

Thanks for the suggestion and comments @thomasjpfan! I've commited my test, please let me know if you have any comments. I decided to add on to the "check_warm_start_oob" test function as I thought the test logic should belong there.

thomasjpfan

Thank you for the update!

Please add an entry to the change log at doc/whats_new/v1.3.rst with tag |Efficiency|. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

thomasjpfan · 2023-05-24T17:46:18Z

sklearn/ensemble/tests/test_forest.py


-    est_3.set_params(oob_score=True)
-    ignore_warnings(est_3.fit)(X, y)
+    # Patch _set_oob_score_and_attributes() to track OOB computation


I think extending this test complicates the original test a little too much. I prefer adding another test:

def test_oob_not_computed_twice(monkeypatch): """Check that oob_score is not computed twice when warm_start=True.""" est = RandomForestClassifier(n_estimators=10, oob_score=True, warm_start=True) mock = Mock(side_effect=est._set_oob_score_and_attributes) monkeypatch.setattr(est, "_set_oob_score_and_attributes", mock) est.fit(X, y) with pytest.warns(UserWarning, match="Warm-start fitting without increasing"): est.fit(X, y) mock.assert_called_once()

where from unittest.mock import Mock.

thomasjpfan · 2023-05-24T17:47:30Z

sklearn/ensemble/_forest.py

-                    " expected. Please change the shape of y to "
-                    "(n_samples,), for example using ravel()."
-                ),
+                "A column-vector y was passed when a 1d array was"


A few months ago we updated to black==23.3.0. May you update to black==23.3.0 and run the linter again?

…rs == 0

…ators == 0

choo8 · 2023-05-27T17:40:22Z

Thank you for your comments @thomasjpfan. I've created a separate test case as suggested, updated the changelog and also updated the version of the linter.

thomasjpfan · 2023-05-27T23:08:35Z

There seems to be an issue with the merge because the diff is now at +3,998 −1,450. Can you try running the following:

git fetch upstream main
git rebase upstream/main

and do a force push?

choo8 · 2023-05-28T04:31:04Z

Hi @thomasjpfan, I've fixed the diff with the commands you suggested.

thomasjpfan

Thank you for the update!

thomasjpfan · 2023-05-28T14:06:08Z

doc/whats_new/v1.3.rst

+- |Efficiency| :class:`ensemble.BaseForest` now only recomputes out-of-bag scores
+  if `n_more_estimators > 0` in subsequent `fit` calls.


BaseForest is not a public class and n_more_estimators is not a public parameter. The user guide is public facing, so it's best to state the change in terms of the public API:

Suggested change

- |Efficiency| :class:`ensemble.BaseForest` now only recomputes out-of-bag scores

if `n_more_estimators > 0` in subsequent `fit` calls.

- |Efficiency| :class:`ensemble.RandomForestClassifier` and

:class:`ensemble.RandomForestRegressor` with `warm_start=True` now only

recomputes out-of-bag scores when there are actually more `n_estimators`

in subsequent `fit` calls.

thomasjpfan · 2023-05-28T14:17:26Z

sklearn/ensemble/tests/test_forest.py

@@ -1470,6 +1471,31 @@ def test_warm_start_oob(name):
    check_warm_start_oob(name)


+def check_oob_not_computed_twice(name):


I can see we there is a convention in this file of having a separate function that the test calls, but I think that adds more indirection. Can you place the body of the function directly in the test_oob_not_computed_twice test?

choo8 · 2023-05-29T14:35:52Z

Hi @thomasjpfan, I've updated the PR with regards to your latest comments.

thomasjpfan · 2023-05-29T14:54:20Z

sklearn/ensemble/tests/test_forest.py

@@ -1466,8 +1467,29 @@ def check_warm_start_oob(name):


 @pytest.mark.parametrize("name", FOREST_CLASSIFIERS_REGRESSORS)
-def test_warm_start_oob(name):


Looks like something went wrong with the last patch.

Can you add test_warm_start_oob back and move the body of the new function into test_oob_not_computed_twice?

Sorry, I mistakenly deleted the wrong function. I've ran pytest to make sure and pushed the corrected edit.

thomasjpfan

Thank you for the update! LGTM

glemaitre

I will merge these two nitpicks.

doc/whats_new/v1.3.rst

sklearn/ensemble/_forest.py

glemaitre · 2023-05-30T07:26:19Z

Thanks @choo8

choo8 · 2023-05-31T12:49:38Z

@thomasjpfan thanks for patiently guiding me through this PR!

…cikit-learn#26318) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

github-actions bot added the module:ensemble label May 2, 2023

thomasjpfan reviewed May 2, 2023

View reviewed changes

glemaitre mentioned this pull request May 4, 2023

Incorrect documentation for warm_start behavior on BaseForest-derived classes #20435

Closed

thomasjpfan reviewed May 24, 2023

View reviewed changes

choo8 added 4 commits May 28, 2023 00:55

Edited docs for warm_start and removed OOB computation for n_estimato…

e0421ec

…rs == 0

Removed change to OOB computation

d6d7832

Removed OOB computation when n_more_estimators == 0

aff6597

Added test to check for unnecessary OOB computation when n_more_estim…

7e8a923

…ators == 0

choo8 added 4 commits May 28, 2023 11:12

Removed change to OOB computation

3d5a712

Removed OOB computation when n_more_estimators == 0

5e74e3a

Split tests and updated changelog

b3c6c7e

Corrected with linter

3ef0393

choo8 force-pushed the warm_start branch from 569690d to 3ef0393 Compare May 28, 2023 03:13

thomasjpfan changed the title ~~[WIP] Remove unnecessary OOB computation when n_more_estimators == 0~~ ENH Remove unnecessary OOB computation when n_more_estimators == 0 May 28, 2023

thomasjpfan reviewed May 28, 2023

View reviewed changes

Updated changelog and refactored unit test

3b1c708

thomasjpfan reviewed May 29, 2023

View reviewed changes

Fixed accidental deletion of old unit tests

2e05fb7

thomasjpfan approved these changes May 29, 2023

View reviewed changes

thomasjpfan added the Waiting for Second Reviewer First reviewer is done, need a second one! label May 29, 2023

glemaitre self-requested a review May 30, 2023 07:17

glemaitre approved these changes May 30, 2023

View reviewed changes

doc/whats_new/v1.3.rst Outdated Show resolved Hide resolved

sklearn/ensemble/_forest.py Outdated Show resolved Hide resolved

Apply suggestions from code review

75cab60

glemaitre merged commit 1415a28 into scikit-learn:main May 30, 2023

REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023

ENH Remove unnecessary OOB computation when n_more_estimators == 0 (s…

87c722e

…cikit-learn#26318) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

	def check_warm_start_oob(name):
	# Test that the warm start computes oob score when asked.
	X, y = hastie_X, hastie_y
	ForestEstimator = FOREST_ESTIMATORS[name]
	# Use 15 estimators to avoid 'some inputs do not have OOB scores' warning.
	est = ForestEstimator(
	n_estimators=15,
	max_depth=3,
	warm_start=False,
	random_state=1,
	bootstrap=True,
	oob_score=True,
	)
	est.fit(X, y)

	est_2 = ForestEstimator(
	n_estimators=5,
	max_depth=3,
	warm_start=False,
	random_state=1,
	bootstrap=True,
	oob_score=False,
	)
	est_2.fit(X, y)

	est_2.set_params(warm_start=True, oob_score=True, n_estimators=15)
	est_2.fit(X, y)

	assert hasattr(est_2, "oob_score_")
	assert est.oob_score_ == est_2.oob_score_

	# Test that oob_score is computed even if we don't need to train
	# additional trees.
	est_3 = ForestEstimator(
	n_estimators=15,
	max_depth=3,
	warm_start=True,
	random_state=1,
	bootstrap=True,
	oob_score=False,
	)
	est_3.fit(X, y)
	assert not hasattr(est_3, "oob_score_")

	est_3.set_params(oob_score=True)
	ignore_warnings(est_3.fit)(X, y)

	assert est.oob_score_ == est_3.oob_score_

	self._set_oob_score_and_attributes(
	X, y, scoring_function=self.oob_score
	)

		- \|Efficiency\| :class:`ensemble.BaseForest` now only recomputes out-of-bag scores
		if `n_more_estimators > 0` in subsequent `fit` calls.

		@@ -1470,6 +1471,31 @@ def test_warm_start_oob(name):
		check_warm_start_oob(name)


		def check_oob_not_computed_twice(name):

		@@ -1466,8 +1467,29 @@ def check_warm_start_oob(name):


		@pytest.mark.parametrize("name", FOREST_CLASSIFIERS_REGRESSORS)
		def test_warm_start_oob(name):

Uh oh!

ENH Remove unnecessary OOB computation when n_more_estimators == 0 #26318

ENH Remove unnecessary OOB computation when n_more_estimators == 0 #26318

Uh oh!

Conversation

choo8 commented May 2, 2023

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

choo8 commented May 23, 2023

Uh oh!

thomasjpfan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan May 24, 2023

Choose a reason for hiding this comment

Uh oh!

thomasjpfan May 24, 2023

Choose a reason for hiding this comment

Uh oh!

choo8 commented May 27, 2023

Uh oh!

thomasjpfan commented May 27, 2023

Uh oh!

choo8 commented May 28, 2023

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

thomasjpfan May 28, 2023

Choose a reason for hiding this comment

Uh oh!

thomasjpfan May 28, 2023

Choose a reason for hiding this comment

Uh oh!

choo8 commented May 29, 2023

Uh oh!

thomasjpfan May 29, 2023

Choose a reason for hiding this comment

Uh oh!

choo8 May 29, 2023

Choose a reason for hiding this comment

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

glemaitre commented May 30, 2023

Uh oh!

choo8 commented May 31, 2023

Uh oh!

Uh oh!

thomasjpfan left a comment •

edited

Loading