-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX Make decision tree pickles deterministic #27580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX Make decision tree pickles deterministic #27580
Conversation
@lesteve Could comment out the fix and push to demonstrate that the new test fails? And then revert commit the fix again. |
@@ -2601,3 +2601,16 @@ def test_sample_weight_non_uniform(make_data, Tree): | |||
tree_samples_removed.fit(X[1::2, :], y[1::2]) | |||
|
|||
assert_allclose(tree_samples_removed.predict(X), tree_with_sw.predict(X)) | |||
|
|||
|
|||
def test_deterministic_pickle(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment about why we have to use two separate estimators, with the same seed but different datasets?
Naively I'd have thought we could do something like using different seeds to provoke the bug/demonstrate that it no longer occurs. So a note for people from the future might be useful :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops good catch they were in the end fitted on the same datasets but in a complicated way, I fixed this.
To sum up, even if we fix random_state
and fit on the same dataset, the pickles are different due to uninitialised memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still 👍 a comment saying something about "uninitialised memory would lead to the two pickles being different" or some such
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I wish there was a different way to do the memory initialisation but I guess sometimes we have to leave the nice,cushy world of Python behind and get our hands dirty.
Approval is assuming that the test shows that it fails before the fix and stops failing after the fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Failing tests with commit da596d7 in https://github.com/scikit-learn/scikit-learn/runs/17679631176
__________________________ test_deterministic_pickle ___________________________
[gw0] linux -- Python 3.11.5 /usr/share/miniconda/envs/testvenv/bin/python
def test_deterministic_pickle():
tree1 = DecisionTreeClassifier(random_state=0).fit(iris.data, iris.target)
tree1.fit(X, y)
tree2 = DecisionTreeClassifier(random_state=0).fit(X, y)
tree2.fit(X, y)
pickle1 = pickle.dumps(tree1)
pickle2 = pickle.dumps(tree2)
> assert pickle1 == pickle2
E assert b"\x80\x04\x9...4.dev0\x94ub." == b"\x80\x04\x9...4.dev0\x94ub."
E At index 1181 diff: b'\xb0' != b'\x00'
E Use -v to get more diff
pickle1 = b"\x80\x04\x95Y\x05\x00\x00\x00\x00\x00\x00\x8c\x15sklearn.tree._classes\x94\x8c\x16DecisionTreeClassifier\x94\x93\x94...00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08@\x94t\x94bub\x8c\x10_sklearn_version\x94\x8c\x081.4.dev0\x94ub."
pickle2 = b"\x80\x04\x95Y\x05\x00\x00\x00\x00\x00\x00\x8c\x15sklearn.tree._classes\x94\x8c\x16DecisionTreeClassifier\x94\x93\x94...00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08@\x94t\x94bub\x8c\x10_sklearn_version\x94\x8c\x081.4.dev0\x94ub."
tree1 = DecisionTreeClassifier(random_state=0)
tree2 = DecisionTreeClassifier(random_state=0)
doc/whats_new/v1.4.rst
Outdated
@@ -424,6 +424,9 @@ Changelog | |||
:pr:`13649` by :user:`Samuel Ronsin <samronsin>`, initiated by | |||
:user:`Patrick O'Reilly <pat-oreilly>`. | |||
|
|||
- |Fix| Make decision tree pickles deterministic. :pr:`27580` by :user:`Loïc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- |Fix| Make decision tree pickles deterministic. :pr:`27580` by :user:`Loïc | |
- |Fix| Make decision tree pickle files deterministic. :pr:`27580` by :user:`Loïc |
or pickle dumps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should consider this a security fix and rephrase as "Do not leak data via non-initialized memory in decision tree pickle files and make the generation of those files deterministic."
And we should backport and issue a quick scikit-learn 1.3.2 even if it's just for this fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which security vulnerability does it fix? Pickle is still pickle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A potential sensitive data leak from memory (e.g. a fraction of the training/testing set, the contents of the clipboard, or secret credentials or whatever). I agree that not much data will typically leak this way, but still, it's ugly.
doc/whats_new/v1.4.rst
Outdated
@@ -424,6 +424,9 @@ Changelog | |||
:pr:`13649` by :user:`Samuel Ronsin <samronsin>`, initiated by | |||
:user:`Patrick O'Reilly <pat-oreilly>`. | |||
|
|||
- |Fix| Do not leak data via non-initialized memory in decision tree pickle files and make | |||
the generation of those files deterministic. :pr:`27580` by :user:`Loïc Estève <lesteve>`. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put this entry directly in 1.3.2 to conform with our security policy:
https://github.com/scikit-learn/scikit-learn/security/policy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more suggestion below.
Do we agree that we should release 1.3.2 for this fix? If so, let's move the changelog entry there right away before merging.
https://github.com/scikit-learn/scikit-learn/pull/27580/files#r1358390249
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
I pushed a commit moving the whats_new entry to 1.3.2. |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Following #27554 I had a look at why the decision tree pickle were non-deterministic. As I guessed this is due to unitialised memory in the C allocated arrays. This started happening with missing value support in trees (scikit-learn 1.3).
I used a
memset
to make sure the memory is initialised when allocating the node array.Unitialised memory comes from two places:
NODE_DTYPE
(the last field is 1 byte but the dtype is padded to 64 bytes as you can see fromitemsize
)missing_go_to_left
for leaf nodes (see55
and54
values below but of course numbers can be arbitrary)