FIX Make decision tree pickles deterministic #27580

lesteve · 2023-10-13T07:51:16Z

Following #27554 I had a look at why the decision tree pickle were non-deterministic. As I guessed this is due to unitialised memory in the C allocated arrays. This started happening with missing value support in trees (scikit-learn 1.3).

I used a memset to make sure the memory is initialised when allocating the node array.

Unitialised memory comes from two places:

padding in NODE_DTYPE (the last field is 1 byte but the dtype is padded to 64 bytes as you can see from itemsize)

In [3]: from sklearn.tree._tree import NODE_DTYPE
 ...: NODE_DTYPE
Out[3]: dtype({
    'names': ['left_child', 'right_child', 'feature', 'threshold', 'impurity', 'n_node_samples', 'weighted_n_node_samples', 'missing_go_to_left'],
    'formats': ['<i8', '<i8', '<i8', '<f8', '<f8', '<i8', '<f8', 'u1'],
    'offsets': [0, 8, 16, 24, 32, 40, 48, 56],
    'itemsize': 64})

unitialised values in missing_go_to_left for leaf nodes (see 55 and 54 values below but of course numbers can be arbitrary)

from sklearn.tree import DecisionTreeClassifier
from sklearn import datasets

X, y = datasets.load_iris(return_X_y=True)

tree1 = DecisionTreeClassifier(random_state=0).fit(X, y)
tree1.fit(X, y)

tree1.tree_.__getstate__()['nodes']

array([( 1,  2,  3,  0.80000001, 0.66666667, 150, 150.,  0),
       (-1, -1, -2, -2.        , 0.        ,  50,  50.,  0),
       ( 3, 12,  3,  1.75      , 0.5       , 100, 100.,  1),
       ( 4,  7,  2,  4.95000005, 0.16803841,  54,  54.,  1),
       ( 5,  6,  3,  1.65000004, 0.04079861,  48,  48.,  1),
       (-1, -1, -2, -2.        , 0.        ,  47,  47., 55),
       (-1, -1, -2, -2.        , 0.        ,   1,   1., 54),
       ( 8,  9,  3,  1.55000001, 0.44444444,   6,   6.,  0),
       (-1, -1, -2, -2.        , 0.        ,   3,   3.,  0),
       (10, 11,  2,  5.45000005, 0.44444444,   3,   3.,  1),
       (-1, -1, -2, -2.        , 0.        ,   2,   2.,  0),
       (-1, -1, -2, -2.        , 0.        ,   1,   1.,  0),
       (13, 16,  2,  4.85000014, 0.04253308,  46,  46.,  0),
       (14, 15,  1,  3.10000002, 0.44444444,   3,   3.,  1),
       (-1, -1, -2, -2.        , 0.        ,   2,   2.,  1),
       (-1, -1, -2, -2.        , 0.        ,   1,   1.,  1),
       (-1, -1, -2, -2.        , 0.        ,  43,  43.,  2)],
      dtype={'names': ['left_child', 'right_child', 'feature', 'threshold', 'impurity', 'n_node_samples', 'weighted_n_node_samples', 'missing_go_to_left'], 'formats': ['<i8', '<i8', '<i8', '<f8', '<f8', '<i8', '<f8', 'u1'], 'offsets': [0, 8, 16, 24, 32, 40, 48, 56], 'itemsize': 64})

github-actions · 2023-10-13T07:52:56Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 5b8e1d8. Link to the linter CI: here}

lorentzenchr · 2023-10-13T11:42:46Z

@lesteve Could comment out the fix and push to demonstrate that the new test fails? And then revert commit the fix again.

betatim · 2023-10-13T13:31:52Z

sklearn/tree/tests/test_tree.py

@@ -2601,3 +2601,16 @@ def test_sample_weight_non_uniform(make_data, Tree):
    tree_samples_removed.fit(X[1::2, :], y[1::2])

    assert_allclose(tree_samples_removed.predict(X), tree_with_sw.predict(X))
+
+
+def test_deterministic_pickle():


Could you add a comment about why we have to use two separate estimators, with the same seed but different datasets?

Naively I'd have thought we could do something like using different seeds to provoke the bug/demonstrate that it no longer occurs. So a note for people from the future might be useful :D

Oops good catch they were in the end fitted on the same datasets but in a complicated way, I fixed this.

To sum up, even if we fix random_state and fit on the same dataset, the pickles are different due to uninitialised memory.

I'd still 👍 a comment saying something about "uninitialised memory would lead to the two pickles being different" or some such

betatim

Looks good. I wish there was a different way to do the memory initialisation but I guess sometimes we have to leave the nice,cushy world of Python behind and get our hands dirty.

Approval is assuming that the test shows that it fails before the fix and stops failing after the fix.

lorentzenchr

LGTM
Failing tests with commit da596d7 in https://github.com/scikit-learn/scikit-learn/runs/17679631176

__________________________ test_deterministic_pickle ___________________________
[gw0] linux -- Python 3.11.5 /usr/share/miniconda/envs/testvenv/bin/python

    def test_deterministic_pickle():
        tree1 = DecisionTreeClassifier(random_state=0).fit(iris.data, iris.target)
        tree1.fit(X, y)
    
        tree2 = DecisionTreeClassifier(random_state=0).fit(X, y)
        tree2.fit(X, y)
    
        pickle1 = pickle.dumps(tree1)
        pickle2 = pickle.dumps(tree2)
    
>       assert pickle1 == pickle2
E       assert b"\x80\x04\x9...4.dev0\x94ub." == b"\x80\x04\x9...4.dev0\x94ub."
E         At index 1181 diff: b'\xb0' != b'\x00'
E         Use -v to get more diff

pickle1    = b"\x80\x04\x95Y\x05\x00\x00\x00\x00\x00\x00\x8c\x15sklearn.tree._classes\x94\x8c\x16DecisionTreeClassifier\x94\x93\x94...00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08@\x94t\x94bub\x8c\x10_sklearn_version\x94\x8c\x081.4.dev0\x94ub."
pickle2    = b"\x80\x04\x95Y\x05\x00\x00\x00\x00\x00\x00\x8c\x15sklearn.tree._classes\x94\x8c\x16DecisionTreeClassifier\x94\x93\x94...00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08@\x94t\x94bub\x8c\x10_sklearn_version\x94\x8c\x081.4.dev0\x94ub."
tree1      = DecisionTreeClassifier(random_state=0)
tree2      = DecisionTreeClassifier(random_state=0)

lorentzenchr · 2023-10-13T13:48:29Z

doc/whats_new/v1.4.rst

@@ -424,6 +424,9 @@ Changelog
  :pr:`13649` by :user:`Samuel Ronsin <samronsin>`, initiated by
  :user:`Patrick O'Reilly <pat-oreilly>`.

+- |Fix| Make decision tree pickles deterministic. :pr:`27580` by :user:`Loïc


Suggested change

- |Fix| Make decision tree pickles deterministic. :pr:`27580` by :user:`Loïc

- |Fix| Make decision tree pickle files deterministic. :pr:`27580` by :user:`Loïc

or pickle dumps.

I think we should consider this a security fix and rephrase as "Do not leak data via non-initialized memory in decision tree pickle files and make the generation of those files deterministic."

And we should backport and issue a quick scikit-learn 1.3.2 even if it's just for this fix.

Which security vulnerability does it fix? Pickle is still pickle.

A potential sensitive data leak from memory (e.g. a fraction of the training/testing set, the contents of the clipboard, or secret credentials or whatever). I agree that not much data will typically leak this way, but still, it's ugly.

This reverts commit da596d7.

ogrisel · 2023-10-13T15:07:56Z

doc/whats_new/v1.4.rst

@@ -424,6 +424,9 @@ Changelog
  :pr:`13649` by :user:`Samuel Ronsin <samronsin>`, initiated by
  :user:`Patrick O'Reilly <pat-oreilly>`.

+- |Fix| Do not leak data via non-initialized memory in decision tree pickle files and make
+  the generation of those files deterministic. :pr:`27580` by :user:`Loïc Estève <lesteve>`.
+


Let's put this entry directly in 1.3.2 to conform with our security policy:

https://github.com/scikit-learn/scikit-learn/security/policy

ogrisel

One more suggestion below.

Do we agree that we should release 1.3.2 for this fix? If so, let's move the changelog entry there right away before merging.

https://github.com/scikit-learn/scikit-learn/pull/27580/files#r1358390249

sklearn/tree/tests/test_tree.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

lesteve · 2023-10-16T11:30:39Z

Do we agree that we should release 1.3.2 for this fix? If so, let's move the changelog entry there right away before merging.

#27580 (files)

I pushed a commit moving the whats_new entry to 1.3.2.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

FIX Make DecisionTree pickle deterministic

553e3cb

github-actions bot added module:tree cython labels Oct 13, 2023

Add changelog

3fe9772

Comment out fix

da596d7

betatim reviewed Oct 13, 2023

View reviewed changes

betatim approved these changes Oct 13, 2023

View reviewed changes

lorentzenchr approved these changes Oct 13, 2023

View reviewed changes

lesteve added 4 commits October 13, 2023 16:52

Revert "Comment out fix"

ea815a0

This reverts commit da596d7.

update changelog

c8b16ae

Fix copy and paste gone wrong

6ffaddb

Fix copy and paste gone wrong

b5d9b5e

ogrisel added this to the 1.3.2 milestone Oct 13, 2023

ogrisel reviewed Oct 13, 2023

View reviewed changes

ogrisel reviewed Oct 16, 2023

View reviewed changes

sklearn/tree/tests/test_tree.py Show resolved Hide resolved

lesteve and others added 2 commits October 16, 2023 10:24

Update sklearn/tree/tests/test_tree.py

05198e0

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Move entry to 1.3.2

5b8e1d8

ogrisel approved these changes Oct 16, 2023

View reviewed changes

betatim approved these changes Oct 16, 2023

View reviewed changes

glemaitre merged commit caeb09e into scikit-learn:main Oct 17, 2023

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Oct 17, 2023

FIX Make decision tree pickles deterministic (scikit-learn#27580)

1f0ba32

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

glemaitre pushed a commit that referenced this pull request Oct 23, 2023

FIX Make decision tree pickles deterministic (#27580)

8a3c16c

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Oct 31, 2023

FIX Make decision tree pickles deterministic (scikit-learn#27580)

3246a61

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023

FIX Make decision tree pickles deterministic (scikit-learn#27580)

d93006b

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

FIX Make decision tree pickles deterministic #27580

FIX Make decision tree pickles deterministic #27580

Uh oh!

lesteve commented Oct 13, 2023 •

edited

Loading

Uh oh!

github-actions bot commented Oct 13, 2023 •

edited

Loading

Uh oh!

lorentzenchr commented Oct 13, 2023

Uh oh!

betatim Oct 13, 2023

Uh oh!

lesteve Oct 13, 2023 •

edited

Loading

Uh oh!

betatim Oct 13, 2023

Uh oh!

betatim left a comment •

edited

Loading

Uh oh!

lorentzenchr left a comment •

edited

Loading

Uh oh!

lorentzenchr Oct 13, 2023

Uh oh!

ogrisel Oct 13, 2023

Uh oh!

lorentzenchr Oct 14, 2023

Uh oh!

ogrisel Oct 16, 2023 •

edited

Loading

Uh oh!

ogrisel Oct 13, 2023

Uh oh!

ogrisel left a comment

Uh oh!

Uh oh!

lesteve commented Oct 16, 2023

Uh oh!

Uh oh!

	- \|Fix\| Make decision tree pickles deterministic. :pr:`27580` by :user:`Loïc
	- \|Fix\| Make decision tree pickle files deterministic. :pr:`27580` by :user:`Loïc

Uh oh!

FIX Make decision tree pickles deterministic #27580

FIX Make decision tree pickles deterministic #27580

Uh oh!

Conversation

lesteve commented Oct 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

lorentzenchr commented Oct 13, 2023

Uh oh!

betatim Oct 13, 2023

Choose a reason for hiding this comment

Uh oh!

lesteve Oct 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

betatim Oct 13, 2023

Choose a reason for hiding this comment

Uh oh!

betatim left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lorentzenchr left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lorentzenchr Oct 13, 2023

Choose a reason for hiding this comment

Uh oh!

ogrisel Oct 13, 2023

Choose a reason for hiding this comment

Uh oh!

lorentzenchr Oct 14, 2023

Choose a reason for hiding this comment

Uh oh!

ogrisel Oct 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel Oct 13, 2023

Choose a reason for hiding this comment

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lesteve commented Oct 16, 2023

Uh oh!

Uh oh!

lesteve commented Oct 13, 2023 •

edited

Loading

github-actions bot commented Oct 13, 2023 •

edited

Loading

lesteve Oct 13, 2023 •

edited

Loading

betatim left a comment •

edited

Loading

lorentzenchr left a comment •

edited

Loading

ogrisel Oct 16, 2023 •

edited

Loading