gh-101438: Avoid reference cycle in ElementTree.iterparse. #114269

colesbury · 2024-01-18T22:54:19Z

Refactor IterParseIterator to avoid a reference cycle between the iterator() function and the IterParseIterator() instance. This leads to more prompt clean-up of the "source" file if the returned iterator is not exhausted and not otherwise part of a reference cycle.

This also avoids a test failure in the GC implementation for the free-threaded build (#114262): if the "source" file is finalized before the iterator() generator, a ResourceWarning is issued leading to a failure in test_iterparse() in test_xml_etree.py. In theory, this warning can occur in the default build as well, but that is much less likely to occur because it would require an unlucky scheduling of the GC between creation of the generator and the file object in order to change the order of finalization.

Issue: ElementTree.iterparse "leaks" file descriptor when not exhausted #101438

Refactor IterParseIterator to avoid a reference cycle between the iterator() function and the IterParseIterator() instance. This leads to more prompt clean-up of the "source" file if the returned iterator is not exhausted and not otherwise part of a reference cycle. This also avoids a test failure in the GC implementation for the free-threaded build: if the "source" file is finalized before the "iterator()" generator, a ResourceWarning is issued leading to a failure in test_iterparse(). In theory, this warning can occur in the default build as well, but is much less likely because it would require an unlucky scheduling of the GC between creation of the generator and the file object in order to change the order of finalization.

serhiy-storchaka · 2024-01-19T12:29:06Z

A class with __next__() method adds an overhead. See #69824, it made benchmarks 2 times slower.

Please test how this change affects the results of the xml_etree_iterparse benchmark in https://github.com/python/pyperformance.

This avoids the `__next__` wrapper and the `root` property, both of which had a performance impact on the iterparse benchmark in bm_xml_etree.

colesbury · 2024-01-19T19:46:19Z

Thanks for the pointer @serhiy-storchaka. It had about a 15% regression. I rewrote it to avoid the regression. On my machine, I'm seeing 1.37-1.38 seconds for ten iterations of the iterparse benchmark (both before and after this PR).

colesbury · 2024-01-23T15:14:08Z

@serhiy-storchaka, would you be able to review this?

serhiy-storchaka

Could you please add a test?

There is an existing test

        # Not exhausting the iterator still closes the resource (bpo-43292)
        with warnings_helper.check_no_resource_warning(self):
            it = iterparse(TESTFN)
            del it

It was purposed to test that source.close() is called, but it seems that it was not called at all.

Perhaps running it in a loop can test this issue. And this code can be moved into a separate method.

Lib/xml/etree/ElementTree.py

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>

colesbury · 2024-01-23T18:36:56Z

@serhiy-storchaka, I don't think we can do better than the current test. That test does catch the behavior. See https://github.com/python/cpython/actions/runs/7617715570/job/20747338297 for the test failure.

colesbury · 2024-01-23T18:40:21Z

I don't think a loop will do much either. The order of finalization is not specified, but in practice it is pretty deterministic. A loop will make the test slower, but not much more reliable.

serhiy-storchaka

LGTM.

serhiy-storchaka · 2024-01-23T19:54:56Z

Thank you for your contribution @colesbury. If you look at the history of this code, you will see that it is a continuous fight against a reference loops. So I'm especially grateful for this fix.

miss-islington-app · 2024-01-23T20:14:50Z

Thanks @colesbury for the PR, and @serhiy-storchaka for merging it 🌮🎉.. I'm working now to backport this PR to: 3.11, 3.12.
🐍🍒⛏🤖

…honGH-114269) The iterator returned by ElementTree.iterparse() may hold on to a file descriptor. The reference cycle prevented prompt clean-up of the file descriptor if the returned iterator was not exhausted. (cherry picked from commit ce01ab5) Co-authored-by: Sam Gross <colesbury@gmail.com>

bedevere-app · 2024-01-23T20:15:04Z

GH-114499 is a backport of this pull request to the 3.12 branch.

bedevere-app · 2024-01-23T20:15:09Z

GH-114500 is a backport of this pull request to the 3.11 branch.

…-114269) (GH-114499) The iterator returned by ElementTree.iterparse() may hold on to a file descriptor. The reference cycle prevented prompt clean-up of the file descriptor if the returned iterator was not exhausted. (cherry picked from commit ce01ab5) Co-authored-by: Sam Gross <colesbury@gmail.com>

…-114269) (GH-114500) The iterator returned by ElementTree.iterparse() may hold on to a file descriptor. The reference cycle prevented prompt clean-up of the file descriptor if the returned iterator was not exhausted. (cherry picked from commit ce01ab5) Co-authored-by: Sam Gross <colesbury@gmail.com>

Prometheus3375 · 2024-01-28T03:47:48Z

Lib/xml/etree/ElementTree.py

+    it = IterParseIterator()
+    wr = weakref.ref(it)
+    del IterParseIterator


I am curious, why previously both iterator and IterParseIterator names were deleted, but now only IterParseIterator? And what is the purpose of this statement in the first place? I was thinking that iterator.__closure__ stores references to these objects; therefore, unnecessary references should be deleted. However, as per my checks, closure stores only referenced variables inside; pullparser, close_source and wr in this case.

I also noticed that it.root = None was deleted. This fact is not documented, but this may still cause unintended errors on the user side if they use root.

I think you are right about the it.root = None. I did not intend a behavioral change here, so it seems like a good idea to add it back.

I don't think the del statements matter one way or the other. They look like they break a cycle, but not really, but they also are harmless.

Prior to pythongh-114269, the iterator returned by ElementTree.iterparse was initialized with the root attribute as None. This restores the previous behavior.

…H-114755) Prior to gh-114269, the iterator returned by ElementTree.iterparse was initialized with the root attribute as None. This restores the previous behavior.

…ute (pythonGH-114755) Prior to pythongh-114269, the iterator returned by ElementTree.iterparse was initialized with the root attribute as None. This restores the previous behavior. (cherry picked from commit 66f95ea) Co-authored-by: Sam Gross <colesbury@gmail.com>

…bute (GH-114755) (GH-114798) Prior to gh-114269, the iterator returned by ElementTree.iterparse was initialized with the root attribute as None. This restores the previous behavior. (cherry picked from commit 66f95ea) Co-authored-by: Sam Gross <colesbury@gmail.com>

…bute (GH-114755) (GH-114799) Prior to gh-114269, the iterator returned by ElementTree.iterparse was initialized with the root attribute as None. This restores the previous behavior. (cherry picked from commit 66f95ea) Co-authored-by: Sam Gross <colesbury@gmail.com>

…honGH-114269) The iterator returned by ElementTree.iterparse() may hold on to a file descriptor. The reference cycle prevented prompt clean-up of the file descriptor if the returned iterator was not exhausted.

…ute (pythonGH-114755) Prior to pythongh-114269, the iterator returned by ElementTree.iterparse was initialized with the root attribute as None. This restores the previous behavior.

…honGH-114269) The iterator returned by ElementTree.iterparse() may hold on to a file descriptor. The reference cycle prevented prompt clean-up of the file descriptor if the returned iterator was not exhausted.

colesbury added topic-XML topic-free-threading labels Jan 18, 2024

bedevere-app bot added the awaiting review label Jan 18, 2024

bedevere-app bot mentioned this pull request Jan 18, 2024

ElementTree.iterparse "leaks" file descriptor when not exhausted #101438

Closed

Avoid regression in bm_xml_etree performance.

8fabc7c

This avoids the `__next__` wrapper and the `root` property, both of which had a performance impact on the iterparse benchmark in bm_xml_etree.

colesbury requested a review from serhiy-storchaka January 19, 2024 19:46

Minor simplification

1ae917e

serhiy-storchaka reviewed Jan 23, 2024

View reviewed changes

Lib/xml/etree/ElementTree.py Outdated Show resolved Hide resolved

Update Lib/xml/etree/ElementTree.py

05baaad

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>

serhiy-storchaka approved these changes Jan 23, 2024

View reviewed changes

bedevere-app bot added awaiting merge and removed awaiting review labels Jan 23, 2024

serhiy-storchaka added needs backport to 3.11 only security fixes needs backport to 3.12 only security fixes labels Jan 23, 2024

Update 2024-01-18-22-29-28.gh-issue-101438.1-uUi_.rst

307b375

serhiy-storchaka enabled auto-merge (squash) January 23, 2024 19:52

serhiy-storchaka merged commit ce01ab5 into python:main Jan 23, 2024

bedevere-app bot removed the awaiting merge label Jan 23, 2024

bedevere-app bot removed the needs backport to 3.12 only security fixes label Jan 23, 2024

bedevere-app bot removed the needs backport to 3.11 only security fixes label Jan 23, 2024

colesbury deleted the gh-101438-iterparse branch January 24, 2024 17:28

Prometheus3375 reviewed Jan 28, 2024

View reviewed changes

Prometheus3375 mentioned this pull request Jan 29, 2024

Fix unintended behavior change in elementtree introduced in #114269 #114737

Closed

colesbury mentioned this pull request Jan 30, 2024

gh-114737: Revert change to ElementTree.iterparse "root" attribute #114755

Merged

miss-islington mentioned this pull request Jan 31, 2024

[3.12] gh-114737: Revert change to ElementTree.iterparse "root" attribute (GH-114755) #114798

Merged

miss-islington mentioned this pull request Jan 31, 2024

[3.11] gh-114737: Revert change to ElementTree.iterparse "root" attribute (GH-114755) #114799

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-101438: Avoid reference cycle in ElementTree.iterparse. #114269

gh-101438: Avoid reference cycle in ElementTree.iterparse. #114269

colesbury commented Jan 18, 2024 •

edited by bedevere-app bot

Loading

serhiy-storchaka commented Jan 19, 2024 •

edited

Loading

colesbury commented Jan 19, 2024

colesbury commented Jan 23, 2024

serhiy-storchaka left a comment

colesbury commented Jan 23, 2024

colesbury commented Jan 23, 2024

serhiy-storchaka left a comment

serhiy-storchaka commented Jan 23, 2024

miss-islington-app bot commented Jan 23, 2024

bedevere-app bot commented Jan 23, 2024

bedevere-app bot commented Jan 23, 2024

Prometheus3375 Jan 28, 2024

Prometheus3375 Jan 28, 2024

colesbury Jan 28, 2024

gh-101438: Avoid reference cycle in ElementTree.iterparse. #114269

gh-101438: Avoid reference cycle in ElementTree.iterparse. #114269

Conversation

colesbury commented Jan 18, 2024 • edited by bedevere-app bot Loading

serhiy-storchaka commented Jan 19, 2024 • edited Loading

colesbury commented Jan 19, 2024

colesbury commented Jan 23, 2024

serhiy-storchaka left a comment

Choose a reason for hiding this comment

colesbury commented Jan 23, 2024

colesbury commented Jan 23, 2024

serhiy-storchaka left a comment

Choose a reason for hiding this comment

serhiy-storchaka commented Jan 23, 2024

miss-islington-app bot commented Jan 23, 2024

bedevere-app bot commented Jan 23, 2024

bedevere-app bot commented Jan 23, 2024

Prometheus3375 Jan 28, 2024

Choose a reason for hiding this comment

Prometheus3375 Jan 28, 2024

Choose a reason for hiding this comment

colesbury Jan 28, 2024

Choose a reason for hiding this comment

colesbury commented Jan 18, 2024 •

edited by bedevere-app bot

Loading

serhiy-storchaka commented Jan 19, 2024 •

edited

Loading