-
-
Notifications
You must be signed in to change notification settings - Fork 31.8k
gh-101438: Avoid reference cycle in ElementTree.iterparse. #114269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Refactor IterParseIterator to avoid a reference cycle between the iterator() function and the IterParseIterator() instance. This leads to more prompt clean-up of the "source" file if the returned iterator is not exhausted and not otherwise part of a reference cycle. This also avoids a test failure in the GC implementation for the free-threaded build: if the "source" file is finalized before the "iterator()" generator, a ResourceWarning is issued leading to a failure in test_iterparse(). In theory, this warning can occur in the default build as well, but is much less likely because it would require an unlucky scheduling of the GC between creation of the generator and the file object in order to change the order of finalization.
A class with Please test how this change affects the results of the |
This avoids the `__next__` wrapper and the `root` property, both of which had a performance impact on the iterparse benchmark in bm_xml_etree.
Thanks for the pointer @serhiy-storchaka. It had about a 15% regression. I rewrote it to avoid the regression. On my machine, I'm seeing 1.37-1.38 seconds for ten iterations of the |
@serhiy-storchaka, would you be able to review this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a test?
There is an existing test
# Not exhausting the iterator still closes the resource (bpo-43292)
with warnings_helper.check_no_resource_warning(self):
it = iterparse(TESTFN)
del it
It was purposed to test that source.close()
is called, but it seems that it was not called at all.
Perhaps running it in a loop can test this issue. And this code can be moved into a separate method.
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
@serhiy-storchaka, I don't think we can do better than the current test. That test does catch the behavior. See https://github.com/python/cpython/actions/runs/7617715570/job/20747338297 for the test failure. |
I don't think a loop will do much either. The order of finalization is not specified, but in practice it is pretty deterministic. A loop will make the test slower, but not much more reliable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thank you for your contribution @colesbury. If you look at the history of this code, you will see that it is a continuous fight against a reference loops. So I'm especially grateful for this fix. |
Thanks @colesbury for the PR, and @serhiy-storchaka for merging it 🌮🎉.. I'm working now to backport this PR to: 3.11, 3.12. |
…honGH-114269) The iterator returned by ElementTree.iterparse() may hold on to a file descriptor. The reference cycle prevented prompt clean-up of the file descriptor if the returned iterator was not exhausted. (cherry picked from commit ce01ab5) Co-authored-by: Sam Gross <colesbury@gmail.com>
…honGH-114269) The iterator returned by ElementTree.iterparse() may hold on to a file descriptor. The reference cycle prevented prompt clean-up of the file descriptor if the returned iterator was not exhausted. (cherry picked from commit ce01ab5) Co-authored-by: Sam Gross <colesbury@gmail.com>
GH-114499 is a backport of this pull request to the 3.12 branch. |
GH-114500 is a backport of this pull request to the 3.11 branch. |
…-114269) (GH-114499) The iterator returned by ElementTree.iterparse() may hold on to a file descriptor. The reference cycle prevented prompt clean-up of the file descriptor if the returned iterator was not exhausted. (cherry picked from commit ce01ab5) Co-authored-by: Sam Gross <colesbury@gmail.com>
…-114269) (GH-114500) The iterator returned by ElementTree.iterparse() may hold on to a file descriptor. The reference cycle prevented prompt clean-up of the file descriptor if the returned iterator was not exhausted. (cherry picked from commit ce01ab5) Co-authored-by: Sam Gross <colesbury@gmail.com>
it = IterParseIterator() | ||
wr = weakref.ref(it) | ||
del IterParseIterator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious, why previously both iterator
and IterParseIterator
names were deleted, but now only IterParseIterator
? And what is the purpose of this statement in the first place? I was thinking that iterator.__closure__
stores references to these objects; therefore, unnecessary references should be deleted. However, as per my checks, closure stores only referenced variables inside; pullparser
, close_source
and wr
in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also noticed that it.root = None
was deleted. This fact is not documented, but this may still cause unintended errors on the user side if they use root
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are right about the it.root = None
. I did not intend a behavioral change here, so it seems like a good idea to add it back.
I don't think the del
statements matter one way or the other. They look like they break a cycle, but not really, but they also are harmless.
Prior to pythongh-114269, the iterator returned by ElementTree.iterparse was initialized with the root attribute as None. This restores the previous behavior.
…ute (pythonGH-114755) Prior to pythongh-114269, the iterator returned by ElementTree.iterparse was initialized with the root attribute as None. This restores the previous behavior. (cherry picked from commit 66f95ea) Co-authored-by: Sam Gross <colesbury@gmail.com>
…ute (pythonGH-114755) Prior to pythongh-114269, the iterator returned by ElementTree.iterparse was initialized with the root attribute as None. This restores the previous behavior. (cherry picked from commit 66f95ea) Co-authored-by: Sam Gross <colesbury@gmail.com>
…honGH-114269) The iterator returned by ElementTree.iterparse() may hold on to a file descriptor. The reference cycle prevented prompt clean-up of the file descriptor if the returned iterator was not exhausted.
…ute (pythonGH-114755) Prior to pythongh-114269, the iterator returned by ElementTree.iterparse was initialized with the root attribute as None. This restores the previous behavior.
…honGH-114269) The iterator returned by ElementTree.iterparse() may hold on to a file descriptor. The reference cycle prevented prompt clean-up of the file descriptor if the returned iterator was not exhausted.
Refactor
IterParseIterator
to avoid a reference cycle between the iterator() function and the IterParseIterator() instance. This leads to more prompt clean-up of the "source" file if the returned iterator is not exhausted and not otherwise part of a reference cycle.This also avoids a test failure in the GC implementation for the free-threaded build (#114262): if the "source" file is finalized before the
iterator()
generator, aResourceWarning
is issued leading to a failure in test_iterparse() in test_xml_etree.py. In theory, this warning can occur in the default build as well, but that is much less likely to occur because it would require an unlucky scheduling of the GC between creation of the generator and the file object in order to change the order of finalization.