Skip to content

ElementTree.iterparse "leaks" file descriptor when not exhausted #101438

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
colesbury opened this issue Jan 30, 2023 · 4 comments
Closed

ElementTree.iterparse "leaks" file descriptor when not exhausted #101438

colesbury opened this issue Jan 30, 2023 · 4 comments
Labels
3.11 only security fixes 3.12 only security fixes 3.13 bugs and security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@colesbury
Copy link
Contributor

colesbury commented Jan 30, 2023

The PR #31696 attempts to fix the "leak" of file descriptors when the iterator is not exhausted. That PR fixes the warning, but not the underlying issue that the files aren't closed until the next tracing garbage collection cycle.

Note that there isn't truly a leak of file descriptors. The file descriptors are eventually closed when the file object is finalized (at cyclic garbage collection). The point of the ResourceWarning (in my understanding) is that waiting until the next garbage collection cycle means that you may temporarily have a lot of unwanted open file descriptors, which could exhaust the global limit or prevent successful writes to those files on Windows.

# run with ulimit -Sn 1000
import xml.etree.ElementTree as ET
import tempfile

import gc
gc.disable()

def run():
    with tempfile.NamedTemporaryFile("w") as f:
        f.write("<document />junk")

        for i in range(10000):
            it = ET.iterparse(f.name)
            del it

run()

On my system, after lowering the file descriptor limit to 1000 (via ulimit -Sn 1000) I get:

OSError: [Errno 24] Too many open files: '/tmp/tmpwwmd9gp6'

Linked PRs

@colesbury colesbury added the type-bug An unexpected behavior, bug, or error label Jan 30, 2023
@ericvsmith
Copy link
Member

I think this would require something like PEP 533 “Deterministic cleanup for iterators“ to solve.

@colesbury
Copy link
Contributor Author

colesbury commented Jan 30, 2023

In this case, I think you can fix it just by avoiding reference cycles that involve the generator. Below is an example refactoring of iterparse.

def iterparse(source, events=None, parser=None):
    # Use the internal, undocumented _parser argument for now; When the
    # parser argument of iterparse is removed, this can be killed.
    pullparser = XMLPullParser(events=events, _parser=parser)
    _root = None

    def iterator(source):
        nonlocal _root
        close_source = False
        try:
            if not hasattr(source, "read"):
                source = open(source, "rb")
                close_source = True
            yield None
            while True:
                yield from pullparser.read_events()
                # load event buffer
                data = source.read(16 * 1024)
                if not data:
                    break
                pullparser.feed(data)
            root = pullparser._close_and_return_root()
            yield from pullparser.read_events()
            _root = root
        finally:
            if close_source:
                source.close()

    class IterParseIterator(collections.abc.Iterator):
        def __init__(self, source):
            self.it = iterator(source)
        def __next__(self):
            return next(self.it)
        @property
        def root(self):
            nonlocal _root
            return _root
    it = IterParseIterator(source)
    del iterator, IterParseIterator

    next(it)
    return it

This avoids two reference cycles that captured the generator created by iterator(source):

  1. Classes naturally form reference cycles, so move iterator(source) to the instance instead of a class attribute
  2. Move _root up so that iterator(source) doesn't refer to the instance it.

I'm not sure about a few things including:

  • How do you test this?
  • Is it.root important? It's not documented. Is it only for testing?

EDIT: the problem is not capturing of the file descriptor, but the generator iterator(source).

@rhettinger
Copy link
Contributor

@colesbury Doe the issue persist if you add gc.collect() after the del it?

@colesbury
Copy link
Contributor Author

@rhettinger everything is cleaned up properly once the garbage collector runs. So adding a gc.collect() after the del it would collect the generator and close the file descriptor.

markshannon pushed a commit to faster-cpython/cpython that referenced this issue Apr 19, 2023
@iritkatriel iritkatriel added the stdlib Python modules in the Lib dir label Nov 26, 2023
colesbury added a commit to colesbury/cpython that referenced this issue Jan 18, 2024
Refactor IterParseIterator to avoid a reference cycle between the
iterator() function and the IterParseIterator() instance. This leads to
more prompt clean-up of the "source" file if the returned iterator is
not exhausted and not otherwise part of a reference cycle.

This also avoids a test failure in the GC implementation for the
free-threaded build: if the "source" file is finalized before the
"iterator()" generator, a ResourceWarning is issued leading to a
failure in test_iterparse(). In theory, this warning can occur in
the default build as well, but is much less likely because it would
require an unlucky scheduling of the GC between creation of the generator
and the file object in order to change the order of finalization.
serhiy-storchaka pushed a commit that referenced this issue Jan 23, 2024
The iterator returned by ElementTree.iterparse() may hold on to a file
descriptor. The reference cycle prevented prompt clean-up of the file
descriptor if the returned iterator was not exhausted.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jan 23, 2024
…honGH-114269)

The iterator returned by ElementTree.iterparse() may hold on to a file
descriptor. The reference cycle prevented prompt clean-up of the file
descriptor if the returned iterator was not exhausted.
(cherry picked from commit ce01ab5)

Co-authored-by: Sam Gross <colesbury@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jan 23, 2024
…honGH-114269)

The iterator returned by ElementTree.iterparse() may hold on to a file
descriptor. The reference cycle prevented prompt clean-up of the file
descriptor if the returned iterator was not exhausted.
(cherry picked from commit ce01ab5)

Co-authored-by: Sam Gross <colesbury@gmail.com>
@serhiy-storchaka serhiy-storchaka added 3.11 only security fixes 3.12 only security fixes 3.13 bugs and security fixes labels Jan 23, 2024
serhiy-storchaka pushed a commit that referenced this issue Jan 23, 2024
…-114269) (GH-114499)

The iterator returned by ElementTree.iterparse() may hold on to a file
descriptor. The reference cycle prevented prompt clean-up of the file
descriptor if the returned iterator was not exhausted.
(cherry picked from commit ce01ab5)

Co-authored-by: Sam Gross <colesbury@gmail.com>
serhiy-storchaka pushed a commit that referenced this issue Jan 23, 2024
…-114269) (GH-114500)

The iterator returned by ElementTree.iterparse() may hold on to a file
descriptor. The reference cycle prevented prompt clean-up of the file
descriptor if the returned iterator was not exhausted.
(cherry picked from commit ce01ab5)

Co-authored-by: Sam Gross <colesbury@gmail.com>
aisk pushed a commit to aisk/cpython that referenced this issue Feb 11, 2024
…honGH-114269)

The iterator returned by ElementTree.iterparse() may hold on to a file
descriptor. The reference cycle prevented prompt clean-up of the file
descriptor if the returned iterator was not exhausted.
Glyphack pushed a commit to Glyphack/cpython that referenced this issue Sep 2, 2024
…honGH-114269)

The iterator returned by ElementTree.iterparse() may hold on to a file
descriptor. The reference cycle prevented prompt clean-up of the file
descriptor if the returned iterator was not exhausted.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.11 only security fixes 3.12 only security fixes 3.13 bugs and security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

5 participants