Add the close method for ElementTree.iterparse() object #69893

serhiy-storchaka · 2015-11-23T13:57:25Z

BPO	25707
Nosy	@scoder, @serhiy-storchaka, @Vgr255, @furkanonder, @jacobtylerwalls
PRs	bpo-43292: Fix file leak in `ET.iterparse()` when not exhausted #31696
Dependencies	bpo-25638: Verify the etree_parse and etree_iterparse benchmarks are working appropriately

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/serhiy-storchaka'
closed_at = None
created_at = <Date 2015-11-23.13:57:25.201>
labels = ['3.8', 'library', 'performance']
title = 'Add the close method for ElementTree.iterparse() object'
updated_at = <Date 2022-03-05.15:26:50.497>
user = 'https://github.com/serhiy-storchaka'

bugs.python.org fields:

activity = <Date 2022-03-05.15:26:50.497>
actor = 'jacobtylerwalls'
assignee = 'serhiy.storchaka'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2015-11-23.13:57:25.201>
creator = 'serhiy.storchaka'
dependencies = ['25638']
files = []
hgrepos = []
issue_num = 25707
keywords = ['patch']
message_count = 8.0
messages = ['255159', '255164', '255171', '255173', '341003', '341253', '341258', '368308']
nosy_count = 5.0
nosy_names = ['scoder', 'serhiy.storchaka', 'abarry', 'furkanonder', 'jacobtylerwalls']
pr_nums = ['31696']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'resource usage'
url = 'https://bugs.python.org/issue25707'
versions = ['Python 3.8']

Linked PRs

gh-69893: Add the close() method for xml.etree.ElementTree.iterparse() iterator #114534

The text was updated successfully, but these errors were encountered:

serhiy-storchaka · 2015-11-23T13:57:25Z

If ElementTree.iterparse() is called with file names, it opens a file. When resulting iterator is not exhausted, the file lefts not closed.

>>> import xml.etree.ElementTree as ET
>>> import gc
>>> ET.iterparse('/dev/null')
<xml.etree.ElementTree._IterParseIterator object at 0xb6f9e38c>
>>> gc.collect()
__main__:1: ResourceWarning: unclosed file <_io.BufferedReader name='/dev/null'>
34

Martin Panter proposed in bpo-25688 to add an explicit way to clean it up, like a generator.close() method.

Vgr255 · 2015-11-23T14:17:13Z

I am unable to reproduce the issue on Windows 7 with 3.5.0; I have tried opening a small (non-empty) text. Here's the result:

>>> import xml.etree.ElementTree as ET
>>> import gc
>>> ET.iterparse("E:/New.txt")
<xml.etree.ElementTree._IterParseIterator object at 0x0023ABB0>
>>> gc.collect()
59

serhiy-storchaka · 2015-11-23T14:56:23Z

You have to enable deprecation warnings. Run the interpreter with the -Wa option.

Vgr255 · 2015-11-23T15:08:13Z

Oh, my bad. Ignore my last message, behaviour is identical then. Thanks for clearing that up.

scoder · 2019-04-27T17:32:50Z

I don't think there is a need for a close() method. Instead, the iterator should close the file first thing when it's done with it, but only if it owns it. Therefore, the fix in bpo-25688 seems correct.

Closing can also be done explicitly in a finaliser of the iterator, if implicit closing via decref is too lax.

serhiy-storchaka · 2019-05-02T08:01:28Z

Implicit closing an exhausted iterator helps only the iterator is iterated to the end. If the iteration has been stopped before the end, we get a leak of the file descriptor. Closing the file descriptor in the finalizer can be deferred to undefined term, especially in implementations without reference counting. Since file descriptors are limited resource, this can cause troubles in real programs.

Reasons for close() in iterparse objects are the same as for close in files and generators.

Maybe we will need to implement the full generator protocol (send() and throw()) in the iterparse objects, but currently I do not know use cases for this.

scoder · 2019-05-02T08:32:20Z

Ok, I think it's reasonable to make the resource management explicit for the specific case of letting iterparse() open the file. That suggests that there should also be context manager support, given that safe usages would often involve a try-finally.

Since it might not always be obvious for users when they need to close the iterator or not, I would also suggest to not let it raise an error on a double-close, i.e. if .close() was already called or the iterator was already exhausted (and the file closed automatically), calling .close() should just do nothing.

furkanonder · 2020-05-06T22:39:43Z

Python 3.8.2 (default, Apr  8 2020, 14:31:25) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import xml.etree.ElementTree as ET
>>> import gc
>>> ET.iterparse('/dev/null')
<xml.etree.ElementTree.iterparse.<locals>.IterParseIterator object at 0x7fb96f679d00>
>>> gc.collect()
34

The warning(main:1: ResourceWarning: unclosed file <_io.BufferedReader name='/dev/null'>) is no longer available in python3.8.2

Prometheus3375 · 2022-06-30T13:00:54Z

I recently did a fully compatible implementation that solves an issue discussed here:

class iterparse:
    __slots__ = '_source', '_opened', 'root', '_next'

    def __init__(self, /, source, events = None, parser = None):
        # If source cannot be opened,
        # an error is emitted and object is moved to gc,
        # gc calls __del__ which calls close().
        # opened flag must be set to false before opening,
        # so close() won't emit AttributeError.
        self._opened = False
        if hasattr(source, 'read'):
            self._source = source
        else:
            self._source = open(source, 'rb')
            self._opened = True

        self.root = None
        self._next = self._iterator(XMLPullParser(events=events, _parser=parser)).__next__

    def __iter__(self, /):
        return self

    def _iterator(self, parser: XMLPullParser, /):
        source = self._source
        try:
            data = source.read(16 * 1024)
            while data:
                parser.feed(data)
                yield from parser.read_events()
                data = source.read(16 * 1024)

            root = parser._close_and_return_root()
            yield from parser.read_events()  # is it necessary?
            self.root = root
        finally:
            self.close()

    def __next__(self, /):
        return self._next()

    def close(self, /):
        if self._opened:
            self._source.close()
            self._opened = False

    def __del__(self, /):
        self.close()

Features:

Fully compatible with current implementation.
According to my tests, implementation above has faster creation time (no new function and class each time), but a bit slower traverse time.
Has close method that does not raise an exception on double-close as @scoder asked (file descriptors also do not raise an error in such cases).
If there is no more references to iterparse object, it is destroyed closing opened fd.

Benchmarks:
I used this XML file for testing.

def test_creation(file: str, impls: list[type], /):
    code = f'iterparse({file!r})'

    return tuple(
        repeat(
            code,
            repeat=50,
            number=1000,
            globals=dict(iterparse=t),
            )
        for t in impls
        )


def test_traverse(file: str, impls: list[type], /):
    setup = f'it = iterparse({file!r})'
    code = f'for _ in it: pass'

    return tuple(
        repeat(
            code,
            setup=setup,
            repeat=50,
            number=10,
            globals=dict(iterparse=t),
            )
        for t in impls
        )


def main():
    file_path = '20220628-FULL-1_1(xsd).xml'
    impls = [iterparse_new, iterparse_old]

    creation = test_creation(file_path, impls)
    for i, time in enumerate(creation):
        print(f'Creation time of {impls[i].__name__}:', min(time))

    print()

    traverse = test_traverse(file_path, impls)
    for i, time in enumerate(traverse):
        print(f'Traverse time of {impls[i].__name__}:', min(time))


if __name__ == '__main__':
    main()

Creation time of iterparse_new: 0.08623739999984537
Creation time of iterparse: 0.10833569999977044

Traverse time of iterparse_new: 0.3242015000005267
Traverse time of iterparse: 0.3198846000004778

…teration. refs #21 (python/cpython#69893)

…parse() iterator

serhiy-storchaka · 2024-01-24T15:36:14Z

Recent improvements (#101438) made explicit close() not needed in CPython. But it can still be useful in alternative implementations like PyPy, and maybe in future Python versions, so it is worth to add it.

…) iterator (GH-114534)

…parse() iterator (pythonGH-114534)

serhiy-storchaka self-assigned this Nov 23, 2015

serhiy-storchaka added stdlib Python modules in the Lib dir performance Performance or resource usage labels Nov 23, 2015

scoder added the 3.8 (EOL) end of life label Apr 27, 2019

ezio-melotti transferred this issue from another repository Apr 10, 2022

serhiy-storchaka added the topic-XML label May 10, 2022

iritkatriel added type-bug An unexpected behavior, bug, or error and removed performance Performance or resource usage labels Aug 17, 2022

namdre added a commit to eclipse-sumo/sumo that referenced this issue Mar 22, 2023

fixing ResourceWarning due to unclosed file when aborting iterparse i…

51b3663

…teration. refs #21 (python/cpython#69893)

serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Jan 24, 2024

pythongh-69893: Add the close() method for xml.etree.ElementTree.iter…

4a2d1d8

…parse() iterator

bedevere-app bot mentioned this issue Jan 24, 2024

gh-69893: Add the close() method for xml.etree.ElementTree.iterparse() iterator #114534

Merged

serhiy-storchaka added a commit that referenced this issue Feb 4, 2024

gh-69893: Add the close() method for xml.etree.ElementTree.iterparse(…

ca715e5

…) iterator (GH-114534)

serhiy-storchaka closed this as completed Feb 4, 2024

aisk pushed a commit to aisk/cpython that referenced this issue Feb 11, 2024

pythongh-69893: Add the close() method for xml.etree.ElementTree.iter…

233e5fd

…parse() iterator (pythonGH-114534)

fsc-eriker pushed a commit to fsc-eriker/cpython that referenced this issue Feb 14, 2024

pythongh-69893: Add the close() method for xml.etree.ElementTree.iter…

ba4a301

…parse() iterator (pythonGH-114534)

max-muoto mentioned this issue Jul 4, 2024

Update xml.etree.ElementTree for 3.13 python/typeshed#12277

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the close method for ElementTree.iterparse() object #69893

Add the close method for ElementTree.iterparse() object #69893

serhiy-storchaka commented Nov 23, 2015 •

edited by bedevere-app bot

Loading

serhiy-storchaka commented Nov 23, 2015

Vgr255 mannequin commented Nov 23, 2015

serhiy-storchaka commented Nov 23, 2015

Vgr255 mannequin commented Nov 23, 2015

scoder commented Apr 27, 2019

serhiy-storchaka commented May 2, 2019

scoder commented May 2, 2019

furkanonder mannequin commented May 6, 2020

Prometheus3375 commented Jun 30, 2022

serhiy-storchaka commented Jan 24, 2024

Add the close method for ElementTree.iterparse() object #69893

Add the close method for ElementTree.iterparse() object #69893

Comments

serhiy-storchaka commented Nov 23, 2015 • edited by bedevere-app bot Loading

Linked PRs

serhiy-storchaka commented Nov 23, 2015

Vgr255 mannequin commented Nov 23, 2015

serhiy-storchaka commented Nov 23, 2015

Vgr255 mannequin commented Nov 23, 2015

scoder commented Apr 27, 2019

serhiy-storchaka commented May 2, 2019

scoder commented May 2, 2019

furkanonder mannequin commented May 6, 2020

Prometheus3375 commented Jun 30, 2022

serhiy-storchaka commented Jan 24, 2024

serhiy-storchaka commented Nov 23, 2015 •

edited by bedevere-app bot

Loading