Skip to content

TarFile filters fail in non-UTF-8 locales #133890

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
serhiy-storchaka opened this issue May 11, 2025 · 0 comments
Closed

TarFile filters fail in non-UTF-8 locales #133890

serhiy-storchaka opened this issue May 11, 2025 · 0 comments
Assignees
Labels
3.12 only security fixes 3.13 bugs and security fixes 3.14 bugs and security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@serhiy-storchaka
Copy link
Member

serhiy-storchaka commented May 11, 2025

Bug report

test_tarfile files in non-UTF-8 locales. For example:

$ LC_ALL=uk_UA ./python -m test -vuall test_tarfile -m 'NoneInfoExtractTests_*' -m test_data_filter -m test_tar_filter
======================================================================
ERROR: setUpClass (test.test_tarfile.NoneInfoExtractTests_Data)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/serhiy/py/cpython/Lib/test/test_tarfile.py", line 3264, in setUpClass
    tar.extractall(cls.control_dir, filter=cls.extraction_filter)
    ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/serhiy/py/cpython/Lib/tarfile.py", line 2389, in extractall
    tarinfo = self._get_extract_tarinfo(member, filter_function, path)
  File "/home/serhiy/py/cpython/Lib/tarfile.py", line 2441, in _get_extract_tarinfo
    tarinfo = filter_function(tarinfo, path)
  File "/home/serhiy/py/cpython/Lib/tarfile.py", line 842, in data_filter
    new_attrs = _get_filtered_attrs(member, dest_path, True)
  File "/home/serhiy/py/cpython/Lib/tarfile.py", line 782, in _get_filtered_attrs
    target_path = os.path.realpath(os.path.join(dest_path, name))
  File "/home/serhiy/py/cpython/Lib/posixpath.py", line 405, in realpath
    return _realpath(filename, strict, sep, curdir, pardir, getcwd)
  File "/home/serhiy/py/cpython/Lib/posixpath.py", line 452, in _realpath
    st_mode = lstat(newpath).st_mode
              ~~~~~^^^^^^^^^
  File "/home/serhiy/py/cpython/Lib/encodings/koi8_u.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
           ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode characters in position 112-118: character maps to <undefined>
encoding with 'koi8-u' codec failed

======================================================================
ERROR: setUpClass (test.test_tarfile.NoneInfoExtractTests_Default)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/serhiy/py/cpython/Lib/test/test_tarfile.py", line 3264, in setUpClass
    tar.extractall(cls.control_dir, filter=cls.extraction_filter)
    ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/serhiy/py/cpython/Lib/tarfile.py", line 2389, in extractall
    tarinfo = self._get_extract_tarinfo(member, filter_function, path)
  File "/home/serhiy/py/cpython/Lib/tarfile.py", line 2441, in _get_extract_tarinfo
    tarinfo = filter_function(tarinfo, path)
  File "/home/serhiy/py/cpython/Lib/tarfile.py", line 842, in data_filter
    new_attrs = _get_filtered_attrs(member, dest_path, True)
  File "/home/serhiy/py/cpython/Lib/tarfile.py", line 782, in _get_filtered_attrs
    target_path = os.path.realpath(os.path.join(dest_path, name))
  File "/home/serhiy/py/cpython/Lib/posixpath.py", line 405, in realpath
    return _realpath(filename, strict, sep, curdir, pardir, getcwd)
  File "/home/serhiy/py/cpython/Lib/posixpath.py", line 452, in _realpath
    st_mode = lstat(newpath).st_mode
              ~~~~~^^^^^^^^^
  File "/home/serhiy/py/cpython/Lib/encodings/koi8_u.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
           ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode characters in position 112-118: character maps to <undefined>
encoding with 'koi8-u' codec failed

======================================================================
ERROR: setUpClass (test.test_tarfile.NoneInfoExtractTests_FullyTrusted)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/serhiy/py/cpython/Lib/test/test_tarfile.py", line 3264, in setUpClass
    tar.extractall(cls.control_dir, filter=cls.extraction_filter)
    ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/serhiy/py/cpython/Lib/tarfile.py", line 2397, in extractall
    self._extract_one(tarinfo, path, set_attrs=not tarinfo.isdir(),
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                      numeric_owner=numeric_owner)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/serhiy/py/cpython/Lib/tarfile.py", line 2460, in _extract_one
    self._extract_member(tarinfo, os.path.join(path, tarinfo.name),
    ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                         set_attrs=set_attrs,
                         ^^^^^^^^^^^^^^^^^^^^
                         numeric_owner=numeric_owner)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/serhiy/py/cpython/Lib/tarfile.py", line 2543, in _extract_member
    self.makefile(tarinfo, targetpath)
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
  File "/home/serhiy/py/cpython/Lib/tarfile.py", line 2589, in makefile
    with bltn_open(targetpath, "wb") as target:
         ~~~~~~~~~^^^^^^^^^^^^^^^^^^
  File "/home/serhiy/py/cpython/Lib/encodings/koi8_u.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
           ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode characters in position 112-118: character maps to <undefined>
encoding with 'koi8-u' codec failed

======================================================================
ERROR: setUpClass (test.test_tarfile.NoneInfoExtractTests_Tar)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/serhiy/py/cpython/Lib/test/test_tarfile.py", line 3264, in setUpClass
    tar.extractall(cls.control_dir, filter=cls.extraction_filter)
    ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/serhiy/py/cpython/Lib/tarfile.py", line 2389, in extractall
    tarinfo = self._get_extract_tarinfo(member, filter_function, path)
  File "/home/serhiy/py/cpython/Lib/tarfile.py", line 2441, in _get_extract_tarinfo
    tarinfo = filter_function(tarinfo, path)
  File "/home/serhiy/py/cpython/Lib/tarfile.py", line 836, in tar_filter
    new_attrs = _get_filtered_attrs(member, dest_path, False)
  File "/home/serhiy/py/cpython/Lib/tarfile.py", line 782, in _get_filtered_attrs
    target_path = os.path.realpath(os.path.join(dest_path, name))
  File "/home/serhiy/py/cpython/Lib/posixpath.py", line 405, in realpath
    return _realpath(filename, strict, sep, curdir, pardir, getcwd)
  File "/home/serhiy/py/cpython/Lib/posixpath.py", line 452, in _realpath
    st_mode = lstat(newpath).st_mode
              ~~~~~^^^^^^^^^
  File "/home/serhiy/py/cpython/Lib/encodings/koi8_u.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
           ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode characters in position 112-118: character maps to <undefined>
encoding with 'koi8-u' codec failed

======================================================================
ERROR: test_data_filter (test.test_tarfile.TestExtractionFilters.test_data_filter)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/serhiy/py/cpython/Lib/test/test_tarfile.py", line 4086, in test_data_filter
    filtered = tarfile.data_filter(tarinfo, '')
  File "/home/serhiy/py/cpython/Lib/tarfile.py", line 842, in data_filter
    new_attrs = _get_filtered_attrs(member, dest_path, True)
  File "/home/serhiy/py/cpython/Lib/tarfile.py", line 782, in _get_filtered_attrs
    target_path = os.path.realpath(os.path.join(dest_path, name))
  File "/home/serhiy/py/cpython/Lib/posixpath.py", line 405, in realpath
    return _realpath(filename, strict, sep, curdir, pardir, getcwd)
  File "/home/serhiy/py/cpython/Lib/posixpath.py", line 452, in _realpath
    st_mode = lstat(newpath).st_mode
              ~~~~~^^^^^^^^^
  File "/home/serhiy/py/cpython/Lib/encodings/koi8_u.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
           ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode characters in position 69-75: character maps to <undefined>
encoding with 'koi8-u' codec failed

======================================================================
ERROR: test_tar_filter (test.test_tarfile.TestExtractionFilters.test_tar_filter)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/serhiy/py/cpython/Lib/test/test_tarfile.py", line 4076, in test_tar_filter
    filtered = tarfile.tar_filter(tarinfo, '')
  File "/home/serhiy/py/cpython/Lib/tarfile.py", line 836, in tar_filter
    new_attrs = _get_filtered_attrs(member, dest_path, False)
  File "/home/serhiy/py/cpython/Lib/tarfile.py", line 782, in _get_filtered_attrs
    target_path = os.path.realpath(os.path.join(dest_path, name))
  File "/home/serhiy/py/cpython/Lib/posixpath.py", line 405, in realpath
    return _realpath(filename, strict, sep, curdir, pardir, getcwd)
  File "/home/serhiy/py/cpython/Lib/posixpath.py", line 452, in _realpath
    st_mode = lstat(newpath).st_mode
              ~~~~~^^^^^^^^^
  File "/home/serhiy/py/cpython/Lib/encodings/koi8_u.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
           ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode characters in position 69-75: character maps to <undefined>
encoding with 'koi8-u' codec failed

----------------------------------------------------------------------

This happens because they use os.path.realpath() for paths in a tar archive, which uses os.stat(), which fails with unexpected UnicodeEncodeError if the path in a tar archive can't be encoded in the current filesystem encoding. This error should be handled at some level, either in os.path.realpath() or in tarfile. os.stat() can also raise ValueError if the path contain null bytes. Don't know if this is relevant here, we should test.

Linked PRs

@serhiy-storchaka serhiy-storchaka self-assigned this May 11, 2025
@serhiy-storchaka serhiy-storchaka added type-bug An unexpected behavior, bug, or error 3.12 only security fixes 3.13 bugs and security fixes 3.14 bugs and security fixes labels May 11, 2025
@picnixz picnixz added the stdlib Python modules in the Lib dir label May 11, 2025
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue May 17, 2025
UnicodeEncodeError is now handled the same way as OSError during
TarFile member extraction.
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue May 17, 2025
UnicodeEncodeError is now handled the same way as OSError during
TarFile member extraction.
@serhiy-storchaka serhiy-storchaka moved this to In Progress in Tarfile issues May 17, 2025
serhiy-storchaka added a commit that referenced this issue May 18, 2025
UnicodeEncodeError is now handled the same way as OSError during
TarFile member extraction.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 18, 2025
UnicodeEncodeError is now handled the same way as OSError during
TarFile member extraction.
(cherry picked from commit 9983c7d)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue May 18, 2025
…H-134147)

UnicodeEncodeError is now handled the same way as OSError during
TarFile member extraction.
(cherry picked from commit 9983c7d)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
@github-project-automation github-project-automation bot moved this from In Progress to Done in Tarfile issues May 18, 2025
serhiy-storchaka added a commit that referenced this issue May 19, 2025
…H-134196)

UnicodeEncodeError is now handled the same way as OSError during
TarFile member extraction.
(cherry picked from commit 9983c7d)
serhiy-storchaka added a commit that referenced this issue May 20, 2025
…H-134195)

UnicodeEncodeError is now handled the same way as OSError during
TarFile member extraction.
(cherry picked from commit 9983c7d)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 only security fixes 3.13 bugs and security fixes 3.14 bugs and security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
Status: Done
Development

No branches or pull requests

2 participants