Skip to content

bpo-45863: tarfile: don't zero out header fields unnecessarily #29693

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 9, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 15 additions & 6 deletions Lib/tarfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -888,15 +888,24 @@ def create_pax_header(self, info, encoding):
# Test number fields for values that exceed the field limit or values
# that like to be stored as float.
for name, digits in (("uid", 8), ("gid", 8), ("size", 12), ("mtime", 12)):
if name in pax_headers:
# The pax header has priority. Avoid overflow.
info[name] = 0
continue
needs_pax = False

val = info[name]
if not 0 <= val < 8 ** (digits - 1) or isinstance(val, float):
pax_headers[name] = str(val)
val_is_float = isinstance(val, float)
val_int = round(val) if val_is_float else val
if not 0 <= val_int < 8 ** (digits - 1):
# Avoid overflow.
info[name] = 0
needs_pax = True
elif val_is_float:
# Put rounded value in ustar header, and full
# precision value in pax header.
info[name] = val_int
needs_pax = True

# The existing pax header has priority.
if needs_pax and name not in pax_headers:
pax_headers[name] = str(val)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does PAX have a limit for the number of digits after the dot for a floating point number?

For example, mtime=1/7 is formatted as 0.14285714285714285. The number 0.142,857,142,857,142,85 has 17 digits after the dot. A resolution of 10^-17 seconds, whereas the most accurate clocks have a resolution of 1 ns (10^-9) on Linux.

$ echo Hello > document.txt
$ tar -H posix -cf archive.tar document.txt
$ hexdump -C archive.tar 
(...)00000200  33 30 20 6d 74 69 6d 65  3d 31 36 34 34 33 32 35  |30 mtime=1644325|
00000210  30 33 33 2e 31 38 32 37  34 34 32 32 38 0a 33 30  |033.182744228.30|
00000220  20 61 74 69 6d 65 3d 31  36 34 34 33 32 35 30 33  | atime=164432503|
00000230  33 2e 31 38 32 37 34 34  32 32 38 0a 33 30 20 63  |3.182744228.30 c|
00000240  74 69 6d 65 3d 31 36 34  34 33 32 35 30 33 33 2e  |time=1644325033.|
00000250  31 38 32 37 34 34 32 32  38 0a 00 00 00 00 00 00  |182744228.......|
00000260  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
(...)

$ python3
>>> import os
>>> st=os.stat("document.txt")
>>> st.st_atime_ns
1644325037796763404
>>> st.st_mtime_ns
1644325033182744228
>>> st.st_ctime_ns
1644325033182744228

I get:

  • mtime=1644325033.182744228
  • atime=1644325033.182744228
  • ctime=1644325033.182744228

9 digits after the dot for mtime, atime and ctime.

Maybe for these 9 header names, round to 9 digits if there are more than 9 digits?

Maybe it's ok to have more digits. I don't know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standard doesn't specify a maximum number of digits. It suggests to use enough digits to allow the timestamp to be reconstructed exactly (if extracting on a system with the same timestamp precision as the originating system). Most implementations use nanosecond precision.

Doing additional rounding of the value in the pax extended header would be a behavioural change beyond what I'm setting out to do here. I suspect any potential issues with using more digits than an extractor can handle would be rare anyway, since you would have to deliberately use a made-up number rather than just dividing st_mtime_ns by 10^9.


# Create a pax extended header if necessary.
if pax_headers:
Expand Down
55 changes: 55 additions & 0 deletions Lib/test/test_tarfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -1899,6 +1899,61 @@ def test_pax_extended_header(self):
finally:
tar.close()

def test_create_pax_header(self):
# The ustar header should contain values that can be
# represented reasonably, even if a better (e.g. higher
# precision) version is set in the pax header.
# Issue #45863

# values that should be kept
t = tarfile.TarInfo()
t.name = "foo"
t.mtime = 1000.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1000.1 cannot be represented exactly in binary IEEE 754:

>>> (1001.1).hex()
'0x1.f48cccccccccdp+9'

Maybe to avoid any rounding issue in the test, we could use 1000.5 number which can be represented exaclty?

>>> (1000.5).hex()
'0x1.f440000000000p+9'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make that change if you want, but there are going to be quite a few real-world timestamps that don't have an exact representation in floating point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And come to think of it, 1000.5 has the disadvantage of being exactly halfway between two integers, so the result of rounding it changes depending on the prevailing rounding strategy. I think round() is supposed to guarantee round-to-even, but it still adds a little ambiguity for humans.

All in all, switching to fixed-point would probably be a good idea. But that's again rather out of scope for what I'm trying to accomplish here. :)

t.size = 100
t.uid = 123
t.gid = 124
info = t.get_info()
header = t.create_pax_header(info, encoding="iso8859-1")
self.assertEqual(info['name'], "foo")
# mtime should be rounded to nearest second
self.assertIsInstance(info['mtime'], int)
self.assertEqual(info['mtime'], 1000)
self.assertEqual(info['size'], 100)
self.assertEqual(info['uid'], 123)
self.assertEqual(info['gid'], 124)
self.assertEqual(header,
b'././@PaxHeader' + bytes(86) \
+ b'0000000\x000000000\x000000000\x0000000000020\x0000000000000\x00010205\x00 x' \
+ bytes(100) + b'ustar\x0000'+ bytes(247) \
+ b'16 mtime=1000.1\n' + bytes(496) + b'foo' + bytes(97) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

info contains mtime rounded to an int, but in the binary header, mtime can be a floatting point number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. The mtime is rounded in the info after calling create_pax_header, and appears in the overall header twice: unrounded in the pax extended header (formatted as a decimal number in a UTF-8 string) and rounded in the ustar header (octal 1750, seen in the next line).

+ b'0000644\x000000173\x000000174\x0000000000144\x0000000001750\x00006516\x00 0' \
+ bytes(100) + b'ustar\x0000' + bytes(247))

# values that should be changed
t = tarfile.TarInfo()
t.name = "foo\u3374" # can't be represented in ascii
t.mtime = 10**10 # too big
t.size = 10**10 # too big
t.uid = 8**8 # too big
t.gid = 8**8+1 # too big
info = t.get_info()
header = t.create_pax_header(info, encoding="iso8859-1")
# name is kept as-is in info but should be added to pax header
self.assertEqual(info['name'], "foo\u3374")
self.assertEqual(info['mtime'], 0)
self.assertEqual(info['size'], 0)
self.assertEqual(info['uid'], 0)
self.assertEqual(info['gid'], 0)
self.assertEqual(header,
b'././@PaxHeader' + bytes(86) \
+ b'0000000\x000000000\x000000000\x0000000000130\x0000000000000\x00010207\x00 x' \
+ bytes(100) + b'ustar\x0000' + bytes(247) \
+ b'15 path=foo\xe3\x8d\xb4\n16 uid=16777216\n' \
+ b'16 gid=16777217\n20 size=10000000000\n' \
+ b'21 mtime=10000000000\n'+ bytes(424) + b'foo?' + bytes(96) \
+ b'0000644\x000000000\x000000000\x0000000000000\x0000000000000\x00006540\x00 0' \
+ bytes(100) + b'ustar\x0000' + bytes(247))


class UnicodeTest:

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
When the :mod:`tarfile` module creates a pax format archive, it will put an integer representation of timestamps in the ustar header (if possible) for the benefit of older unarchivers, in addition to the existing full-precision timestamps in the pax extended header.