gh-121577: fix compileall always recompiling pyc files with hash based invalidation #121960

kulikjak · 2024-07-18T12:15:25Z

Issue: compileall shouldn't recompile up-to-date files with hash-based invalidation mode #121577

…validation modes

edmorley

Thank you for working on this!

There is also another related bug with compileall that I think might be worth trying to fix at the same time, since it's directly related: #117204

That is, I think the logic should be:

If header of file doesn't match requested invalidation mode, then skip checking the timestamps or checksum, and instead always recompile
If the header does match the requested invalidation mode, then check the file is up to date (via either the timestamp or hash, as appropriate), and recompile if needed

In addition, I think the current version of this PR worsens performance in the "compiling multiple optimisation levels" scenario, since the original source timestamp and/or expected hash is calculated multiple times (since those steps were moved inside the loop) even though those values are constant regardless of optimisation level of the output bytecode. As such, it might be worth moving those back outside the loop.

kulikjak · 2024-08-30T12:26:42Z

There is also another related bug with compileall that I think might be worth trying to fix at the same time, since it's directly related: #117204

That is, I think the logic should be:

If header of file doesn't match requested invalidation mode, then skip checking the timestamps or checksum, and instead always recompile

If the header does match the requested invalidation mode, then check the file is up to date (via either the timestamp or hash, as appropriate), and recompile if needed

I am aware of #117204, but I don't think it's as simple as that.

I do agree that if you specify the hash-based invalidation and the target uses the timestamp and is up-to-date, compileall shouldn't silently do nothing (which is the current behavior). However, when you don't specify any mode, it's not that simple anymore - you can either use the default one (timestamp), or you can use the one existing .pyc file uses.

I can imagine a use case for both of these, which means that there should probably also be a command line option that switches between those two, or a new default invalidation mode that would mean "use what the existing .pyc file uses, or default to timestamp if no pyc exists" and that means changes to the command line tool and API...

I don't think I am the one who should decide how exactly compileall will behave. That's why I didn't look into that one more.

That said, it's something that should probably be looked into (and I am happy to help with that once we know how it should behave ;)).

In addition, I think the current version of this PR worsens performance in the "compiling multiple optimisation levels" scenario, since the original source timestamp and/or expected hash is calculated multiple times (since those steps were moved inside the loop) even though those values are constant regardless of optimisation level of the output bytecode. As such, it might be worth moving those back outside the loop.

Good catch - I will do so.

kulikjak · 2024-08-30T12:28:27Z

Oh, and there's also a question about what to do when different optimization levels use different invalidation modes....

edmorley · 2024-08-30T12:37:47Z

However, when you don't specify any mode, it's not that simple anymore - you can either use the default one (timestamp), or you can use the one existing .pyc file uses.

I can imagine a use case for both of these, which means that there should probably also be a command line option that switches between those two, or a new default invalidation mode that would mean "use what the existing .pyc file uses, or default to timestamp if no pyc exists" and that means changes to the command line tool and API...

So if this were about normal Python operation (where it creates pyc files only as needed), I would agree that perhaps a pyc file existing but not being in the expected mode is good enough, and recompilation should be skipped.

However, IMO the whole point of the compileall command is to explicitly compile all specified files. To me, the fact that some files aren't recompiled every time (ie: are skipped if the timestamp is up to date and force hasn't been used) is more of an optimisation implementation detail.

As such, my inclination would be that if compileall has been requested to compile using a particular invalidation mode, then it should recompile any files that aren't using that mode. My gut feeling is that it would be very rare for anyone using compileall to actually want a mixture of invalidation modes to be used based on whether files had previously been compiled, and therefore that we don't need to add an extra option for it - but curious what others think :-)

kulikjak · 2024-08-30T12:54:09Z

That makes sense; maybe I am overthinking it.

In the end, people either care about pyc not changing when the timestamp changes (our case) and then use hashes for everything, or they don't and stick to timestamps. A combination of different invalidation modes for different files (or even optimization levels) seems unlikely and unnecessary...

Maybe @benjaminp would have some thoughts (as an author of invalidation modes :))?

serhiy-storchaka · 2025-07-13T14:51:00Z

Lib/compileall.py

+                        else:
+                            # timestamp-based invalidation
+                            mtime = int(os.stat(fullname).st_mtime)
+                            actual = header[:12]


Why not check also the file size?

serhiy-storchaka · 2025-07-13T14:55:45Z

Lib/compileall.py

-                            actual = chandle.read(12)
+                            header = chandle.read(16)
+
+                        if header[4]:


You can check the magic number without reading the source file.

Also, header[4] can raise IndexError. And you need to check other bytes and bits.

Suggested change

if header[4]:

if len(header) < 16 or header[:4] != importlib.util.MAGIC_NUMBER or struct.unpack('<L', header[4:8])[0] & ~0b11:

break

if header[4] & 0b1:

serhiy-storchaka · 2025-07-13T14:58:11Z

Lib/compileall.py

+                            actual = header
+                            expect = struct.pack('<4sL8s', importlib.util.MAGIC_NUMBER,
+                                                 header[4], source_hash)


You can simply check the source hash:

Suggested change

actual = header

expect = struct.pack('<4sL8s', importlib.util.MAGIC_NUMBER,

header[4], source_hash)

if header[8:] != source_hash:

break

kulikjak added 2 commits July 18, 2024 13:43

compileall now verifies whether recompilation is necessary for all in…

35b9311

…validation modes

add NEWS entry

9d47dbc

bedevere-app bot mentioned this pull request Jul 18, 2024

compileall shouldn't recompile up-to-date files with hash-based invalidation mode #121577

Open

bedevere-app bot added the awaiting review label Jul 18, 2024

kulikjak added 3 commits July 18, 2024 14:21

fix NEWS entry syntax

f14eae5

fix the verification logic

9e3b22d

fix tests

64d7f35

edmorley reviewed Aug 24, 2024

View reviewed changes

edmorley mentioned this pull request Aug 24, 2024

compileall doesn't recompile when invalidation mode changes even if timestamps match #117204

Open

serhiy-storchaka reviewed Jul 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-121577: fix compileall always recompiling pyc files with hash based invalidation #121960

gh-121577: fix compileall always recompiling pyc files with hash based invalidation #121960

Uh oh!

kulikjak commented Jul 18, 2024 •

edited by bedevere-app bot

Loading

Uh oh!

edmorley left a comment •

edited

Loading

Uh oh!

kulikjak commented Aug 30, 2024

Uh oh!

kulikjak commented Aug 30, 2024

Uh oh!

edmorley commented Aug 30, 2024

Uh oh!

kulikjak commented Aug 30, 2024 •

edited

Loading

Uh oh!

serhiy-storchaka Jul 13, 2025

Uh oh!

serhiy-storchaka Jul 13, 2025

Uh oh!

serhiy-storchaka Jul 13, 2025

Uh oh!

Uh oh!

-                        if header[4]:
+                        if len(header) < 16 or header[:4] != importlib.util.MAGIC_NUMBER or struct.unpack('<L', header[4:8])[0] & ~0b11:
+                            break
+                        if header[4] & 0b1:

Uh oh!

gh-121577: fix compileall always recompiling pyc files with hash based invalidation #121960

Are you sure you want to change the base?

gh-121577: fix compileall always recompiling pyc files with hash based invalidation #121960

Uh oh!

Conversation

kulikjak commented Jul 18, 2024 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edmorley left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kulikjak commented Aug 30, 2024

Uh oh!

kulikjak commented Aug 30, 2024

Uh oh!

edmorley commented Aug 30, 2024

Uh oh!

kulikjak commented Aug 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

serhiy-storchaka Jul 13, 2025

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Jul 13, 2025

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Jul 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kulikjak commented Jul 18, 2024 •

edited by bedevere-app bot

Loading

edmorley left a comment •

edited

Loading

kulikjak commented Aug 30, 2024 •

edited

Loading