Skip to content

gh-121577: fix compileall always recompiling pyc files with hash based invalidation #121960

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

kulikjak
Copy link
Contributor

@kulikjak kulikjak commented Jul 18, 2024

Copy link

@edmorley edmorley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this!

There is also another related bug with compileall that I think might be worth trying to fix at the same time, since it's directly related: #117204

That is, I think the logic should be:

  • If header of file doesn't match requested invalidation mode, then skip checking the timestamps or checksum, and instead always recompile
  • If the header does match the requested invalidation mode, then check the file is up to date (via either the timestamp or hash, as appropriate), and recompile if needed

In addition, I think the current version of this PR worsens performance in the "compiling multiple optimisation levels" scenario, since the original source timestamp and/or expected hash is calculated multiple times (since those steps were moved inside the loop) even though those values are constant regardless of optimisation level of the output bytecode. As such, it might be worth moving those back outside the loop.

@kulikjak
Copy link
Contributor Author

There is also another related bug with compileall that I think might be worth trying to fix at the same time, since it's directly related: #117204

That is, I think the logic should be:

  • If header of file doesn't match requested invalidation mode, then skip checking the timestamps or checksum, and instead always recompile
  • If the header does match the requested invalidation mode, then check the file is up to date (via either the timestamp or hash, as appropriate), and recompile if needed

I am aware of #117204, but I don't think it's as simple as that.

I do agree that if you specify the hash-based invalidation and the target uses the timestamp and is up-to-date, compileall shouldn't silently do nothing (which is the current behavior). However, when you don't specify any mode, it's not that simple anymore - you can either use the default one (timestamp), or you can use the one existing .pyc file uses.

I can imagine a use case for both of these, which means that there should probably also be a command line option that switches between those two, or a new default invalidation mode that would mean "use what the existing .pyc file uses, or default to timestamp if no pyc exists" and that means changes to the command line tool and API...

I don't think I am the one who should decide how exactly compileall will behave. That's why I didn't look into that one more.

That said, it's something that should probably be looked into (and I am happy to help with that once we know how it should behave ;)).

In addition, I think the current version of this PR worsens performance in the "compiling multiple optimisation levels" scenario, since the original source timestamp and/or expected hash is calculated multiple times (since those steps were moved inside the loop) even though those values are constant regardless of optimisation level of the output bytecode. As such, it might be worth moving those back outside the loop.

Good catch - I will do so.

@kulikjak
Copy link
Contributor Author

Oh, and there's also a question about what to do when different optimization levels use different invalidation modes....

@edmorley
Copy link

However, when you don't specify any mode, it's not that simple anymore - you can either use the default one (timestamp), or you can use the one existing .pyc file uses.

I can imagine a use case for both of these, which means that there should probably also be a command line option that switches between those two, or a new default invalidation mode that would mean "use what the existing .pyc file uses, or default to timestamp if no pyc exists" and that means changes to the command line tool and API...

So if this were about normal Python operation (where it creates pyc files only as needed), I would agree that perhaps a pyc file existing but not being in the expected mode is good enough, and recompilation should be skipped.

However, IMO the whole point of the compileall command is to explicitly compile all specified files. To me, the fact that some files aren't recompiled every time (ie: are skipped if the timestamp is up to date and force hasn't been used) is more of an optimisation implementation detail.

As such, my inclination would be that if compileall has been requested to compile using a particular invalidation mode, then it should recompile any files that aren't using that mode. My gut feeling is that it would be very rare for anyone using compileall to actually want a mixture of invalidation modes to be used based on whether files had previously been compiled, and therefore that we don't need to add an extra option for it - but curious what others think :-)

@kulikjak
Copy link
Contributor Author

kulikjak commented Aug 30, 2024

That makes sense; maybe I am overthinking it.

In the end, people either care about pyc not changing when the timestamp changes (our case) and then use hashes for everything, or they don't and stick to timestamps. A combination of different invalidation modes for different files (or even optimization levels) seems unlikely and unnecessary...

Maybe @benjaminp would have some thoughts (as an author of invalidation modes :))?

else:
# timestamp-based invalidation
mtime = int(os.stat(fullname).st_mtime)
actual = header[:12]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not check also the file size?

actual = chandle.read(12)
header = chandle.read(16)

if header[4]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can check the magic number without reading the source file.

Also, header[4] can raise IndexError. And you need to check other bytes and bits.

Suggested change
if header[4]:
if len(header) < 16 or header[:4] != importlib.util.MAGIC_NUMBER or struct.unpack('<L', header[4:8])[0] & ~0b11:
break
if header[4] & 0b1:

Comment on lines +239 to +241
actual = header
expect = struct.pack('<4sL8s', importlib.util.MAGIC_NUMBER,
header[4], source_hash)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simply check the source hash:

Suggested change
actual = header
expect = struct.pack('<4sL8s', importlib.util.MAGIC_NUMBER,
header[4], source_hash)
if header[8:] != source_hash:
break

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants