Skip to content

gh-84013: normalize directory contents during import #113447

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

ronaldoussoren
Copy link
Contributor

@ronaldoussoren ronaldoussoren commented Dec 24, 2023

Identifiers in Python code are normalized to NFKC and there is no guarantee that names in the file system are normalized.

Teach the FileFinder class about normalization, but only do this when the to-be-imported name contains non-ASCII characters.

Identifiers in Python code are normalized to NFKC and
there is no guarantee that names in the file system are
normalized.

Teach the FileFinder class about normalization, but only
do this when the to-be-imported name contains non-ASCII
characters.
@ronaldoussoren
Copy link
Contributor Author

The PR is a draft primarily because the tests are incomplete:

  1. No tests using PYTHONCASEOK
  2. No new unit tests for the FileFinder class

@ronaldoussoren
Copy link
Contributor Author

ronaldoussoren commented Dec 27, 2023

For some reason this PR doesn't work on any system but macOS, at least as far as the tests are concerned. I'll see if I can find the problem for Linux, I don't have a system where I can develop on Windows.

This has been resolved by keeping the unnormalised name around because this is needed to match filesystem behaviour. Worked on macOS because the kernel will normalise names there.

Copy link
Member

@ericsnowcurrently ericsnowcurrently left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concerns.

Also, should we document that finders (both metapath-based and path-based) must deal with un-normalized names, or can we do that for all finders internally? Instead, should we state in the language spec that all module names passing through the import system are guaranteed to be NFKC-normalized?

Comment on lines +1622 to +1623
if not tail_module.isascii() and not self._cache_is_normalized:
self._normalize_cache()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of __import__() and importlib.import_module() we will probably need to either normalize cache_module or preserve the original name in the cache. The former sounds expensive so the latter might be the way to go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless, we should probably have some test cases that use one of those functions rather than syntax.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the code to keep the unnormalised name around, we need the unnormalised name to access the filesystem and need the normalised name to compare to tail_module (which will generally be normalised by the compiler).

I agree that there should be more test cases, I started out with just the minimal test from the issue to see if I can get this to work without bootstrapping issues and get some feedback. I'll definitely add more tests when this approach is seen as acceptable by @brettcannon (as the importlib expert).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cache always keeps the raw name then can we drop _cache_is_normalized?

@ericsnowcurrently
Copy link
Member

Also, I meant to say that I think you're on the right track.

@ronaldoussoren
Copy link
Contributor Author

I have some concerns.

Also, should we document that finders (both metapath-based and path-based) must deal with un-normalized names, or can we do that for all finders internally? Instead, should we state in the language spec that all module names passing through the import system are guaranteed to be NFKC-normalized?

Good question, whatever we choose should be documented.

The names passed to __import__ may not be normalised if called by a user, so either the implementation of that function needs to normalise its arguments, or finders would have to do this. From a usability perspective it's probably better to normalise names __import__ and document this in the importlib documentation for finders.

@ericsnowcurrently
Copy link
Member

whatever we choose should be documented.

FWIW, the eventual outcome is a direct consequence of (unintentional?) side effect of PEP 3131. I would not expect such a change to the language reference, if that's what's needed, to require a PEP. However, discussion on discuss.python.org may be warranted.

Regardless, it would be good to have feedback from other import and encoding experts before we go much farther down any particular path, even if the actual patch is relatively small. CC @brettcannon @ncoghlan @warsaw @methane @vstinner

From a usability perspective it's probably better to normalise names __import__ and document this in the importlib documentation for finders.

Note that the bytecode for an import statement calls __import__() only if it is user-defined. Otherwise, when __import__() is the default, the bytecode calls PyImport_ImportModuleLevelObject() directly. 1 This does present an opportunity to normalize in __import__() without double-normalizing for import statements. Either way, though, whatever we document should clearly indicate which builtin/stdlib functions normalize and which don't.

Ideally the import machinery would take care of it in every case and no one would have to think about it, but I'm not sure how feasible that would be. For all its other benefits, the effect of PEP 3131 really is a pain here, isn't it? 😝 Then again, I suppose that the import machinery still not having been fixed after 16 years implies that the problematic corner cases aren't all that common.

Footnotes

  1. This is an optimization. IIRC, there was a period of time where we always just called __import__(). Also, for a little while the default implementation of __import__() was importlib.__import__(). Now importlib.__import__() is never involved (in CPython at least) unless called directly by a user.

@ericsnowcurrently
Copy link
Member

FWIW, one solution that crossed my mind is to modify the parser, since that is where the PEP 3131 normalization takes place, AKA the root of the problem. The parser could preserve the original unnormalized identifier and we'd use it where appropriate. For instance, the import statement bytecode could use the original name, while the normalized identifier would be used as the bound name. The discrepancy between the bound name and the sys.modules key (and module __name__) might be a problem though.

Relatedly, it may be worth adding an optional arg to __import__() for the unnormalized module name. If not provided, it would default to calling unicodedata.normalize('NFKC, name). If the parser preserved the original name then that could be passed as the extra arg. Hopefully we don't need to add a new arg to import()` though.

@ronaldoussoren
Copy link
Contributor Author

whatever we choose should be documented.

FWIW, the eventual outcome is a direct consequence of (unintentional?) side effect of PEP 3131. I would not expect such a change to the language reference, if that's what's needed, to require a PEP. However, discussion on discuss.python.org may be warranted.

The language reference already says all identifiers are normalised to NFKC, see https://docs.python.org/3/reference/lexical_analysis.html#identifiers. The PEP, and more importantly, the import system specification (PEP 451 and its predecessors) failed to fully think through the consequences of this when loading modules from the filesystem. That's not really anyones fault, Unicode and in particular Unicode normalisation are hard.

Regardless, it would be good to have feedback from other import and encoding experts before we go much farther down any particular path, even if the actual patch is relatively small. CC @brettcannon @ncoghlan @warsaw @methane @vstinner

Yes please ;-). I intend to work on fixes the issues with this PR on Linux (and that will likely fix the issues on Windows as well), but beyond that I'd like some advise on the likelihood that the change will be accepted.

Ideally the import machinery would take care of it in every case and no one would have to think about it, but I'm not sure how feasible that would be. For all its other benefits, the effect of PEP 3131 really is a pain here, isn't it? 😝 Then again, I suppose that the import machinery still not having been fixed after 16 years implies that the problematic corner cases aren't all that common.

I guess that not many people write modules with non-ascii names, and those that do quickly learn that this doesn't always work. The forks that want to use non-ascii names are also more likely to be less familiar with English.

P.S. I do have to warn that I'll likely be less responsive after new year. I had a lot time of from work in December, but expect that I'll be distracted by $WORK in January. I'll continue working on this, but slower.

The finder cache is now a mapping from (normalized) names
to actual filesystem names. With some luck this fixes test
failures on Linux and Windows.
@ronaldoussoren
Copy link
Contributor Author

ronaldoussoren commented Dec 29, 2023

There are four ways that names get into import lib:

  1. import statements without overriding __import__: The parser will normalise identifiers, the finder will see a normalised name
  2. import statement with overriding __import__: same as 1 or 3 depending on whether or note __import__ changes the name to be imported
  3. Direct calling of __import__ or import lib.import_module: All bets are off, but name likely is not normalised when using Latin script with accents because the default way to enter these on common platform is not NFKD normalised. The two functions mentioned can normalise the name with little to no impact on user code.
  4. Direct invocation of import lib finders: This either requires documenting that the name must be normalised, or changes to all finders (including those outside of the stdlib)

It's probably best to change the implementation of __import__ and import_module to normalise the name and document that the name passed to finders must be normalised.

The impact of the former should be limited because as @ericsnowcurrently notes these functions aren't used in normal import code paths.

Requiring that names passed to finders must be normalized is technically a backward incompatible change, although in practice finders already mostly see normalised names. Normalising names in finders can have a performance impact due to unnecessary normalising of names, although it is unclear to me how significant that impact would be given that the number of non-ASCII module names is likely very low.

UPDATE: In the current version of the PR __import__ and import_module normalise and the documentation for loaders was updated to mention that the names must be normalized by the caller.

ronaldoussoren and others added 4 commits December 29, 2023 12:57
Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
1. Add a NEWS entry

2. __import__ and find_module will now normalize their
   arguments as needed

3. Update the importlib documentation to note that the
   name passed to finders must be normalized.
@@ -1638,15 +1645,21 @@ def find_spec(self, fullname, target=None):
is_namespace = _path_isdir(base_path)
# Check for a file w/ a proper suffix exists.
for suffix, loader_class in self._loaders:
# XXX: Why is ValueError caught here?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change worries me a little because I don't understand why the original code catches ValueError and then bails out.

Removing this like I did here does not result in test failures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If git blame isn't helpful then it can probably be removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UnicodeEncodeError if the name is not compatible with the filesystem encoding, ValueError for embedded null characters?

Comment on lines +1 to +3
It is now possible to import modules from the filesystem regardless of how
the name is normalized in the filesystem. This in particular affects
importing modules with a name that contains accented latin characters.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably also needs a What's New entry, lets worry about that when the PR is closer to acceptance to avoid creating merge conflicts.

@ronaldoussoren
Copy link
Contributor Author

ronaldoussoren commented Dec 29, 2023

Did I mention Unicode is hard?

In NFKC normalization there are some non-ASCII code points that normalise to code points that are ASCII, such as "ff" (the ligature for "ff"). This leads to some "interesting" behaviour with my PR.

Given a file named "fluffy.py" (with ligature) you currently get an ImportError when trying to import "fluffy". With my patch the import will not raise an ImportError when there has first been an import for some non-ASCII name that was presented to the finder for the directory containing "fluffy.py".

import random
if random.choice([True, False]):
    try:
        import café
    except ImportError:
        pass

import fluffy

This will work 50% of the time and raise ImportError the other 50%. This will AFAIK only affect names in latin script, other languages won't normalise to ASCII strings from my limited understanding of normalization rules.

Another way this PR can lead to unexpected behaviour changes is to have two source files with different normalisations, one that happens to be NFKC normalized and one not (for example the "Mädchen.py" files I used in test_import.py). Currently the NFKC normalized file is picked up, with this PR the other might get picked up depending on the order of in which listdir iterates over the directory.

In practice I don't expect these to be a problem:

  • The easiest way to enter accented characters in NFKC normalized values (that is len("ü") == 1) at least on macOS, which means you'll have to go the extra mile to get a file name that isn't normalized NFKC on Linux
  • MacOS normalises names to NFD in the filesystem, and won't allow creating two files whose names only differs by their normalization (tested on an APFS filesystem, the older HFS+ should be the same).

What I don't know if there are realistic failure modes for the second scenario with non-latin scripts.

@ericsnowcurrently
Copy link
Member

There are four ways that names get into import lib:

Yeah, I think you've captured that well.

Ideally, the mechanism for auto-normalization would baked in to the core import machinery itself, such that we didn't have to modify specific user-facing functions or require any extra attention from authors/maintainers of import hooks. However, I'm not sure that is something we could do and without impacting the performance of import statements. (I could be wrong though.)

  1. import statements without overriding __import__: The parser will normalise identifiers, the finder will see a normalised name
  2. import statement with overriding __import__: same as 1 or 3 depending on whether or note __import__ changes the name to be imported

The function is still called with the parser-normalized name though, right? So in this case the non-default __import__() still wouldn't need to NKFC-normalize the name. However, it would need to accommodate any adjustments we made if the original, unnormalized name were preserved by the parser and passed through. Then again, a non-default __import__() would still need to normalize the name if called directly (which is definitely messy).

  1. Direct calling of __import__ or import lib.import_module: All bets are off, but name likely is not normalised when using Latin script with accents because the default way to enter these on common platform is not NFKD normalised. The two functions mentioned can normalise the name with little to no impact on user code.

FWIW, importlib.__import__() should probably be lumped in with importlib.import_module(). Almost any change to one should match the other. The only caveat is that importlib.__import__() may be set as builtins.__import__(). It might even be the default one for some Python implementations. That was certainly the design goal (and was what we did in CPython for a release or two, IIRC).

importlib.util.find_spec() should be treated the same as importlib.import_module() as well, but without the caveat.

  1. Direct invocation of import lib finders: This either requires documenting that the name must be normalised, or changes to all finders (including those outside of the stdlib)

This applies to user-defined finders, both metapath finders (sys.meta_path) and path finders (via sys.path_hooks & sys.path_importer_cache()).

This also brings up perhaps a fifth group: user-defined finders. Or is that what you meant with (4)?

It's probably best to change the implementation of __import__ and import_module to normalise the name and document that the name passed to finders must be normalised.

That seems reasonable. I don't expect that they are used so widely or intensely that it would have a performance impact.

When modifying these functions, we do have to be careful about how/where we do it. We don't want to impact PyImport_ImportModuleLevelObject(), which is what the import statement uses directly by default. PyImport_ImportModuleLevelObject(), in turn, calls importlib._bootstrap._find_and_load(), so we need to be careful with that.

For __import__() we'd want to make any changes strictly in builtin___import___impl() (Python/bltinmodule.c). For importlib.import_module() (and thus importlib.util.find_spec() and maybe importlib.__import__()) we could modify each function directly (or perhaps in importlib._bootstrap._gcd_import() but probably not). Again, we just have to remember to avoid importlib._bootstrap._find_and_load(). (For importlib.util.find_spec() we should also avoid touching importlib._bootstrap._find_spec().)

@ericsnowcurrently
Copy link
Member

Let's step back for a moment to make sure we're thinking about the right things. IIUC, at the end of the day, the key objective here boils down to FileFinder correctly handling the following situations:

  • module name is non-ascii, NFKC-normal (via PEP 3131)
    • filesystem NFKC-normalizes filenames (i.e. the _path_cache situation you've addressed in this PR)
    • filesystem does not NFKC-normalize filenames and the corresponding filename is not normalized
  • module name is non-ascii but not NFKC-normal (via importlib.import_module(), etc.)
    • filesystem NFKC-normalizes filenames
    • filesystem does not NFKC-normalize filenames but the corresponding filename is normalized

Is that right?

This looks, at least partially, like the way we have to deal with case-insensitive filesystems. We should make sure we take some cues from how that is handled by FileFinder.

(It's also notable that we accommodate the matter of case-sensitivity in FileFinder rather than solving it in the parser/compiler or in the broader import machinery. Thus there's precedent for handling this sort of problem generally, i.e. user-defined finders each have to make the same accommodations or rely on FileFinder.)

""" Normalize 'name' in NKFC form """
global _unicodedata_normalize
if _unicodedata_normalize is None:
from unicodedata import _normalize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from unicodedata import _normalize
from unicodedata import _normalize as _unicodedata_normalize

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get an import error for _normalize but not normalize in the REPL. Is that expected?

Comment on lines +1622 to +1623
if not tail_module.isascii() and not self._cache_is_normalized:
self._normalize_cache()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cache always keeps the raw name then can we drop _cache_is_normalized?

@@ -1638,15 +1645,21 @@ def find_spec(self, fullname, target=None):
is_namespace = _path_isdir(base_path)
# Check for a file w/ a proper suffix exists.
for suffix, loader_class in self._loaders:
# XXX: Why is ValueError caught here?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If git blame isn't helpful then it can probably be removed.

@brettcannon
Copy link
Member

This looks, at least partially, like the way we have to deal with case-insensitive filesystems. We should make sure we take some cues from how that is handled by FileFinder.

(It's also notable that we accommodate the matter of case-sensitivity in FileFinder rather than solving it in the parser/compiler or in the broader import machinery. Thus there's precedent for handling this sort of problem generally, i.e. user-defined finders each have to make the same accommodations or rely on FileFinder.)

I think I agree w/ Eric that this feels like a problem similar to case-sensitivity. As long as the comparison needs are just at the file system boundary and not e.g. sys.modules then I think we have prior art w/ case-sensitivity.

@@ -1638,15 +1645,21 @@ def find_spec(self, fullname, target=None):
is_namespace = _path_isdir(base_path)
# Check for a file w/ a proper suffix exists.
for suffix, loader_class in self._loaders:
# XXX: Why is ValueError caught here?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UnicodeEncodeError if the name is not compatible with the filesystem encoding, ValueError for embedded null characters?

serhiy-storchaka and others added 2 commits May 21, 2025 14:32
@brettcannon
Copy link
Member

@ronaldoussoren did you still want to work on this or drop it? It still has a DO-NOT-MERGE label.

@ronaldoussoren
Copy link
Contributor Author

@ronaldoussoren did you still want to work on this or drop it? It still has a DO-NOT-MERGE label.

I completely forgot I worked on this :-(. I might have some time to finish this in the next month or so, but cannot promise anything.

@brettcannon
Copy link
Member

brettcannon commented Aug 11, 2025

@ronaldoussoren no pressure from me! I just wasn't sure if you wanted to keep the PR open or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants