gh-84013: normalize directory contents during import #113447

ronaldoussoren · 2023-12-24T09:29:14Z

Identifiers in Python code are normalized to NFKC and there is no guarantee that names in the file system are normalized.

Teach the FileFinder class about normalization, but only do this when the to-be-imported name contains non-ASCII characters.

Issue: Modules with decomposable characters in module name not found on macOS #84013

Identifiers in Python code are normalized to NFKC and there is no guarantee that names in the file system are normalized. Teach the FileFinder class about normalization, but only do this when the to-be-imported name contains non-ASCII characters.

ronaldoussoren · 2023-12-24T09:30:15Z

The PR is a draft primarily because the tests are incomplete:

No tests using PYTHONCASEOK
No new unit tests for the FileFinder class

ronaldoussoren · 2023-12-27T16:47:52Z

For some reason this PR doesn't work on any system but macOS, at least as far as the tests are concerned. I'll see if I can find the problem for Linux, I don't have a system where I can develop on Windows.

This has been resolved by keeping the unnormalised name around because this is needed to match filesystem behaviour. Worked on macOS because the kernel will normalise names there.

ericsnowcurrently

I have some concerns.

Also, should we document that finders (both metapath-based and path-based) must deal with un-normalized names, or can we do that for all finders internally? Instead, should we state in the language spec that all module names passing through the import system are guaranteed to be NFKC-normalized?

Lib/importlib/_bootstrap_external.py

Lib/test/test_import/__init__.py

Lib/importlib/_bootstrap_external.py

ericsnowcurrently · 2023-12-28T00:43:11Z

Lib/importlib/_bootstrap_external.py

+        if not tail_module.isascii() and not self._cache_is_normalized:
+            self._normalize_cache()


For the sake of __import__() and importlib.import_module() we will probably need to either normalize cache_module or preserve the original name in the cache. The former sounds expensive so the latter might be the way to go.

Regardless, we should probably have some test cases that use one of those functions rather than syntax.

I changed the code to keep the unnormalised name around, we need the unnormalised name to access the filesystem and need the normalised name to compare to tail_module (which will generally be normalised by the compiler).

I agree that there should be more test cases, I started out with just the minimal test from the issue to see if I can get this to work without bootstrapping issues and get some feedback. I'll definitely add more tests when this approach is seen as acceptable by @brettcannon (as the importlib expert).

If the cache always keeps the raw name then can we drop _cache_is_normalized?

Lib/importlib/_bootstrap_external.py

ericsnowcurrently · 2023-12-28T01:22:39Z

Also, I meant to say that I think you're on the right track.

ronaldoussoren · 2023-12-28T08:36:43Z

I have some concerns.

Also, should we document that finders (both metapath-based and path-based) must deal with un-normalized names, or can we do that for all finders internally? Instead, should we state in the language spec that all module names passing through the import system are guaranteed to be NFKC-normalized?

Good question, whatever we choose should be documented.

The names passed to __import__ may not be normalised if called by a user, so either the implementation of that function needs to normalise its arguments, or finders would have to do this. From a usability perspective it's probably better to normalise names __import__ and document this in the importlib documentation for finders.

ericsnowcurrently · 2023-12-28T17:27:29Z

whatever we choose should be documented.

FWIW, the eventual outcome is a direct consequence of (unintentional?) side effect of PEP 3131. I would not expect such a change to the language reference, if that's what's needed, to require a PEP. However, discussion on discuss.python.org may be warranted.

Regardless, it would be good to have feedback from other import and encoding experts before we go much farther down any particular path, even if the actual patch is relatively small. CC @brettcannon @ncoghlan @warsaw @methane @vstinner

From a usability perspective it's probably better to normalise names __import__ and document this in the importlib documentation for finders.

Note that the bytecode for an import statement calls __import__() only if it is user-defined. Otherwise, when __import__() is the default, the bytecode calls PyImport_ImportModuleLevelObject() directly. ¹ This does present an opportunity to normalize in __import__() without double-normalizing for import statements. Either way, though, whatever we document should clearly indicate which builtin/stdlib functions normalize and which don't.

Ideally the import machinery would take care of it in every case and no one would have to think about it, but I'm not sure how feasible that would be. For all its other benefits, the effect of PEP 3131 really is a pain here, isn't it? 😝 Then again, I suppose that the import machinery still not having been fixed after 16 years implies that the problematic corner cases aren't all that common.

This is an optimization. IIRC, there was a period of time where we always just called __import__(). Also, for a little while the default implementation of __import__() was importlib.__import__(). Now importlib.__import__() is never involved (in CPython at least) unless called directly by a user. ↩

ericsnowcurrently · 2023-12-28T17:38:37Z

FWIW, one solution that crossed my mind is to modify the parser, since that is where the PEP 3131 normalization takes place, AKA the root of the problem. The parser could preserve the original unnormalized identifier and we'd use it where appropriate. For instance, the import statement bytecode could use the original name, while the normalized identifier would be used as the bound name. The discrepancy between the bound name and the sys.modules key (and module __name__) might be a problem though.

Relatedly, it may be worth adding an optional arg to __import__() for the unnormalized module name. If not provided, it would default to calling unicodedata.normalize('NFKC, name). If the parser preserved the original name then that could be passed as the extra arg. Hopefully we don't need to add a new arg to import()` though.

ronaldoussoren · 2023-12-28T19:35:00Z

whatever we choose should be documented.

FWIW, the eventual outcome is a direct consequence of (unintentional?) side effect of PEP 3131. I would not expect such a change to the language reference, if that's what's needed, to require a PEP. However, discussion on discuss.python.org may be warranted.

The language reference already says all identifiers are normalised to NFKC, see https://docs.python.org/3/reference/lexical_analysis.html#identifiers. The PEP, and more importantly, the import system specification (PEP 451 and its predecessors) failed to fully think through the consequences of this when loading modules from the filesystem. That's not really anyones fault, Unicode and in particular Unicode normalisation are hard.

Regardless, it would be good to have feedback from other import and encoding experts before we go much farther down any particular path, even if the actual patch is relatively small. CC @brettcannon @ncoghlan @warsaw @methane @vstinner

Yes please ;-). I intend to work on fixes the issues with this PR on Linux (and that will likely fix the issues on Windows as well), but beyond that I'd like some advise on the likelihood that the change will be accepted.

Ideally the import machinery would take care of it in every case and no one would have to think about it, but I'm not sure how feasible that would be. For all its other benefits, the effect of PEP 3131 really is a pain here, isn't it? 😝 Then again, I suppose that the import machinery still not having been fixed after 16 years implies that the problematic corner cases aren't all that common.

I guess that not many people write modules with non-ascii names, and those that do quickly learn that this doesn't always work. The forks that want to use non-ascii names are also more likely to be less familiar with English.

P.S. I do have to warn that I'll likely be less responsive after new year. I had a lot time of from work in December, but expect that I'll be distracted by $WORK in January. I'll continue working on this, but slower.

The finder cache is now a mapping from (normalized) names to actual filesystem names. With some luck this fixes test failures on Linux and Windows.

ronaldoussoren · 2023-12-29T09:36:41Z

There are four ways that names get into import lib:

import statements without overriding __import__: The parser will normalise identifiers, the finder will see a normalised name
import statement with overriding __import__: same as 1 or 3 depending on whether or note __import__ changes the name to be imported
Direct calling of __import__ or import lib.import_module: All bets are off, but name likely is not normalised when using Latin script with accents because the default way to enter these on common platform is not NFKD normalised. The two functions mentioned can normalise the name with little to no impact on user code.
Direct invocation of import lib finders: This either requires documenting that the name must be normalised, or changes to all finders (including those outside of the stdlib)

It's probably best to change the implementation of __import__ and import_module to normalise the name and document that the name passed to finders must be normalised.

The impact of the former should be limited because as @ericsnowcurrently notes these functions aren't used in normal import code paths.

Requiring that names passed to finders must be normalized is technically a backward incompatible change, although in practice finders already mostly see normalised names. Normalising names in finders can have a performance impact due to unnecessary normalising of names, although it is unclear to me how significant that impact would be given that the number of non-ASCII module names is likely very low.

UPDATE: In the current version of the PR __import__ and import_module normalise and the documentation for loaders was updated to mention that the names must be normalized by the caller.

Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>

1. Add a NEWS entry 2. __import__ and find_module will now normalize their arguments as needed 3. Update the importlib documentation to note that the name passed to finders must be normalized.

ronaldoussoren · 2023-12-29T12:15:44Z

Lib/importlib/_bootstrap_external.py

@@ -1638,15 +1645,21 @@ def find_spec(self, fullname, target=None):
                is_namespace = _path_isdir(base_path)
        # Check for a file w/ a proper suffix exists.
        for suffix, loader_class in self._loaders:
+            # XXX: Why is ValueError caught here?


This change worries me a little because I don't understand why the original code catches ValueError and then bails out.

Removing this like I did here does not result in test failures.

If git blame isn't helpful then it can probably be removed.

UnicodeEncodeError if the name is not compatible with the filesystem encoding, ValueError for embedded null characters?

ronaldoussoren · 2023-12-29T12:17:05Z

Misc/NEWS.d/next/Library/2023-12-29-13-09-44.gh-issue-84013.rUaxkr.rst

+It is now possible to import modules from the filesystem regardless of how
+the name is normalized in the filesystem. This in particular affects
+importing modules with a name that contains accented latin characters.


This probably also needs a What's New entry, lets worry about that when the PR is closer to acceptance to avoid creating merge conflicts.

ronaldoussoren · 2023-12-29T12:52:21Z

Did I mention Unicode is hard?

In NFKC normalization there are some non-ASCII code points that normalise to code points that are ASCII, such as "ﬀ" (the ligature for "ff"). This leads to some "interesting" behaviour with my PR.

Given a file named "fluﬀy.py" (with ligature) you currently get an ImportError when trying to import "fluffy". With my patch the import will not raise an ImportError when there has first been an import for some non-ASCII name that was presented to the finder for the directory containing "fluﬀy.py".

import random
if random.choice([True, False]):
    try:
        import café
    except ImportError:
        pass

import fluffy

This will work 50% of the time and raise ImportError the other 50%. This will AFAIK only affect names in latin script, other languages won't normalise to ASCII strings from my limited understanding of normalization rules.

Another way this PR can lead to unexpected behaviour changes is to have two source files with different normalisations, one that happens to be NFKC normalized and one not (for example the "Mädchen.py" files I used in test_import.py). Currently the NFKC normalized file is picked up, with this PR the other might get picked up depending on the order of in which listdir iterates over the directory.

In practice I don't expect these to be a problem:

The easiest way to enter accented characters in NFKC normalized values (that is len("ü") == 1) at least on macOS, which means you'll have to go the extra mile to get a file name that isn't normalized NFKC on Linux
MacOS normalises names to NFD in the filesystem, and won't allow creating two files whose names only differs by their normalization (tested on an APFS filesystem, the older HFS+ should be the same).

What I don't know if there are realistic failure modes for the second scenario with non-latin scripts.

ericsnowcurrently · 2023-12-29T17:12:09Z

There are four ways that names get into import lib:

Yeah, I think you've captured that well.

Ideally, the mechanism for auto-normalization would baked in to the core import machinery itself, such that we didn't have to modify specific user-facing functions or require any extra attention from authors/maintainers of import hooks. However, I'm not sure that is something we could do and without impacting the performance of import statements. (I could be wrong though.)

import statements without overriding __import__: The parser will normalise identifiers, the finder will see a normalised name

import statement with overriding __import__: same as 1 or 3 depending on whether or note __import__ changes the name to be imported

The function is still called with the parser-normalized name though, right? So in this case the non-default __import__() still wouldn't need to NKFC-normalize the name. However, it would need to accommodate any adjustments we made if the original, unnormalized name were preserved by the parser and passed through. Then again, a non-default __import__() would still need to normalize the name if called directly (which is definitely messy).

Direct calling of __import__ or import lib.import_module: All bets are off, but name likely is not normalised when using Latin script with accents because the default way to enter these on common platform is not NFKD normalised. The two functions mentioned can normalise the name with little to no impact on user code.

FWIW, importlib.__import__() should probably be lumped in with importlib.import_module(). Almost any change to one should match the other. The only caveat is that importlib.__import__() may be set as builtins.__import__(). It might even be the default one for some Python implementations. That was certainly the design goal (and was what we did in CPython for a release or two, IIRC).

importlib.util.find_spec() should be treated the same as importlib.import_module() as well, but without the caveat.

Direct invocation of import lib finders: This either requires documenting that the name must be normalised, or changes to all finders (including those outside of the stdlib)

This applies to user-defined finders, both metapath finders (sys.meta_path) and path finders (via sys.path_hooks & sys.path_importer_cache()).

This also brings up perhaps a fifth group: user-defined finders. Or is that what you meant with (4)?

It's probably best to change the implementation of __import__ and import_module to normalise the name and document that the name passed to finders must be normalised.

That seems reasonable. I don't expect that they are used so widely or intensely that it would have a performance impact.

When modifying these functions, we do have to be careful about how/where we do it. We don't want to impact PyImport_ImportModuleLevelObject(), which is what the import statement uses directly by default. PyImport_ImportModuleLevelObject(), in turn, calls importlib._bootstrap._find_and_load(), so we need to be careful with that.

For __import__() we'd want to make any changes strictly in builtin___import___impl() (Python/bltinmodule.c). For importlib.import_module() (and thus importlib.util.find_spec() and maybe importlib.__import__()) we could modify each function directly (or perhaps in importlib._bootstrap._gcd_import() but probably not). Again, we just have to remember to avoid importlib._bootstrap._find_and_load(). (For importlib.util.find_spec() we should also avoid touching importlib._bootstrap._find_spec().)

ericsnowcurrently · 2023-12-29T17:29:54Z

Let's step back for a moment to make sure we're thinking about the right things. IIUC, at the end of the day, the key objective here boils down to FileFinder correctly handling the following situations:

module name is non-ascii, NFKC-normal (via PEP 3131)
- filesystem NFKC-normalizes filenames (i.e. the _path_cache situation you've addressed in this PR)
- filesystem does not NFKC-normalize filenames and the corresponding filename is not normalized
module name is non-ascii but not NFKC-normal (via importlib.import_module(), etc.)
- filesystem NFKC-normalizes filenames
- filesystem does not NFKC-normalize filenames but the corresponding filename is normalized

Is that right?

This looks, at least partially, like the way we have to deal with case-insensitive filesystems. We should make sure we take some cues from how that is handled by FileFinder.

(It's also notable that we accommodate the matter of case-sensitivity in FileFinder rather than solving it in the parser/compiler or in the broader import machinery. Thus there's precedent for handling this sort of problem generally, i.e. user-defined finders each have to make the same accommodations or rely on FileFinder.)

Lib/importlib/_bootstrap.py

brettcannon · 2024-02-23T23:00:16Z

Lib/importlib/_bootstrap.py

+    """ Normalize 'name' in NKFC form """
+    global _unicodedata_normalize
+    if _unicodedata_normalize is None:
+        from unicodedata import _normalize


Suggested change

from unicodedata import _normalize

from unicodedata import _normalize as _unicodedata_normalize

I get an import error for _normalize but not normalize in the REPL. Is that expected?

Lib/importlib/_bootstrap.py

brettcannon · 2024-02-23T23:11:31Z

Lib/importlib/_bootstrap_external.py

+        if not tail_module.isascii() and not self._cache_is_normalized:
+            self._normalize_cache()


If the cache always keeps the raw name then can we drop _cache_is_normalized?

brettcannon · 2024-02-23T23:14:16Z

Lib/importlib/_bootstrap_external.py

@@ -1638,15 +1645,21 @@ def find_spec(self, fullname, target=None):
                is_namespace = _path_isdir(base_path)
        # Check for a file w/ a proper suffix exists.
        for suffix, loader_class in self._loaders:
+            # XXX: Why is ValueError caught here?


If git blame isn't helpful then it can probably be removed.

Lib/importlib/_bootstrap_external.py

brettcannon · 2024-02-23T23:20:59Z

This looks, at least partially, like the way we have to deal with case-insensitive filesystems. We should make sure we take some cues from how that is handled by FileFinder.

(It's also notable that we accommodate the matter of case-sensitivity in FileFinder rather than solving it in the parser/compiler or in the broader import machinery. Thus there's precedent for handling this sort of problem generally, i.e. user-defined finders each have to make the same accommodations or rely on FileFinder.)

I think I agree w/ Eric that this feels like a problem similar to case-sensitivity. As long as the comparison needs are just at the file system boundary and not e.g. sys.modules then I think we have prior art w/ case-sensitivity.

serhiy-storchaka · 2025-05-21T11:27:31Z

Lib/importlib/_bootstrap_external.py

@@ -1638,15 +1645,21 @@ def find_spec(self, fullname, target=None):
                is_namespace = _path_isdir(base_path)
        # Check for a file w/ a proper suffix exists.
        for suffix, loader_class in self._loaders:
+            # XXX: Why is ValueError caught here?


UnicodeEncodeError if the name is not compatible with the filesystem encoding, ValueError for embedded null characters?

Lib/importlib/_bootstrap.py

Co-authored-by: Brett Cannon <brett@python.org>

brettcannon · 2025-08-08T22:21:36Z

@ronaldoussoren did you still want to work on this or drop it? It still has a DO-NOT-MERGE label.

ronaldoussoren · 2025-08-09T18:22:31Z

@ronaldoussoren did you still want to work on this or drop it? It still has a DO-NOT-MERGE label.

I completely forgot I worked on this :-(. I might have some time to finish this in the next month or so, but cannot promise anything.

brettcannon · 2025-08-11T22:54:30Z

@ronaldoussoren no pressure from me! I just wasn't sure if you wanted to keep the PR open or not.

ronaldoussoren added DO-NOT-MERGE topic-importlib labels Dec 24, 2023

bedevere-app bot mentioned this pull request Dec 24, 2023

Modules with decomposable characters in module name not found on macOS #84013

Open

Slightly cleaner code

7f4b01b

ronaldoussoren marked this pull request as ready for review December 24, 2023 09:43

ronaldoussoren requested review from brettcannon, ericsnowcurrently, ncoghlan and warsaw as code owners December 24, 2023 09:43

bedevere-app bot added the awaiting core review label Dec 24, 2023

ronaldoussoren added 2 commits December 25, 2023 10:57

Argh... Previous cleanup was a bit too quick...

6ea9497

Actually fix the "cleaned up" version

c216e08

ericsnowcurrently reviewed Dec 28, 2023

View reviewed changes

Change of implementation

3f16ae1

The finder cache is now a mapping from (normalized) names to actual filesystem names. With some luck this fixes test failures on Linux and Windows.

ronaldoussoren and others added 4 commits December 29, 2023 12:57

Revert accidental reordering

80b6eec

Update Lib/test/test_import/__init__.py

b9e8927

Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>

Address review comment

7d2d2bf

Some changes

18d99ae

1. Add a NEWS entry 2. __import__ and find_module will now normalize their arguments as needed 3. Update the importlib documentation to note that the name passed to finders must be normalized.

ronaldoussoren commented Dec 29, 2023

View reviewed changes

brettcannon reviewed Feb 23, 2024

View reviewed changes

serhiy-storchaka reviewed May 21, 2025

View reviewed changes

serhiy-storchaka and others added 2 commits May 21, 2025 14:32

Apply suggestions from code review

985e50a

Co-authored-by: Brett Cannon <brett@python.org>

Merge branch 'main' into pythongh-84013

c13edc8

		if not tail_module.isascii() and not self._cache_is_normalized:
		self._normalize_cache()

	from unicodedata import _normalize
	from unicodedata import _normalize as _unicodedata_normalize

Uh oh!

gh-84013: normalize directory contents during import #113447

Are you sure you want to change the base?

gh-84013: normalize directory contents during import #113447

Conversation

ronaldoussoren commented Dec 24, 2023 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ronaldoussoren commented Dec 24, 2023

Uh oh!

ronaldoussoren commented Dec 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericsnowcurrently left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ericsnowcurrently commented Dec 28, 2023

Uh oh!

ronaldoussoren commented Dec 28, 2023

Uh oh!

ericsnowcurrently commented Dec 28, 2023

Footnotes

Uh oh!

ericsnowcurrently commented Dec 28, 2023

Uh oh!

ronaldoussoren commented Dec 28, 2023

Uh oh!

ronaldoussoren commented Dec 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ronaldoussoren commented Dec 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericsnowcurrently commented Dec 29, 2023

Uh oh!

ericsnowcurrently commented Dec 29, 2023

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brettcannon commented Feb 23, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brettcannon commented Aug 8, 2025

Uh oh!

ronaldoussoren commented Aug 9, 2025

Uh oh!

ronaldoussoren commented Dec 24, 2023 •

edited by bedevere-app bot

Loading

ronaldoussoren commented Dec 27, 2023 •

edited

Loading

ronaldoussoren commented Dec 29, 2023 •

edited

Loading

ronaldoussoren commented Dec 29, 2023 •

edited

Loading

brettcannon commented Aug 11, 2025 •

edited

Loading