Skip to content

Regression on initialization and new non-deterministic behavior issue in mimetypes #93417

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
pombredanne opened this issue Jun 1, 2022 · 2 comments
Labels
3.10 only security fixes 3.11 only security fixes 3.12 only security fixes topic-email type-bug An unexpected behavior, bug, or error

Comments

@pombredanne
Copy link
Contributor

pombredanne commented Jun 1, 2022

Bug report

Mimetypes should NOT include defaults when initialized with file(s)

This is a follow up from https://bugs.python.org/issue4963 that led to this merged PR by @davidkhess #3062

I reported the issue in https://bugs.python.org/issue4963#msg384730 :

The changes introduced by this ticket in 9fc720e#r45794801 are problematic.

I discovered this from having tests failing when testing on Python 3.7 and up

The bug is that calling mimetypes.init(files) will NOT use my files, but instead use both my files and knownfiles.
This was not the case before as knownfiles would be ignored as expected when I provide my own files list.

This is a breaking API change IMHO and introduces a buggy instability : even if I want to ignore knownfiles by providing my list of of files, knownfiles will always be added and this results in erratic and buggy behaviour as the content of "knownfiles" is completely random based on the OS version and else.

The code I am using is here https://github.com/nexB/typecode/blob/ba07c04d23441d3469dc5de911376d408514ebd8/src/typecode/contenttype.py#L308

I think we should reopen to fix (or create a new ticket)

Actually this is problematic on multiples counts:

  1. the behaviour changes and this is a regression
  2. even if that new buggy behaviour was the one to use, it should not give preference to knownfiles over init-provided files, but at least take the provided files first and knownfiles second.

See also this thread 9fc720e#r45794801

@pombredanne
pombredanne on Jan 9, 2021 Contributor

This introduces a buggy unstability starting with Python 3.7: even if I want to ignore knownfiles by providing my list of of files, knownfiles will always be added and this results in erratic and buggy behaviour as the content of "knownfiles" is completely random based on the OS version and else.

@adeadman
adeadman 2 hours ago

@davidkhess this has resulted in an undocumented change to the behaviour of calling mimetypes.init(files=["mimetype_list"]) as now the defaults are included where they weren't before Python 3.7.

I'm not sure how this can be fixed without breaking compatibility or principle of least surprise somewhere - perhaps by adding a new kwarg to init() that ignores knownfiles?

@pombredanne
pombredanne 1 hour ago Contributor

@adeadman re:

I'm not sure how this can be fixed without breaking compatibility or principle of least surprise somewhere - perhaps by adding a new kwarg to init() that ignores knownfiles?

IMHO the current un-documented behaviour is a bug so fixing this would not be an API breakage.
Also the changelog is lying:

Fixed non-deterministic behavior related to mimetypes extension mapping and module reinitialization.

The behaviour used to be mostly deterministic and is now non-deterministic.

And @adeadman also posted in #3062 (comment)

This change has resulted in a difference in behaviour from previous versions of Python in that even when specifying files as a parameter to init(), the known files are loaded, which is not the case in previous versions. This causes problems when an application wishes to use the mimetypes database with an internal-only mapping of mimetypes, to work around inconsistencies with mimetypes in the known files list (which can vary between systems).

Unfortunately this change wasn't documented in the changelog either, nor in the docstring comments, so it can only be found by examining the commit history for the code in cpython.

Would it be appropriate to add another kwarg to init() indicating that knownfiles are to be ignored/excluded from the DB initialization?

@davidkhess
Copy link
Contributor

Thanks for pinging me. I'll share what I know and can remember.

The original bug regarded the non-deterministic behavior of the guess_extension method/function. It would return different answers upon different executions of the python interpreter and import of the mimetypes module. I suggest reading through all of https://bugs.python.org/issue4963 as it is very informative as to the discussion and approaches to a solution that were kicked around.

The change you've fingered is related to this comment by me in that issue:

  1. I modified the init() function to check the files argument as r.david.murray suggested. If it is supplied, then the existing database is used and the files are added to it. If it is not supplied, then the module reinitializes from scratch. I'll update the documentation to reflect this if the commit passes muster.

That comment is referring to R. David Murray's comment in that issue: https://bugs.python.org/issue4963#msg108934

It's worth reading that comment carefully and in its entirety to understand how the change to init() tried to walk a very fine line.

I think this current issue is a result of that analysis and my fix related to it missing the use case where files is definitely not meant to be additive - it should be the only thing loaded.

In terms of a workaround, I think the easiest thing that should work is:

    import mimetypes
    mimetypes.knownfiles = []
    mimetypes.init(files=...)

Perhaps a longer-term fix would be to add a boolean argument to init() that controls whether or not knownfiles are used additively.

Just my $0.02.

@adeadman
Copy link

adeadman commented Jun 1, 2022

Thanks for your detailed comment @davidkhess and thanks for opening the report @pombredanne.

From R. David Murray's analysis in the original issue:

It's not so clear what the behavior should be when you pass init one or more files. It is possible, even highly probable, that there is code out there that depends on the fact that doing so is additive.

I'm not clear exactly what was meant by "additive", in this context. However, an examination of the init() code before the change would suggest that this was not always the case, as previously a call to init() passing in a list of files used only the files supplied, and used the knownfiles as a default if no files were specified. A caller would have to explicitly pass the known files in like this: mimetypes.init(files=['my_mimetype_file'] + mimetypes.knownfiles).

I believe his analysis was perhaps referring to calling MimeTypes() a second time would not reinitialise the global db, as the inited flag prevents that from happening, but not a call to mimetypes.init() explicitly.

Regardless of the validity of this point of analysis, the fact remains that the current behaviour of mimetypes.init (since Python 3.7) is to load from the knownfiles as well as the files specified, and so Hyrum's Law would suggest that to put the behavior back as it was will inconvenience someone.

Perhaps as a first step we could update the docstring to clearly mark that the behaviour changed between Python 3.6 and 3.7, so people that must maintain supporting for multiple versions can be aware of the difference.

@AA-Turner AA-Turner added 3.11 only security fixes 3.10 only security fixes topic-email 3.12 only security fixes labels Jun 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.10 only security fixes 3.11 only security fixes 3.12 only security fixes topic-email type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

4 participants