Skip to content

Deprecate support for string named-parameter in hash functions constructors #134978

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
picnixz opened this issue May 31, 2025 · 4 comments
Closed
Assignees
Labels
extension-modules C modules in the Modules dir type-feature A feature request or enhancement

Comments

@picnixz
Copy link
Member

picnixz commented May 31, 2025

Feature or enhancement

This is a follow-up to #134696. After c6e63d9, it's now possible to use the following forms:

  • hashlib.new(name, data)
  • hashlib.new(name, data=...)
  • hashlib.new(name, string=...)

and, taking MD5 as an example, hashlib.md5(data), hashlib.md5(data=...) and hashlib.md5(string=...).

In the docs we only expose the following signatures:

hashlib.new(name, [data, ]*, usedforsecurity=True)
hashlib.md5([data, ]*, usedforsecurity=True)

So, it would make sense to remove support for string=.... Ideally, I want to make it positional-only, but I don't think it's worth the shot and it's too much for a breaking change.

Note that we could also deprecate data itself because originally, PEP-247 and PEP-452 were using string and not data, but this is probably confusing as the PEP says:

Although the parameter is called ‘string’, hashing objects operate on 8-bit data only. Both ‘key’ and ‘string’ must be a bytes-like object (bytes, bytearray…).

I assume that the docs decided to use 'data' to remove this ambiguity as 'data' is more common for bytes-like objects.

cc @gpshead

Linked PRs

@picnixz picnixz self-assigned this May 31, 2025
@picnixz picnixz added type-feature A feature request or enhancement extension-modules C modules in the Modules dir labels May 31, 2025
@gvanrossum
Copy link
Member

Is it worth the bother of using deprecation here? Is there a maintenance cost for us? Deprecations definitely incur maintenance cost for us (twice! once to write/review the deprecation code (6 files modified!), and again in a few years when removing the deprecated part of the API), and for old 3rd party code that still happens to use string=.

I'd honestly leave this one alone unless you can prove that it's a bug magnet.

@gpshead
Copy link
Member

gpshead commented May 31, 2025

the maintenance cost for us of not doing deprecation is the additional complexity in a pile of hashlib code to maintain the dual-named exclusive argument acceptance hacks, including carrying those forward to any new additions.

Deprecation is a way to reduce our burden in the long run by removing the unusual odd-duck argument handling complexity that our long past API inconsistency "oopsies" led to. We already had to jump through hoops to add the complexity to ameliorate those bugs. It'd be nice to not have to retain that in the long run.

I don't think we'll see significant new change friction maintenance burden foisted upon users from this deprecation - they already cannot reliably depend on the name of some hashlib parameters like this given the varying build modes and platforms with multiple different API implementations where a couple of parameter have sometimes unintentionally changed name or positional-only status.

The bug magnet is the complexity behind hashlib. We've got multiple different implementations of various APIs (for good reason) and have routinely unintentionally under-tested the exact pedantic all possible Hyrum's Law nature of those APIs so differences. Leading warts like this to creep in and cause issues. I expect more of that to happen as time goes on the more complex the permutations our API surface remains.

@picnixz
Copy link
Member Author

picnixz commented May 31, 2025

and for old 3rd party code that still happens to use string

the string= is not documented actually, and it's only exposed by directly inspecting the private interface. In 3.11, for instance:

_hashlib.new as EVP_new

    name as name_obj: object
    string as data_obj: object(c_default="NULL") = b''
    *
    usedforsecurity: bool = True

So using _hashlib.new(string=...) worked. But usually people would use hashlib.new:

def __py_new(name, data=b'', **kwargs):

def __hash_new(name, data=b'', **kwargs):

which exposes data instead... I didn't consider this to be a security issue so the fix will only be for 3.13+, but the fix is actually quite "ugly" and I really want to remove this kind of hack.

I'll actually need to amend my NEWS entry because it's a bit wrong. It wasn't possible to do hashlib.new(name, string=b"data") because there would be two values for the data to hash, one given by data=b"" in hashlib.new and one given through string=....

However, depending on the hash functions, some could be called with string=... (MD5, SHA-1 & SHA-2), some had positional without positional-only (SHA-3 functions) and some were positional-only (BLAKE-2).

I previously said that there was some issue hashlib.new but that's not entirely correct. The main issue is as follows:

import hashlib
import _hashlib
import sys
import _md5
import _sha3

hashlib.new("md5", data=b"")  # ok
_hashlib.new("md5", data=b"")  # not ok
_hashlib.new("md5", string=b"")  # but this ok
_md5.md5(string=b"") # and this is ok as well
hashlib.md5(string=b"") # and this one is also ok!
hashlib.md5(data=b"") # but this one is not ok!

hashlib.new("sha3_224", data=b"") # ok
_hashlib.new("sha3_224", data=b"") # not ok
_hashlib.new("sha3_224", string=b"") # ok
hashlib.sha3_224(string=b"")  # ok (if openssl-based)
hashlib.sha3_224(data=b"")  # not ok..

_sha3.sha3_224(string=b"")  # this is not ok!?
# this is STILL not ok even if help(_sha3.sha3_224) shows:
#    sha3_224([data], *, usedforsecurity=True) -> SHA3 object
_sha3.sha3_224(data=b"")  
_sha3.sha3_224(b"") # positional-only but was incorrectly documented as not
_hashlib.openssl_sha3_224(data=b"") # and this one is also NOT ok!???
_hashlib.openssl_sha3_224(string=b"") # but this one is OK??

# if we disable '_hashlib'
sys.modules["_hashlib"] = None
sys.modules.pop("hashlib")
import hashlib
hashlib.sha3_224(string=b"")  # hey it's no more working!!
hashlib.sha3_224(data=b"")  # still not working

So, it's really difficult to iterate over the different constructors and using the same 'args/kwargs'. Namely, a dispatching mechanism is hard to maintain. Now, on 3.13+, it's possible to use any of the above combinations without any issues, although when one uses both data and string, there's still an error being raised. If someone just uses string or data, it's fine.

@gvanrossum
Copy link
Member

Okay, now I understand. Thanks for explaining!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
extension-modules C modules in the Modules dir type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants