-
-
Notifications
You must be signed in to change notification settings - Fork 31.8k
gh-69753: Add Bytes Object Support to Shlex #22657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
gh-69753: Add Bytes Object Support to Shlex #22657
Conversation
Adds checking and conversions for bytes objects, to allow them to be used with shlex.
Adds tests that cover bytestrings in the shlex module.
Allows split to return a list of bytes, given a byte input string.
This PR is stale because it has been open for 30 days with no activity. |
safeunquoted = string.ascii_letters + string.digits + '@%_-+=:,./' | ||
unsafe = '"`$\\!' | ||
|
||
self.assertEqual(shlex.quote(b''), b"''") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to test some bytestrings with non-ASCII characters, to make sure that they're handled correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the missing unicode characters from the testQuote
test, and changes the encoding to utf-8 to account for the changes. Added in bfb76be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add all characters that are supported on POSIX (not just three of them):
unicode_chars = ('ßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ'
'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ')
(I'll let you CC the chars since I am not sure about my own CC)
Adds the unicode sample characters from the `testQuote` test to the bytes version.
8f847e8
to
473c81d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All shlex test run ok.
Also example in bugs look ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see several problems here:
- If
bytes
is passed toshlex()
, should not iteration producebytes
rather ofstr
? - If
shlex()
acceptsstr
,bytes
and text files, should not it support also binary files? join()
returnsstr
for a sequence ofstr
andbytes
for a sequence ofbytes
. Unless the sequence is empty, in which case it always returnstr
. The problem is that an empty sequence ofstr
is the same as the empty sequence ofbytes
.
@serhiy-storchaka Thanks for your review.
The focus of this PR was to get byte objects to work with
Probably out of scope for this bug/pull-request.
>>> isinstance("", bytes)
False
>>> isinstance(b"", bytes)
True
>>> "" == b""
False |
On your last point, I misunderstood, and see your point now. Don't see a possible fix here though, or the use case. |
@@ -22,6 +22,10 @@ def __init__(self, instream=None, infile=None, posix=False, | |||
punctuation_chars=False): | |||
if isinstance(instream, str): | |||
instream = StringIO(instream) | |||
elif isinstance(instream, bytes): | |||
# convert byte instreams to string | |||
instream = StringIO(instream.decode("ascii", "surrogateescape")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would you only support ASCII strings? for POSIX platforms, you should support also additional characters (see wordchars being extended).
return list(lex) | ||
|
||
if isinstance(s, bytes): | ||
return [i.encode("ascii") for i in lex] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this should not be restricted to ASCII characters only.
|
||
|
||
def join(split_command): | ||
"""Return a shell-escaped string from *split_command*.""" | ||
return ' '.join(quote(arg) for arg in split_command) | ||
if len(split_command) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behaviour should be documented, saying that a string will be returned instead of bytes for an empty sequence of bytes. Or maybe add an additional a separate function which only accepts bytes inputs (less code and no warnings and more efficient).
cleaned = [] | ||
warned = False | ||
|
||
for command in split_command: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is too complicated. The str-version assumes that the objects are all strings, namely that split_command
is an iterable of strings.
# use single quotes, and put single quotes into double quotes | ||
# the string $'b is then quoted as '$'"'"'b' | ||
return b"'" + s.replace(b"'", b"'\"'\"'") + b"'" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -171,6 +171,10 @@ def testSplitPosix(self): | |||
"""Test data splitting with posix parser""" | |||
self.splitTest(self.posix_data, comments=True) | |||
|
|||
def testSplitBytes(self): | |||
"""Test byte objects splitting""" | |||
self.assertEqual(shlex.split(b"split words"), [b"split", b"words"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The coverage is insufficient. Use the self.splitTest
as for the string case but do it for bytes inputs instead.
In addition, use more than just ASCII characters but also those that are supported by the POSIX platforms.
safeunquoted = string.ascii_letters + string.digits + '@%_-+=:,./' | ||
unsafe = '"`$\\!' | ||
|
||
self.assertEqual(shlex.quote(b''), b"''") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add all characters that are supported on POSIX (not just three of them):
unicode_chars = ('ßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ'
'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ')
(I'll let you CC the chars since I am not sure about my own CC)
@@ -358,6 +382,46 @@ def testJoinRoundtrip(self): | |||
resplit = shlex.split(joined) | |||
self.assertEqual(split_command, resplit) | |||
|
|||
def testJoinBytes(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the same dataset as for those used in strings, especially with quotation symbols or unsafe characters.
@@ -0,0 +1 @@ | |||
Add support for bytes objects in the shlex module. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add support for bytes objects in the shlex module. | |
Add support for :class:`bytes` objects to the :mod:`shlex` module. |
You should probably explain that join()
returns an empty string as well in the docs. This is important. Or introduce a separate function for bytes objects (which I think would be preferrable because you wouldn't have all those warnings to handle).
The following commit authors need to sign the Contributor License Agreement: |
Adds support for bytes objects in the shlex module.
https://bugs.python.org/issue25567