Skip to content

gh-69753: Add Bytes Object Support to Shlex #22657

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

HassanAbouelela
Copy link

@HassanAbouelela HassanAbouelela commented Oct 12, 2020

Adds support for bytes objects in the shlex module.

https://bugs.python.org/issue25567

Adds checking and conversions for bytes objects, to allow them to be used with shlex.
Adds tests that cover bytestrings in the shlex module.
Allows split to return a list of bytes, given a byte input string.
@github-actions
Copy link

This PR is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale Stale PR or inactive for long period of time. label Dec 17, 2020
safeunquoted = string.ascii_letters + string.digits + '@%_-+=:,./'
unsafe = '"`$\\!'

self.assertEqual(shlex.quote(b''), b"''")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to test some bytestrings with non-ASCII characters, to make sure that they're handled correctly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the missing unicode characters from the testQuote test, and changes the encoding to utf-8 to account for the changes. Added in bfb76be.

Copy link
Member

@picnixz picnixz Jul 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add all characters that are supported on POSIX (not just three of them):

unicode_chars = ('ßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ'
 				 'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ')

(I'll let you CC the chars since I am not sure about my own CC)

@HassanAbouelela HassanAbouelela force-pushed the bpo-25567-shlex-bytestrings branch from 8f847e8 to 473c81d Compare September 1, 2021 19:18
Copy link
Contributor

@MaxwellDupre MaxwellDupre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All shlex test run ok.
Also example in bugs look ok.

@github-actions github-actions bot removed the stale Stale PR or inactive for long period of time. label Jul 30, 2022
@serhiy-storchaka serhiy-storchaka self-requested a review July 15, 2024 17:08
Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see several problems here:

  • If bytes is passed to shlex(), should not iteration produce bytes rather of str?
  • If shlex() accepts str, bytes and text files, should not it support also binary files?
  • join() returns str for a sequence of str and bytes for a sequence of bytes. Unless the sequence is empty, in which case it always return str. The problem is that an empty sequence of str is the same as the empty sequence of bytes.

@HassanAbouelela
Copy link
Author

HassanAbouelela commented Jul 18, 2024

@serhiy-storchaka Thanks for your review.
Some of this information may be inaccurate (it has been four years haha), but I'll try my best:

If bytes is passed to shlex(), should not iteration produce bytes rather of str?

The focus of this PR was to get byte objects to work with shlex.quote, and during the process, I added support for the other standalone methods as well. These support bytes completely, and return bytes when required (conversions aside). Adding support to the shlex class was done in a limited capacity, to at the very least provide functionality without requiring a larger rewrite. No reason it can't be added afterwards. However, the use-cases/need of such features should probably be discussed.

If shlex() accepts str, bytes and text files, should not it support also binary files?

Probably out of scope for this bug/pull-request.

join() returns str for a sequence of str and bytes for a sequence of bytes. Unless the sequence is empty, in which case it always return str. The problem is that an empty sequence of stris the same as the empty sequence ofbytes.

Can you provide a reproduction, or clarification on sequence?

>>> isinstance("", bytes)
False
>>> isinstance(b"", bytes)
True
>>> "" == b""
False

@HassanAbouelela
Copy link
Author

On your last point, I misunderstood, and see your point now. Don't see a possible fix here though, or the use case.

@@ -22,6 +22,10 @@ def __init__(self, instream=None, infile=None, posix=False,
punctuation_chars=False):
if isinstance(instream, str):
instream = StringIO(instream)
elif isinstance(instream, bytes):
# convert byte instreams to string
instream = StringIO(instream.decode("ascii", "surrogateescape"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would you only support ASCII strings? for POSIX platforms, you should support also additional characters (see wordchars being extended).

return list(lex)

if isinstance(s, bytes):
return [i.encode("ascii") for i in lex]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this should not be restricted to ASCII characters only.



def join(split_command):
"""Return a shell-escaped string from *split_command*."""
return ' '.join(quote(arg) for arg in split_command)
if len(split_command) == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behaviour should be documented, saying that a string will be returned instead of bytes for an empty sequence of bytes. Or maybe add an additional a separate function which only accepts bytes inputs (less code and no warnings and more efficient).

cleaned = []
warned = False

for command in split_command:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is too complicated. The str-version assumes that the objects are all strings, namely that split_command is an iterable of strings.

# use single quotes, and put single quotes into double quotes
# the string $'b is then quoted as '$'"'"'b'
return b"'" + s.replace(b"'", b"'\"'\"'") + b"'"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

@@ -171,6 +171,10 @@ def testSplitPosix(self):
"""Test data splitting with posix parser"""
self.splitTest(self.posix_data, comments=True)

def testSplitBytes(self):
"""Test byte objects splitting"""
self.assertEqual(shlex.split(b"split words"), [b"split", b"words"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The coverage is insufficient. Use the self.splitTest as for the string case but do it for bytes inputs instead.

In addition, use more than just ASCII characters but also those that are supported by the POSIX platforms.

safeunquoted = string.ascii_letters + string.digits + '@%_-+=:,./'
unsafe = '"`$\\!'

self.assertEqual(shlex.quote(b''), b"''")
Copy link
Member

@picnixz picnixz Jul 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add all characters that are supported on POSIX (not just three of them):

unicode_chars = ('ßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ'
 				 'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ')

(I'll let you CC the chars since I am not sure about my own CC)

@@ -358,6 +382,46 @@ def testJoinRoundtrip(self):
resplit = shlex.split(joined)
self.assertEqual(split_command, resplit)

def testJoinBytes(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the same dataset as for those used in strings, especially with quotation symbols or unsafe characters.

@@ -0,0 +1 @@
Add support for bytes objects in the shlex module.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Add support for bytes objects in the shlex module.
Add support for :class:`bytes` objects to the :mod:`shlex` module.

You should probably explain that join() returns an empty string as well in the docs. This is important. Or introduce a separate function for bytes objects (which I think would be preferrable because you wouldn't have all those warnings to handle).

@picnixz picnixz changed the title bpo-25567: Add Bytes Object Support to Shlex gh-69753: Add Bytes Object Support to Shlex Jul 21, 2024
@python-cla-bot
Copy link

The following commit authors need to sign the Contributor License Agreement:

CLA signed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants