Skip to content

bpo-12499: support custom len function in textwrap.wrap #28136

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

xi
Copy link

@xi xi commented Sep 3, 2021

I realize that this issue is 10 years old. However, this also provides a generic solution for bpo-24665 (2018), so I believe there is more interest in this than it might appear.

https://bugs.python.org/issue12499

…ter widths

This also provides a generic solution for bpo-24665
@the-knights-who-say-ni
Copy link

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept this contribution by verifying everyone involved has signed the PSF contributor agreement (CLA).

CLA Missing

Our records indicate the following people have not signed the CLA:

@xi

For legal reasons we need all the people listed to sign the CLA before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

If you have recently signed the CLA, please wait at least one business day
before our records are updated.

You can check yourself to see if the CLA has been received.

Thanks again for the contribution, we look forward to reviewing it!

@xi
Copy link
Author

xi commented Sep 3, 2021

I have signed the CLA just before submitting this PR (although with the bpo accout name, which is xi2). Do I have to take additional steps to make that work?

Copy link
Member

@merwok merwok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sending a patch with tests!

I will read the discussion on the ticket to see if there were concerns about adding this.

@merwok
Copy link
Member

merwok commented Sep 3, 2021

The bot can’t find your contributor agreement, please follow the steps here: https://devguide.python.org/pullrequest/#licensing


text = "123 🔧"
expected = ["123", "🔧"]
self.check_wrap(text, 5, expected, text_len=text_len)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that more tests are needed; text_len is used at multiple places in the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding tests! Could some of them use different texts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to avoid adding example for languages that I do not speak myself (e.g. Chinese). Are there ways to reach out to the wider community to provide meaningful examples?

@merwok merwok changed the title bpo-12499: textwrap.wrap: add control for fonts with different character widths bpo-12499: support custom len function in textwrap.wrap Sep 4, 2021
@tiptenbrink
Copy link

Hi, this looks really good, I have some great use cases for this (think wrappers in terminal app packages like blessed). However, it seems to not work well with text_len functions that return zero width for some characters. Take for example the following sample function:

def text_len(text):
    return sum(
        0 if c == 'Q' else 1
        for c in text
    )

If we then run:

wrappr = textwrap.TextWrapper(width=6, text_len=text_len)
print(wrappr.wrap("01QQQQ2Q345678")) (other and total longer than 6)
print(wrappr.wrap("QQQQQ01234")) (other characters less than 6, total longer than 6)

We would expect that in the first case, it breaks after six 1 width characters, i.e. after 5. For the second, we expect it won't break at all. However the results are, respectively:

['01QQQQ', '2Q3456', '78']
['QQQQQ01234']

As you can see, it works fine when calculating if it has to break at all. But when it actually does break, it still counts the zero width characters as having width, which would not be what you expect.

I'll try to dive into this and report if I have found a solution.

@tiptenbrink
Copy link

tiptenbrink commented Nov 9, 2021

The issue is with the following original code (comments are my own):

def _handle_long_word(self, reversed_chunks, cur_line, cur_len, width):
    ...
        if width < 1:
            space_left = 1
        else:
            space_left = width - cur_len
    ...
        if self.break_long_words:
            end = space_left # <-- HERE
            chunk = reversed_chunks[-1]
            if self.break_on_hyphens and self.text_len(chunk) > space_left:
    ...               
                hyphen = chunk.rfind('-', 0, space_left) # <-- HERE
                if hyphen > 0 and any(c != '-' for c in chunk[:hyphen]):
                    end = hyphen + 1
            cur_line.append(chunk[:end]) # <-- HERE
            reversed_chunks[-1] = chunk[end:] # <-- HERE
    ...

As you can see, it's tacitly assumed the text width is equal to the amount of characters, as it breaks on the index equal to the space left. You need a function to find on which index the width exceeds the line. I suggest the following:

    def _find_width_index(self, text, width):
        """_find_width_index(text : string, width: int)

        Find at which index the text has the required width.
        """
        # In most cases text_len will just use the number of characters, so this heuristic prevents calculating width
        # for each character
        if self.text_len(text[:width]) == width:
            # For character widths greater than one, width can be more than the number of characters
            return min(width, len(text))
        cur_text = ''
        for i, c in enumerate(text):
            cur_text += c
            cur_width = self.text_len(cur_text)
            if cur_width >= width:
                return i+1

And then the following amended _handle_long_word:

    def _handle_long_word(self, reversed_chunks, cur_line, cur_len, width):
        """_handle_long_word(chunks : [string],
                             cur_line : [string],
                             cur_len : int, width : int)

        Handle a chunk of text (most likely a word, not whitespace) that
        is too long to fit in any line.
        """
        # Figure out when indent is larger than the specified width, and make
        # sure at least one character is stripped off on every pass
        if width < 1:
            space_left = 1
        else:
            space_left = width - cur_len

        # If we're allowed to break long words, then do so: put as much
        # of the next chunk onto the current line as will fit.
        if self.break_long_words:
            chunk = reversed_chunks[-1]
            end = self._find_width_index(chunk, space_left)
            if self.break_on_hyphens and self.text_len(chunk) > space_left:
                # break after last hyphen, but only if there are
                # non-hyphens before it
                hyphen = chunk.rfind('-', 0, end)
                if hyphen > 0 and any(c != '-' for c in chunk[:hyphen]):
                    end = hyphen + 1
            cur_line.append(chunk[:end])
            reversed_chunks[-1] = chunk[end:]
    ...

I've tested this in tiptenbrink/textwrap, see tiptenbrink/textwrap@f93d7a2 for a proper diff.

A question for @merwok: how does a contribution to an existing PR work in this case? Do I need to also sign a CLA?

@merwok
Copy link
Member

merwok commented Nov 9, 2021

Yes, if you make a PR in xi’s repository and it’s accepted, then this PR here will have code by the both of you, and all contributors need to sign the CLA.

@tiptenbrink
Copy link

Thanks for the info. I've opened a PR @xi

@xi
Copy link
Author

xi commented Nov 11, 2021

I merged the fix by @tiptenbrink but I am not completely convinced. We make some assumptions that I am not sure we can make:

  • text_len(a) >= 0
  • text_len(a) + text_len(b) = text_len(a + b)

Whether we can drop these assumptions depends on the guarantees we want to make: My guess is that "every line will have a maximum text_len of n" can be achieved without these assumptions, but "every line will be the longest possible string that does not exceed a text_len of n" cannot.

This got much more complicated than I had hoped. Should we continue or just drop this?

@python-cla-bot
Copy link

The following commit authors need to sign the Contributor License Agreement:

CLA signed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants