Skip to content

gh-130703: Implement wrapping to width for msgids #130705

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

StanFromIreland
Copy link
Contributor

@StanFromIreland StanFromIreland commented Feb 28, 2025

@StanFromIreland
Copy link
Contributor Author

Requesting @tomasr8 @serhiy-storchaka :-)

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not work.

  • It can break escape sequences.
  • The normalized message can already be multiline. Splitting it again will produce too short lines and even empty lines.

@StanFromIreland StanFromIreland marked this pull request as draft February 28, 2025 20:10
@StanFromIreland
Copy link
Contributor Author

I need to update normalize to wrap respecting words

Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you use textwrap.wrap for wrapping? It's not perfect but it ought to detect do most of the job?

@tomasr8
Copy link
Member

tomasr8 commented Feb 28, 2025

I'm afraid textwrap won't always work. I suggest adding the wrapping logic to the normalize function. pybabel does it in a similar way, you can have a look at their implementation: https://github.com/python-babel/babel/blob/master/babel/messages/pofile.py#L464

@StanFromIreland
Copy link
Contributor Author

Implemented pybabels method.

Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really the best for reviewing this but I can review the implementation. Please, don't just apply my suggestions as is and decide which one is the best.

@StanFromIreland StanFromIreland requested a review from picnixz March 1, 2025 10:19
@StanFromIreland StanFromIreland marked this pull request as ready for review March 1, 2025 10:20
StanFromIreland and others added 2 commits March 1, 2025 11:03
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that the regex approach is correct. It would gobble up consecutive spaces right?

@StanFromIreland StanFromIreland requested a review from picnixz March 1, 2025 11:19
@StanFromIreland StanFromIreland requested a review from picnixz March 2, 2025 09:59
@tomasr8
Copy link
Member

tomasr8 commented Mar 2, 2025

I really recommend creating a dummy file with some gettext calls and comparing the differences between pygettext, xgettext and babel. There are some differences that should be considered. Here's two I noticed:

  • The header is not wrapped but both xgettext and babel do wrap it.
  • This file:
_('foos')

ran with --width=3 produces this output:

msgid ""
""
"foos"
msgstr ""

while xgettext and babel give me this (i.e. they don't insert two extra "" when the line does not get wrapped):

msgid "foos"
msgstr ""

@StanFromIreland
Copy link
Contributor Author

StanFromIreland commented Mar 2, 2025

As for the header, this will conflict with my implementation of --omit-header, could that get merged first (or vice versa)?

@StanFromIreland
Copy link
Contributor Author

StanFromIreland commented Mar 2, 2025

Test fail unrelated.

Wrapping header will require a separate function like so:

Subject: [PATCH] Wrap header
---
Index: Tools/i18n/pygettext.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/Tools/i18n/pygettext.py b/Tools/i18n/pygettext.py
--- a/Tools/i18n/pygettext.py	(revision 8d03cbf141068c4ac9812a967a4c9f5942e22d75)
+++ b/Tools/i18n/pygettext.py	(date 1740913002600)
@@ -589,12 +589,36 @@
     def _is_string_const(self, node):
         return isinstance(node, ast.Constant) and isinstance(node.value, str)
 
+
+def _wrap_header(s, options):
+    lines = []
+    for line in s.splitlines():
+        if len(line) > options.width and ' ' in line:
+            words = _space_splitter(line)
+            words.reverse()
+            buf = []
+            size = 0
+            while words:
+                word = words.pop()
+                if size + len(word) <= options.width:
+                    buf.append(word)
+                    size += len(word)
+                else:
+                    lines.append(''.join(buf))
+                    buf = [word]
+                    size = len(word)
+            lines.append(''.join(buf))
+        else:
+            lines.append(line)
+    return "\n".join(lines) + "\n"
+
+
 def write_pot_file(messages, options, fp):
     timestamp = time.strftime('%Y-%m-%d %H:%M%z')
     encoding = fp.encoding if fp.encoding else 'UTF-8'
-    print(pot_header % {'time': timestamp, 'version': __version__,
+    print(_wrap_header(pot_header % {'time': timestamp, 'version': __version__,
                         'charset': encoding,
-                        'encoding': '8bit'}, file=fp)
+                        'encoding': '8bit'}, options), file=fp)
 
     # Sort locations within each message by filename and lineno
     sorted_keys = [

@picnixz picnixz removed their request for review March 2, 2025 17:18
Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks almost ready now. Please add more tests for cases. It may be convenient to use the same string with different widths.

Add tests for the cases when len(escaped_line) + len(prefix) + 3 equals to width and when it equals to width.

Add tests for the cases when new_size + 2 equals to width and when it equals to width + 1.

Add tests for too long first word (new_size + 2 > widthandbuf` is empty) and for too long last word.

Add tests for whitespaces other than ' ' and '\n' (e.g. for '\t' and '\r'), for non-ASCII line separators and whitespaces. Test for different escaping mode.

Do not add a separate method for every case. Group assertions for similar cases in one method.

@StanFromIreland
Copy link
Contributor Author

Friendly ping @serhiy-storchaka :-)

@serhiy-storchaka
Copy link
Member

Sorry, the tests still do not satisfy me. I am going to play with them myself, and then propose my variant.

@picnixz picnixz removed their request for review March 23, 2025 12:26
@serhiy-storchaka serhiy-storchaka self-assigned this Apr 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants