Skip to content

gh-79516: allow msgfmt.py to compile multiple input po files #10875

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 65 commits into
base: main
Choose a base branch
from

Conversation

s-ball
Copy link

@s-ball s-ball commented Dec 3, 2018

msgfmt.py (from Tools/i18n) can now reliably be passed more than one input po file.

In addition, its make central function can reliably be called repeatedly, which fixes bpo-9741

https://bugs.python.org/issue35335

Test option processing, and conversion of one single file with or
without the -o option.
Also test the little documented behaviour of merging two input files
with -o option.
Also show that it is now possible to build multiple po files in one
single script call.
@the-knights-who-say-ni
Copy link

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Our records indicate we have not received your CLA. For legal reasons we need you to sign this before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

If you have recently signed the CLA, please wait at least one business day
before our records are updated.

You can check yourself to see if the CLA has been received.

Thanks again for your contribution, we look forward to reviewing it!

s-ball and others added 3 commits January 23, 2025 19:32
When merging main into multi_inputs, the reference to os_helper was
erroneously removed.
In 2018, all imports came from the test.support package. They are now
splitted among various subpackages.
@s-ball
Copy link
Author

s-ball commented Jan 23, 2025

My patches now successfully pass all tests. Is there anything else I should do?

@zware zware changed the title bpo-35335: explicitely allows msgfmt.py to compile more than one input po files gh-79516: explicitely allows msgfmt.py to compile more than one input po files Feb 25, 2025
@s-ball
Copy link
Author

s-ball commented Mar 18, 2025

Please forgive my previous comment. I was tired and should not have posted it.

I wanted to keep a compatibility with the previous make function that accepted a string while I wanted to be able to pass an iterable of strings. This is what I meant with

Both ways are for compatibility reasons with previous behaviour.

in the docstring. As I now understand that any internal interface of a module from the Tools folder is an internal detail, it would indeed be much simpler and clearer to always pass an iterable of strings to make. I shall soon push a new commit doing that.

@serhiy-storchaka
Copy link
Member

AFAIK, the differences between the GNU msgfmt and msgfmt.py:

  • The GNU msgfmt always combine multiple PO source files. msgfmt.py can produce separate MO files for every PO source file.
  • If the output file is not specified, the GNU msgfmt writes to message.mo. msgfmt.py infers the MO file name from the PO source file name.

So, combining multiple PO source files is needed for consistency with the GNU msgfmt, and producing multiple MO files is needed for backward compatibility with msgfmt.py. The question is how to combine these two modes, and your PR provides a reasonable compromise.

msgfmt.py does not belong to the stdlib and does not officially provide other API besides the CLI. If you want to use it programmatically, it is better to copy the code.

Removing the global MESSAGES is not necessary for fixing the msgfmt.py for multiple input files and no output file. You can just reset it between processing separate files if no output file is specified. But it will not harm.

The simplest fix of the original issue is:

diff --git a/Tools/i18n/msgfmt.py b/Tools/i18n/msgfmt.py
index 878d3e29077..d9c46fcd166 100755
--- a/Tools/i18n/msgfmt.py
+++ b/Tools/i18n/msgfmt.py
@@ -250,6 +250,9 @@ def main():
         return
 
     for filename in args:
+        if outfile is None:
+            global MESSAGES
+            MESSAGES = {}
         make(filename, outfile)
 
 

The code is inefficient if the output file is specified, because it will be rewritten again and again for every input file, but it will work. The code needs rewriting for efficiency and clarity, and for better API if we decide to provide it in future.

All the logic is now in make at the cost of a break in compatibility.
@s-ball
Copy link
Author

s-ball commented Mar 18, 2025

@serhiy-storchaka I am afraid I did not fully understand your last comment. Do you really want me to bring back the global MESSAGES? Aside from that I have followed your advice.

@serhiy-storchaka
Copy link
Member

No, it is only demonstration that all other changes is a beautification of the code. We usually does not allow pure cosmetic changes, but we can allow them if they are related to the behavior changes and make the final code clearer.

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GNU msgfmt treats duplicated keys as conflicts. Silently ignoring can lead to bugs.

Comment on lines 114 to 115
BEWARE: in previous version of make the first parameter was a string
containing a single filename.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This note will soon became confusing, as it is not clear what version is previous. The right place for it is in the commit message, not in docstring.

# each PO file generates its corresponding MO file
for filename in filenames:
messages = {}
infile, outfile = get_names(filename, None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we can refactor get_names(). Split it on two parts -- for infile and for outfile. The first part can be inlined and moved in process(), and the second part can be left here, in make().

ID = 1
STR = 2
CTXT = 3
def make(filenames, outfile=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No default value is needed.

"""Tests for multiple input files
"""

def test_no_outputfile(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test looks redundant. test_both_without_outputfile supersedes it, and this is MultiInputTest.

@tomasr8
Copy link
Member

tomasr8 commented Mar 18, 2025

The GNU msgfmt treats duplicated keys as conflicts. Silently ignoring can lead to bugs.

We touched upon this a bit before: #10875 (comment) We could either fix it now or leave it for a followup PR. Which one would you prefer?

@serhiy-storchaka
Copy link
Member

If the solution is non-trivial, it can wait. I'm just worried that it may require rewriting the tests added here.

Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com>
@s-ball
Copy link
Author

s-ball commented Mar 18, 2025

If the solution is non-trivial, it can wait. I'm just worried that it may require rewriting the tests added here.

It would be rather simple to test if a new msgid already exists and raise an exception. But GNU msgfmt gives more: the exact file and line of the first occurrence and of the second one. That part will require a non trivial work... For the tests part, the only change would be that the current merge should abort (the "" msgid is present in both files), and that we should provide a copy of the second PO file without its header to obtain the current merge.

s-ball added 2 commits March 18, 2025 23:25
Remove one redundant test from test_msgfmt.py
Remove unneeded changes from msgfmt.py

def get_names(filename, outfile):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is no longer used.

# each PO file generates its corresponding MO file
for filename in filenames:
messages = {}
outfile = os.path.splitext(filename)[0] + '.mo'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a change in the behavior if filename does not end with .po.

For testing, you can use other suffix for one of test PO files.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the reason why outfile was computed after infile. As infile is always computed the same way, be outfile present or not, I externalized it in the get_names method when I write my initial patch in 2018. I am sorry but it was a long time ago and I had completely forgotten about that.

The only way I can imagine without code duplication is that process returns its input filename. But I would not find that very nice on a separation of concern point of view. Do you prefer that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a second thought, that latter way is more efficient because the output file name is not computed when it was provided. I'll try to implement it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is up to you. But computing outfile when it was not used was not elegant.

You can factor out computation of infile (without outfile) into a function and call it in both loops. Or you can pre-process the list of filenames.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of normcase() complicates the code (it was the right decision). In the end, your code with get_names() may be not the worst option. It only looked not elegant. In any case, please add more tests or modify existing tests to cover such special cases.

I am hesitant about adding a special test for input file with the .PO extension (the result will be different on Windows and non-Windows.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could try to build a dedicated test decorated with @unittest.skipUnless(sys.platform.startswith("win"), "only for Windows")...

s-ball added 2 commits March 19, 2025 13:52
Fixes a small bug introduced in previous commit (changed behaviour when an input file has an extension other than .po)
@s-ball
Copy link
Author

s-ball commented Mar 20, 2025

Ok the tests for the infile/outfile computations are passing.

Back to the duplicate ids question, it should not be that hard:

  • add a lid variable in process to record the lno value when a msgid is found
  • pass that lid along with infile to the add method and store tuples (msgstr, infile, lid) in the messages dict
  • in add, test whether a key is present before adding a new record and abort with a message - we now have the file and line number of the previous and current id
  • in generate replace the 2 occurrences of messages[id] with messages[id][0]

For the tests, the current test_both_with_outputfile should fail because the empty msgid is duplicated in the second file. We should just add a new PO file without header to have a passing test.

This would allow a better GNU msgfmt compatibility.

Do you think that this should go into this PR or into a new one?

@s-ball
Copy link
Author

s-ball commented Mar 21, 2025

Started working on that point, and I fell not on technical but behavior questions. msgfmt chokes if more than one file contains a header, but the header is the only place where the encoding of the PO file can be declared. As GNU gettext contains a number of auxiliary programs that can help to manage a set of PO files this behavior is nice, but msgfmt.py has not them and it would be a serious limitation. My opinion is that msgfmt.py should accept a header in every file but only store the first (or the last?) in the resulting MO file. That way users could process PO files with other encodings than UTF-8. But I would love to have other opinions on that...

If you think that this points requires a longer discussion, or that it should be discussed on a different place, then it means that we should handle it in a followup PR. If you think that just making a special case for the header is reasonable, then I could implement it in a couple of days.

s-ball and others added 3 commits April 4, 2025 09:39
compile_messages parameters order has changed in a previous commit to allow compiling multiple PO files.
@StanFromIreland
Copy link
Contributor

There will be a lot of conflicts with my hashing pr… this pr changes quite a lot. What do we want to do?

@s-ball
Copy link
Author

s-ball commented Apr 4, 2025

I was not very happy to reverse to parameters order of compile_messages because I thought it could disturb users if they were used to the precedent order. If there are concurrent PR (which I was absolutely not aware), this is clearly a blocking point. IMHO the only possible ways are to sequence the PR if there are only two of them, or to revert the modification of compile_messages, and use a different function to compile many PO files. I am sorry if I have made a mistake here. It is my first PR and I really do not know the usages. Anyway are there other sources of conflict?

@StanFromIreland
Copy link
Contributor

I guess it is up to @serhiy-storchaka who is the expert for msgfmt if I remember correctly as to what order we want these. (Who will have to deal with the conflicts)

@serhiy-storchaka
Copy link
Member

Before proceeding with this issue, we must solve other issues: --exclude-file, --omit-header, use the source file encoding. I think they are important for this issue.

@s-ball
Copy link
Author

s-ball commented Apr 4, 2025

If I correctly understand, I shall just leave this PR as it is now, without even worrying for conflicts, and only come back when the other issues will be fixed. Not really a problem, but do not forget to wake me when it will be the time to fix the conflicts... Anyway it is far from a very important one even for me as I have understood that I had to use a private copy for my own project. I have already been working that way for more that 6 years after all... Stupid me again! If there are conflicts with other changes, it means that those changes have been merged and that the conflicts are to be resolved... I just shall keep on coming here from time to time and fix them when I see them. But unless being advised to act differently I shall not try to handle the duplicate ids question before the PR is merged in its current state to avoid adding other sources of possible conflicts.

@python python deleted a comment Apr 7, 2025
@python-cla-bot
Copy link

python-cla-bot bot commented Apr 18, 2025

All commit authors signed the Contributor License Agreement.

CLA signed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants