Skip to content

gh-130197: pygettext: Fix and test the --exclude-file option #131381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

tomasr8
Copy link
Member

@tomasr8 tomasr8 commented Mar 17, 2025

Fixes and adds tests for the --exclude-file option. This is what the option does:

--exclude-file=filename
Specify a file that contains a list of strings that are not be
extracted from the input files. Each string to be excluded must
appear on a line by itself in the file.

Previously, the file was read as f.readlines() which keeps the \n character for each excluded string. This made the option not work properly. I changed this to f.read.splitlines() which strips the newlines so the option should now work as expected.

@serhiy-storchaka
Copy link
Member

So, --exclude-file does not work and never worked as intended.

Note also that if it worked as intended, it would work differently than in xgettext. So why not implement the xgettext behavior?

@tomasr8
Copy link
Member Author

tomasr8 commented Mar 18, 2025

So, --exclude-file does not work and never worked as intended.

Pretty much, yes. I randomly discovered this when writing the tests for it..

Note also that if it worked as intended, it would work differently than in xgettext. So why not implement the xgettext behavior?

If you're ok with that, I can implement it. The xgettext --exclude-file option reads the msgids from a PO file. Pygettext currently only knows how to write a PO file but not how to read it. msgfmt can read PO files so we could reuse the code from there. Perhaps create some helper file for PO reading/writing?

@serhiy-storchaka
Copy link
Member

For now, I think that we can just copy the code from msgfmt.py. In future we may add the code for reading the PO files in the gettext module.

@tomasr8
Copy link
Member Author

tomasr8 commented Mar 18, 2025

Ok, I'll just copy it over then for now :)

@tomasr8
Copy link
Member Author

tomasr8 commented Mar 20, 2025

In future we may add the code for reading the PO files in the gettext module.

That would actually be really helpful for my work as well. I'd be interested in contributing this. Have you already thought about how you'd like it to work? Should it be something compatible with GNUTranslations or do we want a richer interface that allows you to access e.g. the comments and flags as well?

@tomasr8
Copy link
Member Author

tomasr8 commented Mar 26, 2025

For now, I think that we can just copy the code from msgfmt.py

Just to give an update, it probably won't be as simple. I started off by copying the code with minimal changes (just to get it to work with pygettext) and I discovered several bugs in the parsing logic when I started writing tests. I'll take some inspiration from the original code but I will need to make quite a few changes in order to make PO parsing work correctly.

tomasr8 added 2 commits March 29, 2025 18:06
Allows passing a regular PO file as an exclude file.
All msgids in this file will be excluded when extracting.

This adds PO parsing capability to pygettext.
Comment on lines +503 to +505
# - \n, \r, \t, \\, \", \a, \b, \f, \v
# - Octal escapes: \o, \oo, \ooo
# - Hex escapes: \xh, \xhh, ...
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not standardized but I checked this with GNU msgfmt and these are the ones that are allowed.

Comment on lines +304 to +306
f"The file {filename} starts with a UTF-8 BOM which is not "
"allowed in .po files.\nPlease save the file without a BOM "
"and try again.")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same error we added to msgfmt.py recently.

@tomasr8
Copy link
Member Author

tomasr8 commented Mar 29, 2025

I implement PO parsing in order to make --exclude-file work as it does in xgettext. I started with the code in msgfmt.py but as I wrote before, there are some issues with it (e.g. bugs, differences from GNU msgfmt) which I fixed and added lots of tests (actually most of this PR are tests now, so hopefully it won't be that daunting to review 😄 )

So to summarize:

  • pygettext is now able to parse PO files. I also verified that the behaviour matches GNU msgfmt (there's lots of edge cases, so I may have missed some though)
  • --exclude-file now accepts a PO file. All msgids from this file are ignored when extracting.

@serhiy-storchaka Do you think you could have a look at this when you have time? 🙂

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work.

I am sorry, but it seems I underestimated the size and complexity of the parsing code. Worst of all, this is code that can likely be changed in future (to fix minor errors or adding features). This turned out to be not such a good idea.

The code needs to be shared in some way. Maybe just import a function from msgfmt?

Wait, how is it happened that the whole msgfmt.py is smaller than the parsing code in this PR?

@@ -742,14 +1006,18 @@ class Options:
# initialize list of strings to exclude
if options.excludefilename:
try:
with open(options.excludefilename) as fp:
options.toexclude = fp.readlines()
options.toexclude = get_msgids_from_exclude_file(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that in xgettext you can specify --exclude-file multiple times, and all msgids are added.

@serhiy-storchaka
Copy link
Member

I think that it is better to fix the parser in msgfmt first.

@tomasr8
Copy link
Member Author

tomasr8 commented Apr 10, 2025

I think that it is better to fix the parser in msgfmt first.

Originally, I was meaning to port this change to msgfmt after this is merged, but I can do the opposite. I'll send a PR to fix the msgfmt parser hopefully in a few days!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants