-
-
Notifications
You must be signed in to change notification settings - Fork 16
English plural rules using Pynini #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Looking good. Most common cases are covered, and can be extended.
Examples x-in-law —> xs-in-law.
This presumes that the noun is not genitive: dog's —> dog's
…On Fri, Apr 19, 2024, 15:14 Nebojša Ćirić ***@***.***> wrote:
@nciric <https://github.com/nciric> requested your review on: #30
<#30> English plural rules
using Pynini.
—
Reply to this email directly, view it on GitHub
<#30 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMCWDER37SGNQLDYGKLY6GJMPAVCNFSM6AAAAABGP4UN2KVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJSGU2DKNBVGY2TSMQ>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trick question, what's the plural of "brush"? It's both "brush" and "brushes". It depends on what you're talking about. A paint brush can become brushes, but brush in the forest is uncountable and thus it stays as brush.
This logic seems interesting. Before using it, I really want to see better vowel handling. That will typically needs Unicode normalization and a UnicodeSet. The UnicodeSet needs to handle different sets at the beginning and the end of a word for English. Typically the letter "y" is a vowel at the end and a consonant at the beginning.
I sometimes wonder how ICU transliteration would work. I'm unsure about the performance, but it's more data driven.
# If a singular noun ends in -y and the letter before the -y is a consonant, change the ending to -ies. | ||
# city - cities | ||
# puppy - puppies | ||
_ies = _sigma + _c + p.cross('y', 'ies') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my logic, I have length > 2 && endsWith("y") && !isVowel(string[length - 2])
. Vowel handling only looks at the base character and not the diacritic, except Hebrew and Arabic scripts. Unicode normalization helps with that.
This handles the word day correctly. The plural of Monday is Mondays and not Mondaies. It seems that _c seems to try to handle consonants, but you really need to get the base character, just like the vowel handling. "e" and "é" are both vowels for example. The letters "n" and "ñ" are also consonants.
fst/en/inflection.py
Outdated
_sigma = p.union(_v, _c).closure().optimize() | ||
|
||
_exceptions = p.string_map([ | ||
# Zero plurals. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I count about 444 words in English that follow this rule so far. This is a good start.
fst/en/inflection.py
Outdated
('wife', 'wives'), | ||
('wolf', 'wolves'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I count about 17 words that follow this logic. Most of them are variations of compound words that end with "knife", "wife", and "life".
fst/en/inflection.py
Outdated
# Stem changes. | ||
('child', 'children'), | ||
('goose', 'geese'), | ||
('man', 'men'), | ||
('woman', 'women'), | ||
('tooth', 'teeth'), | ||
('foot', 'feet'), | ||
('mouse', 'mice'), | ||
('person', 'people'), | ||
('penny', 'pence'), | ||
# Irregular suffixes. | ||
('child', 'children'), | ||
('ox', 'oxen'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good start. The list can get much larger. You may want to make this list data driven instead of embedding it in code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the exceptions above and the rules would be dumped into an FST file that could be used from C++ (OpenFST). Exceptions wouldn't be in the code - we use python to build the rules and serialize the "model" to disk.
Of course, we can move this list to a file and load it instead of specifying in Python code. A refactor for later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored exceptions to the file in data/ folder, but they'll still be part of the final FST once dumped.
fst/en/inflection.py
Outdated
('photo', 'photos'), | ||
('piano', 'pianos'), | ||
('halo', 'halos'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This list of exceptions for the "o" ending is much larger than the regular rule below. I recommend inverting this part of the logic.
fst/en/inflection.py
Outdated
('analysis', 'analyses'), | ||
('ellipsis', 'ellipses'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I count about 187 words that follow this pattern.
_v = p.union('a', 'e', 'i', 'o', 'u', 'A', 'E', 'I', 'O', 'U') | ||
_c = p.union('b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', | ||
'p', 'q', 'r', 's', 't', 'v', 'w', 'x', 'y', 'z', | ||
'B', 'C', 'D', 'F', 'G', 'H', 'J', 'K', 'L', 'M', 'N', | ||
'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'X', 'Y', 'Z') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we replace this by a UnicodeSet and Unicode normalization to handle all of the vowels versus consonants?
Personally I split the vowel set in English to be with a y when it's a suffix, and as a consonant as a prefix.
@macchiati I think your example is where the "x" can be replaced by a family relationship, like "brothers-in-law", right? I've also seen "gluteus maximus" turning into "glutei maximi" for the plural form. That's Latin in origin, and it doesn't follow English rules. It's still used by doctors and gyms. There can also be some discussion around whether the plural of "attorney general" is "attorneys general" or "attorney generals". I suspect that such a pluralization will be an American English versus British English discussion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good start. It can probably be refined further as needed, if this is the direction that we're going.
This is a proposal that has some promise. I would like to see other approaches before we proceed. Pynini library can do many more things than I exposed in this example, also I still need to see the size of resulting grammar transition file (once you build the rules) - it could be prohibitive in size for some languages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I'm late to the party once again, although I don't think I have lot to add to George's comments.
This looks cool. I'm not familiar with Panini, but the gist is fairly clear and the approach makes sense. (My couple of nits aren't really important, of course, as this is a proof of concept.)
I'm guessing the idea here is that you'd use this tool to produce an FST that could be embedded into C code for performance, and I think that makes sense. I do have a couple of thoughts, though. I apologize if they're obvious to everyone else...
- Ideally, what you'd want to do, I think, is start with a large collection of words with all of their inflected forms and glean the transformation rules by examination. That is, rather than hand-crafting transformation rules as you've done here, you'd have some kind of tool that boils them down from looking at the patterns that actually occur in the lexicon and their frequency. Maybe you'd have to hand-craft some rules that the tool couldn't get by examination, but it seems like most of this could be derived by examination.
- If we think about it that way, what all of this basically amounts to is a compression algorithm-- conceptually, you're just looking up inflected forms in a lexicon, and using an FST is just a way of making that lexicon smaller and faster.
As I was writing item 1, I realized I'm sort of describing a deep-learning model, but I think I'm hoping for something simpler and more debuggable that that would get us, something more akin to actually compressing the lexicon data (if that makes any sense).
The other thing we'd have to think a lot about, but maybe later, is what to do when results are ambiguous-- is there a way to tell when the plural of "person" should be "people" and when it should be "persons," for example? Or do we return both, maybe tagged with probabilities?
As I said, this is probably all obvious to everybody reading this, and if so, I apologize...
tooth teeth | ||
foot feet | ||
mouse mice | ||
person people |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plural of "person" can also be "persons" in some contexts...
foot feet | ||
mouse mice | ||
person people | ||
penny pence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You only see "pence" in British English.
… On Mon, Apr 22, 2024, 15:04 George Rhoten ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In fst/en/inflection.py
<#30 (comment)>
:
> +])
+
+# If a singular noun ends in -y and the letter before the -y is a consonant, change the ending to -ies.
+# city - cities
+# puppy - puppies
+_ies = _sigma + _c + p.cross('y', 'ies')
+
+# If the singular noun ends in -on, the plural ending is usually -a.
+# phenomenon - phenomena
+# criterion - criteria
+_a = _sigma + p.cross('on', 'a')
+
+# If the singular noun ends in -us, the plural ending is frequently -i.
+# cactus - cacti
+# focus - foci
+_i = _sigma + p.cross('us', 'i')
On a related note, the plural of octopus is octopuses in English. Though
it's not uncommon to see people use octopi, which is a hypercorrection.
Also see https://twitter.com/MontereyAq/status/1535752435863588866 for
reference.
—
Reply to this email directly, view it on GitHub
<#30 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMGPFVEAC2IQRJTTGNTY6WCNBAVCNFSM6AAAAABGP4UN2KVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDAMJVHA3TINBXGE>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
There are unanswered questions with my approach - multiple results for example. Also I am not 100% sure we'll be able to avoid Python logic, so in addition to FST table you'll need to replicate some of the code too (In this example we didn't have that problem). I think rule generation can be automated to some degree using data patterns and light ML, for example clustering for y->ies, Greek/latin words etc. As I mentioned before, I would like to learn of other approaches before we commit to any single one. Another part I didn't do was generate actual transition table and use it from OpenFST C library and see how that works (both in size and ease of use). |
Implemented plural rules for English using Pynini package to showcase the logic and how-to.