English plural rules using Pynini #30

nciric · 2024-04-19T22:13:13Z

Implemented plural rules for English using Pynini package to showcase the logic and how-to.

macchiati · 2024-04-19T22:33:53Z

Looking good. Most common cases are covered, and can be extended. Examples x-in-law —> xs-in-law. This presumes that the noun is not genitive: dog's —> dog's

…

On Fri, Apr 19, 2024, 15:14 Nebojša Ćirić ***@***.***> wrote: @nciric <https://github.com/nciric> requested your review on: #30 <#30> English plural rules using Pynini. — Reply to this email directly, view it on GitHub <#30 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMCWDER37SGNQLDYGKLY6GJMPAVCNFSM6AAAAABGP4UN2KVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJSGU2DKNBVGY2TSMQ> . You are receiving this because your review was requested.Message ID: ***@***.***>

grhoten

Trick question, what's the plural of "brush"? It's both "brush" and "brushes". It depends on what you're talking about. A paint brush can become brushes, but brush in the forest is uncountable and thus it stays as brush.

This logic seems interesting. Before using it, I really want to see better vowel handling. That will typically needs Unicode normalization and a UnicodeSet. The UnicodeSet needs to handle different sets at the beginning and the end of a word for English. Typically the letter "y" is a vowel at the end and a consonant at the beginning.

I sometimes wonder how ICU transliteration would work. I'm unsure about the performance, but it's more data driven.

grhoten · 2024-04-20T02:57:58Z

fst/en/inflection.py

+# If a singular noun ends in -y and the letter before the -y is a consonant, change the ending to -ies.
+# city - cities
+# puppy - puppies
+_ies = _sigma + _c + p.cross('y', 'ies')


For my logic, I have length > 2 && endsWith("y") && !isVowel(string[length - 2]). Vowel handling only looks at the base character and not the diacritic, except Hebrew and Arabic scripts. Unicode normalization helps with that.

This handles the word day correctly. The plural of Monday is Mondays and not Mondaies. It seems that _c seems to try to handle consonants, but you really need to get the base character, just like the vowel handling. "e" and "é" are both vowels for example. The letters "n" and "ñ" are also consonants.

fst/en/inflection.py

grhoten · 2024-04-20T03:07:10Z

fst/en/inflection.py

+_sigma = p.union(_v, _c).closure().optimize()
+
+_exceptions = p.string_map([
+    # Zero plurals.


I count about 444 words in English that follow this rule so far. This is a good start.

grhoten · 2024-04-20T03:12:17Z

fst/en/inflection.py

+    ('wife', 'wives'),
+    ('wolf', 'wolves'),


I count about 17 words that follow this logic. Most of them are variations of compound words that end with "knife", "wife", and "life".

fst/en/inflection.py

grhoten · 2024-04-20T03:35:25Z

fst/en/inflection.py

+    # Stem changes.
+    ('child', 'children'),
+    ('goose', 'geese'),
+    ('man', 'men'),
+    ('woman', 'women'),
+    ('tooth', 'teeth'),
+    ('foot', 'feet'),
+    ('mouse', 'mice'),
+    ('person', 'people'),
+    ('penny', 'pence'),
+    # Irregular suffixes.
+    ('child', 'children'),
+    ('ox', 'oxen'),


This is a good start. The list can get much larger. You may want to make this list data driven instead of embedding it in code.

All of the exceptions above and the rules would be dumped into an FST file that could be used from C++ (OpenFST). Exceptions wouldn't be in the code - we use python to build the rules and serialize the "model" to disk.
Of course, we can move this list to a file and load it instead of specifying in Python code. A refactor for later.

Refactored exceptions to the file in data/ folder, but they'll still be part of the final FST once dumped.

grhoten · 2024-04-20T03:37:18Z

fst/en/inflection.py

+    ('photo', 'photos'),
+    ('piano', 'pianos'),
+    ('halo', 'halos'),


This list of exceptions for the "o" ending is much larger than the regular rule below. I recommend inverting this part of the logic.

grhoten · 2024-04-20T03:38:57Z

fst/en/inflection.py

+    ('analysis', 'analyses'),
+    ('ellipsis', 'ellipses'),


I count about 187 words that follow this pattern.

grhoten · 2024-04-22T22:21:44Z

fst/en/inflection.py

+_v = p.union('a', 'e', 'i', 'o', 'u', 'A', 'E', 'I', 'O', 'U')
+_c = p.union('b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n',
+             'p', 'q', 'r', 's', 't', 'v', 'w', 'x', 'y', 'z',
+             'B', 'C', 'D', 'F', 'G', 'H', 'J', 'K', 'L', 'M', 'N',
+             'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'X', 'Y', 'Z')


Can we replace this by a UnicodeSet and Unicode normalization to handle all of the vowels versus consonants?

Personally I split the vowel set in English to be with a y when it's a suffix, and as a consonant as a prefix.

grhoten · 2024-04-22T22:22:00Z

Looking good. Most common cases are covered, and can be extended. Examples x-in-law —> xs-in-law.

@macchiati I think your example is where the "x" can be replaced by a family relationship, like "brothers-in-law", right?

I've also seen "gluteus maximus" turning into "glutei maximi" for the plural form. That's Latin in origin, and it doesn't follow English rules. It's still used by doctors and gyms.

There can also be some discussion around whether the plural of "attorney general" is "attorneys general" or "attorney generals". I suspect that such a pluralization will be an American English versus British English discussion.

grhoten

I think this is a good start. It can probably be refined further as needed, if this is the direction that we're going.

nciric · 2024-04-22T23:28:14Z

I think this is a good start. It can probably be refined further as needed, if this is the direction that we're going.

This is a proposal that has some promise. I would like to see other approaches before we proceed.

Pynini library can do many more things than I exposed in this example, also I still need to see the size of resulting grammar transition file (once you build the rules) - it could be prohibitive in size for some languages.

richgillam

Sorry I'm late to the party once again, although I don't think I have lot to add to George's comments.

This looks cool. I'm not familiar with Panini, but the gist is fairly clear and the approach makes sense. (My couple of nits aren't really important, of course, as this is a proof of concept.)

I'm guessing the idea here is that you'd use this tool to produce an FST that could be embedded into C code for performance, and I think that makes sense. I do have a couple of thoughts, though. I apologize if they're obvious to everyone else...

Ideally, what you'd want to do, I think, is start with a large collection of words with all of their inflected forms and glean the transformation rules by examination. That is, rather than hand-crafting transformation rules as you've done here, you'd have some kind of tool that boils them down from looking at the patterns that actually occur in the lexicon and their frequency. Maybe you'd have to hand-craft some rules that the tool couldn't get by examination, but it seems like most of this could be derived by examination.
If we think about it that way, what all of this basically amounts to is a compression algorithm-- conceptually, you're just looking up inflected forms in a lexicon, and using an FST is just a way of making that lexicon smaller and faster.

As I was writing item 1, I realized I'm sort of describing a deep-learning model, but I think I'm hoping for something simpler and more debuggable that that would get us, something more akin to actually compressing the lexicon data (if that makes any sense).

The other thing we'd have to think a lot about, but maybe later, is what to do when results are ambiguous-- is there a way to tell when the plural of "person" should be "people" and when it should be "persons," for example? Or do we return both, maybe tagged with probabilities?

As I said, this is probably all obvious to everybody reading this, and if so, I apologize...

richgillam · 2024-04-23T00:47:19Z

data/en/exceptions.tsv

+tooth	teeth
+foot	feet
+mouse	mice
+person	people


The plural of "person" can also be "persons" in some contexts...

richgillam · 2024-04-23T00:47:31Z

data/en/exceptions.tsv

+foot	feet
+mouse	mice
+person	people
+penny	pence


You only see "pence" in British English.

macchiati · 2024-04-23T04:44:37Z

https://www.merriam-webster.com/grammar/the-many-plurals-of-octopus-octopi-octopuses-octopodes

…

On Mon, Apr 22, 2024, 15:04 George Rhoten ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In fst/en/inflection.py <#30 (comment)> : > +]) + +# If a singular noun ends in -y and the letter before the -y is a consonant, change the ending to -ies. +# city - cities +# puppy - puppies +_ies = _sigma + _c + p.cross('y', 'ies') + +# If the singular noun ends in -on, the plural ending is usually -a. +# phenomenon - phenomena +# criterion - criteria +_a = _sigma + p.cross('on', 'a') + +# If the singular noun ends in -us, the plural ending is frequently -i. +# cactus - cacti +# focus - foci +_i = _sigma + p.cross('us', 'i') On a related note, the plural of octopus is octopuses in English. Though it's not uncommon to see people use octopi, which is a hypercorrection. Also see https://twitter.com/MontereyAq/status/1535752435863588866 for reference. — Reply to this email directly, view it on GitHub <#30 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMGPFVEAC2IQRJTTGNTY6WCNBAVCNFSM6AAAAABGP4UN2KVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDAMJVHA3TINBXGE> . You are receiving this because your review was requested.Message ID: ***@***.***>

nciric · 2024-04-24T21:46:00Z

Sorry I'm late to the party once again, although I don't think I have lot to add to George's comments.

This looks cool. I'm not familiar with Panini, but the gist is fairly clear and the approach makes sense. (My couple of nits aren't really important, of course, as this is a proof of concept.)

I'm guessing the idea here is that you'd use this tool to produce an FST that could be embedded into C code for performance, and I think that makes sense. I do have a couple of thoughts, though. I apologize if they're obvious to everyone else...

Ideally, what you'd want to do, I think, is start with a large collection of words with all of their inflected forms and glean the transformation rules by examination. That is, rather than hand-crafting transformation rules as you've done here, you'd have some kind of tool that boils them down from looking at the patterns that actually occur in the lexicon and their frequency. Maybe you'd have to hand-craft some rules that the tool couldn't get by examination, but it seems like most of this could be derived by examination.

If we think about it that way, what all of this basically amounts to is a compression algorithm-- conceptually, you're just looking up inflected forms in a lexicon, and using an FST is just a way of making that lexicon smaller and faster.

As I was writing item 1, I realized I'm sort of describing a deep-learning model, but I think I'm hoping for something simpler and more debuggable that that would get us, something more akin to actually compressing the lexicon data (if that makes any sense).

The other thing we'd have to think a lot about, but maybe later, is what to do when results are ambiguous-- is there a way to tell when the plural of "person" should be "people" and when it should be "persons," for example? Or do we return both, maybe tagged with probabilities?

As I said, this is probably all obvious to everybody reading this, and if so, I apologize...

There are unanswered questions with my approach - multiple results for example. Also I am not 100% sure we'll be able to avoid Python logic, so in addition to FST table you'll need to replicate some of the code too (In this example we didn't have that problem).

I think rule generation can be automated to some degree using data patterns and light ML, for example clustering for y->ies, Greek/latin words etc.

As I mentioned before, I would like to learn of other approaches before we commit to any single one.

Another part I didn't do was generate actual transition table and use it from OpenFST C library and see how that works (both in size and ease of use).

English plural rules using Pynini

6c8b055

nciric self-assigned this Apr 19, 2024

nciric requested review from grhoten, richgillam and macchiati April 19, 2024 22:13

grhoten reviewed Apr 20, 2024

View reviewed changes

Removed some rules, and moved exceptions to the file.

8aab75c

nciric requested a review from grhoten April 22, 2024 20:37

nciric added 2 commits April 22, 2024 13:48

Fix path for exception file.

0c26961

Month name fixups.

00db68a

grhoten reviewed Apr 22, 2024

View reviewed changes

grhoten approved these changes Apr 22, 2024

View reviewed changes

nciric merged commit 77403cf into main Apr 22, 2024

nciric deleted the cira-code branch April 22, 2024 23:25

richgillam reviewed Apr 23, 2024

View reviewed changes

Uh oh!

English plural rules using Pynini #30

English plural rules using Pynini #30

Uh oh!

Conversation

nciric commented Apr 19, 2024

Uh oh!

macchiati commented Apr 19, 2024 via email

Uh oh!

grhoten left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grhoten commented Apr 22, 2024

Uh oh!

grhoten left a comment

Choose a reason for hiding this comment

Uh oh!

nciric commented Apr 22, 2024

Uh oh!

richgillam left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

macchiati commented Apr 23, 2024 via email

Uh oh!

nciric commented Apr 24, 2024

Uh oh!

Uh oh!