Skip to content

Conversation

nciric
Copy link
Contributor

@nciric nciric commented Apr 19, 2024

Implemented plural rules for English using Pynini package to showcase the logic and how-to.

@nciric nciric self-assigned this Apr 19, 2024
@macchiati
Copy link
Member

macchiati commented Apr 19, 2024 via email

Copy link
Member

@grhoten grhoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trick question, what's the plural of "brush"? It's both "brush" and "brushes". It depends on what you're talking about. A paint brush can become brushes, but brush in the forest is uncountable and thus it stays as brush.

This logic seems interesting. Before using it, I really want to see better vowel handling. That will typically needs Unicode normalization and a UnicodeSet. The UnicodeSet needs to handle different sets at the beginning and the end of a word for English. Typically the letter "y" is a vowel at the end and a consonant at the beginning.

I sometimes wonder how ICU transliteration would work. I'm unsure about the performance, but it's more data driven.

# If a singular noun ends in -y and the letter before the -y is a consonant, change the ending to -ies.
# city - cities
# puppy - puppies
_ies = _sigma + _c + p.cross('y', 'ies')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my logic, I have length > 2 && endsWith("y") && !isVowel(string[length - 2]). Vowel handling only looks at the base character and not the diacritic, except Hebrew and Arabic scripts. Unicode normalization helps with that.

This handles the word day correctly. The plural of Monday is Mondays and not Mondaies. It seems that _c seems to try to handle consonants, but you really need to get the base character, just like the vowel handling. "e" and "é" are both vowels for example. The letters "n" and "ñ" are also consonants.

_sigma = p.union(_v, _c).closure().optimize()

_exceptions = p.string_map([
# Zero plurals.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I count about 444 words in English that follow this rule so far. This is a good start.

Comment on lines 42 to 43
('wife', 'wives'),
('wolf', 'wolves'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I count about 17 words that follow this logic. Most of them are variations of compound words that end with "knife", "wife", and "life".

Comment on lines 28 to 40
# Stem changes.
('child', 'children'),
('goose', 'geese'),
('man', 'men'),
('woman', 'women'),
('tooth', 'teeth'),
('foot', 'feet'),
('mouse', 'mice'),
('person', 'people'),
('penny', 'pence'),
# Irregular suffixes.
('child', 'children'),
('ox', 'oxen'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start. The list can get much larger. You may want to make this list data driven instead of embedding it in code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the exceptions above and the rules would be dumped into an FST file that could be used from C++ (OpenFST). Exceptions wouldn't be in the code - we use python to build the rules and serialize the "model" to disk.
Of course, we can move this list to a file and load it instead of specifying in Python code. A refactor for later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored exceptions to the file in data/ folder, but they'll still be part of the final FST once dumped.

Comment on lines 50 to 52
('photo', 'photos'),
('piano', 'pianos'),
('halo', 'halos'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list of exceptions for the "o" ending is much larger than the regular rule below. I recommend inverting this part of the logic.

Comment on lines 45 to 46
('analysis', 'analyses'),
('ellipsis', 'ellipses'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I count about 187 words that follow this pattern.

@nciric nciric requested a review from grhoten April 22, 2024 20:37
Comment on lines +19 to +23
_v = p.union('a', 'e', 'i', 'o', 'u', 'A', 'E', 'I', 'O', 'U')
_c = p.union('b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n',
'p', 'q', 'r', 's', 't', 'v', 'w', 'x', 'y', 'z',
'B', 'C', 'D', 'F', 'G', 'H', 'J', 'K', 'L', 'M', 'N',
'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'X', 'Y', 'Z')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace this by a UnicodeSet and Unicode normalization to handle all of the vowels versus consonants?

Personally I split the vowel set in English to be with a y when it's a suffix, and as a consonant as a prefix.

@grhoten
Copy link
Member

grhoten commented Apr 22, 2024

Looking good. Most common cases are covered, and can be extended. Examples x-in-law —> xs-in-law.

@macchiati I think your example is where the "x" can be replaced by a family relationship, like "brothers-in-law", right?

I've also seen "gluteus maximus" turning into "glutei maximi" for the plural form. That's Latin in origin, and it doesn't follow English rules. It's still used by doctors and gyms.

There can also be some discussion around whether the plural of "attorney general" is "attorneys general" or "attorney generals". I suspect that such a pluralization will be an American English versus British English discussion.

Copy link
Member

@grhoten grhoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good start. It can probably be refined further as needed, if this is the direction that we're going.

@nciric nciric merged commit 77403cf into main Apr 22, 2024
@nciric nciric deleted the cira-code branch April 22, 2024 23:25
@nciric
Copy link
Contributor Author

nciric commented Apr 22, 2024

I think this is a good start. It can probably be refined further as needed, if this is the direction that we're going.

This is a proposal that has some promise. I would like to see other approaches before we proceed.

Pynini library can do many more things than I exposed in this example, also I still need to see the size of resulting grammar transition file (once you build the rules) - it could be prohibitive in size for some languages.

Copy link

@richgillam richgillam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm late to the party once again, although I don't think I have lot to add to George's comments.

This looks cool. I'm not familiar with Panini, but the gist is fairly clear and the approach makes sense. (My couple of nits aren't really important, of course, as this is a proof of concept.)

I'm guessing the idea here is that you'd use this tool to produce an FST that could be embedded into C code for performance, and I think that makes sense. I do have a couple of thoughts, though. I apologize if they're obvious to everyone else...

  1. Ideally, what you'd want to do, I think, is start with a large collection of words with all of their inflected forms and glean the transformation rules by examination. That is, rather than hand-crafting transformation rules as you've done here, you'd have some kind of tool that boils them down from looking at the patterns that actually occur in the lexicon and their frequency. Maybe you'd have to hand-craft some rules that the tool couldn't get by examination, but it seems like most of this could be derived by examination.
  2. If we think about it that way, what all of this basically amounts to is a compression algorithm-- conceptually, you're just looking up inflected forms in a lexicon, and using an FST is just a way of making that lexicon smaller and faster.

As I was writing item 1, I realized I'm sort of describing a deep-learning model, but I think I'm hoping for something simpler and more debuggable that that would get us, something more akin to actually compressing the lexicon data (if that makes any sense).

The other thing we'd have to think a lot about, but maybe later, is what to do when results are ambiguous-- is there a way to tell when the plural of "person" should be "people" and when it should be "persons," for example? Or do we return both, maybe tagged with probabilities?

As I said, this is probably all obvious to everybody reading this, and if so, I apologize...

tooth teeth
foot feet
mouse mice
person people

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plural of "person" can also be "persons" in some contexts...

foot feet
mouse mice
person people
penny pence

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You only see "pence" in British English.

@macchiati
Copy link
Member

macchiati commented Apr 23, 2024 via email

@nciric
Copy link
Contributor Author

nciric commented Apr 24, 2024

Sorry I'm late to the party once again, although I don't think I have lot to add to George's comments.

This looks cool. I'm not familiar with Panini, but the gist is fairly clear and the approach makes sense. (My couple of nits aren't really important, of course, as this is a proof of concept.)

I'm guessing the idea here is that you'd use this tool to produce an FST that could be embedded into C code for performance, and I think that makes sense. I do have a couple of thoughts, though. I apologize if they're obvious to everyone else...

  1. Ideally, what you'd want to do, I think, is start with a large collection of words with all of their inflected forms and glean the transformation rules by examination. That is, rather than hand-crafting transformation rules as you've done here, you'd have some kind of tool that boils them down from looking at the patterns that actually occur in the lexicon and their frequency. Maybe you'd have to hand-craft some rules that the tool couldn't get by examination, but it seems like most of this could be derived by examination.
  2. If we think about it that way, what all of this basically amounts to is a compression algorithm-- conceptually, you're just looking up inflected forms in a lexicon, and using an FST is just a way of making that lexicon smaller and faster.

As I was writing item 1, I realized I'm sort of describing a deep-learning model, but I think I'm hoping for something simpler and more debuggable that that would get us, something more akin to actually compressing the lexicon data (if that makes any sense).

The other thing we'd have to think a lot about, but maybe later, is what to do when results are ambiguous-- is there a way to tell when the plural of "person" should be "people" and when it should be "persons," for example? Or do we return both, maybe tagged with probabilities?

As I said, this is probably all obvious to everybody reading this, and if so, I apologize...

There are unanswered questions with my approach - multiple results for example. Also I am not 100% sure we'll be able to avoid Python logic, so in addition to FST table you'll need to replicate some of the code too (In this example we didn't have that problem).

I think rule generation can be automated to some degree using data patterns and light ML, for example clustering for y->ies, Greek/latin words etc.

As I mentioned before, I would like to learn of other approaches before we commit to any single one.

Another part I didn't do was generate actual transition table and use it from OpenFST C library and see how that works (both in size and ease of use).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants