-
-
Notifications
You must be signed in to change notification settings - Fork 16
Implementing group 3 noun rules for Serbian. #173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
inflection/src/inflection/grammar/synthesis/SrGrammarSynthesizer_SrDisplayFunction.cpp
Outdated
Show resolved
Hide resolved
static constexpr auto suffix_sg = ::std::to_array<::std::u16string_view>({u"а", u"е", u"и", u"у", u"а", u"ом", u"и"}); | ||
static constexpr auto suffix_pl = ::std::to_array<::std::u16string_view>({u"е", u"а", u"ама", u"е", u"е", u"ама", u"ама"}); | ||
|
||
::std::u16string base = lemma; | ||
// Remove trailing a and apply suffix. | ||
base.pop_back(); | ||
base = applySuffix(base, suffix_sg, suffix_pl, number, targetCase); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this kind of mapping, you may be inspired by Arabic, German or Italian. They convert a string to a numeric key (makeLookupKey) containing multiple grammemes, and they map the key to a string. This mapping is initialized in the constructor instead of at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the concern the runtime size increase (static constexpr)? If yes, I can remove the static (creating these arrays is cheap).
Otherwise the current approach looks simpler. I will look into refactoring this code as I add more cases, potentially implementing Arabic like approach.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was to make it more scalable, but this is fine too.
inflection/src/inflection/grammar/synthesis/SrGrammarSynthesizer_SrDisplayFunction.cpp
Outdated
Show resolved
Hide resolved
inflection/src/inflection/grammar/synthesis/SrGrammarSynthesizer_SrDisplayFunction.cpp
Outdated
Show resolved
Hide resolved
enum class Syllables { | ||
ONE_SYLLABLE, | ||
TWO_SYLLABLES, | ||
MULTI_SYLLABLES, | ||
}; | ||
Syllables countSyllables(const ::std::u16string& lemma) { | ||
static constexpr ::std::u16string_view vowels = u"аеиоуАЕИОУ"; | ||
static constexpr ::std::u16string_view consonants = u"бвгдђжзјклљмнњпстћфхцчџшБВГДЂЖЗЈКЛЉМНЊПСТЋФХЦЧЏШ"; | ||
|
||
uint16_t total = 0; | ||
size_t index = 0; | ||
const size_t length = lemma.length(); | ||
for (const char16_t ch: lemma) { | ||
if (vowels.find(ch) != ::std::string::npos) { | ||
++total; | ||
} | ||
// Check case where R is at the begining followed by a consonant. | ||
if ((ch == u'р' || ch == u'Р') && (index == 0 && index + 1 < length)) { | ||
if (consonants.find(lemma[index + 1]) != ::std::string::npos) { | ||
++total; | ||
} | ||
} else if ((ch == u'р' || ch == u'Р') && (index != 0 && index + 1 < length)) { | ||
if (consonants.find(lemma[index - 1]) != ::std::string::npos && consonants.find(lemma[index + 1]) != ::std::string::npos) { | ||
++total; | ||
} | ||
} | ||
++index; | ||
} | ||
|
||
if (total == 1) { | ||
return Syllables::ONE_SYLLABLE; | ||
} else if (total == 2) { | ||
return Syllables::TWO_SYLLABLES; | ||
} else { | ||
return Syllables::MULTI_SYLLABLES; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of this?
static bool isConsonant(char16_t ch) {
return morphun::lang::StringFilterUtil::CYRILLIC_SCRIPT().contains(ch) && !morphun::dictionary::PhraseProperties::DEFAULT_VOWELS_START().contains(ch);
}
enum class Syllables {
ONE_SYLLABLE,
TWO_SYLLABLES,
MULTI_SYLLABLES,
};
Syllables countSyllables(const ::std::u16string& word) {
int32_t total = 0;
if (word.length() > 2) {
std::u16string lowercase;
morphun::util::StringUtils::lowercase(&lowercase, word, ::morphun::util::LocaleUtils::SERBIAN());
size_t startIndex = 0;
size_t nextR;
const auto stringLengthWithVowelSuffix = lowercase.length() - 1; // Always enough room for a suffix letter
while ((nextR = lowercase.find(u'р', startIndex)) != std::string::npos && nextR < stringLengthWithVowelSuffix) {
++total; // +1 for r
if (isConsonant(lowercase[nextR + 1]) && (nextR == 0 || isConsonant(lowercase[nextR - 1]))) {
++total;
}
startIndex = nextR + 1;
}
}
if (total == 1) {
return Syllables::ONE_SYLLABLE;
} else if (total == 2) {
return Syllables::TWO_SYLLABLES;
} else {
return Syllables::MULTI_SYLLABLES;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code skips vowels (only counts Rs). I like the isConsonant, and I added isVowel function and used it in the code.
It looks simpler now.
PTAL
inflection/src/inflection/grammar/synthesis/SrGrammarSynthesizer_SrDisplayFunction.cpp
Outdated
Show resolved
Hide resolved
inflection/src/inflection/grammar/synthesis/SrGrammarSynthesizer_SrDisplayFunction.cpp
Outdated
Show resolved
Hide resolved
inflection/src/inflection/grammar/synthesis/SrGrammarSynthesizer_SrDisplayFunction.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look fine. Optional comments to consider where also provided.
@@ -87,9 +149,154 @@ ::inflection::dialog::DisplayValue * SrGrammarSynthesizer_SrDisplayFunction::get | |||
return nullptr; | |||
} | |||
if (dictionary.isKnownWord(displayString)) { | |||
displayString = inflectString(constraints, displayString); | |||
displayString = inflectFromDictionary(constraints, displayString); | |||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this scenario, you should check enableInflectionGuess
before using the rules.
<test><source case="genitive" number="plural" gender="feminine" pos="noun">конзерва</source><result>конзерви</result></test> | ||
<test><source case="genitive" number="plural" gender="feminine" pos="noun">гошћа</source><result>гошћа</result></test> | ||
<test><source case="genitive" number="plural" gender="feminine" pos="noun">двојка</source><result>двојака</result></test> | ||
<test><source case="genitive" number="plural" gender="feminine" pos="noun">битка</source><result>битака</result></test> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a lot of fully fleshed out constraints. Most of the other languages only change specific grammemes. Sometimes you only specify the case, number or gender. The other tests usually specify less. The other languages usually default to noun. These tests are currently fine, but common usage starts from any surface form (ideally a unique surface form), and then you modify just the relevant grammemes.
static constexpr auto suffix_sg = ::std::to_array<::std::u16string_view>({u"а", u"е", u"и", u"у", u"а", u"ом", u"и"}); | ||
static constexpr auto suffix_pl = ::std::to_array<::std::u16string_view>({u"е", u"а", u"ама", u"е", u"е", u"ама", u"ама"}); | ||
|
||
::std::u16string base = lemma; | ||
// Remove trailing a and apply suffix. | ||
base.pop_back(); | ||
base = applySuffix(base, suffix_sg, suffix_pl, number, targetCase); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was to make it more scalable, but this is fine too.
Fully implements group 3 (all nouns ending with -a).
This removes a need for a large number of nouns to be added to Wikidata.
Resolves part of #172 .