Implementing group 3 noun rules for Serbian. #173

nciric · 2025-08-12T03:35:05Z

Fully implements group 3 (all nouns ending with -a).

This removes a need for a large number of nouns to be added to Wikidata.

Resolves part of #172 .

inflection/src/inflection/grammar/synthesis/SrGrammarSynthesizer_SrDisplayFunction.cpp

grhoten · 2025-08-15T16:14:58Z

inflection/src/inflection/grammar/synthesis/SrGrammarSynthesizer_SrDisplayFunction.cpp

+    static constexpr auto suffix_sg = ::std::to_array<::std::u16string_view>({u"а", u"е", u"и", u"у", u"а", u"ом", u"и"});
+    static constexpr auto suffix_pl = ::std::to_array<::std::u16string_view>({u"е", u"а", u"ама", u"е", u"е", u"ама", u"ама"});
+
+    ::std::u16string base = lemma;
+    // Remove trailing a and apply suffix.
+    base.pop_back();
+    base = applySuffix(base, suffix_sg, suffix_pl, number, targetCase);


For this kind of mapping, you may be inspired by Arabic, German or Italian. They convert a string to a numeric key (makeLookupKey) containing multiple grammemes, and they map the key to a string. This mapping is initialized in the constructor instead of at runtime.

Is the concern the runtime size increase (static constexpr)? If yes, I can remove the static (creating these arrays is cheap).
Otherwise the current approach looks simpler. I will look into refactoring this code as I add more cases, potentially implementing Arabic like approach.

WDYT?

It was to make it more scalable, but this is fine too.

inflection/src/inflection/grammar/synthesis/SrGrammarSynthesizer_SrDisplayFunction.cpp

grhoten · 2025-08-15T16:56:20Z

inflection/src/inflection/grammar/synthesis/SrGrammarSynthesizer_SrDisplayFunction.cpp

+enum class Syllables {
+    ONE_SYLLABLE,
+    TWO_SYLLABLES,
+    MULTI_SYLLABLES,
+};
+Syllables countSyllables(const ::std::u16string& lemma) {
+    static constexpr ::std::u16string_view vowels = u"аеиоуАЕИОУ";
+    static constexpr ::std::u16string_view consonants = u"бвгдђжзјклљмнњпстћфхцчџшБВГДЂЖЗЈКЛЉМНЊПСТЋФХЦЧЏШ";
+
+    uint16_t total = 0;
+    size_t index = 0;
+    const size_t length = lemma.length();
+    for (const char16_t ch: lemma) {
+        if (vowels.find(ch) != ::std::string::npos) {
+            ++total;
+        }
+        // Check case where R is at the begining followed by a consonant.
+        if ((ch == u'р' || ch == u'Р') && (index == 0 && index + 1 < length)) {
+            if (consonants.find(lemma[index + 1]) != ::std::string::npos) {
+                ++total;
+            }
+        } else if ((ch == u'р' || ch == u'Р') && (index != 0 && index + 1 < length)) {
+            if (consonants.find(lemma[index - 1]) != ::std::string::npos && consonants.find(lemma[index + 1]) != ::std::string::npos) {
+                ++total;
+            }
+        }
+        ++index;
+    }
+
+    if (total == 1) {
+        return Syllables::ONE_SYLLABLE;
+    } else if (total == 2) {
+        return Syllables::TWO_SYLLABLES;
+    } else {
+        return Syllables::MULTI_SYLLABLES;
+    }
+}


What do you think of this?

static bool isConsonant(char16_t ch) { return morphun::lang::StringFilterUtil::CYRILLIC_SCRIPT().contains(ch) && !morphun::dictionary::PhraseProperties::DEFAULT_VOWELS_START().contains(ch); } enum class Syllables { ONE_SYLLABLE, TWO_SYLLABLES, MULTI_SYLLABLES, }; Syllables countSyllables(const ::std::u16string& word) { int32_t total = 0; if (word.length() > 2) { std::u16string lowercase; morphun::util::StringUtils::lowercase(&lowercase, word, ::morphun::util::LocaleUtils::SERBIAN()); size_t startIndex = 0; size_t nextR; const auto stringLengthWithVowelSuffix = lowercase.length() - 1; // Always enough room for a suffix letter while ((nextR = lowercase.find(u'р', startIndex)) != std::string::npos && nextR < stringLengthWithVowelSuffix) { ++total; // +1 for r if (isConsonant(lowercase[nextR + 1]) && (nextR == 0 || isConsonant(lowercase[nextR - 1]))) { ++total; } startIndex = nextR + 1; } } if (total == 1) { return Syllables::ONE_SYLLABLE; } else if (total == 2) { return Syllables::TWO_SYLLABLES; } else { return Syllables::MULTI_SYLLABLES; } }

Your code skips vowels (only counts Rs). I like the isConsonant, and I added isVowel function and used it in the code.
It looks simpler now.

PTAL

inflection/src/inflection/grammar/synthesis/SrGrammarSynthesizer_SrDisplayFunction.cpp

grhoten

Changes look fine. Optional comments to consider where also provided.

grhoten · 2025-08-19T05:37:59Z

inflection/src/inflection/grammar/synthesis/SrGrammarSynthesizer_SrDisplayFunction.cpp

@@ -87,9 +149,154 @@ ::inflection::dialog::DisplayValue * SrGrammarSynthesizer_SrDisplayFunction::get
        return nullptr;
    }
    if (dictionary.isKnownWord(displayString)) {
-        displayString = inflectString(constraints, displayString);
+        displayString = inflectFromDictionary(constraints, displayString);
+    } else {


In this scenario, you should check enableInflectionGuess before using the rules.

grhoten · 2025-08-19T05:46:06Z

inflection/test/resources/inflection/dialog/inflection/sr.xml

+    <test><source case="genitive" number="plural" gender="feminine" pos="noun">конзерва</source><result>конзерви</result></test>
+    <test><source case="genitive" number="plural" gender="feminine" pos="noun">гошћа</source><result>гошћа</result></test>
+    <test><source case="genitive" number="plural" gender="feminine" pos="noun">двојка</source><result>двојака</result></test>
+    <test><source case="genitive" number="plural" gender="feminine" pos="noun">битка</source><result>битака</result></test>


You have a lot of fully fleshed out constraints. Most of the other languages only change specific grammemes. Sometimes you only specify the case, number or gender. The other tests usually specify less. The other languages usually default to noun. These tests are currently fine, but common usage starts from any surface form (ideally a unique surface form), and then you modify just the relevant grammemes.

grhoten · 2025-08-19T05:47:40Z

inflection/src/inflection/grammar/synthesis/SrGrammarSynthesizer_SrDisplayFunction.cpp

+    static constexpr auto suffix_sg = ::std::to_array<::std::u16string_view>({u"а", u"е", u"и", u"у", u"а", u"ом", u"и"});
+    static constexpr auto suffix_pl = ::std::to_array<::std::u16string_view>({u"е", u"а", u"ама", u"е", u"е", u"ама", u"ама"});
+
+    ::std::u16string base = lemma;
+    // Remove trailing a and apply suffix.
+    base.pop_back();
+    base = applySuffix(base, suffix_sg, suffix_pl, number, targetCase);


It was to make it more scalable, but this is fine too.

Implementing group 3 noun rules for Serbian.

64e1fd0

nciric requested a review from grhoten August 12, 2025 03:35

nciric added 4 commits August 12, 2025 04:09

Convert to_array to manuall initialization bcs MacOS.

097fda4

Using u16string_view to avoid allocation in constexpr

9be78bb

Fix spelling

1f37045

Replace regex with simple loop for perfomance reasons.

49295e3

grhoten reviewed Aug 15, 2025

View reviewed changes

nciric added 3 commits August 15, 2025 19:02

Use [[maybe_unused]] on unused parameters.

0a365cc

Remove uneccessary includes, optimize suffix handling code.

67ccfcc

Add isConsontant/Vowel functions and simplify code with them.

b857505

grhoten approved these changes Aug 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Implementing group 3 noun rules for Serbian. #173

Implementing group 3 noun rules for Serbian. #173

nciric commented Aug 12, 2025

Uh oh!

Uh oh!

grhoten Aug 15, 2025

Uh oh!

nciric Aug 15, 2025

Uh oh!

grhoten Aug 19, 2025

Uh oh!

Uh oh!

Uh oh!

grhoten Aug 15, 2025

Uh oh!

nciric Aug 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

grhoten left a comment

Uh oh!

grhoten Aug 19, 2025

Uh oh!

grhoten Aug 19, 2025

Uh oh!

grhoten Aug 19, 2025

Uh oh!

Uh oh!

Uh oh!

Implementing group 3 noun rules for Serbian. #173

Are you sure you want to change the base?

Implementing group 3 noun rules for Serbian. #173

Conversation

nciric commented Aug 12, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

grhoten left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!