Implementing group 3 noun rules for Serbian. #173

nciric · 2025-08-12T03:35:05Z

Fully implements group 3 (all nouns ending with -a).

This removes a need for a large number of nouns to be added to Wikidata.

Resolves part of #172 .

inflection/src/inflection/grammar/synthesis/SrGrammarSynthesizer_SrDisplayFunction.cpp

grhoten · 2025-08-15T16:14:58Z

inflection/src/inflection/grammar/synthesis/SrGrammarSynthesizer_SrDisplayFunction.cpp

+    static constexpr auto suffix_sg = ::std::to_array<::std::u16string_view>({u"а", u"е", u"и", u"у", u"а", u"ом", u"и"});
+    static constexpr auto suffix_pl = ::std::to_array<::std::u16string_view>({u"е", u"а", u"ама", u"е", u"е", u"ама", u"ама"});
+
+    ::std::u16string base = lemma;
+    // Remove trailing a and apply suffix.
+    base.pop_back();
+    base = applySuffix(base, suffix_sg, suffix_pl, number, targetCase);


For this kind of mapping, you may be inspired by Arabic, German or Italian. They convert a string to a numeric key (makeLookupKey) containing multiple grammemes, and they map the key to a string. This mapping is initialized in the constructor instead of at runtime.

Is the concern the runtime size increase (static constexpr)? If yes, I can remove the static (creating these arrays is cheap).
Otherwise the current approach looks simpler. I will look into refactoring this code as I add more cases, potentially implementing Arabic like approach.

WDYT?

inflection/src/inflection/grammar/synthesis/SrGrammarSynthesizer_SrDisplayFunction.cpp

grhoten · 2025-08-15T16:56:20Z

inflection/src/inflection/grammar/synthesis/SrGrammarSynthesizer_SrDisplayFunction.cpp

+enum class Syllables {
+    ONE_SYLLABLE,
+    TWO_SYLLABLES,
+    MULTI_SYLLABLES,
+};
+Syllables countSyllables(const ::std::u16string& lemma) {
+    static constexpr ::std::u16string_view vowels = u"аеиоуАЕИОУ";
+    static constexpr ::std::u16string_view consonants = u"бвгдђжзјклљмнњпстћфхцчџшБВГДЂЖЗЈКЛЉМНЊПСТЋФХЦЧЏШ";
+
+    uint16_t total = 0;
+    size_t index = 0;
+    const size_t length = lemma.length();
+    for (const char16_t ch: lemma) {
+        if (vowels.find(ch) != ::std::string::npos) {
+            ++total;
+        }
+        // Check case where R is at the begining followed by a consonant.
+        if ((ch == u'р' || ch == u'Р') && (index == 0 && index + 1 < length)) {
+            if (consonants.find(lemma[index + 1]) != ::std::string::npos) {
+                ++total;
+            }
+        } else if ((ch == u'р' || ch == u'Р') && (index != 0 && index + 1 < length)) {
+            if (consonants.find(lemma[index - 1]) != ::std::string::npos && consonants.find(lemma[index + 1]) != ::std::string::npos) {
+                ++total;
+            }
+        }
+        ++index;
+    }
+
+    if (total == 1) {
+        return Syllables::ONE_SYLLABLE;
+    } else if (total == 2) {
+        return Syllables::TWO_SYLLABLES;
+    } else {
+        return Syllables::MULTI_SYLLABLES;
+    }
+}


What do you think of this?

static bool isConsonant(char16_t ch) { return morphun::lang::StringFilterUtil::CYRILLIC_SCRIPT().contains(ch) && !morphun::dictionary::PhraseProperties::DEFAULT_VOWELS_START().contains(ch); } enum class Syllables { ONE_SYLLABLE, TWO_SYLLABLES, MULTI_SYLLABLES, }; Syllables countSyllables(const ::std::u16string& word) { int32_t total = 0; if (word.length() > 2) { std::u16string lowercase; morphun::util::StringUtils::lowercase(&lowercase, word, ::morphun::util::LocaleUtils::SERBIAN()); size_t startIndex = 0; size_t nextR; const auto stringLengthWithVowelSuffix = lowercase.length() - 1; // Always enough room for a suffix letter while ((nextR = lowercase.find(u'р', startIndex)) != std::string::npos && nextR < stringLengthWithVowelSuffix) { ++total; // +1 for r if (isConsonant(lowercase[nextR + 1]) && (nextR == 0 || isConsonant(lowercase[nextR - 1]))) { ++total; } startIndex = nextR + 1; } } if (total == 1) { return Syllables::ONE_SYLLABLE; } else if (total == 2) { return Syllables::TWO_SYLLABLES; } else { return Syllables::MULTI_SYLLABLES; } }

Your code skips vowels (only counts Rs). I like the isConsonant, and I added isVowel function and used it in the code.
It looks simpler now.

PTAL

inflection/src/inflection/grammar/synthesis/SrGrammarSynthesizer_SrDisplayFunction.cpp

Implementing group 3 noun rules for Serbian.

64e1fd0

nciric requested a review from grhoten August 12, 2025 03:35

nciric added 4 commits August 12, 2025 04:09

Convert to_array to manuall initialization bcs MacOS.

097fda4

Using u16string_view to avoid allocation in constexpr

9be78bb

Fix spelling

1f37045

Replace regex with simple loop for perfomance reasons.

49295e3

grhoten reviewed Aug 15, 2025

View reviewed changes

nciric added 3 commits August 15, 2025 19:02

Use [[maybe_unused]] on unused parameters.

0a365cc

Remove uneccessary includes, optimize suffix handling code.

67ccfcc

Add isConsontant/Vowel functions and simplify code with them.

b857505

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Implementing group 3 noun rules for Serbian. #173

Implementing group 3 noun rules for Serbian. #173

Uh oh!

nciric commented Aug 12, 2025

Uh oh!

Uh oh!

grhoten Aug 15, 2025

Uh oh!

nciric Aug 15, 2025

Uh oh!

Uh oh!

Uh oh!

grhoten Aug 15, 2025

Uh oh!

nciric Aug 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Implementing group 3 noun rules for Serbian. #173

Are you sure you want to change the base?

Implementing group 3 noun rules for Serbian. #173

Uh oh!

Conversation

nciric commented Aug 12, 2025

Uh oh!

Uh oh!

grhoten Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

nciric Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

grhoten Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

nciric Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!