Skip to content

Conversation

grhoten
Copy link
Member

@grhoten grhoten commented Feb 23, 2025

Resolves #59

This switches the lexical data source to Wikidata. All of the tests pass. This uses git LFS to manage the lexical dictionary.

The generated lexical dictionary is about 2.4 MB. There are ways to trim that a bit, like reducing the size of the inflection tables. Some of the inflection tables are redundant because the Wikidata is redundant. For example, some of the surface forms are just lower case variants of the normal title case variants. Why do they exist? I don't know. Marking them as Q65048529 (lowercase text) in Wikidata will allow them to be successfully ignored, which will allow them to be successfully merged with other inflection tables. The lexical dictionary already handles case insensitive matching with some case sensitive interpretation. If the precise case doesn't exist, the lower case text will be matched instead. So the lowercase variants from Wikidata are just getting in the way without providing any useful meaning. So far, this oddity seems to be mostly unique to Danish.

Here is the summary of the lexical data at the end of dictionary_da.lst. Some of this data can be filtered out to save space too.

==============================================
                       Source: wikidata-20250219-lexemes.json 
                  Lemma terms:   93434
         Unusable lemma terms:     750
       Incoming surface forms:  624556
                Surface forms:  575212
      Collapsed surface forms:   51905  (8.3%)
       Unusable surface forms:     556
                 Usable terms:  575212  (100%)
           Unclassified terms:       0    (0%)
==============================================
Aspect:
    perfective:            15202  (2.6%)

Case:
    nominative:           242827 (42.2%)
    genitive:             232096 (40.3%)

ComparisonDegree:
    positive:              23363  (4.1%)
    superlative:            4795  (0.8%)
    comparative:            2340  (0.4%)

Count:
    uncountable:            2915  (0.5%)

Definiteness:
    definite:             262745 (45.7%)
    indefinite:           253465 (44.1%)
    demonstrative:             6    (0%)

Gender:
    common:               378616 (65.8%)
    neuter:               128952 (22.4%)
    masculine:                 3    (0%)
    feminine:                  3    (0%)

Mood:
    imperative:             8075  (1.4%)

Number:
    singular:             292314 (50.8%)
    plural:               243228 (42.3%)

PartOfSpeech:
    noun:                 486748 (84.6%)
    verb:                  61310 (10.7%)
    adjective:             30496  (5.3%)
    proper-noun:            4509  (0.8%)
    adverb:                  570  (0.1%)
    numeral:                 282    (0%)
    adposition:               98    (0%)
    pronoun:                  96    (0%)
    conjunction:              69    (0%)
    interjection:             54    (0%)
    interrogative:            10    (0%)
    particle:                  6    (0%)
    article:                   4    (0%)

Tense:
    present:               23144    (4%)
    past:                  22932    (4%)

Transitivity:
    transitive:              710  (0.1%)

VerbType:
    participle:            15490  (2.7%)
    infinitive:            12855  (2.2%)

Voice:
    active:                23124    (4%)
    passive:               15128  (2.6%)

processed in 18.818 seconds
License: Creative Commons CC0 License (https://creativecommons.org/publicdomain/zero/1.0/)
generated with options: --language da --inflection-types noun,adjective --ignore-entries-with-grammemes abbreviation --ignore-property countable --ignore-property spelling --ignore-property oblique --ignore-property dative --ignore-property accusative

@nciric nciric self-requested a review February 24, 2025 18:12
@grhoten grhoten merged commit b994daf into unicode-org:main Feb 24, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integrate da Wikidata into Unicode Inflection
2 participants