Inflection-59 Integrate da Wikidata into Unicode Inflection #81
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves #59
This switches the lexical data source to Wikidata. All of the tests pass. This uses git LFS to manage the lexical dictionary.
The generated lexical dictionary is about 2.4 MB. There are ways to trim that a bit, like reducing the size of the inflection tables. Some of the inflection tables are redundant because the Wikidata is redundant. For example, some of the surface forms are just lower case variants of the normal title case variants. Why do they exist? I don't know. Marking them as Q65048529 (lowercase text) in Wikidata will allow them to be successfully ignored, which will allow them to be successfully merged with other inflection tables. The lexical dictionary already handles case insensitive matching with some case sensitive interpretation. If the precise case doesn't exist, the lower case text will be matched instead. So the lowercase variants from Wikidata are just getting in the way without providing any useful meaning. So far, this oddity seems to be mostly unique to Danish.
Here is the summary of the lexical data at the end of dictionary_da.lst. Some of this data can be filtered out to save space too.