Skip to content

Conversation

grhoten
Copy link
Member

@grhoten grhoten commented Jan 24, 2025

Resolves #66

These changes reduce the failures for English, but it doesn't fully fix the issues yet. It's an improvement and a step in the right direction.

Comment on lines +508 to +522
int qVariantIdx = currentLemmaLanguage.indexOf(VARIANT_SEPARATOR);
if (qVariantIdx >= 0) {
// The languages can have wierd Q entry after the desired language.
// A spelling variant is informative. Most of the rest are irrelevant.
var additionalCategory = currentLemmaLanguage.substring(qVariantIdx + VARIANT_SEPARATOR.length());
currentLemmaLanguage = currentLemmaLanguage.substring(0, qVariantIdx);
var variant = Grammar.getMappedGrammemes(additionalCategory);
if (variant == null) {
if (parserOptions.debug) {
System.err.println("Line " + lineNumber + ": " + additionalCategory + " is not a known grammeme for the language variant " + lexeme.id + "(" + lemma.value + ")");
}
continue;
}
lemma.grammemes.addAll(variant);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This chunk helps with ignoring some of the variants. This might need to be extended further for the inflections.

Comment on lines +291 to +317
void setIgnoreProperty(String[] grammemes, Ignorable ignorable) {
var ignorableSet = EnumSet.of(ignorable);
for (String grammeme : grammemes) {
if (grammeme.matches("Q\\d*")) {
TYPEMAP.put(grammeme, ignorableSet);
}
else {
for (Map.Entry<String, Set<? extends Enum<?>>> entry : TYPEMAP.entrySet()) {
for (var grammemeEnum : entry.getValue()) {
String name = grammemeEnum.name();
if (name.equalsIgnoreCase(grammeme)) {
if (entry.getValue().size() == 1) {
entry.setValue(ignorableSet);
}
else {
entry.getValue().remove(grammemeEnum);
ArrayList<Enum<?>> clone = new ArrayList<>(entry.getValue());
clone.add(ignorable);
entry.setValue(new HashSet<>(clone));
}
break;
}
}
}
}
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes the ignorable properties and ignorable inflections, which tend to include a lot of irrelevant information.

}

static final Map<SortedSet<String>, Set<? extends Enum<?>>> TYPEMAP = new HashMap<>(1021);
static final Map<String, Set<? extends Enum<?>>> TYPEMAP = new HashMap<>(1021);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the key type, since Wikidata doesn't include a key set. This is a simplification.

@grhoten grhoten merged commit 18e060b into unicode-org:main Jan 24, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix --ignore-entries-with-grammemes in dictionary-parser and improve language variant handling
2 participants