Wikidata:Requests for permissions/Bot/DwdsBot
- The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Approved--Ymblanter (talk) 21:38, 2 November 2022 (UTC)[reply]
DwdsBot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: Gremid (talk • contribs • logs)
Task/s: The bot imports lexeme data (lemmas, forms, morphological information) from the Digitales Wörterbuch der deutschen Sprache (DWDS).
Code:
- MediaWiki Action API client (GitHub)
- DWDS/WikiData Import Routines (GitHub)
- Wikibase test instance with some sample imports (Hetzner Cloud)
Function details:
- The bot extracts information to be imported from the internal DWDS dataset.
- It pulls a dump of current lexemes in WikiData in order to import only lexemes that are not already in WikiData.
- It converts the DWDS information into WikiData's lexicographic data model and imports lexemes via the MediaWiki Action API.
In a first step, lexemes with their lemma, lexical category, grammatical gender (if applicable) and reference to the DWDS source are imported. Further information (inflected forms, IPA transliterations, morphological classifications etc.) could be imported later on. Permission for bot-based import of further data would be requested separately. --Gremid (talk) 08:52, 17 October 2022 (UTC)[reply]
- Support The sample edits look good. What about also importing hyphenation information? Besides, I like your use of a test Wikibase instance and would like to document that for similar such cases in the future. Can you please put it under an open license, so that we can take and share screenshots of it? --Daniel Mietchen (talk) 23:25, 17 October 2022 (UTC)[reply]
- Thanks for your quick feedback and support! Hyphenation data is something (like IPA transliterations) that we could certainly add to the set of donated information. Regarding the test setup and licensing of it: You refer to the license of the Wiki content in the test instance? Gremid (talk) 07:25, 18 October 2022 (UTC)[reply]
- Support looks good, and I support the import, but I would like to see an example of a verb as well. Asaf Bartov (talk) 06:45, 18 October 2022 (UTC)[reply]
- Thanks! I imported 10 verbs into the test instance for your reference. Gremid (talk) 07:26, 18 October 2022 (UTC)[reply]
- Thank you. And where are the forms? Asaf Bartov (talk) 10:26, 18 October 2022 (UTC)[reply]
- Hi Asaf, as mentioned earlier, I propose to import lexemes with part-of-speech and gender classification as well as reference to the DWDS source database in a first run. In a second run, we would like to provide additional data, including inflected forms. Right now, as you can see on the DWDS site, we do not spell out inflection paradigms either. A colleague in our team with specific expertise in German morphology is working on a solution, where all inflected forms are generated automatically for a given lemma. The solution is based on a finite state transducer and covers about 98% of our lexical database at the moment. We are still working on full coverage and edge cases; a release is planned for the beginning of 2023. I take it from WikiData's statistics on lexicographic coverage that an inventory of all written representations is of particular value to the project? Gremid (talk) 10:12, 19 October 2022 (UTC)[reply]
- Yes, we do want all forms explicitly, even when completely regular and derivable from lemma.
- I could have sworn your request mentioned forms as well, but anyway, I understand the division into several phases.
- So it looks good, and as already stated above, I Support the import. Asaf Bartov (talk) 13:05, 20 October 2022 (UTC)[reply]
- (P.S. consider clicking your red user name and putting some information in your user page about your affiliation and main interest, with any relevant links etc. It would help people who don't know you get a sense about your context.) Asaf Bartov (talk) 13:06, 20 October 2022 (UTC)[reply]
- Hi Asaf, as mentioned earlier, I propose to import lexemes with part-of-speech and gender classification as well as reference to the DWDS source database in a first run. In a second run, we would like to provide additional data, including inflected forms. Right now, as you can see on the DWDS site, we do not spell out inflection paradigms either. A colleague in our team with specific expertise in German morphology is working on a solution, where all inflected forms are generated automatically for a given lemma. The solution is based on a finite state transducer and covers about 98% of our lexical database at the moment. We are still working on full coverage and edge cases; a release is planned for the beginning of 2023. I take it from WikiData's statistics on lexicographic coverage that an inventory of all written representations is of particular value to the project? Gremid (talk) 10:12, 19 October 2022 (UTC)[reply]
- Thank you. And where are the forms? Asaf Bartov (talk) 10:26, 18 October 2022 (UTC)[reply]
- Thanks! I imported 10 verbs into the test instance for your reference. Gremid (talk) 07:26, 18 October 2022 (UTC)[reply]
- Looking at
https://wb-sandbox.middell.net/wiki/Lexeme:L7:- described by source (P1343) "Digitales Wörterbuch der deutschen Sprache" with the reference reference URL (P854) https://www.dwds.de/wb/Astrallicht should just be DWDS lemma ID (P9940) "Astrallicht" (it will link it automatically)
- For the reference on grammatical gender (P5185), you should also include retrieved (P813) with the date and DWDS lemma ID (P9940) "Astrallicht" (explained more verbosely at Help:Sources#Databases)
- When you say only lexemes that are not already in Wikidata, do you mean that if there's a lexeme with the same lemma in Wikidata, it will skip it regardless of whether the lexeme has a link to DWDS?
- What does it do with words which don't have a gender because they're only used as plurals (e.g. Eltern)? (They should have a statement instance of (P31) plurale tantum (Q138246))
- What about words where DWDS has multiple entries (e.g. Boot#1, Boot#2)? Will it create all of them, and will it include the numbers in the link?
- For
https://wb-sandbox.middell.net/wiki/Item:Q19, which Wikidata item will you use? (We have both Digitales Wörterbuch der deutschen Sprache (Q1225026) for the project as a whole and DWDS-Wörterbuch (Q108696977) for the main dictionary - the latter seems like it would be more appropriate to me) - I'm not sure about all of the parts of speech. Adjektiv, Adverb, Substantiv and Verb are fine (and that should cover most of them) but I'd like to look more closely at the others before doing a mass import of them.
- - Nikki (talk) 09:38, 18 October 2022 (UTC)[reply]
- Thank you for your feedback, Nikki! I adjusted the import according to your suggestions, cleared the test instance and imported some samples again (including verbs and plurale tantum nouns).
- The DWDS lemma ID (P9940) is used now In the test instance, the external identifier is not resolved on the UI level but I assume this has to do with rudimentary setup of my Wikibase installation, not with the imported data.
- References to the source database have been added for statements about grammatical gender and plurale tantum classification.
- Currently I skip all DWDS dictionary entries which refer to homographs (i.e. the homographs for Boot) in the import routine. This amounts to 2,998 entries at the moment, which we could discuss later as I assume the mapping of homographs to WikiData's lexicographic data model might not be trivial.
- Currently I also skip any DWDS dictionary entry whose lemma equals a lemma in the current lexeme dump. This amounts to 9,336 entries being excluded from the import.
- What would be left to be imported then, broken down by part-of-speech: nouns (153,743 lexemes), adjectives (21,898 lexemes), verbs (18,405 lexemes), adverbs (1,119 lexemes), interjections (197 lexemes), numerals (82 lexemes), conjunctions (52 lexemes), demonstrative pronouns (45 lexemes), prepositions (44 lexemes), pronouns (24 lexemes), possessive pronouns (22 lexemes), and relative pronouns (11 lexemes). If we are unsure about all parts of speech, we could start with the most frequent ones, working our way to the lesser frequent classes. What do you think? Gremid (talk) 09:57, 19 October 2022 (UTC)[reply]
- Forgot to mention: I deciced to use DWDS-Wörterbuch (Q108696977) as the source database. Gremid (talk) 09:59, 19 October 2022 (UTC)[reply]
- Thanks! That sounds good and looks much better now.
- A few more things that I noticed while looking at the DWDS lemmas:
- The DWDS lemmas have ’ where Wikidata has ' - are you matching those when checking for existing lexemes?
- How does it handle the old spellings (like daß vs dass)? (It's fine to ignore them, but we can add a spelling variant with the language code de-1901 for those (I've only been doing it for forms, like on dass (L248793)), so we wouldn't want separate lexemes for them)
- I think it would be a good idea to skip lemmas without any letters in them, since we haven't really worked out how to model symbols yet.
- Looking at the list here, the ones that I'm happy with are: Adjektiv, Adverb, Eigenname, Interjektion, Konjunktion, partizipiales Adjektiv, partizipiales Adverb, Präposition, Präposition + Artikel, Substantiv, Verb
- I'm not sure about:
- Affix: the current ones are using prefix (Q134830) or suffix (Q102047)
- Partikel: the current ones are using grammatical particle (Q184943)
- Pronominaladverb: the current ones are using adverb (Q380057)
- Things I would skip for now:
- Imperativ, Komparativ, Superlativ: because these are normally forms of a verb or adjective
- Bruchzahl, Kardinalzahl, Ordinalzahl: because they overlap with nouns and adjectives
- Mehrwortausdruck: we've been trying to use lexical categories which describe how a lexeme behaves, so this would correspond to multiple lexical categories
- bestimmter Artikel and all of the Pronomen categories: these are categories where there can be multiple lemmas for something we have as a single lexeme (e.g. we have das (L59500) containing all the gender/case/number forms already)
- (The IDs not being links in your Wikibase does seem to be a configuration thing: https://www.mediawiki.org/wiki/Wikibase/Installation/Advanced_configuration#Define_links_for_external_identifiers describes how to change it, if you want to. You don't have to though - we know it works here :))
- - Nikki (talk) 04:25, 26 October 2022 (UTC)[reply]
- This makes sense to me, too.
- I do think we can start with the missing lexemes and only later figure out a second ingestion/enrichment of existing lexemes. Asaf Bartov (talk) 15:19, 26 October 2022 (UTC)[reply]
- Thanks for the extensive and thorough review, Nikki!
- I leave out any lexemes for now, which do not contain letters or which contain apostrophes in a different encoding.
- I exclude any variant spelling of lexemes from the import for now. This includes older variants (de-1901) as well as current ones (e.g. Foto vs Photo).
- I limited the set of lexemes by lexical category according to your suggestions, see the source code for reference.
- I continue with a test run targetting wikidata.org tomorrow and would be glad if you review that one as well. Gremid (talk) 20:57, 26 October 2022 (UTC)[reply]
- I did a test run against the main site, importing 10 lexemes. @Nikki Could you give your formal consent/support to this request now, so I can get the bot approved and start the bulk import? Thanks in advance, Gremid (talk) 11:43, 29 October 2022 (UTC)[reply]
- Support Excellent :) - Nikki (talk) 12:53, 1 November 2022 (UTC)[reply]
- I did a test run against the main site, importing 10 lexemes. @Nikki Could you give your formal consent/support to this request now, so I can get the bot approved and start the bulk import? Thanks in advance, Gremid (talk) 11:43, 29 October 2022 (UTC)[reply]
- Thank you for your feedback, Nikki! I adjusted the import according to your suggestions, cleared the test instance and imported some samples again (including verbs and plurale tantum nouns).
- I am going to approve the bot in a couple of days provided no objections have been raised.--Ymblanter (talk) 19:46, 31 October 2022 (UTC)[reply]
- @Ymblanter: Thanks! Could you please be more specific about the remaining time until the bot is approved? The request has been pending for 2 weeks now; during that time I politely inquired multiple times on the relevant Telegram channel, whether there are objections to this request and its associated data import. While I do not want to appear impatient and am grateful for the feedback thus far, any general objection to this request would surprise me by now. Am I missing a communication channel via which I should advertise this request more broadly? Gremid (talk) 20:59, 31 October 2022 (UTC)[reply]
- You do not need to advertise the request, but here, on this very page, there were critical comments less than a week ago which you only answered two days ago. Ymblanter (talk) 21:06, 31 October 2022 (UTC)[reply]
- Thanks for the clarification; then I will wait for further critical comments, on this very page. Gremid (talk) 10:24, 1 November 2022 (UTC)[reply]
- I'm happy with the request. So far they've been quite careful about avoiding duplicates, have been responsive to feedback and have been more than willing to adjust their code accordingly, so I think that if any issues do appear later, they will fix them. - Nikki (talk) 13:07, 1 November 2022 (UTC)[reply]
- You do not need to advertise the request, but here, on this very page, there were critical comments less than a week ago which you only answered two days ago. Ymblanter (talk) 21:06, 31 October 2022 (UTC)[reply]
- @Ymblanter: Thanks! Could you please be more specific about the remaining time until the bot is approved? The request has been pending for 2 weeks now; during that time I politely inquired multiple times on the relevant Telegram channel, whether there are objections to this request and its associated data import. While I do not want to appear impatient and am grateful for the feedback thus far, any general objection to this request would surprise me by now. Am I missing a communication channel via which I should advertise this request more broadly? Gremid (talk) 20:59, 31 October 2022 (UTC)[reply]