User:John Cummings/Archive/RFC
Note: please discuss on the Discussion page
What is a data schema?[edit]A data schema is a set of rules that define the structure of the data stored in a database. In the context of Wikidata editing, data schemas can provide a standardised structure for data on a subject area. For example, all the items on museums would use the same structure to describe basic facts about them like location, collection type, date opened etc. There are many existing schemas available online which apply to different kinds of data about the world, e.g. schema.org. Wikidata Schemas have recently been released on Wikidata. This provides the technical means for recording the schemas for different subject areas, but requires advanced technical knowledge to create and currently have limited documentation. They also do not include any kind of process for discussion or consensus for creating schemas. |
Problem statement[edit]There is currently no place for the community to propose, or "agree" on the correct model for specific types of items (e.g. a train, plant, human author, human astronaut etc). Now that Wikidata Schemas have been implemented, we have a place that can represent the model at a technical level (e.g. EntitySchema:E10 for a human). But without a community driven discussion and creation process the models will not be used as the "consensus". This inevitably leads to all of the issues associated with data inconsistency, which Wikidata editors and third party re-users need to find a way to work around. A few of the main issues are listed below: Difficult to query[edit]Inconsistent data modelling means that it's much harder to find, add and query. E.g the location for museums is modelled in at least three ways on Wikidata:
These were the ones found, but there may be others. This variation in the way different items records the same data means it is very difficult to work with the data, to trust that a query is returning all the information Wikidata has on that subject and to reuse data from Wikidata on other Wikimedia projects. Repeated mistakes[edit]There is also no way to collate subject specific knowledge on how data of a kind should be modelled to reflect the data accurately and so editors can't see how to correctly model it. E.g Many World Heritage sites include multiple buildings, some include 100s of listed buildings. Different editors have repeatedly added 'Heritage status = World Heritage site' to all the buildings inside a World Heritage site, which is incorrect. This leads queries on World Heritage sites being completely wrong, showing 100s of World Heritage sites in a single city. Impedes third party re-use[edit]There is a general lack of confidence that data will remain intact after a donation from third party, or when simply reusing the data in another application. As there is no location to put a stamp of approval on a particular way to model something it's much more difficult to have faith that it will not change (or that you even know where to look if you find it is changing). |
Benefits of having community agreed data schemas on Wikidata[edit]Some of the major benefits of a more unified approach with community agreed schemas are: Data quality[edit]
Usability[edit]
Community growth and health[edit]
|
Existing work on data schemas on Wikidata[edit]
| |
A central place to discuss and create data schemas collaboratively[edit]A central discussion area, similar to Wikidata:Property proposal but for proposing and collaborating to develop Wikidata Schemas. 'Wikidata:Schema Requests' could use FormWizard and Visual Editor to lower the barriers to participation. This central proposal area should allow subject experts to develop schemas without requiring a deep level of technical knowledge. The schemas, once agreed, would then be recorded in Shape Expressions by people who have the technical knowledge. After a new Wikidata Schema has been created, the "proposal discussion" would be linked on its talk page (or transcluded into the page). All future discussion about the model will continue on the talk page, with the original proposal being archived. Note: There are many ways in which the community could present the model that they have all agreed to. For example, we could use Wiki tables, or templates to show lists of expected Wikidata statements, or simply a list of bullet points initially. If the proposal here is agreed, there can be plenty of subsequent discussion about the best way to communicate the human generated plan.
|
Recording agreed data schemas[edit]Models decided by the community would be recorded as new Wikidata Schemas (e.g. E10 for human) by editors who know how to write Shape Expressions. The schema would then be linked to the corresponding Wikidata item for that class by a statement on the Wikidata item (e.g.human (Q5)--> Wikidata Schema -->E10 Note: The required property has already been proposed, but is on hold waiting for this Phabricator ticket, which will allow Wikidata Schemas to be used in statements We ultimately need some kind of link between individual instances and the Wikidata Schema that should be used to describe them. However, this would have to be in the form of User Interface enhancements to implement in the future. For example, when someone is editing Douglas Adams (Q42), it would make perfect sense to prompt the user that they should be using the schema for a "human", or have a button to check what's missing/complete. This is just an example solution, but it's included here to emphasise the end goal of Wikidata Schemas informing editors about community agreed models. |
Ways of finding and exploring Wikidata data schemas[edit]The following are some suggested methods to encourage greater use and discoverability of Wikidata Schemas:
|
Outstanding questions[edit]
|