Linking To Linguistic Data Categories in Isocat: Abstract Iso Technical Committee 37, Terminology and Other Language and
Linking To Linguistic Data Categories in Isocat: Abstract Iso Technical Committee 37, Terminology and Other Language and
Linking To Linguistic Data Categories in Isocat: Abstract Iso Technical Committee 37, Terminology and Other Language and
Abstract ISO Technical Committee 37, Terminology and other language and
content resources, established an ISO 12620:2009 based Data Category Registry
(DCR), called ISOcat (see http://www.isocat.org), to foster semantic interoperability of linguistic resources. However, this goal can only be met if the data
categories are reused by a wide variety of linguistic resource types. A resource indicates its usage of data categories by linking to them. The small DC Reference XML
vocabulary is used to embed links to data categories in XML documents. The link is
established by an URI, which servers as the Persistent IDentifier (PID) of a data category. This paper discusses the efforts to mimic the same approach for RDF-based
resources. It also introduces the RDF quad store based Relation Registry RELcat,
which enables ontological relationships between data categories not supported by
ISOcat and thus adds an extra level of linguistic knowledge.
1 Introduction
ISO Technical Committee 37 Terminology and other language and content resources
established a Data Category Registry (DCR), called ISOcat, to foster semantic interoperability of linguistic resources. ISOcat is based on ISO 12620:2009, which
describes the data model and the management procedure for a DCR (ISO 12620,
2009). These procedures follow a grass roots approach, which means that any linguist can add the data categories (s)he needs to the registry. Standardized subsets of
these data categories are created by a standardization procedure involving groups of
Menzo Windhouwer
Max Planck Institute for Psycholinguistics, Wundtlaan 1, 6525 XD Nijmegen, The Netherlands,
e-mail: Menzo.Windhouwer@mpi.nl
Sue Ellen Wright
Kent State University, 109 Satterfield Hall, Kent, OH 44242, USA e-mail: sellenwright@
gmail.com
C. Chiarcos et al. (eds.), Linked Data in Linguistics,
DOI 10.1007/978-3-642-28249-2 10, Springer-Verlag Berlin Heidelberg 2012
99
100
international experts who are members of various Thematic Domain Groups (TDGs)
and the DCR Board. There are currently over a dozen domains supported by a TDG,
e.g., metadata, morphosyntax and terminology. But the aim of improving the semantic interoperability can only be met by the data categories if they are reused by
a multitude of linguistic resource types (Kemps-Snijders et al., 2008). A resource
indicates its usage of data categories by linking to them (Windhouwer et al., 2010).
This paper focuses on how this can be done, and gives special attention to linked
open data, i.e., RDF-based, resources.
The DC Reference XML vocabulary defines the descriptors both as XML attributes and XML
elements. The specific structure of the annotated XML-based resource determines whether either
the attribute or the element should be used.
101
PIDs. ISO TC 37 has recently published a new standard, PISA (Persistent Identification and Sustainable Access, ISO 14619:2011), which describes the requirements
to be met by these PID systems.
Due to space limitations the common ISOcat cool URI prefix http://www.isocat.org/
datcat has been replaced by elipses.
102
<LexicalResource xmlns:dcr="http://www.isocat.org/ns/dcr">
<GlobalInformation>
<feat att="languageCoding" dcr:datcat=".../DC-2008"
val="ISO 639-3"/>
</GlobalInformation>
<Lexicon>
<feat att="language" dcr:datcat=".../DC-1969" val="eng"/>
<LexicalEntry>
<feat att="partOfSpeech" dcr:datcat=".../DC-1345"
val="commonNoun" dcr:valueDatcat=".../DC-1256"/>
<Lemma>
<feat att="writtenForm" dcr:datcat=".../DC-1836"
val="clergyman"/>
</Lemma>
...
<WordForm>
<feat att="writtenForm" dcr:datcat=".../DC-1836"
val="clergymen"/>
<feat att="grammaticalNumber" dcr:datcat=".../DC-1298"
val="plural" dcr:valueDatcat=".../DC-1354"/>
</WordForm>
</LexicalEntry>
</Lexicon>
</LexicalResource>
The example doesnt show the use of container data categories as this is a recent addition to the DCR data model not even covered by ISO 12620:2009. For
the LMF core model and its extensions these container data categories have not
been specified yet. However, it does show an open data category, i.e., /writtenForm/
(http://www.isocat.org/datcat/DC-1836), an simple data category,
i.e., /commonNoun/ (http://www.isocat.org/datcat/DC-1256) which
is an instance of the value domain from a closed data category, i.e., /partOfSpeech/
(http://www.isocat.org/datcat/1345).
103
104
This is an unwanted side effect and is prevented by specifying a dedicated annotation property. Once more the RDF model builder can fine tune this. Depending on
the actual RDF type of the annotated RDF resource the dcr:datcat predicate
can be replaced by the following OWL (2) predicates: owl:equivalentClass
for classes, owl:equivalentProperty for properties and owl:sameAs for
individuals. The use of these specific predicates limits the impact of ISOcat data
categories on OWL semantics.
4 Ontological Relationships
ISOcat basically contains a flat list of data categories, i.e., it doesnt store (ontological) relationships between container and/or complex data categories. In addition to
value domain relationships between simple and closed data categories, only a subsumption hierarchy between simple data categories is stored, but only one such a
subsumption hierarchy is allowed, i.e., a simple data category can only be a child of
one other data category. The storage of these ontological relationships in ISOcat is
due to legacy issues and its usage is actually discouraged.
The reason that ontological relationships arent stored in ISOcat is that they are
highly domain or even application dependent and thus would hamper standardiza-
105
tion of data category specifications. However, they are important to make the semantics of linguistic resources explicit. To support this a companion registry to
ISOcat named RELcat is under construction (Schuurman and Windhouwer, 2011).
In RELcat anyone or any group can store (ontological) relationships between data
categories and/or concepts from other registries.
@prefix
@prefix
@prefix
@prefix
relcat
rel
dc
isocat
relcat:cmdi {
isocat:DC-2573
isocat:DC-2482
...
isocat:DC-2556
isocat:DC-2502
}
:
:
:
:
<http://www.isocat.org/relcat/set/> .
<http://www.isocat.org/relcat/relations#> .
<http://purl.org/dc/elements/1.1/> .
<http://www.isocat.org/datcat/> .
rel:sameAs dc:identifier .
rel:sameAs dc:language .
rel:subClassOf dc:contributor .
rel:subClassOf dc:coverage .
106
i. sub class of (a transitive relationship and the inverse of the super class
of relationship)
ii. part of (a transitive relationship and the inverse of the has part relationship)
A. direct part of (the inverse of the has direct part relationship)
Although inspired by OWL and SKOS these relationship types may seem to be an
impoverished set. But they are already an extension to the original purpose of RELcat, which mainly dealt with (almost) same-as relationships. However, this shallow
taxonomy is just a first start. Other relationship types from other richer vocabularies,
e.g., complete OWL or SKOS, can be inserted at the proper place in this subsumption hierarchy:
1. related
a. same as (a symmetric and transitive relationship)
i. owl:equivalentClass
ii. owl:equivalentProperty
iii. owl:sameAs
iv. skos:exactMatch
b. almost same as (a symmetric relationship)
i. skos:closeMatch
c. ...
Now sets of relations using these vocabularies can be loaded into RELcat,
and be combined and exploited in their usual fashion, e.g., by an inferencing engine. For example, this is done for the GOLD ontology of linguistic concepts
(Farrar and Langendoen, 2010). However, the upper part of the taxonomy can be
used by generic algorithms to traverse the large graph created by the combined relationships.
PREFIX rel:<http://www.isocat.org/relcat/relations#>
PREFIX isocat:<http://www.isocat.org/datcat/>
SELECT ?rel WHERE { isocat:DC-2482 rel:related ?rel . }
107
References
Berners-Lee T (1998) Cool URIs dont change. Tech. rep., World Wide Web Consortium, http://www.w3.org/Provider/Style/URI.html
Broeder D, Declerck T, Hinrichs E, Piperidis S, Romary L, Calzolari N, Wittenburg P (2008) Foundation of a component-based flexible registry for language
resources and technology. In: Proceedings of the 6th International Conference on
Language Resources and Evaluation (LREC 2008), Marrakech, Morocco
Farrar S, Langendoen DT (2010) An OWL-DL implementation of GOLD: An ontology for the semantic web. In: Witt AW, Metzing D (eds) Linguistic Modeling
of Information and Markup Languages: Contributions to Language Technology,
Springer
ISO 12620 (2009) Terminology and other language and content resources - Specification of data categories and management of a Data Category Registry for language resources
ISO 24613 (2008) Language resource management - Lexical markup framework
(LMF)
Kemps-Snijders M, Windhouwer M, Wittenburg P, Wright SE (2008) ISOcat: Corralling data categories in the wild. In: Proceedings of the Sixth International
Conference on Language Resources and Evaluation (LREC08), Marrakech, Morocco, http://www.lrec-conf.org/proceedings/lrec2008/
Schuurman I, Windhouwer M (2011) Explicit semantics for enriched documents.
What do ISOcat, RELcat and SCHEMAcat have to offer? In: Proceedings of the
2nd Supporting Digital Humanities Conference, Copenhagen, Denmark
Simons G, Bird S (2003) The open language archives community: An infrastructure
for distributed archiving of language resources. Literary and Linguistic Computing 18(2):117128
Windhouwer M, Wright SE, Kemps-Snijders M (2010) Referencing ISOcat data categories. In: Budin G, Declerck T, Romary L, Wittenburg P (eds) Proceedings of
the LREC 2010 LRT standards workshop, Malta, http://www.lrec-conf.
org/proceedings/lrec2010/workshops/W4.pdf