Posted:July 13, 2020

KBpedia Joins Wikidata

New KBpedia ID Property and Mappings Added to Wikidata

Wikidata editors last week approved adding a new KBpedia ID property (P8408) to their system, and we have therefore followed up by adding nearly 40,000 mappings to the Wikidata knowledge base. We have another 5000 to 6000 mappings still forthcoming that we will be adding in the coming weeks. Thereafter, we will continue to increase the cross-links, as we partially document below.

This milestone is one I have had in my sights for at least the past few years. We want to both: 1) provide a computable overlay to Wikidata; and 2) increase our dependence and use of Wikidata’s unparalleled structured data resources as we move KBpedia forward. Below I give a brief overview of the status of Wikidata, share some high-level views of our experience in putting forward and then mapping a new Wikidata property, and conclude with some thoughts of where we might go next.

The Status of Wikidata

Wikidata is the structured data initiative of the Wikimedia Foundation, the open-source group that oversees Wikipedia and many other notable Web-wide information resources. Since its founding in 2012, Wikidata’s use and prominence have exceeded expectations. Today, Wikidata is a multi-lingual resource with structured data for more than 95 million items, characterized by nearly 10,000 properties. Items are being added to Wikidata at a rate of nearly 5 million per month. A rich ecosystem of human editors and bots patrols the knowledge base and its entries to enforce data quality and consistency. The ecosystem includes tools for bulk loading of data with error checks, search including structured SPARQL queries, and navigation and visualization. Errors and mistakes in the data occur, but the system ensures such problems are removed or corrected as discovered. Thus, as data growth has occurred, so has quality and usefulness improved.

From KBpedia’s standpoint, Wikidata represents the most complete complementary instance data and characterization resource available. As such, it is the driving wheel and stalking horse (to mix eras and technologies) to guide where and how we need to incorporate data and its types. These have been the overall motivators for us to embrace a closer relationship with Wikidata.

As an open governance system, Wikidata has adopted its own data models, policies, and approval and ingest procedures for adopting new data or characterizations (properties). You might find it interesting to review the process and ongoing dialog that accompanied our proposal for a KBpedia ID as a property in Wikidata. As of one week ago, KBpedia ID was assigned Wikidata property P8408. To date, more than 60% of Wikidata properties have been such external identifiers, and IDs are the largest growing category of properties. Since most properties that relate to internal entity characteristics have already been identified and adopted, we anticipate mappings to external systems will continue to be a dominant feature of the growth in Wikidata properties to come.

Our Mapping Experience

There are many individuals that spend considerable time monitoring and overseeing Wikidata. I am not one of them. I had never before proposed a new property to Wikidata, and had only proposed one actual Q item (Q is the standard prefix for an entity or concept in Wikidata) for KBpedia prior to proposing our new property.

Like much else in the Wikimedia ecosystem, there are specific templates put in place for proposing a new Q item or P proposal (see the examples of external identifiers, here). Since there are about 10,000 times more Q items than properties, the path for getting a new property approved is more stringent.

Then, once a new property is granted, there are specific paths like QuickStatements or others that need to be followed in order to submit new data items (Q ids) or characteristics (property by Q ids). I made some newbie mistakes in my first bulk submissions, and fortunately had a knowledgeable administrator (@Epidosis) help guide me through making the fixes. For example, we had to back off about 10,000 updates because I had used the wrong form for referencing a claim. Once reclaimed, we were able to again upload the mappings.

As one might imagine, updates and changes are being submitted by multiple human agents and (some) bots at all times into the system. The facilities like QuickStatements are designed to enable batch uploads, and allow re-submissions due to errors. You might want to see what is currently active on the system by checking out this current status.

With multiple inputs and submitters, it takes time for large claims to be uploaded. In the case our our 40,000 mappings, we also accompanied each of those with a source and update data characterization, leading to a total upload of more than 120,000 claims. We split our submissions over multiple parts or batches, and then re-submitted if initial claims error-ed out (for example, if the base claim had not been fully registered, the next subsidiary claims might error due to lack of the registered subject; upon a second pass, the subject would be there and so no error). We ran our batches at off times for both Europe and North America, but the total time of the runs still took about 12 hours.

Once loaded, the internal quality controls of Wikidata kick in. There are both bots and human editors that monitor concepts, both of which can flag (and revert) the mapping assignments made. After three days of being active on Wikidata, we had a dozen reverts of initial uploaded mappings, representing about 0.03% of our suggested mappings, which is gratifyingly low. Still, we expect to hear of more such errors, and we are committed to fix all identified. But, at this juncture, it appears our initial mappings were of pretty high quality.

We had a rewarding and learning experience in uploading mappings to Wikidata and found much good will and assistance from knowledgeable members. Undoubtedly, everything should be checked in advance to ensure quality assertions when preparing uploads to Wikidata. But, if done, the system and its editors also appear quite capable to identify and enforce quality control and constraints as encountered. Overall, I found the entire data upload process to be impressive and rewarding. I am quite optimistic of this ecosystem continuing to improve moving forward.

The result of our external ID uploads and mappings can be seen in these SPARQL queries regarding the KBpedia ID property on Wikidata:

A random listing of mappings
Here is the count of current mappings
Here are the first 1000 mappings
Top 100 instance and sub-class mappings and,
Items with the most other external identifiers (mappings).

As of this writing, the KBpedia ID is now about the 500th most prevalent property on Wikidata.

What is Next?

Wikidata is clearly a dynamic data environment. Not only are new items being added by the millions, but existing items are being better characterized and related to external sources. To deal with the immense scales involved requires automated quality checking bots with human editors committed to the data integrity of their domains and items. To engage in a large-scale mapping such as KBpedia also requires a commitment to the Wikidata ecosystem and model.

Initiatives that appear immediately relevant to what we have put in place relating to Wikidata include to:

Extend the current direct KBpedia mappings to fix initial mis-assignments and to extend coverage to remaining sections of KBpedia
Add additional cross-mappings that exist in KBpedia but have not yet been asserted in Wikidata (for example, there are nearly 6,000 such UNSPSC IDs)
Add equivalent class (P1709) and possible superproperties (P2235) and subproperties (P2236) already defined in KBpedia
Where useful mappings are desirable, add missing Q items used in KBpedia to Wikidata
And, most generally, also extend mappings to the 5,000 or so shared properties between Wikidata and KBpedia.

I have been impressed as a user of Wikidata for some years now. This most recent experience also makes me enthused about contributing data and data characterizations directly.

To Learn More

Posted:June 15, 2020

KBpedia Gets Major eCommerce, Logistics Upgrade

New Version Finally Meets the Hurdle of Initial Vision

I am pleased to announce that we released a powerful new version of KBpedia today with e-commerce and logistics capabilities, as well as significant other refinements. The enhancement comes from adding the United Nations Standard Products and Services Code as KBpedia’s seventh core knowledge base. UNSPSC is a comprehensive and logically organized taxonomy for products and services, organized into four levels, with third-party crosswalks to economic and demographic data sources. It is a leading standard for many industrial, e-commerce, and logistics applications.

This was a heavy lift for us. Given the time and effort involved, Fred Giasson, KBpedia’s co-editor, and I decided to also tackle a host of other refinements we had on our plate. All told, we devoted many thousands of person-hours and more than 200 complete builds from scratch to bring this new version to fruition. Proudly I can say that this version finally meets the starting vision we had when we first began KBpedia’s development. It is a solid baseline to build from for all sorts of applications and to make broad outreach for adoption in 2020. Because of the extent of changes in this new version, we have leapfrogged KBpedia’s version numbering from 2.21 to 2.50.

KBpedia is a knowledge graph that provides a computable overlay for interoperating and conducting machine learning across its constituent public knowledge bases of Wikipedia, Wikidata, GeoNames, DBpedia, schema.org, OpenCyc, and, now, UNSPSC. KBpedia now contains more than 58,000 reference concepts and their mappings to these knowledge bases, structured into a logically consistent knowledge graph that may be reasoned over and manipulated. KBpedia acts as a computable scaffolding over these broad knowledge bases with the twin goals of data interoperability and knowledge-based artificial intelligence (KBAI).

KBpedia is built from a expandable set of simple text ‘triples‘ files, specified as tuples of subject-predicate-object (EAVs to some, such as Kingsley Idehen) that enable the entire knowledge graph to be constructed from scratch. This process enables many syntax and logical tests, especially consistency, coherency, and satisfiability, to be invoked at build time. A build may take from one to a few hours on a commodity workstation, depending on the tests. The build process outputs validated ontology (knowledge graph) files in the standard W3C OWL 2 semantic language and mappings to individual instances in the contributing knowledge bases.

As Fred notes, we continue to streamline and improve our build procedures. Major changes like what we have just gone through, be it adding a main source like UNSPSC or swapping out or adding a new SuperType (or typology), often require multiple build iterations to pass the system’s consistency and satisfiability checks. We need these build processes to be as easy and efficient as possible, which also was a focus of our latest efforts. One of our next major objectives is to release KBpedia’s build and maintenance codes, perhaps including a Python option.

Incorporation of UNSPSC

Though UNSPSC is consistent with KBpedia’s existing three-sector economic model (raw products, manufactured products, services), adding it did require structural changes throughout the system. With more than 150,000 listed products and services in UNSPSC, incorporating it needed to balance with KBpedia’s existing generality and scope. The approach was to include 100% of the top three levels of UNSPSC — segments, families, and classes — plus more common and expected product and service ‘commodities’ in its fourth level. This design maintains balance while providing a framework to tie-in any remaining UNSPSC commodities of interest to specific domains or industries. This approach led to integrating 56 segments, 412 families, 3700+ classes, and 2400+ commodities into KBpedia. Since some 1300 of these additions overlapped with existing KBpedia reference concepts, we checked, consolidated, and reconciled all duplicates.

We fully specified and integrated all added reference concepts (RCs) into the existing KBpedia structure, and then mapped these new RCs to all seven of KBpedia’s core knowledge bases. Through this process, for example, we are able to greatly expand the coverage of UNSPSC items on Wikidata from 1000 or so Q (entity) identifiers to more than 6500. Contributing such mappings back to the community is another effort our KBpedia project will undertake next.

Lastly with respect to UNSPSC, I will be providing a separate article on why we selected it as KBpedia’s products and services template, and how we did the integration and what we found as we did. For now, the quick point is that UNSPSC is well-structured and organized according to the three-sector model of the economy, which matches well with Peirce’s three universal categories underlying our design of KBpedia.

Other Major Refinements

These changes were broad in scope. Effecting them took time and broke open core structures. Opportunities to rebuild the structure in cleaner ways arise when the Tinkertoys get scattered and then re-assembled. Some of the other major refinements the project undertook during the builds necessary to create this version were to:

Further analyze and refine the disjointedness between KBpedia’s 70 or so typologies. Disjoint assertions are a key mechanism for sub-set selections, various machine learning tasks, querying, and reasoning
Increase the number of disjointedness assertions 62% over the prior version, resulting in better modularity. (However, note the actual RCs affected by these improvements is lower than this percentage since many were already specified in prior disjoint pools)
Add 37% more external mappings to the system (DBpedia and UNSPSC, principally)
Complete 100% of the definitions for RCs across KBpedia
Greatly expand the altLabel entries for thousands of RCs
Improve the naming consistency across RC identifiers
Further clean the structure to ensure that a given RC is specified only once to its proper parent in an inheritance (subsumption) chain, which removes redundant assertions and improves maintainability, readability, and inference efficiency
Expand and update the explanations within the demo of the upper KBpedia Knowledge Ontology (KKO) (see kko-demo.n3). This non-working ontology included in the distro makes it easier to relate the KKO upper structure to the universal categories of Charles Sanders Peirce, which provides the basic organizational framework for KKO and KBpedia, and
Integrate the mapping properties for core knowledge bases within KBpedia’s formal ontology (as opposed to only offering as separate mapping files); see kbpedia-reference-concepts-mappings.n3 in the distro.

Current Status of the Knowledge Graph

The result of these structural and scope changes was to add about 6,000 new reference concepts to KBpedia, then remove the duplicates, resulting in a total of more than 58,200 RCs in the system. This has increased KBpedia’s size about 9% over the prior release. KBpedia is now structured into about 73 mostly disjoint typologies under the scaffolding of the KKO upper ontology. KBpedia has fully vetted, unique mappings (nearly all one-to-one) to these key sources:

Wikipedia – 53,323 (including some categories)
DBpedia – 44,476
Wikidata – 43,766
OpenCyc – 31,154
UNSPSC – 6,553
schema.org – 842
DBpedia ontology – 764
GeoNames – 680
Extended vocabularies – 249.

The mappings to Wikidata alone link to more than 40 million unique Q instance identifiers. These mappings may be found in the KBpedia distro. Most of the class mapping are owl:equivalentClass, but a minority may be subClass or superClass or isAbout predicates as well.

KBpedia also includes about 5,000 properties, organized into a multi-level hierarchy of attributes, external relations, and representations, most derived from Wikidata and schema.org. Exploiting these properties and sub-properties is also one of the next priorities for KBpedia.

To Learn More

The KBpedia Web site provides a working KBpedia explorer and demo of how the system may be applied to local content for tagging or analysis. KBpedia splits between entities and concepts, on the one hand, and splits in predicates based on attributes, external relations, and pointers or indexes, all informed by Charles Peirce‘s prescient theories of knowledge representation. Mappings to all external sources are provided in the linkages to the external resources file in the KBpedia downloads. (A larger inferred version is also available.) The external sources keep their own record files. KBpedia distributions provide the links. However, you can access these entities through the KBpedia explorer on the project’s Web site (see these entity examples for cameras, cakes, and canyons; clicking on any of the individual entity links will bring up the full instance record. Such reach-throughs are straightforward to construct.) See further the Github site for further downloads.

KBpedia was first released in October 2016 with some open source aspects, and was made fully open in 2018. KBpedia is partially sponsored by Cognonto Corporation. All resources are available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Posted:August 9, 2016

Peg Featured in TED Talk

Continued Visibility for the Award-winning Web Portal

Laszlo Pinter, the individual who hired us as the technical contractor for the Peg community portal (www.mypeg.ca), recently gave a talk on the project at a TEDx conference in Winnipeg. Peg is the well-being indicator system for the community of Winnipeg. Laszlo’s talk is a 15-minute, high-level overview of the project and its rationale and role:

Peg helps identify and track indicators that relate to the economic, environmental, cultural and social well-being of the people of Winnipeg. There are scores of connected datasets underneath Peg that relate all information from stories to videos and indicator data to one another using semantic technologies. I first wrote about Peg when it was released at the end of 2013.

In 2014, Peg won the international Community Indicators Consortium Impact Award. The Peg Web site is a joint project of the United Way of Winnipeg (UWW) and the International Institute for Sustainable Development (IISD). Our company, Structured Dynamics, was the lead developer for the project, which is also based on SD’s Open Semantic Framework (OSF) platform.

Congratulations to the Peg team for the well-deserved visibility!

Posted:May 11, 2016

New, Major Upgrade of UMBEL Released

Version 1.50 Fully Embraces a Typology Design, Gets Other Computability Improvements

The year since the last major release of UMBEL (Upper Mapping and Binding Exchange Layer) has been spent in a significant re-think of how the system is organized. Four years ago, in version 1.05, we began to split UMBEL into a core and a series of swappable modules. The first module adopted was in geographical information; the second was in attributes. This design served us well, but it was becoming apparent that we were on a path of multiple modules. Each of UMBEL’s major so-called ‘SuperTypes‘ — that is, major cleavages of the overall UMBEL structure that are largely disjoint from one another, such as between Animals and Facilities — were amenable to the module design. This across-the-board potential cleavage of the UMBEL system caused us to stand back and question whether a module design alone was the best approach. Ultimately, after much thought and testing, we adopted instead a typology design that brought additional benefits beyond simple modularity.

Today, we are pleased to announce the release of these efforts in UMBEL version 1.50. Besides standard release notes, this article discusses this new typology design, and explains its uses and benefits.

Basic UMBEL Background

The Web and enterprises in general are characterized by growing, diverse and distributed information sources and data. Some of this information resides in structured databases; some resides in schema, standards, metadata, specifications and semi-structured sources; and some resides in general text or media where the content meaning is buried in unstructured form. Given these huge amounts of information, how can one bring together what subsets are relevant? And, then for candidate material that does appear relevant, how can it be usefully combined or related given its diversity? In short, how does one go about actually combining diverse information to make it interoperable and coherent?

UMBEL thus has two broad purposes. UMBEL’s first purpose is to provide a general vocabulary of classes and predicates for describing and mapping domain ontologies, with the specific aim of promoting interoperability with external datasets and domains. UMBEL’s second purpose is to provide a coherent framework of reference subjects and topics for grounding relevant Web-accessible content. UMBEL presently has about 34,000 of these reference concepts drawn from the Cyc knowledge base, organized into 31 mostly disjoint SuperTypes.

The grounding of information mapped by UMBEL occurs by common reference to the permanent URIs (identifiers) for UMBEL’s concepts. The connections within the UMBEL upper ontology enable concepts from sources at different levels of abstraction or specificity to be logically related. Since UMBEL is an open source extract of the OpenCyc knowledge base, it can also take advantage of the reasoning capabilities within Cyc.

Diagram showing linked data datasets. UMBEL is near the hub, below and to the right of the central DBpedia.

UMBEL’s vocabulary is designed to recognize that different sources of information have different contexts and different structures, and meaningful connections between sources are not always exact. UMBEL’s 34,000 reference concepts form a knowledge graph of subject nodes that may be related to external classes and individuals (instances and entities). Via this coherent structure, we gain some important benefits:

Mapping to other ontologies — disparate and heterogeneous datasets and ontologies may be related to one another by mapping to the UMBEL structure
A scaffolding for domain ontologies — more specific domain ontologies can be made interoperable by using the UMBEL vocabulary and tieing their more general concepts into the UMBEL structure
Inferencing — the UMBEL reference concept structure is coherent and designed for inferencing, which supports better semantic search and look-ups
Semantic tagging — UMBEL, and ontologies mapped to it, can be used as input bases to ontology-based information extraction (OBIE) for tagging text or documents; UMBEL’s “semsets” broaden these matches and can be used across languages
Linked data mining — via the reference ontology, direct and related concepts may be retrieved and mined and then related to one another
Creating computable knowledge bases — with complete mappings to key portions of a knowledge base, say, for Wikipedia articles, it is possible to use the UMBEL graph structure to create a computable knowledge source, with follow-on benefits in artificial intelligence and KB testing and improvements, and
Categorizing instances and named entities — UMBEL can bring a consistent framework for typing entities and relating their descriptive attributes to one another.

UMBEL is written in the semantic Web languages of SKOS and OWL 2. It is a class structure used in linked data, along with other reference ontologies. Besides data integration, UMBEL has been used to aid concept search, concept definitions, query ranking, ontology integration, and ontology consistency checking. It has also been used to build large ontologies and for online question answering systems [1].

Including OpenCyc, UMBEL has about 65,000 formal mappings to DBpedia, PROTON, GeoNames, and schema.org, and provides linkages to more than 2 million Wikipedia pages (English version). All of its reference concepts and mappings are organized under a hierarchy of 31 different SuperTypes, which are mostly disjoint from one another. Development of UMBEL began in 2007. UMBEL was first released in July 2008. Version 1.00 was released in February 2011.

Summary of Version 1.50 Changes

These are the principal changes between the last public release, version 1.20, and this version 1.50. In summary, these changes include:

Removed all instance or individual listings from UMBEL; this change does NOT affect the punning used in UMBEL’s design (see Metamodeling in Domain Ontologies)
Re-aligned the SuperTypes to better support computability of the UMBEL graph and its resulting disjointedness
These SuperTypes were eliminated with concepts re-assigned: Earthscape, Extraterrestrial, Notations and Numbers
These new SuperTypes were introduced: AreaRegion, AtomsElements, BiologicalProcesses, Forms, LocationPlaces, and OrganicChemistry, with logically reasoned assignments of RefConcepts
The Shapes SuperType is a new ST that is inherently non-disjoint because it is shared with about half of the RefConcepts
The Situations is an important ST, overlooked in prior efforts, that helps better establish context for Activities and Events
Made re-alignments in UMBEL’s upper structure and introduced additional upper-level categories to better accommodate these refinements in SuperTypes
A typology was created for each of the resulting 31 disjoint STs, which enabled missing concepts to be identified and added and to better organize the concepts within each given ST
The broad adoption of the typology design for all of the (disjoint) SuperTypes also meant that prior module efforts, specifically Geo and Attributes, could now be made general to all of UMBEL. This re-integration also enabled us to retire these older modules without affecting functionality
The tests and refinements necessary to derive this design caused us to create flexible build and testing scripts, documented via literate programming (using Clojure)
Updated all mappings to DBpedia, Wikipedia, and schema.org
Incorporated donated mappings to five additional LOV vocabularies [2]
Tested the UMBEL structure for consistency and coherence
Updated all prior UMBEL documentation
Expanded and updated the UMBEL.org Web site, with access and demos of UMBEL.

UMBEL’s SuperTypes

The re-organizations noted above have resulted in some minor changes to the SuperTypes and how they are organized. These changes have made UMBEL more computable with a higher degree of disjointedness between SuperTypes. (Note, there are also organizational SuperTypes that work largely to aid the top levels of the knowledge graph, but are explicitly designed to NOT be disjoint. Important SuperTypes in this category include Abstractions, Attributes, Topics, Concepts, etc. These SuperTypes are not listed below.)

UMBEL thus now has 31 largely disjoint SuperTypes, organized into 10 or so clusters or “dimensions”:

Constituents
	Natural Phenomena
	Area or Region
	Location or Place
	Shapes
	Forms
Situations
Time-related
	Activities
	Events
	Times
Natural Matter
	Atoms and Elements
	Natural Substances
	Chemistry
Organic Matter
	Organic Chemistry
	Biochemical Processes
Living Things
	Prokaryotes
	Protists & Fungus
	Plants
	Animals
	Diseases
Agents
	Persons
	Organizations
	Geopolitical
Artifacts
	Products
	Food or Drink
	Drugs
	Facilities
Information
	Audio Info
	Visual Info
	Written Info
	Structured Info
Social
	Finance & Economy
	Society

These disjoint SuperTypes provide the basis for the typology design described next.

The Typology Design

After a few years of working with SuperTypes it became apparent each SuperType could become its own “module”, with its own boundaries and hierarchical structure. Since across the UMBEL structure nearly 90% of the reference concepts are themselves entity classes, if these are properly organized, we can achieve a maximum of disjointness, modularity, and reasoning efficiency. Our early experience with modules pointed the way to a design for each SuperType that was as distinct and disjoint from other STs as possible. And, through a logical design of natural classes [3] for the entities in that ST, we could achieve a flexible, ‘accordion-like’ design that provides entity tie-in points from the general to the specific for each given SuperType. The design is effective for being able to interoperate across both fine-grained and coarse-grained datasets. For specific domains, the same design approach allows even finer-grained domain concepts to be effectively integrated.

All entity classes within a given SuperType are thus organized under the SuperType itself as the root. The classes within that ST are then organized hierarchically, with children classes having a subClassOf relation to their parent. Each class within the typology can become a tie-in point for external information, providing a collapsible or expandable scaffolding (the ‘accordion’ design). Via inferencing, multiple external sources may be related to the same typology, even though at different levels of specificity. Further, very detailed class structures can also be accommodated in this design for domain-specific purposes. Moreover, because of the single tie-in point for each typology at its root, it is also possible to swap out entire typology structures at once, should design needs require this flexibility.

We have thus generalized the earlier module design to where every (mostly) disjoint SuperType now has its own separate typology structure. The typologies provide the flexible lattice for tieing external content together at various levels of specificity. Further, the STs and their typologies may be removed or swapped out at will to deal with specific domain needs. The design also dovetails nicely with UMBEL’s build and testing scripts. Indeed, the evolution of these scripts via literate programming has also been a reinforcing driver for being able to test and refine the complete ST and typologies structure.

Still a Work in Progress

Though UMBEL retains its same mission as when the system was first formulated nearly a decade ago, we also see its role expanding. The two key areas of expansion are in UMBEL’s use to model and map instance data attributes and in acting as a computable overlay for Wikipedia (and other knowledge bases). These two areas of expansion are still a work in progress.

The mapping to Wikipedia is now about 85% complete. While we are testing automated mapping mechanisms, because of its central role we also need to vet all UMBEL-Wikipedia mapping assignments. This effort is pointing out areas of UMBEL that are over-specified, under-specified, and sometimes duplicative or in error. Our goal is to get to a 100% coverage point with Wikipedia, and then to exercise the structure for machine learning and other tests against the KB. These efforts will enable us to enhance the semsets in UMBEL as well as to move toward multilingual versions. This effort, too, is still a work in progress.

Despite these desired enhancements, we are using all aspects of UMBEL and its mappings to both aid these expansions and to test the existing mappings and structure. These efforts are proving the virtuous circle of improvements that is at the heart of UMBEL’s purposes.

Where to Get UMBEL and Learn More

The UMBEL Web site provides various online tools and Web services for exploring and using UMBEL. The UMBEL GitHub site is where you can download the UMBEL Vocabulary or the UMBEL Reference Concept ontology, both under a Creative Commons Attribution 3.0 license. Other documents and backup are also available from that location.

Technical specifications for UMBEL and its various annexes are available from the UMBEL wiki site. You can also download a PDF version of the specifications from there. You are also welcomed to participate on the UMBEL mailing list or LinkedIn group.

[1] See further https://en.wikipedia.org/wiki/UMBEL.

[2] Courtesy of Jana Vataščinová (University of Economics, Prague) and Ondřej Zamazal (University of Economics, Prague, COSOL project).

[3] See, for example, M.K. Bergman, 2015. “‘Natural Classes’ in the Knowledge Web,” AI3:::Adaptive Information blog, July 13, 2015.

Posted:May 18, 2015

A Primer on Knowledge Statistics

Trying to Cut Through the Terminology Confusion and Offer Simple Guidelines

Download PDF

Semantics is a funny thing. All professionals come to know that communication with their peers and outside audiences requires accuracy in how to express things. Yet, even with such attentiveness, communications sometimes go awry. It turns out that background, perspective and context can all act to switch circuits at the point of communication. Despite, and probably because of, our predilection as a species to classify and describe things, all from different viewpoints, we can often exhort in earnest a thought that is communicated to others as something different from what we intended. Alas!

This reality is why, I suspect, we have embraced as a species things like dictionaries, thesauri, encyclopedias, specifications, standards, sacred tracts, and such, in order to help codify what our expressions mean in a given context. So, yes, while sometimes there is sloppiness in language and elocution, many misunderstandings between parties are also a result of difference in context and perspective.

It is important when we process information in order to identify relations or extract entities, to type them or classify them, or to fill out their attributes, that we have measures to gauge how well our algorithms and tests work, all attentive to providing adequate context and perspective. These very same measures can also tell us whether our attempts to improve them are working or not. These measures, in turn, also are the keys for establishing effective gold standards and creating positive and negative training sets for machine learning. Still, despite their importance, these measures are not always easy to explain or understand. And, truth is, sometimes these measures may also be mis-explained or mis-calculated. Aiding the understanding of important measures in improving the precision, completeness, and accuracy of communications is my purpose in this article.

Some Basic Statistics as Typically Described

The most common scoring methods for gauging the “accuracy” of natural language communications involves statistical tests based on the nomenclature of negatives and positives, true or false. Sometimes it can be a bit confusing about how to interpret these terms, a confusion which can be made all the more difficult in what kind of statistical environment is at play. Let me try to first confuse, and then more simply explain these possible nuances.

Standard science is based on a branch of statistics known as statistical hypothesis testing. This is likely the statistics that you were taught in school. In hypothesis testing, we begin with a hypothesis about what might be going on with respect to a problem or issue, but for which we do not know the cause or truth. After reviewing some observations, we formulate a hypothesis that some factor A is affecting or influencing factor B. We then formulate a mirror-image null hypothesis that specifies that factor A does not affect factor B; this is what we will actually test. The null hypothesis is what we assume the world in our problem context looks like, absent our test. If the test of our formulated hypothesis does not affect that assumed distribution, then we reject our alternative (meaning our initial hypothesis fails, and we keep the null explanation).

We make assumptions from our sample about how the entire population is distributed, which enables us to choose a statistical model that captures the shape of assumed probable results for our measurement sample. These shapes or distributions may be normal (bell-shaped or Gaussian), binomial, power law, or many others. These assumptions about populations and distribution shapes then tell us what kind of statistical test(s) to perform. (Misunderstanding the true shape of the distribution of a population is one of the major sources of error in statistical analysis.) Different tests may also give us more or less statistical power to test the null hypothesis, which is that chance results will match the assumed distribution. Different tests may also give us more than one test statistic to measure variance from the null hypothesis.

We then apply our test and measure and collect our sample from the population, with random or other statistical sampling important so as not to skew results, and compare the distribution of these results to our assumed model and test statistic(s). The null hypothesis is confirmed or not by whether the shape of our sampled results matches the assumed distribution or not. The significance of the variance from the assumed shape, along with a confidence interval based on our sample size and the test at hand, provides the information necessary to either accept or reject the null hypothesis.

Rejection of the null hypothesis generally requires both significant difference from the expected shape in our sample and a high level of confidence. Absent those results, we likely need to accept the null hypothesis, thus rejecting the alternative hypothesis that some factor A is affecting or influencing factor B. Alternatively, with significant differences and a high level of confidence, we can reject the null hypothesis, thereby accepting the alternative hypothesis (our actual starting hypothesis, which prompted the null) that factor A is affecting or influencing factor B.

This is all well and good except for the fact that either the sampling method or our test may be in error. There are two types of errors that are possible: Type I errors, where a positive result corresponds to rejecting the null hypothesis; and Type II errors, where a negative result corresponds to not rejecting the null hypothesis.

We can combine all of these thoughts into what is the standard presentation for capturing these true and false, positive and negative, results [1]:

		Null hypothesis (H₀) is
		Valid/True	Invalid/False
Judgment of Null Hypothesis (H₀)	Reject	False Positive Type I error	True Positive Correct inference
Judgment of Null Hypothesis (H₀)	Fail to reject (accept)	True negative Correct inference	False negative Type II error

Clear as mud, huh?

Let’s Apply Some Simplifications

Fortunately, there are a couple of ways to sharpen this standard story in the context of information retrieval (IR), natural language processing (NLP) and machine learning (ML) — the domains of direct interest to us at Structured Dynamics — to make understanding all of this much simpler. Statistical tests will always involve a trade off between the level of false positives (in which a non-match is declared to be a match) and the level of false negatives (in which an actual match is not detected) [1]. Let’s see if we can simplify our recognition and understanding of these conditions.

First, let’s start with a recent explanation from the KDNuggets Web site [2]:

“Imagine there are 100 positive cases among 10,000 cases. You want to predict which ones are positive, and you pick 200 to have a better chance of catching many of the 100 positive cases. You record the IDs of your predictions, and when you get the actual results you sum up how many times you were right or wrong. There are four ways of being right or wrong:

TN / True Negative: case was negative and predicted negative
TP / True Positive: case was positive and predicted positive
FN / False Negative: case was positive but predicted negative
FP / False Positive: case was negative but predicted positive.”

The use of ‘case’ and ‘predictions’ help, but are still a bit confusing. Let’s hear another explanation from Benjamin Roth from his recently completed thesis [3]:

“There are two error cases when extracting training data: false positive and false negative errors. A false positive match is produced if a sentence contains an entity pair for which a relation holds according to the knowledge base, but for which the sentence does not express the relation. The sentence is marked as a positive training example for the relation, however it does not contain a valid signal for it. False positives introduce errors in the training data from which the relational model is to be generalized. For most models false positive errors are the most critical error type, for qualitative and quantitative reasons, as will be explained in the following.

“A false negative error can occur if a sentence and argument pair is marked as a negative training example for a relation (the knowledge base does not contain the argument pair for that relation), but the sentence actually expresses the relation, and the knowledge base was incomplete. This type of error may negatively influence model learning by omitting potentially useful positive examples or by negatively weighting valid signals for a relation.”

In our context, we can see a couple of differences from traditional scientific hypothesis testing. First, the problems we are dealing with in IR, NLP and ML are all statistical classification problems, specifically in binary classification. For example, is a given text token an entity or not? What type amongst a discrete set is it? Does the token belong to a given classification or not? This makes it considerably easier to posit an alternative hypothesis and the shape of its distribution. What makes it binary is the decision as to whether a given result is correct or not. We now have a different set of distributions and tests from more common normal distributions.

Second, we can measure our correct ‘hits’ by applying our given tests to a “gold standard” of known results. This gold standard provides a representative sample of what our actual population looks like, one we have characterized in advance whether all results in the sample are true or not for the question at hand. Further, we can use this same gold standard over and over again to gauge improvements in our test procedures.

Combining these thoughts leads to a much simpler matrix, sometimes called a confusion matrix in this context, for laying out the true and false, positive and negative characterizations:

Correctness	Test Assertion
Correctness	Positive	Negative
True	TP True Positive	TN True Negative
False	FP False Positive	FN False Negative

As we can see, ‘positive’ and ‘negative’ are simply the assertions (predictions) arising from our test algorithm of whether or not there is a match or a ‘hit’. ‘True’ and ‘false’ merely indicate whether these assertions proved to be correct or not as determined by gold standards or training sets. A false positive is a false alarm, a “crying wolf”; a false negative is a missed result. Thus, all true results are correct; all false are incorrect.

Key Information Retrieval Statistics

Armed with these four characterizations — true positive, false positive, true negative, false negative — we now have the ability to calculate some important statistical measures. Most of these IR measures also have exact analogs in standard statistics, which I also note.

The first metric captures the concept of coverage. In standard statistics, this measure is called sensitivity; in IR and NLP contexts it is called recall. Basically it measures the ‘hit’ rate for identifying true positives out of all potential positives, and is also called the true positive rate, or TPR:

$\mathit{TPR} = \mathit{TP} / P = \mathit{TP} / (\mathit{TP}+\mathit{FN})$

Expressed as a fraction of 1.00 or a percentage, a high recall value means the test has a high “yield” for identifying positive results.

Precision is the complementary measure to recall, in that it is a measure for how efficient whether positive identifications are true or not:

$\text{precision}=\frac{\text{number of true positives}}{\text{number of true positives}+\text{false positives}}$

Precision is something, then, of a “quality” measure, also expressed as a fraction of 1.00 or a percentage. It provides a positive predictive value, as defined as the proportion of the true positives against all the positive results (both true positives and false positives).

So, we can see that recall gives us a measure as to the breadth of the hits captured, while precision is a statement of whether our hits are correct or not. We also see, as in the Roth quote above, why false positives need to be a focus of attention in test development, because they directly lower precision and efficiency of the test.

This recognition that precision and recall are complementary and linked is reflected in one of the preferred overall measures of IR and NLP statistics, the F-score, which is the adjusted (beta) mean of precision and recall. The general formula for positive real β is:

$F_\beta = (1 + \beta^2) \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{(\beta^2 \cdot \mathrm{precision}) + \mathrm{recall}}$ .

which can be expressed in terms of TP, FN and FP as:

$F_\beta = \frac {(1 + \beta^2) \cdot \mathrm{true\ positive} }{(1 + \beta^2) \cdot \mathrm{true\ positive} + \beta^2 \cdot \mathrm{false\ negative} + \mathrm{false\ positive}}\,$

In many cases, the harmonic mean is used, which means a beta of 1, which is called the F₁ statistic:

$F_1 = 2 \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}}$

But F1 displays a tension. Either precision or recall may be improved to achieve an improvement in F₁, but with divergent benefits or effects. What is more highly valued? Yield? Quality? These choices dictate what kinds of tests and areas of improvement need to receive focus. As a result, the weight of beta can be adjusted to favor either precision or recall. Two other commonly used F measures are the F₂ measure, which weights recall higher than precision, and the F_0.5 measure, which puts more emphasis on precision than recall [4].

Another metric can factor into this equation, though accuracy is a less referenced measure in the IR and NLP realm. Accuracy is the statistical measure of how well a binary classification test correctly identifies or excludes a condition:

$\text{accuracy}=\frac{\text{number of true positives}+\text{number of true negatives}}{\text{number of true positives}+\text{false positives} + \text{false negatives} + \text{true negatives}}$

An accuracy of 100% means that the measured values are exactly the same as the given values.

All of the measures above simply require the measurement of false and true, positive and negative, as do a variety of predictive values and likelihood ratios. Relevance, prevalence and specificity are some of the other notable measures that depend solely on these metrics in combination with total population.

By bringing in some other rather simple metrics, it is also possible to expand beyond this statistical base to cover such measures as information entropy, statistical inference, pointwise mutual information, variation of information, uncertainty coefficients, information gain, AUCs and ROCs. But we’ll leave discussion of some of those options until another day.

Bringing It All Together

Courtesy of one of the major templates in Wikipedia in the statistics domain [5], for which I have taken liberties, expansions and deletions, we can envision the universe of statistical measures in IR and NLP, based solely on population and positives and negatives, true and false, as being:

		Condition (as determined by “Gold standard“)
	Total population	Condition positive	Condition negative	Prevalence = Σ Condition positive Σ Total population
Test Assertion	Test assertion positive	TP True positive	FP False positive (Type I error)	Positive predictive value (PPV), Precision = Σ True positive Σ Test outcome positive	False discovery rate (FDR) = Σ False positive Σ Test outcome positive
Test Assertion	Test assertion negative	FN False negative (Type II error)	TN True negative	False omission rate (FOR) = Σ False negative Σ Test outcome negative	Negative predictive value (NPV) = Σ True negative Σ Test outcome negative
	Accuracy (ACC) = Σ True positive + Σ True negative Σ Total population	True positive rate (TPR), Sensitivity, Recall = Σ True positive Σ Condition positive	False positive rate (FPR),Fall-out = Σ False positive Σ Condition negative	Positive likelihood ratio (LR+) = TPR FPR	F-score (F₁ case) = 2 x (Precision * Recall) (Precision + Recall)
		False negative rate (FNR) = Σ False negative Σ Condition positive	True negative rate (TNR), Specificity (SPC) = Σ True negative Σ Condition negative	Negative likelihood ratio (LR−) = FNR TNR

Please note that the order and location of TP, FP, FN and TN differs from my simple layout presented in the confusion matrix above. In the confusion matrix, we are gauging whether the assertion of the test is correct or not as established by the gold standard. In this current figure, we are instead using the positive or negative status of the gold standard as the organizing dimension. Use the shorthand identifiers of TP, etc., to make the cross reference between “correct” and “condition”.

Relationships to Gold Standards and Training Sets

These basic measures and understandings have two further important roles beyond informing how to improve the accuracy and peformance of IR and NLP algorithms and tests. The first is gold standards. The second is training sets.

Gold standards that themselves contain false positives and false negatives, by definition, immediately introduce errors. These errors make it difficult to test and refine existing IR and NLP algorithms, because the baseline is skewed. And, because gold standards also often inform training sets, errors there propagate into errors in machine learning. It is also important to include true negatives in a gold standard, in the likely ratio expected by the overall population, so that this complement of the accuracy measurement is not overlooked.

Once a gold standard is created, you then run your current test regime against it when you run your same tests againt unknowns. Preferably, of course, the gold standard only includes true positives and true negatives (that is, the gold standard is the basis for judging “correctness’; see confusion matrix above). In the case of running an entity recognizer, your results against the gold standard can take one of three forms: you either have open slots (no entity asserted); slots with correct entities; or slots with incorrect entities. Thus, here is how you would create the basis for your statistical scores:

TP = test identifies the same entity as in the gold standard
FP = test identifies a different entity than what is in the gold standard (including no entity)
TN = test identifies no entity; gold standard has no entity, and
FN = test identifies no entity, but gold standard has one.

As noted before, these measures are sufficient to calculate the precision, recall, F-score and accuracy statistics. Also note that the F v T and P v N correspond to the gold standard “correctness” and what is asserted by the test(s), per the confusion matrix.

We can apply this same mindset to the second additional, important role in creating and evaluating training sets. Both positive and negative training sets are recommended for machine learning. Negative training sets are often overlooked. Again, if the learning is not based on true positives and negatives, then significant error may be introduced into the learning.

Clean, vetted gold standards and training sets are thus a critical component to improving our knowledge bases going forward [6]. The very practice of creating gold standards and training sets needs to receive as much attention as algorithm development because, without it, we are optimizing algorithms to fuzzy objectives.

The virtuous circle that occurs between more accurate standards and training sets and improved IR and ML algorithms is a central argument for knowledge-based artificial intelligence (KBAI). Continuing to iterate better knowledge bases and validation datasets is a driving factor in improving both the yield and quality from our rapidly expanding knowledge bases.

[1] See http://en.wikipedia.org/wiki/Type_I_and_type_II_errors.

[2] Tilmann Bruckhaus, 2015. “How Are Precision and Recall Calculated?” from the KDNuggets Web site, retrieved May 10, 2015.

[3] Benjamin Roth, 2014. “Effective Distant Supervision for End-To-End Knowledge Base Population Systems,” D Engineering Thesis, Saarland University; quote is on p 33.

[4] See http://en.wikipedia.org/wiki/F1_score.

[5] See http://en.wikipedia.org/wiki/Template:DiagnosticTesting_Diagram.

[6] Some would also argue for adequate gold standards in the ontology realm. See Dellschaft, Klaas, and Steffen Staab. “On how to perform a gold standard based evaluation of ontology learning.” In The Semantic Web-ISWC 2006, pp. 228-241. Springer Berlin Heidelberg, 2006. For ontologies, they state it “. . . is apparent that there does not exist a canonical way of performing gold-standard based evaluations of ontology learning. Moreover, we argue in this paper that existing gold-standard based evaluations are faulty and that a well-founded evaluation model is largely missing.”

Main Links

Search

Linked Data

Posted:July 13, 2020

KBpedia Joins Wikidata

New KBpedia ID Property and Mappings Added to Wikidata

The Status of Wikidata

Our Mapping Experience

What is Next?

To Learn More

Posted:June 15, 2020

KBpedia Gets Major eCommerce, Logistics Upgrade

New Version Finally Meets the Hurdle of Initial Vision

Incorporation of UNSPSC

Other Major Refinements

Current Status of the Knowledge Graph

To Learn More

Posted:August 9, 2016

Peg Featured in TED Talk

Continued Visibility for the Award-winning Web Portal

Posted:May 11, 2016

New, Major Upgrade of UMBEL Released

Version 1.50 Fully Embraces a Typology Design, Gets Other Computability Improvements

Basic UMBEL Background

Summary of Version 1.50 Changes

UMBEL’s SuperTypes

The Typology Design

Still a Work in Progress

Where to Get UMBEL and Learn More

Posted:May 18, 2015

A Primer on Knowledge Statistics

Trying to Cut Through the Terminology Confusion and Offer Simple Guidelines

Some Basic Statistics as Typically Described

Let’s Apply Some Simplifications

Key Information Retrieval Statistics

Bringing It All Together

Relationships to Gold Standards and Training Sets