CODI: Combinatorial Optimization For Data Integration - Results For OAEI 2010
CODI: Combinatorial Optimization For Data Integration - Results For OAEI 2010
CODI: Combinatorial Optimization For Data Integration - Results For OAEI 2010
Here, w1 and w2 are negative real-valued weights, rendering alignments that satisfy the
formulae possible but less likely.
The presented list of cardinality, coherence, and stability constraints could be ex-
tended by additional soft and hard formulae. Other constraints could, for example,
model known correct correspondences or generalize the one-to-one alignment to m-
to-n alignments.
A-Box Matching The current instance matching configuration of CODI leverages ter-
minological structure and combines it with lexical similarity measures. The approach
is presented in more detail in [10]. It uses one T-Box T but two different A-Boxes
A1 ∈ O1 and A2 ∈ O2 . In cases with two different T-Boxes the T-Box matching ap-
proach is applied as a preprocessing step, merge the two aligned T-Boxes and then use
our instance matching algorithm. CODI offers complete conflict elimination meaning
that the resulting alignment is always coherent for OWL DL ontologies. This compo-
nent is based on the work of Meilicke et al. [6]. CODI enforces the instance alignment
to be consistent. To this end, we need to introduce observable predicates O to model
conflicts, that is, a positive assertion of one instance in one ontology and a negative
assertion of the same instance in the other ontology. This is done for both property and
concept assertions.
Analogous to the concept and property alignment before, we introduce the hidden
predicate mi representing instance correspondences. Let C be a concept and P be a
property of T-Box T . Further, let A ∈ A1 and B ∈ A2 be individuals in the respective
A-Boxes. Then, using a reasoner, ground atoms are added to the set of hard constraints
F h according to the following rules:
The strength of the system is its modularity allowing the incorporation of different simi-
larity measures. The system can be optimized in two major ways: (a) Inclusion of novel
formulae enforcing the logical consistency and (b) the inclusion of additional similarity
measures. There is room for improvement since we used a very simple lexical similar-
ity measure based on the Levenshtein distance [4] for our experiments. It is possible to
apply different aggregation functions like average or maximum and to include specific
properties of an ontology like URIs, labels, and comments.
In all OAEI test cases Algorithm 1 was used for computing the a-priori similarity
σ(entity1 , entity2 ). In the case of concept and property alignments, the a-priori simi-
larity is computed by taking the maximal similarity between the URIs, labels and OBO
to OWL constructs. In case of instance matching the algorithm goes through all data
properties and takes the average of the similarity scores.
1.4 Link to the System and Parameters File
The alignments for the tracks Benchmark and Conference has been made with the
SEALS platform. For Anatomy, IIMB, and Restaurant the alignments can be found
at http://code.google.com/p/codi-matcher/downloads/list
2 Results
In the following section, we present the results of the CODI system for the individual
OAEI tracks. Due to space considerations, we do not explain the different benchmarks
in more detail.
Algorithm 1 σ(entity1 , entity2 )
if entity1 and entity2 are either concepts or properties then
value ← 0
for all Values s1 of URI, labels, and OBOtoOWL constructs in entity1 do
for all Values s2 of URI, labels, and OBOtoOWL constructs in entity1 do
value ← M ax(value, sim(s1 , s2 ))
end for
end for
return value
end if
if entity1 and entity2 are individuals then
M aphU RI, doublei similarities ← null
for all dataproperties dp1 of entity1 do
uri1 ← URI of dp1
for all dataproperties dp2 of entity2 do
if uri1 equals URI of dp2 then
value ← sim(valueof dp1 , valueof dp2 )
if uri1 is entailed in similarities then
update entry huri1 , old valuei to huri1 , Minimum (old value + value, 1)i in
similarities
else
add new entry pair huri1, valuei in similarities
end if
end if
end for
end for
return (sum of all values in similarities)/(length of similarities)
end if
Benchmark Track While our system’s strength is its modularity and adaptability to
different ontologies we used the exact same setting for all ontology matching tracks.
Hence, the performance on the benchmark track is rather poor. This is primarily due
to the high threshold of 0.85 for the Levenshtein similarity measure that we applied in
each of the ontology matching tracks. The results are shown in Table 1.
Table 1. Benchmark results
Conference Track On the real-world conference dataset CODI achieves very good
results since it employs logical reasoning to avoid incoherences. The execution time is
between 2 and 4 minutes per test case2 . Table 2 summarizes the overall results.
Table 2. Conference results
Average
Precision 0.87
Recall 0.51
F1 score 0.64
Anatomy Track The results on the anatomy track are also convincing. The results
shown in Table 3 are en par with the 2009 results of state-of-the-art matching applica-
tions. The F1 scores are between 0.79 and 0.73 for all subtasks, even for the two tasks
Focus on Precision and Focus on Recall. Thus, our algorithm achieves satisfiable pre-
cision and recall values without sacrifices on the F1 score. For the last task, where a
partial reference alignment was given, we could gain almost 5 % on the F1 score. This
is because incorporating a partial reference alignment in our system is straight-forward.
The reference alignment becomes a direct part of the optimization problem, enforcing
good correspondences while ruling out contradicting ones. However, since our algo-
rithm uses logical reasoning and has to solve an NP-hard optimization problem, the
execution times are quite high3.
Table 3. Anatomy results
PR Track For this track consisting of small files about persons and restaurants, we
used a simple one to one alignment only based on lexical similarity scores since no
significant structural information is available. Thus, the runtime was with less than 5
seconds per test case very short. The results of the CODI system are depicted in Table 5.
Table 5. PR results
3 General comments
CODI is a very young system and does not yet provide a user interface. Hence, im-
provements in usability by designing a suitable user interface will be one of the next
steps. In case of the quality of the alignments, more sophisticated lexical similarity mea-
sures will be tested and integrated. We are also working on novel algorithms solving the
optimization problems more efficiently.
The SEALS evaluation campaign is very beneficial since it is the first time that the
matchers must have a standardized interface which could possibly be used by everyone.
We encorage the organizers to use semantic precision and recall measures as described
in [3].
4 Conclusion
CODI performs concept, property, and instance alignments. It combines logical and
structural information with a-priori similarity measures in a well-defined way by using
the syntax and semantics of Markov logic. The system therefore not only aligns the en-
tities with the highest lexical similarity but also enforces the coherence and consistency
of the resulting alignment.
The overall results of the young system are very promising. Especially when con-
sidering the fact that there are many optimization possibilities with respect to the lexical
similarity measures that have not yet been investigated. The strength of the CODI sys-
tem is the combination of lexical and structural information and the declarative nature
that allows easy experimentation. We will continue the development of the CODI sys-
tem and hope that our approach inspires other researchers to leverage terminological
structure for ontology matching.
References
1. I. Cruz, F. Palandri, Antonelli, and C. Stroe. Efficient selection of mappings and automatic
quality-driven combination of matching methods. In Proceedings of the ISWC 2009 Work-
shop on Ontology Matching, 2009.
2. P. Domingos, D. Lowd, S. Kok, H. Poon, M. Richardson, and P. Singla. Just add weights:
Markov logic for the semantic web. In Proceedings of the Workshop on Uncertain Reasoning
for the Semantic Web, pages 1–25, 2008.
3. D. Fleischhacker and H. Stuckenschmidt. A Practical Implementation of Semantic Precision
and Recall. In 2010 International Conference on Complex, Intelligent and Software Intensive
Systems, pages 986–991. IEEE, 2010.
4. V. I. Levenshtein. Binary codes capable of correcting deletions and insertions and reversals.
Doklady Akademii Nauk SSSR, pages 845–848, 1965.
5. C. Meilicke and H. Stuckenschmidt. Analyzing mapping extraction approaches. In Proceed-
ings of the Workshop on Ontology Matching, Busan, Korea, 2007.
6. C. Meilicke, A. Tamilin, and H. Stuckenschmidt. Repairing ontology mappings. In Pro-
ceedings of the Conference on Artificial Intelligence, pages 1408–1413, Vancouver, Canada,
2007.
7. S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity flooding: A versatile graph matching
algorithm and its application to schema matching. In Proceedings of ICDE, pages 117–128,
2002.
8. M. Niepert. A Delayed Column Generation Strategy for Exact k-Bounded MAP Inference in
Markov Logic Networks. In Proceedings of the 25th Conference on Uncertainty in Artificial
Intelligence, 2010.
9. M. Niepert, C. Meilicke, and H. Stuckenschmidt. A Probabilistic-Logical Framework for
Ontology Matching. In Proceedings of the 24th AAAI Conference on Artificial Intelligence,
2010.
10. J. Noessner, M. Niepert, C. Meilicke, and H. Stuckenschmidt. Leveraging Terminological
Structure for Object Reconciliation. The Semantic Web: Research and Applications, pages
334–348, 2010.
11. M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62(1-2):107–
136, 2006.
12. S. Riedel. Improving the accuracy and efficiency of map inference for markov logic. In Pro-
ceedings of the Conference on Uncertainty in Artificial Intelligence, pages 468–475, 2008.