Semantično Namizje
Semantično Namizje
Semantično Namizje
1 Introduction
Nowadays, information system users can access more content than ever before,
faster than ever before. However, unlike the technology, the users themselves
have not scaled up well. The challenge has shifted from finding information in
the first place to actually locating useful knowledge within the retrieved content.
Consequently, research increasingly addresses questions of knowledge man-
agement and automated semantic analysis through a multitude of technologies
[12], including ontologies and the semantic web, text mining and natural lan-
guage analysis. Language technologies in particular promise to support users
by automatically scanning, extracting, and transforming information from vast
amounts of documents written in natural languages.
Even so, the question exactly how text mining tools can be incorporated into
today’s desktop environments, how the many individual analysis algorithms can
contribute to a semantically richer understanding within a complex user scenario,
has so far not been sufficiently addressed.
In this paper, we present a case study from a project delivering semantic
analysis tools to end users, building historians and architects, for the analysis of
a historic encyclopedia of architecture. A system architecture is developed upon
a detailed analysis of the users’ requirements. We discuss the current implemen-
tation and state first results from an ongoing evaluation.
2
The Encyclopedia. In the 19th century the “Handbuch der Architektur” (Hand-
book on Architecture) was probably not the only but certainly the most compre-
hensive attempt to represent the entire, including present and past, building
knowledge [6]. It is divided into four parts: Part I: Allgemeine Hochbaukunde
(general building knowledge), Part II: Baustile (architectural styles), Part III:
Hochbau-Konstruktionen (building construction), and Part IV: Entwerfen, An-
lage und Einrichtung der Gebäude (design, conception, and interior of buildings).
Overall, it gives a detailed and comprehensive view within the fields of archi-
tectural history, architectural styles, construction, statics, building equipment,
physics, design, building conception, and town planning.
But it is neither easy to get a general idea of the encyclopedia nor to find
information on a certain topic. The encyclopedia has a complex and confusing
structure: For each of its parts a different number of volumes—sometimes even
split into several books—were published, all of them written by different au-
thors. Some contain more than four hundred pages, others are much smaller,
very few have an index. Furthermore, many volumes were reworked after a time
and reprinted and an extensive supplement part was added. So referring to the
complete work we are dealing with more than 140 individual publications and
approximately at least 25 000 pages.
It is out of this complexity that the idea was born to support users—building
historians and architects—in their work through state-of-the-art semantic analy-
sis tools on top of classical database and information retrieval systems. However,
in order to be able to offer the right tools we first needed to obtain an under-
standing on precisely what questions concern our users and how they carry out
their related research.
User Groups: Building Historians and Architects. Two user groups are
involved in the analysis within our projects: Building historians and architects.
Those two parties have totally different perceptions of the “Handbuch der Ar-
chitektur” and different expectations of its analysis. The handbook has got a
kind of hybrid significance between its function as a research object and as a
resource for practical use, between research and user knowledge.
An architect is planning, designing, and overseeing a building’s construction.
Although he is first of all associated with the construction of new buildings, more
than 60% of building projects are related to the existing building stock, which
4
Edited by Joseph Durm (b14.2.1837 Karlsruhe, Germany, d3.4.1919 ibidem) and
three other architects since 1881.
3
User Requirements. For the building historian the handbook itself is object
and basis of his research. He puts a high value on a comprehensible documen-
tation of information development, since the analysis and interpretation of the
documentation process itself is also an important part of his scientific work. The
original text, the original object is the most significant source of cognition for
him. All amendments and notes added by different users have to be managed on
separate annotation or discussion levels—this would be the forum for scientific
controversy, which may result in new interpretations and cognition.
For the architect the computer-aided analysis and accessibility of the ency-
clopedia is a means to an end. It becomes a guideline offering basic knowledge of
former building techniques and construction. The architect is interested in tech-
nical information, not in the process of cognition. He requires a clearly structured
presentation of all available information on one concept. Besides refined queries
(“semantic queries”) he requires further linked information, for example web
sites, thesauruses, DIN and EU standards, or planning tools.
Both user groups are primarily interested in the content of the encyclopedia,
but also in the possibility of finding “unexpected information,”5 as this would
afford a new quality of reception. So far it is not possible to conceive this complex
and multi-volume opus with thousands of pages at large: The partition of the
handbook in topics, volumes, and books is making the retrieval of a particular
5
Information delivered through a user’s desktop is termed unexpected when it is rele-
vant to the task at hand yet not explicitly requested.
4
concept quite complicated. Only the table of contents is available to give a rough
orientation. But it’s impossible to get any information about single concepts or
terms. You can neither find an overall index nor—apart from a few exceptions—
an index of single volumes. Because each of them comprises a huge amount of text,
charts, and illustrations, it is unlikely to find the sought-for term coincidentally
by running over the pages. Thus, this project’s aim is to enable new possibilities
of access by the integration of “semantic search engines” and automated analyses.
An automated index generation alone would mean a substantial progress for
further research work.
Tier 1: Clients Tier 2: Presentation and Interaction Tier 3: Analysis and Retrieval Tier 4: Resources
Databases
Client
Web− WikiWikiWeb Wiki Bot
Web Server
Presentation
GATE−Framework
Navigation (Wiki−)
Natural Language Analysis Components Content
Annotation
Browser
Ontology
Automatic Summarization
Coreference Resolution
Writer
OpenOffice
NLP
Adapter
OO.org
gathered from the restoration of a specific building. Wiki systems typically offer
built-in discussion and versioning facilities matching these requirements.
3 System Architecture
4 Implementation
In this section we highlight some of the challenges we encountered when imple-
menting the architecture discussed above, as well as their solutions.
4) display content
Original
content
Database Wiki
5) add/edit content
Wiki stores the textual content in a MySQL database, the image files are stored
as plain files on the server. It provides a PHP-based dynamic web interface for
browsing, searching, and manual editing of the content.
The workflow between the Wiki and the NLP subsystems is shown in Fig. 2.
The individual sub-components are loosely coupled through XML-based data
exchange. Basically, three steps are necessary to populate the Wiki with both
the encyclopedia text and the additional data generated by the NLP subsystem.
These steps are performed by a custom software system written in Python.
Firstly (step (1) in Fig. 2), the original Tustep 9 markup of the digitized ver-
sion of the encyclopedia is converted to XML. The resulting XML intends to
be as semantically close to the original markup as possible; as such, it contains
mostly layout information. It is then possible to use XSLT transformations to cre-
ate XML that is suitable for being processed in the natural language processing
(NLP) subsystem described below.
Secondly (2), the XML data is converted to the text markup used by Media-
Wiki. The data is parsed using the Python xml.dom library, creating a document
tree according to the W3C DOM specification.10 This allows for easy and flexible
data transformation, e.g., changing an element node of the document tree such
as <page no="12"> to a text node containing the appropriate Wiki markup.
And thirdly (3), the created Wiki markup is added to the MediaWiki system
using parts of the Python Wikipedia Robot Framework,11 a library offering
routines for tasks such as adding, deleting, and modifying pages of a Wiki or
changing the time stamps of pages. Fig. 3 shows an example of the converted
end result, as it can be accessed by a user.
While users can (4) view, (5) add, or modify content directly through the
Wiki system, an interesting question was how to integrate the NLP subsystem,
so that it can read information (like the encyclopedia, user notes, or other pages)
from the Wiki as well and deliver newly discovered information back to the users.
9
http://www.zdv.uni-tuebingen.de/tustep/tustep_eng.html
10
http://www.w3.org/DOM/
11
http://pywikipediabot.sf.net
8
POS Tagger
für/APPR eine/ART äußere/ADJA
Abfasung/NN der/ART Kanten/NN
NP Chunker
NP:[DET:eine MOD:äußere
HEAD:Abfasung]
NP:[DET:der HEAD:Kanten]
Lemmatizer
Fig. 4. NLP pipeline for the generation of a full-text index (left side) and its
integration into the Wiki system (right side)
We now discuss some the NLP pipelines currently in use; however, it is im-
portant to note that new applications can easily be assembled from components
and deployed within our architecture.
OWL
Ontogazetteer
GrOWL
Text Gazetteer lists
The addition of ontologies (in DAML format) allows to locate entities within
an ontology (currently, GATE only supports taxonomic relationships) through
ontology extensions of the Gazetteer and JAPE components. The detected enti-
ties are then exported in an XML format for insertion into the Wiki and as an
OWL RDF file (Fig. 5).
NE results are integrated into the Wiki similarly to the index system de-
scribed above, linking entities to content pages. The additional OWL export al-
lows for a graphical navigation of the content through an ontology browser like
GrOWL.13 The ontologies exported by the NLP subsystem contain sentences as
another top-level concept, which allows to navigate from domain-specific terms
directly to positions in the document mentioning a concept, as shown in Fig. 6.
5 Evaluation
We illustrate a complex example scenario where a building historian or architect
would ask for support from the integrated system.
5.1 Scenario
The iterative analysis process oriented on the different requirements of the two
user groups is currently being tested on the volume “Wände und Wandöffnungen”14
(walls and wall openings). It describes the construction of walls, windows, and
doors according to the type of building material. The volume has 506 pages with
956 figures; it contains a total of 341 021 tokens including 81 741 noun phrases.
Both user groups are involved in a common scenario: The building historian
is analysing a 19th century building with regard to its worth of preservation in
order to be able to identify and classify its historical, cultural, and technical
13
http://seek.ecoinformatics.org/Wiki.jsp?page=Growl
14
E. Marx: Wände und Wandöffnungen. Aus der Reihe: Handbuch der Architektur.
Dritter Teil, 2. Band, Heft I, 2. Auflage, Stuttgart 1900.
12
value. The quoins, the window lintels, jambs, and sills as well as door lintels and
reveals are made of fine wrought parallelepipedal cut sandstones. The walls are
laid of inferior and partly defective brickwork. Vestiges of clay can be found on
the joint and corner zones of the brickwork. Therefore, a building historian could
make the educated guess that the bricks had been rendered with at least one
layer of external plaster. Following an inspection of the building together with
a restorer, the historian is searching in building documents and other historical
sources for references to the different construction phases. In order to analyse
the findings it is necessary to become acquainted with plaster techniques and
building materials. Appropriate definitions and linked information can be found
in the encyclopedia and other sources. For example, he would like to determine
the date of origin of each constructional element and whether it is original or has
been replaced by other components. Was it built according to the state-of-the-art,
does it feature particular details?
In addition, he would like to learn about the different techniques of plastering
and the resulting surfaces as well as the necessary tools. To discuss his findings
and exchange experiences he may need to communicate with other colleagues.
Even though he is dealing with the same building, the architect’s aim is an-
other. His job is to restore the building as carefully as possible. Consequently, he
needs to become acquainted with suitable building techniques and materials, for
example, information about the restoration of the brick bond. A comprehensive
literature search may offer some valuable references to complement the conclu-
sion resulting from the first building inspection and the documentation of the
construction phases.
13
So far, we have been testing the desktop with the Wiki system and three inte-
grated NLP tools within the project. We illustrate how our users ask for semantic
support from the system within the stated scenario.
NLP Index. As the tested volume offers just a table of contents but no index
itself, an automatically generated index is a very helpful and timesaving tool
for further research: Now it is possible to get a detailed record on which pages
contain relevant information about a certain term. And because the adjectives
of the terms are indicated as well, information can be found and retrieved very
quickly, e.g., the architect analysing the plain brickwork will search for all pages
referring to the term “Wand” (wall) and in particular to “unverputzte Wand”
(unplastered wall).
Summaries. Interesting information about a certain topic is often distributed
across the different chapters of a volume. In this case the possibility to generate
an automatic summary based on a context is another timesaving advantage. The
summary provides a series of relevant sentences, e.g., to the question (Fig. 7):
“Welche Art von Putz bietet Schutz vor Witterung?” (Which kind of plaster
would be suitable to protect brickwork against weather influences?). An inter-
esting properties of these context-based summaries is that they often provide
“unexpected information,” relevant content that a user most likely would not
have found directly.
The first sentence of the automatic summarization means: The joint filling is
important for the resistance of the brickwork, especially for those parts exposed
to the weather, as well as the quality of the bricks. This is interesting for our
example because the architect can find in the handbook—following the link—
some information about the quality of bricks. Now he may be able to realize that
those bricks used for the walls of our 19th century building are not intended for
fare-faced masonry. After that he can examine the brickwork and will find the
mentioned vestiges of clay.
The architect can now communicate his findings via the Wiki discussion page.
After studying the same text passage the building historian identifies the kind
of brickwork, possibly finding a parallel to another building in the neighborhood,
researched one year ago. So far, he was not able to date the former building
precisely because all building records have been lost during the war. But our
14
example building has a building date above the entrance door and therefore he
is now able to date both of them.
Named Entity Recognition and Ontology-based Navigation. Browsing the content,
either graphically or textually, through ontological concepts is another helpful
tool for the users, especially if they are not familiar in detail with the subject field
of the search, as it now becomes possible to approach it by switching to super-
or subordinate concepts or instances in order to get an overview. For example,
restoration of the windows requires information of their iron construction. Thus,
a user can start his search with the concept “Eisen” (iron) in the ontology (see
Fig. 6). He can now navigate to instances in the handbook that have been linked
to “iron” through the NLP subsystem, finding content that mentions window and
wall constructions using iron. Then he can switch directly to the indicated parts
of the original text, or start a more precise query with the gained information.
5.3 Summary
The offered semantic desktop tools, tested so far on a single complete volume of
the encyclopedia, turned out to be a real support for both our building historians
and architects: Automatic indices, summaries, and ontology-based navigation
can help them to find relevant, precisely structured and cross-linked information
to certain, even complex topics in a quick and convenient fashion. The system’s
ability to cross-link, network, and combine content across the whole collection
have the potential to guide the user to unexpected information, which he might
not have realized even when completely reading the sources themselves.
In doing so the tools’ time saving effects seems to be the biggest advantage:
Both user groups can now concentrate on their research or building tasks—they
do not need to deal with the time-consuming and difficult process of finding
interesting and relevant information.
References
1. Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval.
Addison-Wesley, 1999.
2. Sabine Bergler, René Witte, Zhuoyan Li, Michelle Khalifé, Yunyu Chen, Monia
Doandes, and Alina Andreevskaia. Multi-ERSS and ERSS 2004. In Workshop on
Text Summarization, Document Understanding Conference (DUC), Boston Park
Plaza Hotel and Towers, Boston, USA, May 6–7 2004. NIST.
3. Document Understanding Conference. http://duc.nist.gov/.
4. H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A framework
and graphical development environment for robust NLP tools and applications. In
Proc. of the 40th Anniversary Meeting of the ACL, 2002. http://gate.ac.uk.
5. Reginald Ferber. Information Retrieval. dpunkt.verlag, 2003.
6. Ulrike Grammbitter. Josef Durm (1837–1919). Eine Einführung in das ar-
chitektonische Werk, volume 9 of tuduv-Studien: Reihe Kunstgeschichte. tuduv-
Verlagsgesellschaft, München, 1984. ISBN 3-88073-148-9.
7. Bo Leuf and Ward Cunningham. The Wiki Way, Quick Collaboration on the Web.
Addison-Wesley, 2001.
8. Praharshana Perera and René Witte. A Self-Learning Context-Aware Lemmatizer
for German. In Human Language Technology Conference/Conference on Empirical
Methods in Natural Language Processing (HLT/EMNLP 2005), Vancouver, B.C.,
Canada, October 6–8 2005.
9. Wikipedia, the free encyclopedia. Mediawiki. http://en.wikipedia.org/
wiki/MediaWiki; accessed July 26, 2005.
10. René Witte. An Integration Architecture for User-Centric Document Creation,
Retrieval, and Analysis. In Proceedings of the VLDB Workshop on Information
Integration on the Web (IIWeb), pages 141–144, Toronto, Canada, August 30 2004.
11. René Witte and Sabine Bergler. Fuzzy Coreference Resolution for Summarization.
In Proc. of 2003 Int. Symposium on Reference Resolution and Its Applications
to Question Answering and Summarization (ARQAS), pages 43–50, Venice, Italy,
June 23–24 2003. Università Ca’ Foscari. http://rene-witte.net.
12. Ning Zhong, Jiming Liu, and Yiyu Yao, editors. Web Intelligence. Springer, 2003.