TOWARDS GRID ENABLED INFORMATION
RETRIEVAL
Babak Akhgar, Nahum Korda, Jawed Siddiqi and Mehrdad Naderi
Sheffield Hallam University
School of Computing and Management Sciences
Shefffield UK
ABSTRACT
Our research aims to further our understanding of Information Retrieval for management of knowledge within the Grid
environment. We do so by developing a search and categorisation toolkit which utilises Grid services for IR. A novel
grid enabled IR services based workflow model is proposed and detailed that describes the interaction and orchestration
between core IR functionalities and Grid services.
KEYWORDS
Grid Technology, Information Retrieval, Workflow Model
1. INTRODUCTION
The emergence of ‘grid computing’ technologies often carries much hype and high expectations; Shread [1]
describes it as “…the fifth wave of the IT revolution” where, the fourth wave being that of the Internet.
Certainly, the combined capability of the Internet and Grid technologies promises to change how complex
problems are handled through enabling large-scale aggregation and sharing of computational, data,
information, knowledge and other resources across institutional and geographical boundaries.
Given the critical importance of data, information and knowledge as the key strategic resources within
enterprises and Virtual Organisations (VO) Siddiqi and Akhgar [2] for the purposes of this paper we consider
the concept of a VO as a logical entity for dynamic coordination and maintenance of Data, Information and
Knowledge (DIK) between providers and consumers alike Akhgar et. al [3]. Research by, Kesselman, et al
[4] and Allan and Hanlon [5] suggest that optimum execution of process of dynamic coordination and
maintenance of DIK within the context of a VO requires self contained, self describing and discoverable set
of services with pre-defined ontological structure which enact Information Retrieval (IR) processes (e.g.
Querying and Personalisation) as stated in Akhgar and Siddiqi [6].
Examples for the partial realisation of these encapsulating an advanced KM services can be seen in the
Globus Toolkit, and GEODISE portal [7].
Our current research aims to describe the key requirements for provision of knowledge management (KM)
services within a Grid environment. In this paper we present a generic set of IR workflows necessary for
realisation of knowledge services architectural design based on a canonical set of requirements for the
development of KM services within a Grid environment obtained during requirements engineering and
architectural design of GRACE project [3]. . GRACE is an EU funded project under "Information Society
Technology Programme" FP5. The project aims to deliver GRID enabled Search and Categorisation Engine
by making terabytes of information that already exists and is distributed on vast amounts of geographically
distant locations highly accessible. Our focus and contribution in GRACE toolkit development is the
identification, elaboration, validation and evaluation of the necessary grid enabled application technology to
enable the next generation KM services based on IR principals to build upon core Grid services and focus on
the design of KM technologies for knowledge workers, communities and organisations alike.
421
IADIS International Conference WWW/Internet 2004
2. GRID BASED INFORMATION RETRIEVAL
Based on research by Jansen et al [8] and Akhgar and Siddiqi [6] Information Retrieval (IR) and its enabling
processes such as search and categorization constituents is one of the key enabler of knowledge management.
The notion of IR over Grid for purpose of KM is recently drawing attention in the community of Grid
developers (e.g. GIR Project). This is due to the fact that the Open Grid Service Architecture (OGSA), and
the following Web Services Resource Framework (WSRF) made the Grid pretty similar to the World Wide
Web (WWW), and consequently, that the feasibility of the World Wide Grid (WWG) seems today more real
then ever. IR is an essential aspect of the WWW and it must be also an essential aspect of the future WWG.
In terms of operational difficulty given vast amount of DIK over Grid platforms and diversity of the
knowledge offer in terms of content the distributed IR clearly preferable option in Grid context to the
centralized approach of the WWW. Instead of allowing crawlers to index the published content into a
centralized catalogue, the new method of publishing could generate local, independent catalogues. The
specialized databases or the repositories belonging to vertical communities today, these catalogues could be
maintained around a particular domain of interest or by a particular group of stakeholders – or both.
Publishing procedure would hence include the decision into which catalogue(s) the published content should
be registered in order to best serve the publishing objectives. Accordingly, instead of querying a single,
centralized catalogue - like today on the WWW, querying requests would be distributed across a variety of
specialized catalogues. This is in line with the query routing of the meta-search, just here it could be guided
by the underlying semantics of the query request. - A query related to a particular subject could be
automatically routed only to the content sources relevant to that subject. Distributed IR is thus closely related
to the vision of the Semantic Web (SW). SW aims at making the WWW content not only machine-readable,
but also machine-understandable [9]. This is crucial for the query routing to be guided by the underlying
semantics. In order to allow the VO members to share the knowledge, Grid publishing and knowledge
discovery must therefore build on the SW principals.
3. PROPOSED WORKFLOW MODEL
GRACE, designed a system architecture that, on one hand, maximizes the utilization of the Grid
infrastructure, but, on the other, also develops specific solutions wherever Grid cannot offer them. This
architecture carefully outbalances operations outside and “above” the Grid with “sinking” into the Grid when
appropriate. Grid is thus considered in this architecture as an enormous storage space, but with rather
inefficient querying mechanisms. This is why querying and knowledge discovery are currently kept mostly
outside the Grid, sinking into it only to grab the content stored therein.
During requirements elicitation process of the GRACE project [10] we have identified a number of key
requirements for satisfying core IR necessities over Grid, they are as follows:
Publishing – one or more documents are submitted to GRACE with the intention to retrieve them in response
to queries;
Querying – a query request is submitted to GRACE expecting a list of links to documents to be retrieved as
search results;
Document access – a document is selected from the search results expecting to have it presented for viewing;
Personalization – personal information used by the application is either submitted to or retrieved from
GRACE.
System architecture was developed in order to offer the optimal solutions to each of the above IR
requirements. The key design components of GRACE architecture [10] and the employed services are
illustrated in table below. It is important to note that GRACE architecture is strictly service oriented
anticipating future OGSA compliance of the underlying EDG infrastructure. The following table presents
which GRACE services are executed by each of the components:
422
TOWARDS GRID ENABLED INFORMATION RETRIEVAL
Table 1: Core Architectural and Services of GRACE project
GRACE Component
Front-end Application
Services
Grid Application Layer Services
Backend
Workflow
Manager
Grid Publisher
Document Processing Service
Personalization Service
Knowledge Domain Registration
Service
Content Source Registration
Service
Document Storage Service
Query Routing Service
Document Retrieving Service
Utilized by Document Processing
Service in order to execute Grid
job
Grid Storage element
Federated
Search
Manager
Normalization
and
Categorization Engine
Storage Element
The following is the presentation of the workflow for each of services separately, introducing the major
components of the system and the way they interact with the Grid infrastructure in order to provide a novel
IR processes execution in Grid context. The process and functional representation of the said workflows are
kept at an abstract level in order to create a generalization condition for development of use cases by other
researchers for their application domains.
Since GRACE is to be integrated with the European DataGrid (EDG) – which is in turn based on the
Globus Kit - into a test-bed, the following presentation of the Grid infrastructure relates specifically to EDG
(in particular EDG 2.0). Figure 1 illustrates the logical layers of EDG architecture.
Figure 1. European DataGrid Layers
Nonetheless, instead of detailing the integration interfaces, this presentation will use the following
abstraction of the Grid infrastructure:
EDG Application Layer will be presented with only two elements that will be utilized: Data Management and
Job Management. On top of the Application Layer various GRACE services will be introduced.
Grid middleware (i.e. Collective Services and Underlying Grid Services) will be treated as a “black box”
without detailing what is going on inside it and how are its various components utilized in order to
423
IADIS International Conference WWW/Internet 2004
accomplish the final results. The same is true of the Fabric Services that are Grid’s interface to the underlying
hardware systems (the so called “Grid fabric”). At the fabric layer various GRACE data storage elements will
be introduced. Figure 2 illustrates the primary services of the GRACE toolkit.
Figure 2. GRACE Layers
Each workflow will be thus presented as being initialized by a GRACE service, the operations “above”
the Grid will be explained in detail, Grid elements interfaced at the Application Layer will be explicated, and
finally, the corresponding underlying GRACE data storage elements will be pointed out – without explaining
the internal workflow within the Grid middleware. For each scenario the following will be presented at the
beginning of each functionality:
GRACE services involved, EDG elements interfaced (at the application layer), GRACE data storage
elements utilized. This opening explication will be followed by a detailed workflow explanation.
3.1 Publishing
GRACE Services
EDG
Application
Layer Elements
GRACE Data Storage
Elements
Knowledge Domain Registration
Service, Document Processing
Service,
Document
Storage
Service,
Content
Source
Registration Service
Data Management
Knowledge Domain Registry,
Document Repository, Normalized
Document Format Repository
Grid publishing is pretty much based around the concepts of Content Sources (i.e. searchable repositories
of documents and corresponding metadata) and Knowledge Domains (i.e. an abstraction of multiple content
sources that share a common denominator – such as belonging to the same domain of human knowledge).
Publishing workflow starts with the registration of the Knowledge Domains. KDs are registered by KD
Registration Service in the KD Registry that resides on the Grid. KD Registration Service interfaces Data
Management element of the EDG Application Layer in order to access the KD Registry. Figure 1 illustrates
the GRACE services involved and their interaction of same with those of EDG for the publishing workflow.
424
TOWARDS GRID ENABLED INFORMATION RETRIEVAL
Figure 1. Publishing Services and Workflow
From here workflow proceeds slightly differently for internal and external content sources. An internal
search engine receives a document, normalizes it and indexes it. The document is stored on the Grid in a
Document Repository. Document Storage Service interfaces EDG Data Management element of the
Application Layer in order to access the Document Repository. GRACE Search Engine utilizes Document
Service in order to normalize the document and generate Normalized Document Format (NDF). NDF is also
stored on the Grid in the NDF Repository. GRACE Search Engine interfaces EDG Data Management
element of the Application Layer in order to access the NDF Repository. The index, on the other hand, is
stored locally for rapid querying. This local index is distributed between the GRACE nodes. The distribution
follows principles of the organization of document sources into Knowledge Domains. The external content
sources, on the other hand, undergo a registration by the Content Source Registration Service. The
registration includes technical definition of how is the content source to be queried and how should be the
required documents extracted from it. It also includes assignment of the content source to one or more KDs.
In addition, one or more knowledge representation systems (i.e. ontologies, taxonomies, taxonomies) can be
assigned to a content source. This information is stored locally in the central configuration repository.
3.2 Querying
GRACE Services
EDG
Application
Layer Elements
GRACE Data Storage
Elements
Query Routing Service, Document
Retrieving Service, Document
Processing Service
Data
Management,
Job
Management
Knowledge Domain Repository,
Normalized
Document
Form
Repository
Query is transferred from the front-end application to the Query Routing Service. This service interfaces
the Data Management element of the EDG Application Layer in order to retrieve information regarding the
relevant Knowledge Domains from the KD Registry. This information, together with the information
regarding the relevant content sources from the central configuration repository is necessary in order to
transform the original query into the specific formats required by various content sources. Figure 1 illustrates
the GRACE services involved and their interaction of same with those of EDG for the querying workflow.
425
IADIS International Conference WWW/Internet 2004
Figure 1. Querying services and workflow
These formatted queries are next routed to the content sources and the search results are received. The
search results are parsed by the Query Routing Service in order to extract document URLs, which are passed
to the Document Retrieving Service. From here the workflow proceeds differently for the internal and
external content sources. In the case of an external content source, Document Retrieving Service simply
downloads the document and transfers it to the Document Processing Service. In the case of the internal
content sources, there exists already the Normalized Document Format, which is required for the
categorization. Accordingly, instead of retrieving the actual document, Data Management element of the
EDG Application Layer is interfaced in order to retrieve the corresponding NDF from the NDF Repository.
A special case are the documents from external sources that were previously cached on the Grid. Similar
to the internal documents their NDF is also already available and can be obtained without re-processing the
actual document.
NDFs from both external and internal documents are next processed by the Document Processing Service
in order to generate categories. Document service interfaces the Job Management element of the EDG
Application Layer in order to initiate a Grid job. The Grid job returns the categories that are transferred to the
frontend application for presentation utilizing knowledge representation systems (in particular taxonomies)
for immediate presentation for browsing purposes, and then gradually populating them with the actual
documents.
3.3 Document Retrieving
GRACE Services
EDG
Application
Layer Elements
GRACE Data Storage
Elements
Document Retrieving Service,
Document Storage Service
Data Management
Document Repository, Normalized
Document Format Repository
Once the search results are presented, the end-user can select a document for inspection. This request is
transferred from the front-end application to the Document Retrieving Service. If the requested document is
stored in an internal content source or was cached at some point in the past, it is retrieved from the
corresponding Document Repository by interfacing the Data Management element of the EDG Application
Layer. Figure 1 illustrates the GRACE services involved and their interaction of same with those of EDG for
the document retrieving workflow.
426
TOWARDS GRID ENABLED INFORMATION RETRIEVAL
Figure 1. Document Retrieving Services and Workflow
Nonetheless, if the requested document resides in an external document repository, it was downloaded
during the querying, in order to allow categorization of its content. Now it just needs to be retrieved from the
local temporary cache. However, the NDF of this document can be transferred to the Document Storage
Service in order to store it on the Grid in the NDF Repository for future use. In the same way the downloaded
document can be stored on the Grid in a Document Repository for cached documents.
In this manner GRACE actually becomes a search engine similar to the WWW-based search engine in the
sense that it indexes not only internal, but also external documents. WWW-based search engines
systematically utilize crawlers in order to reach external documents and index them. GRACE, on the other
hand, doesn’t do it systematically, but indexes only externally documents retrieved in response to a query. Accordingly, GRACE users become factually GRACE crawling agents.
3.4 Personalisation
GRACE Services
EDG
Application
Layer Elements
GRACE Data Storage
Elements
Personalization Service
Data Management
User Profile Database
Personalization is merely an extension of the basic Grid authentication mechanisms with a GRACE user
profile. User profile holds personal information strictly related to IR. For example previous queries and
corresponding search results for future inspection, Scheduled queries automatically performed by the system,
Selected and hierarchically organized links to documents (so called “Favourites” or “Bookmarks”), Personal
visualization preferences.
User profile is envisioned as the base of the collaborative features that are to be developed in the future
using GRACE as the infrastructure for knowledge management. User profiles are stored on the Grid in the
User Profile Database. Upon the authentication by a VO, the matching user profile is retrieved from Grid by
interfacing the Data management element if the EDG Application Layer. Authentication with the Grid and
the retrieval of the user profile are both performed by the Personalization Service. Figure 1 illustrates the
GRACE services involved and their interaction of same with those of EDG for the personalisation workflow.
Personalization Service is also responsible for updating the user profile with modifications and additional
search results.
427
IADIS International Conference WWW/Internet 2004
Figure 1. Personalization workflow
4. SUMMARY
The paper has three primary contributions. We explored the key functional representation of IR over Grid as
an enabling mechanism for Knowledge Management based on the requirements elicitation and architectural
design of GRACE project. Second, a workflow model and its decomposition based on Grid and GRACE
services is proposed and discussed. The workflow models build on earlier research by Akhgar et al. [3] and
Siddiqi et al [11] that identifies and describes taxonomical classification of KM elements and propose a
number of orchestrated KM services necessary for realisation of KM paradigm over Grid. Third, the
suggested workflows presented at abstract level in order to facilitate dissemination of the project result to the
research community so that they can produce generalised use- cases for their specific problem situation.
The motivation to provide an early report of the results obtained so far is to provide an opportunity for
IR/KM and Grid communities to exploit different commercial and academic application of IR/KM by
harnessing Grid computing . The knowledge management and information retrieval community could
participate by providing scenarios akin to their information and knowledge needs so that they could be
elaborated in terms of the KM model and subsequently executed on the GRACE toolkit. The grid community
can assess the feasibility of launching IR/KM services on Grid platforms. Successful execution of the toolkit
and its deployment on a grid platform would be a positive and significant step towards a semantically rich
IR/KM portal. We invite interested reader to visit the website and contact us to collaborate in our mission to
realise a grid enabled IR/KM as a first step towards a semantic grid.
REFERENCES
[1] Shread (2003), SHREAD, P. (2003) IBM Launces Commercial Grid Offerings [online]. Last accessed on 12 March
2003 at URL: http://www.gridcomputingplanet.com/features/article.php/11170_1573391
[2] Siddiqi and Akhgar (2002), Source Code as Strategic Information based Resource, The British Computer Society
Quality Special Interest Group's 10th Annual International Conference 'SQM
[3] Akhgar, B, Siddiqi, J and Naderi M (2003); Grid enabled KM services in complex problem solving environment.
Complex systems in e-business CseB 2003, SLOVENIA,
[4] Kesselman, C, Nick, J and Tuecke, T (2002); On-line view of Grid. www.shu.ac.uk
[5] Allan, R and Hanlon, D : An Introduction to Web Services and related Technology for building an e-Science Grid,
document available at http://esc.dl.ac.uk/TechReports/WebServices/webServices_doc/
[6] Akhgar, B and Siddiqi, J (2001); A framework for the delivery of web-centric knowledge management applications,
Internet Computing IC'2001. Vol 1 page 47. CSREA Press.
[7] GEODISE portal. www.geodise.org
[8] Jansen, B.J., Spink, A., & Saracevic, T. (2000). Real life, real users, and real needs: a study and analysis of user
queries on the Web. Information Processing and Management, 36, 207-227.
[9] Berners-Lee T. (1999) Weaving The Web (Orion Business Books)
[10] GRACE Project Web site www.grace-ist.org Documentation.
[11] Siddiqi, J, Akhgar B and Naderi N (2003); Grid Based Knowledge Management Systems, To appear in Jan 2004 in
the International Symposium on Collaborative Technologies and Systems 2004, USA.
428