Academia.eduAcademia.edu

Towards Grid Enabled Information Retrieval

2004, International Association for Development of the Information Society

TOWARDS GRID ENABLED INFORMATION RETRIEVAL Babak Akhgar, Nahum Korda, Jawed Siddiqi and Mehrdad Naderi Sheffield Hallam University School of Computing and Management Sciences Shefffield UK ABSTRACT Our research aims to further our understanding of Information Retrieval for management of knowledge within the Grid environment. We do so by developing a search and categorisation toolkit which utilises Grid services for IR. A novel grid enabled IR services based workflow model is proposed and detailed that describes the interaction and orchestration between core IR functionalities and Grid services. KEYWORDS Grid Technology, Information Retrieval, Workflow Model 1. INTRODUCTION The emergence of ‘grid computing’ technologies often carries much hype and high expectations; Shread [1] describes it as “…the fifth wave of the IT revolution” where, the fourth wave being that of the Internet. Certainly, the combined capability of the Internet and Grid technologies promises to change how complex problems are handled through enabling large-scale aggregation and sharing of computational, data, information, knowledge and other resources across institutional and geographical boundaries. Given the critical importance of data, information and knowledge as the key strategic resources within enterprises and Virtual Organisations (VO) Siddiqi and Akhgar [2] for the purposes of this paper we consider the concept of a VO as a logical entity for dynamic coordination and maintenance of Data, Information and Knowledge (DIK) between providers and consumers alike Akhgar et. al [3]. Research by, Kesselman, et al [4] and Allan and Hanlon [5] suggest that optimum execution of process of dynamic coordination and maintenance of DIK within the context of a VO requires self contained, self describing and discoverable set of services with pre-defined ontological structure which enact Information Retrieval (IR) processes (e.g. Querying and Personalisation) as stated in Akhgar and Siddiqi [6]. Examples for the partial realisation of these encapsulating an advanced KM services can be seen in the Globus Toolkit, and GEODISE portal [7]. Our current research aims to describe the key requirements for provision of knowledge management (KM) services within a Grid environment. In this paper we present a generic set of IR workflows necessary for realisation of knowledge services architectural design based on a canonical set of requirements for the development of KM services within a Grid environment obtained during requirements engineering and architectural design of GRACE project [3]. . GRACE is an EU funded project under "Information Society Technology Programme" FP5. The project aims to deliver GRID enabled Search and Categorisation Engine by making terabytes of information that already exists and is distributed on vast amounts of geographically distant locations highly accessible. Our focus and contribution in GRACE toolkit development is the identification, elaboration, validation and evaluation of the necessary grid enabled application technology to enable the next generation KM services based on IR principals to build upon core Grid services and focus on the design of KM technologies for knowledge workers, communities and organisations alike. 421 IADIS International Conference WWW/Internet 2004 2. GRID BASED INFORMATION RETRIEVAL Based on research by Jansen et al [8] and Akhgar and Siddiqi [6] Information Retrieval (IR) and its enabling processes such as search and categorization constituents is one of the key enabler of knowledge management. The notion of IR over Grid for purpose of KM is recently drawing attention in the community of Grid developers (e.g. GIR Project). This is due to the fact that the Open Grid Service Architecture (OGSA), and the following Web Services Resource Framework (WSRF) made the Grid pretty similar to the World Wide Web (WWW), and consequently, that the feasibility of the World Wide Grid (WWG) seems today more real then ever. IR is an essential aspect of the WWW and it must be also an essential aspect of the future WWG. In terms of operational difficulty given vast amount of DIK over Grid platforms and diversity of the knowledge offer in terms of content the distributed IR clearly preferable option in Grid context to the centralized approach of the WWW. Instead of allowing crawlers to index the published content into a centralized catalogue, the new method of publishing could generate local, independent catalogues. The specialized databases or the repositories belonging to vertical communities today, these catalogues could be maintained around a particular domain of interest or by a particular group of stakeholders – or both. Publishing procedure would hence include the decision into which catalogue(s) the published content should be registered in order to best serve the publishing objectives. Accordingly, instead of querying a single, centralized catalogue - like today on the WWW, querying requests would be distributed across a variety of specialized catalogues. This is in line with the query routing of the meta-search, just here it could be guided by the underlying semantics of the query request. - A query related to a particular subject could be automatically routed only to the content sources relevant to that subject. Distributed IR is thus closely related to the vision of the Semantic Web (SW). SW aims at making the WWW content not only machine-readable, but also machine-understandable [9]. This is crucial for the query routing to be guided by the underlying semantics. In order to allow the VO members to share the knowledge, Grid publishing and knowledge discovery must therefore build on the SW principals. 3. PROPOSED WORKFLOW MODEL GRACE, designed a system architecture that, on one hand, maximizes the utilization of the Grid infrastructure, but, on the other, also develops specific solutions wherever Grid cannot offer them. This architecture carefully outbalances operations outside and “above” the Grid with “sinking” into the Grid when appropriate. Grid is thus considered in this architecture as an enormous storage space, but with rather inefficient querying mechanisms. This is why querying and knowledge discovery are currently kept mostly outside the Grid, sinking into it only to grab the content stored therein. During requirements elicitation process of the GRACE project [10] we have identified a number of key requirements for satisfying core IR necessities over Grid, they are as follows: Publishing – one or more documents are submitted to GRACE with the intention to retrieve them in response to queries; Querying – a query request is submitted to GRACE expecting a list of links to documents to be retrieved as search results; Document access – a document is selected from the search results expecting to have it presented for viewing; Personalization – personal information used by the application is either submitted to or retrieved from GRACE. System architecture was developed in order to offer the optimal solutions to each of the above IR requirements. The key design components of GRACE architecture [10] and the employed services are illustrated in table below. It is important to note that GRACE architecture is strictly service oriented anticipating future OGSA compliance of the underlying EDG infrastructure. The following table presents which GRACE services are executed by each of the components: 422 TOWARDS GRID ENABLED INFORMATION RETRIEVAL Table 1: Core Architectural and Services of GRACE project GRACE Component Front-end Application Services Grid Application Layer Services Backend Workflow Manager Grid Publisher Document Processing Service Personalization Service Knowledge Domain Registration Service Content Source Registration Service Document Storage Service Query Routing Service Document Retrieving Service Utilized by Document Processing Service in order to execute Grid job Grid Storage element Federated Search Manager Normalization and Categorization Engine Storage Element The following is the presentation of the workflow for each of services separately, introducing the major components of the system and the way they interact with the Grid infrastructure in order to provide a novel IR processes execution in Grid context. The process and functional representation of the said workflows are kept at an abstract level in order to create a generalization condition for development of use cases by other researchers for their application domains. Since GRACE is to be integrated with the European DataGrid (EDG) – which is in turn based on the Globus Kit - into a test-bed, the following presentation of the Grid infrastructure relates specifically to EDG (in particular EDG 2.0). Figure 1 illustrates the logical layers of EDG architecture. Figure 1. European DataGrid Layers Nonetheless, instead of detailing the integration interfaces, this presentation will use the following abstraction of the Grid infrastructure: EDG Application Layer will be presented with only two elements that will be utilized: Data Management and Job Management. On top of the Application Layer various GRACE services will be introduced. Grid middleware (i.e. Collective Services and Underlying Grid Services) will be treated as a “black box” without detailing what is going on inside it and how are its various components utilized in order to 423 IADIS International Conference WWW/Internet 2004 accomplish the final results. The same is true of the Fabric Services that are Grid’s interface to the underlying hardware systems (the so called “Grid fabric”). At the fabric layer various GRACE data storage elements will be introduced. Figure 2 illustrates the primary services of the GRACE toolkit. Figure 2. GRACE Layers Each workflow will be thus presented as being initialized by a GRACE service, the operations “above” the Grid will be explained in detail, Grid elements interfaced at the Application Layer will be explicated, and finally, the corresponding underlying GRACE data storage elements will be pointed out – without explaining the internal workflow within the Grid middleware. For each scenario the following will be presented at the beginning of each functionality: GRACE services involved, EDG elements interfaced (at the application layer), GRACE data storage elements utilized. This opening explication will be followed by a detailed workflow explanation. 3.1 Publishing GRACE Services EDG Application Layer Elements GRACE Data Storage Elements Knowledge Domain Registration Service, Document Processing Service, Document Storage Service, Content Source Registration Service Data Management Knowledge Domain Registry, Document Repository, Normalized Document Format Repository Grid publishing is pretty much based around the concepts of Content Sources (i.e. searchable repositories of documents and corresponding metadata) and Knowledge Domains (i.e. an abstraction of multiple content sources that share a common denominator – such as belonging to the same domain of human knowledge). Publishing workflow starts with the registration of the Knowledge Domains. KDs are registered by KD Registration Service in the KD Registry that resides on the Grid. KD Registration Service interfaces Data Management element of the EDG Application Layer in order to access the KD Registry. Figure 1 illustrates the GRACE services involved and their interaction of same with those of EDG for the publishing workflow. 424 TOWARDS GRID ENABLED INFORMATION RETRIEVAL Figure 1. Publishing Services and Workflow From here workflow proceeds slightly differently for internal and external content sources. An internal search engine receives a document, normalizes it and indexes it. The document is stored on the Grid in a Document Repository. Document Storage Service interfaces EDG Data Management element of the Application Layer in order to access the Document Repository. GRACE Search Engine utilizes Document Service in order to normalize the document and generate Normalized Document Format (NDF). NDF is also stored on the Grid in the NDF Repository. GRACE Search Engine interfaces EDG Data Management element of the Application Layer in order to access the NDF Repository. The index, on the other hand, is stored locally for rapid querying. This local index is distributed between the GRACE nodes. The distribution follows principles of the organization of document sources into Knowledge Domains. The external content sources, on the other hand, undergo a registration by the Content Source Registration Service. The registration includes technical definition of how is the content source to be queried and how should be the required documents extracted from it. It also includes assignment of the content source to one or more KDs. In addition, one or more knowledge representation systems (i.e. ontologies, taxonomies, taxonomies) can be assigned to a content source. This information is stored locally in the central configuration repository. 3.2 Querying GRACE Services EDG Application Layer Elements GRACE Data Storage Elements Query Routing Service, Document Retrieving Service, Document Processing Service Data Management, Job Management Knowledge Domain Repository, Normalized Document Form Repository Query is transferred from the front-end application to the Query Routing Service. This service interfaces the Data Management element of the EDG Application Layer in order to retrieve information regarding the relevant Knowledge Domains from the KD Registry. This information, together with the information regarding the relevant content sources from the central configuration repository is necessary in order to transform the original query into the specific formats required by various content sources. Figure 1 illustrates the GRACE services involved and their interaction of same with those of EDG for the querying workflow. 425 IADIS International Conference WWW/Internet 2004 Figure 1. Querying services and workflow These formatted queries are next routed to the content sources and the search results are received. The search results are parsed by the Query Routing Service in order to extract document URLs, which are passed to the Document Retrieving Service. From here the workflow proceeds differently for the internal and external content sources. In the case of an external content source, Document Retrieving Service simply downloads the document and transfers it to the Document Processing Service. In the case of the internal content sources, there exists already the Normalized Document Format, which is required for the categorization. Accordingly, instead of retrieving the actual document, Data Management element of the EDG Application Layer is interfaced in order to retrieve the corresponding NDF from the NDF Repository. A special case are the documents from external sources that were previously cached on the Grid. Similar to the internal documents their NDF is also already available and can be obtained without re-processing the actual document. NDFs from both external and internal documents are next processed by the Document Processing Service in order to generate categories. Document service interfaces the Job Management element of the EDG Application Layer in order to initiate a Grid job. The Grid job returns the categories that are transferred to the frontend application for presentation utilizing knowledge representation systems (in particular taxonomies) for immediate presentation for browsing purposes, and then gradually populating them with the actual documents. 3.3 Document Retrieving GRACE Services EDG Application Layer Elements GRACE Data Storage Elements Document Retrieving Service, Document Storage Service Data Management Document Repository, Normalized Document Format Repository Once the search results are presented, the end-user can select a document for inspection. This request is transferred from the front-end application to the Document Retrieving Service. If the requested document is stored in an internal content source or was cached at some point in the past, it is retrieved from the corresponding Document Repository by interfacing the Data Management element of the EDG Application Layer. Figure 1 illustrates the GRACE services involved and their interaction of same with those of EDG for the document retrieving workflow. 426 TOWARDS GRID ENABLED INFORMATION RETRIEVAL Figure 1. Document Retrieving Services and Workflow Nonetheless, if the requested document resides in an external document repository, it was downloaded during the querying, in order to allow categorization of its content. Now it just needs to be retrieved from the local temporary cache. However, the NDF of this document can be transferred to the Document Storage Service in order to store it on the Grid in the NDF Repository for future use. In the same way the downloaded document can be stored on the Grid in a Document Repository for cached documents. In this manner GRACE actually becomes a search engine similar to the WWW-based search engine in the sense that it indexes not only internal, but also external documents. WWW-based search engines systematically utilize crawlers in order to reach external documents and index them. GRACE, on the other hand, doesn’t do it systematically, but indexes only externally documents retrieved in response to a query. Accordingly, GRACE users become factually GRACE crawling agents. 3.4 Personalisation GRACE Services EDG Application Layer Elements GRACE Data Storage Elements Personalization Service Data Management User Profile Database Personalization is merely an extension of the basic Grid authentication mechanisms with a GRACE user profile. User profile holds personal information strictly related to IR. For example previous queries and corresponding search results for future inspection, Scheduled queries automatically performed by the system, Selected and hierarchically organized links to documents (so called “Favourites” or “Bookmarks”), Personal visualization preferences. User profile is envisioned as the base of the collaborative features that are to be developed in the future using GRACE as the infrastructure for knowledge management. User profiles are stored on the Grid in the User Profile Database. Upon the authentication by a VO, the matching user profile is retrieved from Grid by interfacing the Data management element if the EDG Application Layer. Authentication with the Grid and the retrieval of the user profile are both performed by the Personalization Service. Figure 1 illustrates the GRACE services involved and their interaction of same with those of EDG for the personalisation workflow. Personalization Service is also responsible for updating the user profile with modifications and additional search results. 427 IADIS International Conference WWW/Internet 2004 Figure 1. Personalization workflow 4. SUMMARY The paper has three primary contributions. We explored the key functional representation of IR over Grid as an enabling mechanism for Knowledge Management based on the requirements elicitation and architectural design of GRACE project. Second, a workflow model and its decomposition based on Grid and GRACE services is proposed and discussed. The workflow models build on earlier research by Akhgar et al. [3] and Siddiqi et al [11] that identifies and describes taxonomical classification of KM elements and propose a number of orchestrated KM services necessary for realisation of KM paradigm over Grid. Third, the suggested workflows presented at abstract level in order to facilitate dissemination of the project result to the research community so that they can produce generalised use- cases for their specific problem situation. The motivation to provide an early report of the results obtained so far is to provide an opportunity for IR/KM and Grid communities to exploit different commercial and academic application of IR/KM by harnessing Grid computing . The knowledge management and information retrieval community could participate by providing scenarios akin to their information and knowledge needs so that they could be elaborated in terms of the KM model and subsequently executed on the GRACE toolkit. The grid community can assess the feasibility of launching IR/KM services on Grid platforms. Successful execution of the toolkit and its deployment on a grid platform would be a positive and significant step towards a semantically rich IR/KM portal. We invite interested reader to visit the website and contact us to collaborate in our mission to realise a grid enabled IR/KM as a first step towards a semantic grid. REFERENCES [1] Shread (2003), SHREAD, P. (2003) IBM Launces Commercial Grid Offerings [online]. Last accessed on 12 March 2003 at URL: http://www.gridcomputingplanet.com/features/article.php/11170_1573391 [2] Siddiqi and Akhgar (2002), Source Code as Strategic Information based Resource, The British Computer Society Quality Special Interest Group's 10th Annual International Conference 'SQM [3] Akhgar, B, Siddiqi, J and Naderi M (2003); Grid enabled KM services in complex problem solving environment. Complex systems in e-business CseB 2003, SLOVENIA, [4] Kesselman, C, Nick, J and Tuecke, T (2002); On-line view of Grid. www.shu.ac.uk [5] Allan, R and Hanlon, D : An Introduction to Web Services and related Technology for building an e-Science Grid, document available at http://esc.dl.ac.uk/TechReports/WebServices/webServices_doc/ [6] Akhgar, B and Siddiqi, J (2001); A framework for the delivery of web-centric knowledge management applications, Internet Computing IC'2001. Vol 1 page 47. CSREA Press. [7] GEODISE portal. www.geodise.org [8] Jansen, B.J., Spink, A., & Saracevic, T. (2000). Real life, real users, and real needs: a study and analysis of user queries on the Web. Information Processing and Management, 36, 207-227. [9] Berners-Lee T. (1999) Weaving The Web (Orion Business Books) [10] GRACE Project Web site www.grace-ist.org Documentation. [11] Siddiqi, J, Akhgar B and Naderi N (2003); Grid Based Knowledge Management Systems, To appear in Jan 2004 in the International Symposium on Collaborative Technologies and Systems 2004, USA. 428