Abstract
There has been an increasing research interest in developing full-text retrieval based on peer-to-peer (P2P) technology. So far, these research efforts have largely concentrated on efficiently distributing an index. However, ranking of the results retrieved from the index is a crucial part in information retrieval. To determine the relevance of a document to a query, ranking algorithms use collection-wide statistics. Term frequency - inverse document frequency (TF-IDF), for example, is based on frequencies of documents containing a given term in the whole collection. Such global frequencies are not readily available in a distributed system. In this paper, we study the feasibility of aggregating global frequencies for a large term vocabulary in a P2P setting. We use a distributed hash table (DHT) for our analysis. Traditional applications of DHTs, such as file sharing, index keys in the order of tens of thousands. Aggregation of a vocabulary consisting of millions of terms poses extreme requirements to a DHT implementation. We study different aggregation strategies and propose optimizations to DHTs to efficiently process large numbers of keys.
The work presented in this paper was carried out in the framework of the EPFL Center for Global Computing and supported by the Swiss National Funding Agency OFES as part of the European FP 6 STREP project ALVIS (002068).
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Klemm, F., Aberer, K. (2007). Aggregation of a Term Vocabulary for P2P-IR: A DHT Stress Test. In: Moro, G., Bergamaschi, S., Joseph, S., Morin, JH., Ouksel, A.M. (eds) Databases, Information Systems, and Peer-to-Peer Computing. DBISP2P DBISP2P 2006 2005. Lecture Notes in Computer Science, vol 4125. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71661-7_18
Download citation
DOI: https://doi.org/10.1007/978-3-540-71661-7_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71660-0
Online ISBN: 978-3-540-71661-7
eBook Packages: Computer ScienceComputer Science (R0)