Jasper Haasdijk 4449754 Searching IPFS
Jasper Haasdijk 4449754 Searching IPFS
Jasper Haasdijk 4449754 Searching IPFS
Computing Science
Searching IPFS
Second assessor:
dr. ir. E. Poll
e.poll@cs.ru.nl
1 Introduction 3
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Preliminaries 5
2.1 IPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Content . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Versioning . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 PubSub . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Search Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Crawling . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Research 9
3.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 Listener . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.2 Publisher . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Increasing efficiency . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Experiments and Evaluation . . . . . . . . . . . . . . . . . . . 16
3.4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4.2 Cluster Setup . . . . . . . . . . . . . . . . . . . . . . . 21
3.4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Related Work 29
4.1 YaCy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Intelligent Search Mechanism (ISM) . . . . . . . . . . . . . . 29
4.3 Presearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4 PeARSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5 ipfs-search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5 Conclusions 32
1
6.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.1.2 Prototype . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
A Appendix 37
A.1 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . 37
A.1.1 WikiMedia Format . . . . . . . . . . . . . . . . . . . . 37
A.1.2 Redirect pages . . . . . . . . . . . . . . . . . . . . . . 38
A.1.3 Disambiguation pages . . . . . . . . . . . . . . . . . . 39
A.2 Namespaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2
Chapter 1
Introduction
1.1 Motivation
The Web today is moving more and more towards a centralized structure.
Large server farms are under the control of a select few companies, which
makes it possible to control the information flow, at least in principle. This
enables digital monopolies, and governments can even use this to block spe-
cific content which they deem not suitable for their residents. One example
of this can be seen in Turkey in 2017, when the Turkish government decided
that Wikipedia is a ”threat to national security” and subsequently restricted
all access to the site [2].
Incidents like this fuel the existing concerns regarding the Web’s evolu-
tion towards this centralized structure [3, 4, 5]. These incidents motivate
a more decentralized setup [3]. Examples of such projects are Diaspora1 ,
Mastodon2 , ZeroNet3 and IPFS.
1
https://diasporafoundation.org/
2
https://joinmastodon.org/
3
https://zeronet.io/
3
IPFS allows for a Peer-to-Peer, decentralized network on which content can
be served by anyone who possesses it. This makes the content incredibly
hard to attack and allows users to view and serve this content, even in more
restrictive areas. However, within IPFS there are no searching or indexing
tools. This makes it difficult to determine which information can be found,
and where. One solution is to share the hash of the content on publicly
available forums. Since data is content-addressable, sharing the hash enables
others to directly access the data. This is a rather cumbersome method and a
search engine would make it much easier to search what content is accessible
on the network.
The motivation to tackle this problem stems from the observation that
searching is a vital part of any network. The ability to share network and
computing resources is a key element of any network. If you are unable to
share such resources, there is no point in being connected. IPFS aims to
replace HTTP but without the means to search through its contents it will
never live up to this goal. Therefore the addition of searching functionality
to IPFS is an essential aspect in the adoption and future of the protocol.
The following chapters of this paper will walk you through our implementa-
tion of a prototype search engine based on IPFS. We will walk you through
the design and creation using ample examples.
4
Chapter 2
Preliminaries
To set the scene we begin with a brief overview of the most important
concepts discussed in this thesis.
2.1 IPFS
“IPFS is a peer-to-peer distributed file system that seeks to connect all
computing devices with the same system of files” [1]. It is a protocol designed
to create a network of peers which are all connected to the same (shared)
filesystem.
While there is much to discover about IPFS, this Section focuses on the
specific techniques and properties used in this thesis.
2.1.1 Content
IPFS employs content addressed storage. What this means is that when ac-
cessing data through IPFS, instead of using location based addressing such
as an object’s link, we use the contents of the data, specifically the hash.
Adding a file x to the filesystem is done using the ipfs add x command.
This will also print the resulting IPFS hash of the file that was just added.
Getting a file from the filesystem is done using the ipfs get <hash> com-
mand, where <hash> is the IPFS hash of the file you want to fetch.
Data is stored using IPFS objects. Listing 2.1 shows the IPFS Object format
[1]:
5
Listing 2.1: The IPFS Object format
1 type IPFSObject struct {
2 links [] IPFSLink
3 // array of links
4
5 data [] byte
6 // opaque content data
7 }
As we can see, an IPFS object contains a links and a data field. A single
IPFS object can store up to 256 KB of data. When storing simple text files
this is enough. When storing files that are > 256 KB, the file is split up in
parts of ≤ 256 KB. An empty parent object will then link all the pieces of
the file together. This is shown in Figure 2.1.
IPFS Object
links: [ ]
data: chunk#1
IPFS Object
links: [ ]
data: chunk#3
2.1.2 Versioning
An updated file is different from the original file, which will result in a
different IPFS hash. To support versioning, IPFS uses commit objects. A
commit object contains a parent and an object field. When updating
6
an object, IPFS links the parent field to the IPFS object that has been
updated. This allows us to track updates over files without losing older
versions. Figure 2.2 shows how updating a file in IPFS using commit objects
looks like. We have a text file with "Hello World!" and update this to
"Hello IPFS!".
Commit Commit
links: [ ] links: [ ]
data: "Hello World!" data: "Hello IPFS!"
2.1.3 PubSub
PubSub is short for the messaging pattern called Publish-Subscribe. IPFS
employs a topic-based system. Publishers can post messages to a particular
topic. Subscribers can subscribe to such a topic and using this approach
only receive messages they are interested in. Subscribers do not necessarily
know who is sending the messages. They simply express their interest in a
particular topic. This offers great network scalability and flexibility.
The current implementation of PubSub in IPFS is called floodsub. This is
due to the fact that messages are sent to every peer in the network. This
is a very simple, easy to implement protocol. The problem however is that
this creates a rather large communication overhead as there is no limit on
the outbound degree when forwarding messages. This can sometimes lead to
the same message being received multiple times. Therefore the IPFS team is
working hard to improve this protocol [6]. This has lead to the specification
of the gossipsub protocol, and a proposal for the episub protocol.
7
2.2 Search Engine
A search engine is a system which allows a user to search for information on
a network. Before such a system can serve search queries, it needs to crawl
and index the available data. When transitioning into a P2P, decentralized
setup, these processes need to be adjusted to accommodate for the change
in setup. For example, in a truly decentralized solution, we cannot employ
a single, centralized index.
2.2.1 Indexing
Indexing is the term (related to search engines) used to describe the process
in which raw data is parsed into a searchable format, the index. This pro-
vides a more efficient way of searching through the complete data collection.
2.2.2 Crawling
The most common method used by search engines to gather the data for the
search index, is by employing a crawler. A Web crawler is a program that
systematically gathers and processes the content of a webpage. This content
is parsed and added into the search index. In a centralized environment this
is a viable strategy. The company behind the search engine has a vested
interest in gathering this data so it can serve its users relevant queryhits.
There is no need to create an incentivised structure as the company is willing
to do the work.
In a decentralized solution we cannot employ this method. There is no
single entity we can rely on to provide us this function. Employing a Web
crawler inevitably reduces the freshness of the search results [7]. Instead, we
have to move to a completely new approach. An approach where content
creators do the work for the data they want, so that there is no need to
create an incentivised structure since anyone who wants something has to
do it themselves. This is called the no-crawling approach.
Networks which do not employ the no-crawling approach are bound to create
an incentivised structure. This is why the company behind IPFS, Protocol
Labs, is working hard to create Filecoin [8]. Filecoin is a token that can be
used to hire miners to store or distribute data for you. This is an alternative
approach by paying nodes to do work they do not need themselves.
8
Chapter 3
Research
To define the scope of this thesis, we take a more concrete look at our prob-
lem definition. Our earlier stated research problem consists of the following
question: “How to devise a search engine in a decentralized file system to
locate assets in an IPFS cluster?”. This is a rather broad question and can
be interpreted in a variety of ways. Therefore we instead introduce a new,
much more concrete problem definition.
If we were to have a document dump consisting of Wikipedia
pages, how do we arrange this collection of pages in a distributed
IPFS cluster with 1, 2, . . . , n clients so that we can support
searching on the contents?
In Section 3.1 we present the main idea of this thesis. This Section contains
the general approach behind our prototype search engine. Sections 3.2 and
3.3 explain the specific implementation of our prototype. This consists of two
parts: a Listener and a Publisher. In Section 3.4 we will collect and process
the document dump of Wikipedia pages, set up our cluster environment and
show that using our search engine we are able to support searching on the
contents of the cluster.
3.1 Design
The main idea presented in this thesis is a prototype search engine built on
top of IPFS. This prototype supports searching by listening to the queries
and answer messages that are published on the network. Whenever a node
in our network wants to search for something, it publishes a query. Other
nodes can then respond to this query by publishing an answer message.
When talking about nodes we refer to the individual peers in the network.
Every node has their own content. A collection of documents that they are
9
willing to share with the rest of the network. Nodes can come and go as
they please. Whenever a node decides to leave the network, so will their
content. This choice stems from the observation that this is simply how the
IPFS network works. If nobody wants to share specific content then that
content will disappear.
The current implementation is not the most efficient one. It relies on a
flooding approach as we simply publish every query we have to everyone
on the network. This provides us with a high recall rate, at the expense
of a lot of communication overhead. Many alternatives to this approach
have been proposed, including overlay networks which try to organize “peers
sharing common interest into clusters” with the goal to try and exploit this
information when submitting queries [9].
In this thesis we aim to increase the speed and efficiency of our implemen-
tation by introducing a summary file. A summary file is a local list of
{document, score} value pairs matched to a given query. This allows us
to know beforehand which data is available in the network. For example our
summary list could contain the following entry:
Now if we search for “World Wide Web”, we can look into our local summary
file and resolve this search without publishing the query.
Data entries that are in the summary file can originate from other nodes
on the network that have responded to a query. You can compare this with
a cache of previously seen answer messages. Using this file we create a
summary of the data that is available to us in the network.
In the remaining part of this Chapter we will elaborate on this idea, and mo-
tivate how our solution solves the problem by providing a technical overview.
10
3.2 Implementation
This Section gives a technical overview of the implementation of our proto-
type. Our prototype uses the IPFS Javascript library1 and consists of two
parts: a Listener and a Publisher.
To be able to send and receive messages on the network we create a shared
PubSub channel. This is done by letting every node in the network subscribe
to the same topic. Whenever a node decides to publish something, PubSub
will distribute this message to every other peer currently connected.
3.2.1 Listener
Our Listener is responsible for returning queryhits. The program flow can
be divided in three parts; starting the Listener, handling network events,
and shutting down the Listener.
The Event field differentiates between a query and an answer event. The
Query field is used to track which query we are dealing with. In the case of
a query event this field indicates what we are querying for, in the case of an
answer event this field indicates for which query the message is returning
queryhits. The Payload field is used to transfer queryhits. In the case of
a query event this field simply returns the empty string. In the case of an
1
https://github.com/ipfs/js-ipfs-http-client
11
answer event this field returns a list of {document, score} value pairs. A
document’s score is calculated based on a similarity measurement with the
query.
Figure 3.2 shows an example query message:
This message indicates that we are querying for the string "Tomato". An
example answer message for this query is shown in Figure 3.3:
On receiving a query message, our Listener searches its local index and
publishes the queryhits in a matching answer message.
12
Listener starts
Load index from file Create and populate a new search index
exitHandler()
3.2.2 Publisher
Our Publisher is responsible for submitting queries. Although much simpler
than our Listener, our Publisher’s program flow can also be divided in three
parts. Starting our Publisher, submitting queries, and shutting down the
Publisher.
Starting our Publisher process is as simple as subscribing to the shared Pub-
Sub channel. This enables us to submit queries to the network. Figures 3.2
and 3.3 illustrate how this can be done. On receiving an answer event, our
Publisher updates its local response list with the newly received queryhits.
We can shut down our Publisher by unsubscribing from the shared PubSub
channel.
13
Figure 3.5 shows the flow of the Publisher process schematically:
Publisher starts
Where <payload> refers to the returned list of value pairs, just as we have
seen in Listing 3.2.
14
Notice that our summary messages are nearly identical to our query and
answer messages. A sumQuery message indicates that we are looking to add
an entry to our summary file. sumAnswer messages subsequently return the
payload for our summary entry.
To be able to use this new feature, we need to adjust our Listener and
Publisher process.
Listener needs to be able to handle sumQuery events. This is a rather small
addition as it is similar to handling a query event. Only now we return a
sumAnswer message instead of an answer message.
Figure 3.8 shows the updated Listener process. The changes have been
highlighted.
Listener starts
Load index from file Create and populate a new search index
exitHandler()
15
Our Publisher on the other hand undergoes quite the change. Instead of
simply subscribing to the shared channel like we used to, we first need to
check if there exists a local serialized summary file we can use. If there exists
such a file, we don’t have to generate a new one and instead load it from
disk. If this file does not exist, we have to generate a new one.
We also need to update our search order. We now have 3 different sources
to query when we want to search for something. Our local search index,
our local summary file, and the cluster. If we are content with the scores
returned from our local search index and summary file, we don’t have to
submit the query to the network. This saves us both in time and communi-
cation overhead.
We also add an exitHandler to the Publisher process, which we invoke
when shutting down the process. This function checks if there already exists
a serialized local summary file. If there is no local summary file yet, before
shutting down our Publisher, we generate a new one from our constructed
summary. If there already exists a summary file, we simply exit the process.
Figure 3.9 shows the updated Publisher process.
3.4.1 Data
Searching for something only really makes sense once we have something to
search for. So before we are able to do anything with data, we first need
to obtain data. We will use a Wikipedia document dump, process this data
into individual .txt files, and add them to our search index.
Data Collection
Wikipedia explains how an interested user can obtain a free copy of their
available content on their “Wikipedia:Database download” page [10]. Fol-
lowing their instructions on where to get the data, we obtain and verify the
data collection shown in Figure 3.10.
This is a collection of (a subset of) pages in their current version, that does
not include metadata, category information, statistics, log events or edit
histories. We are only interested in the page titles with their corresponding
page text. Our desired end result is a list of Wikipedia articles where the
filename is the article’s title (using its <title> tag), and the file’s content
is the article’s text (using its <text> tag).
16
Publisher starts
exitHandler()
Data Cleansing
Looking into this data, we see there is pre-processing required before we can
actually do something with it. We received the article data in the form of an
.xml dump. This file includes unwanted data entries such as redirects and
disambiguation pages. We also notice that all the <text> fields are using
the MediaWiki [11] format and that we are going to have to parse this to its
plaintext equivalent. Some examples of unwanted data entries can be found
in Appendix A.1.
17
File:
from: https://dumps.wikimedia.org/enwiki/20181020/
file: enwiki-20181020-pages-meta-current1.xml-p10p30303.bz2
Hash:
from: https://dumps.wikimedia.org/enwiki/20181020/enwiki-
20181020-md5sums.txt
hash: md5sum: cb5b3336b0bfbb3414aab61a54ecd40a
18
Listing 3.3: Easily convert wikicode into plaintext
1 plaintext = mwparserfromhell.parse(source)
2 return plaintext.strip_code()
File list
To improve the configurability and flexibility of our data set, let us split it
up a bit further and create a set of .txt files where we use the record’s title
as filename and the record’s text as the file’s contents.
Since our records hold all the necessary information, this is something we
can accomplish using Python’s open(), write() and close() methods.
Now that we have completed the preparation of our .txt files, it is time
to assess our finished data collection. As the figure below illustrates, we
started with 22299 entries and filtered this down to 14978 files, a reduction
of 32.83 %.
0 5 10 15 20 25
Number of entries in our collection (×1000)
Data Indexing
The final step in the preprocessing phase of the data is creating the search
index and adding the files to it.
19
Creating our search index
Since we are using the Javascript client library for the IPFS HTTP API, we
have chosen to use the Javascript module elasticlunr2 to provide us with
offline lightweight full-text search. This choice stems from the observation
that this module supports a scoring mechanism and the option to save a
serialized version of the index to disk. We use this module to create our
search index.
20
After every new addition to the search index, we call
elasticlunr.clearStopWords() to remove default English stop words.
Now that we have created our search index and added the files from our file
directory to the index, our data is ready to be searched on.
Alice Carol
Bob Dan
Prerequisites
Before we are able to launch a cluster, we need to be able to run and manage
cluster peers. For this, we install 2 additional tools: ipfs-cluster-
service and ipfs-cluster-ctl. Both binaries can be downloaded from
https://dist.ipfs.io/. ipfs-cluster-service is used to run a cluster
peer and ipfs-cluster-ctl is the command-line interface used to manage
a cluster peer.
In order to be able to store the configurations and states of our nodes per-
manently, our cluster expects a ‘compose’ subfolder. Its directory tree looks
like this:
compose/
|-- cluster0
|-- cluster1
|-- cluster2
|-- cluster3
|-- ipfs0
|-- ipfs1
|-- ipfs2
|-- ipfs3
21
During the first start, default configurations are created for all peers.
The Cluster
We create our cluster environment using Docker3 . The Docker documenta-
tion page on the IPFS cluster website [12] provides us with templates for
our Dockerfile and docker-compose.yml configuration files.
Dockerfile
A Dockerfile is a configuration file which can be used to tell Docker how to
build a specific image. It includes all the commands a user would otherwise
have to manually execute. We have modified the template to include the
required Javascript libraries, e.g., the client library for the IPFS HTTP API
and Elasticlunr, and execute the daemon subcommand to enable PubSub.
We use the Dockerfile to build a custom Docker image and we include the
resulting image ID in our docker-compose.yml to specify what image to
base our containers on.
Docker Compose
“Compose is a tool for defining and running multi-container Docker applica-
tions.”. We use this to define our cluster setup by specifying how our nodes
look like. The configuration for a single node looks like:
22
18 - "127.0.0.1:9094:9094" # api
19 volumes:
20 - ./compose/cluster0:/data/ipfs-cluster
23
32 - /ip4/172.22.0.3/tcp/4001/ipfs/Qmbs6GDGkK
Summary
When generating a new summary file, the main question we have to ask
ourselves is:
How do we decide which entries are included in the summary
file?
How do we share what we have, without sharing all that we have? This is
called Resource selection: “the goal of resource selection is to select a small
set of resources that contain a lot of relevant documents.” [13]. We are
looking to extract a small set of documents to share our most relevant data.
Since we generate this summary file when starting up the Publisher, we
have no idea of knowing what the nodes in the cluster will be querying for.
For this reason we have decided to use the statistics provided by stats.
wikimedia.org on ‘Top viewed articles’. This gives us a list of the top
1000 most viewed articles of the last month. We filter this list (e.g. by
removing Help: pages) and submit sumQuery messages for the remaining
entries in the list. We use the returned sumAnswer messages to update our
local summary file.
Please take a look at what happens when we publish the following two search
queries. The first search query searches for "Ant-Man and the Wasp". The
second search query searches for "Potatoes". Both queries are looking to
retrieve documents with a score of at least 2.0.
24
18 score: 0.2249596581756685 } ]
19
20 I received answer for: ’Ant-Man and the Wasp’
21 Updating my response list ...
22 Our response list for ’Ant-Man and the Wasp’ now looks like:
23 [ { ref: ’QmPZrjeZrSBcofPWUDL8sPkJod4CCTJ7NYe92a4U7JLszf’,
24 score: 1.6673388780404585 },
25 { ref: ’QmeFJRGoK6YeYg4T84jzFAM8kJmAouKU7VdxnXC1YSfggS’,
26 score: 0.2597188628938538 },
27 { ref: ’QmUiRfbuAw8ayxyghGnixYoqZsY4D1GDDyHciwHb7EwLtB’,
28 score: 0.2490572926268008 },
29 { ref: ’QmRT1Jyb3CGHmQUgepq3vQDt6VMf6LfAe8ZCRbkMiHyU3m’,
30 score: 0.22425568575833124 },
31 { ref: ’QmUPzRgcqFQLi4U5y2Md8PmqCvYtmwEFNPQtg9mDqfrfWP’,
32 score: 0.2218710340306346 } ]
We see that our local results for "Ant-Man and the Wasp" scored too low.
Only after we observe this, do we publish the query on the network. On
the other hand we are able to retrieve a document with a very high score
for "Potatoes" locally. This saves us the overhead of publishing the query
on the network. Also notice that after a while we receive an answer for
"Ant-Man and the Wasp" with which we immediately update our response
list. We log the output of this to stdout so the user can see what is hap-
pening.
Getting a file
Now that we are able to generate queryhits for our queries, we might also
be interested in actually getting a file, instead of only seeing its rank. Since
all the files that have been added to the search index have also been added
to IPFS, and queryhits are referenced by the hash of the document, getting
a file is a rather easy process. All we need to do is issue a get operation on
the hash of the document we want to retrieve. This looks like this:
This saves the data contained in the IPFS object referenced by the hash to
our local disk.
Or if we want to implement something like this in our Javascript process:
25
Listing 3.11: Retrieve a file over IPFS, using Javascript
1 ipfs.get("Qm...", function(err, files) {
2 files.forEach(file => {
3 console.log(file.path);
4 console.log(file.content.toString("utf8"));
5 });
6 });
3.4.3 Evaluation
To be able to evaluate how our prototype performs, we constructed a list
of requirements on which we have focused our implementation. We have
sorted this list in descending order of importance:
Recall rate
If a document is available we want to be able to retrieve it. A recall rate
of 100 % means that we are able to retrieve every available document.
Scalability
Since one of the goals of the IPFS project is to replace HTTP [14], the
network needs to be able to scale. For this reason, our implementation
also needs to be able to scale. We don’t want to be waiting 5 minutes
for a query-hit.
Communication overhead
We want to minimize the load on the network.
Disk size
Since it is a P2P network we want everyone to be able to join. We
don’t want our implementation to be too big for the available disk
space on some devices.
Ideology
We want to keep close to the idealogy of P2P systems. We want to
avoid centralization, master-slave relations and single point of failures.
We also want to keep close to the idealogy of IPFS. This means that
popular searches are cached and non-searched items will disappear.
26
Recall rate
By default, our implementation adds every file we add to our search index,
to IPFS as well. We do this so we can use its hash as a reference in our
queryhits. This in turn makes it easier to retrieve a file as we have discussed
in Chapter 3.4.2. Since every file is made available through IPFS, as long
as a node stays connected to the network, its documents are available.
Scalability
Unfortunately our cluster setup does not include scalability testing. This
is something that should be included in future research. We do however
employ a similar technique to ISM [15] which shows very promising results.
More on this in Section 6.2.
Communication overhead
Using our summary file we try to resolve queries locally as much as we can.
This saves the overhead of having to publish to the network for every query.
Disk size
Looking at our current cluster nodes, the average file size for our index file
is 206.25 MB. The average file size for our summary file is 244.825 KB. This
is small enough such that this should not be a limiting factor.
Note however that the size of the index file heavily correlates with the size
of the data you are sharing with the network. On devices with a smaller
data source the index will naturally be smaller as well.
ipfs0 200.9
ipfs1 202.9
ipfs2 214.0
ipfs3 207.2
A future improvement to be able to limit the size of the summary file could
be made by implementing a check in our exitHandler to only write as many
entries to disk as allowed.
27
ipfs0 211.1
ipfs1 323.3
ipfs2 244.5
ipfs3 200.4
Ideology
In the current version of our implementation every node has equal rights.
We have employed a true decentralized solution without introducing single
point of failures, centralization or master-slave relations.
A future addition that would move even closer to the idealogy of IPFS would
be to incorporate a resource selection algorithm that uses the published
query and answer messages as input, as proposed in Section 6.1.2. As
a result of this, popular searches will be cached in the summary file and
non-searched items will disappear from the network.
28
Chapter 4
Related Work
4.1 YaCy
YaCy1 is a free, open and decentralized search engine. The software is
developed by the YaCy community on Github. They employ a crawling
based method where peers parse, index and store a local search index. This
local search index is merged with other peers’ indices on the same network
to provide search results. Index fragments are exchanged using a distributed
hash table. This project is probably the most advanced decentralized search
project at the moment.
The main differences compared to our approach, is that instead of swapping
index fragments, we swap query results. We also allow for a higher peer
autonomy. Since we do not employ a crawling method, peers are free to join
and leave the network whenever they want.
29
forward a query to. It shows to be a very promising technique addressing
scaling issues which flooding techniques have to deal with. The biggest
disadvantage of this technique is that it has quite a long startup time before
it is able to attain a high recall rate [15].
This is something our implementation tries to address by supplying an initial
summary file before the first query has even been published on the network.
After incorporating a resource selection algorithm as proposed in Section
6.2 it would be very interesting to see how our implementation compares to
ISM as both techniques are quite similar. Unfortunately no public imple-
mentation is available.
4.3 Presearch
Presearch2 is a decentralized search engine currently in early access, pro-
viding curious users the option to join the beta program. It is a promising
project with ambitious goals. Apart from their whitepaper, technical details
are unfortunately scarce. They seem to be using a browser based crawling
method based on the user’s history. To incentivize and reward participation
they have created the PRE (Presearch Token). These tokens are issued to
members as they search, and can be used by sponsors to place advertise-
ments. For further reading, take a look at the presearch whitepaper3 .
4.4 PeARSearch
PeARSearch4 is the decentralized version of PeARS5 . It is a prototype de-
centralized Web search system under construction. It is a search engine
that you can query without ever publishing your query on the network. The
main idea is that users collect their favorite web pages and create a local
‘pod’, a collection of indexed web pages that share a topic. Pods can then be
exported and shared through any medium. Users can import pods and add
them to their local search index collection. When searching for something,
results from a user’s local pods are merged and returned. This way your
actual search query never leaves your computer.
This is a fantastic idea. It however requires an active community collecting
webpages and sharing and maintaining pods. Unfortunately this project
has not yet gained any traction. Currently there are only a select few pods
available which makes it not something someone would use.
2
https://www.presearch.io/
3
https://www.presearch.io/uploads/WhitePaper.pdf
4
https://github.com/PeARSearch/PeARS-orchard
5
http://pearsearch.org/
30
4.5 ipfs-search
ipfs-search67 is a project which aims to provide a search engine for IPFS by
sniffing the DHT gossip and indexing the file and directory hashes. When
listening to the DHT gossip, discovered hashes are added to a queue. A
crawler uses this queue to list the type of the IPFS object. If it is a file,
its metadata is extracted and added to the index. If it is a directory, its
referred items will be added to the queue. Searching is implemented using
Elasticsearch8 .
Since it is listening to the DHT gossip between its own node, and is using
the crawling approach, this project will have to face a significant challenge
as IPFS grows. It is not a decentralized solution nor is it scalable.
6
http://ipfs-search.com/
7
https://github.com/ipfs-search
8
https://www.elastic.co/products/elasticsearch
31
Chapter 5
Conclusions
32
Chapter 6
This Chapter discusses our assumptions and the strengths and weaknesses
of the current implementation.
6.1 Discussion
In this Section we will talk about some of our remarks with regards to our
data, our cluster and our prototype.
6.1.1 Data
When adding documents to our search index, the plaintext files are processed
using Elasticlunr’s tokenizer. This tokenizer splits strings into tokens. By
default this is configured to split strings on whitespace and hyphens. When
observing our cleaned text files we see that there is still some WikiMedia
syntax appearing here and there. This could skew the inverted index po-
sitions of certain words a little. Fortunately most (almost all) of our files
are processed properly and therefore the effect this will have on the overall
search quality will be minimal.
In the current version of our implementation we assume that our corpus is
static. We do not check for updates in our text nor do we check to see if
there are new files. We could however enable having dynamically changing
data by checking if files have been updated, and adding the new versions
to our search index. Since IPFS supports versioning using commit objects,
whenever we have a new version of a file, we can simply add it to IPFS and
the network will take care of the rest.
33
6.1.2 Prototype
The current version of our summary file only uses sumAnswer and sumQuery
events to kickstart the summary file. This has to do with how we generate
‘interesting’ queries to populate this file. Remember our resource selection
question:
How do we share what we have, without sharing all that we
have?
How do we know which of our documents are relevant without knowing what
the nodes in the cluster are searching for? Resource selection algorithms
need to have this knowledge to determine what the most relevant documents
are. As a future improvement we could use the published query and answer
messages as input for a resource selection algorithm to determine a node’s
most relevant local set of documents. Using this knowledge we could use the
sumAnswer and sumQuery events to push summary updates to the network.
This could possibly increase the efficiency of using a summary file.
34
Bibliography
35
Retrieval. Berlin, Heidelberg: Springer Berlin Heidelberg; 2008. p. 65–
76.
[10] Wikipedia, Wikimedia Foundation. Database Download;. Avail-
able from: https://en.wikipedia.org/wiki/Wikipedia:Database_
download.
[11] Wikipedia, Wikimedia Foundation. Help:Wikitext;. Available from:
https://en.wikipedia.org/wiki/Help:Wikitext.
[12] IPFS Cluster. Docker;. Available from: https://cluster.ipfs.io/
documentation/deployment/docker/.
[13] Si L, Callan J. The Effect of Database Size Distribution on Resource
Selection Algorithms. In: Callan J, Crestani F, Sanderson M, edi-
tors. Distributed Multimedia Information Retrieval. Berlin, Heidelberg:
Springer Berlin Heidelberg; 2004. p. 31–42.
[14] Protocol Labs. IPFS is the Distributed Web;. Available from: https:
//ipfs.io/.
[15] Zeinalipour-Yazti D, Kalogeraki V, Gunopulos D. Information retrieval
techniques for peer-to-peer networks. Computing in Science & Engi-
neering. 2004;6(4):20–26.
36
Appendix A
Appendix
=See
1 also==
2 <!-- Please keep entries in alphabetical order &
8 * [[Global dimming]]
9 * [[Irradiance]]
10 * [[Kirchhoff’s law of thermal radiation]]
11 * [[Opposition surge]]
12 * [[Polar see-saw]]
13 * [[Solar radiation management]]
37
=References==
1
2 {{Reflist|refs=
3 <ref name="Goode">{{Cite journal
2001GeoRL..28.1671G |display-authors=etal}}</ref>
38
A.1.3 Disambiguation pages
These pages include a definitions list. They do not contain actual text
regarding the page title and therefore have been excluded in the filtered
collection.
Listing 2: Disambiguation page for "Austin"
1 <page>
2 <title>Austin (disambiguation)</title>
3 <ns>0</ns>
4 <id>590</id>
5 <revision>
6 <id>842727923</id>
7 <parentid>841549643</parentid>
8 <timestamp>2018-05-24T08:36:07Z</timestamp>
9 <contributor>
10 <username>Crouch, Swale</username>
11 <id>11009441</id>
12 </contributor>
13 <minor />
14 <comment>per [[MOS:DABPRIMARY]]</comment>
15 <model>wikitext</model>
16 <format>text/x-wiki</format>
17 <text xml:space="preserve">{{wiktionary|Austin}}
18 ’’’[[Austin]]’’’ is the capital of Texas in the United
States.
19
30 ==Geographical locations==
31
32 ===Australia===
33 * [[Austin, Western Australia]]
34
35 ...
39
A.2 Namespaces
The following list dictates which namespace key corresponds to which cate-
gory. It consists of the <siteinfo> data from the enwiki-20181020-pages-
meta-current1.xml-p10p30303.bz2 file.
Listing 1: Namespace key list
1 <siteinfo>
2 <sitename>Wikipedia</sitename>
3 <dbname>enwiki</dbname>
4 <base>https://en.wikipedia.org/wiki/Main_Page</base>
5 <generator>MediaWiki 1.32.0-wmf.26</generator>
6 <case>first-letter</case>
7 <namespaces>
8 <namespace key="-2" case="first-letter">Media</namespace>
9 <namespace key="-1" case="first-letter">Special</namespace>
10 <namespace key="0" case="first-letter" />
11 <namespace key="1" case="first-letter">Talk</namespace>
12 <namespace key="2" case="first-letter">User</namespace>
13 <namespace key="3" case="first-letter">User talk</namespace>
14 <namespace key="4" case="first-letter">Wikipedia</namespace>
15 <namespace key="5" case="first-letter">Wikipedia talk</namespace>
16 <namespace key="6" case="first-letter">File</namespace>
17 <namespace key="7" case="first-letter">File talk</namespace>
18 <namespace key="8" case="first-letter">MediaWiki</namespace>
19 <namespace key="9" case="first-letter">MediaWiki talk</namespace>
20 <namespace key="10" case="first-letter">Template</namespace>
21 <namespace key="11" case="first-letter">Template talk</namespace>
22 <namespace key="12" case="first-letter">Help</namespace>
23 <namespace key="13" case="first-letter">Help talk</namespace>
24 <namespace key="14" case="first-letter">Category</namespace>
25 <namespace key="15" case="first-letter">Category talk</namespace>
26 <namespace key="100" case="first-letter">Portal</namespace>
27 <namespace key="101" case="first-letter">Portal talk</namespace>
28 <namespace key="108" case="first-letter">Book</namespace>
29 <namespace key="109" case="first-letter">Book talk</namespace>
30 <namespace key="118" case="first-letter">Draft</namespace>
31 <namespace key="119" case="first-letter">Draft talk</namespace>
32 <namespace key="710" case="first-letter">TimedText</namespace>
33 <namespace key="711" case="first-letter">TimedText talk</namespace>
34 <namespace key="828" case="first-letter">Module</namespace>
35 <namespace key="829" case="first-letter">Module talk</namespace>
36 <namespace key="2300" case="first-letter">Gadget</namespace>
37 <namespace key="2301" case="first-letter">Gadget talk</namespace>
38 <namespace key="2302" case="case-sensitive">Gadget definition</namespace>
39 <namespace key="2303" case="case-sensitive">Gadget definition talk</namespace>
40 </namespaces>
41 </siteinfo>
40