Jonathan N. Chimah and Friday Ibiam Ude: Current trends in information retrieval systems:
review of fuzzy set theory and fuzzy Boolean retrieval models
catalogue, index and bibliography, abstract iterated if the user wishes to refine the
as well as the computer are known as query.
information retrieval systems. Based on this backdrop, this paper
Automated information retrieval examines concepts and the goal of
systems are used to reduce what has been information retrieval systems, current trends
called information overload. An IR system in information retrieval systems, explains
is a software system that provides access to synonymous concepts (such as imprecision,
books, journals and other documents; stores vagueness, uncertainty, and inconsistency),
and manages those documents. Web search reviews the fuzzy set theory and its
engines are the most visible IR applications. concomitant Boolean retrieval models.
An information retrieval process begins
when a user enters a query into the system. Concept and goal of information retrieval
Queries are formal statements of systems
information needs, for example search Information retrieval (IR) is concerned with
strings in web search engines. In the storage, organization, and searching of
information retrieval a query does not collections of information. It has been part
uniquely identify a single object in the of significant part of human technological
collection. Instead, several objects may development since the development of
match the query, perhaps with different writing. The earliest IR systems were the
degrees of relevancy. organization schemes of ancient archives
An object is an entity that is and libraries, such as early Sumerian
represented by information in a content archives, or the “Pinakes” developed by
collection or database. User queries are Callimachus for the library of Alexandria. In
matched against the database information. the twentieth century the largest impetus to
However, as opposed to classical SQL development of automated IR systems was
queries of a database, in information the need to manage increasing larger
retrieval the results returned may or may not quantities of information in business and
match the query, so results are typically scientific development. Early attempts at
ranked. This ranking of results is a key automating search capabilities for document
difference of information retrieval searching collections involved techniques based on
compared to database searching (Jansen and punched cards, as well as machines using
Rieh 2010). optical sensing of codes on microfilmed
Depending on the application the documents (Buckland 2006).
data objects may be, for example, text According to Larson (2018), the goal
documents, images, audio, mind maps or of any IR system is to select the information
videos. Often the documents themselves are items (texts, images, videos, etc. which we
not kept or stored directly in the IR system, will refer to as “documents”) that are
but are instead represented in the system by expected to be relevant for a given searcher
document surrogates or metadata. As Frakes (or user) from a large collections of such
and Baeza-Yates (1992) had noted, most IR items. Today these collections range from
systems compute a numeric score on how small sets of items on an individual’s
well each object in the database matches the personal computer to the vast resources of
query, and rank the objects according to this the World Wide Web. In all cases the task is
value. The top ranking objects are then the same: to extract some set of items that
shown to the user. The process may then be searchers wants to have from all of those
they do not want. This is not a simple task,
and involves not only the technical aspects There are several ways to represent
of constructing a system to perform such imprecise and vague concepts. One the ways
selection, but also aspects of psychology and is indirectly, by defining similarity or
user behaviour to understand what proximity relationships between each pair of
differentiate the desired items from the non- imprecise and vague concepts. If we regard
desired from the particular user’s point of a document as an imprecise or vague
view. concepts, i.e., as bearing a vague content, a
numeric value computed by a similarity
The synonymous concepts in information measure can be used to express the
retrieval closeness of any two pairs of documents.
The terms imprecision, vagueness, This is the way of dealing with the imprecise
uncertainty, and inconsistency are very often and vague document and query contents in
used as synonymous concepts. Nevertheless the vector space model of Information
when they are referred to qualify a retrieval. In this context the documents and
characteristic of the information they have a the query are represented as points in a
distinct meaning (Motro 1995). Since IR has vector space of terms and the distances
to do with information, understanding the between the query and the documents points
different meanings of imprecision, are used to quantify their similarity.
vagueness, uncertainty, and inconsistency Uncertainty is related to the truth of
allows to better understanding the a proposition, intended as the conformity of
perspectives of the distinct IR models the information carried by the proposition
defined in the literature. with the considered reality. Linguistic
Kraft, Bordogna and Pasi (2018) expressions such as “probably” and “itis
noted that vagueness and imprecision are possible that” can be used to declare a
related to the representation of the partial lack of knowledge about the truth of
information content of a proposition. For the stated information.
example, in the information request, “find Furthermore, Kraft, Bordogna and
recent scientific chapters dealing with the Pasi (2018) noted that there are cases in
early stage of infectious diseases by HIV,” which information is affected by both
the terms recent and early specify vague uncertainty and imprecision or vagueness.
values of the publication date and of the For example, consider the proposition
temporal evolution of the disease, “probably document d is relevant to query
respectively. The publication date and the q.” However, the same information content
phase of an infectious disease are usually can be expressed by choosing a trade-off
expressed as numeric values; their linguistic between the vagueness and the uncertainty
characterization has a coarser granularity embedded in a proposition. For example,
with respect to their numeric one can express the content of the previous
characterization. Linguistic values are proposition by a new one “document d is
defined by terms with semantics compatible more or less relevant to query q.” in this
with several numeric values on the scale latter proposition, the uncertain term
upon which the numeric information is probably has been eliminated, but the
defined. Imprecision is just a case-limit of specificity of the vague term relevant has
vagueness, since imprecise values have a been reduced. In point of fact, the term more
full compatibility with a subject of value of or less relevant is less specific than the term
the numeric reference scale. relevant. A dual representation eliminate
imprecision and augment the uncertainty,
1 Recent
Scale measured in years
CD-3y CD-1y CD
Fig. 1 Semantics of the term “recent” referring to the publication date of a scientific chapter.
CD = current date; y = years. Source: Kraft, Bordogna & Pasi (2018)
On the basis of what has been said about the absolutely no relevance. And, an RSV value
trade-off between uncertainty and vagueness in the interval [0, 1] implies an intermediate
to express the same information content, level or degree of relevance. For example,
there are two alternative ways to model the an RSV value of 0.5 could imply an average
IR activity. One possibility is to model the degree or relevance (Kraft, Bordogna and
query evaluation mechanism as an uncertain Pasi, 2018).
decision process. Here the concept of Inconsistency comes from the
relevance is considered binary (crisp) and simultaneous presence of contradictory
the query evaluation mechanism computes information about the same reality. An
the probability of relevance of a document example of inconsistency can be observed
of d to a queryq. Such an approach, which when submitting the same query to several
does model the uncertainty of the retrieval IRSs that adopt different representations of
process, has been introduced and developed documents and produce different results.
by probabilistic IR models (Crestani, et al This is actually very common and often
1998). Another possibility is to interpret the occurs when searching for information over
query as the specification of soft “elastic” the Internet using different search engines.
constraints that the representation of a To solve this kind of inconsistency, some
document can satisfy to an extent, and to fusion strategies can be applied to the
consider the term relevant as a gradual ranked lists each search engine produces. In
(vague) concept. This is the approach fact, this is what metasearch engines do
adopted in fuzzy IR models (Bordogna & (Bordogna, Pasi & Yager, 2003).
Pasi 2000). In this latter case, the decision
process performed by the query evaluation Current trends in information retrieval
mechanism computes the degree of systems
satisfaction of the query by the Some of the current trends in Information
representation of each document. Retrieval (IR) research run the gamut in
This satisfaction degree, called the terms of expanding the discipline both to
retrieval status value (RSV), is considered as incorporate the latest technologies and to
an estimate of the degree of relevance (or is cope with novel necessities. In terms of
at least proportional to the relevance) of a novel necessities, with the diffusion of the
given document with respect to a given user Internet and the heterogeneous
query. An RSV of 1 implies maximum characteristics of users of search engines,
relevance; an RSV value of 0 implies which can be regarded as the new frontier of
IR, a new central issue had arisen, generally
over the next 2 years until chapters older model or Bayesian inference nets in the
than 3 years have no sense of being recent. probabilistic model to incorporate Boolean
logic into those models. In addition, the use
Fuzzy retrieval models of Boolean logic to separate a collection of
Fuzzy retrieval models have been defined in records into two disjoint classes has been
order to reduce the imprecision that considered, e.g., using the one-clause-at-a-
characterizes the Boolean indexing process, time (OCAT) methodology (Sanchez,
to represent the user’s vagueness in queries, Triantaphyllou & Kraft 2003). Moreover,
and to deal with discriminated answers even now retrieval systems such as Dialog
estimating the partial relevance of the and Web search engines such as Google
documents with respect to queries. Extended allow for Boolean connectives. It should
Boolean models based on fuzzy set theory come as no surprise, therefore, to see
have been defined to deal with one or more extensions of Boolean logic based upon
of these aspects (Bordogna & Pasi, 1995). fuzzy set theory for IR.
It has been speculated that Boolean
logic is passé, out of vogue. Yet, researchers
have employed p-norms in the vector space