of-280fbpkmhy

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Journal of Library Services and Technologies, 2(2), 48-56, June 2020

ISSN: 2616-1354 (Print) 2636-7424 (Online) Available online at credencepressltd.com


DOI: 10.47524/jlst.v2i2.5

Current trends in information retrieval systems: review of fuzzy set theory


and fuzzy Boolean retrieval models
Jonathan N. Chimah, PhD
Ebonyi State University Library,
Abakaliki, Nigeria
E-mail: jonachim2000@yahoo.com

Friday Ibiam Ude


Ebonyi State University Library
Abakaliki, Nigeria

Abstract
This paper reviews the concept and goal of Information Retrieval Systems (IRSs). It also explains the synonymous
concepts in Information Retrieval (IR) which include such terms as: imprecision, vagueness, uncertainty, and
inconsistency. Current trends in IRSs are discussed. Fuzzy Set Theory, Fuzzy Retrieval Models are reviewed. The
paper also discusses extensions of Fuzzy Boolean Retrieval Models including Fuzzy techniques for documents’
indexing and Flexible query languages. Fuzzy associative mechanisms were identified to include:(1) fuzzy
pseudothesauri and fuzzy ontologies which can be used to contextualize the search by expanding the set of index
terms of documents;(2) an alternative use of fuzzy pseudothesarui and fuzzy ontologies is to expand the query with
related terms by taking into account their varying importance of an additional term and (3) fuzzy clustering
techniques, where each document can be placed within several clusters with a given strength of belonging to each
cluster, can be used to expand the set of the documents retrieved in response to a query. The paper concludes by
recommending that in an electronic library environment, the librarians and information scientists should acquaint
themselves with these terms in order to be more equipped in helping library users retrieve online documents relevant
to their information needs.
Keywords: Information retrieval systems, Document delivery, Fuzzy Set theory, Fuzzy
Boolean retrieval models

Introduction documents in order to identify those


Information Retrieval System (IRS) came documents which deal with a particular
into being as a means of ensuring that subject. Reitz (2004) defined information
information generated and recorded do not retrieval as the process, methods and
get over time. Before knowledge became procedures used to selectively recall
recorded, individuals formed the repository recorded information from a file of data.
of knowledge. With libraries, repository of In libraries, searches are made
knowledge began to change into recorded typically for a known item or for
form. With the quantity of new information information on a specific subject, and the
being generated is such that no individual file is usually a human readable catalogue or
can hope to cope with this information index, or a computer-based information
explosion and at the same time make them storage and retrieval system, such as an on-
available to users. This led to the use of line catalogue and bibliographic database. In
information retrieval with minimum cost in the design of such systems, a balance must
time, labour and money. Information be attained to facilitate this literature
retrieval, according to Unagha (2010), is the searching activity may legitimately be called
process of searching some collections of an information retrieval system. The

48
Jonathan N. Chimah and Friday Ibiam Ude: Current trends in information retrieval systems:
review of fuzzy set theory and fuzzy Boolean retrieval models

catalogue, index and bibliography, abstract iterated if the user wishes to refine the
as well as the computer are known as query.
information retrieval systems. Based on this backdrop, this paper
Automated information retrieval examines concepts and the goal of
systems are used to reduce what has been information retrieval systems, current trends
called information overload. An IR system in information retrieval systems, explains
is a software system that provides access to synonymous concepts (such as imprecision,
books, journals and other documents; stores vagueness, uncertainty, and inconsistency),
and manages those documents. Web search reviews the fuzzy set theory and its
engines are the most visible IR applications. concomitant Boolean retrieval models.
An information retrieval process begins
when a user enters a query into the system. Concept and goal of information retrieval
Queries are formal statements of systems
information needs, for example search Information retrieval (IR) is concerned with
strings in web search engines. In the storage, organization, and searching of
information retrieval a query does not collections of information. It has been part
uniquely identify a single object in the of significant part of human technological
collection. Instead, several objects may development since the development of
match the query, perhaps with different writing. The earliest IR systems were the
degrees of relevancy. organization schemes of ancient archives
An object is an entity that is and libraries, such as early Sumerian
represented by information in a content archives, or the “Pinakes” developed by
collection or database. User queries are Callimachus for the library of Alexandria. In
matched against the database information. the twentieth century the largest impetus to
However, as opposed to classical SQL development of automated IR systems was
queries of a database, in information the need to manage increasing larger
retrieval the results returned may or may not quantities of information in business and
match the query, so results are typically scientific development. Early attempts at
ranked. This ranking of results is a key automating search capabilities for document
difference of information retrieval searching collections involved techniques based on
compared to database searching (Jansen and punched cards, as well as machines using
Rieh 2010). optical sensing of codes on microfilmed
Depending on the application the documents (Buckland 2006).
data objects may be, for example, text According to Larson (2018), the goal
documents, images, audio, mind maps or of any IR system is to select the information
videos. Often the documents themselves are items (texts, images, videos, etc. which we
not kept or stored directly in the IR system, will refer to as “documents”) that are
but are instead represented in the system by expected to be relevant for a given searcher
document surrogates or metadata. As Frakes (or user) from a large collections of such
and Baeza-Yates (1992) had noted, most IR items. Today these collections range from
systems compute a numeric score on how small sets of items on an individual’s
well each object in the database matches the personal computer to the vast resources of
query, and rank the objects according to this the World Wide Web. In all cases the task is
value. The top ranking objects are then the same: to extract some set of items that
shown to the user. The process may then be searchers wants to have from all of those
they do not want. This is not a simple task,

Journal of Library Services and Technologies, Volume 2 Number 2, 2020


49
Jonathan N. Chimah and Friday Ibiam Ude: Current trends in information retrieval systems:
review of fuzzy set theory and fuzzy Boolean retrieval models

and involves not only the technical aspects There are several ways to represent
of constructing a system to perform such imprecise and vague concepts. One the ways
selection, but also aspects of psychology and is indirectly, by defining similarity or
user behaviour to understand what proximity relationships between each pair of
differentiate the desired items from the non- imprecise and vague concepts. If we regard
desired from the particular user’s point of a document as an imprecise or vague
view. concepts, i.e., as bearing a vague content, a
numeric value computed by a similarity
The synonymous concepts in information measure can be used to express the
retrieval closeness of any two pairs of documents.
The terms imprecision, vagueness, This is the way of dealing with the imprecise
uncertainty, and inconsistency are very often and vague document and query contents in
used as synonymous concepts. Nevertheless the vector space model of Information
when they are referred to qualify a retrieval. In this context the documents and
characteristic of the information they have a the query are represented as points in a
distinct meaning (Motro 1995). Since IR has vector space of terms and the distances
to do with information, understanding the between the query and the documents points
different meanings of imprecision, are used to quantify their similarity.
vagueness, uncertainty, and inconsistency Uncertainty is related to the truth of
allows to better understanding the a proposition, intended as the conformity of
perspectives of the distinct IR models the information carried by the proposition
defined in the literature. with the considered reality. Linguistic
Kraft, Bordogna and Pasi (2018) expressions such as “probably” and “itis
noted that vagueness and imprecision are possible that” can be used to declare a
related to the representation of the partial lack of knowledge about the truth of
information content of a proposition. For the stated information.
example, in the information request, “find Furthermore, Kraft, Bordogna and
recent scientific chapters dealing with the Pasi (2018) noted that there are cases in
early stage of infectious diseases by HIV,” which information is affected by both
the terms recent and early specify vague uncertainty and imprecision or vagueness.
values of the publication date and of the For example, consider the proposition
temporal evolution of the disease, “probably document d is relevant to query
respectively. The publication date and the q.” However, the same information content
phase of an infectious disease are usually can be expressed by choosing a trade-off
expressed as numeric values; their linguistic between the vagueness and the uncertainty
characterization has a coarser granularity embedded in a proposition. For example,
with respect to their numeric one can express the content of the previous
characterization. Linguistic values are proposition by a new one “document d is
defined by terms with semantics compatible more or less relevant to query q.” in this
with several numeric values on the scale latter proposition, the uncertain term
upon which the numeric information is probably has been eliminated, but the
defined. Imprecision is just a case-limit of specificity of the vague term relevant has
vagueness, since imprecise values have a been reduced. In point of fact, the term more
full compatibility with a subject of value of or less relevant is less specific than the term
the numeric reference scale. relevant. A dual representation eliminate
imprecision and augment the uncertainty,

Journal of Library Services and Technologies, Volume 2 Number 2, 2020


50
Jonathan N. Chimah and Friday Ibiam Ude: Current trends in information retrieval systems:
review of fuzzy set theory and fuzzy Boolean retrieval models

like in the expression “it is not completely query q”.


probable that document d fully satisfies the

1 Recent

0
Scale measured in years
CD-3y CD-1y CD

Fig. 1 Semantics of the term “recent” referring to the publication date of a scientific chapter.
CD = current date; y = years. Source: Kraft, Bordogna & Pasi (2018)

On the basis of what has been said about the absolutely no relevance. And, an RSV value
trade-off between uncertainty and vagueness in the interval [0, 1] implies an intermediate
to express the same information content, level or degree of relevance. For example,
there are two alternative ways to model the an RSV value of 0.5 could imply an average
IR activity. One possibility is to model the degree or relevance (Kraft, Bordogna and
query evaluation mechanism as an uncertain Pasi, 2018).
decision process. Here the concept of Inconsistency comes from the
relevance is considered binary (crisp) and simultaneous presence of contradictory
the query evaluation mechanism computes information about the same reality. An
the probability of relevance of a document example of inconsistency can be observed
of d to a queryq. Such an approach, which when submitting the same query to several
does model the uncertainty of the retrieval IRSs that adopt different representations of
process, has been introduced and developed documents and produce different results.
by probabilistic IR models (Crestani, et al This is actually very common and often
1998). Another possibility is to interpret the occurs when searching for information over
query as the specification of soft “elastic” the Internet using different search engines.
constraints that the representation of a To solve this kind of inconsistency, some
document can satisfy to an extent, and to fusion strategies can be applied to the
consider the term relevant as a gradual ranked lists each search engine produces. In
(vague) concept. This is the approach fact, this is what metasearch engines do
adopted in fuzzy IR models (Bordogna & (Bordogna, Pasi & Yager, 2003).
Pasi 2000). In this latter case, the decision
process performed by the query evaluation Current trends in information retrieval
mechanism computes the degree of systems
satisfaction of the query by the Some of the current trends in Information
representation of each document. Retrieval (IR) research run the gamut in
This satisfaction degree, called the terms of expanding the discipline both to
retrieval status value (RSV), is considered as incorporate the latest technologies and to
an estimate of the degree of relevance (or is cope with novel necessities. In terms of
at least proportional to the relevance) of a novel necessities, with the diffusion of the
given document with respect to a given user Internet and the heterogeneous
query. An RSV of 1 implies maximum characteristics of users of search engines,
relevance; an RSV value of 0 implies which can be regarded as the new frontier of
IR, a new central issue had arisen, generally

Journal of Library Services and Technologies, Volume 2 Number 2, 2020


51
Jonathan N. Chimah and Friday Ibiam Ude: Current trends in information retrieval systems:
review of fuzzy set theory and fuzzy Boolean retrieval models

known as the semantic web (Kraft, personalize it for individual users.


Bordogna & Pasi, 2018). It mainly consists Moreover, great improvements have been
in expanding Information Retrieval Systems made in retrieval systems interfaces based
(IRSs) with the capability to represent and on human-computer interface research.
manage the semantics of both user requests These novel research trends in IR are
and documents so as to be able to account faced by turning to technologies such as
for user and document contexts. This need natural language processing, image
becomes urgent with cross-language processing, language models, artificial
retrieval, which consists in expressing intelligence, and automatic learning. Also
queries to search engines. Cross language fuzzy set theory can play a crucial role to
retrieval not only implies new works on text define novel solutions to these research
processing, e.g., stemming conducted on a issues since it provides suitable means to
variety of languages, new models of IR such cope with the needs of the semantic web
as the development of language models, but (Sanchez, 2006) i.e., to model the semantic
also the ability to match terms in distinct of linguistic terms so as to reflect their
languages at a conceptual level, by modeling vagueness and subjectivity and to compute
their meaning. degrees of similarity, generalization, and
Another research trend, according to specialization between their meanings.
Kraft, Bordogna and Pasi (2018), is
motivated by the need to manage Fuzzy set theory
multimedia collections with non-print audio The notion of a fuzzy set is an extension to
elements such as sound, music, and voice, normal set theory (Zadeh 1965). According
and video elements such as images, pictures, to him, a set is simply a collection of
movies, and animation. Retrieval of such objects. A fuzzy set (more properly called a
elements can include consideration of both fuzzy subset) is a subset of a given universe
metadata and content-based retrieval of objects, where the membership in the
techniques. The definition of new IRSs fuzzy set is not definite. For example,
capable to efficiently extract content indexes consider the idea of a person being middle-
from multimedia documents, and to aged. If a person’s age is 39, one can
effectively retrieve documents by similarity consider the imprecision of that person
or proximity to a query by example so as to being in the set of middle-aged people. The
fill the semantic gap existing between low- membership function, µ, is a number in the
level syntactic index matching and the interval [0,1] that represents the degree to
semantics of multimedia document and which that person belongs to that set. Thus,
query are still to come. the terms recent and early can be defined as
In addition, modern computing fuzzy subsets, with the membership
technology, including storage media, functions interpreted as compatibility
distributed and parallel processing functions of the meaning of the terms with
architectures, and improved algorithms for respect to the numeric values of the
text processing and for retrieval, has an reference (base) variable. In Fig. 1, the
effect on IRSs. For example, improved compatibility function of the term recent is
string searching algorithms have improved presented with the numeric values of the
the efficiency of search engines. Improved time-scale measured in years. Note that here
computer networks have made the Internet a chapter that has a publication date of the
and the World Wide Web a possibility. current year or 1 year previous is perfectly
Intelligent agents can improve retrieval in recent; however, the extent to which a
terms of attempting to customize and chapter remains recent declines steadily

Journal of Library Services and Technologies, Volume 2 Number 2, 2020


52
Jonathan N. Chimah and Friday Ibiam Ude: Current trends in information retrieval systems:
review of fuzzy set theory and fuzzy Boolean retrieval models

over the next 2 years until chapters older model or Bayesian inference nets in the
than 3 years have no sense of being recent. probabilistic model to incorporate Boolean
logic into those models. In addition, the use
Fuzzy retrieval models of Boolean logic to separate a collection of
Fuzzy retrieval models have been defined in records into two disjoint classes has been
order to reduce the imprecision that considered, e.g., using the one-clause-at-a-
characterizes the Boolean indexing process, time (OCAT) methodology (Sanchez,
to represent the user’s vagueness in queries, Triantaphyllou & Kraft 2003). Moreover,
and to deal with discriminated answers even now retrieval systems such as Dialog
estimating the partial relevance of the and Web search engines such as Google
documents with respect to queries. Extended allow for Boolean connectives. It should
Boolean models based on fuzzy set theory come as no surprise, therefore, to see
have been defined to deal with one or more extensions of Boolean logic based upon
of these aspects (Bordogna & Pasi, 1995). fuzzy set theory for IR.
It has been speculated that Boolean
logic is passé, out of vogue. Yet, researchers
have employed p-norms in the vector space

Fig. 2: Categorization of IR-models. Source: DominikKuropka(2004).

of index terms. These systems partition the


Extensions of fuzzy Boolean retrieval collection of documents into two sets, the
models retrieved documents and the rejected (non-
The fuzzy retrieval models have been retrieved) ones. As a consequence of this
defined as generalizations of the classical crisp behaviour, these systems are liable to
Boolean model. These allow one to extend reject useful items as a result of too
existing Boolean IRSs without having to restrictive queries, as well as to retrieve
redesign them. This was first motivated by useless material in reply to queries (Salton
the need to be able to produce proper & McGill, 1983).
answers in response to the queries. In
essence, the classical Boolean IRSs apply an Fuzzy techniques for documents’ indexing
exact match between a Boolean query and The aim is to provide more specific and
the representation of each document. This exhaustive representations of each
document representation is defined as a set document’s information content. This means

Journal of Library Services and Technologies, Volume 2 Number 2, 2020


53
Jonathan N. Chimah and Friday Ibiam Ude: Current trends in information retrieval systems:
review of fuzzy set theory and fuzzy Boolean retrieval models

improving these representations beyond techniques, where each document can be


those generated by existing indexing placed within several clusters with a given
mechanisms. strength of belonging to each cluster, can be
used to expand the set of the documents
Flexible query languages retrieved in response to a query. Documents
There are query languages that are more associated with retrieved documents, i.e., in
expressive and natural than classical the same cluster, can be retrieved. The
Boolean logic. This is defined in order to degree of association of a document with the
capture the vagueness of user needs as well retrieved documents does influence its RSV.
as to simplify user-system interaction. This Another application of fuzzy clustering in IR
has been pursued with two different is that of providing an alternative way, with
approaches. There has been work on the respect to the usual ranked list, of presenting
definition of soft selection criteria (soft the results of a search.
constrains), which allow the specification of
the different importance of the search terms. Conclusion
Query languages based on numeric query In the library today, instead of the individual
term weights with different semantics have memory, we have the corporate memory –
been first proposed as an aid to define more the library catalogues, bibliographies,
expressive selection criteria (Cater & Kraft, indexes and computers. These information
1987). retrieval systems (tools) contain the
bibliographical details of the documents
Fuzzy associative mechanisms such as the author, edition, call-number,
In their work on fuzzy theory, Kraft, publisher, place of publication, date, etc.
Bordogna and Pasi (2018) explained that The concept and the goal of information
these associative mechanisms allow to retrieval systems have been reviewed.
automatically generating fuzzy Current trends in information retrieval
pseudothesarui, fuzzy ontologies, and fuzzy systems have been highlighted. Attempts
clustering techniques to serve three distinct have been made to demystify the seeming
but compatible purposes. First, fuzzy confusing synonymous concepts in
pseudothesauri and fuzzy ontologies can be information retrieval which includes
used to contextualize the search by imprecision, vagueness, uncertainty, and
expanding the set of index terms of inconsistency. Related literature has been
documents to include additional terms by reviewed on fuzzy set theory, fuzzy retrieval
taking into account their varying models, extensions of fuzzy Boolean
significance in representing the topics dealt retrieval models, and fuzzy associative
with in the documents, the degree of mechanisms. It is recommended that in an
significance of these associated terms electronic library environment, the librarians
depends on the strength of the associations and information scientists should acquaint
with a document’s original descriptors. themselves with these terms in order to be
Second, an alternative use of fuzzy more equipped in helping library users
pseudothesarui and fuzzy ontology is to retrieve online documents relevant to their
expand the query with related terms by information needs. This would further
taking into account their varying importance enhance the utilization of our institutions
of an additional term is dependent upon its electronic libraries.
strength of association with the search terms
in the original query. Third, fuzzy clustering

Journal of Library Services and Technologies, Volume 2 Number 2, 2020


54
Jonathan N. Chimah and Friday Ibiam Ude: Current trends in information retrieval systems:
review of fuzzy set theory and fuzzy Boolean retrieval models

References Information Sciences and


Bordogna, G. & Pasi, G. (1995). Linguistic Technology, 61(8), 1517-1534.
aggregation operators in fuzzy Kraft, D., Bordogna, G., Pasi, G. (2018).
information retrieval. International Fuzzy set theory. In: Encyclopedia of
Journal of Intelligent Systems,10(2), library & information sciences, 4th
234-248. Edition. John D. McDonald &
Bordogna, G. & Pasi, G. (2000). The Michael Levine-Clark (eds.). Taylor
application of fuzzy set theory to & Francis, pp.1618-1622.
model information retrieval. In Soft Kuropka, D. (2004). Modellezur
Computing in Information Retrieval: Repräsentation natürlichsprachlicher
Techniques and Applications. In Dokumente. Ontologie-basiertes
Crestani, F.; Pasi, G. Eds. Pysica- Information-Filtering und-Retrieval
Verlag: Heidelberg, Germany. mitrelationalen Datenbanken.
Bordogna, G.; Pasi, G. &Yager, R. (2003). Advances in Information Systems
Soft approaches to information and Management Science, Bd. 10.
retrieval on the WEB.International Retrieved from: https://www.logos-
Journal of Approximate Reasoning. verlag.de/cgi-bin/engbuchmid?
2003, 34, 105-120. isbn=0514&lng=eng&id=
Buckland, M. K. (2006). Emanual Goldberg Larson, R. R. (2018). Information Retrieval
and His Knowledge Machine. System. In: John D. McDonald &
Libraries Unlimited: Westport, CT. Michael Levine-Clark (eds.)
Cater, S.C. & Kraft, D. H. (1987) TIRS: A Encyclopedia of Library and
topological information retrieval Information Sciences, 4th edition.
system satisfying the requirements of Taylor & Francis. p.2199.
the Waller-Kraft wish list In Motro, A. (1995). Imprecision and
Proceedings of the Tenth Annual uncertainty in database systems. In
ACM/SIGIR International Fuzziness in Database Management
Conference on Research and Systems; Bosc, P., Kacprzyk, J.,Eds.:
Development in Information Physica-Verlag: Heidelberg,
Retrieval. New Orleans, LA June, Germany, 3-22.
171-180. Reitz, J. M. (2004). Dictionary of library
Crestani, F.; Lalmas, M.; van Rijsbergen, and information science, West,
C.J.; Campbell, I. (1998). Is this Connecticut: Libraries Unlimited.
document relevant? Salton, G. & McGill, M.J. (1983).
Probably, ACM Computer Survey, Introduction to Modern Information
30(4) 528-552. Retrieval, New York: McGraw-Hill
Frakes, W. B. &Baeza-Yates, R. (1992). Sanchez, E. 2006). Fuzzy Logic and the
Information retrieval data structures Semantic Web. Elsevier: Amsterdam,
& algorithms. Prentice-Hall, Inc. . the Netherlands.
Archived from the original on 2013- Sanchez, S.N.; Triantaphyllou, E. & Kraft,
09-28. D. H. (2003). A feature mining based
Jansen, B. J. & Rieh, S. (2010). The approach for the classification of text
Seventeen Theoretical Constructs of documents into disjoint classes.
Searching and Information Retrieval. Information Processing Management
Journal of the American Society for 38(4), 583-604.

Journal of Library Services and Technologies, Volume 2 Number 2, 2020


55
Jonathan N. Chimah and Friday Ibiam Ude: Current trends in information retrieval systems:
review of fuzzy set theory and fuzzy Boolean retrieval models

Unagha, A. O. (2010) Knowledge Librarians’ Registration Council of Nigeria


organization and information (LRCN). E-mail: jonachim2000@yahoo.com;
retrieval. Okigwe: Whytem Cell: +2348037976028.
Publishers Nig.
Zadeh, L.A. (1965). Fuzzy sets. Information Friday Ibiam Ude, is the immediate past
University Librarian of Ebonyi State University.
Control, 8. 338-353. He is currently Faculty of Agriculture & Natural
Resources Librarian of EBSU and also a lecturer
Dr. Jonathan N. Chimah is the University at the Department of Library and Information
Librarian of Ebonyi State University Abakaliki Science, Ebonyi State University Abakaliki. He
and also a Senior lecturer at the Department of is a member of Nigerian Library Association
Library & Information Science. He is a member (NLA) and a certified librarian with Librarians’
of the Nigerian Library Association (NLA), Registration Council of Nigeria (LRCN).
Nigeria Library Information Science Educators E-mail: ibiamude7@gmail.com;
(NALISE) and a certified librarian with Cell: +2347038852971.

Journal of Library Services and Technologies, Volume 2 Number 2, 2020


56

You might also like