IR First Chapter

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Chapter One – Overview of Info

Storage and Retrieval

Admas University
Hargeisa Somaliland
Department of ICT
Concepts of Information

Data
 Think of data as a "raw material" - it needs to be
processed before it can be turned into something useful,
 hence, the need for "data processing".
 Data comes in many forms - numbers, words, symbols.
 Data relates to transactions, events and facts.
 On its own - it is not very useful.

2
Concepts of Information

Think of the data that is created when you buy a product


from a retailer. This includes:
• Time and date of transaction (e.g. 10:05 Tuesday 23 December
2003)
• Transaction value (e.g. £55.00)
• Facts about what was bought (e.g. hairdryer, cosmetics pack,
shaving foam) and how much was bought (quantities)
• How payment was made (e.g. credit card, credit card number and
code)
• Which employee recorded the sale
• Whether any promotional discount applied

3
Concepts of Information

 At its simplest, this data needs processing at the point of


sale in order for the customer to receive a valid receipt.
So, the data about the transaction is processed to create
"information" - in this case a receipt.
 The same data would also be useful to the manager of
the retail store. For example, a report showing totals
sales in the day, or which are the best-selling products.
So, the data concerning all shop transactions in the day
needs to be captured, and then processed into a
management report.
4
Concepts of Information

Information
• The above example demonstrates what information
is. Information is data that has been processed in
such a way as to be meaningful to the person who
receives it.
• Note the - "processed" and "meaningful“
• It is not enough for data simply to be processed; it
has to be of use to someone - otherwise why
bother?!

5
Concepts of Information

Attributes of Information
Characteristics of good information are as follows:
 reliable,
 timely,
 accessible,
 cost-effective,
 accurate,
 fit-for purpose,
 relevant, and
 understandable by the user.
6
Information life cycle

Information
• The above example demonstrates what information

Information Information
Creation Acquisition

Information
Information
Organization
Use

Informatio
Information n Storage
Distribution
7
Information life cycle

• Information creation: is the process where individual


and organizations generate and produce new
information artifacts and items
• Information acquisition: is the process where
information items are obtained from external sources.
• Information organization: is the process of indexing
or classifying information in ways that support easy
retrieval at later points in time.

8
Information life cycle

• Information storage: is the process of physically


housing information content in structures such as
database or file system.
• Information distribution: is the process of
dissemination, transporting or sharing information.
• Information use: is the process where individuals and
organizations utilize and apply information made
available to them.

9
Motivation Behind IR System

• Computers are able to scan whole documents and


decide on whether they are relevant or not.
• IR systems, since their inception, are in place to
reduce a user’s workload in searching through the
store of documents to find relevant ones.

10
Motivation Behind IR System

• Information explosion
– The growth in information and the retrieval
mechanisms do not match
– The overload made storage and retrieval of
information very tough
– Because of overload our search space becomes
large
– In the search space we have information
items which could be in the form of books,
journals, etc.
11
Information Retrieval (IR)

• One discipline of information science that is concerned with


developing theories and methods to information access
• Is about finding relevant information in large collection of
documents
• Involves helping users find information that matches their
information need (user centered view)
• Concerned with representation, storage, organization and access
to information items (System centered view)
- Information items usually text, but probably also image,
audio, video, etc
- Text items are often referred to as documents, and may
be of different scope ( books, articles, paragraphs, etc)

12
Information Retrieval Systems

– Systems to retrieve documents highly likely relevant to


the user
– Are systems built to reduce users’ workload in
searching through the store of documents to find
relevant one’s
– Systems that give information about the presence or
absence of documents in accordance with the query
– Consists of
• Set of items (documents)
• Set of requests (information needs)
• Some mechanisms for determining the requirements of the
13 request (matching functions)
Components of an IR System

a) Document selection subsystem - How we are going to select


documents in the database that are relevant (matched with user
requests)
b) Indexing subsystem - Means of organizing the documents selected
using the mechanism in the above subsystem
c) Vocabulary subsystem - List of selected subject terms to represent a
document. It is based on vocabulary that the indexing is updated.
d) Searching subsystem - Where we formulate search strategies for
using the system based on users’ need
e) The user system interface - A SW that enables you to give command
to the system and the system responds
f) The matching subsystem - Matches users’ queries with the available
documents that are relevant
14
Components of an IR (cont…)

• Human components of IRSs


– Users: who create the needs of the system (the user)
– Organization: who makes it possible to have the
system (the funder)
– Information professionals: who operate the system
and provide the services (the server)
• System components IRSs
– Data: the contents of the system
– Device and media: HW of the system
15
– Algorithm and procedures: SW of the system
Information Retrieval Systems

Primary goals of an IRS


• Retrieve all documents which are relevant to a
user query while retrieving as few non-relevant
documents as possible

16
Activities of IR

• Content analysis
– Concerned with describing contents of documents
– Deals with representation of the document
– Involves the analysis and assignment of terms or identifiers that
are capable representing document content, which can be used
as access point to that document
• Indexing and cataloguing,
– are some of the processes used to represent the thought content
of the document Information structure
– Concerned with exploiting relationship between documents to
improve the efficiency and effectiveness of retrieval strategies
• Evaluation
17 – Deals with measurements of the effectiveness of retrieval
Data retrieval Vs Information
retrieval systems
Data retrieval Information retrieval
Data organization Structured (clear semantics: Nam Unstructured(No fields(other than tex
e, age ...) t))
Context Data Information
Data object Table Document
Matching Exact match Partial match, best match
Items wanted Matching Relevant
Query language SQL(artificial) Free text(Natural language, Boolean)
Query Complete Incomplete
specification
Accuracy 100%(results are always correct) <50%
Error response Sensitive Non sensitive
18
Basic Concepts of IR
The effective retrieval of relevant information is directly affected by
two things
1) The user task
– Anyone who need to find some information
– The user groups
• By their knowledge of the system
– Novice Vs experienced users
– End users Vs information specialist
• By their domain knowledge
– Domain experts Vs general public
• By their information needs
– Need to locate a particular item, need some information, need all information on a
subject

19
Basic Concepts of IR …
The effective retrieval of relevant information is directly affected by two
things
2) Logical view of the documents
• Documents in a collection are frequently represented through a set of index
terms or keywords

Index terms
• A keyword or group of related words which has some meaning of its
own
• Is simply a word whose semantic helps in remembering the
document’s main theme
• Might be extracted directly from the text of the documents or
specified by a human expert
20
Structure of an IRS
 Information Retrieval System serves as a bridge between the world of
authors and the world of readers/users.
 That is, writers present a set of ideas in a document using a set of concepts.
Then Users seek the IR system for relevant documents that satisfy their
information need.

 The black box is the processing part of the information


retrieval system. Mainly includes indexing and searching.
 Searching is the way the file is examined and the terms
in it are taken to a search query.

21
Structure of an IRS …

22
Typical IR Task

 Given a corpus of document collections (text, image, video,


audio) published by various authors and a user information
need in the form of a query. An IR system searches for a
ranked set of documents that are relevant to satisfy
information need of a user.

Main ingredients of IR Process


There are three main ingredients
 Texts or documents
 Queries
 The process of evaluation
23
Indexing

 The way documents (items) are represented


 One of the operations required in IR
 Most crucial and probably the most difficult task in IRS
 Is the task of assigning appropriate terms and identifiers
capable of representing the content of the collection
items (this task is called indexing)
 In modern environment the indexing task is done
automatically

24
Indexing…

Documents
Items in the file
Requests (Queries)
Expression of the users information needs

Selecting the keyword means indexing the query


Example: I need information on development of
agricultural machinery in Somaliland
Keywords: development, agricultural, machinery,
Somaliland

25
IR Functions

These functions can further be elaborated as follows:


1) To identify the source of information relevant to the areas of interest
of the target users’ community;
2) To analyze the contents of source (documents);
3) To represent the contents of the analyzed sources in a way that will
be suitable for matching with the users’ queries;
4) To analyze users’ queries and to represent them in a form that will
be suitable for matching with the dataset;
5) To match the search statement with the stored dataset;
6) To retrieve information that are relevant and
7) To make necessary adjustments in the system based on the
feedback from the user

26
IR Functions…

27
IR Challenges

 Mostly full-text searching cannot be accurate, since


different authors may select different words to represent
the same concept.
 The same meaning can be expressed using different
terms that are
 synonyms (a word or phrase which has the same or nearly the
same meaning as another word or phrase in the same
language),
 homonyms (a word that sounds the same or is spelled the same
as another word but has a different meaning),
 and related terms.
 How can it be achieved such that for the same meaning
28
the identical terms are used in the index and the query?
Thesaurus

 The vocabulary of a controlled indexing language,


formally organized so that a priori relationships between
concepts (for example as "broader" and “related") are
made explicit.
 A thesaurus contains terms and relationships between
terms.
 IR thesauri rely typically upon the use of symbols such as
USE/UF (UF=used for), BT (Broader Term), and RT
(Related term) to demonstrate inter-term relationships.
e.g., car = automobile, truck, bus, taxi, motor vehicle
color = paint, dye
29
Thesaurus…

Example: thesaurus built to assist IR for searching cars and


vehicles:
Term: Motor vehicles
UF: Automobiles
Cars
Trucks
BT: Vehicles
RT: Road Engineering
Road Transport

Class Activity: Build thesaurus for ‘Electronic Device’


30
Question & Answer

11/2/2022 31
Thank You !!!

11/2/2022 32

You might also like