Digital Libraries: Language Technologies

Digital Libraries: Language Technologies
R AFFAELLA B ERNARDI
` DEGLI S TUDI DI T RENTO U NIVERSIT A P. ZZA V ENEZIA , ROOM : 2.05, E - MAIL : BERNARDI @ DISI . UNITN . IT
Contents
First
Last
Prev
Next
Contents
Contents
First
Last
Prev
Next
1.
Recall: Inverted Index
We gave per granted we know: what a document is. how to machine-read each document.
Contents First Last Prev Next
2.
Text Processing: tools
Contents
First
Last
Prev
Next
3.
Text Processing: Pre-processing
Contents
First
Last
Prev
Next
4.
Denitions: Word, Term, Token, Type
Word A delimited string of characters as it appears in the text. Term A normalized word (case, morphology, spelling etc); an equivalence class of words. Token An instance of a word or term occurring in a document. Type The same as a term in most cases: an equivalence class of tokens.
Contents
First
Last
Prev
Next
Example: Tokens vs. Words Example: andarci veramente piano
3 token: andarci veramente piano 4 words: andare ci veramente piano 2 lexical unit: andarci piano veramente Tokens: token in different position are different. Types: token with no repetitions (position is not relevant)
[graphical words] [inected form] [unitary meaning]
Lemmas: the normalized form of a word (e.g. innitive for verbs,)
Contents
First
Last
Prev
Next
5.
Document transformation for indexing
Contents
First
Last
Prev
Next
5.1.
Normalization
Need to normalize terms in indexed text as well as query terms into the same form. E.g. U.S.A vs. USA Accent: naive vs. na ve, resume vs. r esum e. Case folding: Fed vs. fed We most commonly implicitly dene equivalence classes of terms (windows, window, Windows, etc.)
Most important criterion: How are users likely to write their queries for these words?
Even in languages that standardly have accents, users often do not type them. Most often users use lowercase, hence its better to lowercase everything.
Contents
First
Last
Prev
Next
5.2.
Tokenization
Each token is a candidate for a postings entry.

5.3.
Tokenkization difculties
One word or two (or several)?
Hewlett-Packard State-of-the-art co-education data base San Francisco

Numbers
3/20/91 vs. 20/3/91 Mar 20, 1991 100.2.86.144 800.234.2333

5.4.
Lemmatization
Denition Reduce inectional/variant forms to base form. E.g.:
am, are, is be car, cars, cars, cars car the boys cars are different colors the boy car be different color Lemmatization implies doing proper reduction to dictionary headword form (the lemma). Inectional morphology (cutting cut) vs. derivational morphology (destruction destroy)
Contents
First
Last
Prev
Next
5.5.
Stemming
Denition Crude heuristic process that chops off the ends of words in the hope of achiev-
ing what principled lemmatization attempts to do with a lot of linguistic knowledge. Its language dependent. E.g.: automate(s), automatic, automation all reduce to automat. for example compressed and compression are both accepted as equivalent to compress for exampl compress and compress ar both accept as equival to compress Downloadable, eg. www.tartarus.org/martin/PorterStemmer Sample rules: sses ss ies i ational ate
Contents
First
Last
Prev
Next
5.6.
Sentence splitting
Statistics based methods: Lexical category of preceding word Lexical category of following word Case of preceding word Case of following word ...
Contents
First
Last
Prev
Next
5.7.
Stop word list
Denition stop words are extremely common words which would appear to be of little
value in helping select documents matching a user need. E.g. a, an, and, are, as, at, be, by, for, from, has, he, in, is, it, its, of, on, that, the, to, was, were, will, with 250-300 most common words in English account for 50% or more of a given text E.g. the and of represent 10% of tokens in a corpus. and to, a and in another 10%. etc. Moby Dick Ch1: 859 unique words (types), 2256 word occurrences (tokens) However, if good compression techniques are used stop words can be considered too and in same cases may turn out to be useful. E.g. let it be. Hence, nowadays, most web search engines index stop words.
Contents
First
Last
Prev
Next
5.8.
LT and Inverted Index construction
Contents
First
Last
Prev
Next
6.
Problem with Boolean search

Boolean queries often result in either too few (=0) or too many (1000s) results. In Boolean retrieval, it takes a lot of skill to come up with a query that produces a manageable number of hits. We wish to be able to rank documents, and rank higher the relevant ones. Hence, we need to assign a score to each query-document pair, say in [0, 1]. This score measures how well document and query match.
Contents
First
Last
Prev
Next
7.
Query-document matching scores I
How do we compute the score of a query-document pair? Lets start with a one-term query. If the query term does not occur in the document: score should be 0. The more frequent the query term in the document, the higher the score We will look at a number of alternatives for doing this.
Contents
First
Last
Prev
Next
7.1.
Take 1: Jaccard coefcient
A commonly used measure of overlap of two sets. Let A and B be two sets, such that / or B = 0 / , Jaccard coefcient: A=0
JACCARD (A, B) =
|A B| |A B|
JACCARD (A, A) = 1, JACCARD (A, B) = 0 if A B = 0 A and B dont have to be the same size. It always assigns a number between 0 and 1.
Jaccard coefcient: Example What is the query-document match score that the Jaccard
coefcient computes for: Query: ides of March Document Caesar died in March
JACCARD (q, d ) = 1/6
Contents
First
Last
Prev
Next
7.2.
Whats wrong with Jaccard?
It doesnt consider term frequency (how many occurrences a term has). Rare terms are more informative than frequent terms. Jaccard does not consider this information. We need a more sophisticated way of normalizing for the length of a document.
Contents
First
Last
Prev
Next
8.
Term weights
We need a way to compute the weight of a term in the term-document matrix.
Contents
First
Last
Prev
Next
8.1.
Term Frequency (tf)
Denition The term frequency tft ,d of term t in document d is dened as the number of
times that t occurs in d . We want to use tf when computing query-document match scores. But raw term frequency is not what we want because:
Normalized tf tf count is usually normalized to prevent a bias towards longer documents
(which may have a higher term count regardless of the actual importance of that term in the document) to give a measure of the importance of the term ti within the particular document d j . tfi,j = ni, j k nk, j
where ni, j is the number of occurrences of the considered term (ti ) in document d j , and the denominator is the sum of number of occurrences of all terms in document d j , that is, the size of the document |d j |. A document with tf = 10 occurrences of the term is more relevant than a document
with tf = 1 occurrence of the term. But not 10 times more relevant. Relevance does not increase proportionally with term frequency.
Contents
First
Last
Prev
Next
8.2.
Log frequency weighting
The log frequency weight of term t in d is dened as follows wt ,d = 1 + log10 tft ,d if tft ,d > 0 0 otherwise
tft ,d wt ,d : 0 0, 1 1, 2 1.3, 10 2, 1000 4, etc. Score for a document-query pair: sum over terms t in both q and d : tf-matching-score(q, d ) = t qd (1 + log tft ,d ) The score is 0 if none of the query terms is present in the document. Recall: The logarithm of a number y with respect to base b (logb y) is the exponent to which b has to be raised in order to yield y. In other words, the logarithm of y to base b is the solution x of the equation bx = y
8.3.
Exercise: Jaccard and tf measure
Compute the Jaccard matching score and the tf matching score for the following querydocument pairs. q: [information on cars] d: all youve ever wanted to know about cars q: [information on cars] d: information on trucks, information on planes, information on trains q: [red cars and red trucks] d: cops stop red cars more often
Contents
First
Last
Prev
Next
8.4.
Problem: Rare vs. Frequent terms
In addition, to term frequency (the frequency of the term in the document), we also want to use the frequency of the term in the collection for weighting and ranking.
Desired weight for rare terms Rare terms are more informative than frequent terms. Consider a term in the query that is rare in the collection (e.g., arachnocentric). A document containing this term is very likely to be relevant. We want high weights for rare terms like arachnocentric. Desired weight for frequent terms Frequent terms are less informative than rare terms. Consider a term in the query that is frequent in the collection (e.g., increase). A document containing these terms is more likely to be relevant than a document that doesnt but frequent terms are not sure indicators of relevance. For frequent terms we want positive weights but lower weights than for rare terms.
8.5.
Document Frequency
We want high weights for rare terms like arachnocentric. We want low (positive) weights for frequent words like good, increase, and line. We will use document frequency to factor this into computing the matching score.
Document frequency is the number of documents in the collection that the term occurs
in.
Contents
First
Last
Prev
Next
8.6.
Inverse Document Frequency (idf)
It estimates the rarity of a term in the whole document collection. (If a term occurs in all the documents of the collection, its IDF is zero.) idf affects the ranking of documents for queries with at least two terms. For example, in the query arachnocentric line, idf weighting increases the relative weight of arachnocentric and decreases the relative weight of line. idf has little effect on ranking for one-term queries.
Contents
First
Last
Prev
Next
idfi = log
|D| |{ j : ti d j }|
with |D| : cardinality of D, or the total number of documents in the corpus |{ j : ti d j }|: number of documents where the term ti appears (viz. the document frequency) (that is ni, j = 0). If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to use 1 + |{ j : ti d j }|
,000 Example |D| = 1, 000, 000 idft = log10 1,000 df
t
term calpurnia animal sunday y under the
dft 1 100 1000 10,000 100,000 1,000,000
idft 6 4 3 2 1 0
Contents
First
Last
Prev
Next
8.7.
Collection Frequency vs. Document Frequency

collection frequency 10440 10422 document frequency 3997 8760
word insurance try
Collection frequency of t : number of tokens of t in the collection Document frequency of t : number of documents t occurs in Which word is a better search term? Which frequency is more relevant? This example suggests that df (and idf) is better for weighting than cf: we want the few documents containing insurance to get higher boost for a query on insurance, than the many documents containing try should get from a query on try.
Contents
First
Last
Prev
Next
8.8.
Tf-idf weighting
The tf-idf weight of a term is the product of its tf weight and its idf weight. Its the best weighting scheme in IR. It increases with the number of occurrences within a document. It increases with the rarity of the term in the collection. frequency] [term frequency] [inverse document
Contents
First
Last
Prev
Next
8.9.
Summing up: Weights
Term Frequency As rst approximation we can consider the frequency of the term in the
document.
Inverse Document Frequency Estimate the rarity of a term in the whole document col-
lection. (If a term occurs in all the documents of the collection, its IDF is zero.)
TF-IDF A combination of the term frequency (TF) and the inverse document frequency
(IDF). Others: http://www.lans.ece.utexas.edu/strehl/diss/note52.html
Contents
First
Last
Prev
Next
8.10.
Summing up: Binary, Count, Weight Matrix

d1 1 1 0 d1 4 232 0 d1 1.21 8.59 0.0 d2 1 1 1 d3 0 0 0 d4 1 1 0 d5 0 1 0 d4 2 2 0 d6 0 1 0 d5 0 1 0 ...
Brutus Caesar Calpurnia
d2 d3 157 0 227 0 10 0
d6 0 0 0 d5 0.0 0.25 0.0
...
d2 d3 d4 6.10 0.0 1.0 2.54 0.0 1.51 1.54 0.0 0.0
d6 . . . 0.0 0.0 0.0
Contents
First
Last
Prev
Next
9.
Background: Vector Space
A vector space is a mathematical structure formed by a collection of vectors: objects that may be added together and multiplied (scaled) by numbers, called scalars in this context. Vector an n-dimensional vector is represented by a column: v1 ... vn or for short as v = (v1 , . . . vn ).
Contents
First
Last
Prev
Next
9.1.
Operations on vectors
Vector addition: v + w = (v1 + w1 , . . . vn + wn ) similarly for the . Vectors are visualized by arrows. They correspond to points (the point where the arrow ends.)
Contents
First
Last
Prev
Next
9.2.
Dot product or inner product

n
v w = (v1 w1 + . . . + vn wn ) = vi wi
i=1
Example We have three goods to buy and sell, their prices are ( p1 , p2 , p3 ) (price vector p). The quantities we are buy or sell are (q1 , q2 , q3 ) (quantity vector q, their values are positive when we sell and negative when we buy.) Selling the quantity q1 at price p1 brings in q1 p1 . The total income is the dot product q p = (q1 , q2 , q3 ) ( p1 , p2 , p3 ) = q1 p1 + q2 p2 + q3 p3
Contents
First
Last
Prev
Next
9.3.
Length
Dimension of a vector the number of the components of a vector. We take all vectors to
have the same dimension N (N is the terms of the vocabulary of the collection) 2 Length ||v|| = v v = n i=1 vi For instance, given the vector: d = (4, 4, 4, 4) its length is computed as: 2 |d | = n i=1 di = 16 + 16 + 16 + 16 = 64 = 8
Contents
First
Last
Prev
Next
9.4.
Unit vector
is a vector whose length equals one. u= v ||v||
is a unit vector in the same direction as v. (normalized vector)
Contents
First
Last
Prev
Next
9.5.
Cosine formula
Contents
First
Last
Prev
Next
10.
Vector Space Model: Document as vectors
Each document is now represented as a real-valued vector of tf-idf weights R|V | . So we have a |V |-dimensional real-valued vector space. Terms are axes of the space. Documents are points or vectors in this space. Very high-dimensional: tens of millions of dimensions when you apply this to web search engines Each vector is very sparse - most entries are zero.
dJ = (t 1J , . . . , tn j ) d j stands for a certain document j, and tiJ stands for a term that occurs in j at position i of the vector.
10.1.
Assumption
Documents that are close together in the vector space talk about the same thing.
Contents
First
Last
Prev
Next
11.
Query-document matching scores II
Key idea 1: Do the same for queries: represent them as vectors in the high-dimensional space, qk = (t 1k , . . . , tnk ) Key idea 2: Rank documents according to their proximity to the query
Contents
First
Last
Prev
Next
11.1.
Proximity: Angle and Cosine
Rank documents according to angle with query Thought experiment: take a document d and append it to itself. Call this document d . d is twice as long as d . Semantically d and d have the same content. The angle between the two documents is 0, corresponding to maximal similarity . . . . . . even though the Euclidean distance (which is based on their length) between the two documents can be quite large.
From angles to cosines The following two notions are equivalent.
Rank documents according to the angle between query and document in decreasing order Rank documents according to cosine(query,document) in increasing order
Contents
First
Last
Prev
Next
11.2.
Vectors inner product
Recall: The inner product of two vectors is: x y = xi yi

i
D1 D2
anna, = (3, = (1,
mario, 3, 1,
apple, 3, 1,
home, 3, 1,
fabio, 0, 0,
paola, 0, 0,
school, 0, 0,
lecture 0) 0)
D1 D2 = i D1i D2i = (3 1) + (3 1) + (3 1) + (3 1) = 12
Contents
First
Last
Prev
Next
11.3.
Cosine Similarity
cos(x, y) = xy = |x||y| n i=1 xi yi
2 n i=1 xi 2 n i=1 yi
xi is the tf-idf weight of term i in x. yi is the tf-idf weight of term i in y ||x|| and ||y|| are the lengths of x and y. This is the cosine similarity of x and y or, equivalently, the cosine of the angle between x and y. In short, the similarity between two vectors is computed by the cosine of the angle between them.
Contents
First
Last
Prev
Next
11.4.
Example: Length
D1: Anna and Mario eat the apple at home. Anna and Mario have not an apple at home. Anna and Mario go out of home to buy an apple. D2: Anna and Mario eat the apple at home. D3: Fabio and Paola went to school and took a lecture. D1 D2 D3 anna, = (3, = (1, = (0, mario, 3, 1, 0, apple, 3, 1, 0, home, 3, 1, 0, fabio, 0, 0, 1, paola, 0, 0, 1, school, 0, 0, 1, lecture 0) 0) 1)
Recall denition: ||x|| = ||D1|| = 9+9+9+9 ||D2|| = 1+1+1+1 ||D3|| = 1+1+1+1
2 n i=1 xi
= 36 = 4 = 4
=6 =2 =2
Contents
First
Last
Prev
Next
11.5.
Example: Cosine Similarity

cos(x, y) = xy = ||x||||y|| home, 3, 1, 0, n i=1 xi yi
2 n i=1 xi 2 n i=1 yi
D1 D2 D3
anna, = (3, = (1, = (0,
mario, 3, 1, 0,
apple, 3, 1, 0,
fabio, 0, 0, 1,
paola, 0, 0, 1,
school, 0, 0, 1,
lecture 0) 0) 1)
cos(D1, D2) =
D1 D2 ||D1|| ||D2|| cos(D1, D3) =
(3 1) + (3 1) + (3 1) + (3 1) 12 = =1 62 12 D1 D3 0 0 = =0 6 2 12
||D1|| ||D3||
Contents
First
Last
Prev
Next
11.6.
Example: Query-Document Matching
Query: best car insurance. Document: car insurance auto insurance.

word auto best car insurance tf-raw 0 1 1 1 tf-wght 0 1 1 1 query df 5000 50000 10000 1000 idf 2.3 1.3 2.0 3.0 weight 0 1.3 2.0 3.0 tf-raw 1 0 1 2 document tf-wght weight 1 1 0 0 1 1 1.3 1.3 product nlized 0.52 0 0.52 0.68 0 0 1.04 2.04
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the nal weight of the term in the query or document, nlized: document weights after cosine normalization, product: the product of nal query weight and nal document weight Final similarity score between query and document: i wqi wdi = 0 + 0 + 1.04 + 2.04 = 3.08
Contents
First
Last
Prev
Next
12.
Summary: Ranked retrieval in the vector space model
Represent the query as a weighted tf-idf vector Represent each document as a weighted tf-idf vector Compute the cosine similarity between the query vector and each document vector Rank documents with respect to the query Return the top K (e.g., K = 10) to the user
Contents
First
Last
Prev
Next
13.
LT: much more
The processing of the text can be much more elaborate: Pos Tagging Parsing Named Entity Recognition From unstructured to structured data Ontology Learning ... Some of these tools are useful to build Language Resources encoding the knowledge need to better capture the user information need.
Contents
First
Last
Prev
Next
14.
Your presentation
Possible topics: http://disi.unitn.it/bernardi/Courses/DL/clir.pdf Cross-Language Information Retrieval http://disi.unitn.it/bernardi/Courses/DL/faceted_search.pdf http://disi.unitn.it/bernardi/Courses/DL/automated_metadata.pdf
Contents
First
Last
Prev
Next

Digital Libraries: Language Technologies

Uploaded by

Copyright:

Available Formats

Digital Libraries: Language Technologies

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Digital Libraries: Language Technologies

Uploaded by

Copyright:

Available Formats

Digital Libraries: Language Technologies

Recall: Inverted Index

Text Processing: tools

Text Processing: Pre-processing

Denitions: Word, Term, Token, Type

Example: Tokens vs. Words Example: andarci veramente piano

[graphical words] [inected form] [unitary meaning]

Lemmas: the normalized form of a word (e.g. innitive for verbs,)

Document transformation for indexing

Each token is a candidate for a postings entry.

One word or two (or several)?

Hewlett-Packard State-of-the-art co-education data base San Francisco

3/20/91 vs. 20/3/91 Mar 20, 1991 100.2.86.144 800.234.2333

Denition Reduce inectional/variant forms to base form. E.g.:

Stop word list

LT and Inverted Index construction

Problem with Boolean search

Query-document matching scores I

Take 1: Jaccard coefcient

Whats wrong with Jaccard?

We need a way to compute the weight of a term in the term-document matrix.

Term Frequency (tf)

Log frequency weighting

Exercise: Jaccard and tf measure

Problem: Rare vs. Frequent terms

Inverse Document Frequency (idf)

term calpurnia animal sunday y under the

dft 1 100 1000 10,000 100,000 1,000,000

Collection Frequency vs. Document Frequency

word insurance try

Summing up: Weights

(IDF). Others: http://www.lans.ece.utexas.edu/strehl/diss/note52.html

Summing up: Binary, Count, Weight Matrix

Brutus Caesar Calpurnia

Brutus Caesar Calpurnia

d6 0 0 0 d5 0.0 0.25 0.0

Brutus Caesar Calpurnia

d2 d3 d4 6.10 0.0 1.0 2.54 0.0 1.51 1.54 0.0 0.0

d6 . . . 0.0 0.0 0.0

Background: Vector Space

Dot product or inner product

is a vector whose length equals one. u= v ||v||

is a unit vector in the same direction as v. (normalized vector)

Vector Space Model: Document as vectors

Query-document matching scores II

Proximity: Angle and Cosine

Vectors inner product

Recall: The inner product of two vectors is: x y = xi yi

anna, = (3, = (1,

Recall denition: ||x|| = ||D1|| = 9+9+9+9 ||D2|| = 1+1+1+1 ||D3|| = 1+1+1+1

Example: Cosine Similarity

anna, = (3, = (1, = (0,

D1 D2 ||D1|| ||D2|| cos(D1, D3) =

Example: Query-Document Matching

Query: best car insurance. Document: car insurance auto insurance.

Summary: Ranked retrieval in the vector space model

LT: much more

Possible topics: http://disi.unitn.it/bernardi/Courses/DL/clir.pdf Cross-Language Information Retrieval http://disi.unitn.it/bernardi/Courses/DL/faceted_search.pdf http://disi.unitn.it/bernardi/Courses/DL/automated_metadata.pdf

You might also like