Digital Libraries: Language Technologies
Digital Libraries: Language Technologies
Digital Libraries: Language Technologies
R AFFAELLA B ERNARDI
` DEGLI S TUDI DI T RENTO U NIVERSIT A P. ZZA V ENEZIA , ROOM : 2.05, E - MAIL : BERNARDI @ DISI . UNITN . IT
Contents
First
Last
Prev
Next
Contents
Contents
First
Last
Prev
Next
1.
We gave per granted we know: what a document is. how to machine-read each document.
Contents First Last Prev Next
2.
Contents
First
Last
Prev
Next
3.
Contents
First
Last
Prev
Next
4.
Word A delimited string of characters as it appears in the text. Term A normalized word (case, morphology, spelling etc); an equivalence class of words. Token An instance of a word or term occurring in a document. Type The same as a term in most cases: an equivalence class of tokens.
Contents
First
Last
Prev
Next
3 token: andarci veramente piano 4 words: andare ci veramente piano 2 lexical unit: andarci piano veramente Tokens: token in different position are different. Types: token with no repetitions (position is not relevant)
Contents
First
Last
Prev
Next
5.
Contents
First
Last
Prev
Next
5.1.
Normalization
Need to normalize terms in indexed text as well as query terms into the same form. E.g. U.S.A vs. USA Accent: naive vs. na ve, resume vs. r esum e. Case folding: Fed vs. fed We most commonly implicitly dene equivalence classes of terms (windows, window, Windows, etc.)
Most important criterion: How are users likely to write their queries for these words?
Even in languages that standardly have accents, users often do not type them. Most often users use lowercase, hence its better to lowercase everything.
Contents
First
Last
Prev
Next
5.2.
Tokenization
5.3.
Tokenkization difculties
5.4.
Lemmatization
am, are, is be car, cars, cars, cars car the boys cars are different colors the boy car be different color Lemmatization implies doing proper reduction to dictionary headword form (the lemma). Inectional morphology (cutting cut) vs. derivational morphology (destruction destroy)
Contents
First
Last
Prev
Next
5.5.
Stemming
Denition Crude heuristic process that chops off the ends of words in the hope of achiev-
ing what principled lemmatization attempts to do with a lot of linguistic knowledge. Its language dependent. E.g.: automate(s), automatic, automation all reduce to automat. for example compressed and compression are both accepted as equivalent to compress for exampl compress and compress ar both accept as equival to compress Downloadable, eg. www.tartarus.org/martin/PorterStemmer Sample rules: sses ss ies i ational ate
Contents
First
Last
Prev
Next
5.6.
Sentence splitting
Statistics based methods: Lexical category of preceding word Lexical category of following word Case of preceding word Case of following word ...
Contents
First
Last
Prev
Next
5.7.
Denition stop words are extremely common words which would appear to be of little
value in helping select documents matching a user need. E.g. a, an, and, are, as, at, be, by, for, from, has, he, in, is, it, its, of, on, that, the, to, was, were, will, with 250-300 most common words in English account for 50% or more of a given text E.g. the and of represent 10% of tokens in a corpus. and to, a and in another 10%. etc. Moby Dick Ch1: 859 unique words (types), 2256 word occurrences (tokens) However, if good compression techniques are used stop words can be considered too and in same cases may turn out to be useful. E.g. let it be. Hence, nowadays, most web search engines index stop words.
Contents
First
Last
Prev
Next
5.8.
Contents
First
Last
Prev
Next
6.
Contents
First
Last
Prev
Next
7.
How do we compute the score of a query-document pair? Lets start with a one-term query. If the query term does not occur in the document: score should be 0. The more frequent the query term in the document, the higher the score We will look at a number of alternatives for doing this.
Contents
First
Last
Prev
Next
7.1.
A commonly used measure of overlap of two sets. Let A and B be two sets, such that / or B = 0 / , Jaccard coefcient: A=0
JACCARD (A, B) =
|A B| |A B|
JACCARD (A, A) = 1, JACCARD (A, B) = 0 if A B = 0 A and B dont have to be the same size. It always assigns a number between 0 and 1.
Jaccard coefcient: Example What is the query-document match score that the Jaccard
coefcient computes for: Query: ides of March Document Caesar died in March
JACCARD (q, d ) = 1/6
Contents
First
Last
Prev
Next
7.2.
It doesnt consider term frequency (how many occurrences a term has). Rare terms are more informative than frequent terms. Jaccard does not consider this information. We need a more sophisticated way of normalizing for the length of a document.
Contents
First
Last
Prev
Next
8.
Term weights
Contents
First
Last
Prev
Next
8.1.
Denition The term frequency tft ,d of term t in document d is dened as the number of
times that t occurs in d . We want to use tf when computing query-document match scores. But raw term frequency is not what we want because:
Normalized tf tf count is usually normalized to prevent a bias towards longer documents
(which may have a higher term count regardless of the actual importance of that term in the document) to give a measure of the importance of the term ti within the particular document d j . tfi,j = ni, j k nk, j
where ni, j is the number of occurrences of the considered term (ti ) in document d j , and the denominator is the sum of number of occurrences of all terms in document d j , that is, the size of the document |d j |. A document with tf = 10 occurrences of the term is more relevant than a document
Contents First Last Prev Next
with tf = 1 occurrence of the term. But not 10 times more relevant. Relevance does not increase proportionally with term frequency.
Contents
First
Last
Prev
Next
8.2.
The log frequency weight of term t in d is dened as follows wt ,d = 1 + log10 tft ,d if tft ,d > 0 0 otherwise
tft ,d wt ,d : 0 0, 1 1, 2 1.3, 10 2, 1000 4, etc. Score for a document-query pair: sum over terms t in both q and d : tf-matching-score(q, d ) = t qd (1 + log tft ,d ) The score is 0 if none of the query terms is present in the document. Recall: The logarithm of a number y with respect to base b (logb y) is the exponent to which b has to be raised in order to yield y. In other words, the logarithm of y to base b is the solution x of the equation bx = y
Contents First Last Prev Next
8.3.
Compute the Jaccard matching score and the tf matching score for the following querydocument pairs. q: [information on cars] d: all youve ever wanted to know about cars q: [information on cars] d: information on trucks, information on planes, information on trains q: [red cars and red trucks] d: cops stop red cars more often
Contents
First
Last
Prev
Next
8.4.
In addition, to term frequency (the frequency of the term in the document), we also want to use the frequency of the term in the collection for weighting and ranking.
Desired weight for rare terms Rare terms are more informative than frequent terms. Consider a term in the query that is rare in the collection (e.g., arachnocentric). A document containing this term is very likely to be relevant. We want high weights for rare terms like arachnocentric. Desired weight for frequent terms Frequent terms are less informative than rare terms. Consider a term in the query that is frequent in the collection (e.g., increase). A document containing these terms is more likely to be relevant than a document that doesnt but frequent terms are not sure indicators of relevance. For frequent terms we want positive weights but lower weights than for rare terms.
Contents First Last Prev Next
8.5.
Document Frequency
We want high weights for rare terms like arachnocentric. We want low (positive) weights for frequent words like good, increase, and line. We will use document frequency to factor this into computing the matching score.
Document frequency is the number of documents in the collection that the term occurs
in.
Contents
First
Last
Prev
Next
8.6.
It estimates the rarity of a term in the whole document collection. (If a term occurs in all the documents of the collection, its IDF is zero.) idf affects the ranking of documents for queries with at least two terms. For example, in the query arachnocentric line, idf weighting increases the relative weight of arachnocentric and decreases the relative weight of line. idf has little effect on ranking for one-term queries.
Contents
First
Last
Prev
Next
idfi = log
|D| |{ j : ti d j }|
with |D| : cardinality of D, or the total number of documents in the corpus |{ j : ti d j }|: number of documents where the term ti appears (viz. the document frequency) (that is ni, j = 0). If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to use 1 + |{ j : ti d j }|
,000 Example |D| = 1, 000, 000 idft = log10 1,000 df
t
idft 6 4 3 2 1 0
Contents
First
Last
Prev
Next
8.7.
Collection frequency of t : number of tokens of t in the collection Document frequency of t : number of documents t occurs in Which word is a better search term? Which frequency is more relevant? This example suggests that df (and idf) is better for weighting than cf: we want the few documents containing insurance to get higher boost for a query on insurance, than the many documents containing try should get from a query on try.
Contents
First
Last
Prev
Next
8.8.
Tf-idf weighting
The tf-idf weight of a term is the product of its tf weight and its idf weight. Its the best weighting scheme in IR. It increases with the number of occurrences within a document. It increases with the rarity of the term in the collection. frequency] [term frequency] [inverse document
Contents
First
Last
Prev
Next
8.9.
Term Frequency As rst approximation we can consider the frequency of the term in the
document.
Inverse Document Frequency Estimate the rarity of a term in the whole document col-
lection. (If a term occurs in all the documents of the collection, its IDF is zero.)
TF-IDF A combination of the term frequency (TF) and the inverse document frequency
Contents
First
Last
Prev
Next
8.10.
d2 d3 157 0 227 0 10 0
...
Contents
First
Last
Prev
Next
9.
A vector space is a mathematical structure formed by a collection of vectors: objects that may be added together and multiplied (scaled) by numbers, called scalars in this context. Vector an n-dimensional vector is represented by a column: v1 ... vn or for short as v = (v1 , . . . vn ).
Contents
First
Last
Prev
Next
9.1.
Operations on vectors
Vector addition: v + w = (v1 + w1 , . . . vn + wn ) similarly for the . Vectors are visualized by arrows. They correspond to points (the point where the arrow ends.)
Contents
First
Last
Prev
Next
9.2.
v w = (v1 w1 + . . . + vn wn ) = vi wi
i=1
Example We have three goods to buy and sell, their prices are ( p1 , p2 , p3 ) (price vector p). The quantities we are buy or sell are (q1 , q2 , q3 ) (quantity vector q, their values are positive when we sell and negative when we buy.) Selling the quantity q1 at price p1 brings in q1 p1 . The total income is the dot product q p = (q1 , q2 , q3 ) ( p1 , p2 , p3 ) = q1 p1 + q2 p2 + q3 p3
Contents
First
Last
Prev
Next
9.3.
Length
Dimension of a vector the number of the components of a vector. We take all vectors to
have the same dimension N (N is the terms of the vocabulary of the collection) 2 Length ||v|| = v v = n i=1 vi For instance, given the vector: d = (4, 4, 4, 4) its length is computed as: 2 |d | = n i=1 di = 16 + 16 + 16 + 16 = 64 = 8
Contents
First
Last
Prev
Next
9.4.
Unit vector
Contents
First
Last
Prev
Next
9.5.
Cosine formula
Contents
First
Last
Prev
Next
10.
Each document is now represented as a real-valued vector of tf-idf weights R|V | . So we have a |V |-dimensional real-valued vector space. Terms are axes of the space. Documents are points or vectors in this space. Very high-dimensional: tens of millions of dimensions when you apply this to web search engines Each vector is very sparse - most entries are zero.
dJ = (t 1J , . . . , tn j ) d j stands for a certain document j, and tiJ stands for a term that occurs in j at position i of the vector.
Contents First Last Prev Next
10.1.
Assumption
Documents that are close together in the vector space talk about the same thing.
Contents
First
Last
Prev
Next
11.
Key idea 1: Do the same for queries: represent them as vectors in the high-dimensional space, qk = (t 1k , . . . , tnk ) Key idea 2: Rank documents according to their proximity to the query
Contents
First
Last
Prev
Next
11.1.
Rank documents according to angle with query Thought experiment: take a document d and append it to itself. Call this document d . d is twice as long as d . Semantically d and d have the same content. The angle between the two documents is 0, corresponding to maximal similarity . . . . . . even though the Euclidean distance (which is based on their length) between the two documents can be quite large.
From angles to cosines The following two notions are equivalent.
Rank documents according to the angle between query and document in decreasing order Rank documents according to cosine(query,document) in increasing order
Contents
First
Last
Prev
Next
11.2.
D1 D2
mario, 3, 1,
apple, 3, 1,
home, 3, 1,
fabio, 0, 0,
paola, 0, 0,
school, 0, 0,
lecture 0) 0)
D1 D2 = i D1i D2i = (3 1) + (3 1) + (3 1) + (3 1) = 12
Contents
First
Last
Prev
Next
11.3.
Cosine Similarity
cos(x, y) = xy = |x||y| n i=1 xi yi
2 n i=1 xi 2 n i=1 yi
xi is the tf-idf weight of term i in x. yi is the tf-idf weight of term i in y ||x|| and ||y|| are the lengths of x and y. This is the cosine similarity of x and y or, equivalently, the cosine of the angle between x and y. In short, the similarity between two vectors is computed by the cosine of the angle between them.
Contents
First
Last
Prev
Next
11.4.
Example: Length
D1: Anna and Mario eat the apple at home. Anna and Mario have not an apple at home. Anna and Mario go out of home to buy an apple. D2: Anna and Mario eat the apple at home. D3: Fabio and Paola went to school and took a lecture. D1 D2 D3 anna, = (3, = (1, = (0, mario, 3, 1, 0, apple, 3, 1, 0, home, 3, 1, 0, fabio, 0, 0, 1, paola, 0, 0, 1, school, 0, 0, 1, lecture 0) 0) 1)
2 n i=1 xi
= 36 = 4 = 4
=6 =2 =2
Contents
First
Last
Prev
Next
11.5.
D1 D2 D3
mario, 3, 1, 0,
apple, 3, 1, 0,
fabio, 0, 0, 1,
paola, 0, 0, 1,
school, 0, 0, 1,
lecture 0) 0) 1)
cos(D1, D2) =
(3 1) + (3 1) + (3 1) + (3 1) 12 = =1 62 12 D1 D3 0 0 = =0 6 2 12
||D1|| ||D3||
Contents
First
Last
Prev
Next
11.6.
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the nal weight of the term in the query or document, nlized: document weights after cosine normalization, product: the product of nal query weight and nal document weight Final similarity score between query and document: i wqi wdi = 0 + 0 + 1.04 + 2.04 = 3.08
Contents
First
Last
Prev
Next
12.
Represent the query as a weighted tf-idf vector Represent each document as a weighted tf-idf vector Compute the cosine similarity between the query vector and each document vector Rank documents with respect to the query Return the top K (e.g., K = 10) to the user
Contents
First
Last
Prev
Next
13.
The processing of the text can be much more elaborate: Pos Tagging Parsing Named Entity Recognition From unstructured to structured data Ontology Learning ... Some of these tools are useful to build Language Resources encoding the knowledge need to better capture the user information need.
Contents
First
Last
Prev
Next
14.
Your presentation
Contents
First
Last
Prev
Next