3 Termweighting
3 Termweighting
1
Terms
• Terms are usually stems. Terms can be also phrases, such
as “Computer Science”, “World Wide Web”, etc.
• Documents and queries are represented as vectors or
“bags of words” (BOW).
– Each vector holds a place for every term in the collection.
– Position 1 corresponds to term 1, position 2 to term 2, posi-
tion n to term n.
Di wd i1 , wd i 2 ,..., wd in
Q wq1 , wq 2, ..., wqn W=0 if a term is absent
• Documents are represented by binary weights or Non-bi-
nary weighted vectors of terms. 2
Document Collection
• A collection of n documents can be represented in the
vector space model by a term-document matrix.
• An entry in the matrix corresponds to the “weight” of a
term in the document; zero means the term has no signif-
icance in the document or it simply doesn’t exist in the
document.
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
3
Binary Weights
• Only the presence (1) or ab- docs t1 t2 t3
sence (0) of a term is in- D1 1 0 1
D2 1 0 0
cluded in the vector D3 0 1 1
• Binary formula gives every D4 1 0 0
D5 1 1 1
word that appears in a docu- D6 1 1 0
ment equal relevance. D7 0 1 0
D8 0 1 0
• It can be useful when fre- D9 0 0 1
quency is not important. D10 0 1 1
D11 1 0 1
• Binary Weights Formula:
1 if freq ij 0
freq ij
0 if freq ij 0
Why use term weight-
•
ing?
Binary weights are too limiting.
– terms are either present or absent.
– Not allow to order documents according to their level of
relevance for a given query
• Non-binary weights allow to model partial matching .
– Partial matching allows retrieval of docs that approxi-
mate the query.
• Term-weighting improves quality of answer set.
– Term weighting enables ranking of retrieved documents;
such that best matching documents are ordered at the
top as they are more relevant than others.
5
Term Weighting: Term Frequency (TF)
• TF (term frequency) - Count the number
of times term occurs in document. docs t1 t2 t3
fij = frequency of term i in document j D1 2 0 3
• The more times a term t occurs in docu- D2 1 0 0
ment d the more likely it is that t is rele- D3 0 4 7
vant to the document, i.e. more indicative D4 3 0 0
of the topic.. D5 1 6 3
– If used alone, it favors common words and D6 3 5 0
long documents. D7 0 8 0
– It gives too much credit to words that appears D8 0 10 0
more frequently. D9 0 0 1
• May want to normalize term frequency D10 0 3 5
(tf) across the entire corpus: D11 4 0 1
tfij = fij / max{fij}
Document Normalization
• Long documents have an unfair advantage:
– They use a lot of terms
• So they get more matches than short documents
– And they use the same words repeatedly
• So they have much higher term frequencies
• Normalization seeks to remove these effects:
– Related somehow to maximum term frequency.
– But also sensitive to the number of terms.
• If we don’t normalize short documents may not be
recognized as relevant.
7
Problems with term frequency
• Need a mechanism for attenuating the effect of terms
that occur too often in the collection to be meaningful for
relevance/meaning determination
• Scale down the term weight of terms with high collection
frequency
– Reduce the tf weight of a term by a factor that grows
with the collection frequency
• More common for this purpose is document frequency
– how many documents in the collection contain the term
19
Intuition
t3
d2
d3
d1
θ
φ
t1
d5
t2
d4
24
Properties of Inner Product
25
Inner Product -- Examples
• Binary weight :
–Size of vector = size of vocabulary = 7
Retrieval Database Term Computer Text Manage Data
D 1 1 1 0 1 1 0
Q 1 0 1 0 0 1 1
sim(D, Q) = 3
• Term Weighted:
Retrieval Database Architecture
D1 2 3 5
D2 3 7 1
Q 1 0 2
Inner Product:
Example 1
k2
k1
d2 d6 d7
d4
d5
d3
d1
k1 k2 k3 q dj k3
d1 1 0 1 2
d2 1 0 0 1
d3 0 1 1 2
d4 1 0 0 1
d5 1 1 1 3
d6 1 1 0 2
d7 0 1 0 1
q 1 1 1 27
Inner Product:
Exercise k1 d2
k2
d6 d7
d4 d5
d3
d1
k1 k2 k3 q dj k3
d1 1 0 1 ?
d2 1 0 0 ?
d3 0 1 1 ?
d4 1 0 0 ?
d5 1 1 1 ?
d6 1 1 0 ?
d7 0 1 0 ?
q 1 2 3 28
Cosine similarity
• Measures similarity between d1 and d2 captured by the
cosine of the angle x between them.
n
d j q wi , j wi , q
sim( d j , q ) i 1
i 1 w i 1 i ,q
n n
dj q 2
i, j w 2
• Or;
n
d j d k wi , j wi ,k
sim(d j , d k ) i 1
i 1 w i 1 i,k
n n
d j dk 2
i, j w 2
1.0
Q
D2
cos 1 0.74 0.8
2
cos 2 0.98
0.6
0.4
1 D1
0.2
Terms D1 D2 D3
affection 0.996 0.993 0.847
Jealous 0.087 0.120 0.466
gossip 0.017 0.000 0.254
32
Cosine Similarity vs. Inner Product
• Cosine similarity measures the cosine of the angle be-
tween two vectors.
• Inner product normalized by the vector lengths.
t
dj q ( wij wiq )
i 1
CosSim(dj, q) = t t
dj q wij wiq 2 2
i 1 i 1
InnerProduct(dj, q) = dj q
34