0% found this document useful (0 votes)

14 views34 pages

3 Termweighting

Uploaded by

jamsibro140

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views34 pages

3 Termweighting

Uploaded by

jamsibro140

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 34

CHAPTER THREE

Term weighting and similarity

measures

1
Terms
• Terms are usually stems. Terms can be also phrases, such
as “Computer Science”, “World Wide Web”, etc.
• Documents and queries are represented as vectors or
“bags of words” (BOW).
– Each vector holds a place for every term in the collection.
– Position 1 corresponds to term 1, position 2 to term 2, posi-
tion n to term n.

Di wd i1 , wd i 2 ,..., wd in
Q wq1 , wq 2, ..., wqn W=0 if a term is absent
• Documents are represented by binary weights or Non-bi-
nary weighted vectors of terms. 2
Document Collection
• A collection of n documents can be represented in the
vector space model by a term-document matrix.
• An entry in the matrix corresponds to the “weight” of a
term in the document; zero means the term has no signif-
icance in the document or it simply doesn’t exist in the
document.
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn

3
Binary Weights
• Only the presence (1) or ab- docs t1 t2 t3
sence (0) of a term is in- D1 1 0 1
D2 1 0 0
cluded in the vector D3 0 1 1
• Binary formula gives every D4 1 0 0
D5 1 1 1
word that appears in a docu- D6 1 1 0
ment equal relevance. D7 0 1 0
D8 0 1 0
• It can be useful when fre- D9 0 0 1
quency is not important. D10 0 1 1
D11 1 0 1
• Binary Weights Formula:
1 if freq ij  0

freq ij 
0 if freq ij 0

Why use term weight-
•
ing?
Binary weights are too limiting.
– terms are either present or absent.
– Not allow to order documents according to their level of
relevance for a given query
• Non-binary weights allow to model partial matching .
– Partial matching allows retrieval of docs that approxi-
mate the query.
• Term-weighting improves quality of answer set.
– Term weighting enables ranking of retrieved documents;
such that best matching documents are ordered at the
top as they are more relevant than others.
5
Term Weighting: Term Frequency (TF)
• TF (term frequency) - Count the number
of times term occurs in document. docs t1 t2 t3
fij = frequency of term i in document j D1 2 0 3
• The more times a term t occurs in docu- D2 1 0 0
ment d the more likely it is that t is rele- D3 0 4 7
vant to the document, i.e. more indicative D4 3 0 0
of the topic.. D5 1 6 3
– If used alone, it favors common words and D6 3 5 0
long documents. D7 0 8 0
– It gives too much credit to words that appears D8 0 10 0
more frequently. D9 0 0 1
• May want to normalize term frequency D10 0 3 5
(tf) across the entire corpus: D11 4 0 1
tfij = fij / max{fij}
Document Normalization
• Long documents have an unfair advantage:
– They use a lot of terms
• So they get more matches than short documents
– And they use the same words repeatedly
• So they have much higher term frequencies
• Normalization seeks to remove these effects:
– Related somehow to maximum term frequency.
– But also sensitive to the number of terms.
• If we don’t normalize short documents may not be
recognized as relevant.

7
Problems with term frequency
• Need a mechanism for attenuating the effect of terms
that occur too often in the collection to be meaningful for
relevance/meaning determination
• Scale down the term weight of terms with high collection
frequency
– Reduce the tf weight of a term by a factor that grows
with the collection frequency
• More common for this purpose is document frequency
– how many documents in the collection contain the term

• The example shows that collection

frequency and document frequency
behaves differently 8
Document Frequency
• It is defined to be the number of docu-
ments in the collection that contain a term
– Document frequency is the number of docu-
ments containing a particular term
DF = document frequency

– Count the frequency considering the whole

collection of documents.
– Less frequently a term appears in the whole
collection, the more discriminating it is.
9
Inverse Document Frequency (IDF)
• IDF measures rarity of the term in collection. The IDF is a
measure of the general importance of the term
– Inverts the document frequency.
• It diminishes the weight of terms that occur very fre-
quently in the collection and increases the weight of
terms that occur rarely.
– Gives full weight to terms that occur in one docu-
ment only.
– Gives lowest weight to terms that occur in all doc-
uments.
– Terms that appear in many different documents are less indica-
tive of overall topic.
idfi = inverse document frequency of term i,
= log2 (N/ df i) (N: total number of documents)
10
Inverse Document Fre-
• E.g.: given a collectionquency
of 1000 documents and document
frequency, compute IDF for each word?
Word N DF IDF
the 1000 1000 0
some 1000 100 3.322
car 1000 10 6.644
merge 1000 1 9.966
• IDF provides high values for rare words and low values
for common words.
• IDF is an indication of a term’s discrimination power.
– Log used to dampen the effect relative to tf.
– Make the difference between Document frequency vs. corpus
frequency ? 11
TF*IDF Weighting
• The most used term-weighting is tf*idf weighting
scheme:
wij = tfij idfi = tfij * log2 (N/ dfi)

• A term occurring frequently in the document but

rarely in the rest of the collection is given high
weight.
– The tf-idf value for a term will always be greater than
or equal to zero.
• Experimentally, tf*idf has been found to work
well.
– It is often used in the vector space model together
with cosine similarity to determine the similarity be- 12
tween two documents.
TF*IDF weighting
• When does TF*IDF registers a high weight? when a term t
occurs many times within a small number of documents
– Highest tf*idf for a term shows a term has a high term frequency
(in the given document) and a low document frequency (in the
whole collection of documents);
– the weights hence tend to filter out common terms.
– Thus lending high discriminating power to those documents
• Lower TF*IDF is registered when the term occurs
fewer times in a document, or occurs in many doc-
uments
– Thus offering a less pronounced relevance signal
• Lowest TF*IDF is registered when the term occurs
in virtually all documents
Computing TF-IDF: An Example
•Assume collection contains 10,000 documents and statistical
analysis shows that document frequencies (DF) of three
terms are: A(50), B(1300), C(250). And also term frequencies
(TF) of these terms are: A(3), B(2), C(1). Compute TF*IDF for
each term?
A: tf = 3/3=1.00; idf = log2(10000/50) = 7.644; tf*idf
= 7.644
B: tf = 2/3=0.67; idf = log2(10000/1300) = 2.943; tf*idf =
1.962
C: tf = 1/3=0.33; idf = log2(10000/250) = 5.322; tf*idf =
1.774
•Query vector is typically treated as a document and also tf-
14
idf weighted.
More Example
• Consider a document containing 100 words
wherein the word cow appears 3 times. Now, as-
sume we have 10 million documents and cow ap-
pears in one thousand of these.
– The term frequency (TF) for cow :
3/100 = 0.03

– The inverse document frequency is

log2(10,000,000 / 1,000) = 13.228

– The TF*IDF score is the product of these frequencies:

0.03 * 13.228 = 0.39684 15
Exercise
• Let C = number of times Word C TW TD DF TF IDF TFIDF
a given word appears in airplane 5 46 3 1
a document; blue 1 46 3 1
• TW = total number of chair 7 46 3 3
words in a document;
computer 3 46 3 1
• TD = total number of
forest 2 46 3 1
documents in a corpus,
and justice 7 46 3 3
• DF = total number of love 2 46 3 1
documents containing a might 2 46 3 1
given word; perl 5 46 3 2
• compute TF, IDF and
rose 6 46 3 3
TF*IDF score for each
shoe 4 46 3 1
term
thesis 2 46 3 2 16
Concluding remarks
• Suppose from a set of English documents, we wish to determine
which once are the most relevant to the query "the brown cow."
• A simple way to start out is by eliminating documents that do not
contain all three words "the," "brown," and "cow," but this still
leaves many documents.
• To further distinguish them, we might count the number of times
each term occurs in each document and sum them all together;
– the number of times a term occurs in a document is called its TF. How-
ever, because the term "the" is so common, this will tend to incorrectly
emphasize documents which happen to use the word "the" more, without
giving enough weight to the more meaningful terms "brown" and "cow".
– Also the term "the" is not a good keyword to distinguish relevant and non-
relevant documents and terms like "brown" and "cow" that occur rarely
are good keywords to distinguish relevant documents from the non-rele-
vant once.
17
Concluding remarks
• Hence IDF is incorporated which diminishes the weight
of terms that occur very frequently in the collection and
increases the weight of terms that occur rarely.
– This leads to use TF*IDF as a better weighting technique

• On top of that we apply similarity measures to calculate

the distance between document i and query j.
• There are a number of similarity measures; the most
common similarity measure are
• Euclidean distance , Inner or Dot product, Cosine similar-
ity, Dice similarity, Jaccard similarity, etc.
Similarity Measure
• We now have vectors for all documents in
the collection, a vector for the query, how to t3
compute similarity?
• A similarity measure is a function that com- 1
putes the degree of similarity or distance be-
D1
tween document vector and query vector. Q

• Using a similarity measure between the 2 t1

query and each document:
–It is possible to rank the retrieved docu- t2 D2
ments in the order of presumed relevance.
–It is possible to enforce a certain threshold
so that the size of the retrieved set can be
controlled.

19
Intuition
t3
d2

d3
d1
θ
φ
t1

d5
t2
d4

Postulate: Documents that are “close to-

gether”
in the vector space talk about the same things
Similarity Measure
Desiderata for proximity
1. If d1 is near d2, then d2 is near d1.
2. If d1 near d2, and d2 near d3, then d1 is not far from d3.
3. No document is closer to d than d itself.
– Sometimes it is a good idea to determine the maximum
possible similarity as the “distance” between a document d
and itself
• A similarity measure attempts to compute the distance
between document vector wj and query wq vector.
– The assumption here is that documents whose vectors are close
to the query vector are more relevant to the query than docu-
ments whose vectors are away from the query vector. 21
Similarity Measure: Techniques
• Euclidean distance
–It is the most common similarity measure. Euclidean dis-
tance examines the root of square differences between
coordinates of a pair of document and query terms.
• Dot product
–The dot product is also known as the scalar product or in-
ner product
–the dot product is defined as the product of the magni-
tudes of query and document vectors
• Cosine similarity (or normalized inner product)
–It projects document and query vectors into a term space
and calculate the cosine angle between these.
22
Euclidean distance
• Similarity between vectors for the document di and query
q can be computed as:
n
sim(dj,q) = |dj – q| =  (w
i 1
ij  wiq ) 2

where wij is the weight of term i in document j and wiq

is the weight of term i in the query
• Example: Determine the Euclidean distance between
the document 1 vector (0, 3, 2, 1, 10) and query vector
(2, 7, 1, 0, 0). 0 means corresponding term not found in
document or query
2 2 2 2 2
 (0  2)  (3  7)  (2  1)  (1  0)  (10  0) 11 .05
23
• Similarity between vectors for the document di and query
Inner Product
q can be computed as the vector inner product:

sim(dj,q) = dj•q = wij · wiq

 i 1of term i in document j and wiq

where wij is the weight
is the weight of term i in the query q
• For binary vectors, the inner product is the number of
matched query terms in the document (size of intersec-
tion).
• For weighted term vectors, it is the sum of the products
of the weights of the matched terms.

24
Properties of Inner Product

• Favors long documents with a large number

of unique terms.
– Again, the issue of normalization
• Measures how many terms matched but not
how many terms are not matched.

25
Inner Product -- Examples
• Binary weight :
–Size of vector = size of vocabulary = 7
Retrieval Database Term Computer Text Manage Data
D 1 1 1 0 1 1 0
Q 1 0 1 0 0 1 1
sim(D, Q) = 3
• Term Weighted:
Retrieval Database Architecture
D1 2 3 5
D2 3 7 1
Q 1 0 2
Inner Product:
Example 1
k2
k1
d2 d6 d7
d4
d5
d3
d1

k1 k2 k3 q  dj k3
d1 1 0 1 2
d2 1 0 0 1
d3 0 1 1 2
d4 1 0 0 1
d5 1 1 1 3
d6 1 1 0 2
d7 0 1 0 1

q 1 1 1 27
Inner Product:
Exercise k1 d2
k2
d6 d7
d4 d5
d3
d1

k1 k2 k3 q  dj k3
d1 1 0 1 ?
d2 1 0 0 ?
d3 0 1 1 ?
d4 1 0 0 ?
d5 1 1 1 ?
d6 1 1 0 ?
d7 0 1 0 ?

q 1 2 3 28
Cosine similarity
• Measures similarity between d1 and d2 captured by the
cosine of the angle x between them.
 

n
d j q wi , j wi , q
sim( d j , q )     i 1

i 1 w i 1 i ,q
n n
dj q 2
i, j w 2

• Or;  

n
d j d k wi , j wi ,k
sim(d j , d k )     i 1

i 1 w i 1 i,k
n n
d j dk 2
i, j w 2

• The denominator involves the lengths of the vectors

• So the cosine measure is also known as the normalized
inner product 

n 2
Length d j  i 1
wi , j
Example: Computing Cosine
Similarity
• Let say we have query vector Q = (0.4, 0.8); and also
document D1 = (0.2, 0.7). Compute their similarity us-
ing cosine?

(0.4 * 0.2)  (0.8 * 0.7)

sim(Q, D2 ) 
2 2 2 2
[(0.4)  (0.8) ] * [(0.2)  (0.7) ]
0.64
  0.98
0.42
Example: Computing Cosine
Similarity
• Let say we have two documents in our corpus; D1 =
(0.8, 0.3) and D2 = (0.2, 0.7). Given query vector Q =
(0.4, 0.8), determine which document is the most rele-
vant one for the query?

1.0
Q
D2
cos 1 0.74 0.8

2
cos  2 0.98
0.6

0.4
1 D1
0.2

0.2 0.4 0.6 0.8 1.0

31
Example
• Given three documents; D1, D2 and D3 with the corre-
sponding TFIDF weight, Which documents are more simi-
lar using the three measurement?

Terms D1 D2 D3
affection 0.996 0.993 0.847
Jealous 0.087 0.120 0.466
gossip 0.017 0.000 0.254

32
Cosine Similarity vs. Inner Product
• Cosine similarity measures the cosine of the angle be-
tween two vectors.
• Inner product normalized by the vector lengths.
  t

dj q   ( wij wiq )
 

i 1
CosSim(dj, q) = t t
dj q  wij  wiq 2 2

i 1 i 1
 
InnerProduct(dj, q) = dj q 

D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81

D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3
D1 is 6 times better than D2 using cosine similarity but only 5 times
better using inner product. 33
Exercises
• A database collection consists of 1 million documents, of
which 200,000 contain the term holiday while 250,000
contain the term season. A document repeats holiday 7
times and season 5 times. It is known that holiday is re-
peated more than any other term in the document. Cal-
culate the weight of both terms in this document using
three different term weight methods. Try with
(i) normalized and unnormalized TF;
(ii) TF*IDF based on normalized and unnormalized TF

Term Frequency and Inverse Document Frequency
No ratings yet
Term Frequency and Inverse Document Frequency
26 pages
Term Weighting 2021
100% (2)
Term Weighting 2021
38 pages
ISR Chap..3
No ratings yet
ISR Chap..3
26 pages
3 termWeightingIR
No ratings yet
3 termWeightingIR
32 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
34 pages
IR Chapter 2 Part II
No ratings yet
IR Chapter 2 Part II
45 pages
3 Termweighting
No ratings yet
3 Termweighting
40 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Term Weighting and Similarity Measures
50% (2)
Term Weighting and Similarity Measures
54 pages
3 Term Weighting
No ratings yet
3 Term Weighting
34 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
2 Termweighting
No ratings yet
2 Termweighting
38 pages
Session 4 Text Feature
No ratings yet
Session 4 Text Feature
40 pages
Chapter 6 - Scoring Term Weighting and Vector Space Model
No ratings yet
Chapter 6 - Scoring Term Weighting and Vector Space Model
43 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
33 pages
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
No ratings yet
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
45 pages
Chapter 5 IR
No ratings yet
Chapter 5 IR
46 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Text Representation
No ratings yet
Text Representation
16 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Lecture 10 - Term Frequency
No ratings yet
Lecture 10 - Term Frequency
17 pages
Lecture - 7 MSDS
No ratings yet
Lecture - 7 MSDS
32 pages
Text Pre Processing With NLTK
No ratings yet
Text Pre Processing With NLTK
42 pages
Reference Material For NLP - 1
No ratings yet
Reference Material For NLP - 1
40 pages
TF Idf
100% (3)
TF Idf
38 pages
(Example) SCSE Dr. Sunita Yadav Microteaching Slides TF-IDF Revised
No ratings yet
(Example) SCSE Dr. Sunita Yadav Microteaching Slides TF-IDF Revised
15 pages
L12&L13 Ranked Retrieval
No ratings yet
L12&L13 Ranked Retrieval
31 pages
5 B IRModels
No ratings yet
5 B IRModels
51 pages
Chapter-3 Termweighting
No ratings yet
Chapter-3 Termweighting
17 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
4 IRModels
No ratings yet
4 IRModels
30 pages
Tif Idf Cosine Similarity
No ratings yet
Tif Idf Cosine Similarity
44 pages
Chapter 3 Term Weighting
No ratings yet
Chapter 3 Term Weighting
11 pages
Term Weighting
No ratings yet
Term Weighting
71 pages
Ranked Retrieval
No ratings yet
Ranked Retrieval
52 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
35 pages
5 IRModels
No ratings yet
5 IRModels
30 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
Unit-3 Irs
No ratings yet
Unit-3 Irs
48 pages
Vmodel
No ratings yet
Vmodel
10 pages
Tf-Idf Weighting
No ratings yet
Tf-Idf Weighting
7 pages
TF Idf
No ratings yet
TF Idf
4 pages
The Vector Space Model in Information Re
No ratings yet
The Vector Space Model in Information Re
9 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
Learning Guide Unit 4 - Home
No ratings yet
Learning Guide Unit 4 - Home
14 pages
Term Frequency
No ratings yet
Term Frequency
3 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
25 pages
Key & Common Swedish Words A Vocabulary List of High Frequency Swedish Words(1000 Words): Swedish, #0
From Everand
Key & Common Swedish Words A Vocabulary List of High Frequency Swedish Words(1000 Words): Swedish, #0
MostUsedWords
2/5 (4)
Lecture 10
No ratings yet
Lecture 10
18 pages
Experiment No. 4: Kjsce/It/Lybtech/Sem Viii/Ir/2023-24
No ratings yet
Experiment No. 4: Kjsce/It/Lybtech/Sem Viii/Ir/2023-24
4 pages
Basic IR: Modeling
No ratings yet
Basic IR: Modeling
22 pages
Fundamentals of Database System Module
No ratings yet
Fundamentals of Database System Module
45 pages
CH 1
No ratings yet
CH 1
52 pages
Vector Space Model
No ratings yet
Vector Space Model
6 pages
Information Retrieval Systems Information Retrieval Systems
No ratings yet
Information Retrieval Systems Information Retrieval Systems
7 pages
Introduction To Multimedia
No ratings yet
Introduction To Multimedia
82 pages
Chapter 4
No ratings yet
Chapter 4
93 pages
Guidance Part 2 - Parties Teams and Processes For The Delivery Phase of Assets - Edition 6 PDF
No ratings yet
Guidance Part 2 - Parties Teams and Processes For The Delivery Phase of Assets - Edition 6 PDF
104 pages
Introduction To Automatic Indexing
No ratings yet
Introduction To Automatic Indexing
28 pages
Chapter 1
No ratings yet
Chapter 1
51 pages
Information Systems
100% (1)
Information Systems
27 pages
Chapter 2 Network Admin
No ratings yet
Chapter 2 Network Admin
64 pages
Design Concepts: Software Engineering: A Practitioner's Approach, 7/e
100% (1)
Design Concepts: Software Engineering: A Practitioner's Approach, 7/e
41 pages
4 Indexing
No ratings yet
4 Indexing
59 pages
Chapter 4
No ratings yet
Chapter 4
54 pages
CH 6
No ratings yet
CH 6
30 pages
Chapter 2
No ratings yet
Chapter 2
47 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
5chapter Five - Memory Management
No ratings yet
5chapter Five - Memory Management
38 pages
Exegetical Paper - 1 Cor One 18-31
No ratings yet
Exegetical Paper - 1 Cor One 18-31
26 pages
Chapter 3
No ratings yet
Chapter 3
37 pages
Introduction
No ratings yet
Introduction
27 pages
CH 3
No ratings yet
CH 3
28 pages
7chapter Seven - File Management BEST
No ratings yet
7chapter Seven - File Management BEST
26 pages
Topic 5 Lecture Notes Using Local Languages
No ratings yet
Topic 5 Lecture Notes Using Local Languages
40 pages
Role of Monetary and Fiscal PDF
No ratings yet
Role of Monetary and Fiscal PDF
15 pages
Guidance Services
100% (1)
Guidance Services
56 pages
Chapter-Four Product/Service Development
No ratings yet
Chapter-Four Product/Service Development
22 pages
Chapter 3
No ratings yet
Chapter 3
23 pages
Assessment of Entrepreneur Ship by Senegiorgis Mulugeta
No ratings yet
Assessment of Entrepreneur Ship by Senegiorgis Mulugeta
21 pages
Package 321112
No ratings yet
Package 321112
26 pages
Chapter 4 - OOP Concepts
No ratings yet
Chapter 4 - OOP Concepts
16 pages
4deadlock - Bankers Algorithm
No ratings yet
4deadlock - Bankers Algorithm
15 pages
Analysis: Input Output
No ratings yet
Analysis: Input Output
15 pages
A Combination of Hidden Markov Model and Fuzzy Model For Stock Market Forecasting PDF
No ratings yet
A Combination of Hidden Markov Model and Fuzzy Model For Stock Market Forecasting PDF
8 pages
Chapter 3
No ratings yet
Chapter 3
13 pages
German Frequency Dictionary - 1000 Key & Common German Words in Context: German-English, #0
From Everand
German Frequency Dictionary - 1000 Key & Common German Words in Context: German-English, #0
MostUsedWords Com
No ratings yet
Waiting For The Lord: The Fulfilment of The Promise of Land in The Old Testament As A Source of Hope
No ratings yet
Waiting For The Lord: The Fulfilment of The Promise of Land in The Old Testament As A Source of Hope
15 pages
Kim de L'horizon
No ratings yet
Kim de L'horizon
6 pages
Evaluasi Efektivitas Mesin Kiln Dengan Penerapan TPM Pada Pabrik Semen
No ratings yet
Evaluasi Efektivitas Mesin Kiln Dengan Penerapan TPM Pada Pabrik Semen
32 pages
ENGLISH DLL (6th WEEK)
No ratings yet
ENGLISH DLL (6th WEEK)
5 pages
Bài 17.12
No ratings yet
Bài 17.12
4 pages
Trigger. Etnohistory and Archaeology
No ratings yet
Trigger. Etnohistory and Archaeology
8 pages
Mad Project (1) .
No ratings yet
Mad Project (1) .
5 pages
Installation, Operation & Maintenance Manual For Series 480 Lineshaft Turbine Pumps
No ratings yet
Installation, Operation & Maintenance Manual For Series 480 Lineshaft Turbine Pumps
36 pages
GL 117
No ratings yet
GL 117
14 pages
Literature Review
No ratings yet
Literature Review
19 pages
Leader Traits and Ethics
No ratings yet
Leader Traits and Ethics
14 pages
ChatLLM Teams
No ratings yet
ChatLLM Teams
1 page
Bildungsroman Genre: The Course of The Novel
No ratings yet
Bildungsroman Genre: The Course of The Novel
4 pages
The Burial of The Dead
No ratings yet
The Burial of The Dead
10 pages
JMC 4
No ratings yet
JMC 4
1 page
Assignment One
No ratings yet
Assignment One
2 pages
Summer Mock Amc 8 v2
No ratings yet
Summer Mock Amc 8 v2
5 pages
Civic Consciousness
No ratings yet
Civic Consciousness
11 pages
Small-Signal Stability Analysis of Islanded DC Microgrid Under DBS Control
No ratings yet
Small-Signal Stability Analysis of Islanded DC Microgrid Under DBS Control
6 pages
Appendix and Vitae
No ratings yet
Appendix and Vitae
4 pages
146 ThermoDynamics ThermoDynamics
No ratings yet
146 ThermoDynamics ThermoDynamics
5 pages
Create A Letter: Note To Use The Letter Wizard To Modify or Complete An Axisting Letter, Open The Letter in Word
No ratings yet
Create A Letter: Note To Use The Letter Wizard To Modify or Complete An Axisting Letter, Open The Letter in Word
1 page
TCS Helath Insurance - Domiciliary Claim Reimbursement Guidelines
No ratings yet
TCS Helath Insurance - Domiciliary Claim Reimbursement Guidelines
1 page
X-Tite Construction Grout
No ratings yet
X-Tite Construction Grout
2 pages

3 Termweighting

Uploaded by

3 Termweighting

Uploaded by

CHAPTER THREE

Term weighting and similarity

• The example shows that collection

– Count the frequency considering the whole

• A term occurring frequently in the document but

– The inverse document frequency is

– The TF*IDF score is the product of these frequencies:

• On top of that we apply similarity measures to calculate

• Using a similarity measure between the 2 t1

Postulate: Documents that are “close to-

where wij is the weight of term i in document j and wiq

sim(dj,q) = dj•q = wij · wiq

 i 1of term i in document j and wiq

• Favors long documents with a large number

• The denominator involves the lengths of the vectors

(0.4 * 0.2)  (0.8 * 0.7)

0.2 0.4 0.6 0.8 1.0

D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81

You might also like