Midterm Sol
Midterm Sol
Midterm Sol
17 December 2007 SOLUTIONS Non-electronic documents and calculators are authorized. Name : Semester :
Exercise 1 : Denitions
Dene the following terms : tokenization segmentation of a document in order to produce a list of items (deals with punctuation, acronyms, dates, etc.) permuterm index index mapping all permutations of characters (including delimiters) of a given word to this word (used for wildcard queries) champion list pre-computed list of the r most relevant documents with respect to a given term
What are the size (mega or giga bytes) of the resulting dictionary and posting lists ? Dictionary : 350000 47 = 16450000 bytes = 16.45 MB Postings : 70000000 3 = 210000000 bytes = 210 MB If you compress your dictionary using the dictionary-as-a-string method, what is the new size of the dictionary ? 350000 (4 + 3 + 3 + 2 8) = 9100000 bytes = 9.1 MB (4 bytes for the term frequency, 3 bytes for the pointer to the posting list, 3 bytes for the pointer into the string, and 8 characters per word on average, each encoded with 2 bytes)
realised realised realis What is the Porter measure of the following words (give your computation) ? crepuscular cr C ep VC usc VC ul VC ar VC V m=4
10001001 9
11111111 127
What would be the encoding of the same posting list using a -code ? 9 1110,001 130 11111110,0000010 127 11111110,111111
Compute the vector representations of d1 , d2 and d3 using the tf idft,d weighting and the euclidian normalisation. Estimation of the collection size : either you dene your own (symbolic or not) collection size, or you use a heuristic such as the 3 keywords appear only together in d1 , d2 and d3 . With the latter, the collection size is N = (123 + 240 + 85 3) = 445.
v (d1 ) =
445 12log10 ( 123 ) D 445 15log10 ( 240 ) D 445 52log10 ( 85 ) D 445 35log10 ( 123 ) D 445 24log10 ( 240 ) D ) 13log10 ( 445 85 D
445 2 445 2 2 (12 log( 445 123 )) + (15 log ( 240 )) + (52 log ( 85 ))
v (d2 ) =
445 2 445 2 2 (35 log( 123 )) + (24 log( 445 240 )) + (13 log ( 85 ))
v (d3 ) =
445 2 445 2 2 (53 log( 123 )) + (48 log( 445 240 )) + (12 log ( 85 ))
Give the ranking retrieved by the system for the query movie trailer.
We need the vector representation for the query q movie trailer. We can use the following : 0 v (q ) = 1 1 Then we can compute the score of each document and rank them by decreasing order of score : score(q, d1 ) = v (q ).v (d1 ) score(q, d2 ) = v (q ).v (d2 ) score(q, d3 ) = v (q ).v (d3 )
Hence, we obtain : (1 + log10 (12)) log10 ( 445 123 ) 445 ) v (d1 ) = (1 + log10 (15)) log10 ( 240 445 (1 + log10 (52)) log10 ( 85 )
A workaround for the insertion of key-value pairs whose hash-value is identical consists of using a primary mapping and a secondary mapping. The latter contains the redundant pairs (i.e. the pairs with identical hash-values), that are themselves linked to the main pair in the primary index.
In this context, the insertion algorithm checks the slot for the pair to be inserted in the primary hashtable. If it is unset, the pair is stored, otherwise the pair is stored at the end of the linked list of pairs in the secondary hashtable. proc insert(key k, value v, hashtable H, hashfunction h) int i = h(k) if (H[i].isUnset()) then H[i].key = k H[i].value = v H[i].next = -1 else int j = H[i].next int m = H.nextFree() int n = i while(j != -1) // we traverse the linked list n = j j = H[j].next endwhile H[m].key = k // we store the duplicate hash-value H[m].value = v // in the first free slot H[m].next = -1 H[n].next = m // we link the previous end of the linked list endif endif