0% found this document useful (0 votes)
32 views

Szymon Grabowski Jakub Swacha: Sgrabow@kis.p.lodz - PL

This document proposes a new method for compactly representing large collections of URLs to allow for fast access. The method combines front coding, phrase replacement on residuals, and Deflate compression. It achieves a compressed representation of about 5-9 bytes per URL with average extraction times of 150-600 microseconds. The technique divides the URLs into blocks that are compressed individually and stores common phrases separately for improved compression. Evaluation on real-world URL datasets shows this approach effectively balances compact representation with fast access to URLs.

Uploaded by

Jakub Swacha
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Szymon Grabowski Jakub Swacha: Sgrabow@kis.p.lodz - PL

This document proposes a new method for compactly representing large collections of URLs to allow for fast access. The method combines front coding, phrase replacement on residuals, and Deflate compression. It achieves a compressed representation of about 5-9 bytes per URL with average extraction times of 150-600 microseconds. The technique divides the URLs into blocks that are compressed individually and stores common phrases separately for improved compression. Evaluation on real-world URL datasets shows this approach effectively balances compact representation with fast access to URLs.

Uploaded by

Jakub Swacha
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

http://compact.representation.of.URL.collections.with.fast.

access/ Szymon Grabowski1, Jakub Swacha2


1 Technical University of d, Computer Engineering Dept., al. Politechniki 11, 90-924 d. E-mail: sgrabow@kis.p.lodz.pl 2 University of Szczecin, Institute of Information Technology in Management, Mickiewicza 64, 71-101 Szczecin. E-mail: jakubs@uoo.univ.szczecin.pl

Sok k/Bechatowa, June 2011

trie

Classical string dictionaries


burst trie (Heinz et al. 2002)

minimal acyclic DFA (Ciura i Deorowicz 2001)

Old assumptions and when they fail


Dictionary size applying to Heaps law: (n), n text size (in words), usually in 0.40.6. Texts on the web: dictionaries may have 108+ terms. According to (Ahmad & Kondrak 2005) about 20% of all query words in web searchers are non-dictionary words, including typos, but typos are only a small fraction of them. Those terms are: numerous names (people, brands, product numbers, geographical names etc.), neologisms and e-speak.

Why URLs
Web graph analyses: graph structure PLUS node info, ie. their URLs, needed. Specific characteristics (Heaps law for NL doesnt apply). May be huge.

Modern ideas for URL representation


Belazzougui et al. 2009: minimal monotone perfect hashing. E.g. it is enough to spend about 6.5 bits / key (avg) for a 106M-key URL dataset (uk-2007-05), with fast access (about 30 s per key) apart from the keys themselves. If the keys (URLs) themselves are not needed, the average 6.5 bits per key is enough to map each key to its lexicographical position.

Brisaboa et al. 2011: experimental study, many algs tested. Two techniques most successful for URLs: grammar-based RePair algororithm and plain front coding accompanied with HuTucker coding of the remaining suffixes. HuTucker: optimal among those codes that preserve lexicographical order of the keys.
5

What we do (1 / 2)
Front coding (standard technique) + phrase replacement on the residuals + Deflate (zip). Phrases: popular URL segments separated with [.&=/_-], min. length 2. http://www.skwigly.co.uk/banner/abmc.asp?b=62&z=45 potential phrases: http: | www | skwigly | co | uk | banner | abmc | 62 | 45 (b and z are eliminated as being too short). Note that front coding is also likely to remove http:// or http://www. first. 127 most freq phrases in a superblock replaced with 1-byte symbols.

What we do (2 / 2)
General philosophy: different steams, block based. Indidivual blocks compressed, access entries to blocks given. Deflate (zip) compression used. Front coding: up to length 255. The prefix bytes sent to a separate stream, with blocks of bp size. Residuals: in blocks of b lines. Common phrases: represented on a superblock level, of sb lines (sb being a multiple of b). extract(i) queries: find the prefix block, decode it, find the phrase block, extract its phrases, find the residuals block, decode it, insert back phrases, attach prefixes, 7 refer to the required line.

Datasets and results

Results in brief: about 59 bytes / URL in compressed form with avg extract time about 150600 s (@ Intel Core 2 Duo 6420 2.13 GHz).

Datasets available from the WebGraph project: http://webgraph.dsi.unimi.it/

Future work
Speed-optimize. Experiments also on fully lex-sorted URL collections (better compression). Add support for locate queries (given a key, return its index in the structure, or -1 if it doesnt exist). Smarter phrase replacement?

You might also like