Szymon Grabowski Jakub Swacha: Sgrabow@kis.p.lodz - PL
Szymon Grabowski Jakub Swacha: Sgrabow@kis.p.lodz - PL
trie
Why URLs
Web graph analyses: graph structure PLUS node info, ie. their URLs, needed. Specific characteristics (Heaps law for NL doesnt apply). May be huge.
Brisaboa et al. 2011: experimental study, many algs tested. Two techniques most successful for URLs: grammar-based RePair algororithm and plain front coding accompanied with HuTucker coding of the remaining suffixes. HuTucker: optimal among those codes that preserve lexicographical order of the keys.
5
What we do (1 / 2)
Front coding (standard technique) + phrase replacement on the residuals + Deflate (zip). Phrases: popular URL segments separated with [.&=/_-], min. length 2. http://www.skwigly.co.uk/banner/abmc.asp?b=62&z=45 potential phrases: http: | www | skwigly | co | uk | banner | abmc | 62 | 45 (b and z are eliminated as being too short). Note that front coding is also likely to remove http:// or http://www. first. 127 most freq phrases in a superblock replaced with 1-byte symbols.
What we do (2 / 2)
General philosophy: different steams, block based. Indidivual blocks compressed, access entries to blocks given. Deflate (zip) compression used. Front coding: up to length 255. The prefix bytes sent to a separate stream, with blocks of bp size. Residuals: in blocks of b lines. Common phrases: represented on a superblock level, of sb lines (sb being a multiple of b). extract(i) queries: find the prefix block, decode it, find the phrase block, extract its phrases, find the residuals block, decode it, insert back phrases, attach prefixes, 7 refer to the required line.
Results in brief: about 59 bytes / URL in compressed form with avg extract time about 150600 s (@ Intel Core 2 Duo 6420 2.13 GHz).
Future work
Speed-optimize. Experiments also on fully lex-sorted URL collections (better compression). Add support for locate queries (given a key, return its index in the structure, or -1 if it doesnt exist). Smarter phrase replacement?