Szymon Grabowski Jakub Swacha: Sgrabow@kis.p.lodz - PL

This document proposes a new method for compactly representing large collections of URLs to allow for fast access. The method combines front coding, phrase replacement on residuals, and Deflate compression. It achieves a compressed representation of about 5-9 bytes per URL with average extraction times of 150-600 microseconds. The technique divides the URLs into blocks that are compressed individually and stores common phrases separately for improved compression. Evaluation on real-world URL datasets shows this approach effectively balances compact representation with fast access to URLs.

Uploaded by

Jakub Swacha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views

Szymon Grabowski Jakub Swacha: Sgrabow@kis.p.lodz - PL

Uploaded by

Jakub Swacha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

http://compact.representation.of.URL.collections.with.fast.

access/ Szymon Grabowski1, Jakub Swacha2

1 Technical University of d, Computer Engineering Dept., al. Politechniki 11, 90-924 d. E-mail: sgrabow@kis.p.lodz.pl 2 University of Szczecin, Institute of Information Technology in Management, Mickiewicza 64, 71-101 Szczecin. E-mail: jakubs@uoo.univ.szczecin.pl

Sok k/Bechatowa, June 2011

trie

Classical string dictionaries

burst trie (Heinz et al. 2002)

minimal acyclic DFA (Ciura i Deorowicz 2001)

Old assumptions and when they fail

Dictionary size applying to Heaps law: (n), n text size (in words), usually in 0.40.6. Texts on the web: dictionaries may have 108+ terms. According to (Ahmad & Kondrak 2005) about 20% of all query words in web searchers are non-dictionary words, including typos, but typos are only a small fraction of them. Those terms are: numerous names (people, brands, product numbers, geographical names etc.), neologisms and e-speak.

Why URLs
Web graph analyses: graph structure PLUS node info, ie. their URLs, needed. Specific characteristics (Heaps law for NL doesnt apply). May be huge.

Modern ideas for URL representation

Belazzougui et al. 2009: minimal monotone perfect hashing. E.g. it is enough to spend about 6.5 bits / key (avg) for a 106M-key URL dataset (uk-2007-05), with fast access (about 30 s per key) apart from the keys themselves. If the keys (URLs) themselves are not needed, the average 6.5 bits per key is enough to map each key to its lexicographical position.

Brisaboa et al. 2011: experimental study, many algs tested. Two techniques most successful for URLs: grammar-based RePair algororithm and plain front coding accompanied with HuTucker coding of the remaining suffixes. HuTucker: optimal among those codes that preserve lexicographical order of the keys.
5

What we do (1 / 2)
Front coding (standard technique) + phrase replacement on the residuals + Deflate (zip). Phrases: popular URL segments separated with [.&=/_-], min. length 2. http://www.skwigly.co.uk/banner/abmc.asp?b=62&z=45 potential phrases: http: | www | skwigly | co | uk | banner | abmc | 62 | 45 (b and z are eliminated as being too short). Note that front coding is also likely to remove http:// or http://www. first. 127 most freq phrases in a superblock replaced with 1-byte symbols.

What we do (2 / 2)
General philosophy: different steams, block based. Indidivual blocks compressed, access entries to blocks given. Deflate (zip) compression used. Front coding: up to length 255. The prefix bytes sent to a separate stream, with blocks of bp size. Residuals: in blocks of b lines. Common phrases: represented on a superblock level, of sb lines (sb being a multiple of b). extract(i) queries: find the prefix block, decode it, find the phrase block, extract its phrases, find the residuals block, decode it, insert back phrases, attach prefixes, 7 refer to the required line.

Datasets and results

Results in brief: about 59 bytes / URL in compressed form with avg extract time about 150600 s (@ Intel Core 2 Duo 6420 2.13 GHz).

Datasets available from the WebGraph project: http://webgraph.dsi.unimi.it/

Future work
Speed-optimize. Experiments also on fully lex-sorted URL collections (better compression). Add support for locate queries (given a key, return its index in the structure, or -1 if it doesnt exist). Smarter phrase replacement?

Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet
in Memory URL Compression
No ratings yet
in Memory URL Compression
4 pages
What Is Mapreduce
No ratings yet
What Is Mapreduce
19 pages
What Is A Mapreduce?: Michael Kleber
No ratings yet
What Is A Mapreduce?: Michael Kleber
19 pages
CIS 555 F P P: P ' F S E: Inal Roject Oogle ENN S Avorite Earch Ngine
No ratings yet
CIS 555 F P P: P ' F S E: Inal Roject Oogle ENN S Avorite Earch Ngine
5 pages
A Review of 'The Anatomy of A Large-Scale Hypertextual Web Search Engine'
No ratings yet
A Review of 'The Anatomy of A Large-Scale Hypertextual Web Search Engine'
27 pages
ISR Chap...4
No ratings yet
ISR Chap...4
43 pages
Summary of A Search Engine
No ratings yet
Summary of A Search Engine
4 pages
Who, What, Where, When, Wordlist: @tomnomnom
No ratings yet
Who, What, Where, When, Wordlist: @tomnomnom
30 pages
Page Rank Algorithm
No ratings yet
Page Rank Algorithm
9 pages
Document 2
No ratings yet
Document 2
18 pages
Comp250 hw4
No ratings yet
Comp250 hw4
6 pages
AI6122 Topic 3.1 - Index
No ratings yet
AI6122 Topic 3.1 - Index
40 pages
Semantic SearchMonkey
No ratings yet
Semantic SearchMonkey
39 pages
Dork Pack
No ratings yet
Dork Pack
18 pages
Pression
No ratings yet
Pression
44 pages
An O (K Log N) Algorithm For Prefix Based Ranked Autocomplete
No ratings yet
An O (K Log N) Algorithm For Prefix Based Ranked Autocomplete
14 pages
Web Crawling: Based On The Slides by Filippo
No ratings yet
Web Crawling: Based On The Slides by Filippo
52 pages
Web Search
No ratings yet
Web Search
49 pages
Lab1 Crawling Python
No ratings yet
Lab1 Crawling Python
10 pages
Project 4: Time Due: 9 PM Thursday, March 14
No ratings yet
Project 4: Time Due: 9 PM Thursday, March 14
26 pages
Algorithms
No ratings yet
Algorithms
49 pages
The Anatomy of A Large-Scale Hypertextual
No ratings yet
The Anatomy of A Large-Scale Hypertextual
41 pages
Backlinks - Pagerank
No ratings yet
Backlinks - Pagerank
12 pages
Advanced Indexing Issues
No ratings yet
Advanced Indexing Issues
52 pages
Search Engine
100% (2)
Search Engine
42 pages
Crawling The Web: Information Retrieval © Crista Lopes, UCI
No ratings yet
Crawling The Web: Information Retrieval © Crista Lopes, UCI
25 pages
Computer Networks CS 552: Why High Speed Lookups?
No ratings yet
Computer Networks CS 552: Why High Speed Lookups?
10 pages
Theory and Practice of Monotone Minimal Perfect Hashing
No ratings yet
Theory and Practice of Monotone Minimal Perfect Hashing
27 pages
9 Dictionaries and Tolerant Retrieval
No ratings yet
9 Dictionaries and Tolerant Retrieval
58 pages
Distributed Computing Seminar: Lecture 5: Graph Algorithms & Pagerank
No ratings yet
Distributed Computing Seminar: Lecture 5: Graph Algorithms & Pagerank
33 pages
EECS 395/495 Lecture 5: Web Crawlers: Doug Downey
No ratings yet
EECS 395/495 Lecture 5: Web Crawlers: Doug Downey
23 pages
Apigee Web Api Design The Missing Link Ebook 1 5
No ratings yet
Apigee Web Api Design The Missing Link Ebook 1 5
5 pages
DS Unit 2
No ratings yet
DS Unit 2
23 pages
Chapter Four Indexing Structure
100% (2)
Chapter Four Indexing Structure
60 pages
Uris Don'T Change: People Change Them
No ratings yet
Uris Don'T Change: People Change Them
10 pages
1. Adler. 2001
No ratings yet
1. Adler. 2001
10 pages
Sahil Malhotra 16 BCE 0113 Web Mining L51+L52: 1. Universal Crawling 1.1. CODE
No ratings yet
Sahil Malhotra 16 BCE 0113 Web Mining L51+L52: 1. Universal Crawling 1.1. CODE
11 pages
Dorks With DonJuji
100% (1)
Dorks With DonJuji
4 pages
Completed Final UNIT-V 9.10.17
100% (1)
Completed Final UNIT-V 9.10.17
74 pages
Tries: - Standard Tries - Compressed Tries - Suffix Tries
No ratings yet
Tries: - Standard Tries - Compressed Tries - Suffix Tries
11 pages
Search Engine
No ratings yet
Search Engine
42 pages
CS571-Note
No ratings yet
CS571-Note
2 pages
Current Challenges in Textual Databases: Gonzalo Navarro
No ratings yet
Current Challenges in Textual Databases: Gonzalo Navarro
44 pages
Information Retrieval - 3
No ratings yet
Information Retrieval - 3
36 pages
Unit-2
No ratings yet
Unit-2
14 pages
Unit 2
No ratings yet
Unit 2
157 pages
Hadoop Project
No ratings yet
Hadoop Project
2 pages
A Dynamic URL Assignment Method For Parallel Web Crawler: A.Guerriero F. Ragni, C. Martines
No ratings yet
A Dynamic URL Assignment Method For Parallel Web Crawler: A.Guerriero F. Ragni, C. Martines
5 pages
ir5
No ratings yet
ir5
18 pages
Digital Search Tree
No ratings yet
Digital Search Tree
61 pages
MS CS Manipal University Ashish Kumar Jha Data Structures and Algorithms Used in Search Engine
No ratings yet
MS CS Manipal University Ashish Kumar Jha Data Structures and Algorithms Used in Search Engine
13 pages
web mining 1-10
No ratings yet
web mining 1-10
31 pages
Topic6 - Naïve Algorithms, Binary Tries - Unit2
No ratings yet
Topic6 - Naïve Algorithms, Binary Tries - Unit2
13 pages
Build A Web Crawler
No ratings yet
Build A Web Crawler
6 pages
Part 4 File Organizatin Lec 4 5part 2 File Organization L1&2
No ratings yet
Part 4 File Organizatin Lec 4 5part 2 File Organization L1&2
36 pages
Chapter 5_Index Compression
No ratings yet
Chapter 5_Index Compression
28 pages
Blue Modern Pitch Deck Presentation
No ratings yet
Blue Modern Pitch Deck Presentation
13 pages
Error-Correction on Non-Standard Communication Channels
From Everand
Error-Correction on Non-Standard Communication Channels
Edward A. Ratzer
No ratings yet
CISSP Exam Prep Questions, Answers & Explanations: 1500+ CISSP Practice Questions with Solutions
From Everand
CISSP Exam Prep Questions, Answers & Explanations: 1500+ CISSP Practice Questions with Solutions
Eddie Vi
3/5 (7)
Determination of Equivalent Circuit Parameters of A Single Phase Transformer
No ratings yet
Determination of Equivalent Circuit Parameters of A Single Phase Transformer
5 pages
SuperRawLife May2011
No ratings yet
SuperRawLife May2011
36 pages
Mathematical Communication Profile in Solving Probability Problems Reviewed by Self-Efficacy of Prospective Mathematics Teachers
No ratings yet
Mathematical Communication Profile in Solving Probability Problems Reviewed by Self-Efficacy of Prospective Mathematics Teachers
10 pages
SCM QB 2 Units
No ratings yet
SCM QB 2 Units
10 pages
Web Based Result Publication System For Education Boards
No ratings yet
Web Based Result Publication System For Education Boards
2 pages
My Disney Cruise Adventure Booklet English Version
No ratings yet
My Disney Cruise Adventure Booklet English Version
30 pages
Ethics - 1
No ratings yet
Ethics - 1
4 pages
SB - Problems - S2. 2024 - 2025.docx
No ratings yet
SB - Problems - S2. 2024 - 2025.docx
6 pages
Downloads: Center For Systems and Software Engineering
No ratings yet
Downloads: Center For Systems and Software Engineering
2 pages
TZ Act GN 2023 839 Publication Document
No ratings yet
TZ Act GN 2023 839 Publication Document
7 pages
Slab and Beam Tabulations
No ratings yet
Slab and Beam Tabulations
5 pages
Slum Clearance Madurai
No ratings yet
Slum Clearance Madurai
153 pages
Building Permits Researches
No ratings yet
Building Permits Researches
3 pages
June 13
No ratings yet
June 13
2 pages
University of Santo Tomas: UST-SHS Practical Research 2
No ratings yet
University of Santo Tomas: UST-SHS Practical Research 2
3 pages
Tom Gaebel - Bio
No ratings yet
Tom Gaebel - Bio
3 pages
Ultimate Blueprint For Building SMTP
No ratings yet
Ultimate Blueprint For Building SMTP
42 pages
Plusone-English-Review of Sunrise On The Hills-Vidhya-hsslive
100% (3)
Plusone-English-Review of Sunrise On The Hills-Vidhya-hsslive
3 pages
As 2561-2010 Guide To The Determination and The Use of Quality Costs
No ratings yet
As 2561-2010 Guide To The Determination and The Use of Quality Costs
9 pages
Prova de 5 Nota 901
No ratings yet
Prova de 5 Nota 901
5 pages
5-Day Gen AI Intensive Course 2024 November 11-15 (Full)
No ratings yet
5-Day Gen AI Intensive Course 2024 November 11-15 (Full)
347 pages
Sony zs-rs70bt rs70btb Ver.1.0 SM
No ratings yet
Sony zs-rs70bt rs70btb Ver.1.0 SM
80 pages
Plan 1
100% (1)
Plan 1
24 pages
MSD Animal Health Vaccine Checklist APP
No ratings yet
MSD Animal Health Vaccine Checklist APP
2 pages
week 05-task assignment-the most interesting story in a filmn
No ratings yet
week 05-task assignment-the most interesting story in a filmn
3 pages
Activity 2 in Rizal
No ratings yet
Activity 2 in Rizal
3 pages
[Ebooks PDF] download (Ebook) Understanding Dying, Death, and Bereavement by George E. Dickinson; Michael R. Leming ISBN 9780357034477, 0357034473 full chapters
100% (8)
[Ebooks PDF] download (Ebook) Understanding Dying, Death, and Bereavement by George E. Dickinson; Michael R. Leming ISBN 9780357034477, 0357034473 full chapters
67 pages
Topic: Malicious Prosecution Submitted To: Prof. Surbhi Goyal
No ratings yet
Topic: Malicious Prosecution Submitted To: Prof. Surbhi Goyal
11 pages
Inc-Sr-Cbse Superchaina Iit Mains QP 29.04.2024
No ratings yet
Inc-Sr-Cbse Superchaina Iit Mains QP 29.04.2024
13 pages
Mchdonalds Australia
No ratings yet
Mchdonalds Australia
9 pages