Posts

Showing posts with the label Profiling

2015-09-10: CDXJ: An Object Resource Stream Serialization Format

Image
I have been working on an IIPC funded project of profiling various web archives to summarize their holdings . The idea is to generate statistical measures of the holdings of an archive under various lookup keys where a key can be a partial URI such as Top Level Domain (TLD), registered domain name, entire domain name along with any number of sub-domain segments, domain name and a few segments from the path, a given time, a language, or a combination of two or more of these. Such a document (or archive profile) can be used answer queries like "how many *.edu Mementos are there in a given archive?", "how many copies of the pages are there in an archive that fall under netpreserve.org/projects/* ", or "number of copies of *.cnn.com/* pages of 2010 in Arabic language". The archive profile can also be used to determine the overlap between two archives or visualize their holdings in various ways. Early work of this research was presented at the Internet...