Abstract
State-of-the-art techniques for data fingerprinting have been based on randomized feature selection pioneered by Rabin in 1981. This paper proposes a new, statistical approach for selecting fingerprinting features. The approach relies on entropy estimates and a sizeable empirical study to pick out the features that are most likely to be unique to a data object and, therefore, least likely to trigger false positives. The paper also describes the implementation of a tool (sdhash) and the results of an evaluation study. The results demonstrate that the approach works consistently across different types of data, and its compact footprint allows for the digests of targets in excess of 1 TB to be queried in memory.
Chapter PDF
Similar content being viewed by others
References
B. Bloom, Space/time trade-offs in hash coding with allowable errors, Communications of the ACM, vol. 13(7), pp. 422–426, 1970.
S. Brin, J. Davis and H. Garcia-Molina, Copy detection mechanisms for digital documents, Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 398–409, 1995.
A. Broder, S. Glassman, M. Manasse and G. Zweig, Syntactic clustering of the web, Computer Networks and ISDN Systems, vol. 29(8-13), pp. 1157–1166, 1997.
A. Broder and M. Mitzenmacher, Network applications of Bloom filters: A survey, Internet Mathematics, vol. 1(4), pp. 485–509, 2005.
C. Cho, S. Lee, C. Tan and Y. Tan, Network forensics on packet fingerprints, Proceedings of the Twenty-First IFIP Information Security Conference, pp. 401–412, 2006.
Digital Corpora, NPS Corpus (digitalcorpora.org/corpora/disk-images).
J. Kornblum, Identifying almost identical files using context triggered piecewise hashing, Digital Investigation, vol. 3(S1), pp. S91–S97, 2006.
U. Manber, Finding similar files in a large file system, Proceedings of the USENIX Winter Technical Conference, pp. 1–10, 1994.
M. Mitzenmacher, Compressed Bloom filters, IEEE/ACM Transactions on Networks, vol. 10(5), pp. 604–612, 2002.
National Institute of Standards and Technology, National Software Reference Library, Gaithersburg, Maryland (www.nsrl.nist.gov).
M. Ponec, P. Giura, H. Bronnimann and J. Wein, Highly efficient techniques for network forensics, Proceedings of the Fourteenth ACM Conference on Computer and Communications Security, pp. 150–160, 2007.
H. Pucha, D. Andersen and M. Kaminsky, Exploiting similarity for multi-source downloads using file handprints, Proceedings of the Fourth USENIX Symposium on Networked Systems Design and Implementation, pp. 15–28, 2007.
M. Rabin, Fingerprinting by Random Polynomials, Technical Report TR1581, Center for Research in Computing Technology, Harvard University, Cambridge, Massachusetts, 1981.
S. Rhea, K. Liang and E. Brewer, Value-based web caching, Proceedings of the Twelfth International World Wide Web Conference, pp. 619–628, 2003.
V. Roussev, Building a better similarity trap with statistically improbable features, Proceedings of the Forty-Second Hawaii International Conference on System Sciences, pp. 1–10, 2009.
V. Roussev, Hashing and data fingerprinting in digital forensics, IEEE Security and Privacy, vol. 7(2), pp. 49–55, 2009.
V. Roussev, sdhash, New Orleans, Louisiana (roussev.net/sdhash).
V. Roussev, Y. Chen, T. Bourg and G. Richard, md5bloom: Forensic filesystem hashing revisited, Digital Investigation, vol. 3(S), pp. S82–S90, 2006.
V. Roussev, G. Richard and L. Marziale, Multi-resolution similarity hashing, Digital Investigation, vol. 4(S), pp. S105–S113, 2007.
V. Roussev, G. Richard and L. Marziale, Class-aware similarity hashing for data classification, in Research Advances in Digital Forensics IV, I. Ray and S. Shenoi (Eds.), Springer, Boston, Massachusetts, pp. 101–113, 2008.
S. Schleimer, D. Wilkerson and A. Aiken, Winnowing: Local algorithms for document fingerprinting, Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 76–85, 2003.
K. Shanmugasundaram, H. Bronnimann and N. Memon, Payload attribution via hierarchical Bloom filters, Proceedings of the Eleventh ACM Conference on Computer and Communications Security, pp. 31–41, 2004.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 International Federation for Information Processing
About this paper
Cite this paper
Roussev, V. (2010). Data Fingerprinting with Similarity Digests. In: Chow, KP., Shenoi, S. (eds) Advances in Digital Forensics VI. DigitalForensics 2010. IFIP Advances in Information and Communication Technology, vol 337. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15506-2_15
Download citation
DOI: https://doi.org/10.1007/978-3-642-15506-2_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15505-5
Online ISBN: 978-3-642-15506-2
eBook Packages: Computer ScienceComputer Science (R0)