Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
…
24 pages
1 file
Similarity digesting is a class of algorithms and technologies that generate hashes from files and preserve file similarity. They find applications in various areas across security industry: malware variant detection, spam filtering, computer forensic analysis, data loss prevention and etc.. There are a few schemes and tools available that include ssdeep, sdhash and TLSH. While being useful for detecting file similarity, they define similarity from different perspectives. In other words, they take different approaches to describe what file similarity is about. In order to compare those tools with better evaluation, we introduce a simple mathematical model to describe similarity that would cover all three schemes and beyond. This model enables us to establish a theoretic framework for analyzing essential differences of various similarity digesting algorithms & tools. As a result, a few tools are found to be complementary to each other so that we can use them in a hybrid approach in practice. Data experiment results are provided to support the theoretic analysis. In addition, we introduce a novel similarity digesting scheme that were designed based on the mathematical model.
2012
In this paper, we describe the use of Bloom filters as a sliding window hash storage mechanism for similarity comparisons. The focus of the paper is overcoming accuracy issues in current file similarity methods.
There has been considerable research and use of similarity digests and Locality Sensitive Hashing (LSH) schemes - those hashing schemes where small changes in a file result in small changes in the digest. These schemes are useful in security and forensic applications. We examine how well three similarity digest schemes (Ssdeep, Sdhash and TLSH) work when exposed to random change. Various file types are tested by randomly manipulating source code, Html, text and executable files. In addition, we test for similarities in modified image files that were generated by cybercriminals to defeat fuzzy hashing schemes (spam images). The experiments expose shortcomings in the Sdhash and Ssdeep schemes that can be exploited in straight forward ways. The results suggest that the TLSH scheme is more robust to the attacks and random changes considered.
Proceedings of the 20th ACM international …, 2011
This paper offers a novel look at using a dimensionalityreduction technique called simhash [8] to detect similar document pairs in large-scale collections. We show that this algorithm produces interesting intermediate data, which is normally discarded, that can be used to predict which of the bits in the final hash are more susceptible to being flipped in similar documents. This paves the way for a probabilistic search technique in the Hamming space of simhashes that can be significantly faster and more space-efficient than the existing simhash approaches. We show that with 95% recall compared to deterministic search of prior work [16], our method exhibits 4-14 times faster lookup and requires 2-10 times less RAM on our collection of 70M web pages.
Digital Investigation, 2013
Fuzzy hashing provides the possibility to identify similar files based on their hash signatures, which is useful for forensic investigations. Current tools for fuzzy hashing, e. g. ssdeep, perform similarity search on fuzzy hashes by brute force. This is often too time-consuming for real cases. We solve this issue for ssdeep and even a larger class of fuzzy hashes, namely for piecewise hash signatures, by introducing a suitable indexing strategy. The strategy is based on n-grams contained in the piecewise hash signatures, and it allows for answering similarity queries very efficiently. The implementation of our solution is called F2S2. This tool reduces the time needed for typical investigations from many days to minutes.
Journal of King Saud University – Computer and Information Sciences, 2021
Data reduction has gained growing emphasis due to the rapidly unsystematic increase in digital data and has become a sensible approach to big data systems. Data deduplication is a technique to optimize the storage requirements and plays a vital role to eliminate redundancy in large-scale storage. Although it is robust in finding suitable chunk-level break-points for redundancy elimination, it faces key problems of (1) low chunking performance, which causes chunking stage bottleneck, (2) a large variation in chunksize that reduces the efficiency of deduplication, and (3) hash computing overhead. To handle these challenges, this paper proposes a technique for finding proper cut-points among chunks using a set of commonly repeated patterns (CRP), it picks out the most frequent sequences of adjacent bytes (i.e., contiguous segments of bytes) as breakpoints. Besides to scalable lightweight triple-leveled hashing function (LT-LH) is proposed, to mitigate the cost of hashing function processing and storage overhead; the number of hash levels used in the tests was three, these numbers depend on the size of data to be de-duplicated. To evaluate the performance of the proposed technique, a set of tests was conducted to analyze the dataset characteristics in order to choose the near-optimal length of bytes used as divisors to produce chunks. Besides this, the performance assessment includes determining the proper system parameter values leading to an enhanced deduplication ratio and reduces the system resources needed for data deduplication. Since the conducted results demonstrated the effectiveness of the CRP algorithm is 15 times faster than the basic sliding window (BSW) and about 10 times faster than two thresholds two divisors (TTTD). The proposed LT-LH is faster five times than Secure Hash Algorithm 1 (SHA1) and Message-Digest Algorithm 5 (MD5) with better storage saving.
2014
Identifying the same document is the task of near-duplicate detection. Among the near-duplicate detection algorithms, the fingerprinting algorithm is taken into consideration using in analysis, plagiarism, repair and maintenance of social softwards. The idea of using fingerprints is in order to identifying duplicated material like cryptographic hash functions which are secure against destructive attacks. These functions serve as high-quality fingerprinting functions. Cryptographic hash algorithms are including MD5 and SHA1 that have been widely applied in the file system. In this paper, using available heuristic algorithms in near-duplicate detection, a set of similar pair document are placed in a certain threshold, an each set is indentified according to being near- duplicate. Furthermore, comparing document is performed by fingerprinting algorithm, and finally, the total value is calculated using the standard method.
2007
Large-scale digital forensic investigations present at least two fundamental challenges. The first one is accommodating the computational needs of a large amount of data to be processed. The second one is extracting useful information from the raw data in an automated fashion. Both of these problems could result in long processing times that can seriously hamper an investigation. In this paper, we discuss a new approach to one of the basic operations that is invariably applied to raw data–hashing.
International Journal on Digital Libraries, 2004
The ever-growing volumes of textual information from various sources have fostered the development of digital libraries, making digital content readily accessible but also easy for malicious users to plagiarize, thus giving rise to security problems. In this paper, we introduce a duplicate detection scheme that is able to determine, with a particularly high accuracy, the degree to which one document is similar to another. Our pairwise document comparison scheme detects the resemblance between the content of documents by considering document chunks, representing contexts of words selected from the text. The resulting duplicate detection technique presents a good level of security in the protection of intellectual property while improving the availability of the data stored in the digital library and the correctness of the search results. Finally, the paper addresses efficiency and scalability issues by introducing new data reduction techniques.
Aries, 2018
View Crossmark data research. I expect the book will influence modern Asatru, a faith whose members are often open to academic research. Even seasoned Asatru followers report discovering in Norse Revival groups they have never heard of. The 48-page bibliography alone will be valuable to any future student of Asatru. The one useful tool I missed is a tabular overview with core information of all groups mentioned in the book, many named in unfamiliar languages and some later referred to by acronym only. There are a number of small errors regarding detail that are usually only mentioned to prove the reviewer's thorough reading; for example, the NPD is not "Germany's right-wing party" but the neo-Nazi party and the Wiking-Jugend is not "under police observation" but was banned decades ago (50). But these are trifles in an overall excellent and recommendable book.
International Journal of Thermodynamics, 2005
А.В.Циммерлинг. Простота денотации. Об "общефактическом значении" как части грамматической номенклатуры // Грамматические процессы и системы в синхронии и диахронии. М., ИРЯ РАН, 4-6.06.2024, 2024
Revista de Estudios de la Administración Local y Autonómica, 2023
TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES, 2020
International Journal of Antimicrobial Agents, 2017
Revista Latino-Americana de Enfermagem, 2006
International Journal of Advance Research in Nursing, 2019
BMC health services research, 2016
Diagnostic Cytopathology, 2009
Novel Algorithms and Techniques in Telecommunications and Networking, 2009
v. 6 n. 13 (2024): Letras e Humanidades / Poesias, narrativas curtas e outras palavras, 2024