Ranjan sinha

Followers

Following

Public Views

Interests

Uploads

Papers by Ranjan sinha

Clustering near-duplicate images in large collections

Download

Pruning SIFT for Scalable Near-duplicate Image Matching

Download

Cache-efficient string sorting using copying

ACM Journal of Experimental Algorithms, 2006

Download

Using Random Sampling to Build Approximate Tries for Efficient String Sorting

Algorithms for sorting large datasets can be made more efficient with careful use of memory hiera... more Algorithms for sorting large datasets can be made more efficient with careful use of memory hierarchies and reduction in the number of costly memory accesses. In earlier work, we introduced burstsort, a new string sorting algorithm that on large sets of strings is almost twice as fast as previous algorithms, primarily because it is more cache-efficient. The approach in burstsort is to dynamically build a small trie that is used to rapidly allocate each string to a bucket. In this paper, we introduce new variants of our algorithm: SR-burstsort, DR-burstsort, and DRL-burstsort. These algorithms use a random sample of the strings to construct an approximation to the trie prior to sorting. Our experimental results with sets of over 30 million strings show that the new variants reduce cache misses further than did the original burstsort, by up to 37%, while simultaneously reducing instruction counts by up to 24%. In pathological cases, even further savings can be obtained.

Download

Cache-Conscious Sorting of Large Sets of Strings with Dynamic Tries

Download

Cache-conscious sorting of large sets of strings with dynamic tries

ACM Journal of Experimental Algorithms, 2004

Download

Detection of near-duplicate images for web search

Download

Efficient Trie-Based Sorting of Large Sets of Strings

Discovery of Image Versions in Large Collections

Image collections may contain multiple copies, versions, and fragments of the same image. Storage... more Image collections may contain multiple copies, versions, and fragments of the same image. Storage or retrieval of such duplicates and near-duplicates may be unnecessary and, in the context of collections derived from the web, their presence may represent infringements of copyright. However, identifying image versions is a challenging problem, as they can be subject to a wide range of digital alterations, and is potentially costly as the number of image pairs to be considered is quadratic in collection size. In this paper, we propose a method for finding the pairs of near-duplicates based on manipulation of an image index. Our approach is an adaptation of a robust object recognition technique and a near-duplicate document detection algorithm to this application domain. We show that this method requires only moderate computing resources, and is highly effective at identifying pairs of near-duplicates.

Download