Performance improvement in `SimHashIterator`

The current `SimHashIterator` works by finding the most significant active bit, and then calculating the number of `SimHash` buckets that will exist from that bit to the end of the filter. This means we are iterating `(most_significant_bit / SIM_BUCKET_SIZE) + SIM_BUCKETS` buckets.

The problem here is that in the higher order bits the sparsity is very high, meaning that there will potentially be many buckets where the values we're mixing into our `SimHash`es will just be zeros, and given any value XOR `0` is just that value this is wasted computation.

We could move to use the `BitChunks` iterators which batch up our chunks into a `u64` and an index of that block of bits, skipping ranges full of zeros. This will introduce issues where if our `SimHash` bucket straddles multiple `BitChunks`, which can be solved by peeking the next chunk, making sure that the index is contiguous with our current chunk. 

Right now `SIM_BUCKET_SIZE` is always `6` bits but if we allowed fully arbitrary `SimHash` bucket sizes this could get slightly gnarly because we'd need to be able to peek an arbitrary number of `BitChunk`s ahead of the current position. 

The following is a line chart plot of the proportion of empty `bit_range`s mixed into the SimHashes using pseudorandomly generated geofilters with `0..100_000` items added to them. 

<img width="640" height="480" alt="Image" src="https://github.com/user-attachments/assets/f63c62f3-17eb-4727-9807-77999f688d12" />

The ratio reduces very sharply after a couple of inserts but remains relatively high since newly added items have some chance of landing in a very significant bucket which introduces a load of zero gaps before the next one bit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!