Why FM algorithm?
Every day on the internet, more than 2.5 quintillion bytes of data are created. This data
is increasing in terms of variety, velocity and volume, hence called big data. To analyze this
data, one has to collect this data, store it in a safe place, clean it and then perform analysis.
One of the major problems faced by big data engineers is dealing with unuseful or redundant
data. A lot of time and memory is used to store and analyze this extra data which turns out
to be fruitless in the end. Thus, the removal of duplicate data becomes extremely essential
to cut the analysis cost and reduce redundancy.
Data cleaning can be done using various techniques but before cleaning the data, it is
necessary to know the amount of useful data present in the dataset. Therefore, before the
removal of duplicate data from a data stream or database, it is necessary to have knowledge
of distinct or unique data present. A way to do so is by hashing the elements of the universal
set using the Flajolet Martin Algorithm. The FM algorithm is used in a database query, big
data analytics, spectrum sensing in cognitive radio sensor networks, and many more areas.
It shows superior performance as compared with many other methods to find distinct
elements in a stream of data.
Flajolet Martin Algorithm:
Flajolet Martin Algorithm, also known as FM algorithm, is used to approximate the number
of unique elements in a data stream or database in one pass. The highlight of this algorithm
is that it uses less memory space while executing.
Pseudo Code-Stepwise Solution:
1. Selecting a hash function h so each element in the set is mapped to a string to at least
log2n bits.
2. For each element x, r(x)= length of trailing zeroes in h(x)
3. R= max(r(x))
=> Distinct elements= 2R
For small values of m(where m is the number of unique elements), the brute force approach
can work, but for large data sets or data streams, where m is very large, a lot of space is
required. The compiler may not let us run the algorithm in some cases.This is where the
Flajolet Martin Algorithm can be used. Not only does it occupy less memory, but it also
shows better results in terms of time in seconds.
This approach is used to maintain a count of distinct values seen so far, given a large number
of values. For example, getting an approximation of the number of distinct URLs surfed by
a person on the web. Many companies want to check how many unique users logged in to
their website the previous day to check if their advertisement was successful or not. Here,
the FM algorithm is an excellent solution for these companies.