Histograms were first used in a lecture in 1892 by Karl Pearson — the godfather of mathematical statistics. With how many data presentation tools we have today, it’s hard to think that representing data as a graphic was classified as “innovation”, but it was. They are a graphic presentation of the distribution and frequency of data. If you haven’t seen one recently, or don’t know the word histogram off the top of your head - it is a bar chart, each bar represents the count of data with a defined range of values. When Pearson built the first histogram, he calculated it by hand. Today we can use SQL (or even Excel) to extract this data continuously across large data sets.
While true statistical histograms have a bit more complexity for choosing bin ranges, for many business intelligence purposes, Postgres width_bucket
is good-enough to counting data inside bins with minimal effort.
Given the number of buckets and max/min value, width_bucket
returns the index for the bucket that a value will fall. For instance, given a minimum value of 0, a maximum value of 100, and 10 buckets, a value of 43 would fall in bucket #5: select width_bucket(43, 0, 100, 10) AS bucket;
But 5 is not correct for 43, or is it?
You can see how the values would fall using generate_series
:
SELECT value, width_bucket(value, 0, 100, 10) AS bucket FROM generate_series(0, 100) AS value;
When running the query, the values 0 through 9 go into bucket 1. As you can see in the image above, width_bucket
behaves as a step function that starts indexing with 1. In this scenario, when passed a value of 100, width_bucket
returns 11, because the maximum value given the width_bucket is an exclusive range (i.e. the logic is minimum <= value < maximum).
We can use the bucket value to generate more readable labels.
Let’s build out a larger query that creates ranges, range labels, and formats the histogram. We will start by using a synthetic table within
[...]