H2Histogram: Base-2 Redesign of HdrHistogram for Unbeatable Performance
The H2Histogram provides a histogram that is conceptually similar to HdrHistogram, with modifications to the bucket construction algorithm and configurable options.
Background and Assumptions
Background
Recording and reporting quantile metrics is a very common task, and highly valuable in characterizing workload and performance. Usually, system input/output and behavior have a wide but bounded range of acceptable values, and tracking the distribution within this range at a certain precision provides rich insights into the workload. In high-throughput scenarios, HdrHistogram is highly popular due to its low recording overhead. However, HdrHistogram is a base-10 design. We believe a base-2 variant works even better in terms of speed, simplicity, and configurability.
Assumptions
Below are our core assumptions:
- Bucket widths are powers of 2, and monotonically increasing for larger values
- We want to accept any natural number up to the maximum value allowed
- Using relative error as a constraint, the goal is to primarily maximize recording throughput and secondarily minimize memory space
Goals
H2Histogram aims to provide a simpler implementation, finer-grain configuration, and more efficient runtime compared to the reference implementation provided by HdrHistogram.
Basic Parameters
Maximum value in histogram:
This is the largest value the histogram can track, parameterized by the exponent .
For , = “”.
Relative error:
This is the upper bound on the relative error created by value quantization, parameterized by the exponent (of its inverse), .
For , .
Note: the concept of relative error does not apply when the exact integer value is recorded.
Bucket Construction
We use a bucket width of just 1 for small integers. As the absolute value goes up, we can double the bucket width without violating the relative error constraint, .
The smallest value that allows bucket width to go from () to () is (or ). For , that value is . This means there are precise buckets.
Bucket width can be doubled from () to () for values greater than or equal to (or ). For , that value is . This also means there are buckets of width 2.
So on and so forth, until the threshold value reaches the maximum value . At that point, there will be groups of buckets of doubling widths. For and , the number of groups is .
Note: When , the histogram will have the exact same number of buckets as there are values. Notions of absolute and relative errors do not apply in such a histogram.
Bucket Summary
- Total number of buckets: . For and , that is **** buckets.
- Histogram size: number of buckets multiplied by counter size. For and , that is ** KiB** (64-bit counters) or ** KiB** (32-bit counters).
The table below gives a sense of the total number of buckets and histogram size for various values, assuming we want to record values up to the maximum unsigned 64-bit integer.
| p | relative error | #buckets | size (8-bit) | size (16-bit) | size (32-bit) | size (64-bit) |
|---|---|---|---|---|---|---|
| 2 | 25.0% | 252 | 0.2 KiB | 0.5 KiB | 1.0 KiB | 2.0 KiB |
| 3 | 12.5% | 496 | 0.5 KiB | 1.0 KiB | 1.9 KiB | 3.9 KiB |
| 4 | 6.25% | 976 | 1.0 KiB | 1.9 KiB | 3.8 KiB | 7.6 KiB |
| 5 | 3.13% | 1920 | 1.9 KiB | 3.8 KiB | 7.5 KiB | 15.0 KiB |
| 6 | 1.56% | 3776 | 3.7 KiB | 7.4 KiB | 14.8 KiB | 29.5 KiB |
| 7 | 0.781% | 7424 | 7.3 KiB | 14.5 KiB | 29.0 KiB | 58.0 KiB |
| 8 | 0.391% | 14592 | 14.3 KiB | 28.5 KiB | 57.0 KiB | 114.0 KiB |
| 9 | 0.195% | 28672 | 28.0 KiB | 56.0 KiB | 112.0 KiB | 224.0 KiB |
| 10 | 0.0977% | 56320 | 55.0 KiB | 110.0 KiB | 220.0 KiB | 440.0 KiB |
| 11 | 0.0488% | 110592 | 108.0 KiB | 216.0 KiB | 432.0 KiB | 864.0 KiB |
| 12 | 0.0244% | 217088 | 212.0 KiB | 424.0 KiB | 848.0 KiB | 1.7 MiB |
| 13 | 0.0122% | 425984 | 416.0 KiB | 832.0 KiB | 1.6 MiB | 3.3 MiB |
| 14 | 0.00610% | 835584 | 816.0 KiB | 1.6 MiB | 3.2 MiB | 6.4 MiB |
If input values are 32-bit integers, which is also common, the histograms are much smaller.
| p | relative error | #buckets | size (8-bit) | size (16-bit) | size (32-bit) | size (64-bit) |
|---|---|---|---|---|---|---|
| 2 | 25.0% | 124 | 0.1 KiB | 0.2 KiB | 0.5 KiB | 1.0 KiB |
| 3 | 12.5% | 240 | 0.2 KiB | 0.5 KiB | 0.9 KiB | 1.9 KiB |
| 4 | 6.25% | 464 | 0.5 KiB | 0.9 KiB | 1.8 KiB | 3.6 KiB |
| 5 | 3.13% | 896 | 0.9 KiB | 1.8 KiB | 3.5 KiB | 7.0 KiB |
| 6 | 1.56% | 1728 | 1.7 KiB | 3.4 KiB | 6.8 KiB | 13.5 KiB |
| 7 | 0.781% | 3328 | 3.3 KiB | 6.5 KiB | 13.0 KiB | 26.0 KiB |
| 8 | 0.391% | 6400 | 6.3 KiB | 12.5 KiB | 25.0 KiB | 50.0 KiB |
| 9 | 0.195% | 12288 | 12.0 KiB | 24.0 KiB | 48.0 KiB | 96.0 KiB |
| 10 | 0.0977% | 23552 | 23.0 KiB | 46.0 KiB | 92.0 KiB | 184.0 KiB |
| 11 | 0.0488% | 45056 | 44.0 KiB | 88.0 KiB | 176.0 KiB | 352.0 KiB |
| 12 | 0.0244% | 86016 | 84.0 KiB | 168.0 KiB | 336.0 KiB | 672.0 KiB |
| 13 | 0.0122% | 163840 | 160.0 KiB | 320.0 KiB | 640.0 KiB | 1.3 MiB |
| 14 | 0.00610% | 311296 | 304.0 KiB | 608.0 KiB | 1.2 MiB | 2.4 MiB |
Bucket Layout
Bucket Lookup
Bucket lookup is the primary operation for recording a value. The following algorithm only applies to positive integers (i.e. zero values are not allowed). Support for the zero value is trivially added as a first step of the lookup implementation.
We represent the input value as an -digit binary:
Looking up the right bucket offset for works as follows:
- Find the highest non-zero digit . () is such that but for all that satisfy .
- Calculate the bucket offset. If , maps to a precise bucket, so we simply have ; otherwise, first compute , and we can compute as .
Explanation
In the above algorithm, the highest non-zero digit tells us the width of the bucket that should record the input value . For the bucket width is ; for the bucket width is ; etc. For any positive , the total number of buckets with a width less than is .
There are buckets of width , with a lower bound as the value. To find out which bucket in this group is the right one for , we compute the distance between and the lower bound, and divide that by the bucket width of the current group. Since all bounds are powers of 2, the division is converted to a bit shift.
Optimization
On CPUs supporting SSE4, can be most efficiently computed by
using the LZCNT instruction.
Below are a few lookup examples assuming :
-
: ,
-
: ,
-
: , ,
-
: , ,
-
: , ,
Quantile Lookup
In a naïve implementation, reporting a particular quantile requires traversing all the buckets once. Even if the total number of values in the histogram is known, the number of buckets to traverse is still linear to the total number of buckets.
There are a couple of things to consider during implementation regarding bias in reporting. Because each bucket covers a range except for the precise ones, a decision needs to be made about what value in that range to report when we do a quantile lookup. We also need to consider how to interpret results that don’t fall neatly in a single bucket — for example, when there are 3 values in the histogram and the quantile in question is . For the latter, we can use the nearest-rank method to find the nearest bucket with records. Alternatively, it is also possible to extrapolate an in-between value based on existing data. Each method has its drawbacks. For example, linear interpolation could lead to a confusing situation where the interpreted value is deemed impossible given how values are measured.
Optimization
To reduce the number of buckets traversed during lookup, one can store the total number of counts, , across all buckets, and return when the buckets traversed so far yield a cumulative count greater than (if traversing from the lowest bucket) or (if traversing from the highest bucket). Further reduction can be achieved by using some type of “sketch” that stores cumulative values across multiple buckets, which allows the cursor to jump over many buckets at a time. The tradeoff is that multiple values will need to be updated for each recording, and more space will be used.
There are two typical scenarios where H2Histogram is deployed. The first one is to check if there is any SLO violation, such as p99 latency. In this case, the percentile of interest is very close to the highest end, so a simple global count and backward traversal can greatly reduce the number of buckets visited. The other one is to create a snapshot of value distribution by reporting several pre-defined percentiles at once, such as p25, p50, p75, p90, p95, p99… In this case, it is probably the most efficient to create APIs that allow multiple quantiles to be reported in a single sweeping trip through all the buckets.
Appendix
Counter Type
Recording counts as integers provides precise readings within the counter range, which means it is important to use enough bits to cover the accumulated counts during the recording period. This generally means 8-bit, 16-bit, 32-bit, or 64-bit unsigned integers.
Recording counts as floating point numbers increases the dynamic range of individual buckets at the cost of precision. Since H2Histogram has the concept of precision as a configurable option, it is possible to consider the use of floating point types in conjunction with overall precision.
Below is a table of the range and precision limits for different data types when used to record counts:
| Type | Range Limit | Relative Error |
|---|---|---|
| 8-bit integer | — (precise) | |
| 16-bit integer | — (precise) | |
| 32-bit integer | — (precise) | |
| 64-bit integer | — (precise) | |
| 16-bit (half) float | ||
| 32-bit (single) float | ||
| 64-bit (double) float |
From the above table, it should be obvious that if considering using 16-bit numbers or less, integer types are strictly superior; for 32- and 64-bit numbers, the choice should be based on the potential range of the highest single-bucket count.
Truncated Histogram
Often, the range of plausible values has both an upper and a lower bound. In the current H2Histogram design, expressing an upper bound is straightforward — just provide the smallest that fits the value. But we haven’t discussed lower bound.
In fact, having a lower bound could potentially meaningfully reduce the amount of memory needed to store a histogram, because smaller values are stored in finer buckets. For example, for and , if we know the lower bound of values is 1024, we can eliminate 768 out of 6400 buckets, a 12% reduction.
More precisely, if we know the lower bound is at least , we can eliminate buckets (if ) or buckets (if ).
How common do values have lower bounds? In high-level time measurements in particular, this condition is extremely common. For example, if we use a clock with nanosecond precision to track common software operations, the lower bounds are significantly higher than the resolution of the measurement. Many syscalls are measured in microseconds, request/response time is often measured in milliseconds. This means we can expect to be in the range of 10 to 20 in a lot of real-life use cases.
Use the following calculator to see how much space a truncated histogram can save you:
- Number of buckets (original histogram):
- Number of buckets (truncated histogram):
- Buckets saved: (%)