iopsystems/histogram

H2Histogram: Base-2 Redesign of HdrHistogram for Unbeatable Performance

The H2Histogram provides a histogram that is conceptually similar to HdrHistogram, with modifications to the bucket construction algorithm and configurable options.

Background and Assumptions

Background

Recording and reporting quantile metrics is a very common task, and highly valuable in characterizing workload and performance. Usually, system input/output and behavior have a wide but bounded range of acceptable values, and tracking the distribution within this range at a certain precision provides rich insights into the workload. In high-throughput scenarios, HdrHistogram is highly popular due to its low recording overhead. However, HdrHistogram is a base-10 design. We believe a base-2 variant works even better in terms of speed, simplicity, and configurability.

Assumptions

Below are our core assumptions:

  1. Bucket widths are powers of 2, and monotonically increasing for larger values
  2. We want to accept any natural number up to the maximum value allowed
  3. Using relative error as a constraint, the goal is to primarily maximize recording throughput and secondarily minimize memory space

Goals

H2Histogram aims to provide a simpler implementation, finer-grain configuration, and more efficient runtime compared to the reference implementation provided by HdrHistogram.

Basic Parameters

Maximum value in histogram:

This is the largest value the histogram can track, parameterized by the exponent .

For , = “”.

Relative error:

This is the upper bound on the relative error created by value quantization, parameterized by the exponent (of its inverse), .

For , .

Note: the concept of relative error does not apply when the exact integer value is recorded.

Bucket Construction

We use a bucket width of just 1 for small integers. As the absolute value goes up, we can double the bucket width without violating the relative error constraint, .

The smallest value that allows bucket width to go from () to () is (or ). For , that value is . This means there are precise buckets.

Bucket width can be doubled from () to () for values greater than or equal to (or ). For , that value is . This also means there are buckets of width 2.

So on and so forth, until the threshold value reaches the maximum value . At that point, there will be groups of buckets of doubling widths. For and , the number of groups is .

Note: When , the histogram will have the exact same number of buckets as there are values. Notions of absolute and relative errors do not apply in such a histogram.

Bucket Summary

  • Total number of buckets: . For and , that is **** buckets.
  • Histogram size: number of buckets multiplied by counter size. For and , that is ** KiB** (64-bit counters) or ** KiB** (32-bit counters).

The table below gives a sense of the total number of buckets and histogram size for various values, assuming we want to record values up to the maximum unsigned 64-bit integer.

p relative error #buckets size (8-bit) size (16-bit) size (32-bit) size (64-bit)
2 25.0% 252 0.2 KiB 0.5 KiB 1.0 KiB 2.0 KiB
3 12.5% 496 0.5 KiB 1.0 KiB 1.9 KiB 3.9 KiB
4 6.25% 976 1.0 KiB 1.9 KiB 3.8 KiB 7.6 KiB
5 3.13% 1920 1.9 KiB 3.8 KiB 7.5 KiB 15.0 KiB
6 1.56% 3776 3.7 KiB 7.4 KiB 14.8 KiB 29.5 KiB
7 0.781% 7424 7.3 KiB 14.5 KiB 29.0 KiB 58.0 KiB
8 0.391% 14592 14.3 KiB 28.5 KiB 57.0 KiB 114.0 KiB
9 0.195% 28672 28.0 KiB 56.0 KiB 112.0 KiB 224.0 KiB
10 0.0977% 56320 55.0 KiB 110.0 KiB 220.0 KiB 440.0 KiB
11 0.0488% 110592 108.0 KiB 216.0 KiB 432.0 KiB 864.0 KiB
12 0.0244% 217088 212.0 KiB 424.0 KiB 848.0 KiB 1.7 MiB
13 0.0122% 425984 416.0 KiB 832.0 KiB 1.6 MiB 3.3 MiB
14 0.00610% 835584 816.0 KiB 1.6 MiB 3.2 MiB 6.4 MiB

If input values are 32-bit integers, which is also common, the histograms are much smaller.

p relative error #buckets size (8-bit) size (16-bit) size (32-bit) size (64-bit)
2 25.0% 124 0.1 KiB 0.2 KiB 0.5 KiB 1.0 KiB
3 12.5% 240 0.2 KiB 0.5 KiB 0.9 KiB 1.9 KiB
4 6.25% 464 0.5 KiB 0.9 KiB 1.8 KiB 3.6 KiB
5 3.13% 896 0.9 KiB 1.8 KiB 3.5 KiB 7.0 KiB
6 1.56% 1728 1.7 KiB 3.4 KiB 6.8 KiB 13.5 KiB
7 0.781% 3328 3.3 KiB 6.5 KiB 13.0 KiB 26.0 KiB
8 0.391% 6400 6.3 KiB 12.5 KiB 25.0 KiB 50.0 KiB
9 0.195% 12288 12.0 KiB 24.0 KiB 48.0 KiB 96.0 KiB
10 0.0977% 23552 23.0 KiB 46.0 KiB 92.0 KiB 184.0 KiB
11 0.0488% 45056 44.0 KiB 88.0 KiB 176.0 KiB 352.0 KiB
12 0.0244% 86016 84.0 KiB 168.0 KiB 336.0 KiB 672.0 KiB
13 0.0122% 163840 160.0 KiB 320.0 KiB 640.0 KiB 1.3 MiB
14 0.00610% 311296 304.0 KiB 608.0 KiB 1.2 MiB 2.4 MiB

Bucket Layout

Bucket Lookup

Bucket lookup is the primary operation for recording a value. The following algorithm only applies to positive integers (i.e. zero values are not allowed). Support for the zero value is trivially added as a first step of the lookup implementation.

We represent the input value as an -digit binary:

Looking up the right bucket offset for works as follows:

  1. Find the highest non-zero digit . () is such that but for all that satisfy .
  2. Calculate the bucket offset. If , maps to a precise bucket, so we simply have ; otherwise, first compute , and we can compute as .

Explanation

In the above algorithm, the highest non-zero digit tells us the width of the bucket that should record the input value . For the bucket width is ; for the bucket width is ; etc. For any positive , the total number of buckets with a width less than is .

There are buckets of width , with a lower bound as the value. To find out which bucket in this group is the right one for , we compute the distance between and the lower bound, and divide that by the bucket width of the current group. Since all bounds are powers of 2, the division is converted to a bit shift.

Optimization

On CPUs supporting SSE4, can be most efficiently computed by using the LZCNT instruction.

Below are a few lookup examples assuming :

  • : ,

  • : ,

  • : , ,

  • : , ,

  • : , ,

Quantile Lookup

In a naïve implementation, reporting a particular quantile requires traversing all the buckets once. Even if the total number of values in the histogram is known, the number of buckets to traverse is still linear to the total number of buckets.

There are a couple of things to consider during implementation regarding bias in reporting. Because each bucket covers a range except for the precise ones, a decision needs to be made about what value in that range to report when we do a quantile lookup. We also need to consider how to interpret results that don’t fall neatly in a single bucket — for example, when there are 3 values in the histogram and the quantile in question is . For the latter, we can use the nearest-rank method to find the nearest bucket with records. Alternatively, it is also possible to extrapolate an in-between value based on existing data. Each method has its drawbacks. For example, linear interpolation could lead to a confusing situation where the interpreted value is deemed impossible given how values are measured.

Optimization

To reduce the number of buckets traversed during lookup, one can store the total number of counts, , across all buckets, and return when the buckets traversed so far yield a cumulative count greater than (if traversing from the lowest bucket) or (if traversing from the highest bucket). Further reduction can be achieved by using some type of “sketch” that stores cumulative values across multiple buckets, which allows the cursor to jump over many buckets at a time. The tradeoff is that multiple values will need to be updated for each recording, and more space will be used.

There are two typical scenarios where H2Histogram is deployed. The first one is to check if there is any SLO violation, such as p99 latency. In this case, the percentile of interest is very close to the highest end, so a simple global count and backward traversal can greatly reduce the number of buckets visited. The other one is to create a snapshot of value distribution by reporting several pre-defined percentiles at once, such as p25, p50, p75, p90, p95, p99… In this case, it is probably the most efficient to create APIs that allow multiple quantiles to be reported in a single sweeping trip through all the buckets.

Appendix

Counter Type

Recording counts as integers provides precise readings within the counter range, which means it is important to use enough bits to cover the accumulated counts during the recording period. This generally means 8-bit, 16-bit, 32-bit, or 64-bit unsigned integers.

Recording counts as floating point numbers increases the dynamic range of individual buckets at the cost of precision. Since H2Histogram has the concept of precision as a configurable option, it is possible to consider the use of floating point types in conjunction with overall precision.

Below is a table of the range and precision limits for different data types when used to record counts:

Type Range Limit Relative Error
8-bit integer — (precise)
16-bit integer — (precise)
32-bit integer — (precise)
64-bit integer — (precise)
16-bit (half) float
32-bit (single) float
64-bit (double) float

From the above table, it should be obvious that if considering using 16-bit numbers or less, integer types are strictly superior; for 32- and 64-bit numbers, the choice should be based on the potential range of the highest single-bucket count.

Truncated Histogram

Often, the range of plausible values has both an upper and a lower bound. In the current H2Histogram design, expressing an upper bound is straightforward — just provide the smallest that fits the value. But we haven’t discussed lower bound.

In fact, having a lower bound could potentially meaningfully reduce the amount of memory needed to store a histogram, because smaller values are stored in finer buckets. For example, for and , if we know the lower bound of values is 1024, we can eliminate 768 out of 6400 buckets, a 12% reduction.

More precisely, if we know the lower bound is at least , we can eliminate buckets (if ) or buckets (if ).

How common do values have lower bounds? In high-level time measurements in particular, this condition is extremely common. For example, if we use a clock with nanosecond precision to track common software operations, the lower bounds are significantly higher than the resolution of the measurement. Many syscalls are measured in microseconds, request/response time is often measured in milliseconds. This means we can expect to be in the range of 10 to 20 in a lot of real-life use cases.

Use the following calculator to see how much space a truncated histogram can save you:

  • Number of buckets (original histogram):
  • Number of buckets (truncated histogram):
  • Buckets saved: (%)