6 min read

Aggregating Measures of Uncertainty

There are many situations where you want to aggregate values, however if those values are on different scales or are related to measures of uncertainty, it’s typically more complicated than simply taking a simple mean or sum. You can’t take the aveage of p-values or standard deviations or statistical tests. You also can’t take the sum of confidence (or prediction) intervals to get to an interval for the aggregation of the parts. Similar to challenges with aggregating or combining items that are on different scales, you also can’t just simply aggregate items that in some way relate to measures of uncertainty1.

As a practical example, say your firm sells a variety of different products. Each product has a different price and contracts are negotiated on a deal-by-deal basis so the final price has some level of variability that may differ across products. Sales teams are responsible for securing the final price for any sale but each team has a different product mix. You want a measure for evaluating sales teams to identify which are commanding higher or lower prices relative to their peers. Reliably characterizing a team’s performance as ‘good’ or ‘bad’ requires both an understanding of the value and variability of their specific product mix. You can’t evaluate their sales numbers in isolation (e.g. if their product mix is composed of items with high variability in price, what at first glance looks like good or bad performance may have more to do with luck / a greater level of variability in outcomes specific to their products). What you want is to be able to condense all of that information into some measure of how the sales team did based on their particular product mix that takes into account both the expected sales and also the amount of variability in their particular portfolio.

Another common example may be in forecasting. Perhaps you are producing forecasts for products at a county, state, and national level. In addition to point estimates though you are producing ranges (i.e. prediction intervals) for these forecasts. Preferably you want your prediction intervals at each level to be in some sensible way consistent with one another. However, aggregating lower and upper bounds from the lowest level forecasts up to higher levels requires more than just taking a sum2.

In this post I’ll describe in broad terms a few of the general types of approaches an analyst may take when faced with the problem of aggregating measures that in some way rely on or reflect a measure of uncertainty or are on different scales (in future posts I may delve into the details of each approach in more detail). (These “type” distinctions are overlapping and more reflect distinctions I found convenient for articulation rather than concrete separations.)

Analytic approach

If each of the parts you want to aggregate follow a well-defined parametric distribution (or close), you may be able to aggregate the measures of uncertainty analytically. Say you have average sale prices across four separate products. If the sale price of each product follows a normal distribution, there are well established methods for figuring out what the variance is for the average sale price across products and you can use these measures to determine an appropriate bounded range for forecasts of aggregated sales. In the context of prediction intervals, if using a statistical forecasting approach within the fable package (e.g. ARIMA), a distribution object is saved. In hierarchical forecasting tasks, these distribution objects can be combined analytically when producing prediction intervals on aggregated forecasts at higher levels of the hierarchy.

Transformation to a common scale

Similar to what I called the “analytic approach” is the method of transforming each item onto a common scale at which point an aggregation can be done appropriately. This is commonly done to compare significance across of different variables. For example when applying simple filtering techniques on a prediction problem where you are investigating the relatedness of different variables with some target, you may be reviewing variables of different types (e.g. some categorical, others continuous). To evaluate each type of variable requires a different kind of statistical test. However, these tests each result in z-scores or p-values that provide a common scale for providing a general notion of relatedness with the target of interest. As mentioned previously you can’t simply take the average of p-values, however under some conditions there are other ways of combining p-values to get a common metric. For example by using [Fisher’s combined probably test]. A few years ago I wrote a toy package piececor for investigating piecewise correlations in a tidyverse friendly way that used Fisher’s method (or the related Stouferrer’s Z-score method) to get an overall p-value from a collection of tests for correlation3.

Simulate it

Often your data does not follow a parametric distribution or, even if it does, the math required in generating the joint distribution is incredibly complicated. In these cases you might take repeated samples of the underlying data and generate measures of uncertainty using simulation. The approach you take is defined by your problem and you need to ensure that the procedure of your simulation mirrors the type of uncertainty measure you are trying to estimate. Once you have your procedure well defined these methods often do a very good job of providing reliable estimates of uncertainty

Re-do the measure at each level

You may not care that the uncertainty measures of the components are consistent with the uncertainty measures of the aggregated whole. In these cases, you can simply estimate the values separately. In 2020, the M5 forecasting competition required participants to provide forecasts for each level of Walmart’s sales (across geographic levels and also across products). In addition to hosting a competition on the accuracy of forecasts, Kaggle also featured a competition on the quality of prediction intervals. Participants were evaluated based on the quality of prediction intervals across levels. If you look into the notebooks of some of the top performing participants, many of the participants did not worry about ensuring that the prediction intervals provided at each level of Walmart’s hierarchy were consistent with one another. They simply used various methods for identifying the typical quantiles at each level and used this investigation to create ranges independently. This type of approach may be combined with simulation based approaches where you simulate forecasts at each level and then use the distribution of the simulated errors at each level to provide a measure of uncertainty at each aggregation level.


Aggregating measures that encompass or reflect uncertainty requires consideration of the underlying distributions, and the context of the data. In future posts I may provide additional detail of common examples and approaches from each “type” of approach outlined here.

  1. Doing so typically results in an over estimation of the level of uncertainty. E.g. summing the bounds of prediction intervals will result in too wide of a band.↩︎

  2. or going from higher level bounds to lower levels is more than just dividing the minimum bound based on the proportion at the lower levels.↩︎

  3. Again, these approaches are similar to the “analytic” based approaches in that they typically come with various distributional assumptions.↩︎