getting the average, p95 and p99 of a stream of data

jamesatha picture jamesatha · May 9, 2013 · Viewed 9.2k times · Source

I have incoming data and I want to compute the average, 95th and 99th percentile of that data - I am most interested in the last 1000 values. At any time, I'd like to query this object to get any of the three values (this can occur at any time, not just when the numbers seen mod 1000 is 0). Is there a way to get these three values without keeping the last 1000 samples?

This doesn't have to be perfect so we can use some tricks to get a good estimate. Also, speed is another concern. Thanks

(I will be doing this in C++ but I don't think that matters all that much)

Answer

Zim-Zam O'Pootertoot picture Zim-Zam O'Pootertoot · May 9, 2013

At a minimum, you'll need to maintain a queue of the most recent 1000 elements.

To keep a running average, maintain a running total of the most recent 1000 elements; when you add a new element to the queue you add its value to the total, and you also subtract the value of the oldest element that you've just removed from the queue. Return the total divided by 1000 and there you go.

To keep a running Nth percentile, maintain two heaps and keep a count of the elements in the heaps; the "lower" heap has the lower N% of the values, and the "upper" heap has the upper (1-N)% (for example, the lower 95th percentile heap will have 950 elements, and the upper 5th percentile heap will have 50 elements). At any point you can return the lowest element from the upper heap, and that's your percentile. When you remove an element from the queue of recent values, then remove the value from the heaps as well. If this leaves the heaps unbalanced (eg the lower heap has 951 elements and the upper heap has 49 elements) then shift elements to balance them out (eg remove the top element from the lower heap and add it to the upper heap).

Since you want two percentiles, use three heaps - the lower heap has the lower 950 elements, the middle has the next 40, and the upper has the highest 10. Return the lowest element of the middle heap for the 95th percentile, and the lowest element of the upper heap for the 99th percentile.

Adding and removing heap elements is O(lg(n)), so that is the cost of adding a new element to the queue and three heaps: remove the oldest queue element from the heaps (O(lg(n)), add the new queue element to the appropriate heap (O(lg(n)), and balance the heaps if need be (again, O(lg(n)). Add the new element to the lowest heap whose highest element is greater than the heap element, i.e.

if (newElement < lowestHeap.maxElement) {
    lowestHeap.add(newElement)
} else if (newElement < middleHeap.maxElement) {
    middleHeap.add(newElement)
} else { 
    highestHeap.add(newElement)
}

Be sure that your heaps allow duplicate elements