Python heapify() time complexity

typing... picture typing... · Aug 7, 2018 · Viewed 12.4k times · Source
def heapify(A):
    for root in xrange(len(A)//2-1, -1, -1):
        rootVal = A[root]
        child = 2*root+1
        while child < len(A):
            if child+1 < len(A) and A[child] > A[child+1]:
                child += 1
            if rootVal <= A[child]:
                break
            A[child], A[(child-1)//2] = A[(child-1)//2], A[child]
            child = child *2 + 1

This is a similar implementation of python heapq.heapify(). It is said in the doc this function runs in O(n). But it looks like for n/2 elements, it does log(n) operations. Why is it O(n)?

Answer

Tim Peters picture Tim Peters · Aug 7, 2018

It requires more careful analysis, such as you'll find here. The basic insight is that only the root of the heap actually has depth log2(len(a)). Down at the nodes one above a leaf - where half the nodes live - a leaf is hit on the first inner-loop iteration.

"Exact" derivation

Waving hands some, when the algorithm is looking at a node at the root of a subtree with N elements, there are about N/2 elements in each subtree, and then it takes work proportional to log(N) to merge the root and those sub-heaps into a single heap. So the total time T(N) required is about

T(N) = 2*T(N/2) + O(log(N))

That's an uncommon recurrence. The Akra–Bazzi method can be used to deduce that it's O(N), though.

I think more informative, and certainly more satifsying, is to derive an exact solution from scratch. Toward that end, I'll only talk about complete binary trees: as full as possible on every level. Then there 2**N - 1 elements in total, and all subtrees are also complete binary trees. This sidesteps mounds of pointless details about how to proceed when things aren't exactly balanced.

When we're looking at a subtree with 2**k - 1 elements, its two subtrees have exactly 2**(k-1) - 1 elements each, and there are k levels. For example, for a tree with 7 elements, there's 1 element at the root, 2 elements on the second level, and 4 on the third. After the subtrees are heapified, the root has to moved into place, moving it down 0, 1, or 2 levels. This requires doing comparisons between levels 0 and 1, and possibly also between levels 1 and 2 (if the root needs to move down), but no more that that: the work required is proportional to k-1. In all, then,

T(2**k - 1) = 2 * T(2**(k-1) - 1) + (k - 1)*C

for some constant C bounding the worst case for comparing elements at a pair of adjacent levels.

What about T(1)? That's free! A tree with only 1 element is a already a heap - there's nothing to do.

T(1) = 0

One level above those leaves, trees have 3 elements. It costs (no more than) C to move the smallest (for a min-heap; largest for a max-heap) to the top.

T(3) = C

One level above that trees have 7 elements. It costs T(3) to heapify each of the subtrees, and then no more than 2*C to move the root into place:

T(7) = 2*C + 2*C = 4*C

Continuing in the same way:

T(15) = 2* 4*C + 3*C = 11*C
T(31) = 2*11*C + 4*C = 26*C
T(63) = 2*26*C + 5*C = 57*C
...
T(2**k - 1) = (2**k - k - 1)*C

where the last line is a guess at the general form. You can verify that "it works" for all the specific lines before it, and then it's straightforward to prove it by induction.

So, where N = 2**k - 1,

T(N) = (N - log2(N+1)) * C

which shows that T(N) is bounded above by C*N, so is certainly O(N).