The obvious application (and the original motivation) of heap is heapsort:
Total time: \(O(n \log n)\). Note that even if you don’t know heapify and replace it with \(n\) pushes, it doesn’t change the overal complexity.
However, heaps can also be used in many other much more interesting scenarios in algorithm design. Here we showcase some classical examples.
We all know mergesort uses binary division. But what if we divide the array \(k\) ways? And then recursively mergesort \(k\) subarrays, and finally combine \(k\) sorted (sub)lists. This is known as \(k\)-way mergesort, a very interesting generalization of the classical mergesort which is a special case with \(k=2\).
Again, like classical mergesort, combine is where most of the work lies. Ok, let’s generalize the “two-pointers” idea to “\(k\)-pointers”. But while comparing two numbers is trivial, how do you compare \(k\) numbers and take the smallest one? If you take \(O(k)\) time every step, that would be too slow. Notice that most numbers remain unchanged (only the best number in the previous round is replaced by its successor in that sublist), so you waste much time doing repeated comparisons.
So we use a heap instead!
So the total time for the combine step is \(O(k + n\log k)=O(n\log k)\) because \(k < n\).
Caveat: Notice that if \(k \geq n\), our \(k\)-way mergesort becomes heapsort! So the two extreme special cases of \(k\)-way mergesort are classical mergesort (\(k=2\)) and heapsort (\(k=n\)).
Now divide+combine (the non-recursive parts, or the work at each level) is \(O(k+n\log k)=O(n\log k)\). Now we get back to the overall time using recurrence:
\[ T(n) = k T(n/k) + O(n\log k)\]
We’ll still use the “recursion tree” method to expand the recursion:
This is a remarkable result that the runtime of \(k\)-mergesort does not depend on \(k\)!
Alternative method: Using the Master Method, we can see that the above recurrence falls into case 2, so \(T(n)=O(n\log n)\).
Another problem similar to \(k\)-way mergesort is the team selection problem: the United States have \(n=50\) states, and each state has selected its (sorted) top \(k\) tennis players. Now we need to select the top \(k\) players to form team USA (for Olympics). How would you do that as fast as possible?
Same as \(k\)-way mergesort, you build an initial heap of size \(n\) from the best players of each state. Then you pop/push (or heapreplace), until you have popped \(k\) players.
Time: \(O(n + k\log n)\), because the heap size is bounded by \(n\).
Can you make it even faster? Well, if \(k \ll n\), a key observation is that the vast majority of states will have no representatives on team USA. If a state’s best player can’t make the top \(k\) in the initial heap of size \(n\), then every player from that state doesn’t have a chance in team USA. For example, if \(k=5\), team USA will likely have (even multiple) players from big states like California and New York, and nobody from most other states. This observation suggests we should narrow down our initial heap to just the top \(k\) (best among the best) players (or states) from the \(n\) top players from each state. So we use quickselect to select the \(k\)th best player among those \(n\) leaders and scan all those \(n\) leaders again to select the top \(k\) leaders. Now you build an initial heap of just \(k\) players, and because the heap size is bounded by \(k\), you improve the total time to:
\[ O(n + k + k\log k)=O(n+k\log k)\]
which is slightly faster than \(O(n+k\log n)\).
A slightly more involved problem is \(n\)-best pairs problem. Given two unsorted lists \(A\) and \(B\), each with \(n\) integers, their cross-product (or Cartesian product) is \(n^2\) pairs:
\[ A\times B = \{ (x, y) \mid x \in A, y \in B \} \]
How to select the \(n\) smallest pairs from \(A\times B\)? Let’s say we compare pairs by their sums:
\[ (x,y) < (x',y') \text{ iff. } x+y < x'+y' \text{ or } x+y==x'+y' \text{ and } y<y' \]
i.e., between two pairs, the one with the smaller sum is considered smaller, or in the case of a tie, the pair with smaller second dimension wins (actually you can define this relation arbitrarily, as long as it’s monotonic). For example:
>>> a, b = [4, 1, 5, 3], [2, 6, 3, 4]
>>> nbest(a, b)
[(1, 2), (1, 3), (3, 2), (1, 4)]
Let’s start with the most obvious idea, and gradually improve it.
AB
is the array of \(n^2\) pairs): first
qselect(AB, 1)
for the smallest pair, then
qselect(AB, 2)
for the 2nd smallest pair, all the way to
qselect(AB, n)
for the \(n\)th smallest pair. That would be too slow
(in fact, even worse than naive: \(O(n^2 + n
\cdot n^2)=O(n^3)\)). Actually, you only need to call quickselect
once – just use qselect(AB, n)
to establish the
threshold, and then scan the whole array again to output every
pair that is smaller than the threshold. Total time: \(O(n^2 + n^2 + n^2)=O(n^2)\) which is
slightly better than naive.a
and b
first.
Once they’re sorted, obviously \((a_0,
b_0)\) is the smallest pair. But who’s the second best? Well, it
must be either \((a_0, b_1)\) or \((a_1, b_0)\). Let’s say that \((a_0, b_1)\) is the second best, which gets
popped. Then we would push its own successors, \((a_0, b_2)\) and \((a_1, b_1)\). So in general, once we pop
\((a_i, b_j)\), we need to push
two successors, \((a_i,
b_{j+1})\) (if \(j+1<n\)) and
\((a_{i+1}, b_j)\) (if \(a+1<n\)). We use a heap to store the
candidates for the next best (the frontier of expansion), which starts
with only one pair. The size of the heap is bounded by \(n\) because each time you pop one and push
at most two, so the size increase by (at most) 1 per step.
Therefore, total time is \(O(n\log n + n\log n
+ n\log n)\) where the first two terms are sorting and the last
one is \(n\) heappops/heappushes.Caveat: if a successor is already in the heap, don’t push it
twice. This means you need some hash-based datastructures such as Python
set
to check whether some pair is already pushed, in \(O(1)\) time.
Here is a picture:
You can imagine in a flooding zone, water level keeps rising. Initially, water will only cover the top-left corner (lowest area) and gradually cover more and more cells. Those covered in water are already popped from the heap, and the “waterfront”, i.e., the frontier of expansion, is the current heap, which marks the boundary between those already popped and those never pushed (dry area). In the end, you can see that among \(n^2\) cells, most are never explored (not even computed), i.e., in the dry area, and only \(n\) are popped, i.e., submerged in water, and \(n\) are in the frontier. That’s why this algorithm is so efficient.
Alternative method: Instead of starting with just the
top-left corner (\((a_0, b_0)\)), you
can also start with all the first column \(\{(a_0, b_0), (a_1, b_0), \ldots, (a_{n-1},
b_0)\}\), and then you just need to pop/push (or heapreplace)
instead of pop one and push two. Note that this method is much more
similar to team selection (each \(a_i\)
is a “state”, with its sorted best players being \((a_i, b_0), (a_i, b_1), \ldots\)). In this
case, a
does not need to be sorted (but b
must
be sorted; or vice versa if you start with the first row). The other
small advantage is that you don’t need to maintain a set to check if
some pair is already pushed. Total time: \(O(n\log n + n + n\log n) = O(n\log n)\);
the first term is sort b
, the second is heapify, and the
third is \(n\) heappops. Same runtime,
just not as pretty (or symmetric) as the above method, but may be a bit
easier to implement.
The \(n\)-best problem is taken from my \(k\)-best parsing paper (Huang and Chiang, 2005).