1.3 Quickselect: Linear-Time Selection

Besides sorting, another important task in datastructures is selection, e.g., to select the \(k\)th smallest element from an unsorted array of size \(n\). Clearly, you can always sort the array first, but that will take \(O(n\log n)\) time. Can we do it faster without sorting? Intuitively, we should be able to, because selection does not necessarily require the full sorted order (I just want the \(k\)th smallest number, and nothing else!), and should in principle be much easier than sorting. In fact, if \(k=1\) (smallest) or \(k=n\) (largest), we can just use a simple scan of \(O(n)\) time. But what about any arbitrary \(k\)?

Hint: think about quicksort. Can you simplify quicksort a little bit to do selection?

Indeed, we can! And the resulting algorithm is conveniently called “quickselect”. The idea is very simple (to simplify our reasoning, let’s first assume that the array contains distinct numbers):

Input is array a (size \(n\)) and index \(k\)
Partition the array using a (randomized) pivot into left and right
After partitioning (actually just after getting left), we already have a crucial observation: the pivot ranks |left|+1 in a! (here |...| means size). We can use this fact to do a case analysis, by comparing \(k\) with |left|+1:
- if \(k\)=|left|+1, then we’re done: return the pivot, because there are exactly \(k-1\) numbers in left that are less than the pivot, so the pivot ranks \(k\)th smallest in this array.
- if \(k\)<|left|+1, then the \(k\)th smallest element of this array must be in left; in fact it must be the \(k\)th smallest element in left, so do quickselect on left with k
- if \(k\)>|left|+1, then the \(k\)th smallest element must be in right. But is it still the \(k\)th smallest element in right? No, because there are already |left|+1 numbers (pivot included) that are smaller than our target number, its rank within right must be k-|left|-1. So we should do quickselect on right with k-|left|-1.

Notice that right is only necessary in case 3, so we make our code a bit faster like the following (this does not matter in the standard in-place implementation):

def qselect(a, k):
    pivot = a[0] # you can add two lines to enable randomized pivot
    left = [x for x in a if x < pivot]
    remaining = k - len(left) - 1 # 1 is for pivot
    if remaining <= 0: # cases 1-2: no need to do right!
        return pivot if remaining == 0 else qselect(left, k)
    right = [x for x in a[1:] if x >= pivot]
    return qselect(right, remaining) # case 3

Example:

qselect [4,  1,  5,  3,  2]  k=3
        [1,  3,  2]  4  [5]  # pivot rank: |left|+1=4 > k; in left

qselect [1,  3,  2]          k=3
        [] 1 [3, 2]          # pivot rank: |left|+1=1 < k; in right

qselect      [3, 2]          k=3-1=2: find 2nd smallest
             [2] 3 []        # pivot rank: |left|+1=2 == k: voila!           
return           3

Remarks:

To enable randomized pivot, just add the two lines from quicksort.
Since there is (at most) one recursive call in quickselect (as opposed to two in quicksort), we really don’t need recursion here, and can use a loop instead. Even if we use recursion, it’s a degenerative special case called “tail recursion” meaning it’s equivalent to a loop. This is also the case with binary search (in fact, quickselect very much resembles binary search, but on an unsorted array). But for our purposes, we would still prefer to think of quickselect and binary search in the framework of divide-n-conquer.

Complexity Analysis

The analysis of quickselect is very similar to quicksort, but a little simpler since we only have one sided recursion. In the most balanced case (with randomized pivot), each time we throw away about half of the array (analogous to binary search in sorted array), so:

\[ T(n) = T(n/2) + O(n) \]

This is just a converging geometric series:

\[ T(n) = O(n) + O(n/2) + O(n/4) ... + 1 = O(n) \]

(Note: even the infinite sum of \(1 + 1/2 + 1/4 + ...\) converges to \(2\), let alone a finite sum; in this analysis we don’t even need to know the height of the recursion tree, but in case you wonder, it is still \(\log n\), like quicksort best case).

In the most unbalanced case where each time we can only reduce the size of the array by one (the pivot), this becomes identical to quicksort worst case:

\[ T(n) = T(n-1) + O(n) = O(n^2) \]

So the best case of quickselect is \(O(n)\), which is faster than quicksort best case, and this makes sense since selection is easier than sorting. However, the worst case of quickselect is as slow as quicksort worst case!

Caveat: what if you’re really lucky that the first time is a case 3 (i.e., \(k\)==len(left)+1) so that you don’t need to do any further recursion? Well, that’s still \(O(n)\) because of partition. So quickselect best case is always \(O(n)\), unlike binary search in sorted array, whose best case is actually \(O(1)\) (if you find the query in the first try; its worst case is \(O(\log n)\)).

What about the average case? Similar to quicksort, its average case is the same as its best case: \(O(n)\). If you understand our derivation of quicksort average case, you can derive quickselect average case yourself very easily.

Expected vs. Deterministic Linear-Time Selection

Quickselect is not worst-case linear-time but rather expected linear-time (meaning the average complexity is linear in the size of the array). In practice, with randomized pivot, this is good enough, because there is no a priori worst case input. However, if you really want a deterministic (i.e., not randomized) worst-case linear-time selection algorithm, there is indeed one, called “median of medians” algorithm, which is rather complicated (though very clever). More importantly, it is actually quite slow (much much slower than quickselect!) due to a high constant factor. This algorithm is beyond the scope of our course; see the Wikipedia article (linked above) or other textbooks such as CLRS. In practice, just use quickselect.

Similarly, quicksort is quite a bit faster than mergesort in practice, although the former is expected \(O(n\log n)\) while the latter is worst-case \(O(n\log n)\). (Well, to be fair to mergesort, it is still a very simple and useful algorithm while deterministic selection is too complicated and not practical).

The reasons why quicksort is faster than mergesort are:

it can be implemented in-place (which is very hard or practically impossible for mergesort).
the pivot is removed after partitioning, so the second level has only \(n-1\) items and the 3rd level \(n-3\), 4th level \(n-7\), etc. (level \(i+1\) has \(n-2^{i-1}\) items). While in mergesort, all levels are fully \(n\), and there are exactly \(\log n\) levels.

Summary

algorithm	divide	conquer	combine	complexity
quicksort	partitioning: \(O(n)\)	\(2\times\): best: \(n/2+n/2\); worst: \((n-1)+0\)	trivial: \(O(1)\) (in-place) or \(O(n)\) (out-of-place)	best/avg: \(O(n\log n)\); worst: \(O(n^2)\)
quickselect	partitioning: \(O(n)\)	\(1\times\): best: \(n/2\); worst: \((n-1)\)	n/a	best/avg: \(O(n)\); worst: \(O(n^2)\)
binary search	split: \(O(1)\)	\(1\times\): always \(n/2\);	n/a	always \(O(\log n)\)

Historical Notes

Quickselect, like quicksort, was also invented by the Turing Award winner Tony Hoare, and is known as Hoare’s selection algorithm.

The deterministic linear-time selection algorithm, “median of medians”, was invited by Blum, Floyd, Pratt, Rivest, and Tarjan in 1973 when they were all at Stanford. Among them, Blum, Rivest, and Tarjan later received Turing Awards (for other contributions), and Floyd and Pratt are also legendary figures in computer science.