Let us start the review of data structures with the most commonly used sorting algorithm, quicksort. We will then discover a hidden but deep connection between quicksort and a common data structure, binary search tree (BST). At the end of this section you will hopefully understand these two concepts in a much deeper way.
You might recall from data structures rather complicated implementations of quicksort in C/C++ or Java like this (which I could never understand):
public void sort(int low, int high) {
if (low >= high) return;
int p = partition(low, high);
sort(low, p);
sort(p + 1, high);
}
void swap(int i, int j) {
int temp = a[i]; a[i] = a[j]; a[j] = temp;
}
int partition(int low, int high) {
int pivot = a[low];
int i = low - 1, j = high + 1;
while (i < j) {
++; while (a[i] < pivot) i++;
i--; while (a[j] > pivot) j--;
jif (i < j) swap(i, j);
}
return j;
}
But actually we can write quicksort in Python in just a few lines:
def qsort(a):
if a == []:
return []
= a[0]
pivot = [x for x in a if x < pivot]
left = [x for x in a[1:] if x >= pivot]
right return qsort(left) + [pivot] + qsort(right)
Here we just do a simple partition of array a
into two
parts using the pivot (here a[0]
): left
which
contains the elements in a
that is smaller than the pivot,
and right
which contains those bigger than or equal to the
pivot. Then we just recursively quicksort both left
and
right
and combine them with the pivot in the middle.
Viola!
Remarks:
left
and
right
). Besides using slightly less memory (no need to
allocate new arrays left
and right
), their
implementation is also in principle slightly faster in terms of a
constant factor, but this difference does not change the time
complexity. In algorithm analysis, we care about the complexity, not
the constant factor.left = [x for x in a if x < pivot]
into \(\mathit{left} = \{ x \mid x \in a, x <
\mathit{pivot}\}\), token by token).pivot = a[0]
line:= random.randrange(len(a))
i 0], a[i] = a[i], a[0] # the new a[0] is the pivot a[
[1:]
and >=
in the
right = ...
line. If you forgot the first but not the
second, you will have infinite recursions.Let us now analyze the time complexity for quicksort (assume input
a
has size \(n\)). This is
a typical divide-n-conquer
(therefore recursive) algorithm, which has three parts instead of
two!
left = ...
and
right = ...
)qsort(left)
and
qsort(right)
), and... + [pivot] + ...
)Many students think of divide-n-conquer as just “divide and conquer” (as the name suggests), but that is a big misconception: there is always a combine step! Don’t forgot the combine step in the analysis!
In analyzing a divide-n-conquer algorithm, let us always start with
the non-recursive parts (divide and combine), since they are easier. For
divide, the two partition lines each cost \(O(n)\) time, because they each visit the
whole array once. For combine, the first operation
(... + [pivot]
) seems to take \(O(1)\) time because the second list is a
singleton but actually would still take \(O(n)\) time because it returns a new list
(not in-place!), and the second operation
(... [pivot] + ...
) also takes \(O(n)\) time (if both concatenations were
done in-place, then the first one takes \(O(1)\) time and the second \(O(n)\) time). So we conclude that
divide+combine is \(O(n)\).
Caveat: in the standard in-place implementation above, the combine step has no work (\(O(1)\) time), because you’re always operating on the same array (the input to the recursion is an \([i,j]\) span instead of the array) so there is no need to “concatenate”. This is another advantage of the standard implementation, but again, this difference does not change the fact that divide+combine is \(O(n)\) time. The standard implementation (Hoare scheme) is so complicated that it is not worth our effort in an introductory course.
The rest of the analysis depends on how balanced the recursion tree is. In the best case, the division is always balanced, i.e., the pivot is always (roughly) the median of the array, which divides the array (roughly) equally. Here is a picture:
4) 6 2 5 3 7 1
(---------------------> O(n) -+
2 3 1] 4 [6 5 7] |
[2) 3 1 (6) 5 7 |
(------> ------> O(n) -+-> O(log n) levels
1] 2 [3] [5] 6 [7] |
[1) 2 (3) (5) 6 (7) |
(--> --> --> --> O(n) -+
1[] []3[] []5[] []7[] []
Since each level takes \(O(n)\) time for partitioning, and there are \(O(\log n)\) levels (because each partition halves), so the total time is \(O(n\log n)\). This is the “recursion tree method”.
Or we can write the recursion:
\[T(n) = 2T(n/2) + O(n)\]
which solves (e.g., by the Master Theorem to \(T(n)=O(n\log n)\).
However, in the worst case, the pivot is always the smallest or
largest element in the array (e.g., already sorted or inversely sorted),
in which case one side is empty but the other side is only smaller than
a
by just one element and has \(n-1\) elements:
\[T(n) = T(n-1) + O(n)\]
which solves to \(T(n)=O(n)+O(n-1)+ ... +O(1) = O(n^2)\).
Here is a picture:
5) 4 3 2 1
(----------> O(n)
4 3 2 1] 5 []
[4) 3 2 1
(--------> O(n-1)
3 2 1] 4 []
[3) 2 1
(------> ...
2 1] 3 []
[2) 1
(---->
1] 2 []
[1)
(--> O(1)
1 [] []
Clearly \(n + (n-1) + ... + 1=O(n^2)\).
So we established the basic time complexities of quicksort:
Analyzing the average-case complexity is much more involved, and we will save it for later. But as a preview, think about the following questions:
Many years ago when I was teaching at the University of Pennsylvania, one student (after numerous failed debugging attempts) asked me why her quicksort was not working despite looking so “correct”:
def qsort2(a):
if a == []:
return []
= a[0]
pivot = [x for x in a if x < pivot]
left = [x for x in a[1:] if x >= pivot]
right return [qsort2(left)] + [pivot] + [qsort2(right)]
Initially I was puzzled, because it was basically verbatim from my code, but then I realized the output was weird but it contains an intriguing pattern, e.g.:
>>> qsort2([4,2,6,3,5,7,1,9])
1, []], 2, [[], 3, []]], 4, [[[], 5, []], 6, [[], 7, [[], 9, []]]]] [[[[],
What is this weird list actually representing?
It actually encodes a binary search
tree (BST), with the first pivot (4
) being the
root!
4
/ \
2 6
/ \ / \
1 3 5 7
\
9
First, the pivot 4
partitions the array into
left=[2,3,1]
and right=[6,5,7,9]
. Then for the
left part (all numbers less than 4
), the new pivot
2
divides it into left=[1]
and
right=[3]
, and so on and so forth. So each quicksort is
implicitly building a BST!
This “buggy” version, with the extra pairs of brackets around the two recursive calls, effectively extracted the hidden BST in this format:
[left_tree, root, right_tree]
where root
is a number and left_tree
is a
similarly encoded BST where all numbers are less than root
,
and right_tree
is also a similarly encoded BST where all
numbers are greater than or equal to root
. If you found the
nested list format hard to parse, we can write a simple “pretty-print”
function to visualize the tree using indentation:
def pp(tree, dep=0):
if tree == []:
return
= tree
left, root, right +1)
pp(left, depprint(" |" * dep, root)
+1) pp(right, dep
For example, calling pp(qsort2([4,2,6,3,5,7,1,9])
would
print this representation of the BST above:
| | 1
| 2
| | 3
4
| | 5
| 6
| | 7
| | | 9
This pp
function is a standard in-order
traversal which visits the left subtree first, then node, then the
right subtree. But if we switch the order of pp(left, ...)
and pp(right, ...)
(which is called reverse
in-order, right-node-left traversal), it will
print:
| | | 9
| | 7
| 6
| | 5
4
| | 3
| 2
| | 1
which is a 90-degree counterclock-wise rotation of our usual tree above (you just need to turn your head to see it).
This particular BST is balanced, meaning for each node, the heights of the left and right subtrees differ by at most 1. Note that we can also write a recursive definition: A BST is balanced if both subtrees are balanced, and their heights differ by at most 1. Balanced BSTs are great, because their height is \(O(\log n)\) and therefore searching can be done in \(O(\log n)\) time
\[ T(n) = T(n/2) + 1 = O(\log n)\]
just like binary search (in a sorted array).
However, not all BSTs are balanced, and they can be extremely
unbalanced when the pivot happens to be the smallest or largest element.
The most extreme cases of unbalanced BST become linear chains, e.g.,
when performing quicksort on already-sorted or inversely-sorted arrays
(pp2
is our reverse in-order
traversal above):
>>> pp2(qsort2([7,6,5,4,3,2,1]))
7
| 6
| | 5
| | | 4
| | | | 3
| | | | | 2
| | | | | | 1
Searching in this kind of extreme unbalanced BSTs takes worst-case \(O(n)\) time because each iteration you can only discard one element (as opposed to half of the elements in the balanced case):
\[ T(n) = T(n-1) + 1 = O(n) \]
Now that we have seen a deep but hidden connection between quicksort and BSTs, I hope you have a much deeper understanding of both topics. Here is a summary:
balanced | extreme unbalanced | ||
---|---|---|---|
quicksort | pivot | \(O(n \log n)\) time | \(O(n^2)\) time |
BST | root | \(O(\log n)\) height | \(O(n)\) height |
Small caveat: searching in BST has a best-case complexity of \(O(1)\) since you can be lucky (the root is a match). So we often need to be more specific: “searching for an element not in the BST” has best-case \(O(\log n)\) and worst-case \(O(n)\) complexities.
Quicksort was invented by the legendary British computer scientist and Turing Award winner Tony Hoare (who also invented many other things such as Hoare Logic, and amazingly is still alive as of this writing!). But interestingly, he did it while studying machine translation as a visiting student in Moscow in 1959 under the legendary Soviet mathematician Andrey Kolmogrov. Hoare published this algorithm in 1961 after returning to the UK.
The “buggy qsort” was accidentally discovered by my former student Netta Doron in 2006 when she took my CSE 399 Python Programming course at the University of Pennsylvania. This was such a great discovery. I don’t think any one could’ve discovered it intentionally.