Well, we could simply sort the collection, then go to the nth spot in the array. That would take O(nLog(n)).
Yes, we can usually do better. The key insight is that one can very quickly partition a vector into those elements that are larger than some value and those that are smaller. For example, suppose we want to find the median of the following vector:
2 97 17 37 12 46 10 55 80 42 39Make a guess, say element 37. Now, divide the vector into those elements less than 37 and those larger than 37, and see if the guess was correct:
12 10 17 2 37 97 46 55 80 42 39 < ^ >
int partition(KeyType[] A, int i, int i) { KeyType pivot, temp; int k, middle, p; middle = (i+j)/2; pivot = A[middle]; A[middle] = A[i]; A[i] = pivot; p = i; for (k = i+1; k <= j; k++) { if (A[k].compareTo(pivot) < 0) { temp = A[++p]; A[p] = A[k]; A[k] = temp; } } temp = A[i]; a[i] = A[p]; A[p] = temp; return p; }What does this do? Invariant: at the end of each k loop, all entries from i to p are <= pivot value, all values from p to k are greater than pivot.
Trace this through for our sample vector (i=0, j = 10):
pivot = 46, p = 0;
2 97 17 37 12 46 10 55 80 42 39 initialization 46 2 k = 1 no change k = 2 17 97, p = 1 k = 3 37 97, p = 2 k = 4 12 97, p = 3 k = 5 2 97, p = 4 k = 6 10 97, p = 5 k = 7 no change k = 8 no change k = 9 42 97, p = 6 k = 10 39 97, p = 7 39 46 return 8 final result 39 17 37 12 2 10 42 46 55 80 97Ok, so everything to the left of the 46 is less than 46, and everything to the right is greater. We now know that the eighth smallest entry is 46. But how can we use this for sorting? Well, notice that if we had only two entries to start with, they would now be sorted, no matter which we had picked as the pivot.
So, how about if we recursively apply the same algorithm to both sides, leaving the 46 right where it is?
void QuickSort(KeyType [] A, int m, int n) { if (m < n) { int p = partition(A, m, n); QuickSort(A, m, p-1); QuickSort(A, p+1, n); } }Ok, let's continue the sorting. According to the above, after doing the partition above, p = 7, so we call QuickSort(A, 0, 6) and QuickSort(A, 8, 10); Let's do each in turn
QuickSort(A, 0, 6): QuickSort(A, 0, 6) step 1: call partition(A, 0, 6); p=0 0 6 39 17 37 12 2 10 42 46 55 80 97 initialization 12 39 k=1 no change k=2 no change k=3 no change k=4 2 17, p=1 k=5 10 35, p=2 k=6 no change 10 2 12 39 17 35 42 46 55 80 97 and QuickSort gets a return from partition of p=2 QuickSort(A, 0, 6) step 2: call QuickSort(A, 0, 1) This will just swap the 10 and the 2: 2 10 12 39 17 35 42 46 55 80 97 QuickSort(A, 0, 6) step 3: call QuickSort(A, 3, 6); step 1: call partition(A, 3, 6); p = 3; 3 6 10 2 12 39 17 35 42 46 55 80 97 initialization 17 39 k=4 no change k=5 no change k=6 no change no swap at end (p==i) 10 2 12 39 17 35 42 46 55 80 97 you should have the idea by now.
Often, you know more about the values being sorted than just <,>. For example, if you saw the name "Adams", you might, reasonably, guess that it will fall near the front of an alphabetical order sort. It turns out that by exploiting such info you can sort in O(n)! However, such information will often be "usually right", so the n will be an expected case rather than a worst-case. (It might turn out that Adams is the last name in an unusual collection of names begining with Aaron).
Well, how can we use such information? The basic idea is to put each element "approximately" where it belongs, based on its value.
Example 1: 1000 last names to sort.
method: take an array of size 1000, divide it up into 26 subarrays
of size 1000/26 each. Then, for each name, go to the appropriate subarray
and use insertion-sort to put the element in the proper place.
What is insertion sort?
What if one of the letters has more than 1000/26 names associated?
- not really a problem, but performance breaks down.
What is complexity? Well, Constant time to find right sub-array, plus (1000/26)^2 time for insertion. So, for n elements, sort time is n*(bucket-size)^2. If we can keep bucket size constant, then sort time is linear in n!
But notice this means we must be increasingly accurate in our initial division as n increases. Which translates to increasing requirement on prior knowledge about how keys are distributed.