CS261 Sorting

Sorting:

  • By comparison of pairs of keys
  • By distributing into buckets
  • Sorting by comparison

    The basic idea is that the only operation on key values allowed is to compare two keys and get a relative ordering returned. These methods generally assume a total order, and are unstable if the keys are only partially ordered.

    DataStructure based methods

    We have already seen heapsort, based on heaps. More simply, one could build a sorting method based on binary search trees or AVL trees. Simply insert keys in any order, then read them out using a depth first traversal. What is time complexity? Well, two phases:
    1. Insert: O( Log(n)) per insert, n inserts, so O(nLog(n)) total
    2. Read out: O(Log(n)) per read. (This is not obvious - why?) n reads, so O(nLog(n)) total.
    So the overall costs is O(nLog(n)). But notice there is a constant of at least 2 in front, one for insert and one for reading back out again. Also, all those memory allocations of all those tree nodes won't be cheap.
    QuickSort
    Perhaps the most famous sorting method in computer science is quicksort. We'll start our discussion of quicksort with a study of a related question: A heap is designed to support findLargest(smallest). Suppose we want to find nthLargest? is there any quick way?

    Well, we could simply sort the collection, then go to the nth spot in the array. That would take O(nLog(n)).

    Yes, we can usually do better. The key insight is that one can very quickly partition a vector into those elements that are larger than some value and those that are smaller. For example, suppose we want to find the median of the following vector:

                     2 97 17 37 12 46 10 55 80 42 39
    Make a guess, say element 37. Now, divide the vector into those elements less than 37 and those larger than 37, and see if the guess was correct:
                     12 10 17 2 37 97 46 55 80 42 39
    
                       <         ^       >
    int partition(KeyType[] A, int i, int i) {
       KeyType pivot, temp;
       int k, middle, p;
       middle = (i+j)/2;
       pivot = A[middle]; A[middle] = A[i]; A[i] = pivot;
       p = i;
       for (k = i+1; k <= j; k++) {
          if (A[k].compareTo(pivot) < 0) {
             temp = A[++p]; A[p] = A[k]; A[k] = temp;
          }
       }
       temp = A[i]; a[i] = A[p]; A[p] = temp;
       return p;
    }
    What does this do? Invariant: at the end of each k loop, all entries from i to p are <= pivot value, all values from p to k are greater than pivot.

    Trace this through for our sample vector (i=0, j = 10):
    pivot = 46, p = 0;

                     2 97 17 37 12 46 10 55 80 42 39
     initialization 46             2
     k = 1           no change
     k = 2             17 97, p = 1
     k = 3                37 97, p = 2
     k = 4                   12 97, p = 3
     k = 5                       2 97, p = 4
     k = 6                         10 97, p = 5
     k = 7           no change
     k = 8           no change
     k = 9                            42       97, p = 6
     k = 10                              39       97, p = 7
    
                    39                   46               
    return 8                                                                                                                                       
    final result    39 17 37 12  2 10 42 46 55 80 97
    Ok, so everything to the left of the 46 is less than 46, and everything to the right is greater. We now know that the eighth smallest entry is 46. But how can we use this for sorting? Well, notice that if we had only two entries to start with, they would now be sorted, no matter which we had picked as the pivot.

    So, how about if we recursively apply the same algorithm to both sides, leaving the 46 right where it is?

    void QuickSort(KeyType [] A, int m, int n) {
    
       if (m < n) {
    
          int p = partition(A, m, n);
    
          QuickSort(A, m, p-1);
    
          QuickSort(A, p+1, n);
    
       }
    
    }
    Ok, let's continue the sorting. According to the above, after doing the partition above, p = 7, so we call QuickSort(A, 0, 6) and QuickSort(A, 8, 10); Let's do each in turn
    QuickSort(A, 0, 6):
    
    QuickSort(A, 0, 6) step 1: call partition(A, 0, 6);
    p=0                  0                 6  
                        39 17 37 12  2 10 42 46 55 80 97 
    initialization      12       39
    k=1  no change                                      
    k=2  no change                                      
    k=3  no change                                       
    k=4                     2       17, p=1
    k=5                       10       35, p=2           
    k=6 no change                                         
                        10  2 12 39 17 35 42 46 55 80 97
    and QuickSort gets a return from partition of p=2
    
    QuickSort(A, 0, 6) step 2: call QuickSort(A, 0, 1)
    This will just swap the 10 and the 2:
                         2 10 12 39 17 35 42 46 55 80 97
    
    QuickSort(A, 0, 6) step 3: call QuickSort(A, 3, 6);
    step 1: call partition(A, 3, 6);
    p = 3;                        3        6  
                        10  2 12 39 17 35 42 46 55 80 97
    initialization               17 39                   
    k=4   no change 
    k=5   no change
    k=6   no change
          no swap at end (p==i)
                        10  2 12 39 17 35 42 46 55 80 97
    
    you should have the idea by now.    
    

    Non comparison-based sorting

    As we saw, Quicksort is nLog(n) expected case. We have also argued (but not proved) that comparison-based sorting is an O(nLogn) task. But is sorting in general nLogn? No!

    Often, you know more about the values being sorted than just <,>. For example, if you saw the name "Adams", you might, reasonably, guess that it will fall near the front of an alphabetical order sort. It turns out that by exploiting such info you can sort in O(n)! However, such information will often be "usually right", so the n will be an expected case rather than a worst-case. (It might turn out that Adams is the last name in an unusual collection of names begining with Aaron).

    Well, how can we use such information? The basic idea is to put each element "approximately" where it belongs, based on its value.

    Example 1:  1000 last names to sort.
    method: take an array of size 1000, divide it up into 26 subarrays of size 1000/26 each. Then, for each name, go to the appropriate subarray and use insertion-sort to put the element in the proper place.

    What is insertion sort?
    What if one of the letters has more than 1000/26 names associated? - not really a problem, but performance breaks down.

    What is complexity?  Well, Constant time to find right sub-array, plus (1000/26)^2 time for insertion. So, for n elements, sort time is n*(bucket-size)^2. If we can keep bucket size constant, then sort time is linear in n!

    But notice this means we must be increasingly accurate in our initial division as n increases. Which translates to increasing requirement on prior knowledge about how keys are distributed.

    Hash or bucket sort

    Similar idea, but a bit more sophisticated. Instead of an array of size 1000, just use an array of size 26, where each "bucket" is an AVL tree. Now we are bLogb in bucket size instead of b^2. "Hashing with buckets", but with the additional requirement on the hash function is that     X>Y -> h(X) >= h(Y). For example, we could use a hash table of size 5 and a hash function of key/20 for our earlier numeric example.  That would yield (ps - can you recreate the AVL trees shown assuming left to right insertion of our original vector of numbers? Good exam question!):