Machine Learning, HW2
--------------------------------------
Due Fri April 5 11:59pm on blackboard.
--------------------------------------

*optional.

Part I - Theory of Kernels ======================================================

1. Explain why the kernel matrix must be positive semidefinite.
   Hint: what kind of counterintuitive and ridiculous consequences 
   there will be if the kernel matrix is not positive semidefinite?

2. Choose a kernel to classify the following XOR dataset by perceptron:
   x	   y
   ---------
   (0,0)  +1
   (0,1)  -1
   (1,1)  +1
   (1,0)  -1

   a) Show each step in both primal (explicit mapping) and dual (alphas).
   b) Verify the final results are equivalent.
   c) What's the decision boundary on the original R^2 space?

3. Choose a kernel to classify the following "circle" dataset by perceptron:
   x	   y
   ---------
   (0,0)  +1
   (0,1)  -1
   (1,0)  -1
   (-1,0) -1
   (0,-1) -1

   a) Show each step in both primal (explicit mapping) and dual (alphas).
   b) Verify the final results are equivalent.
   c) What's the decision boundary on the original R^2 space?

4. Explain in your own words the geometric intuitions of Gaussian kernels:
   a) in the original (unmapped) space, what is the interpretation?
      what is the connection to instance-based learning such as nearest neighbor?
      what are the similarities and differences?

   b) in the mapped (feature) space, what special properties do you observe?
      how many dimensions are there?

   c) is the space in b) primal or dual? explain.


Part II - Theory SVM and Convex Optimization =========================================

1. If we use the Hildreth Algorithm to solve the SVM optimization problem,
   a) what is the Q matrix and b vector in the primal? (work out the math, same below)
   b) what is the Q matrix and b vector in the dual?
   c) which version is easier to optimize?
   d) what will change (in Q and b) if we add slacks?
   e) what will change (in Q and b) if we also use kerneles (in addition to slacks)?

2. Explain in your own words and illustrations (you may google relevant resources):
   a) why Lagranging variables must be non-negative?
   b) why KKT conditions hold?
   c) would KKT hold for non-convex optimization? why or why not?

3. SVM Support Vector Set.

   a) True or False (and why):
   a1) if an example has functional margin of +/-1, it must be in the support vector set.
   a2) if an example is in the support vector set, it must have a functional margin of +/-1.
   a3) in R^2 (primal), the size of the support vector set must be 3.
   a4) in R^d (primal), what is the smallest and largest sizes of support vector sets?
   a5) SVM can *not* result in a larger support vector set than perceptron on the same problem.

   b) Compare SVM and nearest neighbor in terms of support vector set.
      
   *c) It's obvious why SVM generalizes better than perceptron (maximal vs. arbitrary margin),
       but why SVM generalizes better than nearest neighbor? 
       (hint: google SVM leave-one-out-bound; compression and learning).

   *d) Can we figure out the convex hulls for the positive and negative examples, respectively,
       and then figure out the support vectors?
       If so is this method going to be faster or easier than quadratic programming?
   
4. Soft Margin, Non-Linearity, and Overfitting:
   a) is bigger C allowing more outliers or less? is C=0 equivalent to hard-margin SVM?
   b) is bigger C more likely to overfit?
   c) is a polynomial kernel with higher degree more likely to overfit?

*5. SVM with or without bias.
    The classical treatment of SVM in almost all textbooks and papers has the bias term (just google).
    Note that the margin in the original unaugmented space has nothing to do with the bias.
    But the treatment in this course has the bias term removed by augmented space, which is a lot easier.

    a) Is it guaranteed that best margin achieved in the augmented space leads to the best margin
       in the original space? Prove or disprove it.

    b) In the original SVM with bias, what extra constraints would you get due to the bias term?
       How would this affect Hildreth Algorithm? Does it still work?


Part II - Experiments (in Python + numpy) =====================================================

1. Kernelized perceptron for Spambase preprocessing.
   
   dataset: hw2/spambase.tgz
   format: the last bit for each line is the label (0: not spam, 1: spam).
   preprocessing: shift/rescale each feature so that its range is [0, 1].
   perceptron details: always do averaged perceptron. run 10 epochs (i.e., iterations over data).
   
   a) compare primal and dual for linear kernel (identity). make sure the results are identical.
   b) compare primal and dual for quadratic kernel (c=1). make sure the results are identical.
   c) do polynomial kernel of degree 4 (c=1). report the best accuracy on dev and # of support vectors.
   d) do gaussian kernel (sigma=0.5). report the best accuracy on dev and # of support vectors.

   plot one plot summarizing a,b,c,d: x axis is the number of epochs, y axis is the accuracy on dev.
   Which kernel is the best?

2. Use an SVM toolkit for the same dataset, for all the four kernels above.
   possible candidates are: libsvm and svm-light, or Python versions: scikit-learn and mlpy.

   Remember that you have to tune C on dev set (choose from C=1,2,5,10,20,50,100).

   Are SVM results better than the corresponding perceptron results?
   How many support vectors? Are they more or less than the corresponding perceptrons?

*3. Try to run the following extremely simple online subgradient descent solver for SVM
    (at least 10x shorter than the toolkits above).

       https://github.com/gracaninja/lxmls-toolkit/blob/master/code/classifiers/svm.py

    You only need to run it on linear and quadratic kernels in the primal mode (using the best C's above).
    
    Are the results comparable? Is this new method with or without bias? Why?


PLEASE INCLUDE A FILE "report.pdf" FOR BOTH THE THEORY PART AND THE EXPERIMENTAL RESULTS (including plots).


PART III - DEBRIEF SECTION ==============================================================

Please include a separate "debrief.txt" for these questions:

1. Did you work alone, or did you discuss with other students? 
   If the latter please write down their names. 
   Note: in general, only high-level discussions are allowed; 
         each student still has to write the code, perform the experiments, 
         and write the report all by him/herself.

2. How many hours did you spend on this assignment?

3. Would you rate it as easy, moderate, or difficult?

4. Are the lectures too fast, too slow, or just in the right pace?

5. Any other comments?

Note: You'll get 5 pts off for not including the debrief. 
If you have submitted but forgot to do include it before the deadline, 
email debrief.txt alone to TA Kai (kzhao@gc.cuny.edu) and the professor.

--------------- THE END OF HW2 ------------------------------------------