E0 270 Machine Learning Due: Feb 10, 2015 Assignment 2 1. (a) Let S = ((x1 , y1 ), . . . , (xm , ym )) ∈ (Rd × {±1})m , and suppose you want to assign different costs C1 and C2 to margin violations for positive and negative examples, respectively. Write down a soft margin linear SVM formulation for this setting. Derive its dual and explain how the final b bb is obtained from the solution to the dual. solution w, (b) Suppose you wanted to minimize the cost-sensitive loss `c : {±1} × {±1}→{0, c, 1 − c} (where c ∈ (0, 1)) defined as   if y = −1 and yb = 1 c `c (y, yb) = 1 − c if y = 1 and yb = −1   0 if yb = y. How would you select C1 and C2 in part (a) above? 2. Programming Exercise: Support Vector Machine. Write a piece of MATLAB code to learn an SVM classifier from the spam classification data set as described in Problem 6 in Assignment 1: your program should take as input the training set (X, y), parameter C, kernel type (which can be ‘linear’, ‘poly’ or ‘rbf’; use the provided code compute kernel.m for computing these kernels), and kernel parameter (which you can take here to be a single real number r; this is used as the degree parameter q in the polynomial kernel K(x, x0 ) = (x> x0 + 1)q , and as the width parameter σ in the 0 2 2 RBF kernel K(x, x0 ) = e−kx−x k2 /2σ ); the program should return as output an SVM model (consisting of the kernel type and parameters, the support vectors, the coefficients corresponding to the support vectors and the bias term). You can use the provided template SVM learner.m as a starting point. Use MATLAB’s quadratic program solver quadprog to solve the SVM dual problem. (a) Run your implementation of the SVM learning algorithm with linear kernel on the spam classification data set (see folder Problem-2\Spambase), selecting C from the range {1, 10, 102 , 103 , 104 } through 5-fold cross-validation on the training set. As in Problem 6(b) in Assignment 1, report the average cross-validation error for each value of C in this range and the training and test errors achieved by the best value of C. (b) You are provided a 2-dimensional synthetic data set for this problem. Run your implementation of SVM on the training set provided using a linear, degree-2 polynomial, degree-3 polynomial and RBF kernel, selecting C from the range {1, 10, 102 , 103 , 104 } via 5-fold cross-validation on the training set; in the case of the RBF kernel, in addition to C, you will also have to choose the width paramter σ from the range {1/32, 1/4, 1, 4, 32} using cross-validation. Report the average cross-validation error for each value of C (and σ) and the training and test errors achieved by the best parameter value(s). Also, draw a 2-d scatter plot of the test instances and visualize how well the decision boundaries learnt using the different kernels separate the positive and negative test instances; you can use the provided code decision boundary SVM.m to visualize the decision boundaries. (See folder Problem-2\Synthetic for the relevant files.) 3. Let K1 : X × X →R and K2 : X × X →R be two symmetric, positive definite kernel functions, and for simplicity, assume that each implements dot products in some finite-dimensional space, so that there are vector mappings φ1 : X →Rd1 and φ2 : X →Rd2 for some d1 , d2 ∈ Z+ such that K1 (x, x0 ) = φ1 (x)> φ1 (x0 ) , K2 (x, x0 ) = φ2 (x)> φ2 (x0 ) ∀x, x0 ∈ X . For each of the following functions K : X × X →R, either find a vector mapping φ : X →Rd for some suitable d ∈ Z+ such that K(x, x0 ) = φ(x)> φ(x0 ) ∀x, x0 , or explain why such a mapping cannot exist. 1 2 E0 270 Machine Learning, January Term 2015, Assignment 2 (a) K(x, x0 ) = c · K1 (x, x0 ), where c > 0 (b) K(x, x0 ) = K1 (x, x0 ) + K2 (x, x0 ) (c) K(x, x0 ) = K1 (x, x0 ) − K2 (x, x0 ) (d) K(x, x0 ) = K1 (x, x0 ) · K2 (x, x0 ) (e) K(x, x0 ) = K1 (f (x), f (x0 )), where f : X →X is any function. 4. Consider a binary classification task on the 4-dimensional Boolean hypercube: X = {0, 1}4 . Draw decision trees implementing classifiers given by the following Boolean functions: (a) x1 ∨ x2 (b) x1 ∧ (x3 ∨ x4 ) (c) (x2 ∧ x3 ) ∨ (x2 ∧ x3 ) (d) (x1 ∧ x2 ) ∨ x4 5. Consider a binary classification task on X = R2 with the following set of 8 training examples: i 1 2 3 4 5 6 7 8 xi (3, 9) (7, 11.5) (4.5, 6) (7, 2) (9, 9) (10.5, 2) (6, 12) (6, 1) yi +1 −1 +1 +1 −1 +1 −1 +1 (a) What is the entropy of this set of training examples with respect to the class labels? (b) Consider the following two splits when constructing a decision tree using the above set of examples: x1 < 5? and x2 > 8? Which of these splits would you choose based on the entropy criterion? (c) Which of two splits in part (b) above would you choose based on the Gini index criterion? 6. Programming Exercise: Nearest Neighbor. Write a piece of MATLAB code to implement the k-NN algorithm on a given binary classification data set (X, y), where X is an m × d matrix (m instances, each of dimension d) and y is an m-dimensional vector (with yi ∈ {±1} being a binary label associated with the i-th instance in X). Your program should take as input the training data (X, y), parameter k and a new instance x (m-dimensional); and should return as output a predicted label for x i.e. +1 or -1. You can use the provided code template kNN.m as a starting point. For this problem, you are provided with 2-dimensional synthetic data of 1000 examples (see the folder Problem-6\Data for relevant files) (a) Learning curve: For k = 5, use your implementation of k-NN algorithm on different fractions (i.e. 10%, 20%, · · · , 100%) of the training data set (see the folder Problem-6\Data\TrainSubsets). In each case, measure the classification error on the training subset used, as well as on the given test set (you can use the provided code classification error.m to help you compute the errors). Plot the learning curve i.e. a curve showing both the training error and test error (on the y-axis) as a function of the number of training examples used (on the x-axis). (b) 1-NN : Repeat part (a) with k = 1. What do you observe? How do the training and test errors differ from those obtained in part (a) above? (c) Changing k: In this experiment, you will investigate the effect of increasing k. For k in the range {1, 9, 49, 99, 499}, perform a 5-fold cross validation implementation (the train and test data for each fold are provided in a separate folder in Problem-6\Data\CrossValidation). Report the average cross-validation error (across the five folds) for each k and identify the k that achieves the lowest average cross-validation error (resolving ties in favor of the smallest value of k). E0 270 Machine Learning, January Term 2015, Assignment Feb 10, 2015 3 (d) Visualizing decision boundaries: Use the provided code decision boundary NN.m to visualize the decision boundaries obtained with k = {1, 49, 499} (include the resulting figures in your assignment solutions). Which of these models do you think has the most ‘complex’ decision boundary, and which has the ‘simplest’ ? Which model do you think would have the best test performance and why? Is your answer consistent with your observations from part (c)? 7. Recall that for any binary classification algorithm which given a training sample S ∈ (X × {±1})m returns a binary classifier hS : X →{±1} from a function class H of finite VC-dimension, and for any distribution D on X × {±1} and confidence parameter δ ∈ (0, 1), with probability at least 1 − δ over the draw of S ∼ Dm , the generalization error of the learned function hS can be upper bounded in terms of its training error as follows: v u u 8 VCdim(H) · (ln(2m) + 1) + ln( 4 ) t δ . er0-1 b 0-1 D [hS ] ≤ er S [hS ] + m Let X = R2 , and suppose that given a training sample S, you use the SVM algorithm to learn a linear classifier h1S and kernel-based classifiers h2S , h3S with polynomial kernels of degrees 2 and 3, respectively. (a) Suppose you are given a training sample S containing 2000 examples; assume all examples are drawn iid from some unknown distribution D. Say you learn three classifiers as above and observe they have the following training errors: 1 er b 0-1 S [hS ] = 0.19 ; 2 er b 0-1 S [hS ] = 0.11 ; 3 er b 0-1 S [hS ] = 0.08 . Compute high confidence upper bounds on the generalization error of the three classifiers, using confidence parameter δ = 0.01. If you were to select one of these classifiers based on the training errors, which classifier would you select? If you were to select a classifier using the 99% confidence generalization error bounds you have derived, which classifier would you select? (b) Now suppose you are given a training sample S containing 10, 000 examples; again, assume all examples are drawn iid from some unknown distribution D. Say you learn three classifiers as above and observe they have the same training errors as in part (a): 1 er b 0-1 S [hS ] = 0.19 ; 2 er b 0-1 S [hS ] = 0.11 ; 3 er b 0-1 S [hS ] = 0.08 . Repeat the calculations of part (a). If you were to select one of these classifiers based on the training errors, which classifier would you select? If you were to select a classifier using the 99% confidence generalization error bounds you have derived, which classifier would you select? 8. Reading Exercise. In an ordinal regression problem, there are r > 2 classes (labels) that have an ‘ordered’ structure (for example, 1 to 5 stars in a customer-product rating prediction task, or Gleason grade 1 to 5 in a prostate cancer diagnosis task). Here Y = Yb = {1, . . . , r} as in multiclass classification, but the loss used for measuring the performance of a prediction model h : X →[r] takes into account the ordered structure of the classes, e.g. often, one uses the loss function `ord : [r] × [r]→{0, 1, . . . , r − 1} defined as `ord (y, yb) = |b y − y|, which measures the absolute difference between the predicted class label and the true label. Read the following paper which develops an elegant support vector machine algorithm for this problem: W. Chu and S. S. Keerthi, Support vector ordinal regression. Neural Computation, 19(3):792–815, 2007.