Assignment 2

E0 270 Machine Learning
Due: Feb 10, 2015
Assignment 2
1. (a) Let S = ((x1 , y1 ), . . . , (xm , ym )) ∈ (Rd × {±1})m , and suppose you want to assign different costs
C1 and C2 to margin violations for positive and negative examples, respectively. Write down a
soft margin linear SVM formulation for this setting. Derive its dual and explain how the final
b bb is obtained from the solution to the dual.
solution w,
(b) Suppose you wanted to minimize the cost-sensitive loss `c : {±1} × {±1}→{0, c, 1 − c} (where
c ∈ (0, 1)) defined as


if y = −1 and yb = 1
c
`c (y, yb) = 1 − c
if y = 1 and yb = −1


0
if yb = y.
How would you select C1 and C2 in part (a) above?
2. Programming Exercise: Support Vector Machine. Write a piece of MATLAB code to learn
an SVM classifier from the spam classification data set as described in Problem 6 in Assignment 1:
your program should take as input the training set (X, y), parameter C, kernel type (which can be
‘linear’, ‘poly’ or ‘rbf’; use the provided code compute kernel.m for computing these kernels), and
kernel parameter (which you can take here to be a single real number r; this is used as the degree
parameter q in the polynomial kernel K(x, x0 ) = (x> x0 + 1)q , and as the width parameter σ in the
0 2
2
RBF kernel K(x, x0 ) = e−kx−x k2 /2σ ); the program should return as output an SVM model (consisting
of the kernel type and parameters, the support vectors, the coefficients corresponding to the support
vectors and the bias term). You can use the provided template SVM learner.m as a starting point.
Use MATLAB’s quadratic program solver quadprog to solve the SVM dual problem.
(a) Run your implementation of the SVM learning algorithm with linear kernel on the spam classification data set (see folder Problem-2\Spambase), selecting C from the range {1, 10, 102 , 103 , 104 }
through 5-fold cross-validation on the training set. As in Problem 6(b) in Assignment 1, report
the average cross-validation error for each value of C in this range and the training and test errors
achieved by the best value of C.
(b) You are provided a 2-dimensional synthetic data set for this problem. Run your implementation
of SVM on the training set provided using a linear, degree-2 polynomial, degree-3 polynomial
and RBF kernel, selecting C from the range {1, 10, 102 , 103 , 104 } via 5-fold cross-validation on the
training set; in the case of the RBF kernel, in addition to C, you will also have to choose the
width paramter σ from the range {1/32, 1/4, 1, 4, 32} using cross-validation. Report the average
cross-validation error for each value of C (and σ) and the training and test errors achieved by
the best parameter value(s). Also, draw a 2-d scatter plot of the test instances and visualize how
well the decision boundaries learnt using the different kernels separate the positive and negative
test instances; you can use the provided code decision boundary SVM.m to visualize the decision
boundaries. (See folder Problem-2\Synthetic for the relevant files.)
3. Let K1 : X × X →R and K2 : X × X →R be two symmetric, positive definite kernel functions, and for
simplicity, assume that each implements dot products in some finite-dimensional space, so that there
are vector mappings φ1 : X →Rd1 and φ2 : X →Rd2 for some d1 , d2 ∈ Z+ such that
K1 (x, x0 ) = φ1 (x)> φ1 (x0 ) ,
K2 (x, x0 ) = φ2 (x)> φ2 (x0 )
∀x, x0 ∈ X .
For each of the following functions K : X × X →R, either find a vector mapping φ : X →Rd for some
suitable d ∈ Z+ such that K(x, x0 ) = φ(x)> φ(x0 ) ∀x, x0 , or explain why such a mapping cannot exist.
1
2
E0 270 Machine Learning, January Term 2015, Assignment 2
(a) K(x, x0 ) = c · K1 (x, x0 ), where c > 0
(b) K(x, x0 ) = K1 (x, x0 ) + K2 (x, x0 )
(c) K(x, x0 ) = K1 (x, x0 ) − K2 (x, x0 )
(d) K(x, x0 ) = K1 (x, x0 ) · K2 (x, x0 )
(e) K(x, x0 ) = K1 (f (x), f (x0 )), where f : X →X is any function.
4. Consider a binary classification task on the 4-dimensional Boolean hypercube: X = {0, 1}4 . Draw
decision trees implementing classifiers given by the following Boolean functions:
(a) x1 ∨ x2
(b) x1 ∧ (x3 ∨ x4 )
(c) (x2 ∧ x3 ) ∨ (x2 ∧ x3 )
(d) (x1 ∧ x2 ) ∨ x4
5. Consider a binary classification task on X = R2 with the following set of 8 training examples:
i
1
2
3
4
5
6
7
8
xi
(3, 9)
(7, 11.5)
(4.5, 6)
(7, 2)
(9, 9)
(10.5, 2)
(6, 12)
(6, 1)
yi
+1
−1
+1
+1
−1
+1
−1
+1
(a) What is the entropy of this set of training examples with respect to the class labels?
(b) Consider the following two splits when constructing a decision tree using the above set of examples:
x1 < 5? and x2 > 8? Which of these splits would you choose based on the entropy criterion?
(c) Which of two splits in part (b) above would you choose based on the Gini index criterion?
6. Programming Exercise: Nearest Neighbor. Write a piece of MATLAB code to implement the
k-NN algorithm on a given binary classification data set (X, y), where X is an m × d matrix (m
instances, each of dimension d) and y is an m-dimensional vector (with yi ∈ {±1} being a binary label
associated with the i-th instance in X). Your program should take as input the training data (X, y),
parameter k and a new instance x (m-dimensional); and should return as output a predicted label for
x i.e. +1 or -1. You can use the provided code template kNN.m as a starting point. For this problem,
you are provided with 2-dimensional synthetic data of 1000 examples (see the folder Problem-6\Data
for relevant files)
(a) Learning curve: For k = 5, use your implementation of k-NN algorithm on different fractions (i.e.
10%, 20%, · · · , 100%) of the training data set (see the folder Problem-6\Data\TrainSubsets). In
each case, measure the classification error on the training subset used, as well as on the given test
set (you can use the provided code classification error.m to help you compute the errors).
Plot the learning curve i.e. a curve showing both the training error and test error (on the y-axis)
as a function of the number of training examples used (on the x-axis).
(b) 1-NN : Repeat part (a) with k = 1. What do you observe? How do the training and test errors
differ from those obtained in part (a) above?
(c) Changing k: In this experiment, you will investigate the effect of increasing k. For k in the range
{1, 9, 49, 99, 499}, perform a 5-fold cross validation implementation (the train and test data for
each fold are provided in a separate folder in Problem-6\Data\CrossValidation). Report the
average cross-validation error (across the five folds) for each k and identify the k that achieves the
lowest average cross-validation error (resolving ties in favor of the smallest value of k).
E0 270 Machine Learning, January Term 2015, Assignment Feb 10, 2015
3
(d) Visualizing decision boundaries: Use the provided code decision boundary NN.m to visualize the
decision boundaries obtained with k = {1, 49, 499} (include the resulting figures in your assignment
solutions). Which of these models do you think has the most ‘complex’ decision boundary, and
which has the ‘simplest’ ? Which model do you think would have the best test performance and
why? Is your answer consistent with your observations from part (c)?
7. Recall that for any binary classification algorithm which given a training sample S ∈ (X × {±1})m
returns a binary classifier hS : X →{±1} from a function class H of finite VC-dimension, and for any
distribution D on X × {±1} and confidence parameter δ ∈ (0, 1), with probability at least 1 − δ over
the draw of S ∼ Dm , the generalization error of the learned function hS can be upper bounded in
terms of its training error as follows:
v u
u 8 VCdim(H) · (ln(2m) + 1) + ln( 4 )
t
δ
.
er0-1
b 0-1
D [hS ] ≤ er
S [hS ] +
m
Let X = R2 , and suppose that given a training sample S, you use the SVM algorithm to learn a linear
classifier h1S and kernel-based classifiers h2S , h3S with polynomial kernels of degrees 2 and 3, respectively.
(a) Suppose you are given a training sample S containing 2000 examples; assume all examples are
drawn iid from some unknown distribution D. Say you learn three classifiers as above and observe
they have the following training errors:
1
er
b 0-1
S [hS ] = 0.19 ;
2
er
b 0-1
S [hS ] = 0.11 ;
3
er
b 0-1
S [hS ] = 0.08 .
Compute high confidence upper bounds on the generalization error of the three classifiers, using
confidence parameter δ = 0.01. If you were to select one of these classifiers based on the training
errors, which classifier would you select? If you were to select a classifier using the 99% confidence
generalization error bounds you have derived, which classifier would you select?
(b) Now suppose you are given a training sample S containing 10, 000 examples; again, assume all
examples are drawn iid from some unknown distribution D. Say you learn three classifiers as
above and observe they have the same training errors as in part (a):
1
er
b 0-1
S [hS ] = 0.19 ;
2
er
b 0-1
S [hS ] = 0.11 ;
3
er
b 0-1
S [hS ] = 0.08 .
Repeat the calculations of part (a). If you were to select one of these classifiers based on the
training errors, which classifier would you select? If you were to select a classifier using the 99%
confidence generalization error bounds you have derived, which classifier would you select?
8. Reading Exercise. In an ordinal regression problem, there are r > 2 classes (labels) that have an
‘ordered’ structure (for example, 1 to 5 stars in a customer-product rating prediction task, or Gleason
grade 1 to 5 in a prostate cancer diagnosis task). Here Y = Yb = {1, . . . , r} as in multiclass classification,
but the loss used for measuring the performance of a prediction model h : X →[r] takes into account the
ordered structure of the classes, e.g. often, one uses the loss function `ord : [r] × [r]→{0, 1, . . . , r − 1}
defined as `ord (y, yb) = |b
y − y|, which measures the absolute difference between the predicted class
label and the true label. Read the following paper which develops an elegant support vector machine
algorithm for this problem:
W. Chu and S. S. Keerthi, Support vector ordinal regression. Neural Computation, 19(3):792–815,
2007.