# CIS520:MachineLearning

## Fall 09


# Support Vector Machines

Historically, Support Vector Machines (SVMs) were motivated by the notion of the maximum margin separating hyperplane. Before we explore this motivation, which is a bit of a MacGuffin, let's relate it to the other linear classification algorithms: we have seen that Logistic Regression and Boosting minimize a convex upper-bound on the 0/1 loss for binary classification: logistic in one case and exponential in the other. Support Vector Machines minimize the hinge loss -- also an upper-bound on the 0/1 loss.

A few things to note about the hinge loss.

• It's convex like the exponential and logistic loss: optimization is fairly easy
• It grows slower below zero: the penalty for misclassification is more mild
• It is zero after 1: once correct and confident enough, there is no bonus for driving y f(x) higher
• It's non-differentiable at the hinge point: simple gradient doesn't work

Why all the excitement about this loss? First, it has a nice geometric interpretation (at least in the linearly separable case) as leading to a maximum margin hyperplane separating positive from negative examples. Second, more importantly, it fits particularly well with kernels because many examples will not be a part of the solution, leading to a much more efficient dual representation and algorithms.

## Hyperplanes and Margins

Consider the case of linearly separable data. In the toy 2D example below, there are many hyperplanes (lines) that separate the negative examples (blue) from positive (green). If we define the margin of a hyperplane as the distance of the closest point to the decision boundary, the maximum margin hyperplane is shown below.

We will denote our linear classifier as (note the constant term b is separated out)

h(\bx) = sign(\bw^\top\bx + b)

Since rescaling the classifier by multiplying it by a positive constant does not change the decision boundary, sign(\bw^\top\bx + b) = sign(\gamma (\bw^\top\bx + b)), we will set the scale so that at the points closest to the boundary, \bw^\top\bx + b = \pm 1, as shown in the figure below.

Now we will compute the margin, which is the distance of the points closest to the boundary, using one point from each class, \bx_1 and \bx_2.

\bw^\top\bx_1 + b = -1 \; \; {\rm and} \;\; \bw^\top\bx_2 + b = 1

Hence,

\bw^\top(\bx_2-\bx_1) = 2 \;\; \rightarrow \;\; \frac{\bw}{2||\bw||_2}(\bx_2-\bx_1) = \frac{1}{||\bw||_2}

Note that \frac{\bw}{2||\bw||_2}(\bx_2-\bx_1) is precisely the distance to the hyperplane: the vector \frac{1}{2}(\bx_2-\bx_1) projected on the unit vector \frac{\bw}{||\bw||_2}. Let's restate this more precisely. A rescaled hyperplane classifier satisfies:

y_i(\bw^\top\bx_i + b) \ge 1,\;\;\; i=1,\ldots,n

(with equality for the closest points) and has margin \frac{1}{||\bw||_2}.

To find the maximal margin hyperplane, we can simply maximize \frac{1}{||\bw||_2}, which is equivalent to minimizing \frac{1}{2}\bw^\top\bw subject to the (scaled) correct classification constraints. The SVM optimization problem (in the separable case) is:

\textbf{Separable SVM primal:}\;\;\;\ \min_{\bw,b} \frac {1}{2}\bw^\top\bw \;\; {\rm s.t.} \;\; y_i(\bw^\top\bx_i + b) \ge 1, \;\;\ i=1,\ldots,n

Note on notation: I will use \min/\max instead of \inf/\sup since the continuous nature of the optimization will be obvious from context. This optimization problem has a quadratic objective and linear inequality constraints. Note that only the points closest to the boundary participate in defining the SVM hyperplane. These points are called the support vectors, and we will see that the solution can be expressed only using these points.

## The Dual View

In order to get more intuition about the problem, we can look at it from the dual perspective. Langrangian duality, reviewed in recitation, provides the right tool here. We formulate the Langrangian by introducing a non-negative multiplier \alpha_i for each inequality constraint:

L(\bw,b,\alpha) = \frac{1}{2}\bw^\top\bw + \sum_i \alpha_i (1-y_i(\bw^\top\bx_i + b))

Note that if \bw,b is feasible (i.e. satisfy all the constraints), then \max_{\alpha\ge 0} L(\bw,b,\alpha) = \frac{1}{2}\bw^\top\bw since all the coefficients (1-y_i(\bw^\top\bx_i + b)) are negative or zero. When \bw,b are not feasible, \max_{\alpha\ge 0} L(\bw,b,\alpha)= \infty. Assuming that feasible \bw,b exist, \min_{\bw,b} (\max_{\alpha\ge 0} L(\bw,b,\alpha)) solves the original problem, and since the original objective is convex, we can change the order of \min/\max to obtain a dual problem that has the same optimal value: \max_{\alpha\ge 0} (\min_{\bw,b} L(\bw,b,\alpha)). Taking derivatives with respect to \bw and b and setting them to zero, we get:

\frac{\partial L}{\partial b} = 0 \;\; \rightarrow \;\; \sum_i \alpha_i y_i = 0

and

\frac{\partial L}{\partial \bw} = 0 \;\; \rightarrow \;\; \bw = \sum_i \alpha_i y_i \bx_i

Note the dual representation of \bw: it is a linear combination of the examples, weighted by dual variables \alpha_i. Plugging in the expression for \bw, we get

L(b,\alpha) = \frac{1}{2}\sum_{i,j} \alpha_i\alpha_j y_i y_j\bx_i^\top\bx_j+ \sum_i \alpha_i (1-y_i(\sum_j \alpha_j y_j\bx_j^\top\bx_i + b)) = - \frac{1}{2}\sum_{i,j} \alpha_i\alpha_j y_i y_j\bx_j^\top\bx_i+ \sum_i \alpha_i - \sum_i \alpha_i y_i b

Since \sum_i \alpha_i y_i = 0, the dual problem is:

\textbf{Separable SVM dual:}\;\;\;\ \max_{\alpha\ge 0} \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i\alpha_j y_i y_j\bx_i^\top\bx_j, \;\; {\rm s.t. }\;\; \sum_i \alpha_i y_i = 0

This dual optimization problem, as the primal, has a quadratic objective and (a single) linear constraint. Because of the (KKT) optimality conditions, when \bx_i is not the closest to the boundary, \alpha_i is zero. So only the points that support the hyperplane, called support vectors, participate in the solution. Given a solution \alpha to the problem, we can recover the primal via:

\bw = \sum_i \alpha_i y_i \bx_i
To recover b, note that for any support vector i, where \alpha_i>0, we have y_i(\bw^\top\bx_i+b)=1, so

b = y_i-\bw^\top\bx_i, \;\; \forall i\;\; \alpha_i > 0.

## Kernels

We can use the kernel trick again, by assuming a feature map \phi(\bx) such that k(\bx_i,\bx_j) = \phi(\bx_i)^\top\phi(\bx_j). The kernelized dual optimization problem is

\textbf{Kernelized separable dual:}\;\;\;\ \max_{\alpha\ge 0} \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i\alpha_j y_i y_jk(\bx_i,\bx_j), \;\; {\rm s.t. }\;\; \sum_i \alpha_i y_i = 0

Given a solution \alpha to the problem, we can recover

b = y_i-\sum_j\alpha_j y_j k(\bx_j,\bx_i)\;\; \forall i\;\; \alpha_i > 0

and the prediction:

\bw^\top \bx +b = \sum_i \alpha_i y_i k(\bx_i,\bx) + b

## Non-separable case

When the data is not separable, the above optimization will fail. To handle the case of points on the wrong side of the fence, SVMs introduce a penalty. Although we might perhaps like to minimize the number of misclassified points, this is an NP-hard problem. Instead, SVMs penalize points where y_i(\bw^\top \bx +b) < 1 using a linear cost, \max(0,1-y_i(\bw^\top \bx +b)), the hinge function. In the primal optimization problem above, this can be accomplished by adding a slack variable, \xi_i for each example and penalizing their sum, weighted by the positive slack penalty constant C:

\textbf{Hinge primal:}\;\;\; \min_{\bw,b,\xi\ge0} \frac {1}{2}\bw^\top\bw + C\sum_i\xi_i\;\; {\rm s.t.} \;\; y_i(\bw^\top\bx_i + b) \ge 1-\xi_i, \;\;\ i=1,\ldots,n

For any given \bw,b, since we're minimizing over \xi, we will have \xi_i = \max(0,1-y_i(\bw^\top \bx +b)) for each i.

Taking the dual, we get:

\textbf{Hinge dual:}\;\;\; \max_{\alpha\ge 0} \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i\alpha_j y_i y_j\bx_i^\top\bx_j, \;\; {\rm s.t. }\;\; \sum_i \alpha_i y_i = 0, \; \alpha_i\le C

Here's a short derivation. In addition to the \alphas, we introduce a non-negative Lagrange multiplier \lambda_i for each \xi_i \ge 0 constraint.

L(\bw,b,\xi,\alpha,\lambda) = \frac{1}{2}\bw^\top\bw + C\sum_i \xi_i + \sum_i \alpha_i (1-\xi_i-y_i(\bw^\top\bx_i + b)) - \sum_i \lambda_i \xi_i
\frac{\partial L}{\partial b} = 0 \;\; \rightarrow \;\; \sum_i \alpha_i y_i = 0
\frac{\partial L}{\partial \bw} = 0 \;\; \rightarrow \;\; \bw = \sum_i \alpha_i y_i \bx_i
\frac{\partial L}{\partial \xi_i} = 0 \;\; \rightarrow \;\; C -\alpha_i - \lambda_i= 0

Plugging in the expression for \bw and using the condition \sum_i \alpha_i y_i = 0 as before, we have

L(\xi,\alpha,\lambda) = \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i\alpha_j y_i y_j\bx_i^\top\bx_j + \sum_i \xi_i (C-\alpha_i - \lambda_i)

Now using the condition C -\alpha_i - \lambda_i= 0, the last sum vanishes, so we are left with the dual:

\max_{\alpha\ge 0,\lambda\ge 0} \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i\alpha_j y_i y_j\bx_i^\top\bx_j, \;\; {\rm s.t. }\;\; \sum_i \alpha_i y_i = 0, \;\;\ C-\alpha_i - \lambda_i=0

We can get rid of \lambda_is by noting that they only appear in the constraint C-\alpha_i - \lambda_i=0, and since they must be non-negative, we can just replace the constraint with C \ge \alpha_i, as in the dual stated above.

Note that the same dual representation properties hold for this version:

\bw = \sum_i \alpha_i y_i \bx_i
and the kernelized version is just:
\textbf{Hinge kernelized dual:}\;\;\;\max_{\alpha\ge 0} \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i\alpha_j y_i y_jk(\bx_i,\bx_j), \;\; {\rm s.t. }\;\; \sum_i \alpha_i y_i = 0, \; \alpha_i\le C

Given a solution \alpha to the problem, we can recover

b = y_i-\sum_j\alpha_j y_j k(\bx_j,\bx_i)\;\; \forall i\;\; C > \alpha_i > 0

and the prediction:

\bw^\top \bx +b = \sum_i \alpha_i y_i k(\bx_i,\bx) + b

## Solving SVMs

We will not spend too much time on how to optimize quadratic programs with linear constraints. There is a very large literature on the topic of general convex optimization and SVM optimization in particular, and many off-the-shelf efficient tools have been developed for both. In recent years, simple algorithms (e.g., SMO, subgradient) that find approximate answers to the SVM problem have been shown to work quite well. Here is a an applet for kernel SVMs in 2D.