2024 IAIO Question 5.3

Novikoff’s theorem guarantees convergence provided there is a weight vector that correctly classifies the training data. Assuming the training examples are distinct, consider the modification of a normalised kernel \kappa to

\tilde{\kappa}(\mathbf{x}, \mathbf{z}) = \kappa(\mathbf{x}, \mathbf{z}) + a \delta(\mathbf{x}, \mathbf{z}),

where a > 0 and

\delta(\mathbf{x}, \mathbf{z}) = \left\{\begin{array}{ll} 1& \mbox{if } \mathbf{x} = \mathbf{z}, \\ 0& \mbox{otherwise}. \end{array}\right.

Show we can find a vector \mathbf{\alpha} such that \tilde{K}\mathbf{\alpha} = \mathbf{y} for the label vector \mathbf{y} = (y_1, \ldots, y_m)^\top, where \tilde{K} is the kernel matrix of the modified kernel \tilde{\kappa}. Hence, argue that the training set is separable in the feature space defined by the mapping \tilde{\phi} of the kernel \tilde{\kappa}.

First, we prove that there exists \mathbf{\alpha} satisfying the equation \tilde{K}\mathbf{\alpha} = \mathbf{y}.

To prove this, it is sufficient to prove that \tilde{K} is positive definite and thus invertible.

For any \mathbf{z} \in \Bbb R^m with \mathbf{z} \neq 0, we have

\begin{align*} \mathbf{z}^T \tilde{K} \mathbf{z} & = \sum_{i = 1}^m \sum_{j = 1}^m z_i \tilde \kappa \left( \mathbf{x}_i, \mathbf{x}_j \right) z_j \\ & = \sum_{i = 1}^m \sum_{j = 1}^m z_i \left( \kappa \left( \mathbf{x}_i, \mathbf{x}_j \right) + a \delta \left( \mathbf{x}_i, \mathbf{x}_j \right) \right) z_j \\ & = \sum_{i = 1}^m \sum_{j = 1}^m z_i \left( \phi \left( \mathbf{x}_i \right)^T \phi \left( \mathbf{x}_j \right) + a \cdot \textbf{1} \left( i = j \right) \right) z_j \\ & = \sum_{i = 1}^m \sum_{j = 1}^m z_i \phi \left( \mathbf{x}_i \right)^T \phi \left( \mathbf{x}_j \right) z_j + a \sum_{i = 1}^m \sum_{j = 1}^m z_i \textbf{1} \left( i = j \right) z_j \\ & = \left( \sum_{i = 1}^m \phi \left( \mathbf{x}_i \right) z_i \right)^\top \left( \sum_{i = 1}^m \phi \left( \mathbf{x}_i \right) z_i \right) + a \sum_{i = 1}^m z_i^2 \\ & > 0 . \end{align*}

Therefore, \tilde{K} is positive definite.

Second, we prove that the training set is separable in the feature space defined by the mapping \tilde{\phi} of the kernel \tilde{\kappa}.

Let \mathbf{\tilde \alpha} be the solution to the equation \tilde{K} \mathbf{\tilde \alpha} = \mathbf{y}.

We construct \mathbf{\alpha} that satisfies

\mathbf{\tilde \alpha} = \mathbf{\alpha} \otimes \mathbf{y} .

We thus define \mathbf{w} as

\mathbf{w} = \sum_{i=1}^m \tilde \alpha_i \tilde \phi \left( \mathbf{x}_i \right) .

Consider any data point with index j, we have

\begin{align*} y_j \mathbf{w}^T \tilde \phi \left( \mathbf{x}_j \right) & = y_j \sum_{i=1}^m \tilde \alpha_i \tilde \kappa \left( \mathbf{x}_i, \mathbf{x}_j \right) \\ & = y_j \sum_{i=1}^m \tilde \alpha_i \tilde \kappa \left( \mathbf{x}_j, \mathbf{x}_i \right) \\ & = y_j \cdot y_j \\ & = 1 \\ & > 0 , \end{align*}

where the second equality follows from the property that \kappa \left( \mathbf{x}_i, \mathbf{x}_j \right) = \kappa \left( \mathbf{x}_j, \mathbf{x}_i \right), the last equality follows from the property that y_j \in \left\{ -1 , 1 \right\}.

This completes the proof.