2024 IAIO Question 5.4

Give an upper bound on the number of updates the Perceptron algorithm will make using the kernel \tilde{\kappa} in terms of the vector \alpha from Part 3.

1 Like

In this part, \mathbf{w} refers to the one defined in Part 3.

Let \mathbf{w}_t be the normal vector of the separating hyperplane after the $t$th update.
Let j_t be the index of the data point used in the $t$th update.

First, we compute < \mathbf{w}_t , \mathbf{w} >.

We have

\begin{align*} < \mathbf{w}_t , \mathbf{w} > & = < \mathbf{w}_{t-1} + y_{j_t} \tilde \phi \left( \mathbf{x}_{j_t} \right) , \mathbf{w} > \\ & = < \mathbf{w}_{t-1} , \mathbf{w} > + < y_{j_t} \tilde \phi \left( \mathbf{x}_{j_t} \right) , \mathbf{w} > \\ & = < \mathbf{w}_{t-1} , \mathbf{w} > + y_{j_t} \tilde \phi \left( \mathbf{x}_{j_t} \right)^\top \mathbf{w} \\ & = < \mathbf{w}_{t-1} , \mathbf{w} > + 1 , \end{align*}

where the last equality follows from the analysis in Part 3.

Because \mathbf{w}_0 = 0, we have

< \mathbf{w}_t , \mathbf{w} \> = t .

Second, we establish an upper bound of \left|\left| \mathbf{w}_t \right|\right|_2^2.

We have

\begin{align*} \left|\left| \mathbf{w}_t \right|\right|_2^2 & = < \mathbf{w}_t , \mathbf{w}_t > \\ & = < \mathbf{w}_{t-1} + y_{j_t} \tilde \phi \left( \mathbf{x}_{j_t} \right) , \mathbf{w}_{t-1} + y_{j_t} \tilde \phi \left( \mathbf{x}_{j_t} \right) > \\ & = < \mathbf{w}_{t-1} , \mathbf{w}_{t-1} > + < y_{j_t} \tilde \phi \left( \mathbf{x}_{j_t} \right) , y_{j_t} \tilde \phi \left( \mathbf{x}_{j_t} \right) > + 2 < \mathbf{w}_{t-1} , y_{j_t} \tilde \phi \left( \mathbf{x}_{j_t} \right) > \\ & = \left|\left| \mathbf{w}_{t-1} \right|\right|_2^2 + y_{j_t}^2 \tilde \kappa \left( \mathbf{x}_{j_t} , \mathbf{x}_{j_t} \right) + 2 < \mathbf{w}_{t-1} , y_{j_t} \tilde \phi \left( \mathbf{x}_{j_t} \right) > \\ & \leq \left|\left| \mathbf{w}_{t-1} \right|\right|_2^2 + y_{j_t}^2 \tilde \kappa \left( \mathbf{x}_{j_t} , \mathbf{x}_{j_t} \right) \\ & = \left|\left| \mathbf{w}_{t-1} \right|\right|_2^2 + \tilde \kappa \left( \mathbf{x}_{j_t} , \mathbf{x}_{j_t} \right) \\ & = \left|\left| \mathbf{w}_{t-1} \right|\right|_2^2 + 1 + a , \end{align*}

where the inequality follows from the Perceptron algorithm update criterion,
the second from the last equality follows from the property that y_{j_t} \in \left\{ -1 , 1 \right\},
the last equality follows from the property that \tilde \kappa \left( \mathbf{x}_{j_t} , \mathbf{x}_{j_t} \right) = \kappa \left( \mathbf{x}_{j_t} , \mathbf{x}_{j_t} \right) + a and \kappa is a normalized kernel that \kappa \left( \mathbf{x}_{j_t} , \mathbf{x}_{j_t} \right) = 1.

Because \mathbf{w}_0 = 0, we have

\mathbf{w}_t \leq t \left( 1 + a \right) .

Third, we establish an upper bound of \left|\left| \mathbf{w} \right|\right|_2^2.

We have

\begin{align*} \left|\left| \mathbf{w} \right|\right|_2^2 & = \mathbf{w}^\top \mathbf{w} \\ & = \left( \sum_{i=1}^m \tilde \alpha_i \tilde \phi \left( \mathbf{x}_i \right)^\top \right) \left( \sum_{j=1}^m \tilde \alpha_j \tilde \phi \left( \mathbf{x}_j \right) \right) \\ & = \sum_{i=1}^m \sum_{j=1}^m \tilde \alpha_i \tilde \kappa \left( \mathbf{x}_i , \mathbf{x}_j \right) \tilde \alpha_j \\ & = \tilde \alpha^T \tilde K \tilde \alpha \\ & = \left( \tilde K^{-1} \mathbf{y} \right)^T \mathbf{y} \\ & = \mathbf{y}^\top \tilde K^{-1, \top} \mathbf{y} \\ & = \mathbf{y}^\top \tilde K^{-1} \mathbf{y} \\ & = \left|\left| \mathbf{y}^\top \tilde K^{-1} \mathbf{y} \right|\right|_2 \\ & \leq \left|\left| \mathbf{y}^\top \right|\right|_2 \left|\left| \tilde K^{-1} \right|\right|_2 \left|\left| \mathbf{y} \right|\right|_2 \\ & \leq m^{1/2} \left|\left| \tilde K^{-1} \right|\right|_2 m^{1/2} \\ & = m \left|\left| \tilde K^{-1} \right|\right|_2 \\ & = \frac{m}{\lambda_{\min} \left( \tilde K \right)} \\ & \leq \frac{m}{a} , \end{align*}

where the seventh equality follows from the property that \tilde K is a symmetric matrix, the second inequality follows from the property that \mathbf{y} \in \left\{ -1 , 1 \right\}^m,
the last inequality follows from the property that the smallest eigenvalue of \tilde K satisfies \lambda_{\min} \left( \tilde K \right) = \lambda_{\min} \left( K \right) + a \geq a.

Following from the Cauchy-Schwarz inequality, we have

\left( < \mathbf{w}_t , \mathbf{w} > \right)^2 \leq \left|\left| \mathbf{w}_t \right|\right|_2^2 \left|\left| \mathbf{w} \right|\right|_2^2 .

Plugging all above three results, we get

t^2 \leq t \left( 1 + a \right) \cdot \frac{m}{a} .

Therefore, the number of updates on \mathbf{w}_t is upper bounded by

t \leq \frac{m \left( 1 + a \right)}{a} .