Implement and Apply a Multiclass Support Vector Machine (SVM) Classifier -- CS231n Exercise

I am currently working my way through the lectures for CS231n: Convolutional Neural Networks for Visual Recognition. I will post my solutions here.

In this exercise we are asked to train a loss function using the SVM classifier on the CIFAR-10 dataset.

Linear Classifier for Images

According to lecture notes, we define the score function as

\[\begin{align*} f(X_i,W) = WX_i \end{align*}\]

In the CIFAR-10 example, \(X_i\) is 3x32x32+1 - with the additional dimension holding the constant 1.

In an expanded format this is what the matrix multiplication will look like:

\[\begin{align*} \begin{pmatrix} W_{0,0} & W_{0,1} & \dots & W_{0,3073} \\ W_{1,0} & W_{1,1} & \dots & W_{1,3073} \\ \vdots & \vdots & \ddots & \vdots \\ W_{9,0} & W_{9,1} & \dots & W_{9,3073} \\ \end{pmatrix} \begin{pmatrix} X_{0,i} \\ X_{1,i} \\ \vdots \\ X_{3072,i} \\ 1 \end{pmatrix} = \begin{pmatrix} f(i,0) \\ f(i,1) \\ \vdots \\ f(i,9) \end{pmatrix} \end{align*}\]

where \(f(i,0)\) is the score for image \(i\) belonging to class 0, \(f(i,1)\) is the score for image \(i\) belonging to class 1, and so on.

Lets take a closer at an element of the weight matrix, \(W_{k,l}\). Here \(k\) is the class and the \(l\) is the pixel. \(W_{k,l}\) is the weight that the pixel \(l\) contributes to an image score for class \(k\). A high \(W_{k,l}\) means that the pixel \(l\) of images in our training set is highly correlated to it belonging to class \(k\).

Multiclass SVM Loss Function

The SVM loss function is setup so that the score for \(f(i,y_i)\) is highest when \(y_i\) is the true class for image \(i\). More precisely, the multiclass SVM loss for for \(i\)-th example is

\[\begin{align*} L_i = \sum_{j \neq y_i} \text{max}(0, f(i,j) - f(i,y_i) + \Delta) \end{align*}\]

Here is a naive way to calculate the loss for all images in the training set.

# compute the loss
num_classes = W.shape[1]
num_train = X.shape[0]
loss = 0.0

for i in range(num_train):
    # i is the image under consideration
    scores = X[i].dot(W)
    correct_class_score = scores[y[i]]
    for j in range(num_classes):
        # j is the class
        if j == y[i]:
            continue
        margin = scores[j] - correct_class_score + 1  # note delta = 1
        if margin > 0:
            loss += margin

How to calculate the gradient for the loss

The full Multiclass SVM loss is given by

\[\begin{align*} L = \frac{1}{N}\sum_i L_i + \lambda*||W||^2 \end{align*}\]

Where \(N\) is the number of images in the training set, \(\begin{align*} \lambda \end{align*}\) is the weighing hyperparameters, \(||W||^2\) is the square of the L2-norm of the weight matrix \(W\).

To start with, lets take the derivative of \(L\) with respect to an arbitrary element from the weight matrix, \(W_{k,l}\).

\[\begin{align} \frac{\partial L}{\partial W_{k,l}} &= \frac{\partial (\frac{1}{N}\sum_i L_i + \lambda*||W||^2)}{\partial W_{k,l}} \\ &= \frac{1}{N}\frac{\partial \sum_i L_i}{\partial W_{k,l}} + 2*\lambda*W_{k,l} \end{align}\]

Focusing on \(\begin{align*}\sum_i L_i \end{align*}\).

\[\begin{align} \sum_i L_i &= \sum_{i} \sum_{j \neq y_i} \text{max}(0, f(i,j) - f(i,y_i) + \Delta) \\ &= \sum_{i} \sum_{j \neq y_i} \text{max}(0, \sum_h W_{j,h}X_{h,i} - \sum_h W_{y_i,h}X_{h,i} + \Delta) \end{align}\] \[\begin{align} \frac{\partial \sum_i L_i}{\partial W_{k,l}} &= \frac{\partial \sum_{i} \sum_{j \neq y_i} \text{max}(0, \sum_h W_{j,h}X_{h,i} - \sum_h W_{y_i,h}X_{h,i} + \Delta)}{\partial W_{k,l}} \\ &= \sum_{i, \text{where } y_i \neq k}\begin{cases} X_{l,i}, & \text{if } \sum_h W_{j,h}X_{h,i} - \sum_h W_{y_i,h}X_{h,i} + \Delta > 0\\ 0, & \text{otherwise} \end{cases} \\ &+ \sum_{i, \text{where } y_i = k} \sum_{j \neq y_i} \begin{cases} -X_{l,i}, & \text{if } \sum_h W_{j,h}X_{h,i} - \sum_h W_{y_i,h}X_{h,i} + \Delta > 0\\ 0, & \text{otherwise} \end{cases} \end{align}\]

Use the equations above to calculate the gradient and the loss within the same loops:

 dW = np.zeros(W.shape)  # initialize the gradient as zero

# compute the loss
num_classes = W.shape[1]
num_train = X.shape[0]
loss = 0.0

for i in range(num_train):
    # i is the image under consideration
    scores = X[i].dot(W)
    correct_class_score = scores[y[i]]
    for j in range(num_classes):
        # j is the class
        if j == y[i]:
            continue
        margin = scores[j] - correct_class_score + 1  # note delta = 1
        if margin > 0:
            loss += margin
            dW[:, j] = dW[:, j] + X[i, :]
            dW[:, y[i]] = dW[:, y[i]] - X[i, :]

Taking regularization parameters into account:

# Right now the loss is a sum over all training examples, but we want it
# to be an average instead so we divide by num_train.
loss /= num_train
dW /= num_train

# Add regularization to the loss.
loss += reg * np.sum(W * W)
dW += 2*reg*W

Vectorizing SVM Loss and Gradient

Vectorizing the loss function is actually straight forward.

\[\begin{align} WX &= \begin{pmatrix} W_{0,0} & W_{0,1} & \dots & W_{0,3073} \\ W_{1,0} & W_{1,1} & \dots & W_{1,3073} \\ \vdots & \vdots & \ddots & \vdots \\ W_{9,0} & W_{9,1} & \dots & W_{9,3073} \\ \end{pmatrix}\begin{pmatrix} X_{0,0} & X_{0,1} & \dots & X_{0,50000}\\ X_{1,0} & X_{1,1} & \dots & X_{1,50000} \\ \vdots & \vdots & \ddots & \vdots \\ X_{3072,0} & X_{3072,1} & \dots & X_{3072,50000} \\ 1 & 1 & \dots & 1 \end{pmatrix} \\ &= \begin{pmatrix} f(0,0) & f(1,0) & \dots & f(50000,0) \ \\ f(0,1) & f(2,1) & \dots & f(50000,1) \\ \vdots & \vdots & \ddots & \vdots \\ f{0,9} & f(2,9) & \dots & f(50000,9) \end{pmatrix} \end{align}\]

All we need to do is subtract \(f(i,y_i)\) from each column and set \((i,y_i)\) to 0. Here is the python version:

ytrue_class_prob = np.array([[i, y] for i, y in enumerate(y)])
P = X.dot(W)
P = P - P[ytrue_class_prob[:, 0], ytrue_class_prob[:, 1]].reshape(P.shape[0], 1) + 1

def myfunc(m):
    if m > 0:
        return m
    return 0.

vfunc = np.vectorize(myfunc)

P_ = vfunc(P)
P_[ytrue_class_prob[:, 0], ytrue_class_prob[:, 1]] = 0

loss = sum(sum(P_))
loss /= X.shape[0]
loss += reg * np.sum(W * W)

Now to vectorize the gradient

def myfuncder(m_):
    if m_ > 0. :
        return 1.
    return 0.

vfunc_ = np.vectorize(myfuncder)

P_der = vfunc_(P)
# count the number of examples where margin > 0
P_der[ytrue_class_prob[:, 0], ytrue_class_prob[:, 1]] = np.nan
count = np.nansum(P_der, axis=1)
P_der[ytrue_class_prob[:, 0], ytrue_class_prob[:, 1]] = -count

dW = np.matmul(X.T, P_der)/X.shape[0] + 2*reg*W

Check out the full assignment here

Comments