__author__ = "Sida Wang"
__version__ = "COS 495 NLP Spring 2018"
Recall that we considered the least squares problem,
\begin{align} L(w) = \sum_{i=1}^n (\phi(x_i) \cdot w - y_i)^2. \end{align}
How does it have to do with ML/NLP? A common approach in ML is to define a loss function that reflects what we desire, and then optimize it. Least squares is the simplest such loss function. Suppose we want to solve a sentiment classification problem under this loss, then $x_i$ refers to a document, $y_i \in \mathbb R$ refers to its label, and $\phi(x)$ is a suitable feature map.
import numpy as np
import random
def sentence_polarity():
with open('./data/rt-polarity/rt-polarity.pos.utf8') as f:
neg = [{'y':'pos', 'x':sent} for sent in f.readlines()]
with open('./data/rt-polarity/rt-polarity.neg.utf8') as f:
pos = [{'y':'neg', 'x':sent} for sent in f.readlines()]
data = pos + neg
random.seed(1)
random.shuffle(data)
print(len(data), len(pos))
return data
data = sentence_polarity()
train_data,test_data = data[:9000],data[9000:]
test_data[:3]
If we can convert y=pos|neg
to a scalar, and review
to a vector, then we can apply least squares. For y it is easy, say 1 for pos
and -1 for neg
. At test time, we can predict whichever closest to $\phi(x) \cdot w$, i.e. pos
if $\phi(x) \cdot w > 0$, neg
otherwise.
How about text itself? One intuitive and effective representation is the bag-of-words representation, where bag hints at ignoring order. Here is an implementation of bag-of-words using hashing:
d = 100000
def bow_hash(x, d=d):
one_ind = [hash(x_i) % d for x_i in x.split(' ')]
phi_x = np.zeros(d)
phi_x[one_ind] = 1
return phi_x
phi = bow_hash
Once we defined the loss and converted each $x$ to a vector $\phi(x) \in \mathbb R^{d}$, we can use stoichastic gradient descent (SGD) algorithm, which performs the update \begin{align} w_i \gets w_{i-1} - \eta \nabla L(w_{i-1}). \\ \end{align}
def sgd(w, gradloss, data, T=6, eta=1e-2, printIterval=3):
for t in range(T):
loss_t = 0
for data_i in data:
grad, loss = gradloss(w, data_i)
w -= eta * grad
loss_t += loss
if t % printIterval == 0:
print(t, '\t', loss_t / len(data))
return w
# dataset-specific conversion for both x and y
def convert_y(lossf):
return lambda w, data_i:\
lossf(w, phi(data_i['x']), y = 1 if data_i['y']=='pos' else -1);
def predict(x, w):
return 'pos' if np.dot(w, phi(x)) > 0 else 'neg'
def accuracy(predict, data):
return np.mean([predict(data_i['x'])==data_i['y'] for data_i in data])
def print_results(w):
print('train', accuracy(lambda x: predict(x, w), train_data))
print('test', accuracy(lambda x: predict(x, w), test_data))
def gradloss_ls(w, phi_x, y):
w_phi = np.dot(phi_x, w)
loss = (w_phi - y)**2
grad = 2 * phi_x * (w_phi - y)
return grad, loss
w_ls = sgd(np.zeros(d), convert_y(gradloss_ls), train_data, T=5)
print_results(w_ls)
I got a test accuracy of over 70%, not bad for such a simple method that does not seem very suitable for the task! Some results are in table 2 of baselines. Let us look at two other common loss functions.
The hinge loss function, and its derivative are
$$ \begin{align} L(w) &= \max(1 - y \phi(x) \cdot w, 0), \\ \nabla L(w) &= \Big\{\begin{array}{ll} -y \phi(x) & 1 - y \phi(x)\cdot w > 0, \\ 0\ & \text{otherwise.} \end{array} \end{align} $$
This is also known as the SVM loss or the margin loss.
def gradloss_svm(w, phi_x, y):
w_phi = np.dot(phi_x, w)
loss = max(1 - w_phi*y, 0)
grad = -phi_x*y if 1 - w_phi*y > 0 else 0
return grad, loss
w_svm = sgd(np.zeros(d), convert_y(gradloss_svm), train_data, T=5)
print_results(w_svm)
The logistic loss and its derivatives are
\begin{align} L(w) = \log(1 + \exp(-y \phi(x) \cdot w)),\\ \nabla L(w) = -y \phi(x) \frac{\exp(-y \phi(x) \cdot w)}{1 + \exp(-y \phi(x) \cdot w)}. \end{align}
from numpy import exp, log
def gradloss_lr(w, phi_x, y):
w_phi = np.dot(phi_x, w)
loss = log(1 + exp(-w_phi*y)) # what could go wrong?
grad = -phi_x*y * exp(-w_phi*y)/(1 + exp(-w_phi*y))
return grad, loss
w_lr = sgd(np.zeros(d), convert_y(gradloss_lr), train_data, T=5)
print_results(w_lr)
from vega import VegaLite
scores = np.linspace(-2, 4, num=500)
data = [(score, gradloss_lr(1, score, 1)[1], 'LR') for score in scores]
data += [(score, gradloss_svm(1, score, 1)[1], 'SVM') for score in scores]
data += [(score, 0.5*gradloss_ls(1, score, 1)[1], 'LS') for score in scores]
plotdata = list(zip(*data))
spec = \
{
"width": 250, "height": 250,
"mark": "line",
"encoding": {
"x": {
"field": "score",
},
"y": {
"field": "loss",
},
"color": {
"field": "Loss type",
"type": "nominal"
}
}
}
display(VegaLite(spec, {'score': plotdata[0], 'loss': plotdata[1], 'Loss type': plotdata[2]}))
def test(x):
print(predict(x, w_svm), '\t', x)
test('a great movie')
test('a not so good movie')
test('it is hard to imagine something more sleep inducing')
test('there were many memorable moments')
test('it made me laugh so many times, its not even funny') # error
test('the beginning was great, but as a whole it sucked')
test('worth my money')
test('it was not bad')
test('no one on earth will say it is bad')
words = ['no', 'not', 'bad', 'awesome', 'good'];
print([(x, w_svm[[hash(x) % d]][0]) for x in words])
We would like to generalize the loss function so it can handle a variety of tasks. With no assumptions on what $y$ might be, we can define the featurizer $\phi(x,y)$ to be a function of $y$ as well. Let the score of $y$ be $s_y = w \cdot \phi(x,y)$ then prediction is the candidate with the maximum score
$$ \hat{y} = \arg\max_y w \cdot \phi(x,y). $$
The structured hinge loss is
$$ L(x,y,w) = \max(0, 1 - (w \cdot \phi(x,y) - \max_{y' \neq y} w \cdot \phi(x,y'))). $$
While the hinge loss has the advantage of being friendly to discrete search methods that finds $\max_{y' \neq y} w \cdot \phi(x,y')$, it only pay attentions to the first and second highest scores. Having probabilities values for all the predictions is useful for sampling and for soft labels. The softmax loss (i.e. the negative log likelihood) is:
\begin{align} \log p_w(y|x) = -L(x,y,w) &= \log\frac{\exp(w \cdot \phi(x,y))}{\sum_{y'} \exp(w \cdot \phi(x,y'))}\\ & = w \cdot \phi(x,y) - \log {\sum_{y'} \exp(w \cdot \phi(x,y'))}. \end{align}
If we are given a soft label $p^*(y)$, a slightly generalization is the cross-entropy loss $$ L(x,y,w) = \operatorname{KL}(p^*(y) || p_w(y | x)) = \sum_y p^*(y) \log \frac{p^*(y)}{p_w(y | x)}. $$
The hinge loss is a (convex) upperbound of the 0-1 loss $1[y \neq y']$. More generally, we might have a preference $\operatorname{Cost}(y,y')$ when the true answer is $y$ and $y'$ was predicted. For example, if we are predicting the numerical ratings of product reviews, it is worse when a review with true rating 5 is classified as rating 1 than when a review with true rating 5 is classified as rating 4. We want to modify the hinge loss to use $\operatorname{Cost}(y,y')=|y - y'|$ in place of $1[y \neq y']$. It should have the following properties to be considered a hinge loss:
Excercise: find an example of this loss.
For multi-class classification, the feature function usually just simulates a matrix vector product: $w \cdot \phi(x,y) = w_y \dot \phi(x)$. On the other hand, if $y$ has structure, the feature function might breakdown $y$. For example, if $y = [\text{Noun, Verb, Noun}]$ is a sequence of part-of-speech tags, a possible feature function is $$ \phi(x,y) = \left[I_\text{Noun}(y_1)\phi_1(x), I_\text{Verb before Noun}(y_1, y2)\phi_2(x), I_\text{Noun before Verb}(y_1, y_2)\phi_3(x)\ldots\right], $$ where $I_{c}(y)$ are indicator functions of the condition $c$ on $y$.
How well this works strongly depends on how good these features are. Instead of thinking too hard about good features, it is preferrable to use a more flexible model that can learn the features. The starting point of feature learning is to put everything about the data into a distributed representation, that is, dense vectors.
The bag-of-words representation does not capture any similarities between words. Everyone word is treated as being completely different from every other word. An alternative that could capture similarity between words is the bag-of-vectors representation. To start, we just use random vectors, which does not perform too well.
vec_dim = 200;
word_vecs = np.random.rand(d, vec_dim)
def bov_hash(x):
one_ind = [hash(x_i) % d for x_i in x.split(' ')]
return np.mean(word_vecs[one_ind, :], 0)
phi = bov_hash
w_svm_bov = sgd(np.zeros(vec_dim), convert_y(gradloss_svm), train_data, T=30, eta=1e-3, printIterval=5)
print_results(w_svm_bov)