Word2vec推导

与传统的one-hot表示不同,word2vec为分布式表示,分布式是指把信息分布式地存储在向量的各个维度中。one-hot表示仅仅只有一个维度表示了词的语意。

harris分布式假设认为相似上下文的词语应该有相似的语意,由此衍生了一系列的方法。

语言模型为计算给定句子或者单词概率的统计模型。一般来说,我们计算根据当前词下一个单词的概率。若句子$W$为$n$个单词$(w_1,w_2,\cdots,w_n)$组成的序列,那么句子$W$的概率为
$$
P(W)= P(w_1,w_2,\cdots, w_n)
$$
在计算句子$W$概率的时候,往往由马尔可夫链式法则表示为其中每个单词条件概率的乘积。
$$
\begin{align}
P(W)= & P(w_1,w_2,\cdots, w_n)
= & P(w_1)P(w_2|w_1) \cdots p(w_n|w_1,w_1,\cdots,w_{n-1})
\end{align}
$$
实际操作来说,对于整个句子计算马尔可夫链,路径过长,所以简化为$n-1$阶的马尔可夫链n-gram,
$$
\begin{align}
P(W)= & P(w_1,w_2,\cdots, w_n)
= & P(w_n| w_1,\cdots, w_{n-1})
\end{align}
$$
在衡量语言模型的表达能力时,ppl经常作为衡量指标。

假定给定当前词语$w$和上下文$h$,经过softmax之前的结果为$s_{\theta}(w,c)$
$$
\begin{aligned}
\log P_{\theta}^{h}(w) &= \log \frac{u_{\theta}(w,h)}{Z_\theta^h} \
&= \log u_{\theta}(w,h) - \log Z_\theta^h \
&= s_{\theta}(w,h) - \log Z_\theta^h
\end{aligned}
$$
由于分母计算过于复杂,每次计算的时候都得遍历一次词表。所以引入噪声对比估计(noise contrastive estimate),将多分类转化为二分类问题。

语言模型的最大似然概率可以写成
$$
\begin{aligned}
\frac{\partial}{\partial \theta} \mathbb{E}{w \sim P_d^h(w)}\big[\log P_{\theta}^{h}(w)\big] &= \frac{\partial}{\partial \theta} \mathbb{E}{w \sim P_d^h(w)}\big[s_{\theta}(w,h) - \log Z_\theta^h\big] \
&= \frac{\partial}{\partial \theta} \mathbb{E}{w \sim P_d^h(w)}\big[s_{\theta}(w,h)\big] - \frac{\partial}{\partial \theta} \mathbb{E}{w \sim P_d^h(w)}\big[\log Z_\theta^h\big] \
&= \mathbb{E}_{w \sim P_d^h(w)}\big[ \frac{\partial}{\partial \theta} s_{\theta}(w,h) \big] - \frac{\partial}{\partial \theta} \log Z_\theta^h \
\end{aligned}
$$
其中$Z_\theta^h$ 仅仅与$h$有关,与$w$无关。

实际的数据分布和在$\theta$限定下的概率分布,二者相等的时候,梯度为0。
$$
\begin{aligned}
\frac{\partial}{\partial \theta} \log Z_\theta^h &= \frac{1}{Z_\theta^h} \frac{\partial}{\partial \theta} Z_\theta^h \
&= \frac{1}{Z_\theta^h} \frac{\partial}{\partial \theta} \int\limits_{w} \exp(s_{\theta}(w,h)) dw \
&= \frac{1}{Z_\theta^h} \int\limits_{w} \frac{\partial}{\partial \theta} \exp(s_{\theta}(w,h)) dw \
&= \frac{1}{Z_\theta^h} \int\limits_{w} \exp(s_{\theta}(w,h)) \frac{\partial}{\partial \theta} s_{\theta}(w,h) dw \
&= \int\limits_{w} \frac{\exp(s_{\theta}(w,h))}{Z_\theta^h} \frac{\partial}{\partial \theta} s_{\theta}(w,h) dw \
&= \int\limits_{w} P_{\theta}^h (w) \frac{\partial}{\partial \theta} s_{\theta}(w,h) dw \
&= \mathbb{E}{w \sim P{\theta}^h(w)}\big[ \frac{\partial}{\partial \theta} s_{\theta}(w,h) \big]
\end{aligned}
$$

$$
\begin{aligned}
\frac{\partial}{\partial \theta} \mathbb{E}{w \sim P_d^h(w)}\big[\log P_{\theta}^{h}(w)\big] &= \mathbb{E}{w \sim P_d^h(w)}\big[ \frac{\partial}{\partial \theta} s_{\theta}(w,h) \big] - \mathbb{E}{w \sim P{\theta}^h(w)}\big[ \frac{\partial}{\partial \theta} s_{\theta}(w,h) \big] \
\end{aligned}
$$

$$
\begin{aligned}
\frac{\partial}{\partial \theta} \mathbb{E}{w \sim P_d^h(w)}\big[\log P_{\theta}^{h}(w)\big] &= \mathbb{E}{w \sim P_d^h(w)}\big[ \frac{\partial}{\partial \theta} s_{\theta}(w,h) \big] - \mathbb{E}{w \sim P{\theta}^h(w)}\big[ \frac{\partial}{\partial \theta} s_{\theta}(w,h) \big] \
&= \sum\limits_{w} P_{d}^{h}(w) \frac{\partial}{\partial \theta} s_{\theta}(w,h) - \sum\limits_{w} P_{\theta}^{h}(w) \frac{\partial}{\partial \theta} s_{\theta}(w,h) \
&= \sum\limits_{w} \big( P_{d}^{h}(w) - P_{\theta}^{h}(w) \big) \frac{\partial}{\partial \theta} s_{\theta}(w,h) \
&= \sum\limits_{w} \big( P_{d}^{h}(w) - P_{\theta}^{h}(w) \big) \frac{\partial}{\partial \theta} \log u_{\theta}(w,h)
\end{aligned}
$$

$$
P^h(D=1,w) = \frac{1}{k+1} P_d^h(w)\
P^h(D=0,w) = \frac{k}{k+1} P_n^h(w)
$$

$$
\begin{aligned}
p^h(D=1|w) &= \frac{P^h(D=1,w)}{p^h(w)} \
&= \frac{P_d^h(w)}{P_d^h(w) + kP_n(w)} \
\end{aligned}
$$

$$
\begin{aligned}
p^h(D=0|w) &= \frac{P^h(D=0,w)}{p^h(w)} \
&= \frac{kP_n(w)}{P_d^h(w) + kP_n(w)} \
\end{aligned}
$$

$$
p^h(D=1|w, \theta) = \frac{P_{\theta}^h(w)}{P_{\theta}^h(w) + kP_n(w)} \
p^h(D=0|w, \theta) = \frac{kP_n(w)}{P_{\theta}^h(w) + kP_n(w)} \
$$

Skip-gram

Skip-gram模型中,给定当前词$w$和上下文$c$
$$
\begin{align}
& argmax_{\theta} \prod_{(w,c)\in D} p(D=1|C,W;\theta) \prod_{(w,c) \in D’} P(D=0| c,w;\theta) \

= & argmax_{\theta} \prod_{(w,c)\in D} p(D=1|C,W;\theta) \prod_{(w,c) \in D’} (1-P(D=1| c,w;\theta)) \

= & argmax_{\theta} \sum_{(w,c) \in D} log(\frac{1}{1+e^{-v_c \cdot v_w}}) + \sum_{(w,c)\in D’} log(1-\frac{1}{1+e^{-v_c\cdot v_w}}) \

= & argmax_{\theta} \sum_{(w,c) \in D} log(\frac{1}{1+e^{-v_c \cdot v_w}}) + \sum_{(w,c)\in D’} log(1-\frac{1}{1+e^{-v_c\cdot v_w}}) \

= & argmax_{\theta} \sum_{(w,c)\in D} log \sigma(v_c \cdot v_w) + \sum_{(w,c)\in D’} log(\sigma(-v_c \cdot v_w))

\end{align}
$$

$\displaystyle argmax_{\theta} \prod_{w \in text} \left [ \prod_{c\in C(w)} p(c|w;\theta) \right] $

$C(w)$为$w$的上下文

$\displaystyle argmax_\theta \prod _{(w,c)\in D} p(c|w;\theta)$

这里$D$为所有上下文对。

NCE works because the max of the objection function converges to the real distribution to fit.