开启掘金成长之旅！这是我参与「掘金日新计划 · 2 月更文挑战」的第 32 天，点击查看活动详情

（本文是第40篇活动文章）

2.1.1.3 加权求和：Q是给定张量，K和V是输入

code-wise attention 示例来自NeurJudge代码，query是用来从输入中提取相关信息的辅助矩阵（罪名表征，广播到输入的mini-batch上，维度是[batch_size,query_num,hidden_dim]），context是输入（维度是[batch_size,sent_len,hidden_dim]）
α是attention，query是G，输入是D
NeurJudge里这个输出（[batch_size,1,hidden_dim]）直接就用来预测了（原代码里分别用了2个query，得到两个attention输出，concat起来做预测）

class Code_Wise_Attention(nn.Module):
    def __init__(self):
        super(Code_Wise_Attention, self).__init__()
    def forward(self,query,context):
        S = torch.bmm(context, query.transpose(1, 2))
        attention = torch.nn.functional.softmax(torch.max(S, 2)[0], dim=-1)
        context_vec = torch.bmm(attention.unsqueeze(1), context)
        return context_vec

这个attention机制，在NeurJudge原文中给出了两篇参考文献，还没看：Bidirectional attention flow for machine comprehension和Multi-Interactive Attention Network for Fine-grained Feature Learning in CTR Prediction

还是来自NeurJudge，参考文献是Sentence Similarity Learning by Lexical Decomposition and Composition，逻辑也来源自这篇，是想要样本表征（d）通过罪名表征（c）分割为两个部分，分别与c平行与正交。这一部分计算attention就是为了将c投影到d上：公式5是为了计算c和d之间token的点积相似度，公式6是用softmax来从c中选择与d最相似的token（softmax相当于是软版的max）在代码里考虑了mask的情况，用了2个mask（一个在softmax之前，一个在×V之前）。两个输入矩阵和返回值的维度都是[batch_size,sent_len,hidden_dim]

class Mask_Attention(nn.Module):
    def __init__(self):
        super(Mask_Attention, self).__init__()
    def forward(self, query, context):
        attention = torch.bmm(context, query.transpose(1, 2))
        mask = attention.new(attention.size()).zero_()
        mask[:,:,:] = -np.inf
        attention_mask = torch.where(attention==0, mask, attention)
        attention_mask = torch.nn.functional.softmax(attention_mask, dim=-1)
        mask_zero = attention.new(attention.size()).zero_()
        final_attention = torch.where(attention_mask!=attention_mask, mask_zero, attention_mask)
        context_vec = torch.bmm(final_attention, query)
        return context_vec

2.1.2 Scaled Dot-Product Attention

解决了dot-product attention随维度增长而剧增、导致softmax取值集中、梯度变小的问题（其实我没看懂这是为啥）

Transformer¹ self-attention: KQV都由输入通过线性转换运算得到。这种做法可以用来计算出一组对象内部之间的关系。在LEMM²中这方面可能会体现得更全面： decoder中则是KV通过encoder得到，Q通过decoder上一层输出得到。

2.2 计算样本对之间的attention

DVQA³：这篇论文既没有给公式，也没有给代码，只能看图了。但是看图感觉还挺清晰的。图中attention distribution map得到样本对之间的attention（每一个方块里面是一个计算attention的模型） DAN：在triplet loss部分直接使用attention之间的距离，classification loss部分则用类似加权求和的方式利用注attention DCN：将attention转化为上下文结合进了原样本表征中，直接实现分类任务
CTM⁴：与DVQA做法和在DAN中的用处都类似，直接给出了公式：

4. 其他本文撰写过程中使用到的参考资料

Transformer 模型详解
深度学习attention机制中的Q,K,V分别是从哪来的？ - 知乎：只看了几个回答，感觉挺多讲得不错，待继续看
还没看
1. 目前主流的attention方法都有哪些？ - 知乎

你给我解释解释，什么TMD叫TMD attention（持续更新ing...）(2)（完）

2.1.1.3 加权求和：Q是给定张量，K和V是输入

2.1.2 Scaled Dot-Product Attention

2.2 计算样本对之间的attention

4. 其他本文撰写过程中使用到的参考资料

Footnotes