1. LSTM理论介绍

LSTM物理架构

几个关键点：

每一层实际上只有一个LSTM cell在运行，不同时刻共享权重的。所谓时刻就是按照输入序列，每个时刻输入一个token的向量来运算。
定义网络与batch_size无关的。
网络的运算与seq_len无关

一个例子： LSTM:

rnn = nn.LSTM(10, 20, 2)  # (单词词向量10，隐层节点个数20（隐向量维度），layer=2)
input = torch.randn(15, 16, 10)  # （seq_len,batch_size,emb_dim）
h0 = torch.randn(2*1, 16, 20)  # (layer*方向=2 batch_size=16, 隐向量维度=20)
c0 = torch.randn(2*1, 16, 20)  # 同上
output, (hn, cn) = rnn(input, (h0, c0))
print(output.shape, hn.shape, cn.shape)
---
torch.Size([15, 16, 20]) #（seq_len,b_s,num_dir*hid_size）
torch.Size([2, 16, 20]) #(numm_dir*layer,b_s,hid_size)
torch.Size([2, 16, 20]) # (num_dir*layer,b_s,hid_size)

GRU：

# input_size(词向量维度), hidden_size(输出隐向量维度), num_layers（层数）
gru = nn.GRU(input_size=10, hidden_size=20, num_layers=2, bidirectional=True)
input = torch.randn(15, 16, 10)   # bs=3，len=5,input_size=10
h0 = torch.randn(2*2, 16, 20)  # （num_layers,len,hidden_size）
output, hn = gru(input, h0)
print(output.shape)
print(hn.shape)
---
torch.Size([15, 16, 40]) #(seq_len,b_s,num_dir*hidden_size)
torch.Size([4, 16, 20]) # (num_dir*layer,b_s,hid_size)

注意：batch_size在第二维 GRU没有cell输出

详细解析

通过上图可以看的更清楚。

output: 第一维,第二维与input一样，（序列长度，batch_size）,第三维是输出num_directions * hidden_size，所以完整输出：(seq_len, batch_size, num_directions * hidden_size) 。 output是完整的输出，包含每个时刻的结果。
h_n：(num_layers * num_directions, batch_size, hidden_size) 只会输出最后个time step的隐状态结果（如上图所示）
c_n:(num_layers * num_directions, batch_size, hidden_size) 只会输出最后个time step的cell结果.(GRU没有)

2.代码实现

参考：
pytorch-seq2seq
NLP FROM SCRATCH-Pytorch

LSTM+Attention实现seq2seq深度解析

1. LSTM理论介绍

2.代码实现