This is my reading note for [NeurIPS 2014] Sequence to Sequence Learning with Neural Networks.


  • DNN只能解决固定维度的问题,没法把 sequence 映射到 sequence
  • 本文中提出了一种通用的端到端的学习方式,对序列结构做出了最小的假设
  • 方法是使用多层 LSTM 将输入序列映射到固定维度的向量,然后使用另一个深层的 LSTM 将这个向量解码到目标序列。
  • 各种效果好balabala
  • LSTM 还学习到合理的短语和句子的表达,即:对语序敏感而对主动语态和被动语态并不敏感。最后,我们发现翻转源句子中的单词的顺序可以明显的提高LSTM的性能,因为这样做会在源语句和目标语句之间引入许多短期相关性,从而使优化问题变得更容易。

补课:LSTM 是啥?

QA can be done by seq2seq.

正如文章标题,只要是 sequence 映射到 sequence 都可以用这个。


  • DNN 可以用很少的步骤执行任意的并行计算. powerful!
  • DNN 只能处理那种输入和输出都可以用固定维数向量进行编码的问题。IO 长度未知的干不了。
  • 我们用 LSTM 解决了一般的 seq2seq 的问题,方法为
    • 用一个 LSTM 读 input sequence 以获得一个比较大的固定维度的向量表示
    • 再用一个 LSTM 从这个向量解码出输出向量
    • image-20220122220420559


  • In this work, we showed that a large deep LSTM with a limited vocabulary can outperform a standard SMT-based System whose vocabulary is unlimited on a large-scale MT task.
  • We were surprised by the extent of the improvement obtained by reversing the words in the source sentences.
  • We were also surprised by the ability of the LSTM to correctly translate very long sentences.
  • Most importantly, we demonstrated that a simple, straightforward and a relatively unoptimized approach can outperform a mature SMT system, so further work will likely lead to even greater translation accuracies. These results suggest that our approach will likely do well on other challenging sequence to sequence problems.



  • While it could work in principle since the RNN is provided with all the relevant information
  • difficult to train the RNNs due to the resulting long term dependencies

LSTM (chosen in this paper to implement the seq2seq structure)


  • known to learn problems with long range temporal dependencies.

  • estimate the conditional probability \(p(y_1, . . . , y_{T′} |x_1, . . . , x_T )\), where\((x_1, . . . , x_T)\)is an input sequence and \(y_1, . . . , y_{T′}\) is its corresponding output sequence whose length \(T′\) may differ from \(T\) (不定长)
  • computes this conditional probability by first obtaining the fixed dimensional representation v of the input sequence\( (x1, . . . , xT )\) given by the last hidden state of the LSTM, and then computing the probability of \(y_1, . . . , y_{T′}\) with a standard LSTM-LM formulation whose initial hidden state is set to the representation \(v\) of \(x1, . . . , xT\) :
    • image-20220123150031678

Actual model differ from the above description in 3 ways:

  1. we used two different LSTMs: one for the input sequence and another for the output sequence, because doing so increases the number model parameters at negligible computational cost and makes it natural to train the LSTM on multiple language pairs simultaneously.

  2. we found that deep LSTMs significantly outperformed shallow LSTMs, so we chose an LSTM with four layers.

  3. we found it extremely valuable to reverse the order of the words of the input sentence. So for example, instead of mapping the sentence a, b, c to the sentence α, β, γ, the LSTM is asked to map c, b, a to α, β, γ, where α, β, γ is the translation of a, b, c.