what is the first input to the decoder in a transformer model?

Question

The image is from url: Jay Alammar on transformers

K_encdec and V_encdec are calculated in a matrix multiplication with the encoder outputs and sent to the encoder-decoder attention layer of each decoder layer in the decoder.
The previous output is the input to the decoder from step 2 but what is the input to the decoder in step 1? Just the K_encdec and V_encdec or is it necessary to prompt the decoder by inputting the vectorized output (from the encoder) for the first word?

noe · Accepted Answer · 2019-05-11T18:48:00.797

11

At each decoding time step, the decoder receives 2 inputs:

the encoder output: this is computed once and is fed to all layers of the decoder at each decoding time step as key ($K_{endec}$) and value ($V_{endec}$) for the encoder-decoder attention blocks.
the target tokens decoded up to the current decoding step: for the first step, the matrix contains in its first position a special token, normally </s>. After each decoding step $k$, the result of the decoder at position $k$ is written to the target tokens matrix at position $k+1$, and then the next decoding step takes place.

For instance, in the fairseq implementation of the decoding, you can see how they create the target tokens matrix and fill it with padding here and then how they place an EOS token (</s>) at the first position here.

As you have tagged your question with the bert tag, you should know that what I described before only applies to the sequence-to-sequence transduction task way of using the Transformer (i.e. when used for machine translation), and this is not how BERT works. BERT is trained on a masked language model loss which makes its use at inference time much different than the NMT Transformer.

edited May 11 '19 at 18:48

answered May 11 '19 at 17:46

noe

26,410
1
46
76

in hindsight I asked my question prematurely before really diving deeply into the annotated pytorch transformer and BERT paper. However since you mention it, are [mask] tokens used at inference time for BERT? I thought it was just during pretraining so at inference time, I only had to worry about [CLS], [SEP] and possibly [UNK]. Also, I was not able to find much info on size of vocabulary to use. I thought "unique words in train_text + 20 or 30 thousand of most common words from wiktionary top 100,000 words" would suffice but then I read use vocab size V but what is an ideal V? – mLstudent33 May 12 '19 at 02:44
Please create new questions so that we can answer them, as these are too unrelated (to the original question) to have in the comments. – noe May 12 '19 at 19:08
sounds good, I might post them later as I am gearing down to try to use the T2T model first after defining a new problem for JA to EN translation. – mLstudent33 May 13 '19 at 08:52
I'm not sure if this deserves its own question but is there an annotated Bert? Something like this annotated Transformer from Harvard NLP: http://nlp.seas.harvard.edu/2018/04/03/attention.html – mLstudent33 May 13 '19 at 10:07
2

I'm not aware of any blog post about BERT with explanations at source code level. Maybe the illustrated BERT blog post is enough: http://jalammar.github.io/illustrated-bert/ – noe May 13 '19 at 11:14
Jay's blog has this quote: "The next step would be to look at the code in the BERT repo:
The model is constructed in modeling.py (class BertModel) and is pretty much identical to a vanilla Transformer encoder." So I assumed that the [mask] was only applied during pretraining and that info would already be a embedded in the vectorized representation I obtain from Bert.
– mLstudent33 May 18 '19 at 04:02
I think you mean "start of sequence token" (SOS)? Naturally this is placed at the start. The end of sequence (EOS) token is only produced by the model several time steps later, when it predicts that the sentence has ended. – Denziloe Apr 05 '21 at 19:34
1

Some implementations use a special beginning of sequence token <s>, but most simply reuse the end of sequence token </s>, as it works the same. – noe Apr 06 '21 at 06:43

what is the first input to the decoder in a transformer model?

1 Answers1

Linked