At each decoding time step, the decoder receives 2 inputs:
- the encoder output: this is computed once and is fed to all layers of the decoder at each decoding time step as key ($K_{endec}$) and value ($V_{endec}$) for the encoder-decoder attention blocks.
- the target tokens decoded up to the current decoding step: for the first step, the matrix contains in its first position a special token, normally
</s>. After each decoding step $k$, the result of the decoder at position $k$ is written to the target tokens matrix at position $k+1$, and then the next decoding step takes place.
For instance, in the fairseq implementation of the decoding, you can see how they create the target tokens matrix and fill it with padding here and then how they place an EOS token (</s>) at the first position here.
As you have tagged your question with the bert tag, you should know that what I described before only applies to the sequence-to-sequence transduction task way of using the Transformer (i.e. when used for machine translation), and this is not how BERT works. BERT is trained on a masked language model loss which makes its use at inference time much different than the NMT Transformer.
[mask]tokens used at inference time for BERT? I thought it was just during pretraining so at inference time, I only had to worry about[CLS],[SEP]and possibly[UNK]. Also, I was not able to find much info on size of vocabulary to use. I thought "unique words in train_text + 20 or 30 thousand of most common words from wiktionary top 100,000 words" would suffice but then I read use vocab sizeVbut what is an ideal V? – mLstudent33 May 12 '19 at 02:44The model is constructed in modeling.py (class BertModel) and is pretty much identical to a vanilla Transformer encoder." So I assumed that the
– mLstudent33 May 18 '19 at 04:02[mask]was only applied during pretraining and that info would already be a embedded in the vectorized representation I obtain from Bert.<s>, but most simply reuse the end of sequence token</s>, as it works the same. – noe Apr 06 '21 at 06:43