0

I am experimenting with the GPT2-XL model and trying to understand the internal structure. While I understand most of the components and how they affect the size of the activation tensors (such as multi-headed self-attention), I do not fully understand how the embeddings are processed.

When extracting the embeddings at a specific point of a forward pass, i.e. between two transformer layers, I get a vector of size token length x context length (so let's say 4 vectors of length 1600 for the first forward pass with input size of 4 tokens).

I understand that each token is represented by an embedding vector. But I do not understand, how these are then processed. Are they calculated one after one for each transformer layer before the new 4x1600 tensor is passed to the next step? Or is each embedding vector processed in an own forward pass? If so, how does the last forward pass play a role in the next one? Are they computed in parallel? If so, do they share weights? This is quite confusing to me at the moment.

Thanks!

0 Answers0