Attention is all you need (Transformer) - Model explanation (including math), Inference and Training

This video provides a detailed explanation of the Transformer model, contrasting it with Recurrent Neural Networks (RNNs). RNNs are slower for long sequences and suffer from vanishing/exploding gradients, hindering their ability to capture long-range dependencies. The Transformer overcomes these limitations using an encoder-decoder architecture with self-attention and multi-head attention mechanisms. The video thoroughly explains these components, including positional encoding, layer normalization, and the masked multi-head attention used in the decoder. Finally, it details the training and inference processes, highlighting the single-step nature of Transformer training versus the iterative approach of RNNs. Yellow-> Important Concepts Recurring neural networks existed a long time before the transformer, And they allowed to map one sequence of input to another sequence of output. In this case, our input is x, And we want an input sequence y. What we did before is that we split the sequence into single items. So we gave the recurrent neural network, the first item as input. So x1, along with an initial state, usually made up of only zeros, and the recurrent normal network produced an output, let's call it y1. And this happened at the first time step. Then we took the hidden state. this is called the hidden state of the network of the previous time step, along with the next input token. So x2, and the network had to produce the sec, the second output token y2, and then we did it the same procedure at the third time step, in which we took the hidden state of the previous time step, along with the input state, the input token at the time steps 3, and the network has to produce the next output token, which is y3. If you have enter n tokens, you need n time steps to map a end sequence input into an end sequence output. This worked fine for a lot of tasks, but had some problems. let's review them. The problems with recurring neural networks. First of all, are that they are slow for long sequences, because think of the process we did before, we have kind of like a for loop, in which we do the same operation for every token in the input. So, if you have, the longer the sequence, the longer this computation, and this made the, the network not easy to train for long sequences. The second problem was the vanishing or the exploding gradients. Now, you may have heard these terms or expression on the internet or from other videos. But I will try to give you a brief insight on what does what do they mean on a practical level. So as you know, frameworks, like pi torch, they convert our networks into a computation graph. So basically, suppose we have a computation graph, i, this is not an error network, I will, making, I will be making a computational graph that is very simple has nothing to do with the neural networks, but will show you the problems that we have. So imagine we have two inputs, x and another input, let's call it y. Our computational graph. First, let's say multiplies these two numbers. So we have a first, a function, let's call it f of x and y that is x, multiplied by y. Let me multiplied and the result. let's call it z is map is given to another function. let's call this function g of z Is equal to let's say z squared. What our phytorch for example does it's that pytorch want to calculate the usually we have a loss function by torch calculates the derivative of the loss function with respects to its each weight. In this case we just calculate the derivative of the g function. So the output function with respect to all of its inputs. So derivative of g with respect to x, let's say is equal to the derivative of g with respect to f and multiplied by the derivative of f with respect to x. These two should kind of cancel out. This is called the chain rule. Now as you can see the longer the chain of computation. so if we have many nodes, one after another, the longer this multiplication chain. So here we have two, because the distance from this node, and this is two. But imagine you have 100 or 1000. Now, imagine this number is 0.5, and this number is 0.5. Also, the resulting numbers when multiplied together is a number that is smaller than the two initial numbers. It's gone up 0.25 because it's one to one half multiplied by one, half is one fourth. So if we have two numbers that are smaller than one, and we multiply them together, they will produce an even smaller number. And if we have two numbers that are bigger than one, and we multiply them together, they will produce a number that is bigger than both of them. So if we have a very long chain of computation, it eventually will either become a very big number or a very small number. And this is not desirable, First of all, because our CPU of our GPU can only represent numbers up to a certain precision, let's say 32-bit, or 64-bit. And if the number becomes too small, the contribution of this number to the output will become very small. So when the pi torch, or our automatic, let's say our framework will calculate how to adjust the weights, the weight will move very, very, very slowly, because the contribution of this product is, will be a very small number. And this means that we have the gradient is vanishing. Or in the other case, it can explode become very big numbers. And this is a problem. The next problem is difficulty in accessing information from long time ago. What does it mean? It means that as you remember from the previous slide, we saw that the first input token is given to the recurrent neural network to with, along with the first state. Now, we need to think that the recurrent neural network is a long graph of computation, it will produce a new hidden state. Then we will use the the new hidden state along with the next token to produce the next output. If we have a very long sequence, um, of input sequence, the last token will have a hidden state, whose contribution from the first token has nearly gone because of this long chain of multiplication. So actually, the last token will not depend much on the first token. And this is also not good, because, for example, we know as humans, that, in a text, in a quite long text, the context that we saw, let's say, 200 words before, still relevant to the context, of the current words. And this is something that the RNN could not map. And So the transformer solves these problems with the recurrent neural networks. And we will see how the structure of the transformer we can divide into two macro blocks. The first macro block is called encoder. And it's this part here. The second macro block is called a decoder. And it's the second part here. The third part here, you see on the top, it's just a linear layer. And we will see why it's there and what it is function. So and the two layers, so the encoder and the decoder are connected by this connection. You can see here in which some output of the encoder is sent as input to the let's start our journey with of the transformer. Uh, by looking at the encoder. So the encoder starts with the input embeddings. So what is an input embedding? First of all, let's start with our sentence. We have a sentence of, in this case six words. What we do is we tokenize it. We transform the sentence into tokens. What does it mean? To tokenize, we split them into single words. It is not necessary to always split the sentence using single words. We can even split the sentence in part in smaller parts that are even smaller than a single word. So we could even split this a sentence into, let's say, 20 tokens by using the, each by splitting each word into multiple words. This is usually done in most modern transformer models, but we will not be doing it, Otherwise, it's really difficult to visualize. So, let's suppose we have this input sentence and we split into tokens and each token is a single word. The next step we do is we map these words into numbers. And these numbers represent the position of these words in our vocabulary. So, imagine we have a vocabulary of all the possible words that appear in our training set. Each word will occupy a position in this vocabulary. So, for example, the word will occupy the position 105, the word, the cat will occupy the position 6500 etc. And as you can see, this cat here has the same number as this cat here, because they occupy the same position in the vocabulary. We take these numbers, which are called input IDs, and we map them into a vector of size 512..? this vector is a vector made of 512 numbers. and we always map the same word to always the same embedding. However, this number is not fixed. it's a parameter for our model. So our model will learn to change these numbers in such a way that it represents the meaning of the word. So the input ID is never change because our vocabulary is fixed. But the embedding will change along with the training process of the model. So the embeddings numbers will change according to the needs of the loss function. So the input embedding are basically mapping our single word into an embedding of size 512. And we call this quantity 512 D model, because it's the same name that it's also used in the paper. Attention is all you need. Let's look at the next layer of the encoder, which is the positional encoding. So what is positional encoding? What we want is that each word should carry some information about its position in the sentence, because now we built a matrix of words that are embeddings, but they don't convey any information about how, where that particular word is inside the sentence. And this is the job of the positional encoding. So what we do, we want the model to treat words that appear close to each other as close and words that are distant as distant. So we want the model to see this information about the special information that we see with our eyes. So, for example, when we see this sentence, what is positional encoding, We know that the word, what is more far from the word, um, is compared to encoding, because we, we have this partial information given by our eyes, But the model cannot see this. So we need to give some information to the model about how the words are specially distributed inside of the sentence. And we want the positional encoding to represent a pattern that the model can learn. And we will see how, imagine we have our original sentence, your cat is a lovely cat. What we do is we first convert into embeddings using the previous layer. So the input embeddings, and these are embeddings of size 512. Then we create some special vectors called the positional encoding vectors that we add to these embeddings. So this vector we see here in red is a vector of size 512, which is not learned. It's computed once and not learned. along with the training process, it's fixed. And this word, this vector represents the position of the word inside of the sentence. And this should give us a output that is a vector of size, again, 512. Because we are summing this number with this number, this number with this number. So the first dimension with the first dimension, the second dimension with that. So we will get a new vector of the same size of the input vectors. Or how are these position in both And you may have seen the following expressions from the paper. What we do is we create a vector of five of size D model, so 512. And for each position in this vector, we calculate the value using these two expressions using these arguments. So the first argument indicates the position of the word inside of the sentence. So the word your occupies the position zero, And we use them for the even dimension. So the zero, the two, the four, the 510, etc. We use the first expression, So the sine and for the other positions of this vector. we use the second expression. And we do this for all the words inside of the sentence. So this particular embedding is calculated, p e of 1. 0. because it's the first word embedding zero. So this one represents the argument pause, and this 0 represents the argument 2 i and p e of 1. 1 means that the first word, uh, dimension one. So we will use the cosine, giving the position one and the two i will be equal to 2i plus 1 will be equal to 1.., and we do this for this third word, etc. If we have another sentence, we will not have different positional encodings. We will have the same vectors, even for different sentences, because the positional encoding are computed once and reused for every sentence that our model will see during inference or training. So we only compute the positional encoding once. When we create the model, we save them, and then we reuse them. We don't need to compute it every time we feed the, feed a sentence to the model. So why the authors chose the cosine and the sine functions to represent positional encodings. Because let's watch the plot of these two functions. Uh, the, you can see the plot is by position. So the position of the word inside of the sentence, and this depth is the dimension along the vector. So the two I that you see saw before in the previous expressions. And if we plot them, we can see as humans a pattern here. And we hope that the model can also see this path. Okay, We will not go inside of the multi-head attention. First, we will first visualize the single head attention. So the self-attention with a single head. And let's do it. So what is self-attention? Self-attention is a mechanism that existed before they introduced the transformer, the alters of the transformer, just changed it into a multi-head attention.20:37So how did the self-attention work? The self-attention allows the model to relate words to each other. Okay, so we had the input embeddings that capture the meaning of the word. Then we have the positional encoding that give the information about the position of the word inside of the sentence.20:59Now we want this self-attention to relate words to each other. Now imagine we have, uh, in an input sequence of six word with the D model of size 512. which can be represented as a matrix that we will call Q, k and v.21:18So our q k and v is a same matrix are the same matrix representing the input. So the input of six words with the dimension of 512. So each word is represented by a vector of size 512.. we basically apply this formula we saw here from the paper to calculate the attention, the self attention in this case.21:41why self-attention? because it's the each word in the sentence related to other words in the same sentence. So it's self-attention. So we start with our Q matrix, which is, uh, the input sentence. So let's visualize it, for example. So we have six rows And on this, uh, on the columns, we have 512 column.22:03Now they are really difficult to draw, But let's say we have 512 columns. And here we have six. Okay. now what we do according to this formula, we multiply it by the same sentence, but transposed. So the transpose of the K, which is again the same input sequence, we divide it by the square root of 512.22:26And then we apply this soft max, the output of this, as we saw before in in the initial matrix and notations, we saw that when we multiply 6 by 512 with another matrix that is 512 by 6, we obtain a new matrix that is six by six.22:45And each value in this matrix represents the dot product of the first row with the first column. This represents the dot product of the first row with the second column, etc. The values here are actually randomly generated. So don't concentrate on the values.23:02What you should notice is that the soft max makes all these values in such a way that they sum up to one. So this row, for example, here, some sums up to one, this other row also sums up to one, etc, etc.23:17And this value, we see here, it's the dot product of the first word with the embedding of the word itself. This value here is the dot product of the embedding of the word your with the embedding of the word cat.23:35And this value here is the dot product of the word, the embedding of the word your with the embedding of the word is the next thing we and this value represents somehow a score that how intense is the relationship between one word and another.23:52let's go, uh, ahead with the formula. So for now, we just multiplied q by k divided by the square root of decay applied to the soft max, but we didn't multiply by v. So let's go forward, we multiply this matrix by v.24:08And we obtain a new matrix, which is 6 by 512. So if we multiply a matrix that is 6 by 6 with another, that is 6 by 512, we get a new matrix that is 6 by 512. And one thing you should notice is that with the dimension of this matrix is exactly the dimension of the initial matrix from which we started.24:29This, what does it mean that we obtain a new matrix that is six rows. So let's say six rows with 512 columns in which each, these are our words. So we have six words, and each word has an embedding of dimension 512.24:48So now this embedding here represents not only the meaning of the word, which was given by the input embedding. Not only the position of the word which was added by the positional encoding, but now somehow this special embedding. So these values represent a special embedding that also captures the relationship of this particular word with all the other words.25:13And this particular embedding of this word here also captures not only its meaning not only its position inside of the sentence, but also the relationship of this word with all the other words. I want to remind you that this is not the multi-head attention We are just watching the self-attention.25:32So one head we will, we will see later how this becomes the multi-head attention. Self-attention has some properties that are very desirable. First of all, it's permutation invariant. What does it mean to be permutation invariant? It means that if we have a matrix, let's say first we had a matrix of six words.25:57In this case the let's say just four words, so a, b, c and d, and suppose by applying the formula before this produces this particular matrix in which the there is new special embedding for the word a, a new special embedding for the word B, a new special bedding for the word c and d.26:16So let's call it a prime b prime c prime, d prime. If we change the position of these two rows, the values will not change, the position of the output will change accordingly. So the values of b prime will not change, it will just change in the the position and also the c will also change position but the values in each vector will not change.26:36And this is a desirable properties self-attention as of now requires no parameters. I mean I didn't introduce any parameter that is learned by the model. I just took the initial sentence of, in this case, six words, we multiplied it by itself, we divide it by a fixed quantity, which is the square root of 512.26:57And then we apply the soft max, which is not introducing any parameters. So for now, the self-attention rate didn't require any parameter except for the embedding of the words. This will change later when we introduce the multi-head attention. Also, we expect because the each value in the self-attention in the soft max matrix is a dot product of the word embedding with itself.27:24And the other words, we expect the values along the diagonal to be the maximum, because it's the dot product dot product of each word with itself. And there is another property of this matrix that is before we apply the soft softmax, If we replace the value in this matrix, suppose we don't want the word your and cat to interact with each other, or we don't want the word, let's say is.27:53And the lovely to interact with each other, what we can do is before we apply the softmax, we can replace this value with minus infinity, and also this value with minus infinity. And when we apply the soft max, the soft max will replace minus infinity with 0..28:11because as you remember, the soft max is e to the power of x. If x is going to minus infinity, E will be e to the power of minus infinity will become very, very close to zero. So basically And we will call them q prime k prime, and V prime. Our next step is to split these matrices into smaller matrices. let's see how we can split this matrix Q prime by the sequence dimension, or by the D model dimension. In the multi--hat attention, we always split by the D model dimension. So every head will see the full sentence, but a smaller part of the embedding of each word. So if we have an embedding of, let's say, 512, it will become smaller embeddings of 512 divided by four. And we call this quantity D k. So D k is D model divided by h, where h is the number of heads. In our case, we have h equal to 4.. we can calculate the attention between these smaller matrices. So q1, k1, and v1 using the expression taken from the paper. And this will result into a small matrix called head 1. head 2, head 3, and head four. The dimension of head 1 up to head four is sequence by D v. What is DV is basically it's equal to Dk. It's just called a DV because the last multiplication is done by v. and in the paper they call it DV. So I am also sticking to the same names. Our next step is to multi-combine these matrices, these small heads by concatenating them along the DV dimension, just like the paper says. So we can cut all this head together and we get a new matrix that is sequence by h multiplied by DV, where h multiplied by DV. As we know DV is equal to D k. So h multiplied by DV is equal to D model. So we get back the initial shape. So it's sequence by D model here. The next step is to multiply the result of this concatenation by w, o, and w O is a matrix that is h, multiplied by DV, so D model multiple with the other dimension being T model. And the result of this is a new matrix that is the result of the multi--head attention, which is sequenced by D model. So the multi-had attention instead of calculating the attention between these matrices here. So q prime, k prime, and v prime, splits them along the D model dimension into smaller matrices and calculates the attention between these smaller matrices. So each head is watching the full sentence, but as different aspect of the embedding of each word, why we want this, because we want the each head to watch different aspects of the same word. For example in the Chinese language, but also in other languages. one word may be a noun. In some cases maybe a verb, in some other cases, maybe a adverb in some other cases, depending on the context. So what we want is that one head maybe learns to relate that word as a noun, another head maybe learns to relate that word as a verb, and another head learn to release that verb as an objective or adverb. So this is why we want a multi--head attention. Now, you may also have seen online that the, the attention can be visualized, and I will show you how when we calculate the attention between the Q and the k matrices. So when we do this operation, so the soft max of q, multiplied by the k, divided by the square root of D k, we get a new matrix, just like we saw before, which is sequenced by sequence. And this represents a score that represents the intensity of the relationship between the two words. We can visualize this and this will produce a visualization, uh, similar to this one, which I took from the paper in which we see how the, all the heads work. So, for example, if we concentrate on this work, making this word here, we can see that making is related to the word difficult. So this word here by different heads, so the blue head, the red head and the green head, but the wire, let's say the violet head is not relating this two word together. So, making and difficult is not related by the violet or the pink head, the violet head or the pink head. They are relating the word making to other words, for example, to this word 2009. Why? this is the case, because maybe this pink head could see the part of the embedding, that these other heads could not see, that made this interaction possible between these two words. You may be also wondering why these three mattresses are called query keys and values. The keys are the category of movies, and the values are the movies belonging to that category. In my case, I just put one value. So we have romantics category, which includes Titanic, We have action movies that include the Dark knight, etc. Imagine we also have a user that makes a query. and the query is love, because we are in the transformer world, all these words actually are represented by embeddings of size 512..! So what our transformer will do, he will convert this word love into an embedding of 512. All these queries and values are already embeddings of 512. And it will calculate the dot product between the query and all the keys, just like the formula. So as you remember, the formula is a soft max of query, multiplied by the transpose of the keys divided by the square root of the model. So we are doing the dot product of all the queries with all the keys. In this case, the word love with all the keys one by one. And this will result in a score that will amplify some values or not amplify other values. Um, in this case, our embedding may be in such a way that the word love and romantic are inter are related to each other. The word love and comedy are also related to each other, but not so intensively like the word love and romantic. So it's more how to say, let's less strong relationship, but maybe the word horror and love are not related at all. So maybe their soft max score is very close to zero. Our next um, layer in the encoder is the add and norm. https://arxiv.org/abs/1706.03762 https://medium.com/@geetkal67/attention-networks-a-simple-way-to-understand-self-attention-f5fb363c736d Each of these items will have some features. It could be an embedding. So for example, it could be a feature of a vector of size 512. But it could be a very big matrix of thousands of features doesn't matter. What we do is we calculate the mean and the variance of each of these items independently from each other. And we replace each value with another value that is given by this expression. So basically, we are normalizing so that the new values are all in the range 0 to 1.. actually, we also multiply this new value with a parameter called gamma. And then we add another parameter called beta. And this gamma and beta are learnable parameters. And the model should learn to multiply and add these parameters. So as to amplify the value that it wants to be amplified and not amplify that value that it doesn't want to be amplified. Uh, so we don't just normalize, we actually introduce some parameters And I found a really nice visualization from papers with code.com in which we see the difference between batch norm and layer norm. So as we can see in the layer normalization we are calculating if n is the batch dimension, we are calculating all the values belonging to one item in the batch. While in the batch norm we are calculating the same feature for all the batch. So for all the items in the batch, so we are mixing, let's say values from different items of the batch. While in the layer normalization we are treating each item in the batch independently, which will have its own mean and its own variance. Let's look at the decoder now. Encoder Decoder are called output embeddings. But the underlying working is the same here. Also we have the positional encoding and they are also the same as the imp as the encoder the next layer is the musket multi--head attention and we will see it now we also have the multi--head attention here with the here we should see that the there is the encoder here that produces the output and is sent to the decoder in the forms of keys and values while the query so this connection here is the query coming from the decoder so in this multi--head attention it's not a self--attention anymore it's a cross attention because we are taking two sentences one is sent from the encoder side so let's write encoder in which we provide the output of the encoder and we use it as a query as keys and values while the output of the masked multi--head attention is used as the query in this multi--head attention and the musket multi--head attention is the self--attention of the input sentence of the decoder So we take the input sentence of the decoder, we transform into embeddings. We add the depositional encoding. We give it to this multi--head attention in which the query key and values are the same input sequence. We do the add and norm. Then we send this as the queries of the multi--head attention while the keys and the values are coming from the encoder. Then we do the add the norm. I will not be showing the feed forward, which is just a fully connected layer. We then send the output of the feed forward to the add and norm and finally to the linear layer, which we will see later. So let's have a look at the Muscat multi--head attention Research papers to take a look at only all the words that come before it, or the word itself. So we don't want this, this, this, this also the same for the other words, etc. So we can see that we are replacing all the word, all this values here that are above this diagonal here. So this is the principal diagonal of the matrix. And we want all the values that are above this diagonal to be replaced with minus infinity. So that so that the soft max will replace them with zero. Let's see in which stage of the multi-head attention, this mechanism is introduced. So when we calculate the attention between these smaller matrices, so q1, k1, and v1, Before we apply this soft max, we replace this values, So this one, this one, this one, this one, this one, etc. With minus infinity, Then we apply this soft max. and then the soft max will take care of transforming these values into zeros. So basically we don't want these words to interact with each other and if we don't want this interaction the model will learn to not make them interact because the model will not get any information from this interaction. So it's like this word cannot interact. Now let's look at how the inference and training works for a transformer model As I saw said previously we are dealing with it we will be dealing with the translation tasks So because it's easy to visualize and it's easy to understand all the steps let's start with the training of the model. We will go from an English sentence. I love you very much into an Italian sentence. It's a very simple sentence. It's easy to describe. let's go We start with a description of the of the transformer model. And we start with our English sentence which is sent to the encoder. So our English sentence here on which we prepared and append to special tokens. one is called start of sentence and one is called end of sentence. These two tokens are taken from the vocabulary. So they are special tokens in our vocabulary that tells the model what is the start position of a sentence And what is the end of a sentence? RNNs suffered from the vanishing gradient problem, hindering their ability to capture long-range dependencies in sequences. The long chain of multiplications in RNNs meant that the influence of earlier tokens on later tokens decreased significantly with sequence length . This contrasts with human ability to maintain context over long spans. In essence, RNNs couldn't effectively handle the context needed for tasks like machine translation, motivating the development of Transformers . give the position of the output word, will have the maximum score with the soft max. This is how we know what words to select from the vocabulary. And this hopefully should produce the first output token, which is T, if the model has been trained correctly. this however, happens at time step one. So when we train the model transformer model, it happens in one pass. So we have one input sequence, one output sequence, we give it to the model, we do it one time step, and the model will learn it when we inference. However, we need to do it token by token. And we will also see why this is the case. At time step 2, we don't need to recompute the encoder output again, because the over English sentence didn't change. So we hope the, the encoder should produce the same output for it. And then what we do is we take the output of the previous sentence. So, um, as T, we append it to the input of the decoder, and then we feed it to the decoder again with the output of the encoder from the previous step, which will produce an output sequence from the decoder side, which we again project back into our vocabulary and we get the next token, which is ammo. So as I saw before, as I, as I said before, we are not recalculating the output of the encoder for every time step, because our English sentence didn't change at all. What is changing is the input of the decoder. because at every time step we are appending the output of the previous step to the input of the decoder. We do the same for the time step 3. And we do the same for the time step 4. And hopefully we will stop when we see the end of sentence token. because that is, that's how the model tells us to stop inferencing. And this is how the inference works. Why we needed four time steps when we inference a model. um, like the, in this case, the translation model, there are many strategies for inferencing. What we used is called greedy strategy. So for every step, we get the word with the maximum soft max value. And however, this strategy works. Uh, usually not bad, but there are better strategies. And one of them is called beam search. In beam search, instead of always greedily. So this is, that's why it's called greedy. Instead of greedily taking the maximum soft value, we take the top b values. And then for each of these choices, we inference, what are the next possible tokens for each of the top b values at every step? And we keep only the one The main problems with RNNs that led to the development of Transformers were their inability to effectively handle long-range dependencies and the vanishing/exploding gradient problem. In long sequences, the impact of earlier inputs on later outputs diminishes significantly in RNNs due to the repeated matrix multiplications, making it difficult to capture crucial contextual information needed for tasks like machine translation. Transformers addressed these issues through their architecture, enabling parallel processing and better handling of long-range dependencies.