MIT 6.S191: Recurrent Neural Networks, Transformers, and Attention

This lecture introduces sequence modeling, focusing on Recurrent Neural Networks (RNNs) and their application to sequential data. RNNs process data step-by-step, updating an internal state to maintain information across time steps. However, RNNs suffer from vanishing/exploding gradients and limitations in handling long sequences. The lecture then introduces the attention mechanism, a more efficient approach that identifies and weighs important parts of the input sequence, forming the basis of powerful architectures like Transformers used in large language models and other applications. and our perceptron neuron, our single neuron is operating on those to produce an output. By taking these, its weight matrix, doing this linear combination, applying a nonlinear activation function, and then generating the output. we also saw how we can now stack perceptrons on top of each other to create what we call a layer, where now we can take an input compute on it by this layer of neurons, and then generate an output as a result here. though still, we don't have a real notion of sequence or of time what I'm showing you is just a static single input single output that we can now think about collapsing down the neurons in this layer to a simpler diagram, right, where I've just taken those neurons and simplified it into this green block. And in this input output put mapping, we can think of it as an input at a particular time step, just one time step t. And our neural network is trying to learn, uh, a mapping in between input and output at that time step. Okay, now i've been saying, okay, sequence data, it's data over time. What if we just took this very same model and applied it over and over again to all the individual time steps in a data point, What would happen? Then all I've done here is I've taken that same diagram, I've just flipped it 90°, it's now vertical, where we have an input vector of numbers, our neural network is computing on it and we're generating an output. let's say we we have some sequential data and we don't just have a single time step anymore. We have multiple individual time steps. We start from x0 our first time step in our sequence and what we could do is we could now take that same model and apply it stepwise step by step to the other slices the other time steps in the sequence. What could be a potential issue here that could arise from treating our sequential data in this kind of isolated step-by-step view? Yes. So I heard some some comments back that inherently right there's this dependence in the sequence but this diagram is completely missing that right. there's no link between time step zero time step two Indeed right. In this in this setting we're just treating the time steps in isolation. But I think we can all hopefully appreciate that. At output at a later time step, we wanted to depend on the input and the observations we saw prior, right? So by treating these in ation, we're completely missing out on this inherent structure to the data and the patterns that we're trying to learn. So the key idea here is what if now we can build our neural network to try to explicitly model that relation, that time step, h, time step to time step relation. And one idea is, let's just take this model and link the computation between the time steps together. And we can do this mathematically by introducing a variable that we call h, And H of T stands for this notion of a state of the neural network. And what that means is that state is actually learned and computed by the neuron. and the neurons in this layer, and is then passed on and propagated time step by time, step to time step, and iteratively and sequentially updated. And so what you can see here, now, as we're starting to build out, this modeling diagram is, we're able to now produce a relationship where the output at a time step t now depends on both the input at that time step, as well as the state from the prior time step that was just passed forward. And so this is a really powerful idea, right? Again, this is an abstraction of that we can capture in the neural network, this notion of state capturing something about the sequence. And we're iteratively updating it as we make observations in this time in this sequence data. And so this idea of passing the state forwards through time is the basis of what we call a recurrent cell. or neurons with recurrence. And what that means is that the function and the computation of the neuron is a product of both the current input and this past memory of previous time steps. And that's reflected in this variable of the state and so on. the right on the right hand side of this slide, what you're seeing is basically that model that neural network model unrolled or unwrapped across these individual time steps. But importantly, right, it's just one model that still has this relation back to itself. Okay, so this is kind of the the mind warpy part, where you think about how do we unroll and visualize and reason about this operating over these individual time steps, or having this recurrence relation with respect to itself. So this is the core idea, this notion The key idea that we talked about is that we have the state H oft and it's updated at each time step as we're processing the sequence that update is captured in what we call this recurrence relation. And this is a standard neural network operation just like we saw in lecture one, right? All we're doing is we're having the cell state variable H of T. We're learning a set of weights w, and that set of weights w is going to be a function of both the input at a particular time step, and the information that was passed on from the prior time step in this variable h oft. And what is really important to keep in mind is for a particular neur, uh, RNN layer, right, We have the same set of weight parameters that are just being updated as the model is being learned, same function, same set of weights. The difference is just we're processing the data time step by time step. We can also think of this from another angle in terms of how we can actually implement an RNN right, We can begin, we think, about initializing the hidden state and initializing an input sentence, broken up into individual words that we want this RNN to process to make updates to the hidden state. of that RNN All we're going to do is basically iterate through each of the individual words, the individual time steps in the sentence, and update the hidden state, and generate an output prediction as a function of the current word and the hidden state. And then at the very end, we can then take that learned model that learned updated hidden state, and generate Now, the next word, pred prediction for what word comes next, at the end of the sentence. And so this is this idea of how the RNN includes both a state update, and finally an output that we can generate per time step. And so to walk through this component, right, we have this input vector x, oft, we can use a mathematical description based on the nonlinear activation function, and a set of neural network weights to update the hidden state h of and while this may seem complicated, right, this is really very much similar to what we saw prior. And so to walk through this component, right, we have this input vector x, oft, we can use a mathematical description based on the nonlinear activation function, and a set of neural network weights to update the hidden state h of and while this may seem complicated, right, this is really very much similar to what we saw prior. All we're doing is we're t learning a matrix of weights. We are learning an individual matrix for updating the hidden state, and then one for updating the input. We're multiplying those by their inputs, adding them together, applying a nonlinearity, and then using this to update the actual state variable h oft. Finally, then we can then output an actual prediction at that time step as a function of that updated internal state h oft, right? So the RNN has updated its state, we apply another weight matrix and then generate an output prediction. According to that question, different nonlinear functions into the T each. https://towardsdatascience.com/backpropagation-in-rnn-explained-bdf853b4e1c2 The idea here is that, right, you have a input at a particular time step, and you can visualize how that input and output prediction occurs at these individual time steps. In your sequence, Making the weight matrices explicit. We can see that, uh, this ultimately leads to both updates to the hidden state and predictions to the output. And furthermore, reemphasizing the fact that it's the same weight matrix, right? For the input to hidden state transformation that, uh, hidden state to output transformation, that's effectively being reused and re-updated across these time steps. Now, this gives us a sense of how we can actually go forward through the RNN to compute predictions, To actually learn the weights of this RNN we have to compute a loss and use the technique of back propagation to actually learn how to adjust our weights based on how we've computed the loss. And because now we have this way of computing things, time step by time step, what we can simply do is take the individual metric of the loss from the individual time steps, sum them all together, and get a total value of the loss across the whole sequence. One question progression differ from setting the bias. A bias is, you know, something that comes in separate from the x of that particular time, This is different than, uh, the servings bias. Yes, Yes. So what I'm talking about here is specifically how the weights, the learned weights are updated as a function of, you know, learning the model and how they're act. The weight matrix itself is, is applied to, let's say, the input and transforms the input in this in this visualization. And the equations we showed we kind of abstracted away the bias term. But the important thing to keep in mind is that matrix multiplication is a function of the learned weight matrix, uh, multiplied against the input or the hidden state. Okay, so similarly, right, this is now a little bit more detail on the inner workings of how we can implement an RNN uh, layer from scratch using code in tensor flow. So, as we introduce, right, the RNN itself is a layer, a neural network layer. And what we start by doing is first by initializing those three sets of weight matrices that are key to the RN computation, right? And that's what's done in this first block of code, where we're seeing that initialization, we also initialized the hidden state. The next thing that we have to do to build up an RNN from scratch is to define how we actually make a prediction, a forward pass, a call to the model. And what that amounts to is taking that hidden state update equation and translating it into Python code that reflects this application of the weight matrix, the application of the nonlinearity, And then computing the output as a transformation of that, right? And finally, at each time step, both that updated hidden state and the predicted output can be returned by the, uh, the call function from the RNN This gives you a sense of kind of the inner workings and computation translated to code. But in the end, right, TensorFlow and machine learning frameworks, abstract a lot of this away, such that you can just take in, uh, and define kind of the dimensionality of the RNN that you want to implement and use Buil-in functions and built--in layers to define it in code. Sequence modeling problems involve processing sequential data, such as words in a sentence or data points over time . Unlike standard deep learning tasks that often deal with single inputs and outputs (like binary classification), sequence modeling handles sequences of data, requiring models to reason about the order and relationships within those sequences . For example, determining the sentiment of a sentence requires understanding the order of words, which is a sequence modeling problem . In Recurrent Neural Networks (RNNs), the vanishing gradient problem arises during backpropagation through time (BPTT) . As gradients are calculated and propagated back through many time steps, they can become increasingly small, making it difficult to update the weights of earlier layers effectively . This hinders the RNN's ability to learn long-term dependencies in the sequence. One solution is to use gated recurrent units, such as LSTMs (Long Short-Term Memory networks), which employ gating mechanisms to selectively retain or forget information, thus mitigating the vanishing gradient problem . These gating mechanisms allow for better control of the flow of information through the network, enabling the learning of long-range dependencies . The next thing in terms of how we tackle this sequence modeling problem is we need a way to be able to handle these sequences of differing length, right? s sentence of four words, sentence of six words, the network needs to be able to handle that. The issue that comes with the ability to handle these variable sequence lengths is that now, as your sequences get longer and longer, your network needs to have the ability to capture information from early on in the sequence and process on it and incorporate it into the output maybe later on in the sequence. And this is this idea of a long--term dependency, or this idea of memory in the network. And this is another very fundamental problem to squence modeling that you'll encounter in practice, the other aspect that we're going to touch on briefly, is again, the intuition behind order the whole point of sequence is that, you know, things that appear in a programat in a programmed or defined way, capture something meaningful. And so even if we have the same set of words, if we flip around the order, the network's representation and modeling of that should be different and capture that dependence of order. All this is to say is in this example of natural language taking, uh, the question of next word prediction, It highlights why this is a challenging problem for a neural network to learn and, and try to model and fundamentally how we can think about keeping that in the back of our mind as we're actually trying to implement and test and build these algorithms and models in practice. One quick question, Yes, large, uh, embedding, uh, how do you know what dimension of space you're supposed to use to like group things together? This is a fantastic question about how large we set that embedding space, right? you can eni envision right as the number of distinct things in your vocabulary increases you may first think okay, maybe a larger space is actually useful but it's not always it's not true that strictly increasing the uh dimensionality of that embedding space leads to a better embedding and the reason for that is it's gets sparser the bigger you go and effectively then what you're doing is you're just making a lookup table that's more or less closer to a one-hot uh encoding so you're kind of defeating the purpose of learning that embedding in the first place The idea is to have a balance of a small but large enough dimensionality to that embedding space such that you have enough capacity to map all the diversity and richness in the data but it's small enough that it's efficient and that embedding is actually giving you an efficient bottleneck ne and representation and that's kind of a a design choice that there are you know um works that show what is effective embedding space for language let's say but that's a that's kind of the balance that we keep in mind I'm going to keep going for the for the sake of time and then we'll have time for questions at the end okay so that gives us you know rnNs, how they work where we are at with these sequence modeling problems. And when we back propop, when we try to update the weights based on the loss, what we do is we go backwards and back propagate the gradients through the network, back towards the, towards the input to try to adjust these parameters. and uh, minimize the loss. And the whole concept is we have our loss objective, and you're just trying to shift the parameters of the model, the weights of the model, to minimize that objective with RNNs. Now, there's a wrinkle, right? Because we now have this loss, that's computed time step to time step, as we are doing this sequential computation, and then added at the very end, to get a total loss. What that means is now when we make our backward pass in trying to learn back propagation, we just have to back propagate the gradients per the time step. And then finally, across all the time steps from the end, all the way back to the beginning of the sequence. And this is the idea of, of back propagation through time, because the errors are additionally back propagated along this time axis as well to the beginning of the data sequence. Now, you could maybe see why this can get a little bit hairy, right? If we take a closer look at how this computation actually works, what back prop through time means is that, as we're going stepwise, time step by time step, we have to do this repeated computation of weight matrix, um, weight, matrix, weight, matrix, weight matrix, and so on. And the reason that this can be very problematic is that this repeated computation, if those values are very large, and you multiply or take the derivative with respect to those values in a repeated fashion, you can get gradients that actually grow excessively large and grow uncontrollably and explode such that the network learning is not really, uh, tractable. And so one thing that's done in practice is to effectively try to cut these back, scale them down to try to learn, uh, effectively. You can also have the opposite problem where if you start out and your values are very, very small and you have these repeated matrix multiplications, your values can shrink very quickly to become diminishingly small. And this is also quite bad. And there are strategies we can employ in practice to try to mitigate this as well. The reason why that this notion of gradient diminishing or vanishing gradients is a very real problem for actually learning an effective model is that it kind of shoot, we're shooting ourselves in the foot in terms of our ability to model long-term dependencies. And why that is, is, as you grow your sequence length, right? The idea is that you're going to have to have a larger memory capacity and then be able to better track these longer term dependencies. But if your sequence is very large, and you have long-term dependencies, but your gradients are vanishing, vanishing, vanishing, you're losing all ability as you go out in time to actually learn something useful and keep track of those dependencies within the model. And what that means is now the network's capacity to model that dependency is reduced or destroyed. So we need real strategies to try to mitigate this in the RNN framework because of this inherent sequential processing of the data. in practice, Going back to, uh, one of the earlier questions about how we select activation functions one very common thing that's done in RNN is to choose the activation functions wisely to be able to try to kind of mitigate a little bit, this shrinking gradient problem by having uh, activation functions that are either zero or one, namely the Reu activation function. Another strategy is to try to initialize the weights, those actual first values of the weight matrices smartly, to be able to get them at a good starting point, such that once we now start making updates, maybe we're less likely to run into this vanishing gradient problem as we do those repeated matrix multiplications. The final idea, and the most robust one in practice, is to now build a more, a more robust, uh, neural network layer, or recurrent cell cell itself. And this is the concept of what we call gating, which is effectively introducing additional computations within that recurrent cell. to now be able to try to selectively keep or selectively remove or forget some aspects of the information that's being inputed into the into the recurrent unit. We're not going to go into detail about how this notion of ating works mathematically for the the sake of time and focus. RNNs and attention-based models like Transformers, while prominently used in Natural Language Processing (NLP), find applications in diverse fields beyond NLP. In biology, they are used to model DNA or protein sequences, predicting three-dimensional protein structures from sequence information . They also have applications in computer vision, where architectures like Vision Transformers utilize self-attention mechanisms to process image data . Furthermore, these models are used in time series analysis, such as financial forecasting or weather prediction, where sequential data points need to be processed to make predictions . Other applications include speech recognition, machine translation, and music generation . Self-attention is a mechanism that allows a model to weigh the importance of different parts of an input sequence when processing it. It doesn't process the sequence step-by-step like an RNN, but rather considers all parts of the sequence simultaneously. The process involves three key components: Query (Q): Represents the current element in the sequence that we're focusing on. Think of it as a question: "What information is relevant to me?" Key (K): Represents all elements in the sequence, providing context. These are like labels or identifiers for each element in the sequence. Value (V): Represents the actual information associated with each element in the sequence. These are the values that will be weighted and combined based on the attention scores. The process works by calculating attention weights between the query and each key. This is often done using a similarity measure like dot product, followed by a softmax function to normalize the weights into a probability distribution. These weights are then used to create a weighted sum of the values, effectively focusing on the most relevant parts of the sequence for the current query. This weighted sum is the output of the self-attention mechanism for the given query. In essence, self-attention allows the model to attend to different parts of the input sequence in a data-driven way, focusing on the most relevant information for each position in the sequence. Word embeddings transform words into dense, low-dimensional vector representations that capture semantic meaning. Instead of representing words as one-hot vectors (sparse and high-dimensional), word embeddings capture relationships between words. Words with similar meanings have vectors that are closer together in the vector space. This allows neural networks to understand semantic relationships between words, improving performance on tasks like sentiment analysis, machine translation, and text classification. Popular techniques for generating word embeddings include Word2Vec, GloVe, and fastText. These methods learn embeddings by analyzing large text corpora, capturing contextual information and relationships between words. The resulting word embeddings are then used as input features for neural networks, enabling the networks to process and understand text data more effectively.