LSTM Networks: Explained Step by Step! | Highlights and Annotations by Gistr.

This video explains Long Short-Term Memory (LSTM) networks, an improvement over Recurrent Neural Networks (RNNs). RNNs suffer from the vanishing/exploding gradient problem, hindering their ability to use information from earlier in a sequence to make later predictions. LSTMs mitigate this by introducing a "cell state" that can selectively remember information over long sequences via "gates" (forget, input, output) that control information flow. This allows LSTMs to maintain long-term dependencies crucial for tasks like natural language processing. So let's say we're trying to predict the next word, given this sequence of words here, the dog ran up to the orange cat and did what? So there could be any number of things in this blank here. But I think most of us would say that it's important to think about the fact that you're talking about the dog in this moment. because if you follow the sentence again there's a dog It ran up to this orange cat and well what did the dog do? Did it bark at? it? Did it chase it? There's going to be something here but it pertains to the fact that we're talking about a dog. The interesting thing with this sentence is that between the thing we're talking about and the thing we're trying to predict, there's actually quite a few words to get through. And we see that that is going to be exactly the problem where the vanishing or exploding gradient arises. But we'll get there in just a moment. First let's recap the structure of the simple recurrent neural network and some of you might have noticed already but I've simplified it even further than we did in the original video and there's two reasons for that. The first one is that when we implemented recurrent neural networks in Python using the Keras package we found that this is actually the structure they use so this will be more in line with the code and the other reason is that the difference you probably noticed is that there's not a hidden state and an output state why there's no y's anymore we're just using the hidden state H as the output state itself and that's exactly the choice that the LSTM or long short-term memory does as well So it'll make our transition from simple RNNs to LSTMs easier as well. Again the question is how does X2 affect H9? Well X2 the only thing that it affects first is H2 so we're going to write that gradient first X2 has some effect on H2 now H2 the only thing that it affects is H3 and so we'll write how does H2 affect H3 H3 The only thing that that affects is H4 so that's the derivative of H4 with respect to H3 and so on and so on We see that there's not too much interesting stuff going on. we're just kind of following these hidden states or output states down until we get to H9. So the final thing is how does H8 affect H9 signified by this arrow right here. So now we should probably write out the forms for each of these. I'm not going to dwell too much on how exactly I got these forms here. Well, you can go back to the previous video and look at the exact mathematical forms for how each of these H's are computed based on the current x and the h before it. And you'll find that the first derivative here, that's the only kind of unique one because it involves an X and an H instead of just pairs of H's. this is going to be the matrix U, which is the matrix that multiplies on any given x and then times this function D of H2. And since I'm using a sigmoid activation layer here, this D of H anything is simply just a diagonal matrix, whose diagonal is given by H. This weird symbol here times 1 minus H. And this weird symbol here, which is going to come up later in the video again, is called the element-wise multiplication. Which just means that if I have one vector of size 5, and I have another vector of size 5, what it means to element-wise multiply them is create a new vector whose first element is the product of the first elements of my individual vectors. The second element is the product of the second elements of my individual vectors, and so on, and so on. And DiaG just means that, take this vector and turn it into a matrix by just putting that vector along the diagonal of my new matrix, And everything else is going to be equal to zero. This really doesn'?t matter too much for this video. I just wanted to be mathematically rigorous about what I'm writing down, and you can use the different activation function instead of sigma here you can use tanh or even a linear activation function might have been a better idea so we don't have to talk about this so much but anyway don't worry too much if this doesn't make sense the key insight is that the derivative the first one is U times that D of h2 now things just get easier because there's just this repeatable pattern what's the derivative of h2 with respect to h3 well that's going to be V this matrix that multiplies each of the h's times the same d function of h3 what's the derivative of h3 with respect to h4 it's the same thing it's just that we're putting an h4 there instead and so all these derivatives which are between pairs of H's are pretty much the exact same the only difference is that this D function is accepting whatever H it's looking at at the moment. And so we have this big kind of like nasty looking form of derivative here and I've written all the dimensions here in case you want to trace it. So we said that our X's are in a k-dimensional space and our H's are in a Z dimensional space. So you can see that by writing the dimensions under each of our matrices the final product is in a k-times Z dimensional space. Does that sanity check? Well we're trying to ask for the derivative of H9 with respect to X2. X2 is a k-dimensional vector. so that makes sense why the first argument here should be K and H9 is a Z dimensional vector. So the second argument here should be Z. So at least all the dimensions check out. So we sanity check we're not making mistakes here but what's the point? The point is not the math itself. The point is to look at the types of symbols that are contributing to this gradient. And the key thing to look at is that there's tons of these V's in this multiplication here. There's actually one V for every arrow we have here, going along this chain of hidden states or output states until we get to the hidden state that we're trying to predict. And that's exactly where our story continues next. So there's lots of multiplications by V And now you see where things start going wrong, the more distance we put between something we care about and something we're trying to predict in the sentence, which is trying to reference that thing we care about. Because what's going to happen in that giant gradient that we just computed is that you're going to basically get V to the power of n, where n is the number of time steps you're putting between the thing you care about and the thing you're trying to predict. Now, what happens mathematically to things in this matrix V, if they are less than one, well, if you take something less than one and raise it to the power of larger and larger ends, that thing is going to vanish geometrically, it's going to go to zero. That's a vanishing gradient problem. Conversely, what happens if something in v is greater than one, like 1.02 even given enough time, given enough size of this n, that thing is going to explode or go to infinity. And now we see what we hand waved in the last video about why we get this vanishing or exploding gradient problem really badly in recurrent neural networks, because we're taking the same matrix v and multiplying it, several, several, several times. And even though theoretically, take a small number and multiply it many, many times, it'll not be zero computationally. our computers do have a limit, both on the lower side, near zero, and on the upper side near infinity, for what they're able to represent. And therefore, when we go to actually code these recurrent neural networks, and we have a big enough sequence, we're going to get this vanishing gradient problem or this exploding gradient problem. Now, I want to be clear here, this was a toy example. We only had about 10 words in our sentence. It's probably not going to happen at 10 words, but you can imagine tons of applications in the real world where you have many more than 10 things. in your input sequence, you're trying to input a whole paragraph or an entire book all at once. And on page 500 of the book you're trying to reference some character who showed up on page one of the book. so you can see that this is not crazy for something that could happen and could be a problem for us. So in a nutshell what's the problem? The problem is that in this simple or vanilla RNN you'll hear these words interchangeably simple RNN or vanilla RNN it's just the thing we've studied in the last video. It's hard in those cases for past inputs to impact future predictions because of this vanishing or exploding gradient problem. So we're going to extend the simple RNN with the goal of letting the network pass information unchanged because if you go back to this architecture here, the whole reason this is happening is that when you're asking about how does X2 change H9 there's only one way for the information from X2 to get to H9 via this highway here and this highway is full of these VS who are basically just destroying your information little by little by little until there's no way for H9 to know what this word even was at that point. So the solution should be some form of letting the network pass that X2 unchanged all the way down to H9 or H100 or H1000 if it is helpful for H9 or H100 or H1000 to have access to that information. And that's exactly the goal we're going to have in mind as we start explaining the step-by-step process by which we're building up these LSTM networks. So the pieces of information that we have at each time step that you already know about are X which is the new piece of information you're considering H which is the hidden or output state from the previous time step this we already know about we're going to now add the crux of the key the secret of LSTMs which is this new vector C the cell state and the purpose The goal the job of the cell state is to remember long--term information if it is useful to the network at some point in the future that's exactly what we were missing right now there's no mechanism in the simple RNA to remember this short--term information long term if it's helpful but now we're giving it that ability so that is extension number one we're adding the cell state and now let's talk about what is the cell state. how does it get set What factors go into it So let's go through the six steps for building up the LSTM at any time step t. We now have three things that we're allowed to play with. so we just explain them here. We have the current new input the new word we have the hidden or output state from the previous time step and now we have the cell state from the previous time step. So we're still not exactly sure what this is or how we're going to use it but hopefully through the six-step process it'll be more clear what it's trying to do and how it's trying to do it. The main drawback of simple or vanilla RNNs discussed in the video is the vanishing or exploding gradient problem. This problem makes it difficult for past inputs to influence future predictions, especially in long sequences. This is because, when dealing with long sequences like paragraphs or books, the RNN struggles to connect information from early parts of the sequence to later parts due to the way gradients are propagated during training. A Long Short-Term Memory (LSTM) network is a type of recurrent neural network (RNN) architecture designed to address the vanishing/exploding gradient problem inherent in standard RNNs. This problem prevents standard RNNs from effectively learning long-range dependencies in sequential data. LSTMs solve this by using a mechanism that allows them to selectively remember information over extended time periods, enabling them to connect information from earlier parts of a sequence to later parts more effectively. This is achieved through the use of gates and a cell state that regulate the flow of information. The LSTM architecture addresses the vanishing gradient problem by employing a cell state and a system of gates (input, output, and forget gates). The cell state acts as a kind of "conveyor belt" that runs through the entire chain of the LSTM. Information can be added to or removed from the cell state via the gates, which selectively control the flow of information. The forget gate decides what information to discard from the cell state, the input gate determines what new information to add, and the output gate decides what parts of the cell state to output. This mechanism allows the LSTM to selectively retain relevant information from earlier time steps, even over long sequences, preventing the information from being washed away by the vanishing gradient problem. The ability to maintain this information unchanged through the network's layers is crucial to connecting early inputs to later predictions, solving the problem that simple RNNs face. The cell state (C) in an LSTM is a crucial component in mitigating the vanishing gradient problem. Unlike standard RNNs where the hidden state is updated at each time step, potentially losing information due to the multiplicative nature of the updates, the LSTM's cell state acts as a kind of conveyor belt. Information is added to or removed from the cell state via the gates (input, forget, and output gates), but the information itself flows relatively unchanged through the cell state. This allows the network to maintain long-term dependencies because the gradient doesn't need to be repeatedly multiplied across many time steps to reach earlier parts of the sequence. The gradients related to the cell state can flow directly back through time, bypassing the issue of vanishing gradients that plague standard RNNs. The gates then selectively control which information from the cell state is used in the network's outputs at each time step. In essence, the cell state provides a pathway for information to persist through time without being significantly altered or diminished by the recurrent updates, thus addressing the vanishing gradient problem.