Low-rank Adaption of Large Language Models: Explaining the Key Concepts Behind LoRA

This video explains the Laura (Low-Rank Adaptation) method for fine-tuning large language models. Laura leverages the observation that pre-trained models' weight matrices have intrinsically low rank, allowing their updates to be represented by smaller matrices during fine-tuning. This significantly reduces memory usage and training time, enabling efficient switching between different downstream tasks even at inference time with minimal latency. A follow-up video will cover practical Python implementation. Laura means low-rank adaptation that's not necessarily satisfying but we we need to understand a few more things before we get into exactly what lur is doing so the first thing i want to talk about is this idea of a weight matrix and this idea of something called matrix decomposition so loosely speaking a matrix is this a by b structure right uh let's just pretend both are 100 so we have this a by b structure 100 by 100 structure that has uh you know total of 10,000 things it's caring about and that's the weight matrix right now one of the key insights that's made by this paper is that pre--train models have very low intrinsic dimension.03:01So what essentially what that means is that they can be described as accurately or almost as accurately using way fewer dimensions than they have. So instead of needing, say a full 100 dimensions, maybe we could get away with having you know 90 dimensions.03:18So that's all that's all really it's saying you can there's a lot of redundancy. there's a lot of extras a lot of lot of just stuff hanging out that we can kind of get rid of right. So the key insight that the a paper makes is it goes one step further.03:32They hypothesize that the weight also have a low intrinsic rank during adaptation. So what does that mean? Right Well if we go back to our diagram this matrix has dimensions of a by b but the rank of a matrix isn't equal necessarily to its dimensions, it's equal to the number of linearly independ rows or columns.04:01So this matrix might be 100 by 100. And it might have rank, you know, 70 or four, you know, so. but what that is saying is that you don't need all of those dimensions to accurately describe everything that's going on.04:19So we use this process called matrix decomposition to represent this very large matrix as a po, potentially smaller combination of matrices. In this case, they are factors, right? So, uh, we, we want the product of this to be this essentially the key insight here is that these matrices WA and WB can be much smaller than the original matrix, right? But be the same thing represent the same thing.04:56So in that case, you know, we have this A by R, R matrix and this R by B matrix. And that gives us this, you know, these smaller matrices that we can use to represent that delta w. And that's exactly what the Laura paper says, right? You have this initial weight matrix of D by k.05:16and then you would add the, uh, you know, the delta w, in this case, delta w would be your full weight update matrix. But instead, we can represent delta w by B and a. So the product of b and a, uh, where b is, uh, a sum D by R matrix A is sum r by k matrix, and then r is less than the dimensions of the original matrix, which is D by k, right? So, uh, and again, the reason we can do that is that we're making this hypothesis that the actual rank of the matrix is lower than what we think it is, right.05:58So they have in intrinsic, ically low rank, which means that we don't need as much stuff to represent them. And so we use r as a hyperparameter to indicate what rank we want these decomposed matrices to be and you might be thinking yourself hang on a second. You could say, well, i'm just going to represent it by this smaller uh, you know these these decomposed matrices and we're going to save ourselves compute and for a lot of practices that actually seems to be pretty true, right that other paper makes the uh, makes the the statement that a lot of these pre--trained models are, you know, they have intrinsically lower dimensions And so it's perhaps likely that the weights would have intrinsically low rank but um, you know, this paper is focused on this specific application, Uh, and there they're only replacing the attention weights so they're not touching any of the rest of the architecture of the transformer just the attention weights. In fact, uh, in the paper they mostly focus on I believe q and v um, which is the query and then the value because we from uh, you know attention is all you need, You know, there's the query key value Um, you know attention mechanism. there's some simplifications here and they they do some things to uh, you know they they they back up their their decisions rather well in the paper. I don't want to get too much into that I would recommend please reading the paper. it's absolutely fantastic. Uh, but we don't need to get that deep here. We just want to understand kind of high level how this is working Why it's working right? So why is it so good? So we get it You know, these these weight matrices have intrinsically low rank So we can decompose the weight matrix into these very small matrices And we can maintain most of the information And that's happy. So what does that get us though? Well when we use those small matrices we don't need to care about the weight matrixes of the entire model anymore, right so we get to instead just ignore them everything else is frozen. We don't need anything else, right. so we don't need the optimizer states for the parameters that are frozen. We just need these small tiny little injections that we've made, which is that those those matrix pairs absolutely fantastic, right. So they talk about a lot of numbers here you know roughly 10,000 times reduction Uh in checkpoint size they talk about uh you know going from 1.2 terabytes to 350 gigabytes Uh, they talk about you know a 25% speed up during training. there's lots of different awesome optimizations here that kind of like come along with the the deal of if you have 96 transformer modules, right and you instead of needing their full weight matrixes or matrices you only need these smaller versions, it adds up to a ton of memory saved, Which means we can compute faster because we're computing on smaller matrices anyway it's fantastic. suggest that, you know, delta W has very low intrinsic rank, uh, but anyway, we you know, uh, moving on from that. let's look at some more results. These are kind of my favorite results. You know, where we, we find that Laura performs better than or at least as good as prefix-based approaches given the same number of parameters. Um, I really think this is cool because again, you know there is there's downside to a lot of these other like kind of adapter methods or we're adding something to the the model but with Laura we're just injecting these these matrices right that that's like the addition we're making and we don't even need to keep that addition we could just straight get rid of it now we don't need to now one of the coolest things I haven't touched on yet is that because all we're injecting is this right and we're training this and this is like the thing that we train um and we we can just combine that with the pre--trained weights imagine a situation where you have one base model that you've trained like six different downstream tasks using Laura all you have to do to swap between those tasks is replace this right you could do that even at inference time you you know what I mean you, you could, you could have customers choose which version they want at inference time. So you don't have to have like all of the models all running at the same thing. You could have them swap based on what task they need done, which is huge. I think that's the biggest advantage to Laura, right? Now it is certainly much better than, than the other solutions when it comes to this, because it doesn't impose a significant inference penalty to do this, right? Uh, we have the ability to toggle and pick and choose from our toolbox. Instead of it doesn't literally mean they have rank R, but we are, we're saying that you can only have up to rank R. So one of the just general truths about matrices is that they can't have a rank that exceeds their smallest dimension, right? So in a 5 by 100 matrix the maximum rank is five and similarly, in a 100 by5 matrix the maximum rank is going to be five So we're artificially limiting the maximum rank here but we're saying that it doesn't need to be up to 100 so rank doesn't need to be as high as 100 but what we're saying is that the rank is not as high as 100 it's probably much lower right? So we can get away with uh explaining or expressing the information as richly using a much smaller matrix Uh, and so that's how this process goes now you will notice that there is some inefficiency here right We do have to uh, do this combination after we we flow data through which is going to be inefficient at inference time We can just merge these updates into the pre--trained weights which means that there's actually potentially zero inference latency which is incredible. Um, there's another really important fact that we can exploit about this that we'll get into in a second but I just want to for the most part get everybody on the same page