LoRA explained (and a bit about precision and quantization)

This video explains low-rank adaptation (LoRA), a parameter-efficient fine-tuning method for large language models. It begins by discussing techniques for handling large models, including reduced precision (e.g., using half-precision floats) and quantization. LoRA, which uses rank decomposition to update only a small subset of model weights, is then detailed, highlighting its benefits in terms of reduced computational cost, memory usage, and inference speed compared to other methods like adapter layers and prefix tuning. The video concludes with a practical demonstration using the Hugging Face PEFT library. So, what does this mean? In computer science studies, you learn how a computer internally represents floating point numbers using only zeros and ones. This is done by reserving bits for the sine exponent and fraction of this equation. Here is an example of 7.5 and some additional digits in 32 bits. Now, one obvious thing that has been done is to lower the precision using other data types. For example, we can switch to half precision, which has only half of the bits to represent the number, and therefore, only requires half of the memory. The downside is that we lose precision. As you can see, in this example, we cannot represent as many digits as in the 32-bit case. And this loss and precision in terms of rounding errors can accumulate quickly. Given this information, it's straightforward to calculate, relate to actual model size in terms of gigabytes. An example, Bloom is a 176 billion parameter model, and it corresponds to roughly 350 gigabytes of memory for inference. This means you need several large GPUs to run this model. The impact of using other levels of precision for model training has been evaluated intensively in the literature. These methods do not simply drop half of the bits, which would lead to an information loss, but instead calculate a quantization factor that allows to maintain the levels of precision. Here's an example of a half-precision matrix converted to int 8 using quantization. How this exactly works is content for another video as there exists different quantization techniques at this point. The traditional way of transfer learning was to simply freeze all weights and add a task-specific fine-tuning head. The downside of this, however, is that we only get access to the output embeddings of the model and can't learn on internal model representations. An extension of this are adapter layers presented in a Google research paper from 2019, which insert new modules between the layers of a large model and then fine-tune on those in general. this is a great approach, however, leads to increased latency during inference, and generally the computational efficiency is lower. A very different idea, specifically designed for language models is prefix tuning, presented by Stanford researchers. This is a very lightweight alternative to fine-tuning, which simply optimizes the input vector for language models. Essentially, this is a way of prompting by prepending specific vectors to the input for a model. The idea is to add context to steer the language model. Of course, prefix tuning only allows to control the model to some extent. So sometimes a certain degree of parameter tuning is necessary. This finally leads us to Laura, the probably most commonly used fine-tuning approach, which we will discuss in more detail in the following minutes. it performs a rank decomposition, on the updated weight matrices. So low rank adaptation actually means the rank of a matrix tells us how many independent row or column vectors exist in the matrix. More specifically, it's the minimum number of rows or columns. Here's an example. This number is an important property in various matrix calculations from solving equations to analyzing data. Now, a low rank simply means that the rank is smaller than the number of dimensions. In this example, we have three dimensions by the rank of two. low rank matrices have several practical applications, because they provide a compact representations and reduce complexity. And finally, adaptation simply refers to the fine--tuning process of models. Now, what's the motivation behind Laura, Laura is motivated by a paper published in 2021 by Facebook research that discusses the intrinsic dimensionality of large models. The key point is that there exists a low dimension re--parametrization that is as effective for fine--tuning as the full parameter space. Basically, this means certain downstream tasks don't need to tune all parameters, but instead can transform a much smaller set of weights to achieve a good performance. Here is an example for fine--tuning birds, And they show that using a certain subset of parameters, namely 200, it's possible to achieve ninety percent of the accuracy of full fine tuning using a certain threshold is how they define the intrinsic dimension. So basically the number of parameters needed to achieve a certain accuracy. Another interesting finding evaluated on different data sets is that the larger the model, the lower the intrinsic dimension. This means in theory that these large foundation models can be tuned on very few parameters to achieve a good performance. More formally, this is done through rank decomposition, as expressed by this equation, w0. are the original model weights, which stay untouched. B and A are both low rank matrices, and their product is exactly the change in model weights, delta W. An important note, it's not relevant that we find a decomposition of delta w into b and a, but rather, we care about the other direction, we construct delta w. By multiplying b and a. That also means they need to be initialized in such a way that delta w equals 0 at the start of training. This is done by setting b to 0. And the weights in a are a sampled from a normal distribution. Let's have a look at an example. the shape of this weight update matrix is 4 times 4.. it's constructed as the product of b times a, b, and a are both low rank matrices, and their rank is 2. But in a transformer, it's typically applied on the attention weights. In the forward pass, the input is then multiplied with both the original model weights and the rank decomposition matrices. The output of that is then simply added together. The output of that is then simply added together. Because of this, The implementation of Laura is fairly easy. But why is this scaling factor used? Looking at the details in the paper, we can see that the output of b and a is scaled with alpha divided by the rank, the rank in the denominator corresponds to the intrinsic dimension, which means to what extent we want to decompose the matrices, typical numbers range from 1 to 64 and express the amount of compression on the weights. Alpha is a scaling factor. it simply controls the amount of change that is added to the original model weights. Therefore, it balances the knowledge of the pre--trained model, and the adaption to a new task. Both the rank and alpha are hyper parameters. I found this GIF, which shows an example of scaling the ratio from 0 to 1 For an image generation model using Xero, it will produce the output of the original model and using one, the fully fine--tuned model. In practice, if you want to fully add laura, this ratio should be 1. what is the optimal rank to choose In the Laura paper, different experiments have been conducted that show that a very small rank already leads to pretty good performance. Increasing the rank does not necessarily improve the performance, most likely because the data has a small intrinsic rank. Increasing the rank does not necessarily improve the performance, most likely because the data has a small intrinsic rank. But this certainly depends on the data set. A good question to ask when choosing the rank, is, did the foundation model already see similar data? Or is my data set substantially different? if it's different, a higher rank might be required? Different experiments run by the authors indicate that Laura significantly outperforms other fine--tuning approaches on many tasks. In practice, a lot of work has been done by the hiking face team to enable easy usage of this technique. The repository path, which stands for parameter Efficient Fine tuning, provides implementations for all popular fine--tuning techniques, including Laura. So, luckily, we don't have to manually apply a low rank decompositions to every single layer. Instead, we can make use of the function get path model, which does this job for us in a config, we can even specify certain target modules, for example, the key query and value matrices of transformers. Here you also find the mentioned hyper parameters, alpha and the rank, we can then call a function that prints the total number of parameters and the trainable lora parameters. In this example, we can see that only 0.19 percent of the original model weights will be trained. So overall, this is a very convenient library and allows to train huge models on a single GPU. Going back to the beginning of this video, where I talked about quantization. optimal. It's obvious that less digits mean less memory, but it's not only less memory, but smaller precision models are also faster to train on most GPUs because it takes less time to read the data. Having the precision typically gives two times speed Improvements in terms of flops during training flop stands for floating point operations per second and is a common measure to compare the speed of hardware. it's the maximum number of floating point operations like multiplication that the hardware might be capable of. Here you can see that the performance of GPUs has been increasing over time in terms of flops. This means that the hardware is capable of executing faster matrix multiplications, which are needed for deep learning. Here's an example from a NVIDIA benchmark that shows that smaller precisions increase the flops. Alright, now we know two ways to make huge models