Let's reproduce GPT-2 (124M) | Highlights and Annotations by Gistr.

Explains that This video details reproducing GPT-2 (124M) in PyTorch, comparing it to OpenAI's model. It covers model architecture, weight loading, positional embeddings, optimizations (AdamW, gradient clipping, learning rate scheduling), and distributed training using multiple GPUs. Performance is benchmarked against GPT-2 and GPT-3, highlighting speed improvements through techniques like TensorFloat-32, `torch.compile`, and FlashAttention. The author trains a model on the FineWeb dataset, achieving comparable results to GPT-3 (124M) with significantly fewer training tokens. 8 minutes and 4 seconds, the speaker explains: Positional embeddings are learned parameters (weights/floats) within the Transformer model. When visualized, these embeddings show a distinct structure. Each row in the visualization represents a fixed, absolute position within the sequence (range 0 to 1024). The structure arises because the embeddings learn sinusoidal and cosine patterns to represent these positions. These embeddings help the Transformer understand the relative positions of tokens and attend to them based on their position, not just their content. note that the normalizations are inside the residual stream. You see how feed forward is applied and this arrow goes through and through the normalization. So that means that your residual pathway has normalizations inside them. And this is not very good or desirable. Uh, you actually prefer to have a single, uh, clean residual stream all the way from supervision all the way down to the inputs, the tokens. And this is very desirable and nice, because the gradients that flow from the top, if you remember from your microad addition, just distributes gradients during the backwards state to both of its branches equally. So addition is a branch in the gradients. And so that means that the gradients from the top flows straight to the inputs, the tokens through the residual pathways unchanged. But then in addition to that, the gradient also flows through the blocks. and the blocks, you know, contribute their own contribution over time and kick in and change the optimization over time. But basically, clean residual pathway is desirable, from an optimization perspective. And then the, this is the pre--normalization version, where you see that RX first goes through the layer normalization and then the attention and then goes, uh, back out to go to the l ration number two and the multia perceptron, sometimes also referred to as a feed forward network or an FFN And then that goes into the residual stream again. And the one more thing that is kind of interesting to note is that recall that attention is a communication operation. It is where all the tokens and there's 1,24 tokens lined up in a sequence and this is where the tokens communicate. This is where they exchange information. So attention is a um, aggregation function. It's a pooling function. It's a weighted sum function. It is a reduce operation. whereas MLP this, uh, MLP here happens at every single token individually. There's no information being collected or exchanged between the tokens. So the attention is the reduce and the MLp is the map. And what you end up with is that the transformer just ends up just being a repeated application of map produce if you want to think about it that way. So, um, this is where they communicate and this is where they think individually about the information that they gathered And every one of these blocks, uh, iteratively refines the um representation is at the residual stream. So this is our block, um, slightly modified from this picture. Okay, so let's now move on to the MLp. So the MLP block Uh, i implemented as follows. It is relatively straightforward. We basically have two linear projections here that are sandwiched in between the g nonlinearity. So nn. g approximate is 10h. Now when we swing on, uh, swing over to the pyro documentation, This is n.g and it has this format and it has two versions, the original version of G which we'll step into into in a bit and the approximate version of Galo which we can request using 10. So as you can see just as a preview here, g is a basically like a Reu except there's no flat,