Deep Dive into LLMs like ChatGPT | Highlights and Annotations by Gistr.

model already has a ton of knowledge from its pre-training on the internet. So it's probably seen a ton of conversations about Paris, about landmarks, about the kinds of things that people like to see. And so it's, the pre-training knowledge that has then combined with the postering data set that results in this kind of an imitation. Um, so that's uh, that's roughly how you can kind of think about what's happening behind the scenes here in in this statistical sense. Okay, now I want to turn to the topic of LLM psychology as I like to call it, which is what are sort of the emergent cognitive effects of the training pipeline that we have for these models. So in particular the first one I want to talk to is of course hallucinations. So you might be familiar with model hallucinations. it's when LLMs make stuff up, they just totally fabricate information etc. And it's a big problem with LLM assistants. It is a problem that existed to a large extent with early models, uh, from many years ago and I think the problem has gotten a bit better uh, because there are some medications that I'm going to go into in a second. For now, let's just try to understand where these hallucinations come from. So here's a specific example of a few, uh, of three conversations that you might think you have in your training set and um, these are pretty reasonable conversations that you could imagine being in the training set. So like for example, who is Cruz? Well, Tom Cruz is an famous actor, American actor and producer etc. Who is John Baraso? This turns out to be a US Senetor for example, who is Genis Khan? Well, Genis Khan was blah, blah, blah. And so this is what your conversations could look like at training time. Now the problem with this is that when the human is writing the correct answer for the assistant in each one of these cases, uh, the human either, like knows who this person is, or they research them on the Internet and they come in and they write this response, that kind of has this like, confident tone of an answer. And what happens basically is that at test time, when you ask for someone who is, this is a totally random name that I totally came up with. And I don't think this person exists. Um, as far as I know, I just tred to generate it randomly. The problem is when we ask who is Orson Kovats? The problem is that the assistant will not just tell you, oh, I don't know. Even if the assistant and the language model itself might know inside its features, inside its activations, inside of its brain, sort of, it might know that this person is like not someone that, um, that is that it's familiar with even if some part of the network kind of knows that in some sense, The uh, saying that, oh, I don't know who this is is is not going to happen because the model statistically imitates is training set in the training set The questions of the form who is blah are confidently answered with the correct answer And so it's going to take on the style of the answer and it's going to do its best. It's going to give you statistically the most likely guess and it's just going to basically make stuff up because these models again, we just talked about it is they don't have access to the internet, They're not doing research. These are statistical token tumblers as I call them. uh, is just trying to sample the next token in the sequence and it's going to basically make stuff up. So let's take a look at what this looks like I have here what's called the inference playground from hugging face and I am on purpose picking on a model called falcon 7b which is an old model this is a few years ago now so it's an older model so it suffers from hallucinations and as I mentioned this has improved over time recently but let's say who is Orson kovats let's ask falcon 7b instruct run oh yeah Orson kovat is an American author and science uh fiction writer okay this is totally false it's hallucination let's try again these are statistical systems right so we can resample this time Orson kovat is a fictional character from this 1950s tv show it's total BS right let's try again he's a former minor league baseball player okay so basically the model doesn't know and it's given us lots of different answers because it doesn't know it's just kind of like sampling from these probabilities the model starts with the tokens who is oron kovats assistant and then it comes in here and it's get it's getting these probabilities and it's just sampling from the probabilities and it just like comes up with stuff and the stuff is actually statistically consistent with the style of the answer in its training set and it's just doing that but you and I experiened it as a madeup factual knowledge but keep in mind that uh the model basically doesn't know and it's just imitating the format of the answer and it's not going to go off and look it up uh because it's just imitating again the answer so how can we uh mitigate this because for example when we go to chat apt and i say who is oron kovats and i'm now asking the stateoftheart state-of-the-art model from open ai this model will tell you oh so this model is actually is even smarter because you saw very briefly it said searching the web uh we're going to cover this later um it's actually trying to do tool use and uh kind of just like came up with some kind of a story but i want to just who or kovach did not use any tools i don't want it to do web search there's a wellknown historical or public figure named or oron kovats so this model is not going to make up stuff this model knows that it doesn't know and it tells you that it doesn't appear to be a person that this model knows so somehow we sort of improved hallucinations even though they clearly are an issue in older models and it makes totally uh sense why you would be getting these kinds of answers if this is what your training set looks like. So how do we fix this? Okay. Well, clearly, we need some examples in our data set that where the correct answer for the assistant is that the model doesn't know about some particular fact, but we only need to have those answers be produced in the cases where the model actually doesn't know. And so the question is, how do we know what the model knows or doesn't know? Well, we can empirically probe the model to figure that out. So let's take a look at, for example, how meta, uh, dealt with hallucinations for the Llama 3 series of models as an example. So in this paper that they published from Meta, we can go into hallucinations, which they call here factuality. And they describe the procedure by which they basically interrogate the model to figure out what it knows and doesn't know to figure out sort of like the boundary of its knowledge. And then they add examples to the training set, where for the things where the model doesn't know them, the correct answer is that the model doesn't know them, which sounds like a very easy thing to do in principle. But this roughly fixes the issue. and the, the reason it fixes the issue is because remember, like the model might actually have a pretty good model of its self knowledge inside the network. So remember, we looked at the network and all these neurons inside the network, you might imagine that there's a neuron somewhere in the network that sort of like lights up for when the model is uncertain. But the problem is that the activation of that neuron is not currently wired up to the model. actually saying in words that it doesn't know. So even though the internal of the neural network, no, because there's some neurons that represent that the model, uh, will not surface that it will instead take its best guess so that it sounds confident, um, just like it sees in a training set. So we need to basically interrogate the model and allow it to say, I don't know, in the cases that it doesn't know. So let me take you through what meta roughly does. So basically what they do is here, I have an example, uh, Dominic Kek is, uh, the featured article today. So I just went there randomly. And what they do is basically they take a random document in a training set and they take a paragraph and then they use an LLM to construct questions about that paragraph. So for example, I did that with chat GPt here. So I said here's a paragraph from this document generate three specific factual questions based on this paragraph and give me the questions and the answers. And so the LLMs are already good enough to create and reframe this information. So if the information is in the context window, um, of this LLM, this actually works pretty well. It doesn't have to rely on its memory. it's right there in the context window. And so it can basically reframe that information with fairly high accuracy. So for example, can generate questions for us like for which team did he play? here's the answer, How many cups did he win, etc. And now what we have to do is we have some question and answers. and now we want to interrogate the model. So roughly speaking, what we'll do is we'll take our questions and we'll go to our model which would be uh say llama, uh in meta but let's just interrogate mol 7b here as an example. that's another model so does this model know about this answer? let's take a look. uh so he played for buffalo sabers right so the model knows and the the way that you can programmatically decide is basically we're going to take this answer from the model and we're going to compare it to the correct answer and again the model model are good enough to do this automatically. So there's no humans involved here. we can take uh basically the answer from the model and we can use another LLM judge to check if that is correct according to this answer and if it is correct that means that the model probably knows so what we're going to do is we're going to do this maybe a few times so okay it knows it's buffalo savers let's drag in um buffalo sabers let's try one more time buffalo sabers so we asked three times about this factual question and the model seems to know so everything is great now let's try the second question how many stanley cups did he win and again let's interrogate the model about that and the correct answer is two so um here the model claims that he won um four times which is not correct right it doesn't match two so the model doesn't know it's making stuff up let's try again um so here the model again it's kind of like making stuff up right let's dragon here it says did he did not even did not win during his career so obviously the model doesn't know and the way we can programmatically tell again is we interrogate the model three times and we compare its answers maybe three times five times whatever it is to the correct answer And if the model doesn't know then we know that the model doesn't know this question. And then what we do is we take this question, we create a new conversation in the training set. So we're going to add a new conversation training set And when the question is how many Stanley cups did he win? The answer is I'm sorry I don't know or I don't remember. And that's the correct answer for this question because we interrogated the model and we saw that that's the case. If you do this for many different types of, uh, questions for many different types of documents, you are giving the model an opportunity to in its training set refuse to say based on its knowledge And if you just have a few examples of that in your training set, the model will know um, and and has the opportunity to learn the association of this knowledge-based refusal to this internal neuron somewhere in its network that we presume exists and empirically, this turns out to be probably the case and it can learn that association that hey, when this neuron of uncertainty is high, then I actually don't know and I'm allowed to say that i'm sorry but I don't think I remember this etc. And if you have these uh, examples in your training set, then this is a large mitigation for hallucination And that's roughly speaking why chpt is able to do stuff like this as well. So these are kinds of uh, mitigations that people have implemented and that have improved the factuality issue over time. Okay, so i've described mitigation number one for basically mitigating the hallucinations issue. Now we can actually do much better than that. Uh, it's instead of just saying that we don't know, uh, we can introduce an additional mitigation number two to give the LLM an opportunity to be factual and actually answer the question. Now what do you and I do if I was to ask you a factual question and you don't know, uh, what would you do? Um, in order to answer the question, well, you could uh, go off and do some search and uh, use the internet and you could figure out the answer and then tell me what that answer is. and we can do the exact exact same thing with these models. So think of the knowledge inside the neural network inside its billions of parameters. Think of that as kind of a vague recollection of the things that the model has seen during its training, during the pre-training stage, a long time ago. So think of that knowledge in the parameters as something you read a month ago. And if you keep reading something, then you will remember it and the model remembers that. But if it's something rare, then you probably don't have a really good recollection of that information. But what you and I do is we just go and look it up. Now, when you go and look it up, what you're doing basically is like you're refreshing your working memory with information and then you're able to sort of like retrieve it, talk about it or etc. So we need some equivalent of allowing the model to refresh its memory or its recollection. And we can do that by introducing tools, uh, for the models. So the way we are going to approach this is that instead of just saying Hey, i'm sorry, I don't know. We can attempt to use tools. So we can create uh, a mechanism by which the language model can emit special tokens. And these are tokens that we're going to introduce new tokens. So for example, here i've introduced two tokens and i've introduced a format or a protocol for how the model is allowed to use these tokens. So for example, instead of answering the question when the model does not instead of just saying, I don't know, sorry, the model has the option. Now to emitting the special token search start and this is the query that will go to like bing.com in the case of OpenAI or say Google search or something like that. So it will emit the query and then it will emit search end. And then here, what will happen is that the program that is sampling from the model that is running the inference. when it sees the special token search end, instead of sampling the next token, uh, in the sequence, it will actually pause generating from the model, it will go off, it will open a session with Bing.com. And it will paste the search query into Bing. And it will then, um, get all the text that is retrieved. And it will basically take that text, it will maybe represent it again, with some other special tokens or something like that. And it will take that text, and it will copy, paste it here into what I tred, to, like, show with the brackets. so all that text kind of comes here. And when the text comes here, it enters the context window. So the model, so that text from the web search is now inside the context window, that will feed into the neural network. And you should think of the context window as kind of like the working memory of the model that data that is in the context window is directly accessible by the model, it directly feeds into the neural network. So it's not anymore a vague recollection, it's data that it, it has in the context window, and is directly available to that model. So now, when it's sampling the new, uh, tokens here, afterwards, it can reference very easily the data that has been copy pasted in there. So that's roughly how these, um, how these tools use, uh, tools, uh, function. and so web search is just one of the tools we're going to look at some of the other tools in a bit, uh, but basically, you introduce new tokens, you introduce some schema by which the model can utilize these tokens and can call these special functions like web search functions. And how do you teach the model how to correctly use these tools, like say, web search, search, start, search, end, etc. Well, again, you do that through training sets. So we need now to have a bunch of data and a bunch of conversations that show the model by example, how to use web search. So what are the, what are the settings where you are using the search? Um, and what does that look like? And here's by example, how you start a search and the search, etc. And uh, if you have a few thousand, maybe examples of that in your training set, the model will actually do a pretty good job of understanding, uh, how this tool works and it will know how to sort of structure its queries. And of course, because of the pre-training data set and its understanding of the world, it actually kind of understands what a web search is. And so it actually kind of has a pretty good native understanding, um, of what kind of stuff is a good search query. Um, and so it all kind of just like works. You just need a little bit of a few examples to show it how to use this new tool. And then it can lean on it to retrieve information And, uh, put it in the context window. And that's equivalent to you, and I looking something up, because once it's in the context, it's in the working memory and it's very easy to manipulate and access. So that's what we saw a few minutes ago, when I was searching on chat GPT for who is Orson Kovats, the Chat GPT language model decided ED that this is some kind of a rare, um, individual or something like that. And instead of giving me an answer from its memory, it decided that it will sample a special token that is going to do web search. And we saw briefly something flash, it was like using the web tool or something like that. So it briefly said that and then we waited for like two seconds and then it generated this and you see how it's creating references here. And so it's citing sources. So what happened here is it went off, it did a web web search, it found these sources and these URLs and the text of these web pages was all stuffed in between here. And it's not showing here, But it's it's basically stuffed as text in between here. And now it sees that text. And now it kind of references it and says that, okay, it could be these people citation could be those people citation, etc. So that's what happened here And that's what and that's why when I said who is Orson kovats I could also say don't use any tools and then that's enough to um, basically convince chat PT to not use tools and just use its memory and its recollection I also went off and I um, tried to ask this question of chachi PT. So how many standing cups did? Uh, Dominic Hasek win and Chachi P actually decided that it knows the answer and it has the confidence to say that, uh, he want twice And so it kind of just relied on its memory because presumably it has, um, it has enough of a kind of confidence in its weights in it parameters and activations that this is uh, retrievable just for memory. Um, but you can also conversely use web search to make sure. And then for the same query, it actually goes off and it searches and then it finds a bunch of sources. It finds all this, all of this stuff gets copy pasted in there. And then it tells us, uh, to again, and sites And it actually says the Wikipedia article, which is the source of this information for us as well. So that's tools, web search, the model determines when to search. And then, uh, that's kind of like how these tools, uh, work. and this is an additional kind of mitigation for, uh, hallucinations and factuality. So I want to stress one more time, this very important sort of psychology point knowledge in the parameters of the neural network is a vague recollection. the knowledge in the tokens that make up the context window is the working memory. And it roughly speaking, works kind of like, um, it works for us in our brain, the stuff we remember is our parameters, uh, and the stuff that we just experienced, like a few seconds or minutes ago, and so on. You can imagine that being in our context window, and this context window is being built up as you have a conscious experience around you. So this has a bunch of, um, implications also for your use of lols in practice. So for example, I can go to chat GPT and I can do something like this, I can say, Can you summarize chapter one of Jane Austin's Pride and prejudice, right? And this is a perfectly fine prompt and Chach actually does something relatively reasonable here. And, but the reason it does that is because Chach has a pretty good recollection of a famous work like pride and prejudice. It's probably seen a ton of stuff about it. There's probably forums about this book. It's probably read versions of this book. Um and it's kind of like remembers because even if you've read this or articles about it, you'd kind of have a recollection enough to actually say all this but usually when I actually interact with LMs and I want them to recall specific things, it always works better if you just give it to them. So I think a much better prompt would be something like this. Can you summarize for me chapter one of Genos's spR and prejudice? And then I am attaching it below for your reference and then I do something like a delimeter here and I paste it in and I I found that just copy pasting it from some website that I found here. Um so copy pasting the chapter one here and I do that because when it's in the context window, the model has direct access to it and can exactly, it doesn't have to recall it, it just has access to it. And so this summary is, can be expected to be a significantly high quality or higher quality than this summary. Uh, just because it's directly available to the model. And I think you and I would work in the same way if you want to, it would be, you would produce a much better summary. If you had reread this chapter before, you had to summarize it. And that's basically what's happening here? Or the equivalent of it. The next sort of psychological quirk i'd like to talk about briefly is that of the knowledge of self. So what I see very often on the internet is that people do something like this. They ask LLMs, something like, what model are you? and who built you? And um, basically this, uh, question is a little bit nonsensical. And the reason I say that is that as I try to kind of explain with some of the underhood fundamentals, this thing is not a person, right? It doesn't have a persistent existence in any way. It sort of boots up processes tokens and shuts off. And it does that for every single person, it just kind of builds up a context window of conversation and then everything gets deleted. And so this this entity is kind of like restarted from scratch every single conversation, if that makes sense. It has no persistent self. It has no sense of self. It's a token tumbler And, uh, it follows the statistical regularities of its training set. So it doesn't really make sense to ask it. Who are you? What build you etc? And by default, if you do what I described and just by default and from nowhere you're going to get some pretty random answers so for example let's uh, pick on falcon which is a fairly old model and let's see what it tells us. Uh so it's evading the question uh, talented engineers and developers here it says I was built by open AI based on the gpt3 model. it's totally making stuff up. Now the fact that it's built by open AI here, I think a lot of people would take this as evidence that this model was somehow trained on open AI data or something like that. I don't actually think that that's necessarily true. The reason for that is that if you don't explicitly program the model to answer these kinds of questions, then what you're going to get is its statistical best guess at the answer And this model had a um sft data mixture of conversations and during the fine--tuning um the model sort of understands as it's training on this data that it's taking on this personality of this like helpful assistant and it doesn't know how to it doesn't actually it wasn't told exactly what label to apply to self it just kind of is taking on this uh this uh persona of a helpful assistant and remember that the pre--training stage took the documents from the entire internet and chach and open ai are very prominent in these documents and so i think what's actually likely to be happening here is that this is just its hallucinated label for what it is this is its self-identity is that it's chat gpt by open ai and it's only saying that because there's a ton of data on the internet of um answers like this that are actually coming from open from Chasht And so that's its label for what it is. Now you can override this as a developer. If you have a LLm model, you can actually override it. And there are a few ways to do that. So for example, let me show you there's this mmO model from Allen AI and um, this is one LLM. it's not a top tier lM or anything like that but I like it because it is fully open source. So the paper for Almo and everything else is completely, fully open source, which is nice. Um, so here we are looking at its sft mixture. So this is the data mixture of, um, the fine tuning. so this is the conversations data, it right And so the way that they are solving it for theo model is we see that there's a bunch of stuff in the mixture and there's a total of 1 million conversations here but here we have Alot to hardcoded. If we go there we see that this is 240 conversations and look at these 240 conversations they're hardcoded. Tell me about yourself says user and then the assistant says I'm and open language model developed by AI to Allen Institute of artificial intelligence etc. I'm here to help blah blah blah. What is your name? Uh Theo project. So these are all kinds of like cooked up hardcoded questions abouto 2 and the correct answers to give in these cases. If you take 240 questions like this or conversations, put them into your training set and fine tune with it then the model will actually be expected to parot this stuff later If you don't give it this then it's probably a chach by open AI And um, there's one more way to sometimes do this is that basically, um, in these conversations, and you have terms between human and assistant, Sometimes there's a special message called system message at the very beginning of the conversation. So it's not just between human and assistant. There's a system. And in the system message, you can actually hardcode and remind the model that, hey, you are a model developed by open AI and your name is Chashi, PT40. and you were trained on this date And your knowledge cut off is this. And basically, it kind of like documents the model a little bit and then this is inserted into to your conversations. So when you go on CHPT, you see a blank page, but actually the system message is kind of like hidden in there. And those tokens are in the context window. And so those are the two ways to kind of, um, program the models to talk about themselves. Either it's done through, uh, data like this, or it's done through system message and things like that, Basically invisible tokens that are in the context window and remind the model of its identity. But it's all just kind of like cooked up and bolted on in some in some way. It's not actually, like really deeply there in any real sense, as it would before a human, I want to now continue to the next section, which deals with the computational capabilities, or like, I should say, the native computational capabilities of these models in problem solving scenarios. And so in particular, we have to be very careful with these models when we construct our examples of conversations. And there's a lot of sharp edges here that are kind of like, elucidative. Is that a word? uh, they're kind of like, interesting to look at when we consider how these models think. so. um, consider the following prompt from a human and supposed that basically that we are building out a conversation, to enter into our training set of conversations. So we're going to train the model on this. we're teaching you how to basically solve simple math problems. So the prompt is Emily buys three apples and two oranges, Each orange cost $2. The total cost is 13. What is the cost of apples? Very simple math question. Now there are two answers here on the left and on the right. They are both correct answers. They both say that the answer is three, which is correct. But one of these two is a significant ific, anly better answer for the assistant than the other. Like if I was data labeler, and I was creating one of these, one of these would be, uh, a really terrible answer for the assistant and the other would be okay. And so, I'd like you to potentially pause the video even and think through why one of these two is significantly better answer, uh, than the other. And, um, if you use the wrong one, your model will actually be, uh, really bad at math potentially. And it would have, uh, bad outcomes. And this is something that you would be careful with in your life labeling documentations when you are training people, uh, to create the ideal responses for the assistant. Okay. so the key to this question is to realize and remember that when the models are training and also inferencing, they are working in onedimensional sequence of tokens from left to right. And this is the picture that I often have in my mind. I imagine basically the token sequence evolving from left to right and to always produce the next token in a sequence, we are feeding all these tokens into the neural network. And this neural network then is the probabilities for the next token and sequence, Right. So this picture here is the exact same picture we saw uh, before up here, and this comes from the web demo that I showed you before, Right. So this is the calculation that basically takes the input tokens here on the top and uh, performs these operations of all these neurons and uh, gives you the answer for the probabilities of what comes next. Now, the important thing to realize is that roughly speaking, uh, there's basically a finite number of layers of computation that happened here. So, for example, this model here has only one, two, three layers of what's called detention. And uh, MLP here, um, maybe, um, typical modern state-of-the-art network would have more like say 100 layers or something like that, but there's only 100 layers of computation or something like that to go from the previous token sequence to the probabilities for the next token. And so there's a finite amount of computation that happens here for every single token And you should think of this as a very small amount of computation and this amount of computation is almost roughly fixed, Uh, for every single token in this sequence, um, the that's not actually fully true because the more tokens you feed in, uh, the the more expensive, uh, this forward pass will be of this neural network, but not by much. So you should think of this, uh, and I think as a good model to have in mind this is a fixed amount of compute that's going to happen in this box for every single one of these tokens. And this amount of compute cann possibly be too big, because there's not that many layers that are sort of going from the top to bottom here. there's not that that much computationally that will happen here. And so you can't imagine the model to, to basically do arbitrary computation in a single forward pass to get a single token. And so what that means is that we actually have to distribute our reasoning and our computation across many tokens, because every single token is only spending a finite amount of computation on it. And so we kind of want to distribute the computation across many tokens. And we can't have too much computation or expect too much computation out of, of the model in any single individual token, because there's only so much computation that happens per token, okay, roughly fixed amount of computation here. So that's why this answer here is significantly worse. And the reason for that is imagine going from left to right here. Um, and I copy pasted it right here. The answer is three, etc. Imagine the model having to go from left to right emitting these tokens one at a time. It has to say or we're expecting to say the answer is space dollar sign. And then right here, we're expecting it to basically cram all of the computation of this problem into this single token, It has to emit the correct answer three. And then once we've emitted the answer three, we're expecting it to say all these tokens. But at this point, we've already prod produced the answer and it's already in the context window for all these tokens that follow. So anything here is just, um, kind of post hawk justification of why this is the answer. um, because the answer is already created, it's already in the token window. So it's it's not actually being calculated here. Um, and so if you are answering the question directly and immediately you are training the model to to try to basically guess the answer in a single token And that is just not going to work because of the finite amount of computation that happens per token. That's why this answer on the right is significantly better because we are distributing this computation across the answer, we're actually getting the model to sort of slowly come to the answer from the left to right. We're getting intermediate results. We're saying okay the total cost of oranges is four so 30 -4 is 9 And so we're creating intermediate calculations and each one of these calculations is by itself not that expensive. And so we're actually basically kind of guessing a little bit the difficulty that the model is capable of in any single one of these individual tokens. And there can never be too much work in any one of these tokens computationally, because then the model won't be able to do that later at test time. And so we're teaching the model here to spread out its reasoning and to spread out its computation over the tokens. And in this way, it only has very simple problems in each token, and they can add up. And then by the time it's near the end, it has all the previous results in its working memory. And it's much easier for it to determine that the answer is, and here it is three. So this is a significantly better label for our computation. This would be really bad and is teaching the model to try to do all the computation in a single token and it's really bad. So, uh, that's kind of like an interesting thing to keep in mind is in your prompts. Uh, usually don't have to think about it explicitly because uh, the people at open AI have labelers and so on that actually worry about this and they make sure that the answers are spread out And so actually open AI will kind of like do the right thing. So when I ask this question for chat GPT, it's actually going to go very slowly. It's going to be like, okay, let's define our variables, set up the equation and it's kind of creating all these intermediate results. These are not for you. These are for the model. If the model is not creating these intermediate results for itself, it's not going to be able to reach three. I also wanted to show you that it's possible to be a bit mean to the model. Uh, we can just ask for things. So as an example, I said I gave it the exact same, uh, prompt and I said, answer the question in a single token, just immediately give me the answer, nothing else. And it turns out that for this simple, um, prompt here, it actually was able to do it in single go. So it just created a single, I think this is two tokens, right? Uh, because the dollar sign is its own token. So basically this model didn't give me a single token, It gave me two tokens, but it still produced the correct answer. And it did that in a single forward pass of the network. Now that's because the numbers here I think are very simple. And so I made it a bit more difficult to be a bit mean to the model. So I said Emily buys 23 apples and 177 oranges. And then I just made the numbers a bit bigger. And I'm just making it harder for the model. I'm asking it to more computation in a single token. And so I said the same thing. And here it gave me five and five is actually not correct. So the model failed to do all of this calculation in a single forward pass of the network. It failed to go from the input tokens. And then in a single forward pass of the network, single, go through the network, it couldn't produce the result. And then I said, Okay, now don't worry about the the token limit and just solve the problem as usual. And then it goes all the intermediate results, it simplifies. And every one of these intermediate results here and intermediate calculations is much easier for the model. And, um, it sort of it's not too much work per token. All of the tokens here are correct. And it arises the solution, which is seven and I just couldn't squeeze all of this work. It couldn't squeeze that into a single forward passive network. So I think that's kind of just a cute example and something to kind of like think about and I think it's kind of again, just elucidative in terms of how these uh, models work. The last thing that I would say on this topic is that if I was in PractI is trying to actually solve this in my day-to-day life, I might actually not uh, trust that the model that all the intermediate calculations correctly here. so actually, probably what I do is something like this. I would come here and I would say use code and uh, that's because code is one of the possible tools that Chachy PD can use. And instead of it having to do mental arithmetic like this, mental arithmetic here, I don't fully trust it and especially if the numbers get really big, there's no guarantee that the model will do this correctly. Any one of these intermediates steps might in principle fail, we're using neural networks to do mental arithmetic. Uh, kind of like you doing mental arithmetic in your brain, it might just like, uh, screw up some of the intermediate results. It's actually kind of amazing that it can even do this kind of mental arithmetic. I don't think I could do this in my head, but basically the model is kind of like doing it in its head and I don't trust that. So I wanted to use tools. So you can say stuff like use code And, uh, I'm not sure what happened there. Use code. And so, um, like I mentioned, there's a special tool and the, uh, the model can write code and I can inspect that this code is correct. And then, uh, it's not relying on its mental arithmetic. It is using the Python interpreter, which is a very simple programming language to basically, uh, write out the code that calculates the result. And I would personally trust this a lot more because this came out of a Python program, which I think has a lot more correctness guarantees than the mental arithmetic of a language model. Uh, so just, um, another kind of, uh, potential hint that if you have these kinds of problems, uh, you may want to basically just, uh, ask the model to use the code interpreter and just like we saw with the web search, the model has special, uh, kind of tokens for calling, uh, like, it will not actually generate these tokens from the language model, it will write the program and then it actually sends that program to a different sort of part of the computer that actually just runs that program and brings back the result and then the model gets access to that result and can tell you that, okay, the cost of each apple is seven. Um, so that's another kind of tool and I would use this in practice for yourself. And it's, um, yeah, it's just, uh, less error prone, I would say so that's why I called this section. Models need tokens to think, distribute your competition across many tokens. ask models to create intermediate results or whenever you can lean on tools and tool use, instead of allowing the models to do all of the stuff in their memory. So if they try to do it all in their memory, I don't fully trust it and prefer to use tools Whenever possible. I want to show you one more example of where this actually comes up. And that's in counting. So models actually are not very good at counting for the exact same reason. You're asking for way too much in a single individual token. So let me show you a simple example of that. Um, how many dots are below, And then I just put in a bunch of dots and Chach says there are. And then it just tries to solve the problem in a single token. So in a single token, it has to count the number of dots in its context window. Um, and it has to do that in the single forward pass. of a network and a single forward pass of a network as we talked about, there's not that much computation that can happen there. Just think of that as being like very little competation that happens there. So if I just look at what the model sees, let's go to the LM go to tokenizer, it sees, uh, this, how many dots are below. And then it turns out that these dots here, this group of, I think 20 dots is a single token. And then this group of, whatever it is, is another token. And then for some reason they break up as this. So I don't actually, this has to do with the details of the tokenizer. But it turns out that these, um, the model basically sees the token ID, this, this, this, and so on. And then from these token IDs, it's expected to count the number and spoiler alert is not 161. it's actually i believe 177. So here's what we can do instead. uh we can say use code and you might expect that like why should this work and it's actually kind of subtle and kind of interesting. So when I say use code I actually expect this to work let's see okay 177 is correct. So what happens here is i've actually it doesn't look like it but i've broken down the problem into a problems that are easier for the model. I know that the model can't count it can't do mental counting but I know that the model is actually pretty good at doing copy pasting. So what I'm doing here is when I say use code it creates a string in Python for this And the task of basically copy pasting my input here to here is very simple because for the model, um, it sees this string of, uh, it sees it as just these four tokens or whatever it is. So it's very simple for the model to copy paste those token ids and, um, kind of unpack them into dots here. And so it creates this string and then it calls Python routine. count and then it comes up with the correct answer. So the Python interpreter is doing the counting. it's not the models mental arithmetic, doing the counting. So it's again a simple example of, um, models need tokens to think, don't rely on their mental arithmetic and, um, that's why also the models are not very good at counting. if you need them to do counting tasks, always ask them to lean on the tool. Now the models also have many other little cognitive deficits here and there. And these are kind of like sharp edges of the technology to be kind of aware of over time. So as an example, the models are not very good with all kinds of spelling related tasks. They're not very good at it. And I told you that we would loop back around to tokenization. And the reason to do for this is that the models, they don't see the characters, they see tokens, and they, their entire world is about tokens, which are these little text chunks. And so they don't see characters like our eyes do. And so very simple character level tasks often fail. So for example, uh, I'm giving it a string ubiquitous and I'm asking it to print only every third character starting with the first one. So we start with U and then we should go every third. So every so 1, 2, 3, q should be next And then etc. So this I see is not correct. And again, my hypothesis is that this is again, dental arithmetic here is failing Number one a little bit. But number two, I think the, the more important issue here is that if you go to tik tokenizer, and you look at ubiquitous, we see that it is three tokens, right? So you, and I see ubiquitous and we can easily access the individual letters because we kind of see them. And when we have it in the working memory of our visual sort of field, we can really easily index into every third letter. And I can do that task. But the models don't have access to the individual letters. They see this as these three tokens. And, uh, remember, these models are trained from scratch on the internet and all these token, uh, basically the model has to discover how many of all these different letters are packed into all these different tokens. And the reason we even use tokens is mostly for efficiency. Uh, but I think a lot of people areed interested to delete tokens entirely. Like we should really have character level or bite level models. It's just that that would create very long sequences and people don't know how to deal with that right now. So while we have the token world, any kind of spelling tasks are not actually expected to work super well. So because I know that spelling is not a strong suit because of tokenization, I can again ask it to lean on tools. So I can just say use code. And I would again expect this to work because the task of copy pasting ubiquitous into the Python interpreter is much easier. And then we're leaning on Python interpreter to manipulate the characters of this string. So when I say use code ubiquitous, yes, it indexes into every third character. And the actual truth is U2s UQs, uh, which looks correct to me. So, um, again, an example of spelling related tasks not working very well. A very famous example of that recently is how many R are there in Strawberry and this went viral many times. And basically the models now get it correct. They say there are three RS in Strawberry, but for a very long time, all the state-of-the-art models would insist that there are only two RS in strawberry. And this caused a lot of, you know, ruckus. Because is that a word, I think So, because, um, it just kind of like, why are the models so brilliant and they can solve math Olympiad questions, but they can't like count rs in strawberry and the answer for that again is I've got built up to it kind of slowly but number one, the models don't see characters, they see tokens and number two they are not very good at counting. And so here we are combining the difficulty of seeing the characters with the difficulty of counting and that's why the models struggled with this, Even though I think by now, honestly I think open I may have hardcoded the answer here or I'm not sure what they did But, um, uh, but this specific query now works. So models are not very good at spelling and there there's a bunch of other little sharp edges and I don't want to go into all of them. I just want to show you a few examples of things to be aware of. And uh, when you're using these models in practice, I don't actually want to have a comprehensive analysis here of all the ways that the models are kind of like falling short. I just want to make the point that there are some jagged edges here and there. And we've discussed a few of them and a few of them make sense, but some of them also will just not make as much sense. And they're kind of like you're left scratching your head, even if you understand in-depth how these models work. And and good example of that recently is the following. uh, the models are not very good at very simple questions like this. And, uh, this is shocking to a lot of people because these math, uh, these problems can solve complex math problems. They can answer Phd grade, physics, chemistry, biology questions much better than I can. But sometimes they fall short in like super simple problems like this. So here we go 9.11 is bigger than 9.9. And it justifies it in some way but obviously and then at the end, okay, it actually it flips its decision later. So, um, I don't believe that this is very reproducible. Sometimes it flips around, its answer, sometimes gets it right. Sometimes get it, get it wrong. Uh, let's try again. Okay, even though it might look larger. Okay, so here it doesn't even correct itself in the end. if you ask many times, sometimes it gets it right too. But how is it that the model can do so great at Olympiad grade problems but then fail on very simple problems like this. And uh, I think this one is, as I mentioned a little bit of a head scratcher. It turns out that a bunch of people studied this in depth and I haven't actually read the paper. Uh but what I was told by this team was that when you scrutinize the activations inside the neural network, when you look at some of the features and what, what features turn on or off and what neurons turn on or off, uh, a bunch of neurons inside the neural network light up that are usually associated with Bible verses U. And so I think the model is kind of like reminded that these almost look like Bible verse markers and in a bip verse setting 9.11 would come after 99.9. And so basically, the model somehow finds it like cognitively very distracting that in Bible verses 9.11 would be greater. Um, even though here, it's actually trying to justify it and come up to the answer with a math, it still ends up with the wrong answer here. So it basically just doesn't fully make sense. And it's not fully understood and um, there's a few jagged issues like that. So that's why treat this as a as what it is which is a st stochastic system that is really magical but that you can't also fully trust and you want to use it as a tool, not as something that you kind of like letter rip on a problem and copypaste the results. Okay, so we have now covered two major stages of training of large language models. We saw that in the first stage. this is called the pre-training stage. We are basically training on internet documents. And when you train a language model on internet documents, you get what's called a base model. And it's basically an internet document simulator right now. We saw that this is an interesting artifact and uh, this takes many months to train on thousands of computers and it's kind of a lossy compression of the internet. And it's extremely interesting, but it's not directly useful because we don't want to sample internet documents. We want to ask questions of an AI and have it respond to our questions. So for that, we need an assistant and we saw that we can actually construct an assistant in the process of a post training and specifically in the process of supervised fine-tuning as we call it. So in this stage, we saw that it's algorithmically identical to pre--training. Nothing is going to change. The only thing that changes is the data set. So instead of internet documents, we now want to create and curate a very nice data set of conversations. So we want millions conversations on all kinds of diverse topics between a human and an assistant. And fundamentally these conversations are created by humans. So humans write the prompts and humans write the ideal response responses. And they do that based on labeling documentations. Now, in the modern stack, it's not actually done fully and manually by humans, right? They actually now have a lot of help from these tools. So we can use language models, um, to help us create (Video2 , 10:30) asdasd key poinys video 1 5 videoes selected Video2 keypoinys