StatQuest: Linear Discriminant Analysis (LDA) clearly explained.

Linear Discriminant Analysis (LDA) is a dimensionality reduction technique similar to Principal Component Analysis (PCA), but instead of focusing on overall data variation, LDA maximizes the separability between known categories. It achieves this by creating new axes that project the data to best distinguish groups, optimizing for distance between category means while minimizing within-group scatter. LDA is particularly useful when visualizing high-dimensional data with multiple categories, offering a clearer separation than PCA in such scenarios. analysis And then we'll talk about the details of how it works. A imagine that we have this cancer drug and that cancer drug works great for some people but for other people it just makes them feel worse. We want to figure out who to give the drug to. We want to give it to people who it's going to help but we don't want to give it to people that it might harm. And since I'm a geneticist and I work in a genetics department, the way I answer all my questions is to look at gene expression. Maybe gene expression can help us decide. here's an example using one gene to decide who gets the drug and who doesn't. We've got a number line, and on the left side, we've got fewer transcripts And on the right side, we've got more transcripts The dots represent individual people. The green dots are people who the drug works for. The red dots represent people whom the drug just makes them feel worse. We can see that for the most part, the drug works for for people with low transcription of gene X. and for the most part, the drug does not work for people with high transcription of gene X in the middle. We see that there's overlap and that there's no obvious cut off for who to give the drug to. In summary, gene X does an okay job at telling us who should take the drug and who shouldn't can we do better? What if we used more than one gene to make a decision? Here's an example of using two genes to decide who gets the drug and who doesn't? On the Xais, we have gene X and on the y ais, we have gene Y. Now that we have two genes, we can draw a line that separates the two categories, the green where the drug works and the red where the drug doesn't work. And we can see that using two genes does a better job separating the two categories than just using one gene. However, it's not perfect would using three genes be even better. Here I've got an example where we're trying to use three genes to decide who gets the drug and who doesn't. Gene Z is on the Z axis, which represents depth. So imagine a line going through your computer screen and into the wall behind it. And the big circles or the big samples are the ones that are closer to you And the smaller circles, smaller samples, or the ones that are further away. And those are along the Z-axis. When we have three dimensions, we use a plane to try to separate the two categories. Now I'll be honest, I drew this picture, but even for me, it's hard to tell if this plane separates the two categories correctly, it's hard for us to visualize three dimensions. on a flat computer screen, we need to be able to rotate the figure and look at it from different angles to really know. And that's tedious. What if we need four or more genes to separate two categories? Well, the first problem is we can't draw a four-dimensional graph or a 10,000 dimensional graph. We just can't draw it. that's a bummer. We ran into the same problem when we talked about principal component analysis or PCA and if you don't know about principal component analysis be sure to check out the stat quest on that subject It's got a lot of likes and it's helped a lot of people people understand how it works and what it does. PCA if you can remember about it reduces dimensions by focusing on genes with the most variation. This is incredibly useful when you're plotting data with a lot of dimensions or a lot of genes onto a simple XY plot however in this case we're not super interested in the genes with the most variation instead we're interested in maximizing the separability between between the two groups so that we can make the best decisions linear discriminant analysis LDA is like PCA. It reduces dimensions however it focuses on maximizing the separability among the It reduces dimensions however it focuses on maximizing the separability among the categories let's repeat that to emphasize the point linear discriminant analysis LDA is like PCa but it focuses on maximizing the separability among the known categories here we're going to start with a super simple example we're just going to try to reduce a two--dimensional graph to a 1d graph that is to say we want to take this two--dimensional graph aka and xy graph and reduce it to a one-dimensional graph aka a number line in such a way that maximizes the separability of the two categories what's the best way to reduce the dimensions well to answer that let's start by looking at a bad way and understanding what its flaws are one bad option would be to ignore gene y and if we did that we would just project the data down onto the x ais this is bad because it ignores the useful information that gene Y provides projecting the genes onto the y-axis, i.e. Ignoring the gene x isn't any better. LDA provides a better way. Here, we're going to try to reduce this two--dimensional graph to a 1D graph using LDA. LDA uses the information from both genes to create a new axis and it projects the data onto this new axis in a way to maximize the separation of the two categories. So the general concept here is that LDA creates a new axis and it projects the data onto that new AIS in a way that maximizes the separation the two categories. Now let's look at the nitty--gritty-details and figure out how LDA does that. How does LDA create the new axis? The new axis is created according to two criteria that are considered simultaneously. The first criteria is that once the data is projected onto the new axis, we want to maximize the distance between the two means. Here we have a green M character, which is a Greek character representing the mean for the green category, and a red mew representing the mean for the red category. The second criteria is that we want to minimize the variation, which LDA calls scatter and is represented by S-squar within each category. On the left side, we see the scatter around the green dots on the right side, we see the scatter around the red dots. And this is how we consider those two criteria. Simultaneously, we have a ratio of the difference between the two means squared over the sum of the scatter. The numerator is squared, because we don't know if the green me is going to be larger than the red mu, or the red mu is going to be larger than the green me. uh, we don't want that number to be negative. We want it to be positive. So whatever it is, um, whether it's negative or positive Begin with, we square it and it becomes a positive number. Now ideally the numerator would be very large, there'd be a big difference or a big distance between the two means. And ideally the denominator would be very small in that the scatter the variation of the data around each mean in each category would be small. Now, I know this isn't a very complicated equation, but to make things simpler later on in this discussion, let's call the difference between the two means D for distance, So we can replace the difference between the two means with D. Now I want to show you an example of why both the distance between the two means and the scatter are important. Here's a new data set. We still just have two categories, green and red. In this case, there's a little bit of overlap on the y AIs, but lots of spread along the x-AIs. If we only maximize the distance between the means, then we'll get something like this. And the result is we'll have a lot of overlap in the middle. This isn't great separation. However, if we optimize the distance between the means and the scatter, then we get nice separation. Here, the means are a little closer to each other than they were in the graph on the top, but the scatter is much less. So if we optimize both criteria at the same time, we can get good separation. So what if we have more than two genes? That is to say? what if we have more than two dimensions, The good news is that the process is the same. We create a new AIs that maximizes the distance between the means for the two categories while minimizing the scatter. So here's an example of trying to do LDA with three genes. We've got that three--dimensional graph that I showed you earlier. Here we've created a new axis and the data are projected onto the new axis. This new axis was chosen to maximize the distance between the two means between the two categories. That is, while minimizing the scatter. What if we have three categories, In this case, two things change, but just barely. here's a plot that has two genes, but now we have three categories. The first difference between having three categories, as opposed to just two categories like we had before, is how we measure the distances among the means. Instead of just measuring the distance between the two means, we first find a point that is central to all of the data. Then, we measure the distances between a point that is central in each category, and the main central point. Now, we want to maximize the distance between each category, and the central point, while minimizing the scatter for each category. And here's the equation that we want to optimize. And this is the same equation as before, but now there are terms for the blue category. The second difference is LDA creates two axes to separate the data. This is because the three central points for each category, define a plane. Remember, from high school, two points toine a line, and three points to find a plane, that is to say, we create new x and y axes. However, these are now optimized to separate the categories. When we only use two genes. this is no big deal. The data started out on an Xy plot and plotting them on a new Xy plot doesn't change all that much but what if we use data from 10,000 genes That would mean we need 10,000 dimensions to draw the data suddenly being able to create two axes that maximize the separation of the three categories is super cool. It's way better than drawing a 10,000 dimension figure that we can't even imagine what it would look like. here's an example using real data. i'm trying to separate three categories and I've got 10,000 genes Plotting the raw data would require 10,000 axes. We used LDA to reduce the number to two and although the separation isn't perfect it is still easy to see three separate categories. Now let's use that same data set to compare LDA to PCA. Here's the LDA plot that we saw before and now we've applied PCA to the exact same set of genes. PCA doesn't separate the categories nearly as well. We can see lots of overlap between between the black and the blue points. However PCA wasn't even trying to separate those categories it was just looking for the genes with the most variation. So we've seen the differences between LDA and PCA but now let's talk about some of the similarities. The first similarity is that both methods rank the new axes that they create in order of importance PC1, the first new access that PC A creates accounts for the most variation in the data.