Complete Statistics For Data Science In 6 hours By Krish Naik

Comprehensive YouTube statistics tutorials covering descriptive & inferential statistics, hypothesis testing (z-tests, t-tests, chi-squared), probability distributions, correlation, and Python applications. Includes concepts like mean, median, mode, variance, standard deviation, p-values, confidence intervals, and sampling techniques. This segment clearly defines descriptive and inferential statistics, differentiating their purposes in data analysis. It sets the stage for understanding the core concepts of statistical analysis within the context of data science. This segment lists and briefly explains several key probability distributions (Gaussian, log-normal, binomial, Bernoulli, Pareto, standard normal), highlighting their importance in data science applications. Understanding these distributions is fundamental for many data science tasks. This segment details various sampling techniques (simple random, stratified, systematic, and convenience sampling), explaining their applications and limitations. Understanding these techniques is essential for designing effective data collection strategies and interpreting results. This segment uses the example of exit polls to illustrate the concepts of population and sample, crucial for understanding statistical inference and the limitations of drawing conclusions from a subset of data. It also addresses common student confusion regarding data types used in examples. This segment introduces inferential statistics, focusing on Z-tests, T-tests, ANOVA tests, and chi-square tests. It emphasizes the importance of hypothesis testing in drawing conclusions from data. This segment clearly defines and differentiates between quantitative and qualitative variables, providing numerous examples to illustrate the concepts. It further categorizes quantitative variables into discrete and continuous variables, explaining the distinctions with practical examples, making the concepts easily understandable for viewers. This segment explains convenient sampling, highlighting its characteristics and contrasting it with other sampling methods. Real-world examples, such as exit polls and household surveys, illustrate how different sampling techniques are chosen based on the specific research question and target population. The discussion emphasizes the importance of selecting the appropriate sampling method for accurate data collection.This segment delves into the complexities of sampling for drug testing, emphasizing the importance of considering the target population and relevant factors like age. It showcases how the choice of sampling method directly impacts the validity and reliability of the results, underscoring the need for careful consideration of the research question and potential confounding variables. This segment contrasts bar graphs and histograms, explaining their appropriate uses for discrete and continuous data, respectively. It also introduces the concept of probability density functions (PDFs) as a smoothed representation of histograms, providing a visual explanation and mentioning the underlying technique (kernel density estimation). This segment introduces four types of variable measurements (nominal, ordinal, interval, and ratio), focusing on nominal and ordinal data. It explains the differences between these two types, emphasizing the importance of understanding the order and value of data in different measurement scales. Real-world examples are used to clarify the concepts. This segment clearly explains the concepts of arithmetic mean for population and sample, emphasizing the importance of correct notation in data science and professional settings. It also introduces the concept of central tendency and defines it as a measure used to determine the center of a data distribution, covering mean, median, and mode. This segment introduces the concepts of variance and standard deviation as measures of dispersion, explaining how they quantify the spread of data around the mean. It uses visual examples to illustrate how variance reflects the data's dispersion and connects standard deviation to the shape of the data distribution. This segment uses a practical example to demonstrate how outliers significantly affect the mean, highlighting the importance of considering median as a more robust measure of central tendency when dealing with outliers. The discussion emphasizes the adverse impact of outliers on data analysis and the need for careful consideration.This segment provides a step-by-step explanation of how to calculate the median, differentiating between cases with odd and even numbers of data points. It further reinforces the robustness of the median against outliers compared to the mean, making it a preferred measure in certain situations.This segment defines and explains the concept of mode as the most frequent element in a dataset. It highlights the usefulness of mode, particularly when dealing with categorical variables and handling missing data, providing practical examples to illustrate its application. This segment explains the characteristics of a Gaussian or normal distribution, emphasizing its symmetrical bell curve shape, where the mean, median, and mode coincide. The speaker highlights the importance of this distribution in inferential statistics and its use in deriving conclusions from datasets. Now let's go ahead and let's try to see now I got my variance as 1.81. Now my standard deviation is nothing but root of variance. Root of variance. That basically means it is nothing but root of 1.81. So if I go and open my calculator I'll just say root of 1.81 and there I'm actually getting is nothing but 1.345. So one point three four five. Now see what the standard deviation basically mean. What is the mean In this particular case what is the mean? mean is nothing but two point eight three right? let's consider this one the mean is 2.83. Now from this mean your data will be distributed because mean is basically specifying your measure of central tendency. It basically says that where the center is there for that specific distribution. So from here if I go one step right one standard deviation to the right you have seen standard deviation formula. The next element that may probably fall between the one standard deviation will range between let's consider that this is my first standard deviation to the right Then it will basically have 2.83 plus 3.4 So this is nothing but 4.17 That basically means in this distribution whatever elements are basically present between 2.83 to 4.17 will be falling within the first standard deviation. And if I consider the same thing towards the left that basically is one standard deviation towards the left then what I'll do I'll just subtract 1.34. So this will basically be 9. 7. 4. 1. So it will basically become 1.49. Now here it basically says that any elements that falls between 1.49 to 2.83 will be falling in this region. That is one standard deviation to the left. Similarly we will go with the second standard deviation. Now in this particular case it will be 4.17. one point three four five five five point five. One Similarly you go and calculate similarly you go and calculate similarly one. Now your standard deviation is a very small number. Still I'll say that this is a small number and if I probably try to construct a graph it will look something like this. The tip right? This, this region that you probably will see, this is basically called as a bell curve. And based on the standard deviation and variance, you will be able to decide two important things. With the help of variance, definitely, you will be able to understand how the data is spread. And with standard deviation, you will be able to understand that between one standard deviation to the right, and the left, what may be the range of data that may be following it. So standard deviation is nothing, but it is a root square root of variance. That basically means from the mean, right? How far a element can be? let's consider that if I consider 5. Now, for 5, if you try to calculate, it may fall somewhere here. So how you are going to represent 5, you will say that it falls in 1.5 standard deviation from the mean. So this kind of definition you will be able to tell them. So that basically means from the mean, how far a specific number is with respect to standard deviation you're calculating, you're using a unit called as standard deviation for saying that and variance specifically talk about spread. If the variance is high, the values, the the data spread that is, there is very, very high. Now let's understand some amazing basic things, which is called as percentile and quartiles. This is the first step to find outliers. How do we find an outlier? So probably we are going to discuss in this the first and with the help of code. Also you can basically do now, with respect to percent times, let's try to understand what is percentiles and how do you find out percentile. Now before understanding percentile, you basically need to understand about percentage Suppose if I have a distribution, I say one, two, three, four, five. Now my question is that what is the percentage of numbers that are odd? So how do you basically apply a formula over here? So I can basically say percentage is equal to number of numbers that are odd divided by total numbers. So if I really try to calculate how many numbers are odd, 1, 2, 3. So 3 divided by 5 is nothing. but how much 0.6, which is nothing, but 60 percentage. Very simple. This is how we basically calculate percentage. Now and I hope everybody knows this. Now let's understand a very, very important topic, which is called as percentile. Now I probably think you have heard about percentiles in lot of things percentile. Probably if you have given gate exam cat exam, Gmat exam, sat exam. Okay, one real life example. I'll show it to you that is related to my my uh, Youtube ranking. Also if you can see my Youtube ranking social blade. So here if I show you one example here, you can see that you can see education rank here If you hover over here, it shows 96.1 percentile. If I hover away, it shows 94.98 percentile Over here. If it's if I hover it shows 94.958 percent time. So we'll try to discuss about this percentiles right now First of all, we will give the definition What is a percentile? So percentile is a value below which a certain percentage of observations lie. So this is the definition of percentile It It is basically saying it is a value if I say, okay, this number is the 25 percent type this basically says that 25 percentage of the entire distribution is less than that particular value So percentile is a value below which a certain percentage of observation will lie. Let me take a very good example and show it to you. Suppose I have a data set and inside this data set I have elements like 2, comma, 2, 3, comma, 4 comma, 5 comma, 5, 6, comma, 7 comma, 8 comma, 8, 1, 8, 1, 2, 3, 4, 5, 9, 9, 10, 11, 11, 12.. So let's consider that this many number of elements that I actually have now in this specific number of elements I want to find out what is the percentile? Let's consider this one. My question is what is the percentile ranking of 10? So this is my question. We solve this problem by using a simple formula. I want to find out the percentile rank of 10. right? So my formula let's consider this x is equal to 10. Okay so here i'm specifically going to write x. So my formula will basically be number of values below x divided by small n which is my sample multiplied by 100. So if you try to calculate this number of values below x divided by n what is n over here? N size is sample size 1, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20.. so 20 is basically my sample size. So here I'm going to say number of values below x. So how many number of values x is 10? How many number of values are below x 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16. so this will basically become 16, 16 divided by 20 multiplied by 100 In short this will become four forza 16 for FISA ones are 20. So 80 percentile will basically be my answer for this. That basically means if I really want to find out what this 10 value percentile is, it is 80. now understand what is the main meaning out of it. The main meaning is that 80 percentage, please listen to me Very, very carefully. 80 percentage of the entire distribution is less than 10. This is the real meaning that you can probably understand from it Now, quickly, what is the percentile ranking of 11 of value 11? So, uh, how many elements are present below 11? I'll say 17 divided by 20 multiplied by 100 Once a FISA 85 percent. let's do the reverse of this. So from this particular distribution, what value exists at percentile ranking of 25. So how do you calculate this for this? you use a very simple formula and the formula is something like this value is equal to percentile divided by 100 multiplied by n plus 1. now see guys I'm not going to derive the formula why it is n plus 1 y is n minus 1. Why it is this for sample variance. I'll discuss about y n minus 1. But understand we really need to understand what things we are doing and how we are using it in some specific purpose. So percentile over here is 25 by 100 multiplied by 21. Now understand this, this 5.25 is the index position. It is very much important to understand. This is not the value, the index position. Now I will go and find out which is 5.25. So this is my first element. First index second index, third index, fourth index fifth index and 5.25 will be in between this, but right now I don't see any element find between this. So what we do is that we take fifth and sixth index and then we do the average and we calculate the value. In this particular case my answer will be 5. so 5 is the value for 25 percentile. Try to find out what is 75 percentile. So if I use 75 divided by 100 multiplied by 21. 15.75 is the index position. Now go and count which is 15.75 from the top 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 11, 12, 13, 14, 15. 15.75 is the sum of these two numbers. So my answer is 9. 15.75 is the index position. So here I'm basically getting the nine answer. Now let's go and discuss about a new topic which is called as five number summary. In five-Number summary we need to discuss about something called as first one is something called as minimum. The second topic that we should discuss about is something called as first quartile which is also denoted by Q1. The third topic that we must discuss about is something called as median. The fourth topic that we should discuss about herd quartile which is also said as Q3. And the fifth topic we basically discuss about maximum. And with the help of this we will be using these values to basically remove the outliers. So let's take one example and let's see that by the help of five number summary, how do we remove an outline? So removing an outlier a very important thing which is also called as IQR. So here we are going to discuss about removing the outliers. Now removing the outliers, let's consider that I have one data set which is like this. One two, two, three, three, four, five, five, five six, six, six, six, seven, eight, eight, nine, twenty seven. Now from this distribution guys, what do you think is what do you think, which is the outline? So obviously you'll be saying that 27 is the outlier, Always understand, guys. Whenever we need to remove an outlier, we really need to define a lower fence. Let's consider that I am going to define a lower fins and then I am going to define a higher pens. The values that you have over here will be between lower fence to higher fence. That basically means after a greater number, all the numbers above that number will be an outlier after a smaller number. All the number below that particular number below this lower fence will be actually treated as an outlier. It should also have higher. It should also have lower if I consider that I had one element which is called as minus 50 is minus 50. An outlier for this distribution. Yes. The answer is definitely yes, right? If you have minus 50 over here that is probably in the lower fan side below the lower fence line and it can be treated as an outlier. So in order to define the lower fence, we write a very simple formula. And the formula looks something like this. So here you can define lower fence is equal to Q1 minus 1.5 multiplied by IQR. I'll talk about what is IQR and upper fence is basically defined by Q3 plus 1.5 multiplied by IQR. This two things are basically there. Now what exactly is IQR? You really need to understand about IQR. What exactly is IQR? IQR is nothing but it is called as inter quartile range, interquartile range is basically IQR and it is given by the formula Q3. minus Q1. Q3 is nothing but 75 percentile and Q1 is nothing but 25 percentile. Now quickly check this distribution and try to find out the 25 percentile. So what exactly is 25 percentile? What exactly is 75 percentile simple formula 25 multiplied by 100 multiplied by small n small n is 1, 2, 3, 4, 5, 6 7, 8, 9 10 11 12, 13, 14, 16, 17, 18, 19, 19 plus 1 right? So this is nothing but 25 by 100 multiplied by 20 which is nothing but 5. 5. This 5 is nothing but index index position. So what is the fifth element index position? 1? 2, 3, 4. 5 is everybody getting 25 percentile is nothing but 3 is everybody getting 25 percentile or Q1 is equal to 3. Similarly if you try to find out Q3 you will be able to get that it is 7. Q3 7 you will get the 15th index for Q3 So you are basically going to get 7. now if I go and compute the interquartile range What is interquartile range 7 minus 3 which is nothing but 4. now you have calculated the IQR So what all things we have calculated the IQR Q3 Q1 everything is being computed now let's go ahead and compute the lower fence Now the lower fence basically say Q1 minus 1.5 multiplied by IQR right. this is what lower fence formula is So what is Q1 Q1 basically is nothing but what is Q1 in this particular case I have computed it it is 3. You can see over here Q1 is 3 Q3 is 7. So I'm going to write 3 minus 1.5. What is IQR4 so 3 minus 6 which is nothing but minus 3. So the lower fence value is -3. Now let's go and compute about the higher fence. higher fence basically say Q3 plus 1.5 multiplied by IQR Q3 is 7. 7 plus 6 is equal to 30 So my lower fence to higher fence range is between minus 3 2 plus 13. Now tell me which is the outlier from here minus 3 to plus 13. Anything that is greater than 13 is considered as an outlier. Anything lesser than -3 is considered as an outlier. So which number should we remove? We should remove 27. Why 27 is greater than 13 which is from the higher fence. Now let me write the distributions once again for all of you let me write the distribution after removing the 13. so the remaining data what I have 1, 2, 2, 3, 3, 4, 5, 5, 5, 6, 6, 6, 6, 7, 8, 8, 9, 27 But I told you we are removing 27. right? so 27 is removed because it is an outlier. Now you know what is the, what is the minimum value out of all these numbers? Minimum value is 1. What is Q1? my first quartile We have computed over here, Q1 is nothing but 3. Median. You calculate and tell me quickly. Then you have q3. q3. 7 and the maximum number after removing the outlier is nothing. but 9. So here you are getting your 5 number summary. Now quickly compute median and tell me what is median? median is nothing but phi. Now let's draw a plot which is called as box plot By this specific data. you can definitely draw a box plot. Now how does a box plot basically get drawn? So you will be having x-axis and let's consider that in this particular x-axis you have values like minus 2, 0, 2, 4, 6, 8, 10.., so this is your x-axis. Now, just to go and find out where is minimum element? Minimum element will probably fall over here. That is in one Q, one will basically fall at three. So this will be your three, median is, basically 5. So this is basically your 5 Q3 is nothing, but 7. So this is your 7 And max is nothing, but 9. So this is your line. Now, all you have to do is that join this lines. So this exactly is your box plot. If I had kept 27 as an element, I would have to extend this line, this much big, and probably put 27 somewhere here. And this used to be one dot over here. Have you seen this kind of plot? This value is nothing but minimum. This is my Q1. This is my median. This is my Q3. And this is my max. And this technique of removing an outlier, we basically say, with respect to lower fence and higher This segment discusses the applications of box plots in determining outliers and provides an overview of the topics to be covered in the subsequent part of the video, including various probability distributions and data visualization techniques. This segment introduces the empirical rule, explaining how data points are distributed within one, two, and three standard deviations from the mean in a normal distribution (approximately 68%, 95%, and 99.7%, respectively). Real-world examples like height, weight, and the Iris dataset are used to illustrate the practical application of this rule. Let's take an example. Suppose my I have a data set where my mean is 4 and my standard deviation is 1. if I have this two information can I construct a distribution? Suppose this is 4 then in the next step what it will come 5, 6, 7, 8 right? And then 3, 2, 1, and 0. so I will be able to create this and let's consider that this is basically following this kind of distribution. So this basically follows this kind of distribution. Now understand this middle one is basically your mean and standard deviation. Sorry mean is 4 and standard deviation is 1. now see one thing guys if I talk about 4.5 my question is that where does 4.5 fall in terms of standard deviation? So you may be thinking okay 4.5 where exactly it is. It is somewhere here. Obviously when I say 5 is first standard deviation to the right, that basically means 4 will be plus 0.5 standard deviation to the right, Understand 0.5 standard decision. If you say one standard deviation, it is basically coming to 5, It is 0.5 standard deviation. Now similarly if I say where does 4.75 fall, then how you will be able to see it the point the standard deviation was 1, I told 4.5 So 4.5 will be something falling over here And this is like 0.5 standard deviation. But in the case of 4.75 it will be very much difficult for you to do the calculation. So that is the reason what we can do is that we can use a concept which is called as Z score. Now Z score will basically help you find out whenever I talk about a value, how much standard deviation away it is from the mean. So this formula is x of i minus mu divided by standard deviation. Now I need to find out for 4.75. I will just write 4.75 minus mu is what mu is 4. 4 divided by standard deviation is 1. So here I am actually getting 0.75. So now I can see that it is 0.75 standard deviation to the right, why it is saying right? because this is positive value. Now if I give you the same question, try to find out where does 3.75 fall? Like how many standard deviation, whether what should be the standard deviation with respect to 3.75? Then you go and apply the same formula. So here I'll say Z score is equal to 3.75 minus 4 divided by 1, which is nothing but minus 0.25. So whenever minus comes, that basically means you have to check in this side and it is basically saying that 3.75 will be falling somewhere here. That is nothing but minus 2.25 standard deviation to the left. Now let's go to the next thing. Suppose I consider this same graph Now you understood if I really want to find out how many standard deviation to the right or the left I need to find out I can definitely use Z score. Let's consider this thing. I will use the same graph. I'm using the same bell curve. This is my 4. This is my 5. This is my 6. this is my 3. This is my 2. This is my 1 here. You know that my mean is 4 and standard deviation is 1. understand one thing over here. I'll talk about Z score again. don't worry now let's apply Z score to every values. What will happen if I apply Z score to every values? What will happen? Okay, what is Z score? formula x of I minus mu divided by standard deviation. Okay, you know the mean mean is nothing but 4 Standard deviation is 1. now if I apply Z score to everything initially my distribution was like this, 1, 2, 3, 4, 5, 6, 7.. now this was my distribution initially. Now after applying Z score to this, what will be my distribution that will be coming? apply? apply for 1. First of all, so if I apply Z score to 1, then what will happen 1 minus 4 divided by 1. this is minus 3. Can I say this? 1 is getting converted to minus 3? 1 is converted to minus 3. Then if I apply the Z to the next element 2, then what is 2 minus 4? My 1, it is nothing but minus 2. So here I'm actually getting minus 2. Then if I go and apply the Z score to 3, then what will happen Z of 3? so 3 minus 4 divided by 1. What will happen? minus 1? So minus 3 will now get converted to minus 1. Then 4 will get converted to 0. Then it will get converted to 1, 2, 3..? now understand the main magic in this with the help of Z score. Is this not the standard deviation of the same elements that we got over here? Is this not the standard deviation of this all elements that we got after applying the Z score. After we applied this initially, my data set was like this. then I got this, this element falls at -3 standard deviation. This elements fall at -2 standard deviation. So here you can definitely see that I am able to get the standard deviation. Now, what is happening? See over here. One beautiful thing that is basically happening. I had a distribution which was 1, 2, 3, 4, 5, 6, seven. After I applied a Z score, this got converted to minus three, minus two, minus one, zero, one, two, three. And probably, uh, yeah, right, I got this. Now, what is this distribution then called? what this was initially a normal distribution, a normal distribution, or a Gaussian distribution, After I applied a Z score, What kind of distribution we are actually getting? And what is this basic distribution called as so this distribution is called as standard normal distribution. So one of the most important property with respect to standard normal distribution is that your mean is 0 and standard deviation is 1 Is this satisfying this property or not? It is being satisfied, right? So can I write, can I write a random variable x or y will belong to standard normal distribution, where specifically your mean will be 0 and standard deviation will be 1..? So after applying a z score, we are able to get into a different distribution, which is called a standard normal distribution. Now, the question arises, why do we do this? What is the use of doing this? Let's go ahead with one practical application. And we do this in machine learning. We do this in most of the algorithms. Now, let's go ahead and try to see the practical application. Suppose I have a data set, let's consider that I am solving a machine learning problem statement, I have a data set age, I have features like salary, I have features like weight, suppose in this particular data set, I have these three columns. Now, understand one thing, h by what unit we will calculate by years, salary, we may calculate by rupees or dollar weight. We may calculate in KGs, understand this units. These are these are what these are basically units, units of calculation. Now whenever I have some values like this, like 24, 25, 26, 27. salary may be 40k, 50k, 60k, 70k, something weight, maybe 70 kgs, 80 kgs, 55 kgs, 45 kg. Now here when you have this kind of data, always understand now in this data, obviously, you can see the units are completely different. Our main target should be that we should try to bring up in a form probably in this particular form where my mean is 0 and standard deviation equal to 1 At that point of time, I can definitely apply standard normal distribution. That basically means I can take up this entire data and apply Z score and convert this into standard normal distribution. Similarly, I can go ahead and take up this particular data set. I can apply Z score, and I can basically convert this into standard normal distribution. This process is basically called as standardization, Very super important. Many people will talk about normalization. I'll talk about the difference between standardization and normalization. Whenever we talk about standardization in short, internally, there is a Z score formula getting applied. So standardization is a process where I am basically trying to convert a distribution into standard normal distribution. The property is that the mean is 0 and the standard deviation is 1. now let's go ahead towards something called as normalization. Now, what exactly is normalization in standardization? Whenever we talk about here, we are getting converted as mean is equal to 0 and standard deviation equal to 1. now in normalization, you have an option You will say that I want to, I want to shift this entire values or whatever values that I have between 0 to 1. Let's consider like this, I want to change all these particular values between 0 to 1. so in this particular case, I may definitely apply normalization. Now, how do we do normalization? There is a very important formula, which is called as min max scalar In the mean max scalar you just have to provide 0 to 1. And automatically this kind of normalization will happen. And yes, I will show you practically also, don't worry if I want to probably shift this between minus 1 to plus 1, I can basically apply this. So normalization gives you a process where you can basically define the lower bound and upper bound and you can convert your data between them. This segment demonstrates how to visually compare two team's final scores (2020 and 2021) using bell curves, illustrating the concept of standard deviation and its impact on performance analysis. The speaker uses the mean and standard deviation of each year's scores to create bell curves, showing where a specific score (240) falls within each distribution. This helps in understanding the relative performance of the team in both years. This segment explains how to use Z-tables to find the area under a normal distribution curve corresponding to a given Z-score. The speaker emphasizes the importance of understanding whether to use the left or right tail of the Z-table depending on the question being asked. The speaker clarifies the difference between left and right Z-tables and how to interpret the values obtained from them.This segment provides a step-by-step guide on how to use a Z-table to calculate the percentage of scores above a specific value. The speaker demonstrates how to find the area under the curve using the Z-score and then explains how to interpret this area as a percentage. The speaker also clarifies the use of the left and right tails of the Z-table. This segment introduces a practical application of Z-scores. The speaker presents a scenario where scores are distributed, and the goal is to determine the percentage of scores above a certain value (4.25). The speaker calculates the Z-score for 4.25 and explains how to use a Z-table to find the corresponding area under the curve, representing the percentage of scores above 4.25.This segment focuses on the interpretation of Z-scores and their relationship to the area under a normal distribution curve. The speaker explains how the Z-score indicates the number of standard deviations a data point is from the mean and how this relates to finding the area (percentage) of the curve above or below that point. The concepts of "tail" and "body" of the curve are introduced to clarify the area calculation. Can you say some numbers that are like kind of outliers in this? So, uh, now the first thing that we are probably going to do is that, let's say that using Z score, I probably want to also find out some outliers. Now, using Z score, how do you find out some outliers? Now let me just go and explain you over here. let's say that you know about normal distribution. Till now, you have discussed, we have discussed so many things in normal distribution. We know that this is the mean first standard deviation, second standard division, third standard deviation first, second and third standard division to the left. You know that 68 percentage of data, 95 percentage of data and 99.7 percentage of data. Can I consider that during some of the scenarios, if my data is normally distributed after the third standard deviation, probably the data are outliers, Yes or no, Yes. after third standard deviation, whatever data is basically present, right? Data outliers, yes or no. Just think over it most of the time If the values are, you know after probably third standard deviation, they are like kind of outliers. Yes. So just think over it guys. It can be treated as an outlier, right? If if data is present after third standard deviation. So first we'll try to implement this. Now what I am actually going to do over here is that first of all, let me make a list. Okay, so here I'm just saying it is outliers. I'm going to basically create it as a list and put up all outliers inside. let's define and how do you find out standard deviation or by using Z-score, right? We can definitely find out Z-score with the help of Z. How many uh, data set or data points actually fall within the third standard deviation. So here I'm actually going to create a function which says define detect underscore outliers. So this will be my function And here I'm going to give my data. Now the first thing that I will create a threshold, my threshold will basically be three standard deviation, right? Anything that falls away from the three standard deviation, I will basically be able to do it. And I hope everybody remembers the formula. The formula for Z score is what if I go and probably define over here My Z score formula is nothing but it is x of i minus mu divided by standard deviation. We usually also write this formula by root n. but I'll talk about it. Why specifically, I'm not specifying root n over here. Uh, over here. I'll just try to use this formula. Okay, so this is basically the Z score formula. Okay. so I have to implement this formula in Python programming language. Okay. So what I am actually going to do, first of all, obviously, in in, in this, I need to compute mean, I need to compute standard deviation, you know how to compute mean, right? So here I will say mean is equal to NP dot mean. And here I can actually give my data points, which will actually help me to find out mean, then my standard deviation here, I can basically write NP dot standard deviation of that specific data, I will be able to get the standard deviation. So I have got my mean and standard deviation. Now for each and every points inside my data set, I will just apply the Z score formula. So I'll say for i in data, I can say Z score is equal to i, I is my x, of 5 points, right? I'll say x i minus mean, right divided by standard deviation. So this is my Z score formula. And for every item, I'm actually trying to find out the Z score formula. Z score will basically give you how many standard deviation it is away from me. So I can write one condition to check whether it falls below the third standard deviation or not. So I can basically use NB dot absolute, which will basically help us to round off the Z score. And I'll say Z underscore score is greater than threshold. If it is greater than threshold. What does this basically mean? Let's let's define threshold over here. I have already defined threshold, right? So if it is greater than threshold, then what does this basically mean? Oh, sorry, it is data set. I'm extremely sorry data set. Now tell me if NP dot apps Zsco greater than threshold. What should I do? What does this basically mean? Green more clarity you want. I think now it is fine right. What what should we do in this? This basically means that it is an outlier right Because it is falling away from third standard deviation. It is falling below or beyond the third standard deviation. So what I can basically do is that I can just write something like this because I have created a list I'll say outlier dot append and I'm going to append that specific set score value. So I hope it is fine. I'm just going to append the Z score value not Z score. I will append the I value because I in data set. Yes I am just going to append this. I. yeah Outliers. sorry. It is outliers dot append of I. And then finally what I'm actually going to do I'm just going to return the outliers or return outlier. let's see whether it will work or not. i'm also trying it for the first time so this is my function that has got executed. I will just execute one more code threshold three basically means this this defines our third standard deviation below like beyond third standard deviation. I can basically say that this actually falls on if you want to probably go and check how this distribution is So I can write plot.test on a specific dataset PLT is not defined Why Okay this should be plT it's okay whether it is normally distributed or not but I am actually trying to see this okay there are some definite outliers but it's okay let's see that whether we will be able to do this or not. What is which our past has changed data set? data in for loop. It is simple, right guys. This this function everybody understood or not. Oh, sorry, This should be data, this data I'm actually passing over here. See threshold threshold here is my third standard deviation. If you want the data set I can paste this entirely And given the chat, so this is my chat with respect to the data set I've already given it to you all Now let's go and execute it. Now I have executed this. Now what I am actually going to do over here I am just going to call detect underscore outliers and I am going to call the specific data set the data set NB.apps NB dot apps basically means NB dot absolute absolute function. Now once I execute it here you will be seeing that it will be returning this three outlier Are these my outliers or not? Guys, The for loop is very simple. For I in data, I'm finding for every data, which is in the form of list, all the Z score, and I am comparing if the Z score is greater than 3 or not, If it is greater than 3, I am considering it as an outlier here. You can see all the outliers are there. Outliers means a big number, right? If you have not attended the previous session, guys, see if you have not attended the previous session, you can drop off. Okay, because you will not be able to understand. This is a seven days live session. Now I have got the outliers. Now this is one way how we can use Z score. So this was an example of actual Z score. So i'm just going to write it as Z score, Z score computation. And basically, we have done it. Now let's go towards the IQR. IQR basically means interquartile range. So for interquartile range what type of code I will be writing always understand in IQR what are we discussing in IQR? First of all we need to find out Q1 Q1 is 25 percent time. Then we have Q3 Q3 is 20. 75 percent time. Then if I subtract 75 percentile minus 25 percentile, I will basically get the IQR right. And always understand in IQR what we do. We basically find out what what do we do in IQR. In IQR we basically find out the low, the lower fence and higher bits that we really need to find out in case of IQR. So how do I write the code? Because this theoretical is already explained. So I'll write down all the steps that is required. So the first step is that I want to arrange I want to sort the data. let's say that I'm sorting the data. Okay this is the first step. The second step is that I will calculate q1 and q3. Q1 and Q3 is pretty much important in this particular case So I need to do it in this scenario. I'll just move this up. I'll copy and paste it over here. So the first step is basically calculate sort the data and then calculate Q1 and Q3. Then we need to find out IQR which is nothing but the third step which is nothing but the subtraction of Q3 minus Q1. Then we need to find the lower fence. Find the lower fence now lower fence formula. I hope everybody knows it. So it is nothing but Q1. Q1. Q1 plus or minus it is Q1 minus 1.5 multiplied by IQR. right. This is the formula. to basically find out the lower fence. Then find the upper fence here. I will basically be using Q3 plus 1.5 multiplied by IQR. So these are the steps that we are probably going to do. So these are my steps that I am actually going to plan for. And based on the steps, I will be implementing it. So these are the steps that I will be performing in order to find the outliers with the help of IQR. Now, first of all, if I really want to find out the sorted data set, how do I find out the sorted data set, sorted data set, I will just say this will be my data set And I can use sorted function and in sorted function, If I give you the data set, this will basically be my sorted dataset. So sorted is an inbuilt function, which will actually help you to sort all the numbers. Okay, Okay. sort all the numbers over here. So right now I have actually created a data set which is completely sorted. So my first step is done. So I am done with my first step. Now second step, I need to calculate Q1 and Q3. So I will say Q1 comma Q3. And here I will basically use np dot percentile. I will give my data set over here. Along with this I'll give two values. One is 25, comma 75. So once I execute it you can see that it has got executed. Now I am going to just print Q1 comma Q3. So here you can see which is my Q1 Q3. This is my 25 percentile. This is my this is my what percentile. This is my 75. Now once we have this now let's go ahead and compute the lower fence and the higher fits. Now in order to compute the lower fence and the higher pins here I'm just going to write the comment find the lower fence and higher prints the lower sense Is equal to Q1 right minus 1.5 multiplied by IQR. And before that I need to compute the IQR. Let's say IQR is equal to Q3 minus Q1. So if I go ahead and print IQR what is this error? It is coming up. Now if I go and execute this you will be seeing that IQr is three. So this is my lower fence For the higher fence I will basically write higher fence is equal to Q3 plus 1.5 multiplied by IQ. Once I execute it now I know my lower fence and higher. So I'm going to print lower underscore pens higher underscore So if I print it it is 7.5 to 19.5. Now, the further part I think you can comfortably do it. And based on this higher, lower fence and higher pins, you can write a condition and you can remove all the elements that is required. This segment applies the Z-score concept to a real-world problem involving IQ distribution. The speaker calculates the Z-score for an IQ of 85, given a mean IQ of 100 and a standard deviation of 15. The speaker then uses the Z-table to determine the percentage of the population with an IQ lower than 85, demonstrating the practical application of Z-scores in statistical analysis. This segment clearly explains the difference between mutually exclusive and non-mutually exclusive events and demonstrates how to apply the addition rule to calculate probabilities in both scenarios using examples of dice rolls and card draws. This segment shows how to use box plots to visualize outliers in a dataset. The speaker demonstrates the use of the `seaborn` library in Python to create a box plot, highlighting how the plot visually represents the data distribution and clearly shows outliers. want to give a definition, what exactly is a probability? So, here you can say that probability is a measure of the likelihood of an event. Probability is a measure of the likelihood of an event. The reason why I am writing you this all definitions, guys understand, you really need to think, you know, what exactly is happening over here. What is the definition? You know, if you can remember those definition in an easy way, by example. So that is the reason I also give you a lot of example. let's say that I am flipping a dice in a dice. What are my possible sample events? You know that it is one, two, three, four, five, six. Now, if I ask you a question, what is the probability when I roll a dice or sorry, roll a dice, not flip flipping a coin. it should be, I'll say, roll at is okay. So here I am basically saying roll a die. So what is the probability of getting? 6? If this is my question, then how probability you will be able to calculate what is the answer? Obviously, you will say one by six, right? It's very simple. So how do we define probability? I'll say that number of ways, number of ways an event can occur. An event can occur divided by number of possible outcomes. So this is the exact definition of this. Now, in this particular scenario, number of ways an event can occur over here. I am trying to find out what is the probability when I roll a dice, I get a six. So how many events can occur? It can only occur as one. And what is the number of total possible outcomes? It is six. So this is how we basically find out. Similarly, if I give one more example, let's say that I want to, I want to toss a coin. Obviously, I know what are my sample space? Head and tail? What is the probability of getting head? You will just say that 1 by 2 because the sample space is 2 and one number of event that can occur is 1 by 2. So you basically say this as probability of header one by two. Now let's go one step above probability, which is called as additional rule. Now here is where you will probably discuss about something called as, so let's let's go to the next topic over here. I'm basically going to define as addition rule. This is super important, probably in your aptitudes, you will be using this addition rule, or we also say it as probability or, or, or, or also you say it as like this. Or now, in order to understand additional rule, you need to understand about two things. One is mutual exclusive events. What is this mutual exclusive events? So I can basically define two events are mutual exclusive, two events are actually mutual exclusive, they cannot occur at the same time. If they cannot occur at the same time. let's see an example rolling at is now when I roll a dice at a specific time, I can either get 1 or I can either get 2 or I can either get 3 or I can either get 4 or I can either get 5 or 6, right? You cannot get 1 and 2 at the same time, or you can't get one, two, three, four. at the same time, you will only get at one, one, probably one experiment or one event that you're probably rolling a dice at a single time, you'll only be able to get one number, you will not be able to get two numbers. So this is specifically an example of mutual exclusive. Another example again, uh, tossing a coin in this particular case, also tossing a coin, in this particular case, also what happens, you may either get head or tail. You cannot get both unless, and until your coin is standing there, like shown in the movies, I hope which movie am I talking about? Which movies probably I'm talking about. You can also consider, you know, good movies like show-le and surely only one type of event occurs at every time right. So for this kind of problem scenarios. Now let's let's discuss let's discuss about non-mutual exclusive. Obviously you understood that what is mutual exclusive. Now with respect to non-mutual exclusive, obviously both the events can occur at the same time. multiple events can occur at the same time Here I'll say that multiple events that can occur at the same time Two or more events can occur at the same time. let's let's say one example. let's take a deck of cards. A very simple example with respect to this in deck of cards. have you seen like what will happen in a deck of cards? two events. let's consider that from a deck of cards. When I pull out a card a king can also come or or let's say that a queen card can come along with the screen card. A red color heart card can also come hard card can also come right. So here, multiple events are there. So this two cards are obviously not mutual exclusive. So here you can see that, okay, I can also pick up a king, it can be in black color, it can also be in red color, right? Multiple things are basically happening. So this is an perfect example of a non-mutual exclusive. Now, based on this, there is some amazing problem statements that you can basically solve mutual exclusion. Suppose if I toss a coin. So my first question is, if I toss a coin, which is again, a mutual exclusive event, what is the probability of the coin landing on heads or tail? Now, whenever you get this kind of problem statement, first of all, you really need to think that, okay, whether it is mutual exclusive or not, Yes, obviously, it is mutual exclusive. Now, I need to find out what is the probability of getting heads or tails, right? This is what I, I need to find out. I need to find out what is the probability of getting heads or tails, right? From this specific event. So, I want to define a common definition, Probably for this, we can write probability of a or b, where a and b are events is equal to probability of a plus probability of b. So, whenever you have a mutual exclusive event, at that point of time, you can define this specific definition, which is also called as additional rule for mutual exclusive. Now, here, what is probability of A, You know that it is 1 by 2 plus 1 by 2. So the answer will be 1., so probability of a or b to come is basically one, These are some very, very important things in in This segment provides a practical application of the multiplication rule, showcasing how to calculate probabilities for both independent (dice rolls) and dependent (card draws) events. The concept of conditional probability is further elaborated within the context of dependent events. The speaker distinguishes between independent and dependent events, illustrating the concepts with examples of dice rolls and drawing marbles from a bag. The explanation includes the introduction of conditional probability and its relevance to dependent events. This segment clearly explains Type I and Type II errors in the context of hypothesis testing. Using the example of a criminal trial (innocent vs. guilty), it illustrates the consequences of each type of error: Type I (rejecting a true null hypothesis – convicting an innocent person) and Type II (accepting a false null hypothesis – acquitting a guilty person). The speaker connects these errors to real-world scenarios and decision-making. here. This is basically your now let's discuss about something called as permutation and combination A very small topic Probably in five minutes I will be able to complete it now let's say that first of all let's discuss about permutation. let's say that um, I have taken some students to a school trip and then we have gone to something like a chocolate factory in which many chocolates are basically they they they create a lot of chocolates They they okay, so they they make a lot of chocolates Okay, so I I catch a word of a student and I say that, okay, I'll give you an assignment and let's say that in this chocolate factory six different types of chocolates are created like dairy milk, Right like five star milky bar and let's say eclairs Okay, jam. How many one two three four five and one more chocolate. Uh normal toffee let's say one more category. Silk of dairy milk is there? So these many chocolates are basically there. So I have given a student an assignment to that saying that okay, there are six chocolates that are getting created in this factory. let's create in your diary you write the first three chocolates, whichever you see whichever chocolates you see Once you enter into that factory, whichever chocolate you probably see the top three, the first three you just write that name and you come up, come back to me. So that student went inside the factory. Now in the first instance, how many different options this particular student can have of seeing the chocolates? He may definitely have six different options. Now once he sees probably any one chocolate right He may have six options because six different, any, any chocolate he may see, right? So obviously, he may have six options out of which he writes one name over here. Let's say in the next instance, how many charts will remain total? 5 will remain. So how many options he will have to write the name 5. He will have the right to write the name of the chocolate. Then finally, here, you'll be seeing that when he comes and write the third name over there, they'll be having four options. Now, if I try to multiply this six multiplied by five multiplied by four, it is nothing but 120. Now, 120, what it is, it is all the possible permutations with respect to the chocolate name, that he may see all the possible permutation like he may, he may see in this way, Dairy milk, gems, milky bar He may also see in different way, Milky bar, gem, dairy milk. So all the possible options that are possible is 120. now when I say 120, Okay, these are all the possible options. Now this is what permutation is permutation formula. How do you write? Now let's go back to school days where directly used to ratify all the formulas. NPR is equal to n factorial divided by n minus r factorial over here. n is nothing but the total number of chocolates. r is nothing but how many names I have told that person to write. So here you will be seeing 6 factorial divided by 6 minus 3 factorial, which is nothing but 6 into, multiplied by 4 multiplied by 3 factorial divided by 3 factorial. This and this will get cut. So total answer is 120. this is with respect to permutation. Now how does combination come into existence now? and what is the difference between permutation and combination? Now in combination, always understand permutation. If I have the same element like this, I have dairy milk, I have gems. I have gems. I have probably eclairs if I've used this element once this combination, I cannot use the same element and probably make a different combination. So combination will be unique with respect to the elements that is used. Okay, if I have used derivative gem and eclair, I cannot again re-swap it and make it as a different order. So in the case of combination, you have a other formula, which will actually for help you to focus on the uniqueness of the objects that you are picking up. So for this, the formula is NCR, which is nothing but n factorial divided by r factorial, n minus r factorial. What is n factorial You know that the 6 factorial, what is r factorial 3 factorial and 6 minus 3 factorial. So here you will basically say 5, move 4 and this will be divided by 3 factorial this I'll make it as 3, 2, 1 multiplied by 3 factorial. this and this will get cut. Two ones are two twos are three ones are three two five twos are ten ten to the twenty, so twenty unique combinations you can basically have, let's say first of all the first topic that we are probably going to discuss about is something called as p value. Super super important topic. Many people gets confused gets confused in this. Now let's take one example. Everybody uses a laptop. let's say that this is my laptop. This is my mouse pad. This is your right button to click, this is your left button to click your laptop Mouse pad over here. You will move the fingers right here. you'll move the fingers. let's go ahead and let's understand. don't you think most of the time when you're moving your fingers you will be moving in this specific region in this specific region. you will be moving your fingers in this specific region, not in the corner. Hardly you will touch somewhere in the corner. Now why I am specifically drawing this because this thing will basically specify your distribution of touches. And most of the time your distribution of touches will be also looking something like this. Now understand one thing why this area is bulged. This area is bulked because most of the times you'll be touching here, this area is less because over here hardly you will be touching away. Now let's consider that I say my P value for this position is my P value for this position is 0.8. Now here what I am actually going to do, what does this point 8 basically means that let's say I am doing 100 times I am touching this mouse pad 100 times I am touching or let's say that every 100 times every 100 times. Okay let's let's remove this. I'll write in white color only every 100 time I touch the mouse pad 80 times out of this 100 80 times I touch this specific region. I hope everybody understood this one every 100 times. Probably I touch this mousepad. The probability of touching this region is 80 times that is 80 percentage. Similarly if I say my P value over here is 0.01. What does this mean? Similarly you can consider any region, this region is the best like broadest right? So this region may be p is equal to 0.9. that basically means out of every 100 touches, I am basically touching 90 times over here. This will be one time, this will be only one time. so I hope you are getting the understanding of P value. P value basically says most of the time, what is the probability with respect to a P value for that specific experiment? Now, let's go ahead and let's understand something called as. Now I'm going to combine multiple topics. The first topic that I am going to combine is something called as hypothesis testing. in that I am going to combine confidence interval in that I am going to combine significance value in that I am going to combine many things. Okay, let's say I am solving a problem. Okay, my problem is to, I have a coin. I want to test whether this coin is a fair coin or not. Simple problem, statement, I have a coin, I want to test whether this coin is a fair coin or not by performing 100 tosses. Now we are entering into inferential statistics. Okay, very important, super important. When do you think a coin is a fair coin? Obviously, when the probability of heads should be 0.5, when the probability of tail should be 0.5. If you have this to condition, definitely you will be saying that yes, in this particular scenario, obviously the coin will be a fair coin. But if you have a chole coin, if you have a Sholey coin, then what will happen if you have a Sholey coin, then probability of heads was 100. So for this kind of things, you'll definitely not say that it is a fair point. Now in order to support this, I am performing 100 experiments. 100 experiment basically means 100 tosses. so 100 tosses I will be performed. Now inside this 100 tosses, what I am going to do is that let's say that from this 100 tosses, obviously what will be the mean? let's say that I'm just focusing on probability of head. I should basically get 50 times. So from the 100 times from this 100 times, if I'm performing 100 experiment, I can definitely say that my probability of head or probably let's let's consider that forget about this probability of head. The number of times I should get, head is how much 50 right? If I get 50 times head, I can definitely say that this coin is the coin is pair. The coin is fair. I can definitely say this if the number of times after performing 100 experiment, if I get 50 times head, I can definitely say the coin is This segment introduces the fundamental concepts of null and alternate hypotheses in hypothesis testing, illustrating with the example of testing whether a coin is fair or not. It explains the process of setting up the hypotheses, performing experiments, and using the results to either reject or accept the null hypothesis. The speaker visually represents the concept using a normal distribution curve, showing how experimental results relate to the mean and standard deviation.This segment defines the significance value (alpha) and its relationship to the confidence interval. It explains how alpha (e.g., 0.05) translates to a 95% confidence interval, visually demonstrating the area under the normal distribution curve representing this interval. The speaker clarifies the meaning of the confidence interval in the context of accepting or rejecting the null hypothesis.This segment focuses on interpreting experimental results in relation to the confidence interval. It explains that if the experimental result falls within the 95% confidence interval, the null hypothesis is accepted; otherwise, it's rejected. The speaker uses the example of coin tosses to illustrate how the number of heads obtained influences the decision-making process. This segment clearly explains the concept of point estimates as a single value estimating a parameter, contrasting it with the range provided by confidence intervals to account for the inherent uncertainty in estimating population parameters from sample data. The explanation is enhanced by a simple example illustrating the difference between sample mean and population mean.This segment introduces the formula for confidence intervals (point estimate ± margin of error) and emphasizes its importance in estimating population means. It highlights the role of margin of error in acknowledging the uncertainty associated with using sample data to estimate population parameters. The discussion also touches upon the factors influencing the choice of formula and the interpretation of results. This segment differentiates between one-tailed and two-tailed hypothesis tests. Using the example of college placement rates, it shows how the phrasing of the research question (e.g., "Is the placement rate different?" vs. "Is the placement rate greater than 85%?") determines whether a one-tailed or two-tailed test is appropriate. The speaker visually represents the difference in the critical regions on the normal distribution curve for each type of test. let me just, uh, solve one one. Very simple problem, uh, and give it to you. So the problem is very, very simple. Not that difficult at all. And we'll try to solve that specific problem. So, this is my question on the quant test of cat exam. I hope everybody knows cat Exam on the quant test of cat exam. The population standard deviation, the standard deviation is known to be known to be hundred. Now, the next thing is that I will take a sample of a sample of 25 test-takers, 25 t stickers has a mean of has a mean of 520 score. So here, my question is that construct a 95 percentage confidence interval about the mean. Now, let's see what all information is given over here. You know, that some information is definitely given, You know, that. right? So first information, what is given over here? You know, your population standard deviation is given. What is your population standard deviation here? you can see that it is 100, 100 is the population standard. What is your small n size? It is nothing. but 25. What is your confidence interval with respect to this alpha? I will get 0.05. And what is your mean? What is your mean over here? Mean is nothing, but x bar, which is nothing. But 520 is this information given in the question? Is this information given in the question? Is this information given in the question? Obviously, it is given right now. my graph looks something like this. See this, my graph is looking like this. My mean is basically, what is my mean? My mean is nothing, but 520. Now my alpha value is 0.05. So here I have 2.5. Here I have 2 point. And this is my 95 confidence interval. Now I need to find out what this value, what this range is. Basically, if I say that I want to construct a 95 confidence interval about the mean, what is this value? What value from here to here? It will range. That is what I need to find out. So this is what is my problem statement. I have also given the standard deviation. Now here, whenever population standard dev first thing, whenever population standard deviation is given, whenever population standard deviation is given guys, Why 90? alpha is 0.05. See, I have given the question as 95 confidence interval, right? So it is nothing but 1 minus 0.95, which is nothing but 0.05. So this will be my alpha value, right? Alpha and confidence interval are interlinked. Very simple. Now when population standard deviation is basically given, we apply a test, right? What kind of test? Now here, I know that this will be my point estimate plus or minus margin of error. This is for my confidence interval formula. Now point estimate is obviously your x bar now plus or minus whenever view you have this population standard deviation, you apply a z test. So here you will write z alpha by 2 and the formula will be standard deviation divided by root n. Now, this is your formula. This, this term, I'll talk about this term, this term that you see is called as standard error. So in this particular case, one more, one more. Second point is that when we should use this formula to find out the confidence interval, the thing Next thing is that over here, you will be able to see that I have taken a sample of 25, but usually the sample size will be greater than or equal to 30. But just for an example, I have taken uh, as 25. Okay, So it's, okay. now don't fight with Mikrish, why I have taken 25, take it 30 Also, we have to do the calculation. But this two condition suits well for this kind of problem statement. Okay, so for a Z test to happen most of the time, this two condition needs to be approved. Now, this Z test is nothing but Z score. Okay, Z score to find out the Z score. That is what Z test is basically used. Now, understand over here what this alpha is, Okay. So this is the entire formula to find out the confidence interval if your population standard deviation is given. and when your sample size is greater than or equal to 30. Now, let's go and solve this particular problem. Now, when I go and solve this particular problem, the first thing is that I will split this equation into two part. One is I will get one higher confidence interval. Alpha value is point zero, five divided by two Standard deviation is what is standard deviation over here. It is nothing but 100 divided by root 25. Now, you understood why I have taken 25 because my calculation will become easier. don't fight with me, guys. I don't have energy to fight nowadays. I fight with a lot of people. So this will basically be my upper bound upper bound of confidence interval. Similarly, lower bound of confidence interval, I'll try to find out that is x bar minus z 0.05 divided by 2, 100 divided by root 25. Now here, I will write point zero, zero, sorry, point zero five by 2 is nothing but z is nothing but 0.025. I hope everybody is getting this. Now how do I find out this particular value for this Go and open your browser and open Z table. So if I go and open Z table. If I open that table, let me just open a Z table. Another Z table. I'll try to open just a second point here. All minus are basically shown. So I'll not use this Z table, which I'll use the other one because there are only negative values given here. Probably I'll be able to find out. Okay, now in Z table, always understand always understand over here when I say point zero two five. Okay, my entire area is how much so my entire area is one. If I subtract one with point zero two five, that basically means this part the entire area will become 0.975. So 0.975 I have to check in the Z table. So for this what I will do is that I will go to my browser and go and check it Where is 0.975. 0.975 is nothing but this specific area Go and check this 0.975. I hope you are able to see this. So what is this value 1.9 And if I go on top it is 0.06 that basically means the Z value is 1.96. So go down over here you will be able to see 0.9750. It is nothing but 1.9 and this is 0.06. So this becomes my Z score. So finally I get my value as 1.96. Now go and calculate This segment details the crucial steps in hypothesis testing, explaining the decision rule based on the alpha value and the calculation of the Z-test statistic using the formula (X - μ) / (σ / √n), emphasizing the significance of standard error in large samples. If it does not fall within this range, then it is going to fall away from that. But basically, we need to reject the narrative. Now, the next question that we are probably going to see is that what if the population standard deviation is not given Now, in that particular scenario, what will you do for that particular case, you really need to use something called as T test. So let me just show you one very good example. And that also will try to solve, let's say that the same question, this standard deviation is not given, standard deviation is not given. population standard deviation is not given, But sample, standard deviation is given. So I'll write down the question over here to for you, But I hope you are able to understand it. So, the question is that, on the point test of cat exam on the coin test of a cat exam, a sample of 25 test-takers has a mean of 520 score with a standard deviation. Now, this standard deviation that is given is basically your sample standard deviation has a standard deviation of 80 Construct 95 percent confidence interval about the mean. So this is basically my question, right? So this is my question. So over here, you can see that population standard deviation is not given. So in this particular case, I definitely have to use Z test. So over here, sorry, t test condition, I'll write, Okay, first of all, we'll try to see what all things are given. your n value is given, which is 25. Your x bar is given, which is nothing, but 520, right? Your sample standard deviation is given. That is 80 and your alpha is 0.05. So when you see over here, your values have not been given over here. That basically means your, you know, the, the, the conditions and not the conditions, But here your population standard deviation is not given. So I can write a condition saying that here, population standard deviation is not given. So in this particular case, we use something called as T test, a population standard deviation, given at that point of time, you use t test, let's go and try to compute it here. Also, the same formula will be used, point estimate plus, or minus margin of error. Here, your margin of error formula will change. Okay, now what kind of formula it will have that you need to understand. The formula will be something like x bar plus or minus, instead of writing z alpha by 2. Here, you will be writing t alpha by 2. And then you will be using s by root n. This is your standard error. Now go and substitute it. So two things, you will be basically having one is upper bound, it will be x bar plus t point zero, five by two, S by root n, right? Now, first thing first, always understand to calculate the T. Okay. to calculate the T value, you need to find out something called a degree of freedom. because in the T table, you will, you will be asked this and degree of freedom formula is just like your sample variance problem, that is n minus 1, which we also use with respect to basal correction. So this will be 25 minus 1, which is nothing but 24.. now I will go to my browser, I will open over here, T table. So T table I am having here. Now, first thing, first, you need to understand, with respect to degree of freedom, what is degree of freedom? 24 degree of freedom is 24. 25. Let's see this, this is 24, right? I hope everybody is able to see the degree of freedom over here. Try to have a look on to this table, Point zero, point zero, two five point zero two five point zero two five is nothing but this one right? this is what point nine seven five So if I see with respect to two point two, sorry twenty four it is nothing but two point zero six four is everybody getting it? We have to see in this line 24 degree of freedom On the left hand side on the right hand side you can see on top it is 0.025.05 So the answer is 2.06 2.064 So here I'm basically going to find your t 0.05 divided by 2 is equal to nothing but 2.064 Now the next step uh once you get this I will go and see what is my x bar 520 520 plus 2.064 multiplied by s. What is s over here. It is nothing but 80 by 5. 5 is nothing but root 25 is 5. 553.024 and then if I go and compute the lower bound 520 minus 2.064 80 by 5 so this minus 520 so here I'm actually getting 486.97 so my lower bound is nothing but 486.97 The upper bound of the confidence interval is nothing but 553.02 So with this we have done wow I've written so much today we have finished confidence interval. This segment clearly explains the one-sample t-test, highlighting its application when the population standard deviation is unknown. It meticulously outlines the steps involved, including calculating degrees of freedom and establishing the decision rule using the t-distribution. This segment focuses on the calculation of the t-statistic and its interpretation within the context of the decision rule. It demonstrates how to reach a conclusion about the null hypothesis and interpret the results in the context of the problem. This segment walks through the step-by-step calculation of a chi-square test, creating tables to organize observed and expected frequencies. It emphasizes the comparison between observed and expected values to determine whether to reject the null hypothesis. This segment details the step-by-step process of conducting a chi-square test, including defining null and alternate hypotheses, determining degrees of freedom, identifying the decision boundary using a chi-square table, and interpreting the results to reject or fail to reject the null hypothesis based on the calculated chi-square value. This segment provides a concise definition of the chi-square test, emphasizing its use as a non-parametric test for categorical data and its importance in addressing questions about population proportions. It also highlights the practical application of the test in interview settings.This segment presents a real-world problem involving a chi-square test, focusing on analyzing changes in population age distribution over time. It explains the concept of non-parametric tests and their relevance when dealing with population proportions. This segment clarifies the relationship between p-values and significance levels in hypothesis testing. It explains how to interpret p-values in the context of rejecting or failing to reject the null hypothesis, emphasizing the importance of comparing the p-value to the significance level (alpha) to make a decision.This segment introduces the concept of covariance as a method for quantifying the relationship between two variables. It explains how positive, negative, and zero covariance values indicate different types of relationships (positive correlation, negative correlation, and no correlation, respectively) and provides illustrative examples.This segment delves deeper into the interpretation of covariance, explaining how positive and negative covariance values represent different relationships between variables. It also highlights a key limitation of covariance: its scale-dependent nature, which makes it difficult to compare covariances across different datasets. This segment demonstrates a Z-test using Python's `statsmodels` library. It explains how to perform the test, interpret the resulting Z-statistic and p-value, and make a decision about the null hypothesis based on comparing the p-value to a chosen significance level (alpha). The example uses IQ scores to illustrate the process. This segment provides a clear explanation of Spearman rank correlation, highlighting its use for non-linear relationships, contrasting it with Pearson correlation, and detailing the formula and calculation process with a practical example. The explanation of ranking data is particularly valuable. This segment presents a real-world application of the t-test, comparing the mean age of a sample of college students to the population mean. The presenter demonstrates the process, interprets the results, and clarifies the decision-making process based on p-values and significance levels, correcting an earlier error in the interpretation. This segment showcases a practical application of the one-sample t-test using Python code. The presenter demonstrates how to perform the test, interpret the p-value, and make decisions based on the results, illustrating the process with multiple iterations and varying sample sizes. This segment addresses common confusion surrounding p-values and their interpretation in hypothesis testing. The presenter clarifies the relationship between p-values, significance levels (alpha), and the decision to accept or reject the null hypothesis, correcting previous inconsistencies in explanations.This segment provides a detailed explanation of the relationship between p-values and significance levels (alpha) in hypothesis testing. The presenter uses clear language and examples to explain how to interpret p-values in the context of decision-making, emphasizing the importance of understanding the confidence interval. This segment demonstrates the use of Seaborn in Python for visualizing correlations within a dataset (Iris dataset). The presenter shows how to generate correlation matrices and pair plots to visually inspect relationships between variables. This segment walks through a complete example of a two-tailed hypothesis test, showing how to calculate the Z-score, determine the critical values, and interpret the results in the context of accepting or rejecting the null hypothesis. The example reinforces the concepts explained earlier. This segment demonstrates how to use a Z-table to find the area under the curve corresponding to a calculated Z-score, emphasizing the importance of considering whether the test is one-tailed or two-tailed when determining the p-value and making a decision about the null hypothesis. This segment details the process of determining a p-value from a Z-score, comparing it to the significance level (0.05), and making a decision about rejecting or failing to reject the null hypothesis based on the p-value. The explanation clarifies the decision-making process in hypothesis testing. This segment provides a clear explanation of Bernoulli distribution, including its definition, characteristics (two outcomes, probability of success 'p' and failure 'q'), and how to represent it graphically using a probability mass function (PMF). The explanation is concise and easy to follow. This segment introduces the Pareto distribution, its relationship to the power law, and the 80/20 rule. It provides real-world examples and connects it to log-normal distribution, highlighting the practical applications and mathematical relationships between these distributions. This segment provides a clear and concise explanation of the Central Limit Theorem, illustrating how taking multiple samples from any distribution (regardless of its original form) and calculating their means will result in a normal distribution, provided the sample size is sufficiently large (n ≥ 30). The explanation is enhanced by the use of visual examples and a step-by-step approach, making it easy to understand even for those with limited statistical background. The speaker emphasizes the importance of sample size and the implications for data analysis. The provided text mentions simple random sampling and stratified sampling. Simple random sampling: Every member of the population has an equal chance of being selected. Example: Picking names from a hat for a survey, or randomly selecting participants for a medical trial. ( ) Stratified sampling: The population is divided into non-overlapping groups (strata), and a random sample is taken from each group. Example: Dividing a survey population by gender (male/female) to ensure representation from both groups. ( , ) The text also briefly mentions convenience sampling (selecting readily available participants) and implies other techniques exist depending on the use case. ( )