data science and machine learning is the hardest job of the 21st century with 
an average salary of hundred and twenty thousand dollars per year. according to linkedin. 
the designs job profile is among the top five jobs in the entire world. 
if you were to foray into the world of data science, you need to have good command 
over statistics as it forms, the basis of all the data science concepts. 
so with the help of statistics, you can make predictions such as new 
york will be hit with multiple tornadoes at the end of this month. or 
the stock market is going to crash by this we can now all of this sounds magical. 
doesn't it? well to be honest, it just status x and not magic and you don't 
really need a crystal ball to see into the future. so keeping the importance of statistics 
in mind. we have come up with this comprehensive course by dr. alban on 
the sarkar. dr. robin in the sarkar has his phd in statistics from stanford 
university. he has taught applied mathematics at the massachusetts institute of 
technology beyond the research staff at ibm let quality 
engineering development and analytics functions at general electric and has co-founded 
omics labs. we are uploading this high-quality classroom session 
by dr. amin in the sarkar from great learnings business analytics and 
business intelligence course, it has been ranked. number one analytics 
program consecutively for the past four years. this tutorial 
will be on youtube for only a limited period of time so that learners across 
the world can have access to high quality content. so please do subscribe 
to greet. earnings youtube channel and share the video with your peers so that everyone 
can learn from the best now without further delay. let's have a quick glance 
at the agenda will start a band standing the difference between statistics 
and machine learning. then we'll go through different types of statistics 
which are descriptive predictive and prescriptive after 
that. we will understand the different types of data available. going 
ahead will understand the concept of correlation and covariance comprehensively 
following which we'll head on to probability and learn how to implement conditional 
probability with bayes theorem. and finally, we'll look at two types 
of probability distribution binomial distribution and poisson distribution. so 
let's start off with the session. you 
now need to do is you now need to be able to get the data, too. solve 
this problem so therefore 
the statistical way of thinking typically says you formulate a problem 
and then you get the data to solve that problem. the machine learning 
way of looking at things typically says here is the data tell 
me what that data is telling you many of my colleagues and i myself have run into 
this problem when going through interviews, etc, etc. and so sort 
of statistician say that we're not getting jobs 
out there. so i go to do to people who are hiding in saying that 
why don't you hire statisticians and i reach an interesting conclusion to this 
entire discussion. that's sometimes around the way the interviewer 
who's interviewing the statisticians for a data scientist job. ask the question. here 
is my data. what can you say? and 
the statistician answers with something like what do you want to know and the business 
guy says but that's why i want to hire you and the statistician says 
what if you don't tell me what you want to know, how do i know what to tell you and 
this goes round and round right? no one's happy about this entire process. 
so there's a difference in the way these two communities approach things. 
my job is not to resolve that. because in 
the world that you will face. you see a lot more of this 
kind of thinking than you seen this. because 
in this world the data is cheap in the question is expensive. 
and you're paid for asking the question. in 
this world. the question is cheap in the data is expensive your 
paid for collecting the data. so 
sometimes you will be in a situation where this is going to be important. 
for example, let's suppose you're trying to understand. who's going 
to buy my product? you're asking the question. let's 
say that my products aren't selling. and you want to find out 
why? what will you do? get 
what data so let's say that you're selling your i don't know. what do you want to sell? go 
to sell watches say so let's suppose people aren't buying buying watches anymore, which is 
a reality, correct? so your watch company who buys watches this the entire 
business model of a watch is disappearing. do you have watches some of you 
have he has actually a surprising number of 
you have maybe they do different things these days that 
that seems like a very that's a fitness device is not really 
a watch at all. so something like this was actually 
with my daughter at lunch today. so she got something like this. i'm not sure my 
my wife who's an entrepreneur runs her own company. she came back from delhi she came back with two of these. i don't 
know where she picked them up. so my daughter the first thing she did she took one of this 
and she took this thing out. because she thought of the whole wrist band was an unnecessary 
idea. she that didn't occur 
to her. i mean that's a separate thing. that's a nice little beautiful red wristband 
etc. so what is different thing but let's say that you are watch company. nobody's buying your watches 
or fewer. people are buying your watches. now, how you going to solve this problem or how you 
going to process this information? what you want to do? what do you want to know? what 
you remember? i'm asking this question also from an analytical. effective 
so when you say that to check the model and see what is not so let us use the whole data question. 
so you so first order you see sense. for whom 
and when and how how do you structure your data? how will you how 
will you arrange the problem? okay, 
that makes problems even harder because now you're going to look for data that is in with you. no. 
no, he's right. he's right. he's maybe people are not buying watches because they buy 
something else. that's a reasonable thing. well, 
let's keep the problem simple. let's consider only data that is within you will go outside 
not to worry, but let's say that i am looking at my data. what data 
do i want to see and what questions do i want to ask a favor? so 
sales year-by-year types. and then what comparisons 
do i want to do here region 
wife is with what purpose what question am 
i asking the data what sex? our customers 
are buying my product or what section of customers 
are buying my product compared to what what 
am i biggest set of customers? so that's also what's happening who are my biggest 
customers? okay. that's a very interesting question to ask except 
that that question implies that i needed to know who my biggest set of customers 
sort of could have been but it's a good point. where 
is the bulk of my sales coming from then? someone else says something about time here. 
is it going is it going down so you can look at things like saying that 
for which group of customers are by sales going down 
the most for example, you could ask that. i'm not saying that's the question 
about that's a possible question to ask. so let's suppose you follow that approach then i'm 
trying to understand. i know that my sales are going down. that's an obvious 
thing when i see you is telling my cfos telling me if i don't stop this we all going to be out of 
a job. correct, the hmt 
factories in bangalore and not in good shape. one of them i think has become 
the income tax office. somewhere in the polish 
forum area. so there is that's going to happen to me if 
i don't do this well, so i know my sales equipment, but i don't know by how much 
and particularly for home. so obvious segments for which the 
sales are going down which segments are sales going down the most 
in which segments are they going down a little bit how fast are they going down? i 
can push i can ask questions of that sort. now 
what conclusions at the end of this do i want to be able to do? how 
do i need to how do i want to use this information? now 
for this you usually follow something like a three-step process and you may have seen 
this and this covers both these sites and these words should be 
should be familiar to some extent the first is called descriptive. the 
second is called predictive. and 
the third is called. prescriptive have 
these was been introduced you at least in this compact least you've read it. i'm sure 
you all cruise the web and look at blogs and things like that. 
nothing new in this. i'm sure but i just want to set a context because it's going to talk a little 
better. what we descripted is a see here. so 
descriptive predictive and prescriptive. now what is the 
descriptive problem? the descriptive problem is a problem that says that 
this try for me where and i'm losing my sales and when i'm losing my sins, 
it just describes the problem for me. it tells me where the problem is it locates 
it it isolates it. the 
predictive problem says look at is data and 
give me an idea as to what might happen. or 
what would happen if i change this that or the other so let's suppose i do 
the following kind of idea. i say that let me relate to my feels 
to my prices. let me try to understand that if i reduce 
my prices of my watches will more people buy them. conversely 
if i make my watches luxury items increase the price 
of a watch. remove a luanne branded makeup watch 
an aspirational thing a decorative item a luxury item 
a brand item. so the people who are watching not to see the time but also as 
a prestige statement as a fashion statement, whatever it is, if 
i do this then what will happen. that's predictive 
and trying to predict something based on it. i'm trying to see if something happens to let's 
say one part of my data what will happen to the other part of my data and then based 
on that the doctor carries out a predictive analysis of you 
because i see this i now think you have this issue. you 
have this thing going on. let's say i'm diagnosing you as being pre-diabetic. 
you're not here diabetic, but you're happy on the way to becoming a diabetic. now 
because of this i now have to issue you a prescription. i 
now should tell you what to do. so 
this is the data that comes from you. the data in some way is modeled 
using the domain knowledge that the doctor has. and 
that model has translated into a into an action that 
action is designed to do something. typically 
is designed to do something actually fairly complicated. the 
first actions the doctor tries to do someone. let's say do no harm the porchetta codes. first 
let me make sure that that i don't do any unnecessary harm to the patient then 
let me shall i say optimize his or her welfare? by making sure 
that i control the blood sugar the best in that. i postpone the onset of diabetes as best 
as i can. it's a complex optimization problem of 
some sort in a business. also. it's a complex optimization problem. right. 
i need to be able to sell more watches, but i also need to be able 
to make money doing so. i can increase my sales. but if i increase 
my sales and my profits go down on my earnings 
go down based on the cost and that's the problem. but at the 
same time if i try to run a profitable business and nobody buys my product that 
also is not a particularly good idea. and there are other 
issues. we've been running the company. i've got employees that i want to keep on the on the boat. 
how do we run the company in such a way so that it means that particular labor force. i 
have finances should take care of have loans to repay. how do i get the cash 
flow in order to repay the bank loans that i have? so the prescription 
has to meet lots and lots of requirements. if 
you are building an autonomous vehicle, you'll have situation seeing the car has to do this but 
it also has to follow certain other rules. for example, if it sees 
someone crossing the road it should stop but 
it shouldn't stop very suddenly because it sounds very suddenly is 
i heard the car is also probably going to hurt the driver. so 
it can it should needs to start by trimming stop to suddenly. 
it has to follow the rules of the road. because 
otherwise the computer will simply say oh you want me to avoid the person crossing 
the road? i'm just going to go behind a person. and you go 
to go in the river tell the card, please don't do that because there's a house next to it. you 
can't just sort of do that. oh, you didn't tell me that you just told me to avoid the person you didn't tell me about 
the house. okay, we'll put that as a constraint in our program and 
see how well it goes. so prescription is problematic. another 
simple way of doing it might be to say that description is how many centuries 
as we are colleagues code lookup cricinfo. i will give you the answer. prediction 
might be try to guess how many centuries vertically will score in the world cup 
prescription might be how do we get vertically to score more centuries in the world? and 
as you can figure out you're going through a purely database version of the problem 
into something that's only not notionally about the data. 
i will help you but there's a lot more than the data when it gets to that. what 
we'll do today, what we'll do now. once i finish talking to you is will will 
take a look at what descriptive or the descriptive part of anger, texas. so 
the descriptive part of our latex is talking about simply describing 
the data without necessarily trying to build any 
prediction or any models into it. simply telling you 
the way it is this is hard. this is in itself 
not necessarily an easy thing to do because you need to know very well how to 
do that. and what are the ways in which one looks at data? this 
is skillful in itself. so for example, let's suppose 
that you are you're that i mean, you're a doctor you go to the doctor and 
the doctor is looking at you looking at your symptoms and the doctor recommends a 
blood test. now. how does the doctor know what blood test to 
recommend? based on the symptom. but 
remember that potentially there's an enormous amount of information in you 
all of us is biological things carry an enormous amount of information, you know in 
our blood inner neurons in our jeans or whatever if 
you're talking about big data as i said, there's two meters inside every cell and there are few billion 
neurons in your head. you don't need to go far to see big data you are big 
data. you're one walking example of big data. we all 
are. thanks the in that big data what 
little data does the dr. 
know to see that's the descriptive analytics problem. the doctor is 
not doing any inference on it. the doctors not building a conclusion and the doctor is not building 
any i system on it. but it's still a hard problem who's 
giving me vast amount of data that the that the that the doctor could potentially see 
the doctor needs to know that i'm this is interesting to me and this is interesting to me and this is interesting 
to me and this is interesting to me in this particular way. for 
example a blood test. let's suppose that i drop i draw 
blood from you for a particular purpose. let's say for blood sugar. okay, leaving 
aside the biology of how much blood etc etc to draw. just 
neither one of you i guess our adoption if you heard of this we doctors in the room. 
so doctor so i can say whatever i want to you understand what i'm saying, if you know, but we're so 
but i'm old enough that this is a real problem for me. so you 
have a you have a large amount of blood that's flowing through you 
we all do. this blood carries nutrients what 
that does is that every time there is a nutrient in flow. the blood looks a little different. 
so if you eat your blood looks a little different. because that's your bloods 
job. the bluffs job is to carry nutrients. if you want to run you want 
to walk if i'm walking around my legs are getting energy from 
somewhere. the energy 
need to my legs has been carried from the blood and it is being generated through inputs 
that i get some of it because of the air that i breathe from where it gets the oxygen 
to burn things. so if you're from the food that i have eaten the nice lunch that i had where 
it gets the calories to do that. so therefore based on what my energy 
requirements are and based on what i've eaten. my 
blood is not constant. my blood content 
is what is known as a random variable. what's 
random about it? because it looks a different it looks really different all 
the time. your blood at 12 o clock 
is going to look a little different twelve o'clock at midnight is will look really little different from 
twelve o'clock at node because it's doing something 
a little different. the same phenomena is are everywhere. 
if i were to for example measure the temperature of the oil in your 
car or in your two wheeler. what do you think? the temperature will be? it 
depends first of all depends on where the car is running or not. it depends on whether it has done 
or not. it depends on how much oil there is. it depends on how you drive? 
it depends on temperature the car the answer is it depends? and 
the same is true for your bodily fluids. so 
this becomes a fight problem because if it is random. then 
from a random quantity, how do i conclude what your blood sugar is? 
how does a doctor reach that reach a conclusion of any sort? average 
of what average a particular duration. so there 
are multiple averages that you can get. first of all there is a question of saying that 
if i take blood from you, how is the blood usually collected? so the phlebotomist 
comes in usually takes an injection. from one point, let's 
say by some strange accident is throwing advised policy 
by same some strange accident two different people are drawing blood. from two 
hands at the same time do not try this at home. well, i suppose 
they do do this. we look at the same blood ideally. 
yes. at 
the same time as i say do not do this at home, but the same time you are getting two different 
samples. there's not just a question of time. your 
blood is not going to look the same even within your body at one period of 
time. even from the left hand 
in from the hand it exactly the same period of time is not going to love this game. there 
is a slight there is a slight problem that some old in love. you said that you know, your 
heart is in the middle. your heart is actually middle, but it beats to the 
left. why? because the the heart is what the heart is both 
a pump and a suction device. the pump side is on the left. the suction side 
is on the right. so your blood pours out from your left side and it 
goes back in on the side. so this is site asymmetry in your body between 
left and right. once i tends to go out 
the other side tends to come in its slight it mixes up all in the middle. so 
one sampling idea is that i'm taking a sample of blood from 
you and is just one example. the 
second question is as you're saying is a question of time. so you can average over 
time if you average over time. this is at least here. you can say i'm going to do this maybe 
before eating. after eating real after 
eating. so those if you have blood pressure test, for example of sorry blood sugar 
test, once they ask you to do it fasting and 
then they ask you to do some two hours after eating. do 
they tell you what to eat? sometimes 
a glucose sometimes they don't this is sort of say that based on what you naturally 
eat. let me figure out what you are processing the expect you to eat a typical 
meal and not go and eat, you know large amounts of kfc. that is not what 
you normally eat. just eat what you normally eat vegetarian need normal visited 
me eat normal food and then figure it out. let's see how how good your bodies are trying it 
out. we think do a normal thing and i'll take another normal sample. 
then one of you said something very interesting the average things out. but 
what does the averaging do? neutralize the 
interesting word to use neutralizes things provide 
context context of what context is 
a good point. so so what is the doctor trying to do? 
so let's let's simplify things a little bit and say that let's suppose that the doctor has a 
threshold. let's give it a number. let's say the doctor says that 
if your blood sugar is above 140, i'm going to do something if you have sugar is 
a less than 140. i'm not going to do anything. i don't know whether this is the number or not, but just let's make 
it up. now the doctor is going to see from 
you and number. it may 
be a single reading. it may be an average. it may be a number of things. how is the doctor 
going to translate? what they see 
from you and compare it to the 140. how is that comparison going to 
be made a number of people? so 
let's suppose i have just one reading. can you suppose 
that i have one reading and that reading? oh, i don't know is 135 i've 
just got one reading from you 135. what does that tell me notice 
required one argue one argument is is simple, let's take a very machine learning 
computer science view to this 135 is less than 140. haha, 
so now we say yeah, but you know what? let's 
say that 135 and another guy who say one foot 120. they should be something 
that says that this 135 is a little bit more trouble than 120 
closer to the threshold as he says. so maybe in other 
words this threshold isn't quite as as 
simple as i thought it was so 
i can solve this problem in one of two ways one way to do. this is 
to make this 140. a little 
range this song is called fuzzy logic in 
other words the question you're asking becomes fuzzy not as crisp. 
you're not feeling with the data you fiddling with the boundary. you're feeling with 
the standard. the other way to do that is 
to create a little uncertainty or create clicking plus minus around 
the reading itself around 135 saying that if this is 135 
and let's suppose that i go and get another reading and the second reading that i get 
is say 130. and the third reading that i get on 
the day after that is say 132. and 
i'll say okay. seems to be fine. i 
might say but let's suppose after 135. guy 
goes and i do my usual thing and i measure it again and this time it 
comes out as 157. and 
i do it again and it comes out as 128. and 
i do it again. it comes out to be 152. so 
in both cases 135 is probably a good number. but in one case is 135 
was very very little and the other cases 135 was wearing a lot which 
gives me different ideas. as to how to process it. so 
what descriptive analytics talks about essentially is 
trying to understand certain things about data that 
helps me get to conclusions of this kind a little more rigorously. now 
to be able to quantify what these plus minus is our is 
going to take a take us a little bit of time and we will not get there. this residency will 
get their next residency. to say that in order to in order 
to say i sought 135 135 plus minus something that question now needs to be 
on set. but to do that i need to have two particular 
instruments at my disposal. one instrument that i need to have 
at my disposal is to be able to know what to measure i need to say 
what does an error mean? i need a statement that says that maybe 
i'm 95% confident that something is happening. i'm 95% sure that this 
is below 140. i need a way to express it. and that is the 
language of probability. so 
what we will do tomorrow is will introduce a little bit of the language of probability. 
in vitro and related to what we're doing today. so there's going to be little bit for disconnect. but 
what we're going to do is we're going to create two sets of instruments one instrument that is purely descriptive 
in nature. and one set of instruments which is purely mathematical 
in nature so that i can put a mathematical statement on top 
of a description. and the reason i need to do that is 
because the pure description is not helping me solve the problem that have set 
it set that have set. so 
therefore what will happen is you will see in certain medical tests. 
you will not see points like this. you will see intervals. your 
numbers should be between this and this your question number your hdl. whatever 
should be between this and this you won't see a number you will see a range the 
tip of typifies a variation and in certain cases you will see thresholds 
and maybe they are it's just a lower limit and upper limit, but you also see a recommendation 
that says please do this again. i'm 
going to compare i can't compare one. number one. number one. 
number two or number is typically a very bad place for any kind of analyst to be in because 
you got no idea of which is error-prone and where the error is. so 
therefore what happens is you try to improve one of those numbers. and 
so either by filling around with the range or by getting more measurements 
and you'll do that in you'll see that as we go along a little later. so 
this is a context for for what we have 
in terms of terms 
of data, let's see. so this is a set of files that has 
been loaded. it's a very standard set of files is 
not mine. to be honest. i just want to make sure that i'm doing what i'm supposed 
to be doing. so for reasons that are more to do with security my 
understanding that notebook will not access. your drapes so 
keep it on your desktop and not complicate life. so 
and there is this notebook. it's 
called cardio goodness. if good. the word statistics refers to the idea 
that this is comes from the statistical way of thinking. which as 
i said opposed to the machine learning way of thinking is tends to be a little more problem 
first data next which means we worry about things like hypothesis 
and populations and sampling and questions like that. and 
the descriptive part refers to the fact that it is not doing any inference. it is not predicting 
anything. it's not prescribed. you think it is simply. telling 
you what is there? with respect to certain 
questions that you might possibly ask of it. what 
is the context to the case? the market research team at a company 
is assigned the task to identify the profile of the typical customer free treadmill product 
offered by the company the market research team decides to investigate whether there are differences 
across product line with respect to customer characteristics. exactly 
what you guys were suggesting that i should do with respect to the watch understand 
who does what entirely logical the team decides 
to collect data on individuals who purchase a treadmill at a particular store during 
the past three months like watches now click looking at data 
for treadmills. and that is in the fight 
in the csv file. so what you should have is you 
should have a csv file in the same. directory 
and through the magic of python. you don't have to worry about things like path. before 
we get there. remember because 
we're looking at this statistically before we get the data. we should have a rough idea as to what we're trying 
to do. and so they say that here are the kinds of data that we are looking at the 
kinds of products the gender the age in years education years relationship status 
annual household income average number of times the custom of plans to use 
the treadmill each week average that number is a customer expects to run walk each 
week on a self-rated fitness scale and 1 to 5 where one is in poor shape and 5 is 
an excellent shape. some of this is data some of this is opinion some 
of this is opinion masquerading as data, like for example number 
of times a customer plans to use a treadmill. hopeful 
wishful thinking is still data here asking someone. how 
many times will you use it? arrows daily. no 
problem seven times a week. oh, we'll see huh, but 
still data, it's come from somewhere. so 
so what has happened the way to think about this is to say that i want to understand 
a certain something and the certain some certain something 
has to do with the characteristics of customer customer characteristics 
and to do this you can then use either you can either take 
let's say a marketing point of view who buys it also make a product engineering 
kind of you what cells there was what kind of product? 
should i make etcetera in business as you probably have for those of you aren't any 
few entrepreneurs? one 
hand up. they're actually one hand of the closet 
enterpreneurs from what i could figure out sometimes it's unclear what that word means. 
in other words. you think you are or you're not confident enough to call yourself one or you doing 
that in itís page. if you are an entrepreneur for example in in 
physical product space or even in software space one of the things you often think about is what's called 
the product market fit, which is you're making something. how 
do you match which in what you can make? and what people will buy? 
because if you make something that people do not buy that doesn't make any sense. on 
the other hand, if you identify what people buy and you can't make it that also doesn't 
make too much sense. so the conclusion 
that we will draw on this we will not drawn today, but the purpose is to be 
able to go towards the conclusions of that kind either isolate products isolate 
customers and try and figure out what what they tell us pandas 
generally calc has a fair amount of statistics build into it. 
that's what it was originally built for. number 
you something that was built more for mathematical problems and anything else. 
so some of the mathematical algorithms that are needed are there. there are other 
steps. i plots in metal up lot. like was he born and many other things that you've seen already python 
is still figuring out how to arrange these libraries. well enough the 
shall we say the the programming bias sometimes shows through in 
the libraries so i for one do not remotely 
know this well enough to know what to import upfront but a good session, 
you know, what to import up front when you do all this up front so you don't get stuck with what 
you want to do. the naming is up to you. if you like the names as they are 
then that's fine. you want a standard set of names? so 
when you don't the data set if this is in the path, just this will work 
dot csv. if usually smart enough 
to convert excel forms into csv. in hours, 
if you have this as acceleration things like that, it's usually smart enough, but if it isn't 
then just go in and save an xls file as a csv file 
and operate that way in case it doesn't do it on its own. but 
more often than not what you see is that when you when you when when 
jupiter sees it it will see and any xls file as a csv 
file or go and make the change yourself. oh, 
you can have other excellus other restatements in it as well. you can change 
functions inside it and you can figure out how much to head what this tells you is is the head and the tail 
of the data. this is simply to give you a visualization of what the data 
is. this gives a sense of what variables 
are available to it. what kinds of variables they 
are? we'll see a little bit of a summary after this etc. 
so for example, some of these are numbers income what is income? 
income is annual household income. that's a number some 
for example, let's say gender male female. this is a categorical variable. this 
is not entered as a number. is entered as a text field 
if you are in excel for example writer the top if you go in and you see that it'll tell you how many distinct. 
entries there are how many distinct settings there are. so 
usually what happens at the beginning and a dataframe like this if it is created 
is a data frame if a data frame is created when it gets created. the 
software knows as to whether it is talking about a number or whether it is talking 
about categories. the 
second challenge is to that you can see one particular challenge to this. 
what does this 180 mean? counts. 
why do you think there are so many decimal places that comes here? 14 
years of experience 16 years of experience. why is it going 0 0 0 0 is 0 
yes, it does this because it's his other numbers with 
those decimal places are needed. so what it 
does is what any software typically does is when it sees data is 
sort of says that at what granularity do i need to store the data. 
sometimes this is driven by a computer your 64-bit or 32-bit and things like that. 
but what it does is it means that the data is stored in the data frame to 
certain digits. usually you don't see that you'll see it in this way, 
but sometimes for example when you see include equal to or any and you ask for a full description 
the data comes out in this slightly irritating rain. because 
of something here because let's say the income figure any of that now 
when it recommends when it looks at the descriptions of this, what is 
the description that it is reporting and how does it choose to report out 
the description? this particular situation. so 
let's take a little bit of a closer. look at this one thing here. look 
at the way it's done here, so count unique top frequency. 
and then there is certain things here means standard deviation minimum 25% 50% 75% 
and max. when it sees a variable like gender. it 
reports out lots and lots of nes. what 
does that tell you off the bat? he can't do 
that. which means it's not a number. this is 
not a number in other words. if you asked me to find the mean of something 
and you're giving me male and female as inputs, i don't know what to do. which 
is an entirely reasonable standard to take for any reasonable algorithm, right? it 
requires another kind of description for it to work that 
the problem would describe this course in taxes that is asking for the same description for all of 
them whether it's in significant digits whether it's in columns, etc, etc. 
just choosing this description. it says that that's all that i'm going to give you. but 
where it makes sense, let's say for example, i look at h. now 
for each i've got a hundred eighty observations. and 
it is calculating certain descriptions for it. correct. 
so what are the descriptions it is calculating? let's look at these is calculating a description likes a 
minimum minimum is what 18 maximum 
is 50. these are easy to understand. then 
let's look at something a little interesting. suppose they want to report one 
number one representative age. for 
this data set this is like asking the question. 
how do i get a representative blood sugar number for you? i 
can give you a minimum and a maximum but to do the minimum the maximum. i need 
to draw blood many many times from you. but 
let's suppose i want to this is why one one representative age for you. somebody asks 
you what is your blood sugar? you want to give them one number? similarly, 
somebody's looking at this data and ask the question. give me a representative 
h. how old is your typical user 
or what age do you want to build it for or you're even asking a you're 
even asking let's say a product question. you're a product designer and 
a product designer building a treadmill. now, 
how do you design a product those if you are engineers? based 
on based on the weight not very good. what wait who's 
wait? who's the user? what 
is the weight of the user he's got a good point as a design engineer? i need to know 
what weight will be on that treadmill. now, 
what is your answer to that question max? 
so there's a question of saying that if i want to measure a variable by one 
number, how should i even frame that question? what 
makes sense? what is the one average no 
max in this particular case you might argue. the max is the is the number because 
i want to be able to say if i can support you maggot 
support anyone but there's also a downside to that. i've now engineer 
that product i you could argue that i shall i shall i say over-engineered that product. 
i'm sorry. okay. so 
let's suppose that you are you're doing this for a mattress? you'll sleep 
on mattresses were all relatively wealthy based on the fact that we are here. so 
we probably sleep on a mattress. not everyone is fortunate enough to sleep 
on a mattress. but let's suppose you do sleep on a mattress how much weight should that mattress be designed 
to back if you over engineer it what will happen. is that number one 
for a reasonable weight? let's say wait a lot below that that 
mattress is not going to sink. let's say that you design it for a 
hundred kilos. now if you are 50 kilos or 60 kilos that mattress 
is not going to sing for you. this is going to be comfortable for someone who is 
a hundred kilos for someone who's 50 kilos you just 
going to bounce on it. you're going to feel it's soft silkiness 
of whatever it is. you want to feel from the mattress? it won't work. so 
what to do? that's a hard problem is 
the description to with a heart problem. who do i engine it for? and 
so therefore people have different ranges of 
what i mean to represent it. so here's one version of 
it. this is what is called a 5-point summary. i 
report out the minimum the 25% point the 50% 
point the 75% point and the maximum 
variable by variable i report five numbers. i 
report the lowest. what is 45% mean? 25% 
of my data set or the people are younger than 24? 
the youngest is 18 25% or a quarter of them 
are between 18 and 24. a quarter between 24 
and 26 a quarter between 26 and 33 
and a quarter are between 33 and 50. this 
is what is known as a distribution. this is what is known as a distribution statisticians 
love distributions. they capture the variability in the data and the you do all kinds of 
things with it. so i'm going to draw typical shape of a distribution. we 
will make more sense of relator on this is the theoretical distribution distribution. 
for example, let's say as a minimum. has 
a maximum. as say 25% point as 
a fifty percent point it says 75% in 
terms of probabilities this 25% here 25% 
here 25% here 25% here. 
if you want to think in terms of pure description. this is not a probability is just a proportion. 
if you want to think in terms of probabilities what this means is that out of a hundred 
eighty people out of a hundred eighty people if i draw one person at 
random. if i draw one person random there's 
a 25% chance that that person's weight is going to be below. we 
know 24 so h-24, correct. if you 
want to think in terms of probabilities will do that tomorrow. but 
this is a description. so what this description does is 
it gives you an idea as to what value 
to use in which situation so for example, 
you could say but i'm going to use 25:26 as 
my representative age. if 
i do that, what is the logic i'm using. this 
this 25% this 50% point so to speak this 
is called the median. this 
is called the median and we'll see it median 
means the age of the average person. first 
shot pick the middle person and ask. 
how old are you? the age of 
the average person i could also ask 
for the average age of the person. which 
is what which is the mean? which 
is one over n? x1 + 
x n now, this is algebra. what 
you have to do is you have to put n equal to 180. this 
is the first change second h the third age up to 180. 
one by one eighty age one plus 
age 180 this is called the mean 
this value is what twenty eight point seven nine. the 
average age is about 28 years or twenty eight and a half years twenty 
eight point eight years for the age of the average person is 46. 
yes. the median and the 
difference between the two so 
i described the median as the age of the average person. 
and i distract the mean as the average age of a person. so 
he's looking at me like say you have to be kidding me that's confusing. 
i admit to it the easy way to understand it could be this. what 
is the mean add them all up divide by how many there are what 
is the median soft them from the smallest to the largest 
leak of the middle? if there's an even number, 
what do you do you take the average of the two middle ones? 
if they're the same it will be the same number if they're not it will be a number between 
them. so sometimes the median may show up with a point five or something 
like that for that reason if there is an integer counts, but 
there are an even number of counts. now, 
which do you think is better? you're 
giving the answer. it depends you figure out that i like that answer. 
they both make sense. they both make sense. it depends on what context 
should going to use it for in certain cases. yes. 
okay, if you're talking terms of parameters, so use an interesting term 
he saying what is the parameter? i'm after parameter is an interesting word parameter 
refers to something what generating the population it's an unknown thing that 
i'm trying to get after for example blood sugar is a parameter it 
exists, but i don't know it. i'm trying to get 
my handle on it. correct. so if i'm thinking in terms 
of of parameters, then these 
are different parameters. so let's let's look at a distribution here. 
i'm not sure whether this is pick up things. i hope so. so 
the median is the is the median is a parameter such 
that. on this side. i have fifty percent. 
and on this side, i have 50. set this 
is the median. the 
mean is what is called the first moment what 
that means is think of this as a plate of metal. and 
i want to balance it on something. where do i put my finger? so that 
it balances? it is a cg of the data the center of gravity of 
the data. you can understand the difference 
between these two. now. if for example, i push the data out 
to the what happens to the median nothing happens to the 
median because the 50-50 split remains the same, but if i push the data 
out to the the mean will change it will move to the your liver 
the liver principle, if there's more weight on one side. i have to move my finger 
in order to counterbalance that wait. so these are two different parameters 
if the distribution for example is what is called symmetry. symmetric means it 
looks the same on the left as on the then these 2 will 
equal because the idea of going half to the left and half to the will 
be the same as the idea of where do i balance because the left is equal to the right? so 
when the mean is not equal to the median, that's a signal that the left is not equal to 
the right. and when the mean is a little more than the median 
it says that there is some data that has been pushed to the right. and 
that should be something that you can guess here because the mean and the median to some 
extent our what 2426 is cetera. the lowest is 18 that's 
about six six years eight years less than that. but what is the maximum? 
50 that's 25 years beyond the data is pushed to the and in bed. 
his racing push to the direct technical term is right. skewed. there 
are there are shall i say people are more not average 
on the on the older side then on the youngest active. there 
was a hand up somewhere. yes. yes. 
so therefore one reason that the median often doesn't move is because 
it is not that sensitive to outliers. so let's suppose for 
example, we look at us as us and we ask ourselves. what is our mean 
income or our median income and we have that each of us make a certain 
amount of money. we can sort that up and sets and put that in now, let's suppose that 
mr. mukesh ambani walks into the room. now what is going to happen 
to these numbers here 
alone probably makes a very large multiple of all our incomes put together possibly. 
i don't know how much you make i know how much i make. but 
what's going to happen to the median? it's going to still 
almost the same the typical person me move by at least half. who's 
what is the typical person going to be the typical person is going to be an actual individual in the room or maybe an average 
of two individuals in the room. and that person is not going to change. 
yes. yes, that that's that's one 
conclusion. we can draw on this there are other products below which will also show the same thing. you 
don't be able to draw that conclusion. good logical reason. i haven't shown you the full data 
will see the histogram will do that. so hold on to that question. the conclusion 
was drawn is that there are two pieces. there are two things to do 
see here one is if i simply look at this without 
seeing any more graphics, where is the middle of the data from median perspective 
at 26? good, not from 26. 
look at the difference between 26 and the smallest. 18 
between 18 and 26 that's eight years. this 
eight years contains 90 observations because there's a hundred eighty thousand 
now what is on the opposite side of this 26 to 50? 
that's how many years 24 years this 24 
years now contains how many observations same 90 so 
the 90 observations that are between 18 
and 26 and the 90 observations between 26 
and 50. so if i were to drop picture, 
what would what would that picture look like? yes exactly. 
as you are drawing it, right? this 
usually by definition is called skewed. this is a problem 
that be a bi has does this mean is left's two rights cute 
as a word right? it's called skewed. more data 
to the right. sorry more data is a dangerous word 
ha. no, that's the same number of observations. i'll say the data is pushed 
to the right. or 
variation variation side is probably safer way of putting it. yes. 
so skewness is often measured in various things one measure of skewness is typically 
for example mean - median mean - median if it is positive. 
it usually corresponds skewness mean - vn- usually 
corresponds to left skewness. this is a statistical 
rule. but sometimes it is used as a definition for skewness. 
there are many definitions for skewness cute data sometimes causes 
difficulties in analysis because what happens is the idea of variation 
changes being the variation one side mean something 
real different than variation from the other side. by the way, what's happening 
to you with respect to things like books are you getting books are not getting books? i have no idea what 
the books are. you got one book, which is what? 
which is the statistics book? okay, i'll take a look at that book later. so 
this book right? okay, show me the book. okay. 
comment one very nice book coming to not a python book. right, 
that doesn't make it a bad book. so if you're looking for help 
on how to code things up, this is not the book. get 
a book like things cats or something like that. but if 
you want to understand the statistics i to it is an excellent book. everything that i'm talking 
about is going to be here. am i talked about which chapters and things like that at 
some point? and i might talk about how to use this in 
the book. so for example at the back of this book, there are lots in there are tables. 
there are tables at the back of this book. which will learn how to use 
and then i'll try to convince you that you shouldn't use them. but 
remember many of these methods are done in ways in 
which either you don't have access to computers. 
or if you do have access to computers, you don't have them shall we 
say at runtime? in other words when i want to run the application 
on that i can build a model using a computer, but i can run it within one. the 
runtime environment for statistics is often done when there are no computers around. 
the build environment can include computers but the runtime environment can a lot of statistics is done 
under that kind of situation even probably yes 
very much so very much so okay, 
so definitions of skewness and things like that do it do it in the way you 
usually use a book which means you go to the index and see if the word is there. when 
then you go back and figure it out and we'll give you some ideas as to how that works. it's a nice book 
is one of the best books that you have in business statistics, but it's not necessarily 
a book that will tell you how to put things up that is not a deficiency of the book 
not every book can do things of that sort. in other books around that will tell you how to code 
things up, but will not explain what you are doing. it's 
important to know what you are doing is also important to know why you're doing it. but 
books can be written with often everything in my guess. the 
thinking is here. i think this is good for thinking i would actually recommend this book 
on the thinking side. yes. 
yes, and that answer i think is very very good here where you won't 
get is either say do this and it won't give you the python syntax to do it that 
that will not be here. so if you can solve that i'm going through some other means. i 
used to have a colleague in in corporate life who had a very big sticker on his board. 
it said google search is not research. now 
nobody agrees with him anymore. so i suppose that 
when in doubt you do what normally homo sapiens do today, which is you google for an answer. 
so one possibility is that you excuse you understand 
something from a book such as this and if you want to understand the syntax is google for the term say 
python that term whatever you'll probably give you the code. things are very 
well organized these days. there's 
also the question and i should give you a very slight warning here for to not 
to discourage you from anything. but in the next 
nine months or thereabouts the the duration of your program, there's going to be 
a fair amount of material that will be thrown at you. correct. 
the look and feel will sometimes be like what we would what we would often call it a mighty as drinking from 
a fire hose. you can if you want 
to. but you will get very wet. so 
therefore pick your battle. if you want to understand 
the statistics side of it peas, please go into the depth of it. but 
if you try to get into equal death on every topic that you want to learn that 
will take up a lot of your professional time. now 
the reason we do the statistics for first one, it's a little easier 
from a computational perspective although harder from a conceptual perspective. so we begin 
it this way, but hold on to that idea and then as you keep going 
see if this is something that you want to learn more on and if you can you're welcome 
just write to us let us know already. anyone know that with has just come in let her know and we'll 
get the references to but if you want to force a for the first presidency's please read 
the book and see what happens if they're adults. yes, but it's a well-written book 
it since its instructor is one of our colleagues here, you know, if you want to 
give you can also help explain things. so 
this is a summary. what did the summary tell you this summary 
give you what's called the five numbers five numbers that help 
you describe the data minimum 25 50 75. max 
will see another graphical description of this it also described for you a 
mean. there is also another number here and this is 
this number is indicated by the letters std if 
she needed first two standard deviation std refers to standard. 
deviation and 
what is the formula for a standard deviation? std 
is equal to the square root of 
a mess but two steps 
step 1 calculate the average. step 
to take the distance from the average for every observation. 
ask the question how far is every data point from the middle? 
if it is very far from the middle say that the deviation is more if it 
is not far from the middle say the deviation is less deviation being used 
as a synonym for variation and the hood variation variation can be more 
variation can be less. more than the average less than the average 
if someone is much older than average there's variation if someone 
is much younger than average there is variation. so therefore 
both of these are variation. so what i do is when i take the difference from the average i 
square it so more than x bar becomes positive less than x 
/ also becomes positive. then i add it up 
mea average it this is small questions to why it is n minus 1 and 
that is because i'm dividing. i'm sticking a difference from an observation that is already 
taken from the data now ever squared when 
i have squared my original unit was in age when 
i have squared, this is become h squared. so i take the square 
root in order to get my measure back into the scale of years. so the standard deviation 
is a measure of how spread a typical observation is 
from the average. it is a standard deviation 
where a deviation is how far from the average you are? and 
because of the squaring you need to work with a square 
root. in sort of modern machine learning 
people sometimes use something called a mean absolute deviation immedi 
mad. very optimistically called. so mad is 
is as you don't take a square you take an absolute value. and 
then you do not have a square root outside it. and that is sometimes 
used as a measure of. how much variability there is? so 
why it is why is it? why do we square it? 
because we want to look at both positive and negative deviations. 
if i didn't square it, what would happen is it would cancel out? what was the word 
that one of you used neutralize right? i love that term. your 
positive deviations would neutralize your negative deviations. number 
is going to be positive if the x1. so let's look at the first number here. so 
if i look at the head command here when i did the head command here, what is the head? 
what did the head command? give me the first few observations. now. this is an eighteen-year-old 
this probably sorted by h. this is an 18 year old, correct. now, i'm trying i'm 
trying to explain the variability of this data with respect to this 18 year old. 
what is the what is the what why is there variation this? 18 number 
is not the same as 28 and 18 is less than 28. 
so what i want to do is i want to go 18 - 28.7 
what i'm interested in is this 10? this 
10-year difference between the two. now the first 
and the oldest person in this data set is how old 50 when 
i get to that row this 50 will also differ from this 28 by 
22 years. so interested in that 10, 
and i'm interested in the 22. i'm not interested in the minus sign 
or a minus 22. i can 
do that. i can do that in all those what i can do is i can look at i can represent 
18-20 8 as 10 and i can represent twenty eight minus 
fifty has 22 and that is this as i said one over 
n minus 1 absolute x 1 minus x bar plus 
plus absolute xn minus x bar. that is this. within 
-1 and this is done as i'm saying this is what is called mean 
absolute deviation and many machine learning algorithms 
use this you are correct in 
today's world. this is simpler. now 
when standard deviations came up first, this was actually harder, 
but people did argue about this. i think 
well hundred fifty maybe more about say forget my history that much there are two famous mathematicians 
one named gauss and one named laplace. well 
argued as to whether to use this or whether to use this. 
plus said you should use this. and gauss said you should use 
now. the reason gauss one was simply because cows found 
it easy to do calculations. why is this 
easy to calculate with? because you cannot come up with 
calculations, you know century or so before that. and so for example, let's suppose that 
you want to minimize variability which is a which is something that we often need 
to do in analytics, which means you need to minimize things with standard deviation, 
which means you need to differentiate this function. the the square 
function is differentiable. you can minimize it using calculus. this is not 
so therefore what happened was the house could do calculations, but 
le plus could not and laplace lost. and 
those one the definition of the standard deviation of 
25 percent or 75 so as in okay, 
okay, why do we not do that? so today 
this entire argument makes no sense. because today how do 
we minimize anything? a computer 
program you don't use any calculus. you asked after 
you run f min or something of that sort you run a program to do it. so 
therefore this argument that you can both two calculations equally well with 
this as in as in that so today what is happening is that laplace's 
way of thinking is being used more and more. this one is a lot less sensitive to outliers. 
this one what it does is if it is far away the 22 squares 
to 484 or something like that, which is a large number. so 
the standard deviation is is often driven by very large deviances 
larger the deviance the more it blows up. and 
so therefore this is often very criticized. if you read for example, the finance literature 
is that i called taleb nassim taleb out. he writes his book called the black swan and 
fooled by randomness very left and criticized the standard deviation as a measure of anything. 
so today this argument doesn't make a great deal of sense and when in practice 
something like this makes sense. it's often used. so 
a lot of this is done historically the it looks this 
way because of a certain historical 
definition and then it's not is 
hard to change. so today in the in you know centuries 
after gauss said people like me are trying to explain it having trouble doing it. 
because there's a logic to it and even and that logic doesn't hold at all anymore now. yes. for 
example, is that creepy? how about jewelry 
that from the video? how far how far on the average is 
an observation from the average confusing statement again? he's going 
to be one happy. but how far on the average is an observation 
from the average if that answer is 0 that means everything is at the average. 
but you're asking the question how far from the average is it is an observation 
on the average if i take your blood pressure how far from your average 
blood pressure? is this reading? if this is exactly 
equal then i don't need to worry about variability every time i measure blood pressure all see the same thing. what 
is your average bank balance? don't tell me that but but 
but you know what, i mean, you have an average bank balance your 
bank account manager or your bank actually tracks this what your average bank 
balances. who but you are actually your 
balance is almost never or very very rarely equal to your actual average 
bank balance. it's more and it's less. how 
much more how much less is something that the bank is also interested in in order 
to try and figure out you know, how much of your money so to speak to get out there? 
does the bank is going to make money by lending it out, correct, but when it lends it out 
it can give it to you. so it makes an assessment as to how much 
money picks i don't want you to finance now, but you get the drift. 
so therefore there it is a measure of the it is not the only measure of that. 
so for example, here's another measure. so remember this 25 
number and the 75 number that you're asking about. let's say that 
i calculate a number that looks like this. let's say 33 - 
i'm fine - 24. so 33 
minus 24, let's say this is my 24 and this is my 33 between 
this how much data lies. fifty percent 
why because this is 25% and this is 25% 
this now contains 50% this is sometimes 
called the interquartile range. interquartile 
range big word thank know 
why is it called an interquartile range? the reason is because 
sometimes this is called cutesy and this is called q1 
q3 stands for upper quartile. you can understand 
quartile quarter. so upper 
quarter and this is the lower quartile. 
and the difference between the upper quartile in the lower quartile is sometimes called the inter quartile. 
why is it called the range because what is the actual range of the data 
the range of the data in this particular case is 50 minus 18. 
and fifty minus 18, which is your max - you're mean, 
this is simply sometimes simply called the range. range 
is maximum minus the minimum interquartile ranges, upper 
quartile minus lower quartile, and these measures are 
used. they do see certain uses based on 
certain applications. you can see certain advantages 
to this. for example, let's suppose that i calculate my five points 
summary with my five point summary. i can now give you a measure of location, 
which is my median and i can give you two measures of dispersion, 
which is my interquartile range and my range. so 
those five numbers have now been twisted to 
give me a summary number, which is the median. and 
a range number interestingly 
i can also draw mental conclusions from that. for example, i can 
draw conclusions from these five numbers in the following way 24 
and 33 half. my customers are between 
22 and 24 and 33. so 
if i want to deal with half my customers i 
need to be able to deal with the range of about nine years within this nine 
years is all that i'm interested to get this straight. so 
if i'm building my if i'm building my my machine, i'm going to make 
sure let's say that the 33 year old is okay with this and the 
24 year old is okay with this? will the 50 year-old be 
okay with this? yeah may not be i 
want to thank the 50 year-old rookie with this and have trouble with the 18 year old. so 
i can do a lot with even these five numbers. we'll 
see more descriptive statistics as we go along. by 
the way, this is only for each i can do this for you know usage. i can 
do this for fitness. i can do this for income. i can 
do this for miles income is interesting. here is the median income. $50,000 
and the mean income about 53,000 dollars. if you see income in 
almost all real cases, the mean income is going to be more than the median 
income the per capita income of india is 
more than the income of the typical indian say, what does 
this command do if i say my data dot info what 
this is doing is my data. first of all is a data frame that i've created just to review i read 
the pdf file this this is a described. 
and this here is info now describing the info in 
english language are similar things description information. 
this is interpreted in the software has two completely different things information 
is like your variable setting is like your integer field. your 
real field is setting like that is giving you information on the data 
as data. the word data means different things to different people 
to a statistician data means what? processing data means 
and number to an it professional what does data mean? bites 
information, you know, i've lost my data. i don't particularly in what the data is place 
my data. so this is that information it tells you tells you about 
the data is an object is description is a 64-bit 
store integer who is in objects, which tells you about numeric categorical. it 
tells you about the kind of data that's available normal feels in other words. there are objects 
in the field etc. there are so many integer types 
which are stored at 64 because this computer is only capable at 64 and 
there are three categorical variables. this is a this is shall. we 
say a data object summary of the what is there in that data frames 
not a statistical summary. useful in its own way particularly, 
if you're processing it and storing it for those of you who are going to go into 
data sort of curation like careers. this 
kind of a database is a nightmare. because typically 
what happens is when you store real data in addition to data you 
often store was called a data dictionary. sometimes that's your photos 
metadata data about the data because simply storing a bunch of numbers 
is not enough you have to say what the numbers are about this adds a layer of complexity 
to the metadata. you now have to store not only what the variable is about. but what kind of a variable 
it is so many professional organizations say is that archival 
data should never be a mixture of both numerical and categorical objects 
and they pay a price for that numerical things should become categorical or categorical things 
become numerical but what happens is if you are storing large 
volumes of it and archiving it and making it available for people who are not seen it before 
is sometimes gets convenient. so therefore fees like this are often useful to see how 
big a problem you have now. 
i want to plot a few things to plot. you can plot anything see what 
i think is coming already later. but this plot this 
is from matt clark library. and 
it is plotting through a command. called hist hist 
means histogram which have already seen if covered 
histograms, right? i think you seen histograms. so this is a histogram 
now histogram has a syntax has been 
sizes in figure sizes. so what you can do is you can play 
around with these and see the differences in what this histogram does but 
this is certain default that shows up and the default is quite good and here 
is a histogram distribution of the age. this is not a set of numbers. this 
is a picture. this 
is a picture. what does this picture have? this picture 
has a set of wins. it has set of counts within each bins. 
between these two numbers between say 10 and 
whatever this illicit 22 or thereabouts. i have a count of let's say 
17. so it gives a count and 
it does this by getting a sense of how many bins there are and plotting 
this shape is a little bit of an art to write a histogram program. 
there's a there's a python book out there. i think things started going to fit in which 
sort of the first one. third of the book is how to write a histogram 
code it's a wonderful book, but because it freezes example 
it got terrible reviews. reviewer said why do i want to learn how 
to code a histogram? and the book's author is and teaching 
you how to write a code histogram is an is an example how 
to do that. and i tend to agree. if you want to test 
yourself of your understanding of data and your understanding of any programming language 
and any visualization language code a histogram in it. and 
have fun. so it's a nice 
challenge from many perspectives. the data challenges. the language challenge is the visualization 
challenge all over. yes. we 
companies do that that they want archival data to be of only one data form 
only one format. why is that so because 
as i said when you store data, how do you store it? let's see that 
you've generated analysis. the analysis is done, correct, and you've 
decided not to destroy the data. you're going to keep the data 
in your company's databases or in your own database. how will you keep it? you can take a technology. 
let's let's let's pick an example. let's say what's it take an example sql excel whatever 
mistakes in let's say i keep it in excel. now if i keep it 
in excel, what will i now do so, let's say i have an excel spreadsheet. let's say my cardio 
data centers say this data set now in addition to the data, what 
do i need to store with it? yes, 
so one possibility is i can have a text file like that. like i had at the top of this 
describing all of this which is typically what happens in extra storage. it 
describes this and it describes there's one file called dot data in another 
file called describe something of that sort, which describes the variables 
and the idea is that they have the same name and one extension gives you the data 
the other disk extension gives you the description of the variables that are 
in this data. correct. now this is good now 
what's going to happen on that data certain code has been 
run? that food is going to assume certain things 
about the data. what do you want that coat to assume 
about that data? whatever you want that code to assume about that 
data should be available in the data dictionary. now 
if that code is stable enough to realize that whatever field you give me i will run on 
that's cool. but if that caused requires you to know what kind 
of data is being used. let's say discrete data. let's say continuous data in 
the future. you'll be doing things like linear regression logistic regression linear regression will make 
sense. if the variable is a number logistic regression will make sense. if 
the variable is a 0 or a 1 if 
you have that problem now in the metadata, you need to be able to tell not only 
what business information this variable contains. but also what 
kind of a computation object it is. so 
the code can run. so therefore what people often say is that i'm going to make 
it very simple and i'm going to assume that my entire data frame consists of only one kind of 
variable so that when i run any algorithm on it, i know exactly 
what kind of data input that algorithm is going to get what i'm saying is a practical answer that 
many companies often often have and i worked in a couple of companies that 
is act is one company where this was very seriously done. so we had to 
we had to when we put data back in we had to convert it. and 
for in the situation that i was in it wanted everything in categories. so 
what we would do is we take continuous data and we would do what's called find classing which 
means that we would divide not into four pieces by into 10 pieces decide 
one decide to - i'll see decile for up to decide one and every variable 
was stored now not in its original numbers, but as ten nine eight seven 
six five four three two, so let's suppose that i tell his income 
is 9 what? that means is i know he's in the ninth desai. 10% 
of the people or more have income more than him 80% have less 
than him. he's in that bracket and it all variables are stored 
that way now what happens is every algorithm knows that every variable 
is going to be stored that way. and you can keep 
writing algorithms that otherwise what would have to happen is every algorithm will need to be differently 
and let's say you're doing credit scoring. let's say you're doing crm models you doing 
something of this sort and you have built a very sophisticated crm model that tracks your 
customers and it works now suddenly, you've got a new variable coming in 
the twitter feet. and suddenly nothing works. what 
to do go back and rebuild that entire model that's 
going to set you back three four months that's going to set you back a few thousand dollars. so you say 
no any variable that has to go in has to go in in this form and 
if it goes into this form my algorithm can deal with it. it 
might not affect the efficiency of the model that we generate. yes. 
yes. and in fact is i'm going far away from topic now in 
practice and professional list has to struggle between 
doing the thing badly and the wrong thing well, you 
want to do the thing, but i think that is going to cost you time 
money data and everything. so you struggle between saying 
that i'm going to get a 
flawed model quickly built on a new data set or i'm going to get 
an inefficient answer on a model that's already been built and let's see how far 
it goes. and so these are more 
cultural issues with how our and analytical solution is often 
deployed in companies. they very very much from industry to industry. they 
very very much from company to company from the culture 
of a company to cultural company. they depend on regulatory environments 
in certain environments and auditor like entity comes in an insist 
on seeing your data. show me your data. let's in finance 
is sometimes happening regulatory agencies the reserve bank of india. goes into a bank and says 
show me your data all this in p is a citizen. show me your order book. show 
me your loan book. click and now that has to be done and 
the decisions you made have to be done in a way that is patently clear why you've done 
this so very often people say i don't want to make the best risk 
decision. i want to make the most obvious risk decision. which 
may not be the same thing at all. but i'm being 
audited. so 
that's a practical question and i don't have a clean answer to that. but 
i do know what happens. is it right? no, 
it's not but we 
live in a world that has a kind of imperfection. my one of my teachers 
is name was jerry friedman. you'll see some of his work later on. he came up with algorithms that projection 
pursuit cart mars gradient boosting. he created many of the 
algorithms that we'll be studying one of my teachers at stanford when he ran 
our consulting classes. he would say this solve 
the problem assuming you had an infinite. at least smart client and 
an infinitely fast computer. after 
you've done that solve the real problem. when 
you do not have an infinitely smart client and you do not have an infinitely fast computed. this 
was in the early 1990s. the computer speeds were a lot slower. i 
wouldn't have powerful machines like this around. so 
a lot of this is done in that kind of situation where 
you are where you're struggling for continuity when you're figuring 
it out. imagine yourself as analytics manager and i hope many of you will be 
and you have analytics team sitting in front of you. correct. you're looking at 
them and you're looking at them in the eye and you know how much you're paying them and you know that half 
of them are going to leave at the end of the year. what 
you going to do with regard to the modeling and things like that? your 
first order of business is going to be two things should continue to in some form. keep 
it simple, right? keep it simple. keep it obvious 
for the next bunch of people who are going to come in. and 
for that you'd be willing to trade a little bit of make it, right? so 
now the new person coming in will now not want to solve a very complicated kind of situation. 
this is not where you want to be but and i do not want to depress you 
on day one, but it's also the fun part of the profession is also 
what makes it. interesting and sort 
of interesting and exciting. it's not all bad. okay. 
so the histogram of command summaries of what these histograms are in each gives you a 
sense of what the distribution is. and as you can see from most of these pictures most 
of these variables when they do have asked you tend to have a skewed maybe 
education has a little bit of a left skew maybe 
education a little bit of a left skewed that a few people are educated and most people are here, but even 
so interesting 
plot but work life 
has this as well, but see one has a better version of it. this is what's called 
a box plot you seen a box plot. this 
is a box plot. people 
are unsure as to where this box came from. because 
it is transition for box. who's used this before but this box 
came from what it used to be called a box-and-whisker plot. these are the whiskers. this 
whisker will go this is this is the median. this 
is the upper quartile the top edge of the box. the 
bottom edge of the box is the lower quartile. the 
end of the whisker is one point five times the 
interquartile range above the box. if 
you want the formula. sort of the whisker 
the length of the whisker is 1.5 times. iqr 
should have a break now i a little bit maybe huh? so 
we'll look at 3:45 what i will go up to there. i 
haven't stopped i just got distracted. so. 1.5 
times the if it goes up to that. if 
a point lies outside it the 
point is shown outside. if 
the data end is before it. the whisker also ends 
correct. what is the whisker? okay, what is the risk? all 
right, let me explain another way. the whisker is the maximum. 
the top of the whisker is the maximum the bottom of the whisker is the minimum. okay 
not okay. okay this 
point here. what is this plot here? it's 
for bills. so this means is this is the minimum 18 or whatever 
it is. and this is the maximum 48 or whatever. it is minimum the maximum. 
so if you see nothing else on the box plot, no other 
points other than just the box and whisker then 
your five points summary sitting there. that's 
it. now what happens if you see points like this? outliers 
what is an outlier? and how care is a point that lies 
more than one point five times the interquartile range above the 
box? so this whisker will not extend indefinitely. 
it will go up to 1.5 times this box. and 
they just stopped and if any points are still left outside, it will 
show them as dogs. you can treat this as a definition for 
what an outlier is. say 
the same thing in the other direction. the logic is 
symmetric. that 
means this mean it hasn't it's the data is entered here. the data 
is entered here i 
suppose so and you can change it you can i won't now, but you can go to the box first 
syntax and change that. so you can go to box plus syntax, 
and you can change at one point five. it's not 
hard coded into the algorithm. and 
i think 95% sure a statistician of never sure about anything but 
i will break but it's a parameter in 
the in the usual to pass the parameter in the start function 
default is 1.5. you should be 
able to change it. what's the color 
part is a comedian which one color this 
these two colors these two colors are because of asked for two things. i've 
asked for male and female if i if i had three of them. okay, this 
this one here or this is q3. the lord is q3 and the upper is key for your 
is q 1 and the upper is q3. totally so 
so for meals between the bottom bottom 
whisker to the end of the box is a quarter of your data the 
box is half your data and the top of the box to the end of the whisker 
is quarter of your data. so 
the middle line in the middle is the median the middle 
is the median there is also a function in box plot. you can play with where we'll 
give you a dot. and that. is the mean i 
mean you can you can you can ask box plot to do that. but 
what i mean is not a general standard component in the 5-point summary. it's 
a different calculation. not a sort. but 
if you want to you can make box plot to give a dot on the mean as well by definition. 
it yes. so 
mean median so half the data is between 24 
and 34 or whatever that is half 
of all my all the men in my sample are between those 
two numbers. i 
think austin doesn't allow you to change the shape of the box. i 
think that is set. that's sort of central 
to the idea of a box plot. it does allow you to fill with the size 
of the whisker don't think it allows you to fill with the size of a box. now. 
what if you change that to something else? let's say the 20% 
point to the 80% point 80/20 rule. that's no longer a 
box plot. is another interesting plot the significance 
of it is exactly this as we have seen before the significance of it is 
is that the data looks like this. it's 
rights cute. think of the picture so this 
is your q 1 and this is your cue 3 this is your cue to or the median. then 
the median is going to be closer to q 1 then it is 2 q 3. in 
the same way that the minimum will be closer to the media than the maximum same idea. this 
is a summarization for numbers if 
you want to summarize for categorical data. what's 
called a crosstab? or a cross tabulation. this 
is simply how many products are their product category? 195 498 
and 798. they've got three kinds of treadmills and they're trying to understand 
which who was using what kind of treadmill a business problem is to understand 
who is using what products this is a crosstab. what is this? this 
is something that will be used for categorical variables. no box 
plot will make sense here. there's no numbers. so 
now you can ask interesting questions here if you want to and you can think about how to answer 
it is at for example, you can ask the question. is there a difference between the preferences 
of men and women? possibly. is 
there a difference in the products that they prove that irrespective 
of gender? is there a product that that they prefer? then ask 
all kinds of interesting questions and you can find ways to answer it which 
we will do not in this residency for next time around categories. 
so this is simply once again, this is descriptive all 
this is done is it is simply told you the data as it is? what 
i'm saying is that if for this if you want to do a little more analysis on it you 
now have to reach a conclusion based on it. so for example 
one conclusion to ask is is that is that do men and women 
have the same preferences when it comes to the fitness product they use. 
now that's a question to answer that question is enough to 
look at the data for just looking at it will not give me the answer. i need 
to be able to find a statistic. figure that out 
a statistic that does what that in some way measures that difference. 
let's say measures the difference between men and women or what we will do 
is not measure that what we'll do is we'll measure that if there was no difference 
between men and women what should this table have look like. and 
then we'll compare the difference between these counts and that table but 
that's the interesting part of a statistical statistic which we will do that's called the chi-square 
test. it's coming up in the next residency. but that's the prediction part 
or the inference part of this description. this is just the description you 
can do a similar thing here. this for example is for 
marital status and product. what 
product you use? i do not vary depending on your partner 
dosing what is metal particles form? it has to do with age or maybe they correlated. 
should we use one as opposed to the other? okay. you 
can use counts as well. if you see instead instead of instead 
of doing it this way instead of seeing it as a table. if you want to see 
it as a plot you can ask for counts. so 
there are things like count plots and bar plots, which allow you 
to do counts in the lab. you will do probably a few more of these. this 
is simply another visualization of the same thing. for 
those of you who like things that pivot tables in excel. hmm 
so microsoft has made, you know wonders of us all in 
corporate life. they were two rival stole that you know, you can have you can have a master's 
in bachelors masters in anything engineering is good etc and 
is nice if you have if you have you know, phds in a few areas, but 
what you really need is a phd in powerpoint engineering and 
that's a necessary qualification for success. so certain 
tools have been used. so therefore those tools have been implemented. in 
many of these software's as well. this is the pivot table version of the same data set. 
this is the last sort of not last but still this is this is a this is a plot. 
let me show you this plot and then we'll end out. we'll take a break. this 
is a plot. that is a very popular plot because 
it is a very lazy plot. this plot requires extremely 
little thinking pair plot of a data frame, right? 
you don't care what the variables are. you're telling it nothing about 
the plots. you're simply saying figure out a way 
to plot them pair by pair and it does that. so 
for example, how would you read this plot on this side? 
so it creates a matrix? the rows are 
a variable and the columns are variables. what 
is this? this is h versus h 
is forces h makes no sense. so what it plus there is histogram 
of age. does 
it like the gap nature abhors a vacuum? i suppose python 
does as well. so 
enough flutter what it should have taught his age 
versus h. if you're right, it should have been a 45-degree life. hmm. 
but a 45 degree n is a useless graphic particularly the same 45 degree 
line shows up in all the diagonals. so to 
make a more interesting graphic. it blocks the histogram there. 
this calluses this kind of analysis sometimes has a name 
associated with it. the name is univariate. univariate 
means and looking at it variable by variable 
one variable at a time when i'm looking at h. i'm only looking at h. 
so univariate analysis is just a word uni as in. uniform 
same form unicycle cycle with one wheel things like that univariant 
unit in 
replicate the same. it will replicate the same nature of the data. they'll be histogram 
here again. so 
yes, so what it will do is remember 
this graph the nature of the graph. so let's let's see this. so 
where is gender here? where 
is gender here? is 
it there? is 
gender is gender in my data? it 
is so when i did paste plot my data. 
what do we do with gender? yes 
remember in info? when we did info here. 
remember how it is stored the data? not 
any any here so 
it had product gender and marital status. it had identified 
as objects in the data frame. where 
is it form the data frame? so now what does 
it tell you about the about the command? the 
pair plot command yes, 
it will it will ignore those objects. so 
in answer to your question, if the data frame has been stored 
has been captured with integer 64. it is integers or 
numeric seen it it will plot. if 
it's only objects in probably given a lot. yeah. say 
again is not why 
i like that. this is the histogram. this is the same plot. this plot is 
the same as which plot? this 
one is the same as this one here. no, 
this is not a job versus age. this is just age. he's versus 
age would have been a 45-degree line but is not floating 
that. it's not floating 
that in the diagonal. it is not plotting each versus age in the diagonal. it 
is simply plotting ages own distribution. yes 
with the council what it is doing is it is essentially running hissed 
on age or all each observation and putting 
it on the diagonal. yes. 22nd 
there is a bit. from 
each business account. it's 
a count of the number of people who are in that age group here. this 
is a know this miles. this 
is h. this is h. so say 
between here between let's say 40 point five 
and forty three point five or whatever. these numbers are there are three 
people. is it remember the histogram is a visual thing. you 
can determine a histogram. if you want to which means you can step you 
can find out what those are and you can see it inside inside histogram 
just ask for somebody that it will give you what the features are of that histogram, but the histogram 
is not meant to be used that way. it's meant to be used as a as 
an optical device. to see the shape to see the count. it's 
an art to do a histogram. if you change the beans a little bit 
the histogram will look a little different. so i would suggest that 
unless you've got a lot of experience in this or you really enjoy the programming do 
not fill with the histogram is sheep will change. i'll 
see you later after the break not change the histogram, but what shape is. 
no, not not in default. you can go in and change 
it on size. but 
the bin weights etc the bin width of histogram takes a little motor change 
so you can they stuff out here you can find other things in which you can 
play this so there are ways to do it. okay, 
so quickly ending we are losing our food. so 
these different plots and will continue after the break. the rest of it is simply 
an x versus y so, for example, this is age versus education. 
this is age versus education from the first 
one. is this rotated this? yes, he's if this 
is education on the y-axis in age on the x-axis 
or vice versa, then these two plots one and two and two and one are just 
mirror images of each other. rotate 
your depends on where you look see where you put the mirror but yes mirrors. 
so i remember when i was when i was a kid. nina's would 
confuse me. so i would ask the question like this there when i see a mirror left 
and get switched but top and bottom don't i 
never understood why? you know. due 
to gravity you can think i mean left and get sweet but top and 
bottom dude. i saw this coming to the mirror and then i thought it was something to do with my eye is you know, maybe 
because they left a so i looked at it this way and that didn't help. so. 
yes. it's an important point when you do symmetry. it's a good catch so good cash realize 
that there are so many plots. they're actually only half 
as many plots because the plot on this side of histograms in the plot on the opposite side is histograms 
are the same. there's another question that one if you ask is that many of these he's seem to 
look like rows and columns in the sense that 
what are these rows now? what is this role? look, what is this mean? it 
means that this variable fitness. this variable 
fitness actually has very few numbers in it. it 
has a number one, two, three four and five now, why 
is that because remember how i define fitness is my perception of whether i was 
fit or not in my original definition of the 
variables here you go self-rated and fitness and one to five 
scale where 1 is in poor shape and five is in excellent shape. this was the created 
data. so in this data set, i now have that this 
variable in it. these kinds of variables sometimes 
cause difficulty in the sense that they are some there's a word 
for it. these are sometimes called ordinal variables. so sometimes 
data is looked at sort of, you know numerical. and 
categorical and categorical 
is sometime called nominal. and 
ordinal nominal means it's 
a name. name of a person north south east and west 
gender male female place etc. it's a variable 
essentially it's a name ordinal is it's also categorical but there is a sense 
of order. this is the water dissatisfied 
very dissatisfied. so there's an order order they for ordinal 
this variable the fitness variable can if you wish be treated as 
an ordinal categorical variable. so for 
example, the likert scale is that so the seven-point scale not satisfied very 
dissatisfied dissatisfied morally dissatisfied neutral 
set morally satisfied satisfied very satisfied. mach 
1 this generates the data from a scale of say 1 to 7 o 0 
to 6, so it will show up in your database as a number like for example here 
you can see instead of one to five very 
unfit morally unfit, okay. relatively 
fit very fit is to giving one to five give it that way and you code 
it up this way your choice. so 
sometimes when you have data that looks like this. the data 
the python or any database will recognize it as a number 
because you've entered it as a number but you analyze it as if it is a category. 
so the opposite problem also sometimes exists in 
that sometimes you get to see a categorical variable show 
up as a number but you know, it's the categorical variable a zip code is an example 
a zip code shows up as a number, but it's obviously not. you 
can add up zip codes. you 
take two places in bangalore and you want to find a place between them. that's not the average of the zip codes. you might 
be close, but you can't do arithmetic with zip codes. the other 
difficulty with zip codes is that they can be many of them which 
means that as your data set grows. the number of zip codes 
also grows. so the number of values that are variable can take grows 
with the data and this sometimes causes a difficulty because 
what happens is that in the statement of the definition of the variable you now cannot how 
many categories there will be present? so, 
you know that they will be more zip codes coming. you just don't know how many moles of course will be coming. but 
you also known as a categorical variables. you can treat it like a number. and 
so there's some special types of you know problems like zip codes that require special types 
of solutions. so the plot itself is a very 
very computational plot if it recognizes it as a number eight plot set. if 
you don't want to make it flat as a number change it to a character most 
of ways including python will allow you to do that. now 
this is in some ways a graphical representation of it for the free 
end of this session. we can talk a little bit about the numbers associated with it. so 
here for example, my data h. you can also go, 
you know dot h if you want to and things like that. this 
is the mean. twenty seven point seven 
eight eight eight is the mean so you can extract it there functionalities 
of the mean that can be that can be recovered like trimming, etc. etc there. if you want to 
you can calculate the standard deviation. remember the standard deviation formula that strange 
formula that i wrote on the boat. this is the standard deviation formula if you want to calculate 
the standard deviation, you can do this for other variables 
as well. this is an interesting plot. so 
i don't want to go too far into this plot, but it's interesting plot if 
see ball. there's a warning on the code. this 
is called or what they're referring to is a distribution plot. so 
this is a plot. that tries to look 
at not what the data is but what the distribution 
is. so remember i was drawing these odd pictures pictures like this and 
drawing lines on it. those were distributions. so what is trying 
to do is trying to go after the distribution of the data? now, 
what does this mean? it means that it says this it says 
that there is an underlying distribution of 
the age variable. this distribution is a distribution. 
you do not know. however, you have 
a sample from that distribution. how big a sample? about 
a hundred eighty observations from that standpoint of a 
hundred eighty. can you guess at what that distribution is? in 
other words, can you give me a curve? it's 
an answer to that problem. and it gives a curve. why 
is the raw data not enough? so 
the raw data is not enough. then this goes to the heart of what the statistical problem is. 
is because i am interested not in the age 
of this particular group of people. i'm 
interested in the corresponding edge of another 
very similar group of people. why what 
is the problem? i'm trying to solve and trying to solve the problem of who is buying 
my cardiac equipment now. when are these people going 
to buy my cardiac equipment? at 
some point in time. okay. now, what is my data? but 
for whom is this data, whoever got the rate of four people 
have already bought. so i have a problem. my problem is 
i want to reach a conclusion about my future customers based 
on my old customers. how to do this what 
mathematical logic allows me to say something about the future 
based on the past? yes 
in short the way to do. this is to assume 
that there is what is called a population. we'll talk more later at 
this stage to assume a distribution. to 
assume that there is a distribution from this distribution. i 
have seen a sample. today from 
this distribution as see a sample tomorrow. the 
people are not the same. because the people are going to use my card experiment 
cardiac equipment yesterday are not what you're going to use it next year if 
it was the same. i never have a growing business. there's 
no point analyzing data of customers unless i want them to buy 
more things or i have new customers coming in. so 
what is common between my observed set of data 
and the data for my new customers? that 
commonality is what you can think of as a distribution. so 
he says that from this can you give me a sense of what this distribution 
is and from this distribution. i can think of other 
people. coming so 
what we'll do tomorrow is we'll talk about a few distributions certain few specific distributions 
in our calculations in the distributions for now. what is graphic 
does is a simply calculates that distribution for you. 
i'll explain very very briefly how it does. that won't go into 
too many details. what it does is it takes the averages 
of points? yes. i'm 
saying that for a sample. why this is so why 
is not the sample the distribution itself? why 
am i not saying it's a good question. why am i not saying ignore the curve? 
why am i not saying that the original histogram which have seen three or four times before? why 
is that itself? not the distribution? that's similar 
to the following question. let's suppose that you have done a blood pressure. yes, and 
you've got into few measurements. you've been tested twice today. let's 
say pre-printed, you know before weeks after breakfast next week. also, you have done this. let's 
say you've done this for what it will be a reason you've done this say once a week 
for a month. so now you got four readings. no foreign 
8 readings. now these eight readings. is 
that the distribution? that 
says yes if i want to understand what my blood sugar 
is and what it will be going forward. if i do not get treated then certainly 
there's a relationship between this and what will happen in the future. for 
example, if i behave exactly the same way if i eat exactly the same way or 
exercise are not exact same exercise exactly the same way if i smoke if i left hand 
is exactly the same as it is, i would expect my readings to be the same. but 
what about it is going to be the same and what about it is different. i don't quite know. yes, it is. 
true that those eight numbers. our representation 
of the distribution but they're not the distribution is safe if they were the distribution 
itself. i would be forced to say that in the next month. i 
would have exactly these aid readings. but 
i know that's not true. but i also know that 
from these eight readings i can't say something about what will happen next month. 
it's not that there's useless information there. so 
if my readings this month, for example, our let's say a hundred and ten hundred twenty hundred 
1525 sem good health hundred thirty etcetera. i know these 
these are my it readings. i know with some idea perhaps that 
that if things become remain the same next month, they will not start becoming 
220 210 215 2030. they will not become that. how 
do i know that because i have this reading this month? so 
the idea of a distribution is to be able to abstract away from 
the data the random part and the systematic 
part and the systematic part is what remains as the distribution 
around it. there's going to be a random variation. and the random 
variation is going to exist from data set to data set like this month and next month 
like this one customers and another set of customers who buy cardio equipment 
maybe from another branch of my store. if 
i am for example running, let's say a chain of stores. let's say that i am oh i don't know not 
to pick names. but let's say i don't know reliance fresh or something of that sort. anyone who understand 
how my stores are doing. let's say i take five or six stores and i studied them extensively. 
how do i know that those results are going to apply to the remainder of my 500 600 stores? 
what is common between these 5 and the remainder how 
are they representative of it? what part of it applies to the rest? and what part 
of it does not how do i extend it? how do we extend your blood pressure 
readings to the next third pressure readings? how do you figure this out? 
that is the heart of statistics is called statistical inference. to 
abstract away from the data certain things that remain the same and 
certain things that do not so distribution is an estimate of that 
underlying true distribution of h. and 
so it's not as rough as smoother. how 
smooth is something that the plot changes that the plot figured out on its own 
like histogram, but you are free to change it. you 
are free to change it. there are functions. there are functions within it functions 
within this plot. and that's it's a fairly sophisticated function from 
which you can do which you can do many things with i mean, it's a fairly sophisticated 
thing in there are many many functions available within it. so for example, this 
bin of histogram this allows you to say where should the 
boundaries of those his histogram part of it be whether 
you want to plot it or not, whether you want to plot that that what i was calling 
a distribution the gaussian kernel density estimate is a sophisticated way of 
saying the same thing. and their 
functions available to put into that so you can change this. it's one of the one 
of the most sophisticated plotting functions that you will be able to see. i 
wouldn't suggest doing it now kelly more experience in doing this but yes. 
gaussian will not make too much of a difference. what 
will happen is if there is if there is no smoothing out here it will look like a you look 
like a normal distribution. this is the these little wiggles will go. we 
will discuss everything later as to when it may be tomorrow 
as to when it's a good idea to do that. and when it is not just 
hold on to that question a little bit. we haven't talked about the gaussian distribution yet. i'll 
deal with that when the gaussian distribution comes for now what this is 
is it gives you a visual representation simply a descriptive representation 
of the underlying distribution, hence. the distribution is a distribution 
plot. one of the examples 
is samples were taken example. yes distributed. then you taking a current sample 
that korea. so if i distribution is, correct. let's 
suppose in an ideal world if my distribution thinking is correct. then here's what would 
happen. if i take my old new old a data and i do 
histogram a sorry. i do a dist in the new data. i do 
add ista ghin these two should be very similar. the 
histograms may be different but the distribution should be similar. if 
i've done my analysis correctly. does that mean the variance 
is i wouldn't use the word variance is a variability. it means that 
there is a there is a some this is called sampling variability. in other words is a very 
variability that is due to the fact that you've taken a sample. there 
is an underlying truth but you know, seeing that truth because you're taking a sample. 
there is an underlying level of your blood sugar, but you're not seeing that 
because you have taken only a very small sample of your blood a few milliliters 
whether liters flowing around and it only a for at a few seconds 
in time. there are many hours in the day. there 
are so many other things that you reading could have been but if it is a good if 
it is a good sample, then what would happen is that i will be able to cover this 
variability. see if i want to get a sense of what your 
blood sample actually is when i want to trample this. well, what i will do is i'll take samples 
in different kinds of situations one thing the cover for example before 
eating and after eating that that they cover and maybe or maybe 
i want to cover other things as well. for example in certain kinds of diseases. 
they very conscious is to way to take the blood from because 
the metabolism in the blood changes based on what certain disease and i won't be going for example, you 
draw the blood near the liver. the liver is the body's filtration system. so 
essentially you want to figure out the nature of the blood when it flows into the into the liver 
and then after it shows up to the live in order to understand whether the liver is filtering your blood correctly or not. 
now to do that you need to draw the blood in very specific places. so 
in order to do that there for you in your experimentation should cover all of that. what does that 
mean? for example in business terms? let's say that you're looking at sales data and you want to understand 
your sales distribution. well don't focus on certain sales people. look at 
your bat sales people. look at your good sales people. look at your high selling 
products. look at your nose selling products cover the range of possibilities. 
if you do not cover the range of possibilities, you will not see the distribution. if you do not see 
the distribution, you will not know what where the future data will come from. and 
if you don't know that you'll not be able to do any prediction of prescription for that. the 
histogram is just the summary. this is also just a summary but 
the histogram summary applies to just this data set. this distribution 
is pretending to apply to a little bit more definition 
of the definition of a distribution doesn't apply 
to the data. so distribution function. so to speak is just this so for example 
is sometimes defined this way f of x is equal to the probability that 
x is less than or equal to x. this is sometimes called a distribution function f 
of x is equal to the chance that age is less than or equal to 15 age 
is essential to 16 is a centre 17 and now let me confuse 
you even more hmm. f of x 
this is the derivative of x the differential of x. is 
the density function which is the area under the curve? this is called the density 
function which is what this plot is plotting. this is sometimes 
called the density function the density function. so the distribution function is the integral 
of the density function and the density function is the derivative of the distribution 
function if you're very mackey in all of this. huh? so 
what they're plotting is a plotting the density function. i showed 
the consider. this is actually called the density function. the reason i'm calling it a distribution 
function is because it says distribution here. i was hoping not to confuse you clearly have 
failed. go ahead. yes, 
that's the idea. yes. now 
you'll see now you're now you've hit the problem of statistics bang on the head. how 
do i get an idea of a distribution that applies to everyone based 
on only one sample sitting in front of that 
is the million dollar question. and that is why people like me exist like 
that is the whole point of the subject. and 
it is a hard problem. it is a hard problem because 
you are trying to draw conclusion. outside 
your data. you are in you're not even 
nobody is interested in your data. nobody's interested in your data. 
right. everybody is interested in their data. or in their problem. how 
come laughter? yeah, nobody's interested in your data now, but 
you still have to analyze the data that is in front of you and reach a conclusion that makes sense to 
them. the bank has to look at his order at it's you know, portfolio 
and figure out what is your strategy should be the clothing store needs 
to figure out look at it sales and figure out what loads it should make regularly 
has to figure out its course reviews and figure out which faculty members to keep you 
have to look at your your expenditures and figure out how much salary to negotiate 
for. how will you do all of this? 
how do you do all of this by the way? based on some sense 
of distribution there. so when you go and you negotiate for a salary now, 
you're not going to negotiate for you know hundred crores, you might but you say no one's going to give me that anyway, 
so maybe you're good enough. i don't know but but 
but what you do is you essentially say you do roughly something like the following you 
figure out how much money you need and how much money you are expecting and 
that to drums to some extent is based on. your expenditure 
and what you want to do your expenditures also based on that. you have a certain income when 
you're spending this term this you're doing all of this on a regular basis. 
hmm. you're standing on the road. correct? you're standing on the road in you're trying 
to decide whether to cross it. how do you decide? experience 
you got past data and that data is telling you please toss the road how 
that data is not seen that car. okay, 
fifty three three six one nine, which is driver has not been seen by your data set. how 
you cussing. because you're making the assumption that why 
i have not seen him have seen many others like him. so 
so there's this there's this story right? so, you know a 
taxi driver is is going at night on the road etcetera etcetera and he's just running 
left left isn't so red light. new 
shoes etc just keeps going that the drivers are the passenger is getting very scared. 
stop curl using the driver. he says in hindi apologize for the language 
gymnastic a shed, and i am the lion of the road who will stop me. 
it goes through all the red lights and then there's a green and then he stops. 
and then he says why are you stop now? the guy says ask them about sherry? so 
the guy on the other side, so he's logical right? so 
his data is saying that there are people who cross the red 
light. so therefore if i'm standing in a room like this red light on 
the other side cars a bit across left headlight, right? very logical so 
therefore and we do this all the time. so why we are not trained a statistician that is normal 
people are not they behave like one based on the x-men 
now your objective and the objective and of analytics professional is to translate this logic 
into a algorithm. into a proceedings 
of the company that the company and the computer understands and 
that is not easy for starters. let's 
say that you that you that you are here. and you say that this is an average, right? 
this is the mean age. this is twenty eight point seven eight eight eight 
and you could say that this is an estimate of the mean 
of the distribution. this is not 
this is the mean of the data but you are not interested in the mean of the data. why 
are you not interested in the mean of the data? because you're not interested in this particular set of 
hundred eighty people, but you are interested in the average age 
of my customers. so now the question becomes what does the 
age of my new customers have to do with the number 28 are 
they related? yes, you say that they're they 
like a copy of what i have. i think christie 
said that you'd like a copy of what i have then when i say twenty 
eight point seven eight again now, you'll say probably not there a 
copy but not that much of a copy most likely are now we're talking 
you know how likely is most likely and what about it is going to be the 
same and what about it is going to be different. y axis 
of distribution, so the y axis of the distribution you say is the same also, for example, you 
could say that this 78.8 is an estimate of 
the population distribution, which means that yes, it comes from 
the histogram. it comes from the same district himself, but 
also comes from this distribution. but there's 
also this nagging feeling that i do not know for sure. i do not know. 
i don't know what this new data is going to be. so what will happen is we will not give 
the answer 78 28. sorry. we not give the answer 28.7 
will give an answer that is like 28.7 plus minus something you say i do not know 
what the population mean is, but i'm going to guess is around 
28. i know it is not going to be exactly 28 but 
20 it isn't useless for me either. it's going to be around 1 to bike 
to bike around to an old 28. how much 
around 28 now certain criteria come in? what 
will this depend on it will depend on the variation of the data. if 
the data is standard deviation. if the data is very variable. this plus minus will be large 
then. yes an issue that you will depend upon how many things i'm 
averaging over if this was a hundred eighty. i'm 
so sure if this was eating thousand, i'll 
be even more sure. if this was 18, i 
believe here. so depends on how much data is being average 
over the more data. i have the hearer. i am about the repeatability 
of it the sure i am that i will see something similar again. it 
depends upon how sure i want to be if i want to be 95% sure. 
if i want to be 99% sure if you want to be 99 point the moisture, i want to be 
the bigger the the tolerance i must have on my on 
my table and those are things will get to so those also 
descriptions but those descriptions are heading towards being able to predict. 
so now if i give you this twenty eight point seven eight, i've given you a description of 
the data, but i'm not giving you a prediction. even 
though even though 28.7 it plus minus something have now begun to give 
you a prediction. today's 
about descriptive analytics. we're not we're not predicting anything. we will get there. but 
this plot is in some way a first measure of of of 
looking at this idea of a population and of a distribution associated 
with the population. this is yeah. huh? 
if the if the variation let's discover will be sharp. 
flat means variation is more if the curve 
goes this way. it means that there is a lot of variation. i'm unsure about the middle. 
it's harder. you need more data not 
necessarily the variation of the average would go less. so 
let's suppose that you have no control over your diet. i'm not accusing you of anything. 
it happens to humans, but let's suppose that you are doing a job in 
which your lifestyle is very varied you travel from place 
to place you eat in different hotels. sometimes you don't eat at all. sometimes 
you stress out a lot. sometimes you're naturally going after trains and 
sometimes you're sleeping for 12 hours in a row your life is highly variable 
now, let's suppose and there's nothing wrong with that many people have very varied lives. 
well, i suppose i'm now trying to measure the blood sugar of such a person. what 
must i do? now 
try to other variables or at the very least what i need to do is if i simply want to get a good blood pressure measurement 
is i have to measure it underwent many different circumstances. or i could 
argue. i don't control your circumstances. i can control your circumstances 
so i can say for example that go and measure it at this time or go and measure to take a look amateur 
take a look at meter and before going to bed do this or 
after you've just had a very hard day do this. i 
can give certain instructions to cover all the corners or i can simply say 
i don't know but what you need to do is you need to measure your blood pressure or sorry 
your blood sugar say every six hours and then tell me what 
happens but you need to do this often because i expect your blood sugar to 
be highly variable. simply because your 
body is being put through a enormous amount of variation. in 
a business tuition let's suppose that you've introduced a new product. you 
do not know if this new product is going to sell or not. what 
will you do? what will you do? i mean how will you measure you just introduced it? based 
on past data you'll do that. but you've just released it. you can measure current data. no 
different situation. i've just released a product. all that is over all that is over. i 
now have just released this this watch in the market what 
typically happens is people track the market very very closely. liberty 
city as the number of sales made everything the reason is because they're not 
sure how much this will sell see. the question is what changed what changed 
was your product release another competitor could be reacting immediately. my 
point is not my point is not that there are many things to look at which you should my 
point is that when there is a change in the distribution when there is when there is an unknown 
distribution coming in front of you whose variation you do not know you 
tend to get more data. you 
sample more frequent here. you get more data. you you figure this 
out. we do this all the time. for example, let's suppose for those 
of you who have kids. that's it for your kid is going to a new school. what 
will you do? you lost your questions to them. you'll get more 
data you find out what is happening in school. what are the teachers? like i 
said because there's too much variability standing in front of you now 
with those answers and then you do a few trips to the school. you are now a little more, you know, 
you may like it you may not like it but you release more more 
informed the distribution is now known to you. so 
you get more and more data? that's 
why you get the experience. that's why you start getting that experience. if 
you have that experience already, in other words, if you know the distribution very very well 
and you're comfortable with it. it will take time to get there 
and that's why this big data world is becoming so interesting that by the 
time you've understood a problem. the problem is not important anymore. there's 
a new problem now. this is good, right? that's why you guys have jobs. but 
also means that the answer to 
that is that also means that when you have new data you solve 
a different problem. you don't solve the older problem better. which 
is what a statistician to some extent is trained to do that as you get more and more data 
get the distribution better get a better idea of the unknown. 
make a better product, but the alternative view is make another 
product. solve a different problem if 
you have more data, so the ceo is now saying i have more data. give 
me more more of what solve another 
problem for me. give me new customers that i can go after and things like 
that. so therefore that problem is is a problem 
that statisticians big data people often and it's not an easy problem to go after. 
but as you have more and more data coming in, how do you utilize 
it? how do you how do you how do you make efficient use of 
this information do you get tighter 
estimates of what you're going after you're doing sentiment analysis of text when if you will do text 
analysis you write, you know twitter code in excel and you will do latent semantic analysis. 
you will look at positive net, you know, let sentiment scores and things are 
that's it. and now the question will be that you know that this is going to change 
people's opinions are going to change. so 
overworked granularity. do you expect people's opinions to say the same do 
they change every day if they change every day? there is no point looking at a person 
over an average of days because i have average is nothing. every 
day is a different opinion on the other end if their opinions change. let's see on a monthly 
basis. then you can look at daily averages in average them will get a better estimate 
of that monthly rating. so 
it's also you have to make a guess as to whether i'm estimating a changing 
thing or whether i'm estimating a solid thing better. and 
that's not a that's not an easy thing to do 
it since it is a i know it's happened to me. i don't 
know why this happened to you or not. but at times in my life where i have simply 
not had haircuts. what that means is 
i've gone six months eight months a haircut has been like a weight loss program. well, 
i'm not care what i look like. i'm not sure i do now, but you know when things become 
very unhygienic i go and get a haircut. it also means 
times in my life. when i've been a lot more conscious of what others think about me. you can imagine 
what points in my life. now. i groom i'm very careful. i get my 
hair done and you know all connected. i'm getting my haircut much more regularly. 
now what am i doing? so in the second case, what i'm doing is i'm trying to make sure that i'm 
reaching a certain distribution of standard. universe is certain target 
distribution that i have and i'm interested in getting there and intolerant of variability. i'm saying 
that i'm going to estimate this distribution. i'm going to stay close to it in the first 
case. i was not i was perfectly okay with the variability. edison 
kissing you will be okay with the variability and you'll not want to estimate the distribution of this time. and 
in certain cases you will you will want to estimate it very very well. you 
want your hair to be done? very correctly. you will want your product to be targeted 
to a very specific age group. you will want to know that when i 
am targeting to this particular age group. what advertisements do i want to show 
you want to advertise it on television and you will want to know which 
who is watching the program on which you're advertising this. are 
they tardy college people are they professionals are the old 
people sitting at home who will use this? and therefore where will advertise my 
cardio product three times union want to know this very very precisely or 
as precisely as you can. so therefore 
this number this mean number and this number from a distribution 
perspective. from a description perspective 
is perfectly okay. it is just the average. but from an influential 
perspective is just the beginning of the journey is just one number and we're going to have to put a 
little more bells and whistles around this. go ahead. you have lots of questions clearly. okay, 
so we haven't talked about normal distributions. we will do tomorrow but so 
statisticians need to make assumptions about data. one 
of the assumptions is what he's talking about. it assumes a certain distribution. 
it says that i'm going to assume that the data has a normal distribution is an assumption. but 
why do statistician make assumptions like that one reason they make assumptions like that is 
because they make it the calculation becomes easier now just because the 
calculation becomes easier doesn't mean the calculation is correct. because the assumption 
is wrong the calculations also going to be wrong. but because 
of the assumption you can do many of these calculations and if you don't make those 
assumptions these calculations now become difficult or even impossible given 
the data at hand. so a lot of the tests a lot of the procedures that 
will be talking about are going to make certain assumptions will see one in about an hour or so. 
have you stood assumption is correct. i will have a strong model. but if the presumption 
is wrong, i will still have a model that is. that 
is that is indicative. so there was an economic. i think paul samuelson 
are not intelligent or who but someone who said no, george box. i think the box and 
called the box lock box. he said that all models are wrong, but 
some models are useful. so 
the question is it may still be useful. if in many 
cases the distribution is expressly 
allowed to be not normal the domain tells you that let's 
say that you are in an engineering domain, you know, the data has a certain shape 
and engineering domain tells you that it's a shape and the shape is sometimes called a weibull distribution what 
that means is that if you are reporting out, let's see the failures of something you're reporting 
out the failures of gas turbine blades. i spent a number of years doing that. we had 
to report out the weibull distribution. we didn't report on a normal distribution at reporter weibull 
distribution. in the finance industry report out a log normal distribution 
means and variances of it. every industry has its own favorite distribution 
because every industry has its own genetic data form. 
now even within the industry of particular data set could violate that rule. and 
then it becomes interesting. there is a statistician view now use a higher 
power tool set a more powerful tool set to solve that. this 
leads to certain complexities the first complexity there often runs into is 
which one and do i different do i do it differently from someone else? is 
it is like a doctor who looks at a patient and says that you 
know what the textbook says that i should do it this way. but 
i like this guy he looks different who i've never seen anyone 
like him before. so let me know the textbook and treat him this way. i think he'll get better. now, 
could it could i be i could but i'm taking a risk. 
so every time you're making an assumption on your own and following through and it you're taking similar risk 
you could be for that particular case, but the president's you have far fewer precedence 
has to go on and as result of which a turn-on when you extend it beyond 
to someone else you're going to have to you're going to find it hard to do so. so 
therefore people often make assumptions and distributions in a sort of you 
know, sort of historical sense that they've known that this has worked moderately 
well over a period of time. and they very hesitant 
to change it for particular cases. sometimes there are allowed to in 
regulatory terms are not allowed to any accountants here. accountants, 
its accountancy, you know this. so if you are an accountant you have to do your books in 
a certain way. now let's say that you are measuring cash 
flow. there is there is a certain way in which you will measure cash flow. now 
you may say that in this particular month your business was done we 
will differently so i'm going to show a better cash flow this way. if 
you can, you know, you're running into trouble. now you 
may be you may be in the sense that that me actually a better way 
better way of doing it. but as soon as you go out of cfa cfa 
as soon as you go out of a very standard way of doing things. things 
will be a problem. and the same kind of logic applies 
to office statistical analysis as well. so as a result of which like 
an accountant you are you are doing 
the thing approximately most of the time. in 
machine learning. this is a term that you might see there's a term 
that's often called like supervised or unsupervised etc is called pac learning. 
pac learning it's a deeply technical field 
and that stands for probably approximately correct. 
probably approximately, correct. i'm not telling 
you anything if i'm wrong. don't believe me. but 
i'm probably approximately correct. and 
the probabilistic part comes from statistical thinking the approximately part comes to 
machine learning thinking and and it's a it's a it's a deep field is a serious 
field, but it puts a probabilistic statement 
or an approximation. so therefore at the end of the day, whatever method you use 
their stuff your sense of how generalizable it is. 
you will do that you do that fairly soon in 
a couple of months after you do your first hackathon. and your first hackathon 
all your hackathons will have a certain feel to them a common thief or a hackathon is i'm going to give you a data 
set you build your model on the data set and i am going to have a data set that 
i'm not going to show you and i'm going to tell you how well your model has done on my 
data set. and 
you have a day or six hours or whatever to fill on with your data set. show 
improvement on my data set. this 
is what you'll do. you'll do twice i think in your in your schedule. what 
does that mean? it means that by being very good on your data set 
doesn't necessarily mean you are successful. you 
have to be good on my data set, but i'm not going to give you my data set. this 
is not as impossible as it sounds is a very standard problem. and this is a typical problem. you will 
not find this heart. you'll find is very easy. by the time 
you get there, you know not a problem you all will your pretties is will you 
will you will get me in a nineteen ninety six ninety and whatever percent accuracy not to worry. technically. 
this is not hard how the road right? so there are two answers to that 
one is if the mean is different from the median then you ask no 
no. meaning equal to the median from a distribution 
sense means that these are the two numbers. okay, 
if the distribution looks like this and i have a another parameter mu 
we're going to do this later when statisticians use a greek letter. they're 
referring to something that they do not know. where 
it is all greek to them. so mu is a copulation permitted 
exists, but it is unknown. it exists but 
it is unknown. now. if the distribution is nice and symmetric like this, then 
this unknown thing in the middle can be estimated using a mean or it can be 
estimated using a medium. now the question becomes which 
is better and the answer to that roughly speaking is this that 
if there are many outliers if this distribution tends to sort of spread out to the tails 
then use a median. because of the reason that i said the median 
becomes stable to outliers. if 
this distribution has the more bell-shaped curve of this particular kind the mean is more efficient 
at this a better answer is e what if the destination of distribution 
is not that but it is like this then the median may be here. 
and the mean may be here. now you're asking different question now is not 
a statistics question. it's a common sense or a sinusoid question. which one are you interested 
in? how you interested in per capita income or are you interested in the income 
of the typical indian? correct. 
for example, let me ask you this how much time or give me one 
number one representation of the amount of time that you spend 
on a website? i'm 
asking for one number. don't tell me the number but think in your head as to how you would answer this. 
how much time do you spend on our website by website what i mean is 
this? yes, but what does the average mean? so 
how would i do this? so so here's what i'm asking you cruising 
the web every day. let's say sort of asking for is a number like this that and 
the amount of time that you spend on a website you go to different websites and you spend a variable 
amount of time on each of these websites for whatever be your purpose. sometimes you just passing 
through sometimes you're seeing a video. sometimes you're sending an email blah blah blah, whatever and every 
section i'm thinking of as a different website. if you go to the theory go to google twice 
then i thinking i'm thinking that is too. sites so session ways, so 
to speak now. i'm asking for a representative number. so 
how would you come up with that number? what's a fair answer to that? i 
mean see if i do the mean here is how i would do it on a given day. 
i would so the first website i've gone to and find out how much time i spend their 
second how much time i spend 1/3 how much time i spend their fourth how much time i spent 
it and it had this up and i divide that's the mean right? 
what would be the median? the median would be 
i look at all those times and i sought it and it put this in which is going to be larger. 
it depends is correct. but in this particular case, so 
think of your think of your typical browsing habits. now everyone's browsing habit is different, 
but you think of it and networked people who deal with network traffic deal with this problem on a regular 
basis. so here is what usually happens. 
most of your sessions are actually quite short. for 
example query you go to website a new poster. you 
post a query or you go to your gmail and you check whether there's 
been a new email or you go to a favorite web site new 
site and see whether something new is there or not. most of the 
actual pages you visit you don't spend a lot of time on 
but sometimes you go to a website and you spend a lot of 
time on it. let's see you write an email. let's see you see a video. 
so what does your data look like many small numbers? and 
a few big numbers this is 
what is called a heavy-tailed distribution the distribution of the histogram. sometimes looks like 
this. heavy 
tailed this is the tailed is called a tale of a distribution 
at l2 a statistician is not an animal thing a tail 
is usually refers to the end of a distribution. some 
kind of heavy tail distribution and a network traffic is an example of a 
typical example of heavy tailed. so now here is what happens people 
in this particular case the mean and the median are carrying very different kinds 
of information. the median is essentially 
saying that for a typical website that you go to how 
much time do you spend on a typical website? now 
the if that number is low that is an indication that 
most of the time. you are shall we say cruising? 
or browsing on the other 
hand if you if you're looking at the mean and that number is high 
then you know that you're spending a lot of time on certain very specific websites. and 
this points to two very different kinds of people. so 
the mean and the median are carrying different kinds of information with them both useful. 
so you also get your question. it depends on what you're going after. and 
and and in certain things you will see one of them. naturally used as opposed 
to the other. there's also a third one called the mode which is which is actually 
harder when we when we were still six units instead of mean median mode. and 
the mode is the peak of the distribution. what is most likely? and 
the reason the mode isn't talked about much is because the mood actually algorithmically is 
very hard to get at. the mean is a very simple algorithm. 
the median is a very simple algorithm. the mode is a harder algorithm. you 
can think about how to write a program for the mode if you want to. it's 
a much harder algorithm. so the mode essentially what is the mode of this distribution, 
for example? so let's take a look at one of 
them. this is what this is income 
for men. what is the modal income the 
modal income is here? it's some real somewhere around 55,000 
where this maximum is. correct 
for women it's here. maybe a just less than 50,000. so 
you understand what the mood is it is the it is the highest frequency 
or the most common value. but in practice 
that's actually really difficult to do if i give you a set of numbers. how will you 
calculate the mode? but 
when you see a spike, what is the spike so i'll give you all your ages. how 
would you calculate the mode? so one possibility is you you look 
at the age and ask which age is the commonest? with 
a count of the ages more but that almost means that your data is 
not numerical. you're almost 
thinking of the data is being categorical because you're counting how many observations there are high value? 
the idea of a numeric is that it is sort of continuous. it's not junk to that way. so 
for data that is chunked up or categorical you can easily calculate the mode for 
something. that is not. and so the mode therefore has become less fashionable 
because it's not a very easy thing to go after when we were in college. 
the mood was something actually quite easy to calculate. here's the way we would calculate the mode here 
is a histogram. and the way we would calculate the 
mode is this we draw a line from here to here. we draw a line from here 
to here left left dig the highest class 
draw this cross line draw this tossed line and here is the moon this 
is the way we would do the mode in the pre-computer era. i 
went to college where we didn't have any laptops and things like that running 
a program meant running to the computer center with pieces of paper. so 
many of these things were done by hand and this is something that easy to do. i manually this is 
not that easy to do on a computer. the 
logic is twisted. you have to figure out what the bin width 
is. therefore you have to make there for 
his estimation of mode and his estimation of mode will be different from the same data set. that 
is not going through for the mean or the median. and 
as soon as two different people find the same answer to the same different answers to the same question, 
you know, there's a problem with the statistic. so therefore 
this is so the mood isn't done as 
much these days. these are the histograms sort of my data histogram. 
it's this is a way of separating out the histograms. in other words looking at these programs 
by different column equal to income essentially means that which variable 
the by says which gender so it's and they go side by sides because they essentially tell you 
as to what the difference in the distributions is. so what does this tell you? 
i could have plotted a this plot here as well or the code could have 
but this is says that there is a little bit of a difference between the male and the female distributions 
in shape as well as in the actual value so to speak and so from a descriptive 
perspective, you can keep doing analysis of this kind to see whether there is a difference 
not just in the engender variable, but in other variables as well. do 
people travel the same amount of miles on different on different 
devices a plot like this will tell you to compare 
these two? what we can do in lecture students 
your you can do as an assignment after that is you can see is there a statistical 
difference between the miles of products that 
are traveled or that are used for between the different products? 
in other words. is there a difference between these three products in terms 
of how much usage they see and you can compare three distributions and 
we will compare three distributions in time. okay. 
now the last saturday i want to talk about today. is we've done 
we've talked mostly about univariate which is one variable. we 
saw a little bit of a plot, but i want to talk about shall we say bivariate? 
bivariate means two variables at a time if you want 
to talk about many variables at a time that's called multivariate. but 
before we get too many, let's get two. to stop 
we've looked at one notion, but we've looked at two notions. we looked 
at the notion of location location means 
that if there is a distribution, what is this middle and that can be mean or median? 
we have looked at variation like standard deviation range 
and interquartile range, but when i look at distributions of two variables, 
there's a little bit more to it. there is a relationship 
between the two variables that i want to want to be able to capture a sense 
of relation or a sense of correlation that how do i measure 
whether one variable is related to the other variable or not. remember? 
i'm still describing. i'm still trying to find a number 
like a mean like a standard deviation. i'm trying to simply describe a number 
if that number is this correlation is high if that number is this correlation is 
know. what should that number be. there are many many ways of defining 
such a number here is 1 and is there in 
the book let's suppose. so i'm going to do this slightly 
abstractly. so i've got i've got numbers that look 
like this x 1. these 
are my points. so for example if 
i look at say a plot. here 
is take one of these. this 
is same miles and income. the amount of exercise done 
and income each of these points has an x-coordinate 
and the y-coordinate these coordinates. i'm 
calling x1. y1 x2 y2 x3 by 3 
x 4 y 4 x1 lty 180 understand the 
faisal observations this 
is say x 1 y 1 this is 
say x 2 y 2. this is 
say x 3 y 3 the pairs of observations this way. x 
bar is what? 1 
/ m x1 plus 
xn and i'm going to write this simply because i'm going to try something a little more complicated 
now summation. i is equal to 1 to m x 
i if you don't like the sigma notation, that's fine. 
you can write it with dots. why add 
a little complexity here? y bar which is the average 
is similarly summation i is equal to 1 to n why i 
i'm going to write something here. i'm going to write summation. xii 
minus x bar why i minus 
y bar. i'm 
going to write that down. i'll tell you why i'm writing that down. but look 
at that. what 
is x i minus x bar? it's 
sort of like a variation or a spread of xii from its 
average. similarly why i minus y 
bar, okay. when is this 
dumb? xii minus x bar y minus 
y bar. when is it positive when both of these are positive? 
or both of them are negative now. both 
of them are positive means what both of them are positive means excise 
above average and why are you is above average? 
wolf negative means xii is below average 
or wire is below average. so 
imagine a data set that looks like this. 
where is x bar and y bar somewhere in the middle here? here 
is one line over here is a line and here is another line. for 
all the points here xii is 
above its average and why is above his average for 
all the points here x i is below its average and 
why it's below is average which means all these terms are most of these terms 
are going to be positive. i may still have a point. for example 
say this point where it is negative. but 
when my data looks like this this number will be positive 
what happens when my data looks like this? when 
we get a looks like this then x i is above his average. 
sorry, why are is above his average and excise below its average 
that is one of these is positive and one of these is negative. that means this guy is 
negative so when the data looks like 
this. this becomes negative. what 
happens if the data looks like this? 
the positives and the negatives will this number being negative means 
when one is high in the other is low. for example, let's say 
height and weight. i can wait means what the taller you are. behavior 
you are relationship between the two my 
doctor says that i am about four or five kilos overweight. i 
say no doctor. i'm about two inches too short. i 
don't have a weight problem. i hate problem your interpretation. so 
so therefore so if you want to defer get a statistic 
that captures whether your data with your variables are moving together or 
in opposite directions opposite directions, for example might be something like say 
wheat of a car and mileage of a car. bigger cars 
have lesson mileage which means 
that if you have an above average weight car? that's 
probably has a lower average lower than average mileage. 
so this particular measure? this 
is an addition when i divide it by 1 over n minus 1 
to take an average effect. this thing is called a covariance 
of x and y. this 
is called the covariance of x and y covariances are 
very heavily used in certain areas. they're heavily 
used for example in you know, dimension reduction in principle components. 
you see that time they're used in finance for 
in portfolio management and things of that sort. this is called a covariance of x and 
y. what is the covariance of x and x? 
the covariance of x and x which 
means instead of y. i'll just put x. this 
becomes 1 over n minus 1 summation. i is equal to 1 to n x 
i minus x bar into xil minus x bar. which 
means x i- x-bar squared. which 
is the square of the standard deviation, this is sometimes called 
the variance of x. which 
is the same as the standard deviation of x squared? 
so the thing that before i took the square root that's called the variance. 
with the square root is called the standard deviation without the square root is called 
the variance by the way. it's all there in the book. so 
in case you didn't get it, you can see the video or you can read the book called. these are very standard definitions. 
so the covariance is a measure of the nature 
of the relationship between x and y if the covariance is positive, they're 
moving in the same direction if the covariance is negative, they're 
moving in opposite directions if the covariance is 0 then 
many things can happen either the data looks like this there's no 
relation or maybe the data looks like this. not 
a normal distribution. this is not a distribution. this is a price. and 
profit for example, what is 
usually been price in profit, by the way, this 
is a cuticle relationship between price and profit on this side as price goes up. your 
profit increases because you're getting more money per product and 
on this side with even higher price fewer people buy your product. 
so your profit goes down. now for such 
a thing, that's when i the average is somewhat here. so the correlation also becomes 
0 another way to think of it as is positive on this side and negative on so 
if this is 0 it doesn't mean that there is no relationship. it could mean that there is a 
complicated relationship something that is positive on one side and negative on the 
other side. and now that's it. i 
once remember doing an analysis in which we were trying to find out 
that is about attrition why people leave companies and inside it there was a model 
that we were trying to for some reason trying to find out the relationship between or 
trying to understand where people stay do they stay 
close to the office or do they stay far away from office? and 
what do you think is relationship between say experience? and 
distance to home we 
are normalized for that the in other words think of it 
as just experienced but we were looking at populations in which experienced loosely translates 
to age but you're there could be people who join the company very old. i 
agree with that policy simply. life and say that you have a data set in which 
you experience and here's what we found that 
that early on in their careers. people live close by in 
the middle there moved away. and towards the end they again became 
closer. now this was an observation. 
there's no science to this. this was just simply seen in that particular 
company this particular thing would happen. but remember the 
point is not to describe. the point is also to predict to understand and things like that. so we 
had we had to build a story around this when we went to the cmd and said that you know, here's what 
we had done. so so the so you can make some story around this and the story 
we made up correctly or incorrectly don't know 
is that in the beginning? to some extent people have low dependencies 
typically coming. you're unmarried bachelor cetera. you also need 
to work a lot harder. so staying close by his convenient. you 
get a pg or you get an apartment? you stay close to you close to work 
because thing far away from work gets, you know particular benefit is just 
inconvenient. but as you as you reach in some way middle-aged, 
so to speak things with them very complicated. there is a spouse here. 
he may have a job. there are kids there are schools. this 
kinds of houses that you can afford. and so this solves the more complicated optimization problem 
and you may not be able to find a solution to that problem close to work. but 
people who survived even longer in the company earn enough to solve this problem 
through other means and then what 
happens is they move back to work again, you know buy a villa 
close to etcetera and now there are multiple cars. to take 
people elsewhere kids are often grown up. so the number 
of dependencies are a lot less. you may agree with the story. you may disagree with the story. 
but the point is that there's a complicated relationship you're trying to explain based on 
what the data is now the use of it. i will talk about much. so 
this this number is a number whose sign positive 
or negative tells you about the nature of the relationship. but 
only the sign tells you the value is much harder to interpret. 
the reason is because i can measure these things in whatever units i want 
suppose i am measuring you know, you know, so height say height and 
and weight and i measured height in centimeters and 
weight in kilograms. that's one answer, but i can measure height 
in feet and weight in pounds and get a different answer. i can even 
make this number much higher by measuring height in millimeters and 
wait in milligrams and why i do that but i get it so 
this as a value is entirely dependent upon the units of measurement 
which makes it a problem. so what statisticians do when they reach the situation 
is if they normalize things they make the unit go away. so the way 
the unit goes away is you divide this by the standard deviation of 
x and you divide this by the standard deviation of y. 
now i can do this on the board without writing anything again, but i would suggest you write the whole formula 
again. when i divide this by the standard deviation 
of x and standard deviation of y now the units cancel out. now 
this value becoming one means x i is one 
standard deviation above average. in whatever beats units 
and why are you say two standard deviations above average in whatever is eunice the 
unit has gone away. this number is called the correlation between x and 
y. and 
the correlation between x and y is a number between 0 and i'm sorry is a number between minus 
1 and 1. the correlation 
is between minus 1 and 1. if 
the now give the data looks like this then it is 1 if 
the data looks like this then it is minus 1. this 
is the correlation. it 
is a measure of the relationship between two variables measured 
in this very peculiar way. it is not just a measure of the relationship. 
it is a measure of what i would say the linear relationship between x 
and y are nonlinear relationship or a strange relationship could 
cancel out positive and negative and end up with zero or a low number. so 
if the correlation is close to plus 1 there is a strong 
positive relationship between the two strong postulation means what if one of the variables 
is above average then the other is also very likely to be above 
average. and vice versa so 
what i can do is this 
is the when i do my data and i do dot core as a 
function. this 
gives what is the called the correlation matrix? again, 
it will calculated only for the things with numbers. if 
it doesn't in other words, if you give it a data frame and this doesn't happen then just 
make sure that you only take the subset of it, which has only the numbers do not 
calculate correlations for things that aren't numbers if they're not numbers. there are other 
ways to calculate association will see that later as well now 
based on this. what do you see? first of all, the correlation 
between age and age is 1 why well 
is the 45 degree line, right? by definition it is one 
that is this is a number that comes from one data set with 
one kind of relationship. what does that say anything about the practical 
world? so to speak is another way of stating saying what i have been saying all along. how 
does your data have anything to say about these relationships 
outside the data? the 
problem is we will clear it here maybe but the problem exists for anything. so 
for example, there is a correlation of .28 between 
education and h point 
to it means that there is a positive relationship butter. 
not a very strong one. where is that? 
where is that graph? this is h. education 
was a second one, right? so this 
one, right? or this one whichever way this 
shows that there is a weak positive relationship between them. 
when one goes up, the other does have a slight tendency of going 
up now should warn you that there is no sense of causation 
here. there is no sense that he flex goes up. then 
why goes up because correlation of x and y is the same as the correlation of 
y and x. definition this 
is symmetric concept. it makes no attempt at causation. that's 
a different thing altogether. so 
this is a positive this these this is a positive relationship. it's a 
weakly positive relationship. this 
is about usage and education is about 40 point 
for income and education is about .62 
miles and usage is about .48 miles 
and fitness is about point seven eight. let's see miles and fitness. 
this is miles and fitness. nothing 
in this data set has a negative correlation, but you might have seen it if one was negatively 
correlated to the other negative 8 to the other. close 
to zero. you're looking for low correlations, 
right? so age and usage for example is a very low number. who 
is in miles in other words he's doesn't seem to have much to do with. 
things shall we say? other than income, 
but asian income doesn't really have much to do with your product per se 
it will be useful in when you do clustering rated on variables like the reduce full to child 
segment. rich 
old people always an interesting segment. yes. 
what is the coalition mean zero that there is no relationship between the variables 
it could mean for example that a plot that looks like this. let's 
take a variable. so closest to 0 is what age 
and usage. so 
age and usage is where usage 
and this one so 
this age and miles that also is something low problem. this 
one. no, there's no relationship between 
them in the sense that there probably is a relationship in the variability. 
in other words is more variability here then here but if i want to draw a line through this 
the line doesn't have a positive slope or a negative slope. there's 
no there's no idea that says that if one of them is above average the other is also likely to 
be above average. so low correlation means that there is no 
sense that one being above average rate to the other being above average. no 
increasing nor decreasing. correlations 
and toto really hard numbers to interpret but they're also very useful summaries 
particularly for large data. the 
question that he asks is to does this make any sense in the real world has two components to it. 
component 1 is your relationship between the two related 
to a linear concept? so 
for example, we i was talking about height and weight. what is the reward 
should be the relationship between height and weight? linear, 
so if i plot height versus weight, i should see a straight line. okay. 
now she's going to say not necessarily removing 
outlines. we're all outlaws, aren't we? okay have any of you heard 
of a concept called the bmi body mass index in this 
day and age, we've all heard of body mass index. kavitha. what is body mass? index? 
hide by height by know 
there's no asian it hide by weight sweat 
hide by weight squared now. so bmi 
is height by weight squared. so if bmi 
is height by weight squared, what does that tell you about? the human body 
height by weight squared is what is called bmi. and 
this number let's say should be around 25 if you are healthy. what 
does that tell you? if you are taller what will happen, 
how will your how should your weight increase? no, there's a square 
here. how 
did it so roughly? this should mean let's say that this is correct. let's 
say roughly. so this is correct. if this is correct, what does that mean? it means that height 
is approximately 25 into weight squared if you are healthy that 
means if i see a bunch of very healthy people and i planted height versus 
the weight. i should see a curve like that. not 
a straight line. she's 
figuring this hissing. now. why why why do i if i'm twice as tall? i should be twice as 
in. yes, if you want to give it 
a fancy name, correct when you refrigerate that is a parabola, undoubtedly true. 
so you could argue as you why is it height by weight squared? 
society different question. why isn't it's a weed by height? so 
let's suppose that you so with my height means what so let's suppose that these 
two they're not the same. well, i suppose that they're so 
so so so these two so this is a certain height. this is a certain height 
if i put this on top of this what happens to the weight, if 
these two are exactly the same this is going to double or if i take two of these i don't see two of 
them. i apologize. but anyway, okay, so here's one more so these two so if i do if i 
put this on top of this is doubles. so therefore if i look at objects such 
as this then by doubling the weight and the height 
so height by weight is remains a constant, correct. so 
if i'm looking at bmi for bottles this way, it should be weighed by height. so 
if you are a body. your bmi would be weighed by height. 
okay. now imagine that you are a football. now 
if you're a football and you hate double how much bigger 
would you be? you 
understand. the problem is the football. it is now twice as high. how 
much heavier is it? well factor 
of what but so the volume height is double volume has gone up by what 
no. no, how high are what 
no 4 by 3 pie r cube. it has gone up by a bicubic 
factor. so now for 
a ball the bmi should be weighed by height cubed. so 
you're not growing like a cylinder? how do you like doing like a football? 
you're going next something between a cylinder and a football. we all are not 
you personally, so which is why 
it looks like that. baby 
is do like cylinders. we don't 
we don't go like cinderella's if we grew like cylinders would be a lot thinner. think 
of yourself imagine yourself when you are, you know, five or six and now wr 
height. you'd be looking at her. right. 
similarly. you don't grow like a you don't you don't grow like a football as well. imagine yourself 
five is six and now imagine you grew in every dimension in the same way. you'd be a lot 
fatter than you are now. so therefore this relationship 
depends on the empirical relationship between height and weight for the data that is available, 
which is of humans growing. and so empirically people have discovered 
that this is the object. that should be invariant. this 
is an example of what's called dimension reduction. variables 
are being combined into one which is carrying information for you. but 
it relies on a nonlinear relationship between the two that is not going to be pick only picked 
up by the correlation. so the correlation 
goes so far and no further. it is not one of the more 
analytically useful things very often. we do test a hypothesis 
is the correlation 0 versus is the correlation not zero to ask whether the 
correlation is real or what is often called spurious and in a later 
class, i think about two or three residencies from now you spend some time on things like spurious 
correlations in other words and finally relationship between x and y, but is it real 
or is it due to something else? it's 
hard as a basis for acquisition. it gives you some summary of the data. it 
is at best a descriptive measure of. association 
sometimes people want to see it in another form. this is what's called a heat map. 
it is exactly the same thing as a correlation except that in a heat map. 
it gives you nice colors. it 
gives you nice colors and you can change those colors. so to speak 
here's the index of what the color is. - one 
is pale blue positive etc and and positive is in the same direction. so 
it gives you a sense of what the color is. so sometimes when you have lots and lots of variables, 
this is too few set of variables for a heat map to be useful. so 
for example, let's suppose that you're looking at a product catalog. a 
few thousand products in you trying to find the correlation between sales 
of those products across time and across geographies and you give 
a display of you know with and so you do hit map and you find those regions where the 
products are sort of clustering up we often do it in medicine through what is called micro arrays. 
we essentially we look at data from jeans and let's say there are thousands of genes and 
and you look at the expression levels of each of these genes and you say these are the genes that have been expressed 
and these are the genes that haven't been so if you are doing correlations of thousands of 
variables, hundreds of variables often a night and nicely arranged 
set of variables with a heat map gives you a good picture of the data. so 
heat map in this form is exactly the same as the correlation except 
that it adds colors to the numbers so that you're not looking at the numbers you get 
a visual picture. so 
the user so the traditional choice of it is hot is related. so 
red is related. and why it is not. but 
there are many ways in which you can change the coding of the heat maps colors. 
okay, now comes to some extent a tool that is descriptive. 
however, it is the first predictive tool that you will see i 
will not want to use it like a predictive tool but i'll still show it. 
so let me show you what the end product is. the 
end product is i want to summarize the relationship between 
say miles usage and fitness variables like 
this. known 
predict in relationships of variables such as this kind here's an equation 
minus is equal to minus five point seven five plus 
20 into usage plus 27 into fitness. 
this is shall. we say a targeted equation. what is this equation 
as far as i'm concerned today? this is a description of the 
data. but the description of the data will be used 
in order to predict how many miles my instrument will run. so 
think of what the instrument is, the instrument is going to be 
is an engineer design instrumental. i'm trying to figure out how much it will be used. how many miles 
did we used to do that? i will figure out whether people 
consider themselves fit or not and how frequently they use it. and 
using that i want to get an equation for the number 
of miles. this will run. is 
there a descriptive way of getting at that equation? so 
what this does this kind of an equation is what's called 
a linear regression model. this 
is your first model. this 
is going from descriptive to predictive 
i haven't done it yet. i haven't done it yet. i'm just saying what i'm trying to do. 
self-rated fitness on a 1 to 5 scale mmm, 
i'll get there. okay, so maybe i shouldn't have shown you the output always dangerous to show good 
people output never showed put moral of the story. so 
what i want to do is keep it deliciously vague. 
huh? so y is equal to beta naught plus beta 1 x 
1 plus beta 2 x 2. i want to fit an equation of that 
type. why do we have multiple variables? i can do it with one variable. maybe 
life is simpler with one variable. you have given me at nasa data via a key 
variable a kia here. so you 
can you can in the code as you see you can have one variable you can have two variables. you can 
see where he's going to be number of variables. i think they've chosen to to say that you know, i 
once had a few we're going from bivariate distribution multivariate distribution 
seated on a bivariate distributions, and then at the end he said now put two equal to n and 
we were telling i'm sorry. it doesn't work. that way if you do it for n, i can put n equal to 2, 
but if you want me to put you through to equal to n and which to do i put with k. so 
he saying i'll show you it for two. but if i show it for to you can do it for one and then you can 
do it for three you can do it for any but we can cite for one also if you want to so 
let's look at that and what what what am i trying to put here and 
trying to put miles here. and 
i'm trying to put usage here. and i'm trying to put fitness 
here. i forget which was where but anyway, these two variables. how 
am i using it to describe? so you want to invite want to think of it this way if i give 
you three variables, how do i describe the relationship between them? 
if i give you three variables, how do i describe the relationship between them? there 
are three variables. in the form of something 
like that is one way of doing it. no, does 
that mean that in reality as he might say that there is 
a relationship between these three things no correctly. 
so free shipping big sighs, maybe i don't know not necessarily. correct. 
not necessarily, so when you do 
linear regression in the future. any regression for that matter 
they will be three uses of it. use one it 
is simply descriptive. it was simply describe 
the nature of the relationship to you. it will make no causal inference. 
no sense that this causes this it will give you no predictive model 
it simply describes and we'll discuss how he describes to it 
predicts predict means when i put in another value of x. 
and another value of x for another value of x 2. i will get a different value of y which means 
that i've looked at data from all of you and a new person comes into the room with 
a new x1 and x2 and i'll put then have her number in and 
i will predict her why. that is a pretty views of the of 
a model third prescriptive in 
order to get a different targeted. why? what 
changes should i make in my x 2 x 1 and x 2? to 
get different usage of the equipment what behavioral changes do i need to make in people 
to get them to use mod n even more complicated use of the same thing. 
so the same model the same principle can be used for different uses. i 
am using it simply as a description. simply 
as a way to summarize. not univariate 
not bivariate or trivariate or multivariate. 
i can do that with a three-by-three correlation matrix. but 
if i choose to do it this way now, where is my where is the 
what number am i looking for? fitness 
is here average number of miles a customer expects to walk or run average 
number of times the customer plans to use. so i'm going to give 
it this variable and this variable and try 
and get an outcome for the middle one. getting 
it the way to do it is something that i won't talk about too much. 
so there's a there's there's there's a scale on 
which is you know, one of the one of the learning modules that they learn in the sense of supervised learning 
import linear model regression linear model as a function 
and the cycle irritating big function here called linear model, which is inherited 
from linear model. you're giving it a why what 
is the why the thing on the left hand side of the equation? what 
is the x the thing on the hand side of the equation? what 
is ric fit rec fit means regression fit? and 
this fits my x and y and this output something it doesn't 
output anything at this point in time. now 
i have my regression coefficients and my regression coefficients at 20 and 
27. my regression intercept is minus 56 and 
my my spirit is - 56.5 4 + 20 usage purse 27 
fitness. how is this interpreted from a purely descriptive perspective. 
it means that for example if usage 
remains the same. and my fitness goes up by 
1 unit. then my miles goes up by 27 
year. if my fitness remains 
the same and my usage goes up by say one hour or one unit. 
then my miles goes up by 20. what 
does -56 means? if you don't use it at all and 
you have zero fitness, you have done - 56 miles. makes 
no sense, but neither does 0 fitness. so 
the model is not necessarily written in a way in which this intercept makes sense, 
which is why in the software the intercept is not treated as a coefficient. 
this is up is a part of the equation but is not one of the coefficients that you interpret. this 
is pure description. how does 
it how does it what does it do in case you're asking and 
i hope you don't. what it does is this. what 
it does is it looks at the data and what is my data? my data 
is say y 1 x 1 x 2 and it says this it 
looks at why one sorry why i minus beta naught 
minus beta 1 x 1 i minus beta 
2 x 2 i whole squared. this is my prediction 
of the equation beta naught plus beta 1 x 1 plus beta 
2 x 2. this is my actual. what prediction 
is the closest to my actual in what cells find the difference 
between the prediction of the actual square it and then minimize 
it? with respect to beta 1 beta naught beta 1 
and beta 2 so what are we turn our beta 1 and beta 2? their 
variables or the parameters that are estimated in such a way that 
if i is she made it this way this plane is the closest to the 
data. in what sense in the sense that 
the difference between the predicted and the actual is the smallest? 
don't worry you'll do this again. you will do this. 
again. this is a very important thing in supervised learning in 
production mode. in description mode all 
that is necessary for this to happen is that it describes the nature 
of the relationship between miles usage and fitness describes 
in what way in addition to the interpretation of the numbers. there's 
also something else interesting here the positive sign. what is the positive 
sign mean? it means that as fitness goes up 
miles goes up as usage goes up my 
is goes summarize the relationship between three 
variables treating one of them as a output. 
this is a distinctive use of linear regression as 
a way to describe data is the description 
real. to be decided to 
be confirmed to be analyzed. to be understood 
right, you do not know it is empirical. it 
is based on data. why is it nests? why is it necessarily true? is 
there a logical reason why this is to be the case. yes, you can do 
it with one. you can do it with you can remove it. if 
i remove it what happens? so what would you do you guys can do it? if it's 
there you would move it here instead of instead 
of usage and fitness just have one of them there. i have not given 
you any idea as to whether the description is good. i've 
not told you whether this model is a good model or a good 
equation in the same way that i did not tell you whether the correlation was good 
or whether the mean was good. i've not given any quality 
assessment to anything. these are ways to describe. 
the quality of the model how accurate is my mean how good 
is my prediction? these are things that are going to be inference in fridge 
and we'll come we can't answer those questions before we get to probably middle here sense 
of language on it. yes. hm. 
fitness and usages huh, 
huh? that's true. so you're saying 
that it doesn't make sense for certain values, which is true, which may be will be 
as i said, i am not saying that this is a good pretty model. what will happen is 
you will you will what will happen you will study a model like this and you will 
ask certain questions. what questions might you ask 
for example, here's a question that you would ask you to ask the question 
that if i fit a model like that is this coefficient 
that is in front of this variable actually equal to zero. because 
if it is actually equal to zero, then there is no relationship between 
the output and that variable. so 
what we do is we ask for a statement of this kind if say why 
i is equal to beta naught plus beta 1 x 1 plus beta 2 
x 2 i asked for the statement is beta 1 equal to 0. 
and these are called hypothesis. because 
if beta 1 is 0 then this number should not 
be in the model. and therefore this variable has no predictive power over this 
variable which is where the analytics part becomes interesting but 
to answer that question. i need to have a sense of how do i know whether 
this is 0 or not? and to answer that question. i need 
to have a sense. what the error around that number is? so this number is 
not 20. it is 20 plus minus something. in 
the same way that my meanest 28 age of 28 was not 28. 
it was 28 plus minus something. this is also similarly not 20. 
it is 20 plus minus something and if that plus minus something include 
0 then i can't see that this is not zero. if 
on the other hand that plus-minus does not include 0 i can say it's 
a pretty model that's coming. but for now, this is simply a 
way to describe data and like 
for means like for correlations, like four standard deviations and 
for linear regression, all of these all will now see an inferential 
phase to them the main must see a plus minus the regression 
coefficient must now see a test is it equal to 0 is not equal to 0 
all these models all these estimates will now be put into 
an inferential. test interpretive test. how is how useful is it for new data? because 
just describing current data is not going to be good enough for me. i'm writing an equation 
like this. i want to write this equation. i 
want to write miles is equal to 
beta naught plus beta 1 into usage 
plus beta 2 into fitness. i 
want to write this. the code now 
tells me what these numbers are. this 
number is minus 56. this number is plus 20 and 
the third number is plus 27. that's 
it. you can call it intercept based on what you like. 
whatever your term is. yes. yes, 
yes in x just put in another variable do cam another variable. it 
can be any number. aggressively to try it out and you can do it 
now. if i don't want to screen on with this i will plot it my purpose 
is to if i could not it i would but remember there are three variables. 
remember the three variables. why am i doing this? because 
if two variables i can plot it i can also look at many variables 
at a time and see a correlation. but if i have three variables plotting 
things becomes difficult. if i have 4 variables floating things 
becomes even more difficult, but you still do it. i think you have tableau or in 
your curriculum, maybe i'm not sure but 
visualization techniques can help you. but if you're going to 10 variables then plotting is not a way to do 
it. so, how do i express the relationship between 10 variables? 
by arbitrary equations like this what 
does it mean this intercept is if this is 0 and 
this is 0 what is this? but as we have said 
this zero doesn't make sense in this zero doesn't make sense. but 
this is simply a line that goes through the data if i have data that looks like 
this for example all it does is it fits the straight line? what is 
the intercept when it comes to success doesn't make any sense. maybe 
maybe not. this is the place for the data makes sense, but the equation is written 
so that it cuts the line here. great, if 
i find a relationship between height and weight and i write the equation 
as y is equal to beat our say wait. sorry is 
equal to beta naught plus beta 1 into height. what 
is beta 1 beta 1 is the weight of someone who has height 
0 makes no sense, 
but giving me the freedom to have a beta one here allows me to get a much better line 
because i can move this line up and down in order to get the best fit. it 
allows me an extra flexibility. don't worry. in 
fitting good models, you will have enough experience in doing this. my purpose 
is just to show you it as a way to describe three 
variables in one shot. i am again. i'm not building emulsify do it for two of 
them just miles and usage. just just two of them this an equation an equation just have 
this kind with one 
variable you wouldn't do this because there is nothing to mod rewrite this 
one equation between one variable criteria for doing this. remember 
my purpose is not to use this to select which variables 
to model. when i'm calculating means and standard deviations and correlations. 
i'm not using them to select anything. i'm not saying that i will measure your 
mean because you're important for i'll measure your standard deviation because you're low. 
i'm using this as a tool to summarize three or four variables which 
variables to use very interpretive mode. you can do you can look at for high correlations 
in there many other techniques that you learn in order to figure it out. so just 
like i mean is a way to do analytics. correlation 
is a way to do analytics and deviation is similarly. so 
what we had done yesterday is we had spoken essentially about 
descriptive statistics and descriptive statistics 
is the picking of data 
and to simply describe it with the later 
purpose of either visualizing it or writing a report 
or using it for inference and prediction in 
later courses or later applications. it is compared 
with predictive statistics of predictive analytics. and 
then prescriptive describing is simply a task 
of summarizing a given set of numbers. you will do sessions in 
visualization in due course. prediction 
is a task that is often in machine 
learning or a data mining professionals requirement to 
say that if something changes then what happens? i 
should have made a comment that there are two english language words that mean more or less 
the same thing one is forecasting and what is prediction? in 
the machine learning world, these words are used a little differently for 
testing is usually in the context of time. so something has 
happened in the past what will happen in the future. i'm 
giving you this week what will happen next week forecasting in the future prediction 
is usually used without any sense of time prediction is like 
i'm giving you an x you give me a why i'm giving you one variable 
you give me a another variable. so predictive analytics 
doesn't necessarily forecast anything. despite 
the fact prediction itself is what forecasting so the words 
mean slightly different things. it's a little like, you 
know price and worth mean more or less the same thing, but priceless and worthless 
mean different things. so so 
the words are used as i create different context. so in descriptive statistics, 
we had looked at certain ways of doing things. for example, we had looked at what is called univariate 
data. univariate means one 
variable for the univariate distributions. we have seen 
certain kinds of descriptive statistics. some of them were about 
shall we say location location meant where is 
the distribution and we had seen for example things like means 
and medians. which 
talked about where is the distribution located we talked 
about things on variation? where 
we are talked about standard deviation will talk more about things like this today. standard 
deviation range interquartile 
range here 
also we had terms for example like you 
know the quartiles the upper quartile the lower quartile. these 
are parameters that are used in order to convey a message to someone 
saying that what is the data about so for example, a 
five-point summary talks about the minimum the 25% point 
the 50% point the 75% point in the maximum. irrespective 
of the number of data points you could have 10 of them. you could have a hundred of them. 
you could have a million of them. you could have a billion of them. it doesn't matter it still five numbers. sometimes 
those five numbers tell a lot they tell about location. they tell about 
spread they talk about skewness is a distribution sort of 
tilted towards one side is in more data on this side than on the other in terms 
of the data spreading out towards the tails. so and 
so their plots associated with this as well. we talked a little bit about the plots later. then 
we went towards the end towards the idea of lexi bivariate 
data. bivariate means that there are two variables 
in which we didn't spend a lot of time we talked about covariance. and 
correlation covariance 
is a sense of variability of two variables together. it's 
univariate version is a variance which is the square of the standard deviation 
is scaled version of covariance is the correlation if the correlation 
is is close to plus 1 then it means that there is a strong positive 
relationship between the variables positive means if one goes up the other also goes up 
if one goes down the other also goes down- means the opposite as 
one goes up. the other goes down correlation is not to be confused with causation. 
there is nothing in the descriptive. that says that this cause is this there is no science 
to this. this is simple description the science to it 
and the logic to it and the use of it for for inference for 
business logic and things like that will come a little later for now. we are simply describing. 
then we are taken an even brief and perhaps even more confusing. 
look at multivariate or first multivariate summary where we looked at the idea 
for linear regression. a 
linear regression is an equation of the form y is equal to say 
beta naught plus beta 1 x 1 plus 
beta p x p where one variable is written as an equation 
of the others. this is merely 
done to describe the nature of the relationship between the variables. 
correct. it can be used for prediction. it can be 
used to prescription if you wanted to but that is not a purpose here. 
our purpose is simply to describe a relationship. why 
is this useful because let's say that you've got three variables for variables 
10 variables. you need a mechanism to say how these variables 
are connected. how do you describe 10 things at a time? 
there are graphics out there that are famous graphics in history where you have 
many variables being represented in on one plot 
or one visualization. so visualizing 
things itself. so for example, we looked at a certain kinds of plots. we 
looked at for example histograms. we 
looked at box plots. set of pairs. 
which were essentially scattered what are called scatter plots? so 
these are for the human eye. these are things 
for the human eye to to see data. and they have the limitations 
because we can only see data in a certain way. we can't 
see very high-dimensional data. visually, we can see up to three dimensions maybe. for 
those of you who are interested about such things or any of you are in the graphics world 
etc. he spent a lot of time saying, how do i how can i make people 
see things? so how many dimensions can you actually plot in python 
is self is is good at it but there are other devices. so 
for example, let's say that you're plotting you can have of course one variable 
as x 1 variable is y another dimension can be 
maybe the the size of the plot. this is 
bigger than another variable z becomes larger. it can 
be a color like a heat map a fourth variable if it 
is low can be blue and if it is high can be read. another maybe 
the shape of it lower values are circles higher values are more 
pointy. so there are many ways in which you can get summarization 
to be done. so when 
you do visualization, if you do you'll see other ways of summarizing it but if you want 
to do it as a number then something like an equation that looks like 
this. is often a good representation 
how one gets at these beta 1 sin beta psi explain very briefly. 
what happens is you form this equation and you take 
those values of beta naught beta 1 and beta p that are closest 
in some sense to the data. so if i draw a picture of say two 
of them y on x and i say give me 
a line. which line should i take take 
the distance from the line to the points? and 
make this distance the smallest get 
a line that goes through the data with the smallest distance to the points. 
how is small measured small is measured by the 
square of these distances because distance from the above the line and distance in below. 
the line are equivalent. so if this 
is my beta naught plus beta 1 x and this is my y 
what i do is i look at why - or why i 
minus beta naught minus beta 1 x 1 i isaac 
equal to 1 to n my endpoint square it this is the squared 
of the distances from the line and then i minimize this with 
respect to beta naught and beta 1 that is how i get the numbers. but 
if you simply interested in what python or or does then the program 
will simply give you what the number is. so what what 
sorry what i will get from those you will get the value of 
beta naught. tan beta 1 find the value of beta naught and 
beta 1 such that this is the smallest. fi 
for different values of beta naught and beta 1 this distance will be different 
for different lines this distance from the line will be different. 
which line will i take the line such that this is the smallest. how 
to get the beta so find 
this why i minus beta naught minus beta 1 x squared on a 
plot after that points are have existed. the point is here. 
the points are here. so these line line is which 
the line is the line. i'm trying to find. here 
is a point here is a point here is a point here is a point. here 
is a point right? let's say a five points. i 
want to describe the relationship between these five points. therefore 
what i need to do is i need to find a line that goes through these points. 
i want to write an equation of the type y is equal to say 
remove the b and x y is equal to a plus bx. i want a line 
like a plus bx going through those points. there are many lines. 
this is one line. this is one line. this is one line. this 
is one line. there are lines which line will i use to 
represent the relationship between y and x. i need two 
criteria. so what i do is i try to say 
let me find a line. let's say that this is the line and 
find out how good it is. at 
describing the data. now when is it good 
at describing the data? when it passes close 
to the points. because that is his purpose to describe the data. 
because i want to say that this line. without any data points 
is a description of the data line position. that's what 
i'm talking about. so i need the value of a and b. 
correct. so how do i find the value of a and b for 
every such line a and b. i find the 
distance of the points to the line. so 
if they're how many points do i have here? i've got five points. i've got 5 distances. what 
are the points this is the point x1? y1. this is the point say 
x to y 2. x3 y3 x 
4 y 4 and x 5 y 
5 these are my five points. how 
far is the point x1? y1 from the line? this 
distance and what is this distance? 
this distance is how much? this 
distance is this point is why one? - 
what is this point? a 
so a plus b x 
1 that's 
the point on the line. hit okay, 
i can stop here. but if i stop here what will happen is 
that if this is the distance then this will become a negative distance and 
this will become a positive distance and they will cancel 
or neutralize as you say no 
is this equation? is equation of 
the line? you 
want to know why? dummies is this point? b 
is the slope and i want 
to find dnd. so 
this equation is a plus b x so this point 
is y and this length of this line is y 
minus a plus bx 1. square 
plus for the second one. what is it for the second one 
y two minus a minus bx 2 whole 
squared do this. 
five times correct. 
for every line you will get this number if you want to you can 
take a square root. for every line you will get this number. 
this number is the sum of squared distances of the line 
from the data. it tells you how far the line is 
from the data. the larger 
this number is the further. the line is from the data. the 
smaller this number is the closer it is to the data if 
it is if the data is on the line, what is the value of this 0 
so every point is on the line or if the data is itself 
a straight line, then this will be 0 so 
i have formed this now. i find 
the value of a and b such that that is the smallest. 
for every n b. i will get the value like this if i take another 
line i will get another value of this. for 
every choice of a and b. i will get a difference distance 
from the data which nb will i pick? that 
a and b such that this distance becomes the smallest. 
so can we have you 
don't you believe 
this? so choose a and b to 
minimize this 
and that is the envy that the software gives you. this 
is called a linear regression answer to does it have 
a problem. does it have re square? this is why gauss was so successful 
and laplace was not you will get a unique solution. this was called a 
convex problem. and this is a convex optimization 
because of the squaring if you have modulus 
values here. there is a possibility that you will not get a single answer. but 
because of this and because of this square and because of the nice bowl 
shaped curve that the square function gives you you will find a unique solution 
to this know 
the system doesn't do it. that way the way the system does it is the 
system differentiates this with respect to a and b differentiates 
respect to a set 's equal to 0 differential spec to be 
sets equal to zero and solve those two equations. it doesn't minimize. it 
doesn't minimize when this becomes very high-dimensional this 
minimization this differentiation of solving it becomes a very interesting problem 
in mathematics and numeric analysis. to 
do that you need typically to do linear algebra and in cases such as this and in 
machine learning books, you will see at the beginning of the book you will often find chapters 
and optimization and linear algebra because of this or 
something similar to this that represent a problem you 
often need a matrix representation and to get a good 
learned solution. you need an optimization. so 
most machine learning algorithms are built that way because for example 
staying yesterday that you're going to tell someone to do something. i'm going to tell a car 
to behave itself on the road. yesterday while going back to today 
morning. i heard that or i read that bmw and daimler are setting up a 
you know, 1 billion euro rnd operation somewhere in europe for self-driving 
cars, etc. etc. to different industries are trying to go 
towards making cars that don't need people the automobile 
industry is as well as the healing industry. people 
like uber and lyft and align these companies. so 
now you're going to figure out the car is now going to be have to be told when to go and when to stop 
but how does the car know that it has a good rule? how 
does it know that it has learned and what is good learning as opposed 
to bad learning? what is enough learning as opposed to not enough learning? a 
computer is stupid. on a computer can do store a lot of 
data and do calculations quickly computers aren't intelligent. to 
make the computer intelligent. you have to give it an intelligent function. you have to say okay, 
run your algorithm such that this thing becomes the highest it can be 
or this thing becomes a lowest it can be which is an optimization problem. 
so what machine learning algorithms almost invariably do is they say that 
here is an input and here is an output. give me an algorithm 
such that based on the input. you can come closest to the output. 
for example object recognition if i am if i'm teaching computer vision 
or if teaching text recognition or any of these lists a text recognition, 
so i'm trying to understand what a word what the word is. so if 
so, i'm the computer is reading something. let's say in handwriting. and 
try to identify that as an english language or a kannada hindi phrase. 
so it's going to write something down you write this and in my horrible handwriting, i'll write 
something and that camera has recognized what i wrote and 
transcribe it into something that you can read. now 
how does it know it's done a good job. what it needs to know is that 
this is what i think the word is and this is what the correct word is 
now tell me whether i'm close. anytime close. 
i'm good if i'm not close and not good, but this 
has to work not just for one word. this has to work for thousands of words. so 
i must be close to thousands of words at the same time. therefore. 
i need to measure the distance from my prediction and my actuality over 
many many data points. so all these 
algorithms what they do is they take your prediction and they compared with the actuality and 
they find a distance between them and they minimize the totality of the distance 
between the prediction the actual and algorithm that minimizes that distance is 
a good algorithm. it has learned well. so 
they all do something like this. with this is the prediction 
and this is the actual and a and d other parameters 
in the in the prediction. in other words find a prediction such 
that it is closest to the actual. so 
this algorithm has become very popular. it's probably the single 
most popular fitting algorithm out there. this called least squares. we 
squares hmm. this called least 
squares. squares 
here's a square least because you're minimizing this called 
least squares and the least squares algorithm is a very standard way of doing things. 
this has nothing to do with the algorithm itself. the algorithm 
can be anything. this is self can be a neural network. it can be a support vector 
machine. it can be a random forest. it can be association rule. it can be any 
of your logics. but the question is if you give me the program, 
how do i know whether the program is good? so 
i give it what's called training data training data means i 
tell it what the answer is. yes. 
so this is a prediction. this line is a prediction if you want to think of it the 
data points of the actuality. the problem is this 
these data points are also not the actuality. the actuality 
is going to come in the future. it 
is a training set for the data it is yell. it is the data that is being given to the algorithm 
to train it. but is not the algorithm. it is not the data 
that the data will actually run on the car will run on 
the road. the car will see is data points, if people 
and other cars and the cows and whatever for the first time it will not 
have seen that data before but it will need to know what to do. yes, 
so what do you do? so what you do is you train the algorithm? what 
does training the algorithm mean training the algorithm means you give it data 
for which the car is told what to do in other words you give it what they call ground 
truth. so you give it the why and you say here is the doubt or 
here is a situation. please do the thing. so 
please do it such as this so here's a person who's crossing the road. please 
stop here is another person crossing the road, but 
he's very far away. calculate the distance compared 
with your speed and decide what to do. he may 
be far enough for you to be able to see but you may not stop. if 
you are driving a car. it's quite possible that you are seeing someone crossing the road, but 
maybe about close to a hundred meters ahead and you're not slowing down because you're doing the 
calculation that i have a seat. the person crossing the road also is speed and by the time 
i get there, this person would have gone do not do this when doing a level 
processing. but don't do this all the time. 
we do this from crossing the road. so happy doing that. there is a car coming but i'm still 
crossing the road. why because i know that i will be 
able to cross the road. before i get there, so the car 
needs to be taught how to do these things. so this data is given data like 
this and says for the training data get as close as you possibly can to the 
training data and then it's given what's called 
test data. and now the algorithm is told. oh now i'm going to 
give you need a new data and now you're going to tell me how well you did on 
new data. so suppose you 
were given a problem like this. i'm not supposed to talk about this here your m and instructors are supposed to 
talk about this but suppose you are given a problem like this. in other words have given 
you a data set and i'm going to tell you that your performance will be judged not 
by this data set but on another data set that i am not giving you. how 
will you what will you do? how will you make your program ready? yes. 
how will you make your program generalizable? so the usual 
way it's done is something interesting you say that okay, you want me to predict 
data that i have not seen i will see if i can do that. so 
what you do is all the data that is available. you take a certain part of it and you keep 
it aside you have it, but you don't use it. and 
now the remaining part of the data you build your algorithm one and now 
you tested on the kind of data that you yourself have but i've kept aside this 
called validation data. and now if your algorithm works on 
your own held out data the data that you are other things not seen you're 
more hopeful that it will work on somebody else's new data. this 
called validation and this entire cycle is often called test validate training 
center or train validate test etc. and you will do this in your 
hackathons. but to do this you'll go to the needs to know 
as to how good it is and it needs measures like this. there 
are other measures. so for example, if you're classifying an algorithm good or bad positive 
sentiment on tweet or negative sentiment. don't treat no numbers then you do 
not need this. what you need is simply are you correct or are you incorrect 
if it is, correct? let's say you give yourself zero distance. if it is incorrectly. 
see you give yourself one distance in other words you made one error and you just count how 
any mistakes you made but what is a mistake when it comes 
to estimating a number like say miles and things like that. 
there's no mistake. you are either closed or you are for so you need a measure of how 
close that's a measure of how close. so 
this descriptive method is used as a criteria for building 
predictive models. this is says can be a predictive 
model, but very rarely is a good enough. too 
few things in the world are this simple? all 
the things in the world is simple. yes. but as we say as 
we discussed yesterday even things like height and weight and not that simple. there 
are complexities to that. so for example, you can have theories it we said for example, 
you could say let's say a savings rate what the savings 
rate of savings rate is the proportion of money that you save. so 
if there is a saving state what that would mean is that you if 
i take your income data and i take your consumption data that should form 
a straight line. because you're saving a the 
same proportion every month, but it's not. if you go 
home and month-by-month you figured out what what your income was fairly precisely 
from your salary or from other sources, etc. etc. and you also plot again 
fairly precisely as you can how much you your household spend that month. it 
will have an increasing effect probably but it is very very unlikely to be a straight line 
is certain things. you may be going after a law of physics, but the law of physics may 
hold for gravity be not hold for anything else. i remember 
trying to apply this one day one day cricket sort of became 
popular when i was in school or thereabouts and one calculation was done as to as 
to how to figure out whether a team is doing well 
or how well is a chase going. so one possibility is just simply track 
the score. the other possibility is to say that if you know how 
many runs you're going to get you tend to begin slow protective we can send you accelerate. 
so what you do is you build models for that you build models were saying that let me assume that the 
team. going to accelerate constantly. which means that every 
over that comes later it's going to do. is going to get better. 
it's run rate is going to keep increasing steadily. now if 
it's run rate keeps increasing steadily then when will it reach the halfway 
point? that is the same thing as asking the question 
that if i take a bone and i drop it, how long will it take to get to the halfway point 
and this is square root term there in the answer is about 50 divided by the square root of 2 about the 37th 
over or something of that sort. so effectively the logic was 
if at if you've reached halfway point below, let's say the 
37th overall so you are on track. if not, you 
need to accelerate in faster that's using a physical law to 
try and predict something that is not a physical law. i'm 
the laws of physics don't apply to critic cricket at least not in this way that i am describing. 
so therefore these laws will get you somewhere like a straight line, etc, etc. but they 
are approximations. and so what you will do is you will build better versions 
of this when you use it for an actual 
prediction for the same argument holds for things that means standard deviations 
in many such things if there's a specific problem you need to solve you may make 
get a better estimate for doing it. yes. someone had was 
asking a question. yes. 
yes, yes. yes. so 
there are many ways to do. that one is you just put it in you find for 
different values of a and b. you find what that number is and 
then you solve it. if you won't do it the hard way you can still do it the hard way 
and the hard way will end up being something like this. i'm minimizing and i 
should not be talking about this. say why i 
minus beta naught minus beta 1 x i whole square 
i was using a and b, so say a and b whole square. i'm 
going to minimize this respect to a and b. so essentially what i'm going to do is i'm 
going to call this. let's say l of a and b. and 
i will say dda of la is equal to zero. 
ddb of lb is equal to 0 and 
this will and i will solve these and this will give me 2 interesting equations. and 
my answers will be this. i'll tell you what the answer is your b hat 
your estimate of be. your estimate of b will be 
this summation x i minus x bar 
y i- y bar divided by summation 
x i minus x bar whole square and 
you're a will be this. y bar 
minus b hat x bar, so if you 
want formulas these are your formulas because 
at minimizing something to minimize something is the same thing 
as setting is derivative 0 now that is also the same as maximizing something but 
this is where convex optimization comes in that. this will have a minimum 
but we won't have a maximum. so by setting it equal to zero, i'm 
going to arrive at the minimum. yes. ignore 
this deflection. yes x and y are fixed the data is fixed. 
the parameter is varying. action 
by a fixed for my data, correct. so my b is written 
in terms of a and so this is a formula if you 
want to close it, this can also be written as the covariance of x and y 
divided by the variance of x. so 
if you want to calculate it for two variables, what you need to do is you need to calculate 
the covariance and divided by the variance and 
here y bar minus bx bar. this means 
that the that the that the line passes through x bar y bar 
the line passes through the middle of the data. we're 
minimizing with respect to a is a variable. how 
will do this? so for different 
values of a and b, the distance will look like this there's a particular value of 
n which the distance will be this. there is another value of n we suggestions will be this. 
there's a particular value of a and b in which the distance will be this. chorus 
chorus. this is this is my my summation 
y i- a minus bx i whole squared. this 
van four different values of a and b. i will get this. so 
when i minimize this got to do this, you don't need 
to do it all you need to do if you want to do it if you want to do it is this. if 
you want to do if you want to do it do this. do you have yesterday's code? open 
it. you can do it. that 
now that's it you stop is i think i get the mean then what? 
one use of it is to predict another use of it is to prescribe. 
there are many uses of it a third use is to do nothing but simply to use it to 
to visualize or to summarize the relationship between two variables. correct, 
and we do this all the time so so for example 
how do you measure how price-sensitive your product is? 
do you understand the question you're trying to cheat? you're trying to change the 
price of your product. why would you want to change 
the price of your product? profitability. 
maybe you want to increase it. so you get more money. so people 
in marketing often want to understand how sensitive my 
sales are to price. no to do that they come 
up with various kinds of measures one particular measured is what is called the elasticity of 
demand. elasticity of demand means this if 
my price changes by 1% by what percentage 
does my says change well, 
if my price goes down, i would expect my demand to go up but by how much 
now there are certain assumptions to this. for example, it's the there's it's assumed 
that the same number works if it increased price as well as you decrease price. so 
this is called the elasticity of demand. so therefore to get the elasticity 
of demand, but what is the elasticity of demand elasticity of demand is essentially a slope? 
a slope that relates something like this that if i have demand on this 
side, let's see seals on this side and price on this side. i have this 
negative slope the slope of this is what the elasticity is. so 
very often you do equations like these in order to simply get at a number that 
has a certain meaning for you. so the slope of a linear 
regression between log sales and log price is 
the elasticity of demand for that product. i mentioned log 
sales and no clock price because velocity is done in terms of percentages 
a percentage increase in price and a percentage decrease in sales. 
if i don't do it as a percentage there's a problem now my measure 
depends on my units. is it thousand units 
per rupee or what? it depends on what i'm selling and one currency 
and that's not a good measure. so i measure it not i measure it as percentages, 
but when i measured as percentages have to wait on the log scale, so there are many models 
like this where the equation itself is used to simply describe 
a parameter. something that tells you a little bit 
about the market like an elasticity of demand. you're 
not using it to predict anything the simply using it as a descriptor. to 
say that this is this is an inelastic product. if this is an inelastic product, what does that 
mean? it means that if you change its price, they won't be too much of a change in its 
demand classic examples of that. for example of salt if 
you change the price of salt a little bit they've at least certainly domestic 
salt. they won't be too much of a change in demand. they might be a little bit but 
there's certain things are highly inelastic you change it a little bit and the demand will change 
a lot. and marketing people are very sensitive to this idea 
saying that dua is my is my demand elastic or is it in lastik 
if i want by prices to go up then i want the demand to be inelastic 
because i don't want my demand to go down if i want my demand to if 
i want my prices to go down, but if i'm pulling my prices down then 
i want the demand to be last because i want people to say that your prices are going down. therefore. i will buy more. 
so marketing analytics is very concerned with things like this. so 
therefore sometimes in equation of this kind is built just to describe 
something. so what i'm going to do is is go down. and 
since we are going to do this just on to let's pick it just onto 
so let's let's change this to maybe miles and let me remove 
this. so what 
i'm going to do is i'm just going to do it on. one 
of the i suspect one parentheses my den work. i 
suspect this might work x 
has to mean to simply because of the weight centered. because 
it because i have not done anything on this data set now. this 
one is a comment. what do i have on the coefficient 
here? 36 and 22? so 
what is my equation based on this? miles 
is equal to minus 
22. plus thirty six point 
two nine, whatever into usage okay. 
alright. now let's try to do this manually if you want to if 
i want to do this manually what so i need to get at each of these things. 
so now i need to find for example, i need to find let's say the covariance between 
miles and usage. how do i do that? tell me. no, 
not a sample. the data is present in me. so if i so now i have things like for example, 
my data miles. correct, 
so i can i can calculate things on this? so 
for example i can do. i 
can do this. this is the mean 
i can do calculations on this. okay. so 
now if i do say what 
is the standard deviation syntax? st. 
no dv, okay. this 
is standard deviation. let's try one more. this 
is the variance. what is the variance the square of the standard 
deviation? now i want to find 
the covariance of this. how will i find the covariance? 
not necessary. remember i have the correlation function. how did 
i find a correlation function here? i 
found the correlation function from here. so i've got a number of ways to potentially do 
this one is i can do it with the correlation function or the covariance 
function in other words. so, for example, i can try doing this. how 
do i write it here? my data. 
this thing here was the correlation. this 
gives the covariance matrix. this 
gives the covariance matrix. okay. 
now what is the value? what is the value of b according to my formula? 
covariance of which variables now which are which are my two variables 
- and usage. where is that covariance? is 
it this number 42 points is 7:1. right. 
okay. / what? variants of what? 
where is it? no? variants of usage why? 
not why isn't why but why isn't question mark why? so 
my data. usage 
. is 
this also here in the data? it 
is because what is it that sort of what is the diagonal element of this? what 
is this number? usage. 
this is usage. 1.17 
this number here. this number here is the variants. 
okay. what is the equation usage? this 
is my ex. this is my y this 
is my x so covariance of xy divided by 
the variance of x. so based on this. what 
is my answer? my answer is going to be for the the 
answer for my coefficient is going to be where is the covariance here 42 point? 
i can do it. you know man you almost 42.7 1 / 
where is the variance? one point 1:7 
say a hero. you know six or something 
of that sort. 36 
point g18. where is my value of b? here 
this is my slope. how 
do i get my how do i get my intercept? mean 
of why so 
my data this 
is the mean of all of them. so what 
do i want? which mean do i want mean of? what is 
my y here miles? where is my mean miles? let's 
say 1 0 3. say point wire. no say 
1/9. - what is my slope? 
say 36 point? why not? three two, or something of that 
sort star? what 
is my x? usage which is three point 
say four five. so 2.4. 
bye. three point three point 
four five some trick will 
start work that way. - 22 
what is my coefficient? - 22 so 
if you want to you can do this from first principles. by 
using that formula i'm 
not asking you to. you can do it just by running linear regression. 
but what it is is this you can also check the units. what is the unit 
of be the unit of b is a unit of miles divided by the unit 
of usage what is unit of the covariance miles into usage 
divided by the variance which is miles in two minds. so the ratio of this is usage 
by minds. what is the unit of a miles 
this is in miles. this isn't my for usage. and 
this is in usage. so the units all makes sense. which 
one? units what is the unit of the covariance 
the unit of x into the unit of why why is that remember the definition of covariance 
is the product of an x and a y. so this is in the unit of 
the product of x and y. this is in the unit of the square of x. 
so the product of x and y divided by the square of x and y cancel out why this becomes 
the units of y by x which is what b should be b should 
be in. my spur usage 36 
means what it is 36. miles per usage unit. 
that's what b is. b is in some 
units because these in some units we will run into some difficulty 
when you use this in predictive models because if suppose i want to figure out is 
this number equal to 0 or not? because if this number is equal to 0 
statistically speaking then miles doesn't depend on usage but 
because this is not a dimensionless number i can make it anything. i want to 
by simply changing the unit. so that makes the statistics a little hard 
so i cannot simply look at this number and say where is high or low? i can make your height anything 
i want. by simply changing the scale. i can make 
your height a million by simply using a small enough unit. so 
simply taking a raw measurement tells you no idea of the value of 
its magnitude. that 
same argument will work for any of those parameters. so therefore when we do testing when 
we do hypothesis testing, we need to normalize all these numbers by something and that something 
is typically the standard deviation. so we'll do that. okay. 
so let's end this the purpose of this was just to tell you what that regression 
line is and then there are similar formulas but as the dimensions increases 
it's hard to do this manually for two of them. you can do it manually for three of 
them. it's hard to show manually which is why i changed it because i would not have been able 
to do this for two variables formula becomes a lot more complicated for two 
variables. and which is why people don't 
use the formula for certain variables. now 
what i want to do is i want to talk a little bit about probability. this this slide deck should also 
be there with you. so you have to cope with under the 
the idea of probability is to be able to cope with this uncertainty. what is the uncertainty 
that we're talking about here? the uncertainty is that is that when you observe something 
you're entirely sure what the value is not because of measurement perspective. but because 
you do not know what the corresponding population number is you do not know the truth of the number another 
sample will give you another number there is uncertainty. and 
this uncertainty is being made is usually captured by a probability hmm. 
this is interesting question. what is the probability that a man least 4,000 years the 
empirical probability. what is empirical probability mean empirical probability means you ask a question 
as anyone live for a thousand years if the answer is no then you say that 
the answer is 0 if anyone has lived for a thousand years you'd say tell me how 
many people are afraid thousand years. so one interpretation of probability 
is simply. you see it. there's 
a criticism to this point of view one of our teachers the professor 
d-bus too many years ago would say that if you want to find the probability that little girl 
is going to fall into the river. how many little girls do you want to walk next to the river to find 
out? so in other words normal probabilities 
can be thought of as let me just see how often it happened. 
so you need a little bit more than this. so some words the 
words are often useful to know probability refers to the chance or 
likelihood of a particular event taking place and event is an outcome of an 
experiment and experiment is a process that is performed to understand and observe 
probable outcome probable outcomes. the set of all outcomes of an experiment is called a sample 
space. this is correct, and it's easy to understand with one problem. 
who's performing this experiment? well when you 
use probability. you are some you're in you're in two 
modes in one mode. you are performing the experiment. what does that mean? 
let's say you are running a marketing campaign or you are designing a portfolio 
or you are manufacturing a product or you are recruiting people or 
you are testing a piece of code you are doing the experiment and 
sometimes you are not doing the experiment. somebody else is doing the experiment in your 
simply observing. the customer is buying or is not buying the 
product is failing in the field or it is not failing. the portfolio is making 
money or it is not making money the come person you hired 
is staying on or is quitting. it is not your experiment. 
you are simply observing the outcome of it. so 
sometimes you get to do the experiment and sometimes you do not we 
used to call these things experimental studies and observational studies and 
experimental studies is something in which you begin by designing the experiment 
and you have a handle over how much later you will collect an observational studies. 
you just watch and you see what data comes in. you 
will in your careers mostly be working with observational 
studies because of the nature of data today. there's just a lot more that 
is simply being generated without anyone asking for it in certain 
very peculiar situations. you do experimental studies. for example nuclear 
explosions, right? why do countries want to test nuclear devices? 
telugu is primarily to collect data. primarily 
to figure out where this thing. what's an ordered? how does it work or not? so the do little experiment to 
say boom. let's see what happened. because otherwise, it's all computer 
simulations and you got no idea whether this happens or not. i 
remember running into trouble with my engineering friends on this working on the 
design of a fairly large aircraft engine. and 
there's a question of saying that you know, what is the thrust what is the efficiency of the engine and 
i stupidly made the observation that why don't we test it out? and 
so they looked at me this side that side etc as if you know, how 
do we go to explain to this idiot? i'm officially 
one of them said to me very kindly use very courtly gentleman 
older than me any and he took the responsibility of telling me where 
will it go? so 
his point was that if this engine fires up? is 
going to want to move. where will it go? pointing 
to the difficulty that i cannot easily do a full-blown 
tester for jet engine this if i do start it, i got to give it enough room to 
move somewhere. so where do you want it to? so 
where do you want it to go? so you will not be in a situation to do that 
very often. so when you say experiment it 
is sometimes your experiment in sometimes it is not in rare situations, will you be 
in an experimental see like in a/b testing in my websites, for example, it's 
a common job marketing people often disaster design websites and the erastus 
it does which is a good website. so you do an a/b test will see maybe test you design 
a website of say type a or type b. maybe one is equal to website and you let them loose 
and you find out how people react to the different websites. 
this is a little tricky but i want you to think about this. we will not spend a lot of time on it 
in manufacturing unit three parts of an assembly are selected. we 
are observing whether they're defective or not defective determine the sample space and 
the event of getting at least two defective parts. what is the 
question that i'm asking the question that i'm asking is here's the situation there are three 
parts. for these three parts are interested 
in knowing whether they're good or bad. the 
question is asking this describe for me all the possibilities. 
which is what the sample space is. so what are the possibilities don't 
talk about probabilities now just talk about the possibilities of what could happen. we'll talk 
about the probabilities later on also 
good one way of doing it. is this also a defective two 
of them are defective. one of them is defective and none of them are defective. 
if you do it this way the sample space has for objects 
in it great, that 
is one way of describing it. one- we 
haven't yet gotten to probability. but yes, if you get to it, you know give you it will be 1 minus that. 
and the event of getting at least two defective parts means to 
defective or see three different which is good. so 
this is a this is this is one way of describing the sample space. this 
is not the way the sample space is typically described. you're not wrong. but 
there's a problem. and the problem is this so let's 
suppose i described it this way. in other words. i'm now 
i describe my sample space as it says zero defective. 
one defective to defective and three defective. 
let's say these are my possibilities. if i 
do it this way and this to defective thing is here. if 
i do it this way. i will eventually have to get around to calculating probabilities. and 
let's say i want to calculate the probability of let's see this event at least 
two defectives. now, how will i do that calculation? now 
what happens is the way the probability calculations are done is 
that you try to split this up and see that i'm 
going to find the probability as the sum of the individual 
outcomes as a sum of individual events. i'm going to split it into individual 
components and then add it up. so therefore i will 
ask for you. therefore. what is the probability of to defect and what is the probability of three different? 
what is the probability for example of two defects? so let's say i want to 
find out what is the chance of two defects. how will i find that? how 
will i find the probability of there being two defects in this situation? yes, 
and how will i how will i how will i do that calculation? there 
is not see there is you have not allowed me to even think in terms of one 
two, and three there is no one two, and three there's only 0 defective one defective to 
defective or three defective. your sample space has lost 
all identity as to which one is defective. so do you want to revise 
your opinion of what the sample space is? what 
do you want to define it now is correct. 
so what you can do is you can define your sample space not in terms of the count of defectives, 
but in terms of whether each individual item is actually defective or not, correct. 
in other words, what you're doing is you're essentially saying let's say good bad or 
good defective or g. what does this mean? the first is good. the second 
is defective and the third is good if you do it this 
way how elements are there in the sample space eight 
because each of these can be good or good or bad. 
these are your eight possibilities now 
from this what can happen is using these events 
you can now add them up. now. what happens is if i am looking at let's say 
to defectives. which ones are relevant? say 
this one is this has to defectives this has to defectives 
and so three of them have 
to defectives in it, correct. one of them has no defectives 
three of them have one defective three of them have to 
defectives and one of them has all 
defectives. so this is another way of writing the sample 
space what this will do is this will allow the calculation to 
be a little easier and your objective is to be able to make the calculation 
real easier. so 
in this particular case, for example, just just to get the calculation 
out of the way. let's suppose that the chance of a defective. let's 
suppose the probability of a single defect. let's 
say is 20% let's 
taste 20% there's a one in five chance at per unit is where this 
seems too high you would survive. let's 
say ten percent. one in 
ten is defective. one in ten is defective. if 
one in ten is defective the probability is now ten 
percent then what are the chances 
of all of these? and 
asking for common sense answer to the question will get to the concept related one. 
so the chance you understand the chance that a single one of them is defective is 
10% the chance that a single one of them is defective is 
10% and let's say that i want to 
solve this problem. what 
is the problem the event of getting at least two defective parts in 
other words? i want to find probability of. let's 
say to defectives. what 
is this? let's work it out. so good example to work 
out will understand many things as we do it. the 
chance of a single defect is 10% 
i'm asking for the chance that i've drawn three of them and then i 
will see two things two of them being to defectives. this needs a little bit of work. 
let's do this patiently. let's let's work this out. now 
the chance of two defectives can happen in how many ways 
we just saw it now. let's 
suppose that i want to calculate the chance of three defectives. to 
calculate the chance of three defectives here is what i claim i can do. i 
can add up the chance of these three. equivalent. 
i can do it. this way probability of to defect is equal to probability of gbd 
or d g d or 
dd g. is this correct? there 
are only three ways in which i can get. defectives 
you're okay with this. now 
i'm going to do something real interesting. i'm going to write this as p of 
gd d +. p of 
d gd + p of ddg, 
let me explain what allows me to do this. what 
allows me to do this is the fact that if this 
happens these two cannot happen. they 
are mutually disjoint. both cannot 
happen together, which means that if i draw a picture 
there like this. so if i want to find the chance of 
being here or here or here? i can simply add 
up the chance of being here plus a chance of being here plus a chance of being here. why 
can i do that? because they're disjoint. there is no common 
thing. if this happens, then this does not happen. these 
are two separate things disjoint. okay. 
now let's look at each of this probability of g b and 
d. this is what event the first is good. the second 
is defective and the third is defective. i'm going to write 
this as p of g multiplied by p 
of b multiplied by p of d. 
i'm going to multiply to 
use the technical term here and the technical term that i want to hear is independent. 
independent means that the whether the first is 
good or bad. tells me nothing about whether the second 
is good or bad. they are independently good 
or bad. this is an assumption. but i think the problem allows 
me to make that assumption. i'm making a part is good or bad and making 
another part good or that and these are independent of each other. 
i'm trying to sell a product to him and to him whether he buys or not 
is independent of whether he buys or not. that's an assumption. it may be true. let's 
say they're from two different neighborhoods, or maybe if their neighbors and i'm going from one 
house to the other. maybe this is not independent. maybe 
if he buys his more inclined to buy so independence is an assumption. 
in this case. i'm making that assumption when events 
are independent. i can multiply the probabilities. what 
does that mean? for example, let's say that he will buy a 
product for me. let's say ten percent of the time in other words for 
every 10 people. i want to sell my product to only one person buys. so there's 
a 10% chance that he will buy my product and let's say there's an independent 10% 
chance that he will buy my product. what is the chance that they will both buy my product? 
ten percent multiplied by another 10% first he has to buy the second person 
and then his 10% will be like ten percent of that. so 10 percent to 
10 percent. there's only a 1% chance that they will both buy my product. multiplication 
is allowed when things are independent. so 
independence means that the probability of both of them happening. let 
me re let me rephrase the question in this way. let me write it as one more step. 
let me write this as probability of g 
and b and b. i'm going to write 
this gds g and d and d. 
which means the first is good and the second is bad and 
the third is bad this and i will now write 
as g into the sorry. into 
probability of d in other words if there's an and 
then i can multiply where if 
things are independent, these laws will be clearly written later. 
so if things are independent. i can multiply 
if there is an and if things are disjoint. 
then i can add when there is a war. common 
sense rules but they require little bit of sort 
of logic in calculating. so 
i'm going to take this on take this to the top now. this is going to be what 
this is going to equal p of g into p 
of d into p of d. plus 
p of d into p of g into 
p of b plus p of 
b into p of b into p of g 
now what is p of g? .9. 
90% point nine in two point 
nine into 
point plus. very 
clever thinking into 3. you 
guys are ahead of me here. see dew point 
nine in two point one in two point one. even 
more generally let me we make let me be even smarter than you are. i'm 
going to write this as 3 choose 1. into 
point one to the 
power 2 in 2.9 to 
the power 1 it's a set is sophisticated 
way of writing the same thing. correct. why 
did i write it as 3 choose 1? because why was it three? 
how many ways could there have been? to 
defectives out of three so correctly 
speaking. i should actually have written this as 3 choose 1 or 3 choose 2 because 
either i can choose it as one good or i can say it is too 
bad. so 3 choose 2 is the same as 3 choose 1 
whichever way you want to write it. so maybe a slightly better way to do it would be to say this 
is choose to what are these two the two 
bag defectives out of three? what is 
this point one the chance of a? one defective. 
what is this too? they were two of them. what is this point 
nine the chance of the good? so 
how many goods how many bags were there? to how 
many goods were there one in how many ways 
could i have chosen two bags out of three? three 
of them this is the answer. this 
is an example of a distribution called the binomial distribution, which will see. and 
this calculation you don't need to calculate your python. will we 
do for you like all good things? the look and feel 
for many of these classes seat once understand it and then, you know 
ignore it because someone else can do the calculation for you. 
will you not to worry we'll get to it after the race. but what is the answer? 3in 
to say point 1 in 2.1 in 2.9. somebody tell 
me what that is. i've got no idea. two point two 
point four three points to two point forty percent. 0.024 
three this number someone verify 
on there. 0.027 
okay, then about two percent 2.7 percent or below 2 percent 
that is a chance of c2 defects when there is a 
when the chance of seeing a single defect was. so 
this is as you can see this calculation is not about defects 
or anything of that sort. this is just a counting argument. this 
is a counting argument. so for example, i could have asked the question that i'm 
i'm trying to sell a product to three people. my 
chances of success is is final 10% 
what's the chance that i'll be able to sell to two of them? today, 
i'm going i'm a salesperson. i know i can call upon 
i'm going to three houses today. let's say that i sell children's 
books. i've gone to schools have set up my stalled. you can 
cool down there. i've got three addresses of parents who have been kind enough to say please 
come to my home and you know, i'm willing to listen to you. so 
now i have on my on my cell phone three 
addresses. i'm going to go to today. i know that my chances of selling this 
are not good. optimistically even 10% 
which means that if i try to sell to 10 people only one will probably 
buy so my chances of success for a single person is about 10% 
so now i can ask myself what is going to happen at the end of today. 
how many will i sell? what is the chance that nobody will buy? what is 
the chance that one? single by what is the chance that two of them will buy 
and what is the chance that all three of them will be so what is the chance that two of them 
will buy two poisons there's 
only about a two and a half percent or roughly two to three percent chance that we'll end up with two 
of my of these people by not. no, 
i for me i define 10% as bye. so 
which is i think that this calculation does not depend upon whether your is defective 
or is buying anything. what it depends on is the probability of an 
event and you're asking the question how many times will this happen? 
and that can be a defective part that can be a seal that can be the loss of value of of 
a portfolio that can be the attrition of a of a person that 
can be a hit on a website that can be a click-through rate can be very small number 
for those of you who are in digital marketing. what's a typical ctr mode set? click through rate 
and if you with that industry. so what's a typical click-through rate 
for you? website clicks are 
immigrants but what's a click-through rate? so what a click-through rate typically means is of the people who pass 
through an application for whom the application is an image now 
was an impression as they say what percentage of them actually click on it is 
very important for digital marketing in a so you're showing ads all these websites 
come with ads etc. someone's paying for those ads and they want to know what the click-through rate is when i see 
the ad what percentage of people click on the ad and it's typically very 
small number. have you ever clicked on an ad? no, most normal people 
don't but either she advertised you 
see advertised clitoris is very slow. let's say it literally it is point three percent. let's see. 
let me see out of a thousand people can be expected to click through 
now, i can ask the question for example that if i i want to 
have let's say, you know, so how many impressions should i have that 
depends on how many clicks i expect to have expect to see if i want to have 
let's say a hundred people clicking on my ad that gives me a rough ideas, 
too. how many how many i should reach how many impressions i should be? 
i wish i should be having in also answer a question like this. what is the 
chance that i will have more or less a less than a 
hundred people clicking? so you can ask the question. what is the chance 
that less than a hundred people click in a month? how 
do i calculate that with this i can because what i 
need is a radian estimate of how many impressions there are in a month. from 
that that is how i see my n that is an 
inn. i can then calculate. yes. yes, 
i could have if i wanted to do at least two. right. 
if i was solving this question at least two then you're absolutely right. i should have 
added the last one. there's a question of so 
i've written it as this probability of two defectives if i retreat as at least 
two, you are absolutely right. i should have done that. i could have done that. so how do you 
divide this into an impression is the the 
so on a website if you see the ad in other words for a session, so 
someone has gone to that website and that ad is present on that website at that 
point in time. that's an action if someone is actually clicked 
on that ad in that session, that's a click. so 
the picture rate essentially is if i'm showing you the impression are you clicking on it? 
so i can look at the number of impressions because that's that's the number of times that website 
has been visualized and that ad has been on it. i can also calculate the click-through rate how many people 
are doing looking at and now we can ask the question? what is the chance that have less than 
this? so then what would i do? i this number would 
be lets say the number of impressions. this 
would be say a hundred. or and 
this would be say my click through rate. and this would be my 
one - my electorate. so my 
point 0 3 to the power say a hundred those hundred people who 
clicked. the the 19 the the the 
hundred the number of impressions - hundred people who did not click and 
the number of ways in which i can get a hundred people out of let's 
say, i don't know million impressions. i do a calculation like that. i 
wouldn't do it. i get someone to do it for me. so 
this is where we are heading. and i'm very slow down a little 
bit take you through this conceptually. just just to get these terms understood slightly. 
so first of all, what is the probability a probability is a number between 0 and 1 
it is often calculated as a ratio. the number of ways that is favorable 
divided by the total number of outcomes. this is not the only way of calculating 
probability and it is very rarely works, but it is often a conceptually easy way to 
understand. it's a number between 0 and 1 0 means impossible 1 means 
certain the probability is a pure number it doesn't have units. the 
philosophical question of is the glass half full or half empty types of probability 
different ways of doing it because here's what i was talking about as mutually 
exclusive events. these are two things that have nothing in common. 
mutually, exclusive exclude each other out and example. if a drawing from 
a deck of cards, you can either draw key or you could queen 
or neither of them, but if you are drawing one card, you cannot draw both a king 
and a queen. just like if you're drying up 
apart, it's either defective or it is not defective. if 
you are physicists you think of schrodinger's cat. and 
so physicists have a lot of fun with this. you know, the story shooting his capture is shooting a given example 
of a cat things like the position of the electron. so there's a cat 
in a box and this very unfortunate for the cat 
and there's a vial of poison. now that vial of poison is little 
unsteady. so you could fall down it could break open and if it does then then 
the box fills up with fumes and the cat dies. so, 
you know that there's a box of poise. there's a vial of poison in the box and there's a cat now. 
the question is is the cat dead or alive. it's 
a closed box. you know that there's poison in the box and 
there is a cat question is is the cat dead or alive 
and the answer is you do not know until you open the box. 
now if you open the box, you can see whether the cat is alive or dead. this 
collapse of the wave function in quantum physics which 
means that the event has already happened, but until the wave function is collapse, 
you do not know whether the cat is alive or dead if you can observe 
an electron the electron is here, but if you're not observing the electron you don't know with 
electron is it could be here or there or anywhere? so the electron is 
buzzing around the room. in physics, this is an important idea and 
a lot of probability theory has come from physical consideration. if things aren't mutually 
exclusive you can add up the probability as we had said this king or a queen what 
about two independent events two independent events are events such that 
if one of them happens, then it no way influences 
your currents of the other one in other words if he buys then nothing about if one of them, is it effective 
it says nothing about the other thing being a defective. let me ask 
you a question. let me go back to my previous picture of mutually 
disjoint events. are these two events independent? know 
why are they not independent? yes, 
if i know that the king is drawn. i know something about whether the queen has been 
drawn or not. i know in fact that the queen has not been drawn. these 
two are most certainly not independent. so please don't confuse 
these two concepts. we talking 
about the same event. i'm talking about king these two events. so 
for example, if i talk about let's say one particular 
one particular unit being grd. good are defective. if 
i'm talking about one of them, then the picture looks like this. it can 
either be g or it can be d. it cannot be both. this 
is for one of them. but if i am talking about two of them. then 
say this can be g 1 and this can be sadie to in other words. the chance 
of the first is good and the second is defective. 
now. these two are no longer disjoint. why because both 
these things can happen together. there is quite possible that the first is 
the first is good and the second is defective. that's quite 
possible, but they're independent. independent 
means if i know that the first is good. it tells me nothing 
about whether the second dish. defective or not so 
if you picture sort of intercepts. then you 
know that you cannot add up the probability. in fact, you 
know little bit more, you know that if you want to add up the probability you can but you have to somehow take 
out this. real common part so 
this joint two separate things you add it up. 
or this or this not both? just 
situation happens you can add the probability up and you 
can also add up. but you 
need to assume independence. we will break all these assumptions soon. 
this is the simplest possible way to do calculations have to get to a little 
bit of a nightmare called bayes theorem rule for computing rules for computing probabilities 
this language here cup and cup cup 
and cap language this from set theory. some 
people find it comforting to see that language others find 
it complicating. it's 
called the union. so union 
means so union 
in general means the collection of two things so you know this. 
this is the probability of a or 
b if 
there is a common part then probability of a 
or b. is equal 
to probability of a plus probability of b 
minus probability of a and b. the 
chance that both happens is a chance that one happens plus the chance 
of the second happens - the chance set they both happen. 
if there's a disjoint then this becomes 0 because 
i know that both cannot happen but in general. 
this term stays. this is called the intersection. they 
both happen simultaneously. here's an example. what 
is the probability that the selected card is a king 
or a queen? so this assumes that you know, what a 
card deck is. so 52 cards 13 
for suits. so how many kings? four kings 
how many queens so what is the probability of a king 
for by 50 to 1 by 30? what 
is the probability of a queen? 30 what is the probability 
of a king or a queen 1 by 30 
plus probability of a queen? 2 by 13 the 
other way to do it is if you want to how many ways. can you get 
a king or a queen 8 which is 8 divided by 52, 
which is also same number. what about 
the second one? what is the probability that the selected is a king or a diamond? 
so again, there are two ways of doing this p of king 
or diamond is p of king? plus 
p of diamond minus p of king and diamond. 
this is let's stay on 52. this is so 
this for by 50 to 1 by 30 is also correct. plus 
probability of diamond 13 by 52 - 
king and diamond. there's only one such cut one by 52. 
this is 15 by 52 another way of doing it 
is how many ways. can you get a king or a diamond 16 ways 
the whole suite of diamonds and there are three remaining kings. there 
are 13 kick 13 diamonds. there are four kings, 
but have double counted one of them in both. so 
if you subtract it once remind 
me what the second statement was the 
question. okay, you're saying 
there's a king selected card is a king or a diamond. 
you draw one card at random from a deck of cards. and you're asking. 
is this a king or a diamond? let's say i'm trying to decide and try to 
sell him something. correct. one event that i'm interested in is easy going to buy my product 
or not. the other interesting question is let's say for example, 
is he an it professional or not? correct. 
now is a relationship between these two things. not really. 
but i may be interested in the joint probability of them not because of this event, but 
because i want to calculate another event that's interesting to me, which is 
if i know that he's an it professional. can i sell him 
something in other words? suppose 
it is not independent. suppose i now know that whether he buys 
my product or not depends on whether he's riding 
it professional or something as this i'm trying to sell him a computer peripheral 
and i may be assuming that if he's writing professional. he may be more interested in a computer peripheral 
if he's not he may still be interested. but if he's an it professional you may be more interested 
in this particular patient. in that case we're 
trying to do is i try to i try to use one unrelated event as 
information about another one in other words i'm saying it is not actually unrelated. 
so these hands become interesting. so effectively 
how will my calculation go my calculation will go this way that if i want 
to find the probability that he will buy my product given 
that he's an it professional. then my answer will be 
let me find the probability that is both will buy the product and 
an it professional / with reason it professional why 
let me first calculate the chance. he's an it professional within that 
let me now find the chance that he will buy my product. a 
given b is equal to a and b divided 
by b. this trick this trick 
is always used in analytics to say this and we'll do it before. i 
have received an email. is it spam or not? which 
means i need to find tell me the words and i will tell you whether it is spam. 
so now i need to relate the words to spam. so 
i have two unrelated concepts, but what i want to do is i want to say that 
if i know one of them maybe i can get some information 
about the other similarly here. i'm this maybe about 
a color and this may be about a suit, but if i know about one of them, maybe 
two gives me a little bit of information about the other. we'll see examples 
later. it can 
be both because i am drawing one card. you're 
asking just about this. that's 150 to what 
is challenging. is it is this an exclusive on? he's 
asking is this an x dot in computer science? you 
know was he saying that when i see or am i excluding the case 
that both are allowed? no, 
but confusion still remains if he's very pick t. so you could say he's making a distinction 
between two statements king or a queen king or a diamond or 
king or a diamond or both? any to specify both? 
and in that yes, you are, correct. ha 
so his mind works in ways in which the default is the exclusive. 
your mind works in ways in which the default is not the exclusive. but 
it's a valid. it's a valid criticism to make. that 
in the english language when you use it, do you which 
or do you mean in when i say this 
in probability theory if i say a union b and 
if there is an intersection, i include that intersection set 
theory is not confused about this. set 
theory a union b is just the set. and if there is a common part 
that's in it. and it only once is 
this region? so what i did was we 
translated this into set theory and he saying that maybe i should have been a little more 
careful because there's a difference between this set and the 
following set which is just this part and this part. multiplication 
rule when things are independent. i'm allowed to multiply. example 
there are two subjects the chance that you will do. well in one of them is 70% 
the chance that you will do. well in the other is 35 is 5% the 
chance that you will do. well in both of them are in the be the corresponding 
grades is is the multiplication of the two, which 
is 35% here comes the interesting part. what 
happens to events which are not? didn't 
what happens to the or i'm 
sorry what happens to the multiplication? and 
there are various ways in which this parts written. so the 
currently the way the formula is written is a and b is 
equal to a x 
a is multiplied by probability b 
given a this 
is the way this expression is written. sometimes it's easy to understand 
this way. sometimes it's easy to understand this way. probability 
of b given a is equal 
to probability of a and b. divided 
by probability of e i 
want to know what is the chance that be will happen when i've already been 
told that a will happen. so first i 
find what is the chance that you will happen? and within that 
i take the fraction of both a and b happening. this 
is the same as saying the top line a and b is equal to a given b. this means 
what this means a and b is first e happens. 
then given that a has happened be as happen. correct. 
if he and b are independent, what do i know if 
a and b are independent then a and b is a 
into b of b. that means that 
if a and b are independent independent 
probability of b given a no is 
equal to probability of b. stare at that for a while. 
if they're independent, then this will become p of b. and so p 
of b given a will equal p of b, but is this not exactly what independence is 
if i tell you that he has happened. i have not changed the chance 
of be. that is almost by definition. what independence is 
that by knowing that one of them has happened has told me nothing about the second one by 
knowing that the first unit was defective told me nothing about the 
second one by knowing that the first customer bought. my product told me 
nothing about whether the second one will buy it or not. this 
these statements are understood in different ways. sometimes 
this is a good way to understand is sometimes this is sometimes this. but 
this is a more general form for doing it will see examples of this. this one needs a little 
bit of work to understand. from a pack of cards two 
cards are drawn in succession one after the other after every job 
is selected card is not replaced. so you're drawing one is like a normal deal. the 
second one now comes after the first one. what is the probability 
that you get both drawers? you will get speech. in 
other words, you'll get two states. two 
drawers to space. what is the chance of that? so 
here's a structuring of the problem. is that the you get 
a spade in the first draw b is you get a spade in the second drop? 
so what is the chance of a the chance of a is 13 by 52 is 
the chance of the first one is the speed. no, i 
want to find a and b and the way i 
do it is this what is the chance of a and then what is the chance 
of b given a in other words i've drawn a spade. and 
then what is the chance that i will draw speed given that i've already 
drawn a straight the first thing and the answer to that is minus 
1 because there are now 51 cards left in the deck and there are 12 space remaining 
so 12 by 51. so the answer is 13 
by 52 x well by 51. 
what would the answer have been if i had replaced the first card? it 
would have been 13 by 52 x 13 by 52 because of independence. 
i put it back if i put it back when i put it back the second draw looks 
exactly like the first one so knowing that i had a speed to begin 
with has been lost because i've already put that straight back in 
it is it is a situation of independent experiments. this 
one however is the case that the result of the first of 
the result of the second depends upon the result of the first we 
are assuming the second one that we have already picked the first one. 
yes as a spade, huh? because that is what is being asked for. what 
is the probability that in both the drawers you will get escapes? so 
i'm drawing one and i'm doing a second one. what is the chance that they're both space? here's 
a here's a here's a similar dish question. what is the chance 
that i will get to addition to adjacent 
seats on my flight if i don't free 
book. yeah, so it's a similar 
kind of calculation. why is it a similar kind of calculation? so 
you want to adjust the seats, but for two adjacent seats to 
be picked by you. those two sliver speak to empty adjacencies. 
now two adjacent empty seats means what? that means 
can you calculate the probability? yes, you can but when somebody booked seats, 
let's say that one particular seat has been booked what happens to the probability 
of the next seat next to it being booked. so the probability of a seat being 
booked of a single seed being booked. let's say is 
making up a number. let's say 50% for single seed being booked is 
50% now. i'm telling you that a one particular seat has been booked. 
let's say, you know 15 a has been booked. well, i'm asking the question 
given that 15 a has been booked. what is the chance that 15 b will be bought. will 
be 50% will be more than 50% will it be less than 50% it will 
be more than 50% at least if you're modeling reasonably. well, it would be 
because the whole bunch of people will be booking pez. we'll 
be looking past. so now if i know that once it has been booked if i know 
that 15 a has been booked now the chance that 15b has been booked 
is going to be more than fifty percent which means that my chances of coming late and 
find into adjacencies is going to go down because i'm looking for sees that are on 
board. kansas 
will be more right. know the chances will be less because 
as people book so people book adjacent seats 
more than at random. so the probability of two 
adjacencies being booked is not the product of the individual seats 
being booked. it's more than that. so 
the probability of me finding two empty adjacencies is going to be less. we 
are looking for empty seats. so here's an example of 
doing this conditional calculation marginal probability is 
a term. i'll explain when i do the example. so here's an example 
a survey of 200 families was conducted information regarding family income per year and within 
the family buys a car is given in the following table. so the 200 
data points 200 surveys of come and they've been distributed 
in a cross tabulation like this. we did a crosstab like this yesterday 
as well. this is the crosstab. once the axis is 
the by a car or did they not buy a car? we are there is an income statement income 
below 10 lakhs or income greater than 10 lakhs. now. why am i why would 
i be interested in this data to 
figure out who by my who buys my cars where 
the cars can be sold and whether that has anything to do with income 
and if it does arrange to do with income then is high better or is no 
better i don't know. so what i've done is i've arranged my data in this particular way. 
and now he's asking a few questions. what is the probability that 
a randomly selected person or what is the probability someone is a buyer 
of a car? it's you don't even need to look at the full table. 
this is 80 by. 200 probability 
of let's say car. this is called a marginal probability 
y module because from the picture it's at the margin. it's at the margin 
of the table, which is where the term originally came from. this called a marginal probability. 
there are many things going on. but you're asking a question only about one 
margin in this case the margin of the car you are interested in the income. 
this called the marginal probability. what is the probability that a randomly selected 
family is both. with a buyer of a car and belonging to income 10 lakhs 
or above both buying a car. and income 10 
lakhs or above 42 
on 42 1 200 
ok a family objected random is found to be belonging to 
an income of 10 lakhs and above. what is the probability that the family is a buyer of a 
car? if the income is 
more than 10 lakhs, what is the chance of a car? so 
this is probability of car given 
greater than 10x 42 
by 80 interesting 42 by 80 why 
is this? you're here, right? 
is 80 that's a sample size. you 
understand the logic, but that is exactly the same as this probability 
of car and greater than 10 lakhs. / 
greater than 10 lakhs why 
because car greater than t car and greater than 10 lakhs is 42 
divided by 200 and greater than 10 lakhs 
80 divided by 200 200 200 cancel out this 
again becomes 42 by 80 but the thing is absolutely right. this 
goes in the denominator because because this somehow says that out of how many people 
am i going to select? and then on the top 
is how many are both. this is called a conditional probability. 
this is called a conditional probability by the way. what is this number? this 
for example less than 50% sorry greater than 50% what 
is the chance of buying a car? which 
is about forty percent. that means if i 
did not know your income. i would guess that your chance of 
buying a car is 40% if i didn't know that your income was 
more than 10 lakhs now your chances of buying a car went up to 
over 50% therefore 
it's worth my while to find out whether your income is more than 
10x. because 
it it is by the sample data tells me that that's going to influence 
in a positive direction whether you will buy my product or not. so i'll try to find out. 
this is in terms of words is called the marginal probability marginal. 
and this is a conditional. you might have a little bit 
of trouble with these words, but conceptually, this is not very hard. 
and so this is the calculation that we just did. base when you 
originally wrote this paper. so we talked about it. nobody understood 
him. only after he died did somebody find in 
his papers and i said, okay this is going to take a long time and then they explain it to others. 
let me explain what it tries to do. yes on 
the board card car and greater than 10 lakhs this 
one. okay. this one is a joint probability. this 
is a marginal. this is a joint and this is 
a conditioner so conditional is 
a joint by a marginal condition is a joint divided by a margin 
a joint is a marginal x a conditional. 
so the base hiram's idea is the following. what it 
does is it switches which event is being 
conditioned on? it switches between a given 
b & b given a now 
when would you need to do this? here's an example. you 
want to find out whether the whether the email that you're receiving is spam 
do use gmail? gmail often identify 
things as spam and moves them somewhere how does it do that? 
actually, it looks at the male's and headers and it uses a very very complicated algorithm. but 
let's suppose you are building an application of this sort and you want to do it based just 
on the content of the email. so you want to following 
kind of program you want a program that says that if i know the words 
of the email i can tell you whether it is spam or 
not. which means i want 
the following thing. i want the probability of spam given 
words. if i tell you the words, can you tell me whether 
this is spam or not? this is what i want to do. correct. 
but how will i solve the problem? i'll solve the problem by finding the opposite 
condition. what is the opposite conditional the opposite conditional 
is what is the probability of words given? spam 
now, why do i am i interested in this because this one is easier 
for me to do. in the following sense. what 
i can do is let's see in my research lab. i can 
collect lots and lots of documents and i can identify 
them as spam or not spam. in other words. i can manually 
go in and i can tag them. so let's suppose that have looked at a thousand 
of these and have targeted. let's say say 800 of them as stand 
and 200 of them is numb stand or maybe i go after things that are spam and 
find five thousand of them and go off to things that i know are not spam and find 
five thousand of them. now. i can solve the opposite problem, 
which means that if i know that it is pam i 
know the distribution of words and if i know 
that it is not spam. i know the distribution of words. i can do 
this inside my analytics environment. 
so now i know that if it is sam, this is what the distribution of what is looks like. 
if it is not spam this was received. no was look like using that 
i will now push the problem and say now if you give me the words 
i will tell you whether it is spam or not. now, 
how do i do that? i do that doing this. now. 
this is a very easy formula to understand why because 
this formula essentially says this that why 
is why is this equality true this equality is true 
because let me rewrite it slightly. let me see. what 
is the probability of let's say stamm. and 
works what is 
the chance of spam and words in other words 
there is an and they're now i'm going to write this like 
a and b. but here's the interesting thing 
when i wrote i can write e and b 
in two ways. i can write it as b 
multiplied by a given b. but 
i can also write it as e x be 
given. i 
have a choice as to which is first and which is second. so 
therefore i can write this in two ways. i can write this as 
spam given words. x 
words but i can also 
write it as words given spam x 
spam do you understand the trick? but 
what does that mean? that means these two things are equal. if 
these two things are equal that expression of for is now i know that 
probability of stam given 
words is equal to the probability of words 
given spam multiplied by probability of stan? 
divided by probability of corpse so 
to execute on this what do i need was given spam 
which i told you what to do probability of spam, which is an 
estimate of the proportion of emails that are spam or not spam. and 
probability of words black has no conditioning 
in it. this is what's usually called a lexicon. or 
a dictionary so if you give me a 
dictionary of the language i can give you this denominator. if 
you give me shall we see an it estimate or a sociological estimate as 
to the proportion of words or proportion of emails that end up being stamp? i 
can give you the probability of spam. you give me a thing 
stop the spam. i can find its dictionary distributions. if 
you give me things that are tagged does not spam i can find it. so 
therefore i know the hand side therefore. i know the left hand 
side. and now if you give me the words i can tell you the probability that 
it is spam. so it's either 
thought of in the way. i just described it which is sort of flipping 
these two probabilities. sometimes it is described the following way stan 
given words is an update of just probability 
of spam. this probability of spam part is sometimes 
invasion language called a prior. and 
spam given words is called a posterior, which means that if i 
know the words. i have a greater idea as 
to whether it is spam or not. if i know his income if i know he's 
an it professional. i have a better idea when we buy the product or not if 
i know the income is more than 10 lakhs. i have a better idea 
whether you buy a car or not if i know the words i have a better idea whether 
it is spam or not and to do that i flip it this 
way. and because of applications 
like this bayes theorem has become very very central to machine learning. because 
for example think of the autonomous car, what 
is your numbers cause decision problem something is crossing the road. 
should i stop? in other words given 
cow, should i stop? now 
think of the think of the the problem that has to be solved to do that. i can flip 
it now to flip it means what to flip. it means essentially 
flip it by saying thousand stuff essentially a now have to tell the program. so 
i say stop given cow. so now i have 
to solve it by this young kyle given stuff. so 
i need to take these are the situations in which a car is stopped. and 
these are the situations in which a car has not stopped since top 
situation. look at what that cost saw and in a not 
stop situation. look at what the cost saw. like 
spam and not spam and now i can flip this and say therefore if this is what i 
saw i now know whether to stop or not. it's 
a neat little logic. so this is 
this is a set what bayes theorem essentially does. it 
is one way of doing supervised learning. it is one style of doing supervised 
learning and there are there are supervised learning algorithms that are explicitly 
this for example bayesian belief networks for bbs. there's 
some supervised learning algorithms that are this but aren't explicitly so for example linear 
discriminant analysis. but what you do is you find the posterior distribution 
of being in this class given the data. and 
so this class given the data is written as you know, 
the class given data. so 
and vice versa. so there are at least two of these algorithms that you will study later. justin ellis 
is one and i think bb ends i don't know is regular but in general you will find 
it to be a very useful trick. i'll come back and i'll show you the 
theory behind it if you're interested, but this is actually all that's 
that needs to be remembered for this application. so 
the questions are autonomous cars. his question is why don't i do the simple thing of saying 
that if you see something stuck. now from 
a computer computer. following that logic the computer 
now has to know what should i do when i see something? not 
if i see something in stock so you could see if i see something on the road then stop. you 
didn't ask what happens if i don't see something i should keep going. so 
this becomes very simple rule that says that if i see something stop 
if i sort of don't see anything stop. 
now, what will this do to the car? okay. 
so so this is a translation of a rule. the difficulty will be the 
following and you can try doing it the difficulty will be that what 
precisely will the car see and we'll follow that logic explicitly. so 
if you see the car that is coming quite far ahead. it 
will stop you could say i'm going to drop threshold. if 
it is further away from this in front the car in front, then don't stop 
because you're expected to see a car in front. and so if you see a car in front, please 
don't stop because something is in front. but you now have to encode that. 
and so that way of doing things is entirely feasible. so for example, there's a there's 
a whole branch of learning called case-based reasoning case-based 
reasoning and case-based reasoning essentially lies on that. give me all the cases and 
give me the reasonings for all those cases. what does 
misusing sometimes becomes difficult if if becomes very very difficult to enumerate 
all the possible cases? for example in the stand problem 
i have to solve this problem for every conceivable word. that 
the email might see because email is going to decide based 
on the words. and if you do, if you do not go full case 
based approach if the email sees a word that it is not seen before the meal will say 
what do you want me to do? so 
typically when bayesian methods are used when it sees that word it will do precisely 
nothing. in other words. it will say if certain words are there. i will update my decision 
if those words and not there i want it's irrelevant to it. there's no evidence that it has 
chop the other is a probabilistic way of thinking that bayes theorem 
or any of these relations is probabilistic learning that when you do some 
when you went up when an autonomous system or any machine learning system decides 
then what does it decide on you'll often find in data sets 
the following situation. i should have had an example. i pull it up. all 
the x is are the same but the wise are different. all 
the exes are the same by the wise are different. two people have 
exactly the same characteristics but one has bought the product and one has not 
bought the product to people applying for a loan have given 
you the same information. they come from the same village. they have the same income. 
they're the same, you know family circumstances. they grow the same crops one 
farmer has repaid the loan. the other farmer has not hmm 
car being tested out the someone crossing the road identical see 
one test driver decide to stop the other test driver decided not 
to stop same x different. why what 
should the computer do? nothing 
of this remote computers perspective. what is the computers problem? then the variance problem is 
if you give me an x i will give you a why 
now, what do you want the computer to do in this particular situation? because in your real data 
the same xu leading to different wise what's an ideal solution here. 
what would you do? how do you think through this problem? one 
possibility is to give it a probability. that's 
one approach to the problem. what that means is this at in your data set. 
let's say half your people who have seen this x have given a y of 0 and 
half your people have seen this data set of giving it a y of one the computer 
literally tosses a coin and decide which one to predict. that's 
called a randomized response and sometimes it's done. it could be a disaster as 
well. i'm sorry, that could be good. but 
what would give me another alternative? the safest 
of alternative we could go for a ride, which is safe. how does the computer know that? what 
condition is given the same x its input is identical? see 
that consequence has already been worked out by in nature. in 
nature is the consequence was there that would have already have been baked in. so 
if there is a consequence to it and if there was a good consequence and the test driver would 
have stopped in all cases. the case driver would have stopped. yes. 
yes that 
that that decision would have been made by the test driver as well. would it not have been? 
the raw data would also have shown that bias. or 
are you teaching a computer to have a sense of value that 
the real human did not have? two 
doctors look at the identical medical report one 
doctor says cancer. the other doctor says no cancer. you 
are building an ai system for medicine. what should it say? go 
for it. go for another test. you 
should see that the you should see a very nice video of watson. you know, 
what watson is you should see the watson videos if you haven't seen it and 
you want to be an ai. professional read email professional then you should see the watson 
videos. wonderful videos and you can see 
you can see that the decisions at the bottom. you can see the you 
can see how watson decides you know, 
what what's in this is the jeopardy videos. so what sampling jeopardy? 
and so sub japanese a quiz question in which the answer is given 
and you have to sort of say the question or something of that sort. so when you see the video 
you'll see at the bottom you'll see a bar and the bar is a set of probability 
statements as to how likely is this the answer etcetera etcetera? and 
based on those probabilities watson gives an answer and sometimes wasn't does not give 
an answer because it is unsure of even its best answer. so 
you should so when you watch it watch the watch the bottom of the screen 
the data that watson is on setting based on. this 
particular way of doing things but in general this problem is a hard 
problem in machine learning because in the real world, you will have this issue. if 
this was not the case if it was the case that that 
identical values of x give identical values 
of y. the machine learning problem would be a mathematical function fitting problem. 
it would be a problem simply saying if this is the x match map it to the y. just 
find the rule that map's x to the y it's not and the reason is 
not is because identical inputs. do not lead to identical outputs. 
and resolution of that has many many procedures 
and possibilities for doing that. one of them is a probabilistic way of doing things. 
to answer the following question. i will not tell you whether y is 0 or 1. i 
will tell you what is the probability that y is 1 i 
would love to tell you whether you have cancer or not. i will tell you what is the probability that 
you have cancer? i will not tell you 
what the probability of hitting something will be if i continue it's 
not a definite answer. i'm asking for a 0 or a 1 and i'm not giving you a 0 or a 
1 i'm giving you a probability. so at every time 
the car when it is driving is calculating a number given the scene. 
what is the probability that i will hit something continuously based 
on what it is seeing now, you decide based on that probability 
whether you should stop or not based on you know, your risks etc. 
the the learning system does not do that. the learning system does not save 
whether you should be diagnosed with cancer is simply says what is the chance that you have 
cancer now you decide based on your whether that's 
enough for me to state whether you have cancer or not, the learning system will not say what 
is the probability that you have defaulted on that that what you will not say whether you will default 
on your loan or not. it will say what is the probability that you will default on the loan 
now you decide how much risk you will bear? that's 
one solution to the problem. it doesn't even try to predict the answer. 
it simply gives you a distribution on the possible answers. you 
decide as i said if you see the jeopardy videos you'll see this action. you'll 
see that the data on which it does category is 19th century. 
novelist. what watson wants to do then is preserve the lead not take a big risk, 
especially with final jeopardy because just like for humans follow japanese hard for watching now 
we come to watson who is bram stoker. i 
would have thought this technology like this was years away, but it's here now. i 
have the bruised phenomenal. that's one on a terror watson. 
look at that. what is doing is it's given probabilities 
on the answers? these 
don't add up to 1. these don't add up to 
1 but what is the chance that list is the answer. what is this juncture pain 
is etcetera, etcetera this number if it is below this threshold what's 
in will say pass it won't answer. and is there in the video few number 
of times? it doesn't know but it says that if i am more 
sure than a certain threshold and if i am uniquely sure it 
will also none sir, if multiple of these cross here. which 
means both of them are probably and i don't know which is like they both sound correct to me again. 
i might stop. he'll 
do that every question based on hearing it. so 
if probability is them by python language or any of the machine 
language thing then what is that? we are here for meaning what is the 
rule in deciding that deep philosophical questions? why are we require we existing 
at all? why are we here at all? so real? 
yeah. so so one one reason you're there is to provide 
test data to the system or what's called ground truth. it 
was you need to give it spam. and you two need to tell it once 
this is spam. just like he's saying i need to tell it to stop. i need 
to say that this is a dangerous thing. so so human needs to initiate that but he has people are asking 
that question a lot. that is that human initiation necessary. now 
the trouble with that is that the the value system that is necessary 
to decide that. this is a good thing or a bad thing is something that computers do not have 
and it's extremely difficult to encode that. it's not easy 
to include in a computer in some way. this is good or this is one decision 
this one decision. and also if you want to encode a cost to it, and if 
i do this, this is what caused suffering reinforcement learning does this if you take a wrong 
decision, there's a penalty function that hurts the computer in terms of 
an objective and the computer knows that if i want to reduce shall i say that pain 
factor i should avoid doing this like babies learn is 
called reinforcement learning. i don't know whether you will do much reinforcement learning in this course not 
but you will. that so you you so you so you so you build algorithms 
of that kind. they will come a time with that will not be necessary 
for us. it is not necessary. but even we even humans have 
to come with a genetically coded information. we also cannot begin from scratch. we 
already come coated with this. there's a school of thought that says that 
that's all that there is that this information is passing along in 
other words a hen is an egg's way of making another 
egg. so 
an egg was to make another end. how does an egg make another egg 
through a hint it makes a hell and that hen makes the egg another egg breaks 
which means that there is a basic information content. the gene is trying to say i need to survive. 
so the sequence of three cds and geez that has a survival instinct. and 
the only way it can do that is to get another organism to create a copy of it. why 
does this do that? brilliantly? fix the big war 
going on on planet earth for a few billion years and still continuing. it's 
a deadly war has got no winners and is going to continue is a war between bacteria 
and viruses. nobody wins. write these 
two are at each other for donkey's years because they have two very different 
ways of dealing with information. viruses 
retrovirus type thing a virus is just dna with the protein around it. the way 
tree produces is like certain birds. we learn in mythology that information gets 
into another organism typically a bacteria. so a virus forces a bacteria to 
make another virus. and obviously the bacteria doesn't 
like it. okay, so the bacteria over billions of years 
of figured out how to prevent doing this and viruses of consequently 
adapted and have repeatedly kept doing this and 
so information transference has a long long history in the real world 
in the in the in the computing world the challenge 
of saying that how do i input the information? how do i get the machine to learn is 
something that we are rapidly evolving in? the reason this 
this current generation is so excited about it and i am 
not that old but even in my career and i've been doing this for about 
25 years or so, roughly speaking. i've seen three or four waves of it. are 
we going to goes up it goes down it goes up. it goes down and different the current version 
of it essentially is based on certain deep learning algorithms that have come 
and it made it a lot easier to feedback information. so, 
you know recurrent neural networks can cut all these neural networks now have the ability to feed context 
in feed information lot more efficiently, which means this idea that a computer 
can pick up context and use it to get better. algorithms is 
there and that scares a few people mightily 
because what it means is that as a car keeps driving very well. it's 
knowing that is driving very well and will keep doing certain things. so 
the school of thought that says that therefore maybe the car should have a few accidents. 
just like maybe they should be a few 
nuclear explosions. let's suppose that you go and 
get an hiv test done. hiv tests are routinely 
done. we'll see you have surgery or anything like that. etcetera hiv tests are done. so 
let's suppose that for whatever be the reason an hiv test gets done and 
the test turns out to be positive. i hope it never happens to but let's suppose it 
just turns out to be positive the question is how scared should you? very 
that's a reasonable answer. but let's work it out. 
so to do that trying to calculate the probability of 
hiv given positive test 
this is what i'm interested in calculating because my life may depend on it. there 
are many ways to do this. here's a suggested route. now 
what i'm going to do is i'm going to write this version of the formula. no. 
hm. without this early and you'll see what 
what it means here. so what i'm going to do is i'm going to write this as probability 
of hiv 
and plus divided 
by probability of positive, correct 
conditional is join to / marginal. 
now i'm going to write the numerator as probability 
of positive given hiv 
multiplied by probability of hiv i'm 
going to twist it. here's why? these 
are numbers. that are much more available to me. what 
is this number? this number means that 
if i have hiv what is the chance that the test will be positive 
that's called the sensitivity of a test a test maker has to report that. this 
is the proportion of people who have hiv. this is the incidence rate. it has nothing to do with me. 
it's like my dictionary is just the prep fraction of people who have hiv. 
so these are numbers that i know one from epidemiology 
and one from my test manufacture. / 
positive and i'm going to do something very interesting on the positive. i'm 
going to write this positive in two ways. there are two ways in 
which someone can become positive. hiv 
and positive plus not 
hiv. and positive 
okay this joint there are two disjoint ways 
in which i can end up being positive either. i have the disease or i do 
not have the disease. now they can write this as this 
i'd already calculated is the same number. probability 
of positive given hiv multiplied 
by probability hiv plus probability 
of positive given not hiv. multiplied 
by probability not hiv this 
is this formula just example doubt. we're 
going to apply this and see what happens. let's 
what are the numbers that i need? i need a number of probability of hiv 
probability of hiv is a incidence rate for hiv. what's a good number for this 
point? zero one? okay. it's a point one percent. that's 
actually very low the hiv. it is a lot higher policy 
one person. one percent of people 
have hiv and 99% don't 
what this also means is that probability of not hiv is 99% 
okay. i also need a few other things. i need for 
example this. probability of positive 
given hiv this is a measure 
of how good the test is if you have hiv what 
is the chance that it will report that you have hiv. what's a good number for this? 
99% 95% what 
is 85% mean that if you have hiv 
there's a 95% chance that i will find it equivalently for a hundred people 
who have hiv 495 of them. i will find 
it. yes, which 
one i 
asked this is a this will come from the this is called a sensitivity number. 
it comes from the test. a very good test may have this at 99% 99.9% 
hmm are not very good test or a cheap test may have this at 90% 
i'm assuming that this test is 95% pick your own number. it's 
sensitivity is 90% we have the number is sometimes called specificity. 
so for example, let's say i go the other way positive of negative given 
not hiv, which means if it if you do not have hiv. what 
is the chance that it will say you do not have hiv? again, 
95% again 95 in other words, and i have a fairly 
simple stressed, which is 95% accurate. whatever 
your disease state is 95% the time we will give you the answer. okay. 
now let me re ask the question. i've given you a test 
that is 95% accurate. i am now telling you 
that your test is positive. what is the chance that 
you are hiv positive 95% that's a reasonable guess 
right. let's work it out. negative 
not hiv is 95% so what is positive given not hiv? hi 
percent. great. okay. now 
i have everything that i need to calculate this. what is positive given 
hiv? .95. correct 
into what is for probability hiv. .01 
with is given it as one person. downstairs 
again point nine five into point zero one plus 
what is this positive and not hiv? 0 
5 x probability not hiv .99 
could someone please work this out? on 
a calculator or on hints 
together they cover everything. yes, which means in a 
particular case you have hiv or you do not have a check. there 
are no other possibilities. the exclusive why because either 
you have hiv or you do not have hiv. but 
exhaust exhaustive events means that there are no other things. so 
this given hiv positive is 
95. so yes because of what they had 95, you're not calculating this 
which one the last one five percent this 5% this is i think one 
- this for not hiv negative was 95% then positive will be 5% 
what is this number? point 
i have to have high variance in my answers. anyone 
else 0.16 0.16. 
they're sixteen percent chance you have hiv if you test positive, why 
is it that a fairly accurate test and 95% 
accurate test? my wife and i have a party 
company. we're trying to release a product on molecular diagnosis for infectious diseases. 
if we get 95% we'd be thrilled. our investors would be thrilled. we'd be in business. 
this is not easy to attain particularly cheap. we try to keep 
the cost of our test fairly low for things like uti and stuff like that, but 
so where is the problem samples are false positives? 95% 
approach but there is a there is a there is a there is a there is a problem of false positives 
here. so another way of seeing exactly the same calculation 
a pretty much exactly the same calculation is the following thing. so i'm going to double bass serum 
here, which is exactly this i leave it to you to link this to be is cetera 
et cetera, but sometimes it's easy to just understand it as an example. as how it's done, 
but i'll show it to you as i will now show you two as a picture i leave this here. and 
now let's assume that i begin with a population of maybe a hundred 
thousand people. let's suppose that i've got a hundred thousand 
people. who are being tested let's say. 
now all of these hundred thousand people some of them have the disease. some of them do not have 
the disease samples. i got the total is hundred and this is my sample space. so to speak now, 
let's say how many of them have hiv 1% 
1% so that's how many thousand 
so thousand of them are here. so these are hiv and 
how many are not hiv 99,000, 
correct? now all of these 1000 how 
many of them test positive 950 
and how many test negative? 50 okay 
of this 99,000 how many test positive and how many test negative? 
so these guys should test negative. so what is 95% 
or what is 5% of ninety nine thousand five hundred of ninety nine thousand four 
nine five zero is five percent. so five percent is a wrong which means full 
full nine five zero are here. this is 5% 
of 19,000. and so how many are now here- 94,000 
about that this 
number one matter much anyways, so you're okay with the situation here. now, 
let's look at all the people who tested positive. where 
all the people have tested positive? these guys have tested positive 
and these guys have tested positive. so how many people have tested positive 
in all? so 950 
+ 4 9 5 0 of them how many have 
the disease? 950 calculate this this 
is exactly the same calculation you did before arithmetically it is the same calculation. 
sir here 
for nine. five zero is the culprit. what does that mean? it means that 
they were a lot of people who had a false positive now, why 
were there a lot of people who had false positives because there are a lot of people who did not have 
the disease for that large number of people who did not have the disease only 
a few positives with swamp the positives of the people who had 
the disease which means most of the people who are testing positive 
and actually healthy people. who have had the misfortune 
of the test going wrong on them? but 
because there were so many of them. it affected 
the probability. no, 
but what is it for you? so what is the moral of the story now? so 
therefore what will happen? let's say therefore. let's say you go and 
let's i'm pretty sure this hasn't happened. but if somebody gets in positive hiv 
test, what will the doctor say? get 
ready test done. why? because 
let's suppose this is my test. let's suppose this is my test 
and let's suppose now. i've changed the test to sing that i will say you have 
hiv only if you test positive twice in a row. you 
tested twice and both times you will end up you show a positive. now. 
what happens to these numbers? what is now the positive 
given hiv in what is now negative given hiv first of all what happens to what 
happens to this? what happens to you? what is the chance of a false positive now? 
so the chancellor false positive which was previously 5% now 
become yes now becomes you must it must go wrong twice. so 
point zero five into point zero 5 and then 
1 minus that 5% of five percent 5% 
of five percent is what it's a quarter of a percent or something like that or even 
less maybe that becomes now a very large number. so this number becomes much smaller 
the chance of a false positive becomes much lower and because 
the chance of a false positive becomes much low, this number becomes a lot lower. and 
now the number begins to approximate what you think it would but for this 
to work. i must be able to multiply the two probabilities 
that both tests went wrong. that multiplication comes from 
independence, which means the second case that you should do should be from a 
different laboratory, which would have its own biases in have its own problems, 
but they will be independent of the first guy and you can multiply this out and this problem will go away if 
it doesn't multiply out. this is the 
same result happens. in other words, if the same thing shows up, then this profile fire will not go 
down. so 
this difficulty with this do this also again, for example, this shows up in many things 
even even this is so if i if i am trying to detect let's say 
fraud. i'm going to take fraud and i fraud detection 
algorithm. and i now say if i see this signal what is the chance 
that it is fraud by this serum that will be low. the 
reason that will be low is because most transactions are not fraudulent 
transactions. and so even if there is a small 
possibility of detecting an on fraud transaction is a fraud transaction. i 
have messed up my algorithm. you 
have to do the test independently running the same program twice will not help you. huh? 
so in the biological example, you need to run it again. what it in a different test in 
a machine learning situation. what does that mean? it means you have to give it fresh data. different 
data from the same situation shall we say which is a little harder, but 
that's fine. so this is based here. so 
that last world and sperm problem. how does it end of map 
to how does it map to this? okay. well it looks it looks completely 
different does it not? okay. we'll do it this way. what is the proportion 
of spam and non-spam? let's say this is pan. and 
not spam what 
is the purpose i need to know this? this is the proportion of things that 
have salmon not spam independent of. what 
is in the text? what's the proportion of emails that are spam? what 
do you think my pleasure sir? 30% 
a stamp. okay, you guys know your inbox? 
it also points to a healthy social life, right? so now what now 
let's suppose that we fix the problem and i'm going to solve the problem not 
for not for words, but for one word what's 
a stem like word for example? bye. 
congratulation. congratulation right? congratulation. 
so now so now i want probability 
of congratulations given spam. 
what is probability of congratulation given spam if 
congratulation is there? then 
if it is spam, what is the chance of the word congratulation will be there? a 
hundred percent. let's link this little too, huh? 75% 
let's say let's say this. let's say this is 75% then. 
what else do i need because what is the problem? i'm trying to solve. huh, 
so i'm trying to solve the following problem trying to find probability of spam 
given. congratulation. this is 
what i want to find. i want to say that if i see the word congratulation. what 
is the chance that this email is spam? that is the problem. you want to solve 
now to solve that and solving the opposite problem i'm saying what is spam? what 
is not spam? what is congestion nation given spam and i need when i need one more probability. 
congratulations not spam. what 
is this? 25% not necessarily one 
- this is a separate calculation, but it could be 25% if you want to. let's 
make it 35. huh, 
which means if it is a genuine email if it is not spam. this is 35% 
chance of the word congratulations will be there. now i don't need to make this 
up as i said in a laboratory. i can look at all spam things and i can count how many times congratulations 
shows up in it. so now let's suppose 
let's suppose this is here. let's suppose i know 
this. now, can you do the calculation you can do it using bayes 
rule. you can do it using the drag diagram if you want 
to just try. what is the answer? congratulation 
given a lot of times in 35% that is known as the phone numbers 
are known to you. well, actually, you know, these are the same 
number so three numbers are known to you. if it is fair, 
then the chance of congratulation is 75% if it is not spam the chance of congratulation is 
35% now i want to find what is the probability of 
spam given that there is congratulation. now, 
how do i how do i all 
four unknown? this is mike. shall we say the information 
that's available to me? some of you can try using the formula 
some of you can try using the picture. so if i do it using 
the formula? what will it look like? spam 
given. congratulations is 
equal to probability of congrats and 
going to support her. is this spam? x stamp 
/ probability of congrats 
given spam multiplied by probability of spam 
plus probability of congrats given 
not spam. multiplied by probability of not 
spam this 
and this is what congrats given spam is. .75 
into probability of spam is 0.3. divided 
by 0.75 into .3. + 
congrats given not spam. .35 x 
not spam is 0.7. point 
no z .47 or 
you may want to draw a picture like this like we are drawn before begin with an another typical 
number. let's say a hundred thousand. you'll do it as spam 
or not spam on the spam side. this is a hundred thousand emails on the spam side. 
how many will there be 30,000? on this side 
70,000 on this side. how many will have congratulations 
this is on the stem side, 75% of them will have 
congratulations. so 
75% of 30,000. that's what 22500 or 
something like that. and the remaining will not have 
congratulations. how many here will have congratulations? for 
not spam 35% of 70,000. 
what is 35% of 70,000? huh 24500 
and so what is my answer 22,500 / to 
a 2500 plus 24,500 which is presumably 
might 47% we could do this as well. without 
opening the email without opening the email and seeing the email 
the chance that it is spam is 30% but if the word congratulation 
is there in the email the chance that it is fame has gone up to 47% 
now you would not do this just for congratulations you do this for a whole bunch of words. 
which means that instead of congratulation read congratulation and something and something etc, 
which means that instead of congratulations here. it'll be congratulation and something and something and 
something here which means for these probabilities you will need to 
say congratulation and something else. let's say another word. what's another word offer? 
so you now say what is the probability of spam given congratulation and offer 
now, you would need congratulation and offer but if you assume independence there 
this can be progress. visions given stan x offer given 
spam so word-by-word the probability can be calculated 
and can be put in this approach. you will see studying text mining one if your course 
is called the bag of words approach. the words are put into 
a bag irrespective of their order and things like that. yes. 
yes, yes. 
who is this for? yes, 
so each of these the e will then be a new event and the new 
event would be different words. and so that those different 
words will be thought of as the product of each word. so 
the chance that that the words congratulations and offer are there 
in the email? is the chance that congratulations there in the email 
x the chance that offer is there in the evening? that's an assumption 
and as i mentioned that is built into the bag of words model. if 
you don't like it, what you have to do is you have to give me the joint probability of 
offer and words and those motors are also there. they're called bigram models. 
no. spam and non-spam are where 
stamina stamina system or in the bi these 
to spam and non-spam. yes. 
yes, we are reading it reading it to to hear they were there are k possibilities 
know the number of possibilities in this case. no, 
and they could be other possibilities here here the things and deciding between are just 
to spam or not spam in this formula. the 
number of things that i'm deciding between are many for example in 
your gmail. how many categories are there? the social 
this promotions and primary. so instead of these being stamp. 
i can define it as primary social and promotions. 
so now i need to find what is the probability of primary given congratulation? 
promotion given congratulation and social given congratulation. 
there are three of these now that can that now you can apply here. 
there's b1 b2 and b3. so 
you all we've already seen an example of a distribution. i'll 
simply tell you what it is the binomial distribution. what is the binomial 
distribution the binomial distribution is a distribution 
of simply counting. the number of things the number 
of defective products. hmm. the number of customers that receive 
services, etc. etc. exactly like the applications that we were talking about. 
this is the statement we have already seen. the probability 
of getting x successes out of n trials is 
p of x is equal to n choose x p to the power x 1 minus 
p to the power x where the individual p 
is the probability of getting success in one try? you 
remember my formula of point 1 to the power 2? 
switch that formula what is this formula say 
this formula says? that if p is 
the probability of success of a single trial 
then what is the probability? of getting x 
successes out of n trials n 
trials p 
is the success probability? each 
trial what is the probability 
of x successes? n 
choose x p to the power x 1 
- t to the power n minus x. how do i think this through? what 
is a trial a trial is the total number of 
attempts that i'm making the total number of products that i'm making i'm making three 
products? the probability of each product being defective 
is point one what 
is the chance that i will get to defects? switch 
is to point 1 to the power 2 
.9 to the power 1 p 
success is p into p into p n - 
x serious. what is not a success as a failure whose 
probability is 1 minus p and 
that x-rays of choosing that. original n 
so in this case, it's like these trials are 
like with replacement for these trials are not just with replacement. 
yes, they will replacement. it's not like it's a it's a population. 
so to speak in other words an actual. it is not being done. 
it's imagined that someone is doing this experiment repeatedly. so 
yes, if you want to think of it as replacement is replacement. it's a model. for example here a 
bank issues card statements to customers under the schema master card based 
on past data. the bank is found the 60% of all accounts pay 
on time following the bill. if a sample of seven accounts is selected at 
random from the current database construct the binomial probability distribution 
of account staying on time. what is the question being asked the question being asked 
is this that i am looking at seven accounts. 
and i'm trying to understand how many of those accounts. are 
paying up? how many of those accounts are paying up 
now? what values can it take? what 
what are the possible values that that my ex can take 0 
1 2 3 4 5 
6 and 7. six 
wins none pay on time. i'm 
sorry zero means non-parent time one means one pays 
on time. seven means all pay on time. the 
chance that every one of them individually pay on time is 60% 
and i'm going to make the assumption that these people aren't talking to each other. so 
they're behaving independently the 60% chance applies to everyone 
separately which means there is one person is field that 
has had no impact on whether another person has. paid or not? 
correct. let's do one of these calculations. let's say what 
is the probability that let's say how 
many people two people 
pay on time so to pay on time. what 
is the answer to this? you can use this formula 
directly, but two people pay on time means point 6 
into 2 .6 in 2.6. not into 2 to the power 2. 
.4 to the power 5 these 
are the five people who have not paid on time. these are the two 
people who have paid on time. so this point 6 into point 6 into this point 
six in two point six in two point forty two point forty two point four in two 
point four into point for the seven people now, that is 
one arrangement how many such arrangements are possible 7 
choose to arrangements are possible those two could be the first to they 
could be the next to they could be the first and the last there are 
seven choose two of those for each of them is a pattern paid 
paid not paid not paid paid and every time you see a paid 
.6 every time you see a not do not paid point for the 
point six you're going to see twice. and the point for you're 
going to see five times their for this formula. 7 
choose to is a formula which simply says how many ways can i pick two 
things out of seven the formula for it is seven factorial divided by 2 
factorial into 5 factorial. which is 
7 in this case 7 into 6. / 
2 which is i think 21:21 the 
21 ways to pick 2 t 2 out of 7 to 
5 minutes because i asked for two. i 
can do it and the problem asks for all combinations. i've just solved it for. for 
one particular answer i need to do it for 0 1 2 3 4 all of them. we've 
had them if you like it, i'll get the answer 1 because 
something must happen. know 
the number of trials is 7 the number of outcomes is 8 if 
i toss one coin i can see two things. so 
there are seven outcomes the seven people. so 
0 1 2 3 4 5 6 7 that's eight the eight possible outcomes. 
all right. so now there is a file here. it's called i think 
binomial distribution example, you didn't you report 
a few things for plotting and for the state functions, then 
i'm going to set up the problem. how am i going to set up the problem in this particular 
case just by specifying an n and specifying a pee. what 
is the n in this case n is the total number of trials? 
why is it 7 for me? because there are seven customers. correct? 
p is .6 where do we get this point six here? the 
60% what am i doing here? what 
i'm doing here is i'm creating the sample space and 
creating the set of numbers for which i want to calculate the probability. 
so this one here? the 
range function 0 to 8. so when i do this, 
it creates an array of eight numbers zero two seven 
zero really has a value. of course we do there is a 
there is a reasonable probability that nobody pays on time. 
same place wherever you got the other 
one from how 
does form? this is x 
people have paid so this is p into so think of it as p 
into p into p x times. and 
think of one minus p into 1 minus p n 
- x x because 
x people have paid and what allows me to multiply the probabilities because 
if they're sixty percent chance you pay then also 60% chance you pay 
when i think of the chance that both of you pay is going to be six point sixty two point six and if 
he doesn't pay and i want to modify those two 2.62 point section 2.4. now 
how many pieces are there? how many successes i want? how many point force are 
there? how many non successes are there and how many such possibilities 
are there? how many ways can i get two successes? that 
is what i am calling 7 choose to which is 21. why is it 21 
you are going to pick two people out of 7? how many ways can you pick 
them the first person you can pick in seven waves? the first person 
to present time the second you can pick in six ways. 72 
6, but if i pick you first and you second the system speaking 
you first you second serve double counted. so by 2. so 
7 into 6 by 2 which is my 21 this 
application this kind of application or another kind of application. for example, i can change this 
to say in sales. i am i am selling my or 
am i i am approaching seven leads. the chance of a conversion 
for a lead is 60% what is my sales distribution? okay, 
tell you that information. for example to figure out let's say 
that how much budget should i have for the sales team? for 
example, i could say you know, what? i'm 
going to approach seven leads and i'm going to get sales. however. 
how about those sales going to be made? the sales are going to be the sales are going to be made on 
the phone. but to confirm the feel i need to be able to 
send a salesperson to the person's house and get their signature. this 
person is going to take a certain amount of time to travel through the famous city of bangalore. 
and get stuck in the traffic jam and get there. so i will be able to get at 
most three signatures in a day. and if i use it, 
i lose it or throw signatures or so. let's suppose 
that therefore i employ one person. is 
that good enough? so now i'm 
asking the question. what is the probability that i'll end up making more than 
three sales in a day? because if i end up making more than three 
shells in a day and not be able to close all the sails. so 
this becomes a salesforce question. it becomes a question of saying 
that these semi-pro ability to sell i should have a sales team if myself 
scene is too short too small. there's a probability that they will not be able to 
close out all my sales and i leave money on the table if my sales team 
is too big. i'll be staying for that sales team, but 
they will not have enough to do. so 
yes, the binomial distribution is just and center. in 
in contact centers. yes is use the same same argument in context 
in this for example. one reason is used is how many escalations do you expect? so 
in many of these so how do i execute on this? so 
i've given the i've created the array now, 
here's the command that you need to know. this command calculates 
that formula. that n choose k that formula 
that formula is calculated. right, by 
the way, you can manually do this if you want to once which is your 21 
in 2.6 to the power 2 into 0.5 to the power 4. does anyone want to manually do 
it once? no one has any just to check. otherwise, 
we'll just trust the output. that's fine. but if i do this 
binomial stats dot binomial dot p mf 
p m-- f stands for probability mass function in case 
you want to know what enough that means probability mass 
function. so this thing is called a probability, 
mass. function probability 
clear mass means is almost as if you're thinking of emit 
a solid material and the probability has been physical 
mass. how much mass is an each number? how much mass 
is in each number? so 
this number? the pmf 
simply is this number? it's a calculation 
of this number. so now if i asked for binomial 
if you do it without the equal to it'll just give it directly. alright, 
so it run it just takes a bit of time. so binomial 
is an array. so what is this number here? for 
0 so what is this in the business context? this is the chance 
that nobody pays on time. the number of people who 
pay on time is zero. so it's 
about point one six percent number of if what is the chance that one person 
pays on time? 1.7 percent to 
people pay on time about 7.7% three 
people pay on time about 19% for 
people pay on time about 29% five 
people pay on time 26% six people 
13% seven people about 2.7 
percent. okay curiosity question. how many people 
would you expect to pay on time? no, remember 
there's a 60% chance that everyone will pay. yes, 
405. in fact, the answer is 7 in 2.6. which 
is above zero 4.2% so you'd expect to see about 
for a little more than four people pay on time. and the chance 
of four people paying on time is what is 0-1. 29% 
and the chance that five people pay on time is about 46 percent. if you want to 
plot this this is there is a slightly sort of, you know, 
jazzed-up version of a plot here. so the first line says 
plot it, you know, it says binomial then 
there's a thai tea does a labels and then finally the plot command 
itself. i think that's a plotting artifact. i mean it tells you what to plot you can 
remove it and see what happens. here's an interesting thing. someone's 
cost what happens when i add up all the probabilities. which 
is what i get here. i don't need it. it's 
a checksum. so one person one possibility of a business outcome is 
what is the probability that say more than six people do not pay their 
bills on time. no in the collection steam in a bank 
certainly is interested in that. will you have to go 
after that? there's also a question of what is the entitlement or miles on 
my on a specific month? so bank is going to make money or csa 
telephone company. whoever is going to make money on the amount of bill that's actually paid. now 
the fact that the bill has been given to a person doesn't necessarily mean they will pay it like here. so 
how much money does the bank actually expect to make it 
has to have an estimate of its revenue per month. how does it get that by 
doing a calculation of this kind? here's a real formula if it wants to help 
you. the average of a binomial distribution is given by 
n into p, which is discussed that total number of trials into 
the probability 7 in 2.6. which 
means for example that if i think that my 
success probability of a sale is 10% and i approached 
10 people the number of people i expect number of sales i expect to make 
is 10 into point 1 which is 1 does 
it mean i will make oneself no. the distribution 
goes from 0 1 2 3 up to 10, but the average is that 
one similarly, the average of this distribution is where it's 
at 4.2. but where is the picture? where 
is the average weight is 4 point to somewhere here. somewhere 
here is 4 point 2. this is the center of gravity of the of the distribution. the 
standard is a standard deviation formula. if you want to know n 
p into 1 minus p the standard deviation. we will make 
a little more sense when we talk about the normal distribution. i hope i'll get there. now 
there's another distribution which is used a little less in practice. you guys 
are all very practical types. how is it is how is it used the 
question kind of question. he asked so i want to make an estimate for example as to how many people will 
pay my bills. because based on that eyewitness. i 
can do it two ways. i can for example say what is the number of people expect to pay my bills? 4.2? what 
is the number of number of sales? i expect to make what is the number of errors? 
i'd expect to have in my code. what is the number of defective products? 
what is the number of expected customer recalls that i have whichever 
industry you're in there are events that happen in that industry any 
trying to find out an estimate for it one estimate for it is an expectation like we discussed 
yesterday. but remember this one is not coming from data. seven 
in 2.6 is not a calculation based on data. i didn't give you any data. on 
people paying their bills on time. i give you three digital distribution. this 
is an assumption that i made. it's not an 
average computed on data. so therefore 
when i make the distribution assumption and is a beast and the distribution, what is the expected 
number? i should see will i see that all the time? no, that's why there's a distribution. 
so there was a yes yes 
reality. no, 
so this would be used and it is often used where what will 
come from the what will come from the data. one thing that can come from the data 
is the p the p just happy not 
the distribution itself. yes, 
and so that will not so for example, i want to find that next month 
next month for a new customer or next month. how many people will pay their bills on time? 
that's use a case. now. here's the way i do it i ask myself last month. 
how many people paid their bills on time? but it comes but may come from 
the data. so the p comes from the data, but the calculation 
for saying how many people will pay their bills on time comes from the next month. it 
is done for the next month. it makes no sense to do it for this month because i already 
have this month situation the 
probability that we had at write probability of one person saying property 
of 2% yes, i exact array. yes. yes. yes. yes. of 
data it already has because the p has come from the past data. yeah. yes 
that normally in a real situation. yes that probability has to be computed in 
a lab based on past data rate. yes. let 
me clarify that yes, it came to quickly. so the complexities one 
is you might be supposing that it changes with time. you 
might be it might be a situation that does this that you know what i 
have to i have a collections problem means not 
enough people are paying so i might have a problem that looks like 
this that might be the number of people who pay their bills 
on time is 60% and i'm saying it's too low now. 
we want to increase that how to increase that i was my manager comes in 
says make it so such that. the 
number of people say let's say more than five people 
not paying on time. this number must be less than 
let's say point one percent. that's 
the goal now to do that. i now need to change my 
p. so i'll set my piece so that the answer to this question 
becomes less than 0.1% that gives me a target p now. 
i must reset my collection process so that that p is attained. to 
achieve that p so i can do that. i can create applications in various ways. 
give me the p and i will tell you what happens or give me a situation 
that i want to achieve. and give me a target p such that 
it gets their constant. the variables keeps is yes, the constant the 
variable keeps changing. what do i want to fix keeps changing so that the pack this is a model. 
this is a mathematical model how you use it is up to you. this 
is one particular use case, but there'll be many use cases for this you 
see one in logistic regression. for example, the poisson distribution is a very 
similar distribution except that for the poisson distribution that has 
a mass function that looks like this. now this 
mass function. counts, but 
does not count relative to a maximum. the binomial 
goes from 0 to n 0 1 
2 up to n the posture. there is no n there 
is no total number of things. for example, i might ask the question how 
many fraud cases do i expect to see? there's 
no sort of maximum to that. i could frame it 
as saying that tell me the total number of cases there are and that is my n 
and then i'll figure out based on a po mail fraud cases there are but there are situations where this 
maximum is something that doesn't quite make sense. how many fraud cases 
are there? how many cracks are there microfractures are there on 
this bottle? it's a count 
right? how many eggs will the chicken make? it's 
a count is not in some way a proportion like thing. so 
if it's if you're in a pure count like situation you are in the situation 
of the so-called poisson distribution whose mass function has this cycle different 
form called e to the power minus lambda lambda to the power x where lambda is the average 
if on average six customers arrive every two minutes set up bank during busy 
working hours. what is the probability that exactly for customers arrive in 
a given minute? what is the probability? it is that more than three customers will arrive in a given 
minute. this is slightly different from a binomial. why? the 
reason is in the previous case they were asking for how 
many customers did not pay but there was a total universe of customers 
7 customers. there was a samples sample space here. there isn't 
i'm not telling you how many could have come there is a series and the series could go up 
to anything. so to speak this is the typical situation 
of a poisson distribution where it's not a question of saying independent trials, and 
how many were successes? it is a time simply counting how many there are 
and i have no ideas to how many there could have been potentially how many 
fraud cases i do not know how many micro fractures i do not know how many customers 
could have arrived. i do not know. there is no maximum to it. so 
the similar calculation here. for the same thing. if 
you open the poisson distribution example file. now 
for the poisson distribution that formula for the 
binomial there were two numbers you need you to put in the 
n and the p for the post. so there is only one number. there 
is only one number and that number is usually called the rate the rate at which my customers 
are deriving the rate at which i get from the rate or the density 
of my cracks. it's a writ number. you can think of this 
rate number as a product of n and p. as 
as the total number of opportunities x the product if you want to think of it as that so 
for the posture i need to be able to specify the rate. and 
now i do exactly the same thing again calculate the poisson probability stats 
dot plus r dot. pmf. now for computational 
purposes. i am setting the range from 0 to 20. i can 
set it to me any high number that 20 is not coming from my data the 
20 is coming for a computational reason because i want to do the calculation for a finite number 
of points. and as you see after 20, the numbers are very very 
small. so the 20 is not 
there from the problem the 20s there for my visualization. i 
can make it any making any other number if you make it too low you'll 
be leaving some probability to the more than 20 you make it too high. you'll be calculating 
a lot of zeros. so what is my problem? let's go here by 
problem is what is the probability that exactly for customers arrive in a given minute 
six customers arrive every two minutes at a bank. what is the 
probability that exactly for customers arrive in a given minute, whatever i put my 
rate as six 
and here is my distribution. this is what 2 point 4 into 10 to 
the power minus. 3 so this is what point 
zero zero two. let's 
see what happens. so what is the probability of zero point zero 
zero two. what is it? 4-1 .001 
for two for 3.00 8 4 
4 what will know what is it for? what 
is it forces this zero. one two 
three four. what is it for 4.13 13% 
what is it four five? 16% 
was it what is it for six? 16% what 
is the average number of customers expect to see six? 16% 
what is this? what is this? seven thirteen percent for 
810 percent now to start going down. i'll 
go down and by the time i reach 20, it is already point 0 0 0 
0 1. so if 
you have gone beyond 20, i would have seen even smaller numbers, but 
i could have stopped for example, let's say 15. if i stopped at 15 we have would this have stopped 
one, two, three, four five you would have stopped here. which 
is fine. xx 
xx is an approximation 20 is 20 is a guess here is a distribution plot 
the same thing. this is the plot of the distribution function 
whose average is at six. by the way, what is the answer to the question? 
what is the probability that exactly four customers arrive in a given minute? be 
slightly careful be slightly careful six 
customers arrive every two minutes. the question asks 
for exactly four customers arriving in one minute. which 
means? if i eas if i'm putting six as 
the rate. they have to convert this question to saying what is the probability 
that exactly how many customers have every two minutes? each 
customers arrive every two minutes or 
what i can do is i can change my rate to 3. this 
one is a distribution where you do most of the calculations with this is 
the normal distribution the distribution that corresponds to age. 
to the means fed, all the continuous variables that we were looking at numbers. 
numbers. so if you're dealing with numbers, then 
you deal with the distribution that has a shape like that. this is called a normal distribution. 
now the normal distribution. the reason i wanted to get to is this because because of this 
picture now this picture puts the standard 
deviation in context. so yesterday we talked about the standard deviation and a 
question often asked is what is the standard deviation mean? what is standard about the standard deviation 
this picture tells you what is candid about the standard. so 
this picture means that if i have a normal distribution, then the 
chance of being within one standard deviation is 68 percent. as 
a numerical quantity this distribution is a distribution that 
has a mean. and it has a standard deviation. now 
the standard deviation is to be defined in such a way. and the way the standard 
deviation is defined implies that the chance of being within one 
standard deviation is 68% the chance of 
being within two standard deviations is 95% 
the chance of being within three standard deviations is 99.3% 
so now if i tell you something like this that i'm telling you that 
for a group of people the mean height is say 5 feet 10 inches. 
with the standard deviation of 2 inches mean 
height is five point two inches and a standard 
deviation of 2 inches so mu so 
let's say five feet eight inches and 
a standard deviation of sometimes you noted by sigma 
of set to inches. i've told you some interesting things 
if you allow me a normal distribution. i am now told you 
that sixty percent or roughly two-thirds of the people. are 
between 5 feet 6 inches and 6 feet. ten 
inches this is 5/8. 
this is one standard deviation, which is two and two. so 
this is 510. and this is five six and 
this is about 68 percent. sometimes it's easy to remember 
it has two thirds close enough. two 
out of three are between these two heights 95% 
are between what and what? six and five 
four 95% are between these 
two heights. one 
in 20 are outside this range. so 
therefore if i tell you the mean and the standard deviation have actually told you 
a reasonable amount as to how the data is spread. so 
sometimes the mean and the standard deviation are are reverse-engineered 
so to speak. so if you are professionals and i often do this, so people say 
people often as well as the data. they said nobody has any data so 
assessor so, you know, you're trying to figure out what work to that so so so 
so you might ask a question. when do you typically arrive and someone says oh nine 
o'clock thereabouts. what's your earliest arrival time? 8:30? 
what is your latest? ten o'clock so 
looking at this you'll now see a so you can decide as to what you should 
assume that the whole range of the distribution is say from say 8:00 
8:30 to 10:00 o'clock. and now this pattern tells you that if i 
go for three sigma covering 99.7% this whole range is about six standard deviations. 
so to achieve if you could find the mean you just take the middle of it and to find 
the standard deviation, you take the whole range and divided by 6. so 
i can get an idea of what the average is and what the standard deviation is without even getting any data 
from you, but just getting a sense of the extremes. it's 
a nonsense way of doing things but what it 
does is it allows you to cheat with essentially very minimal information. so 
remember this remember these pictures are helpful. they give you an idea of what the distribution 
is a by the way, these numbers are easy enough to calculate 
so we'll do some calculations. the the normal distribution is a bell-shaped distribution. 
so it's symmetrical the tears could be extended. it depends on 
two parameters mu and sigma see the power of it by giving your 
two numbers have given you characteristics like this. so 
and i can do calculations and this is the density function that equation if you want to think of it, nobody 
does anything with this but and then you can do calculations on it. so 
here's a curious the calculation. i'm not sure this is a calculation that we had worked on. this 
is a calculation that we actually do in in some detail. let's do it. so 
the mean weight of normal of a morning breakfast cereal pack 
is 295 kilograms with a standard deviation of point two five kilograms random 
a random variable way to the that follows a normal distribution. what is the probability that 
the pack weighs less than 280 grams now, why would someone be interested 
in this? one possibility 
perhaps is that may be the target for the for the pact is 
something like 300 grams. and 
you're trying to understand whether you are. within tolerances 
or more or less or something of that sort. so 
what is the probability that the pack weighs less than 280? so 
what do i need to do? what is my picture like? my 
average is 295 standard 
deviation of 25 on the gram scale 
and i want to find the chance of being to the left of 280 
i need this area. calculating this area is 
actually quite easy. so let me calculate that area. so i'm going to 
do it this way stats. . 
norm. c. bf 
c d e f stands for cumulative 
distribution function i tell you what's cumulative about it 
cdf now. what is the number that i'm interested in probability 
of being less than 2 8 0 or if 
i want to be very clear about this point sorry 
point to it and i'm gonna do something here comma location 
equal to location means the middle of the distribution for me. 
what is the mean? point 
two nine five coma 
skill is equal to what is the standard deviation? is 
that are the numbers correct? twenty-seven 
percent this one here is 0.28 your 
sanctuary .27. no 
and calculating the answer to this question. what 
is the probability that the pack weighs less than 280 grams? 
this is the question. also this the 
way i set it up was to say what is 
the chance of being less than 280? when the mean 
is 295 and the standard deviation is .25. 
because of certain technical aspects of other functions, the 
mean here is referred to as location and the standard deviation is 
referred to as scale. so 
if those terms location and scale confuse you just ignore 
it huh? this first term is 
the number, but otherwise this 
this this one here. this 
one here makes more sense. go ahead with this. all 
right. do you understand how the code works? on it. let's 
do the second problem. what is the probability that the pack weighs 
more than 350 grams? what do you 
think the answer should be? yes. one - yes, 
one - what one 
- stats dot norm now. what should i do? sorry 
norm dot cdf. .350, 
same thing. about 
1.3 99% the chance of being more than 380. 
clear, so what does cdm do cdf 
cumulative? distribution 
function what 
does it do calculate the area to the left top less 
than therefore if i want to calculate the area? mm probably 
demos and i need to go 1 minus y because the 
whole probability is 1 what 
was the third one? what is the probability that the pack weighs 
between 260 grams and 340 grams 
how to do this? yes 
340 so 
i now need to be between 340. and 
what is it, too? 260 so 
less than 340 - less than so. 
it should be again. let's say let's get lazy. what 
is this number? 340 
and this is 260, right? eighty-eight 
percent 88% of my packets 
are going to lie between 260 grams and chair 
and 40 grams. it's a resumption. we're making 
remember there isn't any data at all here. there isn't any data 
told here. what numbers am i using? mean and 
standard deviation so what i'm doing is what is the advantage that i have? 
i don't need the data. all i need is this mean and standard deviation? what is the price i pay? 
and as i'm sure on the distribution no, 
so i could instead of using norm have another distribution 
sitting there. there's a whole range of other other possibilities by noam is 
one there. there are other distributions if you want to you 
would you decide based on whichever distribution makes 
most sense for your application. now in certain cases, you know, what do this 
nature of those distributions look like for example, if you looking at lifetimes of things 
it's an exponential distribution gamma distribution or something of that sort, but 
there's a certain advantage to the normal distribution because of something called the central limit 
theorem and we'll cover that a little bit. it will be mentioned within 
the in next residency central limit theorem essentially says 
that if i take the averages of things or the totals 
of things i end up with a normal distribution. the 
normal distribution is a result of averaging. so 
if my observation is a total of little things. then 
probably the normality assumption is a good assumption for that. large 
data doesn't necessarily mean normal. but if 
you observation is the total or the accumulation lots of things are for example height is often 
normal why because our height is a car is in some 
way a random combination of many things maybe the height of each of our cells and things of that sort. 
so 
the normal distribution is often used as an assumption based on the central limit theorem. the 
other part of it is that even if the data doesn't look like 
a normal distribution the the sort of addition 
for it the sample from a normal distribution doesn't necessarily look like 
a sample from a normal distribution. so even 
like we saw yesterday the bell-shaped curve, so it's hard to look at the data 
and see that it is not normal. so the normal distribution percent 
to the that is often made in the absence of any 
other information on the data. it is 
obviously wrong in cases where the data has a very strong skew in 
one sense to another but remember in many cases. you're not even talking about 
the data. the question that you're asking is not a 
data question. the question that you're asking is a probability question is a situational 
question you're asking for effectively the following thing. why would 
some why is this analysis of this kind done? what 
data is it going after if anything? you're talking 
about the data being normal or not. normal. what data is it even referring 
to? why do i care about the first question? what 
is the property that a pack weighs less than 280 grams one context for it could be 
that if a person buys a pack, what is the chance that they're getting a 
light pack in other words something that is less than 280 grams 
true. but a my question is this we're in all of that is a data. where 
is the data in this? how do you even think of there? is it a data problem at all? i'm 
asking the question. that is my product in stick in 
other words. what data are you referring to? what 
is this a data science issue at all? or is it not we 
are asking the question addresses it normally is it not is it a data question? you 
reach the customer. yes and what weight of the packaging science so 
what data is that? data, 
what? datum is to use how many data 
observations which data observations for whom for which customer when 
one data? huh? so kilos 
quality check for what i could argue. for example that this is 
about saying that if he goes in and buys that breakfast cereal we 
get something that is below 280 grams. well the value of the price, 
yes, but where is the data? in the supermarket there 
is no mike. it's a business question. what data 
does it apply to what i'm trying to say is that is not a data problem 
at all. you can solve it using 
you can say i'm going to gather a lot of data to solve the problem. no, 
i'm telling you that this could come from the past mean this could 
so you could say that i'm going to i'm going to gather the data to get 
this number and get this number. that's a good answer then order to solve my 
business problem. i need a mean and a standard deviation so that i can get a handle 
of what is the chance that you will be underweight. now that means standard 
deviation has to come from somewhere and i can say i will use data to get that mean 
and the standard deviation. that's a good answer. that you will now say 
why do i need data in order to calculate mean and standard deviation? why do i mean 
nina standard deviation? because that's still that's the least data. i need 
in order to be able to answer this question. which 
is the question i'm interested in answering. willie by 
the product will my network go down will i be under product? there's a business question. i'm interested on single. 
there is a tech question that i'm interested in answering. and often that is made 
independently of the data. so for example, 
the car has to stop autonomous vehicles. write 
the data that the car is going to react to is the scene that the car sees in front of it, but 
that's what the data on which the algorithm is going to be based. so 
the so the detail but the car sees is what it is 
reacting to similarly. this is reacting to only one number 280 grams. i 
am now solving the 280 grams problem by saying. is 
it this i'm giving you a packet and i'm asking the question. is this underweight. does this have less 
water than it should. i'm interested 
only in that. i'm not interested in any data. so 
in hypothesis, testing, what we will do when we come back is to be able to close out that question 
and say therefore from data. how do we get to numbers like this? 
which now means that i have to put the two pieces of this residency 
together. have to put together the idea 
of calculating means and standard deviations from data. the 
idea that it is a parameter being estimated to solve a problem. 
so you would say that that data this 295 comes from data 
that immediately raises an issue, but if it comes from data, it comes from a sample 
and if it comes to a sample it's not accurate and if it comes if it's not accurate 
then how well does it solve my problem? and life keeps going in circles 
like that. so this is the probability 
side to it, which explains why i need to have means and standard deviations 
in order to do a calculation and the descriptive part 
says i have the means and the standard deviations to do the calculations so 
that if it had normal distribution, then i am more relaxed no know 
if it had a normal distribution then maybe i'd be able to get good numbers around 
this. plus minus is would be symmetric. this calculation doesn't 
rely on the normality behind the to phone at 295 estimate this 
calculation lies on the normality of the future data, which doesn't exist at all. but 
what i'm asking is will these numbers be more reliable the mu and the sigma 
if i had a normal distribution, you know, not necessarily not necessarily 
if i have normal distributions. i'll be able to use certain very specific formulas that will see if 
it is not normal those formulas may break down a little bit. so those formulas help me calculate. 
so normality helps me calculate it helps me calculate. how 
good these numbers are? it also helps me calculate using 
normality what the answers to questions such as these are but 
the normality that i use now ten minutes ago had nothing 
to do with data. and that to some extent 
is the power of probability that you're being able to answer a question like saying do 
i expect that? the weight is going to be less than 280 grams. without 
having data in place for it. the simple answer would be 
give me the data and count. how many are less than 280 grams? that's the simplest answer. 
right? what is the chance of the pack listen to and 50 grams empirical go 
collect a hundred packets and find out how many of them have weight less than 280 
grams. that's the answer to that question. so 
why are we doing all of this because you don't have that? because 
you don't have that data. why don't you have that data? because that's not 
the question i'm asking i'm asking the question. is it listen to it? again? i'm looking at a computer 
program in front and i'm asking what is the chance that there are more than five bugs 
in this code? i'm looking at all the computers 
in my office and i'm asking what is the chance that all the employees today? 
there's going to be more than two hacks. or malicious attempts 
on my server there is no data 
yet. there will be but by the time the heck happened. but 
i still need those embers and i get those numbers using these distributions to 
operationalize those distributions. i need certain numbers and i can get them. i 
can beg them. i can borrow them. i can steal the mechanist omit them from data. i can ask her for a friend. 
i can read a book. see a standard i can look at market 
research. yes. i can do them any number of things, you know to get at those numbers. i 
can look at an industry standard those two pieces will put together 
should be is getting a little nervous this 
this picture is definitional for the normal distribution. this 
picture this is definition for the normal distribution. so if you look at six sigma 
6 sigma will cover 99.7 percent co2 per thousand 
will allow lie outside the plus minus 3 sigma range not 
everything but roughly 300 thousand. this 
is t sigma 6 sigma 3 sigma usually says 3.4 
defects per million opportunities. which is actually not statistic 3 sigma 
is not 60 minus 4.5 sigma. so 4.5. 
sigma is about 3 point 4 into 10 to the power minus 6. that's 
4.5 sigma. so if you look at six sigma literature, there's a confusion there. 
what it says, is that if you have in 
order to get 3.4 defects per million to the customer. 
you have to be within six sigma which is about one in a billion. 
this is at this is at three standard deviations personal assistant deviations. 
if i go to plus minus 4.5 standard deviations, i'll be around 3.4 
in 10 to the power minus 6 to reach that for the customer. i need 
to go to six sigma here, which is about one in a billion. i 
must be more accurate in my factory floor for my customer. so 
if i reach six sigma, my customer will reach 4.5 
sigma and for per customer 4.5 sigma is the 3 point 4 into 10 to the 
power minus 6. so if you look at key point 4 into the power minus 6, it doesn't correspond 
to six sigma. you will confusing but that's 
the basics in my literature is written. the normal distribution 
is just this as a formula plus plus 1 plus or minus 2 
sigma is 95% actually actually plus or minus. 1.96 
sigma is 95% and t sigma is about 19 
percent. infinity by 
definition goes to infinity. you want to cover everything plus minus 
infinite standard deviation? 
