[MUSIC PLAYING]
BRIAN YU: All right, welcome
back, everyone, to an Introduction
to Artificial Intelligence with Python.
And last time we took a look at how
it is that AI inside of our computers
can represent knowledge.
We represented that knowledge
in the form of logical sentences
in a variety of different
logical languages,
and the idea was we wanted our AI
to be able to represent knowledge
or information and somehow use
those pieces of information
to be able to derive new pieces
of information via inference,
to be able to take some
information and deduce
some additional conclusions
based on the information
that it already knew for sure.
But in reality, when we think about
computers and we think about AI,
very rarely are our machines going
to be able to know things for sure.
Oftentimes there's going to
be some amount of uncertainty
in the information that
our AIs or our computers
are dealing with where it might believe
something with some probability,
as we'll soon discuss what probability
is all about and what it means,
but not entirely for certain.
And we want to use the information that
it has some knowledge about, even if it
doesn't have perfect knowledge, to
still be able to make inferences, still
be able to draw conclusions.
So you might imagine, for example,
in the context of a robot that
has some sensors and is
exploring some environment,
it might not know exactly where
it is or exactly what's around it,
but it does have access to some data
that can allow it to draw inferences
with some probability.
There's some likelihood that
one thing is true or another,
or you can imagine in context
where there is a little bit more
randomness and uncertainty, something
like predicting the weather, where
you might not be able to know for
sure what tomorrow's weather is
with 100% certainty,
but you can probably
infer with some probability
what tomorrow's weather is
going to be based on maybe today's
webinar and yesterday's weather
and other data that you
might have access to as well.
And so oftentimes we can distill
this in terms of just possible events
that might happen and what the
likelihood of those events are.
This comes a lot in
games, for example, where
there's an element of chance
inside of those games.
So you imagine rolling the dice.
You're not sure exactly what
the die roll is going to be,
but you know it's going to be one of
these possibilities from one to six,
for example.
And so here, now, we introduce
the idea of probability theory.
And what we'll take a
look at today is beginning
by looking at the mathematical
foundations of probability theory,
getting an understanding for some of
the key concepts within probability,
and then diving into how we can
use probability and the ideas
that we look at mathematically to
represent some ideas in terms of models
that we can put into our computers
in order to program an AI that
is able to use information about
probability to draw inferences,
to make some judgments about
the world with some probability
or likelihood of being true.
So probability ultimately
boils down to this idea
that there are possible
worlds that we're here
representing using this
little Greek letter omega,
and the idea of a possible world
is that, when I roll a die,
there are six possible worlds
that could result from it.
I can roll a 1 or 2 or
3 or a 4 or a 5 or a 6,
and each of those or a possible world,
and each of those possible worlds
has some probability of being true, the
probability that I do roll a 1 or a 2
or a 3 or something else.
And we represent that
probability like this,
using the capital letter P
and then, in parentheses, what
it is that we want the probability of.
So this right here would be the
probability of some possible world
as represented by the
little letter omega.
Now, there are a couple of
basic axioms of probability
that become relevant as we consider
how we deal with probability
and how we think about it.
First and foremost,
every probability value
must range between
zero and one inclusive.
So the smallest value
any probability can
have is the number zero, which
is an impossible event, something
like I roll a die and the die is
a seven is the roll that I get.
If the die only has
numbers one through six,
the event that I roll
a seven is impossible,
so it would have probability zero.
And on the other end of
the spectrum, probability
can range all the way up
to the positive number one,
meaning an event is certain to happen,
that I roll a die and the number
is less than 10, for example.
That is an event that is guaranteed
to happen if the only sides on my die
are one through six, for instance.
And then there can range through any
real number in between these two values
where, generally speaking, a
higher value for the probability
means an event is more
likely to take place
and a lower value for
the probability means
the event is less likely to take place.
And the other key rule for probability
looks a little bit like this.
This sigma notation, if
you haven't seen it before,
refers to summation, the idea
that we're going to be adding up
a whole sequence of values.
And this sigma notation's going
to come up a couple of times
today, because as we
deal with probability,
oftentimes we're adding up a
whole bunch of individual values
or individual probabilities
to get some other value.
So we'll see this come
up a couple of times.
But what this notation means is that
if I sum up all of the possible world's
omega that are in big Omega,
which represents the set of all
the possible worlds, meaning
I take for all of the worlds
in the set of possible worlds and add
up all of their probabilities, what
I ultimately get is the number one.
So if I take all the possible worlds,
add up what each of their probabilities
is, I should get the
number one at the end,
meaning all probabilities
just need to sum to one.
So for example, if I
take dice, for example,
if you imagine I have a fair
die with numbers one through six
and I roll the die,
each one of these rolls
has an equal probability
of taking place,
and the probability is
one over six, for example.
So each of these probabilities is
between zero and one, zero meaning
and possible and one
meaning for certain.
And if you add up all
of these probabilities
for all of the possible
worlds, you get the number one.
And we can represent any one of
those probabilities like this.
The probability that we roll the number
two, for example, is just one over six.
Every six times we roll the die, we'd
expect that one time, for instance,
the die might come up as a two.
Its probability is not certain, but
it's a little more than nothing,
for instance.
And so this is all fairly
straightforward for just a single die.
But things get more interesting
as our models of the world
get a little bit more complex.
Let's imagine now that we're not
just dealing with a single die,
but we have two dice, for example.
I have a red die here
and a blue die there,
and I care not just about
what the individual roll is,
but I care about the
sum of the two rolls.
In this case, the sum of the
two rolls is the number three.
How do I begin to now
reason about, what does
the probability look like if, instead
of having one die, I now have two dice?
Well, what we might imagine is
that we could first consider,
what are all of the possible worlds?
And in this case, all
of the possible worlds
are just every combination of the red
and blue die that I could come up with.
For the red die, it could be a 1
or a 2 or a 3 or a 4 or a 5 or a 6,
and for each of those
possibilities, the blue die,
likewise, could also be either
1 or 2 or 3 or 4 or 5 or 6.
And it just so happens that,
in this particular case,
each of these possible
combinations is equally likely.
Equally likely are all of these
various different possible worlds.
That's not always going to be the case.
As you imagine more complex
models that we could try to build
and things that we could try
to represent in the real world,
it's probably not going to be the case
that every single possible world is
always equally likely.
But in the case of fair dice
where, in any given die roll,
any one number has just as good a
chance of coming up as any other number,
we can consider all of these
possible worlds to be equally likely.
But even though all of the
possible worlds are equally likely,
that doesn't necessarily mean that
their sums are equally likely.
So if we consider what the sum is of all
of these two-- so 1 plus 1, that's a 2.
2 plus 1 is a 3--
and consider for each of these
possible pairs of numbers
what their sum ultimately
is, we can notice
that there are some patterns here
where it's not entirely the case
that every number comes
up equally likely.
If you consider seven,
for example, what's
the probability that when I roll
two dice their sum is seven,
there are several ways this can happen.
There are six possible worlds
where the sum is seven.
It could be a one and a six
or a two and a five or a three
and a four, a four and
a three, and so forth.
But if you instead consider,
what's the probability that I
roll two dice and the sum of those two
die rolls is 12, for example, well,
looking at this diagram,
there's only one possible world
in which that can happen,
and that's the possible world
where both the red die and the
blue die both come up at sixes
to give us the sum total of 12.
So based on just taking
a look at this diagram,
we see that some of these
probabilities are likely different.
The probability that the sum is a seven
must be greater than the probability
that the sum is a 12.
And we can represent that even
more formally by saying, OK,
the probability that we
sum to 12 is one out of 36.
Out of the 36 equally
likely possible worlds,
six squared because we have
six options for the red die
and six options for the blue die, out
of those 36 options, only one of them
sums to 12, whereas, on the
other hand, the probability
that if we take two dice rolls and
they sum up to the number seven,
well, out of those 36
possible worlds, there
were six worlds where the sum was
seven, and so we get six over 36,
which we can simplify as a
fraction to just one over six.
So here, now, we're able to represent
these different ideas of probability,
representing some events
that might be more likely
and then other events that
are less likely, as well.
And these sorts of judgments
where we're figuring out,
just in the abstract, what is the
probability that this thing takes
place, are generally known as
unconditional probabilities,
some degree of belief we have in some
proposition, some fact about the world
in the absence of any other
evidence without knowing
any additional information.
If I roll a die, what's the
chance it comes up as a two,
or if I roll two dice, what's the
chance that the sum of those two
die rolls is a seven?
But usually when we're thinking about
probability, especially when we're
thinking about training in AI to
intelligently be able to know something
about the world and make predictions
based on that information,
it's not unconditional probability
that our AI is dealing with,
but, rather, conditional
probability, probability
where rather than having
no original knowledge,
we have some initial knowledge about the
world and how the world actually works.
So conditional probability
is the degree of belief
in a proposition given some evidence
that has already been revealed to us.
So what does this look like?
Well, it looks like this
in terms of notation.
We're going to represent conditional
probability as probability of a
and then this vertical bar and then b.
And the way to read this is the thing on
the left-hand side of the vertical bar
is what we want the probability of.
Here, now, I want the probability that
a is true, that it is the real world,
that it is the event that
actually does take place.
And then on the right
side of the vertical bar
is our evidence, the
information that we already
know for certain about the world--
for example, that b is true.
So the way to read
this entire expression
is, what is the
probability of a given b,
the probability that a is true given
that we already know that b is true?
And this type of judgment,
conditional probability,
the probability of one thing given
some other fact, comes up quite a lot
when we think about the
types of calculations
we might want our AI to be able to do.
For example, we might care
about the probability of rain
today given that we know
that it rained yesterday.
We could think about the probability
of rain today just in the abstract.
What is the chance that today it rains?
But usually we have some
additional evidence.
I know for certain that
it rained yesterday,
and so I would like to calculate
the probability that it rains today
given that I know that
it rained yesterday,
or you might imagine that I
want to know the probability
that my optimal route to
my destination changes
given the current traffic conditions.
So whether or not traffic
conditions change,
that might change the probability
that this route is actually
the optimal route, or you might
imagine in a medical context
I want to know the probability that a
patient has a particular disease given
some results of some tests that
have been performed on that patient,
and I have some evidence,
the results of that test,
and I would like to know the
probability that a patient has
a particular disease.
So this notion of
conditional probability
comes up everywhere as we
begin to think about what
we would like to reason about,
but being able to reason
a little more intelligently by
taking into account evidence
that we already have.
We're more able to get an
accurate result for what
is the likelihood that
someone has this disease
if we know this evidence,
the results of the test,
as opposed to if we
were just calculating
the unconditional probability of saying,
what is the probability they have
the disease without any evidence
to try and back up our result one
way or the other?
So now that we've got this idea of
what conditional probability is,
the next question we have
to ask is, all right,
how do we calculate
conditional probability?
How do we figure out, mathematically,
if I have an expression like this,
how do I get a number from that?
What does conditional
probability actually mean?
Well, the formula for
conditional probability
looks a little something like this--
the probability of a
given b, the probability
that a is true given that
we know that b is true,
is equal to this
fraction-- the probability
that a and b are true divided by
just the probability that b is true.
And the way to intuitively
try to think about this
is that if I want to know the
probability that a is true given that b
is true, well, I want to consider
all the ways they could both be
true out of the only
worlds that I care about
are the worlds where b is already true.
I can sort of ignore all
the cases where b isn't true
because those aren't relevant
to my ultimate computation.
They're not relevant to what it is
that I want to get information about.
So let's take a look at an example.
Let's go back to that example
of rolling two dice and the idea
that those two dice might
sum up to the number 12.
We discussed earlier that
the unconditional probability
that if I roll two dice and
they sum to 12 is one out of 36,
because out of the 36 possible
worlds that I might care about,
in only one of them is the
sum of those two dice 12.
It's only when red is
six and blue is also six.
But let's say now that I have
some additional information.
I now want to know,
what is the probability
that the two dice sum to 12 given that
I know that the red die was a six?
So I already have some evidence.
I already know the red die is a six.
I don't know what the blue die is.
That information isn't given
to me in this expression.
But given the fact that
I know that the red die
rolled a six, what is the
probability that we sum to 12?
And so we can begin to do the math
using that expression from before.
Here, again, are all
of the possibilities,
all of the possible combinations
of red die being one through six
and blue die being one through six.
And I might consider,
first, all right, what
is the probability of my
evidence, my b variable where
I want to know what is the
probability that the red die is a six?
Well, the probability that the red
die is a six is just one out of six.
So these one out of six options
are really the only worlds
that I care about here now.
All the rest of them are
irrelevant to my calculation
because I already have this
evidence that the red die was a six,
so I don't need to care about all of the
other possibilities that could result.
So now, in addition to the fact
that the red die rolled as a six
and the probability of
that, the other piece
of information I need to
know in order to calculate
this conditional probability is the
probability that both of my variables,
a and b, are true, the probability
that both the red die is a six
and they all sum to 12.
So what is the probability that
both of these things happen?
Well, it only happens in one possible
case, in one out of these 36 cases,
and it's the case where both the red
and the blue die are equal to six.
This is a piece of information
that we already knew.
And so this probability
is equal to one over 36.
And so to get the
conditional probability
that the sum is 12 given that I know
that the red dice is equal to six,
well, I just divide these two values
together, and 1/36 divided by 1/6
gives us this probability of 1/6.
Given that I know that the
red die rolled a value of six,
the probability that the sum of the
two dice is 12 is also one over six.
And that probably makes
intuitive sense for you,
too, because if the red die is a six,
the only way for me to get to a 12
is if the blue die also rolls a six.
And we know that the probability of the
blue die rolling a six is one over six.
So in this case, the
conditional probability
seems fairly straightforward.
But this idea of calculating
a conditional probability
by looking at the probability that
both of these events take place
is an idea that's going to
come up again and again.
This is the definition, now,
of conditional probability,
and we're going to use
that definition as we
think about probability more generally
to be able to draw conclusions
about the world.
This, again, is that formula.
The probability of a given b
is equal to the probability
that a and b take place divided
by the probability of b.
And you'll see this formula sometimes
written in a couple of different ways.
You could imagine, algebraically,
multiplying both sides of this equation
by probability of b to
get rid of the fraction,
and you'll get an expression like this.
The probability of a and b, which
is this expression over here,
is just the probability of b times
the probability of a given b,
or you could represent this equivalently
since a and b, in this expression,
are interchangeable. a and b
is the same thing as b and a.
You could imagine also representing
the probability of a and b
as the probability of a times the
probability of b given a, just
switching all of the a's and b's.
These three are all equivalent
ways of trying to represent
what joint probability means.
And so you'll sometimes
see all of these equations,
and they might be useful to you as
you begin to reason about probability
and to think about what values might
be taking place in the real world.
Now, sometimes when we
deal with probability,
we don't just care
about a Boolean event.
Like, did this happen
or did this not happen?
Sometimes we might want the ability
to represent variable values
in a probability space where
some variable might take
on multiple different possible values.
And in probability, we call a
variable in probability theory
a random variable.
A random variable in probability is
just some variable in probability theory
that has some domain of
values that it can take on.
So what do I mean by this?
Well, what I mean is I might
have a random variable that
is just called Roll, for example,
that has six possible values.
Roll is my variable,
and the possible values,
the domain of values that it can
take on, are 1, 2, 3, 4, 5, and 6.
And I might like to know
the probability of each.
In this case, they happen
to all be the same.
But in other random variables,
that might not be the case.
For example, I might
have a random variable
to represent the weather,
for example, where
the domain of values it could take
on are things like sun or cloudy
or rainy or windy or
snowy, and each of those
might have a different probability,
and I care about knowing,
what is the probability
that the weather equals sun
or that the weather equals
clouds, for instance,
and I might like to do some mathematical
calculations based on that information.
Other random variables might
be something like traffic.
What are the odds that there
is no traffic or light traffic
or heavy traffic?
Traffic, in this case,
is my random variable,
and the values that that random
variable can take on are here.
It's either none or light or heavy.
And I, the person doing these
calculations, I, the person encoding
these random variables
into my computer, need
to make the decision as to what
these possible values actually are.
You might imagine, for
example, for a flight,
if I care about whether or not
I make it to a flight on time,
my flight has a couple of possible
values that it could take on.
My flight could be on time.
My flight could be delayed.
My flight could be canceled.
So flight, in this case,
is my random variable,
and these are the values
that it can take on.
And often I'll want to know
something about the probability
that my random variable takes on
each of those possible values.
And this is what we then call
a probability distribution.
A probability distribution
takes a random variable
and gives me the probability for each
of the possible values in its domain.
So in the case of this flight, for
example, my probability distribution
might look something like this.
My probability distribution
says, the probability
that the random variable Flight is
equal to the value on time is 0.6,
or, otherwise, put into more English,
human-friendly terms, the likelihood
that my flight is on
time is 60%, for example.
And in this case, the probability
that my flight is delayed is 30%.
The probability that my flight
is canceled is 10%, or 0.1.
And if you sum up all of these possible
values, the sum is going to be 1.
If you take all of the
possible worlds, here
are my three possible worlds for the
value of the random variable Flight.
Add them all up together.
The result needs to be the number one
per that axiom of probability theory
that we've discussed before.
So this now is one way of representing
this probability distribution
for the random variable Flight.
Sometimes you'll see it
represented a little bit more
concisely, that this is pretty
verbose for really just trying
to express three possible values.
And so often you'll instead
see this same notation
representing using a vector.
And all a vector is is
a sequence of values.
As opposed to just a single value,
I might have multiple values.
And so I could extend, instead,
represent this idea this way--
bold P-- so a larger P--
generally meaning the probability
distribution of this variable flight
is equal to this vector
represented in angle brackets.
The probability distribution
is 0.6, 0.3, and 0.1,
and I would just have to know that
this probability distribution is
an order of on time or
delayed and canceled
to know how to interpret this vector
to mean the first value in the vector
is the probability that my flight is
on time, the second value in the vector
is the probability that
my flight is delayed,
and the third value in the vector is the
probability that my flight is canceled.
And so this is just an
alternate way of representing
this idea a little more verbosely.
But oftentimes you'll see us
just talk about a probability
distribution over a random variable.
And whenever we talk about
that, what we're really doing
is trying to figure out
the probabilities of each
of the possible values that that
random variable can take on,
but this notation is just
a little bit more succinct,
even though it can sometimes
be a little confusing depending
on the context in which you see it.
So we'll start to look at examples
where we use this sort of notation
to describe probability and to
describe events that might take place.
A couple of other important
ideas to know with
regards to probability theory--
one is this idea of independence,
and independence refers to the idea
that the knowledge of one
event doesn't influence
the probability of another event.
So for example, in the
context of my two dice
rolls where I had the red die
and the blue die, the probability
that I roll the red
die and the blue die,
those two events, red die and
blue die, are independent.
Knowing the result of the red die
doesn't change the probabilities
for the blue die.
It doesn't give me any
additional information
about what the value of the blue
die is ultimately going to be.
But that's not always
going to be the case.
You might imagine that in the case
of weather, something like clouds
and rain, those are probably not
independent, that if it is cloudy,
that might increase the probability that
later in the day it's going to rain.
So some information informs some other
event or some other random variable.
So independence refers to the idea that
one event doesn't influence the other.
And if they're not independent, then
there might be some relationship.
So mathematically, formally, what
does independence actually mean?
Well, recall this formula from before,
that the probability of a and b
is the probability of a times
the probability of b given a.
And the more intuitive
way to think about this
is that to know how likely
it is that a and b happen,
well, let's first figure out
the likelihood that a happens,
and then given that we
know that a happens,
let's figure out the
likelihood that b happens
and multiply those two things together.
But if a and b were independent, meaning
knowing a doesn't change anything
about the likelihood
that b is true, well,
then the probability of b given a,
meaning the probability that b is true
given that I know a is true,
well, that I know a is true
shouldn't really make a difference
if these two things are independent,
that a shouldn't influence b at all.
So the probability of b given a is
really just the probability of b,
if it is true that a
and b are independent.
And so this right here is one
example of a definition for what
it means for a and b to be independent.
The probability of a and b is
just the probability of a times
the probability of b.
Any time you find two events a and
b where this relationship holds,
then you can say that a
and b are independent.
So an example of that might be the dice
that we were taking a look at before.
Here, if I wanted the probability of
red being a six and blue being a six,
well, that's just the probability
that red is a six multiplied
by the probability that blue is a six.
Both equal to one over 36.
So I can say that these
two events are independent.
What wouldn't be independent, for
example, would be an example--
so this, for example,
has a probability of one
over 36, as we talked about before.
But what wouldn't be independent
would be a case like this--
the probability that the red die rolls
a six and the red die rolls a four.
If you just naively took, OK,
red die six, red die four,
well, if I'm only
rolling the die once, you
might imagine the naive approach
is to say, well, each of these
has a probability of one over six.
So multiply them together, and
the probability is one over 36.
But, of course, if you're
only rolling the red die once,
there's no way you could get two
different values for the red die.
It couldn't both be a six and a four.
So the probability should be zero.
But if you were to multiply probability
of red six times probability
of red four, well, that
would equal one over 36.
But, of course, that's
not true because we
know that there is no way, probability
zero, that when we roll the red die
once we get both a six and a four
because only one of those possibilities
can actually be the result.
And so we can say that the event
that red roll is six and the event
that red roll is four, those
two events are not independent.
If I know that the red roll is a six, I
know that the red roll cannot possibly
be a four.
So these things are not independent.
And instead, if I wanted to
calculate the probability,
I would need to use this
conditional probability,
as is the regular definition of the
probability of two events taking place.
And the probability of this, now,
well, the probability of the red
roll being a six, that's one of six.
But what's the probability that the roll
is a four given that the roll is a six?
Well, this is just zero, because
there's no way for the red roll
to be a four given that we already
know the red roll is a six.
And so the value, if we do
all that multiplication,
is we get the number zero.
So this idea of
conditional probability is
going to come up again
and again, especially
as we begin to reason about multiple
different random variables that
might be interacting with
each other in some way.
And this gets us to one of the most
important rules in probability theory,
which is known as Bayes' rule.
And it turns out that just using
the information we've already
learned about probability and just
applying a little bit of algebra,
we can actually derive
Bayes' rule for ourselves.
But it's a very important rule
when it comes to inference
and thinking about probability
in the context of what
it is that a computer can do, or
what a mathematician could do,
by having access to
information about probability.
So let's go back to
these equations to be
able to derive Bayes' rule ourselves.
We know the probability of a and b,
the likelihood that a and b take place,
is the likelihood of b and
then the likelihood of a given
that we know that b is already true.
And likewise, the probability
of a given a and b
is the probability of a times
the probability of b given
that we know that a is already true.
This is sort of a symmetric
relationship where
it doesn't matter the order of a and
b and b and a mean the same thing.
And so in these equations,
we can just swap out a and b
to be able to represent
the exact same idea.
So we know that these two
equations are already true.
We've seen that already.
And now let's just do a little bit of
algebraic manipulation of this stuff.
Both of these expressions
on the right-hand side
are equal to the probability of a and b.
So what I can do is take these two
expressions on the right-hand side
and just set them equal to each other.
If they're both equal to
the probability of a and b,
then they both must be
equal to each other.
So probability of a
times probability of b
given a is equal to the probability of
b times the probability of a given b.
And now all we're going to do
is do a little bit of division.
I'm going to divide both sides by P of
a, and now I get what is Bayes' rule.
The probability of b given a is
equal to the probability of b
times the probability of a given
b divided by the probability of a.
And sometimes in Bayes'
rule you'll see the order
of these two arguments switched.
So instead of b times a given
b, it'll be a given b times b.
That ultimately doesn't matter
because in multiplication you
can switch the order of the
two things you're multiplying
and it doesn't change the result.
But this here right now is the most
common formulation of Bayes' rule.
The probability of b given a is
equal to the probability of a given
b times the probability of b
divided by the probability of a.
And this rule, it turns
out, is really important
when it comes to trying to
infer things about the world
because it means you can express
one conditional probability,
the conditional probability
of b given a, using knowledge
about the probability of a
given b, using the reverse
of that conditional probability.
So let's first do a little
bit of an example with this,
just to see how we might use
it, and then explore what
this means a little bit more generally.
So we're going to construct a situation
where I have some information.
There are two events that I care about--
the idea that it's cloudy
in the morning and the idea
that it is rainy in the afternoon.
Those are two different
possible events that
could take place-- cloudy in the
morning, or the AM, rainy in the PM.
And what I care about is, given
clouds in the morning, what
is the probability of rain in the
afternoon, a reasonable question
I might ask.
In the morning, I look
outside, or an AI's camera
looks outside, and sees that
there are clouds in the morning,
and we want to conclude,
we want to figure out,
what is the probability that in the
afternoon there is going to be rain?
Of course, in the abstract,
we don't have access
to this kind of information,
but we can use data
to begin to try and figure this out.
So let's imagine, now, that I have
access to some pieces of information.
I have access to the idea
that 80% of rainy afternoons
start out with a cloudy morning.
And you might imagine that I
could have gathered this data just
by looking at data over
a sequence of time,
that I know that 80% of the time
when it's raining in the afternoon it
was cloudy that morning.
I also know that 40% of
days have cloudy mornings,
and I also know that 10% of
days have rainy afternoons.
And now, using this
information, I would like
to figure out, given
clouds in the morning, what
is the probability that
it rains in the afternoon?
I want to know the probability of
afternoon rain given morning clouds,
and I can do that, in particular,
using this fact, the probability of--
so if I know that 80% of rainy
afternoon start with cloudy mornings,
then I know the probability of cloudy
mornings given rainy afternoon.
So using sort of the reverse conditional
probability, I can figure that out.
Expressed in terms of Bayes' rule,
this is what that would look like--
probability of rain given
clouds is the probability
of clouds given rain times
the probability of rain
divided by the probability of clouds.
Here I'm just substituting
in for the values of a and b
from that equation and
Bayes' rule from before.
And then I can just do the math.
I have this information.
I know that 80% of the time,
if it was raining, then
there were clouds in the
morning-- so 0.8 here.
Probability of rain is 0.1 because 10%
of days were raining and 40% of days
were cloudy.
I do the math and I can
figure out the answer is 0.2.
So the probability that it rains in
the afternoon given that it was cloudy
in the morning is 0.2 in this case.
And this, now, is an
application of Bayes' rule,
the idea that using one
conditional probability,
we can get the reverse
conditional probability.
And this is often useful when one
of the conditional probabilities
might be easier for us to know about
or easier for us to have data about,
and using that information,
we can calculate
the other conditional probability.
So what does this look like?
Well, it means that knowing the
probability of cloudy mornings given
rainy afternoons, we can calculate the
probability of rainy afternoons given
cloudy mornings, or, for
example, more generally,
if we know the probability of
some visible effect, some effect
that we can see and observe given
some unknown cause that we're not
sure about, well, then we can calculate
the probability of that unknown cause
given the visible effect.
So what might that look like?
Well, in the context of
medicine, for example,
I might know the probability of some
medical test result given a disease.
Like, I know that if someone has a
disease, then x percent of the time
the medical test result will
show up as this, for instance.
And using that information,
then I can calculate,
what is the probability that,
given I know the medical test
result, what is the likelihood
that someone has the disease?
This is the piece of information that
is usually easier to know, easier
to immediately have access to data for.
And this is the information that
I actually want to calculate.
Or I might want to know, for example--
if I know that some probability
of counterfeit bills
have blurry text around the edges,
because counterfeit printers
aren't nearly as good at
printing text precisely.
So I have some information
about given that something
is a counterfeit bill, x percent of
counterfeit bills have blurry text,
for example.
And using that information, then I can
calculate some piece of information
that I might want to
know, like, given that I
know there's blurry text on a bill, what
is the probability that that bill is
counterfeit?
So given one conditional
probability, I can
calculate the other conditional
probability as well.
And so now that we've
taken a look at a couple
of different types of probability.
We've looked at
unconditional probability
where I just look at what is the
probability of this event occurring
given no additional evidence
that I might have access to,
and we've also looked at
conditional probability
where I have some sort of evidence, and
I would like to, using that evidence,
be able to calculate some
other probability as well.
The other kind of probability that
will be important for us to think about
is joint probability,
and this is when we're
considering the likelihood of multiple
different events simultaneously.
And so what do we mean by this?
Well, for example, I might
have probability distributions
that look a little something like this,
like I want to know the probability
distribution of clouds in the morning,
and that distribution looks like this.
40% of the times, C, which
is my random variable here,
is equal to it's cloudy, and
60% of the time it's not cloudy.
So here is just a simple
probability distribution
that is effectively telling me
that 40% of the time it's cloudy.
I might also have a probability
distribution for rain in the afternoon
where 10% of the time,
or with probability 0.1,
it is raining in the afternoon
and with probability 0.9
it is not raining in the afternoon.
And using just these two
pieces of information,
I don't actually have a whole lot
of information about how these two
variables relate to each other.
But I could if I had access
to their joint probability,
meaning for every combination
of these two things--
meaning morning cloudy and afternoon
rain, morning cloudy and afternoon
not rain, morning not cloudy
and afternoon rain, and morning
not cloudy and afternoon not raining--
if I had access to values
for each of those four,
I'd have more information--
so information that'd
be organized in a table like this.
And this, rather than just
a probability distribution,
is a joint probability distribution.
It tells me the probability
distribution of each
of the possible combinations of
values that these random variables
can take on.
So if I want to know, what is the
probability that on any given day
it is both cloudy and
rainy, well, I would say,
all right, we're looking
at cases where it is cloudy
and cases where it is raining and
the intersection of those two,
that row and that column, is 0.08.
So that is the probability that
it is both cloudy and rainy
using that information.
And using this conditional
probability table,
using this joint
probability table, I can
begin to draw other
pieces of information
about things like
conditional probability.
So I might ask a question like, what is
the probability distribution of clouds
given that I know that
it is raining, meaning
I know for sure that it's raining.
Tell me the probability distribution
over whether it's cloudy
or not given that I know already
that it is, in fact, raining.
And here I'm using C to stand
for that random variable.
I'm looking for a distribution,
meaning the answer to this
is not going to be a single value.
It's going to be two values, a vector
of two values where the first value is
probability of clouds, the second value
is probability that it is not cloudy,
but the sum of those two
values is going to be one,
because when you add up
the probabilities of all
of the possible worlds, the result
that you get must be the number one.
And, well, what do we know about how
to calculate a conditional probability?
Well, we know that the probability of
a given b is the probability of a and b
divided by the probability of b.
So what does this mean?
Well, it means that I can
calculate the probability of clouds
given that it's raining as
the probability of clouds
and raining divided by
the probability of rain.
And this comma here for the probability
distribution of clouds and rain,
this comma sort of stands
in for the word "and."
You'll sort of see the logical operator
AND and the comma used interchangeably.
This means the probability
distribution over the clouds
and knowing the fact that it is raining
divided by the probability of rain.
And the interesting thing to
note here and what we'll often
do in order to simplify
our mathematics is
that dividing by the probability
of rain, the probability of rain
here is just some numerical constant.
It is some number.
Dividing by probability
of rain is just dividing
by some constant or, in
other words, multiplying
by the inverse of that constant.
And it turns out that
oftentimes we can just
not worry about what the
exact value of this is
and just know that it is,
in fact, a constant value,
and we'll see why in a moment.
So instead of expressing this as
this joint probability divided
by the probability of rain, sometimes
we'll just represent it as alpha times
the numerator here, the probability
distribution of C, this variable,
and that we know that it
is raining, for instance.
So all we've done here
is said this value of one
over the probability of rain, that's
really just a constant that we're
going to divide by or equivalently
multiply by the inverse of at the end.
We'll just call it alpha for now
and deal with it a little bit later.
But the key idea here now-- and this is
an idea that's going to come up again--
is that the conditional
distribution of C given rain
is proportional to, meaning
just some factor multiplied by,
the joint probability of
C and rain being true.
And so how do we figure this out?
Well, this is going to be the
probability that it is cloudy
given that it's raining, which is
0.08, and the probability that it's not
cloudy given that it's
raining, which is 0.02.
And so we get alpha times--
here now is that
probability distribution.
0.08 is clouds and rain.
0.02 is not cloudy and rain.
But, of course, 0.08 and 0.02
don't sum up to the number one.
And we know that in a
probability distribution,
if you consider all of
the possible values,
they must sum up to
a probability of one.
And so we know that we just need to
figure out some constant to normalize,
so to speak, these values, something
we can multiply or divide by
to get it so that all of these
probabilities sum up to one.
And it turns out that if we
multiply both numbers by 10,
then we can get that
result of 0.8 and 0.2.
The proportions are still
equivalent, but now 0.8
plus 0.2, those sum up to the number 1.
So take a look at this and
see if you can understand,
step by step, how it is we're
getting from one point to another.
But the key idea here is that by
using the joint probabilities,
these probabilities that it is both
cloudy and rainy and that it is not
cloudy and rainy, I can take
that information and figure out
the conditional probability--
given that it's raining,
what is the chance that it's
cloudy versus not cloudy--
just by multiplying by some
normalization constant, so to speak.
And this is what a
computer can begin to use
to be able to interact with
these various different types
of probabilities.
And it turns out there are a
number of other probability rules
that are going to be useful to us as
we begin to explore how we can actually
use this information to
encode into our computers
some more complex analysis that we
might want to do about probability
and distributions and random variables
that we might be interacting with.
So here are a couple of those
important probability rules.
One of the simplest rules
is just this negation rule.
What is the probability of not event a?
So a is an event that
has some probability,
and I would like to know, what is the
probability that a does not occur?
And it turns out it's just one
minus P of a, which makes sense
because if those are
the two possible cases,
either a happens or a doesn't happen,
then when you add up those two cases,
you must get one, which means P of
not a must just be one minus P of a
because P of a and P of not a
must sum up to the number one.
They must include all
of the possible cases.
We've seen an expression for
calculating the probability of a and b.
We might also reasonably want to
calculate the probability of a or b.
What is the probability that one thing
happens or another thing happens?
So for example, I might
want to calculate,
what is the probability
that if I roll two dice,
a red die and a blue die, what is
the likelihood that a is a six or b
is a six, one or the other?
And what you might imagine you could
do and the wrong way to approach it
would be just to say, all right,
well, a comes up as a six,
the red die comes up as a six
with probability one over six.
The same for the blue die.
It's also one over six.
Add them together and you get
2/6, otherwise known as 1/3.
But this suffers from the
problem of over counting,
that we've double counted the
case where both a and b, both
the red die and the blue die,
both come up as a six roll,
and I've counted that instance twice.
So to resolve this,
the actual expression
for calculating the
probability of a or b
uses what we call the
inclusion-exclusion formula.
So I take the probability of a,
add it to the probability of b.
That's all same as before.
But then I need to exclude the
cases that I've double counted.
So I subtract from that the
probability of a and b, and that
gets me the result for a or b.
I consider all the cases where a is
true and all the cases where b is true.
And if you imagine this is
like a Venn diagram of cases
where a is true, cases
where b is true, I just
need to subtract out the
middle to get rid of the cases
that I have over counted by double
counting them inside of both
of these individual expressions.
One other rule that's
going to be quite helpful
is a rule called marginalization.
Some marginalization is
answering the question
of how do I figure out the probability
of a using some other variable that I
might have access to, like b?
Even if I don't know additional
information about it,
I know that b, some event,
can have two possible states.
Either b happens or b doesn't happen,
assuming it's a Boolean, true or false.
And well, what that means
is that for me to be
able to calculate the probability
of a, there are only two cases.
Either a happens and b happens or
a happens and b doesn't happen.
And those are two disjoint, meaning
they can't both happen together--
either b happens or b doesn't happen.
They're disjoint or separate cases.
And so I can figure out the probability
of a just by adding up those two cases.
The probability that a is
true is the probability
that a and b is true plus the
probability that a is true
and b isn't true.
So by marginalizing, I've
looked at the two possible cases
that might take place.
Either b happens or b doesn't happen.
And in either of those cases, I look at,
what's the probability that a happens,
and if I add those together,
well, then I get the probability
that a happens as a whole.
So take a look at that rule.
It doesn't matter what b is
or how it's related to a.
So long as I know these
joint distributions,
I can figure out the
overall probability of a.
And this can be a useful way,
if I have a joint distribution,
like the joint distribution
of a and b, to just figure out
some unconditional probability,
like the probability of a,
and we'll see examples
of this soon, as well.
Now, sometimes these might
not just be variables
that are events that are they happened
or they didn't happen, like b is here.
They might be some broader
probability distribution where
there are multiple possible values.
And so here, in order to use
this marginalization rule,
I need to sum up not just over b and not
b, but for all of the possible values
that the other random
variable could take on.
And so here we'll see a version
of this rule for random variables,
and it's going to include
that summation notation
to indicate that I'm summing up, adding
up, a whole bunch of individual values.
So here's the rule.
Looks a lot more complicated,
but it's actually
the equivalent, exactly the same rule.
What I'm saying here is that if I
have two random variables one called x
and one called y, well, the probability
that x is equal to some value x sub i--
this is just some value that
this variable takes on--
how do I figure it out?
Well, I'm going to
sum up over j, where j
is going to range over all of the
possible values that y can take on.
Well, let's look at the probability
that x equals xi and y equals yj.
So the exact same rule--
the only difference here
is now I'm summing up over
all of the possible values
that y can take on, saying let's
add up all of those possible cases
and look at this joint
distribution, this joint probability
that x takes on the value I care about
given all of the possible values for y.
And if I add all those up, then I can
get this unconditional probability
of what x is equal to, whether or
not x is equal to some value x sub i.
So let's take a look
at this rule because it
does look a little bit complicated.
Let's try and put a
concrete example to it.
Here, again, is that same
joint distribution from before.
I have cloud, not
cloudy, rainy, not rainy.
And maybe I want to
access some variable.
I want to know, what is the
probability that it is cloudy?
Well, marginalization says that
if I have this joint distribution
and I want to know, what is the
probability that it is cloudy, well,
I need to consider the other variable,
the variable that's not here,
the idea that it's rainy.
And I consider the two cases, either
it's raining or it's not raining,
and I just sum up the values
for each of those possibilities.
In other words, the
probability that it is cloudy
is equal to the sum of the
probability that it's cloudy
and it's raining and the probability
that it's cloudy and it is not raining.
And so these, now, are
values that I have access to.
These are values that are just inside
of this joint probability table.
What is the probability that
it is both cloudy and rainy?
Well, it's just the intersection
of these two here, which is 0.08,
and the probability that it's cloudy
and not raining is-- all right,
here's cloudy, here's not raining--
it's 0.32.
So it's 0.08 plus 0.32, which
just gives us equal to 0.4.
That is the unconditional probability
that it is, in fact, cloudy.
And so marginalization
gives us a way to go
from these joint distributions to
just some individual probability
that I might care about.
And you'll see a little bit later
why it is that we care about that
and why that's actually
useful to us as we
begin doing some of these calculations.
Last rule we'll take a look up before
transitioning into something a little
bit different is this
rule of conditioning--
very similar to the marginalization
rule, but it says that, again,
if I have two events a and b--
but instead of having access
to their joint probabilities,
I have access to their
conditional probabilities,
how they relate to each other.
Well, again, if I want to know the
probability that a happens and I know
that there's some other variable b,
either b happens or b doesn't happen,
and so I can say that
the probability of a
is the probability of a given
b times the probability of b,
meaning b happened, and
given that I know b happened,
what's the likelihood that a happened?
And then I consider the other
case, that b didn't happen.
So here is the probability
that b didn't happen,
and here's the probability
that a happens given
that I know that b didn't happen.
And this is really the equivalent rule,
just using conditional probability
instead of joint probability
where I'm saying,
let's look at both of these
two cases and condition on b.
Look at the case where b happens and
look at the case where b doesn't happen
and look at what probabilities
I get as a result.
And just as in the
case of marginalization
where there was an equivalent
rule for random variables
that could take on multiple possible
values in a domain of possible values,
here, too, conditioning has
the same equivalent rule.
Again, there's a summation to mean I'm
summing over all of the possible values
that some random
variable y could take on.
But if I want to know, what is the
probability that x takes on this value,
then I'm going to sum
up over all the values j
that y could take on and
say, all right, what's
the chance that y takes on
that value, yj, and multiply it
by the conditional probability
that x takes on this value given
that y took on that value yj--
so equivalent rule just using
conditional probabilities
instead of joint probabilities.
And using the equation we know
about joint probabilities,
we can translate between these two.
All right, we've seen a
whole lot of mathematics,
and we've just sort of laid
the foundation for mathematics.
And no need to worry if you haven't
seen probability in too much
detail up until this point.
These are sort of the
foundations of the ideas
that are going to come up as we
begin to explore how we can now
take these ideas from probability
and begin to apply them to represent
something inside of our computer,
something inside of the AI agent
we're trying to design that is able to
represent information and probabilities
and the likelihoods between
various different events.
So there are a number of
different probabilistic models
that we can generate, but the first of
the models we're going to talk about
are what are known as Bayesian networks.
And a Bayesian network
is just going to be
some network of random variables,
connected random variables,
that are going to
represent the dependence
between these random variables.
And odds are most random
variables in this world
are not independent from
each other, that there's
some relationship between things that
are happening that we care about.
If it is raining today, that
might increase the likelihood
that my flight or my train
gets delayed, for example.
There is some dependence
between these random variables,
and a Bayesian network is going to be
able to capture those dependencies.
So what is a Bayesian network?
What is its actual structure,
and how does it work?
Well, a Bayesian network is
going to be a directed graph.
And again, we've seen
directed graphs before.
They are individual nodes
with arrows or edges
that connect one node to another node,
pointing in a particular direction.
And so this directed graph
is going to have nodes,
as well, where each node
in this directed graph
is going to represent a random variable,
something like the weather or something
like whether my train
was on time or delayed.
And we're going to have an
arrow from a node x to a node y
to mean that x is a parent of y.
So that'll be our notation.
If there's an arrow from x to y, x is
going to be considered a parent of y.
And the reason that's important
is because each of these nodes
is going to have a probability
distribution that we're
going to store along with it, which
is the distribution of x given
some evidence, given the parents of x.
So the way to more
intuitively think about this
is the parents are going to be thought
of as sort of causes for some effect
that we're going to observe.
And so let's take a look at an
actual example of a Bayesian network
and think about the types of
logic that might be involved
in reasoning about that network.
Let's imagine, for a moment, that
I have an appointment out of town
and I need to take a train in
order to get to that appointment.
So what are the things
I might care about?
Well, I care about getting
to my appointment on time.
Either I make it to
my appointment and I'm
able to attend it or I
miss the appointment.
And you might imagine that
that's influenced by the train,
that the train is either on time
or it's delayed, for example.
But that train itself
is also influenced.
Whether the train is on time or
not depends maybe on the rain.
Is there no rain?
Is it light rain?
Is there heavy rain?
And it might also be influenced
by other variables, too.
It might be influenced,
as well, by whether
or not there's maintenance on
the train track, for example.
If there is maintenance
on the train track,
that probably increases the
likelihood that my train is delayed.
And so we can represent all of these
ideas using a Bayesian network that
looks a little something like this.
Here I have four nodes
representing four random variables
that I would like to keep track of.
I have one random
variable called Rain that
can take on three possible values in its
domain, either none or light or heavy
for no rain, light rain, or heavy rain.
I have a variable called
Maintenance for whether or not
there is maintenance on the train track,
which it has two possible values, just
either yes or no.
Either there is maintenance or there is
no maintenance happening on the track.
Then I have a random variable for
the train indicating whether or not
the train was on time or not.
That random variable has two
possible values in its domain.
The train is either on time
or the train is delayed.
And then, finally, I
have a random variable
for whether I make it to my appointment.
For my appointment down here, I have
a random variable called Appointment
that itself has two possible
values, attend and miss.
And so here are the possible values.
Here are my four nodes,
each of which represents
a random variable, each of which
has a domain of possible values
that it can take on.
And the arrows, the edges
pointing from one node to another,
encode some notion of
dependence inside of this graph,
that whether I make it
to my appointment or not
is dependent upon whether the
train is on time or delayed.
And whether the train is on time or
delayed is dependent on two things,
given by the two arrows
pointing at this node.
It is dependent on whether or not there
was maintenance on the train track,
and it is also dependent
upon whether or not
it was raining, or
whether it is raining.
And just to make things a
little complicated, let's
say, as well, that whether or not
there's maintenance on the track,
this too might be
influenced by the rain.
Then if there's heavier
rain, well, maybe it's
less likely that there's going to
be maintenance on the train track
that day because they're
more likely to want
to do maintenance on the track on days
when it's not raining, for example.
And so these nodes might have
different relationships between them.
But the idea is that we can
come up with a probability
distribution for any of these
nodes based only upon its parents.
And so let's look node by node at what
this probability distribution might
actually look like.
And we'll go ahead and begin with this
root node, this Rain node here, which
is at the top and has no
arrows pointing into it,
which means its probability
distribution is not
going to be a conditional distribution.
It's not based on anything.
I just have some probability
distribution over the possible values
for the Rain random variable.
And that distribution might look
a little something like this.
None, light, and heavy--
each have a possible value.
Here I'm saying the
likelihood of no rain
is 0.7, of light rain is 0.2, of
heavy rain is 0.1, for example.
So here is a probability distribution
for this root node in this Bayesian
network.
And let's now consider the next
node in the network, Maintenance.
Track maintenance is yes or no.
And the general idea of
what this distribution
is going to encode, at
least in this story,
is the idea that the heavier
the rain is, the less likely
it is that there's going to
be maintenance on the track
because the people that are
doing maintenance on the track
probably want to wait until a day
when it's not as rainy in order to do
the track maintenance, for example.
And so what might that probability
distribution look like?
Well, this now is going to
be a conditional probability
distribution, that here are the
three possible values for the Rain
random variable, which I'm here just
going to abbreviate to R, either
no rain, light rain, or heavy rain.
And for each of those possible values,
either there is yes track maintenance
or no track maintenance, and those have
probabilities associated with them,
that I see here that
if it is not raining,
then there is a probability 0.4
that there's track maintenance
and a probability of
0.6 that there isn't.
But if there's heavy
rain, then here the chance
that there is track maintenance
is 0.1 and the chance
that there is not track
maintenance is 0.9.
Each of these rows is going to sum
up to one because each of these
represent different
values of whether or not
it's raining, the three possible values
that that random variable can take on,
and each is associated with its
own probability distribution.
That is ultimately all going
to add up to the number one.
So that there is our distribution for
this random variable called Maintenance
about whether or not there is
maintenance on the train track.
And now let's consider
the next variable.
Here we have a node inside
of our Bayesian network
called Train that has two possible
values, on time and delayed.
And this node is going to be
dependent upon the two nodes that
are pointing towards it, that
whether or not the train is on time
or delayed it depends on whether
or not there is track maintenance,
and it depends on whether
or not there is rain,
that heavier rain probably means
more likely that my train is delayed.
And if there is track
maintenance, that also
probably means it's more likely
that my train is delayed as well.
And so you could construct a
larger probability distribution,
a conditional probability
distribution, that
instead of conditioning on just
one variable, as was the case here,
is now conditioning on two
variables, conditioning
both on rain, represented by R, and
on maintenance, represented by yes.
Again, each of these rows has two
values that sum up to the number one,
one for whether the train is on time,
one for whether the train is delayed.
And here I can say
something like, all right,
if I know there was light rain
and track maintenance-- well, OK,
that would be R is light and M is yes--
well, then there is a probability
of 0.6 that my train is on time
and a probability of 0.4
the train is delayed.
And you can imagine gathering this data
just by looking at real-world data,
looking at data about, all right,
if I knew that it was light rain
and there was track
maintenance, how often
was a train delayed or
not delayed, and you
could begin to construct this thing.
But the interesting
thing is, intelligently,
being able to try to
figure out, how might
you go about ordering these things?
What things might influence other
nodes inside of this Bayesian network?
And the last thing I care
about is whether or not
I make it to my appointment.
So did I attend or miss the appointment?
And ultimately, whether I
attend or miss the appointment,
it is influenced by track maintenance
because it's indirectly this idea
that, all right, if there
is track maintenance, well,
then my train might
more likely be delayed,
and if my train is more
likely to be delayed,
then I'm more likely
to miss my appointment.
But what we encode in
this Bayesian network
are just what we might consider
to be more direct relationships.
So the train has a direct
influence on the appointment.
And given that I know whether
the train is on time or delayed,
knowing whether there's
track maintenance
isn't going to give me any additional
information that I didn't already have,
that if I know train, these
other nodes that are up above
isn't really going to
influence the result.
And so here we might represent it
using another conditional probability
distribution that looks a
little something like this, that
train can take on two possible values.
Either my train is on time
or my train is delayed.
And for each of those
two possible values,
I have a distribution
for what are the odds
that I'm able to attend the
meeting, and what are the odds
that I missed the meeting?
And obviously, if my
train is on time, I'm
much more likely to be
able to attend the meeting
than if my train is
delayed, in which case
I'm more likely to miss that meeting.
So all of these nodes
put altogether here
represent this Bayesian network,
this network of random variables
whose values I ultimately
care about and that
have some sort of
relationship between them,
some sort of dependence where these
arrows from one node to another
indicate some dependence, that I can
calculate the probability of some node
given the parents that
happen to exist there.
So now that we've been able to describe
the structure of this Bayesian network
and the relationships
between each of these nodes,
by associating each of the node
in the network with a probability
distribution, whether that's
an unconditional probability
distribution in the case
of this root node here,
like Rain, and a conditional
probability distribution,
in the case of all of the other
nodes whose probabilities are
dependent upon the
values of their parents,
we can begin to do some computation
and calculation using the information
inside of that table.
So let's imagine, for
example, that I just
wanted to compute something simple,
like the probability of light rain.
How would I get the
probability of light rain?
Well, light rain-- rain
here is a root node.
And so if I wanted to
calculate that probability,
I could just look at the
probability distribution for rain
and extract from it the
probability of light rain.
It's just a single value that
I already have access to.
But we could also imagine
wanting to compute
more complex joint probabilities,
like the probability
that there is light rain and
also no track maintenance.
This is a joint probability of two
values, light rain and no track
maintenance.
And the way I might do that is first
by starting by saying, all right, well,
let me get the probability
of light rain, but now
I also want the probability
of no track maintenance.
But, of course, this node is
dependent upon the value of rain.
So what I really want is the probability
of no track maintenance given
that I know that there was light rain.
And so the expression
for calculating this idea
that the probability of light
rain and no track maintenance
is really just the probability
of light rain and the probability
that there is no track maintenance
given that I know that there already
is light rain.
So I take the unconditional
probability of light rain,
multiply it by the conditional
probability of no track maintenance
given that I know there is light rain.
And you can continue to do this
again and again for every variable
that you want to add into
this joint probability
that I might want to calculate.
If I wanted to know the
probability of light rain
and no track maintenance
and a delayed train,
well, that's going to be the
probability of light rain multiplied
by the probability of
no track maintenance
given light rain multiplied by the
probability of a delayed train given
light rain and no track maintenance,
because whether the train is on time
or delayed is dependent upon both
of these other two variables,
and so I have two
pieces of evidence that
go into the calculation of
that conditional probability.
And each of these three
values is just a value
that I can look up by looking at
one of these individual probability
distributions that is encoded
into my Bayesian network.
And if I wanted a joint probability
over all four of the variables,
something like the
probability of light rain
and no track maintenance and a delayed
train and I missed my appointment,
well, that's going to be multiplying
four different values, one
from each of these individual nodes.
It's going to be the
probability of light rain,
then of no track maintenance
given light rain,
then of a delayed train given light
rain and no track maintenance.
And then, finally, for this node here
for whether I make it to my appointment
or not, it's not dependent upon
these two variables given that I know
whether or not the train is on time.
I only need to care about
the conditional probability
that I miss my appointment given
that the train happens to be delayed.
And so that's represented here by
four probabilities, each of which
is located inside of one of
these probability distributions
for each of the nodes,
all multiplied together.
And so I can take a variable
like that and figure out
what the joint probability
is by multiplying
a whole bunch of these
individual probabilities
from the Bayesian network.
But, of course, just as with last
time where what I really wanted to do
was to be able to get new
pieces of information,
here, too, this is what we're going to
want to do with our Bayesian network.
In the context of knowledge, we
talked about the problem of inference.
Given things that I
know to be true, can I
draw conclusions, make deductions
about other facts about the world
that I also know to be true?
And what we're going to do now is apply
the same sort of idea to probability.
Using information about
which I have some knowledge,
whether some evidence or
some probabilities, can
I figure out not other
variables for certain,
but can I figure out the probabilities
of other variables taking
on particular values?
And so here we introduce
the problem of inference
in a probabilistic setting in a case
where variables might not necessarily
be true for sure, but they
might be random variables
that take on different
values with some probability.
So how do we formally define what
exactly this inference problem actually
is?
Well, the inference problem
has a couple of parts to it.
We have some query,
some variable x that we
want to compute the distribution for.
Maybe I want the probability
that I missed my train
or I want the probability that there
is track maintenance, something
that I want information about.
And then I have some evidence variables.
Maybe it's just one piece of evidence.
Maybe it's multiple pieces of evidence.
But I've observed certain
variables for some sort of event.
So for example, I might have
observed that it is raining.
This is evidence that I have.
I know that there is light rain or
I know that there is heavy rain,
and that is evidence I have.
And using that evidence, I want
to know, what is the probability
that my train is delayed, for example?
And that is a query that I might
want to ask based on this evidence.
So I have a query, some
variable, evidence,
which are some other variables
that I have observed inside
of my Bayesian network,
and of course that
does leave some hidden variables, y.
These are variables that are
not evidence variables and not
query variables.
So you might imagine in the case where
I know whether or not it's raining
and I want to know whether my train
is going to be delayed or not,
the hidden variable, the thing I don't
have access to, is something like,
is there maintenance
on the track, or am I
going to make or not make
my appointment, for example?
These are variables that
I don't have access to.
They're hidden because
they're not things I observed,
and they're also not the query,
the thing that I'm asking.
And so ultimately what
we want to calculate
is I want to know the probability
distribution of x given
e, the event that I observed.
So given that I observed some event,
I observed that it is raining,
I would like to know, what is the
distribution over the possible values
of the Train random variable?
Is it on time?
Is it delayed?
What is the likelihood
it's going to be there?
And it turns out we can do
this calculation just using
a lot of the probability rules
that we've already seen in action.
And ultimately, we're going
to take a look at the math
at a little bit of a high
level, at an abstract level,
but ultimately we can allow
computers and programming
libraries that already exist to
begin to do some of this math for us.
But it's good to get a general sense
for what's actually happening when
this inference process takes place.
Let's imagine, for example, that
I want to compute the probability
distribution of the
Appointment random variable
given some evidence, given that I know
that there was light rain and no track
maintenance.
So there's my evidence, these two
variables that I observed the value of.
I observe the value of rain.
I know there's light rain.
And I know that there is no
track maintenance going on today.
And what I care about knowing, my query,
is this random variable Appointment.
I want to know the distribution of
this random variable Appointment.
What is the chance that
I am able to attend
my appointment, what is the
chance that I miss my appointment
given this evidence?
And the hidden variable, the
information that I don't have access to,
is this variable Train.
This is information that is not
part of the evidence that I see,
not something that I observe.
But it is also not the
query that I am asking for.
And so what might this
inference procedure look like?
Well, if you recall back from a when we
were defining conditional probability
and doing math with
conditional probabilities,
we know that a
conditional probability is
proportional to the joint probability.
And we remember this by recalling
that the probability of a given b
is just some constant
factor alpha multiplied
by the probability of a and b.
That constant factor
alpha turns up and you're
dividing over the probability
of b, but the important thing
is that it's just some
constant multiplied
by the joint distribution,
the probability
that all of these
individual things happen.
So in this case, I can
take the probability
of the Appointment random variable given
light rain and no track maintenance
and say that is just going to be
proportional, some constant alpha,
multiplied by the joint probability,
the probability of a particular value
for the appointment random variable,
and light rain and no track maintenance.
Well, all right, how do I
calculate this, probability
of appointment and light rain
and no track maintenance,
when what I really
care about is knowing--
I need all four of these values to be
able to calculate a joint distribution
across everything, because,
then, a particular appointment
depends upon the value of train.
Well, in order to do
that, here I can begin
to use that marginalization
trick, that there are only
two ways I can get any configuration
of an appointment, light rain,
and no track maintenance.
Either this particular
setting of variables
happens and the train is on time or this
particular setting of variables happens
and the train is delayed.
Those are two possible cases
that I would want to consider.
And if I add those two
cases up, well, then I
get the result just by adding up all
of the possibilities for the hidden
variable, or variables
if there are multiple.
But since there's only
one hidden variable here,
Train, all I need to do is iterate over
all the possible values for that hidden
variable Train and add
up their probabilities.
So this probability
expression here becomes
probability distribution over
Appointment, light, no rain, and train
is on time, and the
probability distribution
over the Appointment, light rain,
no track maintenance, and the train
is delayed, for example.
So I take both of the possible values
for train, go ahead and add them up.
These are just
Joint probabilities that we saw
earlier how to calculate just
by going parent, parent, parent, parent
and calculating those probabilities
and multiplying them together.
And then you'll need to normalize them
at the end, speaking at a high level
to make sure that everything
adds up to the number one.
So the formula for how you
do this and a process known
as inference by enumeration
looks a little bit complicated,
but ultimately it looks like this.
And let's now try to distill what
it is that all of these symbols
actually mean.
Let's start here.
What I care about knowing is the
probability of x, my query variable,
given some sort of evidence.
What do I know about
conditional probabilities?
Well, a conditional probability is
proportional to the joint probability.
So we had some alpha,
some normalizing constant,
multiplied by this joint
probability of x and evidence.
And how do I calculate that?
Well, to do that, I'm
going to marginalize over
all of the hidden variables.
All the variables that I don't
directly observe the values for,
I'm basically going to iterate
over all of the possibilities
that it could happen and
just sum them all up.
And so I can translate this
into a sum over all y, which
ranges over all the possible
hidden variables and the values
that they could take on, and adds
up all of those possible individual
probabilities.
And that is going to allow me to do this
process of inference by enumeration.
And ultimately, it's pretty
annoying if we as humans have
to do all of this math for ourselves.
But it turns out this is where computers
and AI can be particularly helpful,
that we can program a computer to
understand a Bayesian network to be
able to understand these
inference procedures
and to be able to do these calculations.
And using the information
you've seen here,
you could implement a Bayesian
network from scratch yourself.
But turns out there are a lot
of libraries, especially written
in Python, that allow
us to make it easier
to do this sort of
probabilistic inference
to be able to take a Bayesian network
and do these sorts of calculations
so that you don't need to know and
understand all of the underlying math,
though it's helpful to have a
general sense for how it works.
But you just need to be able to
describe the structure of the network
and make queries in order to
be able to produce the result.
And so let's take a look at
an example of that right now.
It turns out that there are
a lot of possible libraries
that exist in Python for
doing this sort of inference.
It doesn't matter too much
which specific library you use.
They all behave in fairly similar ways.
But the library I'm going to use
here is one known as pomegranate.
And here inside of model.py, I
have defined a Bayesian network
just using the structure and the syntax
that the pomegranate library expects.
And what I'm effectively
doing is just, in Python,
creating nodes to represent each
the nodes of the Bayesian network
that you saw me describe a moment ago.
So here on line four, after
I've imported pomegranate,
I'm defining a variable called
rain that is going to represent
a node inside of my Bayesian network.
It's going to be a node that
follows this distribution where
there are three possible values--
none for no rain, light for
light rain, heavy for heavy rain.
And these are the probabilities
of each of those taking place.
0.7 is the likelihood of no rain, 0.2
for light rain, 0.1 for heavy rain.
Then, after that, we go
to the next variable,
the variable for track
maintenance, for example, which
is dependent upon that rain variable.
And this, instead of being an
unconditional distribution,
is a conditional distribution, as
indicated by a conditional probability
table here.
And the idea is that this is
conditional on the distribution of rain.
So if there is no rain, then
the chance that there is yes
track maintenance is 0.4.
If there's no rain, the chance that
there is no track maintenance is 0.6.
Likewise, for light rain,
I have a distribution.
For heavy rain, I have
a distribution, as well.
But I'm effectively encoding
the same information
you saw represented
graphically a moment ago,
but I'm telling this Python
program that the maintenance
node obeys this particular
conditional probability distribution.
And we do the same thing for the
other random variables, as well.
Train was a node inside my distribution
that was a conditional probability
table with two parents.
It was dependent not only on rain,
but also on track maintenance.
And so here I'm saying something like,
given that there is no rain and yes
track maintenance, the probability
that my train is on time is 0.8,
and the probability that
it's delayed is 0.2.
And likewise, I can do
the same thing for all
of the other possible values of
the parents of the train node
inside of my Bayesian network by saying,
for all of those possible values,
here is the distribution that
the train node should follow.
And I do the same thing
for an appointment
based on the distribution
of the variable Train.
Then, at the end, what I do is
actually construct this network
by describing what the
states of the network are
and by adding edges between
the dependent nodes.
So I create a new Bayesian
network, add states to it--
one for rain, one for maintenance, one
for train, one for the appointment--
and then I add edges
connecting the related pieces.
Rain has an arrow to maintenance because
rain influences track maintenance,
rain also influences the train,
maintenance also influences the train,
and train influences whether
I make it to my appointment,
and bake just finalizes the model
and does some additional computation.
So the specific syntax of this
is not really the important part.
Pomegranate just happens to be
one of several different libraries
that can all be used
for similar purposes,
and you could describe and
define a library for yourself
that implemented similar things.
But the key idea here
is that someone can
design a library for a
general Bayesian network that
has nodes that are
based upon its parents,
and then all a programmer needs to
do, using one of those libraries,
is to define what those nodes and what
those probability distributions are,
and we can begin to do some
interesting logic based on it.
So let's try doing that
conditional or joint probability
calculation that we saw us do by hand
before by going into likelihood.py
where here I'm importing the model
that I justified a moment ago.
And here I'd just like to
calculate model.probability,
which calculates the probability
for a given observation,
and I'd like to calculate
the probability of no rain,
no track maintenance,
my train is on time,
and I'm able to attend the meeting--
so sort of the optimal
scenario, that there's
no rain and no maintenance on
the track, my train is on time,
and I'm able to attend the meeting.
What is the probability that
all of that actually happens?
And I can calculate
that using the library
and just print out its probability.
And so I'll go ahead and
run Python of likelihood.py,
and I see that, OK, the
probability is about 0.34.
So about a third of the time, everything
goes right for me, in this case--
no rain, no track
maintenance, train is on time,
and I'm able to attend the meeting.
But I could experiment with this,
try and calculate other probabilities
as well.
What's the probability that everything
goes right up until the train
but I still miss my meeting-- so no
rain, no track maintenance, train
is on time, but I miss the appointment.
Let's calculate that probability, and
that has a probability of about 0.04.
So about 4% of the time
the train will be on time,
there won't be any rain,
no track maintenance,
and yet I'll still miss the meeting.
And so this is really
just an implementation
of the calculation of the joint
probabilities that we did before.
What this library is
likely doing is first
figuring out the probability
of no rain, then figuring
that the probability
of no track maintenance
given no rain, then the
probability that my train is
on time given both of these
values, and then the probability
that I miss my appointment given that
I know that the train was on time.
So this, again, is the calculation
of that joint probability.
And turns out we can also begin to have
our computer solve inference problems,
as well, to begin to infer,
based on information, evidence
that we see, what is the likelihood
of other variables also being true?
So let's go into inference.py,
for example, where here I'm,
again, importing that exact
same model from before,
importing all the
nodes and all the edges
and the probability distribution
that is encoded there, as well.
And now there's a function for
doing some sort of prediction.
And here, into this model, I pass
in the evidence that I observe.
So here I've encoded into this
Python program the evidence
that I have observed.
I have observed the fact
that the train is delayed,
and that is the value for one
of the four random variables
inside of this Bayesian network.
And using that information,
I would like to be
able to draw inspiration
and figure out inferences
about the values of the
other random variables
that are inside of my Bayesian network.
I would like to make predictions
about everything else.
So all of the actual computational logic
is happening in just these three lines
where I'm making this
call to this prediction.
Down below, I'm just iterating over all
of the states and all the predictions
and just printing them out so that we
can visually see what the results are.
But let's find out, given
the train is delayed,
what can I predict about the values
of the other random variables?
Let's go ahead and run
Python inference.py.
I run that.
And all right, here is
the result that I get.
Given the fact that I know
that the train is delayed--
this is evidence that I have observed--
well, given that there is a
45% chance or a 46% chance
that there was no rain,
a 31% chance there
was light rain, a 23%
chance there was heavy rain,
I can see a probability
distribution over track maintenance
and a probability
distribution over whether I'm
able to attend or miss my appointment.
Now, we know that
whether I attend or miss
the appointment, that is only
dependent upon the train being delayed
or not delayed.
It shouldn't depend on anything else.
So let's imagine, for example, that
I knew that there was heavy rain.
That shouldn't affect the distribution
for making the appointment.
And indeed, if I go up
here and add some evidence,
say that I know that the
value of rain is heavy--
that is evidence that
I now have access to.
I now have two pieces of evidence.
I know that the rain is heavy, and
I know that my train is delayed.
I can calculate the probability by
running this inference procedure again
and seeing the result.
I know that the rain is heavy.
I know my train is delayed.
The probability distribution
for track maintenance changed.
Given that I know that
there is heavy rain,
now it's more likely that there
is no track maintenance, 88% as
opposed to 64% from here before.
And now what is the probability
that I make the appointment?
Well, that's the same as before.
It's still going to be attend the
appointment with probability 0.6,
miss the appointment
with probability 0.4,
because it was only dependent upon
whether or not my train was on time
or delayed.
And so this here is implementing
that idea of that inference algorithm
to be able to figure out,
based on the evidence
that I have, what can we infer about
the values of the other variables that
exist as well?
So inference by enumeration is one
way of doing this inference procedure,
just looping over all of the values
the hidden variables could take on
and figuring out what
the probability is.
Now, it turns out this is
not particularly efficient,
and there are definitely optimizations
you can make by avoiding repeated work
if you're calculating the same
sort of probability multiple times.
There are ways of optimizing the
program to avoid having to recalculate
the same probabilities again and again.
But even then, as the
number of variables
get large, as the number of possible
values those variables could take on
get large, we're going
to start to have to do
a lot of computation,
a lot of calculation,
to be able to do this inference.
And at that point,
you might start to get
unreasonable in terms
of the amount of time
that it would take to be able
to do this sort exact inference.
And it's for that reason
that oftentimes when
it comes towards probability and
things we're not entirely sure about,
we don't always care about
doing exact inference
and knowing exactly
what the probability is.
But if we can approximate
the inference procedure,
do some sort of approximate inference,
that that can be pretty good as well,
that if I don't know
the exact probability
but I have a general
sense for the probability,
that I can get increasingly accurate
with more time, that that's probably
pretty good, especially if I can
get that to happen even faster.
So how could I do approximate
inference inside of a Bayesian network?
Well, one method is through a
procedure known as sampling.
In the process of sampling, I'm going
to take a sample of all of the variables
inside of this Bayesian network here.
And how am I going to sample?
Well, I'm going to sample one of
the values from each of these nodes
according to their
probability distribution.
So how might I take a
sample of all these nodes?
Well, I'll start at the root.
I'll start with rain.
Here's the distribution
for rain, and I'll go ahead
and, using a random number
generator or something like it,
randomly pick one of these three values.
I'll pick none with probability
0.7, light with probability 0.2,
and heavy with probability 0.1.
So I'll randomly just pick one of
them according to that distribution,
and maybe, in this case,
I pick none, for example.
Then I do the same thing
for the other variable.
Maintenance also as a
probability distribution.
And I am going to sample--
now, there are three
probability distributions here,
but I'm only going to
sample from this first row
here because I've observed already in my
sample that the value of rain is none.
So
Given that rain is none, I'm going to
sample from this distribution to say,
all right, what should the
value of maintenance be?
And in this case, maintenance
is going to be, let's just
say, yes, which happens 40% of the time
in the event that there is no rain,
for example.
And we'll sample all of the
rest of the nodes in this way,
as well, that I want to sample
from the train distribution,
and I'll sample from this first
row here where there is no rain,
but there is track maintenance.
And I'll sample 80% of the time,
I'll say the train is on time.
20% of the time, I'll
say the train is delayed.
And finally, we'll do the same thing
for whether I make it to my appointment
or not.
Did I attend or miss the appointment?
We'll sample based on this distribution
and maybe say that in this case
I attend the appointment, which
happens 90% of the time when
the train is actually on time.
So by going through these nodes, I
can very quickly just do some sampling
and get a sample of the
possible values that
could come up from going through
this entire Bayesian network
according to those
probability distributions.
And where this becomes powerful
is if I do this not once,
but I do this thousands or
tens of thousands of times
and generate a whole bunch of
samples, all using this distribution.
I get different samples.
Maybe some of them are the same.
But I get a value for each of the
possible variables that could come up.
And so then, if I'm ever faced
with a question, a question like,
what is the probability
that the train is on time,
you could do an exact
inference procedure.
This is no different than the
inference problem we had before
where I could just marginalize, look
at all the possible other values
of the variables and do the
computation of inference
by enumeration to find out
this probability exactly.
But I could also, if I don't care about
the exact probability, just sample it.
Approximate it to get close.
And this is a powerful
tool in AI where we
don't need to be right 100% of the time
or we don't need to be exactly right.
If we just need to be right
with some probability,
we can often do some more
effectively, more efficiently.
And so here, now, are all
of those possible samples.
I'll sort of highlight the ones
where the train is on time.
I'm ignoring the ones
where the train is delayed.
And in this case,
there's six out of eight
of the samples have the
train is arriving on time.
And so maybe, in this case, I can
say that, in six out of eight cases,
that's the likelihood
that the train is on time.
And with eight samples, that
might not be a great prediction.
But if I had thousands
upon thousands of samples,
then this could be a much
better inference procedure
to be able to do these
sorts of calculations.
So this is a direct sampling method
to just do a bunch of samples
and then figure out what the
probability of some event is.
Now, this from before was an
unconditional probability.
What is the probability
that the train is on time?
And I did that by looking at all
the samples and figuring out,
right here, the ones where
the train is on time.
But sometimes what
I'll want to calculate
is not an unconditional probability,
but rather a conditional probability,
something like, what is
the probability that there
is light rain given that the train
is on time, something to that effect.
And to do that kind of calculation,
well, what I might do is here
are all the samples
that I have, and I want
to calculate a probability
distribution given
that I know that the train is on time.
So to be able to do that, I can
kind of look at the two cases
where the train was delayed
and ignore or reject them,
sort of exclude them from the
possible samples that I'm considering.
And now I want to look
at these remaining cases
where the train is on time.
Here are the cases where
there is light rain.
And now I say, OK, these are two
out of the six possible cases.
That can give me an
approximation for the probability
of light rain given the fact that
I know the train was on time.
And I did that in almost
exactly the same way
just by adding an additional
step, by saying that,
all right, when I take
each sample, let me
reject all of the samples
that don't match my evidence
and only consider the
samples that do match
what it is that I have in
my evidence that I want
to make some sort of calculation about.
And it turns out, using the libraries
that we've had for Bayesian networks,
we can begin to implement
this same sort of idea,
implement rejection sampling, which
is what this method is called,
to be able to figure out some
probability, not via direct inference,
but instead by sampling.
So what I have here is a
program called sample.py--
imports the exact same model.
And what I define first is a
program to generate a sample.
And the way I generate a sample is
just by looping over all of the states.
The states need to be
in some sort of order
to make sure I'm looping
in the correct order.
But effectively, if it is
a conditional distribution,
I'm going to sample
based on the parents.
And otherwise, I'm just going
to directly sample the variable,
like rain, which has no parents-- it's
just an unconditional distribution--
and keep track of all those parent
samples and return the final sample.
The exact syntax of this, again,
not particularly important.
It just happens to be
part of the implementation
details of this particular library.
The interesting logic is done below.
Now that I have the ability
to generate a sample,
if I want to know the distribution of
the appointment random variable given
that the train is delayed, well, then I
can begin to do calculations like this.
Let me take 10,000 samples
and assemble all my results
in this list called data.
I'll go ahead and loop n times--
in this case, 10,000 times.
I'll generate a sample, and I want to
know the distribution of appointment
given that the train is delayed.
So according to rejection
sampling, I'm only
going to consider samples
where the train is delayed.
If the train's not delayed, I'm not
going to consider those values at all.
So I'm going to say, all
right, if I take the sample,
look at the value of the
train random variable,
if the train is delayed, well,
let me go ahead and add to my data
that I'm collecting the value of
the appointment random variable
that it took on in
this particular sample.
So I'm only considering the
samples where the train is delayed
and, for each of those
samples, considering
what the value of appointment is.
And then at the end, I'm using
a Python class called counter,
which quickly counts up
all the values inside
of a data set so I can
take this list of data
and figure out how many times
was my appointment made,
and how many times was
my appointment missed?
And so this here, with just
a couple of lines of code,
is an implementation
of rejection sampling.
And I can run it by going ahead
and running Python sample.py.
And when I do that, here
is the result I get.
This is the result of the counter.
1,251 times I was able
to attend the meeting,
and 856 times I was able
to miss the meeting.
And you can imagine, by
doing more and more samples,
I'll be able to get a better and
better, more accurate result.
And this is a randomized process.
It's going to be an
approximation of the probability.
If I run it a different time, you'll
notice the numbers are similar--
1,272 and 905-- but
they're not identical
because there's some randomization,
some likelihood that things
might be higher or lower, and so this
is why we generally want to try and use
more samples so that we can have
a greater amount of confidence
in our result, be more
sure about the result
that we're getting of whether or not
it accurately reflects or represents
the actual underlying
probabilities that are
inherent inside of this distribution.
And so this, then, was an
instance of rejection sampling.
And it turns out, there are a
number of other sampling methods
that you could use to
begin to try to sample.
One problem that
rejection sampling has is
that if the evidence you're looking
for is a fairly unlikely event, well,
you're going to be
rejecting a lot of samples.
Like, if I'm looking for the
probability of x given some evidence e,
if e is very unlikely to occur-- like,
occurs maybe one every 1,000 times--
then I'm only going to be considering
one out of every 1,000 samples
that I do, which is a pretty
inefficient method for trying
to do this sort of calculation.
I'm throwing away a lot of samples,
and it takes computational effort
to be able to generate
those samples, so I'd
like to not have to do
something like that.
So there are other sampling methods
that can try and address this.
One such sampling method is
called likelihood weighting.
In likelihood weighting, we follow
a slightly different procedure,
and the goal is to avoid
needing to throw out
samples that didn't match the evidence.
And so what we'll do is we'll start
by fixing the values for the evidence
variables.
Rather than sample
everything, we're going
to fix the values of the evidence
variables and not sample those.
Then we're going to sample all the other
non-evidence variables in the same way,
just using the Bayesian network, looking
at the probability distributions,
sampling all the non-evidence variables.
But then what we need to do is
weight each sample by its likelihood.
If our evidence is
really unlikely, we want
to make sure that we've taken into
account, how likely was the evidence
to actually show up in the sample?
If I have a sample where
the evidence was much more
likely to show up than
another sample, then I
want to weight the
more likely one higher.
So we're going to weight
each sample by its likelihood
where likelihood is just defined as
the probability of all of the evidence.
Given all the evidence we
have, what is the probability
that it would happen in
that particular sample?
So before, all of our samples
were weighted equally.
They all had a weight of one when we
were calculating the overall average.
In this case, we're going
to weight each sample,
multiply each sample by
its likelihood in order
to get the more accurate distribution.
So what would this look like?
Well, if I asked the same question, what
is the probability of light rain given
that the train is on time, when
I do the sampling procedure
and start by trying to
sample, I'm going to start
by fixing the evidence variable.
I'm already going to have in
my sample the train is on time.
That way, I don't have
to throw out anything.
I'm only sampling things where I
know the value of the variables that
are my evidence are what
I expect them to be.
So I'll go ahead and sample
from rain, and maybe this time I
sample light rain instead of no rain.
Then I'll sample from track
maintenance and say maybe, yes, there's
track maintenance.
Then for train, well, I've
already fixed it in place.
Train was an evidence variable, so I'm
not going to bother sampling again.
I'll just go ahead and move on.
I'll move on to appointment and go ahead
and sample from appointment as well.
So now I've generated a sample.
I've generated a sample by
fixing this evidence variable
and sampling the other three.
And the last step is now
weighting the sample.
How much weight should it have?
And the weight is based on how probable
is it that the train was actually
on time, this evidence
actually happened,
given the values of these other
variables, light rain and the fact
that, yes, there was track maintenance?
Well, to do that, I can just
go back to the train variable
and say, all right, if there
was light rain and track
maintenance, the
likelihood of my evidence,
the likelihood that my
train was on time, is 0.6.
And so this particular sample
would have a weight of 0.6.
And I could repeat the sampling
procedure again and again.
Each time, every sample
would be given a weight
according to the probability of the
evidence that I see associated with it.
And there are other sampling methods
that exist, as well, but all of them
are designed to try and
get at the same idea,
to approximate the inference
procedure of figuring out
the value of a variable.
So we've now dealt
with probability as it
pertains to particular variables
that have these discrete values.
But what we haven't
really considered is how
values might change over time,
that we've considered something
like a variable for rain where rain
can take on values of none or light
rain or heavy rain, but,
in practice, usually when
we consider values for variables like
rain, we like to consider it for,
over time, how do the values
of these variables change?
What do we deal with when we're dealing
with uncertainty over a period of time?
Which can come up in the context
of weather, for example--
if I have sunny days
and I have rainy days.
And I'd like to know not just what is
the probability that it's raining now,
but what is the probability that
it rains tomorrow or the day
after that or the day after that?
And so to do this,
we're going to introduce
a slightly different kind of model.
But here we're going to have a random
variable, not just one for the weather,
but for every possible time step.
And you can define time
step however you like.
A simple way is just to
use days as your time step.
And so we can define a
variable called x sub t, which
is going to be the weather at time t.
So x sub zero might be
the weather on day zero,
x sub one might be the weather
on day one, so on and so forth,
x sub two is the weather on day two.
But as you can imagine,
if we start to do
this over longer and
longer periods of time,
there's an incredible amount of
data that might go into this.
If you're keeping track of data
about the weather for a year,
now suddenly you might be trying to
predict the weather tomorrow given
365 days of previous pieces
of evidence, and that's
a lot of evidence to have to deal
with and manipulate and calculate.
Probably nobody knows what the
exact conditional probability
distribution is for all of
those combinations of variables.
And so when we're trying to do this
inference inside of a computer, when
we're trying to reasonably
do this sort of analysis,
it's helpful to make some
simplifying assumptions,
some assumptions about the
problem that we can just assume
are true to make our
lives a little bit easier.
Even if they're not totally
accurate assumptions,
if they're close to
accurate or approximate,
they're usually pretty good.
And the assumption we're going to
make is called the Markov assumption,
which is the assumption that the current
state depends only on a finite fixed
number of previous states.
So the current day's weather depends
not on all of the previous day's weather
for all of history, but
the current day's weather I
can predict just based
on yesterday's weather
or just based on the last two days'
weather or the last three days'
weather.
But oftentimes, we're going to deal
with just the one previous state helps
to predict this current state.
And by putting a whole bunch
of these random variables
together, using this
Markov assumption, we
can create what's called a Markov
chain where a Markov chain is just
some sequence of random variables where
each of the variable's distribution
follows that Markov assumption.
And so we'll do an example of this
where the Markov assumption is
I can predict the weather.
Is it sunny or rainy?
And we'll just consider those
two possibilities for now,
even though there are
other types of weather.
But I can predict each day's weather
just on the prior day's weather.
Using today's weather, I can come
up with a probability distribution
for tomorrow's weather.
And here's what this
weather might look like.
It's formatted in terms of a
matrix, as you might describe it,
as sort of rows and columns of
values where on the left-hand side
I have today's webinar, represented
by the variable x sub t.
And then over here in the columns,
I have tomorrow's weather,
represented by the variable x sub t plus
one, t plus one day's weather instead.
And what this matrix is
saying is if today is sunny,
well, then, it's more likely than
not that tomorrow is also sunny.
Oftentimes the weather stays
consistent for multiple days in a row.
And for example, let's say
that if today is sunny,
our model says that tomorrow, with
probability 0.8, it will also be sunny,
and with probability
0.2 it will be raining.
And likewise, if today
is raining, then it's
more likely than not that
tomorrow is also raining.
With probability 0.7, it'll be raining.
With probability 0.3, it will be sunny.
So this matrix, this
description of how it
is we transition from one
state to the next state,
is what we're going to
call the transition model.
And using the transition
model, you can begin
to construct this Markov chain by just
predicting, given today's weather,
what's the likelihood of
tomorrow's weather happening?
And you can imagine
doing a similar sampling
procedure where you
take this information,
you sample what tomorrow's
weather is going to be,
using that you sample
the next day's weather,
and the result of that is you can
form this Markov chain of x zero,
time day zero is sunny,
the next day is sunny,
maybe the next day it changes to
raining, then raining, then raining.
And the pattern that this
Markov chain follows,
given the distribution
that we had access to,
this transition model here,
is that when it's sunny,
it tends to stay sunny
for a little while.
The next couple days
tend to be sunny too.
And when it's raining, it
tends to be raining as well.
And so you get a Markov
chain that looks like this.
And you can do analysis on this.
You can say, given
that today is raining,
what is the probability
that tomorrow it's raining,
or you can begin to ask
probability questions,
like what is the probability of this
sequence of five values-- sun, sun,
rain, rain, rain--
and answer those sorts of questions too.
And it turns out there are, again,
many Python libraries for interacting
with models like this of
probabilities that have distributions
and random variables that are
based on previous variables
according to this Markov assumption.
And pomegranate 2 has ways of dealing
with these sorts of variables.
So I'll go ahead and go
into the chain directory
where I have some information
about Markov chains.
And here I've defined a
file called model.py where
I've defined in a very similar syntax.
And again, the exact syntax
doesn't matter so much as the idea
that I'm encoding this
information into a Python program
so that the program access
to these distributions.
I've here defined some
starting distributions.
So every Markov model begins
at some point in time,
and I need to give it some
starting distribution.
And so we'll just say,
you know what, to start,
you can pick 50/50
between sunny and rainy.
We'll say it's sunny 50% the
time, rainy 50% of the time.
And then down below, I've here
defined the transition model,
how it is that I transition
from one day to the next.
And here I've encoded that
exact same matrix from before,
that if it was sunny today,
then with probability 0.8
it will be sunny tomorrow, and
it will be raining tomorrow
with probability 0.2.
And I likewise have another distribution
for if it was raining today instead.
And so that alone
defines the Markov model.
You can begin to answer
questions using that model.
But one thing I'll just do is
sample from the Markov chain.
And it turns out there is a method built
into this Markov chain library that
allows me to sample 50
states from the chain,
basically just simulating
50 instances of weather.
And so let me go ahead and
run this, Python model.py.
And when I run it, what I get is it is
going to sample from this Markov chain
50 states, 50 days worth of weather
that it's just going to randomly sample.
And you can imagine sampling many
times to be able to get more data
to be able to do more analysis.
But here, for example,
it's sunny two days
a row, rainy a whole bunch of days in
a row before it changes back to sun.
And so you get this model that follows
the distribution that we originally
described, that follows the
distribution of sunny days
tend to lead to more sunny days, rainy
days tend to lead to more rainy days.
And that, then, is the Markov model.
And Markov models rely on us knowing
the values of these individual states.
I know that today is sunny or that today
is rainy, and using that information,
I can draw some sort of inference about
what tomorrow is going to be like.
But in practice, this
often isn't the case.
It often isn't the case
that I know for certain
what the exact state of the world is.
Oftentimes the state of the
world is exactly unknown,
but I'm able to somehow sense
some information about that state
that a robot or an AI doesn't have exact
knowledge about the world around it,
but it has some sort of
sensor, whether that sensor
is a camera or sensors
that detect distance
or just a microphone that is
sensing audio, for example.
It is sensing data, and using
that data, that data is somehow
related to the state of the
world even if it doesn't actually
know, our AI doesn't know, what the
underlying true state of the world
actually is.
And for that, we need to get
into the world of sensor models,
the way of describing how
it is that we translate
what the hidden state, the
underlying true state of the world
is with what the observation, what it is
that the AI knows or the AI has access
to, actually is.
And so for example, a hidden
state might be a robot's position.
If a robot is exploring
new, uncharted territory,
the robot likely doesn't
know exactly where it is.
But it does have an observation.
It has robot sensor
data where it can sense
how far away are possible obstacles
around it, and using that information,
using the observed
information that it has,
it can infer something
about the hidden state,
because what the true hidden state
is influences those observations.
Whatever the robot's
true position is affects
or has some effect upon what the sensor
data the robot is able to collect
is, even if the robot doesn't
actually know for certain
what its true position is.
Likewise, if you think about a voice
recognition or a speech recognition
program that listens to you and is able
to respond to you, something like Alexa
or what Apple and Google are doing
with their voice recognition as well,
that you might imagine that the
hidden state, the underlying state,
is what words are actually spoken.
The true nature of
the world contains you
saying a particular sequence of words.
But your phone or your smart
home device doesn't know for sure
exactly what words you said.
The only observation that the AI has
access to is some audio wave forms.
And those audio wave forms
are, of course, dependent
upon this hidden state,
and you can infer,
based on those audio wave forms,
what the words spoken likely were,
but you might not know
with 100% certainty what
that hidden state actually is.
And it might be a task
to try and predict.
Given this observation,
given these audio away forms,
can you figure out what the
actual words spoken are?
Likewise, you might
imagine on a website.
True user engagement might be
information you don't directly
have access to, but you can
observe data, like website or app
analytics about how often
was this button clicked
or how often are people interacting
with a page in a particular way.
And you can use that to infer
things about your users as well.
So this type of problem comes up all
the time when we're dealing with AI
and trying to infer
things about the world,
that often AI doesn't really know
the hidden true state of the world.
All that AI has access
to is some observation
that is related to the hidden
true state, but it's not direct.
There might be some noise there.
The audio wave form might
have some additional noise
that might be difficult to parse.
The sensor data might
not be exactly correct.
There's some noise that
might not allow you
to conclude with certainty what
the hidden state is, but can
allow you to infer what it might be.
And so the simple example
we'll take a look at here
is imagining the hidden state as the
weather, whether it's sunny or rainy
or not, and imagine you are programming
an AI inside of a building that maybe
has access to just a camera
to inside the building,
and all you have access
to is an observation as to
whether or not employees are bringing
an umbrella into the building or not.
You can detect whether
it's an umbrella or not,
and so you might have an observation
as to whether or not an umbrella is
brought into the building or not.
And using that information,
you want to predict
whether it's sunny or
rainy, even if you don't
know what the underlying weather is.
So the underlying weather
might be sunny or rainy.
And if it's raining, obviously people
are more likely to bring an umbrella.
And so whether or not
people bring an umbrella,
your observation tells you
something about the hidden state.
And of course, this is a
bit of a contrived example,
but the idea here is to
think about this more broadly
in terms of more generally,
any time you observe something,
it having to do with some
underlying hidden state.
And so to try and model this type of
idea where we have these hidden states
and observations, rather than
just use a Markov model, which
has state, state, state,
state, each of which
is connected by that transition
matrix that we described before,
we're going to use what we
call a hidden Markov model--
very similar to a Markov
model, but this is
going to allow us to model a
system that has hidden states
that we don't directly observe
along with some observed event
that we do actually see.
And so in addition to
that transition model
that we still need of saying, given
the underlying state of the world,
if it's sunny or rainy, what's the
probability of tomorrow's weather,
we also need another model,
that given some state is
going to give us an
observation of green,
yes, someone brings an umbrella
into the office, or red,
no, nobody brings
umbrellas into the office.
And so the observation
might be that if it's sunny,
then odds are nobody is going to
bring an umbrella to the office.
But maybe some people
are just being cautious
and they do bring an umbrella
to the office anyways.
And if it's raining, with
much higher probability,
then people are going to bring
umbrellas into the office.
But maybe, if the rain was unexpected,
people didn't bring an umbrella,
and so they might have some
other probability as well.
So using the observations,
you can begin to predict,
with reasonable likelihood,
what the underlying state is
even if you don't actually get
to observe the underlying state,
if you don't get to see what the
hidden state is actually equal to.
This here we'll often
call the sensor model.
It's also often called
the emission probabilities
because the state, the underlying
state, emits some sort of emission
that you then observe.
And so that can be another way
of describing that same idea.
And the sensor Markov assumption
that we're going to use
is this assumption that the evidence
variable, the thing we observe,
the emission that gets produced,
depends only on the corresponding state,
meaning I can predict whether or
not people will bring umbrellas
or not entirely dependent just on
whether it is sunny or rainy today.
Of course, again, this assumption
might not hold in practice,
that in practice it might depend--
whether or not people
bring umbrellas might
depend not just on today's weather,
but also on yesterday's weather
and the day before.
But for simplification
purposes, it can be
helpful to apply the
sort of assumption just
to allow us to be able to reason about
these probabilities a little more
easily.
And if we're able to approximate it, we
can still often get a very good answer.
And so what these hidden Markov
models end up looking like is a little
something like this, where now, rather
than just have one chain of states--
like, sun, sun, rain, rain, rain--
we instead have this upper level, which
is the underlying state of the world,
is it sunny or is it rainy, and those
are connected by that transition
matrix we described before.
But each of these states
produces an emission,
produces an observation that I
see, that on this day it was sunny,
and people didn't bring umbrellas,
and on this day it was sunny,
but people did bring umbrellas,
and on this day it was raining
and people did bring umbrellas,
and so on and so forth.
And so each of these
underlying states, represented
by x sub t for x sub 1, 0,
1, 2, so on and so forth,
produces some sort of
observation or emission,
which is what the E stands for--
E sub 0, E sub 1, E sub
2, so on and so forth.
And so this, too, is a way of
trying to represent this idea.
And what you want to think about
is that these underlying states
are the true nature of the world, the
robot's position as it moves over time,
and that produces some sort of
sensor data that might be observed,
or what people are
actually saying and using
the emission data of what audio wave
forms do you detect in order to process
that data and try and figure it out.
And there are a number of
possible tasks that you might want
to do given this kind of information.
And one of the simplest is
trying to infer something
about the future or the past or
about these sort of hidden states
that might exist.
And so the tasks that you'll often see--
and we're not going to go into
the mathematics of these tasks,
but they're all based on this same
idea of conditional probabilities
and using the probability
distributions we have
to draw these sorts of conclusions.
One task is called filtering, which
is, given observations from the start
until now, calculate the
distribution for the current state,
meaning given information about
from the beginning of time
until now, on which days
did people bring an umbrella
or not bring an
umbrella, can I calculate
the probability of the current
state, that today is it sunny
or is it raining?
Another task that might
be possible is prediction,
which is looking towards the future.
Given observations about
people bringing umbrellas
from the beginning of when we
started counting time until now,
can I figure out the distribution that
tomorrow is it sunny or is it raining?
And you can also go backwards, as
well, by a smoothing where I can say,
given observations from start until
now, calculate the distributions
for some past state.
I know that today people brought
umbrellas and tomorrow people brought
umbrellas, and so given two days' worth
of data of people bringing umbrellas,
what's the probability that
yesterday it was raining?
And that I know that people
brought umbrellas today,
that might inform that
decision, as well.
It might influence those probabilities.
And there's also a most
likely explanation task,
in addition to other tasks
that might exist as well,
which is combining some of
these given observations
from the start up until now, figuring
out the most likely sequence of states,
and this is what we're going to
take a look at now, this idea
that if I have all these
observations-- umbrella,
no umbrella, umbrella,
no umbrella-- can I
calculate the most likely states of
sun, rain, sun, rain, and whatnot that
actually represented the true weather
that would produce these observations?
And this is quite common
when you're trying
to do something like voice
recognition, for example, that you have
these emissions of audio wave forms
and you would like to calculate,
based on all of the
observations that you have,
what is the most likely
sequence of actual words
or syllables or sounds that the user
actually made when they were speaking
to this particular device,
or other tasks that
might come up in that context as well.
And so we can try this out by
going ahead and going into the HMM
directory, HMM for Hidden Markov Model.
And here what I've done is I've
defined a model where this model first
defines my possible state, sun and
rain, along with their emission
probabilities, the observation
model or the emission model,
where here, given that I know that
it's sunny, the probability that I
see people bring an umbrella is 0.2.
The probability of no umbrella is 0.8.
And likewise, if it's
raining, then people
are more likely to bring an umbrella.
Umbrella has a probability of 0.9.
No umbrella has probably of 0.1.
So the actual underlying hidden
states, those states are sun and rain.
But the things that I observe,
the observations that I can see,
are either umbrella or no umbrella as
the things that I observe as a result.
So this, then, I also need to add to
it a transition matrix, same as before,
saying that if today is sunny, then
tomorrow is more likely to be sunny,
and if today is rainy, then tomorrow
is more likely to be raining.
As with before, I give it
some starting probabilities,
saying, at first, 50/50 chance
for whether it's sunny or rainy,
and then I can create the model
based on that information.
Again, the exact syntax of
this is not so important
so much as it is the data that I am
now encoding into a program, such
that now I can begin
to do some inference.
So I can give my program, for
example, a list of observations--
umbrella, umbrella, no umbrella,
umbrella, umbrella, so on and so forth,
no umbrella, no umbrella.
And I would like to
calculate, I would like
to figure out, the most likely
explanation for these observations.
What is likely?
Was it rain, rain?
Is this rain or is it more likely
that this was actually sunny
and then it switched
back to it being rainy?
And that's an interesting question.
We might not be sure
because it might just
be that it just so happened on this
rainy day people decided not to bring
an umbrella or it could be that it
switched from rainy to sunny back
to rainy, which doesn't seem too
likely, but it certainly could happen.
And using the data we give
to the Hidden Markov Model,
our model can begin to predict these
answers, can begin to figure it out.
So we're going to go ahead and
just predict these observations.
And then for each of those
predictions, go ahead
and print out what the prediction is.
And this library just so happens
to have a function called
predict that does this
prediction process for me.
So I run Python sequence.py,
and the result I get is this.
This is the prediction based
on the observations of what
all of those states are likely to be,
and it's likely to be rain, then rain.
In this case, it thinks
that what most likely
happened is that it was sunny for a
day and then went back to being rainy.
But in different situations, if
it was rainy for longer, maybe,
or if the probabilities
were slightly different,
you might imagine that it's more likely
that it was rainy all the way through,
and it just so happened on one rainy day
people decided not to bring umbrellas.
And so here, too, Python
libraries can begin
to allow for the sort
of inference procedure.
And by taking what we know and by
putting it in terms of these tasks
that already exist, these general tasks
that work with Hidden Markov Models,
then any time we can take an idea and
formulate it as a Hidden Markov Model,
formulate it as
something that has hidden
states and observed emissions
that result from the states.
Then we can take advantage
of these algorithms that
are known to exist for trying
to do this sort of inference.
So now we've seen a couple of ways that
AI can begin to deal with uncertainty.
We've taken a look at probability
and how we can use probability
to describe numerically things that
are likely or more likely or less
likely to happen than other
events or other variables.
And using that information,
we can begin to construct
these standard types of models,
things like Bayesian networks
and Markov chains and
Hidden Markov Models,
that all allow us to be able to
describe how particular events relate
to other events or how the
values of particular variables
relate to other variables, not
for certain, but with some sort
of probability distribution.
And by formulating things in terms
of these models that already exist,
we can take advantage
of Python libraries
that implement these sort of
models already and allow us just
to be able to use them to produce
some sort of resulting effect.
So all of this then
allows our AI to begin
to deal with these sort
of uncertain problems
so that our AI doesn't need
to know things for certain
but can infer based on
information it doesn't know.
Next time, we'll take a look
at additional types of problems
that we can solve by taking
advantage of AI-related algorithms
even beyond the world of the types
of problems we've already explored.
We'll see you next time.
