[MUSIC]
>> Hello, everyone.
Let's get started.
So first of all, this is
mainly for the residents.
They didn't have many talks
about neural networks before,
things like that, so
I'll have to give
an introduction about
these things before.
So bear with me a few minutes.
Feel free to ask questions at
any point when you have them.
Let's get started.
So I'm Miltos and hopefully,
you're here because
you're interested in learning
about graph neural networks.
The first thing, let's talk
some high-level background,
machine learning in general,
and we'll get to graph neural
networks in five minutes.
What's machine learning?
Well, we design the
model, we, the humans,
design the model and the
model has parameters.
The parameters are these
knobs that you turn
slightly left or right and
change the model's behavior.
The model is something you want to
represent something in the world.
All the models are
wrong, some are useful.
So this is what machine
learning helps to do,
is we design the models as humans,
we learn the parameters through data.
So supervised machine learning,
which is the most common
and most widely used
and the thing that works the most.
We have our model, our spherical cow.
We get the input data, we try
to predict something out.
Our dataset is the data points
which have input x and output y,
and these data points independently
come out from your population.
So we assume there is no correlation
among them in most cases.
What do we want to do is we
have this loss function,
something that measures how good
or bad our predictions are.
We start with our model f, we give
it x, and we try to predict y,
and we judge how close our
prediction f is close to y.
So that's supervised learning.
One flavor which is common
in deep learning nowadays
is the following.
We do the learning
with gradient descent.
The idea is as follows.
While we have not converged,
somehow, we compute an estimate
of the derivative of the loss
over our dataset and then
we update our parameters,
this Theta thing, based on how
we're going to minimize the loss.
So we gradually move to a
point which is hopefully good.
This is a computational
methods to do this,
not the only one, but
in any case, that's it.
What do we want to hope to get out
from machine learning and
these modeling things?
Well, we want to generalize the
new previously unseen data.
If we had these data points here,
we don't want to have a very
simple model that is underfitting,
we don't want to have
a very complex model
that overfits exactly in our points
but does the model, the
thing we're trying to do.
So that's machine learning.
If you haven't seen this before,
hopefully, this is good
enough to get started.
So other things that
we're going to use,
the notion of distributed
vector representations.
That's common in natural
language processing,
common in deep learning.
The idea is the following.
We have a set of discrete things.
One way to represent this is
with local representations.
We have a huge sparse
vector of mostly zeros
on the single one when we
represent something like,
let's say, a banana
or a mango or a dog.
This, however, has the
problem that we don't know
how these two things
correlate or combine.
It's a very discrete representation.
What happens usually
in deep learning,
somehow, different models
do different things,
learn a distributed
vector representation.
Distributed because the
meaning is distributed
across the components of this vector.
So maybe banana and mango
share this yellowness thing,
whereas the dog is
somewhat different.
So how do you do that?
How do you learn these things
that you can compute them?
You can learn them as
parameters of your model.
The high-level idea is you can
start from these representations,
the one-hot representations,
multiply them with
this embedding matrix
and you get this distributed vectors.
Now, your parameter is this E here,
which is an embedding.
Maybe you've heard things like
Word2vec or things like that.
These are the distributed
vector representations.
They are building blocks, let's say,
of the things that
we're trying to get at.
So how you compute them
or how you learn them,
it's quite wide variety
of ways of doing this.
But you'll hear me talking about
distributed vector representations
a few times later today.
>> D is a lot smaller than V.
>> Exactly. That's part
of the point where the V,
you have 10,000 elements or 10,000
words or something like that.
D would be 128 or 500
or something like that.
So it compresses down
the space in some form.
That's another way of
seeing these things.
So graph notation, again,
for the residents,
most of you don't necessarily have
a computer science background.
Graphs look like this.
You have nodes or vertices.
You can see here you have edges,
which can be directed
or undirected edges,
you can see them here.
You can have many
different types of edges.
So here, I'm coloring some edges
with a different color to say this.
What does this graph say?
Well, in this specific one, nothing,
but the main idea here
is that you encode the
nodes which are elements,
and the edges encode relationships
among those elements.
So this is the high-level thing
and you can encode quite
a few things in graphs.
Hopefully, if the residents
have any questions
about notation or
graphs, let me know.
Very simplified thing,
there are tons,
of course, of things around graphs.
So we got to the point
where I can start
talking about graph neural networks.
So the idea of graph neural
networks is the following.
You start with a graph, a
representation of your problem.
What is this graph?
Where does it come from?
This is a modeling decision,
that something that you decide
that this graph here encodes the
problem you're interested in.
It could be a social network,
it could be a molecule,
it could be a program,
it could be anything that
you're interested about.
So that's the first part.
You, as a human, have designed
a graph representation
of your problem.
The second thing is that for
each of your nodes here,
you need some information.
This information is encoded in a
distributed vector presentation,
an embedding if you wish, and
this is encoded for each node.
So each node gets somehow
a representation.
You can compute it.
It could be an image.
So B could be an image,
so you compute it from a
convolutional neural network.
It could be a word, it
could be an embedding,
it could be anything
that you wish it to be.
So this is where we start.
This is the input to a
graph neural network.
Now, the question is,
what do we want to do?
Well, the idea is we start from here.
This graph representation
will describe,
for most of the rest of this hour,
the graph neural networks.
The output here is going to
be, again, the same graph,
but for each node, you have
a vector representation.
Now, these vectors here
are going to have information
not about node A,
when you had initially here,
but how A belongs within
the context of this graph.
So you are essentially saying,
given the features that
I had initially here
and the structure of the graph
and the neighbors and what the
neighbors had to say or do,
what is the representation here?
So a GNN computes these
representations once it's trained.
Then what you do is you take this,
this representation you
have of your graph,
the output of the GNN,
and then you can pass it to something
that is specific to the task
that you're trying to add.
So it could be the loss
or something else,
but it's something
that is task-specific.
We'll discuss very briefly two tasks
after I explain the
graph neural networks.
>> Do these funny things
attach to each vertex?
They're somehow meaningful
to the person starting it.
That's not the internals of the GNN,
it's a meaningful thing.
>> It's as meaningful
as a vector can be.
>> But it comes to that side,
it's not part of the mechanism
of machine learning.
>> It's not part of the mechanism
of graph neural networks.
It could be a neural network
somehow processes F,
let's say an image, and the image is,
you get a vector out of it,
which could be the features
that there is an edge here,
or there is a face here,
or there is a wheel here,
or something like that.
It could be that this is a word.
So you have something distributed,
an embedding of a word,
like a natural language processing
word and things like that.
So the GNN is transparent to
what the initial representation.
>> Are they the same kind of thing?
The same length, for
example, identical?
>> They don't have to
be the same length,
but they usually are the same length,
not the same vector, of course.
Otherwise, this would be
an identity function.
But they are a representation.
Again, it's like a
feature vector of this
node A is central in this part,
or if it were a variable in a
program that we will see later,
this variable here
acts like a counter.
People accumulate things into it.
Things like that that are learned,
we don't explicitly say or
know exactly what this is.
We can find by analogy in
some cases, but that's all.
>> Do they have to be
unique and identifying?
So for instance, if you
have the same type of nodes
but with different edges,
you will have the
same node identifier
but the edges would
make it different?
As in, if I have two vectors that
are identical in the same graph,
I can assume that they are
the same type of node.
>> So here?
>> Yes.
>> Yes, you can.
>> Which means if I have the same,
like they don't clash,
like MD5 thing, you can assume that?
>> You can assume anything
you want in this case, yes.
So the simplest scenario here
is that all these things,
all these vectors are identical,
and you want to learn
something about the graph itself.
That's perfectly fine.
There are other cases
where you want to attach
some information to the graph.
>> I see.
>> So I'm trying to map this
diagram to the first slide
you showed about what
machine learning is,
which is you have the model
and then you have the knobs.
So is it accurate to say
that the model here,
the human design part is
the structure of the graph
and then the knobs that you're trying
to learn the parameters
for are the vectors?
>> So the human element is in here,
we'll discuss this in the next slide,
and also, you can think of
it as a feature extraction.
In the, let's call it
old-fashioned machine learning,
you extract features,
is this red or blue?
Is this small or large?
So you have the zero one things.
Instead, here, you can think of
it as a feature engineering.
Feature engineer your
problem into a graph.
>> I see.
>> So this is part of
what we humans do,
but also there is components in
here that are human related.
But the parameters,
most of them are here.
There might be parameters of how
to compute this initial
vector representations.
>> Got it. Thanks.
>> Should we assume that
the graphs were perhaps
different examples or variable size
are not necessarily
the same structure?
>> Yes.
>> Okay. After you get
the final representation,
then you have these vectors
that you would like to combine
into your [inaudible] how do you,
like you have a variable size?
>> We'll get there eventually.
>> Does the output graph
have the same structure
as the input graph?
>> Yes.
>> So I'm not adding or
removing any nodes or edges?
>> Not in the flavor of
the graph neural networks
we'll be explaining for
the rest of the talk.
In principle, you can do
some of these things,
but let's ignore this for
all practical purposes.
So hopefully, that's good enough,
graph neural networks here.
Let me zoom in in this box.
What happens is the following.
Let's take this small example here.
This F node is connected to E
and D connects to F through
different types of edges.
At some point in time,
let's say, E and D have some
vector representations.
The same thing happens with F.
Now, the idea is as following.
We start from our neighbors
and we're going to do something
like computing a message.
That's the analogy.
The message here is essentially
get another vector.
The idea is that we get
a function of the input,
the neighboring node,
the actual current node,
and the type of the edge,
and with this function,
which could be many things,
again, that's design
decisions about our model,
we'll discuss concrete, a
few concrete options later,
we get a message,
meaning again a vector.
The same thing for this edge here.
Then we summarize all the messages
we received from our direct neighbors
and again, summarizing could
be many different things.
We combine the summarized information
with the current state of the
node and we update the node.
So now, F, essentially,
what happened is,
it had to instate at time t minus 1.
At time t, now, it has a new state
that has information about
itself and from its neighbors.
In some way, this is combined.
There are many, many different
ways to combine them.
We'll get to that very soon. Yes.
>> When you say both nodes
and also be the type of edge?
>> Yes.
>> Is that encoded somehow or is
it just directed, undirected?
>> No. The type of edge
could be that D and F,
let's say, are at D is below F
and E is above F, let's say.
So it's a relationship
that you encode.
Again, the types of edges you have
is something that you decide.
It's, again, the modeling
decision of the human.
>> But then it's simply directly
in the vector of the edges?
>> We'll discuss how and when
it's encoded in a second.
So let's look at an equation here,
which is hopefully
colored in a helpful way.
So prepare the messages.
This is a function
which might be depended
on the time step t that takes
the representation of your node
that you're currently trying to
predict, like F in this case.
It takes the type of the edge,
so this k here, and optionally,
the type and the
information of state,
the vector if you
wish, of your input.
So you do this,
you compute a message for each edge,
for each node that connects to n_j.
Now, I'm using this
fancy operator here,
mostly because I couldn't find
another operator in Office,
but the idea is as follows.
This could be any
commutative operation,
any permutation invariant operation
for combining the messages.
It doesn't matter if the
edge from D to F comes first
or the edge from E to F comes first.
This is an important part
of graph neural networks
because they make them
essentially invariant
to however you may
shuffle your graph,
and that's an important thing.
In some cases, this
could be a summation,
in some cases, this could be a max,
in some cases, this
could be something else.
That's, again, a design
decision of our model,
nothing beyond that, but this is
a general class of operations.
>> Then you should
include all the edges.
>> Yes, but you include all
the edges that connect.
>> This is the row n_j.
>> Yes, and then we're currently here
updating only one node, of course.
So we get all our neighbors,
we compute the message,
we summarize it,
and then we take our own
state, this state here,
and we combine this
with some function q,
and we now have a new
state over there.
So that's the high-level thing.
For each node, you can imagine,
we have a previous state,
you get messages from your neighbors,
and you update your state.
>> Has this been going from n_j to n
or there were something on notation?
>> Depends on notation.
>> Under the big union?
>> Yeah.
>> Yeah.
>> Yeah, I guess you want,
n_j should have been here.
>> Yes.
>> So this is the process
for updating one node,
how does most of the common
graph neural networks work?
Well, you have one of this,
the parameters are shared across
all of these for each node,
and then there is a clock,
very much like a CPU.
At each point, a clock ticks.
When it ticks, every node gets
input from all its neighbors,
compute its messages,
updates the state.
The clock ticks again and the
process repeats again and again.
So how many times do you run this?
That's a hybrid parameter,
a design decision,
however you want to call it.
There is no notion of convergence
or other fancy things like that,
but that's how a graph
neural network works.
>> So the new computer value
has to be saved separately,
so you only update all
of the values at the end
because otherwise, the
order in which you update
the nodes will change?
>> Yes, you do this in parallel.
>> Okay.
>> Which brings me to the next slide.
You can think of this as unrolling
the graph multiple ways in time,
and now, each vector, it
will get updated again
and again and again.
So all nodes update their states
simultaneously in parallel.
Each slice here, you see all the
nodes in the graph have one state.
>> As time goes on, how are these
F functions involving as well?
>> By F functions, you mean?
>> The ones that [inaudible] .
>> So once you've trained
your neural network.
>> Show the next slide.
>> Yeah, I'm trying to get there.
>> [inaudible] .
>> Yeah.
>> [inaudible] .
>> Again, modeling decision.
You might decide that
f_t, it's dependent on t.
You might decide that
q is dependent on t.
In many cases, they don't.
In some cases, they do.
That's entirely up to you.
But usually, in some cases,
you can think of this
as something that
depends on the time t,
but it's shared across
all time slices.
>> So I think I'm missing
out in this discussion
is the parameters wise,
so both F and q get D
is source of an input as well?
>> Yeah, so parameters can be in F.
>> So usually, f_t of Theta
and these other things
and it's q of Theta and
these other things.
>> Yes.
>> There's a silent Theta here.
>> Yeah. We'll discuss I
think in two or three slides,
there are concrete
choices that people make
where the parameters will
be more explicit list.
So what this means is
that in the beginning,
each node knows about itself.
In the next step, each
node learns about itself,
well, you knew that already, and
it distance one's neighbors.
In the next step, their neighbors
have also been updated.
So now, its neighbor
know about itself,
about its neighbors, and
its neighbor's neighbors.
So you increase essentially
this perceptive field
of what you know about each node,
and by more time steps,
hopefully, you learn more about
how you belong in the graph.
>> Is this like Spanning
Tree Protocol in networking?
>> I don't think so.
I cannot say that I'm very
familiar with that. Okay. No.
Graph Neural Networks.
You started from some
state, you got the output,
now you have all of these vectors
here that are about information,
about a node, and how
it belongs in there,
and now you can do anything
you actually care about
in your application.
You can say I want to pick one node
from this graph that
has some property.
So you can do that,
this is node selection.
You can label each node,
you can add a class,
for example, and you take
each representation here
and classify it somehow,
or you can decide
to summarize all these
vectors you have here
and do graph classification.
There are different ways to do this,
and it's very application-dependent,
we'll discuss when we go
to some applications.
Let's take a very, very simple
example, not very practical,
but still, it's the simplest
possible you can get.
So you have this h,
you got out from each node
at the final time step,
and you multiply it with
a single layer MLP,
and what you get out of it is
a binary decision, a zero one,
assuming that this makes
sense for your application.
So now, you have a loss, essentially,
saying that the classical
binary cross-entropy loss,
saying that I want my
decision for each node here
to match my value y_n,
whether it's true or
false, zero or one.
What's the implication? I don't know.
I just made this just to
make this a concrete thing.
Now, what this means
is that I have a loss,
I can compute its gradient,
and with automatic differentiation,
we'll compute the gradient
through the graph neural network.
I can update the parameters here,
the w, everything in the GNN,
everything in the initial layer,
and this is how learning with
graph neural networks would work.
>> So is this the step that's done?
So your graph neural network
produces this output,
and from this output, you then
try to classify based on it
or is it something
that you incorporate
into the graph neural network?
>> It's all entropy.
You can think of this as a function.
The loss, for a simple
example at least,
over the parameters of all the
parameters you have in your model,
and you back-propagate
to all the parameters.
In most cases, it's the
most common thing to do.
>> But you're doing
the back-propagation,
the entire stack, so it's
a pretty giant expression
that you are differentiating.
>> Yes, you really want to do
the differentiation
manually, that's for sure.
>> I think so, yeah.
Also, do you run into
problems like radiance and so on?
This sounds like a
very hard RNN problem,
and RNN [inaudible].
>> It depends on the architecture.
Again, there are many architectures
that are coming up next.
Some of them look more like CNNs,
and you can think of them
as a stack of CNN layers.
Some of them are RNNs.
Usually, this unfolding happens
for 10 steps, let's say.
So you'd never get into
that big of a problem.
Again, you use an RNN,
which is like a GRU or
something like that.
So that all you get is
residual layers like in CNNs,
and hopefully, you don't run
into this problem in most cases.
>> Where's the update
rule for these h things?
>> The update rule is
stochastic gradient descent
or stochastic with momentum.
>> How do you go from
h_t to h_t plus 1?
>> You mean ht is a value
in the model, right?
You don't update h_t, it's not
a parameter, w is a parameter.
>> No, I get that.
Okay. Maybe you say it
in the second slide.
>> I'm not sure what you're asking.
>> Because earlier, you said
there will be an update rule
for each of these states,
for each of these h vectors.
>> So h here is the output h.
So maybe this de should
have been capital D,
the final thing when
you go over there.
>> [inaudible]
>> Yeah. That depends on the graph
neural network architecture.
So let's look at two popular graph
neural network architectures.
So the first one is gated
graph neural networks.
It was first invented
in this building.
I wasn't here at that time,
but the idea is as following.
We want to compute the messages.
We say that the message depends
on the type of the edge k,
on the neighbor and the
state of the neighbor D.
So h of node D at time t minus 1.
This is copy paste, of course,
so that's equally wrong,
and it does not depend
on F, in this case.
So here, this is E, is a
parameter of the model,
and you do a matrix multiplication,
you transform the state,
the input state of your neighbor
with your summarization.
Your permutation invariant
method is just a summation.
So however you order this sum,
it doesn't really matter, it's a sum.
Well, modular floating
point, but don't worry.
Then here, you get
your previous state,
you have an RNN cell,
a recurrent neural network.
This one is called
GRU, doesn't matter.
It has some parameters in it and
it takes the previous state,
the message that we just computed
that the summarized information,
and updates its state.
>> What's k?
>> K is the type of the edge here.
So you can say that
the white one is zero,
the red one is one,
something like that,
so you have a finite,
unusually small set of edges.
>> I guess if your F depends
on the type of [inaudible]?
The function F.
>> The function F, yes.
This is what previously
we had an F as F here.
So we have essentially a
different matrix for each k,
which is an n by n,
where n is the dimensionality of h.
>> How come there's no
bias tone for E_k_h plus?
>> For E_k?
>> Yeah, the E_k_h plus.
>> You could add one.
It depends on your application.
There are variance, at this, I think
the original GNN does that one.
You can say that the
bias, essentially,
will be helpful for counting how
many edges come in in some place.
That's one option. Okay, next. Nick?
>> What does the gated referring to?
>> Sorry?
>> What's the gated refer to?
>> The GRU, which is a
gated recurrent unit, yes.
So this is an RNN that says do
previous state to next state.
So this goes back to you
have as much exploding
or imploding gradients as
much as you would have
for a GRU with 10 steps,
so not much, usually.
Another variant, that is the graph
convolutional neural networks.
That's because they
resemble convolutions,
and we'll see later how this goes.
But the idea is you take the states
from all your neighbors,
you sum them up,
you multiply them
with your own state,
and then a matrix
multiplication and division.
This model, originally, at least
in its original implementation,
doesn't support different edge types.
So you see there is
no dependence on J.
Again, you see a permutation
invariant combination
and a different way of
updating your state.
The first thing, I
didn't really discuss.
So these are design decisions.
All of them are different
modeling decisions,
and there are tons of them.
You can go online and find
more than a hundred papers
that change something
in this equation,
but the original equation is
still more or less always intact.
You have a way to prepare messages,
a way to summarize them,
a way to update your state.
The first thing is now the following.
There is one trick that
I have omitted so far.
Let's take, for example, node D.
Node D is not connected,
no incoming edges come through here.
So node D has no
opportunity to learn about
how it belongs in the graph.
So a practical trick is that you
add explicitly backwards edges.
So you double the type
of edges you have,
a new for each forward edge,
you add the backwards edge.
Now, you still have a graph,
using the same equations apply,
but the information is propagated
through the whole graph.
That's just a trick that exists.
>> Isn't this related though to
the assumptions that you were making?
Surely, if you said the
D has a directed edge
to the rest of the graph and
nothing else connects to D,
then that's a pretty strong statement
about how D relates to the
rest of the [inaudible].
>> Yes.
>> So if this is the input, you
don't want to change it, right?
>> You mean this?
>> Yes.
>> Yes, but it depends
on your application.
If you want to learn
something about D,
how it belongs in there,
then the information has to
somehow propagate back to D.
This is what this tries to solve.
>> Is the edge you replicate the
same time or continuous?
>> Well, you can say
it's a different type
is the same like if
this is the left of E,
this is the opposite of that.
>> So basically, what you want
is getting feedback, right?
Can you make it weaker or stronger?
Or can you change it?
>> You could. People don't.
But yeah, there's nothing
necessarily stopping you.
>> Will it blow up if I just had
an edge type for no edge here
and just try to basically
fully [inaudible]?
But the default was [inaudible].
>> There are some people
that do something like that
where essentially,
you imagine a node,
a God super node here
that everything connects
to that and back.
It works quite well in many
cases. There was a question.
>> I thought someone had
the same question earlier
while you were introducing the
function that prepares the message.
So could you also fix this
by changing that function
that prepares the message
to trick your forward nodes
and backward edges differently?
>> This thing currently,
I mean, the way I have written this
is you take the previous state
and you update the state.
>> Yeah.
>> So you still don't do
anything to the neighbors.
You could imagine making
this more symmetrical,
but that's essentially what
the backwards edges are doing.
>> Okay. I see.
>> So let's go one step further.
The equation is nice.
How do we actually implement
it in a matrix operation?
Again, very background of
things, an adjacency matrix.
If you haven't seen it, let me know.
The idea is that you have nodes,
you assign them some ID,
and you say that node 0 is
connected to node 1 and node 2,
and the same thing for node
1 is connected to node 2,
and node 2 does not have
any outbound edges.
So now, imagine that we have
some variable, a, b and c,
just a number for now
for each node here.
Then if we multiply these two
things, we get this thing.
So here, you can think
of it, very vaguely,
as you had a here and now, you
multiplied it and b got a.
So you see that the
second element is a,
whereas number 2 here was
connected to a and b,
well, it is connected to b,
and you can see for the
matrix multiplication
that this is a plus b.
So this implies how message
propagation will start working.
The idea is you can repeat this
and create this adjacency
matrices easily.
It's something deterministic.
So let's look at some
pseudocode-like thing.
You have a matrix where each line
is your initial vector
representations at time t.
This is a matrix and
here is the size,
which is the number of nodes by
whatever dimensionality H has.
Now, you want to compute
the messages to be sent.
So you do a multiplication by E_k
for each type of H_k with H_t,
and now you have all the messages
to be sent outside from a node,
possibly, not
necessarily all of them.
Now, in order to
receive the messages,
you multiply this M with
your adjacency matrix a
for all edge types k and you
have the messages you received.
Now, you can apply
your update function.
So this is the simple
mathematical form that you use,
and if I wanted to use a vanilla RNN,
this GRU function is essentially
some matrix multiplication
with H_t plus W with the R_t,
just to make this
slightly more concrete,
although not very useful.
>> [inaudible]
>> So this M?
>> The b times d.
>> Yes. Then you have
this matrix E_k,
which should be d times n.
Then you have M, and then
this is also something by m.
Then here, you need
to play around with W
or the GRU input to get this work,
but you can make it work.
Usually, M and D are the same
in most of our implementations,
so you can use it.
But in theory, there's
nothing stopping
us to make this [inaudible].
>> Can you repeat what a is
in the receive messages?
>> A should be A_k to start
with. So it's a typo.
But this is the adjacency
matrix for nodes of type k.
So this is where this comes in,
this adjacency matrix that says that
node 0 is connected to 1 and
2, and things like that.
>> So the k is like a
finite enumeration,
like [inaudible] blue,
it's not like 4.7.
>> In most cases, yes,
and in the cases we're seeing
here, definitely, yes.
So yeah, you can get this thing here.
So again, the next step is, again,
mostly for a very practical step,
but I think it's useful
for the residents.
Now, let's make this next step.
We show the high-level equation,
we show the matrix.
Now, let's say there's a pseudocode.
First, what's einsum?
I'm going to use this.
I felt it will be useful
for the residents.
The idea is that you
express operations
over matrices without
a is of size bd,
b is of size qd, and the
output you want is bq.
So you express essentially
this operation
with this Einstein summation here.
You can do more complex things.
You can vaguely think of this,
that here, what happens is that
you erase the dimension of D,
or you summarize sum
over it, marginalize,
whatever the word you want to use,
or here, you remove
some other dimensions.
So you remove a with
this and you can express
this complex sum with a
symbol of short notation.
So how does the code look like?
Well, you say my initial states
are somehow given to me.
It doesn't really matter.
We have a number of steps
that we're going to unfold
our graph neural network,
for each time we compute
messages for each type k,
which is this message
multiplication that we have here.
>> Works on this slide, I think.
Hey, I'm still stuck
on the inventions.
ND times dm makes it n times m.
>> Nd?
>> First messages k
statement, yes, that one.
>> Yes.
>> Nd times dm gives
you the output nm.
>> Yes.
>> That makes sense for
the matrix multiplication.
>> Well, they are swapped and this
should have been [inaudible].
>> [inaudible] on the previous slide,
communication is slightly different.
>> Okay. I have a lot of slides.
We should fix this eventually.
Anyway, this is just a notation here
and then you have the
received messages,
and hopefully, this
also gets you there.
You sum the received messages
and you get the GRU.
>> The num steps is
the input parameters?
>> Yes, it should have
been an input parameter,
a hyperparameter, a constant,
well, in most cases,
it's a constant for all your
rounds or something like that.
But yes, you're right. It
should have been also enabled.
So this ties with this here.
So hopefully, this makes
the graph neural network
a bit more concrete.
The one thing I'm not
going to discuss today
is this specific thing here.
Here, we have the adjacency matrix,
which is n by n, but
in Einstein summation,
you cannot say this.
But this adjacency matrix may
be very sparse, but very large.
So how can you make
this more efficient?
That's something we can
discuss another point in time.
>> [inaudible] It
should be [inaudible]?
>> Where? Here?
>> Yeah.
>> No. This is n, n,
but you cannot write that
in the Einstein summation.
So L equals n for your purposes,
but it needs to be different.
>> [inaudible].
>> Yeah. I want to say
we're part of this.
Yes. I wanted to go
through two applications.
I think we don't have the time to.
So I'll go just very briefly
over one with one slide.
The idea is that
molecules, for example,
you have molecules and
you want to predict
some target quantity
out of it, the idea is,
you can see the molecules,
they look like graphs,
they are graphs to some extent,
and now, each node is an atom.
You have different types
of nodes like carbons,
or nitrogen, or other kind of things,
I'm not a chemist, and you
have different types of edges.
So some are double bonds,
some are single bonds, and so on.
So now, the question is,
can you get from your molecule
up to some target equation?
What you do is you take
this graph, you get this,
and you try to predict
your target quantity.
That's very high
level, but hopefully,
it's more concrete than
before. That's one option.
With the other thing, which
I'm not going to discuss,
is about source code,
how we find bugs,
but I don't think that
we have sufficient time
to discuss through the various
design decisions. Yes.
>> So I just swapping atoms then
so that the colors
represent different types?
>> No. This is how
features may be computed.
This is the same atom
repeating again and again.
The same graph, but the
graph neural network
updates a state, essentially
for each node, for each atom,
which hopefully has
information that is relevant
to the target quantity
you're trying to predict.
Yeah.
>> If you have time, is
it okay to go over it?
>> We can do this. Well, let's see.
Let's talk a bit about
the correlated things.
So the idea is as follows.
We want to find bugs.
Specifically, we want to
do the following task.
Given a blank in a program here,
we want to say which of the things in
the code should be placed in here.
So here, you have first and clazz.
Now, this will help us find
bugs, like copy paste,
so I'm going to copy
and paste in this line,
but the correct thing would be first.
First thing we need to do,
we want to take a program
and convert to a graph.
How do we do it? Well,
that's a design decision.
That's a feature
extraction if you wish.
The idea is that we
want to take things
like how data flows
around these things,
how control flows around
these things, and get there.
So one possible thing I'm
going to show here is tokens.
Well, a very boring graph.
You can parse code,
so you can essentially and
deterministically find a tree of this,
and you can add extra
nodes and extra edges.
Again, a graph.
Let me hide these edges and take
a more simpler program here,
and we can also have data flow edges.
So when was x last written?
Well, when I just entered
the loop, it was here.
If I'm looping, it's itself.
So we're going to keep repeating
adding more and more edges
and more and more different things
that we believe encode
the problem we have.
So by the end of the
day, we have a graph.
For this simple snippet, this
is how a graph looks like.
It's not easy necessarily to parse,
but it's not an uncommon thing
in programming languages.
>> So this line decision,
would you have all of these
edge types into one big graph
and then run your model
rather than have multiple graphs
with different types of edges?
>> No. You have one graph
with multiple types of edges
and you do the propagation
simultaneously.
>> Because this is fixed?
They do affect each other?
>> Exactly. So we're
just showing it for
clarity in the picture because
it will be too crowded, but yes.
>> So this is like you're abstracting
the tree of your compiler,
but then edges are problem-specific.
>> Well, some of them are
the abstracts index tree,
some of them are token-based thing,
some of them are data flow
things, things like that.
So how do we go around
doing this thing?
Well, we started with
this place here,
and what we do is we're going
to say this whole thing,
I'm showing it to you here
as a text, but it's a graph,
I'm going to add a few
extra edges and nodes
saying these two nodes here,
we want first, for example,
to be how would data flow if
first was in the slot here
or how data would flow
around if clazz was here,
and now, we have a big graph.
I'm not showing you all of it.
Now, the idea is we
started from our code,
we got our graph,
we somehow compute
initial representations
and plotting around that part.
Now, what we want is our objective,
we want the output of our graph
neural network to have the H
of the slot node here to be as
close as possible to the first
and as far as possible to the
clazz, which is the wrong thing.
Again, graph neural networks,
we started somehow,
we encoded our problem in the graph,
that was dependent on what we
thought we needed at least.
Our graph neural network says, well,
this is something
information about slot,
about first, about clazz,
and now, this is our objective.
Well, not exactly this,
but this is more or
less our objective.
We can compute a loss,
we can back-propagate and
learn things from there.
So that's essentially
a second application.
Let's take this simple example
where if the question is,
what is the variable
that should be here?
Well, the answer, as you have not
guessed probably, is full path.
The neural network knew that
in a few milliseconds
once getting this graph.
So that's two applications.
Graphs are very broad.
You can do almost anything
and represent almost
everything with the graphs,
but this hopefully makes
the graph neural networks
somewhat more concrete.
I wanted to just emphasize
two special cases
for graph neural networks
and this is convolutions.
Imagine you have a
great pixels and now,
you take each pixel and you
create the following edges.
One which is top, top left,
top right, left, right, bottom,
and so on, different types of edges,
and you do this, of
course, for all nodes.
In some formulation of
graph neural network,
the graph convolutional
neural network,
for example, this is
equivalent to convolutions.
Of course, it's much more
efficient to do it the current way
that convolutional
neural networks work
because you can batch
this computation
much more efficiently
than doing this thing.
The other thing is deep sets.
So the idea is that you
get a set of things.
It could be a variable
size set of things.
In this case, it's
images, for example,
and you want to predict
something about this set.
So this set of images are
people with black hair and
rosy cheeks, and so on.
Now, you get these sets of things
and you want to predict
things about them.
How do you encode this problem?
Well, essentially, a set is a
graph that is fully connected.
Everything is connected
to everything.
Again, you can do a set
of graph neural network
and at the output of a
graph neural network,
you can predict
something about the set.
So these are two
extremes, if you wish,
about graph neural networks
and some special cases that result to
something that is more or less
similar to the things we have.
I had the slides to talk
about more advanced GNNs.
Sorry, not a lot of time for that.
But again, we can play
around with how messages
are propagated and not
do the propagation,
all messages, all nodes update
all there at all points in time.
That's all.
So I wanted to conclude
with the following
more practical non-graph
neural network things,
since I expected more,
that the residents would be
the only ones attending here.
But let's talk about a few things,
how would machine learning,
deep learning specifically,
works in practice.
You have your data and now, you
have to do the following steps.
First, from your data,
you compute some form of metadata,
some information that you get
from your data that will inform
how your model looks like.
So how many words
appear in your text,
how many different types of
edges you have in your problem,
things like that.
Now, given your metadata,
you can convert your
data into tensors,
into things that look like matrices,
and encode the problem.
The part of the
metadata, for example,
could be that the ward Es is number
53 and the ward R is number 54.
So you can create one-hot vectors
for a specific word, for example.
Once you have all
your data in tensors,
you can do this in a fly, it
doesn't matter when you do it,
you can batch those tensors
into smaller groups,
into mini-batches, and how you do it,
usually, this is a separate problem.
In most cases, you just have
an extra dimension
position or batch size,
and you can concatenate
essentially these tensors.
Then you'd get our mini-batches
and you pass them into your
machine learning model.
So you get tensors, tensors
go in, you get a loss.
A loss is a scalar.
A scalar, you can compute,
differentiate the loss,
try to minimize it,
propagate gradients,
update your machine
learning model, and repeat.
So this is a very
high-level thing of how,
let's say, non-computer vision
machine learning models work.
So you have this steps of metadata,
converting to tensors, and
creating mini-batches.
The next thing is about debugging.
[MUSIC]
>> So it's notoriously difficult to
actually debug this models
because they're not
terribly interpretable,
but there are high
level of recipes about
how to do this especially
with deep learning.
I'm listing some of
these things here,
probably you'll get access
to the slides at some point.
But common things are,
take a small sample of your dataset,
try to overfit your model.
If your model cannot learn
even your small dataset,
probably you have a bug
somewhere, or try synthetic data.
Encode the problem you're trying
in a very synthetic problem
and try to make your
model actually solve it.
Can it actually solve it?
If not, then maybe you can
find from how it fails
where the problems
in the code appears.
There are many cases where you
have optimization issues like,
the gradient descent doesn't
work as you expect it.
One idea is you look at
the learning curves.
How does your loss evolve
for training or the tacit?
This is usually what you see.
If you don't see exactly like
that or something like that,
not exactly, you
might have a problem.
It might be you see that your model
is never fitting or
it's underfitting,
overfitting, all
these kind of things.
You can monitor the gradients.
Do we propagate updates?
Do we update zeros?
Do we propagate zero
information about things?
That's probably indicating that
something is wrong with your model.
Of course, there are many
other ways to do things
like visualizing,
doing error analysis.
I had this someone who
was telling me that
error analysis in machine learning is
what debugging is to
actual programming,
and this is, you look at the
predictions of your model
and you try to see
where does it fail.
Of course, there are many
other things you can do,
many things that are
project-specific,
but I thought that
since it's the first,
I think, deep learning
lecture you get as residents,
this is a good set of suggestions.
So with that, I think I
am mostly out of time.
There are certainly questions.
Feel free to ask. Jack?
>> So you have field nodes
and you come up with a
representation for the edges,
you just have [inaudible] do
people try to find
representations to the edges?
>> Yes. There are
cases where you have
metadata in edges like a
number or something like that.
People, they use this.
You're going to need to
modify the equations in
some form to get that,
but it's possible.
In most cases, in most of
the existing data we have,
the edges are just fixed there
or you have a 0-1 weight.
To which extent does
this edge exist or not?
>> Do you always need to
process the entire graph,
something like a social network?
Is it feasible to say
I have a focus node
and only process within set
of number of links of it?
>> It depends what do you
mean need in that sense.
In many cases, you don't need to,
but it's very convenient
because the data processing,
GPUs, fertilization, all
these kind of things.
If you want to summarize
something about the full graph,
then you definitely need to do this.
In some other cases,
you just need to focus on something.
So you can say that,
if we take the variable misuse
example, the bug thing,
you're interested in a
very specific place.
So you know that message propagation,
the last step far, far away,
won't affect your result.
You could have dropped
this computation.
Now, writing the code to do this
and doing this faster and doing
everything is a hard problem.
I don't think that someone
has achieved this.
>> So you said that the conclusion
of the new sets of generalization
is graph neural networks?
>> Yes.
>> In special cases of the
graph neural networks?
>> [inaudible] yes.
>> So what's the message passing
[inaudible] and for instance
the difference in those cases?
>> Something has gone wrong.
So the convolutions thing.
So this is the CNN.
You multiply the state of each pixel,
let's say, with something
that says with a kernel.
This is essentially the
kernel you would have,
and you update the
state of that node.
Forget max pooling and things
like that for a moment,
but you repeat this.
So the weights of the
edges that you use
is essentially your kernel.
Of course, I'm showing you this node,
but the same exact edges
would appear in this node,
in this node, in this
node, and so on.
I'm not sure I answered
all your question.
>> [inaudible] .
>> In the other example,
so here, you get a CNN.
Each image is a vector.
Now, this is the initial state
of each of those things.
The simplest graph neural network
is one without [inaudible] .
So you sum up for everything
or you do a min pooling,
or if you want to do
something more complex then,
it depends on the flavor of the GNN.
>> [inaudible] is what
defines graph analysis?
>> Yeah. So here, you can
say that for each image,
let's say, you have this
initial representation
and you get representations from
all the other images in the set,
and now, you update your
own representation as in,
what are my commonalities
as an image B
with everything else and the
same thing happens with A,
with C, with D, and you can keep
repeating this a few steps.
Then at the end of
the day, hopefully,
each image knows how it's similar
or how it relates with
every other image.
Then you need to summarize the
output of these four in this case,
or all your images which
[inaudible] average would increase,
for example, average of the vectors.
It depends and also, deep sets
have a different formulation.
So depending on the paper you look,
but yeah, you can
transform, for example,
our deep sets, they don't
explicitly say that and
this is old attention and sum for.
Cool. Thanks for coming.
