- Well, my name is Will
I work on the Neo4j
Developer Relations team.
And today I'm going to be
talking a little bit about
graph data science with
Neo4j graph algorithms.
So here's the agenda for
what we're gonna talk about
in the next 30 minutes or so.
First we'll talk a little bit about
what is graph data science,
what are graph algorithms in that context,
we'll talk a bit about the Neo4j
graph data science library.
Which is this library of these algorithms
that let us run this in Neo4j.
And then hopefully if we have enough time,
we'll spend a fair bit
of the session hands-on
looking at using this
graph data science
library in Neo4j browser.
We'll look at how we can use
the results of these algorithms
in graph visualization with Neo4j bloom,
sort of as a way to interpret the results
of these algorithms that we run.
And then we'll take a
look at a no-code approach
to graph data science
with the graph algorithms
playground, in Neo4j desktop.
And then we'll talk a little bit about
some resources for getting started.
Cool, so let's jump into it.
So we're talking about graph data science.
So let's back up a step and
understand what we're talking about
when we talk about data science.
So this image is inspired by
the data science Venn diagram
that Drew Conway I think first created.
Probably at least 10 or 15 years ago.
And his point that he
was trying to get across
is that data science is really
this interdisciplinary field
that combines things like
he called it hacking.
But basically being able to write code,
combining that with math and statistics
and then with a healthy
amount of domain knowledge,
for sort of whatever problems
you're trying to solve.
And all of these things together
the intersection of these
three things together
is what data science is all about.
And ultimately we're talking about
answering questions using data.
So then when we're talking
about graph data science,
this is now a subset of data science.
So we're answering questions with data
but specifically we're
looking at connected data.
So this is when we're using
the relationships in our data
in addition to just sort
of discrete data points
to answer questions.
That's fundamentally what
graph data science is all about
and just like data science in general,
it's a multidisciplinary and
multifunctional as well, right.
So we use different tools together.
We use graph algorithms,
queries, graph visualization.
We use all of these things together
to be able to answer questions
with relationships in our data.
So if you're familiar with
graph databases and Neo4j.
You may realize that there's
really this spectrum between
what we think of as local graph queries
or local traversals and
global graph algorithms.
So when we're writing a
cipher query that is saying,
find this pattern where
it exists in the graph,
maybe I'm looking for
a movie recommendation.
So I'm going to start from a user node
and traverse out to the
movies that they've rated
from those movies traverse
out to other users
that have rated those same movies.
And then traverse out will find
what other movies those users
who have rated the same movies
as me have also rated higher.
So this standard sort of
collaborative filtering
recommendation query.
If I'm doing that kind of query,
that is on the local end of the spectrum
where I have a well-defined starting point
I'm taking advantage
of index free adjacency
to traverse out from there.
And that's the kind of thing
that the graph databases
are optimized for they're really fast
and able to use traversals.
Now on the other end of that spectrum
you have graph algorithms
that don't have a
well-defined starting points
that they traverse out from.
These are often iterative
global operations
that are touching every
piece of the graph.
These are things like running PageRank,
on the entire graph or
community detection,
these sorts of things.
And of course we can use these together.
Oftentimes the results of
our graph algorithms then
feed in to local graph queries.
So in our recommendation example,
often times it's useful
to calculate PageRank
or personalized PageRank in the context
of the user's network.
To be able to provide them
with recommendations influenced
by the more influential
nodes in our network.
So this is spectrum and
things we can use together.
But in this talk we're
gonna be focusing more
on the graph algorithms
and of the spectrum.
So we're gonna be using the
Neo4j graph data science library today.
You may have previously seen
what was called Neo4j
graph algorithms library.
And the graph data science library is
really the next iteration of that library.
So the graph algorithms library
enabled running these global
graph operations on Neo4j
and the graph data science
library takes that concept
and sort of makes it enterprise grade,
more stable, more optimized.
So think of GDS as we call it
the graph data science library,
really as the next iteration
of the graph algorithms library.
There's very similar concept,
slightly different syntax
and under the hood optimize
performance, essentially.
Cool, so in this graph
data science evolution
I guess is one way to think of this.
If we look back at the,
when we're talking about
the spectrum between
local and global, we can think
of knowledge graph queries
as more on the local end of this, right.
So going back to our
movie recommendations,
if we have a graph of movies
and the actors and the
genres and things like that
this is the kind of sort of decision,
support, knowledge, graph
type query we can use.
Now, then when we move to graph analytics
we're talking about looking at statistics
about the network, looking at
counting the degree of every node,
running graph algorithms, like
PageRank community detection
those sorts of things.
And beyond that, we have things like
graph feature engineering,
where we're using
graph algorithms to uncover
some feature that then
maybe feeds in to a
machine learning algorithm
for classification, maybe
an unsupervised algorithm
something like that.
And then forward looking beyond that,
we have things like graph embeddings
or we have some way to
encode or represents a graph
that can again, feed
into another algorithm.
You have things like deep
neural nets and so on.
But what we're gonna
be talking about today
is this graph analytics piece.
Where we're talking about using
graph algorithms with graph data sets.
Cool so there's this the
five types of algorithms
that are available to us
in the graph data science library.
And there are, pathfinding and search
centrality or the importance,
community detection,
link prediction, and similarity.
So pathfinding, allows
us to find for example,
the shortest path between two points.
We may want to take weights
into accounts, right.
So these are things
like a star for routing
or breadth first search,
these sorts of things.
Next category is centrality.
And these are algorithms
that allow us to find
the most important or the most
influential nodes in the network.
So things like PageRank
between the centrality.
The next group of algorithms
is community detection.
And here we're interested
in finding clusters
or partitions in the graph.
So, a way to sort of
group similar entities.
Maybe their users who
interacted together frequently,
maybe they're movies that end up belonging
to the same genre, the sort of thing.
The next group is of
algorithms is link prediction
where we're looking for,
trying to identify if
there's a sort of hidden link
between nodes that may not be represented
explicitly in the graph.
And then the final category is similarity.
And similarity is interesting
because this is not really,
I should say most of these algorithms
are not really graph algorithms.
They're often based on set comparison.
So things like Deckard similarity
or cosine similarity where
we're comparing two sets,
but they're really useful
in the context of a graph.
So here for example, we
may have run let's say
like a natural language process
on a bunch of documents and
extracted the entities mentioned
in these documents and records
that represented that in a graph.
And then we want to calculate
the similarity between
these documents nodes.
And we can do that by comparing
the overlapping entities
mentioned in the documents.
This would be a way that we could use
Deckard similarity for example,
or things like cosine similarity.
Where we're comparing maybe
the ratings of movies,
for users in a user rated movie
graph, something like this.
So those are the general categories.
Here's a glance at the
different algorithms
that are available in the
graph data science library,
for these different categories.
I think there are maybe 45 or so of these.
And you'll see some of these are in bold
and these are the ones
that are the most optimized
and that are sort of
at the product level tier of support.
The others maybe more experimental
or not performance
optimized implementations.
But they're still available
to us nonetheless.
Cool, so how do we actually
use these algorithms with the
graph data science library?
So the graph data science library
gives us procedures that we
can then call within cipher.
And we'll see hands on how
to use this in a minute.
But I just wanna sort of show the syntax
for these procedure calls.
So the basic syntax is
we use the call keyword.
This indicates where we're
calling some procedure.
And we say called GDS.
So GDS for graft do science,
that's the namespace for these.
Then we have an optional tier.
So I said earlier that we have
some algorithms that are the product tier.
So these are the most
performance optimized.
These are gonna be in the
product tier namespace.
Then we reference the
algorithm that we want to run.
So this might be like GDS.PageRank.
Then we have the execution mode.
So there are three basic execution modes.
We can run the algorithm and
then stream the results back,
without sort of updating
the data in the database.
And that's the stream execution mode.
We can run the algorithms
in the right execution mode
where we're running the algorithm
and then writing the data
back to the database,
either as a node property
or some of the algorithms such
as the similarity algorithms
will create a relationship between nodes,
but either way in that mode,
we're updating the graph.
And then we can also
run this in stats mode,
which will give us some statistics
about the results of the algorithm.
Such as, how long it took to run,
the min and the max, this sort of thing.
And then we can also, just estimate
the memory requirements for the algorithm.
So if we don't actually
want to run the algorithm
instead, we just wanna make sure that
our machine is configured appropriately.
And we have enough
resources to perform it,
we run a specific algorithm
with a specific configuration.
We can add on this optional
estimate procedure at the end.
Now, what are we passing
in to this procedure call?
Well, we're passing in a graph name.
So we would need these procedures allow us
basically to construct an in-memory graph,
that we're actually using
to run these algorithms.
So we can either name the graph,
which can be sort of a sub graph
or a projection that we
can define with cipher.
So the graph that we're
running the algorithm on
doesn't actually have to
exist in the database.
So this is a really useful
feature of the GDS library.
We'll see an example of in a minute.
So basically we build
up this in-memory graph
and then we pass along some configuration.
And the configuration here
depends on the algorithm,
whether we want the weighted diversion
of label propagation for example
or specifying source nodes or
PageRank something like that.
So it depends on the algorithm running
what the configuration is going to be.
Okay, so here are the different tiers.
So we have the product supported tier,
and in our case we just
leave the tier off.
Those are just in the GDS namespace.
Then we have the beta tier.
These are algorithms that
are give as I guess sort of
candidates that will be moved into
the product supported tier, eventually.
And then we have the alpha tier.
Which are the more experimental,
less optimized algorithm implementation.
So use these ones sort
of at your own risk.
Cool so how can you make use of this?
Well, first of all
if you're using Neo4j desktop,
the graph data science
algorithm is available
in the plugins within Neo4j desktop.
So you can just click
add plugin for a database
and you'll see it available nodes that
you cannot use both the
graph data science library
and the Neo4j graph
algorithms library together.
They're not compatible.
So be sure only to install
one version of those.
You can also use graph
data science with the
graph app that's called the
graph algorithms playgrounds
with Neo4j desktop.
We'll take a look at this in a minutes.
And then you can also make use of
the results of some of these
algorithms using bloom,
which will also take a
look at it in a minute.
And bloom is also,
now included in Neo4j desktop by default.
So you shouldn't need to do
anything to install bloom now.
If you have the latest version of desktop,
you can just start it as a graph app.
And then if you're not using desktop,
you can download the
graph data science library directly
from neo4j.com/download-center.
There's also a link here
to the documentation.
Cool, so what I wanna do now
is let's jump over to Neo4j sandbox.
So Neo4j sandbox that
we showed us earlier,
this allows us to spin
up some Neo4j instance
that's hosted for us
in the cloud with some pre loaded data.
And also includes the
graph data science library,
so we can run these algorithms.
So I had just started this
graph data science project.
So if you click on a new project,
you'll see lots of these
different ones available.
This specifically is graph data science.
And I just clicked on open with browser
and brings me here in Neo4j browser.
So the graph data science sandbox,
this includes a data set
that's been extracted
from the George R.R. Martin
Game of Thrones series.
Basically it is looking
at character interactions
where we have two names that appear
within 15 words of each other.
Then we consider those in interaction.
There's also some more data.
It actually combines a
few different data sets.
Let's look at, called
DB schema visualization.
And this is a bit difficult
to look at at first
because we have lots of
multiple labels here.
So let's remove
King, knight and dead
from our schema visualization.
Basically, those are multiple labels
that are added to person in this case.
So we have things like the battles
that different characters were
in the house they belong to.
Which book they appear in,
things like their status.
I think this is are they still alive
or dead at various points in the books.
But then we also have these
person to person
interaction relationships.
And you can see, we have interacts one
interacts two interacts.
And the reason for that is we have
different measures of interactions
throughout the different books.
But basically that's what
we're going to be looking at.
Are these person interacts with person,
pieces of the graph.
So the first thing that I want to do
is create one of these in memory graphs.
And to do this, I say
called GDS create graph.
We give it a name, GOT interactions
and then we give it all of
the nodes we want to consider.
So in this case, the person nodes
and then we give it the
relationships you wanna include
in this in-memory graph.
I want the interacts relationships
following both directions.
So I'll run this and this will load.
Now we have this GOT interactions graph
that we can refer to later on
when we're actually
running our algorithms.
And the first algorithm that
I want to run is PageRank.
So we'll say called GDS.PageRank
and this is going to be the
streaming variants of it.
You pass in the graph, GOT interactions
that we constructed previously.
Then this call is going to
yield a node ID and a score.
So to get back the properties
from the actual nodes,
in this case, we wanna know
the name of this person.
We can use this GDS util as node function,
passing in the node ID.
So this bit here is going
to give us the actual node
that we just ran a PageRank on,
and we're grabbing the name
property, and the score.
So this is the PageRank value
and ordering the results by the score
it should give us the top
10 characters by PageRank.
We have Jon snow, Tyrion
Lannister, Cersei.
So if we're familiar with the books
the sort of makes sense.
It's also interesting to compare PageRank
to other sorts of centrality measures.
So here we're running PageRank
we're getting back the nodes.
Then here we're calculating
degree and weighted degree.
So degree, this is just the number
of interacts relationships
that each character has.
And then weighted degree
is summing up the weight.
So for each character, for all of their
interacts relationships,
we have a weight property
which for any given pair of users
this weight is the number
of the interactions
between those two users
throughout all of the books.
Now if we add that up,
then we get a metric called
the weighted degree centrality.
So here, if we're comparing
and we've ordered by PageRank here.
So Jon Snow has the
highest PageRank score,
but know that he also
he does not have the
highest degree centrality
or even the highest
weighted degree centrality.
So PageRank is giving us a
bit different information
cause it's this sort
of recursive algorithm.
We're not just simply looking
at degree weighted degree.
Which may or may not necessarily find
the most influential node,
depending on how we're defining that.
Cool, so far we haven't updated the graph.
We've just streamed the results back.
We can now write
the PageRank values back to the graph,
using GDS PageRank right.
And we specify the right property.
So this means that
we'll calculate PageRank
and then we will add a property value
on the nodes called PageRank.
So if we want to now find
all of the person nodes
we can just (typing)
p.name and p.PageRank, right.
So now we can see
a person's name, their PageRank score
is stored in the graph.
Let's run one more algorithm
and then we'll take a look at how
to interpret these results
in a visualization.
Let's take a look at a
community detection algorithm
in this case, we'll use label propagation
which can take into account weights.
So we need to create a
weighted in-memory graph.
The previous one we created
did not specify the properties to include.
So that was an unweighted graph.
So now we're going to
create a new in-memory graph
called GOT interactions weighted.
We want this looking at the person nodes
following the interacts
relationships undirected
but now including the weight property.
So if we run that, we now have
this GOT interactions weighted graph
that we can then pass in to
our label propagation algorithm.
Specifying we want to consider
weights for this algorithm
and then writes a property
back called community.
So this algorithm
will assign partitions communities
to each node and write that as a property.
So we can do something like this
match all the persons, get the
community property as (typing) community.
And we can group by community
gives counts of the persons as
the number in the community.
And now look at the community
and its number of nodes assigned to it.
Let's just look at the top 50 of these
not sure how many we
actually generated, okay.
And we can see here the
sort of distribution
of the size of the number of person nodes
assigned to each community.
So we have, community 304
has almost 400 person nodes.
Then the next, is 100 and so and so on.
So we have really just a handful
of larger medium sized communities.
Okay, so we've run our algorithms.
We can see how we can kind of query
some of these results using cipher.
But let's take a look now at Neoj bloom
to see how we can visualize
some of the results of these algorithms.
So, sandbox includes Neoj bloom.
So first I'm just gonna copy my password.
And then we'll open up bloom here.
So bloom allows us to explore
Neoj graphs visually.
And first thing we'll do is sign in.
So user Neo4j and that's the password
I copied from sandbox.
So bloom allows us to
sort of explore the graph
without having to write cipher.
And it's configured by
what's called a perspective.
So perspective can define
the type of data that
we want to visualize.
It can give us some styling
for different pieces of data.
So, we loaded that perspective.
That's gonna give us some
styling for our graph.
So the first thing we can do
is let's just search for person nodes.
I'm just gonna zoom out a little bit here.
Let's see a bit more cool.
So the first thing we noticed
is that we have 1000 person nodes,
that have been rendered here.
And there are different sizes.
So let's click on one
of the bigger ones here.
And we can see this node
represents Tyrion Lannister.
We can see his PageRank score.
We can see which community
has been assigned to.
But he has one of the
higher PageRank scores
and he's showing up as
larger individualization,
which is great because we can look at this
and very quickly identify
the most important nodes in the network.
And in bloom, we have what's
called range based styling.
So if we look at the configuration
in the perspective for person nodes
we see that we have this
range based style rule
that says, size nodes relative
to the value of PageRank score.
So where the value is
higher, make nodes bigger.
So that I can quickly see
the more important nodes in
the network, which is great.
And we also noticed though
that these different colors.
So if we go back to our person
rule-based styles, we can
see that we have a rule
that is setting the color
based on the community value.
So for all of the nodes that
are in the same community,
they're being assigned to the same color.
So we can quickly look at the graph,
to sort of make sense of which nodes
belong to the same community.
Let's clear the scene here
and now I'm going to
run this top characters
(typing) and connections search phrase.
So bloom allows us to define
what are called search phrases,
which are essentially
parametrized cipher queries
that we've defined within the perspective.
Had a typo in there.
There we go, top
characters and connections.
So this is running a cipher query
that we've defined in the perspective,
which basically says,
find the top person nodes by PageRank
and then find what characters
they've interacted with.
And if we zoom in on this, we can see that
not only are the nodes styled by size
and color or the results of
PageRank and label propagation.
But we can see, we also have relationships
that are thicker than others.
And this is because now if
we go to the relationships
just like the ones are
showing in the scene.
In the prospective configuration,
we have a rule based style
for relationships that says,
basically style them
related to the weight property.
So the weight property is large
make those relationships thicker.
Which we can sort of verify,
let's select this node.
So this is Rob stark.
If we look at his relationships
and let's just filter for only
the interacted interacts relationships.
We can see,
there you go.
That we have different weights assigned
to these different
interactions and that here,
Rob Stark and Theon Greyjoy
is a much higher weight.
So we're seeing a thicker
relationship there.
So just useful to sort
of look at the graph
at a glance and being able to interpret
some of the results of these algorithms.
So we have a few minutes left.
There's one more thing
that I want to look at.
And that is Neo4j desktop.
Here I have the same data
sub loaded in Neo4j desktop,
and I've installed the graph data science
playgrounds graph app.
So let's start that up.
The graph data science playground
or you may also have heard
this referred to as Euler.
This allows us to run graph algorithms
using the graph data science library
but without writing any cipher statements.
Instead we have this nice
sort of query builder.
So here we want to run
let's say between the centrality on
person nodes following the
interacts relationship,
let's write it back to the graph.
We can run this and we get the results.
So here's the between
the centrality scores.
Number one is John Snow
followed by at Ned Stark.
We also can look at chart visualization.
So very quickly see the results here.
And we can also look at
graph visualizations.
So here, we're looking
at a section of the graph
where we've styled them,
styled the graph according to
some of the algorithms that we've run.
So in this case, the
nodes size and styles.
According to betweenness centrality.
So this is really useful
if we're not sure
the configuration option
for different algorithms,
if we're not sure sort of,
which algorithms we want to run.
This can be really useful
for just sort of building up
experimentally, some
of the procedure calls
for the algorithms, with their
different configurations.
If we look in the code tab,
we can see what cipher queries
we're actually building up
with the configuration for
these difference algorithms.
And we can even generate,
a Neo4j browser guide.
This is useful if we're sort
of working with a coworker
and we wanna send them just a quick link
that has embedded in
all of the configuration
for a certain algorithm
that we've created.
Cool, so that is the graph
data science playgrounds.
I just want to end on a couple of slides
pointing out some resources here.
So the first thing
I wanna mention is
how you install the graph
algorithms playgrounds.
So if you go to install.thatgraphapp.io.
You can see lots of different graph apps
that are available to
add into Neo4j desktop.
Here is the graph algorithms playground.
We just click this install link
and that will deep link into
Neo4j desktop to install that.
Here's a list of some other resources
and I'll drop a link to the slides
in the chat here and
we'll share the slides.
So you have the links to all of these.
But the Download Center
that we talked about,
the graph data science sandbox is on here,
the sort of authoritative resource
for all of this is the
O'Reilly Graph Algorithms book
which you can download for free.
So be sure to check that out.
