>> Okay. So today, we have
the pleasure to
have Will Guss here,
speaking to us about his work
toward Neural Homology Theory.
Will is a PhD candidate in
the machine learning department
at CMU and co-founder of
Infoplay AI is advised by
Dr. Ruslan Salakhutdinov
and his research spans
reinforcement learning,
natural language processing,
and deep learning theory
particularly at the intersection
of Algebraic topology,
computational geometry
in learning theory.
Will completed his Bachelors
in pure mathematics at
UC Berkeley where
he was awarded the
Regents' and
Chancellor's Scholarship.
During his time at Berkeley,
William received
an Amazon Alexa prize grant
for the development
of conversational AI
and co-founded ML at Berkeley.
So Will take it away.
>> Cool. Yeah, so today,
I'm going to talk to you
about a little bit of work
on Algebraic topology
in relationship
to neural networks theory
and the way I want
to motivate this is
with the problem of
architecture selection.
So in the past couple of years,
deep learning has
been really successful
mainly in large part
because we've
eliminated feature
engineering, right?
We're not using
HOG filters anymore.
And as a result, research focus
is really shifted
to instead of finding
the best features finding
the best architectures.
Like given two architectures,
how do you compare them which is
going to perform better
in your test set.
And a large part of
this is really focusing
on how you build inductive bias
into architectures.
So choice of convolutional
architectures
or just fully connected
architectures,
residual connections, etc.
This is really important.
And then in Computer
Vision, for example,
large body of work
focuses on improving
these initial architectural
choices of AlexNet even.
And if you start looking
at what architectures look
like in different domains
like Natural
Language Processing,
you start getting
these very hyper-specialized
architectures.
And these are extremely
successful but it sort
of leaves a question,
what is it about the data
that leads to
these architectures?
So there's this issue
that if we keep
approaching it
this way by imparting
inductive bias on
these datasets,
it doesn't allow
deep learning to
scale without expert-level
architecture designers.
So if you want to implement
automatic deep
learning et cetera
using this approach,
I mean it's been very
successful doesn't really scale.
And there are several
solutions to this, right?
So there's this whole work
on neural architectural
selection.
There's Bayesian
hyperparameter optimization
and it works really well.
Recently, it's been
set up extremely well
but the main thing is
it doesn't capture what
seems that this year
at NIPS seems to be
the best tool in
deep learning
practitioners toolbox
which is inductive bias.
These methods are sort
of blind to the data.
And it's not really
clear what features of
the data should influence
our architecture selection.
Sort of if you look at
neural architecture selection
in general like
the algorithms by Zolf and Lee,
you kind of see
that they're using
this reward signal
which is how well does
the neural network do on
the data set as this sort of
implicit representation of
what the data looks like.
But we think we can do
better than this and not
just improving
architectural selection but
sort of understanding what is it
about the data that makes
architecture so successful.
So there's a partial
solution here
and it says use
Neural expressivity theory.
So just in general as a primer,
basically neural expressivity
theory tries to relate
the properties of neural
networks like how many layers,
the width of
the neural networks, to
basically the capacity of
that neural network
to generalize
and express certain functions.
You'll get results
like shallow networks
need exponentially
many more units
than these deep neural networks
to express certain
polynomial functions etc.
but there's a caveat here that
you need to know
if you're trying to
use architecture selection that
you need to know what polynomial
function your data expresses,
you need to know
the curvature of your data.
All these things are super
important but you can't
get it efficiently
without sort of
solving the problem
in the first place.
And so we're going
to propose a sort
of different lens something
that's more computational here.
So the statement of
architecture selection
is you have this sort of
data set or learning problem,
and you're trying to ask
which architectures are suitably
regularize enough and expressive
enough to learn and
generalize on that problem.
So data-first architectural
selection says, "Well,
if you can come
up with some sort
of computable measure of
data complexity and then
characterizing neural networks
by their expressive,
their ability to learn
subject to that complexity.
Then you can sort of seed
architectural selection.
So one particular example of
this is take these two
datasets here, right?
So we've got D1 on the left
and D2 on the right.
D1 is just two classes.
I've omitted
the second class but it
is the complement of the first.
And it's got two balls.
Two separately
connect the components
with the uniform sampling on
those balls and
the second datasets got
two balls and two rings
same sort of setup.
Now, if you train at
twelve hidden unit
a single hidden
layer neural network
on this dataset and the 26
at unit one they both learn.
They both at least capture
these connected components.
But if you try on D2,
no matter how many.
No matter what initialization of
this 12-unit neural
network you give,
it'll never express
these rings perfectly.
There are always be
these adversarial sets
right in the middle.
And this is saying
some things that if you
capture that D2 is somehow more
complex than D1 and sort of
characterize the architectures
by this complexity.
Then if you could
compute this complexity
choosing an architecture or
a set of candidate
architecture so for
neural architecture
search or Bayesian
hyperparameter optimization,
would be a simple as is
computing this sort of prior.
This notion of complexity of
the data and imputing
it on the algorithm.
So to give you a more visual
representation of
what's going on here,
you have a two-unit
neural network,
a four-unit, eight-unit
and 16-unit same dataset.
And if you sort of
look at the best case,
if you sample 100,000 of
these neural networks and look
at the best sort of convergence,
you can see the two new
neural network as you
expect only capture
the sort of one component.
These red things are
just a bunch of data points,
so there're so many data
points here that that
it looks like it's solid but
if you zoom in close enough,
it's just data points
and the compliment.
This is one class
and the other class
is sort of omitted here
but it's everything else.
So whatever's inside of
that red ring right there and
outside of that
red ring et cetera.
And if you sort of look at this,
no matter how you initialize
this four hidden unit neural
network it's never going
to capture that sort
of ring there.
But if you go down
to the sixteenth
hidden neural network,
it captures it.
And in the eighth one,
it doesn't actually
capture the ring so there's
adversarial examples here.
And this happens in
different orientations
but this is
sort of a phenomena we observed.
Certain units sorry. Yeah so
I'm just giving you an example
with one layer neural
networks but the work
here we're gonna do is
we're going to develop
this sort of notion
of data-first
architectural selection
for multiple layer or multi
connected or fully
connected neural networks.
So that's sort of the picture
you should have
in your head here
but how did we develop
this notion of geometric
complexity in the data.
And so, what we're
going to use here is
this cool tool
called "Topology".
So the main tool of
Topology is sort of
these continuous maps.
And the classic example
I can give is
sort of what defines
equivalency in Topology.
So in Topology,
you have two sets,
geometric shapes for example,
and you kind of say
these two shapes are
equivalent if you can sort
of continuously deform
one shape into another.
So the classic example
here is the doughnut in
the coffee mug and there's
a sort of deformation.
I'm being a little loose
here with what it means to be
equivalent topology but this
captures sort of
the same notion.
So there are certain
invariants that are
captured like
the number of holes,
the number of connected
components, et cetera.
And it can differentiate
sets and we're going
to get two datasets
here in a moment.
It can differentiate them
in a meaningful way that
discards certain
information that
might be even harder to compute.
So what does it what
does it mean to have.
What is the rotation
of MNS for example
what is the translation of
MNS relative to the origin.
All these things that are
other interesting geometric
properties Topology discards.
But we find if we use
this geometric notion of
complexity that it
actually kind of
characterizes and
delineates neural networks
in a very, very powerful way.
So just going back to
the original example,
D1 is not equivalent topology,
the D2 because you've
got two clusters
here and four clusters
here and at some point
you'd have to split,
you'd have to make
a tare in this cluster to
produce at least one of
these other four clusters here.
So in a sense,
these datasets are unequal.
And the next thing
is so okay, we've
got a notion of how to separate
datasets topologically
roughly. Yeah.
>> Topology have to do with
solid objects and things.
>> I am assuming that we're
talking about
the distributions here.
Like, if we looked
at the support of
the distribution up
to certain level set,
then you could say that
these datasets are different.
But we'll get to, you've
heard a Persistent Homology,
I think from a previous
talk but we'll get
into how this is sort
of plays in here.
But roughly speaking
these datasets,
the true distributions of
these datasets are different.
So the next tool we're going
to use this as some way
to actually kind of compare
the complexity of a dataset.
So there's this tool called
Algebraic topology which
roughly takes
these sort of spaces,
these topological spaces and
converts them to Algebraic
groups rings other sort of
Algebraic structures
which are discrete
and these structures we're
going to be able to use are,
we'll boil them down to
a set of integers
which we compare
which will show that
certain datasets
are more complex
than one another.
So just to give you
some background here,
I'm going to be very rough from
a topological perspective
but we're going
to find something
called Homology and it's
roughly going to count
the number of holes.
The notation we use is Z to the
12345 whatever it is where
the power of the integers
here is the number of holes in
dimension N. So to give you
just a rough example of this,
if we look at D1, the number
zero dimensional holes
this is the caveat that means
the number of
connected components.
Number one dimensional
holes is like
the number of sort of rings and
the number two
dimensional holes as
number of cavities that
we look at D1 here,
we've got two different
connected components
and Homology would
give you that,
it would give you that you have
zero one-dimensional holes
and zero two-dimensional holes.
If you go look at
the second data set
here Homology is going
to give you that you have
four separated clusters and then
in two of these clusters
you have holes, right?
So it'll be
two one-dimensional holes
and right off the bat,
if we're going to apply
some reasonable partial ordering
to sort of this Homology right
here we could kind
of say that D1
is sort of less complex
than D2 and this is
what we kinda saw.
So first of all, D1
is less complex than
D2 in Topology and in terms
of neural networks that can
train and actually express
the decision boundary
properly on these datasets.
The simpler dataset is
easier for 12 hidden units
and impossible for 12 units
on the second data set.
So this is basically
the basis of our talk.
We're going to use
this tool called
Homology here and we're going
to characterize the capacity
of neural networks.
And just as a brief
aside this Homology tool,
it doesn't capture
everything about
topology but if the spaces are
equivalent then you
get this equivalency
in these Homology here.
Basically, the powers
are the same if
the spaces can be smoothly
transformed into one another.
Anybody have
any questions at this
point just as a basic primer?
>> This is a high
level question.
It seems like the difficulty in
Homological space may not
correspond to the difficulty
in neural network
learning space from
a two connected component space
or whatever works there.
>> Right so here's
the caveat here
I mean data for architecture
selection says take
some description of your data
and help that guide neural
architecture search, right?
So, you don't
necessarily use all of
your inductive bias on
the data deciding architecture.
And the same thing holds here,
we could either easily
come up with another
measure that would be
the curvature of your data set
and say you know,
let's characterize you
know if I have a data
set that has a variation
on its curvature,
if I kind of walk around
the data set and sort of
there's a lot of spikes
and there's not a lot of
regularity of the data set,
do I need fifty hidden units,
or five hundred hidden units,
or thirty layers?
You can do the same thing
here. I think that would
be a great piece
of follow-up work.
But we're just starting
with topology because it's
sort of something that you can
approach in terms
of computation,
and you can also sort of
visualize very easily so, yeah.
>> So I have
a doubt about starting
with a topological space.
I'm thinking about situations
where I have to do
some text processing.
I'm thinking, do I want to
treat words as atomic units
or do I want to do
a character level
or in that sort of
thing or try to do
morphological or something.
And, to me, it feels
like these are not
even the same topological spaces
and so I don't even know
whether I understand.
>> So, what's it
like comparing like
a character level
representation of
sort of NOP data versus,
right, well I mean,
one thing you could
argue here is that,
if you do have a
continuous mapping from one
to the other, somehow,
if you start with
one topological space,
it's sort of depending on
what that mapping looks like,
is some of the
topological information
is being preserved.
One thing with continuous
mappings is if you if you have
a space with a certain number
of connected components and
you push it through to
another continuous mapping,
You basically do an embedding,
or some sort of
projection on let's say
the word level representation
the number of connected
components is preserved.
So that might be one way to
look at this I think is that,
sure these different
representations
might have different topologies,
but if you're kind of continuously
going in between them,
then this notion is
still a little bit nice.
And but what is entirely
unclear to me is,
how for example for
in language data,
like what what is this space
even look like there.
Is it a trajectory through.
I mean you could imagine
it just being
a trajectory and then
you're sort of looking at
the topology of
all the trajectories.
You could sort of come up with
a high-level embedding
and look at that topology.
It's unclear there and I
think there's a lot of work
that needs to be done to
sort of flash that out but
at least starting out with
this nice real
value data is that
our first little test case here.
There's something interesting
going on and you'll
see in the empirical study
were given a moment that,
there's actually like
a clear delineation.
I mean sure curvature could
play a role but, you know,
if you have these data sets of
varying increasing topologies,
there are neural networks
which can and
cannot express these topologies.
>> Yeah, this is very
good for like image.
They're kind of continuous
data set but for language,
the data structure I guess is
more likely the tree
structure topology.
In fact, if you only collect and
cancel all these holes
on a collective basis,
then you have difficulty
to deal with these trees.
>> Right. Yes, I mean
I don't know. I don't
have a solution here.
It's entirely unclear
how you approach this in
an NOP setting but I do
have some sort of faith
because you kind of get
these nice, continuous,
word embeddings.
Although it's
discrete, you do get
continuous word embedding so I
kind of have some faith that
maybe it's not captured by
this continuous notion of
topology but there
could be approximations.
I mean that's an open question.
You should write a paper on it.
But there is some work actually
on capturing
the persistent Homology
of NOP data so
this came out of CMU I think
about a year and half ago,
and I'm happy to talk
offline about this.
So, yes.
>> So for component picture,
hx equal to xy
implies x component.
>> No.
Let's think about
it this way, right?
You have a topological space,
and persist or
Homology is a sort of
flashlight you cast on
the space, and you
see a certain shadow.
But you could take
a flashlight and move it to
a different angle and
capture different shadow.
So just because
the shadows are the same,
doesn't mean the spaces
are the same, if
that makes any sense.
It's kind of like a course lens
with which we look at topology.
Okay, so we've got this notion
of data sets complexity.
And now, we want to sort of ask,
"Okay well, what sort
of neural networks can
express certain topologies?"
And their decision
matters, just looking
at the neural networks
now not the data sets.
So, first thing that
should be apparent is that
homologies actually
sort of a stringent
measure for characterizing
neural architectures
or topology,
in general, so if you've got
some sort of
topological manifold.
So, let's just say this is where
your data distribution lives,
and we denote the sort
of for any function,
we denote the Homology
of the support,
so the number of holes in
the support of that function,
by this H sub s thing,
then there's sort of
this really natural notion that,
if none of the functions
in your function class
express their support
the topology of your space,
they can express
the same number of holes.
There's always going to be
some sort of adversarial set.
And this is kind of
a trivializing result,
but it kind of tells
you that this provides
a minimality condition.
So, if there's like
10 holes in my data set,
my function class can
never express 10 holes,
then there will be an adversarial
set. So, this is just-.
>> By support you mean f of x
squared is zero or
not equal to zero.
>> F of x greater than zero,
just think that positive
valued f right now.
>> Just the super low set.
>> Yeah.
>> Support usually isn't meant-
>> It should be anything that's
unequaled. Yeah, exactly.
That's why I've
not actually like
written support here but, yeah.
But so this is sort of
the condition we'll use to say
that if I have a bunch of
neural networks and none of
them express the topology,
then if I have a data set
that truly has more holes than
this neural network class
can express or
this architecture can express
regardless of initialization,
then there will be
adversarial sets.
Okay, so and if
we kind of frame everything
in the lens of topology,
we can sort of look at
architectural selection
in the following lens, right?
Given some sort of data set,
which architectures
express function classes,
which then have
support homologies
or positive support homologies
that are equivalent
to that data set?
In other words,
which architectures can
express the same number
of holes in our data set?
This is the question
we're going out after.
So, we'll first do
an empirical
characterization here, right?
So, the first thing we'll
do is we'll generate
data of increasing
Homological complexity.
We'll apply random different
perturbations
through it but it's
a very simple thing
of generating
data sets with increasing number
of holes et cetera, et cetera.
And once we've got
these data sets,
we engender a lot
of these data sets.
Yes, so the cloud.
So, the class is red classes,
this cloud class, blue
classes and the native class.
>> So, they're like
two-dimensional?
>> Yeah, and
the reason why we're
going to do two dimensional
data set is because it's
going to directly let us compute
the the the Homology of
the decision boundary for
neural network exactly computed.
So, as opposed to
sort of persistent
Homological approaches
where you're kind
of approximating the Homology
of the decision boundary.
We can look at exactly what
the Homology is here.
And by essentially plastering
a grid on this 2D space,
and you'll see in a moment
how this works out. Yeah?
>> For example, in
the last picture,
how do you deal with
the scatter plot?
To you, does it seem
that all the other ones
other than the red ones
or blue or in fact they kind of-
>> Everything is blue here,
so I've kind of added some alpha
here because there's
a concentration around
the red circle and
this is just essentially to
help the optimization here.
So, I'm not necessarily
capturing which ones,
like let's say we have
a natural data set,
right, with a
certain distribution.
You often won't find
a huge concentration
around one class, right?
It'll be sort of dispersed.
Hopefully, if you
take a random sampling
of the distribution,
will be sort of disburse.
But this is merely just to
help the optimizations
so we start out with-.
>> How about
this very empty place?
Do you think
this blue one is red?
>> This? We'll suppose
it's blue but it's really-
>> Why do you think it's blue?
>> This is like
synthetic data sets.
We're designing the
distribution to be this way.
So sure, there's not
enough points sampled
in this picture,
but if you take like 100,000
points in this data set,
this will be filled
with blue points.
We've designed the distribution,
so we start with the
distribution, then sample from it.
And the only reason
we're doing this is
is we're trying to say, "Okay,
guide stochastic gradient
descent with a bunch of
different neural
networks to at least
express these regions nicely.
And this isn't an exact
characterization we're
about to get here,
but what it will do is kind of
upper bound the characterization
and in a nice way.
So we're gonna generate
a bunch of synthetic data sets
with different
Homological complexities.
And just at first glance,
we're going to compute
the persistent Homology
of the data.
So this is sort of the notion
of can you can actually
compute the Homology of
different data sets which just
basically look like many points.
So if you zoomed in on
this data set, you
got close enough,
you find that there's a bunch of
different connected
components here,
like each point is its own
separated connected component.
And so there really isn't
any interesting topology here.
But persistent Homology
gives you this tool from
an aesthetical perspective to
recover the Homology
of the original space.
So there's nice work,
and I'll kind of
explain the algorithm.
You may have heard about
this before, but essentially,
you take all the
points in a sort
of a class of your data set,
and you grow balls
around these points.
And as these balls
start to intersect,
you form sort of
a graph structure.
And as in higher dimension,
you form sort of the simplicial
structure on the space.
And in every single size
of the balls,
we call Epsilon, you sort
of compute the number
of connected components.
So in this example, as we
grow Epsilon more and more,
these balls are
completely disjoint.
So there should be
as many connected components as
there are balls in
the data set and eventually,
there's a threshold point
where some of them
start to merge.
And if we keep going here
as epsilon increases,
somehow some of them
start forming these holes.
So here's an example
of a hole at
epsilon or epsilon equals seven,
a hole starts to emerge here.
There's this one
little pixel on which
these data sets are
sort of forming a hole,
and if you keep increasing this,
you get this sort of
persistence diagram, which says,
"When does a connected component
emerge and when does it die?
When is it merged with
another connected component?"
And same thing for these holes.
So one question you
might ask is why is this
even the right way to sort of
compute the Homology
of the space?
And there's two sort
of points here.
One is from a
statistical perspective.
You can use various methods.
There's actually a whole body
of literature which
describes bootstrapping
message which gives
confidence bands on
the topology of the space.
So you start with
some distribution,
you sample a bunch of points,
and if you compute
the persistent Homology
on certain subsets
of those points,
randomized subsets
of those points.
You can build a confidence band
on how probable is
it that the distribution
really expresses this Homology.
The other thing here is
for the purposes of
architectures selection.
We're actually going to not need
an exact characterization
of the space.
We can even have a pessimistic
characterization of the space.
For example, let's say there's
only really one hole
in our space,
and for some reason
like as I grew balls
are on this thing,
there's a hole
that emerged here.
That's not a real hole
in the distribution
but you can sort of have
a computational budget here.
So for architectural selection,
I have 700 GPUs or I have
100 GPUs. The same
thing holds here.
If I'm willing to
be a little bit more
pessimistic about
my data set and I
have more computational
resources as I
grow these balls around
these certain points,
I could say, "Well,
there may not be a hole
here but if there is,
I would be damned to choose
an architecture that
doesn't express it."
So for the purposes of
architecture selection,
persistent Homology
is actually sort
of okay as an algorithm.
So just as a sanity check
for this empirical
characterization,
does real data have sort of
nontrivial Homology to it?
So just to give you
an example here,
we take a UCI data set.
This is an
eight-dimensional data set.
Right here is a two
dimensional embedding of
that data set just for
visualization but we
compute the Homology on the
data set in eight dimensions.
And if you do this,
you can actually see
that there's non-trivial
sort of connected components
of merging the data sets
non-trivial holes.
And a really good example here.
So if you look at
the car's class on CIFAR,
you provide sort of a
lossless embedding of
this car's class into sort
of nine dimensional space,
lossless is kind of
a rough term here.
But you actually can sort
of see that if you looked at
this sort of
persistence diagram for
this thing, there's
rings that emerge.
And the intuition
behind this is that
if you look at the car's class,
there's different
orientations of the car.
And so you should
expect that there's
a one-parameter family
of these cars
that forms a circle,
a 1D circle.
So this is just sort of
the first inkling here.
Does anyone have
questions at this point?
This is just a sort
of sanity check and I
want to make sure we're
all on the same page.
>> [inaudible] side,
but haven't you embed
the [inaudible] to
a nine-dimensional space?
>> So I've tried a few others,
a few papers on this.
I tried, TSNE and
LLE and isometric
embedding and it shows
them one which had
the least loss and that was
known to preserve Homology.
So there's an empirical
characterization I can
point you to
afterwards that says,
"For the certain embeddings,
there better preserving
the persistent Homology
of the original space."
So I'm just trying
to give a high-level
characterization here.
So we've got a way to compute
the Homology of the data.
We generated
synthetic data sets.
And now, we want to say
which neural networks
can actually express
these data sets.
So all we'll do is we'll
train neural networks of
increasing architectural
complexity on
these data sets 5, 7,9,
etc., and we're going to
train a very large number of
these neural networks on
these datasets including in
depth for the neural networks.
And what you get is you
get a few phenomena.
So the first thing that pops out
is these things called
Topological phase
transitions here.
So I'm plotting
the minimum error of
the best-performing error for
a single hidden layer.
Yeah, what's up?
>> Non-linearities [inaudible].
>> Non-linearities do met.
So in this characterization,
I'm doing value right now
just for the empirical point.
But I can give you
sort of theoretical intuition
why they do matter.
And in fact, value is more
expressive than
hard thresh holding, etc.
So just sort of
the key is clear here,
blue is one hidden unit
on the first layer,
green has two hidden units,
three is, etc.
So these are the colors,
and you can kind of
see it evident in
the last picture.
But let me walk
you through this.
So if you generate
data sets that
have sort of two connected
components and nothing
else and you look at
the best-performing neural
network with one layer,
one hidden unit,
two hidden unit,
three hidden unit
four hidden unit,
five, six, seven, eight,
and so on so forth,
as you increase the number
of hidden units,
there's these sort
of bands where
the neural networks
never converge like
in the best-performing case.
And there's these points
that you'll see in
the next picture
which basically are
these moments at
which the neural networking
express the Homology.
And it's kind of very nicely
correspond with how
this error convergence works.
So if you look at
one layer versus
two layers or even more
complex Homological components,
you can see it in the case of
two connected components
for the data set,
now these three neural networks
with one layer can
express the Homology.
But if we go to the three
connected components,
we'll now it's not
only three but it's
four neural networks that
can express the Homology.
So we're basically going to
build up this empirical
characterization
and we're going to
try this out on
increasing number of
Homological complexity and
increasing number
of neurons. Yeah.
>> For something
that's going against
conventional [inaudible]
I would have expected from
an appeal to
model capacity arguments
that once I hit
enough hidden units that
the thing becomes representable.
>> Yes.
>> The adding more
capacity should
slow down my generalize.
I don't know. So are
you reporting like
that's error or are you
reporting training for it?
>> And I'm sampling
a lot of points.
So the main thing is
I really just want to focus on
if it is going to
express the Homology.
>> Correct.
So I understand that if
four hidden unit thing
can express it
so well than seven
unit hidden unit thing.
>> Yeah.
>> I don't see why consistently
the seven hidden unit thing must
train to represent that
faster than the original.
>> So the speed here,
I haven't even characterized.
>> Okay.
>> So I'll actually give us
a slide here in a second
and it talks about speed.
Basically, if I have a certain
Homology, my data sets,
so two connected components,
3 4 5 6 7 8 9 10 11 12,
and if I were to
look at with one
layer sort of how long it
takes for the neural network
to express the Homology,
you get these nice bands
that emerge.
But I can't really say
anything more about that.
So basically, with
two connected components,
the neural network with
four hidden units takes
100 times,whatever
number of iterations,
and then so it decreases
monotonically and then
that kind of stabilizes.
But you'll notice like
here, I don't have
a statistically
significant enough sample.
There's more work needs to
be done in this area but
you can sort of
see these nice bans
emerging, but I'll come
back to that in a moment.
I just want to kind of
express what I think is
the most important
thing here which
is in this Homological
characterization.
So if you looked at
the data Homology
and you're just
first of all look at the number
of connected components,
and you took one layer neural
networks, and you said,
"What is the probability
that at some point
during training,
the neural network expressed
the Homology of the data set?"
Okay. And you sort of
plotted this against
the number of hidden units.
So this is number
hidden units minus two,
and I apologize, the plot axis
didn't work out here.
But if you look at,
Homology of three and
the five hidden units
or something like this,
you'll start seeing
the probability sort
of increase in this nice way.
And if you look at, sort of,
for one layer neural networks,
you can kind of build up
this nice linear relationship,
for two layer neural networks,
if you sort of go
off this graph,
it starts curving down and
then in a very nice way.
For three layer neural networks,
you can see it's much easier
to express the Homology,
and for four layer
neural networks,
it is almost trivial to express
these very early
connected components.
>> [inaudible] layer is X over
>> No. I'm just looking,
at the first, just
the architectures here.
The first layer has these number
of hidden units and after that,
it's basically the dimension
of the data set so that I
can sort of preserved
by homeomorphism if
that thing is full rank.
>> So by [inaudible]
When you have
only one hidden layer.
>> Yeah, so this.
>> For example, five
units that you can,
for example, really
you can reduce
the training error to zero.
>> No, you can express
the Homology so you can express
the number of components.
For example, this
data set right here,
slick squiggly data set, right?
Here's my data set. I'm going
to cut out this ring
in the middle.
This is one connected component
and you can't expect to find
neural network to express
this class.This is
very, very complicated.
This is the curvature argument
that he brought up earlier,
but what this is saying is that
the Homology's at
least expressed.
>> But what your
experiment measure
is the training error,
so how do you-
>> They measure
Homology so every single
time I record
this training error,
this is a lot of data so I
get up here to a petabyte.
So, every time you record
one of these tics I take
the entire decision or
the entire output of
the neural network,
and then I compute
the Homology of
the sublevel sets and then
I take the sublevel set,
which is greater than
zero and it tells me
exactly what the Homology is.
So, if I go back to
this picture over here,
I literally take a grid
and this grid is like
600 by 600 but of
course here I'll take
a grid of 200 by 200,
get the exact values and
and there's a way that
I can sort of compute
the number of holes for
values which are
greater than zero.
That's the decision
bounday, and I do
that for every single tick.
And that's what lets me
come up with this sort
of nice probabilistic statement.
Not just about
the error, but about
the Homology because that's
what we really care about.
Like for me to say
that I'm going to
characterize under
which conditions
the decision error
converges to zero,
l'm a little dubious
about that claim,
but what I'm not
dubious about is sort
of characterizing the Homology.
>> So, with
how many units and you
can reduce the
training error to zero?
>> Well, I wish I could
answer that question.
I can tell you something
like this, right?
This next thing basically
says you know if I
have certain Homology,
and mind you, this
isn't a statement
about all homologies.
It's about datasets which
have homologies generated
by this very particular way.
Like, I have
this this picture here,
which says the datasets
look like this, right?
But I could stretch
this dataset this way and
sort of perturb it and
play a random sort of
continuous mapping to
it that would
change this picture.
>> The reason why I
asked this question is
because we know that for
one hidden neural networks,
we cannot get the normal minimum
by training so as
long as you reach
the number of hidden units that
is required to reduce
the training error to zero,
then I think you can
represent their homologies.
>> Yeah, of course.
>> Supposed that's
the curvature likely
go in that like this but
it will be going like this,
and then going like this,
and then suddenly it
will all, you know,
on the left side so there's
a big transition there maybe.
>> I mean the phase transition
for Homology is
like this, right?
Like this point, right?
Roughly these points are
going to express
the Homology, right?
Like everything above
this point is really
expressing the Homology.
If I go the black regions,
you can kind of see.
>> I thought that expects
that the one hidden
layer picture
more like the fourth
hidden layer picture.
>> Well, why do you expect?
Som this is in terms of
number of hidden units, right?
>> Yeah.
>> This is in term of
the Homology, right?
So, as I increase the Homology
you'd expect it actually takes,
I need more hidden units.
This is a result that's
really classical in
sort of the neural
network literature,
which is like, if I
have a certain number of
connected regions I need,
you see what I'm saying.
So, this is a picture is really
as I increase data complexity,
where do these phase
transitions happen?
That's the better way
to phrase this.
Okay, so you get
these nice characterization
so we'll go to
one-dimensional holes, right?
So, for one layer, it's
really difficult
to express these,
probabilistically speaking,
these one-dimensional holes.
>> Are you defining
properly or is it
sampling random units?
>> I'm sampling
these random data
sets that I've generated
and in this case
I'm aggregating regardless of
the number of
connected components.
I'm saying if I have
100 connecting
components and I have
one hole that will
express it, etc., etc.,
etc., roughly, I'm kind of
contracting the number of
connected components and
making that a very small space.
But the idea is
that with these four hidden
layer neural networks,
you can express these holes
a lot more readily,
and there's a nice
relationship here.
>> How confident are
you that your data
generation mechanism
is represented with that?
>> Of the Homology?
>> Yeah, because
you're saying like
probably expressing
each one of these,
and cutting out and generating
data this way so it is
a very conclusive claim.
>> Okay so, right.
We're generating data
in this particular way.
Well, so I guess one thing
is that it could
be nicer than this.
The curve could be
over here, right?
The data could be even
harder to learn on,
and so that it's sort of not
nice enough so that
these neural networks will
converge to sort of
express the Homology.
I'm not talking about
the error but that
the data may be even harder for
the stochastic
gradients and algorithms
to have these homologies pop up.
But at least this is
sort of giving you
a nicer upper bound, right?
Like at least I'm giving
you sort of an upper bound
and what this lower bound is.
>> Possibly you have
like four circles instead
of having a circle.
Maybe have it, this
will actually work.
>> Or, a triangle. Right, yeah.
>> But this because or none of
this square because
neural network
favors the coordinate basis,
the center coordinate bases.
>> Right. Yes, so
conceivably, the balance
can be lower here.
This is just at a high level
to illustrate that like there
is sort of this increasing
relationship happening here.
And since I'm restricting
the geometry of
each data set to be the same,
so these are circles
every single time, right?
The fact that this sort of
curvature changes as
the number of layer expresses,
it could be confounded
by the fact that
curvature is easier expressed
with more layers right?
But at least this is providing
an upper bound right?
>> A priori, I feel like this
is kind of what you expect.
What would you say is
the unexpected insights you
gain after doing the experiment?
>> Well, okay, yes.
The unexpected insight is
that this relationship between
connected components
is sort of polynomial,
if that makes any sense,
and the additional insight
here is this sort
of characterization
of the number of holes.
That's essentially
what we get because if
you look at the other
results in sort of
characterizing
the topology of neural net,
it's on the sum of
the betting numbers.
That beyond chiney work,
just the sum
of the numbers must
be exponential ones,
but not actually giving you
an exact sort of
characterization here.
So, this is saying that well
these one-dimensional
holes matter
but they really
matter in terms of
depth not the number
of hidden units.
Well, they do sort
of matter in the number
of hidden units
but depth plays
a much more crucial role
here than you would expect.
I mean if you looked
at the chiney bounds,
you have no idea whether or not,
you know, depth relates
to the number of 1D holes,
2D holes, 3D holes.
This sort of opens inside that
these different
Homological degrees matter.
>> Suppose your datasets
two dimensional
>> Yeah. Yeah.
>> So how would
you expect this to
change if you go to
national [inaudible]?
>> Well, I can give you sort
of a theoretical perspective
at least on the number
of connected components.
To express connected
component I need
>> Okay, but for H1,for example.
>> For H1, for example?
>> Yeah.
>> I don't have any idea
how that's going to change.
The intuition is
that if you kind
of are expressing a cavity,
for example, if you look at
the highest Homological degree,
you can use the same
sort of argument.
What's more interesting is
sort of what happens in between?
So the highest
Homological degree to
express sort of a cavity,
you can imagine this polytype
with an empty center,
if you increase
the number of dimensions,
you have to increase
the number of units to do that,
the same way that you
express the number of
zero dimensional Homology
connected components.
But what happens
in between, that's
a very interesting sort
of characterization.
The trouble with doing
this sort of empirically is
that if you want to go
to higher dimensions,
I told you that I
literally made a grid and
then computed the subsets.
So your data complexity
increases exponentially.
I mean, this is already a very,
very intensive
experiment to run out.
>> Right.
>> But at least provide
some insight that, "Hey,
these intermediate Homological
degrees sort of matter."
It's not just connected
components. Yes?
>> So one thing I did in H1
and I also was going
down the same rabbit hole
was we were confused by
probability and I think it might
have been better to call
it fractional types.
>> Yes. Frequentist
fractional types. Sorry.
>> Because the rabbit
hole I went down
was maybe the right measure
to think about
these probabilities is where
you are randomly sampling
neural network weights
for each of
these things and then
computing those processes.
>> Sure. Yes. So it's
frequency of types.
>> Yes, so a fraction of types.
>> Right. Right. I do
sample like for each
of these squares
>> But I trained on
a sample data distribution,
so it's not clear what's
the right probability measure.
>> Sure. Right. So
I have to go over
all the data distributions
in certain Homology.
So for these points,
at least I'm still sampling over
very large numbers of
weights in the space.
Same thing for
the different datasets.
They have a very large number
of different datasets
at the same Homology
that I'm expressing.
But, yes, there's
some interesting comments here.
Just from an
experimental perspective,
it's almost impossible.
What this is telling you is that
these Homological degrees
that are higher than zero
do matter and their
relationship isn't just
in the number of units it's
also in the number of layers.
This plays out in the ReLu case,
too, specifically,
rather, in the ReLu case.
Okay. So you can come with
empirical characterization.
You can also talk about, at
least for these datasets,
how long does it take for
these neural networks to sort of
converge to 0.05
errors or 0.01 error,
etc., and you kind of get
these nice bands that have
merged here as you would expect.
So if you increase
the Homology overall,
the time to converge
increases in
this fashion right here.
And I don't have any conjectures
on what the relationship is
exactly but there is sort
of this increasing relationship,
and it kind of
breaks down as you
increase the number of
layers and I suspect
that's a statistical
insignificance sort of piece,
but that's that's
an open question.
So, basically, I made
this conjecture that this thing,
if you extended this diagram
all the way out
off of this page,
it sort of looks like
this polynomial relationship.
In particular, if
you extend it off,
you get, for four layers,
it looks like the fourth root
of the number of
connected components,
and for three layers,
you get that it looks like
the third root, etc., et.
And so I make this conjecture,
which would be
interesting to prove,
that the phase transition point,
the point at which it expresses
the Homology is going
to be somehow greater
than or equal to,
lower bounded by this Lth root
of the betting number.
And if you do multi-linear
regression on all of
the frequencies with
which this thing
converges and you
basically get that it's
the first bedding number times
this Lth root that you get here,
this is roughly what pops out.
And if you sort of take
a bunch of datasets
>> There's something that's
tripping me up right now.
If I take Homological space
and flip the labels
of all the classes,
say if I look at
the complement of something,
it feels like I have
a different Homological
characterization.
Right? But if you
think about rings
>> Right. So I'm characterizing
the decision regions.
Right? I'm not literally
going to decision boundaries.
So that's right. For example,
and this is something that
comes up, this would be
a great next step in
the empirical side.
I know exactly what
you're saying.
So let's say this goes of
to infinity this ball and,
on one hand, the number
of red components is one.
I'm saying one versus all.
Right? This is
one connected component.
But if I take the other notion,
of one versus all in the
opposite class it's one, one.
Right? Characterization
is different,
but there's this nice machinery
that was developed
I think in 2015,
which says the right way to
actually look at
characterizing the capacity
of any given classifiers in
the actual decision boundary.
What isn't clear are there
the nice statistical guarantees
that I was talking
about earlier.
So what it does
essentially is you've
got one class over here,
this dot class and this sort
of minus class inside.
Sorry, these minuses
look like dots.
What you do is you connect,
as epsilon grows, the minus
class to the dot class.
And you connect as epsilon
grows wider and wider.
And if there's sort of
a mutual connection,
you make a triangle
out of it, and
if there's another
mutual connection,
make a triangle, and
so on and so forth.
And what is an invariant
in flipping the class is
if you get the simplex
out of this,
it's something that
looks like this.
I'm being a little rough here,
but you'll get a bunch
of triangles which have
an empty point in the middle,
and it doesn't matter how
you flip the classes here
this thing still has
a Homology of one.
There's one connected component
and one central hole in it.
So there's a way to compute
the Homology of
decision boundary.
And so we're just
talking about computing
the Homology the decision
region for one class,
but if you give the nice
statistical guarantees for
this algorithm than
I think approaching
this work from that perspective
is the right way,
at least for classification
problems. Okay.
>> Can you look at
the clean space where you
will get identification?
So if you get
a clean space inside of
a sphere so it
reverses in particular.
I cannot think of the word.
>> Protection.
>> Protection. So if you do
that then there's
an isomorphism for
nice spaces like an X inside
a sphere and its compliment
in terms of Homology.
So the precise statement
is that if X is
a certain nice subset of
the sphere and Y is
the compliment of that,
and there's an isomorphism
between Homology of
X and Homology of Y.
So there is a duality there.
You cannot look at
it over the blank.
For example, if the ball
stretches to infinity,
I think of the point
of infinity that
sort of wraps everything around.
So if you reverse
the theoretical rejection,
that is a ball.
The outside region
is also like this.
If you have a sphere and
then the disk inside is like
a lower atmosphere and
the surrounding region is
like an upper atmosphere.
They have the same
essential Homology.
So there is a duality there.
So if we flipped it,
when you compactify
to the sphere,
you do have this ideology to it.
>> From a practical
perspective though,
you want to do the
decision boundary because
that's the difficulty of
your learning problem.
Doesn't matter if
your class one has
500 holes or your class
two has 300 holes.
If it's a linear separation,
it's a linear separation.
But that's a nice
theoretical perspective.
Thanks for bringing that up.
I totally forgot about that.
>> Let's move a little faster.
>> Okay. So you can take
random data sets and
come up with
a confidence interval on
how many holes they have and
then throw in this sort of
model and kind of predict sort
of a phase transition point.
The caveat here is it's
still lower boundary.
Right? So there might be
datasets with higher curvature.
So these two datasets,
for example,
the persistent Homology
is to sort of
take the connected components
which fall outside of
a two sigma confidence
band and then
predict the number of
hidden units based on
the regression I did
on that empirical data,
but it's still lower bound.
So you can get
predictions which are far
below it but the idea here is if
you sort of take this as
an archetype for seeding
architectural selection it will
be a lot nicer to restrict
to some subset of the space.
So if I gave
a higher dimensional
Homological characterization,
so B1, B2, B3, B4.
If you somehow understood
how neural networks
behaved in that case,
you'd expect the fidelity of
this prediction to improve.
At least as a seed for
Bayesian optimization or a
seed for neural
architectures search,
using Homological features might
be a plausible way forward here.
But this is all empirical
characterization
and there are caveats
here that we talked about,
but one interesting thing
would be to actually
somehow understand what
is the exact relationship.
So in understanding
the exact relationship between
the decision region of a neural
network and its Homology,
there's some work that
basically sums
the betting numbers.
It sums the number of holes
and all the dimensions
and you can kind of
get different relationships
or different
activation functions.
But the question is can you
get an exact relationship?
So to do the method we
said here was you need
to get the persistent Homology
of the data and
that'll tell you sort of
the betting numbers in
each dimension and maybe the sum
of bedding numbers is sort
of a weaker characterization.
So how do we compute the
Homology of a neural network?
So we start with
a data set M and we
basically want to answer
this neural architecture
selection question.
Right? So the
architectures are such
that there is some F
in the architecture class
that expresses the Homology.
So neural Homology theory
is kind of broken
down like this.
You start with some sort
of neural network
here and you convert it to
a set Algebraic expression.
And for neural networks with
value and half-space functions,
you can sort of do this.
There's actually
a very nice way of doing this.
So you convert this neural
network in its weights to
some set Algebraic expression:
"A union B union C",
or "A union B intersect C",
or "A union B intersect
the complement of C", etc.
And then you can apply
this sort of nice tooling.
In the Maria Torre
sequence and I'll just
really kinda make this high
school algebra for you,
but to actually kind of count
the number of regions
in your space.
So I'm going to break down each
each of these sort
of components.
Okay. So first thing
you start with is
the decision region of
a single hidden unit
is a half-space if
you're using ReLu or
ReLu-like functions
like the hard
thresholding functions.
That's the first given. I hope
this is sort of
clear to everyone.
The next thing is, if you
sort of go deeper in
the neural network,
you've got always half spaces
for like a single layer,
multi-layer or
single-layer perceptron.
But if you go
deeper in the neural network,
what's happening?
We're going to
restrict ourselves
to hard thresholding
functions for now,
but you can do the same analysis
for ReLu-like functions.
So, just give you
a picture here.
So, we've got a neuron A and
neuron B and this is
sort of the half space
that A produces on
the input space,
the half space that B
producers on the input space.
And we've got a neuron C,
which is up here
and it's acting on
this sort of next
layer's output space.
And all you have to do really
is sort of look at this,
in a hard thresholding case,
you're going to
output 01 or 0101.
This is the binarization.
And if you sort of look at
the angle of this half space,
it'll give you a set expression.
So, what this thing is saying is
because C has this plain,
it's saying the
neural network is only
going to be active if we
have the first neuron H_a
and the second neuron
H_b or such that,
when A is active B
is not active,
when A is active B is active,
or if B is active
A is not active.
So, if I actually
to write this out,
the set algebra and
you get A union B.
So that's the decision region
of this neural network.
And there's these
nice tools in Homology
that will let you
look at sort of
set expressions of
the various half spaces.
Okay, so there's like
a fun way to describe this.
I won't get into
it now, but there's
some machinery we can
talk about afterwards.
But how do we sort of compute
the number of holes given
a set expression of half spaces.
So, I'm going to set up some
notation first really quickly.
I'm going to show you
kind of looks like kind of
high-school algebra when
you work out the math here.
So I'm going to
write A plus B as
A union B intersect B
is A times B.
And then, if I want to
sum the number of holes,
just apply this sort
of summation operation.
So if A has five holes,
B has six holes,
so I'm going to sum them and
just write A circle plus B.
And if I want to subtract
the number of holes,
I'm going to write
A divided by B,
and I'm going to drop
the H for this talk.
So, this expression just
notationally write A O_plus
B divided by A B plus C O_plus D
is just actually equal
to the same notationally.
Everybody good on this notation?
So, H is just the
number of holes
and just think
about it like that.
>> And so plus
goes has operator precedence
over the circle plus?
>> Well, so I haven't
said how these things
sort of distributed. Yes.
>> [inaudible] plus
for the O_plus.
>> No but otherwise
the thing does
not even time check, right?
>> Yeah. I mean I'm being
kind of lousy here
with this notation,
but just to kind of give
a high-level picture of
what's happening here.
But like I would put
some parentheses,
I'll put parentheses
around this first.
>> That's fine.
>> Yeah.
>> No other
interpretation checks
so this has to be
the right interpretation.
>> Okay. Okay, so
now I'm going to
really kind of graze over
how this works in Homology.
So, let's say we've got
the set expressions AB
plus C and we want to
compute the Homology of them.
I'm going to pretend
like we're in
the 1D case right now because
it works out pretty nicely.
So, we're just going to look at
the number of
connected components.
So, we know for any half space,
the Homology of
that half space in 1D is one.
The number of connected
components is one.
We got this rule, so
we can sort of kind of
extract if we have
A O_plus B then it's
equal to AB O_plus A plus B.
This is just a nice
little rule I'm
going to take it as given.
Homology this is
[inaudible] on Cotorsion or
torsion free Homology
without higher
dimensional Homology,
but I'm just going to
say this is a rule.
And then, the initial rule here,
which is also going to require
some restrictions on how
we're computing Homology says
if C is equal to A O_plus B,
then A is equal
to C divided by B.
This is just nice little
Algebraic rules here.
And then, normal
setup operation works.
So, A times B plus C
is equal to AB plus AC,
and then, AA B C
is equal to ABC.
So, we've got this sort
of high-school
level algebra here
with some rules,
and if we have a set expression
we're going to
write it as what it
is AB plus C. We're going to
apply this rule number one,
so basically on
the left side we've
got AB is equal to
A in this rule one,
and on the right side C
is equal to B.
And so, we're going to
apply rule one and two here.
And sort of keep applying
these rules and you just imagine
working and
listening on algebra.
I'm going to skip
over how these rules
work out in this case,
but you can say that
the number of holes
is equal to three,
because A plus B plus C is
three minus sort of whatever
the Homology of
A intersect B intersect C
is plus the Homology
of A union B.
And this is really nice
because you can
kind of get a nice
upper bound on the number
of connecting points.
Just apply this really
simple Algebraic expression.
So, as a simple question here,
can three hidden unit
neural network generalize on
a dataset with
H_not(M) equal to 10.
Homology theory says no, so
the proof looks
like following: So,
you start with this
neural network like this,
you somehow convert
this neural network
and its weights
to set expression.
In this case, the weights
are such that you
get A union B union C
for these half spaces.
Then you apply the set algebra.
Just going to go through
it really quickly,
and if you work
through the derivation
you can sort of use
these very simple Algebraic
rules to define upper bound.
So, one really nice upper
bound you can come up
with the 0th dimensional
Homology or the number
of connected components
of a neural network with
these Heaviside step functions
as activation functions is
bounded by N. And if you
increase the number of
layers that's bounded
by N to L. That's
just a nice result
that sort of pops out here.
But this is
very preliminary work,
and we haven't really
developed a lot of this theory,
but it's sort of the inkling
of how you might go
about proving this these
Homological questions.
So that's all I have for now.
You can check out the paper on
archive and if you
have any questions,
I'm happy to answer them, so.
>> Sir your algebra
is only for H_0.
>> Well, so I mean
I'm sort of omitting,
like you're going to have
to compute the current.
We can sit down and talk
about this afterwards,
but you have to sort
of compute the kernel,
you can work out the migratory
sequence and do
something nice here.
But in this example, I'm
just being very simple
because I wanted
to sort of explain this people
haven't seen Homology before.
>> It seems to me
for large either dimensionless
large number of neurons.
>> You're right. Yes, so
one question I
have in this cases
is sort of Hyperplane
Arrangement theory, right?
You can sort of compute
the Homology and it's kind of
subject to how fast you
can compute the
intersection lattice.
Do certain sparsity conditions
in neural networks make
the intersection lattice
of half spaces
easier to to sort of compute?
Is computing intersection of
half spaces easier or harder
than intersection of
hyperplane arrangements
so on and so forth?
Those questions
are going to tell
you how intractable this is.
Starting this way and
sort of trying to,
if you wanted to build
an algorithm that
computed the Homology of
neural network by literally
manipulating the set
Algebraic expressions,
you get exponentially many terms
if you go deep and
you do it naively.
So, this isn't the way
you write the algorithm,
but this is conceivably
a way you can prove
things in a sort
of nicer format.
So that's sort of
where this is going.
For proofs, this is
a nice mechanism,
I think for an algorithm
it's very naive.
>> So I want to
propose a Question:
Do you expect
an artificial algorithm
to solve this problem?
So I will give you
a neural network
architecture, and also,
I will give you
the breaking numbers for
this three dimensional
of the datasets.
And can you have
an efficient algorithm to see
whether this neural network
and cover represents
this top origin?
>> Yeah. So that's
an open question I hope.
>> But the difficulty is to
>> The difficulty
is going to be in
sort of the main mechanism thing
here that I'm kind
of imagining is
that first of all,
you have to take these
half spaces and you have to
somehow compute
their intersections.
So like with the first layers,
it's pretty easy to do this,
but you still get
a lot of terms.
So you have to compute
the intersection of half spaces.
And then, somehow, this
will give you the Homology of
your space, the number
of the holes.
And as you add more layers,
this is gonna depend
on the following thing,
does the intersection
of half spaces problem,
as you go deeper, blow up
exponentially or can you sort of
use in a nice way
that the all the work
you've already done
in the first layer?
So I could keep
enumerating conditions
here that make me think that
this is an intractable
right off the bat,
but it's an open question.
The ideal dream is if we could
compute the Homology of
a neural network from its data,
there's tons of
things you could do.
You could do away with
persistence Homology.
You could come up with a sort of
parametric version of
Homology computation.
You train a neural
network, compute
the Homology of
its decision boundary,
and say the data set
has to at least contain
these things with the confidence
that you have in
your classifier.
Another thing you could do is,
say two neural networks
are performing
really really well on ImageNet,
and they get
the same testing error.
But one has like 500
connected components and
its decision boundary.
Those are all
adversarial pockets.
Right? It's getting
the same testing error.
How do you tell
which neural network is
simpler in its
decision boundary?
>> From my perspective,
maybe the usage of adversial
examples on the border here,
in fact, what you
mean is that you will
get wrong labels span
adversial examples
or the version
of the samples is
related to the margin of
your decision boundary.
>> I got it.
So that's the definition.
>> Example, whether you
will jump into another one,
if you have a lot body,
you do not have
adversial samples,
but in your your setting you
up talking about classic.
>> There'll be pockets of
classification error that you
wouldn't mess recognition error.
One thing though if
you really are to use
this definition
adversarial examples is
because you're using
these linear half spaces,
instead of like
a different shape for example.
That's going to give you
a certain increase in Homology,
where do these extra Homological
logical components live?
Are they really far away
from your margin or they like
right on the margin, that's
that's another thing.
It could be very well
if that that happens.
But thank you for
the correction.
Yeah. So there will
be one way to choose
an architecture
would be conceivably
they're both doing
well and testing error,
but one has way less of
these sort of
Homological complexities,
and Homological
complexities are room
for sort of misclassifications.
So choose the one's
that it's simpler. Yeah.
>> It's a basic, basic question.
And so are there analogs
of these Homological
complexities but for manifolds?
The reason I ask is because
the kinds of models
you're studying,
these neural networks do have
like local Euclidean smoothness,
and they do have smooth
injector bias at least locally.
So manifold seem to
be with the early.
>> [inaudible].
>> So it seems like
why not understand
the more specialized
thing that rely on
manifoldery rather than fight
to stay more general
topological theory?
And then like the sort
of other part like the neural
Homological theory is cool,
but it seems like it's more
broadly applicable after
doing all this work,
trying to do
dataset categorization
through topological
complexities.
Like Homological
complexities just
could not land to
decision change, too.
>> Yeah that's right.
That's right. Maybe it's
easier to even do this. Yeah.
>> Exactly.
>> Yeah. It's more broad.
I'm calling neural Homological
theory too because this
is where it came from.
Yeah, absolutely.
>> If you want to
[inaudible] and you're
thinking about some classes
of nonlinearities,
maybe the right thing
to do is to
specialize things further
and think about the
>> Manifold structure.
>> Right.
>> Yeah. The main reason
why I'm just starting with
this topological
structure perspective
is that in some cases,
it's actually conceivable that
you can compute this on a data.
Obviously, in
high dimensional cases,
it breaks down.I mean
it's not clear if I use
certain manifold properties of
neural nets and characterize in
that way that that
would even help in sort
of the seeding
architecture selection.
The main point or
the initial point of
this work was justifying
why Homology would be as
something nice to feed into
a neural architectural search
that somehow that
it really does sort of
affects which neural networks
you should choose.
It's one piece of
inductive bias you can use
in choosing neural neworks.
And the main thing is
you have to be able to
compute that
geometric complexity,
compute that entire device.
So from a theoretical
perspective,
I think making it more
specialized notions
for neural nets
is really next because it
applies to every half spaces.
But from sort of
a computational perspective,
having this
computable topological
measure would be really nice.
>> I'm not convinced by
focusing on topology.
I don't feel like
data complexity
is a function or
topological categorical.
Like if you put that will have
two clusters that are
very close together,
like topologically,
there's sameness if
they're far apart.
Well, I think that's much
more difficult to learn.
>> Right. Yeah. I mean I
can come up with a million to
one counter examples for this.
But there's sort of
like I mean I guess
the notion was that
topology can't be your only
inductive piece
of inductive bias,
but I think it is one piece.
And we should have at least
one piece that we
understand well.
>> Okay. Okay.
>> That's another one.
If you can characterize
that and then come up
with an algorithm for it,
then let's do that
characterization, too.
>> There's one remarkable,
the sanity check that
will already be just
really cool to like
demonstrate it,
often, it involves
the rotational invariance
of images, too.
And we often say
that that lies on
a much lower numbers are
manifold and like the arginine,
it says basically
represent things,
and it would be really cool to
just actually do it
and visualize it,
like literally take
the same object, take a whole
>> There's a great image
of this as like
a manifold embedding of
ducks, rotated ducks.
>> I thought you'll say
plant photo for example.
>> Well, no. But the way is
simpler because there's
just literally like you
could take a bunch of dots
that are rotated and it gives
you this nice circle
when you embed it.
>> Without appeal to
embedding matrix started
the original data
representation.
>> Visualize the circle.
>> No. The [inaudible] the
persistent Homological thing,
you should essentially see that
>> Oh, okay. Yeah. That would be
a nice simple example as
suppose capturing those.
>> Yeah. I'm saying
like don't do this,
don't do any embedding,
take the original images,
do this versus
the Homology thing and.
>> He seems [inaudible].
>> Ask Russ. Ask Apple.
>> That's just public.
All right.
>> Cool. Thanks.
