For a lot of these things, it's actually really
easy to make something poisonous.
And as governments, as the industry has grown
recognition to this fact, you just have this
recurring thing that all of a sudden, you
invent a miracle, something or other, oh plastics.
Plastics were thought to be the wave of the
future in the 1950s.
They're also a type of just a molecular product.
And now we find out that they choke Seagulls
they choke baby turtles.
There is microplastics everywhere.
I think this is a type of generalized toxicity
issue that we realize if you make large quantities
of a new substance, that the world broadly
isn't prepared to digest.
What happens is 30 years down the line, you're
like, oh, crap, I killed off the trout.
I killed off the eagles.
So it all comes down to the fact that I think,
you know, living systems are extraordinarily
complicated, and making something that is
tested and safe for a living thing to interact
is actually very challenging.
You're listening to gradient dissent, a show
where we learn about making machine learning
models work in the real world, I'm your host,
Lukas Biewald.
I'm especially excited to talk to Bharath
because he created the DeepChem open source
project, which we've seen a lot of our customers
at Weights and Biases using and it seems to
be the most popular library for people working
on deep learning applied to chemistry and
biology.
He also made an open-source dataset called
MoleculeNet, which is a benchmark suite to
facilitate the development of molecular algorithms.
He got his Ph.D in computer science from Stanford,
where he studied deep learning applied to
drug discovery and is the lead author of Tensor
Flow for Deep Learning from linear regression
to Reinforcement Learning.
I'm really excited to talk to him.
It's really exciting to talk to you.
We've been seeing a lot of customers come
in doing drug discovery and other medical
applications, and it's something that I'm
not super familiar with but seems incredibly
meaningful.
We've got a chance to talk to a whole bunch
of our customers and ask them what they're
doing.
And one thing that keeps coming up is actually
the DeepChem library that I think you're the
original author of.
So I really wanted to start off by asking
you about that.
What inspired you to make it, and what problems
were you trying to solve?
Yeah, absolutely.
First of all, thank you for having me on the
show.
I'm glad and excited to chat as well.
Lots of folks I know have been using Weights
and Biases to train models and track experiments
so I think it should be a fun conversation,
I hope.
A few years ago, basically during my PhD,
I did an internship at Google where I used
their mini-computers to train some deep learning
models for molecular design broadly.
But I think what happened was, as with all
good things, the internship came to an end,
and I had to head back to Stanford and then
I found out I no longer had access to all
that code.
I couldn't really reproduce my results.
So I think the starting point was I just wanted
to reproduce the results of my own paper.
And I think to start basically was just a
few scripts in Theano and Keras at that point.
Then I put it up on GitHub, I mean why not?
Then a few more people did start to use it,
then it just sort of grew slowly and steadily
from there.
I think the original aim of DeepChem was really
to enable answering questions about so-called
small molecules.
So most of the drugs that we take, Tylenol,
your Ibuprofens, things like that are all
small molecules.
But over time, I think pharma has actually
begun to shift off it and so now there is
newer classes of medicines.
There are of course things like vaccines.
So nowadays, I think DeepChem is slowly trying
to grow out to enable open-source medicine
discovery across a broader swath of modern
biotech.
So that's just a little bit about the project.
I think there is a very active community of
users.
There's a number of educational materials
and tutorials built up around it.
I think it's also that a lot of medicine discovery
is quite proprietary.
There is biotech things that we often see
their advertising material like in our proprietary
algorithm, our proprietary technique, which
has worked fine for the industry for a long
time.
You know, that's the way most medicine we
know was discovered.
But, of course, as we know in tech there's
just been a shift, in that open source is
increasingly a foundational part of the way
we build companies, we discover things.
So I think part of the goal of DeepChem is
to bring some of this open-source energy to
the biotech drug discovery community and enable
more people to be able to share in these tools.
It seems like you've definitely been successful
at that.
I mean even before I knew of you, talking
to folks at Genentech and GSK and I would
say, over half of the conversations I've had
with pharma companies have mentioned DeepChem,
I thought it was pretty cool that they are
using the same platform and contributing IP.
I didn't know that pharma did that at all.
So that seems really wonderful.
I think it definitely is kind of a new shift
in thinking.
But of course, you know, pharma has seen the
fact that TensorFlow is open source, PyTorch's
open source.
So I think it is the beginnings of a shift.
At the same time, I think IP considerations
definitely do matter a lot.
So I think a lot of folks find they can't
contribute at some places, which is fine,
I think it's just a policy.
But there is still a culture of caution around
potentially releasing valuable IP.
But I think what helps things a bit is there's
this recognition that oftentimes it's the
actual data that's the core IP.
It's not necessarily the algorithm that's
just calculus.
And so I think there is some favorable shifts
in the industry, but it's definitely something
that's only beginning to happen.
So just taking a step back, because I think
not everyone necessarily knows the field at
all.
I actually didn't, till maybe six months ago
when we started to see our users doing this.
What's the canonical problem here that Pharma
is trying to solve?
Yeah, I think it's a great question.
At heart, the goal really is to design medicine
for diseases you care about and the reality
is this is an extraordinarily complicated
process.
And I'd say even now, machine learning is
only useful for 10% of it.
And the task here is that say you identify
a disease, then you want to find a hypothesis
for what causes the disease.
Maybe there is a protein that somehow has
become misconfigured or mutate in the body.
There can be a whole host of disease-causing
factors but you oftentimes try to take a reductionist,
you narrow that down to one protein target.
So you say that if I somehow could change
the behavior of this protein, I could potentially
cure this disease.
It's a hypothesis.
It might be right, It might be wrong, but
it's a good starting place.
Then you go out and you say, now I know this
protein.
Can I find a molecule that causes it to have
some interaction?
So there is a few mental models for this,
you can about it as a lock and key, you can
think of it as basically an interacting agent
that comes in, the drug, that is, and shifts
the behavior of the protein the way that's
favorable.
So the goal computationally at a crude level
is that design the molecule, given the description
of this problem, print out the ideal molecule
for this.
Now, the reason this gets challenging is that
the ideal molecule is extremely hard.
I think one of the hardest problem here is
that there's this question of toxicity.
I think the silly example for this is if you
want to kill cancer cells, you can pour bleach
on them.
You can't drink that bleach - that's going
to kill you too.
So a lot of medicine is pretty indistinguishable
from poison.
It's really targeted poison that goes after
one particular part of the body.
So when you're designing medicine, you're
often just struggling with this challenge,
if you're on this very razor-thin design edge
of between poison to medicine.
You also often don't have a precise model
of whether the potential drug works or not
until you try it in real patients.
So you try to make proxy models for this.
Traditionally you'd have something like a
rat that has some variant of the disease,
or sometimes it's things like cats or even
dogs but when you think it's safe, you then
try it out on real patients.
So this is kind of the clinical trial process:
Phase One, which tests toxicity.
Is it safe for humans?
There's Phase Two that has efficacy.
Is this actually showing effect in a group
of patients I'm trying this on?
And then Phase Three is basically "OK.
We think there is effect, let's make sure
on a big trial with lots of people."
And occasionally there are things like Phase
Four, which is after the drug is being used
by real people, let's do more studies, understand
the real effects it's having on patients so
that we can give better guidance to doctors.
So I think the heart of the challenge in applying
machine learning here is that we are dealing
with a lot of unknowns.
We don't know precisely why things become
poisonous.
We know some of the reasons.
But oftentimes you'll get these strange factors
that crop up.
We don't know if a potential medicine actually
treats the disease in question until we try
it.
Just to slow down for a second.
I think it's not even obvious to me necessarily
what the machine learning problem is within
that.
What's the input data and what are we trying
to predict?
That's definitely another great question.
And usually, the challenge here is that you
start with a very narrow sliver of this problem.
So there are, say, limited models for toxicity
that given some amount of data, you create
a database of compounds and you're like this
molecule induces a negative effect.
You can train a machine learning model that
given the structure of a new molecule, will
predict an output, which is the toxicity label.
The challenge, of course, is generalization.
You know it works on your training set but
if I give you a new molecule, does it actually
work?
That's often the question.
That it's very hard to gauge that.
And then how is it possible.?
Sorry, there are some questions I have.
How would you possibly have enough training
data?
You're not going to keep poisoning cats to
keep finding more and more poisonous molecules,
right?
How does that work?
I think there's another great question, and
the real answer is we don't have enough training
data.
Which is why I think molecular machine learning
is a bit of an art right now.
Unlike images and speech where there are these
dramatically larger training sets, the datasets
are fundamentally limited.
There are a few approaches people take to
deal with this.
I think one common theme is let's use more
of the fact that we know a lot about physics
and chemistry.
Toxicity, I think is a very hard problem;
it's biology.
It's kind of harder.
But in many cases, you'd say that "well, okay,
I know something about the molecule.
I know something about its invariances.
I can encode that into the convolution network."
So now you have increasingly sophisticated
graph convolutional networks that encode more
factors of known molecular structure.
It's definitely not a salt field.
I think this entire part of machine learning
is far from what I call the image net moment,
there is that point in which the thing just
crosses over and breaks out and I think right
now it's useful, but it isn't that magic bullet
in this order.
I actually really would like to go back to
that but I want to make sure I understand
the core problem here.
So it sounds like you have a molecule and
you want to predict some kind of property?
I think that is definitely the most common
one.
There's a number of variants to this.
Like you might have a protein, then you want
to find a molecule that interacts with it.
One way you can do this is, does the property
interact with the protein?
There is also generative models where you
say that okay, given a database of known drugs
use an LSTM or something to just print out
the new potential drug.
This tends to get a little hairy.
It's kind of hot research, but it's not safe
to really use in production.
I think there is some reaching academic debates
about that right now.
Alright.
Sorry, could I ask some more dumb questions?
How do you even represent a molecule?
Text seems kind of obvious to me but I mean,
it seems like molecules have a variable length
and they have some structure.
Is it a graph?
It's actually a great question.
Thankfully there's the field of chemo informatics
where a number of years ago they defined a
thing which is called SMILES, S-M-I-L-E-S.
So SMILES strings are basically a language
that allows you to write down molecules.
It's most often used for small molecules but
you can write pretty big arbitrary molecules
as well.
Many architectures take the smiles and do
convert it into a graph.
And the idea is that the atoms in the molecule
turn into nodes in the graph and bonds usually
turn into edges.
Although sometimes you do something like a
distance cut off because there's these non-covalent
interactions.
So you might say all atoms that are close
to each other are now bonds and have edges
in my graph.
And does that completely represent a molecule?
Honestly, not at all.
The real molecules are these very complex
quantum beasts that have orbitals and extremely
complicated wave functions.
In fact, I'd say that when you get past really
teensy molecules like helium (there's probably
a few slightly more complicated ones), you
actually don't know the quantum structure
of these things.
Until the quantum computers arrive and we
can run these simulations, we actually do
not really have the ability to grasp the "true
structure" of a molecule in most cases.
So it's an approximation.
It's mostly useful for many purposes though.
But yeah, molecules are more complicated than
we understand.
In many cases.
So when you talk about an LSTM generating
a molecule, it's generating, literally generating,
a string that gets interpreted as a molecule?
Exactly.
So the smiles language I mentioned, precisely
what you do is that you just treat it like
a sentence generation task but you're generating
in the smiles language.
And oftentimes the challenge there is that
if you do this naively, you'll generate grammatical
errors.
So it's not an actual molecule but there's
been a lot of research by some groups at MIT
in particular and UToronto, that have worked
out ways to constrain the generative models
so that it's more likely to generate real
molecules.
So I guess this sounds, you know, as an ML
person, this sounds incredibly appealing,
right?
Like a kind of well-formed tricky ML problem
that has the potential of saving lives.
And I guess I wonder how much of this is real
and how much of it is speculative?
Can you point to an example of a drug that
was created through this process or helped
by this process?
So absolutely not, unfortunately.
So this is kind of where it gets really fuzzy.
Is it some on average like, you know, I think
Covid might actually speed up discovery in
some cases, but most of the time it's like
15 years from the first discovery, starting
a project to like the actual getting to patients.
So there have been simpler computational techniques
in use for decades now.
So there is some degree of evidence that they
help.
But I don't think there's been a smoking gun.
There isn't like one molecule they can really
point to and say that and AI made that.
And I think it's more like, you know, the
process of using this program helped, you
know, in some fuzzy, hard to quantify fashion,
the design of this compound.
But it seems like the programs are kind of
suggesting, or at least the framing that I
hear from a lot of our customers is the programs
are like suggesting compounds to try.
Which makes a ton of sense, right?
Because you have to try something.
So I assume that people have some non-random
approach for this.
It seems that there must be evidence now if
these deep learning techniques work better
for this kind of suggestion than other techniques.
That seems like pretty quantifiable. or am
I missing something?
So.
I think part of the challenge here is that
it's hard.
There is like many steps in the process.
So there is a paper from Google recently where
they showed that on one particular task that,
when they ran the experiment, they say naively
was like a few percent hit rate.
That is like things that actually like looked
like they might work in that stage.
And when they bootstrapped it by training
and machine learning model, then make predictions
it was something like 30 percent.
And, you know, that sounds like a giant boost,
but I think that's like one step out of like
20 in the process.
So, you take the thing that comes with that
and you go to the next stage where you are
like, well this molecule's good, but it turns
out that it gets caught up by the liver.
We need to, like, change it somehow so that
it avoids that.
And right now, the best way to do that is
still to hire a seasoned team of medicinal
chemists who can guide you through that process.
In the later stages, it gets particularly
gnarly because you have very small amounts
of data.
So like the Google paper, it was at an early
stage where they could generate programmatic
large datasets, like 50 million data points
or something.
But in the later stages, you might have like
a hundred.
And then also you are in that fuzzy no man's
world in which machine learning is kind of
witchcraft.
So I think that's part of the reason.
Because maybe you started out with something
that was AI-generated but then 10 medicinal
chemists came along, tweaked it here tweaked
it there, then what do you have at the end?
And honestly, we don't know, like, I think
10 years from now, maybe there will be a molecule
we can point to.
But for now, I think it's so fuzzy.
It's kind of interesting that you said I mean,
I totally resonate with the ImageNet moment,
because I definitely remember the ImageNet
moment for vision, because I ran a company
that was selling training data and suddenly,
you know, everyone flipped from wanting text
training data to images because suddenly all
the image applications were working.
But I guess what was kind of interesting was
that I actually feel like the ImageNet moment
came a few years after ImageNet, like not
only did we see vision starting to work, but
it took people a while to realize it.
And then companies started to staff up.
And now, you know, I can go on Pinterest and
click on stuff and buy them right away.
Or I can find out my baby photos on my iPhone.
But like, it seems like this one, the medical
companies have kind of staffed up maybe before
it's clear that it's working.
Because it does seem like deep learning is
now important to basically every Pharma company.
I mean, it seems like this could be set up
for a real serious disappointment also.
I think that's very kind of insightful as
an observation and I think you're totally
right.
I think if you talk to a former veteran and
they'll talk, there's like this old Fortune
magazine from 1980 where they had some pictures
of molecules on a computer and they said it's
going to be like medicine on the computer,
It's going to change everything.
And of course, nothing changed.
And I think, you know, even for the Human
Genome Project, there's a lot of hype.
You know, people thought, having access to
the genome would change everything.
But I think the recurring theme of biology
is that billions of years of evolution always
have more tricks behind them.
So I think you're right.
I think deep learning is a useful but not
magical tool in the space right now.
And I think that in some cases that disappointment
has already hit people.
I think in other cases still, my hope is that
people stick with it because I think these
techniques do have a lot to offer.
But I don't think it's going to magically
cure cancer.
I think it'll be one useful tool in the scientist's
toolkit to discover medicine.
But what do you think caused people to feel
this optimism because machine learning techniques
have been around for quite a long time.
And I presume people were trying these on
the same datasets.
Like, is there something special about deep
learning that it sort of feels more promising
in some way?
It's a great question.
You know, I think, you know, we all saw this
amazing wave of just deep learning hype.
Because I think that ImageNet moment spread
out into these other fields.
And I think people started hoping.
I think there are some genuinely new advances
that deep learning on molecules has engendered.
For example, the more predictive models, when
you have enough data, they actually start
working considerably better.
This Google paper I mentioned a while back,
it actually gets like a considerable boost
over a simpler or random forest or something
because it has enough data.
The generative models, they can sometimes
do clever things.
So I think there is some substance, that sort
of paper.
But there isn't that.
I think there is the hope that it might lead
to a breakthrough.
And just speaking for me personally, when
I started working in this field, I didn't
really understand any biology or chemistry,
I think 9th bio classes, was my last formal
training in the subject.
You and me both [laughs]
Had a good 9th grade bio teacher but yeah
I think when you come in, you're like, well,
you know, tech can solve many hard problems,
like why can't it solve this?
Why not?
And I think the answer is evolution has had
billions of years and that just builds up
irreducible complexity sometimes.
So I think it's still hopeful.
I think there is real potential and value.
But I think also once you can spend some time
in you get some humility, the scope of the
problem is much grander than you.
At least I first realized when I was coming
into the space.
But yeah, I think it's just a hype train got
ahead of the actual technology and then it's
like the Gartner hype cycle.
I think now we'll end that trough of disappointment
and then that slope of enlightenment coming
up a few years from now.
Interesting.
People seem fairly optimistic for a trough
of disappointments.
It is an interesting perspective.
Yeah maybe we're still coming down, I hope
not.
One problem that I've always found in health
applications is missing data.
Like, are there data sets like ImageNet for
these kinds of applications?
So honest answer, not really.
So I kind of started a project called MoleculeNet
a number of years back in grad school, along
with kind of one of my coauthors.
And our intent was to gather as many datasets
as we could to try to make something like
ImageNet.
And I think the honest answer is we helped
a little bit.
I think there is a useful collection of data
and benchmarks we put together.
But the challenge is that molecules are non...
So I think in computer vision, I think object
detection, object localization don't cover
all vision tasks.
I think their is some hard frontier of problems
still.
But you get like a pretty big chunk of them.
In molecules, it's more like there's just
an entire range of things people want to do
with them.
You have a little bit of data for each task
and the tasks are often not latent.
So if you take like a quantum mechanical dataset,
you'll find that very different featurization
and algorithms actually work better than if
you take a biophysical task or a biological
task.
So I think there is a reasonable amount of
data in aggregate.
But it's for different applications and you
can't easily blend it into one ImageNet style
mono data set yet.
Interesting.
It kind of reminds you of natural language
processing with all of its different applications.
I think there is a dream that maybe we can
figure out some type of universal pretraining
that akin to the GPT2 models or to like actually
does get you to that universal molecular model.
I think as of now, we haven't achieved it,
but maybe it's so crazy to think that we can.
Like, we do know that Schrodinger's equation
at some deep level is pretty close to a, leaving
aside relativity, it's the best known model
of these molecules we have.
So maybe if the quantum computers will eventually
help solve this.
But it's a ways off for now.
Interesting.
And the experiments presumably are kind of
expensive to run now.
Yeah I think there's the rise of mail-order
services, things like Enamine or Muchi where
you can pick out a molecule out of a catalog,
then they'll make it for you and they'll ship
it to you.
So it's a little easier than it used to be.
You don't actually need to be a bench chemist
at the same time you do still need to run
an experiment.
So oftentimes people will say use Enamine
to buy it and they'll use a second contract
research organization to run the experiment
and they'll just keep track of quality control.
So it is possible to do it, you know, not
quite in your basement, I think.
But maybe in a well stocked garage where you
can carefully coordinate many e-mail threads
or something like that.
But, yeah, it's expensive.
It'll put you somewhere between a few hundred
to a few thousand dollars per compound depending.
We have a whole bunch of customers that are
startups doing this type of thing, how do
they hope to kind of compete with bigger companies
when they don't have access to these datasets?
That is a great question in many ways.
Maybe I'm not the right person to ask because
I didn't found one of these startups.I think
there is some advantage to coming at it with
some new eyes.
I think when you're a very big company and
are trying to introduce just a shift and thinking.
There is of course, a lot of cultural inertia.
Traditional startup versus Bigco dynamics.
I think there is some potential to pick up
kind of interesting potentially looking fruit
that just people haven't looked at.
I think there is also some eventually, I think,
potential for mergers and acquisitions.
I think building a talented machine learning
team can be difficult.
And I think if you have a company that has
succeeded and has shown some promise, maybe
it's a good acquisition target.
So I think there are fruitful paths forward
for many of these companies.
I think some of them are actually aiming really
high.
They want to be the next Genentech.
And I think it is possible, but I think that
might end up coming down more to your biologists
than it does to your machine learning people.
And perhaps I'm a bit of a pessimist on that
front.
I think core biology, the really foundational
stuff, is still beyond our current machine
learning and AI techniques.
I think it's beginning to change as you get
more genomics data, more kind of biological
material that you can feed into machine learning
models, there's a lot of companies at that
frontier.
But for now, I think it really is that if
you have a crack team of scientists, that
might take you further than a crack team or
machine learning engineers.
Ideally have both, and then you have the best
of all worlds,
Though it just seems like the data collection
process is so hard.
It seems you might need to innovate there,
too.
I mean, I'm coming from my own background
of data leveling.
It seems so daunting, the idea that you have
to order molecules somehow and run a wet lab.
I guess again, I have a whole bunch of different
questions.
One thought I have I guess is probably like
the dumb things that people think of when
they first hear about this stuff.
But it seems like if you could model things
about molecules, that's so powerful.
That's like the stuff everything's made out
of.
Like, there must be applications besides biology
that might be simpler.
Is that is that true?
I think absolutely.
Now, unfortunately, the challenging part of
some of the most interesting applications
are in places like batteries.
So I think there are kind of other fields.
Like, for example, the crop protection industry.
So if you make pesticides, herbicides, fungicides,
pretty similar techniques,
Really?.
I guess they deal with the properties of molecules.
In fact, this is kind of coming back to that
thin line between poison and medicine.
If you actually take a look at some pesticides
and you look at them, it kind of looks like
the same small molecules you have in medicine,
which might explain a few things about the
world.
I think there's also other applications, in
industrial applications, probably in petrochemicals
even.
I think there is a bit.
So there is absolutely kind of other cases.
But, I think we in the software industry are
sometimes used to working in our world of
bits.
Whereas I think when you get into these industries,
you're like, at the end, you have to make
something and I think there is that slowdown.
I think maybe batteries is actually the hardest.
Pharma's a little behind that.
I think some of these agricultural applications
are a little easier to get to market, but
still quite daunting.
I think in general, it just kind of comes
down to like for a lot of these things, it's
actually really easy to make something poisonous.
And as governments, as the industry has grown
recognition to this fact, you just have this
recurring thing that all of a sudden, you
invent a miracle, something or other, oh plastics.
Plastics were thought to be the wave of the
future in the 1950s.
They're also a type of just a molecular product.
And now we find out that they choke Seagulls
they choke baby turtles.
There is microplastics everywhere.
I think this is a type of generalized toxicity
issue that we realize if you make large quantities
of a new substance, that the world broadly
isn't prepared to digest.
What happens is 30 years down the line, you're
like, oh, crap, I killed off the trout.
I killed off the eagles.
So it all comes down to the fact that I think,
you know, living systems are extraordinarily
complicated, and making something that is
tested and safe for a living thing to interact
is actually very challenging.
What about other medical applications?
I think you wrote a book on this.
Right.
So, like, what are the other categories of
things?
And I guess, I'd be curious to your take on
like how promising they are, it sounds like
it's hard to separate the hype and you've
probably thought deeply about this.
I definitely think there is a whole host of
really promising applications.
I think to name two, I think microscopy is
going to be completely changed by ConvNets.
This is one of those magical places where
ImageNet works, you can actually take an ImageNet
model and stick it on top of a microscope
and start doing pretty sensible things pretty
quickly.
What's an example of a thing that you might
do with microscopy?
One of the kind of interesting things about
this field is that you can pick up a lot more
out of a microscope than you could have thought.
So there are some really interesting papers
that show that oftentimes like, so there's
some say readouts of a cell, where traditionally
you had to kind of destroy the cell, blow
it up in order to get at it.
But people have started to show that you can
instead get a dataset where you take the original
cell, then you blow it up, get the read out,
then you can train the machine learning model
to start to input that from the raw cell so
you can potentially get non-destructive readouts
that enable new things.
This is kind of more basic science.
Like it's not clear what the downstream effect
is.
There are a number of companies, I think,
Recursion Therapeutics is a prominent one
that has been using microscopy and machine
learning broadly to do phenotypic screens.
Earlier, I mentioned you often pick a protein
target.
Which I did to slow down for my 9th grade
biology.
Phenotypic screen is what?
My apologies.
No no, I know that phenotype, it's like the
expression of a gene.
Is that right?
Yes, exactly.
So I think one way to think about it is maybe
bottom-up design versus top-down design.
So kind of the targeted drug discoveries may
be bottom-up.
You say the human body is complicated, I'm
going to be a reductionist, I think this is
one magic lever and I can switch that lever
on and off.
I can really change everything.
And that's kind of, you coming from the bottom
and then you hope it makes it all the way
to the top.
The other one, which is actually the more
traditional way of finding medicine, is like,
you know, some really smart doctor.
This is like the penicillin story, notes some
effect, you have no idea what the effect is
caused by.
You don't really understand the intricate
biophysics, the chemistry behind it.
But you see it, maybe there's something that
you just observe.
I think this famous case of penicillin wasn't
the mold on the bread.
But I think for a phenotypic screen like the
ones Recursion do.
Basically they have these cell-based acids
where they grow cells in a petri dish.
And essentially they test, you put a little
bit of medicine in there and then you see
how the cells state changes and use the microscope
and the deep learning system on that to pick
up those changes.
So you can do this very rapidly.
What would be an example change?
Is that the cells are a different shape?
That's a really good question.
I think it often depends on the disease in
question.
So like a common thing, say for like cancer
is that, the silly one is can you kill the
tumor cells?
The hard part there is can you kill it without
finding bleach?
So that's something that's a medicine.
I think, for other readouts really depends
on the disease.
I think the general point there is like diseases
are complicated.
So there are many proxys people use.
So kind of the hierarchy of proxy's is if
you have a pure test-tube, which is molecules,
that's like the weakest, if you have cells,
that's a little better, if you have a rat,
that's a little better; but I think the gold
standard of course, is like the human.
So you can think of this as like it's better
than the pure test-tube, but it's absolutely
not the same as a human, it is a useful kind
of proxy.
So, okay.
So what the method with the machine learning
does is kind of find properties based on the
images from the microscope?
The way I like to think about it is that machine
learning is kind of like making a better microscope.
So in many ways, if you go back to classical
signal processing.
We have all these, you know, Fourier transforms,
you have high pass filters, low pass filters.
And these, you know, traditional signal processing
techniques made things like microscopy even
feasible in the first place.
Well, you have purely kind of optical microscopes
back in the day.
But in the last century, I think there's been
a lot of signal processing attached to it.
So I think of deep learning in these applications
as signal processing, turned up to eleven.
And so you can pull things out of the image
for which there is no obvious way to write
down that function.
So I think right now it's more like this really
fascinating scientific thing, you know there's
got to be something there.
But I want to make sure I'm like, picture
it, like I want to have a mental model.
So, like, maybe that was evocative of like,
did I kill that tumor cell?
So is the point that like the machine learning
could tell me if the tumor cells were killed
without me having to actually look at it?
or is it that the machine learning, like sees
something deeper that like I couldn't figure
out if I looked at it.
So I'll have to apologize up front because
I'm not an expert at cellular biology but
I'll try to.
So, for example, I might be making this up,
so if there are real biologists that eventually
listen to this, please bear with me.
No, it's a machine-learning audience, you
can pontificate.
By the way I think machine learning people
will be really familiar with the idea of just
looking at results and not worry about the
process behind it.
So I feel like this is very appealing to our
machine-learning audience.
You know, I do have to say I still have no
way idea about what happens deep in layer
37 of my ConvNet.
Imagine you have a muscle cell and you can
often measure like the stretchiness of the
muscle cell.
There is often ways to kind of guess that
a proxy for healthiness.
I think the actual thing you measure depends
a lot on the biology of the system.
For example, like one common thing is that
there is these things called fluorescent reporters
and you can engineer the cells so that if
you have the drug and it actually hits something
in the cell that you know about, it sets off
light.
Here, It's you have to know a little bit about
what's happening inside the cell.
You have to have a guess already.
I think the cruder version might be, you know,
you have this muscle cell you're looking at.
You know, maybe there's some measure of how
stretchY it is.
Oftentimes it's just like kind of obvious
to the eye.
It's like that traditional, you know a dog
when you see it.
You see the healthy cells, they have some,
like, nice geometric shape, it looks good.
And you see, like disease and they're all
like shriveled up and just looks bad.
And you can't quite write down that function.
You can't know when you look at it.
Yeah.
So it makes sense to begin to pick this up.
Right.
And I guess I've seen versions of like cancer
cells and kind of different levels.
What do they call them, like biopsies?
Where you look at the cells.
Its 9th grade biology.
I guess I can picture what you're saying like
that there's like healthy cells.
My question is what is the machine learning
helping with?
Is it sort of like reducing the cost of looking
at this stuff, or is it like pulling out other
signals that are somehow like, useful?
I think it's a bit of both.
So I think traditionally, the traditional
labor was you'd have a grad student whose
painful job it is.
If you're unfortunate to be stuck in this
lab to is look at cell 1,2,3 ...10,000.one
to three, ten thousand.
Now, I think there a number of readouts where
you just look and you kind of know there is
a difference.
So I think you can train yourself to read
these things.
I think this is, again like an interesting
example you brought up where you're training
the model to basically pick out something
and you do it at a bigger scale to maybe before
I can only test 10000 views.
You know, the grad student union would revolt
at that point.
But now, you know, maybe I can test a billion
or I'm limited more by my supplies.
I think the second question you asked is actually
the more exciting one.
Is it possible we can pick out something we
didn't know?
So I think there are glimmers that this is
yes, I know there are a few papers that are
doing things like you can identify where the
organelles are, you can begin to do some more
complex readouts.
But I think there is sort of almost a chicken
and egg problem here, as in like when you're
discovering something it's like unsupervised
learning, right?
If you know the thing you're looking for,
then you can, like, slot it into buckets pretty
easily.
But then if it's like you want to go deeper
and find something you don't know.
I think yes, I think there are likely places
that ConvNets act as amplified microscopes
and like pick up biology that we don't know.
But if I knew that, I would have gone off
and written nature paper about it already.
I'm sure there is a couple that have already
come out of this thing.
Okay.
So I have to ask you, one of the Nature papers
that blew my mind and I think a lot of people
was the dermatologist's one where they fine-tuned
an ImageNet classifier on cancers.
That was not like under a microscope, that
was just literally just like photos.
And that seemed so amazing.
I mean, should I be as enamored with that
as I felt or are there some gotchas where
it's not actually like it?
Should we actually using doctors for these
diagnoses still?
It sort of seemed like from the paper that
it was more accurate than the doctor's diagnosis,
wasn't it?
You know, I think that entire field for sure,
I think is like radiology or I think usually
it's like pathology or like dermatology.
You look at some picture and then you kind
of diagnose it, I think that absolutely is
a place ConvNets will just make a big difference.
And I do think that these models do kind of
achieve a striking advance over what you could
do previously.
So my understanding is that the challenge
there is that sometimes these models pick
up things that are kind of silly.
I remember there is this really excellent
blog post where we kind of discussed failed
models that are turned out.
There are like scans from different trauma
centers and the models doing an amazing job,
99% accuracy.
Any time you see that 99 percent accuracy
know something is up.
It turned out there's like some label at the
bottom or something that printed to the trauma
center so there is like light trauma, Heavy
trauma.
Guess what that model learned to do right
there.
So I think it kind of comes down to, what
is the model learning?
Is it a fluke?
Is it kind of an actual thing?
Radiologists were kind of tried and tested
like, do you really want to fir your world-class
radiologist?
So I think there's there is a natural caution
there.
I think in part because we don't really understand
what happens deep in layer 37 of the resnet.
So I think the FDA and some companies are
moving forward.
I do think in potential in places where there
aren't enough doctors, this could be kind
of potentially a revolutionary advance or
you could get, you know, world-class scanning
centers, available clinics throughout the
world, and not just places where you have
excellent hospitals already.
But I think it will take some time.
I remember a number of years ago, I think
maybe in the 80s, again, there's a whole wave
of hype around expert systems for medicine
and how they could diagnose patients.
And I think it might have been in that same
blog, a retrospective study that found that
many cases, hospitals that deployed expert
systems, actually had a fall in patient kind
of well-being afterwards because there are
these complex interactions that no one thought
of in the first study.
And then you find a number of years later
that there is this unexpected side effect.
So, yeah, I am, long with an answer there.
I think it is something to be interested in
and excited about.
I think it will also take time to really bet
and really kind of like make sure that this
is something that improves patient well-being.
Although I guess I do know like what happened
with the melanoma model, because it does seem
like, you know, doctors are also not perfect.
And you know, I also cannot inspect my doctor's
brain to really know their decision-Making
process.
So I wonder, is it unsafe to not change, or
was there some real flaw or some simplification
that it wasn't obvious.
I don't think there is a flaw in the paper.
My guess is that.. this isn't my field, soof
projecting a little bit out there.
I know that the entire deploying something
in the clinic, in the health care side is
actually quite more complicated even than
the new biotech side.
I think you have to work with insurers that
work with payers to work with hospitals and
doctors.
You know, the American healthcare system has
many known challenges.
My sense is that this has just been very hard
to actually get out there.
So, I think, in Pharma Inforum and biotech,
I think the advantages is like if you get
something to work, there is actually a very
well known path to get it to people.
I think for advances like this dermatology
thing, there's actually a fuzzier, more ill-defined
path to get it out there in the wild.
I think there are some real scientific questions
around is this actually robost that extols
an answer.
But I think there is also harder business
questions about does this make sense as a
viable business?
And I'm sure there's like a dozen startups
who are working on this right now, but I just
don't know as much about it.
Actually my wife runs a healthcare staff and
she tells me that it's the only industry where
you can literally save money and save lives
simultaneously and not have a viable business.
I've had a few friends who left health care
and have formed, ostensibly boring but very
successful startups and are much happier with
their lives.
So I sympathize just a little bit.
But, you know, you probably know way more
about this than I do.
Like, it's a little bit outside of my expertise.
Sorry to take you out of your expertise.
But this is what I was hoping that the podcast.
I could corner guys like you to ask all of
my dumb questions.
I really appreciate it.
And I think we should kind of wrap up because
I think this might be just getting long for
the format.
But we always get the two questions that I'm
kind of curious, actually.
I always say this, but really, I am curious
how you're going to answer this and what is
one really underrated aspect of machine learning
that you think people should pay more attention
to.What comes to mind?
That's a really good question.
I think that machine learning is amplified
signal processing, I think, it is a view that
is not as commonly celebrated.
But I think there is these really exciting
things going on.
Machine learning is finding its way into instruments,
like into sequencers, into microscopes.
It's a type of internet of things , but like
not the consumer version.
I think traditionally new scientific instruments
are the predecessor to fundamental new scientific
discovery.
So I think that when we find deep learning
is making our instruments better and more
capable.
Then I think that we're actually setting ourselves
up to discover and build fundamental science.
So that's something I'm very excited by.
But it's kind of a longer...
We might have the instrument and we still
need the Einstein or something to come in
and work that and really get us that magical
new understanding about the world.
But I'm excited by that.
That is a totally cool answer.
But I guess they may give some many readings
that it's like hard to even interpret, but
I guess a good algorithm would give you a
few high values what you call processed outputs
like that.
I think for now it's still going to be quite
a while before.
I think we see.
I think we talk a lot about HTI and I know
there are many ways in which you could get
a general intelligence.
But I think the process of induction, of interpolating
things about reality from very few hunches,
This is probably made up, the Newton, the
Appletree.
Like, if it probably didn't happen that way,
we know it's just so story.
But, you could imagine some machine learning
model seeing that.
Can you somehow interpolate from that out
to the universal law of gravitation?
That I think would be amazing.
It just seems far beyond our current science.
I feel like with all these medical applications,
I guess the reason I naively find them exciting
is that, like if you're trying to compete
with the human for navigation and driving.
Our brains are designed for that.
Clearly, like huge part of our brain is just
to navigate the world and not crash into stuff.
But it doesn't seem like our brains are designed
for interpreting molecules that we can't see
and like what effects they might have.
I mean, I'm still trying to visualize it in
my head, I can't even do it.
So it sort of seems like maybe the bar is
lower a useful algorithm.
I think it's a really interesting kind of
point there.
I do think, understanding quantum mechanics,
this kind of, at least doesn't fit in my head
kind of.
There are lots of complicated things going
on about that ##hided## world.
Maybe, part of the challenge is that it's
hard to validate a discovery.
Many times a model says something, but after
you spend a while like 9 times out of 10,
you're like what bullshit did the system pick
up this time.
And I think the challenge there is like maybe
we have to make the model like you said, we
have to make the models robust enough that
there is actually high-quality signals coming
out.
So we're like, oh, that's a clue or not.
Oh.
I don't know what Hiccup happened then.
In you know, step two thousand of gradient
descent.
So I think that's maybe the challenge where
we just haven't.
I think this was beginning to change.
It feels like still discovery, like invention
is the province of the human and not the machine.
But, you know, maybe that's like, you know,
the antiquated line and 10 years from now
AI will have discovered everything.
And I'll be like, well, that aged poorly there.
It will be an interesting world that comes
to pass.
All right.
So final question is, so, you know, right
now in 2020.
I guess its already June, what do you think
is.
What do you think is currently the biggest
challenge of making machine learning models
work in the real world?
Like in your experience, what are the challenges
that you've run into?
Like what have been the surprising hurdles?
I think things more specific to me are often
small data.
Like, again, you have 30 data points and oftentimes
it's a very well-meaning scientist who kind
of comes and says, what can you do for us
with 30 data points?
yet oftentimes I'm like, oooh, I wish I had
a better answer.
Sometimes you just try seven things like you're
trying to transfer learning and you try like
multitask learning, the mental learning and
all the learning fail.
And then at the end, the random forces like,
yeah, it's all great, but it does something.
So I think for things I'm excited by, I think
robust transfer learning that actually works
on small data, which I think this has occurred
in NLP.
But I think has not occurred in molecules,
I think that would be an amazing advance for
this field.
It's so interesting.
It hasn't occurred because I think it's also
totally happened in vision for sure.
And NLP now, definitely it's too interesting.
It doesn't work for molecules.
It might just be data.
I think if someone just found a gigantic trove
of molecular measurements, so it is high quality,
you can build into them.
Collecting it, nobody is going to find that
right.
I think this is just one thing that I think
the governmental effort could do, like amazing
work.
You know, to be fair, I think, governmental
agencies have actually put out most of the
open-source data out there.
So they are actually working hard at this.
But, yeah, maybe the sort of thing that like
if you get a $10,000,000 grant or something.
I think you could make a serious dent at putting
together a high-quality open data set for
this but it is more expensive than ImageNet,
and it will take more resources.
This means you could do the actual experiments.
Great answer, I love it.
Well, thank you so much.
Is there like someplace we should tell people
to contact you or is there a thing you want
to promote, maybe DeepChem.
Everyone should try it.
Absolutely.
I think part of like the goal behind DeepChem
is to make opensource more feasible for drug
discovery.
So I think we could definitely use more users.
In particular, if you an engineer that knows
how to handle build processes well, please
get in touch, you know, I am trying to figure
out the windows and etc. builds and it is
such a pain.
I am too much of a scientist.
We could absolutely use more help.
So if you are interested in open science,
please do get involved.
I love it.
Thanks, Bharath.
My pleasure.
Thank you for inviting me.
