Welcome to the AI Show. I'm Scott Stephenson,
co-founder of Deepgram. With me is Jeff Ward,
AKA, Susan.
Hi all.
He's a Navy pilot, acclaimed dad joke writer,
and AI scientist at Deepgram.
On the AI Show we talk about all things AI,
what is it, what can you do with it, how does
it affect you, where is it going? We are live
and ready to answer your questions. Comment
on YouTube and Twitch or tweet @deepgramai
to join in.
Today we're asking the question, the very
big question-
Big question.
... how do you use a neural network in your
business?
Oh.
So for realizies, how do you use it?
Well let's just talk about what people think
of neural networks. Simple ones, there's sort
of the classic one, the very first thing you
ever build when you're learning how to deal
with this stuff is the MNIST digit predictory.
You familiar with this-
Yep.
... Scott?
N-I [crosstalk 00:00:48].
It's like Modified National Institute of Science
and technology-
Something like that.
Yeah, something like that, something like
that.
Handwritten digits.
Yeah, handwritten digits.
28 by 28 pixels.
Exactly.
Gray scale.
Gray scale, basically they're handed to you
on a silver platter and centered. Absolutely
useless without a massive ecosystem around
it to feed those digits into you in the perfect
way.
What do you mean?
Well, what you don't just take a picture of
a letter and suddenly-
Everything works.
... everything works.
Not that simple, right?
It's definitely not that simple. How do you
take something like a task of digit recognition,
how do you break it down, how can you use
deep learning to actually make an effective,
useful model, create a tool that you can use
in some meaningful way?
In the real world you have a task and you
want to do something with a neural network,
in this case it's like I have a camera and
I want to take a picture of something, essentially,
or maybe a video camera and I want it to figure
out what is written on some letter or something
like that, and you have handwritten digits,
just the digits, zero, one, two, three, four,
five, six, seven, eight, nine, and tell me
what the digits are. Simply task, right? A
human can tell you right off the bat, they
can just read them right off.
Exactly. This is sort of the difference between
accuracy in the machine learning world and
utility. You can have the most accurate classifier
in the world-
Yeah, for a specific data set on a specific
task.
... but it's completely useless because you
can't feed it that data in the real world.
It's like if you want to send letters to the
right place at the post office or something
and you want it to be mechanized, but people
have handwritten everything, so hey, a hard
problem, you used to do it with humans, now
you want to do it with a machine.
That's the core idea for today, how do we
actually create a machine learning model or
use neural networks in a real world situation
there. We've got a great example there, digital
recognition on a letter or something like
that, but what about also I guess in the news
they're talking about license plate scanners
and stuff like that, what would it take to
actually build something like that? How do
you actually turn an idea, hey I need this,
and use deep learning in there? What are your
big idea of what you should be thinking about?
You have to think about what's my data set.
That data set has to be at least kind of close
to the task that you're trying to accomplish.
Is it pictures of handwritten digits that
are centered and perfect or is it pictures
of license plates on cars that are driving
down roads at oblique angles with lots of
light on them or smoke, or something like
that in the way. Okay, if you have those pictures
fine, but do you have them labeled by a human
and are they properly labeled, and are they
centered or not, are they all blown out and
all white, are they too dark, do they have
a big glare in them, et cetera?
Just that first step we've talked about data
so many times getting-
Very important.
Very, incredibly important, but it's important
not only just to have data but data that represents
the production environment that you're going
to be in. It's all well and good to have say
for instance, license plate data, but if it's
not taken in a meaningful way it's staged
with professional cameras and all these different
... Is that going to be as good of a data
set as no kidding I'm actually taking footage
from the real world and from the equipment
that I expect to use and dealing with it that
way, not the version that it is pristine but
the version that's already gone through whatever
Kodak's have had their hands on.
Been compressed.
It's been compressed, it's been mangled. By
the way, why is it whenever you see video
like that it's just-
It's always crappy.
... absolutely the worst.
Shot by a potato.
Yeah. It's like Big Foot. Magically, whenever
Big Foot it's on the worst video equipment
ever but you got to think you're about to
take pictures of Big Foot and you need to
recognize you're going to have that quality.
But you also have to be careful don't try
to boil the ocean. You don't have to get every
single angle and is it snowing, is it raining,
is it whatever. Okay, get verticalized, get
one thing working pretty well first and then
you'll start to see the problems crop up but
maybe 80% or 90% of your solution is already
there and then you tackle those problems later.
You don't have to do everything all at once.
Yeah. That's also another key thing here is
be prepared for iterations on [crosstalk 00:05:51].
Iterate-
Iterate.
... iterate, iterate, iterate.
Get your first hour of data just so you understand
formats and how you might be processing and
dealing with it before you spend 10s of 1000s
of hours and umpteen millions dollars collecting
data that you then find out is not quite right.
That's really heartbreaking.
Yeah, you can't just guess the answer from
the outset, it's too hard.
It's gradient ascent, right?
Yep.
You [crosstalk 00:06:19] assess, take a step,
assess, take a step, assess, take a step,
it's pretty classic.
What do you mean by gradient ascent?
I've heard of this thing, gradient ascent,
the basic algorithm that a huge chunk of machine
learning uses to train neural networks. Just
like I said, I was given the example there
for assessing, taking a step, assessing, taking
a step, the assessment stage is using what's
called the gradient and that points in a direction
that might be a good way to go for your weights
[crosstalk 00:06:53].
As an example, if you're walking around in
a hilly terrain and it's your job to find
water you might want to start walking downhill.
Yes.
What do you do, you look around at your feet
and you think oh well the hill is sloping
that way I should go that way, but does that
necessarily mean that water's going to be
that way? Not necessarily, but if you keep
going downhill if there's water around it's
going to be around there. That gets into the
discussion of maybe you're in a local optimum,
meaning a local minimum in this case and you
might need to go over the next ridge and then
find an even lower hole somewhere, but still,
this is gradient descent, you're looking at
the slope and you're moving along that path.
Yes. Gradient then apples to machine learning,
but it also applies to life. It's a great
process/technique that really works well.
We were talking about data there in gathering
data and the ideas behind don't whole hog
right away. You've got to make sure all that
... But the next step, another thing to think
about along the lines of data, so your inputs
and outputs. When you're designing a system
that's going to be useful you're really actually
thinking how am I going to use this thing.
You've got to think the data you're going
to feed it and the data it needs to be fed
in order to predicate some answer that you
can then use to do something with. Just very
briefly think about the MNIST style digit
classify there, data inputs are a 28 by 28
gray scale, centered-
Pixels.
... digit-
White, and black, and gray.
How do you build that? You've got a whole
ecosystem surrounding that, which it's kind
to find where the digits are at, it's got
to parse them out and do all these different
things. It's probably a harder task than-
You're talking from wild data like-
Yeah well-
... okay, I've got a whole bunch of pictures,
I'm starting out with 10 million pictures
but I'm going to whittle it down to the 60,000
that are actually my data sets that I'm going
to use to train.
Well yeah, I'm just saying when you're thinking
about a production environment I've got these
cameras, how am I going to get to the classifier
itself or to the network itself, the data
into the shape and form that it was trained
on in order to make the prediction that I'm
going to then use back out there. If you find
that the task of doing that is a lot harder
than the model itself you're probably right.
The real world is not well normalized.
If you get your data set right, you get your
tools right, you pick your model architecture
correct, you get your input and output set
correctly the training's actually pretty easy,
you just say, "Go".
Well I mean-
For the most part. Compared to gathering all
the data and all the digity things you have
to figure out.
Once you can encapsulate the problem that's
really what we're talking about here, encapsulating-
Define the problem well.
... the problem. Yeah. You need to define
that problem. Going back to the iterative
idea here, you'll find that you started collecting
some data and then you started designing inputs,
and outputs, and a model behind it and you
realize maybe those inputs, and outputs, and
that model can't work with that data so you
need to adjust. You go through this iterative
system, but you always have to have an eye
with I can't do anything that the real world
doesn't support. That's what a lot of people
lose sight of when they're learning how to
do these tools the first time.
It has to work for real in a real setting.
Yeah, they're given these pristine data sets
that have well encapsulated some simple problem,
or even a complex problem. I've personally
spent two weeks working on one data set just
to whip into shape to be usable and it's a
really hard task to get the real world to
be bent and shaped into something that's usable
by your particular model. Keep that in mind
when you're thinking about a usable model.
Let's go from beginning to end for a simple
system. You have a dash cam in your car and
you want it to detect license plate numbers
and display them on a display in your car,
so you're like a police officer or something,
right?
Yeah.
And you-
Officer Susan.
Officer Susan reporting for duty. You're driving
around in your car, you have a dash cam and
you want to get a text or display on your
screen. Of all the license plate numbers around
you how do you build that system? Okay, you
have the camera, then what? Not even then
what, the camera what is it doing? It's looking
at an optical signal in the world. As a lens
it's taking in light and it's digitizing it,
so that's really important. You have to be
able to digitize the thing that you're actually
trying to measure. It's pretty hard to measure
people's thoughts. Other things that you could
think about, very hard to digitize, but taking
a picture, we've got that one, use a camera.
Yeah, use a camera. Like you said, you've
got to digitize that, you got to be able to
put it in some sort of portable processing
system if you're doing this real time.
So maybe it's hooked up to a USB. That dash
cam is hooked up to a USB cable that goes
to a computer and that computer is just saving
an image, 30 of them every second, just saving
all of them and just building up tons of data.
Which going down to the question about inputs
and outputs, we'll just take a base one here
are you going to try to figure out something
over time or you treat each image individually,
these are basic simple-
You have 100 pictures of the same license
plate. Do you want 100 notifications of that
license plate or just 1?
The image classification world has gone light
years, just massive leaps forward since the
original work on MNIST and what everybody's
familiar with, making a simple multi-layer
network to recognize digits. In general, you're
going to have to find some way of taking that
image that you've digitized-
It's a big image.
... you've been able to feed into some engineering
solution that takes a picture in seconds or
as fast as it can be processed and then looks
for the next one. Takes all that, feeds it
in to something that's going to probably normalize
the image for light and do some techniques
for basic image processing to take care of
a whole lot of stuff.
Try to make it not too dark, not too light.
The more you can normalize your data the less
your neural network is going to have to work,
which is a great thing because the accuracy
is going to go up there.
Sure, but you have a camera and it's got a
pretty big view, and the license plate could
be anywhere inside.
Exactly, so you're probably have to go into
something that's going to detect where license
plates are at. The different network-
Yeah so you probably have two systems-
Yeah, at least.
... one that's a license plate detector. It
just says, "I think a license plate is here"
but that's looking at the entire image, it's
looking for the whole thing and then saying,
"Oh, I think a license plate is here". Then
you have another one that says, "I'm going
to snip out only that section and then I'm
going to try and read the digits".
Yep, well it's going to scale it next. It's
going to snip out, scale it, you're going
to make certain assumptions because you know
what license plates look like about how to
scale it and it's actually probably a nice
problem because of that.
A fun problem.
Then finally, you can send it off to your
classifier after you've scaled, and sliced,
and diced. Now you've got something that might
be able to output possible answers that you
then display to the person driving. Hopefully
they're not texting while they're doing it.
To build the data set for that, if you're
starting out it's like I want to build a license
plate reader for that dash cam but I have
no data. What do you start doing, strapping
cameras to the front of cars and driving around,
right?
Yeah.
Then you send it off to crowd source the data
labeling or you do it yourself and you sit
down and look at images, and you draw a box
around the license plates. There's the box
around the license plates and you use those
boxes, the pixel numbers for those boxes,
to say in here there was a license plate,
so that's to get the data to build your first
model that just tells you where the license
plate is. Once you've gone through and made
all those boxes now those are just images
that are for your next data set that you go
in and say can I read these or not," or can
a person read them or not. Then type in what
that license plate is, the numbers or the
letters. Now you actually have a labeled data
set at that point and that's how you train
the models that we're talking about, identify
where the license plate is and then also what
is it, what are the numbers.
This is, keep in mind, this is all a very
simplified version of this problem, this is-
We don't have to make it more complicated
though.
Yeah so-
This is a simplified version and it's already
really complicated.
Exactly, exactly. This is a real world use
case. The real world is going to throw all
sorts of kinks and curves at you, for instance,
I don't know, having multiple cars. You start
detecting multiple license plates. What happens
when a motorcycle splits lanes right next
to you, what are you going to look at there,
those kinds of things, shadows hitting you,
those people that put the shields over their
license plates to make him hard to see, which
I don't know the legality of that.
A typical system that would identify either
where a license plate is or what the numbers
are that would be just a typical CNN network,
a convolutional neural network or something
like that. These work really well, but those
things have been done to death. Many academic
papers written about them. You can figure
out how deep should it be, how wide should
it be, which kernels should I use, all these
different settings. You just go download one,
you can just [crosstalk 00:16:52] Pytorch,
TenserFlow, get an example and there it is
for you. Now it might not be trained for exactly
what your task is but you don't have to pick
the model architecture and you don't have
to go through all that whole design process
to figure out what's going to work or not.
You can pretty much take it off the shelter
and just hit train, and maybe adjust a few
parameters, but you spent an hour, five hours,
on that section, maybe a day, and then you
spent two months on the other stuff.
That's a great point because there's a lot
off the shelf stuff that didn't exist before,
especially in the image recognition world.
If you're playing in that world, I don't get
a good chance to go back there too often but
every time I look there's just more and more
amazing tools, especially when it comes to
anything on the road, for obvious reasons
for the autonomous driving revolution that's
happening. Those tools are just getting a
tremendous amount of attention and there's
a lot of great work that's out there. If you're
thinking about building some of these things
look for off-the-shelf solutions first. Come-
There's won't be an end to end, everything
you need to do, but there will be parts of
it that you can save a considerable amount
of time.
But that comes down to if you go with some
off-the-shelf system going into the things
you should be thinking about that might dictate
some hardware that you don't have access to.
It's like this model here is using these tools
and these tools you either have to delve into
them or figure out how to build something
that can mimic them in some way, shape, or
form. That becomes a real concern, especially
something like in a real world you don't have
a lot of processing power available trying
to do this task. This comes back to the difference
between accuracy and usability. If you have
to have a rack of servers sitting in a car
to be able to do the task that's probably
not usable, even if it's accurate.
Maybe a first proof of concept, but this isn't
going to be a real product that you ship.
Driving around with all the fans whirring
behind you.
Yeah, with $100,000 worth of computers in
the back of your car.
It's great, I can read those license plates
now, although probably don't need that much
compute for that task.
We talked about data, super important, we
talked about inputs and outputs, loss function.
This is really determining more the type of
problem you're doing. There's a lot of standard
... When we think about a loss function what
we're talking about is the thing that takes
truth versus prediction and says how close
are they.
What's my error.
Yeah, what's my error.
How I define what my error is.
That loss function has to be crafted in such
a way that it can work with a lot of differentiation,
this ability to what we call backpropagate
the error all the way through the model if
you're talking about deep neural networks.
What does that do though, this backpropagation?
Backpropagation, what we've got are all these
... When we talk about a model and model structure
the structure is the physical way that the
math is laid out, in other words, this equation
leads to that equation, which leads to that
equation, this is the layers.
These are the layers.
Right, these are the layers. But those layers,
those equations, have a whole bunch of parameters.
It's the simple slop form, mx+b.
Just numbers.
They're just numbers. If you can get those
N and Bs just right then you can fit the curve.
And there's just millions of them though.
But there's millions-
There's a lot of them.
... and millions and millions of them. Those
things we call parameters. Well when we-
So all these dials in the network [crosstalk
00:20:34] and they need to be turned.
That's actually one of my favorite images
is a person sitting in front of a switch board
with 10,000 knobs between 0 and 11. Every
single one of those knobs affects every other
knob. You've got inputs over here and you've
got outputs over there. If you could just
twist those knobs right here to the right
step-
There is a correct one that minimizes your
error.
There is a great setting of them, but finding
them out ... So what do you do, this is where
backpropagation, gradient descent and all
these things come into play, you send something
through that model-
You let it make a guess.
... you let it make a guess.
Leave the settings where they are but just
go.
Yeah and you look at the outputs that came
in there, and you look at the truth, and you
have your lost function-
You know the answer to the input that you
gave it and it's like how far away is the
output from the model compared to what the
actual truth is.
Exactly. You've got your loss function that's
going to show you that. Now from that loss
function I can take that error and I can propagate
it backwards through that network and I can
say-
Essentially there's a recipe that says if
we have this error down here then what you
need to do is go back and turn all of these
knobs this much, but it's only a little bit
each of them. It doesn't say put this knob
in this position. It says move this one a
little bit that way, move that one a little
bit that way.
In every single example that goes through
that it's going to say, "Hey, the knob should
have been here" and the knob's going to be
there. When you've got a bunch of these examples-
It needs to be moved a bit this way.
Yeah, you take the average of a bunch of examples
at once, this is what we call a batch, and
now the average says in general this knob
should have gone over here. You do this a
whole lot and eventually [crosstalk 00:22:15].
You don't do this once or 10 times, you do
this millions of times, many, many updates.
Yeah, exactly. In the end, it comes up with
a great setting for those knobs and now the
outputs are getting you pretty close to what
you want.
At first there's a lot of movement, the knobs
are moving all over the place and then there's
slow refinement as the model starts to get
trained.
Yeah and the occasion time where it trips
over and a whole bunch of them start going
off and [crosstalk 00:22:42].
Yeah because they all affect each one of them,
so one ... It has to make up for that change.
That's generally backpropagation. One of the
key skills is coming up with 1001 way of thinking
of that to yourself because the more ways
you start thinking about how this works the
better you understand intuitively what's going
on and can help you design these things in
the future.
Constraints help a lot with this, how much
money do you have, what computing resources
do you have, what talent do you have, the
people that know how to operate these systems.
You can go on many, many goose chases here,
a lot of rabbit holes. You could spend the
next 15 years working on a problem and never
come up with something that's actually valuable.
There's still many good things that you're
learning along the way. You have to learn
to cut off certain things and be like, "Good
enough, good enough, good enough". That's
kind of the way that machine learning is now
at least. You have to have some restraint
in order to get a real product out the door.
Yeah, definitely. We've talked a bit about
designing something, but I think a lot of
what people don't realize, again, with the
machine learning world, is not just building
is a challenge, but the world isn't static.
Maintaining a deep neural network is actually
a really big challenge. Even just consider
the license plate problem, every single year
there's 100s of new license plates. Someone
does it right into their local representatives
and says, "Hey, I think the state should have
this picture of my favorite cartoon character
from 1983" and they get enough signatures
and suddenly there's a brand new license plate-
Has a new design.
... in the world. Car designs change, vehicle
designs change, all sorts of things change.
Or in California the first digit kind of just
incrementally goes up. There's a new first
digit just because it's later on. It wasn't
likely before but now it's likely.
The idea that you put all this time and effort
and it stops, maybe there are problems out
there like that, but it's pretty hard to image.
We'll just go back to handwriting, the digit
recognition. I can guarantee you that the
average penmanship has changed
