SPEAKER 1: Please give a
warm welcome to Dr. Krishnan.
S.P.T. KRISHNAN: Thank
you, [INAUDIBLE]..
So good morning, Seattle.
Good morning, everyone.
Thank you for coming
together today.
So as you can see,
I work at Hitachi.
And I had a small team of people
geographically distributed.
We do a lot of AI/ML stuff.
So this is one of the projects
that we have been doing
for the last couple of months.
I mean, just here
on the [INAUDIBLE]
are some of the results
that we have got.
And let's keep the
questions towards the end.
I have a couple of
slides to go through.
And as [INAUDIBLE] introduced
the Google Developer Experts
this morning, I happen to be
GDE for Google Cloud as well.
I've had the privilege of being
that for the last five years.
So if any of you want to
know how to become one,
feel free to talk to
me during lunchtime.
Having said that, let me
just get going into the talk
right away.
Most of the time,
we have some agenda
of what we are going
to do and then I
would show the demo
later on, but I'm
going to flip this
around and say,
hey, what is it that we
actually built that is
very important to you guys?
And all the results
are there after.
So in this particular problem,
there was an opportunity that
Hitachi was working
on-- or, let's say,
my company was working on--
where we have to
identify the faces,
we have to detect the faces,
we have to track the people,
and we have to redact people
who are not persons of interest.
So we actually built a solution.
We call it as a solution
because we consumed
quite a bit of open
source software,
we did our own innovation
on top of that.
And what you are going
to be seeing in here
is the net result that
actually comes out of it.
So let's start with the
actual demo, itself, in here.
So I want you to watch
the person, number 10.
We like the person.
It's as if, magically, she
knows that we are actually
capturing the video.
And, of course, we did
not capture the video.
It was some video
from the internet.
So this is what you are
going to be seeing--
how we actually built a
solution that actually created
this video, in terms of not
only tracking one person,
but tracking multiple
people and also
redacting people who may not be
redacted or who need to be left
unredacted.
So it's all OK?
Thank you.
So now let's get into
the agenda in here.
Basically, I'm going to
be defining which category
of people, which
category of market
actually requires this
problem to be solved.
What is the solution approach?
I'm going to walk you gently
introducing a flowchart, not
much on the mathematics side of
things, but more into logically
how would this problem be
solved and what is the takeaway
that you can take from there.
I'll just identify
four, key features.
And, of course,
a few more demos.
Not the first demo, but several
more demos that is coming.
And then a sneak peek into--
there's an error.
No, it's not a peak.
It's a p-e-e-k.
We'll fix it later on.
But then, in the
sneak peek, it's
what are all the other
projects are what we are doing.
So first thing, where does
this whole opportunity exist?
So it's in law
enforcement area, wherein
there are a lot of
people uploading videos
on the internet, whether
it's from a Ring video
camera, a Ring Bell video
camera, or your Nest video
camera.
I'm sorry.
This is Google, so we talk
about the Nest video camera.
And then, when these
videos are going
to be presented in
front of a judge,
in front of a District Attorney
in the US, what happens
is you need to
leave the people who
are a person of
interest in track,
but everybody else
should be redacted--
legal requirement, privacy
reasons kind of thing.
So now, that's the
opportunity that we are
we are currently working on.
And take it like version
one of the results there.
So now, some of the things
that needs to be redacted
is face, number plates
and street signs.
All of them.
And again, face
[INAUDIBLE] privacy.
It just happens that
[INAUDIBLE] number plate.
Non-important
number plates should
be detected and redacted.
And, of course, street
signs, if you show them,
are kind of having a negative
impact on the property value
in that area.
So the street signs
also should be redacted.
Then the question comes in, why
do we need a solution for it?
Why can't we do it manually?
Yes, you can do it
manually with old videos,
where it's a 30 frames
per second kind of thing.
But with 4K videos,
with 60 frames
per second, 120
frames per second,
you can't really put a
person to actually do it.
At the same time,
there is so much
amount of video that
is being uploaded
to many of these portals.
So we need an automated
method to do this whole work.
The third one is that
many of the CCTV cameras
these days are not stationary.
They're actually
panning back and forth.
So when you have panning
cameras back and forth,
the challenge
becomes that, hey, I
saw Krishnan in this
particular frame.
Is this the same
person who I now
see a different side profile,
when the camera actually
pans together?
So we need a method
to actually track
people who are non-stationary--
either the camera is not
stationary, or the
person is not stationary.
So this is required.
That's the third challenge that
what we undertook and solved
in this whole project.
And the last one
is-- yes, there's
a lot of open source
libraries, as I said.
Here, it's more about
building a solution
that comprises of, let's
say, a working solution.
We did a little bit
of coding, but we also
consumed quite a bit of
open source software.
That means each of the
software, by itself, is not
a perfect fit for the problem.
They are solving
part of the problem
but not the entire problem.
So we amalgamated
multiple things.
And that's where you see no
single, open source library
on the internet that
we are familiar with is
able to solve this problem.
So we did a lot of
scouting on the internet
and then we started
doing our own work.
So now what's the solution?
The first thing is, we
call our solution iRedact.
There's no link to
any Cupertino company.
Just to make it
very clear, this has
nothing to do with
iPhone or anything else.
I just call it iRedact.
And then it kind of comprises
of framework, libraries,
there is a CLI that you can use.
So if there is an
opportunity to do a workshop,
I'll be happy to show it.
Everything is Python.
Python is the primary
programming language
in my group.
CLI and GUI also integrates
with the web app.
The web app purpose is
primarily to manage the video.
Somebody uploads a
video to a portal,
and here is a law
enforcement officer
who needs to redact or not
redact those other things.
And iRedact itself comprises
of three components, which
are face, number plates,
and also street signs.
In this talk, we are going
to talk only about the face.
There is the sneak peek about
the number plates and street
signs.
And I'm talking a little bit
fast, but just bear with me.
I do want to cover
a couple of slides,
so that there is some value
at the end of the day,
at the end of the session
for everyone who came today.
Now iRedact face-- here
is the algorithm that,
eventually, worked for us.
So first thing was,
here comes a big video.
It could be a 4K video, it
could be a full HD video.
We first have to
chop it into frames.
A very logical step.
Everyone knows about it.
But now, there are multiple
methods that actually works it.
And we tried a
couple of methods.
And we ended up with one
method that was OpenCV.
I have more details later on.
We ended up chopping
it into frames.
Now, once we have enough
frames and everything there,
we need to detect how
many people are there,
in the particular frame.
Let's say there is a security
camera running in here.
The camera captures
video of this room.
There are a lot of
people in this room.
So we need to detect
how many faces--
how many human
faces-- are there,
in this particular video.
Now, as I said, no
single, open source
library was able to
detect everything,
primarily because some of these
libraries detect large faces,
some of them are good
at right profile faces,
left profile faces,
partial faces.
These are all challenging.
So we ended up not combining
together-- or amalgamating--
a couple of libraries together
in a method through which
to say, if there's two
bounding boxes in a video,
does it refer to the
same person or does
it refer to a different person?
So we'll have more
details coming together.
Now once we have all those
faces there, we need to say,
well, there are two
faces, are these two faces
two different persons?
How do we know?
So essentially, for people
who are in the opencv-world
or who are in video
processing, you
have something called
facial embedding.
But for everybody
else, it just means
that we create something
like the equivalent
of a fingerprint for the faces.
So it's just called
facial embedding.
Again, it's using an
open source library
from Google, called FaceNet.
We slap it on top of that.
After frames, the face is
detected to create a unique
fingerprint-- or face print,
if you want to say so--
with all the people detected
in the particular frame.
After that, now
the video's moving.
Then we need to track to
see, is this person moving?
One person-- say, Person A--
was in coordinate x and y.
Now the person is going
to move to coordinate,
let's say, x plus
1 and y minus 1.
Is this the same person?
How do we know that?
We repeat the same process
by extracting the faces,
generating the thumbprints--
or, sorry, face prints--
and then combining
them to see if it
is within the same distance
vector, within the threshold,
to see if is this the same
person or a different person.
So that's what you see in
terms of a deep SORT tracking.
I have more details in a second.
And finally, there is a
very unique problem that,
when a security camera is
spanning from left to right--
let's say I'm stationary
on one side of the room
and I'm talking with my friend,
but the camera just went away
and the camera's
now coming back.
Because the camera went
away, I went out of frame.
Now there is a real problem
that when I come back,
I could be given a different ID.
Remember all the IDs that
you saw in the first video--
the demo video-- they
were all system generated.
The first face I
detect, number one.
The second face, I
go to number two.
I do not assign it.
So when you come back in here--
now Krishnan got in 10
in one of the videos,
now Krishnan got 20
in another video.
Is 10 and 20 the same person
or is it a different person?
So that's what we saw,
using the clustering.
We put together all the
steps and created a solution,
so that this solution is now
capable of processing videos.
And it is going to be
automatically generating
what we call a checkpoint
file-- more details soon.
With that, we can now give it to
a law enforcement officer, who
can select some of the frames to
say, keep track of this person
and redact every other person.
Make sense?
Sorry.
Thank you for keeping
with my pace of talk.
I'm from Virginia, by the way.
I got up early in the
morning-- the time zone.
So first thing, extract
frames from video.
That's OpenCV.
That's the method
that actually worked.
Face detection
networks-- as I said,
there is no single
network that actually
worked with every face,
every video we tested.
We ended up selecting
three different methods.
There was not threshold
for three or four.
We started experimenting
with a lot of libraries.
And as there was overlap
between other libraries,
and one library was performing
better, we started dropping.
And we, eventually, ended
up with three libraries.
The names are coming
up in the next slide.
Now each of the libraries
are going to be detecting.
Let's say their detecting me.
One is going to draw
maybe a rectangular box,
another one is going to draw
similar bounding boxes--
maybe not exactly overlapping.
Now with two of these
overlapping boxes,
we need to see, are these
two different people or is it
the same person?
So we used a non-max
suppression method for that.
Then we generate the
facial embeddings
that are using the FaceNet, as I
mentioned a little bit earlier.
Then we introduce Deep SORT
to actually track the people.
And finally, we cluster it
with DB Scan clustering.
Now with all these
things done, we call it
as pre-processing
because we are also not
a pure, academic group,
we are an industrial group
wherein we need to
take the solution
and need to be able scale it.
Maybe not on a Google scale,
but, let's say, scale it
so that we are able to
do a lot of processing.
So we created, in step number
seven, a checkpoint file.
And the checkpoint
file is, basically,
a JSON object that
just says, here
is a tracking ID for
each, unique face,
that face appeared in
this particular frame,
and here are the bounding boxes
for that particular person.
So we created all these things.
So through the pipeline-- we
just simply call it an iRedact
pipeline--
in comes a video, out
comes a JSON file.
It's a single time processing.
And we just, literally,
created a full, let's say,
X-ray of every person
in the particular video.
Once this is done, we
then built, basically,
a simple Python UI for LEO--
law enforcement officer--
to select one or more faces.
So now comes the
challenge that, initially,
when we're doing the
project, the requirement was,
I'm going to select
only one person
and everybody else
must be redacted.
This requirement kind
of changed and said,
what if two people and
the suspect in the video
are working together?
I need you to be tracking
those two people,
but leave everything else out.
So we have to expand it to say,
it doesn't matter whether it's
one person or a
number of persons,
you will now be able to track.
So even if it's a group activity
that an LEO wants to track,
we are now able to do that.
So it's a Python UI.
Then, basically, it
gives that raw video,
gives the JSON file with
this small, little UI tool,
and then out comes
a redacted video.
And depending on whether we have
access to CPU, GPU and maybe
TPU, one day-- then we'll
be able to do it faster.
Right now, we have tested
with GPU and also CPUs.
Then, last one, is iRedact also
integrated with a custom web
app that was just
to manage videos.
That's pretty much
uploading, managing, tagging,
user authentication--
all those things.
So now, with that said, I want
to go with four, key features.
Maybe it's time to slow down and
give you slightly more details.
The first thing, is
I said, the challenge
is detecting faces in the wild.
It's still a
challenge, irrespective
of the recent process.
And again, face orientation,
ratio, size, background--
everything can affect it.
Now the net result
is almost impossible
for a single neural
network or a single network
to identify all faces.
Our approach was to run
multiple networks in parallel
to amalgamate the results.
And then these are
the three networks.
We saw the name earlier on.
Here is the expansion
of the name.
These are all academic
papers behind it.
There are also
libraries behind it.
So the approach that we took
in building the solution
is to use pre-trained networks
or pre-trained models.
And then we did a little bit
of transfer learning in order
to get the final
result from there.
That's key feature number one.
Key feature number two--
facial embedding generation.
This is, again, using
Google's FaceNet.
Essentially, for
every bounding box
for a face that we
detected, we took it,
we ran it through the FaceNet,
that gave us a 128 Byte
vector that we stored.
That kind of became not
exactly a tracking ID,
but a unique, let's
say, facial fingerprint
for that particular person.
Now key feature
number three-- now
we need to be tracking
people across the frames.
So we were hunting
for some methods
for it, academic solutions.
We ended up discovering there
is something called SORT.
This is Simple Online
and Real Time tracking.
But then there
was another paper,
which was deep SORT,
which built on SORT.
So we first started
off with SORT,
then we looked at deep SORT,
and we integrated deep SORT
with facial embedding.
So the uniqueness
of the solution
is that first, we
took the images.
We did not stick with one
particular, pre-trained model.
We took multiple
pre-trained models.
We found a method of
amalgamating the results
together.
Then we introduced facial
embedding on top of it.
And finally, we
introduced deep SORT
to actually track
the person as they
go through this whole solution.
So this is the
third key feature.
Now the fourth one is,
basically, clustering.
And again, clustering
is required
when you have the person
go out of the frame
and then come back
into the frame.
So these are the features.
So with that, let's say the
theoretical section of it
is a little bit done.
Now I'm going to be
talking about the results.
Of course, you saw
one of the results,
but that didn't include
that redaction part of it.
So now I'm going to show
a couple of results.
And for this, whenever we
do any work in my group,
we want to compare
against the best.
And it just happened
that we were
comparing with YouTube Studio.
And then we got this opportunity
to talk at the Google ML
Summit.
So every one of these use cases
video that we had, we took it,
we uploaded into
Google's YouTube Studio.
The YouTube Studio has a
capability to redact faces.
So if you're not familiar
with it, go check it out.
It has a feature there.
Let's see how iRedact compared.
So I'm going to be talking
about three demo videos.
The first demo video--
a lot of people, a lot
of different faces,
the camera is going to be
panning back and forth.
There are small faces,
big faces, partial faces.
This is the first demo.
This is the first, rather,
testing video that we took.
So here, this just
shows a couple
of screenshots between where
iRedact performed better
compared to YouTube.
So in this case, the
video that you're seeing
is President Trump's
inauguration video.
I think it's 2017 or
20-something, January.
And then we just
took the video that's
available from whitehouse.gov.
And then we ran it
through our solution.
You can see, on
the left, that we
were able to detect
and redact more faces.
But then the YouTube version was
not able to detect small faces.
And I have the video
playing in a minute,
but this is just to show
where the differences are.
In this case, we were tracking
First Lady, Melania Trump,
to leave out Melania
Trump, and then
detect every other face
that is detectable,
and also redact those
particular faces.
So now, coming back to here,
again, here, the circles
show the difference
that in iRedact, we
were able to detect probably the
Secret Service people walking
behind the President.
We were able to
detect and redact.
But then YouTube was
not able to detect it.
Again, we are not saying we are
perfectly better than YouTube.
We have examples where
YouTube was slightly better.
Let's take a look at the
particular video here.
So as I said, we were
tracking Melania Trump here.
So every other person and human
face we are able to detect,
then we are able
to track, and then
we are able to redact as well.
Now let's go to the next one.
I don't want to play
a YouTube video,
but it's just present
in here to show
where are the critical
points that actually missed.
Now interestingly, when
I did the first demo
in an internal presentation, one
of the persons came and said,
are you a Republican?
I'm like, well, I'm a scientist.
In all my projects, I take the
role of a principal research
scientist.
Scientist and
Republican seem to be
the same number of characters.
So I said, we have to not dispel
this and say that we are not
going to be seen this.
So the next video
that we actually took,
happened to be just
President Obama's video.
It's one of the national
security briefings.
And here, the challenge
and the opportunities--
Obama was giving a talk.
There are going to
be very few people,
the faces are not moving
a lot, but the glass
that you see in front
of President Obama
actually has a
reflection of the face
of probably the video
camera or the person
who is shooting the video.
So that kind of
interfered with it.
Again, here it shows
that, in this case,
it's an example where YouTube,
in exactly one instance,
was able to detect a better
face compared to we could not.
We tried, multiple times,
but we could not detect
that particular, partial face.
So that is going to be in
the, I think, 18th second.
So let me just play with it.
So it's going to be this
particular person, when she
bends down to take something.
We will be missing that face.
So again, it's the 18th second.
So you can just see that.
It's not even a full face.
It's just probably
half of the face.
And then we'll
take it from there.
So this is about this one.
This is YouTube, which
does a very good job.
They kind of track
all the faces.
And then they're
also redacting it.
You will see the same person.
They do not miss the face.
So if there is somebody
from YouTube Studio,
I'm happy to exchange
notes later on.
This is on the 18th second.
The same video.
The person will come down.
They still maintain
the tracking.
And they're able to show that.
The point here is
it's very important
to know that in the
law enforcement area,
privacy is a very
important thing.
Even if you miss a single face,
your video or video evidence
may not be accepted in a court.
And that's very we
take very good care
that we need to be
redacting all these faces.
Now demo three-- whether we
are Republican or Democrat,
it doesn't matter.
Most of the time, people
in all the countries,
what do they care about?
Shopping.
So we took a Black Friday video.
A very noisy environment.
Here, we are not perfect,
neither is YouTube.
But then we did a much better
job compared to YouTube.
Here is an extremely
noisy environment
with a lot of people
going back and forth.
And then there's a
lot of racks in there.
We did not do a good job,
but, at the same time,
this is not an evidence video.
We just took it as a challenge.
We ran in through the solution.
You will see that we did miss
a few faces here and there.
We are tracking this
particular person there,
but most of the other faces
we were able to detect.
Yes, a little bit of faces
on those boxes as well.
So this is to show a third
video that we took in here.
So again, the primary thing
that I wanted to share here
is about a solution that we
painstakingly put together
to show what's possible
with open source libraries
and some little bit of
innovation on top of that.
So now this kind of brings
an end to my presentation
that I wanted to
talk about in here.
But I want to give
you a sneak peek.
So the first thing is iRedact.
I told you about it.
This talk is about
face redaction.
Here is number
plate, street sign.
And we also have a very
interesting project
that we are doing defect
detection on utility pipes.
I'm just going to take a
few couple of more slides
to show that.
So iRedact number plate--
here is one detecting
the number plate.
And then you know
with high confidence
what the number plate is.
But here is a
video of the number
plates being redacted as well.
You will see that, here,
we processed live video.
And we need to detect any
number of number plates
and we need to redact it.
And this just goes
on for seconds,
and then you will
see that the way
we will accelerate, slow down.
But this is still
a work-in-progress.
This is not a fully
baked solution.
So you would see that there
are very few instances where
we do miss the number plates
from some certain [INAUDIBLE]
in there.
So talking about street signs--
street signs are actually
an interesting problem
for us because, again,
as I said, if there is
going to be an evidence
video from CCTV that shows,
let's say, any unfortunate
incident in that neighborhood--
if you don't redact the street
signs, the property value
is going to go down.
And that gets into lawsuits
and stuff like that.
So we need to
detect street signs.
We need to be redacting it.
So here, what you see is
a couple of street signs.
On the first column, you
have just the street signs
on the wall.
There's a street sign directly
independent of the wall.
We are able to detect
the street signs,
and then we are
able to detect it.
So you might be wondering
at this state, hey,
you're just doing OCR, and then
you're doing bounding boxes,
and then you're
trying to redact it.
Actually, the solution
is not that simple.
Because here is a street
sign from a video.
You can see from the result
below that, yes, we do do OCR,
we do do-- it says
car wash, but we know
that that's not a street sign.
So in this case, we
don't need to redact it.
So it's not just OCR, there is
some library reference that we
do to say, is this an address?
Or is this actually some name
board lying down somewhere.
So that's iRedact.
And just one last video.
And this one is not a video,
it's just a frame in here.
You will see that
this is a utility
pipe running underground in
the US, in one of the counties.
And right now, the process is
that they send a robot down it,
and somebody records the video,
painstakingly looks at all
those things, registers
all the cracks.
The picture on the left
just shows a frame.
Let's say some time we
detect a crack on the top.
In the next frame, we are
actually expanding the crack.
This is, again, using
deep neural networks.
This is a very big utility.
And here, because the pipes
in the US are getting old,
we need a more efficient method
of doing this whole stuff.
So this is, again,
an ongoing project.
As I said, I'm not a super man.
I've only had a team.
I'm a principal
research scientist.
We go in packs,
whether it comprising
of data scientists, data science
engineer and data engineer--
just to give
recognition to my team.
And, of course,
my extended team.
And having talked about
inclusion diversity,
we are taking
conscious efforts-- me
and Hitachi, in general.
And you can say, what is
your definition of diversity?
This is my new team.
Right now, they are interns.
We are hoping to hire
them from January, 2020.
Again, they also contributed to
iRedact street sign and number
plate.
And the previous slide
was about the face.
That really brings
us to the last slide.
If you have anything private,
please, use my Gmail.
I live in Washington,
D.C. metro area,
one mile from Washington
Dulles Airport.
If you have anything
official to talk with,
that's my official
email address.
I'm happy to give you name
cards as well, later on.
With that said, I think
I'm right on time.
The floor is open to questions.
SPEAKER 1: Any questions?
