>> Hello everyone. I
am Prashant Namekar.
So, here I'm representing
Gaia Smart Cities.
So, we are a small startup
in IoT space.
So, what we are
doing in Microsoft,
we are trying to
develop something,
speech recognition on
those small footprint devices.
On the Edge itself, we'll
try to debug everything,
and on the Edge itself,
we'll try to spot
the keywords with
a low footprint ramps.
So, here is what we have
proposed as an objective.
So, the main objective
which we had
is that we will be putting
all the machine-learning codes
on the small footprint
devices which has
having extremely low RAM size,
and apart from this,
we are using
the audio input mics.
So, we have to take
the inputs from the mic,
we have to process that,
we have to put that into the RAM.
All the processing should
happen in less than
100 milliseconds.
So, we will need
a fast computing.
It means the model side should be
less of the machine learning's.
So, other than that,
we will reduce the number
of computations required.
Also, we will be focusing
on the accuracy,
what we can get and what
we can achieve more.
So, this where we plan.
What are the tangible outcomes
we got in one month of
time are: we have development
one generic tool.
What the tool do is,
you just have to
pass the checkpoints
file like the TensorFlow
Model file 2A.
It will just create
a Microcontroller
independent libraries.
That is.h files, that
is C library files.
You can directly put that onto
the Microcontroller thing.
So, other than that,
we have developed
our own machine learning
libraries for the Cortex
M4 and Cortex M7,
for the LSTM and the FastGRNN
and also for GRU thing.
So, in one month of time we have
almost 60 competitive
experiments on
different Recurrent
Neural Network models.
Other than that, we have
Android app to
gather the dataset.
So, as currently we are using
the Google Seven and Google
30 dataset for our models,
we will be moving to some
of the Indian languages.
So, we have worked on
the different ML models
are the DSCNN,
GRU, LSTM, FastGRNN and
FastGRNN with Low Rank.
So, this is
the Microcontroller we
will be using for
our development purpose.
The RAM sizes I spoke is 128 KB
of RAM and 512 KB of Flash.
It is operating at
a CPU speed of 180 MHz
and same thing we have developed
for M4 also and M7 also.
The main reason of using
the M4 and M7 was,
it is having an inbuilt DSP
instructions support and
some of the Neural Network
Libraries are supported.
So, for the DSP Library
we are using for
the dot product and
the dot products addition,
subtraction, vector copy
and some of the FFT things.
Also, we are storing
the wavefiles in the.h format
for the offline training and all,
and we will be saving
the hamming frames and
all in the.h file itself.
So, I'll be passing
the mic to Sneha.
>> Next, I will be
taking you all to
the workflow of Keyword spotting.
So basically, the data
which we have is audio.
So, we cannot directly
send the audio
to the Neural Network
to get the predictions.
So, first of all, we need to
extract some features
out of the audio.
Here we are doing that,
you can see there are
Speech Signals there.
We are processing them to
get this Speech Features and
then putting those features
into the Neural Network.
The Neural Network does gives us
the predictions of the audio.
For this problem set,
we have used the dataset
which was given by Google,
that is called Google
speech commands,
and it has 30 keywords
like yes, no.
It has a background noise
files also with it.
We have trained the model for
30 class and seven class problem.
Also, the train,
test and validation
set split was given in
the dataset itself.
So, the features
we have extracted
are logarithm of
the mel scale filter banks.
These are extracted for
one second audio so we
get a feature matrix
of dimension 99 X 32.
The dataset is also
normalized with
the mean and standard deviation
of the training dataset.
Next, these are the models
that we are using,
we are using basic RNN models.
I'm sure you all are
well aware of what
is RNN models so I will
not go that much into it.
We're using LSTM and FastGRNN.
These two are the
different versions of RNN.
So, we have trained
three different models,
LSTM, FastGRNN and
FastGRNN with low rank.
Now, we have trained
more than 60 models
with different classes:
optimizers and hidden states,
learning rates.
Here are some of our results
that you can see.
We have the model:
Learning rate, Optimizers,
Batch Size, Epochs, Weight Size,
Accuracy and all these things.
Then, after doing
all these experiments,
we have finally
reached onto one of
the models which we
decided to be deployed
on the Microcontroller.
The model deployed had
the following
specifications: it was
a seven Class model
with 16 Hidden States,
Accuracy of 91.1% and
Weight Size of three 3.6 Kb,
it was with the background noise
and quantized non-linearities.
Next, I would like to
call upon Shubham.
>> Hi, my name is Shubham.
So, now we have trained
our models in TensorFlow.
So, the important thing
that we want to do is,
we want to get all these models
onto our Microcontroller.
So, the first thing
that we need to do is,
we need to somehow
convert the whole of
the TensorFlow models
into a format that is
recognized by C. So for
that particular purpose,
we decided to create a tool
that will auto-generate
these.h files for us
that will contain
all the important information
of the models like
the hidden weights,
the input, the dense layer,
the dense bias and
other information.
Apart from that, we also
populated our.h file
with some constants which
are very critical for
audio processing,
such as, the filter banks,
hamming window et cetera.
So, something that we
observed during the course
of the workshop was that,
the filter bank matrix
was found to be sparse
and in the sparsity
we observed a very unique pattern
that actually allowed
us to exploit it.
So, the pattern that we observed
was something like this.
So initially, there
were some zeros,
then there were some
non-zero elements and then
there were more zeros.
So, what is actually
happening was,
we had a continuous length
of non-zero elements,
and which were preceded
and succeeded by zeros.
So now, this kind of
observation made us think,
let us see how many
non-zero elements
we have in this whole matrix.
So, the results that we
found quite surprising.
So, we used, as you can see that
the entire matrix is 32 x 257,
which turns out to be around
8,224 total number of
elements in the matrix.
But in the whole matrix,
we found that there were
only 458 non-zero elements.
Now, that got us to thinking,
if we are multiplying this
matrix with another matrix of
257 then we have to perform,
for the whole matrix you have to
perform around
8,224 computations.
Instead why and that was majorly
because we were doing a lot
of zero multiplications.
So, instead of that we said,
why can't we represent
this in another format and
do only 458 multiplications
corresponding to
the non-zero elements.
Part of it, after thinking
about how we could do this,
we came up with this
particular representation.
That we could just
store the length of
the non-zero elements
of each row,
the column index at which
the non-zero elements start and
all the non-zero elements as
a single array of 458 elements.
So initially, as you can see we
were storing 8,224 elements
for the filter banks.
But, with this kind
of representation we
just have around 500 elements,
500 floating point numbers.
So, an estimate of the size
is like the first model,
has each floating point
number is four bytes,
it would have occupied
32 KB of flash.
But, with this kind
of representation,
we occupied only to 2 KB
of flash and secondly,
the computations
much more faster.
Indeed, this proved
out to be very
beneficial to us because
we were using this multiplication
every 10 milliseconds.
So, an estimate of
how much time got
decrease was initially,
a model that we had
done without using
any of these optimizations
took around 1.58
seconds in LSTM model,
but when we shifted
to a FastGRNN model,
and we performed
this optimization
and several other optimizations,
we brought down the time
to 89 milliseconds.
So, there was a huge gain for
us by using this kind of
representation of metrics.
So, I would like to hand over
the mic to Yash and he'll
explain the whole pipeline
in summary. Yes. Thank you.
>> Now, here we have
received the.h file,
now we need to do
the predictions.
In the Microcontroller,
we were having
a DMA buffer of 20 milliseconds.
It can store the audio sample
of 20 millisecond over there.
Now here, we have MEMS Microphone
from where we get the PCM Values
and we store it in
our DMA Buffer.
Now, in this DMA buffer,
when it feels
the first 10 millisecond,
it gives it a Half Transfer
Callback and we give
that 10 millisecond audio
in our 30 millisecond buffer
which is for feature extraction.
We have made that 30
millisecond buffer
because we need
25 millisecond of window,
so as to extract the features.
After that, when it passes
that 10 millisecond,
it keeps on updating
this DMA buffer.
While we transfer the data to
our 30 millisecond buffer,
it keeps on updating
the 10 to 20 milliseconds
in this DMA buffer.
After completing
this standard 20 millisecond,
it gives a Full Transfer
Callback and I pass
it that time to
20 millisecond over there
and similarly for 20 to 30.
After I fill
the thirty 30 millisecond
buffer window over there,
I use the 25 millisecond window
to extract the features and I get
32 Log Mel Bank filter
and I initialize it and
if the first FastGRNN layer
with sixteen 16 hidden
units with zeros.
This is the first time frame
of the FastGRNN layer
since we are using zero
to 25 millisecond.
This is the input vector
and this is the previous
hidden search that we get.
We compute the next hidden units
for next FastGRNNGate and I
copy those weights
over there and we
need a 10 millisecond
straight over here.
So, I shift that 30
millisecond buffer
and the next 30 to 40
millisecond audio,
we take it from the DMA buffer
and again store it in
our 30 millisecond buffer,
and again extract
the Mel Filter Bank and
again do the same for
the second time frame
of that earlier.
We do it for 99 steps since
we have trained our model on
one second audio which has
99 frames of 25 millisecond
with 10 second straight.
Here, we are firing
20 FastGRNN layers and
computing the
predictions over here.
I will show you
how it is working.
If I do the two second
audio over there,
I initialize the first
FastGRNN layer at
the start and I compute
that FastGRNN layer and I get
the prediction, like six.
I've got the prediction maximum
of the probabilities of
the soft max that we
obtain after applying the
densely and I store
it in a buffer.
The buffer is called ten
previous predictions,
like it will store
the previous ten predictions
in that buffer over there.
Here, I initialize
the second FastGRNN layer
after 50 milliseconds.
Like the first FastGRNN layer
is from zero second
to one second.
The second FastGRNN layer is from
50 millisecond to
one second, 50 milliseconds.
I again do the predictions
over here and I keep that data
in my buffer which has
20 previous predictions
and I do the
same again and again and again.
Since here, we have five times
predicted three over here.
So, it calculates the frequency,
how many times a particular
keyword has been predicted.
If it goes more than five times,
then it predicts
that particular keyword
and it shows output.
We applied 20
FastGRNN layers over
here like that 20 FastGRNN layer
is starting just
before one second
and ending just before
the second seconds.
Sorry. When the one second
is completed,
I again re-initialize
the first FastGRNN layer and I
again started to compute
for the prediction.
>> Hello. I'm Sheena Joy.
So, as we already
discussed that we're
converting the CKPT
files to.h files,
so that they can be fed into
the microcontroller and then
used for the further processing.
This is the tool that
will help us to do it.
First, all that you
need to do is just
upload detail checkpoint files
and the corresponding mean,
and the standard deviation files.
So, in the background, these
files they get uploaded into
the Azure block and
then the file starts
executing which converts
those files into the.h
file and it gets again uploaded
into another Azure block.
So, once this process is done,
you will be navigated
to another page,
which in this space,
by clicking the "Download".
You would be allowed to download
the weights file that is
generated and put in
the Azure block to
your local system.
From this, you can feed it
into the Microcontroller.
The best part is this whole tool
it can be used in
any Microcontroller,
that is, it is independent
of any Microcontroller.
That is because it uses
the only basic
integer data types.
So, you can use this tool for
any other Microcontroller.
Right now, it works only
for the FastGRNN and LSTM,
but again you can extend
it to any other model.
So, in this three weeks,
we had a chance as Nehal already
mentioned to work with
around 60 models.
We have trained the models
using the LSTM files,
FastGRNN and FastGRNN-low rank.
Each of which we've trained with
the 30 class and
the seven class and
again each of those
we've trained it
with background noise and
without background noise.
So, we've had some observations,
some of them are pretty evident,
that is the seven
class accuracies
are better than the 30 class and
the accuracies of
without background noise models
are better than the
with background noise.
So, other one is,
the FastGRNN-low rank model.
For the higher number
of hidden states,
that is 128 hidden states,
it worked very well.
I mean, it gave us
a better accuracy,
but when we were using it
for the 16 hidden states,
it gave us a little
less accuracy.
So, we had to narrow down
between the FastGRNN and LSTM.
We were only keen on using
the 16 hidden states with
background noise model.
So, for the 30 class with
background noise model,
FastGRNN has given us
the better accuracy,
but when we were using
the seven class with
background noise,
the LSTM slightly given us
the better accuracies around
maybe one percent higher or
less than one percent higher.
But considering the model size,
we had to choose
the FastGRNN model.
The FastGRNN model is three
times smaller than that of
the LSTM model because
we have to fit this in
a constrained device,
which was the FastGRNN model and
we've used the quantized
non-linearity for it.
So, finally we had
a model size of around
3.6 KB with 91.1
percent accuracy.
Apart from that, we had
some other observations,
like initially we
used the Tanh and
the sigmoid functions but while
computing them on
the Microcontroller,
it took lot of time
for the computations.
But when we started using
the quantized non linearities
because they use
only basic functions
like minimum and maximum,
the computation time decreased.
While training the models,
when we were using
the quantized ones,
the Adam optimizer
worked well for us,
and when we used
the normal user functions,
the momentum optimizer
worked well.
Also initially, we started using
the log functions but because
it takes the input
as the double and
we were providing
it with the float,
there was this type promotion
that was happening which
took a lot of time
for the computations.
But when we switched
to the Logf function,
it helped us with
the computation time.
Also, while training, we've
noticed that after
certain epochs,
the last spiked up.
So, to overcome that,
after certain number of epochs,
we've halfened our learning rate.
We've validated the Microsoft
FastGRNN model in real time
on our Microcontroller which
has a single core processor.
I would like to give
it to [inaudible].
>> I mean, the motivation
that we built
was extremely small footprint,
low computation battery as
to last over six months,
no connectivity and
what Alexa does,
we wanted to do it
completely on the edge.
Those are the motivations.
Then we realized that it
could also solve these problems
of security of privacy,
connectivity and power and
that's what we've extended it to.
In addition, the models
that we have built,
the framework that we
have built is easily
scalable across used cases.
So, while today we're
doing it for 66 watts,
we're doing it for
a certain number of
one particular sector,
we believe with
the framework that we have,
it can replicate easily.
I mean, in fact, that's this
so this the Microcontroller
using as an ARM M4.
It's a really small footprint.
It's a 180 megahertz, but 256 KB.
In that, way, the power
that it consume
just 89 microamperes for
every computation per
megahertz computation.
Right now, with what
we have developed,
we just need 89 milliseconds
for one second speech analysis.
Faster than you
speak. I think we can
analyze behold what
you're trying to say.
This is done at a 3.6 KB,
as the team mentioned earlier.
Without quantization,
when you quantize,
it comes to less than 1.0 KB.
So, that's the ST 32 F4.
This is the Azure IoT sphere.
That's the master
controller which
has about 500 megahertz,
it's an ARM A7.
It's far superior in terms
of computation power.
We believe, if we increase
the number of word size even to
100 and if you don't quantize,
we can actually get a model
size not more than 9.5 KB.
This actually exceeds
the number of
used cases that we can apply to.
We're trying to extend it
to services and retail,
hospitality services and
industrial
manufacturing, as well.
Pretty much, this doesn't
require connectivity or power.
These are a number of
used cases that we believe
we can move fast.
Even panic detection
in cabs, feedback,
even usage in cities,
all of that is where we can.
This one, I want to talk about.
So, yesterday, I met this
[inaudible] from Intel and they have
a smart home application which
they don't want it to be,
they want it to be likely
connected to the internet.
They want a lot processing
on the edge itself.
They don't want,
for every analysis,
like Alexa does, they don't
want to contact the server.
They believe this can
be a good used case.
Similarly in the connected car,
I mean, just for starting a car,
they want to move into
the start button and if
you can give them
the ability to start this by
saying talking and
the analysis and
the voice print can be
analyzed on the edge,
it solves a lot of problems
for a lot of people.
Connectivity, security.
What we are trying to do
is increase the number
of languages.
In fact, we built this model to
actually replicate
across languages.
We have to work on
false positives.
We have been receiving
our false positives right now.
We want to extend it
to more use cases.
In fact, you will see
the demo that we have.
It's a pretty small
footprint thing that power
can go or acting on
a single AA battery.
We think it can last
for about two months
with at least 1,000
usage per day.
We want to do beta trials.
We also want to do,
adding a speaker to it.
We have one model now, we're just
doing speech recognition,
we would like to do gender
detection, sentiment analysis.
Also, the footprint
shouldn't exceed so much.
I'll end with this, and,
well, without ending,
I can't do it without
thanking the Microsoft team.
[inaudible] Harsha, Pratique,
Bridj yesterday who
helped us solve the
real-time wise problem.
We're doing everything
off the drive,
but we did real-time
just last night.
We had help from Bridj, Sateesh,
Chris, all of them.
Thank you so much.
Without further ado,
I'll call Prashant
to give you a demo.
>> Okay. So, meanwhile, I can
do a demo on the small kit.
So, this is a STM32M4
Microcontroller.
So, it is having
a RAM size of 128 KB.
So, we are having
the biggest challenge that,
to take the audio real-time,
to process that, prediction,
everything we have to do in
the small amount of time.
So, I will be giving
the demo on that.
So, you can see that
for prediction,
for prototyping we are
using the LEDs to display.
So, here, if I say yes,
this LED should blink.
This LED will be continuously
blinking just to
show that it is on.
If I say no, it should blink
down and similarly for others.
So, one thing, yesterday
night, we all cracked this.
Actually, yesterday it became.
Okay, I'll give a demo.
>> That's a small power source.
>> Okay, you can see that
the red LED is blinking here.
This is the power.
Yes. [inaudible].
>> One.
>> It blinked one power.
>> One. One. One.
>> You can say other.
>> Four. [inaudible].
>> But the S is still on.
>> [inaudible] We are
having this pretty much
confident that we
can crack this more,
just one month to it.
Yes, actually this thing.
>> With that, essentially,
the model even quantization,
we can even do sentiment analysis
and panic can be
detected easily without
even contacting the Edge.
Those are the USPs of
what you're trying to do.
>> Actually, we use
the YouTube buzz noises and all.
So, we are cracking that also.
>> It does noise cancellation
and all the background noise
is taken away, too.
>> We'll do that
in our experiment.
>> [inaudible].
>> Yes. It has
some features in that.
So, that signal to noise ratio,
if I hear something, something.
We haven't studied that,
we are more focused on
the model things achieving
that, perfecting that.
So, right now, we'll focus
on the other parts also.
>> The size is what is
really exciting for us;
size and the power consumption.
>> All these in real-time.
>> Yes. All this in
less than 89 milliseconds
is what we calculated.
>> What helped us to
achieve this was,
one thing was
the sparsity matrix,
what dumbbell bank filter and
all he talked about previously.
It was taking some around
8,000 computations.
So after that, we
observed that we
don't actually need
that much computations.
We are reduced to around 450 or
something and other than that,
some DSP instructions
sets and all, that's it.
[inaudible].
>> But these are all the-
>> Yes. We can show that.
>> [inaudible].
>> Only we can show you.
>> Yes, you can show me.
>> Okay. Connect to
monitor and show this one.
>> [inaudible] Are you
going to deploy anything
the next month when
you get back to work?
>> If we get the grant, yes.
>> Facility, in terms of
your deployment out,
which is the next step.
>> I think there's
a couple of things
we need to crack before that.
One is the false positives,
we need to reduce
false positives.
We'd like to increase
the languages,
because in India, the languages-.
>> [inaudible] when is it going
to go, when is your big use case?
>> The first use case, it will
be done the airports
authority of India.
We've got 32 airports
signed up for this.
All it's been so it's part of
the mission where its toilets,
and the thing with
toilets is that
people don't want to come
out and press a button.
Because they think it's
dirty so they'd rather
say something than
press a button.
That's why Speech came
in and Wise came in.
In fact, they wanted
a sentiment analysis, too.
We have done sentiment analysis.
We've done sentiment analysis,
but not, I mean,
the footprint is too small
to do today and we
couldn't crack it.
But these are requirements
are coming from
the market. We haven't
invented them.
>> [inaudible]
feedback in systems
as the target
applications, right?
I would imagine that
you would be vulnerable
about like adversarial imports.
>> Example?
>> In the sense that [inaudible].
In fact, people have shown
attacks against [inaudible]
systems can do ultrasonic attacks
[inaudible] what they're saying.
We can see the ultrasonic attacks
at a distance [inaudible] very,
very low [inaudible].
>> Honestly, no.
>> [inaudible] a feedback
system and I wanted
a teleportly instead of
new and supplying them I would
just [inaudible] no words,
who they want or
what do they want.
>> We haven't thought
about that, but what
we have thought about is
of one person giving
multiple feedbacks.
There on the Qualcomm,
on the Snapdragon,
which got a high processing,
so we've analyzed
the speech print
and tried to detect
duplicates there.
We haven't done on
this because of
the size, but obviously,
we can work it out,
work out how to detect
duplicates in the voiceprint.
You don't need to know
who it is but just a
voiceprint analysis can
give help us remove.
>> No but this-
>> [inaudible] One of
the things that people
in the adversarial machine
earn income when they are
found is the reason
this machinery model can be fooled
is because they can [inaudible].
They're all fitting them up. So I
wonder because of building
a compact model to fit
on these devices for it,
they might be as
a side effect might be more
[inaudible] to adversary to import.
>> We hope IoT sphere
has some kind of
security checks on that.
>> [inaudible] I'm
quite curious actually.
>> Yes. Truly, as experiment
because these people
has just suddenly go
the reason this adversary machinery
[inaudible] is because, hey,
these things are all fitting
and prettier compact model
is naturally robust
[inaudible].
>> Hi, this is Manik. So, we
haven't studied artists
serial attacks.
I doubt we would be able to
cope with targeted
serial attacks,
but I think we get
some robustness to sensor noise.
So, if your sensor is
not very good quality
and because of the quantization,
etc., it might be robust to that,
but I doubt we will be able
to overcome a sustained
network serial attack.
>> Firstly, we move the problem
away and handle it differently.
So, the system
works by proximity.
If the whole system can
work because of proximity,
you can put a proximity sense
and then make it
active only when
you're close to it.
It could probably do away
with that unless somebody
comes close to the-.
There are possibly other ways
to offset that thing.
Don't switch around. In fact,
that helps us save power, too.
Don't make the system
active without
somebody coming close because
your interests in feedback and
a simple proximity sensor
can help us solve that.
>> I also have a question,
if you don't mind.
>> Sure.
>> Go ahead.
>> So, if you have
hundreds of kilobytes of
RAM and flash available,
do you really care
whether the model
is one kilobyte or 20 kilobytes?
Can you share
a little bit about like
how much space are you
going to allocate for
this RNN or LSTM amongst
all the other things you need to
also fit onto
the Microcontroller?
So that was one thing.
The second thing was,
I missed a bit of
your presentation,
so perhaps you
already spoke about
this on- maybe more than RAM,
your battery consumption
might be more of an issue.
So, did you test whether
in LSTM was fast?
Your RNN, how much
was the consumption?
How much was the
computational cost on
the battery requirement that
battery life it was saving?
>> So, we calculated
hypothetically,
we just didn't do experimentally.
We didn't have the option
to do experimentally.
That's one.
Number 2, the size
does matter if you're
actually looking at trying
to increase the feature
set or other functionality.
As we spoke or sometime back,
if you tried to do
sentiment analysis or
if you tried to do
gender detection
or if you tried to
do age detection of
the person who's talking,
these are all things
that can come in,
and if you can extend it
to include all these,
I guess size does matter.
>> But let's not
speak hypothetically.
For your concrete application,
how much space do you have
reserved or how
much space have you
got allocated or your mark
for the RNN or LSTM?
>> This Microcontroller
has 128 KB RAM,
and for this entire application,
it consumes 123 KB of entire RAM,
for a complete computation,
prediction, and taking
in the input data.
Actually, the most of
the memories going for
that 10 millisecond.
>> Hold on. RNN will fit
on the flash, right?
You won't have gotten
and RAM, I assume.
>> No, no, no. But we
may need to have
some metrics over here
in the RAM so as to
take the weights on
the flash so as to compute
it in the RAM or here.
>> Sorry, I didn't follow or
maybe I'll take
this offline with you.
>> [inaudible] So, you're hypothesizing
from the manage point but [inaudible]
is asking if analysis can
be done environment here.
I don't know if you tested it.
>> No, we don't
know. We don't know.
>> [inaudible] we realize
it's not here [inaudible]
>> We are running out
of time. So, thank you.
>> Okay. Thank you.
>> Thank you.
