sound recognition is the art and science
of having machines identify sounds and
these are sounds beyond speech and music
so dogs barking glass being broken
babies crying etc this in itself is a
new academic discipline as well as a new
industrial discipline so has many
elements we understand and has some
elements that we don't understand yet.
Sound recognition is inherently an
artificial intelligence subject which
means the first place you start with is
from the data which in in this case is
going to be the audio recording
themselves and with any sort of AI
subject the the important thing to
understand is the variability in the
data so this is all the things that make
the sounds different so if you take some
even as simple as glass break one of the
things you run into immediately is
people have a pre-defined idea what
glass break sounds like often comes from
Hollywood movies Hollywood movies have a
specific technique on something called
Foley art where there actually have a
room where they're making the same
script as they go and they're trying to
make it sound like you think it should
sounds it's a bit of a self-reinforcing
method and doesn't necessarily sound
like the real thing
it's fascinating as a company out the
back of Cambridge where we're based who
supply a lot of audio into Hollywood
movies and one of the things I know they
supply is is core alarm sound effect
because I'm used to it now I hear on
every single movie that is and then
everybody thinks car alarms sound like
that car alarm on in fact they sell and
you know diverse range of car alarm is
just you get used to it this is the same
screen isn't a side effect I think
actually
there's some in music as well like
they're the famous feet that's used
across a range of different music and
you can has been studied on tracing the
heritage of it through hip hop into
house music etc etc Nativa Lucian in its
own right
see an unbreaking Bassin ating things
and they're the same thing kind of
happens for Saint now if you take
something like a glass break and you
imagine the actual this is a glass
window and the application for it's a
sound recognition as if somebody's
trying to break into a property then you
want to know that that's occurring and
detected before they into the property
so they're trying to you know smash it
with a hammer or a crowbar or whatever
I'll stare trying to smash it with so
that might make it sound different then
you've got the different types of glass
different size of the glass different
thicknesses of glass so you can laminate
plate wired tempered glass all of them
sound slightly different so if you
imagine two extremes straightforward
plate glass which is what most of us
have in our houses on our front windows
double glazing when you break it it
breaks off into large shards and they
dong into each other as they fall down
so you get this drunk type sound coming
out of it if you take sometimes for to a
safety grass sometimes for this laminate
glass it's got a sheet of laminate in
the middle I'm sticky you break it the
whole piece breaks but the shards don't
fall off because it's designed to be
safe because that's what it's there for
each one of those things contributes
completely differently to the way the
sound is made up and all of that plus
all the thicknesses and sizes makes this
whole thing that we call glass break so
you're going to go out and collect all
that data first before you can even
decide that is the thing I want to
detect so that's a big challenge then
you've got to collect all the other
sounds you don't want it to get confused
with so you sit down and say what else
is going to hear and I say it's in a
home you know maybe it's going to hear
people with cutlery you know cats and
knocking glasses off off the shelves
which you CLE don't want to say
breaking because it's not through to
just more sort of random stuff like
radios left on and things of that so you
get all these two big data sets okay so
you know all these data sets
you then have to get people because the
only way you can realistically treat and
make sure that your machine is
classifying what you wanted to is
somebody have to go through and say that
is this because it has to learn off
humans in the first instance so
something's going to sit down and so
glass starts here and stops here and
that's going to be done at extremely
accurate level you know sort of better
than tens of milliseconds sort of level
so really accurate something like glass
break I'm assuming is going to start
very suddenly a wave is going to be a
very sharp rise in volume very very
quickly and yeah if you imagined your
time's going that way amplitudes going
up there do you imagine a glass break it
and this is your your Big Bang and then
these sort of tinkling sounds if we're
talking about plate glass sound and if
we're talking about the other ones we
talking about these wouldn't gong as
much so you wouldn't have as much of a
tail to the sound if we use the actual
descriptions here there's a thing called
an envelope which is the way you
describe things and they're made up of
an attack a bit of a sustain decay and
then a sort of release phase a SD arc is
the way you describe these sort of
characteristics and can be used for all
sounds whether it's a violin or a piece
of glass being broken it looks to me
like it's very straightforward to find
the beginning of that sound
but isn't it very visual to find the end
of it yes when you decide the end is
quite difficult obviously at some point
it hits the noise floor of the room and
you don't want to be saying to the
computer the noise floor is relevant to
what we're trying to learn because it's
not you know the noise floor in this
room is very different to the noise for
wherever this device is gonna be
deployed and will me by noise floor is
the acoustics of any places
well noise noise floor in itself is a is
a complicated area sorry audio always
then goes into levels of granularity
it's on those subjects so noise floor is
made up of whatever electrical noise so
is your film um this camera at the
moment the equipment you'll film has a
set of electrical noise and digital
noise it's going to be added to it then
the buses are so and you've got the
busses on side so you've got this chain
of noise all
is all of that goes in to add up to the
overall noise floor and clearly all of
that is just random stuff and so you
don't want to necessarily be including
that so yes it gets quite difficult you
then if you switch to other sounds which
have more semantically and big use
definitions so we move on something like
baby cry got a little girl
so sometimes she's crying sometimes
she's whinging sometimes she's moaning
you know and and the exact definition
can be a bit vague and it's obviously a
continuous nature for us so some sounds
we deal with are less clear about where
they start and stop because they're less
discrete events and that is then a
difficulty because you saying well
machine plee able to find somewhere
layman's data so we can move on to the
next phase which is train the machine to
recognize it to get machines to
recognize a specific pattern you have to
look at the nature of a specific pattern
is it video is it speech is it music no
this is sound it's different from speech
in the fact it doesn't have a language
model you know last time I checked my
windows vibrate them didn't speak to me
and also in music you have some sort of
structure to music depending sometimes
you don't put not my particular love of
the type of music but you know you have
some set of things that restricts what
sets of sounds you're going to hear in
what order sounds are slightly different
to that so the techniques you need to
use to actually try and do our special
intelligence technique of them are very
different we have to set about building
a whole new set of capabilities there a
lot of it comes down to how do you
describe the individual components of
the sound that you want to recognize
some of them we have to make up things
for because they simply didn't exist
so when you say disguise it you mean in
human language or in terms of what I was
referring in that bit to actually in
what are called features which are
things that use describe the sound so a
really simple feature which would be
practically useless to you but when we
can have a conversation about would be
volume yeah is it loud or quiet yeah
well if you try and distinguish all
sounds in the world by is it loud or
quiet you're not going to get very far
in terms of identifying specific sound
you need to keep adding range of
descriptions and features in that
regards this could be frequency this
could be like a attack of the envelope
or all these things couldn't display as
different yeah you need hundreds of
hundreds of these things to be able to
adequately say if it's all of these
things I've got a rough understanding of
what it is and then you have to deal
with it changes over time because sound
changes over time so you've got that
level of complexity but that's the sort
of see the language side of it is
fascinating so one of the problems we
ran into straightaways you imagine sort
of drawing those things out you very
quickly run out of English and you very
quickly run out of the ability to draw
it so with something like even simple as
baby cry we had to invent new English
terms for things because otherwise you'd
be sitting there go you know the bit
that goes like BPD boopity Bob there and
then we got sorry I'm going to clear
wait talking about yeah you're going to
have sort of a lexicon a set of terms by
which you know what each other a meaning
otherwise you two engineers trying to
model this thing will go off in
different directions because they
fundamentally don't have words to speak
about so do you think that's a question
of I mean other people who have some of
these words you know it's a thing that
gets talked about in say I don't know
audio files and schools but maybe not so
much in software no I think until you
get into it sort of like a think of it
you imagine the people who work for
Oxford Dictionary or something like that
until you actually need to concretely
define the definition of a word for
whole set of meanings there's no reason
to come up with the language because you
can just refer to it it is that it is
this when you're actually trying to
define something the first thing you do
is churn to English I mean in my case
the language of speak then it's the most
flexible
- we have to describe things and then
you can have a conversation then you can
build back up - you know understanding
and then from understanding into trying
to teach a computer to understand it
because going to good part and parcel so
once you've got that there then you've
got to move on to the machine learning
side of it which is you've got to
present all of this information to a
system you've got to use a whole bunch
of machine learning techniques some of
which we have to develop ourselves to
adequately then characterize all of that
complexity down into something's
incredibly small a lot of people do this
kind of certainly speech recognition in
the fire burn mode yeah speech
recognition and the language model is
typically done in the cloud obviously
one thing you then run into is you've
got a stream you might have to in our
field stream the audio off the device
24/7 well clearly that's going to have a
battery implication clearly because
you've got to boot up something that
transmits it to something clearly it's
going to have a privacy implication
because you're taking anything that's
being heard and transmitting it up in
the speech world they do what's called
keyword awake word recognition on a
device so you know this hey Siri hey
Google you know whatever kicks it up and
then the rest of that audio a lot of
times being shipped up to the cloud and
then analyzed and sent back down there's
no equivalent of a wait word in our sort
of industry so we're running all the
time which means that you'd have to be
shipping the audio awful time which
doesn't sound like a good idea
so we've had to really understand the
math to make it compact enough to fit on
a relatively small processor
is it so yeah so you're not going to be
going Alexa is a reverb were at the door
and or is the burglar going to be going
Alexa I've just broken in can you tell
the homeowner you in so it doesn't
doesn't work in that direction
so when you get things like glass break
or your smoke alarms going off or your
babies crying these are all things that
happen
in these cases in a house the house if
you were sitting downstairs on your
couch you would say oh I need to react
in this way it's sometimes you don't
hear it sometimes especially in the
smart home industry you want the house
to take care of you a little bit and
take active you know sort of
participation in trying to make your
life a little bit easier and that's what
it does it makes us house into more of a
living a being than just a passive
observer of all of the crazy things that
you know go on from day to day hating on
various houses yeah if only of this
Akins away things like shoes own work
for music recognition was that a
different crafts that's that's a death
different discipline of sound
identification that's called fingerprint
recognition so there's sort of if you
want to lay a hand like that you've sort
of got speech recognition then you've
got a music identification music
identification is great about
identifying sounds that it's heard
specifically before you know we've never
heard a window in your house break but
our system could detect that we've never
heard it I don't hold your table if
they're still young enough to cry we
never heard them cry but we can identify
that and that's a very different thing
and that's this sound recognition piece
or this artificial audio intelligence so
that we provided subsets in the middle
and it's a very different set of
techniques and objectives I'd like to
think I'd be able to hear index smashing
behind me even though I've never heard
that's absolutely there yet - because
I'd want to duck out of the way of
whatever is coming through yeah so is
that is that a fair way it's kind of
putting it you're attempting to do it
more like a human yeah you're you're
you're actually teaching it the the ax
mattock components of the sound itself
rather than just a high-level
description especially in the case of
the Shazam type comparisons where it
just has you know the top 20 songs from
all time which is a hell of an
undertaking in itself and obviously a
great piece of technology but in our
case we're trying to help the computer
understand
what that sound means a quite a
fundamental level so that when it is as
you say you hear a piece of glass behind
you and you think but a duck because
that's going to hit me in the back of
the head at some point then you know you
can actually take some reasonable action
off that so it's a different thing in
that sense well for now we won't talk
about exactly what's going on in this
compression function but the idea is
that the bits of ABCDE are being
combined together they're being shuffled
they're being promoted around to make it
look more and more random as we go and
at the same time we're bringing in bits
from this message to
