In our last video we looked at the
Chomsky hierarchy and the correspondence
between the rules of human language and
the different grammars in the hierarchy.
We also saw that human language has some
context sensitive rules. In this video
we're going to explore the consequences
of having context-sensitive rules as
part of our models of human language and
what that means to us as programmers.
Spoiler alert it means that our - as our
systems grow in number of words, in
number of rules, processing power is
going to shoot up potentially even
exponentially. So this is another
presentation of the Chomsky hierarchy.
Last time we looked at forms of grammars
like regular grammars that can be
described through finite state machines. One
example of that was basic English
sentences where you have one or more
nouns for subject, verb, 0 more nouns for
direct object. We also looked at context
free rules, for example center embedding
in English, where you have a sentence in
English and you can embed another
sentence, but you need to remember how
many times you went into the recursion
to then walk out of the recursion, but
you don't need to look at the context
you don't need - not need to look at other
sentences in order how to build your
local sentence. Finally we looked at
context-sensitive rules like Yoruba
emphasis where you have a string and
then you need to scan for the last
appearance of a certain substring. You
need to extract that substring perform
some transformations on it and then use
it as the output. This is a
context-sensitive rule because you have
to look at different symbols, sounds and
you have to search in between them. So
human language can be in different parts
of the Chomsky hierarchy. Some rules live
in different parts of the hierarchy. This
is also true of music, for example, and
it's true of animal vocalization. For
example these are nightingale songs and
as you can see they are repetitions of
the same elements
and we can describe them with, for
example, finite state machines. So in the
green segments, for example, we have
repetitions of three elements beep-beep-beep, we
have repetitions of more elements in the
red segments beep-beep-beep-beep-beep
and so forth.
These are spectrograms by the way, very
similar to the ones we'll use for human
language, and the regions with the darker
colors are regions with more energy. So
these - this is where the singing is
located. As you can see, we could program
the finite state machine to accept and
to generate songs of nightingales at
least. So there are so - the Chomsky
hierarchy is a way to describe phenomena
that we see repeating in some way or
that are rule-based in some way. Why do we
care? Why are we even looking at this?
Because the Chomsky hierarchy will make
predictions about how long it takes to
process a certain rule, how much time
we're going to need to run the rule, and
how much memory and space are we going
to need to process the rules. And it
makes different predictions for
different types of formal grammars. In
general, the more complex the grammar is
the more expensive it will be
computationally, so it will take longer
to process in terms of time and will
need more memory as well. Regular
languages need less resources.
Context-sensitive languages need
many resources. So this says languages by
the way but what it means is just series
of strings what we've been calling rules.
Let's look at a finite - at a regular
language model through a finite state
machine. If you have one finite state
machine, the tra - the transition between
one state and the next is always gonna
take the same amount of time regardless
of how many elements you have in the
finite state automaton. We call - we say
that this processing
time is constant and we're going to use
the notation O(1) to call this a constant
processing time. This notation, by the way
is called Big O. It's a way of this -
abstracting how long something will need
to run. So O(1) means that as you double
the input, as you double the number of
states in the finite state machine, you
are going to need the same amount of
time to go in between transitions.
However if you start adding up finite
state machines, one finite state machine
for the roots of the verb, 1 finite state
machine for the conjugation of a verb,
one finite state machine for the clitics
of the verb, each of them is going to
take their own time to run, and we're
going to need to start adding up those
times. So the more finite state machines
you have, the more time to process all of
the program will take. This time is going
to grow in a linear fashion. If you type -
if you have two finite state machines, if
you have - I'm sorry if you have 1 finite
state machine and then you double that
and you have two finite state machines
this is going to take longer to process,
twice as long. If you have three finite
state machines, it's going to take three
times as long. This type of growth is
linear where the growth in the input
elements, in this case finite state
machines, corresponds to a growth in the
processing time so two sta - finite state
machines twice as long, 3 finite state
machines thrice as law. What will happen
with context free rules? When you have a
context-free rule, you have your finite
state machine, but you need some other
programming element to help you with the
recursion, for example it could be making
a variable that remembers how many times
you go in. It could be having a stack
where you push and then you pop. Whatever
you do it's going to take time to
perform those operations. A single one of
these is going to take approximately O(n). It's gonna be linear so that you - as
elements double, as you have more stacks,
you're gonna need to run twice as long, 3
times as long for three stacks, four
times as long for four stacks, and so on. When
you have this context free rule interact
with other context free rules, so there's
one here and then there's one inside of
it, for example, this is going to multiply
the times. This is similar to how in
programming when we have a for, like for
i equals 0, run it so many times, and then
another for inside of it, for j equals 0,
so many times. This happens, for example
if we need to look for duplicate
elements in a list or if we need to
traverse a list twice. So for every
element in the first for, you need to
traverse the element in the second for.
Notice that this is going to multiply
the time that it takes to perform this
operation. The time is going to become
polynomial because now it's whatever
time you had before
multiplied by the same time and maybe a
third time if you have more nested fors,
and so forth. Polynomial growth is very
fast. If - if your data doubles, it means if
you have, for example, O(n to the
second power) it means that if your data
doubles, your time will become 4 times
as much. If you have (O)n to the third power,
if your data doubles,
you're gonna have 8 times as much time
to process. So as you can see, polynomial
growth is faster than linear growth.
And we also have context-sensitive rules.
So for example in Yoruba you need - you
have a string and then you need the
computer to find not only a certain
substring, but to find the last
occurrence of a certain substring. This
type of search where you need to find
permutations within a string takes an
exponential time to solve. It's O(2 to the nth) 
and if you have several
several of these coming together, if you
have several context-sensitive rules
that need to interact, you're gonna be
looking at a growth that multiplies that.
So it could be O 2 times
however many symbols you have to look
through, and however many rules you have.
This processing method is at least going
to be very costly and in the worst case
scenario could grow exponentially. This
is just an example for you to see how
bad the times can get in constant growth.
So n equals 1 so like 1 finite state
machine, in just 1 if you double the
amount of input the processing time will
not change. For example things will take
10 milliseconds. In a similar operation
with linear time, so O(n), if you
double the amount of data that you have
to go through, processing time will
double, so something will now take 3
seconds. For example, in quadratic growth,
in polynomial quadratic growth, if you
have O(n) squared, you - if you double your
data, the processing time quadruples, and
now will take almost 2 minutes to
process. The same algorithm, if it became
cubic, so polynomial O(n to the 3rd), double the
data, you have time that multiplies by 8.
So now things would take 55 minutes more
for example. And in exponential time,
things can grow to where they take hours or days.
This problem is called combinatorial
explosion, and it's always gonna be
lurking in the shadows whenever we
handle human language. There's always the
process - that processes are gonna blow up
in complexity, in processing time, and in
storage needs. So whenever we add new
words to our system, or new rules to
explain something, this is gonna make it
grow but grow much worse than we thought
it would. Think of a simple example like an
eliza type chatbot where you have
regular expressions, let's say, nine
regular expressions to provide some
simple interactions that can say hi my
name is whatever, hey how are you, hi I
want to do this, why do you want to do
this, and let's say now you want to
control for one more word, now you would
have to edit all the rules so that they
can account for one more thing, and then
that's just adding one, if you add - if you
want to add more words, for example more
verbs, you're gonna have to maybe create
more rules or maybe have you're gonna
need more words that the rules are gonna
have to go through. Also if you wanted
the robot to - I'm sorry, the chatbot to
become more complex, you're gonna have to
start adding rules for certain types of
verbs, rules for certain types of
interactions, rules for certain kinds of
sentences, to the point where before you
realize it, you're gonna have thousands
of rules, this thing is gonna go out of
control. So modeling all of human
language using rules can be a problem
because of combinatorial explosion,
because you need so many rules and as a
linguist I can even tell you that we
don't know how many rules you need to
describe the language. We cannot even
demonstrate to you that the number of
rules is finite. So as a summary for
our week's topics, we've been looking at
computational abstractions, for example
finite state machines, and again these
are just abstractions for the kind of
code that we regularly write. Simple
abstractions cannot model
human languages, so something simple like
a finite state machine where you just
say if I get an input then I go here if
I get some input then I go there, they
are not enough because of recursivity.
You're always gonna need to count things
along the way because of center embedding, because of long distance
relationships, you're gonna always need
to have some memory dedicated to these
kinds of relationships, and because of
context sensitivity you're always gonna
be needing to go back and forth within
the strings to figure out how to apply
some of the rules, even if we got rid of
limitations like infinite recursivity,
for example the automaton we would need
to describe human syntax, and all of
human language would be incredibly large,
probably too large to be useful and the
process will take - will grow
exponentially as we add more words
and more elements. This doesn't mean that
simple procedures are useless.
There's many automata that are very
useful. For example phonology has simpler
descriptions. Something like the
structure of English syllables is
relatively straightforward, and so
something like a hidden Markov model can
be used to explain sound, and to have
rules for a sound input. Hidden Markov
models are very popular in speech
recognition for example but in general
it is very difficult to create models of
language that are just based on rules.
This is why people try to use things
like deep learning for example where
rules are more opaque and it's more
difficult to know what a computer is
doing, but it helps us with some of the
weaknesses of having to model every
single rule. In summary, human languages
can be described by rules but making a
computer program with explicit rules
would make it too big and would make it
run for too long. Next week we're gonna
start studying languages as features and
we're going to slowly progress towards
language as input for deep learning methods.
