Today. Well today, as ever, we're going to
listen to our subscribers. And a lot of
them keep saying: "Well, you haven't really
done regular expressions head on. You
keep on mentioning them as though we
know all about them. but we don't. It will
be nice to have a sort of "Regular
Expressions Explained", as it were.".
So, although I'm gonna do my best to do
Regular Expressions completely, as a
one-off - no previous experience needed,
for those of you who look at it and think: "Oh!
come on, this relates to something else",
we will give you the link to go to 
previous episodes and you can get more
details about just how it links into
other things. To those of you who think I
must be infinitely old and [be] personal
friends of all these people, well I am
of some of them but not all. And today's
hero is called Stephen Kleene.
He's the man who, in the mid-1950s,
invented Regular Expressions. What Kleene
wanted was a way that he didn't have
to draw diagrams like this one but that
he could abbreviate what the automaton
was doing. So I'm going to number these
days 0 1 and 2 and I'm going to label this
one, over here, as being the 'Finish' state.
And this is going to be an incredibly
simple string fragment. A fragment of a
string of characters that we're going to
recognize. This is the automaton way of
depicting it. At the start you are in state 0
And we're going to say that, in state 0,
as you are analyzing your input string
- we'll put some input strings down here
that this thing will be able to recognize -
if what you see at the start
of your string, at the moment, not being
coped with, is a letter 'a' then you're
going to state 1. If - let's take the
simplest one I want to recognize - if in
state 1 you've coped with the 'a', if the
next character is 'c' then that is
acceptable. So, I write a 'c', over that
transition arc here, and that gets me
into state 2.
Usually the finish state is sometimes
distinguished by drawing a sort of
double circle like that. So, you've just 
glanced at it and you say: "Ah! that's it".
So, yeah, what I've drawn there certainly
recognizes the string "ac".
Frankly, electronics engineers, and so on, were
using these - not in the late 50s but in
the early 50s. They used them for creating and
understanding what they called 'state
machines'.  And, as I say,
Kleene sort of liked the idea, liked the
pictorial notation but wanted to see if
he could get it to be more compact. So
let's just finish the story by saying
that if you come around on this loop and
re-enter State 1, you can do so by
recognizing a 'b' in the string. So we've
already shown that it can recognize "ac"
You just come here, you start, you get an
'a', you see a 'c' and you're finished.
But equally, as we now see, it would also
accept "abc". Start here. See an 'a', if the next
character in your string is a 'b', fine,
you accept it, but you come back into
state 1. Although it doesn't explicitly
involve a stack this is like a sort of
iterative-recursive re-entry into the
same state. But the only way out of that
state to the finish is via accepting a 'c' at 
the end of the string for [state] number 2.
You can keep going around this 'b'
loop as long as you want. It will
certainly accept "abbc", "abbbc" and
so on. And, effectively, what Kleene came
up with, as being a good way of talking
about these things over the phone, and not
having to draw diagrams, is to say in his
Regular Expression notation that it
accepts ab*c. And this is the first
bit of Kleene-invented notation.
That '*' which as we almost all of us now
know means "zero or more of" - there we are then.
The strings that that the automaton
accepts is 'a', followed by zero or more
'b's, finishing with a 'c'
And the nice thing about this
regular expression notation is just look
at how compact it is! It is just so much
easier to handle in a program than
trying to do elementary computer
graphics and literally draw yourself
automata. Although for ... pictorially I
think everybody liked this way of
sketching out what your automaton would accept,
those of you - some of you - may have
seen it. If not, follow the link out to
the things we did about 3 years ago.
you'll find in there an automaton that
accepts 25 pence in order to issue a
parking permit. It's all done with quite
complicated, but locally simple, little
transitions on either coins or
characters or whatever. So, just to
underline and re-emphasize why this
regular expression notation of Kleene's
was seized upon because it was so
compact, is ... let's complete the triangle:
One side of the triangle is the automaton
approach. Another side of the triangle we've
now seen is the regular expression approach -
completely equivalent. But if we
look at the Chomsky Grammar approach,
done on the playlist, would that be
compact and nice? Not so much. The rules
on Chomsky Grammars, in their purest form,
is that if you use a capital letter that
is a so-called non-terminal symbol.
There'll be another left hand side rule
that develops it further.
Whereas lowercase literally means what
it says. So this means - well Chomsky
tended to call them all' sentences', a
sentence in this limited language is a
little letter 'a' followed by anything
that a 'B' can be. So, I have to complete this 
by saying: "Well, what can a capital B become?"
Well, again, look at the automaton
diagram, which is completely
equivalent, what happens with a 'B' is you
can spit out a lowercase letter 'b'and a 'B'
or you can have a much simpler rule that
says a capital B can just become the
letter 'c'. So there we are. Instead of one
small [regex] line we've got three lines.
And if you're doing it the grammar
way you might say: "Oh! well can't you
abbreviate that grammar a bit more?" 
It's awful having to do three lines like that!
It's very verbose. Yeah, here's the
allowed shortening - but it's
not much - is you could say: "All right, I will
allow myself the luxury of the OR bar
and put a 'c' there. So, it's down to two
lines now, not three. What it's saying is
a 'B' can be a lowercase 'b followed by
recursive re-entry into 'B' again, or it can
just be a letter 'c'. So it's down to two
lines now. But it's still nowhere near as
compact and nice as the regular
expression is. And what I must absolutely
emphasize -  can't emphasize too much - I shall
put double-headed arrows all over
the place here - is that all of these
things: the automaton diagram, the regular
expression the grammar. They are all
completely equivalent to each other
Those of you have been through mathematics
courses will know your instructors go on
and on and on about: " ... no I'm not asking
you whether one is a subset of the other
or "one is contained within the other but
there's a little bit extra". No, they are
completely and totally equivalent. There
is not a single thing that one will
accept that the other won't, and vice
versa. So, it's just a matter of
convenience of the notation.
Perhaps, if you're a computer scientist,
for using in a programming context. And I
think the other thing that also had to
be coped with is to say: "Well, all right, you
can do it as an automaton diagram, you can
do it as a regular expression you can do it as a 
grammar. The regular expression's looking good
because of the compactness of the
notation but are we sure that there
aren't some snags in this process of
saying they're equivalent?"
Well, one snag that did occur and was
recognized very early on in the late 1950s.
I mean, all of this is that sometimes,
looking at it, let's look at it from the
automaton diagram point of view:
Sometimes you get a situation where what
you want is what's called
'non-deterministic' In other words, I'm
happy here because there's only one exit
from [state] 0 to get you into [state] 1.
It's labeled 'a'.  How would it be if I took
another arrow out of [state] 0 and said 'a'
could also lead you somewhere else?
So, let's just draw up a little diagram of
what horrors that might be. What happens,
and what would you do about it, if you
say: "Well, my regular expression, let's
call it E, is actually either "abab" or
I want to recognize "abbb". Now you might
say that's useless, you know, how does
that fit into anything? It does
illustrate, though, a very very important
point which is that if you are trying to
build an automaton, or recognizer, for
this you've got two alternatives here.
You are in state 0 and you want to go
into possibly two completely different
directions. But they both begin 'a' How do
you choose?! For the moment I'm saying no
cheating you're not allowed to look ahead.
You just get given an 'a'. What do you do?
And worse still of course - it doesn't end there -
it then goes on. But whichever route you
go you find that the next thing is ...
you've got to recognize a 'b'. So, the
"common factor", if you like, and it is just
like factorizing expressions in algebra,
you know, x^3 - x^2.  Take out an x^2
factor out at the front. It's
x^2 (x  - 1) that's what. 
It's similar in principle to that. Now, of
course,  they do change a bit eventually
because this one accepts an 'a' here
and then finishes up by accepting a 'b'.
This one, next one, goes a 'b' and another 'b'
and finally it gets down into the 'Finish' state
as well.
OK, well it looks trivial. But what are
you going to do? Here you are - 
the programmer trying to implement this.
You look at the string and you say:
"There's two ways I can cope with this!"
The most amazing thing about this is
that two theoreticians in the late 50s
won the Turing ACM award for saying it
will always be possible to turn the
non-deterministic one into a deterministic 
one. Now, how would we do that?
In this case, easy. You start off
with [state] 0. You say that first of all
we'll accept [that] you always do an 'a' and then
a 'b'. That's the "common factor" if you like
for the start. But you put your split
point here that [alternative] continues with an 'a'
and that [lower] line continues with a 'b', to
get the two avenues that you see up there.
So, we factorize out the [initial] 'a' and the 'b'
And the story was: "Well, it's all right for
us humans who are really intelligent to
see that you could do that but will it
always be possible for these things to
turn the non-deterministic one into a
deterministic one, where you force it
through factorization first?" And the
answer from Dana Scott and Michael Rabin
in the late 1950s was: "It is
always do-able so long as you stick to
simple finite-state automata". Don't start
monkeying around with extra RAM or
stacks, just stick to them as they are
you can always do it. But it [the algorithm] can 
get very very hairy. So, who was the first person
who plunged headlong into this and said:
"I'm gonna do it!"
Answer: Ken Thompson - our hero from UNIX
and Bell Labs. And the whole lot. By the late
60s he effectively was saying: "This
regular expression stuff is great for
pattern matching in editors. I want to
use it but this problem of
non-determinism how do I cope with it?"
And it is the most amazingly far-sighted
piece of work. You understand, when you
look at this, why Ken Thompson is a legend.
He basically said: "I know it's not deterministic
but I'll look at all possibilities and I'll
start pre compiling little bits of
assembler code to cope with whichever
one turns up on the day" -  as it were.
And everybody's open-mouthed about this. He
also realized - he said: look, yeah, if you
don't mind doing pre-processing you can
actually, you know, Michael and Dana have
shown us that it is possible... Yeah I know
how to do that, but on the other hand
that could take some time". Because it's
all very well doing a little toy example
like this but non-determinism in a big
real-life automaton can be hell to disentangle.
>> Sean: What's an example of one of those then ?
>> DFB: Well, let me give you one
example which sounds utterly innocuous
but which I can give you, Sean, on a
complete page that you can show to the
fans about this. Brian Kernighan wanted
to get a regular expression recognizer
[by using 'lex'] for all of the keywords and all 
of the constructs in his 'pic' language, which
after all is a preprocessor for the UNIX
'troff' typesetting language. You just
want to draw simple line diagrams. So
we've got primitives like 'circle', 'line'
'ellipse' 'from' 'to' 'with' 'dotted' etc. 
 to show you the nature line all this kind of stuff.
It looks innocuous -  biggish but innocuous.
And [via 'lex'] he put it through one of these NFA
to DFA [transformations]. Let's stop it being
non-deterministic. We want fast
recognition. We want to build an engine -
we don't care if it has more states than the
non-deterministic one - but boy it has got to go
off like a Lamborghini! Right, so be
prepared to spend some pre-processing
time. You can put out his release note
for [early 1980s] 'pic', which says the 'lex' phase 
" ... takes an eon - 15 minutes on a VAX 750. Be patient"
Brian is always noted for his Ernest Hemingway
terseness, you know. So that is
what can come to haunt you, it's that 
de convoluting and tracing round everything to
get it deterministic. It is worthwhile if
you are going to be putting huge 'pic'
scripts through this - so we are hammering
it over and over and over again, anything
to get this really efficient. Otherwise
we'll be here forever while we're using
it. On the other hand Ken's usage of it,
in the UNIX editor 'ed', took the
pragmatic view this thing 'ed'. in the
early days -  because stream editor [sed] Lee
McMahon hadn't been invented yet - 
just 'ed'. You can get away with very
simple regular expressions because
that's all that humans will use. If
you're feeding it stuff that's been
prepared by another program that's when
your pet schemes tend to die the death
because stuff prepared by another
program can exploit little wrinkles that
cause you great difficulty. But Ken said: "No, 
in 'ed' I think I can get away with
keeping it non-deterministic but
[with] preparing and look ahead. It's one the
first examples of just-in-time
compilation. It took the attitude "I'll
look at which way the cookie crumbles"
to try and work out which is the most
likely [alternative] and precompute pieces 
of fast assembler code that I can put in to
actually execute these things. And it was
the most amazing flexible system. It
basically adapted to the input. It is a
good example of just-in-time compilation.
While Ken says: "I'll do it as I need
it",  Brian, in 'pic', couldn't say that
because it's basically much more like a
compiler and it's a compiler that's being
driven [possibly] by reams of externally provided
input. So, two approaches to the same
thing. Are you going to make it
deterministic in all cases, ahead of time,
or are you going to cope with it as you go. 
