There's a parsing theme, devotees will
know, and there's a Regexp theme. And our
aim, in the end, is to unite these all
into one, at the top, so that you're all
totally happy - or completely unhappy - as
the case may be. We have done Regular
Expressions and the last one I did was
really very simple and straightforward,
and showed you absolutely nothing about
what can go wrong. And some of you out
there, coming up with the usual old jokes,
quite right, about: "Oh! you're implementing
using Regular Expressions. Now you've got
two problems", Yeah! We'll look on in glee
as we now take the story a bit further
and see what can go right and how
powerful Regular Expressions are. But
also a little bit of what can go wrong
and why you've got to be careful in your
use. So what can it be this marvellous new
example that consumes Life the Universe
and Everything and exhibits a whole load
of new features? And you're going to be
quite convinced I'm off my head. 
It's Roman Numbers. Way way back when we first,
here at Nottingham, got Lex and Yacc, in
UNIX Version 7 - it was in the late 70s. 
So, believe me, I have been ... or, back in
those days I was setting Roman Numbers
as an example of what Lex -  remember the
Regular-Expression-based front-end to
your compiler. Does tour identifiers and
all that. I thought, all right, as an
exercise, just write a Regular Expression
that will accept Roman Numbers - all of the
good ones and reject all the bad ones.
Now, I can't emphasize too strongly: that
rider is important! It is very easy, in a
Regular Rxpression, quite easy, to get it
to recognize what you want, but will it [also]
reject all the stuff that's bad?  And it
is sometimes so easy to overlook [things].
I say with shame because, when I'd set that
question, I scratched down a quick and
easy answer - which I now see is full of
holes ....   Because if you input, into my
regular expression MMII - 
again, looking ahead, Roman Numbers
MMII [is] 2002. Fine. But then I
thought ... well I thought now, I didn't think then:
"What happens if I do MIIM? [winces at the thought]?"
[Answer = ]  2002. [It] shouldn't be. There's a strict hierarchy
in the way these work. So, at this stage
we don't know how much, if anything, you
remember about Roman Numerals. So, as an
Extra Bits we've put out a simple
tutorial on Roman Numerals based around
this example, But for those of you that
are totally happy with Roman numerals, we
can just carry on with the main video now.
So, just to locate ourselves then, and
do a few examples to make everybody
comfortable I'm going to write down  MCDXCII. 
Start at the left. We've got a
thousand [M] Then we see CD and, remember,
this is one of these subtraction things
CD is "a hundred less than 500" which is 400.
MCD. So we're up to 1400 so far. XC?
>> Sean: OK That's 90
>> DFB: 90. II?
>> Sean: 2
>> DFB: 1492. What happens in 1492 Sean, in 
Western European folklore?
"Columbus discovered horizons new in
America (1492)
>> Sean: I should have known that one
>> DFB: Yeah! right. And, of course, the year 
we're in at the moment, as we film all of this,
is - that's simple MMXX, 2020.
And one which is dear to me, I'm sure you'll
work out why. Yes, it's got a lot of these
prepended abbreviation things in there look,
MCM;  1900, XL: 10 less than 50 (40) 
IV: one less than 5 (4) -- 1944. 
 244 let's focus
Let's focus ourselves now on how would we do a
Regular Expression for all of this?
In fact, let's make things very very simple
and say: "All I'm gonna do at the moment
is concentrate on getting a Regular
Expression pattern that will either
generate me one or two, or three, or four, or five -
[writes on paper] forgive me I'll carry on down here.
Or six, seven, eight, or nine. And I won't go
any further than that.
Because tens is a new adventure
further to the left. So, those numbers
then, from one to nine, which I've written
know like that now. I hope you can see
that if you're completely simple-minded
about this and don't want to do complex
Regular Expressions you really could
write yourself a great long 'alternation'
as it's called. In other words a great
big sequence of ORs. And you could do
everything from one up to thousands and
just have this walloping great OR clause,
that your Regular Expression parser will
probably go bananas and say: "too many
alternations", or something like this.
So, no, that's not the way to do it. What one
wants to do is to generalize and use,
as it were, elegant Regular Expression
capabilities to shorten this a bit.
So, there we are the Roman numbers from I to IX,
All written out.
First of all let's try and separate
these things into classes, and groups,
that make sense. There are two weird
things which involved the subtraction
rule, you know, like IV (1 less than 5)
IX (one less than 10). I think we will do those
together, because they're weird in the
same way, if you like. The other things
are much more regular like you go from 1
2 and 3, which is with III. And similarly,
here, you get the same pattern but coming
after a V  -  a V on its own, or VI, VII, VIII,
and so on. So, we've got three
groupings: things involving Is with or
without a V in front of them;
and the two special cases IV and IX. So I
would conjecture there's a wonderful
piece of Regular Expression covering so
far what we've done, but something new
coming up - so do pay attention -
would be this: my RE is going to be I
- now this is square 'choice' brackets [], 
We have seen these before.
But whereas, before, I put ranges inside
them - you know like [A-Za-z]
[0-9], any one of those digits.
These aren't ranges; these are just
explicit possibilities inside square
brackets. It's called a 'choice' so I've got
to have an 'I', if I'm on this side of the
expression. But then it could be followed
by either an X or a V, but it must be one
or other of them, and only once.
There's no repetition outside; there's no
star there's no plus; no nothing. So those
are the special cases taken care of, 
the IV and the IX. Now we put a walloping
great big OR bar down the middle. Or if
it's not a special case it's a regular
straightforward case. But we've been
clever with our REs now. Oh yes! look
at this. Now, again, I think this is
something we may have mentioned but
probably not another piece of regular
expression notation is: " ... if you put a
question mark after something it means
0 or 1 of". So, what I'm saying here
is I can either have a V, or I don't need
a V. How can you not need a V? Well,
what's following on here is going to be
something that really is new. This is a
counted repetition in curly braces zero
- [should be 0,3]. Look at that! Bound 
to win an award! Let's just run through it. 
The two special cases are off on that left-hand
side of the OR bar. But if you choose not
to go down that route you go to here and
it says: "Oh well if it's not one of
the two special cases then it's like the
following: anything between 0 and 3 Is.
I mean you may not have Is at all - you
can just have V on its own if you want -
preceded by an optional V,  nought or one
these,
>> Sean: Tthat's giving you the choice of
either 1 2 3, or 5, or 6 7 8, right?
>> DFB: Yeah, yeah, that's right.
But it's all to do with sometimes
having zero of something. Now, hereby
hangs a tale about Regular Expressions.
And those of you that keep on posting
these notices about " ... you've got multiple
problems with Regular Expressions !".  Well. here's
another one. It's the strange case of the 'empty
alternative', which really can gang up to
get you. Because what you're saying here,
in this expression, is: " ... Is it possible for
it to match nothing at all and yet
that's a valid match?" Yes it is! Because
if you choose to ignore what's on the
left of the OR bar here - which you can do,
it's either/or - we come here, and we say: " V,
oh! I can have 0 or 1 of those. So, let's
have 0 of them". The I ?  I can have
anything between 0 to 3 of those. So
let's choose 0 of them. So, in other words
choosing nothing at followed by nothing
is a valid match for a Roman number 
[according to this RE].
And if you think about it it's true. Nothing in
Roman Numbers is compulsory. You can have
M's, or no M.  Yyou can have C's or no C's.
You know all of them. 
It's all optional. But as you start getting down
to the units digits you think: "I've got
to have something !"
The paradox, then, about Roman Numbers
done as a Regular Expression, is: you've got
all sorts of cases that yield you
nothing at all and yet you're going to
be very disappointed if you don't have something!
But saying "... overall you must
have something" That's not a context-free
assertion. That's a context-sensitive
assertion. So we'll live with that for the
moment. And I think also what I'd like to
just show you now, if I can get the
screen fired up ... and I'm using 'awk' here
to demonstrate this, I'm going to add yet
another little piece of notation to all
of this. I'm going to say that you can, if
you want to, insist that what you're
matching must just be a valid Roman
Number, on a line on its own, with no
writing or messing about on either side
of it. Just the number. Start of the number 
(^),  end of the number ($). Finish. Right?
And you denote that by saying start of
string anchor, ^  End of string anchor, $.
You're basically saying whatever 
pattern there is must fit between
the start of the string, on a line on
its own, and the end of the string - the
newline right at the end. Now, this is the key 
thing about this. Those things [^$ markers]
if you put them in, are not options. You
can't have zero of them. If you say
you must match the beginning of the line,
you must match the beginning of the line. If you
say you must match the end of the line,
you cannot ignore it. You can't have zero
of them. So, in a way, this is a kind of
reality check and it's generally the
case that if you use these correctly you
won't end up with null strings being a
pseudo-match. Because this thing is saying: 
"Well, whatever the finite [non-zero] length
string is, it's got to fit between
beginning of line end of line. Here's the
'awk' script that I hacked together, to try
out a few alternatives. And I decided to
go for this simple little piece first.
Can I do everything from 1 to 9 in a
single piece of Regular Expression. And
you'll see here this is classic 'awk'.
You, give a Regular Expression pattern, then
you give an action. It's a C-like syntax,
even in 'awk'. Basically what I'm saying is;
"If I match that then I print out 'Pattern
p1 matched' and so on". Now, if you look
down here four of these things - because
this is just a test-bed - four of them are
commented out, they will not be executed.
The one that will be executed is the one
I've just been discussing with you, here,
which says: "I must match start of string
then here's my Regular Expression
pattern - which I think will work from I
to IX inclusive, and then it must match
end of string". So, I'm hoping that by
uncommenting this that this will work.
And I'll type some good and bad Roman
Numbers at it. And for every one it's happy
with, it will say 'Pattern P0 matched is'
And it will echo back what it your Roman
Number [was] that you put in. So, what I want to
do, now, is actually run that script, that
we've been looking at. 'awk - f ' take your
instructions from a file called 'test-roman.awk'
OK, so when that runs now it
should just hang there waiting for its
input of a Roman number. And remember it is going
to match that beginning of string, piece
of regular expression, end of string.
>> DFB: So. which one do you want me to try first Sean?
>> Sean: OK, let's go really simply and just go 
for a number 5 [V].
>> DFB:  V all on its own [waits]
Look at that! That's a 'P0 matched is V  
>> Sean: All right - a bit trickier 4 [IV] because
it's got that 'minus one' sort of idea. 
Let's put one in that shouldn't work.
>> DFB: Go on that what do you want ?
OK just put in C for Computerphile.
>> DFB: C for Computerphile?
Now you might say that's a valid Roman number.
Yes, but it's it's not being covered by the bit of
Regular Expression I've done, I've done a
piece of Regular Expression that
literally, I hope, does I to IX and
nothing else. Let's see what happens there [with C]
It echoes it back but it doesn't
do anything with it.
So let's now try IX, looking good.
All right how about the ultimate test [which] is
type in something utterly idiotic like HELLO
No, doesn't do anything with it,
doesn't recognize it. So, we're back
showing the 'awk' script again and let me
just remind you that what we've shown is
that as long as we put the anchors in at
the beginning and the end of this
pattern then our minimal testing, but I
think we've tried all the options, it's
recognized V it's recognized IX. IV
all this stuff. That one is absolutely
fine and it rejects rubbish. I've
actually put in, as a comment, 
StackOverflow's assertion that this [points at screen] 
is the thing that's the answer to life the
universe and everything. Let's move right
across to the right and the first thing
you'll see here is the piece we've just
done for the units as it were, I through to IX.
But as you go leftwards can you
see that the beauty of this is ... 
that has handled the units part of the Roman
Numbers. It's dead easy to see, therefore,
that it's like a repetition of the same
idea, but with different symbols? When you
come to the tens range What are the
exceptions there XC (ten less than a hundred)
XL (ten less than fifty). I's just like the 
I's and the X's and V's were before, look.
Similar. X[CL]. Or it is is an optional
L, for 50,  followed by a tens digit, zero
{0,3} of them. You can't say just {1,3}
because L on its own is
perfectly valid as a Roman Number Number.
So, you see that piece of pattern - similar to
the far-right one but just with a
relabeling almost of the digits you're
using-  that copes with the tens. Up here 
[points to left of screen] it's
coping with the hundreds, going on to
the thousands. So, the story finishes off
in here by saying: Ah! you can have CM
(a hundred less than a thousand) i.e. 900
in other words. CD for 400.
And the midpoint of the range,
which everything pivots around. It was V
there, it was L here, it is going to be D
here. I hope you all know that D is
500? So there we are.
how about you see three very similar
pieces of pattern for the hundreds the
tens and the units and right out here at
the far left, M*. And M* means
you can have zero or more 1000s. So, we've
got the basis there for something that
looks like it would do everything. The
only problem is these empty matches. But
that's such a big problem we probably
ought to have an episode devoted to that
all on its own.
