Hi! In this video we're gonna talk about
regular expressions, one of the most
powerful tools in our NLP toolkit.
So we've all had to do this at some point
you have a huge document and you have to
try to correct some tiny detail, but
doing it by hand would take forever.
We probably want some way to get to
complex patterns quickly, like the
pattern on the example here. It has some
sentences that have - that are separated
by two spaces, some sentences are
separated by three spaces, and some
sentences that are separated by one
space. And we probably want to regularize that.
We want to turn everything into one
space, two spaces, whichever notation you
prefer. But how do you ask the computer
for such a complicated pattern?
This is where regular expressions come in. You
can specify a simple string to get the
computer to look for a complex pattern
in a huge collection of texts. You can
use this to extract emails, phone numbers,
to get alternate spellings for words and
with this, manipulate the text so that
it's in a way that is more regular for
the computer to learn from it for
example. Let's start at the beginning.
Regular expressions are a notation and
kind of programming language used to
specify text strings. Essentially you
give the program a string and then tell
it to find that within a larger string
or a document. And the computer is very
literal about it: if you give it the pa-
the pattern R O B O T, as in robot, it's
gonna search for exactly that. For
example if you have the document robot
comma robots it's gonna find the first
word and then only a fragment of the
second word. It's not gonna give you the
s because you did not specify it in your
regular expression. Let me show you how
it would work. I'm going to use one of the
regular expression testers that's on
your lower left screen. I'm gonna use
regex 101, but you can use any of the
many that are available online.
So for example I have a document that
has robot robots Robot Robots uppercase
lowercase. If I give it robot with lowercase, it's gonna try to find exactly that
and you can see that it found it twice
robot and robot without the s. Very literal.
As you noticed, regular
expressions are case sensitive: if you
tell the computer to find uppercase R
obot, it's gonna find Robot and Robots
with the uppercase but it's not going
to see the one with the lowercase.
For example, if I change this to uppercase
Robot, you can see how it finds the
second- the words in the second line but
not the words in the first line because
it's literally looking for uppercase R
lowercase o lowercase bot.
We probably want to find words that are either in lowercase or uppercase because they're
the same word. We can specify more than
one option using brackets square
brackets. So for example here the square
bracket r lowercase r uppercase R
square bracket tells you that it should
find one of those two characters but
just one between the lowercase or the
uppercase.
If you give it the same
string that we had and do this, it's
going to find one character from the
list of lowercase r uppercase R and then
literally obot all lowercase. So this
finds all four cases of robot within our text.
You can also specify ranges for the
options within the square brackets so
for example if you tell it to look for
every character between a and z and then
for the characters at, it's gonna find
words like cat and bat but it is not
going to find at because the pattern
tells it to look for three things one
character in the range, the a and the t.
For example, we have here cat, bat, rat,
at. If we tell it to find the a and the t
it's going to find it in some of them but
it's going to miss part of the words. If
we tell it to find everything between a
and z followed by an a and a t, it's
gonna find only the lowercase ones and
it is not going to tag the last word of
the line because that one is missing a
character it only has two characters and
the regular expression has three.
We can of course specify ranges that are
longer. For example, we can tell the
computer to give us everything in the
lowercase a to z and in the upper case A
to Z so that it can find cat bat and
rat no matter - and it wouldn't matter 
if they're lowercase or uppercase.
So we have a through z and it
finds all of those words that have a
consonant and that have - and that then have
the letters a and t but it doesn't tag at
because that one's missing a character.
Notice again that it only finds one
character within the whole list in the
square brackets.
There's other ways of providing
alternatives for the regular expressions
for example you can use the pipe
character the 'or' character. If you have a
string that like that cat is great that
dog is great, you can find either the
word cat or dog. As you can see here
cat | dog will find you either one of
those but only with the lowercase. We can
use, same as we did before, square
bracket c lowercase uppercase at or d
D og to find cat or dog in either
uppercase or lowercase. Let me show you. d D.
So the - that regular expression
finds every instance of cat or dog
regardless of whether it's uppercase or
lowercase.
There are several wild cards that you
can use. For example, the question mark
will find the character zero times or one
times - one time. The word color for
example is written different ways
throughout the English-speaking world.
Sometimes it's written color as in the
US without a u or sometimes it's written
colour with the u like they do in Great
Britain or New Zealand. So colo finds
the letters colo and then square
bracket u question mark will find you
will tell you whether there's 0 or 1
occurrences of the letter u.
As you see here. And as you can see the square brackets are not necessary for that one.
Color and colour are the same word and
this regular expression allows us to
find it regardless of its spelling. So
the question mark allows you to find
something either zero times or one time.
The star, also called the Kleene star, lets
you find something 0 or more times. So in
the expression bo
* h and the Kleene star is
just the asterisk on your keyboard you,
can find - it'll match the string bh which is  b zero times o h and it'll also match boh booh boooh booooh
and so forth. This is
useful if you're looking for data on
Twitter for example where someone might
write what with 1 a, whaaat with 3 a's or
whaaaaaaaaaat with ten a's. They're all the same
word.
So bo*h will match all of
these:  b (zero o) h as in bh and then boh,
 booh, and so forth.
The period, the dot, will match
any character but just once so beg.n
matches begin begun began beg3n
but it wouldn't match begn because
you need at least one character there
that will occur once.
So beg.n matches
the beg literally then any character
once then the letter n so this one the begn is not marked as a - as an instance
of the regular expression.
The plus sign
allows you to find something one, one
time or more times. Once or more. So this
will mark boh booh but not bh because you
need the o to occur at least once.
Let's practice!
For example let's build a regular
expression that can may - find all of
these four phrases. I am going - you should
pause the video and then unpause it so
that you can see the solution pause it now.
And welcome back. So hello John, Anne,
Alberto and your name and of course
insert your own name, would be something
like this: first we need literally the
word Hello literally the comma
literally the space and then any
character as many times as it's needed.
So this matches literally Hello,
space and then any character many times
which matches the names. This will find
all of the four sentences.
Let's do something more advanced.
How about we
extend it to Hello John without a
comma Hello, Alberto with the comma Hi,
your name. Go ahead and give it a
try and pause the video.
And welcome back.
So this was a little bit more
challenging. We could have - first try to
find literally the hello, we're gonna
face the problem that the comma could
happen zero times or one time so we're
gonna have a  ,? so that
we can have the comma zero times or one
time. Then we need the space because
there's always the space between the
hello and the words and then any letter
as many times as it's needed. This finds
three of our sentences and not four so we
need to provide the option for it to - for
the greeting to be either hello or hi so
this matches hello or hi comma zero
times or one time space and then any
character as many times as is needed.
We shall continue with more options for
regular expressions.
