All right.
Hi there everyone
I'm Alaina.
And this lovely pre dinner slot I will be
telling you the linguistic back story of the
regular expression.
So, regular expressions look a little intimidating.
I mean, imagine being someone who is new to
programming.
Say programming and JavaScript and seeing
something that looks like this.
When I started coding, I was pretty freaked
out by regular expressions.
I just didn't get them.
If I were a Shiba Inu, I would look like this.
Some of you might relate to that sentiment.
Some of you might wonder how you can get better
at writing regular expressions.
The usual answer is something like StackOverflow.
But people tend to use StackOverflow to copy
and paste code when they're in a pinch rather
than to learn the proper way to actually write
regular expressions.
To truly understand regular expressions, I
propose an unorthodox route.
Do it right now.
No, I mean, I'm being a bit facetious and,
you know, this is a big thing, learning linguistics.
What I really mean to say is, learn how regular
expressions work in the field of linguistics
before trying to use them in JavaScript.
This is what I'll be walking you through today.
Let me take you on a journey that begins with
Noam Chomsky pictured here and defends a cheer.
Cool.
Chomsky fans in the audience.
Me too.
And in journey.
We'll start with Chomsky and we'll end with
the regular expression.
So, as some of you may know, Chomsky is not
a computer scientist by profession.
He is a linguistics, a philosopher, a political
activist and ton of other things.
But his contributions to computer science
were pretty accidental.
His major contribution to the field of computer
science is something known as the Chomsky
hierarchy.
The Chomsky hierarchy is an ordering of grammars.
But since most of us in the room are not linguists,
we have to cover what a grammar is before
we can go further.
First thing, a linguistic crash course so
that we can understand Chomsky's hierarchy.
So, grammars.
Grammars provide the rules that build languages.
Some of you might also wonder, what's a language?
Well, a language comprises of symbols and
the rules to apply to those symbols.
In other words, languages comprise of symbols
and a grammar.
These symbols can be combined to make strings.
The grammar provides us with a set of operations
that can be used on these strings of symbols.
To be a bit more specific, grammars provide
a framework with which input or intermedial
strings of symbols can be converted into some
output string of symbols.
Throughout this presentation, I might refer
to input and intermediate strings as non terminals
and output strings at terminals.
I'm just tell you this now so that if you
hear these different terms, you'll know that
they mean the same thing.
So, I want to make these terms feel a bit
more tangible because obviously learning linguistics
in two minutes is not a thing that anyone
can do.
So, I'm going to provide an example using
a made-up language from the world of Harry
Potter known as Parseltongue.
So, in Parseltongue, there are three terminal
symbols and three non terminal symbols.
They're right here.
In Parseltongue, there's also a set of rules
for converting non terminal symbols into terminal
symbols.
So, now let's go through an example of converting
a non terminal string into a terminal string.
So, here is our starting string.
It is a non terminal.
Oops.
I'm sorry.
One slide back.
This is our starting string.
Now, I'm going to make one conversion of the
first non terminal symbol into a terminal
symbol.
However, this conversion does not make the
entire string a terminal string.
The string is still a non terminal because
there are still non terminal symbols, the
words "Am" and" Snake" in there.
With that being said, I'm going to make the
next conversion of "am" into 'S." The string
is nonterminal.
But the string backs a terminal when I turn
"Snake" into this 'SS."
We produced a string of entirely parsed terminals,
a non terminal string.
You could call this a grammatically correct
phrase in Parseltongue.
Cool.
Now that we've introduced linguistic terms,
reintroducing you to the Chomsky hierarchy.
It looks like this.
In the interest of time, I'm going to hone
in on the grammar, the innermost circle.
What is so special about this regular grammar?
In the process of converting a non terminal
string into a terminal string, regular grammars
have no knowledge of the string's past states.
Once the regular grammar converts one non
terminal symbol into a terminal symbol, the
grammar essentially forgets what the previous
state of the string was.
We care about regular grammars because they
are regular expressions.
I mentioned a slide ago that regular grammars
retain no information about intermediate steps
in converting an input string into an output
string.
So, what does that tells about the regular
expression?
The forgetfulness of regular grammars implies
that translations in one part of a string
do not impact how non terminals are translated
in future steps.
This means that there's no coordination between
the different parts of the string in creating
the output string.
This gives us some insight into why programmers
first brought regular expressions into code.
Because regular expressions convert input
string to output string piece by piece, they're
wonderful at pattern matching.
In programming, regular expressions convert
a pattern into the set of strings that match
that pattern.
So, in JavaScript, this is what the match
method does.
It takes in a pattern, let's say it's called
regX, like up here.
And this pattern is a regular expression.
And then match will find all sub strings of
a string, which I've called S. And then match
sorry, it will find all sub strings that match
that pattern within the string S.
So, the linguistic origin of this programming
concept that we know, regular expressions,
is starting to make a bit more sense now.
One question remains.
How else do regular expressions in programming
relate to those that we've seen?
Linguistics thus far?
Well, I'll walk you through the similarities
before their paths fork.
Regular expressions of linguistics and programming
use the same three grammatical operations.
These operations are alternation, repetition
and concatenation.
I'll talk you through what these three rules
look like using both words and then some examples.
Alternation effectively gives us this logical
or.
And I'm going to tell you a great story, a
true story, about the 1920s when a lot of
Greeks were emigrating from Greece to the
United States.
My family was one of those set of Greek people.
Given that my forebears didn't speak English,
it was up to the immigration officials to
determine how their Greek surnames would be
spelled using the alphabet that we use in
English.
So, some of my relatives ended up with the
surname Kafkes spelled with e s.
And then others ended up with the last name
Kafkis spelled with i s.
Unless your family reunions turn into something
like these cats, which are really aggressive,
don't want that to happen, I can use it to
create spellings of the last name both ways,
pleasing everyone.
Now we're going to move on to repetition.
Repetition allows us to match an expression
zero or more times.
Let's look at some bit strings in order to
see repetition in action.
All right.
So, I have this pattern right here and it
is 01 star.
What 01 star will match is 0 plus followed
by any number of 1s.
The star is a meta character, basically a
short hand character and symbolizes one of
the many forms of repetition that you can
use in JavaScript.
All right.
On to concatenation.
Concatenation joins expressions together in
order to create new patterns.
I think that concatenation is pretty useful
when we're trying to see how matching different
conjugations and tenses of English verbs look.
So, we have this regular expression right
here.
It's call.
Call will match many forms of its infinitive
to call.
So, it will match call, calls, called, calling,
all sorts of things.
But as soon as we add another letter, or another
pattern, to that regular expression, only
calls will be matched.
So, yeah.
Now that we've talked through these examples
of the rules that both engineers and linguists
use in terms of regular expressions, it's
time to does how regular expressions in code
and JavaScript in particular differ from their
linguistic predecessor.
Regular expressions in programming languages
have two major add ons.
These are back references and quantifiers.
Once again, I'll talk you through these differences
with some examples along the way.
Back references allow regular expressions
to store these small patterns, or sub patterns.
Back references can be used to match all the
laughing messages that you send your friends.
And I'll show you how.
So, take, for example, the humble word "Ha."
The pattern above denotes" Ha" as a sub pattern
using those parentheses.
And then references it using the backslash
one short hand symbol.
This meta character tells us to match for
the sub pattern again.
Then the star the end matches the second "Ha"
repeated zero or more times.
So, to put it more simply, this pattern that
I've written here matches the string" Ha"
repeated over and over again.
One or more times.
I mean, you can test it out if you want, but
it might just look like this right now.
A little bit of forced laughter.
But yeah.
Cool.
So, some of you might notice something a bit
more disturbing about back references.
And it is that they go against the purpose
of regular grammars as it's spelled out in
linguistics.
As a reminder, the purpose of regular grammars
is not to obtain information about previous
states or previous places in the string.
So, despite the performance hit that this
violation will cause, programmers enjoy having
a way to match repeated sub strings.
On to quantifiers.
Quantifiers alter the greediness of the regular
expression matching.
What does greediness mean here?
Well, greediness is measured by the question,
how much of that one thing do you want?
Let's look at the star, plus sign and question
mark quantifiers for some examples in code.
All right.
We saw this example a bit earlier when I was
discussing repetition.
If you'll remember, the pattern 01 star matches
0 and then 0 followed by any number of 1s.
The star quantifier matches the expression
to which it is applied, which is the 1 here
zero or more times.
So, I can contrast this with the plus sign
quantifier.
The plus sign quantifier matches the expression
to which it is applied one or more times.
So, this means that from the previous slide
we had zero as a possible match.
Whereas now we do not.
And then one more quantifier that's pretty
cool to look at is the question mark quantifier.
The question mark quantifier matches the expression
to which it's applied zero or one times.
Only those two cases.
This means that the pattern that I have written
above only matches zero and 01.
The star, plus sign and question mark quantifiers
they mention ready just a few of the many
quantifier us out there that JavaScript programmers
use.
Quantifiers are nice because they give programmers
more flexibility to fine tune their regular
expressions than the linguistic rules allow
for.
And now, the sun must set on our linguistic
journey through the land of regular expressions.
Even though the sun in Berlin doesn't set
until 9 p.m. I walked you through from Chomsky's
theory to JavaScript.
I showed you how regular expressions differ
in linguistics and code.
All that cool information aside, I think that
the most important takeaway from this talk
is that regular expressions were put into
programming languages such as JavaScript for
very pragmatic reasons.
They were also tweaked for practical reasons.
Learning about the linguistics behind regular
expressions helps us understand why the people
who brought linguistics into code made these
very particular design decisions.
This goes to show that interdisciplinary learning
helps enhance computer knowledge.
I encourage you it look at engineering or
JavaScript specifically from other disciplines.
Trace them to their roots like I did today,
and you can deepen your understanding of those
concepts.
Thank you, everyone.
Once again, I'm Alaina.
I work at a cool company called Medium and
I love how linguistics and programming intersect.
Feel free to come talk to me after this.
[ Applause ]
