Hi. In the next two videos, we will look
at spell checking, and we will start with
a very simple algorithm called minimum
edit distance. So the last two videos
we've been looking at how writing is a
technology; it's a technology to
represent spoken language or sign
language in general. Writing is the
technology to represent natural language
and spelling. Rules are a part of that
technology. Somebody invented them for
various reasons, and they have become a
part of using that technology, and when
we use any technology there are moral
implications to how we're using it, and
how we how we choose to deal with
problems in it. Now that we have
discussed that, we can take a look at how
the computer can find vari - can find
variations in your spelling or spelling
mistakes, knowing how loaded those two
words are. Spelling variation can come in
two forms: sometimes it can come from non
words, for example when you're typing in
English, and instead of typing the word
giraffe you produce the word graffe
with your keyboard. The word graffe does
not exist in the English language so
this is a non word, and we would need to
find the closest real word to that non
word. Errors can also come from real
words. For example typing a word that
does exist but it's out of context. For
example in the sentence She's walking
across the desert. What it actually says
there is She's walking across the dessert
with two s's. So dessert does exist in
English, it's just not - it shouldn't be
written with two s's right there. In this
video, we will look at a method to find
non word corrections.
Let's say we have a beautiful corpus of
English, and from that corpus we select
the unique tokens and we make them into
a big list which we're gonna call our
dictionary. This dictionary has all the
possible words in English and
somewhere in that dictionary, there's the
word intention. Suppose that we have this
beautiful dictionary file, and then we
get some input file, some text and the
person who wrote the text in one of the
paragraphs wrote "ntention" and then
further ro - down they wrote "intecntion", and then in the final paragraph
they wrote "inxention". So none of those
three words exist in the English
language. So we need to find the words
that are closest to them. We could do
this by performing simple operations,
like insertion of letters, deletion of
letters, and substitution of letters. For
example, to transform ntention into
intention, we need to insert an i. To
transform intecntion into intention, we
would need to delete the c. And to
transform inxention into intention, we
would need to substitute the x with a t.
We could assign an arbitrary cost to
each of these operations. For example,
they could all cost the same. Maybe it'll
cost you one unit of effort to insert a
letter, because you insert one character.
Maybe it would cost you one unit of effort
to delete a character, like the c in
intecntion. It could cost you one unit of
effort to substitute something, for
example inxention into intention. Some
implementations of these algorithms
sometimes have substitution as with the
cost of one, and sometimes have
substitution with the cost of two,
because you can also think of
substitution as being two things:
deleting the x and inserting the t. So
individual implementations may vary but
what's constant among
implementation of this is that you need
to assign a cost to the operations. Let's
assign a cost of one to all of them for
this example. Let's say we want to
measure the distance between the word
kitten and the word sitting. In order to
transform kitten into sitting, we would
first need to substitute the letter k
with the letter s and this operation
will cost us one in this example. The i
is fine, the two t's are fine. We need to
substitute the letter e with the letter
i, so that costs us 1 unit of effort. We
need to substitute the, I'm sorry, the n
is fine and then we need to insert a g
because kitten doesn't have one at the
end. So the insertion cost is 1 unit. In
total, transforming kitten into sitting
cost us 3 units of effort. We're gonna
call this the edit distance between
kitten and sitting. So how can we use
this for spell check. We have our
beautiful Dictionary of English here, and
we have an input document with some
words. And in - in this document we find
the sentence: The road to hell is paved
with good intenctions. For each word in
the input document, we're going to
calculate its distance to the words in
the dictionary, and with some luck we
will find them. For example if we find
the word road, we will - if we have the
input word road, we will find the word
road in the dictionary, there will be a
word such that the distance between
dictionary - between road in the
dictionary and road in the input is zero,
because they're identical. If we have
such a word then it means that road is
correct and it does not need correction.
How about when we keep making progress
into the sentence, and then we find in
intenctions.
What's going to happen is that you will
not find intenctions in the English
dictionary, so then you need to calculate
which word in the dictionary would be
closest to intenctions according to your
edit distance function. For example,
transforming intenctions into kittens
would have a cost of 8. Transforming
intenctions into flowers, we have - will
have a cost of 13, and transforming
intenctions into intentions will have a
cost of 1. So we'll find the candidates
that have the shortest edit distance, and
hopefully there will be a candidate that
has the minimum edit distance in this
case, intenctions for intentions, and not
flowers or kittens. We will then take that
candidate with the minimum edit distance
and suggest it as a possible correction
so that intenctions can be replaced with
intentions, potentially. So we can use an
algorithm like minimum edit distance to
find suggestions for spelling
corrections and it can correct non word
errors, and by the way the exact
implementation of minimum edit distance
is in your text book, in Jurafsy and
Martin, please write if you have any
questions about it and of course go
check it out if you're interested in the
details of the implementation. This
algorithm however cannot correct real
word errors. If you get something like
They where here, where were is where, the
computer will just say that, oh there is
a word where in the dictionary, that's fine,
so we will not see it as a potential
error. In our next video we'll
incorporate probabilities into our
calculations so that we can find real
word errors and find that the
collocation they where with an h is
unusual and that it should be replaced
with the more usual collocation they
were without an h.
