We're going to talk about stemming,
an important part of pre-processing text.
So there are words that share a stem, for
example, in consult, consultant,
consultants, consulted, all of them share a
single stem, consult. This means that all
of them are related in meaning somehow and
that, for example, in a web search where
we search for consultant, we also want
the word consultants to be included in
those searches. This is what stemming can
do for us.
Let's just do a short review
of what stemming is. So in - there's an
area of linguistics called morphology
and it studies the parts of words. So
words can be separated into their
individual components, and we call each
of these components a morpheme if they're a
minimal unit of meaning. Morphemes can be
of two types: they can be stems which are
the core meaning bearing units in the
word, so for example in unbearably, it's
- the stem is bear because it's - this
is to support something or to carry
something, and by the way remember that
in English roots and stems are roughly
equivalent. So a word has a stem and it
also has affixes, which are like
grammatical functions that are appended
to the word.  So in unbearably, bear is the stem, un is
a prefix that tells you that something
is the negative of what we're talking
about, able is a suffix that tells you
that you can do something, and y makes
the word an adjective or an adverb.
So why do we need to reduce words to
their stems sometimes? This is because
words that have the same stem are
related and we probably want them to be
classified together. So if we have a
search engine for example and we search
for word like tacos, we probably want the
website to return documents that include
the word tacos like in number three,
three amazing tacos for you, but we also
want the website to return documents
that include the word taco, the stem of
tacos. Taco sale tomorrow, best taco recipes. These two words are
related because they share a stem and we
want the computer to to be able to
identify this. So sometimes we need to
transform our text into its stems to
perform stemming for the computer to see
these relationships. So again without
performing stemming, the computer could
not see that taco and tacos refer to the
same object.
There's several algorithms to implement
stemming. A very common algorithm in
English is called Porter's algorithm. So
I'm leaving it - in the lower left of
your screen. You have the website where we're
going to demo the stemming and the
actual implementation of - of the
algorithm which if you want to take a
look at how it's actually implemented I
leave it for you there. So what this
algorithm does is it uses regular
expressions to try to decompose words
and get to their stems basically. So if
you get a word like caresses for example
you'll transform the sses
into just the ss. Caresses to caress. If it
gets something like ponies, it will tran - it will
replace the ies for an y. Ponies to pony. If
you get something like cats, you - you will
transform cats to cat.
It does - then performs another step, again
with regular expressions, where it tries
to find words that it knows are verbs.
For example, in a word list, and then does
then - then replaces the ing in walking
for zero, so it takes walking and makes
walk. It takes plastered and makes
plaster replacing the ed to zero.
In another step it takes usual morphemes
and simplifies them, so for example
ational it replaces it for ate,
relational relate, digitizer digitize,
operator operate. And for - it takes some
other morphemes and transforms them to
zero. For example revival it removes it
so that the stem would be reviv, adjustable
it makes it adjust, activate activ.
So let me show you how this
algorithm would work. The Porter. And
again this is just one possible
algorithm. This is one of the websites
that we have on the lower left of the
slides, and we're gonna write a simple
sentence about tacos, why not. Tacos are
simply amazing. Have you tried tacos, question mark. So
we're gonna ask it to - to use the
Porter stemmer and we'll see what happens.
As you can see, tacos is stemmed into
taco are simpli, so the are is not
replaced by to be, this is because of the
specific implementation of the Porter.
Simply is changed to simpli which is the
stem for things like simplify,
simplification, simple and so forth.
Amazing is replaced with amaze and this
is - it's probably consuming a little bit
too much but it is the - amaze is the stem
for amazing,  it's - you have other
words like amazement or amaze. Have you
tried so tried is simplified to tri and
tacos is simplified to taco. So if you
do this with a document you will be able
to identify that both of them contain
contain the word taco even if sometimes the word
is taco and sometimes tacos. Again the
Porter is just one algorithm to
implement stemming. Let's try another
English one: the Lancaster algorithm for
example. As you can see here, we get
slightly different decisions. So ar for
are. We still have amaze and singular taco,
and now we have hav as the stem for
have with an E, and this is correct this
is because have also appears in having
haved for example. Tri taco is simplified.
So another very common one is this
normal stammer and you can see that it
exists for other languages as well. This
is related to the Porter as a matter of
fact that's an evolution of the Porter,
so it's - the results are identical I
think. Taco ar simpli amaz
hav you tri taco. So again we can use
this to find further similarities across
text. If you want to look at the specific
implementation of algorithms like Porter,
I leave you with the URL on the
presentation. Here is the part of the
Porter algorithm where you can see that
some of the morphemes are transformed to
shorter forms.
So in summary, stemming transforms tokens
into their stems and this can be useful
when you conduct searches so that the
computer can identify that tacos and taco
are related.
This process is language dependent so
notice that I chose English alternatives
and you do need a specific stemmer for
the language you're working with.This
process might be very complex so we
looked at a very simple example in
English but imagine what a stemming
algorithm would look like in Arabic. It
would need to figure out what the root
consonants are and get rid of the vowels
in the middle. So this process is heavily
language-dependent and if you run - if
you're using a small language maybe you
will not be able to perform stemming. Thank
you for your time and I'll see you in
the chat forum and I'll see you next
week in more videos. Have a good evening,
afternoon, morning.
