We're gonna talk about text tokenization
and normalization. So every now and then
you're gonna get a document that is
absolutely perfect. It has no spelling
mistakes, it - all the words are where you
need them. And you can just use it as the
input for a machine learning algorithm.
What's gonna happen most of the time is
that you're gonna need to do some
pre-processing to the text. Maybe you're
gonna have to correct a few mistakes here and there
maybe there's some spelling variation
that is actually meaningful, for example
in Twitter you might find something like
what with one a versus whaaaaaaaaaat with ten a's
and you're gonna have to decide if those
two forms are the same or different for
the purposes of your algorithm. This
process is called normalization but
before we can normalize our text, we need
to figure out what the words are in our
text. This process is called tokenization.
So if you want to see your linguist
friends suffer, ask them the simple
question: what is a word? So all
scientific theories have constructs and
there's some constructs that are easier
to justify, some are more difficult. The
construct of word is a very difficult to
define one. It's simple to grasp
intuitively but finding a definition of
what exactly a word is is surprisingly
difficult. We don't have to go very far.
In languages like English, for example
there's forms that could be thought of
as one word or two, like contractions
like I've, like I'd, like gonna like
finna for example. In languages like
Spanish, and many Romance languages,
French, Portuguese, Italian, we have forms that
are called clitics. These small particles
can attach to the verb, or they can also
be detached from the verb as a separate
word. So sometimes they're a word and
sometimes they're not. Also they can
dance around the verb sometimes they're
at the end of the verb, sometimes at the
beginning. So for example we have sentarse
to sit oneself down, with the clitic
attached to the verb at the end of the
verb, and we have se sento, he sat himself
down which can be - which is at the
beginning of the verb, before the verb
and separated. So this form is sometimes
a word and sometimes it isn't and it's
the same in other Romance languages. So
just defining what a word is is more
difficult than you think.
And there's other difficulties as well.
For example, expressions like better
suited or ill-defined. Are you gonna
count those as one word or two? You need
to figure out depending on the needs of
your algorithm. For example, forms like
this with an uppercase T or a lowercase t
are you gonna count those as different
or the same? Forms like one nine six zero
as in 1960, are you gonna count that as
an English word? Or what are you gonna do
with favor versus favour written with the u 
as they do in New Zealand or without
a u as they do in the US? How about
foreign words like pozvoleno,  a word in
Russian? Are you gonna count that as a
word in English in your collection of
English texts? Again it depends on the
needs of your algorithm, but because word
is such a difficult concept to define,
we're going to use other concepts. For
example, the concept of token. Now a token
is just a form, like an alphanumeric form
within your collection of documents. For
example, in the sentence cats are great
My cat is the greatest. Cats cats cats! We
have 11 different tokens: cats are great
my cat is the greatest cats cats cats.
Each of them is a form, an alphanumeric
form within the text. We can also count
types or unique tokens. Those are the
ones - that's the collection of items that
occur in the document and maybe we just
want to check if the item exists or not
and that's when we want to calculate the
types. So for example, this document has eight
types: cats are great my cat is the
greatest. Notice that the word cats only appears
as one unique token, as one type, even
though it is - it actually occurs one two
three four times in the document. Also
notice that cat and cats are different
types because they're not identical. They're not identical but they do share
the same stem which is why we have the
concept of lemmas. These are forms that
share the same stem within a language. So
for example the lemma cat is the stem
for cat in the singular and cats in the
plural. So one lemma, two tokens. The -
the lemma be from the verb to be is the
stem of are and is. So two tokens, one lemma.
Great is the stem of great, and greatest.
Two tokens, one lemma.
My and the are lemmas as well.
So in English, for example, the way we
will get to our tokens would be just to
split the text by its spaces, will be to
separate all the alphanumeric forms by the
spaces that they have in between them
and to ignore the punctuation, for
example. When - we're going to do this not
just on a sentence or a string but on
massive collections of documents and we
call those collections of documents a
corpus. The plural, by the way, is corpora.
So a corpus is a collection of texts and
you're going to use a corpus to examine
the frequency of a word, whether a word
occurs in the corpus or not, and
ultimately you're gonna use the corpus
to try to prove some hypothesis, maybe a
linguistic hypothesis, maybe a hypothesis
within your data set.
There's many corpora for many languages
in the world. English has some of the
largest corpora of course. There's one
called the Switchboard
phone conversations,  which has - people
were asked to talk
on the phone and they were recorded
while they were doing so, and this corpus
is approximately 2.4 million tokens and
among some 20,000 unique tokens. All of
the work of Shakespeare has
approximately 884,000
four thousand tokens and about 31,000
unique tokens or types. There's a really
big corpus that we're gonna study later
in the class called the Google
Ngrams which is approximately 1 trillion
tokens and about 13 million unique
tokens. As we said, in English, you could
just tokenize a sentence by splitting
the alphanumeric items by spaces, so
something like This is a sentence would
have four tokens: This, is, a, sentence. But
there's many languages where you don't
have spaces. For example, Mandarin. In the
sentence like wo bu hiu shuo zhongwen,
I don't speak Mandarin, or in Japanese doitsugo wa hanasemasen, I don't speak German.
There's Thai, for example, also doesn't use 
spaces. So as you can see here for
example in the Japanese sentence, 
doitsugo wa hanasemasen, there's
three words: German-topic don't not-speak.
But in regular Japanese writing, those
would be together. There would 
be no spaces in between the
characters. So now you have to decide
what the tokens are. How are you going to
split the tokens in the text? We use an
algorithm called MaxMatch. MaxMatch
depends on a dictionary, and in natural
language processing when we say a
dictionary, we usually mean just a
document with the list of words in the
language. So you would have a list of
possible words in Mandarin, and then the
algorithm gets the sentence wo bu hiu shuo 
zhongwen, I don't speak Mandarin.
The algorithm would get the first character, wo, which means I. It would get that
character into a string, and then search
it into - in the dictionary and try to see if
it's an item in the dictionary. 
It will find it,  so
check! It will accept that as part of a
word. And we have - now have a maximum
match of one. It will then take the
second character bu and then join it
into the potential maximal match. So we
have wo bu, I don't, and we search that in
the dictionary, and we're gonna fail to
find it. It's not gonna be there because
it's not a word. Because this maximum
match was not possible, it's going to
recede to the last successful maximal
match which was just one word, the wo
and it's gonna say this is a word. Check. 
And make the boundary there, and then
continue on to the other character, and
in doing so, it will try to find the
maximum number of characters that can
fit into a word in the dictionary. This
is how you would get the tokens from a
language without spaces. Interestingly
the - splitting sentences is also not
trivial. You would think that to
separate sentences you only need to get
to the period and that'll be it. But there's
other things with periods and dots in
English texts, for example, abbreviations
like Dr. have dots, etc., and so we need
some more complex algorithm. We can use
something like a decision tree for
example where we try to figure out if
after our line, there's a lot of enters
like blank spaces, if maybe the final
character of the string is a question
mark, or an ex - or an exclamation mark,
maybe the final element before the
period is a known abbreviation like
etc. or Dr. So depending on these
factors, a period is going to either
represent the end of a sentence or it
won't.
And of course contemporary systems
take training inputs and use
probabilistic approaches to calculate
whether the - something is the end of a
sentence or not and we're gonna study
probabilistic approaches in a couple of
weeks.
Finally, the let's talk about
normalization. Again, you need to figure
out whether something that is a
mistake is - should actually be corrected
or not, something like what versus whaaaaaat. 
Let's say you want to calculate
sentiment analysis, whether something is -
whether there's excitement in a sentence
or not. You probably want to keep these
forms separate because something with
more a's is going to represent more
emotion. On the other hand, if you are
searching for just the occurrences of
wh - the word what, you probably want to
normalize these, and transform the whaaaaaat 
into what because you want to
know you - you want to calculate these two as
if they were the same item. How about
forms that are very similar, for example
favour in New Zealand versus favor in
the US, with or without a u?
Again you're going to have to decide
what you want - what you need in the
algorithm. Maybe you want to study the
differences between New Zealand English
and American English and so you're gonna
have to keep these separate. Maybe you
you don't want that or maybe you just
want to figure out how often people say
favor, and you're gonna need to transform
one into the other so that they're the
same item. There's forms with uppercase
and lowercase such as the The THE and
you're gonna have to decide if you're
gonna transform all of them into for
example the lowercase form.
There's forms like USA versus U.S.A., 
and you're gonna need to decide
whether they're the same or different. And you
again you always need to think before
you act. For example, if you took the The
and THE, and made all of them lowercase, maybe that
will solve the problem for the, but it
might affect other things. Words like US
with all uppercase versus us in
lowercase are indeed different words, so you
might be introducing some errors into
the system. So again normalization entails
decisions and entails thinking about
these decisions.
in summary the - the pre-processing of
text will entail decisions about what
you consider a word and how you want to
represent these words. In order to tokenize
a text, we split it by words and we
use - it's not always easy to decide what
a word is, it's not always easy to split
things into words, some cha - some
languages don't use spaces.
Sometimes we want tokens, which is our
specific term for what a word is within
these text collections, sometimes we just
want a unique token and sometimes we want
to see if a word is present or absent.
and again normalizing a text involves
decisions about which tokens are going
to remain different. Thank you for your
time.
