Hello and welcome to Accelerated
Computational Linguistics at Dartmouth
College. My name is Rolando and I want to
thank you for your interest in this
class. Here we're going to use the tools
of computer science to model human
language and we're gonna be looking at
two questions. The first one is: Why is human
language such an interesting challenge
to computer science? Why do we have an
entire class devoted to it?
Human language has many layers of
complexity that we will study during this
class. The second question is: How can we
use our tools to model human language so
that we can have software that for
example transcribes what we say to a
mobile phone or finds a document when
we type it into a search engine. The
first question is: Why human language is
such a challenge for computers? There's many answers but the first one is very
obvious. There's many human languages.
There's about 7,000 of them and they're
all, they all have many differences. Some
of them are very obvious, like having
different words or different writing
systems. Arabic for example is written
from right to left and Chinese and
Japanese are written without spaces
between the words. Beyond those
superficial differences there's
differences in the way languages
configure their words. For example, in
English a sentence like I Love New York
is what we would normally say. However
other languages have other orders. For
example, in Japanese the correct way to
say the sentence would be I New York
love. In other languages like Welsh the
verb would go first and the correct
sentence would be Love I New York. In
languages like Hixkaryána from Brazil the
correct order will be New York I love.
Even in languages that we might be
familiar with, such as English, there's
many challenges awaiting us.
For example, there's sentences that can
have two very different meanings. In the
sentence "the chicken is ready to eat"
maybe it's the food or maybe there's a
chicken that is waiting for the food.
There's sentences with two meanings and on the other hand
there's sentences that are perfectly
correct but have no meaning at all. For
example, "colorless green ideas sleep
furiously". The sentence is perfect
English and yet you cannot imagine what
it could mean. We are going to study many ways in which language poses a challenge
but we're also going to study how people
have tackled these challenges and we're
going to focus on the two main streams
of solutions that people have tried. One
is rule-based approaches and the other
one is probabilistic approaches and
machine learning based approaches. In
rule-based approaches we're gonna study
structures like automata and finite
state machines. We're gonna study regular
expressions. For example, this regular
expression finds email addresses in text.
We're going to study parsing rules and
how computers can understand the
structure of a human sentence. On the
other hand, we're also going to study
probabilistic approaches. For example, in
the sentence "we finish each other's ___", what
you would usually say would probably be
something like "sentences". It would be
very unusual for someone to say "we've
finished each other's sandwiches". You
have a certain intuition in your mind
about which of the two would be more
likely to occur. We can use these
statistical properties to feed them into
the computer and to have the computer
generate new text, for example, or
understand the properties of previously
existing text. For example, "how are you" in
English is usually translated as "¿cómo
estás?" in Spanish. Not always, but very
often. In the class we're gonna start
with text processing and normalization;
regular expressions and how to get the
text to a shape where the computer can
operate on it. On week 2 we're going to
study rule-based approaches, automata,
finite state transducers, which are used
in applications such as speech
recognition, and we're going to study the
types of rules that you
languages have. Why some of them can be
modeled through rules and some of them
are better modeled using machine
learning. On week 3 we're gonna work with
unsupervised learning. We're gonna study
sentiment analysis, clustering of
documents and a structure called word2vec. We're just trying to find what the
meaning of a word is based on its
neighboring words. On week 4 we're gonna
study n-grams, which is what words
follow one another and we're gonna look
at spell checking. On week 5 we're gonna
start our machine learning exploration.
We're gonna study support vector
machines which can help us classify
structures. For example, tell us which
dialect of a language we're listening at,
we're listening to, what language we're
reading in a document, and parts of
speech; whether something is a noun, a
verb or a conjunction. On week 6 we're
gonna study different neural network
architectures, how can we use them to
process language and we're gonna study
machine translation real quick. On week 7 we're
going to study parsing: How to understand
the structure of human sentences and
understand that when you ask the
computer to "Play The Beatles", one of them
is the action and one of them is the
thing that wants to be played. On week
8 we're gonna study information
extraction and how can the computer get
meaning from documents so that it can
understand that if you want a "restaurant
near me", that those have a certain
mathematical meaning. On week 9 we're
gonna study speech recognition, both in
traditional systems and in the more
contemporary End2End systems, which are
based on machine learning, and we're
gonna study how speech recognition works
in languages other than English and
languages with low resources. On week 10
we're gonna focus on our projects and
we're going to discuss where to go from
here. And you can call me Rolando. I am
from Costa Rica. I did my PhD in Tucson,
Arizona which was fairly warm.
After that I worked in New Zealand. I
worked with indigenous languages. I've
worked with them in the Americas: In
Mexico, Bolivia, Costa Rica and also in
Polynesia. We're gonna have plenty of
examples of a language called Cook Islands
Māori. I work with,
yes, natural language processes,
processing in indigenous languages, particularly smaller languages and I'm
very happy we're going to be doing this
together.
Welcome to Accelerated Computational
Linguistics and I'm looking forward to
working with all of you.
