Hi. So far we have been working with the
assumption that we'll always work in
with just one language, for example, we
have calculated our probabilities for
English text, we have calculated
probabilities for generating sentences
in Spanish, and so forth. However, most of
the world uses more than one language in
their daily life. Here we'll look at a
few examples of multilingualism, at a
specific example called a diglossia, and
what the consequences of this
multilingualism might be for our natural
language processing products.
Let's begin by normalizing this idea
that you can have more than two
languages in your life. Language duality
is not a problem, but an innate ability.
It is an accurate reflection of a
duality that exists in all of us, a
duality between our mundane life and our
spiritual one. So first I want to begin
by asking you how many languages you use
in your daily life? If you use more than
one, when do you speak one language and
when do you use the other? Do you
sometimes mix them, so maybe start a
sentence in a language and end in a
different one? What do you think of
people who do that, and even if you only
speak one language yourself, have you
ever been in a place where they speak
more than one? Did you notice that some
languages were used for some things and
some languages were used for other
things?
So indeed most people in the world have
more than one language in their lives,
and sometimes one language is used for -
for some people and tasks, and there's
other languages when you do something
else, and there's many combinations in
which this can happen. Multilingualism is
one. We're also going to be looking at
code-switching, Creole languages and diglossia.
So the most basic combination is
multilingualism where there's two
languages, and sometimes you're speaking
English and sometimes you're using
something else. For example, in
Switzerland there are four official
languages:
there's German, Italian, French, and Romanj,
which is another Romance language, and in
theory, you could be able to conduct your
entire life in Romanj if you want, and
then the government needs to provide
schools that can educate you in that
language. The government has to be able
to function in that language as well, and
to function in all languages equally.
In New Zealand, for example, there's two
official languages English and the
indigenous language Maori. And in theory
there's a lot of provisions for them to
have equal footing in society, so that
you could go to school in Maori or in
English. All law needs to be translated
into both languages, needs to be issued
in both languages. This also happens in
Belgium where people speak either French
or Flemish, which is a variant of Dutch.
In practice, of course, these lofty ideals
where all the languages are equal are
very difficult to maintain. In New
Zealand, for example, Maori is legally on
equal footing but in practice there's
always struggles making sure that
there's enough funds to run Maori
language schools, to be sure that signs
in the public space are in both
languages, and so forth. So
multilingualism refers to having more
than one language in society, doesn't
mean that they will treat it equally,
unfortunately, but at least in theory, it
means that you can conduct your language
in one or in the other.
There's something called code switching
where you might have more than one
language, but then they mix together in
your daily life. For example, in India there's
something called Hinglish, which is a
combination of English and Hindi. You can
sometimes have as in number one and in
number three, hungry kya, which is hunger
and like a question mark, like are you
hungry?
Here in the Snickers commercial, ab
hunger ko karo, bye bye, grab a Snickers.
This means Is hunger here now? Bye bye,
grab a Snickers. You can have it like in
number two, which means think differently,
think hatke. So as you can see, you can
begin with one language, and end in a
different one, so that you're going in
and out between the two. And this is not
at all a chaotic or random process. It's
the complete opposite, it can - it is a
very precise switching between them, and
as a matter of fact, it mostly happens at
the edge of grammatical phrases, so
places where you would like breathe to
continue, If you took intro to
linguistics, this is the constituents, so
code switching happens at the edge of
constituents. For example in Spanglish,
from New York as Little Red Riding Hood
is walking along the forest, se encuentra con un lobo. She runs into a wolf. As
you can see, along the forest, se encuentra
con un lobo. There's very precise
places where you can do the switch
between them. You can also have code
switching between two dialects of the
same language, for example, standard
American English and African American
Vernacular English, as you can see here.
It happens at the same points, where you
would take a breath to keep going. So at
about two o'clock, they wasn't back. You
can see here that some parts of the
speech of this one person are in
standard American English, you know one
o'clock they weren't back, those are
marked in red, and some parts are in
African American Vernacular English, two
o'clock they wasn't back, there in blue.
And this one goes back and forth between
standard American English and African
American Vernacular English and they
have - you can see it because they have
different grammatical markers, for
example, the words was and were in
standard American English have only one
form in African American Vernacular
English, was. And this is by no means a
defect of the language, if anything it is
something that many other languages do.
Swedish verbs for example work exactly
like this. There's only one form for the
verb. There's also something called zero
copula or deleting the verb be, as in the
example below. We just walking around. It -
standard American English would have be
there, but in African American Vernacular
English, to be disappears, it's deleted
but this happens in many languages.
Russian for example also has zero
copulas for its present tense and this
doesn't mean that it's defective in
any way. It's - they're perfectly good
languages working in perfectly normal
ways. So as you can see, you can have two
languages or two variants of one
language, where you go back and forth
between them. This is code switching. So
far we have multilingualism, two
languages code switching, two languages
going back and forth. There's a third
situation called Creole languages, and
before we talk about them, I'm going to
show you an example going from English
from England to different forms of
Creole English from Jamaica. "The north
wind and the sun were disputing which
was the stronger when a traveller came
along wrapped in a warm cloak." "The north
wind and the sun were disputing which
was the stronger
when a traveler came along wrapped in a
warm cloak." "Di naht breeze an di sun ah
awgue bout who stronga dan who wen wan travella cum wrap up inna
sweata." "Di Neat Win an di Son did a kos bout which wan a dem chrangga
wen dem si wan man a kom wel rap up ina wah sitn we luk leka wan winta kluok."
There we go. So as you can see, they are -
there's slight changes in between them,
but if you compare the one called the
Basilect to the one - to the English from
Britain on the other side, they are
quite different. These types of languages
are born out of tragedy,
mostly enslaved people we usually group
together so that they wouldn't be able
to communicate and to plot to overthrow
the people who have enslaved them, so
they were placed in groups that are
linguistically varied so that they
couldn't communicate. This gave rise to a
situation where people used a mixture of
the colonial language English, Spanish,
Dutch, Arabic, with elements from the
languages from Africa that the community
had. There's also Creole, Creole language
in Hawaii for example. So it was
English with elements from Tagalog and
from other languages of Asia. And so this
gave the situation where you have a kind
of mix between the colonial language and
other languages. And these are not wrong
or broken versions of English, for
example, far from it. They have a lot of
complexity in them, and as you can see
they are full languages. But people
usually go back and forth between the
English language, the Mesolect, I'm sorry,
the Acrolect, and the one that most
resembles the Creole, the Basilect. So
they go back and forth depending on the
circumstance, and of course the
circumstance is whether someone is from
inside of the community or outside of
the community, and what you want to talk
about. Sometimes if you want to talk with
your family about things from
your home, you're gonna talk in a more
Basilect kind of way. Sometimes if
you're in university, because the English
language is associated with power and
with education, you're gonna talk in a
more Acrolect kind of way, in a way that
most resembles the English from Britain.
And of course, this has to do with
imbalances in power, not because one
language is better than the other.
Talking about this, we have a final
situation, called diglossia. So, so
far we have multilingualism which is two
languages, code-switching which is two
languages or dialects of a language
where you go back and forth between them,
and Creole languages when you have a
colonial language and a language that is
a mixture of the colonial language and
other languages, and you go back and
forth between them. Diglossia is a situation
where a language has a certain function
in society. So if you want to say very
formal things, you use Portuguese like in
this sign from Cape Verde for the one and
the upper left says entrance to people
younger than 18 year old is forbidden.
So if you want to say a very formal
thing, you use Portuguese and if you want
to say a fun thing like in XL flavor
and for parties my good friend XL, you
use a creole language. This is from Cape
Verde in western Africa, and Portuguese
has a function of being written and
formal, and the creole language has a
function being informal and oriented
towards fun, family and so forth. Because
each language has its own function in
society, there is - people essentially
speak one language for some things, one
language for different things.
We call this diglossia.
A very important example of diglossia in
the world is Arabic. So Arabic is made up
of many different languages. There's one
called Modern Standard Arabic which is
the one that's used in the writing of
Arabic, so if you have books in Arabic,
they will be written in this language.
However when people speak to one another,
they're going to use different languages
called colloquial, and they're different
throughout the Middle East. So for
example, the colloquial Arabic of Egypt
is different from the colloquial Arabic
of Morocco, different from the colloquial
Arabic of Yemen, and so different that
people from, for example, Dubai are gonna
have a lot of trouble understanding
people speaking Arabic from Morocco. They
write in modern Standard Arabic, but each
of them speaks in Moroccan or Emirati
Arabic, and so forth. It's an example - I
always like to - to have the metaphor
of imagine if in Romance languages,
people wrote in Latin but then they
spoke to their families in Spanish and
French and Italian. So that all of our
books were written in Latin but then our
conversations were in Spanish and so I
could understand French books because
they'd be in Latin but I couldn't speak
to them because I speak Spanish and not
French, for example. This is what happens
in the Arab world. For example in this
sentence
I love reading a lot, ana ohebbu alqaratu katheerin, you
have one form in Modern Standard Arabic
which is written in the characters as
you can see, and then you have spoken
forms for the dialects that are very
different from one another.
For example from Tunisia, nhebb naqra barsha, but in Egyptian, ana bahebb al'eraya
awi, but in Jordan, ana katiir bhebb alqira'a.
As you can see, these are very different
from one another, as different as French
and Spanish, for example. So diglossia
again is - would be a situation where
people write and say very formal things
in Modern Standard Arabic, and where they
speak about things with family and
friends in their colloquial.
What does that have to do with us? It
would be - we really have to think of what
kind of input people are providing and
what kind of output do we need to
provide. In societies that have diglossia,
for example, these kinds of situations
where you need to use more than one
language. This is an example of how
Tunisian works. For example in writing
you have the classical Arabic, you have
Modern Standard Arabic, and you also have
French on top of this, but when people
are for example on TV with watching
sitcoms, when you get cartoons, when you
read the menus, they're written in the
colloquial. So for example, elevated
Tunisian Arabic, but also code switching
between local Tunisian Arabic and Modern
Standard Arabic. So in this game going back and forth,
what kind of training set would you
collect? If Tunisian Arabic is not
written very often, how would you gather
data so that you can then calculate
transition probabilities, and you can
calculate n-grams? If only Modern Standard
Arabic is written, how are you gonna get
the data for colloquial Tunisian Arabic?
Will we have to go to Twitter, for
them to write it informally, we have to
go to chats? Is there gonna be no writing
and then you're just gonna have to do -
to get the data from speech recognition?
If there's no writing how are you going
to make the speech recognizer work?
There's many challenges that societies
with more than one language present to
us for natural language processing, and I
mean for example what is Siri going to
say in Tunisia? What - which of all of these
is Siri going to choose? See we would
have to pay attention to the context to
try to figure out if it's more
appropriate to answer in Modern Standard
and colloquial Tunisian and so forth.
That's a quick summary. Most societies on
Earth they use more than one language. It
is just it is a historical accident that
English-speaking societies are more
monolingual than other societies on the
planet. Well it has historical reasons
but that's out of the scope of the class.
For societies that have more than one
language, there's many combinations in which it
can happen. You can have multilingualism
whether you have languages that are well
separated, you can have code switching
where you go back and forth, you can have
pidgins and creoles where you go back
and forth between high prestige
varieties and varieties that are more -
more for use in the home, and you can
have diglossia where you have very clear
edges of what a language - of this
language does this, this language does
that, and so forth. When you make NLP
software, you might need to get
multilingual training sets just for the
software of that one society. You may
need to make plans in how to handle
trilingual input and how to determine what the best
output would be in each circumstance.
