OK, so what I want to talk about today is
about language change and I think it’s kind
of one of the basic problems in linguistics.
I mean it’s something that we’ve been
working on as researchers since before linguistics,
right, so philologists have been working on
language change, you have dialectologists
and historical linguists since before the
20th Century working on language change. It’s
really a kind of fundamental and basic area
of linguistics. I mean I’m not sure if this
is true but my hunch would be that it’s
probably the area that’s had more research
on it in the history of linguistics for the
longest period of time. And so you have dialectologists,
you have socio-linguists, you have historical
linguists and typologists, philologists, these
are all really old fields and they’ve been
devoting a lot of time for this for centuries.
But I think at the same time we actually have
a lot of mysteries about language variation
and change. I don’t think we really know
kind of what drives language change in general.
I don’t think we even know kind of what
is the full potential for language change,
like what constrains language change. I think
these are big questions that we don’t know
because we haven’t really had the right
amount of data to do it yet and this is sort
of the big step that’s happening right now
in the last few years, in the last decade,
with increased data we can start looking at
these problems in a new way and really trying
to understand what’s happening, what’s
driving language change, what constrains it,
what are its limits.
I think probably more than anything else,
I think what we know the least about is the
lexicon, the vocabulary of a language. We
know a lot more about say, like, how a [vowel/valve]
system changes, we have things like the great
[vowel] shift or the northern city shift,
and there what you have is [vowels] moving
around together, so we have a system of [vowels].
But there’s about forty or fifty [vowels]
in English; let’s say where we have thousands,
millions of words, so that’s a much more
complicated system that I don’t think we
actually know very much about yet. We have
dictionaries, we’ve done a lot of studies
of individual words but what we don’t kind
of know is how the lexicon as a system changes
over time.
And that’s really because, like I said,
the lexicon is so complex. I mean it’s hundreds
of thousands of words but you can think about
them all being kind of inter-related. They
all have sort of define in terms of each other.
On top of that they’re highly polysemous,
so most words have multiple meanings. Pretty
much all words have multiple meanings and
they’re always shifting and this is something
I really want to show today that they shift
really quickly, like in days, not in decades.
But then the other big problem, probably the
biggest one, is that they’re just incredibly
rare. So most words hardly ever occur and
that’s why studying the lexicon is such
a difficult thing to do. This is a very famous
picture, this is from my data but this is
a well-known pattern and this is just the
top 50 words in a sample of tweets. This is
about a billion words of data. I’ll come
back to the corpus that I’m using in a second,
but this is from 2013, these are American
tweets, about a billion total words, about
a hundred million tweets here and this just
shows the frequency of the 50 most common
words in this corpus. So up in the top corner
you have ‘I’ which is the most common
word on twitter, which isn’t usually the
case. Usually it’s ‘the’ but in this
case ‘I’ is the most common word. Then
you have this big drop down to ‘the’ and
then you have this pretty kind of gradual
fall now and the 50th word is ‘one’. OK?
If we zoom this out a bit, now this is the
top 500 words and you can see what happens,
right, that most of these words are very rare.
You have a few words that account for most
of the frequency in a corpus and you have
all these other words that are quite uncommon.
And then here’s the top 50,000 words. What
you can see is that a few words make up most
of it and then you just have this long tail
of really uncommon words. They’re uncommon
words but these are like fairly common words,
these are all words that we know. So these
are just three of the words in the last 50
of the 50,000 – ‘conditional’, ‘illustrations’,
‘converge’. I mean these are common words
that we all have in our mental lexicon and
yet in this corpus they are in the bottom
level in the last 50th of 50,000 words. They
all look very rare, so they only occur in
this corpus once per 5 million words.
I think I really should put into perspective
how much data you need to study these words,
like if you had the British National Corpus
which was 10 years as the standard, that’s
a hundred million words and you get, you know,
20 occurrences of these words. Well 20 occurrences
of these words, you can’t really see how
they change over time, right? You can’t
see how, I mean they all have different meanings
too so you couldn’t like track all these
different meanings of these words, how they’re
changing. You couldn’t track them like over
space or over social groups and that’s the
problem we have when you don’t have much
data, right? When you have a hundred million
words you can’t look at most words in the
lexicon and that’s something we can do now
with these multi-billion word corpora.
So that’s what I want to argue today. That’s
only really become possible to analyse lexical
variation change on a systemic level, so as
a system not just as individual words but
as a full system over the last decade with
the rise of social media. Social media’s
not the most interesting thing in the world
but it has really changed how we can do linguistics.
So what it means is that we can now do these
really large studies in ways tat we couldn’t
come close to doing 5, 10 years ago. And the
big advantage I think more than anywhere else
where you feel this is when you’re looking
at the lexicon. The lexicon is this thing
that’s really hard to study and what social
media lets us do is actually observe lexical
variation change on a large scale. It also
lets us look across time in really fine detail,
as I’ll show you here in a second. We can
track change over days, not over decades or
over centuries, and it lets you look across
society in really fine detail. I’ll show
you that too, particularly with mapping.
So that’s what I want to look at today and
I want to focus in on particular a corpus
of about a billion tweets, or about 10 billion
words. This is American English twitter data
from 2013 to 2014. Mostly I’ll be looking
at one corpus; at the end I’ll kind of shift
a little bit but it will still be from this
general period. For every tweet what we know
is the exact, down to the second, when it
was posted, so really fine grain time information,
so we can really track change over really
fine grain, and we also have the longitude
and latitude of the person when they posted
the tweet and that means we can really track
fine grain variation across space. What I
want to try and do is take advantage of that
here in this talk and look at how we can really
track change over time and change over space
in a way that you couldn’t do a few years
ago.
OK, so I’m going to show you quite a few
of these graphs. This one here is just the
frequency of the word ‘I’ over time. So
this is the most common word in the corpus.
What you can see here along the horizontal
access is just the time line, right? So it
starts her in October 2013, that’s this
point here, and it runs till November 2014.
So October 2013 to November 2014 and there’s
a few little months there marked. On the vertical
access you have the frequency, OK, so that
tells you how frequent the word was in the
corpus for that day. So basically what I’ve
done here is taken this 10 billion words and
split it up into about 400 mini corpora, one
corpus per day, OK, and what you can see there
is each of those spikes is basically a day.
So that’s the relative frequency of the
word was on that day and so you can see some
things here, right? I mean the most obvious
thing right away is that the word ‘I’
is falling in in frequency. So ‘I’ became
much less common over the course of 2014,
probably related to just twitter becoming
a bit more formal over time. What you can
also see here is a weekly pattern, so that’s
these spikes that you get every seven days
or so and that’s showing the day of the
week where ‘I’ is least common and that’s
on Sunday. People tend not to use ‘I’.
[laughing]
So that is your basic pattern, right, so a
pretty clear all over time in the use of ‘I’.
It’s not uncommon, I mean I’ll sure you
here the second most common word and you’ll
see it’s still less common than the low
spot for ‘I’ and that’s ‘the’. So
‘the’ you don’t see much of a pattern
here at all. You actually do see a bit of
a weekly pattern; I think it’s actually
the same if I recall which is that ‘the’
tends to be used less on the weekends, but
basically this one’s flat, right? There’s
not much going on here. You might see some
small changes but they’re probably just
kind of random, there’s not much of interest
here, but this is sort of I guess what you’d
expect, right, things to be more or less flat.
And then you have some words that are rising.
So this the word ‘from’ and it’s on
the rise. It’s not quite as dramatic as
the fall of ‘I’ and it’s obviously not
quite as weekly cycle here but you can see
it rising over time and I think this sort
of contrasts with ‘I’. ‘I’ is a pronoun,
more associated with informal language and
prepositions are more associated with more
formal language, so I think these balance
each other out to an extent.
Now you might wonder if these are real at
all. I have just a couple here, just sort
of proof of concept. So this is ‘Christmas’
and this shows a pretty obvious pattern, right?
This massive spike at Christmas time; I think
this is probably Thanksgiving, so just kind
of Black Friday type stuff, right before it
happens. You can see it’s starting to take
off over here again, right? Here is ‘moon’.
So what you see there are the spikes of full
moons and what that spike is was a lunar eclipse
that happened in 2014. I think an interesting
thing here you see is that actually after
the eclipse you then start to get more discussion
of ‘moon’ on these full moons. You get
a bit of a jump, right, like these are considerably
higher than these. So people started to care
more about the moon after the eclipse.
And this is what I often show. This is my
favourite of all because I think it really
does tell a good story. So this is ‘library’.
You can see a weekly pattern here. The weekly
pattern is not surprising, right, libraries
get talked less about on the weekends, and
then these three big kind of troughs you see
here, that’s Thanksgiving, that’s Christmas
and that’s summer holidays. So these are
students. And then you can also see these
spikes, these two spikes in particular. This
is right before final exams, the two sets
of final exams in most American universities.
You’ll see a little one here, I think that’s
spring break with mid-terms. And that’s
just showing you that these times here are
very meaningful. They’re not always this
obvious and they’re certainly not always
this obvious but these times here hold a lot
of meaning in them and that’s what we’re
going to kind of use here to try to see if
we can understand change in the lexicon better.
So the idea here is that we can make a time
series for every word in this corpus. I showed
you 50,000 words a second ago, I’m going
to keep showing you in these big graphs 50,000
words but actually this is based on the top
100,000 words. These are about all the words
that occur about 500 times, at least 500 times,
in this 10 billion word corpus. What I want
to do is for each of these words I make two
measurements. The first measurement is how
uncommon, how frequent was the word at the
start of this period of time? So at the end
of 2013, how common was the word? So in particular
we’re going to try and find words here that
are very uncommon at the end of this period
– sorry, at the end of 2013, at the start
of the period of time represented by the corpus.
So we can measure that, right, like we can
see that this word is occurring about, the
word ‘library’ is occurring about 100
times per million words at the beginning of
the corpus. The next thing we’re going to
measure is how consistently it rose over time,
or how consistently it fell. And so a word
like ‘I’ fell a lot and a word like ‘from’
showed a big rise. So these are the two measurements
we’re going to make. So for every word in
the corpus we take its time series and we
make these two measures – where does the
time series start and how consistently does
it get to wherever it’s ending, OK? Based
on those two pieces of information, what we
can start to do is we can visualise the entire
lexicon of this corpus, from this corpus,
right? All 50,000 or 100,000 words at once.
So we do this by plotting every word based
on these two measurements – how infrequent
it was at the start of time and how much it
rose or fell over time. So basically what
we’re going to do is take this – this
is 50,000 of those time series all overlaid
on top of each other – and from this information
which doesn’t show much of a pattern here,
we want to pull out a picture of the whole
lexicon, and so this is what it looks like.
So this is the 50,000 words all plotted together
and on the horizontal access here we have
how much it rose or fell over time. So if
it’s on this end it means it showed quite
a fall over time. And so for example you won’t
be able to see it but here’s ‘I’, right
here. So ‘I’ fell a lot, that’s why
it’s far on the left side. Over here we
have a word like ‘from’, which is farther
on the right side because it rose. ‘The’
is somewhere here in the middle because it’s
pretty flat. So that’s like a full picture
of the lexicon. There’s lots of other ways
one could visualise the lexicon but this is
one way of doing it, right, and so I think
what’s exciting here is that we have all
50,000 words plotted together and we can start
to explore this lexicon in a really principled
way.
Now particularly the words I’m going to
be interested in today, and I’ll kind of
tell you what they are in a second, are right
down here in this corner. So out of all these
50,000 words – actually, like I said, I’m
only plotting 50,000 but actually there was
100,000 we are looking through, 97,000 – I’m
curious in these ones right down here in the
corner, that’s the research that we’ve
been doing for a while now. The reason I’m
interested in these words, right, is because
they are words that start off incredibly rare,
they’re really low, at the end of 2013 they
were hardly in use at all, but over the course
of the year they really rose and that kind
of little quadrant, that little corner of
this picture, is what we want to zoom in on
and we can only do this because we’ve made
these measurements of all 50,000 of all 100,000
words, otherwise we wouldn’t know where
they are. So we can zoom in on this. So this
is basically all the words that are rising
from infrequent to frequent and we’re going
to zoom down in on this corner. OK? This is
the very bottom corner. This is what, I don’t
know, a couple of hundred words out of 50,000
, right. These are words that are all, like
I said, incredibly uncommon at the end of
the period and that really took off over the
course of the year. You get different types
of words in here. So you get words in here
that are by no means new, words like ‘oppressed’
or ‘welder’, ‘supplemental’, ‘concierge’.
You get some here that are interesting too
like ‘sexism’ is here and ‘feminism’
is in here somewhere. These happened to take
off over this period. It’s not that they
aren’t being used at all, they’re common
words, right, but they weren’t very common
Twitter until this year. You also get some
proper nouns, so you see ‘Palestinian’
is here, you see ‘Lewinsky’ is there and
there’s a few others kicking around. There’s
‘lift’ for example, the ride sharing company.
But you also get some words in here that are
not in any dictionary, they’re not proper
nouns. These are words that were just super-uncommon
and they’re not in the dictionary, but they
just took off over the course of this year.
So words like ‘xans’ or ‘tfw’ – that
feel when – or ‘lordt’.
So that’s the idea. What we want to do is
pull out these words and we call them ‘emerging
words’. So these are words that were hardly
being used at all, that aren’t in any dictionary,
some of them have got into the dictionary
since – cosplay is one of them that is now
in the dictionary but wasn’t back then.
And these are just these very specifically
defined words. It feels like a bit of a rag-tag
set. I’ll show it to you in a second, but
I didn’t pick these words, right, we defined
the properties of these words and then we
extracted them. So this is an unbiased analysis
here and the real nice thing about this is
that we can pull out 54 of these words that
meet this criteria, that means we’re looking
for things that are really infrequent, less
than once per million words at the end of
2013 that show a real clear rise, this measure
of correlation over .7, we exclude all the
proper nouns and all of these existing words
that are in the dictionary like ‘feminism’
and ‘forensic’, which were rising at the
time too. We also only take one word per lemma,
so for example ‘xans’ and ‘xan’ which
are both short for Xanax, meet this criteria
but we just take the most common ones because
we want to keep things balanced, so we just
took ‘xans’ which was more common and
that leaves us with these 54 words and what
that lets us do then is actually make generalisations
about this process of lexical emergence. If
we just hand-picked a couple, you know, how
could we do that? But this is really saying
in this corpus, in this lexicon from this
corpus, there’s a small set of words that
are doing something really interesting and
let’s see what’s happening with them and
let’s see what we can learn about this process
of new words by looking at these forms.
So here they are. I’m not going to spend
loads of time going over these but let me
run through a couple of them. You get ‘baeless’
and ‘baeritto’, you also see we get ‘unbae’
in a second. These are all formed of the term
‘bae’ which you might not know. It was
a new word maybe two or three years before
this data in 2013. It means something like
your significant other. So something like
‘baeless’ means to be single, without
your significant other. We get these two forms,
‘boolin’ and ‘brazy’. These are kind
of interesting; they’re both blends from
taking the words ‘cooling’ and ‘crazy’
that begin with a ‘c’ and replacing it
with a ‘b’ and the idea here is that this
is done by the blood street gang to avoid
using words that begin with ‘c’ which
is associated with the cripps street gang.
And so you get this blend of ‘blood’ with
‘cooling’ gives you ‘booling’, or
‘blood’ with ‘crazy’ gives you ‘brazy’.
You get quite a few words in here that are
associated with technology, computer communication
and stuff like that. So you get here ‘faved’
which is to favourite something on social
media, you also get ‘candids’ and ‘celfie’,
these are types of pictures. And you also
get a number of words you’ll see as we go
through this that are kind of related to anime
culture and to fandom, so you get ‘cosplay’
and later on we’ll get ‘waifu’ and ‘senpai’.
Here’s some more. So ‘(on) fleek’ was
a real one that really took off and you’ll
see just how we were really able to catch
‘(on) fleek’ as it was rising. Most of
these were, you know, it’s not quite as
clear as this one. This one really took off
and it’s a nice case study I’ll show you
in a second. You get words like ‘gainz’
and ‘goalz’. Now these are words that
exist, right, ‘goals’ and ‘gains’
as normally spelt, but these are unique authographic
forms, so we keep them in our data set. But
they’re doing something kind of interesting,
both of these are. They’re not the standard
meaning of ‘gains’ and ‘goals’; you
can’t just use this wherever you want. You
use this spelling with the ‘z’ when you’re
trying to indicate a new meaning of these
two words. So ‘gainz’ is used mostly by
young men who are trying to beef up by eating
and working out, that’s what ‘gainz’
mean almost all the time in this context.
‘Goalz’ is almost always referring to
someone’s life goals, OK? You get these
three forms here, ‘litt’, ‘litty’
and ‘lituation’. They’re from ‘lit’
itself with one ‘t’, it’s very popular,
it’s really taken off over the last few
years, but these are all just basically – they
sometimes mean intoxicated but generally they
just mean ‘good’. You’ll see later on
that ‘litty’, of all these words, ‘litty’
is the one that’s done the best over time,
so we’ll look at some 2016 data at the very
end and it’s one that’s really done well.
You see quite a few of these, like I said,
these words like ‘senpai’ here which is
a [0:19:38] culture. It’s sort of a mentor,
an older mentor. You get lots of these acronyms
which I think are interesting; we’ll come
back to those too. You get lots of these forms
based off ‘thot’, which means sort of
a promiscuous woman and so you get things
like ‘thotful’ which means which means
someone who’s sexually generous. ‘Unbae’
like I said, ‘waifu’, ‘xans’.
So now what I want to do is start to take
a look at these 54 words and see what kind
of generalisations we can make about lexical
emergence and I want to just stress this is
something you couldn’t do previously, right,
like you could pick a few of these I guess
but how would you go back in time and get
them? That’s the basic problem we have with
all these approaches. Like you couldn’t
really do a survey on this because you wouldn’t
know what the words were. You couldn’t go
back and look at an old newspaper corpus because
these are all real slang words that you wouldn’t
expect to find in those kinds of corpora.
You really need these massive corpora, really
informal online communication, to make this
work.
OK, so there’s lots of ways we can talk
about these words. So here are just topical
patterns. So like for example in red I have
words that are associated with sex and relationships;
in purple these are words with kind of technology
and online communication; in blue you have
words that are at least in part – a lot
of these are kind of, maybe this is over-simplifying
but some of these would be in multiple categories
– but to do with intoxication; in yellow
you have things related to, say, anime culture.
So you see quite a bit of variation here in
the meanings. These aren’t all from just
one topical area. You’ll see lots of variation
in parts of speech, so here in blue you have
nouns, in red you have verbs or verbal forms;
in green you have adjectives and in yellow
you have miscellaneous or other parts of speech.
And you also get a lot of variation in word
formation processes. So this is how are these
words constructed? In light red there, like
‘amirite’ and ‘cosplay’ and ‘fallback’
you have compounds; you have blends like ‘boolin’
and ‘brazy’ in dark red; you have convergence,
so words that are taken from other parts of
speech, like ‘faved’ and ‘mutuals’
in yellow; you have truncations, shortened
forms, like ‘famo’, ‘notifs’ and ‘xans’
in orange; and you also have numerous of these
acronyms in green and creative spellings in
purple. Some people don’t like these words,
they don’t maybe feel that interesting,
but in a corpus these are unique forms and
I think it would be wrong to throw them out
and they also do very interesting things.
Like I said, one of the most interesting things
I think is how a lot of these spellings are
used to mark new meanings. You also see I
have one there in white, that’s ‘fleek’
and I have that there because it’s the only
one that – and this might not be the case
– but to the best of our knowledge it’s
the only true coinage in this data set. All
these other words are all formed by taking
existing words and kind of reconstructing
them, shifting their meaning, respelling them,
compounding them, truncating them. ‘Fleek’
is the only one that we don’t – and maybe
‘figgity’ – that we don’t know what
the derivation is and it appears to just be
someone’s creative use of a new form, but
I think it’s really notable how rare that
is, right, how most of these words are just
reusing existing words.
OK, we can also do the time series on these.
So here is ‘fleek’ and like I said, this
is probably the sharpest one, or define the
sharpest one, we have in our data set. It
was hardly used at all, it takes off there
mid/late 2014 and we see this really big jump
in usage. This is the one that we really caught,
I think it was one that would really spread
online to some extent that we really see such
a clear pattern and really get the starting
point for it. Most look more like this; so
this is ‘famo’, it means something like
family and friends and we can see this sort
of more gradual but still a pretty clear rise
over time. Here is ‘baeless’ to mean single.
You can see this has a very nice curve on
it, very clear. You can see it starts to fall
off a bit at the end and that’s something
we’ll see continued on that trend. Here
is ‘litty’. Like I said, in 2016 of all
these words, the one that’s done the best
and you can see this one’s caught very early.
You can see just how infrequent it is per
million words, it’s not even occurring anywhere
close to once per million words at the start
of this data and even at the end of the time
period. This one’s quite choppy.
What we can also do is take all 54 of those
words and overlay them. These are the time
series for all 54 of those words and what
you can see here is in general this kind of
s-shaped pattern. We could clean this up a
bit more. If you want to look at it in the
paper we have this sort of talked about in
a bit more detail, but I want to keep this
kind of simple and just show the raw data
here and this is just all these scaled in
frequency so you don’t see the difference
in the frequencies of these things, but it’s
just to get the general shape here and I think
you can see a sort of s-shaped curve which
is interesting. I think given all the inconsistency
and the meanings of these words and the grammar
of these words, it’s interesting that they
all follow a fairly consistent pattern over
time.
So, from this kind of study I think we can
draw some real basic properties of these emerging
words. We see diversity in their meanings
and their grammatical status and their word
formation processes, like they’re a really
diverse set. They all do show these s-shaped
curves. We can see that many of these words
reflect new cultural trends, right, as well
as new communicative contexts. In general
it seems like these words are driven by sort
of, at least to a large extent, by sort of
the need to express new ideas and the need
to communicate in these new contexts. This
next one I just added in this morning because
I was reading something by Lane Green and
it was about – no, sorry, by John McWhortor
on Lexicon Valley and he was arguing that
new words come from children. I think we can
see this isn’t the case here at all, right?
Like children are not coming up with a bunch
of words for intoxication and sex and online
communication, like these aren’t children’s
words, these are being created by adults.
Like I said, coinage is very rare and I think
that’s an important result. I think it’s
something that surprises us and it’s hard
to know that until we take a big set of these
and look at them together. But almost all
these words are constructed from existing
words. And I also think it’s important to
know here that there’s some highly productive
new word formation processes online. These
include acronyms and creative spellings and
they maybe don’t feel quite as interesting
or as valid as traditional word formation
processes, but they’re doing a lot of stuff
online and I think they’re interesting to
look at and they could be looked at in a lot
more detail. Creating spellings in particular
are doing something really neat and that is
they’re really marking off new meanings
of words.
OK, so what I want to do now then is look
at the other information that’s in this
corpus and what it is, is that we have all
this information about where people were when
they tweeted. So for every single tweet in
this corpus, every single token of any of
these words, we know exactly where the person
was and like we know, you know, like if Akira’s
here and Michaela’s down here, we know that
difference, like that’s how much resolution
we have. We can capture everyone in this room.
We don’t use that but that’s the kind
of information we have. It’s incredibly
precise and that really lets us do some new
stuff, right? What it lets us do for the first
time is really look at both the actuation
and the diffusion of this lexical invasion
and what I mean by actuation is where these
changes come from, like what is the source
of these new forms; and by diffusion how they
spread out. There’s been a lot of work on
diffusion in socio-linguistics and historical
linguists, it’s usually based on one form
or a handful of forms so it’s hard to make
generalisations. So just like I was arguing
before, we have this big unbiased set of 54
words, I didn’t hand pick, and there’s
a lot of them, we can see common patterns
of diffusion. That is how do words spread
out, but I think in some ways the actuation
is even more exciting because there’s been
a big argument which is that you can’t look
at the creation of new forms because you’re
never there to see them, right? Like you’re
not there to witness them. By the time a linguist
has noticed them, they’ve already passed,
right? And so obviously most of these words
I don’t believe started on Twitter but we
can still from this information I think you’ll
see in a second, we can actually try to kind
of see where they come from, estimate their
regional source and that’s something we’re
going to use here to try to understand what’s
happening.
Now to do this there’s various ways one
can map them. The simplest approach is just
to map how the frequency changes over time,
OK, and so you get a whole series. So instead
of time series of frequencies you get a time
series of maps and you can see then how the
words spreads out over time. What we also
want to do though is try to compress that
down to a single map, OK, and I’m not going
to go into the details of this but basically
the idea is to measure the number of days
since the word became relatively common in
each location and we can then do some sort
of spatial analysis to find the hotspots.
I’ll show you what this looks like. So here
is ‘fleek’ near the beginning of the corpus
in June 2014. You can see it’s not like
we get it in one spot, we get it spread out
a little bit. This word didn’t start on
Twitter. In fact we know where this word started,
but we can see how it spreads out nevertheless
on Twitter here. So this is June 2014. OK,
I think notably you get three counties quite
dark red – I’m not sure if you can see
that – but these are very high frequency
uses down here in South Carolina. You get
a smattering here across the lower mid-west
and over here in Los Angeles. Here they are
in July. As you can see now it’s starting
to fill in this southern region. Here is August,
which we see a big jump, right? Like you can
think of that time series, how fast that word
took off. Now you can see it’s filled in
a lot of the south and if you kind of squint
your eyes and look back, that’s basically
what the hotspot will do for you in a second,
you can kind of see it’s really concentrated
in the south. That’s Atlanta there.
By September it’s spread across most of
the United States. This is largely empty,
there’s not many people here so we shouldn’t
be too worried about that. And here it is
in October and in November. So you see by
the end of November it’s spread across much
of the United States, OK. You can see it’s
still particularly concentrated in the south
east but you see quite a bit over there in
California for example and up here in the
north east as well. These are the Appalachian
Mountains, these are the Rocky Mountains here,
which is sort of low population there as well.
This then is our attempt to kind of compress
that into one map. So this is a hotspot map
and we should start to think about this basically,
is this basically tells you how long since
the word has been used frequently in each
of these locations. So in dark red we have
places where this word was used very frequently
early on and so you get – and not just one
location, you might remember there’s a few
here but there’s quite a few locations in
here that were using them. So this kind of
area here, these are the Virginia and the
Carolina’s and here’s Atlanta, Houston.
And then this area of sort of lighter pink
is the area where it kind of spread out to
first and then in blue you have the rest of
the country. We saw it spread to everything
eventually but that’s the pathway it took.
Here’s another one, this is ‘baeless’.
So you can see it’s kind of a similar pattern,
right? In the south it’s shifted a bit more
into the deep south, so this is kind of the
coastal south here and this is the deep south,
so that’s really big around Atlanta, New
Orleans, Houston, Dallas. And here’s ‘lordt’.
This is basically used as an exclamation like
‘lordt’, and you can see this one here
is also in the south. You can see the general
pattern is very similar but now the hotspot’s
down here around New Orleans and Houston,
Louisiana here. This is the Mississippi River
up here.
OK, we can look at another one here. This
is ‘tfw’, this is another word that’s
done very well, so this is one that is still
being used a lot online. This means ‘that
feels well’, it’s a way of encapsulating
some kind of moment, like that feeling when
you get home from a long day at work or something
like that. You can see this one here, we didn’t
catch it quite as early so it’s a bit more
diffused already but here we can see quite
a few counties here in California that have
this form and you can kind of see this is
where this is most active. So by now it’s
sort of spread to Seattle, Portland, Las Vegas,
Phoenix, Flagstaff. And here it is spreading
out. Again, July, September, November. Now
what I want to note about this one here is
it obviously didn’t do quite as well as
‘fleet’, it’s not had quite as much
coverage, but it also kind of shows almost
the opposite pattern, like this area in the
south it was so strong for ‘fleet’ and
for ‘baeless’ and for ‘lordt’ but
it’s now relatively less active. I mean
you still have lots of activity here but it’s
nowhere near as dense and you see that then
when we can find the hotspots here and you
can see hotspots in California with a few
other kind of urban areas in the north getting
hits. Here’s ‘waifu’, this means sort
of like your anime crush, you know, like who
you might want to be your wife if you were
an anime character. We get this here in California,
in Florida, in Miami, New York City. And here’s
‘litty’ which we can see really does hotspot
mainly right in New York City.
OK, the next thing we want to do is try to
find out, so once we’ve seen these 54 words,
these 54 maps, what are the most common patterns
there? The way I like to think about this
is like sorting these patterns into piles,
right, and we say ‘oh look, well here’s
all the ones that show a California pattern
and here’s all the ones that show a deep
south pattern; here’s the New York pattern’
and you basically stack up these 54 maps and
a map can go in multiple piles sometimes if
it has a combination. Then what you do is
you take the biggest pile and you kind of
average all those maps and that’s the first
basic pattern here. We found five of these
patterns. Within each of these we’ll see
kind of a core area, the hub of lexical innovation,
that’s where the word appears to have come
from, at least on Twitter. And then we’ll
see where it diffused out to, right? Not individual
words but in general where multiple words
come from and seem to spread out to.
So here’s the first pattern. This has these
words most strongly associated with this pattern.
We get all these anime forms, ‘waifu’,
‘senpai’, we get a lot of kind of related
to communication online, things like ‘faved’,
technology ‘candids’, ‘rekt’ which
is often used in video game language. We get
quite a few of these acronyms like ‘tbfh’
and ‘tfw’. So this is the pattern. So
this one here is quite a bit like the ‘tfw’
pattern, right, and that’s what we should
expect since it’s associated with this one.
What we get is a really strong kind of west
coast pattern really with San Francisco here
and Los Angeles, and then we get a smattering
in other places, especially around New York
City and around Miami, but then all sort of
various large cities, so this is Minneapolis,
this is Tulsa, this is Dallas, that’s Atlanta,
this is Boston down to Washington DC. OK?
And so what we see here is how the general
pattern across much of the United States,
much of the urban United States, and we can
really see this if we look at a map of population
density. So this is basically an urban map
of the United States, right, the darker area
is more people per square mile. You can see
how closely those align. OK? So this is your
general urban pattern.
The second pattern is this one here. You see
no acronyms anymore, you see words like ‘baeless’
and ‘boolin’ and ‘brazy’, ‘fleek’
is here now, ‘famo’, ‘tookah’, ‘traphouse’,
‘xans’, and this is the pattern. That’s
the pattern that these words tend to show.
So this is kind of the opposite like the one
we saw for ‘fleek’ or something where
you had the south. In particular it’s hotspotted
here. There real hub of these words is Atlanta.
This also corresponds very closely to a well-known
demographic pattern in the United States.
This is the distribution of African Americans
and you can see just how closely those patterns
are, right, like extremely close. OK? And
so basically these appear to be African American
words. These words tend to come from major
African American parts of the United States.
Atlanta in particular is often considered
to be the capital of African American America.
The third pattern involves these three ‘litt’
forms and ‘balayage’, which is a hairstyle,
and these are more associated with New York
City. They have a bit less of a spread. We
have these, the fourth pattern are these words
here, ‘shordy’, ‘sqaud’ – spelt
that way – ‘thottin’, ‘yaas’. These
are associated with this middle Atlantic region
and they seem to be centred here on Washington
DC, the Washington DC Baltimore area spreading
up to Philadelphia. There’s also a secondary
cluster here in Chicago. OK?
And then the final pattern is these ones here
and these are the words most strongly associated
with this sort of Gulf Coast pattern, centred
here on Louisiana and New Orleans.
So, overall what we find then is that emerging
words to end to originate from a pretty small
number, at least on Twitter, of these hubs
of lexical innovation and what I want to point
out is that these are all cities, which I
guess isn’t surprising, but just that these
are really culturally important cities. You
know, New York City, Los Angeles, San Francisco,
I don’t think we should be surprised by
those, right? Washington DC for that matter.
But like Atlanta isn’t that big a city,
it’s the 10th biggest city in the United
States yet it’s the 2nd biggest source of
these new words. It is an incredibly important
city though for African American culture.
New Orleans is the 50th most populated city
in the United States. It’s not a very big
city at all, especially after the hurricane,
but it’s a major cultural centre of America;
it always has been, we can all think of jazz
and food and carnival and stuff from New Orleans.
This seems to be where new words are coming
from, not just from cities but from really
culturally important cities and diverse cities.
We can also see when we look at the diffusion
of these words that in general it seems that
cultural regions majorly constrain the spread
of these words so that cultural borders tend
to slow words down and so what we see here
isn’t that words just jump from one big
city to the next or just spread out radially
from some central location, but that cultural
patterns seem to constrain the spread of these
words. Maybe the most clear spot is in Atlanta
where these words start in Atlanta and then
actually stop. This is Atlanta and then this
is stopping up here. This is a major demographic
boundary. This is white Georgia and this is
black Georgia.
And, we also see that African American English
appears to be the main source of lexical innovation
on American Twitter. This is all of these
five patterns superimposed on the same map
and here is your map of the African American
population. What I just want you to notice
is that three of these five patterns are really
in this African American area, you know, this
is white America but very little comes from
white America. One pattern in New York City,
there’s one kind of general west coast urban
pattern and you get three very distinctive
patterns here in African American English.
African American English often is presented
as this really monolithic thing but I think
we can see here some regional diversity in
African American English.
What I want to do last then is just to zoom
in on New York City now and so, because we
have so much data, what we can do is zoom
in on one city and see how words are spreading
in just one city. So I’m going to take all
54 of these words and look at New York City
in particular. Maybe those weren’t African
American words, right? So let’s see what’s
happening there. If you recall, the main words
for New York City were ‘balayage’, ‘litt’
and the three ‘litt’ forms. The corpus
is a little bit different, it’s just from
2014 so it actually runs to December; it’s
about 20 million tweets and 200 million words
from the five boroughs and rather than counties,
what we’re going to look at here is census
blocks, census tracks, which are just basically
a few blocks in the city, so we can really
get high resolution. I think this is probably
by far and away the highest resolution anyone’s
ever mapped language variation on where we’re
just down here in one city at a really detailed
level.
OK, then we can do just what we did in the
last study. So here for example is ‘litty’.
This is in March 2014, here it is in May,
here it is in July, here it is in September,
OK? Let me just quickly say what we’re looking
at here. So this is Manhattan, that’s Central
Park, this is Harlem, this is the Bronx, this
is Staten Island, this is Brooklyn and this
over here is Queens. That’s JFK airport.
A lot of the big kind of things you see are
either parks or airports. That’s LaGuardia,
that’s [Wreckers - 0:40:09] Island, the
prison, that’s Coney Island. Now you can
see it’s not very clean but we can do the
same thing we did before and sort of look
at where the hotspots are and this is where
we find the hotspots to be, which doesn’t
necessarily – I think it’s important here
to note that it doesn’t quite look like
that map. That’s because this really highlights
where things occurred early on. OK? Here’s
‘lituation’, so you see we actually get
pretty early usages up here in the Bronx with
one in Lower Manhattan, here’s the spread
over time, here it is in November. When we
get a hotspot map on that too it really shows
the real hotspot up here in the Bronx and
then parts of Brooklyn and Queens. Here’s
a few others. So here’s ‘baeless’ which
you can see has a rather similar pattern;
here’s ‘amirite’ which has a very different
pattern. So this is Lower Manhattan, this
is downtown Brooklyn, this is Park Slope,
one of the richest parts of Brooklyn, this
is Williamsburg which is sort of the hipster
capital of the world. And here’s ‘balayage’,
the hairstyle and we can see this is a very
affluent area here in Manhattan, Upper West
Side.
OK, so when we do the same thing before we
find two patterns in this data set. The first
pattern has these words and what I’ve highlighted
here are these three because these are the
three that we believe come from New York City,
right? So this appears to be, you know, where
a lot of these new words are coming from.
We said in the earlier study that we think
they’re coming from New York City and now
we can try to say where we think they’re
coming from in New York City. But it also
has these other words which, if you recall,
these are all kind of southern African American
forms, especially from that Atlanta region.
Here’s what it looks like. Now you can see
we haven’t at all kind of located this in
one part of the city, which may be a little
bit disappointing but we have seen that this
is where all these words are coming from more
or less. Now this is a very explainable pattern.
This is the percentage of non-white and non-Asian
people in New York City. So this is basically
a measure of black and Hispanic people in
New York City and you can see how closely
these two are aligned. This is JFK so nobody
really lives there, but you can just see that
this is basically saying that these words,
including ‘litt’, ‘litty’ and ‘lituation’
that we believe come from New York City, are
coming not from white communities, not from
Asian communities – meaning basically Chinese
communities – but from African American
communities and Caribbean communities, particularly
Puerto Ricans and Dominicans.
Here’s a second pattern. We get the one
word, the other word that we’re looking
for, ‘balayage’ is here too and then we
get these other forms. These forms here are
all associated with that more general, western,
northern form. We also get what we’ve only
seen once and that is ‘fleek’ is actually
selected again. It’s like people try to
avoid this form. And here’s the pattern
here. Now I haven’t had chance to get the
kind of corresponding map up but I’ve seen
the map in general – I haven’t had chance
to do it myself – but this basically maps
on very nicely to how rich people are, OK?
And so what we have here is Lower Manhattan,
this is Williamsburg and this is Park Slope,
a very rich part of Brooklyn. This is kind
of the white rich area of Staten Island and
so what we have here is a really big difference.
We have the more white affluent areas, the
hipster areas to some extent, versus the more
poor, African American Latin areas of New
York City.
OK, so what I think I’m trying to say is
that zooming in on such a small area shows
how important these cultural patterns are,
right? We can also see how important this
African American English is here as well,
but I also wanted to point out that we can
see this kind of competition, right? You have
this one kind of white New York part which
has their own words. This is especially from
Lower Manhattan, Williamsburg and Park Slope
and Brooklyn, and they’re in competition
essentially with this other area of New York
City. We have these sort of two separate cultures
in New York City producing different sets
of words. And they’re not just producing
different sets of words but they’re importing
different sets of words. So the one group
is importing words from the north and from
California, that’s the more white area in
Lower Manhattan and Brooklyn. The more ethnically
diverse area in Bronx and Brooklyn and Queens
is bringing in new words from the south.
Now I just wanted to do one last thing here
with this competition idea and that is talk
about – this is a pretty famous quote from
Darwin – he says ‘the survival or preservation
of certain favoured words in the struggle
for existence is natural selection’ and
the last thing we’ve done here most recently
was try to look at how we can predict which
of these words will win out over time. And
so as I’ve said a few times, ‘litty’
is the most successful of all these words
and that’s what we want to look at here.
So we want to see what predicts how successful
these words between 2014 and 2016. So for
this data set, I have another data set of
just basically the same thing of 2016 tweets
and we just want to see how much the frequency
of these words has changed over time. We’re
going to look at four things about the form
and the function of these words to see what
predicts their success. So these is basically
what we’re trying to predict. So in red
we have the words that rose in frequency,
in blue the words that fell in frequency.
So we can see that ‘litty’, ‘brazy’,
‘balayage’, ‘nahfr’, these are most
successful. We see words like ‘baeless’
and ‘fleek’ and ‘baeritto’ did very
badly.
OK, so let’s see what predicts success of
these words, or their failure. First we can
look at word length, OK. So this is just representing
one of the predictors here and that is how
long these words are. OK, so words like two
and three letter words are very short; these
ones are very long. We have parts of speech
– this is just the slide from before, right,
the parts of speech in our data set. We can
also look at word formation processes. So
I’ve simplified this, so all we have here
is spelling in red, acronyms in yellow and
then all the other kind of standard word formation
processes in blue. And then this is the one
that was the hardest to code, it’s the one
that makes the most difference. This is if
they mark a new meaning and so we have words
like ‘balayage’ which is a new hairstyle,
or ‘cosplay’. There is no single word
that you can use for these things; these are
the word to use. ‘Cosplay’ is now in the
dictionary for example. Or ‘traphouse’
or ‘waifu’, these are not words that we
have a standard word for. On the other hand,
things like ‘faved’, we have the word
‘favourite’, ‘boolin’ we have the
word ‘relaxing’, ‘notifs’ we have
the word ‘notification’. In the case of
the ones in white or the ones that are done
by spelling, that’s a bit different because
these are all existing words, but there we’ve
marked the ones in red like ‘gainz’ and
‘goalz’, these mark a new meaning, whereas
something like ‘lordt’ just means a word
that we already know that’s in the dictionary,
a meaning of the word ‘lord’.
OK, so this is what we found. We found that
those four predictors account for a substantial
amount of the variation and change of these
forms, about 35%. What we found by far, the
most important of these predictors was if
it marks new meanings. So emerging words that
fill semantic gaps tend to do much better
over time than emerging words that just replicate
an existing meaning. We also found that words
formed through standard word formation processes
tended to do better. What that means, in some
sense what we’re seeing is that words that
are suitable across different communicative
contexts tend to do better. So an acronym
or a creative spelling, you know, it doesn’t
translate to speech easily. We also see that
short words tend to do a bit better as well
and that maybe tells us about efficiency in
general and maybe especially on Twitter.
This is just showing the word formation process
and whether or not it’s marking a new meaning.
So the ones on the left of each of those pairs
in circles, we have forms that are marking
new meanings. You can just see how much more
successful say a standard word formation process
is when it marks a new meaning than when it
doesn’t, or when a spelling marks a new
meaning versus when it’s just replicating
an old meaning. We can see quite clearly here
that we get more above this zero line. So
things that are growing tend to be more these
new meanings in general.
OK, so just to conclude, what I want to say
is that lexical emergence is a really kind
of fundamental issue in language change. There’s
lots of stuff written on word formation processes
but it’s been really hard to do this on
a systematic level, not just because these
words – or words in general but like these
new words in particular – are just so rare
and so dynamic and so elusive. You need these
huge informal corpora to be able to make sense
of them. That’s what we can do now with
these really large and dense Twitter corpora.
There’s nothing that special about Twitter
except that we have so much of it and ten
years from now we’ll have other types of
data too. We’ll have more and more of this
type of data. We can do more and more exciting
things with it. But I think what we can now
do here for the first time is start to look
at what forces are driving and constraining
the development and the spread of these words
in real time. We also see just how diverse
this process is, right? We can see massive
variation in the topics and the form and the
grammar of these words. We can also see that
they almost all involve just repurposing existing
words. We also see how important cultural
patterns are here, both that they come from
really influential cultural groups and they
spread throughout these cultural regions,
and we can also see that success over time
is predicted by how much communicative utility
the word has, and particularly filling in
these gaps in the lexicon.
This is what I think then is what we’ll
see the study of language change becoming
over the next decade as we get more and more
data, as linguists get more and more able
to work with this data, as computer scientists
get more and more interested in this data
as well. I think what we’ll really see is
that linguistics is sort of shifting from
a social science, or an arts even, to a data
science. As we get more and more data it really
takes new techniques to be able to make sense
of this data and it really opens up new areas.
I mean I think there’s a lot of talk that
big data is just a trend or a fad, it doesn’t
really make much difference and that sometimes
more data is more trouble than anything, but
I think in the context of linguistics that’s
not true. I mean it really opens up things
that we just couldn’t do before.
Where I think this research is really going,
at least for myself, is that I think lexical
emergence is important but what we really
see here is semantic change across the lexicon,
so shifts on existing words is a much more
pervasive and dynamic process. We only had
54 words there out of 50,000 out of 100,000
words, but almost all those words are shifting
in meaning. Almost all those words are shifting
in usage and that’s something we can start
to process now for the first time and really
look at the lexicon as a system and how lexicons
as a system are changing over time, not just
individual words. So when one word shifts
in meaning, what happens to the other words
in that area? That’s something we can start
to track and we couldn’t track that a couple
of years ago.
The way we can do this is probably through
using what’s called word embedding models.
I haven’t talked about them here but they
really give us a method for modelling the
lexicon. They’re not perfect for this yet,
they’re mostly used for more machine learning
or natural language processing tasks, but
there’s something that linguists can use
to really build full models of a lexicon and
really track change. I also think this is
an important point and that is that often
we think of language change, we think of it
as competition between existing forms. I actually
haven’t done this here, I’ve been looking
at individual words, but often in socio-linguistics
we say, you know, is it bath/bath, what’s
rising, what’s taking over? Pop or soda?
Or this grammatical construction or this grammatical
construction? It’s always a competition.
But I think what we can kind of make an argument
for here, and like with that New York data,
is that we actually have competitions between
the whole systems. It’s not just individual
words that are competing but these whole systems
and you can start to see if you have models
of lexicons from different social groups and
you can start modelling and trying to understand
how these whole systems are competing with
each another and what that means for language
change.
And then the last thing I want to say, because
I probably could have called this paper the
Future of the Study of Language Change, but
I do want to make a point about the future
of language change itself. Maybe this is controversial,
I mean I think we see it in the data that
is shown and it’s not my idea, but I think
it’s fair to say that lexicon and language
in general is evolving, or changing, at least
in part because our culture is evolving and
that means that we get these developments
of new social structures and of new communicative
contexts in particular. We can see this with
lots of the words, right, that they’re related
to new cultural movements, they’re related
to new community of context, they’re related
to changes in our culture. That’s why we
get these new words for the most part. There’s
some need for it in our communicative environment.
We need to express a new meaning, we need
to mark a specific social group, we want to
achieve a specific communicative goal that
we couldn’t do previously.
So what I just want to end with is saying
that if we believe that the population of
English speakers is growing and that the diversity
of our culture and of communicative context
in general – probably not just in English
but across the world – then we should also
expect that language should be speeding up
too, more and more rapidly. I think we’re
in a really neat spot here, right, because
we both have the speeding up of language change
on the one hand and this access to more data,
to more powerful computational methods and
more computer power, and that means we’re
going to be in a really good spot to see this
change. I think it’s a really exciting time
for studying language change and for observing
language change. Thank you.
[APPLAUSE]
