So it’s a great honour for me to give the
Sinclair Lecture today because, as Michaela
said, John Sinclair has had a strong, has
been a strong influence on my development
as a corpus linguist. But I’d like to start
the talk with a disclaimer and two promises.
The disclaimer is that unlike any other, all
the other talks I’ve given, this talk won’t
contain any solid quantitative results. It’s
very much programmatic. I’m going to talk
about my vision for the future of corpus linguists;
my vision for the future of the corpus linguistics
that I want to do in the future. This fits
the Sinclair Lecture very well because when
I had the pleasure to meet John Sinclair in
2005 in the Tuscan Word Centre and sit on
the veranda with him for quite a chat, he
had this much broader view of development.
I was still fairly young after my PhD, focused
on methodology on specific question and he
had this much more channel and far-reaching
view and so when I was invited to give this
lecture I thought about what my visions for
the future.
OK, now onto the promises. So at last year’s
corpus linguistics conference when I was here,
there seemed to be a trend of putting funny
pictures of animals into your presentations,
so I spent a lot of time thinking about which
animals I’d pick, but in the end I decided
to remain serious, so I can promise you there
will be no dinosaurs, orang-utans, cats, bunnies,
dragons or even chobits in the presentation.
For those who have already heard talks, some
of my other talks, I have another promise
that I’m not going to talk about authorship
attribution, collocation evaluation, non-randomness
and I’m not even going to mention Zipf’s
Law. So although you’re going to hear something
different from me, it won’t be boring.
So, when thinking about the future of corpus
linguistics let’s look at the state of corpus
linguistics in 2018. Well, what are the tools
that we work with most? Concordances, collocations,
key words – frequency analysis. So of course
I’m ignoring quite a lot of words that have
more theoretical bend that focus on syntactical
structures, on proving theories, but looking
at last year’s corpus linguistics conference
programme, all the programme of the corporate
and discourse analysis conference I just attended
in Lancaster over the weekend, at least half
of the talks rely on these methods and obtain
very useful, very good results with these
methods. Here’s proof just from the weekend’s
conference. What does analysing corpora mean?
Concordances, collocations and key words.
OK, so as I said, I’m ignoring all the other
work, corpus based research with theoretical
cognitive linguistics background that focuses
more on speak alternations and predicting
alternations. Some researchers of course have
their pet method, so they always work with
a particular method that doesn’t fall into
this basic toolkit of corpus linguistics.
The prime example of course is Biber’s research
into dimensions of register variation and
then of course you get a few fairly innovative
studies. These are usually inter-disciplinary
collaborations, often involving a computer
scientist or expert statistician, and corpus
linguists who work together. That also means
for many corpus linguists who don’t have
access to a statistician – or at least to
a statistician who is willing to talking to
a corpus linguist – it’s very difficult
to carry out this analysis because you will
rely on what the available tools offer you.
So essentially we’re still working with
methods that were developed in the 1960s and
the 1970s. A lot of these methods were introduced
by John Sinclair in his 1966 paper in the
report from the OSTI Project in 1970 and all
his subsequent work. The main difference is
that today you can work with bigger corpora,
we can do the analysis faster, and it’s
all much more convenient because we’ve got
these nice web interfaces and software tools
– here is CQPweb software tools like AntConc
or MonoConc that allow you to do exactly these
three types of analysis in combination and
link between the different analysis. So actually
I have to say at this point I made a slightly
narcissistic choice of using only examples
from my own research group as illustrations
in a talk. That’s my way of making up for
the lack of quantitative results. At least
you get an idea of what we’re working on
at the moment and CQPweb falls under this
because I’m mostly involved in the development
of the underlying indexing software.
So we seem to be working with fairly old-fashioned
methods, but we’re very successful doing
this. In the UK we have these big centres
such as Birmingham and Lancaster, the Centre
for Corpus approaches to social sciences has
been highly successful at Lancaster, addressing
a wide range of problems especially those
that have to do with language description,
with language teaching, very strong and social
linguistics in politics, political ideologies,
and there’s also very successful work here
at Birmingham and literally stylistics, or
style of metrics as some people refer.
So if you look at modern lexicography it’s
almost unthinkable to write a dictionary without
corpus tools and we’re using exactly the
inventory of methods that corpus linguists
have introduced – collocations, key words,
reading, concordances. So I’d like to illustrate
these successful applications with some of
the work we are doing at the moment, so that’s
in cooperation with the Department of Political
Science. We are working on discourses around
austerity. So this is based on newspaper articles
from the Guardian and the Daily Telegraph
between 2010 and 2016, which as always we
collected through Lexis Nexis, giving us a
total of 18 million totals after very, very
painful de-duplication procedures. For 323
articles we have also scanned images that
appeared in the articles that directly relate
to austerity and we are carrying out analyses
that are inspired by corpus based discourse
analysis, but are combined with the multi-modal
view of these images. So this is one of these
fairly typical applications that I’ve heard
all over the weekend.
In another research project we’re looking
at multiply resistant bacteria. This is done
in German so the idea is not biomedical information
mining but rather to work out what the discourses,
the public discourses, around multiply resistant
bacteria because a lot of people are afraid.
There’s a lot of rapportage in the press
on these issues, especially if something happens
at a hospital. So we looked at discourses
in an online corpus related to multiply resistant
bacteria. This uses the traditional key word
approach, so we carry out a key word analysis,
comparing the frequencies of lexical words
in the target corpus which is limited to mass
media rapportage on multiply resistant bacteria,
comparing this to German newspaper corpora
from the same time and then grouping these
key words into categories that correspond
to the different discourses. Here’s the
distribution of key words, the number of key
words found for each of these categories.
As you can see, quite a number of categories
are very well supported by a key word analysis,
so again corpus based methods are successful
in understanding the discourses there, but
there are some categories, especially metaphors,
where we have some gaps. These categories
are actually identified by a qualitative detailed
analysis of a small number of texts.
So that’s also a very successful application
that’s still in the early stages. We’re
now going to improve the key word analysis
and also extend it to other data sources.
The main interest, so this is a cooperation
with the palliative care unit at our university
hospital. The interest here is to, these people
are interested in being able to communicate
better with patients and their care-givers
and in order to do that they want to understand
the discourses, so we are also going to analyse
the interviews with the patients which they’ve
digitised and recorded and there’s not enough
material in these interviews to do a corpus
based analysis but our idea is to relate what
we find in these interviews against the public
discourses because that will influence how
the patients and their care givers conceptualise
the problem of multiply resistant bacteria.
A third research project that’s not so pleasant.
A third research project that’s going to
pop up again towards the end of the talk is
called ‘Exploring the Fukushima Effect’.
So our interest here is in understanding discourses
around nuclear energy and especially the abolishment
of nuclear energy after the Fukushima disaster
in 2011. This is a corporation with, amongst
others, the Department of Japanese Studies
at my university. So here we have an additional
challenge in that we want to compare discourses
– or political argumentation – not only
across two quite different languages but also
across different cultures. In addition we
want to compare the way these results were
reported in the mass media to the way they
were discussed on social media, especially
on Twitter which is an extremely popular social
network in Japan. The reason why we are interested
in this is that in Japan after the Fukushima
incident, at first the incident was downplayed
at all and then there was a strict policy
of the government to keep using nuclear energy
and they basically forbade, stopped, newspapers
from reporting, from 
writing against nuclear energy, which of course
is quite different to what happened in Germany
where the government immediately decided to
abolish nuclear energy for good after the
incident. So we had really a lot of material
in the newspapers immediately after the incident,
but then it died down fairly quickly when
Germany stopped using, or decided to stop
using, nuclear energy, whereas in Japan at
first there’s no reporting in the newspapers
but then there was a grass roots movement
formed and organised via Twitter mostly, an
anti-nuclear energy movement, so by the time
of the next election in Japan, the discourses
started actually appearing in the mass media
as well.
So because this is such a difficult setting
we are actually carrying this out in a large
inter-disciplinary collaboration. I think
that’s also fairly typical of successful
work in corpus linguistics, especially recent
work. We try to combine different perspectives.
Corpora give us one perspective on the data,
one perspective of what’s going on, but
then of course the experts, in this case we
need people from communication science, contribute
a more detailed analysis of the different
actors and their interests in such discourses.
Japanese and Japanese cultural studies contributes
the understanding of Japanese culture. Without
this it would be impossible to make any sense
of this, even for people who speak a bit of
Japanese. And then we have computer science
to do the fancy visualisation network analysis
that you need to make any sense of Twitter
data. So we are about midway through the project.
We’ve made fairly slow progress at first
because it’s not so easy to get such different
groups together and find out how we can work
together in an inter-disciplinary team. But
initial results that we’ve got, for instance,
is to look at the development of various key
words, key phrases, such as ‘nuclear phase
out’, which you can see here in violet.
Here in the Japanese Twitter data you can
see at first after the incident it takes a
while until the discussions on a nuclear phase
out starts but then it’s very active on
Twitter and at that time if you look at the
newspaper data you will get hardly anything,
any rapportage, any reporting about a nuclear
phase out. Then there are a few additional
spikes which coincide with elections in Japan.
So at the following two elections people again
started talking about the nuclear phase out,
but then eventually in 2013 things died down
and the movement more or less stopped.
As part of this work we also looked at some
other Japanese data sets and this again shows
nicely how much we can learn with our traditional
corpus linguistic analysis. So what we looked
at here, that was a first, it was a way of
getting our hands dirty and learn how to work
with Twitter, how to analyse Twitter data.
So we looked at the general election in Japan
in 2014. We compiled a Twitter data set, just
500,000 tweets because that was firstly the
largest data set we collected and a way of
you tackling your problems. And so we collected
around 500,00 tweets related to the general
election using certain search terms and then
we put the tweet collection into CQPweb and
started doing our analysis, collocation analysis
about – so the reason why this was a snap
election was because the parliament was dissolved,
so one of the search terms looked at the dissolution
of parliament. In our collocation analysis
we were surprised to see sexual harassment
as one of the top results #sexualharrassment,
and it turned out that it was actually a frequent
recurrence of this particular sequence. That’s
the Prime Minister’s party, Shinzo Abe’s
party. Sexual harassment and dissolution of
parliament always next to each other in this
exact sequence. So we took a bit of a closer
look at those tweets and found out that they
were actually generated by a social board,
by a fake account. So there was an event some
time earlier where a female member of parliament
was harassed in parliament, which spiked quite
a bit of discussion and some groups, actually
right wing groups, set up a fake account,
a fake Twitter account, for this politician
which pretended to be this politician’s
account, but was actually a social bot sending
out very many copies of the same tweets. All
these tweets started with these free hashtags.
Then there are differences in text but the
way we were able to identify all tweets from
this bot, by just looking at repetition counts
– so that’s what you see here on the x
axis – oh no, now I’ve got quantitative
data, oops! What you see on the x axis are
repetition counts and you can see some of
these tweets are actually repeated, the same
tweet is sent out more than a thousand times.
So that’s very obvious bot behaviour but
when we took a closer look at these kinds
of statistics we were able to identify several
other botnets which show up as clusters in
this visualisation – this one and this one
– and another net that actually shows up
as a diagonal line. So this untidy cluster,
categorised by a certain average number of
copies. So these are tweets that are sent
approximately 30.5 times per user account.
The total number of accounts sending the tweets
and the total number of copies are quite different
but the ratio is surprisingly constant across
the entire line.
So that’s a bit of additional technique
that we brought in many modern corpus linguistics
studies, but then of course we turned back
to the corpus data to take a closer look and
find out what these botnets what the purposes
of these botnets were and whether they are
really distinct botnets. So interestingly
it turned out that this botnet was overtly
pro-Abe propaganda, just advertising for the
Prime Minister. The other two botnets are
covert pro-Abe propaganda. So they were disguised
as strongly nationalist groups using certain
terms that only Japanese nationalists use,
but when you look at what they actually write,
it was always defending, a defence of Abe
against his detractors. So these were very
interesting findings that came out, as often
in corpus linguistics, by serendipity, by
noticing a certain pattern in the collocation
analysis. What I wanted to emphasise here
is that it’s always important to go back
again to the corpus data, to read the concordances,
because without that we wouldn’t have understood
what these botnets actually represent, which
is something that of course all of you know
but other fields that have become very successful
in recently years, are less interested or
less focused on staying in touch with the
data, on looking at concordances.
So one field that I have in mind, digital
humanities, what distinguishes digital humanities
from corpus linguistics at the current time
is that corpus linguistics is still very much
based on these old method and a close connection
between the quantitative analysis and the
concordances, the actual corpus data, the
data studied. Whereas digital humanities has
transformed humanities research by making
use of all the possibilities, the modern possibilities
of computation analysis of information. They
apply a much wider range of methods. If you
go to a typical digital humanities conference
they apply a much wider range of methods and
visualisation than we usually do in corpus
linguistics. Digital humanities has its origins
back in the 1950s but only with the availability
of so many new analysis techniques with very
beautiful visualisations, there has been explosive
growth in recent years and at least if you
look at the funding in Germany, digital humanities
is doing much, much better than corpus linguistics.
So I’m going to focus here on working with
digital humanities that is interested in analysing
data and analysing electronic data. Of course
there’s a lot of work in building digital
additions, digital archives, and creating
suitable software platforms for this purpose,
which I’m not so interested in here. I attended
two recent digital humanities conferences
in Germany and I just collected a few examples
to illustrate what digital humanities is doing
in case you haven’t worked in the field
or been to one of those conferences. So what
you always get at such a conference is my
pet peeve, the ‘word cloud’ or ‘tag
cloud’. That’s the most meaningless visualisation
that you can possibly make. They have suggested
improvements – that’s a ‘tag pie’;
you can see it’s a mixture of a pie chart
and a tag cloud which at least allows you
to compare things and doesn’t assign meaningless
colours to the words. But still, I don’t
think that this is a really very useful way
of presenting data.
This is also, one of the drawbacks of tags
clouds is that they tend to focus the view
on a small number of very salient terms that
are printed in large. So if you look at this
you probably get – well you actually get
here are actually the names of the plays because
they are shown in the largest font, but then
you just get a few names and a few key words
and then there’s lots of other words that
you can’t really read. Digital humanities
also likes to work with topic models which
take this separation from the original text
one step further by not creating tag clouds.
Basically tag clouds are based on a form of
key word analysis but rather when we apply
a topic model we 
identify semantic topics by their concurrence
pattern, by concurrence patterns in text.
It’s slightly related to collocation analysis
but the result of a topic model is just a
list of topics which allow you to do visualisations
such as these, that tell you that some texts
are similar to others, according to certain
topics which are fairly difficult to understand
because the best you can get from a topic
model as a description of a topic is a list
of the words that have high weight, high probability
in this topic. I’ve talked to various people
in digital humanities who try topic models
and they said ‘well, it looks nice but some
of the topics were meaningful and some other
topics were impossible to interpret at all’,
you just get these words and you don’t see
what the connection is between all those words
really. So as you can see this is already
quite far removed from the object of study
from text data.
The same holds for one of their most favourite
techniques, network analysis and graph visualisation.
So this is the visualisation of controversies
in Twitter. Well, you can see that this seems
to be, there’s a red and a blue group and
they’re rather controversial. Then people
have applied the same idea to visualising
constellations of characters. So these are
actually plays and novels. So this is from
this year’s digital humanities conference.
For some of the novels you might be able,
if you could read the names then you could
actually figure out who the characters are
and how they’re connected to each other,
but I guess in [Foust] you just give up because
what you can say is there seems to be three
groups of characters that have very little
to do with each other, which is probably something
that you might have known in advance. You
can take this further. That’s a visualisation
of the Bible. That’s the complete history
of German drama across 200 years. So that’s
what people often refer to as ‘distant reading’
in German. That’s sort of the extreme form
of distant reading. I have no problem with
distant reading. The problem is in this case,
I think you completely lose the connection
to the text. It’s just a visualisation and
of course what you do in this visualisation,
if you can make out anything from this, it’s
by picking out graphical patterns. So basically
it’s an aesthetic appreciation of the visualisation
that drives the interpretation of such analyses.
And then of course there are the completely
ridiculous things. So this was a, that’s
a blog post that I’ve always been very angry
about because what they did was to find certain
regular shapes in the development of sentiment
across a novel by applying fourier analysis
which forces the shape to look like this.
So basically the analysis forces these shapes
into data that look much more complicated.
I think that’s just a very weird presentation
of the hierarchy of concepts. So you get these
occasionally. Those are of course, I deliberately
picked examples that didn’t seem to make
a lot of sense but my impression, when working
across one of those poster sessions, is that
you really have very sophisticated analysis
techniques, a wide range of techniques, you
have very good looking visualisations, but
it’s totally removed from the data and I
don’t think that many of the people there
understand what the connection between visualisation
analysis result and the original text is.
I know that I don’t.
OK, digital humanities also has nicer looking
software. So if you compare this to CQPweb
I think it looks about 20 years newer and
it also shows there is a wider range of visualisation,
a wider range of techniques, that can be applied
in these standard software tools.
OK, so that’s one field we may find ourselves
competing with. Here’s a competitor that
earns even more money, which is the field
of artificial intelligence research. So basically
artificial intelligence was a big thing in
the 1970s and then people found out that you
cannot really build and artificially intelligent
agent that handles anything more than a very
very small toy domain, so they more or less
gave up on artificial intelligence until computers
became so powerful that so-called artificial
neural networks, a very simple mathematical
model of how the human brain works, or how
our neurons in the human brain are supposed
to work, achieved astonishing results in machine
learning. So artificial neuron networks are
actually a very general machine learning algorithm,
you can design different network structures
that can learn different things and have different
strengths and weaknesses. This field, research
in neural networks, also had its origins back
in the 1950s but only with recent improvements
in processing power it became possible to
actually apply this to real world problems,
to the point that if you’ve got a fast graphics
card in your computer – so gamers have an
advantage nowadays – you achieve substantial
improvements in many natural language processing
tasks, in language modelling, so basically
predicting the next word in a sentence, various
kinds of text categorisation, solving analogy
questions. Machine translation has been substantially
improved by neural network methods. Visual
object recognition, you just have to search
on YouTube a little to see amazing films where
a neural network recognises objects in movies
in real time and it gets surprisingly many
of these objects right. So even digital humanities
profits from deep learning because optical
character recognition has been improved substantially
and the latest deep learning models are already
fairly good at recognising handwriting, and
various other claims that they’ve made.
Artificial intelligence is largely driven
by interest in nowadays, so although computer
science and computation linguistics does most
of its research now with a focus on deep learning,
Google, Facebook, Microsoft Apple and all
the other computer companies invest so much
more money that are making much faster progress,
and they also have much more computing power
than in a humble university. Because this
field is so much industry driven they’re
also very bold claims about what deep learning
can do. So just recently I read in the newspaper
that Google now has an AI that can make phone
calls for users. There’s this famous Obama
lip-synch site where you can actually have
Obama say anything you like and a neural network
automatically modifies video recordings of
him to lip-synch to what you say. Some people
claim that neural networks achieve superhuman
performance. I searched around a bit so there’s,
for instance, one task in relational reasoning
where the neural network actually seems to
perform better than human test subjects. They
even claim that you can do zero-shot learning.
Zero-shot learning means that you don’t
have any training data. So normally machine
learning works by collecting a lot of training
data and then using pattern recognition algorithms
to find patterns in the data and making predictions,
but deep learning claims they’re able to
do this without any training data in certain
situations where they can transfer knowledge
they’ve learned in different tasks to the
new task attained.
And there are even efforts at Facebook research
to sort of raise, bring up, a general artificial
intelligent agent from scratch. So the idea
is teach the computer like you would teach
a child, starting from simple sentences to
more complex sentences, to question and answering,
and so on.
So this sounds really impressive. The problem
for us is that there are also claims that
with neural networks, with deep learning,
we don’t need human understanding anymore
because there’s what they call end to end
learning where basically just take a text
as input and a decide output and you don’t
model your domain, you don’t need any linguistic
understanding, you just take a very very deep
network and train it for long enough on sufficient
amounts of training data. So because deep
learning has become so important I thought
I’d give you at least a brief look at what
a neural network actually looks like. That’s
the simplest possible neural network, so it
takes some inputs, multiplies each input by
a certain weight which represents the importance
of the input and can be positive or negative
so it’s like exciting and inhibiting neurons.
These are summed up in a neuron and then transformation
function is applied. That’s actually fairly
crucial because it turns this from a very
simple linear system that can model more complex
situations, and from this we produce the output
which can either be yes/no predictions or
that would be used in classification, or a
numerical value if you have numerical data
to predict. So this is actually nothing fancy,
that’s the simplest form of machine learning,
so-called linear classifier. There are various
other types of linear classifiers; they’re
sort of the standard bread and butter of modern
machine learning, even before deep learning.
So linear classifier basically can learn very
simple shapes, very simple structures.
Now the key trick is that if you add multiples,
so-called hidden layers, between the input
and the output layer in the network, then
the network becomes much more complex. So
if you just have this single layer that we’ve
seen before then you basically get just a
linear combination of the inputs. But here
we have lots of different linear combinations
by applying a non-linear transformation. They
can pick out different aspects of the data,
of the input data, which the next hidden layer
then uses to pick out more high level aspects
until you finally produce the outputs. So
this type of network would be able to learn
much more complex distribution, much more
complex shapes. What modern deep learning
has done is take the network from this to
something like this. So you really add dozens
of layers. These layers can be very large
with hundreds of thousands of neurons. They
can be complicated, they can be recurring
and so-called LSTMs. So if you ever hear the
term ‘LSTM’, that’s the deep learning
network that you would use in language processing,
that anybody who deals with language will
come up against. So you can build enormously
complex networks and because of this complexity,
if you have enough training data 
and if you have enough processing power, then
you can learn very good classifiers and prediction
models.
So that’s the situation at the moment. I’ve
drawn an illustration just because I love
diagrams. This is a corpus linguist in case
you couldn’t tell. So what a corpus linguist
does is take the corpus data, perform collocation
or frequency analysis and then check the results
by interacting with concordance from the corpus.
So this is quite a contrast to how digital
humanities or some of the research in digital
humanities works because we never just look
at the quantitative results; we will always
check this interactively by not only looking
at concordances but working with the concordances
and I think that’s one of the big strengths
that corpus linguistics has to offer.
As a result we can produce insights and occasionally
we will even be able to use them for applications
that bring in money. So the application here
means anything that’s either good for society
or brings in money. So in this illustration,
digital humanities is more like this. So they
use a much wider range of techniques – network
analysis, clustering, topic modelling, lots
of visualisations and sometimes also supervised
learning. So they will annotate some data,
train a supervised learning model, perhaps
even a deep learning model, and then use it
to automatically annotate large amounts of
data. From this they directly jumped to the
insides without ever, or rarely, going back
to the data. Sometimes they combine some close
reading but then it’s just selected sums
where we’d read an entire text. There’s
not this systematic interaction with the text
that John Sinclair always focused on. Just
the text, read the concordances, understand
what’s really happening in the text. That’s
what we do in corpus linguistics.
And finally deep learning just circumvents
all the human insights and the understanding
and goes directly. So that’s an illustration
of end to end learning, it goes directly from
the text data to the successful application
and makes money from that.
Now sometimes when I have nightmares I begin
to wonder whether at some point people will
say ‘do we still need corpus linguistics?’.
So, digital humanities has all these fancy
techniques and makes use of the possibilities
of modern statistics, data analysis and visualisation,
and deep learning doesn’t need human insight
anyway because end to end learning achieves
super-human performance, so what is our future
as corpus linguists? Of course for all of
us it’s clear how important corpus linguistics
is, but the question is whether the funding
bodies – in the UK it seems to still work
quite well; in Germany it’s getting more
difficult and it’s definitely easier to
get funding with all these nice methods, digital
humanities methods.
So I ask myself, outside our traditional research
focus, why is corpus linguistics so important
and I think it’s also corpus linguistics
and the understanding of corpus linguistics
that’s also important, both for digital
humanities and for applications that would
often now be approached with deep learning
techniques. What I think is one of the key
drawbacks in digital humanities, as I’ve
already said, it’s too far removed from
the object of study, especially the work that
uses these very sophisticated analysis techniques.
It’s usually based on an aesthetic appreciation
of the visualisations and there is no clear
methodology from going from these visualisations
to the insights, to the interpretation. We
have much clearer methodologies in corpus
linguistics than digital humanities has for
these new techniques, and I believe that’s
where corpus linguists could bring a lot to
the field of digital humanities because we
know how to combine distant and close reading,
and we know how to draw conclusions from quantitative
data, draw meaningful conclusions and if we
don’t know, we at least are aware of the
problems and we can work on these. So that’s
also a direction for corpus linguistics to
develop better ways, or develop a consistent
methodology for working with quantitative
analysis and linking them to the text.
I don’t know whether you’ve heard about
the Time Machine proposal? That’s a digital
humanities proposal, an FET flagship project
– you know this half billion Euro projects
that the European Union offers nowadays? So
they do exactly, that’s exactly what I’ve
just shown in the diagram. So this is strongly
based on both digital humanities and large-scale
machine learning, especially deep learning
techniques. The idea is to build a platform
to record our cultural history, European cultural
history, to have massive digitisation efforts
to fill in the data – that’s this part
of what they call the ‘information mushroom’
– but as you go back in time even these
efforts won’t be enough to get complete
or relatively complete coverage of what has
been going on. So they want to use deep learning
and simulation techniques to expand, to extrapolate
beyond the available data and simulate the
past. So I talked to a number of people from
the consortium and they all seemed to agree
that this sounds very good but nobody actually
knows what to do with these simulation and
with all the data, and we are lacking a good,
consistent methodology of interpreting such
quantitative data, especially data that are
generated with sophisticated machine learning
techniques. So what exactly do these simulations
mean? What conclusions? How do we get from
this enormous amount of information and all
the visualisation and pattern analysis, how
can we get from that to meaningful interpretation?
So I think that’s one aspect where corpus
linguistics could contribute a lot.
So I believe that in some cases, machine learning
artificial intelligence would also benefit
from human insights, especially those of corpus
linguists, because at the end of the day even
the fanciest deep learning, the fanciest deep
neural network, is still just a machine learning
classifier. So it can do things, either supervised
classification if you have labelled training
data in a categorisation task, or you can
predict some observable quantity which allows
you to generate arbitrary amounts of training
data. So if you want to predict customer behaviour,
if Amazon wants to predict customer behaviour,
they have more or less unlimited amounts of
training data from all their customers, from
the reviews they left on the page, and if
you just throw that onto an end to end learning
network then it can often make surprisingly
good predictions about the behaviour of these
customers. But if you want to look at something
where there is no clear-cut classification
scheme where you don’t know the precise
categories you want to find at the start and
where there’s also not enough training data
for end to end learning, then pure machine
learning isn’t a solution. I believe in
that in this case, human understanding is
crucial.
So again, I have two examples to try to illustrate
where I believe that we can really make a
contribution to something which traditionally
would be a pure machine learning task. One
is so-called ‘financial narrative processing’.
So the goal of financial narrative processing
is to analyse financial reports made by public
companies, written by published companies,
and use these to predict the future performance
of the company and especially to identify
red flags that might point to a looming bankruptcy
or other event for the company. Fortunately
there are not that many bankruptcies that
we would have enough training data to just
throw this into an end-to-end deep learning
network. We have to figure out what the red
flags could be from a fairly small number
of examples. So there’s quite a lot of interest
in this. The thing is, it’s not just a corpus
linguistic study because if you want to make
money from this, if this is to be actually
useful, it has to be applied to all the available
financial reports from thousands of companies.
So this process has to be automated at some
point. You cannot rely on just doing a corpus
study to identify, to find out to recognise,
the red flags.
This year we had the first financial narrative
processing workshop at the LREC conference.
In some of our own work we tried to predict
stock prices, short term effects on stock
prices, from so-called ad-hoc disclosures
which companies have to make under certain
circumstances. Whenever something happens
that might affect stock prices they have to
disclose this immediately and by analysing
the text of the disclosures we tried to predict
the development of stock prices which doesn’t
work all that well. We got 48% accuracy in
predicting either rise, fall or no effect,
but there’s just a 33% baseline and with
a simple trading strategy would still have
made money with this information. So you can
actually learn this and that’s really end-to-end
learning. You just take the text, a bag of
words, machine learning model and directly
try to predict the direction of stock prices.
So that works reasonably well but it seems
to work for the wrong reasons.
Here are some of the positive and negative
key words that the machine learning classifier
identified from the text. So a lot of these
it’s just reading at the surface. So if
you exceed, if a company exceeds expectations,
the stock market will go up, it will improve.
If there’s an improvement, if things rise,
then we expect the company to do better. If
things fall a week or lower then it doesn’t
do well. Here you can see Turbon that’s
actually the Turbon Group, probably a German
public company. So AG is German for plc, so
it has learned that public companies seem
to do well in general, so that’s a positive
key word. That’s clearly not good reasoning.
That’s clearly not valid reasoning, it’s
something that happens to be the case in these
data but these kinds of clues are what we
expect. Also if people have reasons for something,
that’s a bad sign and price of cancer is
a good thing but that’s just because it
can make so much money with cancer drugs that
any mention of cancer research is picked up
as a positive clue. So I believe even though
it makes above chance predictions, it really
makes these predictions for the wrong reasons.
For what we really want to do for identifying
red flags, you have to learn to read between
the lines, not take everything at face value
because red flags are particularly cases where
a company tries to hide problematic developments
or problematic information. So vague reporting,
one thing we found is that if a company focuses
much more on the good performance of its industry
sector rather than on its own performance,
that’s an indication that they’re having
problems. If you suddenly have contradictions
to prior reports or contradictory sentiment
in different parts of the report, that could
also be an indication of a problem. So this
is something narcissistically again my own
research growth project, that’s something
we started to work on and actually we are
hoping to get some industry funding to do
this at a larger scale and analyse financial
reports. At the moment we are at the digital
humanities level, we computer readability
and subjectivity scores with standard tools,
with standard measures, and plotted them to
see that Item 1 and 1a, that’s about risk
factors that might affect the company; item
9a is the conclusion, the final conclusion
of the chairman, and as you can see, the final
conclusion is much more readable in risk assessment.
On the other hand, risk assessment is much
more subjective than the final conclusion,
which tells us exactly nothing about red flags
and the future performance of companies. So
our next step will be to not do this, which
we also did for a paper, but go beyond that
and actually do valid corpus linguistic analysis
and then try to operationalise all these red
flags so they can be found automatically.
Another research we’re working on is argumentation
mining. So traditional approaches to argumentation
mining, either to classify the steps of an
argument into the claim, the qualifiers that
limit the claim, premises that can be used
as reasons, additional backing provided, a
refutation and attack on the claim which consists
of the initial refutation and the rebuttal
which would be the argument against the claim.
Another traditional approach to this is to
use knowledge mining techniques, so look at
patterns such as we need A because B, which
would take you from premise B to a claim A.
We want to apply this to social media data,
again Twitter because that’s so easily accessible,
but it’s difficult to apply the standard,
traditional approach of argumentation mining
to Twitter for one thing because people often
don’t argue in a very coherent way, in a
logically coherent way. Tweets are just too
short to make an argument that consists of
multiple steps. There are a few discussions
– reply threads – but on the whole there’s
a lot more data where people don’t need
a discussion on Twitter but just make statements
that could contribute to an argument in more
or less non-standard ways.
And so we’ve just started a research project
called ‘Reconstructing Arguments from Noisy
Texts’, which gives us the wonderful acronym,
‘RANT’, where we want to carry out a corpus
linguistic analysis of how people argue on
Twitter, with a special interest in implicit
arguments which aren’t made logically where
the logical connections aren’t made explicit,
and argumentation that’s based more on opinions
than on logical reasoning. We believe a lot
of the argumentation on Twitter actually is
that people mistake an opinion for an argument
but they’re still trying to prove a claim
or make their point with this, so that’s
what we want to model in the Twitter data.
We believe that a good approach is to use
corpus queries, extended forms of corpus queries,
very similar to patterns, grammar and corpus
pattern in analysis, to interactively develop
knowledge patterns, mining patterns, that
can then be applied automatically. So this
pattern would find attributions of statements.
So in many cases that we’ve seen, this actually
means the support for an argument is in the
person who said that something is the case,
or you could look for patterns like ‘can’t
be bothered about that’ or ‘is bothered
about that’. So this clearly is also part
of the argument. In this case it’s a corpus
of tweets about Brexit, so the argument for
or against Brexit.
Now before I run out of time, I’d like to
come to my view of the future of corpus linguistics.
So, I hope you convinced, I managed to convince
you and myself that corpus linguistics is
very important for the future, both digital
humanities and many applications, not only
for the future of corpus linguistics itself
but what is our next step? How can we develop
the field so that it, for instance, makes
use of all the fancy techniques that digital
humanities has to offer.
So I believe there are three steps that we
need to take, or that I want to take. The
first thing we need to achieve is interoperability
between tools. One of the reasons that we
don’t use these more sophisticated analysis
techniques is that they’re usually separate
software packages or just libraries for some
language like R or even worse. So you would
have to somehow extract the data from the
corpus, feed them to the tool and then you
get the analysis and there’s no link back
to the concordances. So we need interoperability
to include these tools in our usual research
process, which means making them compatible
with CQPweb and/or Congo or any of the other
software packages that we typically use. So
this goes beyond just being able to use these
tools and then typing a word that you’ve
found. So if you found topic model, typing
all those words into CQPweb to find the concordance,
that’s not an efficient way of working.
We need real interoperable integration where
they work, these tools can be used like any
other analysis function in our corpus software.
The next step I believe is to make this – so
you can see the big advantage of the concordance
is that we’re actually interacting with
the corpus data. We don’t interact with
many of the analysis tools except in the indirect
way of changing the parameters when we realise
there’s a problem, trying different parameters
or realising that we should change our query
a little bit to filter out some incorrect
results. So I believe the next step will be
to make these additional tools, the analysis
tools of quantitative analysis, interactive
so that the corpus linguist can interact with
them like they would interact with a concordance.
But now the biggest step, so this still means
that there’s a limited form of interaction
where I can do something within these tools,
the biggest step I believe is that we need
to feed back the insights we draw from that.
So the problem is here that we just have this
one directional process. When we get to interpretation,
when we learn something from the analysis
and the concordances, it doesn’t affect
this analysis. We can change parameters but
that’s all we can do. We cannot feedback
our insights into the analysis and I think
that’s the crucial step that we need to
find a way to feed back what we found out
by looking at the data, what we found astute
about the data, into the quantitative analysis
because then we will be able to produce an
analysis that goes beyond what digital humanities
and deep learning uses. If we get this far
enough and everything becomes tidily integrated
then finally we will have the hermeneutic
cyborg.
Now if I may talk for another five minutes,
I have a few first ideas how to go about these
steps and these are fairly initial ideas.
I hope that we’ll have time talk about this
over the wine reception, over dinner or in
any of the coffee breaks in the week. So the
first step is to make tools interoperable.
So I guess most of you work with one of these
standard tools that all corpus linguist use
depending on your preferences in corpora.
It might CQPweb, it might be SketchEngine,
it might be AntConk, WordSmith, MonoConc,
LancsBox or any other of these tools. It doesn’t
really matter all that much except of course
that everybody likes their own tool best,
because all these tools have more or less
the same basic functionality – they can
do frequency analysis, they can do collocations,
key words and it’s all linked to concordance.
Some of these have specific extensions depending
on the research interests of developers, so
LancsBox implements collocation networks and
this is very limited and it’s very slow
adoption of new innovative techniques, usually
because the developer of the tool doesn’t
have enough spare time, or they’re just
not interested in a specific technique. So
what we need is to – because there are implementations,
when people suggest new algorithms they usually
provide software, just not software that’s
integrated with these tools. So what we need
is better interoperability and that doesn’t
mean interoperability in terms of a data format.
We do have standards for corpus data for corpus
annotation that a lot of tools can work with.
That means we can read the same corpus in
CQPweb or SketchEngine or AntConk without
any substantial changes to the data format,
but still since they have basically the same
functionality that doesn’t gain us much.
It also doesn’t mean interoperability at
the level fo corpus query languages, that’s
also some standardisation work towards a corpus
lingua-franca. That just means that it’s
easier to switch from one tool to another
because you know what query language, you
understand the query language.
What this means is that we need interoperability
between the analysis and visualisation software
and the standard corpus tools. Ideally in
the form of plug-ins so that we’re linked
directly to the concordances. The problem
is that if people were to develop plug-ins
for tools they would sort of buy into certain
ecosystems. So to develop a CQPweb plug-in
would be an entirely different task than to
develop a plug-in for AntConk, and most method
developers don’t have the time to not only
invent a new method and produce a reference
[0:59:30], but also write all those plug-ins.
So the poor man solution is to achieve a more
manual, a more indirect interoperability if
we can just find a way for the tools to exchange
the essential information which is quantitative
data from the corpus, which must be sent to
the analysis tool, to the visualisation tool,
and the visualisation must be able to link
back to concordance lines because otherwise
we would again have the nice visualisation
that’s completely remote from the text data.
So Coquery actually came up with the idea
of using a tabular data format. That’s standard
in statistics. He used it in the Coquery tool
to give users the opportunity to carry out
more flexible analysis within the software
itself, but I believe that tabular formats
could be an easy, minimalistic approach to
enabling interoperability. This puts quite
a bit of a burden on the user of the tools
because in this minimalistic idea, what the
corpus tool – CQPweb or AntConk – has
to provide is a very flexible way of producing
tabular output, but the user will have to
understand what an analysis tool needs and
how to generate this output. Then in addition
we need an API that allows the analysis tool
to link back to the concordances. So here’s
one example. If you wanted to multivariate
analysis you would export a table that contains
frequency counts of different features for
each text, where the text idea provides the
link back to the concordances. So by providing
that to the corpus tool we can look at the
respective text for a certain analysis. And
even more flexibly you can do the same with
token level data. So for instance for each
query result, generates certain output and
both of these are actually already possible
in CQPweb for instance. The thing that’s
missing is a way of using these links. So
these are syntactic relations which we could
use to produce a word sketch, something that
CQPweb doesn’t support out of the box. If
we had a separate tool it could use these
token positions to actually display all the
results for a given collocation in the word
sketch.
This is something that we absolutely have
no time for.
I think a nice point to end might be a very
small demo about achieving interactivity.
So we’re currently working on a prototype
to support corpus based discourse analysis,
structing collocation and key word analysis,
with an interactive procedure. So what you
do in this tool is you select one of the available
corpora, type something to search for and
then it creates, basically it carries out
a collocation analysis. It takes a moment
to create the collocation database 
and if it takes several moments then – there
we go. So this is different from a traditional
collocation analysis in a very obvious way
because it’s not just of collocates ordered
by their association strength, but rather
semantic maps. So we use the simplest form
of deep learning techniques, so-called word
embeddings, to compute semantic similarities
of the collocates to give a pre-grouping,
a pre-structure. That’s also something that
people in digital humanities would often like
to do. So here the collocates of theory, you
get physics related, up here for instance,
amplitude, time-dependent, condensate, quantum,
grand theory, and you can now interact with
this by making the font a little larger so
it’s reasonable. You can also change the
importance weightings of the collocates, whether
we want to see many or just a few. So that’s
becoming more like a tag cloud now. And you
can try different association measures to
see how that affects the collocates. So that’s
interactivity at the level of analysis parameters.
But the interactivity I actually mean is the
usual process of making sense of these collocations
is to group related collocates into categories
and that’s supported in this tool. Do we
see related categories? So quantum, relativity,
Lorenz belong together. So this gives us our
category ‘relativity’, which we can also
label here as a group. We can also show the
collocates – the collocates are automatically
hidden because they become part of the category
and the category visualised according to the
average semantics of the collocates that have
been assigned.
Something I find very important is that again
this tool directly links to concordances.
You can always select a collocate and see
the concordances. This is something that should
also be part of an interactive tool so that
you always check the actual text against the
original data.
So the only thing that remains to say is why
we haven’t got further with the third step.
So the third step, the actual integration,
the feedback of results, we have some ideas.
So actually the tool I just showed you is
part of an algorithm we started developing
last summer, just after I got the invitation
for the Sinclair Lecture, so naturally I was
hoping that I would be able to demo the MMD
algorithm and show you how the feedback improves
the analysis, how it can feed back insights
into the analysis. But then this happened
and I was ill for a month and so I can just
thank you for your…
