So I thought I’d just, since we cannot have
him here, Stefan, I just thought I’d show
you a picture. So this is John Sinclair and
that’s a picture we took at one of the Tuscan
workshops. Of course this was one that we
actually did in Saarbrucken. And that is what
we all looked like at the end of the course
and I have a feeling this year will actually
be much worse on Friday!
So, corpus linguistics. I think I first need
to explain a little bit what this is about.
I thought I’d give you the spelling of it
– corpus – and it’s nothing to do with
dead bodies at all. Corpus, just to start
with a definition, is a usually large collection
of samples of naturally occurring language
that is stored in computer readable formats.
When I say this to people, what I get is very
[safe - 0:00:59]. So people will say ‘ah,
you’re a computation linguist’ and I’ll
say ‘no, I’m not really’, or ‘that
means you do distant reading’. I also don’t
do distant reading, ‘so this is stylometry’
and this isn’t stylometry and ‘you are
the digital humanities person’ and I’m
thinking OK, that is probably one I would
give them. So that is kind of in the area
for this is.
So, what do we do in corpus linguistics? So
we like counting, comparing and visualising
a lot and the guy whose language I like looking
at is Charles Dickens, so you’ve got him
here. So things that you can do is you can
then do word clouds of Oliver Twist if you
want to. So you can see which words occur
quite frequently, like ‘Oliver’ and ‘Mr’
and ‘replied’ and things like this and
we’ve got these things around commercially
so people play around with word clouds for
everything. You have seen when you come in,
in the English department, there’s a word
cloud. When you go over to Elm’s Road there’s
a word cloud. So it’s very fashionable.
However, you can also do other things. You
can just really count the word and give people
the frequency. So this is a word list where
you have the most frequent word at the top.
So you start with THE, AND, TO, OF and that
is also Oliver Twist, only Oliver doesn’t
appear that early on. You can also do something
like this and that is a key word list of Oliver
Twist where you take the frequencies of the
words in Oliver Twist and compare them to
a large reference corpus to see which words
are statistically more frequently occurring
in Oliver Twist compared to a corpus of 19th
Century newspaper articles that were published
at the same time and then you see that these
are the words that are striking. Another comparison
you could do is you could take all of Dickens’
novels and then see which words are key in
them individually and then do key key words
to see which words are key shared across Dickens’
novels and then you get this kind of display.
So the point I want to make is give me something,
I can show you what you like, OK?
And then to think about don’t trust people
who show you fancy stuff. Understand what
is going on and I want to show you a couple
of these things to go towards this understanding
and in the summer school we’ll work until
Friday to make sure that everyone understands
these things.
So this year is what is called a concordance
and that is a concordance for the word ‘eye’.
A concordance is one of those things that
John Sinclair spoke a lot about because his
point was you only know that much about language.
You understand a lot more what is going on
when you see a lot of language at the same
time, because when you see a lot, you can
actually identify patterns. So if you look
at this you can see that ‘eye’, occurs
together with London Eye and that is then
the ferris wheel. Or ‘eye’ occurs a lot
with eye cream and eye creams and eye shadow
and lots of stuff to do with eye make-up.
You can already tell where this text is coming
from. Then you have things like a man’s
eye gouged out in something where people took
the Sharia law quite literally and did an
eye for an eye and that resulted in this situation.
You have things like eye implant, bionic eye
and there’s a reason why they are there
together, and then you have eye surgeons and
eye tests and I’ll keep that for later,
eye tracking we’ll also talk about. I wanted
just glad that eye tracking made it in
the Guardian.
And then you can tell that ‘eye isn’t
always a noun, but ‘eye’ can also be a
verb. So you can eye corporate sponsorship
or you can eye pub expansion or things like
this. So the point I want to make here is
that when you take a lot of language, you
realise what the patterns are and when you’ve
got the patterns, you get an understanding
of what words mean. So your understanding
of the meaning of ‘eye’ is a lot more
complex than just saying ‘here, this is
an eye’ and preferably you talk about two
eyes. So eye as in the singular is already
quite different from eyes in plural. So what?
That is a question that John Sinclair liked
to ask a lot. So that was actually this publication
that Matthew mentioned where we asked John
to write an introduction and you have to be
really careful to ask John to write an introduction
because he will be very critical. So he said
something in his introduction, ‘empirical
linguists are often criticised for having
their eyes and ears so close to their data,
that they do not tackle any wider issues.
They’re always at risk of having their scholarly
reports greeted with cries of ‘so what?’
and indeed, such a judgement is a professional
hazard for all of us’. OK, so we need to
look at that professional hazard in a bit
more detail and now I understand why at some
point Marcus Walsh made me health and safety
officer at the University of Liverpool! So
professional hazard.
You’ve seen this before haven’t you? Google
n-gram. That is something that is becoming
immensely popular. So what you can do is you
can take some of those combinations of words
that we had in the concordance, so you can
take ‘eye surgery’, ‘eye test’, ‘eye
creams’ and see how the frequency of these
co-locations has developed over time. So eye
surgery is much more frequent than eye test
and eye creams and you can also see how 
that developed. However, it’s very important
to then be quite critical about just looking
at frequency data like this and data scientists
and statisticians have been discussing a field,
or the beginning of a field, that is called
‘culturomics’ and there was an article
that got published in Science that said basically
‘this is the application of high throughput
data collection and analysis to the study
of human culture. So the results are a new
type of evidence in the humanities and the
challenge lies in the interpretation of the
evidence’ and then the Science article is
six pages long and then the authors claim
that in these six pages they provide us with
lots of new insights into lexicography, the
evolution of grammar, collective memory, the
early adoption of technology, the pursuit
of fame, censorship and historical epidemiology,
all in six pages with a couple of graphs and
then they said ‘this is the new way to understand
how culture works’.
Then, this goes for a while and then in 2015
some other people – again, no linguists
in that kind of sense but this time mathematicians
– published an article in PloS ONE and said
‘our findings call into question the vast
majority of existing claims drawn from the
Google Books corpus’ and that is referring
exactly to this paper. I just thought I could
have told you so. I mean just ask a linguist.
I mean that’s stuff that we’ve been dealing
with like forever and then we can’t just
by looking at Google n-grams solve all those
problems. However, an important point about
this is that kind of – and I think it’s
a positive thing to take away from this – that
the themes that we do as corpus linguists
got some attention from other fields - there’s
a data scientist there, statisticians there,
mathematicians – they talk all about these
things that we are all so interested in. All
we need to do now is teach them how to do
it properly but then can use these things
and do stuff with it. So, another relation
to a field is what’s called ‘distant reading’
and that is very popular in circles around
Moretti where distance is described as a condition
of knowledge. So it allows you to focus on
units that are much smaller, much larger than
the text and then if the text itself disappears,
well it’s one of those cases and one can
justifiably say ‘less is more’. So distant
reading is about having large collections
of text and then doing visualisation, finding
graphs and then saying something about the
literature in its entirety, not just fifteen
Dickens’ novels but everything you can talk
about. This is quite in stark contrast to
what people have referred to as close reading,
because close reading focuses on a particular
text. It’s very much related to Richard’s
practical criticism. This went out of fashion
a little bit, especially for Victorian studies,
people explained this with almost an obsession
of all themes big and social that people focused
on and then went maybe a bit too far away
from the text and talked too much about the
historical context and the biographical information.
Some people in that spectrum. And then that
is something Moretti then says. He really
doesn’t like close reading at all and he
said ‘it is basically dealing with a very
small canon and as a theological exercise,
very solemn treatment of very few texts taken
very seriously’. So you can see where he
stands on this. Close reading in the digital
age, again, seems to be becoming much more
prominent again. So while it might have gone
a little bit out of fashion, now that we read
things on iPads, on computers, on the phone,
in all sorts of bite sized tweets, blogs or
whatever, people pay again more attention
to close reading and say close reading is
basically almost mindfulness applied to language.
Mindfulness is another very popular term and
that is then used to say you really need to
focus and concentrate on the text. So you
realise I’m just giving you these things
entirely without any bias and totally uncommented
on, just for you to make up your own mind,
and I’m contextualising this and then link
with what you could look at as corpus theoretical
context.
In corpus linguistics there were huge debates
about this corpus linguistic theory or methodology
and some people here in the room with whom
I could have a very long discussion, probably
until next Friday, to say we disagree on this.
But I think even if you disagree on these
things, if you look at what other people who
are not corpus linguists are doing, which
the things that I think are really corpus
linguistics, it is quite important to get
your theoretical basis straight in this regard
to see why do we do the things because something
that is very dangerous is just doing things
simply because we can. I can do a word cloud
and a I can run a word list and I’ll do
it because I can show off that I can press
the right button. But that is not the point.
You need to have a reason for doing these
things. So I think these three points are
quite important in the sense that the way
we look at language is very much that language
is a social phenomenon and that means we use
language to do things. There’s very little
happening without language and the fact that
we do things repeatedly with language leaves
traces or creates evidence. This evidence
is collected in corpora and we can see what
the patterns are. If we produce text, people
can study the text to see what’s going on.
If you then study them you will find co-locations,
so things like ‘eye cream’ or ‘eye surgeon’,
you have co-locations or textual patterns
or relationships between lexis and textual
position. And in corpus linguistics you very
often give each priority to lexis, so when
we focus on lexical grammar or we look at
discourse as representation, or we look at
the terms that people use to talk about the
word.
OK, the point I want to make here is that
it’s very important to take a very critical
approach to the available data and the methods.
So also think about the relationship between
different texts. Are texts maybe related in
a way that you wouldn’t easily see if you
chucked them all together in a corpus and
just count words. Is there some directionality?
Is there something that is happening within
text? That brings us to reading concordances.
John Sinclair wrote a whole book on how to
read concordances. So for him, reading concordance
was a very fundamental thing and it was very
important in lexicography where you then had
to work out what are the meanings of words
so that you can put them in a dictionary.
A big question then, however, if you don’t
compile a dictionary every day, the question
is which words in which corpus and with what
purpose do you run concordances on? I was
looking at ‘eye’ because I’ve been looking
at body language in Dickens’ a lot so then
I’ve done a lot of work on various body
parts to just see what happens with them textually.
But you need to have a purpose to run a concordance
and to give you a little bit of the purpose
for a lot of the work that I’m doing is
that I’m very much interested in fictional
people in Dickens and that is theoretically
situated in a context that looks at characterisation
as a process that kind of happens in the mind
of the reader. And this process links top
down knowledge, so your knowledge about the
word that you bring to another when you read
the novel, where you think about people in
the novel as they’re people like me. They
kind of behave like normal people. I assume
that everyone in a novel has two eyes, unless
the text tells me that we are on a weird planet
and everyone has just one eye – everyone
and for no particular reason. Then you also
have bottom up signals, the information in
the text that then triggers what happens with
your word knowledge that you bring into the
text. And Peter Stockwell and myself, we’ve
talked about this in the relationship of corpus
linguistics and mind modelling, so the creation
of fictional minds and looking at how fictional
minds are related to the minds of real people.
So what I’m very interested in is identifying
linguistic patterns that link the fictional
and the real world and not in a kind of realism
kind of way but thinking about it in terms
of the language that we use to talk about
the real world and how that language is different
or similar from the language that we use in
fiction to describe fictional words. And amazingly,
this is very similar.
So, the thing I like spending a lot of time
with is the CLiC tool that Matthew already
mentioned, so that is something that we have
here now at the University of Birmingham where
I work together with colleagues on trying
to think about how can we make people do concordance
searches on first of all Dickens – because
I like Dickens so much – and then also potentially
other fiction so we’ve got a reference corpus
of 19th Century novels to look at. And what
is so special about our application to corpus
concordance is that we won’t look at texts
in their entirety but we have already annotated
the text behind this web app so that you can
make a distinction between fictional speech,
so anything in quotation marks that is essentially
said by the characters or thought by the characters,
and then anything that is outside of quotation
marks is more narration. And here in this
slide the red bits are bits within quotation
marks, so that is our approximation to fictional
speech, and then the blue stuff is what we
call non-quotes, so anything that isn’t
in quotation marks is non-quotes. That is
a first very rough distinction and we’re
refining this at the moment and I’ll show
you some examples of this. So what we’ve
got is we’ve got annotation behind this
web application that also very nicely runs
on iPhones and iPads and looks really flash
when you use it there, but don’t use it
all at the same time because that brings our
server down!
So the annotation is what we have behind it
and things that you can then do is you can
say, going back to what I showed you at the
beginning about these key comparisons, saying
if something is significantly more frequent
in one set of texts than in another set of
texts, we can now do this within texts. So
we can take all of the speech, so everything
in quotation marks, chuck this basically in
one corpus and take everything that isn’t
speech and put it in another corpus and then
we do a key comparison to then see what are
the words, what are the phrases that occur
statistically more frequently in fictional
speech compared to non-speech. So that is
what CLiC can do for you and these phrases
are called n-grams and here we’re back to
the Google n-gram or they’re called clusters
if you use words with terminology for this
and we put both up because I tend to use clusters
but if you have people who are not familiar
with corpus linguistics, you need to call
it n-grams so that they also understand what
are those. So you can then do a key comparison
and what you get is, for instance, that if
you say I want to have all the sequences that
are five words long, this would be then the
table that you get with all the sequences
that occur statistically more frequently in
quotes than in non-quotes. So you have things
like ‘what do you mean by?’, ‘what do
you think of?’, ‘I beg your pardon, sir’,
‘I’m very much obliged to you’ and these
are such phrases that describe the speaker
listener words, so that is this idea that
when you have those little sequences, they
are so frequent going back to our point ‘language
is a social phenomenon’, we do things with
words, we need to be efficient, we can’t
think of everything every time we say it.
If we piece all the words together all the
time, this takes forever. This will then sound
like me speaking Italian or something like
this which takes a long time. So you need
to build these sequences together and the
idea here is that the speaker listener word
then has a lot of things that are to do with
pronouns, so, ‘what do you mean?’, ‘what
do you think?’, ‘I beg your pardon’
and things like this. And these occur more
in the fictional speech and you don’t have
them as much in the narration. But then in
the narration you have other things.
OK, so the reason why certain clusters or
n-grams are more frequent in fictional speech
is to do with the functions that they have
and that is again the speaker listener world.
What do we do in spoken language? We tend
to be polite to other people and we don’t
reinvent being polite every time we want to
do it, so there is very formulaic language
there of things that are said automatically,
‘I’m delighted to see you’, ‘I’m
glad to see you’, ‘how do you find yourself?’
or ‘will you have the goodness to..’,
yeah? And what we find here are phrases in
19th Century fiction that are used as politeness,
which might not be the same politeness formula
that we would use today because this is all
very, it depends on which time period you’re
looking at, but you can get some good insights
there. And importantly also, I mean I get
this a lot when you send an article to a journal
saying ‘please review us’, this is quite
nice but you actually really need this to
compare this to a real spoken corpus and I’m
afraid that’s a very good idea, can someone
show me a real spoken corpus of 19th Century
and I’ll happily do this. So we can’t
easily do this therefore it’s actually quite
a good approximation to say that especially
these clusters bring us quite close to these
functional sequences. And also another thing
that I get a lot is when I talk about speech
and fiction, especially Victorianists – sorry
John, not you obviously – but some Victorianists
who can say ‘ah, we do this boring thing
with these speech ticks you know in Dickens
where everyone has these repeated phrases
and then there’s one character that says
the same thing all the time and we know this
without you checking this in a corpus’ and
so that’s exactly the point because I’m
really not interested in these, I’m interested
in phrases that are shared across fictional
people. So not things that individualise one
particular character but phrases that occur
across a whole lot of fictional people so
that you can get an insight into what are
the typical functions that were socially relevant
at the time. And an important point here is
that the discussion of realism and fictional
speech also was something that went a bit
out of fashion because people said fictional
speech is nowhere near like spoken language
because you don’t have these ‘mmms’
and ‘ahs’, people don’t interrupt each
other, it’s just very edited and tidy. So
it’s actually boring to study it because
it is too different. However, Norman Page
said part of the problem is that our understanding
of what spontaneous speech is really like
is quite inadequate and now that we had spoken
corpus, or contemporary spoken corpus, people
have a much better understanding what real
spoken language is like and with this much
better understanding from corpus linguistics,
you can then also go back to fiction and see
what are the similarities. You might not get
exactly the same phrase but we get a lot of
functional similarity.
OK, so one example I wanted to show you here
is – and that’s something you can get
with CLiC as well – if you look at clusters,
so something like ‘and all the rest of it’,
that is something that tends to occur in quotes,
so in these quotation marks, and the places
where it’s done really really interestingly
is the one or two times where it doesn’t
occur in quotation marks because if you have
this tendency of normally something does this
and then occasionally very rarely we’ve
got it here, then there’s usually a specific
function or a specific effect in this and
this is from Bleak House where you have from
the village school of Chesney World, intact
as it is this minute, to the whole framework
of society; from the whole framework of society
to the aforementioned frameworks receiving…
blah, blah, blah, blah, blah… and then you
have ‘and what not’ in brackets and then
it continues and you have not minding their
catechism … blah, blah, blah… the first
station in which they happened to find themselves
and from that, to their educating other people
out of their stations, and so obliterating
the landmarks and opening the floodgates and
all the rest of it; this is the swift progress
of the Dedlock mind. What you have here is
you have actually the narrator talking. There
aren’t any quotation marks but there are
so many pointers that are really spoken – ‘and
what not’ and then the ‘THEIR’ in capital
letters to really give you the spoken emphasis
and then you have ‘and all the rest of it’.
So these phrases that you would expect in
the spoken language and then you have it in
the part that really isn’t spoken, Dickens
even tells you – I mean Dickens wasn't very
subtle was he? So he then says ‘this is
the swift progress of the Dedlock mind’,
so he’s actually telling you you’re now
inside Dedlock’s mind and see how he sees
society and what he thinks of the horrible
problems we need to deal with.
OK, so another thing we look at a lot especially
in the CLiC project and in all sorts of other
projects, is particular places of context.
That is what we call ‘suspensions’ where
you have the speech of character and then
this is interrupted by the narrator coming
in and saying something. So, ‘I would not
mind’, said Agnes, coming to me and speaking
in a low voice, so full of sweet and hopeful
consideration that I hear it now, and then
the speech continues. So what happens in that
way is the narrator can just slip in something
instead of making a prominent big point. The
idea is that you’ve got spoken language
and then you have some circumstance or information
that is just slipped in and then it’s very
interesting to look at what is the kind of
stuff that is slipped in and I started checking
etiquette books and politeness guides of the
time because there's a lot about the tone
and the voice that is always commented on
and I don’t have that much time so I don’t
read the top one but ‘there is a boisterous
loud voice, loud laughter, a bold deportment
– I don’t know why at a hotel – are
sure signs of vulgar breeding’. So you can
see how what happens in the description in
the novel is actually linked to contextual
information and I think that is something
that, to just go back to this culturomics
people, was probably also missing there. You
can’t just take figures and run something
and observe something, you always need to
contextualise it and think about other complementary
data sets to see whether you get some explanation
for what you’re finding.
OK, another thing that you can do is once
you’ve got these suspensions, you can actually
run concordances for character names. So in
Hard Times you can run the concordance for
Mrs Sparsit, and then you can see that things
she does is ‘we joined Mrs Sparsit with
decent resignation’, ‘said Mrs Sparsit
in a highly superior manner’, or ‘said
Mrs Sparsit with dignity’ or in an impressive
manner or in a lofty sense. So what Dickens
there does is he gives you quite an idea of
what he thinks Mrs Sparsit is like. But it
isn’t that he has got a sentence that says
‘Mrs Sparsit is a horrible woman, she is
arrogant and she behaves in a really lofty
way’. That’s not what he’s doing. He
lets Mrs Sparsit do her talking and then he
just slips in a couple of snide comments talking
about her and then you also get the idea and
then if readers aren’t too brilliant on
the uptake, then things where you have
‘said Mrs Sparsit, making her nose more
Roman and her eyeballs more Coriolanian’
and so he’s also talking a lot about her
nose and then you get into the text where
you have something like ‘what’s the matter
now, ma’am?’ said Mr Bounderby, in a very
short, rough way. ‘Pray, sir’, returned
Mrs Sparsit, ‘don’t bite my nose off’,
‘bite your nose off, ma’am?’ repeated
Mr Bounderby. ‘Your nose!’, meaning, as
Mrs Sparsit conceived, that it was too developed
a nose for that purpose. After which offensive
implication he cut himself a crust of bread
and threw the knife down with a noise. OK,
so that is if you don’t get the subtle then
you also get the other bits.
So what we did, and here I’ve got Pablo
in because he came in when we started annotating
suspensions. We are now trying to get some
sense into these suspensions in a more systematic
manner. So we’ve changed the CLiC so that
you can run the suspension and then this here
is our user annotation. So we’ve got Pablo,
Johan, Michaela, everyone is annotating in
this to just see what are the functions there,
so can we find some patterns? Is there something
that isn’t just Mrs Sparsit but are the
typical ways of Dickens doing horrible things
to his characters or making nice descriptions
of things like this.
OK, what we’ve also been trying to do is
to get a handle on these things and that’s
something I’m doing together with Andrew
Solway is we’ve tried out some local grammar
induction algorithms where we thought is there
a way of – because what Pablo, Ewan and
I are doing is very intensive work because
you read through all of this and then you
try to think how does this pattern look like,
what could this be, and then you have a pattern
and you spend two hours talking about it and
then you go through the whole lot again and
we thought is there a way that there's maybe
an algorithm that could help us sorting out
the things and what we’re doing here is
not just doing these clusters where you have
sequences, but we’re trying to look at optionality.
So we have got equivalence clusters where
you could say ‘nodding, shakes, burrowing,
rolling’ and then a possessive head or alternatively
with possessive head, so that you try to group
stuff together to get a little bit of the
optionality in instead of looking at individual
instances and then trying to see what you
can get.
Anyway, this whole business about body language
and looking at body parts, where this is taking
me at the moment is to then also consider
the deformed body, disease and disability,
and here again body parts are really quite
helpful if you know that when you talk about
eyes in literary text, you normally talk about
two. If you then just find one, there is something
to have a closer look at. So here you have
‘Betsy Martin, widow, one child, and one
eye. Goes out charring and washing by the
day, never had more than one eye, but knows
her mother drank bottled stout and shouldn’t
wonder if that caused it (immense cheering).
Thinks is it not impossible that if she had
always abstained from spirits she might have
had two eyes by this time’ (tremendous applause).
The things that are really interesting here
is, and that is what I’m trying to do in
the wider context, in literature you can get
quite a lot of information in terms of what
disease and disability was like at the time,
very often in a funny way but there’s a
lot of serious stuff behind this because what
they’re really describing is people’s
alcoholism. There’s mothers drinking themselves
into stupor while they were pregnant and then
people had to deal with this in some kind
of way. And so what I’m trying to look at
is what is in the literature in this regard
and looking at non-standard patterns of body
part descriptions is a very good entry for
this. The way in which we put these patterns
into a wider context is by looking at what
happens in the Times, so we can now access
the Times digital archive and thanks to the
library and I think Sarah was here somewhere
as well – yeah, there’s Sarah – because
we have got such a good relationship with
the library that we can have the Times digital
archive, not just in this interface that you
access but actually the data behind it so
that we can do corpus work with it. Then you
can see whether any of these things that are
hinted at in the literature has potential
patterns in newspaper articles of the time
so that you can get a sense of what was going
on. And some of the things that I just wanted
to hint at briefly, that is work that we’re
doing together with a lot of people and there’s
Simon who we’ve done this with, Fiona’s
over there, so we’re trying to get a handle
on is there some kind of relationship between
the way in which alcohol is being talked about
and what people see in terms of alcohol and
the health implications. So this is just to
show you – I’m doing exactly the thing
that I was criticising at the beginning – I’m
showing you a nice visualisation of how you
can look at co-location. So which are the
words that various types of alcohol occur
together with, so that you have things like
‘alcohol and drinking’, ‘alcohol problem’
and then you have links over to ‘risk and
disease and cancer’ and all sorts of relationships
that you can find in this way, and that’s
what we’re trying to look at in the Times
digital archive.
Another thing, because we have got the topic
of reading – I hope you noticed that! – the
topic of reading is how you look at these
patterns that you can find with the help of
corpus methods. How do you question what these
patterns are? How do you think about, I was
saying I’m interested in mind modelling,
thinking about how cognitive relationships
and what you see in the text actually work
together. So there was this thing about patterns
and meanings that are associated. Meaning
is made for bringing in background knowledge
and all the rest of it. Something that corpus
linguists say a lot, they say you can find
these patterns and corpus linguistics is so
important because as a speaker of the language
you don’t know what is frequent. You can’t
tell. If I ask you what is frequent or if
I ask you what was the most frequent thing
you’ve been saying over the past two weeks,
no-one will be able to say this because we
are just not very good at judging what’s
happening. So what John Sinclair brought into
that discussion is that part of this has to
do with the fact that we kind of store pre-fabricated
phrase. So we have these units that we then
use without having to build them up individually
because it’s just faster, it makes it more
efficient. And then also language users are
probably primed a lot, and that is stuff that
Michael Lowe was talking about, primed a lot
about the way in which they speak.
Anyway, what we wanted to find out and that
is work that I’m doing with Gareth who’s
over there, and with Kathy Conklin who’s
there in the picture but in Nottingham, we
just had to get them to take a picture of
the eye tracker, that is stuff that we’re
looking at how we can make observations on
these patterns that we find in the corpus
by looking at what happens when we read these
patterns. So with the eye tracker you then
can do things, we can give people a passage
to read and then you can see where they’re
looking at, how long they are spending on
reading something and we’ve done a couple
of studies with this already in our brand
new eye-tracking lab – sorry, a little promotion
thing here. What we did is we looked at mothers
and to give you a little bit more context
here, I’m particularly interested now in
these novels in the 19th Century that were
very much – and particularly Dickens – published
in parts and some of the things that people
say about the findings on clusters is that
these clusters occur repeatedly because people
have got like three chapters of the novel
and then there was this break and then they
couldn’t read it anymore and then they had
to wait until the next instalment comes out,
so they needed something that then once they
get the next instalment, reminded them of
a particular character. So therefore you had
all those repetitions. However, no-one has
really tested that yet. I mean that is something
that goes down very well as something that
you read often and the more often you read
it, the more you believe it, but no-one has
actually tested this until now because we
are now looking at exactly this. So one of
the clusters we’ve looked at is with ‘his
head against a wall’ and then something
again in Bleak House there’s this Mr Jellyby
who ‘during the whole evening sat in the
corner with his head against the wall as if
he were subject to low spirits’. Now something
like a little bit of narrative progression
because you have several instances of him
doing this, so ‘poor Mr Jellyby who very
seldom spoke and almost always sat when he
was at home with his head against the wall’…
blah, blah, blah. So for him this sitting
with his head against the wall is basically
a summary of he’s in a very untidy household,
he’s got a horrible wife who can never say
anything at all and all he can do is sit with
his head against the wall. So that’s a summary
characterisation of Mr Jellyby. And so you
could then say OK, if that summarises his
life, then if you’re here with his head
against the wall, that reminds you if you
haven’t seen him for a while that that is
the guy.
So what we then did is we did a reading study
where we had undergraduates read basically
an abridged version of Bleak House where I’d
just taken out all the passages where Mr Jellyby
is in. So instead of reading it really as
the novel, we just cut everything else and
had a little story that was just a story around
Mr Jellyby and we gave this to participants
to read this to then see does anything happen
if they see this with his head against the
wall several times, or at what point can you
see that they’ve actually remembered it,
that they’ve learned it? So there are things
that you can then do. You could take the overall
reading times for the class, so you measure
how long does it take the participants to
read this? What you can then also do is you
can normalise this reading rate per character
and you can see where this is going. If you
normalise this per character you can then
compare what happens, so you need to normalise
the corpus linguistic thing. Anyway, then
you can also compare this to the reading rate
per character for the whole text. So if you
then see, these here are instances of one,
two, three, four, five, six, seven – these
are the instances with his head against the
wall and you can actually see that there's
a general trend that the more often you see
it, you kind of read it faster and that goes
together with a lot of stuff in the literature
that says reading time is an indication of
the relationship to memory, have you seen
this before, have you learnt it, and that
is actually some good evidence to see that
there is some learning going on. Obviously
the correlation isn’t entirely clear so
we need to do still a lot more studies and
refine the experiments, but I think there
is enough to show us that there is something
worthwhile investigating. There is a trend
and actually there is a relationship between
what we find on the basis of corpus data and
what we find in other areas of linguistics
that are not necessarily corpus linguistics.
OK, so with this result we can see that there
is some kind of learning, there are some things
in the peripheral aspects of it and again,
I skip over this, we don’t have enough time.
What we also did is we then gave people follow
up questions because we wanted to see is it
just the reading that we observed that gives
us an indication of their learning this, or
can we ask them questions about characters
in the book? So we would ask them something
like how many characters were in the extract?
What did the characters do? Where were they
sitting? Can you describe the character in
terms of adjectives and short phrases? I wanted
them really to say that this is Jellyby and
he is very sad and all these things and it
seems people are OK to give you information
about the gist of it, but people will not
say he was sitting with his head against the
wall until you really point them to it. So
if you have various questions like where is
Mr Jellyby, what is he doing, where might
he be sitting?
And you get there but it isn’t so – so
it’s kind of people read it faster, they
learn it, but they’re not aware of it and
they associate it with something else.
OK, so ongoing work because I don’t have
time so I’ll just give you a couple of things,
that we’re doing to contextualise these
corpus linguistics methods a lot more. So
we’re looking with people at questionnaires
that have been filled in by vets looking at
practice based medicine to then see what kind
of things do they talk about? What is it that
they consciously or unconsciously might mention?
So we’ve done key word studies, we’ve
done some topic modelling and things in that
area. With Anna I’ve been doing something
on translation studies so we just finished,
hallelujah, an article on translated cohesion
and then the other project that Hugh is leading
on where we want to look at John Donne’s
sermons, together with Paul. So there’s
a lot of stuff. Anyway, I had a lot of pictures
of people here and I couldn’t give you all
the pictures so I do one slide with a lot
of people who are involved in a lot of this
and then this is one person who would say
‘so what?’. So now I think I’ve been
talking for like 45 minutes and then as with
[0:41:40], ‘so? You’ve done all this?
Where do we take this from here? Does anyone
care?’ and I think people should care because
these fundamentals about corpus linguistics,
how we understand meaning, how discourses
are created, are also important in the wider
context of what has to do with what people
like to call ‘digital humanities’ because
I think one thing to take away from this is
that there are a lot of similarities between
disciplines but they are sometimes just hidden
underneath terminology that is disciplinary
terminology, or sometimes fancy terms where
people call things differently and then you
can’t see immediately that you’re basically
doing the same thing, only in different fields
and if only you would talk to each other,
you could actually do much better stuff together.
So it’s really important that communication
across fields happens so I think if the Google
n-grams paper, if that was a paper that had
been submitted to one of the three corpus
linguistics journals that are at present represented
in this room, we’ve sorted that out I think.
And then there’s things that yes, the toolbox
has changed. We can do a lot more things but
that doesn’t mean we don’t need good research
questions. So we need to understand what it
is we want to find out and then try and make
tools, try and get methods that answer the
research question and not just do something
because we can and it looks nice and maybe
someone will like it. But we need to have
the questions and then we do what we can to
get it answered. So we also need to be very
critical of ourselves. Are we re-inventing
the wheel? I mean that’s the thing with
the local grammar induction where I’m currently
trying to find out someone must have done
this, it must be somewhere, so I need to find
whether that’s around and how you can make
the connection. But also be critical of others,
so don’t be fooled by fancy visualisations.
Because you can’t do a word cloud, it doesn’t
mean that a word cloud is the answer to everything.
So I think that corpus linguistics as a field
needs to have an impact on digital humanities
more widely because over the past thirty years
we’ve had all these discussions about theory
and method and is a corpus representative
of something? So we would have never just
taken Google Books and then made a claim on
something because people have talked about
balance and design forever. So there are things
that we have learnt sometimes in the hard
way over a long period of time, but that is
still valuable and maybe that can help if
we then look at development in digital humanities
that are happening right now. We can say we
had a similar problem, the way we dealt with
this is following. And underlying all of this
is that language is a social phenomenon. So
this is also crucial in our understanding
of the digital world. So a lot of this is
really how do people interact? What is it
we find interesting? How do we conceptualise
the world around us and how do we bring this
together? So I would say that we should really
tackle these challenges and avoid professional
hazards through collaborative work ultimately.
So I think corpus linguistics for me, it’s
very much collaborative and I would hope that quite a lot of people in this room would think the same.
So, thank you
