Computational linguistics is an interdisciplinary
field concerned with the statistical or rule-based
modeling of natural language from a computational
perspective, as well as the study of appropriate
computational approaches to linguistic questions.
Traditionally, computational linguistics was
performed by computer scientists who had specialized
in the application of computers to the processing
of a natural language.
Today, computational linguists often work
as members of interdisciplinary teams, which
can include regular linguists, experts in
the target language, and computer scientists.
In general, computational linguistics draws
upon the involvement of linguists, computer
scientists, experts in artificial intelligence,
mathematicians, logicians, philosophers, cognitive
scientists, cognitive psychologists, psycholinguists,
anthropologists and neuroscientists, among
others.
Computational linguistics has theoretical
and applied components.
Theoretical computational linguistics focuses
on issues in theoretical linguistics and cognitive
science, and applied computational linguistics
focuses on the practical outcome of modeling
human language use.The Association for Computational
Linguistics defines computational linguistics
as:
...the scientific study of language from a
computational perspective.
Computational linguists are interested in
providing computational models of various
kinds of linguistic phenomena.
== Origins ==
Computational linguistics is often grouped
within the field of artificial intelligence,
but actually was present before the development
of artificial intelligence.
Computational linguistics originated with
efforts in the United States in the 1950s
to use computers to automatically translate
texts from foreign languages, particularly
Russian scientific journals, into English.
Since computers can make arithmetic calculations
much faster and more accurately than humans,
it was thought to be only a short matter of
time before they could also begin to process
language.
Computational and quantitative methods are
also used historically in attempted reconstruction
of earlier forms of modern languages and subgrouping
modern languages into language families.
Earlier methods such as lexicostatistics and
glottochronology have been proven to be premature
and inaccurate.
However, recent interdisciplinary studies
which borrow concepts from biological studies,
especially gene mapping, have proved to produce
more sophisticated analytical tools and more
trustworthy results.When machine translation
(also known as mechanical translation) failed
to yield accurate translations right away,
automated processing of human languages was
recognized as far more complex than had originally
been assumed.
Computational linguistics was born as the
name of the new field of study devoted to
developing algorithms and software for intelligently
processing language data.
The term "computational linguistics" itself
was first coined by David Hays, founding member
of both the Association for Computational
Linguistics and the International Committee
on Computational Linguistics.
When artificial intelligence came into existence
in the 1960s, the field of computational linguistics
became that sub-division of artificial intelligence
dealing with human-level comprehension and
production of natural languages.In order to
translate one language into another, it was
observed that one had to understand the grammar
of both languages, including both morphology
(the grammar of word forms) and syntax (the
grammar of sentence structure).
In order to understand syntax, one had to
also understand the semantics and the lexicon
(or 'vocabulary'), and even something of the
pragmatics of language use.
Thus, what started as an effort to translate
between languages evolved into an entire discipline
devoted to understanding how to represent
and process natural languages using computers.Nowadays
research within the scope of computational
linguistics is done at computational linguistics
departments, computational linguistics laboratories,
computer science departments, and linguistics
departments.
Some research in the field of computational
linguistics aims to create working speech
or text processing systems while others aim
to create a system allowing human-machine
interaction.
Programs meant for human-machine communication
are called conversational agents.
== Approaches ==
Just as computational linguistics can be performed
by experts in a variety of fields and through
a wide assortment of departments, so too can
the research fields broach a diverse range
of topics.
The following sections discuss some of the
literature available across the entire field
broken into four main area of discourse: developmental
linguistics, structural linguistics, linguistic
production, and linguistic comprehension.
=== Developmental approaches ===
Language is a cognitive skill which develops
throughout the life of an individual.
This developmental process has been examined
using a number of techniques, and a computational
approach is one of them.
Human language development does provide some
constraints which make it harder to apply
a computational method to understanding it.
For instance, during language acquisition,
human children are largely only exposed to
positive evidence.
This means that during the linguistic development
of an individual, only evidence for what is
a correct form is provided, and not evidence
for what is not correct.
This is insufficient information for a simple
hypothesis testing procedure for information
as complex as language, and so provides certain
boundaries for a computational approach to
modeling language development and acquisition
in an individual.
Attempts have been made to model the developmental
process of language acquisition in children
from a computational angle, leading to both
statistical grammars and connectionist models.
Work in this realm has also been proposed
as a method to explain the evolution of language
through history.
Using models, it has been shown that languages
can be learned with a combination of simple
input presented incrementally as the child
develops better memory and longer attention
span.
This was simultaneously posed as a reason
for the long developmental period of human
children.
Both conclusions were drawn because of the
strength of the artificial neural network
which the project created.
The ability of infants to develop language
has also been modeled using robots in order
to test linguistic theories.
Enabled to learn as children might, a model
was created based on an affordance model in
which mappings between actions, perceptions,
and effects were created and linked to spoken
words.
Crucially, these robots were able to acquire
functioning word-to-meaning mappings without
needing grammatical structure, vastly simplifying
the learning process and shedding light on
information which furthers the current understanding
of linguistic development.
It is important to note that this information
could only have been empirically tested using
a computational approach.
As our understanding of the linguistic development
of an individual within a lifetime is continually
improved using neural networks and learning
robotic systems, it is also important to keep
in mind that languages themselves change and
develop through time.
Computational approaches to understanding
this phenomenon have unearthed very interesting
information.
Using the Price Equation and Pólya urn dynamics,
researchers have created a system which not
only predicts future linguistic evolution,
but also gives insight into the evolutionary
history of modern-day languages.
This modeling effort achieved, through computational
linguistics, what would otherwise have been
impossible.
It is clear that the understanding of linguistic
development in humans as well as throughout
evolutionary time has been fantastically improved
because of advances in computational linguistics.
The ability to model and modify systems at
will affords science an ethical method of
testing hypotheses that would otherwise be
intractable.
=== Structural approaches ===
In order to create better computational models
of language, an understanding of language's
structure is crucial.
To this end, the English language has been
meticulously studied using computational approaches
to better understand how the language works
on a structural level.
One of the most important pieces of being
able to study linguistic structure is the
availability of large linguistic corpora,
or samples.
This grants computational linguists the raw
data necessary to run their models and gain
a better understanding of the underlying structures
present in the vast amount of data which is
contained in any single language.
One of the most cited English linguistic corpora
is the Penn Treebank.
Derived from widely-different sources, such
as IBM computer manuals and transcribed telephone
conversations, this corpus contains over 4.5
million words of American English.
This corpus has been primarily annotated using
part-of-speech tagging and syntactic bracketing
and has yielded substantial empirical observations
related to language structure.Theoretical
approaches to the structure of languages have
also been developed.
These works allow computational linguistics
to have a framework within which to work out
hypotheses that will further the understanding
of the language in a myriad of ways.
One of the original theoretical theses on
internalization of grammar and structure of
language proposed two types of models.
In these models, rules or patterns learned
increase in strength with the frequency of
their encounter.
The work also created a question for computational
linguists to answer: how does an infant learn
a specific and non-normal grammar (Chomsky
Normal Form) without learning an overgeneralized
version and getting stuck?
Theoretical efforts like these set the direction
for research to go early in the lifetime of
a field of study, and are crucial to the growth
of the field.
Structural information about languages allows
for the discovery and implementation of similarity
recognition between pairs of text utterances.
For instance, it has recently been proven
that based on the structural information present
in patterns of human discourse, conceptual
recurrence plots can be used to model and
visualize trends in data and create reliable
measures of similarity between natural textual
utterances.
This technique is a strong tool for further
probing the structure of human discourse.
Without the computational approach to this
question, the vastly complex information present
in discourse data would have remained inaccessible
to scientists.
Information regarding the structural data
of a language is available for English as
well as other languages, such as Japanese.
Using computational methods, Japanese sentence
corpora were analyzed and a pattern of log-normality
was found in relation to sentence length.
Though the exact cause of this lognormality
remains unknown, it is precisely this sort
of intriguing information which computational
linguistics is designed to uncover.
This information could lead to further important
discoveries regarding the underlying structure
of Japanese, and could have any number of
effects on the understanding of Japanese as
a language.
Computational linguistics allows for very
exciting additions to the scientific knowledge
base to happen quickly and with very little
room for doubt.
Without a computational approach to the structure
of linguistic data, much of the information
that is available now would still be hidden
under the vastness of data within any single
language.
Computational linguistics allows scientists
to parse huge amounts of data reliably and
efficiently, creating the possibility for
discoveries unlike any seen in most other
approaches.
=== Production approaches ===
The production of language is equally as complex
in the information it provides and the necessary
skills which a fluent producer must have.
That is to say, comprehension is only half
the problem of communication.
The other half is how a system produces language,
and computational linguistics has made some
very interesting discoveries in this area.
In a now famous paper published in 1950 Alan
Turing proposed the possibility that machines
might one day have the ability to "think".
As a thought experiment for what might define
the concept of thought in machines, he proposed
an "imitation test" in which a human subject
has two text-only conversations, one with
a fellow human and another with a machine
attempting to respond like a human.
Turing proposes that if the subject cannot
tell the difference between the human and
the machine, it may be concluded that the
machine is capable of thought.
Today this test is known as the Turing test
and it remains an influential idea in the
area of artificial intelligence.
One of the earliest and best known examples
of a computer program designed to converse
naturally with humans is the ELIZA program
developed by Joseph Weizenbaum at MIT in 1966.
The program emulated a Rogerian psychotherapist
when responding to written statements and
questions posed by a user.
It appeared capable of understanding what
was said to it and responding intelligently,
but in truth it simply followed a pattern
matching routine that relied on only understanding
a few keywords in each sentence.
Its responses were generated by recombining
the unknown parts of the sentence around properly
translated versions of the known words.
For example, in the phrase "It seems that
you hate me" ELIZA understands "you" and "me"
which matches the general pattern "you [some
words] me", allowing ELIZA to update the words
"you" and "me" to "I" and "you" and replying
"What makes you think I hate you?".
In this example ELIZA has no understanding
of the word "hate", but it is not required
for a logical response in the context of this
type of psychotherapy.Some projects are still
trying to solve the problem which first started
computational linguistics off as its own field
in the first place.
However, the methods have become more refined
and clever, and consequently the results generated
by computational linguists have become more
enlightening.
In an effort to improve computer translation,
several models have been compared, including
hidden Markov models, smoothing techniques,
and the specific refinements of those to apply
them to verb translation.
The model which was found to produce the most
natural translations of German and French
words was a refined alignment model with a
first-order dependence and a fertility model[16].
They also provide efficient training algorithms
for the models presented, which can give other
scientists the ability to improve further
on their results.
This type of work is specific to computational
linguistics, and has applications which could
vastly improve understanding of how language
is produced and comprehended by computers.
Work has also been done in making computers
produce language in a more naturalistic manner.
Using linguistic input from humans, algorithms
have been constructed which are able to modify
a system's style of production based on a
factor such as linguistic input from a human,
or more abstract factors like politeness or
any of the five main dimensions of personality.
This work takes a computational approach via
parameter estimation models to categorize
the vast array of linguistic styles we see
across individuals and simplify it for a computer
to work in the same way, making human-computer
interaction much more natural.
==== Text-based interactive approach ====
Many of the earliest and simplest models of
human-computer interaction, such as ELIZA
for example, involve a text-based input from
the user to generate a response from the computer.
By this method, words typed by a user trigger
the computer to recognize specific patterns
and reply accordingly, through a process known
as keyword spotting.
==== Speech-based interactive approach ====
Recent technologies have placed more of an
emphasis on speech-based interactive systems.
These systems, such as Siri of the iOS operating
system, operate on a similar pattern-recognizing
technique as that of text-based systems, but
with the former, the user input is conducted
through speech recognition.
This branch of linguistics involves the processing
of the user's speech as sound waves and the
interpreting of the acoustics and language
patterns in order for the computer to recognize
the input.
=== Comprehension approaches ===
Much of the focus of modern computational
linguistics is on comprehension.
With the proliferation of the internet and
the abundance of easily accessible written
human language, the ability to create a program
capable of understanding human language would
have many broad and exciting possibilities,
including improved search engines, automated
customer service, and online education.
Early work in comprehension included applying
Bayesian statistics to the task of optical
character recognition, as illustrated by Bledsoe
and Browing in 1959 in which a large dictionary
of possible letters were generated by "learning"
from example letters and then the probability
that any one of those learned examples matched
the new input was combined to make a final
decision.
Other attempts at applying Bayesian statistics
to language analysis included the work of
Mosteller and Wallace (1963) in which an analysis
of the words used in The Federalist Papers
was used to attempt to determine their authorship
(concluding that Madison most likely authored
the majority of the papers).In 1971 Terry
Winograd developed an early natural language
processing engine capable of interpreting
naturally written commands within a simple
rule governed environment.
The primary language parsing program in this
project was called SHRDLU, which was capable
of carrying out a somewhat natural conversation
with the user giving it commands, but only
within the scope of the toy environment designed
for the task.
This environment consisted of different shaped
and colored blocks, and SHRDLU was capable
of interpreting commands such as "Find a block
which is taller than the one you are holding
and put it into the box." and asking questions
such as "I don't understand which pyramid
you mean." in response to the user's input.
While impressive, this kind of natural language
processing has proven much more difficult
outside the limited scope of the toy environment.
Similarly a project developed by NASA called
LUNAR was designed to provide answers to naturally
written questions about the geological analysis
of lunar rocks returned by the Apollo missions.
These kinds of problems are referred to as
question answering.
Initial attempts at understanding spoken language
were based on work done in the 1960s and 1970s
in signal modeling where an unknown signal
is analyzed to look for patterns and to make
predictions based on its history.
An initial and somewhat successful approach
to applying this kind of signal modeling to
language was achieved with the use of hidden
Markov models as detailed by Rabiner in 1989.
This approach attempts to determine probabilities
for the arbitrary number of models that could
be being used in generating speech as well
as modeling the probabilities for various
words generated from each of these possible
models.
Similar approaches were employed in early
speech recognition attempts starting in the
late 70s at IBM using word/part-of-speech
pair probabilities.More recently these kinds
of statistical approaches have been applied
to more difficult tasks such as topic identification
using Bayesian parameter estimation to infer
topic probabilities in text documents.
== Applications ==
Modern computational linguistics is often
a combination of studies in computer science
and programming, math, particularly statistics,
language structures, and natural language
processing.
Combined, these fields most often lead to
the development of systems that can recognize
speech and perform some task based on that
speech.
Examples include speech recognition software,
such as Apple's Siri feature, spellcheck tools,
speech synthesis programs, which are often
used to demonstrate pronunciation or help
the disabled, and machine translation programs
and websites, such as Google Translate.Computational
linguistics can be especially helpful in situations
involving social media and the Internet.
For example, filters in chatrooms or on website
searches require computational linguistics.
Chat operators often use filters to identify
certain words or phrases and deem them inappropriate
so that users cannot submit them.
Another example of using filters is on websites.
Schools use filters so that websites with
certain keywords are blocked from children
to view.
There are also many programs in which parents
use Parental controls to put content filters
in place.
Computational linguists can also develop programs
that group and organize content through Social
media mining.
An example of this is Twitter, in which programs
can group tweets by subject or keywords.
Computational linguistics is also used for
document retrieval and clustering.
When you do an online search, documents and
websites are retrieved based on the frequency
of unique labels related to what you typed
into a search engine.
For instance, if you search "red, large, four-wheeled
vehicle," with the intention of finding pictures
of a red truck, the search engine will still
find the information desired by matching words
such as "four-wheeled" with "car".
== Subfields ==
Computational linguistics can be divided into
major areas depending upon the medium of the
language being processed, whether spoken or
textual; and upon the task being performed,
whether analyzing language (recognition) or
synthesizing language (generation).
Speech recognition and speech synthesis deal
with how spoken language can be understood
or created using computers.
Parsing and generation are sub-divisions of
computational linguistics dealing respectively
with taking language apart and putting it
together.
Machine translation remains the sub-division
of computational linguistics dealing with
having computers translate between languages.
The possibility of automatic language translation,
however, has yet to be realized and remains
a notoriously hard branch of computational
linguistics.Some of the areas of research
that are studied by computational linguistics
include:
Computational complexity of natural language,
largely modeled on automata theory, with the
application of context-sensitive grammar and
linearly bounded Turing machines.
Computational semantics comprises defining
suitable logics for linguistic meaning representation,
automatically constructing them and reasoning
with them
Computer-aided corpus linguistics, which has
been used since the 1970s as a way to make
detailed advances in the field of discourse
analysis
Design of parsers or chunkers for natural
languages
Design of taggers like POS-taggers (part-of-speech
taggers)
Machine translation as one of the earliest
and most difficult applications of computational
linguistics draws on many subfields.
Simulation and study of language evolution
in historical linguistics/glottochronology.
== Legacy ==
The subject of computational linguistics has
had a recurring impact on popular culture:
The 1983 film WarGames features a young computer
hacker who interacts with an artificially
intelligent supercomputer.
A 1997 film, Conceiving Ada, focuses on Ada
Lovelace, considered one of the first computer
scientists, as well as themes of computational
linguistics.
Her, a 2013 film, depicts a man's interactions
with the "world's first artificially intelligent
operating system."
The 2014 film The Imitation Game follows the
life of computer scientist Alan Turing, developer
of the Turing Test.
The 2015 film Ex Machina centers around human
interaction with artificial intelligence.
== See also
