Hi, today I’m going to be talking about
Text and Data Mining, an exciting technique
to get lots of information fast so you can
get on with your research.
Text and data mining (TDM) is a process through
which large amounts of information can be
analysed electronically. This allows researchers
to work through far more research content
than they would ever be able to do manually.
Imagine having to go through thousands of
papers looking for a specific term or keyword.
It would take a ridiculously long period of
time and that would be if a single researcher
just looked at abstracts and conclusions,
let alone reading each individual paper cover
to cover.
TDM can be used for things other than going
through research papers such as working through
data sets to identify trends and patterns.
TDM is extremely useful in the sciences, especially
areas such as medical research when analysing
for occurrences of certain diseases in the
literature. However TDM doesn’t have to
be limited to these sorts of applications
and can be used across disciplines such as
in linguistics for language frequency and
conservation for species tracking or associated content.
A lot of TDM options are made possible due to individuals coding and building programs
that can carry out automated analysis. If
you saw our video on alternative communities,
you’ll know that GitHub is a goldmine for
finding programs and code written by others,
especially if you do not have the knowledge
to start such a program from scratch.
One local Cambridge project that uses GitHub
as part of its development process is ContentMine.
ContentMine is an ever-developing set of tools and does a lot of work with literature reviews
and other research-based analysis. Headed up by Cambridge chemist Peter Murray-Rust,
who is also funded by the Shuttleworth Foundation,
ContentMine has been used to discover research
papers on the zika virus, search for species
on the IUCN endangered list, as well as analysing
clinical trial reviews.
However there are some limitations with TDM with
the biggest relating to copyright. TDM works
by making a copy of a resource so a program
can analyse it and then typically discards
that copy once the analysis is done. Ordinarily
this type of wholesale copying would be illegal
under UK Copyright Law but there is an exception
in the instance of computational analysis
for non-commercial research, which is good
news for anyone working in academia. Unfortunately
it isn’t all good news.
Currently it is illegal to apply these sorts
of techniques to the services where many researchers
find literature for their work – databases.
Many articles and other such research comes
via large databases such as JSTOR or ScienceDirect and due to the infrastructure and hosting
investments on the part of the provider, these are protected by a ‘database right’ which
means they aren’t open for TDM access.
Some providers do offer their own APIs for analysis
but this means using someone else’s programme
rather than the one you diligently built for
a specific purpose. All is not lost however. There are open peer-reviewed
resources out there that can be used in conjunction
with TDM such as Europe PubMedCentral and
open data sets via Wikidata. Also, many within
the academic community are campaigning to
have more open access to data and research
so that TDM can be carried out legally and
with complete access to the world’s literature
without hindrance. It should be noted that
this introduction is deliberately general
to give an overview of a research technique
and some of its limitations. If you wish to
use TDM in your work, we highly recommend
that you ensure you are doing so legally and
that you contact likeminded folk such as the
team at ContentMine to ask for advice.
So that was a quick look at Text and Data
Mining which I hope was a helpful look at
something new and exciting to try out as part
of your research. So until next time, bye!
