Good morning, everyone.
I know we are just on top of 11:15.
I'm sure there are folks who are going to
start coming in, but two minutes ago things
started to get very quiet after the chime
so I thought I would go ahead and get started.
My name is Beth Namachchivaya, I am here on
behalf of the project team that was funded
by the Institute of Museum and Library Services
to host a national forum on text data mining
research, using in-copyright and use-limited
text datasets.
That forum happened last week in Chicago,
and in the room today with us there are at
least a couple of participants who were there
at the one-and-a-half day forum.
So I'm going to ask my colleagues, if you
have input that you would like to add, please
feel free, I would welcome that.
So when we proposed this idea for the forum,
we identified several stakeholder groups that
we felt really needed to come together at
this forum to talk about the challenges and
also the way forward to facilitate text data
mining with in-copyright and licensed text.
So our goal was to situate - and I'm gonna
call it TDM for short - it was to situate
TDM support and education and conversations
by academic libraries within a broader landscape,
because we felt that TDM really appeared as
though it were a niche activity, something
that was really done by an erudite group who
were both technically savvy and also had the
wherewithal to work out, not only the technical
problems, but to also work through the methods
that were necessary to successfully do text
data mining and then incorporate that into
your research.
So we really wanted to situate TDM within
a much broader landscape within research libraries,
we wanted to articulate points of convergence
and divergence among the stakeholder groups,
and we wanted to develop a strategy for libraries
to expand research data services to include
support frameworks for text data mining.
We also wanted to leverage partnerships that
libraries could and have been forging with
researchers, professional and scholarly organizations,
and the legal community to support more open
and accessible TDM.
We felt that that was really a necessary part
of what we were going to pursue together.
So, as you can see, the attendees at the meeting
represented a really impressive scope and
depth of expertise.
We brought together researchers who are at
the heart of text data mining, librarians,
content providers - and this included both
commercial and openly accessible content providers.
We brought together legal experts, and that
included practitioners, both in libraries
and in the broader academic legal networks.
And we also brought together legal researchers
in these groups.
And then we brought together representatives
from professional associations, organizations
that advocate in the space of research and
networked information like CNI.
Professional association members included
a representative from the National Academy
of Sciences, ARL, ACRL, BRDI - the Board of
Research Data and Information - and all in
all we identified a team of 25 experts in
this area who met, as I mentioned, last week
in Chicago.
So with these questions framing the work that
we did, the project team set out throughout
the fall and winter of 2017 and '18, and we
worked on two things: a scoping literature
review that formed the foundation of a preformed
discussion paper, and we worked simultaneously
with the project participants to explore their
perspectives on the landscape of TDM.
We asked them to develop SWOT analyses, we
did individual interviews with them, and just
prior to the forum we asked them to write
forum statements.
So the forum attendees were really a diverse
group, they had a wide range of ideas and
opinions.
So when we entered the room last week there
was one statement that we all agreed on at
the start of the forum, and that was the fact
that copyright law and resource licensing
complicate research with text data.
So we started from that point of reference
and we worked forward.
We also wanted to make sure that we framed
a definition of TDM, and really we defined
it as "computational processes for applying
structure to unstructured electronic texts
and employing statistical methods to discover
new information and reveal patterns in that
process data."
These data might include electronic journal
articles, newspapers, books, or more informal
textual data such as consumer reviews or blog
posts.
In scoping the project, we set aside numeric
data, non-textual content such as static images,
audio, or video.
We got a lot of pushback about that throughout
the process of the forum and this was a really
interesting tension that we needed to explore,
not only with the attendees, but as we scoped
the national forum proposal and worked with
IMLS and got feedback from the proposal reviewers
who suggested, "We want you to narrow this,
we want you to really focus on text."
And then as we got to the forum, folks in
the forum said, "You know, there are really
relationships between what happens in TDM
and what happens with mining and analyzing
other media, so we understand and we feel
that this is really an important backdrop
for what happens in this space going forward."
So we also struggled with something else,
and that was with what to call this data.
We finally came to terms with this and we
called it "use-limited data."
Coming to terms with it, the intellectual
property dimensions, was really difficult
for us.
In the early proposal development we referred
to these data as "IP-restricted."
We found that term sort of hindered rather
than facilitated our discussion with stakeholders.
And as we submitted the proposal, we settled
on "limited access."
We didn't really feel that encompassed the
full spectrum of challenges that scholars
face throughout the research life cycle when
they're trying to work with these data.
So that's how we came to grips with talking
about use-limited data.
We feel that that better describes the more
restrictive facet of research with these data,
how they may be used, and it encompasses a
spectrum of activities, ranging from modes
of access to redistribution for validation
and reuse.
So we did use several methods, as I mentioned
earlier.
One was the scoping review of the literature,
that was a targeted review of scholarship
on issues relating to mining texts that are
under copyright, subject to licensing agreements,
or otherwise restricted due to intellectual
property.
We looked at works, primarily in English,
for the past 17 years.
We focused primarily on the U.S. but the team
also included scholarship that addressed other
legal jurisdictions including Canada, Australia,
the UK, and the European Union.
We did searches on prominent databases in
using terms related to law, library and information
science, computer science, linguistics, e-science,
digital humanities, and computational social
science.
We also did interviews with each of the forum
participants, we reviewed the notes and the
interview transcripts, and we identified prominent
themes.
You might say that we did a lot of mining
of the input that we got from the participants.
We also mined SWOT analyses that we asked
the participants to do.
We asked them to look at and articulate very
succinctly the strengths, weaknesses, opportunities,
and threats in this space.
We also asked them to develop a forum statement,
a very brief and succinct one-to-two-page
statement about what they felt was important
that needed to happen in this area to make
TDM more accessible.
And another - it's not a method, but I feel
like I want to call your attention to something
that we used to really facilitate conversations
and action-focused work at the forum - and
that was a framework called Liberating Structures.
If you Google "Liberating Structures" on the
web you'll find a website which I understand
is cribbed from a publication, very largely
from another publication, that essentially
provides some really good ideas for eliciting
input from groups and essentially getting
groups to interact with each other but also
to focus on the places where you want to go.
I want to talk a little more about the SWOT
analysis because I lived and breathed with
it for a couple of weeks all by myself in
my apartment, and I identified a number of
themes coming out of this SWOT analysis but
I did something that I thought was really
helpful.
Each attendee did a SWOT analysis.
I coded those according to these themes, but
I also coded them according to what stakeholder
group they represented.
And then I took the information that fell
under each of the themes, regardless of which
stakeholder group, and I combined all of that
information.
So under a theme I could see comments on strengths,
weaknesses, opportunities, and threats from
all of these stakeholders together.
So mining that, I wanted to talk about a couple
of themes that really jumped out, and they
jumped out of the SWOT analysis across these
groups.
There's tension over working with content
and the research process of working with content.
There are differences of opinion among researchers,
librarians, and content providers about the
best ways to provide access to use-restricted
data.
Okay, that's a no-brainer.
But it goes a little bit deeper than that.
All of these groups share the concern that
there really isn't a shared terminology across
the disciplinary and professional boundaries.
There are ad hoc procedures for transferring
data, uneven data quality, idiosyncratic use
of data formats among content providers - all
of these things really hinder greater access
to and deeper analysis of these data.
So there was a lot of shared concern over
this.
Practices to providing access to these data
are all over the map and everybody to a person
agreed that that really doesn't make sense
if we want to promote this research.
Even well-resourced universities struggle
to provide access to content that has been
delivered, say, on a FireWire drive that shows
up in the library with the caveat, "Make sure
you destroy this after you've done your research
with it and don't redistribute it."
We also noticed that there is a chilling effect
of use restrictions in TDM research.
Those folks who said that researchers continue
to do work with use-restricted data, but they
don't openly communicate their methods and
their data sources.
We're also in this sort of rough area where
researchers really don't have a way to communicate
how other researchers could replicate or repurpose
the findings that they have from their research,
and I think this is one of the things that
everyone agreed was probably a significant
obstacle to TDM being TDM uptake throughout
the disciplines.
There were a number of legal and policy issues
that were pointed out.
It is not surprising, I'm sure you will not
be surprised by the fact that this theme was
the most commented-on across all of the stakeholder
groups.
In the United States, the Fair Use Agreement
for Text Mining frequently is grounded on
the concept of non-consumptive research, which,
although it was defined in 2010 in the rejected
settlement agreement in the Authors Guild
v. Goggle, in practice it's more complicated
than it first appears.
The boundaries between consumptive use and
non-consumptive research are really not well-developed.
The line between checking results - which
is permissible in, say, the Google Books decision
- and that line between doing that and human
reading is not a bright line, and often that
is the thing that researchers want to do.
They want to look at the summative results
that come out of running algorithms across
one or more text corpora, and then they want
to go back and read portions of that and really
more deeply understand what they're doing.
There was also some interesting conversation
and tension around business models.
There's tension about the role of commercialization
in the text mining services.
Some fear that if they haven't already, universities
are going to lose ground to large corporations
- such as Google - who will serve as data
brokers for researchers instead of libraries.
Others noted that publishers' interest in
data mining extends beyond building TDM platforms
and provisioning data access, but also to
mining journal content for internal business
purposes.
Some folks noted that licensed data sets are
a source of economic viability.
This is a way to extend a thriving publishing
industry while a number of stakeholders are
concerned about further monetizing access
for mining purposes.
So, what were some of our outcomes?
One of the areas around which there was a
pretty broad consensus at the end of a day-and-a-half
of conversation, and thinking both as a group
of the whole but also thinking in small groups
and working through some of the tensions that
exist among stakeholder groups, we all came
to this conclusion that TDM is really part
of a larger conversation.
It's really, if you think about the libraries'
role, about libraries making content more
useful and more usable in the digital age.
We struggled as a group around identifying
words and phrases that would elicit a pretty
high comfort level across our stakeholder
groups.
We talked about, "Is this really about open
access for data mining?"
And we realized that there isn't a strong
comfort level across all of our stakeholder
groups with talking about making everything
open, but there was a comfort level with focusing
on making content more useful and more usable
in the research process.
Building on that, there was a pretty good
shared understanding that more useful and
usable content really does mean it is accessible.
And we need to figure out how to frame the
conversations that, say, libraries have on
behalf of researchers, or libraries and researchers
and content providers have about figuring
out how we really have conversations that
express what it is a researcher wants to do,
how it can be done, and then how it can be
done within the context of working with use-limited
data without raising tensions that sometimes
lead to a unilateral, "No, you can't," or,
"You can only do this, but you can't do that."
We also realized that reading and content
mining, as they were outlined in that 2010
conversation about consumptive and non-consumptive
research, these are really not mutually exclusive
research activities for all researchers, and
probably they're not mutually exclusive for
most researchers.
Because inevitably, if you see something in
a summative, that's referred to or occurs
in a summative way, you really want to dig
in and understand, what is the context around
that?
How can I apply what I know about this body
of research to these pointers in the non-consumptive
research?
And we also came to this realization that
content mining can drive business models and
revenue if we work on this and if we use it
appropriately.
It doesn't mean that content mining as a revenue-generating
activity needs to be necessarily an add-on
to something that is already fairly expensive
and out of reach for a number of institutions
and groups.
So I mentioned at the beginning of the presentation
that we were really geared toward making commitments,
and we were geared toward asking individuals
to make commitments, but we also worked in
a very focused way on getting groups of people
together, not just according to their stakeholder
groups, but also giving them opportunities
to talk across the stakeholder groups to identify
things that they wanted to do, things that
were actually coming out of their conversations
together.
So at the end of the one-and-a-half days,
we had a number of activities that groups
are working on together.
One group is working on a declaration for
principles around text data mining.
Another group is working on making recommendations
for academic library services, so a pragmatic
approach.
Another group is working on legal infrastructure
for computational research.
Yet another group is developing a grant proposal
to develop legal and intellectual property
workshops for librarians and for researchers.
There are conversations happening around a
pilot TDM service working with Hathi Trust,
Portico Publishers, and CrossRef.
There are more things happening, but we were
really quite excited about the fact that there
are a number of on-the-ground activities that
working groups are pursuing, and we're actually
in the process of setting up Google Groups
for them to continue to do this.
Next steps in addition to that include a whitepaper,
which ACRL intends to publish this summer,
and I also wanted to extend an invitation.
If anyone wants to get involved, feel free
to contact me, that's my email address.
We will set up a more general email box for
folks who want to know more about the project,
and in the beginning, on my first slide - if
I can page back to it, which I will in a moment
- there is a link to the website for the project
and we're going to use that to keep people
informed.
So I would like to break and take questions,
comments from those in the audience.
If you want to get the URL for the project
site, that is another way to track on what's
happening with the project going forward.
Thanks, everyone.
I really appreciate it, thanks.
