- [Mary] Good morning.
Give everybody a minute or two settle in.
(audience member shushes)
Oh wow, was that a librarian who did that?
(audience laughs)
I mean, that was very
effective, I must say.
I'm Mary Augusta Thomas.
I'm the deputy director of Libraries
and I wanna welcome all of you,
so, good morning and welcome.
Nancy Gwinn, our director
of Libraries is off
as some of you know,
working on the institution's
new strategic plan today,
as is Anne Van Camp,
the director of archives,
and I suspect Ann Speyer,
the CIO of the institution.
Ann, Anne and Nancy are
responsible for bringing you
this series, this lecture series,
which we said this morning
does not actually have a title,
but is the continuation of
something we started last year
bringing in people sort
of on the cutting edge
to give us their thoughts,
and help us think through many issues.
Today is the second in
the '09 series, the third
will be December 11th, keep
watching your Facebook,
Twitter, Yammer, and the calendar,
because we'll be posting more information
as the next event comes up.
So it's my pleasure to welcome Dan Cohen,
who will be introduced
by Martin Kalfatovic,
the Library's assistant
director for digital services
and to welcome you all this morning.
(audience laughs)
- That's my ode to Charlie
Henry Gibson, our late departed.
It's my pleasure to introduce Dan Cohen
here to the Smithsonian and
I have to pause for a moment
to Twitter that, and as
you know probably Dan
is a very effective
twitterer reaching thousands
of people with his
pronouncements and interesting
news related to New
Media, American history,
and other of the new things
that he's interested in.
Dan's currently an
associate professor in the
Department of History and Art History in
George Mason University,
and is the director for the
Center of History and New
Media there at George Mason.
His own research is in
European and American
intellectual history,
the history of science,
particularly mathematics,
and the intersection of
history and computing.
He's the author or coauthor
of numerous publications
including, Digital History:
A Guide to Gathering,
Preserving and Presenting
the Past on the Web,
and also a more recent
publication, which is
Equations from God: Pure
Mathematics and Victorian Faith.
He also has done numerous
articles or book chapters
on the history of
mathematics and religion,
the teaching of history,
and the future of history
in the digital age.
Dan received his bachelor's
degree from Princeton,
a master's from Harvard,
and a doctorate from Yale.
He couldn't seem to, I guess,
find one school that he really liked.
(audience laughs)
At the Center for History and
New Media, which he's also
co-directed in the past before
taking over as director,
he has supervised numerous projects,
including the September 11th archive,
and has been a leader
in developing important
new tools for humanities,
history, et cetera,
in terms of the New Media things.
The key one that I really like is Omeka,
and he's worked with
some Smithsonian staff
on an Omeka installation,
he and his staff.
Also importantly is Zotero,
which is a reference
manager that you can just
install on your computer
and pull in resources and cite them,
and use them in all kinds
of new and wonderful ways.
In fact, Dan calls
himself the chief Zoteran
on his Twitter biography.
Dan is also a very good
friend of the Smithsonian.
He participated in the
Smithsonian 2.0 meetings
that occurred in earlier
this year in January.
He was also a reader for the Smithsonian
Digitization Strategic Plan,
and most recently, he served as a reader
for that strategic plan.
So it's again, a pleasure to
welcome here to the Smithsonian
to talk about Scholars and
the Everywhere Library, Dan.
(audience applauds)
- Thanks very much for that
very kind introduction.
I really think it's
great to be here to talk
about this topic, I think
with everything that's
going on with the digitization
plan, and just overall
strategic planning at the Smithsonian.
I think it's an exciting
time to be at the Smithsonian
and to, if you're part
of the general public who
appreciates the Smithsonian,
to think about the
possibilities for the
institution in the next
five, or 10, or 20 years, and I suspect,
even though things move fairly
rapidly in the digital world
that we are talking about
that size of a project.
There's a lot going on here.
There are many libraries,
there are many museums,
there are many collections,
and I think it's fascinating
to kind of think through
and go through some
thought experiments, which is
really what I wanna do today
about what happens when
you make this transition
from the physical to the digital.
Obviously, I'm a traditionally
trained historian
who was trained to go into
dusty archives and find things,
and write about them on paper.
It's interesting to
think about that process,
of just the scholarly process
and what happens to that
as you encounter new kinds of collections
that may be still partially physical,
but also have finding aids online,
and then ultimately scans
online, and other forms online,
and to think about what
we might do with that.
So again, I really
appreciate the invitation
and I hope, again, that
we can have a discussion.
I know we'll have ample
time at the end of my talk
but you should feel free
to interrupt me if you
see anything objectionable,
or you want me to elaborate on any point,
I'm happy to do that today.
So my main point today
I think will be this.
What I'd like to say is
that we are already in 2009
at a point in which,
few scholars use a single
library or archive.
Everyone is looking at
things online, the browser
is already, has been
probably for many years now,
the entry point for scholarship.
There are of course,
still cases where scholars
have a very specific
collection that they will visit
and that their entire
book, or dissertation,
or article, will be based upon.
But in most cases, we are already
searching for things online,
taking, plucking objects of interest
from multiple collections.
If you can think about
Smithsonian as an umbrella
for multiple collections,
which it certainly is,
you can imagine the scholar coming in and
grabbing materials of interest
for a particular topic
from multiple collections.
It's that process that
we're assessing at the
Center for History and New Media.
I do wanna say that there's
over 50 of us there,
and while I oversee it really,
almost all the work is
done my incredible staff.
They really, together, we are
thinking about this process,
and to address that process of
scholarship in a digital age
we've done lots of things.
We've built our own collections
that we'll talk about,
I'll talk about the 9/11
collection and what we
might be able to do with that.
We have built educational
tools, learning modules
for how you might use digital
materials in the classroom.
We've also decided that
we had to build software
because we didn't find tools out there
that were really made for
the kinds of scholarly work
that we thought we were
doing, and that ultimately,
everyone would be doing.
And so we did end up building software,
two major open-source software projects,
everything we do is open
access and open source,
and Zotero is for the
scholar, the researcher,
and Omeka is for the content provider,
which is a terrible word for it.
Let's call them museums,
libraries, archives,
places that want to put
content online in a way
that's maximally accessible
and maximally usable
for the process of scholarship.
So we felt that we've had
to cover this entire realm
to address this problem,
and it is a huge problem,
and I think as the
Smithsonian goes through
the growing pains of addressing,
how it will exist online
it will have to think
through these very same
problems that we've gone through
and so I hope I can be
at least modestly helpful
in sort of talking about
some of the ways that
we've addressed these issues.
So, what does it mean to do research
across multiple collections?
What I love about being
in this realm is...
The web let's say, is 20
years old this year, right?
1989, the paper from
Tim Berners-Lee at CERN,
the place where the Large
Hadron Collider exists,
the center for nuclear research in Europe.
You know it's 20 years old and
yet, we exist in this realm
where there are all
these gurus about the web
and social media gurus,
I'm always followed by
new media gurus, and as a
historian I take a long view.
I'm sort of perplexed
by the fact that we can
address a 20-year old medium and say that
we're a guru on it.
I think we're still
really feeling this out,
and I wanna bring some
skepticism into this talk today.
As I go through it, I'll point
out some places of weakness,
so I don't think you
should consider me to be
a pure evangelist for this.
Again, I'm really a pragmatist about it.
I'm a pragmatist about what
we can do in this medium.
There are many advantages to
it, that we'll talk about,
but there are some disadvantages,
and I'll talk about those as well.
Since I'm a historian,
let's just start at the beginning here,
and talk about really what
this transition means.
What I'd like to do is
go back and just look
at the history, the digital
history here of the web.
Here is Yahoo's homepage,
from very early on, I think it's 1995,
and you can tell what
happened here, right?
In fact, one of their first
hires was a librarian,
and so what did Yahoo try to do?
A lot of people don't remember this,
but they tried to catalog the web.
Categorize websites, put
them under categories,
arts, business, economy, et cetera.
So you'd go to arts and humanities,
and then you'd drill
down to that to history,
and then you'd go from
that to American history,
and then to Civil War,
and then to Bull Run,
and all these things...
Someone at some point thought
that this was possible to do.
I mean we can giggle about
it, but again it shows you
how difficult it is to re-conceptualize
this process of being a library,
or being a portal to
research in a digital age.
There's no anticipation
here of the exponential
growth of the web, or
how the casual surfer,
or the intent researcher,
is going to address this.
So we bring in a cataloger,
we try to catalog the whole thing.
Well, we know the history well.
Three years later, this launched.
This is the 1998 first Google homepage,
also, rather comical in its design,
although, not much
different than it is today.
Oh, you can tell it came from Stanford,
and that it was done by
geeks who like Linux,
because they have
special searches on that.
But this is already a rather
large re-conceptualization,
that I think, we should pause and take
the time to appreciate what went on here.
All of a sudden we were
talking about free form,
a full index of the web, a
full text index of the web.
So we've already gone
beyond what librarians like
to call, metadata, or
descriptions about these webpages
to just a full text search.
Of course there were other
search engines out there,
AltaVista and HotBot if
you remember these things.
Google crushed them
all, through some rather
savvy algorithms about how the web works,
particularly the way it's interconnected,
and that those inner
connections form votes about
what's important and what's not.
But still, there is a
re-conceptualization here
that maybe in a digital age,
everything since it's hooked
up to the internet becomes just
part of an undifferentiated
mass, and that most researchers
will prefer to search
across the entirety of it
to find what they want.
They will not care, as much as librarians,
and archivists, and museum
curators would like,
they will not care as much
about the individual locations,
or associations, or affiliations,
of any of those collected items.
They are going to just flit
across these collections
to find objects, and so
this transition here,
from this to this, means that,
on this you would drill down,
you would eventually get to
the Smithsonian, get
to a particular library
and go and look at that
collection, and search
that collection's online
catalog, if there was one.
When we get to here,
we're just already making
the assumption that we're
going to dance across
the various collections as we may.
So, the main point here is
that we've already gotten
a decade ago to the point
where libraries and archives
are not islands in a digital age.
They are connected together,
they are traversed brazenly
by those unappreciative scholars...
We need to think more
about what this means.
So again, I'm gonna walk you through
a series of experiments,
some real, some imagined,
about what this Everywhere
Library implies.
For my research, this
was really a revelation.
This is a early experiment,
it's already five years old,
in which,
three very terrific
historical collections of mathematics,
really in the period
that I'm interested in,
were able to combine online.
So they each have their
own distinct collections
at Michigan, Cornell and Goettingen,
but they have been able to
create a unified project,
a unified search across these collections,
so that when you search for, let's say,
non-Euclidean geometry, you pull up
these books from the scanned books
from all of these sites at one time.
Now we could say this
is just a time-saver,
it's rather nice that
we don't have to go to
these three separate sites,
but you can see already
I think some revelation here, right?
Other collections could link into this
to create a global unified collection of
historical monographs in mathematics.
There are potential search
fine-tuning and other
kinds of efficiencies,
that are gained by this process.
So in using this collection,
I really found myself
exploring the collection in a new way,
because I wasn't going to
these individual collections
and downloading PDFs of the books,
and then reading them in the
normal scholarly sequence.
I was able to think about, for instance,
the spread of, let's say,
non-Euclidean geometry
in the United States,
where it happened, what the
publication scheme was on that,
how many books were
published in each year.
It would have been much
harder to do if I was
just going to each collection,
especially if I was doing that physically.
I wouldn't think about these
things, but I was able to
prospect very quickly to think about,
how many hits I'm going to
get in each of these areas,
where the publications
were, things like that.
Okay, so here's one terrific example
and I think that all of collections
here's a fantastic collection that really
all historians of the American south,
particularly the Antebellum South,
go to UNC Chapel Hill to look at the
Southern Historical Collection.
This is their online site,
Documenting the American South.
They're in the process of
digitizing, going through
the same growing pain of
getting their stuff online.
It will be a multi-decade
effort because they've got
a huge number of plantation records.
It's really a collection of collections,
and actually extends beyond UNC,
and they're thinking about
how to put this stuff online.
This is some of their initial attempts,
and again, you can think
about, and look at the baggage
of this process of going online.
This is, by the way, not a criticism,
I think they're doing
incredible things at UNC
and they're really on the
cutting edge in many respects,
but just to show you some of
what they've gone through.
One of the first things
they did is put online
their LC subject headings,
their Library of Congress
official subject headings
for their documents.
This is a very long webpage.
(audience laughs)
There's a lot of, I mean,
you can see just on slavery,
the LC subject headings are...
Like the Dewey Decimal System,
they extend out, right?
So you start with slavery,
and then you might go to a nation state,
and then you might go
from that to history,
and then a century, and then
a social and cultural history,
and then drinking and gambling--
You could go on and on,
and add these things up.
But this is what scholars,
and I certainly grew up
using a card catalog, and this
is what we were trained to do
and so you can think about this baggage of
well, what should we put online,
well here's a finding aid.
We've forgotten about that
Google search box already
in this process that
everyone's used to and
has so much
power
as a single box,
and that user interface,
and what it says about
the process of research in a digital age.
Obviously, they figured out
that this was not so helpful,
and so in fact, they power
their site by Google.
You can get custom Google
hardware or software
to scan your site, and do web searches.
But you can see already
we've run into problems here,
and I think particularly for
history this is a huge problem
as we move away from the web,
I mentioned it offhand before,
but the reason Google search is so good,
the insight that Larry
Page had in page rank,
is that each link to
another webpage is a vote
for that webpage.
So if a lot of webpages
link to a particular site,
and in that link it says Abraham Lincoln,
Google figures out that if
you wanna know something
about Abraham Lincoln
you should go to the site
cited by those other webpages.
So it's a reputational system.
When you deal with flat
archives that don't have
hyperlinks in them, you
are losing the possibility
of having those connections.
Now WAGs in the audience
will probably note that,
for instance, books
have footnotes in them,
and in fact, Google is trying
to mine that data as well.
That may be votes, but
there's lots of things
where you don't link to other
things from within a document,
and so it's very hard to figure out--
If I did a search on Booker T. Washington,
what's the most important
document in the 16 million
documents of the Southern
Historical Collection?
That is a very tough
problem for Google to solve.
I think that here is where
scholarly institutions,
like the Smithsonian,
have a chance to really make
a significant contribution
to this realm of digital research.
In that, we do need to think
about what the interfaces are,
what the indexing algorithms are
for presenting information in a way
that scholars are able to find it.
We need to think very carefully about
the discovery process of the scholar.
How do we search across collections,
how do we find things of interest to us?
These things are becoming
incredibly complicated,
they're much more complicated
actually than a Google search,
even though Google is dealing
with 10 billion webpages.
Those webpages are, in some
ways, much easier to deal with
than let's say, the 130
million objects in the
Library of Congress, or
the two million objects
at the Smithsonian.
We have to think about, what
do we do with this process,
they clearly thought through
this and then decided,
well we'll have it related,
we'll use Google's related
process but these also
don't really help out well.
If you do a lot of these
searches and look at, click on--
They added these little
links for related documents,
they don't really help out very much.
So we get to this point,
this is where I'm often
criticized if people think I'm a
cyber enthusiast,
someone who just believes
this is all great.
Scholars will say, well--
And in fact I was at a meeting
at UNC with a bunch of...
I got criticized for
calling them this but,
analog historians, if
I'm a digital historian,
and those analog historians,
without exception,
without exception, all
have that mythical story
of the scholarly process.
It was a warm, humid day in Chapel Hill.
I got my cup of coffee,
I went to the archives.
I opened up this folio,
and within it was this one
document that formed my
Pulitzer Prize winning book.
Already I don't wanna
respond to that moment,
except, I do wanna respond
to it, and I've heard
these stories from
virtually every scholar.
I had a professor at Harvard, Harvey Cox,
who gave this great story
about going into Widener
Library at Harvard.
He was reaching for a book,
and as he pulled it off,
another book fell off the
shelves and hit him on the head,
and he wrote a book about the book
that fell off this shelf
and hit him on the head.
And this for me is that notion
of serendipity, of scholarly
serendipity that we all have,
especially of physical archives.
But of course, it blinds
us to the fact that
for every example of serendipity
in a physical archive,
there are countless other
invisible examples where
you miss something because
it was stuck to the back
of another document, or
the archivist in England
was fussy and wouldn't
let you into the room
to look at some cache of
documents because some lord
didn't want their papers exposed.
Or, you weren't able to
actually physically scan through
the number of documents that was there.
You have five days in the
archive, you do the best
you can, and then you go home
and you've missed something.
So no one ever thinks about
the hidden costs of being
analog versus digital.
So I think we need to
think a lot about that
and also, to think about
ways in which we can use
this medium to create
new forms of serendipity,
to enable scholars to
find things that suddenly
they realize that they should know about.
We're thinking a lot
about that in our projects
and in particular, the Zotera project,
that I'll talk about in a bit,
with its online group collections,
and the way that scholars interact.
But again, discovery in
the digital age becomes
very tricky, but I also
think it presents really
some tremendous kinds of possibilities.
There is a computer
scientist named David Mimno
at the University of
Massachusetts Amherst,
who has done some
fascinating experiments with
indexing digital collections,
scholarly collections,
for instance, of books,
and what he's done is he's
come up with a process
called virtual shelves,
in which, he creates the
experience of a shelf.
So, yes, you can go up
to a screen and sort of,
you're looking at a
set of what should be a
linear count of books.
But he alters the shelf based
on your research interests.
So that, let's say, you're
doing research on the
Erie Canal and you're an economist,
he'll set up the shelves in one way.
If you're a historian of
culture, and let's say,
of the culture of the Erie Canal,
he will set up those
shelves in another way.
But you will still have
the opportunity to scan on
either side of let's say
the book you're looking for,
but you might come across other things.
I think these are the kinds of experiments
that need to go on, and I
think scholars need to be part
of that process of saying,
you know this is something
of interest to me, this presents
new possibilities for me.
Okay, we have done archives ourselves,
although I'm often criticized
for calling this an archive
because archivists say
there's no provenance on this
collection, our 9/11 collection which was
collected via the web.
And they're right (laughs),
but, we'll use it in the colloquial way
rather than the professional
way, this archive.
I wanna talk about this
archive because I think, again,
this presents some important
thought experiments
about archives in the digital age,
especially because this
is a born-digital archive.
So this archive, which
we did do as you can see
in the logo on the bottom,
in alliance with the
Smithsonian, as well as
some other institutions,
is really a counterpart to
the physical collections
that the Smithsonian has on 9/11.
Right after 9/11, we actually
met with the Smithsonian
and talked about this process.
They were already getting
calls from people who said,
I have a CD-ROM of photographs, or I have
other digital objects that
I want to give to them.
Meanwhile, after we set up
this project, we actually
started getting physical
objects, so we had a nice
exchange program going on,
but this was a project that we
did in concert with the
American Social History Project
in New York.
They sort of focused on the
New York side of things,
and we focused on the
Washington side of things,
and on some of the technologies here.
I think this again, provides
some interesting examples.
So what we did is, we set
up this website to collect
digital materials about 9/11,
that anyone could come to
the site and contribute
to this collection.
Again, it shows you how hard
and the baggage you have,
how hard it is to think
about what you might collect
and how you might present it on the site.
So initially, when we set
this site up, and I don't
have the original design which
I did and is really horrible
(laughs) because I am not a web designer,
but, when you look at
that design all we had was
stories, which was a big
box that you could type in
your what happened to you on 9/11.
We had a box for pictures
and we had a box for email,
and sort of the only
thing we could consider.
Well, right after we launched this,
people just started hacking our site.
They started uploading
artwork, Photoshop artwork
that they had made, they
wanted to upload video, audio,
we were contacted by people
who wanted to dump in
their Blackberry communications
that they had saved.
We had all kinds of things,
we had someone write to us
and say, "Well, can you archive my blog?"
And we hadn't thought about that,
that there are all of these blogs,
this is in the very
early years of blogging
and sure, blogs were really fascinating.
If you were a historian and
again, we were trying to
do this thought experiment,
what would we want in 50, or 100 years,
that we need to be proactive to capture,
because this digital stuff
will get deleted, right?
Blogs will go away, they
won't be anywhere unless
we kind of save them and
archive them in some way.
We ended up getting 3000
blogs through this site.
We thought we'd be happy with
1000 or 2000 digital objects,
we ended up getting 150,000
digital objects in the first
two years before we turned
over a 146-gigabyte hard drive,
that was about the size of a matchbox,
to the Library of
Congress for preservation.
It was from 30,000 people,
individuals contributed
the archive, and I think just there
there's something fascinating.
This project is often
compared to, for instance,
oral history projects,
and there were some very
traditional oral history projects,
really the gold standard
of these kinds of histories
at Columbia, which went out and they have
Columbia Oral History Project
that you probably know about,
went out and did 300 interviews,
really in-depth interviews,
sat down, transcribed them, et cetera.
We relied on this sort of
hacked together website
to gather stuff, and
there's certainly a lot of
junk in here, it is not
the high quality of the 300
interviews, and we didn't
have a chance to question
people and ask for more
information about something.
But, there is something about scale.
There is something about having
30,000 stories versus 300.
In that, we're able to
find things in that size
of an archive, and a
fully digital archive,
that you might not be able
to find in a paper archive,
or an archive in the scale
of hundreds rather than
tens of thousands, or
hundreds of thousands.
Let me give some examples
and then I'll show
you some eye candy.
When we looked at the logs
a couple of years later
about the way in which
researchers were coming to
this archive, so rather than contributors,
researchers were coming to this archive.
We were really struck by
how poorly we anticipated,
as the collection maker,
how the general public
and scholars would use this collection.
The best example I have is,
well, let me give you two examples.
One is that, there was an
entire study done about
what cellphone use was like in 2001,
because about 25% of the
stories mention cellphone use.
So, all of a sudden you had an archive,
where social scientists
were really interested in--
Cellphone use was just
spiking at that point,
so it's really a snapshot,
it had nothing to do with 9/11 really.
We thought this was a site
about memorialization,
about social and cultural
history of tragedy.
But here's someone who comes in laterally,
and quite brazenly, who studies
this as an archive about
social practice with cellular phones.
We had nearly 200
linguists come to the site
and use it to study teen slang.
Because we had so many--
What happened on the one year anniversary,
is that a lot of high schools
mandated that their students
write their own stories into this archive,
and we had a lot of teen
slang in this archive
that people were able to use.
This is also, 2001, the
rise of texting language.
Before texting language, really,
instant messaging language,
so LOL, and all that stuff.
So there were linguists
who were really interested
in modern American teen language,
who were able to use this
archive for that purpose.
We've got lots of other
examples like this,
but you can see already what's happened.
That in the process of
putting something online
you really can't anticipate ahead of time.
If we had created the equivalent
of the Library of Congress
subject headings, and said,
well, here's what we think
the categories of research are--
There's no way we would've had a category
for teen slang, or...
History dash dash,
United States dash dash,
21st century dash dash,
technology dash dash,
mobile phone use or
something like that, right?
We would not have come up with that.
So it's only in the
process of a full search,
which we had on the site,
across the collection
that scholars were able to
conceptualize and execute
entirely new questions.
That was very powerful
when we realized that,
looking at our server logs,
that we needed to put up
collections in as open a way,
because if they're not open
you can't do these kinds
of full scale searches,
and in as flexible a way
as possible, to allow
for new digital research possibilities.
That if we pigeon-holed it,
if we put too much of an imprint
on this online collection,
or said that we're gonna keep the stories
totally separate from the email,
totally separate from
the voicemail, et cetera,
as indeed we had originally had,
where they were silos unto themselves,
then people couldn't have
done this full scale research,
and couldn't have come up with
the answers to their questions.
Let me just provide one
more example about this.
Well, actually, let me wait.
I'm gonna bring this up
as we get to more advanced
forms of digital research.
This is the word that Roy Rosenzweig,
my dear colleague who passed
away a couple of years ago,
used for this problem though.
A big negative as we get to collections,
like our September 11th digital archives,
it's just the problem of abundance.
Really, 150,000 objects is not that much.
You can imagine collections
that are far scarier to the historian.
So, the example I'd like to
use is White House emails,
which are proliferated.
We don't have the stats
yet for the Bush years,
but the statistics from NARA,
from the National Archives
and Records Administration,
is that, just on the main
email server in the White House
for the eight years of
Bill Clinton's presidency,
there were 40 million emails.
40 million email messages.
So, they're about 25 kilobytes each,
that's one terabyte of email text.
If you're a presidential
historian that's a scary thought.
It really requires you to think about,
well, I need new tools.
How am I going to look through this?
How am I gonna find anything
in 40 million emails?
So, Roy thought that there
was this problem of abundance
in an age of digitization,
and of digital research.
I think he was spot on.
I think--
How do you handle the Everywhere Library?
What are the strategies
that we can come up with
to reduce the number of things
that we're gonna have to look through?
You can no longer do
what the presidential
historian was trained to do,
which is to go the archive,
and sequentially read all the memos,
and take notes and write a book.
We can't do that with a record
of Bill Clinton's White House.
We won't be able to do
that with any presidency
from here on out, and the number of emails
will just proliferate.
One of our answers, as I
mentioned at the start is--
Well, here's just another
example of the abundance.
Google Books, they're in
the process of digitization,
this slide is a year old.
They're actually, I
think rapidly approaching
10 million volumes, on
their way to 15, or 20.
Here's another example of that.
How do you deal with this massive number
of objects to scan through?
Zotero is one of our projects
that we have used to address this problem.
It's a project where we
literally sat down and said,
as scholars are sitting in
front of the web browser,
for the first stage of research,
and how are we going to create the kinds
of personal collections
that are the equivalent
of you know, my old
photocopies from the archives
that I went to in Britain,
and Ireland and America,
for my book, Equations from God.
How do you recreate that experience?
How can you pull from multiple sites?
How can you index it?
How can you search it?
We thought that we needed a
tool that was really embedded
right within the web browser.
We looked at other tools
that were out there,
like EndNote for instance,
which was a standard,
from the '80s application,
that you double-click on your desktop,
and it launches an application
and you take notes within it,
while your web browser
sits in another window,
and you kinda go back
and forth between it.
We thought that we needed
a tool that really operated
right within the web browser,
and that recognized when you were viewing
a scholarly object within the web browser,
it can offer to save that object for you
into your personal collection.
So, this was the problem, right?
We had all these windows,
we had handwritten notes,
we had perhaps a desktop
application like EndNote,
you had word documents, and
you had your web browser,
and these were all separated.
We wanted to just squish this
all onto the web browser,
which we did, because of
actually the open-source,
Firefox web browser came
out at around that time,
and had tremendous
possibilities for extending it.
There are thousands of
extensions to Firefox.
So we conceptualize Zotera
as an extension of Firefox.
It's actually much more than that now,
and you can access
Zotera in multiple ways,
including within the web browser.
But the main client, or software piece,
that started this all off
was a Firefox extension,
it still is, actually, and
that's how most people use it.
What happens when you
click on the Zotero logo,
in the bottom right of your web browsers,
it pulls up an iTunes-like environment
from the bottom of your web browser.
That includes--
This is an interface by the way,
that we brazenly ripped-off from iTunes.
It's an interface, it's very successful.
They're called Miller columns.
There's a principle of going
from general to specific,
from left to right, in
computer interfaces--
(audience member sneezes)
Bless you.
So we have collections on the
left, including My Library,
which is every citation
you've ever grabbed,
every document you've ever grabbed.
Sub-collections.
When you click on a sub-collection,
you pull up references,
books and articles.
From that, when you click
on an object, it gives you--
We didn't want to be
too librarian about it,
so we didn't call it metadata,
we just called it info,
also it takes up less space, fewer pixels.
So it pulls up info in the right pane.
We leverage semantic computing,
which is a kind of new area
in computational science,
that is able to look at what's
going on in the web browser,
and pull out metadata,
pull out information about that object.
If you think about the web, web 1.0,
it's about presenting
documents online, right?
Getting documents online.
But they're really just presentational.
When you go to your
online library catalog,
and look up a book record,
it presents you the
information about that.
What we wanted to do
is be able to actually
extract information from the page,
knowledge that's embedded
within the page, in many cases.
We were able to do that.
By sort of, watching the
webpage be drawn on the page,
Zotero actually interprets the webpage
as you are viewing it.
In this case, we're on JSTOR,
which is the indispensable
online digitized journals for scholars.
So in this case, it's a little
washed out on the screen,
but it recognizes that it's looking at
an article here about Hamlet.
When you click on that
icon in your address bar,
it saves the information,
it saves the PDF into your interface.
We use a series of semantic algorithms,
and other kinds of algorithms,
to pull stuff off the webpage.
If you're on Amazon,
it'll realize that you're
looking at a DVD, or a book.
If you're on the New York Times website,
it'll realize you're
looking at an article,
or a set of articles.
It can save one, or all,
of the objects that are on the page.
As we've gone along, libraries
and museums got into the act
and started creating
Zotero-compatible websites,
which you can do.
Not through our
technologies, but by actually
using standards that are
broadly available on the web
for making metadata available,
in an invisible fashion on the page,
but a visible fashion to software.
Again, we work with thousands of sites,
we can't name them all,
and it grows every day.
And every day we ship out automatically
new site compatibility
to all Zotero users.
This creates what I like to
call, fluidity of bibliography.
So all of a sudden, you're
no longer having to retype
citations, and I know this
is a kind of time-saver.
But there's also something
that happens again in this.
By making it so frictionless,
it's really simple for instance,
to grab some citations,
you can actually produce reports
that you can post on a website.
So here's the--
If you replace the http with zotero://
you're actually able to put
up a webpage with a report
of things you've read,
let's say it's Shakespeare,
and those reports are Zotero-compatible,
and also have standards embedded in them.
It's very easy to grab citations,
post them back to the web,
you can actually drag
and drop within Zotero
to other places, so you can
take a citation from Zotero,
drag it into a Google document online,
that citation will be Zotero-compatible.
Someone can come along, or
your colleague can come along,
grab all the citations
from a Google document,
or you can drag it to a blog
post, or to an email client.
It will automatically format that
in whatever citation style you want.
So, we've just increased the velocity
of all this library
information that was siloed.
We can grab it from anywhere,
we can send it back out to anywhere
through the hub that Zotero represents.
What's really nice here
is that there's a kind of
virtuous circle that happens
when you think about these
kinds of digital research possibilities.
If people adhere to standards,
and we get some
interoperability between sites,
really neat things happen.
Like here's WorldCat, which
is the big online catalog,
sort of umbrella catalog,
that's run by OCLC.
They created something
which I really like,
WorldCat Lists, which are sort of
curated bibliographies on various topics.
You can create an account,
pull things off of
WorldCat and present it.
They knew about Zotero,
and they embedded some
information on WorldCat Lists
that makes them Zotero-compatible.
So, if you go over to WorldCat Lists,
and you're looking at
their Shakespeare list,
you can see that there's
a gold folder on the top,
and through one click you
could grab an entire list
into your Zotero collection.
So all of a sudden, this process--
And so now others are encouraged
to also do this embedding,
and also to allow for the free exchange
of scholarly information
on whatever site you're on,
through these standards,
through interoperability.
We also, I think, have
tapped into another area of
our current digital world.
We've talked about ad
hoc personal collections,
and how powerful they are.
But there's also this
aspect of social computing,
Twitter is one of those aspects,
Facebook, is probably very
famous in social computing...
We have in a kind of
2.0 version of Zotero,
leveraged this ability to share maximally
through our server.
What I'd like to do on this, is talk about
how much you can share,
and I like to do this thought experiment.
So we've talked about sharing citations,
but Zotero collects a lot more than that.
It's the full research process,
from the beginning conceptualization
through publication.
We actually have plugins
for Word and OpenOffice,
and other word processors.
So we are collecting within Zotero
the entire scholarly process,
in a way that wasn't happening before,
and we're collecting it
in a very shareable way.
So this includes things like,
notes, this includes tags,
as you often hear, is the
personal application of keywords
to your research, so you might
wanna tag a bunch of things,
non-Euclidean geometry, so
you can search back to that.
Also, people have used
Zotero, for instance,
to rank objects, so
there's people who actually
make value judgments
about the importance of
particular objects in their library.
We're collecting all that within Zotero.
By the way, you don't need to
share if you don't want to.
We're just encouraging it.
So, what are we collecting
in that process?
The example I like to use
is just to think about
one set in the Library of Congress,
there's one million dissertations.
There's 131 million objects
in the Library of Congress,
one million of which, are dissertations.
You think about everything
that went into those dissertations,
and what was left behind in
the act of printing it out,
into double-spaced,
standard 300-page documents.
Those documents have a
lot of other information
other than the narrative, right?
It includes footnotes, a bibliography--
Those are helpful.
But let's think about all
the work that went into it,
and what we've lost in just grabbing,
really what's the tip
of the iceberg, right?
What's leftover at the end,
or what's visible at the end
of this huge scholarly process.
So, the average time to write
a dissertation is four years,
there's about 2,000 work hours in a year,
with a two-week vacation.
I always like to discount
the graduate process,
having been a graduate student,
and having graduate students
working on dissertations,
I know that you have to just
chop off about 1,000 hours.
Coffee drinking, twittering,
these sorts of things.
Okay, but still.
Let's say 1,000 hours of
actual work over four years,
that's four billion hours
of dissertation research,
went into this one collection
in the Library of Congress.
That is a huge, huge amount
of human computation.
That is a bunch of people
assessing what's important,
pulling things from various archives,
taking notes on those things, right?
They're creating bibliographies
through those sub-collections in Zotero,
that may never see the light of day
because they changed topic,
or their ultimate bibliography
didn't use some documents.
But, they were there, they looked at them,
that might be of interest to other people.
So we have lost all that,
and it often sits in
filing cabinets, frankly,
in paper filing cabinets,
in professors' offices.
So, if we could operationalize
that material some way,
then we could gain quite
a bit of new information.
This is a lot of scholarly
information that is lost
that we can think about
regaining digitally.
So we have a server now that is available.
If you go to Zotero.org, you
can download the 2.0 version,
that includes access to the server.
Within just a few weeks
of launching the server,
we already had over
5,000 scholarly groups,
of every conceivable interest,
natural sciences, social
sciences, humanities...
You can just surf through this,
and there are dozens of
scholars in these groups,
or in some cases, hundreds
of scholars in these groups.
And they are all sharing their
materials through our server,
they've all got their local
collections on their server--
Excuse me, on their computer.
But they've also, are creating,
they're sharing things
that they have found.
So, we've aggregated a
huge amount of scholarship
through this process by
just thinking through
the scholarly process, and
then making connections online.
We have bibliographic
feeds, so you can actually--
If you don't want to use
Zotero, you can just actually
subscribe to a group as an RSS feed,
in Google Reader, or something like that,
and find new references
as they come through.
API, is just a fancy way of saying that
other software developers can
come in and use our server,
or our clients' software, to
extend it in any way they want.
We're a completely open source,
and so it's very easy for
others to kinda come in
and attach things to it,
or use our information.
We have a recommendation
system that's coming up,
that will be our attempt,
I hope not feeble,
but our attempt to try to
make some serendipitous
connections for you.
Just say, hey, I see you're
doing some work on the rise of
non-Euclidean geometry, here
are some the things that
other Zotero users have
saved to their collections
that might be of interest to you.
So we're taking this very
powerful, social computing
platform and leveraging it for discovery,
for digital discovery,
because we're assuming that
the power of the crowd is more
impressive than the power of,
let's say, a Google algorithm,
a straight text mining algorithm,
or a Library of Congress
subject heading listing.
So here's our creative lead at the
Center for History New
Media, Jeremy Boggs.
So he has followers and
people following him,
and he shares his entire
Zotero library online,
and it's really wonderful.
If you're interested, he's
a fantastic web designer,
he thinks a lot about
technology, and history,
and you go to his Zotero page,
and you can just grab whatever
you want from his collection.
He's made it fully available,
he's really categorized it nicely.
He even has all of his
dissertation documents
out in the open, as you could see about
two-thirds of the way down
his page he's working on,
history of technology dissertation.
And he's made that stuff fully available.
When I hear gasps,
when I talk about this to
scholarly audience about,
you know, I'm gonna lose my
research to someone else,
I always think this is
a complete red herring.
I mean, maybe it's
because I wrote a book on
19th century pure mathematics,
that very few people have read.
But most academic monographs, trust me,
there are not hundreds of people out there
prospecting around, to take
your research on that topic.
Besides, you're far along
on your research anyway.
So I think this point about,
holding it and being secretive,
is a little bit too much
academic self-importance,
and paranoia, if we can
call it that, appropriately.
There are other forms of
social computing beyond us--
Oh, I'll just also mention that
we've got lots of other things going on
with aggregating documents.
So if you scan in documents,
that are public domain,
we're about to launch an
internet archive interface.
It's actually available in
the most recent edition of
Zotero, but we've hidden it
because we're road-testing it,
so don't turn it on--
I won't tell you how to turn it on.
But it is baked into the software now.
This will provide the
ability to drag documents
that you want, into a public
commons at archive.org,
the Internet Archive,
which probably a lot of you
have heard about.
So if you were let's say, an
archeologist, and you have
10 thousand photographs
from your dig in Athens,
either during your research,
or when you're done,
you could just drag
that collection into it,
and it will be preserved
at the Internet Archive in
perpetuity, all those digital photographs,
and will be made searchable
for everyone to use.
So we're creating our own
commons using this system.
So we'll talk a little bit
more about some crowdsourcing
possibilities, that I think
are kind of interesting
for the Smithsonian.
We talked about crowdsourcing quite a bit
at the Smithsonian 2.0
meeting this past spring.
The one I really like, is this one.
It's Galaxy Zoo.
What this is is these are
professional astronomers,
who are using the Sloan
Digital Sky Survey.
There's a huge set of
digitized images of the sky,
they've chopped them up using
computers into small sectors,
and then they've realized,
I think quite truly, that
astronomy is one area where
there are a huge number
of amateurs out there, who know a lot.
In fact, a lot of comets,
as you probably know,
are discovered by amateurs,
not professional astronomers.
They decided to kind of
leverage that community,
because they had hundreds
of thousands of galaxies
that they were interested in studying,
but they didn't know well,
which ones are interesting.
They want to find interesting forms,
galaxies that are colliding,
galaxies that are going supernova,
galaxies with black holes
in the center of them...
They realized it was
impractical for a single scholar
to sit down and leaf through
hundreds of thousands
of Sky Survey images.
So they created this site.
I think it's actually great
that they named it Galaxy Zoo,
not something more academic.
It's kind of a game.
There have been thousands of people
who have come to this site--
You sit down, you register and log in,
and it shows you little
part of the sky, and says,
well, what does this look like to you?
And you do this human assessment of it.
So, does it look smooth,
does it look something else.
I suspect they're interested
in the ones in the middle,
Features or Disk, probably
the most interesting to them.
People go through and catalog
these hundreds of thousands of
galaxies for the astronomers.
I think at last count, they've done a
quarter million galaxies
using this process.
It's really just tremendous.
I think about the Smithsonian,
and there are a lot of
collections in the Smithsonian,
where there are similar communities,
that would love access to this material,
probably know a lot about various things.
You know, hummel dolls,
Civil War artifacts--
Civil War is one of those
areas, where there's
tremendous amateur knowledge
that can be leveraged.
There's all kinds of areas
where you can imagine
sites like Galaxy Zoo,
adding to the discovery
systems that you set up here,
so that people can find objects
that they might be interested
in for their research.
I did a little experiment
a few months ago,
doing precisely this,
where I just put this image up on a board
in front of a crowd in New York City,
and I also Tweeted it at the same time.
This is my own little
crowdsourcing experiment.
This is an object from a
19th century anthropological journal,
I'll tell you the story in just a second,
and from a young anthropologist,
who found this object in the Midwest,
and brought it back to Washington
because he had no idea what it was,
and went to the Anthropological
Society of Washington
to kind of ask his colleagues,
through the 19th century
version of crowdsourcing,
to figure out what this was in this dig
that he did in Illinois.
So I sort of tried to replicate
this in a digital age,
by putting this up, I put it on my blog.
You can read about the
full outcome of this,
and also the process a bit on my blog.
I threw this up, and
then I tweeted, I said,
look, I've got an image on
my blog, take a look at it,
and you have 30 minutes,
my followers on Twitter,
to figure out what this is for me.
So do the 21st century
version of what this guy did
in the late 19th century.
So it took nine minutes for,
at that point I think I
had about 1,600 followers,
to figure out what this was.
It was incredible, I actually
had behind me on the screen
just the scrolling discussion
online of the 1,600 people.
And there were
anthropologists who came in,
and pointed out, well
there are two holes in it,
it was a necklace, so
something worn around the neck.
Then there were other people who came in,
who were specialists in
Native American religion
and realized, there's a spider on there,
and they talked about
what the spider meant.
Then there were others
who came in and said well,
which tribes had spider iconography,
and people started doing
digital research for me
and pointing out various books online
where you can find spider iconography.
And they ultimately,
completely figured out
what this was in nine minutes.
So, that actually impressed even me.
(chuckles)
I'm kind of jaded but, it was
a kind of fascinating exercise
that I have a unusual community,
in that I've got a lot
professors following me,
and I'm lucky enough to
actually have a few followers,
so this'll be tough to do
without a lot of followers.
But again, it's just an exercise in what
we might be able to do.
Certainly the Smithsonian
in their Twitter accounts
has a huge number of followers,
American history, museum does,
Library of Congress does.
There's lots of institutions that
could have artifact Wednesday,
and post something and have
people figure out what it is,
or do something like Galaxy Zoo,
where you could crowdsource
a much larger archive.
- [Woman In Audience] So, what is it?
- So, that's the interesting
part of the story.
The person who found this
was William Henry Holmes,
does anyone know that from
Smithsonian history, right?
Okay, he was a young man who
was a curator of anthropology
at the Smithsonian when he was in his 20s.
He then went to the Field Museum.
He returned to the Smithsonian,
and actually became
the head curator of the Department
of Anthropology, in 1897.
Then he became the director of the
National Gallery of Art, in 1921.
But he was a fascinating figure.
He was, I think, very smart to kind of use
19th century crowdsourcing
on this material.
It ended up that this was buried in a hill
in Saint Clair County, Illinois.
It was made out of shell,
and there was a tribe
in that area that had...
I flipped on something here...
That had ceremonial shell insignias.
What's interesting about this is that
it's in the upper part of
basically, the Mississippi Valley,
and what was unusual
about it to him was that
this shell clearly came from farther down
in the Mississippi Delta,
or somewhere like that.
It's a ceremonial shamanistic shell,
it was worn around the neck,
or usually right around
the throat like this.
It's part of a earth cycle practice,
the circles, which have to do with the
circle of life, and things like that.
I don't have my full notes on it.
You can find out the full
story again on dancohen.org,
but it was from a tribe,
and I guess that they
had acquired shells through
trading with other tribes
farther down the Mississippi.
So it also was, I think,
really fascinating for him
because he realized that
there was significant...
Sort of, trade, and
trade roots in this area,
that he hadn't realized existed.
Where was I?
Yeah, so I think that there's--
Social computing and crowdsourcing,
they're often belittled
because I think, when we
think about crowdsourcing,
we often think of Wikipedia
and the problems of veracity,
and these sorts of things.
But I think there's ways of doing it,
these are a couple of examples,
where there are
self-correcting mechanisms,
and actually if you looked on
my Twitter stream for this,
people were correcting
each other and saying,
couldn't be from here,
could be from there,
and as long as you had enough activity,
you could really correct
for problems of error.
Let me talk about finally
some further possibilities
and this is I think farther down the road.
So we've talked about
creating discovery systems,
about presenting things
online in ways that present
new forms of digital research.
Computational scientists,
computer scientists, have been
doing this for decades.
There are people at IBM and Google
who have really fancy ways of
mining through lots of data,
finding things, visualizing it.
But, I think a lot of these
methods are still really
immature for history, and
the humanities and the arts.
I mean, a lot of these objects
don't present themselves
very well for the kinds of
algorithms as we discussed before
that all those math PhDs at
Google are interested in.
But I do think there are
some really interesting
possibilities that we can think about,
what a scholar might be doing
in five, or 10, or 20 years,
once we have a mass of
materials online that is open,
that can be mined for new information.
Here's just one example from
Brigham Young, Mark Davies,
digitized Time magazine from 1923 to 2006,
and OCRed it, so transcribed
the text out of this.
So he's got a corpus,
or a collection of about
a 100 million words
from Time magazine, and this
is a really fascinating site,
he really put it up as a linguist,
so linguists could study
changes in language over time.
I think it's fascinating as an
historian because you can do
these incredible searches for
topics across Time magazine.
Here's a search on race
relations, and it's shocking--
It's almost a perfect bell curve around
the civil rights movement,
which peaks in the '60s of course,
in the Johnson Administration.
What's interesting is the pattern, right?
So what I think is interesting
in doing digital research,
is searching for patterns,
but also anomalies are interesting.
So here we have an anomaly of the 1990s.
Can anyone figure out what that is?
Why there's a spike in
discussion of race relations
in Time magazine?
(audience chattering)
- [Man In Audience] O. J. Simpson
- [Dan] O. J. Simpson, yeah.
The aftermath of the O. J.
Simpson trial in the early '90s.
So, again, thinking
ahead 100, or 200 years,
a historian would want
to come to a collection
and be able to do a
bird's-eye scan in this way,
and then actually, Mark has I think,
done a terrific job here.
It's very easy to actually then
drill down into specific documents.
You can see, oh boy, what
happened in the 1990s, right?
We know what that is,
but someone in 100 years might not.
So they can very quickly go
down to the level of the scan
by clicking on these lines
down here in the lower right,
so it's easy to do.
This is one of my main
points today, to combine
close reading in this way,
with what Franco Moretti,
a scholar at Stanford,
laureate scholars called, distant reading,
which I guess is a nice way of putting it.
We've got the distant reading up top,
and we've got the close
reading possibilities below.
I think what scares a lot of scholars
in the arts and humanities,
and social sciences,
is that we're gonna lose
in digital methods the
traditional methods of reading
something really carefully.
I'm a firm believer that we
will not lose close reading,
but we need methods for
discovering documents
that we want to read closely.
We did this with 9/11.
We've really been thinking
about ways to mine
and visualize the information
in the 9/11 archive.
We started off, again,
with the baggage of,
how do you present a lot of information?
And we viewed it just like
we viewed all archives.
Well, you pull a box off the shelf,
and open it up, and leaf through it.
So our initial interface research was
literally browsing through it--
So here's our photograph section--
And you leaf through it, right?
Which doesn't really make a lot of sense.
We since done some experiments.
We did this experiment with
matching it up with Google Maps,
where we took images and
stories from 9:00 a.m. on 9/11,
and using some services,
there are digital services
that will help you
geolocate, or place on a map,
images, or text stories
based upon metadata,
or based upon text from within the story.
So they'll take 9th and Broadway,
and be able to map that
very specifically for you.
So we ran through some of our collection
and posted it onto a Google map.
This, again, this presents
a new way of you know--
Well let's look at the cluster
of people in Chinatown,
or who were in a specific building.
We can do that very rapidly.
Think about how long this would take
if you were looking at a physical archive.
You'd have to go through
by hand all these objects,
and then look at them, draw
on a map where they are.
In this way, you can start
having people who are
experiencing the same thing
from different angles,
or who are on the same subway car.
You can find all these people very quickly
and do, again, new kinds of prospecting.
Since we have an internet connection,
which I wasn't expecting,
I'll go to this live and just
show you a final example of this.
So here's our same collection,
this is our 9/11 collection.
We've taken the stories
and we've done text mining.
So again, we've had a
computer scan through
for specific keywords, and then also,
packaged up those stories and sent them to
Yahoo service for geolocation,
so that we could locate
where these stories were taking place.
I'll just give two examples,
or a little tongue-in-cheek--
Let me actually flip off one first...
Okay, so let's just look
at these red dots first.
Again, we have thousands,
and thousands, of stories
of what people were doing on 9/11
from all across the world actually.
We've just taken here
the American stories,
we have geolocated them, and
these are stories of people
who said that they were praying.
So I had the computer look
for variations on the phrase
praying, or I was
praying, things like that.
Then we've put them onto the map,
and the thought experiment here is,
the historian of religion,
let's say, American religion,
who was interested in what
religious practice was like
in 2001, and wanted to
look particularly at,
let's say, more evangelical
forms of religion,
or prayer forms of religion.
What you can look at this
collection, and really zoom in,
and look around at different areas--
I haven't done something
here that should be done,
which is to normalize the data,
and we can talk about that
in the discussion period.
It's very easy to look at various areas,
and really what I noticed
through this thought experiment
is that you very quickly
come to the conclusion that,
prayer is a suburban
and exurban phenomenon.
In fact, if we had an overlay,
actually can go to Google
Earth and get various overlays
to overlay on this overlay.
So you can look at, for instance,
megachurch construction
in the United States,
and evangelical practice
in the United States.
And it correlates very
closely with these things.
What I find fascinating
is the godless urban core of America.
(audience laughs)
Whereas you go from city to
city, Atlanta's the same way.
In the downtown, we don't have
a lot of stories of prayer.
Let me pull back here and go up to
godless Boston, where I'm from.
But it's really even the same place--
Here, we'll just zoom into--
Actually, let's zoom into
Washington since it's real.
So here's Washington.
We've got a bunch out here
in northern Virginia...
Only one story in Montgomery County.
(audience laughs)
Recently moved from
godless Montgomery County
to northern Virginia.
But you can really get a really
interesting kind of overview
of American religious practice,
again, in a way that you
wouldn't be able to do
with a physical archive.
Just on the tongue-and-cheek
side, I like to overlay
people who are watching CNN...
See, these are CNN viewers.
And they are actually in
the godless urban core.
Let me go back to--
I don't know why the New
York Times is showing up.
You can see in Atlanta, it's
showing these zip codes here.
Let me turn off the side bar.
Austin, no prayer, just CNN viewers.
Dallas-Fort Worth, remember
there was nothing in the center?
Okay, now obviously I do
this in a semi-comical way
to entertain audiences, but
I do think there's something
here about the possibilities.
We wouldn't have gotten
this research possibility
with 300 stories.
We would've gotten something
else with 300 stories
of gold standard oral history,
but we wouldn't have this possibility.
So I think there are
some really fascinating
possibilities coming up,
with doing this kind of
text mining, visualization.
Again, we can go from here actually--
Oh no, I haven't set it up on this.
But I have set this up
before where we can link back
to the specific stories, you
can go in, go from this--
Again, literally, bird's-eye
view, over to specific stories,
read through them, and we
can do this for anything.
This took me a couple a hours,
but now it would take me 30 minutes.
So I can do this for cellphone use, right?
Or some of these other--
I'm sure you're thinking about
research topics right now,
that you'd like to do on this.
And It's only possible
because we've digitized it
in a particular way, we've
made it fully accessible
to other digital services
like, geolocation services,
visualization services, in
this case, Google Earth.
It's trivial to have the
archive spit out a file called a
KML file, which Google Earth reads,
and then you can just start
looking around inside of it.
So, this is certainly a possibility.
Now, there are problems,
and I wanna end on a note of skepticism.
This is one that some of you
might have heard me talk about before.
So this is actually one of the most famous
visualization projects, which
is the Many Eyes project,
that comes out of IBM's labs.
So you got a lot of
computer scientists there,
and this is their
visualization of the Bible,
the Christian Bible.
What they've done here, if you can see is,
the size of the circles are by the
importance of the characters in the Bible.
As I like to joke, it comes
up with the astonishing
conclusion that Jesus was
important in the Bible.
(audience laughs)
Now, I have been criticized
about this and people say,
well there's lines between
the characters which has to do
with how many scenes
they show up in together.
But I just present this as a
kind of opportunity to say,
there need to be experiments.
Maybe there is something
here, I don't know.
But, we need to think through what these
research possibilities will look like.
What are helpful visualizations?
What are visualizations
that just state the obvious?
So there's a lot right now
going on in digital research
to try to separate the
weed from the chaff,
in terms of this.
'Cause you often do run into--
And I would, in an
attempted self-criticism,
say that some of that visualization
of the 9/11 collection
is obvious.
Might not be obvious in 100, or 200 years,
but it's certainly obvious right now.
So we need to think about ways in which,
again just to review, we can present new
serendipitous materials to people,
we can allow them to
search across collections
and create ad hoc collections,
we can allow them to take what they found,
and visualize it, to think through it,
to scan through it in ways
that might provide revelations.
Again, I think we are just at
the beginning of this process.
The folks at Many Eyes have
self-congratulated themselves
about the socialization several times,
it's shown up in the New
York Times and other venues.
But I think there's still
a lot of work to be done,
and that's why in some sense,
the Smithsonian may be behind
on digitization, but I think
we're still a head of the curve
in terms of, what can we
actually do with this?
What do scholars actually
want out of this?
What does the general public
actually want out of this?
I think there's lots
of possibilities there.
I think the best thing to do,
is to do the kinds of things
that we try to do, which is,
experimentation, prototyping,
small-scale tests, like we've seen here.
I think that in the process
of doing those things,
you really make a lot of realizations
about problems that are real,
and problems that are fake,
for instance, we didn't
have a lot of fake stories
in our 9/11 archive because it just--
People who wanna do fake stories,
there's other places to do that.
I think also, with the
Smithsonian insignia there,
the Library of Congress insignia,
people came to that site
and took it very seriously,
so we didn't have a lot of
problems with some of the things
you might worry about with
this kind of crowdsourcing.
Why don't I stop there?
I realize that I've overextended,
and it's already 10 of Noon,
and I wanna leave some time for questions,
but thank you very much.
(audience applauds)
- [Dan] Yes.
We've got--
Since we are webcasting,
please wait for the mic.
- [Woman In Audience]
Well thank you very much.
It strikes me, because you're right,
I was thinking of a research project
as I was looking at your 9/11,
that one of the things
though, that you have
as you start out with
a large digitized body
or a natively digital body,
if you wanna then take that
and go to the Civil War and
say, did people go to churches
after something happens in the civil war,
how are we going to get to that point?
- Right.
So as a historian I do worry about this.
First of all, not everything
will be digitized.
Ever.
I think actually a lot of the
collections I've looked at in
my life will probably never be digitized,
so there's that problem.
You at least hope for
finding aids to go online,
so there's that piece.
I think a big problem we have
right now with, let's say,
pre-20th century material,
even moving beyond just the
born-digital material, which
is great cause it's already
scanned, it comes that way, right?
Is handwritten materials,
are highly problematic.
OCR software doesn't really
work right now on handwriting,
although there's a lot of
work being done on handwritten
materials, and it's gonna
get better very rapidly.
But, I think here is a case
where we really rely on
other people who've gone
through the collection
to say what's in those letters.
So let me go back to
just the Zotero example.
Really, when we were
conceptualizing Zotero,
I was thinking about my collection.
So I've read every
letter that George Boole,
the creator of Boolean
algebra and Boolean logic
that's at the heart of computers,
I've read every letter he ever wrote,
which I'm not tooting my horn,
he didn't write that many
for a Victorian writer.
They tend to write a lot.
So, I have a bunch of notes
on what's in those things.
I don't have a transcription,
but I basically created
my own finding aid for that.
In the '90s, I had no way of sharing that,
but I could share that online.
I think there's gonna be a
lot of examples where you're
gonna have to aggregate notes,
tags, these sorts of things.
I think in that case it's gonna
be incredibly complicated.
For the Google Earth version,
I think you're gonna need a transcription.
There's a lot of experiments
going on right now,
with crowdsource transcription.
We're looking into it for--
We have an archive of,
really, the first 15 years
of American history, the
Papers of the War Department,
is on our site at wardepartmentpapers.org,
where we've scanned in 50,000
letters that came out of the
war department, which
was basically the entire
federal government for the first 15 years.
We don't have enough money to
go through and transcribe it,
but we think that there's
enough people out there,
amateur enthusiasts,
about the early republic
who love seeing a letter
from Thomas Jefferson
or George Washington would
be willing to go through,
on a split-screen environment,
where one side is the
high quality scan and
another side is a wiki
that we'll set up, that
they could type into,
to transcribe these
materials, and to verify them
by having multiple people
work on them at the same time.
So there's various computational methods
for verification as well, but
that's probably gonna take a
long time for those 50,000
letters to get fully transcribed.
Yes, if you could just wait for the mic.
Maybe we could pass that down or--
- [Man In Audience] What I wonder about is
now and the future.
By that I mean, if a researcher
now wants to look at the
letters written by
Chekhov, or a scientist,
or something, he has paper to look at.
I'm talking about unpublished material.
How do we know now if we have
some budding genius who's
20 years of age, and is just
writing emails back and forth
to friends or whatever,
that it'll be lost.
And 20 years from now we will say,
gee whiz we wished we had that
stuff, but we don't have it--
Is there any thought being given to that?
- That's a terrific question.
Yes, there is thought given to that,
and the Internet Archive,
which is our partner,
really has been doing the
most thinking about this.
Brewster Kahle, who created
the Internet Archive,
said, I think quite famously that,
somewhere in the archive that he created--
So they do, just to expand upon it
if you don't know what
the Internet Archive does.
They crawl the entire web,
or what they can access
about every 30 to 60 days.
They have two petabytes of
storage, to do these web caching.
You can got to archive.org,
and they have something called
the Wayback Machine, where
you can go back in time
and look at sites as they looked--
And that's where I pulled
those screenshots from.
And they just have everyday
sites from normal people.
He likes to point out, Brewster
Kahle likes to point out
that, somewhere in that archive
are the papers of a future president.
But how would you know in advance, right?
You don't know when they're 13
and they put up their website
on skateboarding that that
will be a future president.
So his feeling is, you just
gotta go out there and get
it while it's available, and
that was our feeling too.
That archivists, librarians,
have to be proactive
because this stuff goes away.
It raises a larger problem, doesn't it?
Which is, part of the
problem actually that I have
with closed services, like Facebook.
So right now, we've got a
whole generation that is
texting each other, or emailing
each other within Facebook,
and I'd much rather have
them do it on the public web,
or participate in public
ways that can be archived
by services like, the Internet Archive,
or the Library of Congress.
Because, surely, we are going
to lose stuff that is in
gated areas, like a Facebook,
or on commercial services.
Actually, when we
launched the 9/11 project,
we got these emails, I
remember very early on,
from people who had,
did a lot of discussion
about their feelings on Yahoo forums.
Yahoo made this statement that
because of their storage restraints,
they were going to delete all
their forums after six months.
So on March 11, 2002,
there was this outcry that
they were suddenly gonna delete
all of these recollections.
Now, Yahoo ended up
actually saving that stuff.
But, there is a real problem
here about where stuff resides.
I think if your history-minded,
you need to think very carefully the ways
in which you participate.
If you use services
where the stuff can't be
extracted from it, then
you will lose your history.
That stuff will be lost.
So, I use Gmail, like a lot a people,
which is a web-based service,
but I archive all my email separately,
and I archive it in a format
that is universally accessible.
It's a very basic format,
that pretty much anything can read.
Not that I think I'm gonna
be a future president.
I'm sure I'm not going
to be a future president,
but I just want that
stuff to be available.
If you look in Digital History,
the book that I wrote with Roy Rosenzweig,
which is available for free online,
although you're welcome to
purchase a copy in paper,
chapter eight of that
book talks about this,
about what you might wanna do yourself
if you want to be sure that
your records will be archivable.
There's basic practices, in
fact, much of it we learned,
Roy and I, learned just
from talking to archivists,
that applied both offline and online.
So documentation about
what's in something,
say you burn a DVD-rom, using
standards that are again,
universally accessible,
so it's always better to have something
in plain text than to
have it in Word 2007.
There's a lot of things that
come up related to that issue,
and it's complicated.
We kinda go about our merry
digital ways without thinking
about the past, so there's
a consciousness that I think
people are gonna start to have about that,
that they'll be losing all this material.
A lot of what we have, that
I have as a victorianist,
is what archivists call,
preservation by neglect.
So there were letters, Boole's letters,
that were stuck in a shoebox,
in some cases quite literally,
and then someone finds it in their attic
couple of decades later, or
couple generations later,
and goes to the Bodleian, or
some other preservation body,
and says, oh you know,
I found these things.
Well, it was lucky it was on
paper, and it was in a shoebox,
it was dark in basically
climate-controlled,
but, there's the equivalent
of that in the digital age
of what's your shoebox.
I do worry that we're surely losing a lot
right now, as we speak.
- [Audience Member] Thank you Dan.
You've given us a lot of food for thought,
and some great eye candy as well.
I just actually wanted to
put in a plug for your,
and to say thank you, for
your Digital Campus podcast.
- Oh, thank you.
- [Audience Member] They're great.
Keep it up.
- Okay, thanks.
I should point out again,
the collaborative effort with
Mills Kelly and Tom
Scheinfeldt, digitalcampus.tv.
If you're interested in these issues,
it's something that we just
talk in a freewheeling way.
We had a little summer hiatus,
but we're back recording.
We're recording actually
tomorrow morning again,
for our 44th podcast.
Thank you very much for the plug, yes.
Okay, I've given you
too much to think about.
(audience laughs)
I think the questions are over.
Thank you very much again
for coming, and listening.
If you have specific questions
about your own personal
preservation regimes,
or something like that,
I'm happy to address them off the mic,
and off the webcast as we finish.
- [Woman In Distance] Thank
you for coming this morning,
thank all of you who came
as well, and good luck.
(audience applauds)
- Thank you, thanks very much.
