- [Voiceover] So, good morning.
We're going to continue the journey today.
Sort of a recap of what we've done so far.
We've come a very long way.
Talked about establishing
and thinking about
what are good data
requirements for archiving.
Talked a little bit
about how to define data
and how to create an organized data in
the form of a data models.
Looked at a variety of different tools and
the PDV in particular,
as an example of how
to do data acquisition, deposition,
and then further explored
how the (mumbles)
data can be curated and validated.
And then last week we
talked a little bit about
data standardization post curation.
So, today we're going
to go one step further
and discuss what we mean
by creating a data archive
or repository, and some of
the additional considerations
that bear on setting up a
repository in an archive.
And that discussion will divide
up into a couple of areas.
First rule, we've all ready
talked a little bit at the very
beginning about defining
the scope of data content,
but now we'll talk about
that a little bit further,
in terms of what the
implications of those choices are
with respect to creating an archive.
We'll talk about how
to choose identifiers.
This would be how you're going to identify
the data objects in your repositories,
the people that are
contributing those data objects,
and generally, what
those considerations are.
We'll talk about how to
organize the data for delivery,
in terms of data file formats or
services that you will
use to deliver data.
And then, finally, we'll talk
about, kind of wrap things up
with a discussion with the
full life cycle of data,
from beginning to end, with a focus on how
each of those steps impacts,
ultimately impacts, archiving.
So, let's first talk a
little bit about scope.
Again, at the very
beginning of the course,
there was a discussion about how to,
what were the considerations
that beared, or bear on,
what data you decide to collect.
And so now, the question
is to sort of develop
a finalization of those
requirements in terms of what is it
that you're actually going
to score in your repository.
So you got a whole pipeline,
you have a deposition system,
you have a collection, you have collection
in data processing,
curation and validation,
so at the end of the day,
what data are you going
to be able to store and deliver
as part of your archiving operation?
Go back, this is in some way a review some
considerations that
were discussed earlier.
It's important to
understand the stakeholders
and issues for the
contributors at all levels.
Since we've been discussing
being in experimental archive,
it's important to
understand exactly what the
state of the experimental
field is that's being archived.
What is the possibility of
collecting the information,
how can it be validated?
What sort of uncertainties
are associated with
the information that you're collecting?
How will you be able to, in
your data processing operations,
how will you be able to
access or communicate
the reliability of the information?
And, perhaps most importantly, what level
of automation is there with respect to
being able to collect this information?
So increasingly, the examples
that have been presented
in this course, have
dealt with experimental
structural biology, it's a very,
very complicated data sets.
We have hundreds, or
perhaps nearly a thousand,
individual data items,
that's not something that
people are going to be
able to manipulate by hand.
It's very important to make
sure that the information
that's being collected and
that the data pipelines,
as much as possible,
can support automation,
both to protect the archives,
as well as for the
convenience of the end user.
So, other important considerations are
what will people do with the data?
So, what individuals are
going to use the data,
and what are their
requirements, in terms of,
what data will they likely
be able to productively need
and what types of questions do you think
they're going to ask about
that data, ulltimately?
Those considerations have
all ready discussed somewhat.
Those need to be answered,
in terms of finalizing
exactly what's going to be stored
and how it's going to be organized.
Then, looking at all those considerations,
one has to evaluate how
all those considerations
will bear on the ability to
create an engineer pipeline,
that will move that data from
the depositor, in the forms
of the depositor has that
information in some form,
all the way through the
(mumbles) and validations form,
and then ultimately put that information.
We use the word repository,
and finally, I think,
sustainability has been discussed
in a couple of different contexts.
But the degree of the
amount of effort required
to perform that operation,
and to support the data,
to support that pipeline,
is something that needs to be sustained.
So, you don't want to be
able to restruct the pipeline
and the managements of the future,
or create difficulties which
are practically insolvable.
So, another important
consideration is how to
identify themes, so people can find them.
So the types of identifiers that,
the ones we have to address include
session codes, this
would be the identifiers
that are used by the repository
to identify data object.
We seen those when dealing
with all of the different
repositories, BIB codes, and
database codes, (mumbles).
We need a way to identify the people
that contribute to the resource.
So these are identifiers for people.
And then you need to be able
to identify the ultimate data
objects themselves, the documents
and the actual data sets.
The form will allow
people to reference those
in other contexts, and
they will be able to access
them using conventional web
and internet technologies.
So, especially (mumbles)
or repository identifiers
lays a variety of different
questions, and a variety
of different approaches have
been used to deal with them.
So, there are style
issues, and that would be a
question of whether or not
the identifier is simply
a member or perhaps
provides some information
about what's in the
underlying data object.
So, this is the issue of
whether or not we have
opaque versus expressive identifiers.
The selectivity, and that
being if you have complex
data sets, it's important
that you be able to associate
a particular identifier with
a particular data object,
so there's no ambiguity about
what that session points to.
It's important for me to
(mumbles) long term durability
that we have persistence
over a period of time,
the identifier is always pointing
to the same data object, the same content.
If the data object is
going to change over time,
if it's subject to
modifications of some sort,
then (mumbles) may be required.
And then a practical
consideration is the ability
to identify the session
incurred, or the identifier,
when you publish literature,
so that if someone
you reference as a data
object in your repository
in a publication, is it easy to spot that?
In other words, if it
shows up in an abstract
or keywords, something that
can easily, textually identify,
or recognize, the identifier (mumbles).
Is it pointing to a data
object in your resource?
So some examples, you've used
these resources all ready.
UniProt is a resource which has a variety
of different types of identifiers.
Historically, they used
expressive identifiers,
for instance, here, for human insulin.
In early days, the PDB did
something very similar to that,
even though it was a
four letter (mumbles),
sometimes those identifiers in early days
reflected something about
the molecule, in this case,
the early structures of insulin had,
sort of, boutique style ID's.
That practice for the
PDB was found not to be
very scalable and hasn't been pursued.
Less expressive identifiers
are used by these resources
for instance, UniProt has a (mumbles)
which has basically
have a single character
followed by a number and
the PDB has (mumbles),
which may have absolutely, generally now,
has absolutely nothing to do with
the underlying data object.
GenBank has always had a
relative okay identifier,
but also includes an
explicit version index
at the end of the
identifier, so the .1 will
change to .2 if there's a modification.
So these are different styles that have
been adopted by different resources.
Similar considerations apply to people,
identifying contributors.
You make the decision, of course,
people's names just won't cut it for this.
In fact, it's such a problem
that you can have the same
person's names multiple times
in a single publication now.
So, it's very difficult to rely on names.
For a variety of reasons,
it's very convenient,
because people want to
have, people want their
contributions to science to be available
in an integrated fashion,
in terms of assessing
their contributions, so
it's important that they
decide identifiers,
working very well with what
happens in the publication process.
There needs to be a scheme
that integrates well with
other data repositories and
other types of digital content.
It's particularly useful
if there's wide adoption
and portability of the
identifications (mumbles).
The main context identifiers
associated with personal
information, such as
the duration of security
associated with personalized identifiers,
which is not typically a problem
for data set identifiers.
Because of the complexity of
these identifications is some
sort of software support is
also often very important.
So, some examples that
have survived in this area
are a combination of
identifiers which have been,
have grown out of the publication industry
and are largely
proprietary in their scope.
And also, there's SCOPUS
ID is one example of that.
Just a numerical ID that is
assigned to contributors,
to authors within their context.
Researcher ID is the Thomson
Reuters of (mumbles),
which is within the Web of Science,
so it has a lot of,
this identifier has
quite a lot of traction.
And then more recently, the
Open Researcher Contributor ID,
which actually built, it was
actually built, in many ways,
from infrastructure that
was developed through
Researcher ID, has now become
a very popular open source
alternative, which also
enjoys commercial support.
So in ORCID ID, there's a numerical ID,
which can be, which you can
register for at the ORCID site,
and it has this numerical
form, and anyone can go
to that site and create
their own ORCID ID.
Part of that process involves assembling
your digital footprint, if you will.
They provide a lot of functionality to
help you assemble all of your
published (mumbles) materials.
And being (mumbles) that information,
with some personal
identifying information,
that's bound to that identifier.
It also provides software
support, so that people can
access your ID, using that
URL, using the ORCID ID URL,
and be able to capture some of the
metadata that you've registered.
Now, the service to
obtaining the information
is what keeps this
organization in business,
so public access to the
information is not as rich
as proprietary access
that you can pay for.
A number of universities have
adopted using the ORCID ID
as an institutional management structure
for the contributions of
faculty, in which case
they engage in a private
relationship with the
ORCID foundations to manage
that information internally.
So, this is, this identifier
the PBD has recently
added to what it collects
from its deposition.
It looks very promising.
So, then there's the question of how
you actually identify data objects.
Since we're dealing with the
web, the one very popular
way of doing that is through URL's.
URL's, unfortunately, are not so enduring,
and certainly many of you have encountered
the dreaded 404, file not
found, or if you're trying
to access URL's, you
may, one of the reasons
that Google has built such
a large infrastructure,
is that they need to research the web and
re-verify all the URL's
that they index on a very
regular basis, because they're
so (mumbles) in nature.
A recent study of the
literature that is produced
by the Supreme Court of the United States,
has revealed that because
they rely on URL's
to reference published
literature, that after two years,
only about 10% of the
opinions that are written,
the references of those opinions are
interpretable in any meaningful way.
Depending on simple URl's is
a very short sited approach.
Very early on, this was recognized,
and the notion of persistent
URL was developed.
And the persistent URL
differs from a regular URL
in that (mumbles) point user,
one level of indirection
is added so you can
register a persistent URL,
and that URL (video cuts
out) which (loud static)
goes offline, or has to change.
So this was the notion
of, this was the first,
a first development in
trying to create identifiers
that were simple URL's,
that were more enduring.
(mumbles) which are perhaps
have become more mainstream now,
are digital object identifiers.
These are idealized, which you see now,
very commonly on this
all published literature.
And recently the data
objects and repositories,
so for perhaps five years
now, the PBD was very early
to generate DOI's with
all of its data sets.
Okay, but that's become
very much main stream now.
What DOI adds is, not only
this level of indirection,
in that the DOI itself
is a neutral identifier,
which points to a URL which can be
updated within its metadata,
but it also carries a variety
of metadata, which describes
the data object itself.
So, unlike the URL, for
which you have no idea
where it's going to land
or where it's pointing at,
the DOI codes has to be registered with
a rather detailed set of
metadata that describes
the authorship, and the
content of the object
that it points to, and that
information is available
and destructible for those DOI exchangers.
So, the way a DOI is, is
actually used with that,
where the DOI number,
which looks like a number
followed by an often a
very unintelligible string.
What follows with the
numerical ID (mumbles),
which is the larger number,
is exactly up to the
individual that registers the DOI.
So, some people have
relatively simple (mumbles),
for instance, the PDB has a PDB code
and then a format ID to follow.
But you'll see many
published literature's have
what looks like a grey
line named and streamed
that follows the ID (mumbles),
that points to a publication.
So, that's a preference that
is purely a matter of style.
But using this URL,
dx.doi.org, this points to
a commercial exchange DOI,
exchange or from a company
called CrossRef, and some
other company called DataCite,
that provides a similar service.
And there are some other public
services that do that as well.
This URL will land you on the data
object that's pointed to them.
You need a space at the bottom that is,
the bottom of the slide,
that actually point you to
the data file for the
PDB accession for HHB.
So, any questions so far?
So, jump in if there's
anything that isn't clear.
- [Voiceover] Right at your last point,
the DOI points to the file itself?
- [Voiceover] Yes.
- [Voiceover] And not to, you know,
a web page that describes it?
- [Voiceover] That's correct.
So there's different ways
that we can organize this,
in fact, if you think about
how the publication industry
usually does this, if
you land on a DOI for,
if you reference a DOI for a publication,
you usually go to that
publisher's landing page,
which has commercials,
and a whole bunch of stuff
about the publication,
okay, and also provides
you with some information
about that publication
and maybe some additional downloads.
That's the style that's
commonly used for publication.
When we adopted the screen
for referencing data objects,
the requirement was
that we literally point
at the data object in
which we were registering.
Because it becomes somewhat
complex if you want to
provide yet another layer on top of this,
that would be no point to
file in different formats,
other experimental data
associated with it,
and that really becomes a query into what,
it becomes a query of the
resource, almost, at that level.
So the decision at this
point, at least with PDB,
has been to take this approach,
but it's certainly not the
only way that it can be done.
So, continuing on, another
consideration is formatting.
So, some decision has
to be made about how you
are going to organize the data
in your repository for delivery.
And at the very essential
level, that could mean
how files are organized, physically,
and the formats of those files,
or download for individual users.
So, if you think about
the PDB as an example,
or sequence databases as an example,
you had an identifier
and you have a download
by the (mumbles), basically
downloading any particular
data file corresponding
to accession of code.
And, so the choices, the
actual concrete format
that the data file is
going to be represented in,
is a choice that has to be resolved.
Since we've discussed so
far, we've talked about
how to, basically, a
conceptual data model for the
information that you're
archiving, that doesn't
block you into a particular
data format for delivery.
But that's just, we're talking
now about how the bytes
are organized on this, for instance.
Or how you would deliver
the information (mumbles),
so all of those types of
considerations are (mumbles).
So, some of the considerations are
just enumerated in this list.
So, obviously the format
should be able to express
all the information that
was in your data model,
so it has to be suitable for representing
the information that you have to deliver.
And we'll come to some examples for that,
that would seem like an obvious point,
but we'll come to some practical examples
where that actually is quite challenging.
You also want to pick something that would
allow you to grow in the future.
So, if you expect that
your repository is going
to add new content, or
extend, or change a science
involved, or technology
involved, you want to be able
to pick style of
representation or delivery
that makes that a smooth rather
than a disruptive change.
Portability is an important consideration
for delivering data across the web.
The types of things that
work against portability
are formats which are only available,
or have particular constraints on
a particular computer platform.
So binary files, for
instance, if you think about
the types of portability
that you may have experienced
is that if you have different versions of
Microsoft Windows, or
Macintosh, or Apple computers,
and you use Word
Processors over many years,
files that you created,
say five or six years ago,
may not be supported anymore.
That's particularly true
of some early Microsoft
Word documents in a lot
of formats that were used
on Apple Word Processors, for instance.
So, there are lots of
different, there are lots of
things having to do with future
proofing your choices here,
and sometimes, it's
just a matter of picking
something that's simple,
that doesn't add a lot of
platform specific complexity.
It's useful to pick
something that is compatible
with your stakeholder community.
So, picking a file format
that's either difficult
or unfamiliar to that user community is,
perhaps, not a good idea.
Software support is
crucial for maintaining
and fostering adoption of your repository,
making sure that the data file format
that you choose something that is
well supported in the community software.
Having something where the file format,
or the file formats
specifications can be documented.
That can be done electronically,
so much the better.
Increasingly now, delivery
of data in the form of
individual data files from
the traditional FTP repository
is, perhaps, less important
than being able to
deliver the data through
imperative program
interaction through that service.
So, you may be accessing
information that's
pre-processed and delivered
to you in a packet
that may be displayed on the web.
Whether that means showing
up at the complex data file,
that you then have to
(mumbles) separately,
and then it may be that
there is no one single
solution that supports
all of your requirements.
So, it may be that
multiple different types
of delivery formats are needed.
So, if you take the PDB as an example.
The PDB archive, as you've
heard, has a long history.
So, it goes back quite a few generations,
in terms of technology.
Initially, PDB data was (mumbles)
on people's punch cards,
and later, primarily on various
types of tape technology.
The punch card technology
was a 72 character format.
And for those of you
familiar with the PDB format,
you'll notice that the
enduring format of the PDB,
there is a slight
resemblance to a character,
a (mumbles) fixed format
punch card record.
We've also heard, more
recently, that other formats,
structured formats, which
are somewhat more flexible,
or have more complicated
internal structure,
have been adopted, for
instance, the PDB uses
the sift format as a
master format internally,
but uses that format to
generate other formats.
For instance, the PDBML is an XML markup.
Markup RDF is a variance
of that, which is used for
the best part of web (mumbles).
We will look at the features of each
of these in the next few slides.
So, again, this is the regular re-spaced
record ordinance of PDB format.
Certainly the most popular
format delivered by the PDB,
because of it's long history,
and it's significant software support.
The documentation of the
PDB format is basically
a written document that describes what
fields various elements are in.
From an extensibility perspective,
the ability to extend it is, basically,
the ability to invent new record formats.
So, there's no, there's
all of the changes that
you make to this format are
all customized one off changes.
By far the reason why this
format has been so popular
is first, because it's
actually very simple.
It conforms very closely
to how people use the data.
The coordinate data is
very simply represented,
very easy to access, and as I said,
very well supported by software.
It's not surprising that this has been
the popular choice of the community for,
from a user perspective,
for a very long time.
A format that PDB uses as
its internal master format,
and now as its primary archive
format, is the sift format.
You've all ready seen
examples of the format.
It's key value, where
the data model is obvious
from the names of the data
items that are in the format.
It has a relatively simple format,
in that key value's very easy to read.
And tables of data are
also very easy to read.
So, while there is some
additional complexity
in this format, it
doesn't take you very far
from the original
character oriented format,
if you think about how
the data's (mumbles).
To summarize the syntax is quite simple.
It introduces the idea of
having named data objects,
and because of its regular syntax,
it provides for
extensibility in a way that
the record oriented format does not.
So, it's very easy to add new data items
through different names, it
doesn't change the syntax,
it's just as easy to (mumbles).
It has built in extensibility.
It now enjoys rather
wide software support.
And, as you've seen in
other complex in the course,
the documentation for this is
built into the data dictionary,
where it which defines almost
all of the data names in the archive,
when we discussed data
modeling previously.
So, a variety of advantages.
Perhaps one disadvantage
of the sift format,
is that it's not a mainstream
computer engineering format.
Much of the world runs
with an XML style format,
which is presented here, HTML,
which is the simplification,
sort of a simplified version of XML
is the language of web page delivery now.
XML is widely used in a
variety of applications.
Primarily because of this tagging scheme.
If you look at a tag in
XML, you'll always expect
to find a closing tag, which is the
bracketed tag with a leading slash.
So, it's a regular syntax.
There's no real implied semantics.
The names are entirely arbitrary,
they have to be derived
from some reference.
The way that PDB has
used XML, is to literally
transcribe the organization in sift,
using the data dictionary to
present the same information,
transliterated into a different syntax,
not a different semantics.
And you'll notice that one
of the features in the way
the translation has been
done, is that information,
which is a unique identifier,
if you'll remember
the (mumbles) of a category E,
is represented as what's
called an XML attribute,
which is a value, which is
assigned to within a tag.
Whereas all the other values
are represented as elements,
which are just vertically
numerated within a (mumbles).
One of the disadvantages of this format,
from the point of view
of structural biology,
is that having to associate
a rather verbose tag
with other piece of data,
introduces a lot of overhead.
Just basically adds a lot of space.
Typically, for a PDB coordinate file,
which has a long list of atom records,
that spacing efficiency is about ten fold.
So, an XML file that has
fully marked up coordinates
is about 10 to 12 times larger
than the corresponding
sift for a PDB file.
So, it has (mumbles) so what
the PDB has done has divided
up the XML delivery into a
number of different forms.
One which has fully
marked up files where all
the records are marked up, every data,
every single data element,
every atom record,
is individually tagging.
Another one where the
metadata is, not included
in the coordinate records,
are fully marked up.
So in just the non-coordinated data.
And another form where
the data is presented in
the atom records, is basically,
has a string of atom records.
So, instead of marking up
the individual atom records,
the entire record is treated as a
record to save mark up space.
So, XML brings portability and
wide use for (mumbles) tools,
it's turned out not to
be particularly popular
in the structural biology community,
except among people that were all ready
consuming a lot of XML
data for other reasons.
It is, however, a very fair (mumbles),
because most of the tools that
are used for delivery web,
web data are all ready pretty
finitive in XML format.
Finally, one step beyond just pointed XML,
is the resource description format.
Which, you can see, is (file cuts out)
has very much the same style.
It is designed to serve a
slightly different purpose.
RDF files are designed to provide
specific URL entry points,
for a specific data object.
You can create a URL, if you
will, which is almost literally
a query for, that can sort of
embed down into a data set.
So, the URL here is an
example for a particularly PDB
identifier (mumbles), to be
able to probe down to into B1.
So, this format was primarily
developed to support,
to be able to parse complicated data into
individual data objects, which are all
accessible by individual URL's.
And it has become a
vehicle by which people can
construct query's across
different website,
using this kind of syntax.
We have all the expressivity
and portability,
and software support, of
XML, and it's particularly
friendly for web service delivery.
Our colleagues at PDBJ and Asoka provide,
what's called an RDF (mumbles),
where all of the data in
the PDB is translated,
so that it is accessible
in this kind of style.
To summarize, Maggie
prepared a table that sort of
highlights the attributes
of each of the file formats,
and their features,
according to the criteria
that we originally discussed.
Sift does very well,
except it would not be
the popular choice for
web service delivery.
PDB format has rather serious limitations,
in terms of content,
extensibility, and expressivity,
although in fairness to the format,
it served very well
for more than 40 years.
So, it's known as the most enduring
and beloved format in science, I think.
Only relatively recently
that the molecular
structures in the PDB, that
are popular to the PDB,
have grown to the point
where that format can no
longer represent a single
molecule within the PDB format,
because of the (mumbles)
of that limitations.
Sift does very well on all the categories
and then the translated
format is PDB and RDF,
are designed to support, specifically,
different applications, and do well except
that their (mumbles) perhaps,
not the compatibility,
popular compatibility choice,
for the user community.
So finally, I wanted to
sort of tie things together
and look at archiving from the perspective
of the entire life cycle of the process.
So, the focus here's
that if you're thinking
about archiving, you might
be thinking about the
whole life cycle, but
in fact, archiving means
that you're, that whatever
you're thinking about,
sort of, a process that
actually comes to an end,
you're thinking about more perpetual care.
So, you've seen the graphic
of the PDB pipeline,
going from deposition through
annotation and validation,
and then today, we're
talking about dissemination.
We'll divide that up into steps that occur
both before and after those steps.
We'll begin with what sorts of things have
to be prepared in order
to support the pipeline
before anybody even sends you any data.
And a lot of this has all ready been
talked about in different
points in the course, so far.
We talked about, in terms of data quality,
the importance of allowing
people to be able to check
their data, prior to deposition,
in an anonymous fashion.
They need to be able to
assist in the assembly
of information, for the
purposes of deposition,
in a reasonably automated form,
so data harvesting, services, and tools.
Specifications for format and (mumbles),
in the context of structural biology,
this would be the only
quick chemical (mumbles),
to be sure, as well as
molecular (mumbles).
The specification for the
data format that the PDB uses,
that's used in structural biology,
is based on the data dictionaries
that we've discussed in the course.
And the procedures and
the data requirements,
and the examples, and so forth,
there can never be any shortage
of helpful documentation
that will lead people
through the process of,
or help simplify the process,
of assembling data for the
purposes of deposition.
So, just to review some of the steps that
were part of the deposition process.
We made the functionality
to capture the data sets
that are the data objects of interest,
as well as all of their
supporting metadata,
so this is where data harvesting comes in.
One of the steps that's required
in that deposition process
is to be able to assign
the user an accession code,
so the data that's being deposited.
We talked about what some
of those requirements
for the session codes
were a few moments ago.
You heard the lectures all ready,
describing the process of data processing,
and annotation, and validation,
then being able to return
data quality diagnostics
back to the depositor.
This is an (mumbles)
process, as you've heard.
Depending on the outcomes
of these validation reports,
the completeness of the data set,
there may be multiple steps involved.
From the point of view of archiving,
perhaps that hasn't been discussed so far,
is that all of these
steps need to be recorded,
in the form of audit records and history,
so that it's clear what's actually
taking place in the processing and entry.
So, maintaining an audit trail with all
of that information is something that is
the responsibility of the repository.
That information may not
be conveyed to the public,
but it's an important
record keeping aspect.
And finally, as you've
heard, information that comes
in through the deposition pipeline,
may not be immediately ready
to be conveyed to the public.
So managing the embargo
period, and dealing with
communication where any individual,
either the depositor or
the journal associated
with the publication, that
may have some dependency
on the embargo period.
It's something that also
has to be addressed.
So, I don't think that we talked, at all,
about the the release process.
But, as a final step,
the chapter to making the
data available, steps have to
be taken to ensure that the
data objects are prepared
in the appropriate formats,
and are organized, and are
ready for public dissemination.
So, if we were to look
at the PDB as an example,
that takes place on a weekly schedule.
Which involves a rather
complicated set of steps
for information about data sets and
the data set themselves, from
all of the different sites
in the world that collect PDB data,
are assembled, their formats are
in all of the different
formats that are produced,
that are going to be
disseminated into the archive.
And all of them derivative
(mumbles) is describing
that information, that are
put into the FTP repository,
are produced and checked.
That information is then
added to the repository,
and that process is, that
master copy of the repository
was then made available
to be copied out to sites
that will distribute that information.
This is a set of, basically,
final consistency checks
that are made before the data goes
out into the public view.
Once the data actually
gets into the public view,
it's necessary to be able to address what
will happen if, for instance,
some change needs to be made.
If some update required
through the data set,
a choice has to be made
about whether or not,
how that impacts on the accession
scheme of the repository.
For example, in the case of the PDB,
the session code is a
single four letter code.
In the event that someone
changes a page number
in the primary citation,
those changes are made
directly to the entry.
A revision record is
placed inside the entry
indicating that that level has been made.
And the accession code is
basically remains unchanged.
It's a sort of, it's treated
as a minor modification.
On the other hand, if, for some reason,
the author finds that a
substantial error has been made,
that requires changing
something in the (mumbles),
or the material structure
that has been deposited.
So, either the sequence
has somehow changed,
or the coordinate model has to be changed.
That has traditionally resulted in the
obsoleting of that entry.
And a re-accession of the
modification as a new PDB code.
So, that provides some
level of protection,
so that (mumbles) has pointed
to a particular ID in the PDB,
and performed to some
calculation on that ID.
They are sure that the calculation
they've done will be
enduring in a sense that,
if for some reason the
coordinates they used
to perform that calculation have changed,
then the ID of that
entry will have changed.
The way the PDB has managed this,
is that entries that
become obsoleted through
these processes are taken
out of the main archive
and put into a separate place
where they are available
forever as an obsoleted entry.
So, that's one way of
managing the change process.
By identifying the level
of change that's important,
and having a well defined procedure
when dealing with major modifications.
I should say that, in fairness,
that the manner in which the PDB has been,
has recently, rarely will
a (mumbles) be changed
to make it possible for people to
retain the PDB identifier code,
and create a new version
of a particular entry,
in the event that the contributor
makes a change in his own data.
So, it's been a longstanding
point of contention with
depositors who have
published a particular ID
in the manuscript and
want to make a change,
potentially a substantial
change in that entry,
and do not want to have that result
in a change the identifier,
which makes it very
difficult to track, it makes
it more difficult for people
to understand the history of
how that entry has changed.
It's actually kept in revision
records in the data file,
that it's extra complexity
for the individual.
So, in the future, at least
for changes by the contributor
and depositor, a version
or scheme will be provided.
And finally, there's some
additional things that
need to be taken into consideration
as the storer of public
data, or data that's
been contributed by others.
And that is what happens
if something terrible
happens to a computer infrastructure.
There are a variety of
different schemes that people
have concocted, if you
will, what would constitute
best practice in this regard.
A relatively typical one today,
which is trackable today,
is to maintain multiple online copies of
everything that's
synchronized, at all times,
there possible, to do that
across physical sites.
If one data center has a problem,
the data saved in that other data center.
Doing that online now, is
possible because of the speed
of networks and the speed
of storage hardware.
For absolute safety,
the PDB is continuing to
archive data on magnetic
tape, although that's
becoming increasingly less useful because
of the size of the archive
and the speed of tapes
is not really to be based with the size
and complexity of the data that we have,
but it does serve as a
one level down, kind of,
safety net in case of a huge catastrophe.
Another option that's available now,
is to think outside of the
local data center scope
and take advantage of facilities
in remote hosting context.
So, it's now possible to
lease storage at remote sites,
from large trusted providers like Amazon,
or Microsoft, or Rackspace.
They store data in a general repository.
Amazon, for instance, provides what
they call glacier storage,
which is designed for
(mumbles) once, for the
purposes of disaster recovery,
and not intended to do that many times.
The cost of that storage
is, perhaps, 50% of what
it would cost to buy hardware
to store something locally.
The disadvantage of the
remote hosting storage,
from an economic
perspective, is that it's a
subscription, sort of, so
you're basically locked into
paying a regular monthly
or yearly service.
There's a variable cost for data access.
While the cost to store
the data is, sort of,
a fixed storage model, if
for some reason you need
to access it frequently,
that incurs a variable cost,
which also has to be bought.
The economics of this are not
necessarily simple, by comparison.
There are also issues
with privacy, as well.
Although, these very well,
these very large providers
have, certainly provide very,
very good privacy guarantees.
So, that is it for this part.
I'll be happy to answer
your questions before
we go on to the practical discussion.
- [Voiceover] John, I
just wanted to ask you,
so the role for user
and the staff over here,
basically differ in stage,
what is their responsibility?
So, are the pre depository,
at that time is already the user,
they key in all the info, and after that,
each stage will be responsible
of your staff over here to
go through all the change?
For example, (mumbles) with
the final, they have errors,
typos, is the user can do that directly?
Or need to go through me?
- [Voiceover] Okay, so
it depends on whether
or not it's before or after the use,
in the case of the PDB.
So, the roles and
responsibilities of the annotation
staff, or certainly everyone
here, the people that
we've met from the project,
are all responsible
for providing the
infrastructure to support
the pre-deposition, the harvesting,
and the metadata specifications,
and all the chemical reference
data, are all maintained
(mumbles) part of the
annotation responsibility.
When the (mumbles) creating his entry,
as much as possible, we provide
tooling to help him do that
in the most automated way possible.
So, in structural biology
you'll end up with a bunch
of large files or output files
from different (mumbles),
and the tools have been
created that will lead
those files, suck out the data
that needs to be deposited,
put it in the right form in a Sift file,
so that that final file can be uploaded
to the deposition accession,
leaving the depositor
with a smaller number of
things to enter by hand.
The (mumbles) of biology
and the bibliographic
information that is not
captured as part of the
structural refinement pipeline,
typically has to be added by hand,
or through some templates
that we provide the depositor.
A lot of groups that are
doing (mumbles) structure
take advantage of being
able to prepare even the
more biologically focused
and bibliographic information
is all electronic before they deposit.
So they have very, very (mumbles) complete
data files at deposition time.
Now, during the deposition process,
it may be found that there's a problem or
something is unexpected
or that the depositor
finds that he needs
another refinement after he
started the deposition process,
and finished after the change.
The deposition system
supports changing things
up until the point that
the file is released.
After the file is released, currently,
changes are done at, through
a communication protocol,
which is provided by
the deposition system.
Where the contributor will
convey what needs to be changed.
It will be done by the annotation staff,
files will be returned
back to the depositor,
and those will be re-released
as revisions to entries.
I will say that it's not
unusual for the (mumbles)
or revised entries on a weekly
basis, (audio cuts out),
are going to be, literally,
of a hundred or more.
So, the process of post (mumbles) revision
is a big part of the
data processing pipeline.
It's nothing but updating
citation information.
Often, entries are being released,
at the point of publication,
that's now electronic
publication, so that
the paper page handlers,
for instance, aren't
available, at the point
that the entry goes
out for the first time.
Something like that has
to begin after the fact,
the entry that's re-released.
Or there may be other small
changes that take place.
But that's a big part of the process.
And from an engineering prospective,
trying to make that as
automatic as possible,
has been a design (mumbles)
of the new deposition
system that has been
built, but the post release
modification part has not been,
is still not where we would like it to be.
So, in the future, it
would be great if it could
be completely electronic, and
into the depositors hands,
but we're not at that point yet.
- [Voiceover] Yeah, I had
a question I wanted to ask
you is about our risk management.
You mentioned about you
will have the backup
info to make sure everything
is uploaded correctly,
did somehow in the database,
only a couple of records
are not going well.
What is your process?
Are you going to load the whole set in,
or are you going to individually load it?
- [Voiceover] So, we have
to step back a little bit,
because we're getting
into what we're going to
talk about in the next few minutes.
Right now, the data processing
that's done in the PDB,
and this is the way the
PDB organizes its data,
is all done, in terms of wide files,
not in terms of database objects.
So, the authoritative copy of data,
during the annotation
process, is it that file?
And if there are changes
that need to be made
to data than that's done
through a web (mumbles)
editing session on that flat file,
or through a software
tool that makes a change.
The loading of information
using databases for delivery
or search, is a separate step.
So, the objecting changes
and perfecting the data,
if you will, all takes
place on flat files.
And the reason for that is
is that all of the tools
consume structural biology data,
that have to deal with coordinate data,
are all expecting a flat file input.
Moving to a database management structure,
for this type of data, not very efficient.
You basically end up in a
situation where you might
be storing the information
at one other database
but you got to spit it
back out into a flat file
in order for the program to (mumbles) it.
So, our choices mainly to
keep the data in a form
that is as close to possible
as the choice of (mumbles),
and then move to a
database infrastructure for
the purposes of delivery and search,
at the end of the
process, once the data has
been perfected, to that extent.
But you can perfect it anytime you want.
- [Voiceover] Thanks.
- [Voiceover] So, I think
we should thank John for
giving a very clear description
in the lifecycle of data.
And we'll take 10 minute break,
and then we're going to actually learn how
to create a database because that's
what we're going to be
doing with your own data.
So, this is just sort of the grand finale.
(laughing)
And you're going to go
through a step by step
description of what to do, and then you're
actually going to start
the process yourself.
So, this might be the most important
class we have, right now.
Yeah, (mumbles) very
important, because this time
we're going to do something with the data
that you've been looking
at, and I think then you'll
sort of get a better idea
of what happens with data.
We can--
