- [Voiceover] Okay, so
today we're going to talk
about data deposition which is the process
of getting your data into the archive.
And the two people who are
going to talk about this
are Ezra Peisach and Brian Hudson,
both of whom work as
biocurators for the PDB.
Have both experienced
(speaking drowned out
by audio malfunction)
X-ray, NMR, and EM
(speaking drowned out
by audio malfunction)
knowledge of the work
that's coming into the PDB,
which is really required in order to
retain the archive in a professional way.
And so we're very, very lucky
to have the quality of
(mumbles) that we have,
and so the first one is going to be Ezra,
who's going to (speaking drowned
out by audio malfunction)
- [Voiceover] Good morning.
And afternoon for those
in the other conference.
So I'm going to to start by giving you
an overview of what do we
mean by data deposition.
We want to have an archive, over here.
We want to have an archive
that we want to distribute eventually.
So you have to be able to
somehow bring in your data.
You have to add information, add content,
and this is the process of people
weighing in the data to the archive.
After you deposit data,
this is broken down based
on the various lectures,
after deposition will be
some form of processing
where we are then doing
a performance curation
that adds value to the data.
We ensure that there's uniformity
and we may provide extra
(mumbles) back to the content
to relate to other known databases.
But that will be more next week.
So, why would you want to deposit data?
Besides, various reasons include,
among scientists you
want to share your data,
among your colleagues, among your peers.
This allows your work to be compared
to other work that's
being done in other fields
or even by your competitors.
Some journals now require,
most journals require that
you deposit your data.
And there are funding agencies
that say you must share your materials.
This is more of an electronic form,
but if you're working
in biological sciences,
anything that you publish, a paper,
in the United States,
you must make your new
data publicly available
to other scientists so they could then
verify your work or grow
from where you started.
Another reason is that you
want to preserve your data.
To have this on a single hard disk
somewhere in your lab is prone to failure,
could be a fire, could be water,
could just be electrical failure.
By putting into a redundant archive,
that ensures that your
data will be available
for many generations thereafter.
I would like to comment though
that not all archives make
their data publicly available.
And this is, there are terms
and conditions of every archive,
the ones that we are familiar with here,
and that we generally talk
about do have public access,
but the government does not
require and it's (mumbles).
So the deposition systems are essentially,
it's the front door to an archive.
If you think about it,
it's sort of like a bank.
You know, you want to
go open a bank account.
Well, what are the properties
that you want to have
with this bank account?
You want it to be secure.
You want to make sure that
you're the only person
who can have access to your account.
I don't want to deposit money
and then have someone,
random other person come in and say,
"Give me your money," for that person.
So that's why your bank's secure.
You also want to ensure
the privacy of the data.
If I'm depositing data to an archive,
before I'm ready to publish it
or make it publicly available,
I want to make sure that it's private.
There may be some patents involved.
There may be material that
would be of interest to competitors.
Whatever definition is to be had,
you want to make sure that you are not
going to open your information
out to the people before it's time.
The deposition system ideally
should be very user friendly.
You want to be able to
easily and quickly add the data
and make any additions that you need.
This has not historically been true
of all deposition systems,
some have been very cumbersome.
I could make a comment
that when I first deposited my first data
to the Protein Data Bank,
it took me three days to clear the data
since it's deposited because
there's such strict
formatting requirements
and things have evolved since then.
We want a deposition system to
catch and block easy errors.
If someone makes a very obvious typo
or says something that clearly
does not agree within the data,
you want the deposition system
to just stop you from continuing on.
You want that type of
check to be done early on
and not down the road
in the processing cycle.
So, what would you want to deposit?
So, for a structural biology database,
we are always interested in our
case, the model coordinates.
This represents the end results
that's involved in the structure.
I know that a number of you
have looked at, done some searches
on the Protein Data Bank
and you have found (mumbles) of interests.
Behind there there's a model.
We now require the experimental data.
What is the model being based on?
What experimental, are they
actually diffraction studies,
what is called structure factors?
NMR, a type of chemical shifts?
An electron microscopy,
electron density masks?
We want to have have
access to that information
so that you can state this model
agrees with the data and can
be independently verified.
We also want to collect a variety
of what we call meta-data.
These are things that might be
relevant to the experiments.
So it might be data collection parameters,
it might be temperature.
These are things which are
not specific to the model,
but are relevant if you want to make
comparisons between your model
and someone else's model.
But how you produced the materials
for that study may be important.
And it's keeping in one central place
that doesn't require people
to go looking in a journal,
finding the paper, and saying,
"Oh, did they actually put
down this information?"
Besides it's probably the things
that are relevant to the experiment.
The type of data that are collected really
depends on the maturity of
the archive and the field.
When a field is, or
actually say an archive
is relatively young,
they may not really know
what they want to collect.
They just want to get the data into place
and then as the field
or the archive and field
or archives evolve, they may discover,
hey, it would be very useful if we add
the extra information.
So, we're gonna discuss
later on the three archives.
One of them is the
homology model in archives,
which is a relatively new archive,
which is an archive of
theoretically determined models
based on other structures.
And they right now are at the stage
where they just want to get the models in.
They are not interested in
very many details of how
you determine the model.
But they will evolve over time.
The Protein Data Bank,
as we've been discussing,
has been around for over 40 years.
We are various different models,
so we are also interested
in extra meta-data that is in a format
that could be read by other computers,
that allow you to share information.
The Cambridge Structural Database
has structured the small molecules
and there was to be a (mumbles),
but they have the advantage
of there's pretty much
one or two pipelines
that are producing all the data,
so they select all the
information they want
during the process so by the time
you get to deposition stage,
it's pretty much ready to go.
So I'm going to outline
what I consider five stages
of a deposition process.
So I've broken 'em down here
with a registration process,
a method, we upload your data,
data entry of metadata,
you perform a data check and validation,
and then the final submission.
You want to deposit
your data to an archive,
just like you go to a bank.
First thing you do is they say,
show us some ID, give
me your phone number,
give me relevant information.
Well the registration process allows
any biocuration staff
to eventually contact
the depositor and make
sure that we have a conduit
or are communicating with them.
Registration can be as simple as providing
an email address and an uploaded file,
or it could be very complicated,
sort of like you register
an account online,
where then they send you an email message
and then you open up a link on a URL
on a web browser.
Essentially what you're doing is you're
confirming that the
communication channels will work.
This is also a time when
archives usually inform a depositor
what are the policies of the archive.
They may have to do with data sharing,
it might have to do with privacy,
anything that's important to the archive
and they feel that the depositor
should be well aware of before
they commit to the deposition process.
It might have to do with when
will you release the data.
If we don't hear from you in a year,
we'll release the data
without talking to you.
These are policies that are
made clear up front to the user.
The file upload process is where
you actually do a transfer of the data
that you have put together
to the deposition system.
You want to somehow transfer it in mass
all your data to a system
and you're not gonna want
to type it all in by hand,
one line at a time.
In order to ensure this works,
that your uploaded data must adhere
to accepted standards.
So the Protein Data Bank uses a format
that's 40 years old, picks column data
in terms of what has to be in the format.
You want to be sort of generous
in terms of the steps,
but you've got to very strict
about what we put out at the end,
but if we can't talk to each other
in terms of agree on what
the standard format is,
then there's no way that we know
what you're gonna providing us.
Some systems will allow you to include
the metadata at the upload time
if you have it in a file to do that,
but is required that you do
full data entry afterwards.
In preparing a file for upload
it's sort of like you're
preparing your taxes,
which in the United States
is due in about three weeks.
You need to collect all your information.
This is what you need for any deposition.
If you don't have, you got to know
everything about what you're depositing.
You got to know the ins and outs,
and have it readily available.
So in the case of a
macromolecular database,
like the Protein Data Bank,
then you're gonna need to have
your experimental model and your data.
You're gonna need to have any metadata
ready to be entered later on.
So, what do we mean by metadata?
I keep talking about metadata.
And so this starts, comes from
one of the previous talks,
is to remind you of some of the things
that we've said in the past.
For an X-ray structure it might be
when is a buffer that the
crystals are sitting in,
it might be how do you load the crystals?
For NMR, what buffer is your sample in?
What type of label experiments did you do?
Did you label the nitrogen?
Did you label the carbon?
For electron microscopy,
how did you read the sample?
How did you prepare the sample?
These are the types of information
which is useful to have
known later down the road.
While not absolutely
required, or essential,
the communities that have
prepared these archives
have felt that this
information is useful to detain
and that is why we ask it.
We also allow for what
we call data harvesting.
So in structural biology,
you're gonna be running,
structural biologists use a variety
of different programs for
processing their data,
their data refinement.
And just a lot of information
just to pick, goes in one place.
Some deposition systems have
a data harvesting system
where you combine various
data together automatically,
will interpret their log files
in lots of different formats,
pull together, extract various metadata,
and then put together and allow you to do
eventual file upload all in one shot.
This also could allow for
different format conversion.
As I said, we are very strict
in terms of what do we
require for file uploads
in terms of the file
formats that are allowed.
Well, there may be a
variety of other formats
that have been used in the field,
and you can have the
data harvesting program
do the format conversion for you
in a relatively easy way
for the outside user.
The next step after you've
uploaded your files,
is going to be a data
entry and the extra input
of data items that could not be extracted
from the model files.
These might be the metadata,
we're talking about it.
It might be other missing data
that we feel is biologically important
that could be useful to know,
such as what is the
type of your structure.
That's sort of a useful item.
Hopefully the user-friendly form
or interface will allow for rapid
and easy entry of the information.
And actually nowadays,
that's probably the longest, largest
commitment of the depositor's time
is filling in all the
extra little information,
because most of the file upload
is relatively painless now.
- [Voiceover] What you mean by the largest
commitment of depositor time
is to enter the metadata,
it takes the longest?
- [Voiceover] Right, nowadays the metadata
takes a lot longer than the file upload.
- [Voiceover] Do you think it takes longer
for the solving the structure?
- [Voiceover] No.
If it was from automated pipelines, maybe.
(laughter)
Right.
No, right now if you
know what you're doing,
I had mentioned that it took
me three days the first time,
nowadays I would say if
you know what you're doing,
you can probably go through
the whole process in an hour.
If you have all your materials with you.
If you have to put it
down and start searching,
Word, Acrobat, and what
notebook did I record that in,
all the information that could take long.
The deposition systems that
we've been talking about
usually have some form of validation
or checking of the data.
These might be internal
consistency checks.
We want to make sure that you've,
if he's paid one thing,
then one part it matches
another one, for instance.
You may claim that your structure
has a certain evolution,
a certain quality to it,
but if the supporting data that you have
to go with that model
doesn't agree with that
then you can check that and say,
wait, there's something wrong.
Block it and say is this really correct?
Think about this.
The validation against
community standards.
There have been a number of,
there are biological structures,
there is now a number of
validation task forces
which have been meeting
to try to establish
standards for what is
considered a good structure.
And what qualities
should you be looking at.
So there might be checks against
these types of standard,
known (mumbles), for instance.
I don't know if you've spoken about
individual amino acids, but these are sort
of the building blocks of life,
we know what they look like.
So if your model says,
this on life is five,
you know, it's very large,
instead of what it should normally be,
we could flag that and say
there's something wrong here.
And that could be useful to do
before you deposit the structure
because then as a depositor
you can go back and say
let me check this problem.
We could also perform checks against
the archive as a whole.
We have, at least in
the Protein Data Bank,
over 110,000 structures.
So we could say something about
on average what does
the average structure,
a typical structure look like and say,
(mumbles) sort of a normal structure
or there's something
very unusual about that.
Again, it could be flagged.
It may be correct, it may
be a scientific reason,
but we allow the depositor
to think about it before inspecting it.
Data checking could be done
at multiple times throughout the process.
It could be done at the
time of file upload,
could be along with data entry,
or just after data entry
before you finally say you're ready
to submit this to the archives.
And essentially, any major issues
really should need to be fixed
before you allow submission.
So once you've had all your data together,
you've collected it,
and you've put together the terms,
and you've agreed to all
the terms of the archive,
now's the time to submit this structure,
sorry, to submit your entry.
This is the point at which you say,
I am done with the data entry,
I'm ready to let someone
else look at my data
and say did I make a mistake,
is there, and we know it's ready.
And at this time we are usually giving it
some form of accession number,
some form of reference
that you can use for
a journal publication.
This is sort of gonna
be the outside reference
to what your data, you're preparing.
And that's going to be
also saying, I am done.
So what happens after this?
You've deposited data that are,
to (mumbles) more next week,
oh sorry, two weeks from now.
The data have been processed
and reviewed by biocurators.
And there's two different
forms that could happen.
There could be a human curation,
which maybe they've increased the quality,
but it takes a lot of time,
or it to be an automatic processing,
which is very efficient and quick
and depends on what the
needs of the archive
and what they've established
as their standards.
You will eventually,
after depositing data,
hopefully get some form
of correction and feedback
which you could then
have a back and forth,
maybe through the deposition system
or some through email
or some other process
of incorporating selections
that could be put into
whatever your deposition is,
and then eventually your depositions
will be released to the public
or whatever the archive requirements are
on some form of scheduling.
It could be based on publication,
it could be based on
I'm ready to let it go,
it could be years or
it could be immediate.
It depends on what are the
archive's policies and rules.
And I think that pretty much summarizes
the actual basic steps
of a deposition system.
And I'll take questions now.
- [Voiceover] Yes, so could you describe
in the deposition validation,
what the deal breakers are?
- [Voiceover] Sure.
Okay, so some of the deal breakers.
Are you talking about inconsistencies?
So, you could say, you collected the data
three years from now.
That would be pretty much a deal breaker.
If you could, let's see,
you talking about the validation reports
or are you talking about
just other types of checks?
- [Voiceover] Well, the validation.
You have a validation during deposition.
Which things will cause
you not to get the ID?
- [Voiceover] It depends
on the system I would say.
- [Voiceover] So, the reason
I'm saying this it that
for the students in the class,
in order to publish a paper in a journal,
you have to show that you have an ID.
So, the question is,
so people do whatever they
need to do to get an ID
and the question I have is, how?
- [Voiceover] Right, how generous?
- [Voiceover] Yeah, how--
- [Voiceover] Right now in the wwPDB,
the validation report, which should take,
I can go back, let me
go back to this thing.
The internal consistency checks,
if your format or your file
does not match the requirements,
you'll get blocked.
If you have outliers in
terms of cross checking
within your model different
data fields, you'll get blocked.
If you can't describe what
your chemical components
are, you'd be blocked.
But checks against the aggregated archive
would not block you at the moment.
- [Voiceover] Okay, that happens later
and we'll talk about that in two weeks,
but I just wanted to have that clarity
that during deposition
things are relatively loose,
but they could get
tighter as time goes on.
- [Voiceover] Right, as
the archive develops,
they are to do more and more checks.
Brian will talk about the
Cambridge Structural Database
and they have tighter consistency checks
than the wwPDB in terms of
the actual format checking
and before you get to
the actual deposition,
before you submit, I mean.
- [Voiceover] Yeah.
- [Voiceover] I think Cathy you
were about to say something?
- [Voiceover] Yeah, I was
gonna ask a follow up question,
which is if somebody deposited a structure
and a set of structured factors
that were from a completely
different data set,
they would be able to make
a deposition and get an ID?
- [Voiceover] At the moment, yes.
- [Voiceover] Cool.
- [Voiceover] Well, consider the fact that
if there's a major
inconsistency, for instance,
you know, it's got those
technical unit cells
mismatched between the model
and the structure factor,
that can be blocked,
but if you're gonna be interpreting
whether a ligand is really present or not
in the electron density map,
that's a little more subjective.
- [Voiceover] Just for the students
who are not going to be here
for the second half of the
course, during annotation--
- [Voiceover] They are checked.
- [Voiceover] Everything
is checked very carefully
and flagged if there are problems.
And the real thing with deposition is
how strict do you want it to be
before you give an ID,
and that will change over time.
It has changed over time,
and hopefully it'll continue to change.
- [Voiceover] Right.
As people discover we
want tighter standards
and they can be checked easily
at the deposition time
then as the field advances.
And John, are you trying
raise your hand back there?
(student speaks off microphone)
Okay.
I can mention that too.
Well, I didn't prepare a slide about this,
but the wwPDB, we also have
a stand alone validation
server that you can
before you actually go to
the deposition process,
you have a way of preliminary checking
your data without anyone else seeing it.
You could (mumbles) and you could do this
just over and over again and say
what type of feedback am I going to get
before I actually start a deposition?
And we haven't been encouraging people
to go that route, but
it's up to each individual
whether they do that or not.
Any other questions at this point?
Or I'll turn over to Brian Hudson.
- [Voiceover] Thank you.
Bear with me for a moment while I
transfer the microphone here.
Let's see, am I audible?
Okay, that sounded very audible.
Okay, that's good.
Better very audible
than not audible enough.
Okay, so my name is Brian Hudson.
I am a biocurator at
the Protein Data Bank.
And I'm going to,
I'm fading out for some
(speaking drowned out
by audio malfunction)
So, I'm going to go over a few examples
of actual existing deposition systems
to sort of kind of give some,
a more concrete foundation to a lot
of the concepts that
Ezra has just discussed
in the first half of the lecture.
So the three deposition systems
that we'll be talking about are for,
they're for the three different archives.
The first one is for the Model Archive,
which is an archive of theoretical
3D models of macromolecules,
such as proteins, thinks
like acids, et cetera.
It's a fairly young archive.
So it's established in 2013
and has as of last week
about 1,400 structures in it.
The second is for the Protein Data Bank,
which has a much larger scope,
containing the
three-dimensional structures
of macromolecules that have been solved
by a number of experimental techniques
including crystallography and NMR
and 3D electron microscopy.
The Protein Data Bank is over
40 years old at this point
and has well over
100,000 structures in it.
The third is the Cambridge
Structural Database,
which archives the
three-dimensional structures
of small molecules, though small molecules
are not necessarily really small.
But if small molecules determine using
specifically X-ray crystallography.
Is an older archive than all of the rest,
having been established in
1965, so over 50 years old,
and has well over 800,000
structures in it at this point.
So our first example is the Model Archive.
It was established as part of
the Protein Model Portal
and by the Swiss Institute
of Bioinformatics.
It is used to collect the
three-dimensional models
of macromolecules determined
using theoretical methods.
Has a fairly narrow focus,
and it's a fairly young archive.
In fact, it was established in 2013,
a few years and actually a few years after
and specifically because in somewhere,
no, I think along 2006, I think it was,
the Protein Data Bank,
which had been accepting
theoretical models for a while
decided that it was going
to narrow its scope slightly
and cease to service theoretical models.
And so after that, theoretical
models needed a place to go.
And so the Model Archive was born.
So as far the specifics are concerned,
reaching just a little bit here
at the bottom of the page.
Ezra talked to you about using
the five steps of validation,
not validation, excuse me,
the five steps of deposition
were actually made to starting
with registration,
moving on to file upload,
we're actually gonna break that pattern
immediately here under this example
because for the Model Archive,
you can actually upload your file
prior to doing any sort of registration.
And this is actually an
example of, to some extent,
of being able to, we were talking
just at the end of Ezra's
half of the presentation
about pre-deposition validation.
In other words, being able to check
your data prior to actually
beginning a deposition.
And that's sort of actually
what the Model Archive
has built in into its system.
One can actually with Model Archive
actually upload your file, your structure,
before doing any sort of registration
and running all of the text files.
I know this page here says,
it says file upload and validation.
I've been told validation
is actually somewhat
of a misnomer in the creation
of the Model Archive,
but what's actually occurring here
upon file upload this is actually
extensive format checks and
then some geometrical checking.
So it's checking bond
angles, bond lengths,
against standards just to make sure
that everything is good.
But what happens now here is
you do a file upload, it does the checks.
And then afterwards
if everything is acceptable,
then it is then possible to register.
Registering at the Model Archive
is simply a matter of making an account,
just as though you're making
an account at any sort
of website that you might
encounter on the web.
Once an account has been registered for,
then the other pages become available
and one can go about the
process of data entry.
This is the interface.
You can see this is kind of a common way
that interfaces is set up.
The next example will be of
the wwPDB Deposition System,
so there's a very similar look
where you have the menu on the left
and then the content and
the age on the right.
For the Model Archive you have
several different categories that appear.
This one is able to enter data,
including citation information,
materials and procedures used.
Now, as I said, the Model Archive
is a rather young archive.
And there's still a lot of variability
in the techniques that get used
for solving these models.
And what you'll notice,
looking at the page,
is that under, say, here, this is the map
we're given on the procedures
for determining the model,
there's just a big blank text field
where the depositor can enter
any information that they want to.
Now, this is a way of maximizing
the flexibility of the
data that can be entered.
However, in the future,
if the user needs to search this data,
then they have to be using this
more complex data mining technique used
in order to get information out of it.
And once the depositor
has entered whatever
information that they want
to provide for the model,
they can then proceed to
submit the deposition,
the submit button is right down
in the lower left-hand
corner of the screen.
The second example is
the wwPDB D&A System.
A lot of letters.
It stands for the
Worldwide Protein Data Bank
Deposition and Annotation System.
It collects the 3D
macromolecular structures
that have been determined using a number
of different experimental techniques,
including X-ray crystallography,
nuclear magnetic resonance spectroscopy,
and electron microscopy.
It's a fairly broad focus,
covers a lot of techniques.
So long as they're experimental,
there's a strong possibility that
the Protein Data Bank will take it.
It covers fields that range
from the rather young, which would be
a 3D electron microscopy
to the rather mature
X-ray crystallography,
which is a field that's been around
for a while and it's fairly
established at this point.
The system is secure with
each deposition session
individually password-protected.
You go to the deposition system
and you'll see something
along the lines of this.
It's slightly different
than the registration
you have at, let's say the model portals.
One of the aspects of the
wwPDB Deposition & Annotation
is that there is a, after deposition,
there is a large amount
of curation with those,
a lot of manual curation.
And so this full process,
and you'll hear more about this next week,
the full process is a very
collaborative one between
the depositor and the archive personnel.
And so it's very important to that
can be communication pipeline
be established very early.
And the system is actually set up
to ensure that a deposition is not made
without communication
being able to be made.
And so what actually, at this page
or in the registration,
when designating the sort of experiment
one is depositing data for,
one is actually supplying the depositions.
One gives the email address,
and that message is then
sent to that email address
and then along with a deposition ID,
a deposition ID can then
be entered into the system.
This ensures that the archive
has a working email address.
Nothing like receiving a deposition
from a user that has
a wrong email address,
then there's no way to contact.
And it becomes rather difficult.
So various features of the wwPDB
Deposition and Annotation System.
All the communication is managed
within the deposition system,
so it's a fully contained system.
What this does is by keeping
all of the communication
out of the general email
it allows all of each deposition,
communication for each
particular deposition
to be placed within in that deposition
or with that deposition.
So if you go to the deposition,
you have all of the correspondence
right there with it, you don't have
to go anywhere else,
it's all self-contained.
Coordinates and experimental data
can be input in two different ways.
There's file upload and
then there's data entry.
There's also, there are very strict rules
for file formats, whereas data entry
there are some, a few there,
there are guidelines and
there's then validation,
but there aren't necessarily
quite such strict rules.
There's also the ability to provide
some metadata via file upload.
In other words, the formats that are used
for the experimental data can include,
and often do include metadata
that our system references on the
wwPDB Deposition and Annotation System
harnesses along with the primary data.
And all of the corrections
and subsequent file uploads
are tracked by the system.
Like I said, it's very self-contained.
So this is generally what the interface
looks like when entering.
It's very important when,
for a deposition system
that the deposition interface
to be user-friendly, easy to figure out.
Someone who is depositing structure
may be depositing the very good results
of four or five, this may be the only time
they ever see a deposition system.
It's a good idea if
they're able to look at it,
for them to figure out
without too much difficulty figure out
how to get the data in.
And the first step of
deposition is the file upload.
In this case the system
based on what was designated
during the registration steps,
the system knows what sort
of files it put into it.
And we're on the back to tell
the depositor what sort of files
that they should be uploading.
One uploads those files,
designates the formats of those files,
so the system knows what it has,
and then can continue with the deposition.
Here is more of the deposition interface.
I'm gonna demonstrate data entry.
There are,
as we saw with the Model Archive,
we have the menu along the left-hand side
and then each individual
page on the right.
The system tells the depositor
precisely what they have entered
and what they still need to enter.
There are data items that are required.
There are mandatory data items.
There are data items that are requested
that are not mandatory.
And can see on the right,
several of the data
items that are mandatory.
They have little red boxes,
but if it's not mandatory
you have the white boxes.
In the left menu you can see
the various data categories.
Some are marked with a green check mark,
some are marked with an exclamation point.
Ones marked with a green check mark,
for those all of the mandatory data items
have already been provided.
But for those with the
red exclamation point,
those are categories where there are
still mandatory items
need to be filled in.
Let's see here.
That indicates that the PDB
deposition system here
once everything's gone
to a green check mark
in the left-hand that, menu that is,
and then everything
mandatory has been filled in
then the list requirements
up in the upper left-hand corner
turns into a submit button.
(mumbles) I have a pointer.
This right here becomes a submit button
and then that can be clicked
and the deposition is submitted.
So in addition to data
entry, as I said before,
there is a certain amount of
data harvesting that occurs.
Or rather, that can occur.
This is sort of an illustration of that.
On that left we have some data items
that may have been in
an uploaded mmCIF file,
along with coordinates.
And on the right, you see how those
would be filled in automatically
in the deposition system,
which is a benefit of freeing,
if these data are present in this file,
the depositor does not
have to spend their time
entering all of the information manually.
This is also an illustration of
what data entry looks like
in the case of an archive,
archive that's very mature.
And some of the data documents
that can be collected are fairly specific.
In this case for X-ray crystallography,
they can be very specific,
so rather than a large sort
of open text, pre-text field
describing the data collection,
you have instead very specific categories,
some of which have a
controlled vocabulary,
such as there are only
a few possible answers.
Others of which, only a few choices
because those are
typically the only choices
that really exist because this field
has been around for a while
and know what's possible.
Validation occurs during the
course of the deposition.
There are generally speaking
two different types of
validation that occur
in the wwPDB Deposition
and Annotation System.
One is what we call inline validation.
This is where the system tells you
right at the moment you enter the data
whether or not there's a problem.
For instance, if you told them
that your data set is twice as large,
if the amount of data that you processed
that you're fine despite this large
amount of data that you actually had,
the system will tell you.
If you told the system that you
collected the data during
the Protestant Reformation
or the Renaissance, the
system will tell you
that maybe that is little early.
In this particular case, the example,
the example shown is actually
one of the more complex examples.
One of the things that the
Deposition and Annotation System asks for
is it asks for a sequence, sequences.
The amino acid sequences
have all the proteins
that are in your sample
and it also asks for an acid sequence
'cause of all the DNA, RNA that happens
that are in the sample.
The depositor provides those sequences.
The system checks those sequences
against the sequences that are present
in the coordinates that
have been uploaded.
And this is an example
of one of those validations.
This validation where it's
checked the sequences
against one another with the,
let's see, coordinate sequence, right.
On the top we have the sequence
that the depositor provided,
on the bottom we have the sequence
that is present in the coordinates
and the system is telling the depositor
that the coordinate
sequence aren't in sequence,
aren't in the provided sequence.
The depositor would then
have to correct that
before they could continue.
Like I said, all of the system occurs
at the time that each individual
data item is being entered.
Now there's a second kind of validation.
And in the PDB's deposition system,
this validation starts running
the moment that the
coordinates are uploaded
and this is the generation
of a validation report.
This is a set of checks,
a large set of checks
that have been designated by
a group of experts that run
on the coordinates and
the experimental data.
It's rather long.
It takes a while and it's rather involved
and it runs in the background
so that's separate from
the data entry process.
And produces a coherent validation report
that the depositor can then review,
check and make sure
that aspects of the data
that they provided are acceptable to them.
The wwPDB validation report
has information such as this,
checking bond angles,
various graphical representations
of sort of relative,
how the structure
has responded to certain checks.
Our third example is the
Cambridge Structural Database.
Which is run by the
Cambridge Crystallographic Data Centre.
And this contains model coordinates
and experimental data
files for small molecules.
You see an example up at the top.
So we're talking things
that generally speaking are
smaller than, say, proteins
and the nucleic acids.
We've changed the crystal, the structures
of these molecules that's determined
by X-ray crystallography.
Then the findings are deposited,
they're expected to follow
a very strict format.
The fields of X-ray crystallography,
and especially if X-ray
crystallography of small molecules,
has been around for a very long time,
it's a very mature field
and the archive has a very narrow focus
on these small molecules.
And so the data pipeline
that feeds in the structural database
are very well developed.
They were specifically developed,
essentially the deposition process,
so one of these structures begins
pretty much at the
point of the first data,
the first data point is collected,
during the solution of the structure.
The database and the
techniques are so established
that they might as well begin
putting everything in the format
that are gonna be required for deposition
from the very beginning.
And so by the time
one of these small molecule
structures has been solved,
it is essentially in most respects
ready to be deposited
without much alteration.
The Cambridge Structural Database
has a relatively, I like the model,
despite the differences
in the maturity of the databases,
the Cambridge Structural Database
has a very similar
maturity of the archives.
The Cambridge Structural Database
has a relatively simple interface
rather like the Model Archives,
where initially it's just
a question of registration
and uploading the file.
And 'cause the uploaded file
is going to be pretty much mostly ready,
ready for immediate deposition only then.
A lot of the validation
has already been done
during the structure
determination process.
Any problems that have been found
during validation for, say, irregularities
are already inserted into the
CIF file as being positive.
They consist of a problem
that has been charted by the system
and the depositor's explanation
of why that problem has this score,
why that problem may not
actually be a problem.
This file is checked and after the check
the depositor can then
add additional information
that may not be present
in the deposited file,
either as part of the (mumbles)
or as part of the metadata
present in the deposited file,
this includes citation information
and any sort of additional information
that they may wish to include.
You can see in here that, for instance,
things like the crystal
space group, already present.
That's required information
of something like the
color of the crystal,
which is not a required information,
but you know possibly useful for the user,
can still be entered.
They also can, there's also
with a visualization tool
within the deposition interface
they can survey the structure.
So in summary,
deposition systems are
a controlled interface
for outside users to
add data to an archive.
It's essentially, it's the place
where the archive asks the depositor
for the information that's going
to be stored in the archive.
If the archive doesn't ask for it,
the archive doesn't get it.
If the archive really
needs the information,
the archive has to make it mandatory.
All these depositions have commonalities.
We saw them and they've
all got data entry,
they all require data
upload, you register.
There's a mechanism for submission,
and usually some sort of checking
or validation involved in every case.
The maturity of the archive
with respect to the nature of the data
that of the young archive,
and they collect a lot of data
that may not be
particularly well organized.
Then we don't know at the beginning
exactly what's going to be important
for what's the aim, the prominent way
that something's gonna end up being done,
and so you collect as much
information as possible,
whatever format can be had.
The more mature archive,
the more is known about the techniques,
more specific data items can be requested
in a format where they can be better,
where they can be more organized
within the archive.
Data are checked either during deposition
or after deposition.
A lot of checking that
happens is automatic.
Some archives will have
curators who will then
check the data afterwards,
manually, by hand or by eye.
And then in every case,
there's eventual release of
information to the full archive.
And any questions or
requiring clarification,
please let me know.
- [Voiceover] So you guys are gonna
be creating depositions forms,
so this is a good chance
for you to ask questions
to any of us, Brian,
Ezra, any of us staff,
how this all happened.
It seems very trivial in many ways
until you actually have to
do it, as you will learn.
So, are there any questions?
Okay.
Thank you, Ezra and Brian.
I thought it was really
clear presentations.
We're gonna take a 10 minute break
and then Cathy is gonna show you
how to make a deposition form
'cause you're gonna be doing that
for your final homework
in this part of the class.
After Cathy goes over the
deposition form with you
and how to actually do this,
we're gonna spend time
going over your homework
that you submitted this last time.
There's some of you who did very well
and some of you perhaps
need a little coaching.
So, we're gonna do the coaching
after you do your (mumbles)
and so let's reconvene in 10 minutes.
And then Cathy will show you
how to make a deposition form.
