- [Voiceover] We're gonna get started.
First of all, for those of
you who did the homework,
in general, look really good.
I wanna reinforce
what the point of the homework is.
You're building up a set of molecules,
each of you, that eventually
are gonna be turned into,
by yourselves, a database.
It's important that you really understand
that that's what it's about.
Every homework is gonna build on itself.
By the end of the fourth class,
you will have created
a deposition form for those molecules
with the extra annotations,
which we're gonna
teach you how to do.
I think it's really important that you pay
very close attention
to what you actually,
what you're actually
doing with this homework
'cause you're gonna have to
live with these molecules
(beeping) 'cause if you
take the full course,
you're gonna live with these molecules
(beep) for the entire semester.
Make sure you love them. (laughs)
- [Voiceover] If I could
just add something.
I checked with the molecular biosciences,
and those of you who are only registered
for the first mini
course can still register
for this (beeping) one.
You just need to contact
the biosciences office
and let them know.
- [Voiceover] This is gonna be a very,
very practical course where at the end,
you'll have something that you maybe
even can use for your graduate research.
It's not just a hypothetical thing.
We're hoping that you're really loving
your molecules and are
gonna really work with them.
A lot of the responses
that we got indicate
that that's the case.
Today we're gonna start talking about
how to create archive requirements,
which may seem very simple because
most people think, well,
what's there to do?
You could just say what it is.
But as you'll see very soon,
it really requires
understanding the science
behind your archive,
as well as understanding the community
that you're gonna be dealing with.
We're gonna try to go
through (beeping) how
you think about this.
First we'll talk about the
experimental data pipelines
for the three kinds of experiments that
are currently represented
in the Protein Data Bank.
Some of our remote participants
are building,
will be building new
pipelines for new kinds
of biophysical experiments.
This, hopefully, will help.
The second thing we're gonna talk about
is what we mean by how we
involve the stakeholders.
I will talk to you about
what we mean by a stakeholder
'cause I know the first
time I heard that word,
it sounded very businessy
and it didn't really
sound like something that a
scientist would care about,
but in fact, it's
really, really important.
Then finally, gonna talk to you about
deposition annotation
and release policies.
Again,
everything you'll see as
we go through this course,
you have to think about everything and you
have to define everything.
Part of that is an iterative process.
As you think you know what you're doing,
you then discover,
when you put the database into practice,
that there are rules that are unclear
and have to be made clear.
It's an evolutionary
process as we go through it.
We're gonna talk about these three things.
The first things we're gonna talk about
are probably the easiest
for most scientists
to understand, and the others
take a little bit of doing
because it's not the way
we normally think.
In the
PDB currently, there are X-ray structures
that are determined using
X-ray crystallography.
PDB began with those kinds of structures.
They still represent the
biggest part of the PDB.
The second method that began to be used
for structure determination in the '80s
was NMR, nuclear magnetic resonance.
There, there's growth, but not the same
as in X-ray.
The third method that began to appear,
structures began to appear in the PDB
in the '90s was electron microscopy.
That is a very big growth area.
There are many crystallographers
who are now switching to EM as a way
of doing structure determination.
Then finally, there's a new area, which I
mentioned to you, called
integrative hybrid methods.
This method
is
a combination of different
experimental methods
that are brought together,
and using computational
techniques, models are produced.
This is a relatively new method
of structure determination.
Right now, it is slightly represented
in the PDB.
Then at a certain point, we decided not
to have it continue in the PDB because
we realized we didn't actually know how
to validate these structures.
So there's an active effort now
to do this properly,
and hopefully very soon,
we'll have the first integrative models
that have at least been,
in some way, checked
in a separate subset of
the Protein Data Bank
as an experiment as we
figure out how to do it.
This is new
and is right on the edge
of what we know how to do.
The information common to all methods
that when you try to archive the data,
there's certain information
that must be in there.
One is the name of the
person or the people
who did the work, the attribution.
Eventually, the citation of the work.
When people first submit
things to the PDB,
it hasn't been published, the structure,
because now most journals require PDB ID
before it's published,
and so actually, after the
structure goes in the PDB,
the citation is added.
There are specific things
that have to be defined
in order to define the citation.
Then the biological source.
Is this natural, recombinant, synthetic?
What organism does the material come from?
The recombinant source
organism, if that's how
you're doing it, and then the sequence
of the polymer.
All of these things need to be defined
for all methods,
no matter what.
Then there are method-specific information
that have to be defined.
There's four, and I'm gonna
give you some examples.
There are four general areas.
One of what is the sample,
what is the description
of the sample?
What's the experimental
setup and conditions?
How the experimental measurements
that are actually produced,
and then what is the
interpretation of the data.
All of that is in there.
I think most people who
use the Protein Data Bank
just go straight to the forward bits
and aren't paying much attention to any
of the other information, but
coordinates without knowing
where they come from
and how they were determined
can be misleading.
All of that is in the PDB file.
As I said, there are three methods
that are actually in the PDB now.
I think we know what we're doing.
The first is X-ray crystallography,
where you start with a crystal.
You take data on some kind of a
data collection apparatus.
Usually these days it's a synchrotron.
You get diffraction spots, and then,
using various kinds of analyses,
you get a model.
The model is shown here.
For those of you who are not familiar with
representation, the way in which you
represent the model has been a whole area
of research for many, many years
to simplify the
description of the molecule
so that you know what you have.
This is a very popular
representation shown here.
For NMR, you have samples in the tube.
It's in solution.
You then take the data
and you get, this is a picture of the kind
of data you have, and then you get your
what's called an ensemble of structures,
some more well defined than others.
For electron microscopy, you have
particles on a grid.
You have an electron
microscope to take the data.
You get this globby-looking
thing, which is a map.
Then from that, you derive the structure.
I'm gonna show you some examples of this.
For X-ray, just showed you, the sample
is crystals.
You have to define the buffer conditions
and the crystallization procedure.
I would love to tell you
that this is well defined
in the PDB file, but it's not.
Maybe someday it will be,
and that would be better.
It used to be, and over and over (beep)
you're gonna see this, it used to be that
you got whatever you could get, and hoped
that you could get
something from the depositor
of the structure.
You didn't want to press too hard.
We're gonna talk about
that a little bit later.
As it became clear that all
this information was useful,
then all pf the sudden the depositor says,
I wanna put more information in.
We have to somehow be
ready for all of that.
In the case of NMR, you have
to define the buffer and
the isotope labeling.
Then in the case of EM, the buffer,
the sample support and, in some cases,
the vitrification, the freezing
of the sample.
The experiment.
Different experiments
for X-ray, NMR and EM.
Again, you wanna talk about
the sample conditions,
the X-ray source, the detector, and then
how exactly were the data collected.
For NMR, the sample
conditions, the spectrometer,
the acquisition parameters.
Then for EM, the sample conditions,
the electron source, the detector
and the imaging parameters.
All of that has to be defined.
It may seem very simple
when you first start.
I was just at a meeting where
people were talking about this.
They had, for each method, one line
to describe everything.
It's when you get into
the details you realize
how much there is.
I'll show you that very shortly.
In the case of X-ray, the measurements
are the diffraction
images, which are not yet
a requirement, but there's certainly work
going on in different places to actually
collect those diffraction images.
In the PDB, we have what are called
the process data on the structure factors.
You define the software
that was used to do
the processing and the
various kinds of statistics
that you get as a result
of the data processing.
In the case of NMR, we
have the resonance spectra,
the resonance assignments,
chemical shifts,
the constraints and the
processing protocol.
For EM, there are the particle images,
which are not in the PDB,
they're in another database
called EMPIAR.
We have the 3D map, which is in something
called EMDB, processing software,
the processing protocol
and the resolution.
Then finally, the thing that,
what appears to be the only
thing anybody cares about,
but not really because eventually
you need to understand where it came from,
are the interpretation
and the final model.
For X-ray, you need how it exactly
determined the structure,
what method was used,
the refinement method
that was used to refine
the structure and then the statistics.
For NMR, the structure calculation method,
the refinement software.
For EM, the modeling method,
the model source and the fitting software
need to be defined.
What do we mean by all that?
This is an example of work that was done
by Cathy and her colleagues on
bacterial transcription
complex that consisted of
catabolite activator protein,
which you heard about last week,
RNA polymerase and promoter DNA.
This was a big finish
that needed to have the
structure determined.
The deposition into EM Data Bank,
where the data goes
both to something called
EMDB, as well as to the PDB and map.
They show on the left
this blobby-looking thing,
which had to then be
fit to produce the model
shown on the right.
Here are the kinds of details.
Now, I'm not talking about the format
because that's a separate
thing that's gonna
be discussed, but what
actually is collected, the content.
These are the level of
detail of what is collected
in the
EM Data Bank.
You see the titles of
the paper, the journal,
the volume, the first page, the last page,
the year and so on.
All of those things need to be collected
and archived.
Then finally, something called the
digital object identifier, which is a way
of being able to find any kind of journal
(beep) are now data
objects when you need it.
The biological source.
This is probably scaring those of you
looking at this now
because we're talking about
every single thing that's in your sample
defined, I wouldn't say precisely,
but more precisely than
most people believe.
This is what we
have to identify, exactly
where it comes from,
whether it's genetically
manipulated, whether
it's synthetic, what it's actually called,
what its formula weight and how many
of these things are there.
This is what was in that
sample, the biological source.
Then we talk about the sample preparation
for this particular sample.
Again,
here we're talking about a descriptor,
so this imparts great detail,
but it might be someday the concentration,
the details of the support of the
sample, the state, the
buffer details and the pH.
All that is recorded.
Then this is about the imaging experiment.
Everything I'm talking about now is EM,
and it's Cathy's work.
She can fill in more for you
if you have questions, but again,
this is about an imaging experiment
and what it is that needs to be collected.
A little bit later, we'll talk about
how we decide how much
of this is collected
and how much is important.
There has to be some
balance between the level
of detail that's needed versus what people
are willing to give you.
This is an example.
Then again, the measurements of the data
and, again, all of the information
that is required.
You do not have to
remember all these numbers,
but I just wanted to give you a sense
of what we mean when we talk about,
this is called the metadata
that are collected.
This is everything behind the model
so that you can actually, perhaps,
verify what you have.
This is actually related
to current discussion about
reproducibility in science.
I think those of us that
do instructional biology
are pretty proud of the fact that we spent
a lot of time over the many
years trying to describe
what we do so that there's a chance that
the experiment can be
reproduced by somebody else.
There are other areas
of science where that
is not as well done.
I think we have, structural biologists
are pretty proud of the fact that they
have spent so much time on this.
Even though there's still
errors that will appear
and there are still things that will not
be reproduced, but this is
many steps ahead of other fields.
Then finally, the interpretation,
the final model.
In EM,
not yet, we don't yet, we're not able yet
to do what I call an ab initio, a fitting
of the map, although some people
are making steps in that direction.
The most common way of
doing this kind of work
is to take
other structures that are in the PDB
and use those structures as a parts list
to (mumbles) the map.
This shows the parts that were in the PDB
that were then used to fit the map.
That's a very quick run though of
all that's involved
in the kinds of data that you collect.
We picked EM because it's
a relatively new method,
compared to X-ray.
They do the same sort
of thing with different
parameters that are being collected
and different kinds of data items before
you actually get the model.
Stakeholder involvement.
I'm gonna ask you a question.
Does anyone know what we
mean by a stakeholder?
It's okay.
You don't even have to go on
mike if you don't want to.
Anyone wanna try to answer that question?
No one?
Okay.
Linda said anyone who can benefit
from a deposit.
That is one kind of stakeholder, yes, yes.
Any other kind of stakeholder?
What we do mean in
general by a stakeholder?
Yes.
Someone who has an interest, yeah.
Yeah.
Any others?
Okay.
Stakeholders, I think
that's a good answer.
It's anyone who has an interest.
Who might that be?
There's the depositor of the data
is definitely a stakeholder.
Actually, I heard from someone something
that I never thought of before.
This person was saying the reason that the
PDB became so successful is that the
depositors as stakeholders
actually got something back.
I never thought of that before.
It's not that the depositor
puts data in for other people
to use, which was part of the objection,
initially, to having a PDB is that
people said, why should we put our data in
for other people to benefit.
But what happened fairly
early on is that the
depositors themselves benefited by having,
just minimally, they
wouldn't lose their data,
which, as I told you
last time, was a problem
that people do tend to lose
their data, including me,
where you put your data
on some computer somewhere
and the computer dies.
Then where is it?
Having a PDB just for
that alone is useful,
but then nowadays, having the PDB
validate your data and provide software
for validating the data
is also enormously helpful
for
minimizing embarrassing errors.
Then there's the
scientific user of the data
in a variety of fields.
For example, biologist who needs
to know more about a
structure, a computer scientist
who wants to have a corpus of data
to do some kind of analysis.
It's surprising, we found out, how many
computer scientists and statisticians use
the PDB because they have
data in a standard format
that they can do various kinds of analyses
with statistics, and then
test their methodologies,
just as a purely theoretical exercise.
That, as opposed to a
computational biologist
or computational chemist
who will use the corpus
of data in the PDB to derive new chemical
and biological insights, and,
as I said, statisticians.
Then, another
stakeholder is the database provider.
There are many databases
that are built on top
of the primary data in the PDB
and make various kinds
of value-added databases.
This is important to
have PDB as a sort of substrate
for that kind of work.
Then software developers.
There are people who
are developing software
that's directly relevant
to what's in the PDB.
For example, the structure determination
software developers
have a strong interest,
an increasingly strong interest,
in what the PDB is doing.
Then software developers
who are developing
various kinds of analytical methods
will (mumbles) PDB.
Educators
who use the PDB.
Students who learn from the PDB, and then
the funding agencies.
You could probably (clears throat),
for anyone who's building a database
has to think very hard ahead of time
about who the stakeholders
are likely to be.
You have to decide at some point when
you're building a database,
what's this database for.
Is it for the whole world
or is it for a small
scientific community that
has a very special interest?
All of those things are important.
You need to think about
it when you're doing this.
A long time ago, when I
was involved in trying
to set up a database for
the nucleic acid database,
I was never been, never done this before.
I didn't really understand
how to go about this.
I kept worrying about, things about, well,
how do we make sure that
it's usable by everybody
on every machine for everything.
John Westbrook, who's sitting in the back
of the room, said to me.
At that time, I was a professor and he
was a graduate student, so
that was a long time ago.
John said, "Just make
it useful for yourself.
"Just get it so it works
for what you wanna do,
"and then it'll go from there."
That's exactly what happened.
When I stopped worrying about
all the different machines
and all the different
things that you use for
and just say, well, what do I need it for,
then we started to be able to build it.
Then it grew organically.
That's how the PDB was, grew organically.
I think that's really, really important.
As opposed to an approach where you try
to do everything at once, and
then maybe nothing happens.
This is a piece of (mumbles).
The depositor, what are the things we
have to think about?
First of all, what is
the experimental method
that you're trying to describe?
How many data items
are normally collected?
You need to think about.
For any experiment, there'll
be enormous different kinds
of data that are collected.
How well defined are these data items?
What are the incentives and benefits
for the depositor?
Why would anybody wanna
give you this information?
What's the incentive?
How many of these data
items are quantifiable,
where you could just write
a number, like temperature?
What is the burden of deposition?
All of those things have to be considered.
Again, in an idealistic world,
you'll try to collect everything,
which could be hundreds
and thousands (laughs)
of data items, and then
you might get nothing.
The other thing you need to think about is
business about quantifiable data.
Most data in science is quantifiable,
but some data are very hard to quantify.
The question is, is it worth the risk
of having text fields that you put in?
One of the prime examples
was crystallization,
where just to collect anything at all,
the PDB has collected these text fields
so people just can write
anything they want.
It's okay, you have this dream.
I can use text mining and all this stuff
to get the information out, but it doesn't
actually always work.
If you can quantify, it's much better.
On the other hand, if the only way
you can figure out how
to get the information
is to have text fields,
then it might be worth doing.
You have to make that balance.
In considering the user,
what are the likely queries?
Each type of community
needs to be considered.
When you're building your database,
if it's for yourself,
which is what you guys
are gonna be doing,
you think of the kinds of questions
you want to answer.
Those are the kinds of information that
we then want to make
sure is in the database.
You need to come up with use cases
or what it is that you
want to actually answer
by putting this data in the database.
How will the data be used?
You need to think about that.
What level of precision is required
for the analysis?
How perfect does it have to be?
You have to think about that.
Are you happy with just
getting approximate answers
or do you need, for whatever reason,
very precise answers?
Everybody says, including
some of my good friends,
that
the PDB should be like Google,
or any databases should be like Google.
For those of you who think about
what Google is, Google has behind it
incredible amounts of engineering
with lots and lots of people who
are skilled in artificial intelligence.
There's lots and lots of
heuristics behind Google.
It's a very, very sophisticated system.
Even with all that sophistication, I think
that all of us know that
when you do a search
on Google, you'll get a
hint of what is out there.
You won't get every
single thing giving you
the right answer.
Building search engines is a very
difficult and sophisticated process.
Most scientific databases
cannot and do not have the
level of sophistication
that Google has, or none of them do.
But you still, by doing all the
defining of your data items, you have a
pretty good chance of
getting what you want.
I just want you to think about that.
The database provider,
person who's building the database,
and that's gonna be you guys,
what is the scope of the data
resource?
Who is the intended audience?
In your case, for your
database, the intended audience
is you, each of you.
Don't worry about anybody else.
You get that right,
then we can think about
what happens next.
Then what resources are available
for creating the data resource?
How much money do you have?
How much computing power
do you have and so on?
You need to think about those things.
In your case, this is why we,
ahead of time, define
somewhat of the scope
of what you're doing, so that
you didn't get yourself into trouble.
The other thing that
you need to think about,
the external software and
instrument developers.
This is, for, say, the PDB, how much data
are you gonna stake here?
How much data are produced by the various
external data?
Data is a plural.
Even all these years, I
make that mistake, too.
How many data are produced by the
various external resources and
can those date be harvested?
A lot of experiments,
the data are produced by machines.
The machines produce
computer-readable information.
Can that computer-readable information
be harvested in the database?
Significant amount of work
has gone into that area.
In order for that to happen and to also
consider the fact that every incidence
may have a completely different format
and how you harvest from
those different formats
is quite complex.
You'll hear a little bit
about that when John Westbrook talk about
defining the data model and the syntax
and semantics for the data.
When you think about stakeholders,
if you think about the whole PDB and what
the PDB is considering
and what target audience.
If you think about the history
of the Protein Data Bank,
at the beginning, the
stakeholders were really
simply the depositors.
It took a long time before
that changed.
Now only 25%
of the users of the PDB are depositors.
Somewhere around 75 to 80%,
depending on who you talk to,
are non-depositors of the data.
That is an evolution that
took many, many years.
I'm gonna stop for a minute
if anyone has any questions 'cause I think
there's a lotta material
(speaks too low to hear).
Questions?
Okay.
The next thing that I want to go over
are the deposition annotation
and release policies.
The person who organizes all that
is sitting in the back of the room,
Jasmine Young.
Raise your hand.
Jasmine is the lead
curator for the RCSB PDB
and
has to refer to these policies regularly.
The policies are made not
at all once.
They evolve as situations evolve.
You have to figure out what the policies
need to be to make things
as clear as possible.
How do you go about
developing the requirements
for what goes in a data archive?
What data should be
mandatory for every entry?
What data can be optional
and who should decide?
Those are things that
need to be thought about.
These things will change
as design changes.
What data should be included
in the data archive?
In order to figure that out, it can't
be decide by the archive.
It has to be decided by the domain experts
with really deep understanding
of the experiments
and the relative importance
of the different data.
If you try to do that
without that understanding,
it won't work.
In terms of an experimental archive,
it's really important
that there are people
on the staff who understand
the science really well,
as well as working very closely with the
external community who
are doing the experiments
to find out what's really
important and what's not.
Any other way of doing it,
I think we run into trouble.
Then the people who are
running the archive,
in this case, we're talking about the PDB,
have to evaluate the
practicality of collecting
the data and how to best organize that.
People may ask for tons of data
that they would like collected.
They kinda have to
present the case of why.
They have to think about why.
Then the archivist has to figure out,
does this make sense or not?
There has to be some kind of a partnership
between the archivist and the
experimentalist to make sure
that this is done properly.
In the case of the PDB, we set workshops
on a regular basis or some kind
of task force to achieve a consensus.
Then there are followup meetings
and there are email exchanges.
This goes on for years
and years and years.
It's not a one-shot deal where you say,
okay, we're gonna collect
this, and that's it.
Case closed.
We need to think about it and we need,
the scientists and the change.
If the data archive is going to be useful,
you have to think about what's going on
in the science.
Then standard annotation
procedures and policies.
The data come in.
You'll learn later about what's involved
in the data coming in.
The data have to be
standardized in some way.
They have to be made so
that from entry to entry,
you're talking, the
definitions of the terms
are
the same and the processes that are used,
the process data are the same.
We want to produce uniform quality.
We wanna make sure the
archive is consistent.
It would be very, very
nice that you could say
that from day one, the
archive is consistent,
but you'll hear about this later.
Science changes and the
way in which you think
about certain data will change, so there
has to be a periodic review of the data
to make sure that there's consistency.
Then we have to set some kind
of a curation expectation.
One of the things that's
really interesting,
when people start databases
and they talk about curation,
they will, I remember in
the very, very early days,
it could take a month to
curate a single data set
because if you're a perfectionist
and really understand
the data really well,
you'll be fussing and fussing and fussing.
While the fussing is good and
it might get better results,
you'll also create a backlog that will
be unmanageable.
That's something that needs
to be considered, too.
Setting the policies.
To set the boundary,
what is the scope of data
to be archived?
To govern data policy,
we're gonna talk about that
because when data come in,
until they are released,
they have to be kept secure so that
people don't access the
data before it's ready.
Then what kind of,
what's the minimum data for validation?
Again, that's a ongoing discussion.
Here's some practical things.
Procedure of setting
the standard annotation
procedures and policies.
The wwPDB, as I mentioned
in the first lecture,
is
involved in setting all of these policies.
There are regular meetings
with the wwPDB members
to discuss the details.
This is all because of the
international organization.
This is all done by various kinds
of video conferencing.
Documentation.
There's
drafts of the various policies.
They're reviewed by the wwPDB directors.
They're revised and approved.
They're then posted to the public website
and there's a public announcement.
Then all the documents
have to be made public
and made available so everyone
can see what they are.
In the case of the NSF rules,
the National Science Foundation rules,
any major change has to
be posted 60 days ahead
so people have a chance
to know what's going on.
For the wwPDB policies,
the policies evolve
as the science evolved.
I mentioned that many times.
The deposition of the
experimental data, for example.
By that I mean the
structure factors for X-ray
became mandatory as
data quality assessment
became more and more important.
It used to be that you could deposit
the structure factors,
but you weren't required.
It wasn't until 2008
that structure factors
became required.
The reason was that
there were problems about validation
and questions as to whether a structure
was right or wrong.
Unless you validate against
the experimental data,
you don't know whether the
structure's right or wrong.
There was also,
used to be some funny rule
about the size of the peptides
that were in the PDB.
You have the polymers, but
then you have small peptides.
How do you set the rules for what the size
of the peptides?
What's a small molecule and big molecule?
It may sound very simple.
Turned out, it went on for years
and discussion about what the size
of the peptides should be.
All of the policies are
shown at this website
that I wrote here.
It has the entry
requirements, the authorship,
the rules about release, the assignment
of PDB IDs and Ligand codes,
and the change to entries.
What are the requirements of acceptance
of an entry to the PDB?
You must have
three-dimensional coordinates.
There must be information
about the composition
of the structure, the
sequence, the chemistry
and so on.
What was the experiment performed,
the details of the structure,
determination steps
and the author contact information.
Also, experimental data for X-ray
and NMR are required.
Again, for NMR, that took years
to come up with the rules
for what should accompany the coordinates
of an NMR-determined structure.
Lots of discussion with the NMR community
and a lotta back and forth.
We made the rules, then we
unmade the rules and so forth.
In the case of the PDB, it
used to be in early days
that
structures that were determined strictly
in silico were allowed in the PDB.
It then became clear
that that was probably
not a good decision since people
used the PDB
for creating modeling methods.
So if you're putting theoretical models
in with experimentally determined models,
you could have a problem.
So there was a meeting here
11 years ago.
It's hard to (mumbles).
A little more than 10 years ago
to discuss what should happen.
It was decided that the structure
must be experimentally determined.
It had to be a physical
sample of some sort
in order for the coordinates
to be in the PDB.
We thought, ah, we've
figured this all out,
but it turns out,
as time has gone on, that
we've had to think again
about how this all works.
That's why in the area of what's called
integrative modeling, we're now forming
a different way of thinking about this,
which I won't talk
about now, but I'm happy
to talk about, anyone who'd like to hear
what we finally are coming to.
Right now, we accept coordinate sets
produced by X-ray, NMR,
EM, neutron diffraction,
powder diffraction and fiber diffraction.
Purely in silico models are not accepted.
They are in something called the
Protein Modeling Portal.
At that meeting in 2005,
it was suggested that there should be a
Protein Modeling Portal, and now there is.
People who are responsible for that
are in Switzerland.
I believe they're online right now
if anyone wants to ask them how models
are handled, purely in silico models.
What kind of structures can
be deposited in the PDB?
Polypeptides of various sorts,
polysaccharides and polynucleotides.
As I said, there was an excessive
and extensive discussion about peptides.
I don't think that's ever
gonna fully go away. (laughs)
That's about what is accepted in the PDB.
Now, the release of an entry.
These rules have changed over time.
Once something is published in a journal,
the structure must be
released to the public,
no matter what.
I think virtually here,
virtually no journals
that don't have that as a rule.
There are various status
code for PDB entries,
release, hold, hold for publication,
withdrawn and obsolete.
We'll talk a little bit
about what that means.
There are very detailed deadlines
for requesting the release of entries
because everything gets
released at exactly
the same time around the world.
That's important to many
people and it's important
to the journals, the
exact date of release,
because in some cases, this has (mumbles).
It's all sort of competitive structures
that are in the PDB,
and people need to know
for sure when this is
going to be released.
Experimental data and coordinate files
must be released at the same time.
It didn't used to be like that, but again,
as time has gone on, it became clear
we had to do that.
We don't send email addresses around.
In that sense, the privacy is protected.
Who has access to the unreleased data?
This is data that's been submitted,
but hasn't yet been released
and hasn't been published.
Only the authors of the particular entry.
The reviewers of a paper may not
obtain unreleased coordinate
sets from the PDB,
although that has been asked for.
Our view and the view of the PDB
has been that we have a sort of
contract with the author
and we protect their privacy
and we protect their data.
If somebody requests a coordinate set
for review in a journal,
they need to get that from the author.
We don't want to be the
conduit of that information
before publication.
We also strongly advocate that
the reviewers have the validation reports,
which you'll hear about.
They should have those while
they're reviewing the paper,
but we will not send them to the reviewer
'cause we don't have the security systems
that would allow us to do that,
but the author should,
is encouraged, to send this
with their paper to the journal.
That's relatively new
practice, which is beginning
to be very popular.
What information is available
for unreleased entries?
The information about the
title, the authorship,
the status, experimental data status
and sequence are available.
Now, that's very controversial.
There's a status thing,
and people who are doing
extremely competitive work
do not want any information released
ahead of publications.
They don't wanna be scooped.
So right now, there is a discussion,
I don't know when it's
going to be resolved,
that all of that information
should be suppressed
prior to publication.
The PDB has been way ahead of its time
in making everything
available and all that,
but there are certain limits that
authors have, and certainly
prior to publication,
I could see, understand why somebody might
want something suppressed.
That's under discussion now.
I don't know the status
of those discussions.
That may change.
Two months?
Okay, the boss says two months,
so that's what'll happen.
A lotta people have asked.
Changes.
This is also changing, but
this is what it is now.
Right now,
before release, anything can be changed.
In other words, while it's
still being processed,
anything can change.
But changes after release,
the current policy
is that any of the
metadata could be changed
after release.
Citations often change because
before release, there isn't a paper,
and after release, there is.
If there are certain things
in the experimental data,
that can be changed, but coordinates
cannot be changed right now
and still maintain the same ID.
These rules are likely
to change now because
the whole structure of the
PDB is likely to change
in that there'll be versioning.
Right now,
if there's a major change,
such as coordinates or the chemistry,
you have to obsolete your structure,
and then you get a new ID.
A lotta people hate that
because they actually
know their structures by the ID
and they have some affection for the ID,
although we can't understand how
you can have an affection for a
randomly generated number, but they do.
(laughs) There is gonna
probably be versioning.
That is a lot of work,
a lot of software that'll
have to be developed
to do versioning.
Then these rules are likely to change.
Right now, though, if you
change your coordinates,
you have to obsolete and then supersede
with a new deposition.
Remediation.
See, now this is where
it get contradictory.
The PDB, wwPDB reviews the entire archive
on a regular basis, it
remediates the data.
The nature of the changes are described
in a public document,
as exactly what it is
that's being changed.
In the case of the wholesale remediation,
the biggest remediation we ever did
was the atomic naming and numbering.
That got changed in order to comply
with IUPAC rules.
In those cases, the individual
authors are not contacted.
We're talking about thousands of data.
A version number is assigned and a REMARK
with this version number is put on file.
The older version is maintained
so that you could go back to it.
That's really
the summary of some of the policies.
As you can see, a lot of thought goes into
how do we do this, what effect
is this gonna have on people.
When you remediate, on one hand,
people say, oh,
you should fix all the
areas and do it right away
and all that, but if you fix
the entire archive and change all the
(mumbles) and naming,
that means that people
who are using the archive
for their research,
and there are people who
are using the entire archive
for their research, have
to re-do everything.
Have to re-download it.
It is a big perturbation on the system,
so we do not do wholesale remediation
without a lot of thought.
We do individual errors changing.
Errors are fixed as soon
as we find out about them,
but this wholesale remediation
is done very carefully
and
infrequently because we
know what the impact is.
The other thing about remediation
is it's a big job for the bio curators
who have to go through every single file
'cause even though you think
you're changing one thing,
who knows?
Something else could change, so it means
re-validation and all that.
So it's not something that
is to be taken lightly.
I think I'm finished with
this part of the lecture.
I welcome any questions now.
Linda.
Yeah, put on the microphone now.
- [Voiceover] My question is,
when policies change about what is,
for example, mandatory or not mandatory,
structure factors have changed in 2008,
but what happens to the entries
that came in earlier for
which within mandatory
the people (speaks too low to hear)
(beep)
- [Voiceover] Some people
have tried to submit those.
Course, there's some problems.
There's all sorts of issues,
but it's not required
because we couldn't possibly manage that.
- [Voiceover] (speaks too low to hear)
- [Voiceover] Depositors
do want to send us
structure factors afterwards,
so we occasionally
will receive structure
factors from depositor
who did not deposit
structure factors early on.
- [Voiceover] (speaks too low to hear)
- [Voiceover] Yes, we vary dates.
- [Voiceover] Again,
this is a changing thing.
What happens is it used to be that
a change was made, and it was,
the whole file was not re-validated.
Big mistake.
Now I think we know you
always have to re-validate
because you think you're
making an innocent change
in one part of the file, (clears throat)
something else could happen.
- [Voiceover] Any questions
from our remote participants?
I think we're clear.
- [Voiceover] All right.
So what we're gonna do
now is take a break until
10:15.
Then Cathy is going to
do an exercise with you.
Then after that,
we're gonna start going through your,
well, we'll give you your
new homework assignment
and then go through your
old homework assignment
with you individually to discuss
what you're doing.
I wanna re-emphasize that the data
that you're picking up now,
you're gonna have to put, we'll show you
how to do it, so don't be scared,
but you're really gonna
be working with this data
and you're really gonna
be making a database,
so take it seriously.
These homeworks are not one-offs.
They're gonna build into a real project.
Okay?
10 minutes, okay.
