- [Voiceover] Okay, so the
first half of the class,
we talked about everything getting up
to deposition of the
data into the archive,
and you all made deposition
forms for your own data.
And Cathy and I will get
back to you this week
with how you did, in
terms of that assignment.
So now we're gonna go to the next steps,
which is the curation of the data,
and then the distribution of the data.
So today we're going
to talk about curation,
and we have a sizable number
of PDB people here (chuckles)
who know a huge amount about curation.
Jasmine Young is the lead curator
for the Protein Data Bank,
and she will help answer
questions if you have any.
So, PDB, the way it's run is that the
PDB curators and the depositors
of the data work together to figure out
the best representation of the data,
and we do that via various
kinds of task forces,
and the guidelines for curation, again,
are created in collaboration
with a broad community
of experts, and the,
we are not,
(beeping)
one of the big things that
we feel very strongly about
(audio cuts out)
So what we mean by that is that
some people think, okay,
the data comes into the PDB,
and we should
(beep)
use data if it's not up
to a certain quality.
We should not let people put data in,
and that sort of thing, and
that is not our philosophy.
Our philosophy is to work
together with the depositors
and make sure that the data
are very well represented
and are as, represent the experiment
as closely as it can,
and if there are errors
we communicate those
errors with the depositors
and together, try to come up
with a very good data file.
So the collaboration,
it is at every level,
from the point of how
we should be doing this
to the point of detailed
doing the curation
at any single time, it's
meant to be a collaboration.
And some people don't agree with that.
Some people feel that the
PDB should refuse data,
but we don't agree with that.
We're a communities database,
we will evolve, and in
time, the data has gotten
better and better through
this kind of collaboration.
So what do we mean by curation?
It's the review of the deposited data
at various different levels.
And so there's different
levels of curation
in general for all databases that exist.
One philosophy is to leave the data as-is,
and it's solely the
author's view of the data.
There could be unintentional errors,
and the problem with
leaving the data as-is
and making no effort to curate is that
the usage by people
other than the depositors
becomes very difficult,
but there is a point of
view out in the community
in general that an archive should not
make any attempt to curate at all.
We do not, PDB does not
subscribe to that point of view.
Then there's what we call medium curation,
or what we've decided
to call medium curation,
which is just to remove formatting
and nomenclature errors,
just make sure that
everything is spelled right,
and make sure that the
atom labels are right
and things like that.
That improves the usage and
allows comparisons of the data,
but again, it's not as good
as doing real curation, so again,
in early days with the PDB, I would say
things were done at the
medium curation level,
basically formatting
nomenclature and so on.
And then you have extensive curation,
and this vastly improves
usability of the data,
it prevents errors from
propagating in future studies,
and in order to do extensive curation,
you have to have deep
knowledge of the science
and you have to work
closely with the community.
In the 18 years that the PDB, RCSB-PDB
has been here at Rutgers,
we went from having our curators be mostly
non-degree, people with Bachelors Degrees
with not much training,
and that's why they can
only do medium curation,
to a situation where we have
PhD-level curators, many
with post-doc experience
such as, Brian back there
has had experience in x-ray,
crystallography, NMR, cryo-EM,
so those curators are able
to really look at the data
very carefully and understand it,
and work with the depositors to make sure
that they are, the data is
being represented properly.
So at this stage, we have
a very highly skilled
staff of curators at the RCSB,
as well as the other
sites in Europe and Japan.
So things have changed over time,
and part of that has to do
with the community where,
in the beginning the
community didn't really
use the data, didn't really
want to have too much in there,
didn't really care that much.
It was more like a place to
keep the data from getting lost,
but it wasn't a place
where you could really
do extensive analysis, whereas now,
the people want to do expansive analysis,
they want to have high-quality data.
To have high-quality
data, you have to have
high-quality scientists at both ends
on the side of the,
people who are doing the
structure determination
as well as the curators at the PDB.
So these are the different levels.
So in terms of the curation pipeline,
we first have the
annotation where we check
incoming data for consistency,
for example, making sure that the sequence
of the thing that is being studied
and the coordinates that
are being shown match,
making sure that there
are no inconsistencies
with respect to, for
example, if somebody says
the temperature of something is,
gives it in the wrong units,
or comes up with impossible results,
all that has to be fixed.
And then the other thing
in terms of annotation
is to make sure that the
data matches information
in other data resources,
so PDB does not stand alone,
we have other data resources
such as the UniPROBE,
GenBank, and so on, and
we want to make sure
that the data match what
is known as best as we can,
or if it doesn't match, why not.
I'll show you that in a minute.
And then there's what we call validation,
and there we're actually checking the data
with respect to what is known about other,
about these kinds of structures,
so for example, in geometric validation,
you want to make sure that the geometry
that is in the structure
matches what we know to be
correct for distances, angles,
confirmation and so on.
And so you want to compare the geometry
of the molecule that's coming in
with the geometry that other
molecules of similar type,
and that used to be all that was done.
Then about eight years ago or so,
we started checking model
against the experimental data,
so we had the model and then we had the,
the case of extra-crystallography, we have
the structure fractures
and we actually check
whether the structure
matches the experimental data
that accompanies it,
see if there are any errors of that sort.
So that's pretty extensive validation,
and pretty unusual for most archivers
to do that level of validation, and
a template which we think
is very, very important.
So here's the current deposition
and annotation system.
This was, there've been
several systems developed over the years,
the different data centers that handle
structural data, and about
eight years ago or so,
we decided to come up with a single system
that would allow all the
data centers around the world
to be able to check the data using exactly
the same software,
exactly the same systems,
so we could be sure
that the data were being
very carefully checked in the same way
and it didn't matter
who processed the data.
There was a time when you could tell
the data set was processed in Europe
versus if the data set
was processed in America
because they were slightly
different results,
slightly different
characteristics of the files,
and now with the new system added,
it becomes less and less likely.
So basically we have the deposition,
where the data comes in, comes in to
single portal and then behind the scenes,
the data go to the different data centers,
and this right now,
according to the geography
of where the data comes from.
It then goes through
this annotation pipeline
which is made up of a series of modules
that process the ligand
data, the sequence data,
there's various kinds of other
annotations that are done,
then the validation.
I'll talk about each of these
things in a few minutes.
And then when this is all ready to go,
and finished, it's basically
kept in a holding tank
until it's supposed to be released,
and then at some point,
when the paper's published,
or if the author says just to release it
without a publication that could come up
and so the structure then becomes,
is released to the public
at the correct time.
So all of this requires
a system that keeps track
of what's supposed to go
out to the public and when,
and occasionally deposition
has to deal with a big crisis,
where a structure is
released at the wrong time,
and we have to figure
out how to retrieve it
before it gets, once it gets
out to the public, it's there,
so you have to be very careful,
and it's a big responsibility
to make sure that the release
is done at the correct time.
This is a little more complicated picture
of the deposition pipeline,
so there's a common deposition interface
that have specific pieces
having to do with x-ray,
BDM, MR, and then
data are uploaded, and this harvesting
from various kinds of
files that are also in,
where you can pick up
automatically some data
that are required, the meta
data that are required.
The depositor can do some editing,
they can do some pre-validation
to see if there are any
problems with the file,
and then when everything is ready to go,
it's actually press Send,
and it goes into the PDB.
Once it goes into the PDB,
well it doesn't go into the archive,
it goes to the annotation
pipeline, and there
the annotators, curators
go through the data
and they check ligands,
they check the sequence,
they do extensive validation,
they do other annotations,
things having to do with
the biologic assembly,
and other things, and then eventually
you have the release process.
So this is just a slightly
more detailed picture
of that cartoon that I showed you before,
and again, all of this
is extensively built
software by a team.
Wasn't always the easiest
project to get through,
but there is now a tool that allows you to
do all of this and it's
really quite remarkable
when I think back to the way it used to be
when you sent your coordinates
on a magnetic tape,
and they were then read.
This is prior to our involvement,
where that was the technology at the time,
and then it would take really weeks
to process a file, get it right,
and then all of the
communication with the authors,
there was no web, so it
was all done by letters.
That was before email.
Cathy had to do this, right?
And so it was a completely
different situation.
It wasn't because anybody
was doing anything bad,
but this was the most
you could actually do,
and it was a miracle that you
could get the data in at all.
And then I'll just say
you know, since I'm a,
I've been around a long time,
but when the data finally
got itself into the archive,
there was a newsletter that was hand done,
in Courier font, and every newsletter
had a list of all of the PDB files
that were currently available, typed in,
and they could see what was there.
And that's the way it was.
So that was pre-web,
- [Voiceover] And the
newsletter had a little form,
so you could order your tape,
- [Voiceover] Yes! (laughs)
- [Voiceover] With the archive--
- [Voiceover] Which was
$30 to get the tape,
so you could order a tape.
You sent in a blank tape,
and it cost $30 to put the data sets
onto this tape, and then
you received it that way.
And so that was the difference
in the technology then, was then,
and what it is now.
Okay, so annotation and curation.
What the authors provide
are the atomic coordinates,
the sequences, and what is the,
what is the (audio muffles speech)
and the definition of that
used to be quite complicated
and people didn't actually know
how to put in those
sequences, what the sequence
you're supposed to put in is the sequence
of the actual molecule
that you are working on,
not what you actually saw
in the electron density map,
and that's a very important distinction.
The experimental descriptions and various,
lots of information about
the structure determination.
What the biocurators provide
is the format standardization,
the annotation consistency,
so there are lots of policies
and rules about how exactly
to isolate the file.
There's standard vocabulary
that's been developed
for as much of the file as possible,
and then added annotation
that can go into the file.
In designing the annotation workflow,
for this new system,
the goals were to have
a regular review for the consistent
and highly accurate data,
and maximize the quality
of the individual data sets
and quality across the archive.
That was a big goal.
They wanted to employ procedures
and data handling tools
to achieve the goals, so
accept, organize and encode deposited data
in the standard form,
add annotation in the
form of classification,
so over the years these data
have been classified by various people
and as much as possible the annotations
follow those classifications,
and then validate the experimental
and structural content.
It's also necessary to manage
the incoming data stream efficiently,
and get the burden down for
the annotator and depositor,
so in the early days it would
take a week, maybe, or so
to prepare a file to come into the PDB,
and then a few weeks to actually annotate
and process that data,
and that's the sustainable
way of doing things, so we used to have,
or we still have it, something
called the magic number.
The magic number is basically the backlog
of data that needs to be processed,
and there was a time where we had hundreds
of structures in the backlog,
and that's not sustainable.
So you really need to
get it so that you have
as few structures in
the backlog as possible,
and you want to be able
to engage the depositor
during the annotation process,
to be able to resolve any
issues with he depositor.
It has to be again, this collaboration.
And make sure you're leveraging
the depositor's expertise
in vetting it and added annotations,
because usually, but not always,
the person who knows the most about
the structure that they've deposited
is the person who did the structure,
and that's why, in the very early days,
people believed that
the PDB should not touch
the data at all, because they felt
that the annotation staff
would now know enough
to know how to process the data,
and so they just wanted it in there
so it wouldn't get lost.
Period, and that view was
held till relatively recently,
and certainly when the
RCSB took over the PDB,
there were many people who were
(beeping)
scared that we would
overprocess the data and take
away the correct meaning.
So it's really, this was a controversy
which I think has finally died down.
I think people are appreciative
that somebody else is taking a look,
and making sure that everything is okay,
and also, because we didn't
become the data police,
they now know that nobody's gonna say,
"Eh, you made a terrible
mistake, and we're going to,"
you know, we never do that.
Other people might do
it, but we don't do it,
and I think that's very important.
So the annotation
workflows are shown here,
at the bottom, and
there are these modules,
which again, we've talked about before,
where there's a review module,
there's a peptide module.
There was a time when it was not clear
at what point something
stopped being a peptide
and became a protein, and so
what was the, how did you deal with that?
And so there were a significant number
of structures in the PDB that had some
strange annotations to the peptide,
so two years ago the PDB took the task
of trying to get that straightened out,
but right now, when we bring
in a peptide into the PDB,
it has to be checked, it has to be named,
and annotated as both a
sort of small molecule,
as well as a polymer, and that was
the solution we came up with.
The ligand is checked very
carefully for its chemistry,
the sequence is checked.
Again, these added
annotations, the validation,
communication, author review and then
author submission and resubmission.
So at this stage, when all
of this is being checked,
the author may get some bad news,
they may want to actually
redetermine the structure
because they made a mistake,
or something is not quite right,
and so you not redetermine but redefine.
Okay.
So one of the most important
parts of the structure
is the sequence of the polymer,
and so the
author submits the sequence and the
coordinates, and we want to make sure A,
that the sequence matches the sequence
of the sequence databases,
and if it doesn't,
is that deliberate, or is there a mistake,
and then you want to
check that the coordinates
in the atom site records,
I'm having trouble the mouse here, okay,
so in the atom site records, we have
the actual sequence, and does that match
what is in the sequence records,
and they should match unless there's some
deliberate change in the structure,
in the material that was worked on,
like some kind of a
substitution or mutation.
There, the author provides
source organism information,
and that's checked against
the taxonomy database
and then the database
references are also put in,
and so all of that is checked
against UniProt as well
as the self-consistency
within the file.
And so here's the tool
that allows you to do this,
and you have the author,
the author sequence, the PDB sequence,
the UniProt sequence, and
if there's any discrepancy
that shows up and then the annotator
can take note of that and make sure
that all of this is consistent.
This was done before this tool exists,
but it was not done in a
uniform way across the sites
and there were errors and inconsistencies,
and now there are fewer.
There's also a 3D structure view
so you can actually look and see
how everything is connected and again,
tabular view, so there's
lots of different views
that the annotator can see and decide
whether there's a problem or not.
In ligand annotation, we
have the problem that,
and this is a problem that's
been fairly consistent
over the years, and it's been a big issue,
so most people who do protein structures,
in the old days, most people
who did protein structures
didn't much care about the small molecules
that were in there.
Nowadays we have to be especially careful,
because the small molecules
are many of them drugs,
or potential drugs, and
getting the chemistry right
is really, really,
important, and that has been
an ongoing issue for the
PDB for many, many years,
and continues to be an issue.
One way in which this was,
one thing that was done fairly early on
was to create what's called
the Chemical Component Dictionary that has
all the aspects of the chemistry
of what's supposed to be in the PDB,
both the individual residues,
the amino acids and the nucleotides,
as well as the small
molecules, in an attempt
to standardize all of the nomenclatures
and all of the details of the chemistry
are in this Chemical Component Dictionary,
so when a structure comes into the PDB,
the small molecule ligands
are checked against
the Chemical Component Dictionary
to see if it's already in
there and then whether or not
all the chemistry matches.
If it's not in there, then
you have a new chemical
and that new chemical has to be checked
and made sure to be correct.
And there are errors
in the reporting of these small molecules.
Sometimes because the
author isn't quite sure what
he or she has anyhow,
another thing that can happen
is that when you make
the crystal structure,
and you put in a small molecule,
sometimes the small molecule actually
has a chemical reaction and changes,
and you don't really
know that that happens,
so all these things make it particularly,
the small molecules I would
say are far more difficult
to cope with than the large molecules,
so there is a tool now that lets you look
and see, first of all, whether or not
this is in the PDB already,
and then whether or not what you have
matches the correct chemistry,
and if it doesn't, what's wrong with it,
and then if it's new,
again, there's a big
struggle if it's really new
and is this really the right chemical.
So there's a fair amount of efforts test
we put into doing the checking of ligands.
The other thing that needs to be checked
is what's called the biological assembly,
and in crystallography,
you determine the structure
of what's called the asymmetric unit,
and sometimes the actual biological unit
is for example, a dimer,
so there's crystal graphic
symmetry that allows you
to go from monomer to the dimer,
and you have to be able
to say whether this is
a monomer, a dimer, a tetramer,
and all of that,
that annotation needs to be done.
There are computer
programs that attempt to
say what the biological assembly is.
Sometimes it's really easy,
you can see right away
if something is a dimer,
it makes perfect sense.
Sometimes it's quite ambiguous,
and sometimes the author,
even though it may look like a dimer,
the author says, "No,
I have the biologically
"functioning thing is monomer."
So is that a crisp
artifact, what's going on,
all of that has to be figured out,
and it's the source of
a lot of discussion,
and there's probably not one answer,
depending on who you talk to
there's not one correct answer for that.
Okay, so basically,
and there are other added annotations
which some of the
annotators, some staff here,
and I can tell you about that,
but basically those are the
big things that are checked.
So then we have to do validation,
which is establishing
or checking the truth
or accuracy of something,
and there's a quote from Richard Feynman
that "Science is a way of
trying not to fool yourself.
"The first principle is that
you must not fool yourself,
and you are the easiest person to fool."
Richard Feynman was a very,
very, very famous physicist
who had a way of reducing
science to very simple ideas,
and I don't know, how
many of you know about the
Challenger shuttle blow-up?
Okay.
So I think it was in
1986, about 30 years ago,
a shuttle was being launched,
and it blew up, and all the astronauts
on board were killed,
and it was a really horrible event, and
there was a big investigation
as to why this blew up,
and they had all this fancy
stuff (audio muffles speech)
and then they had Richard
Feynman testify to the Congress,
and he had what's called
an O-ring which is
this thing that is holding
stuff together on the spaceship,
and he dipped it in some kind
of fluid, liquid nitrogen,
and it cracked, and he said,
"When the O-ring is subjected
to too low temperature,
"it cracks, and it loses its seal."
And he did that at the hearing, and
it was a very simple explanation
as to what had happened,
with respect to why the shuttle blew up,
and it was an example of his
very, very clear thinking.
Unfortunately, (audio cuts out)
(beeping)
shortly after that
hearing, Richard Feynman died.
Apparently he was very,
very ill during that time,
but really wanted to show what happened.
So he was a very clear thinker.
And I think the people,
the group who know me well
know that I have no patience
with complicated thoughts.
It has to be simple,
and that comes from being
raised scientifically,
in an era where Richard
Feynman was a hero.
So validation, you compare your models
to the experimental data
and to prior knowledge.
You need to reproduce the
knowledge and information
used in the construction of the model,
so in the case of say a protein,
it's a polymer, so we know
that a polymer is connected
in a particular way,
and so the polymer that is being submitted
should follow those chemical principles.
You need to be able to predict information
not used in the construction of the model,
and that there are methods
that we use to do that,
which are relatively new.
That wasn't done before.
And we want to look at the model alone,
the data alone, and then
the fit of the model
to the data.
And Cathy is very experienced
in thinking about that,
because while this is
very well established,
how to do this x-ray crystallography,
in cryo-EM, this is not
so well established,
and so you need to learn
how to actually do this.
So you want to inspect
the electron density,
and here is an example of a very
well determined structure on the left,
where everything is in
the electron density,
and you can be pretty sure it's okay,
and on the right is an example
of a not so well determined,
and what happens is it comes in such,
well I had this and this in my crystal,
so it must be there, and I
can't see it well, too bad,
but the problem is, maybe it disappeared.
Maybe it broke up or maybe
it didn't crystallize,
and this has been the subject of the most
kind of controversy that has existed
in modern structural biology is
analyses done on ligands
and small molecules
that don't actually show themselves
in the electron density, and so
what does that actually mean?
So I think that's an important thing.
So in 2006 or seven or so,
there was a very scary
set of findings where 12 structures
that were in the PDB were found to be
likely fraudulent.
They were made up data.
We didn't know that at the time, but
various very smart people had
gone in and done some
analysis and realized
that those 12 structures
could not actually be correct.
I'm not talking about a mistake,
I'm talking about data
that was (audio cuts out).
This scared the bejesus out of everybody
who's involved with PDB, because
that meant that maybe
there's more than 12.
Maybe there's 50%.
And I remember really, we
were all very frightened
that we had not caught this.
So we assembled back then a task force,
it was the first validation task force.
It was headed up by Randy Read,
and Randy, well he appointed
some really, really
deep thinkers and critical
people in this area
to come up with the right way to validate
an x-ray structure to make sure that
we would avoid this kind of error.
This task force took about four years
to figure out how to do this,
and they came up with a set of algorithms,
they ran the algorithms
through the entire PDB
to check things, and
came up with standards
for how to check structures in the PDB,
and identify the outliers,
things that didn't
quite fit the way they should.
Luckily, after all was said and done,
first question is, were there any more
fraudulent structures in the PDB?
And there might be one more.
It's not clear.
But basically the structures in the PDB
were honest attempts by honest scientists
who put their structures in and might have
some mistakes in them but
they weren't made up data.
In the case here, this
is alleged made up data,
because this case has been
going on and on and on
and there's no, it's a whole other story,
which if you wanted to
talk about it later,
I'd be happy to tell
you more about it, but
at any rate, they came up
with a whole set of standards,
and then a way of doing validation,
coming up with a validation report,
and Ezra's gonna talk about that, but
there are two sort of overall quality
in that you can't take one
number and say, "What's that?"
You really can't judge a
structure by one number,
say the R-factor.
They came up with a set
of quality indicators.
If the quality indicators are on this side
of this slider thing, it's okay.
If it's on this side, it's not okay,
but in also checking the structure
against all other structures
of that particular resolution,
and the particular time, so any structure
that was put in early
won't necessarily have very good sliders.
So the sliders are okay,
but they can be a problem,
and it doesn't mean that
the structure is wrong,
it means that at that time
it was the best it could be,
so structures appear to be
getting better and better
because everyone is using
these validation indicators
and making sure their
structures are in good shape.
So that's one thing, and
then the other thing is
this is a sequence of a chain, and
if the residue is green, it means it has,
there's nothing really
seriously wrong with it.
If it's red or orange,
there may be some problems.
The red dots mean there's
a problem with density.
So I personally like to look at this,
'cause it gives right away, you can see
all kinds of things about
a structure at a glance.
So that's my favorite.
Ezra will show you more about
what these reports look like.
Now, because people are so interested
and concerned about quality
and reproducability,
there's been a move to foot
that we strongly support
to make these validation reports
available to journal
reviewers and editors,
but because we feel that our relationship
is with the author, it's
the author who needs to send
those reports to the journal, not us,
because again, we're not the data police.
People have said, "Oh, why
don't you just send it to us?"
No, so we have a sort
of sacred relationship
with the author, and the journal
has to work with the author.
The author has to be the
one to send these reports.
And that's really where we are now.
I'd be happy to, what's gonna happen next
is Ezra's going to go
through a validation report
and show it to you, so
before he does that,
does anyone have any questions
about what we mean by
annotation, curation, validation?
What we actually do?
Any questions?
Brenda.
- [Voiceover] I just
wanted to know what happens
when you, where the last two slides,
where you showed something,
case where there's no
ligand density, but then
there's a mistake there,
or the last slide you
mentioned it could be
made up data, this was fraudulent data.
- [Voiceover] Yeah, as I said,
there's no indication of
any more fraudulent data.
That was one crazy person, okay?
So we're not worried,
and you really, you know,
since science is based
on fundamental honesty,
and searching for truth,
if you have a criminal
who is doing basically a criminal act,
we can't produce a system
that's going to find
some person who's gonna trick the system.
So right now we believe
we can detect all those,
but if somebody really wants to cheat,
we're not there, we're
not the data police.
- [Voiceover] Right, but
an honest mistake, like--
- [Voiceover] Honest mistake
is a whole different thing,
and perhaps Jasmine can
address what happens
in real life when there's
an honest mistake.
- [Voiceover] We first believed
that it was a honest mistake,
and then we go back to
authors when we found
there is a problem, and we ask author
to correct if possible,
or sometimes also come back, say,
"That's just the nature of the data."
And the density is too weak to support
the existence of the ligand,
and so they have two
choices, either they can
remodel it not to include the ligand,
or sometimes they just
leave it the way it is
because they truly believe
the ligand is present,
just not observed.
So in that case, in the data case,
if they decide to leave
the ligand in there,
then we normally add a caveat record
of for the case that the
density can then support
the existence of the ligand.
- [Voiceover] GVB is to
work with the author,
not to punish the author. (chuckles)
'Cause anybody can make a mistake,
and that's why it's good
to have some group there
who's checking.
Now we have software
validation servers out there,
you can check your data
before you ever submit it
to see if there's any problem.
The other thing is that this
is an experimental science.
There's gonna be situations
where you don't see
parts of the structure,
you don't see the ligand.
I mean, that doesn't make it wrong,
it just means that the
methods that we're using
just aren't good enough to see,
or no, there's a whole other area of work
in disordered proteins where
the proteins are not always compact
and fully ordered, and that
disorder may actually be
very important for its function,
and there's a whole area of
study just based on that,
so that's just the nature of science.
So any other questions?
All right, so we're gonna
take a break for 10 minutes,
and then reconvene and then
Ezra's gonna go through
a validation for you.
