- [Voiceover] Is actually
a pretty important topic,
and is kind of what
bedevils a data resource,
where everybody wants us to be perfect.
But as you'll see soon,
we all have to be part of the solution.
We're all part of the problem,
we all have to be part of
the solution of the problem.
So why do data
become inconsistent,
and why is that?
Well, one thing is that
databases change over time.
So here's a picture of
what is justifiably a database,
a paper archive of information,
which some of us are very familiar with.
And it's how data was
stored for many, many years,
and in some cases, continues to be stored.
On the right, we show a picture that comes
from the PDB at Brookhaven National Lab,
where in the early days,
all of the archiving or data
curation was done manually,
and using paper files.
And you can see here
the system they had
for figuring out where things were.
So yellow folder means
currently being processed.
Red folder means on hold
that will be processed in date order.
Green folder, hold until publication.
I'm sure that Brian is
amused by this method.
But this was the only way you could do it
at that time, in that
things had to be kept
in some kind of order.
So that was early days,
meaning that the first
25 years of the PDB,
this sort of tracking everything
was done in this manual way,
although the data itself,
and the coordinates,
were kept digitally.
Now, in the current
instantiation of the PDB,
we have our data files,
which are now in a more modern format
than the original legacy format.
And then, more importantly,
we have these very
complex computer systems
that have been put in place,
a very complex infrastructure
to keep and track everything
and make sure that things are not lost,
there's plenty of redundancy, and so on.
So we went from mostly paper
and punch cards and magnetic tapes
to almost totally digital,
and everything tracked on computers.
We do have still the paper archives
from the old days,
because we're required to keep them.
And thanks to the Rutgers Library,
they showed us how to
archive all of the paper
in special boxes that
would not deteriorate.
And for many years, we
had to keep all of that
in a special place,
a special storage place,
so they would not be ruined.
And the funding agencies
were very, very strict
that that would happen.
So that's the reason,
or at least one reason, why the data
will change, or the data
archiving will change.
So why do the databases change?
Well, you need to accommodate
new types of data.
You know, you start out
and you say, well, I'm gonna
collect this information,
this information, and this information.
That seemed perfectly fine in 1974,
but as the science evolves,
kinds of data that you
need to collect changes.
And then there's new relationships
among the data in the archive.
We also need to enable and support
different kinds of queries,
and in order to do that,
we have to have consistent annotation.
As you'll see later,
there is still certain data
that are collected in the PDB
that are just text, hard to parse,
and therefore, difficult to query.
And then we need to be able to support
new organizations and
presentations for browsing.
And, very importantly,
we have to integrate
with different data resources
that are not under our control.
And so this all leads to
subtle and not so subtle changes
in the data over time.
Over time, errors may be introduced.
And one is lack of clear definitions.
So when you first start collecting
particular kinds of data,
you think, well, it's good enough
to just write a sentence or two about it.
But as time goes on
and you want to be able
to query that data,
you really need to have
very clear definitions.
That's part of our dictionary
way of doing things.
We try to make clear definitions.
Sometimes no matter how clear
we think the definitions are,
there are misunderstandings, OK?
Somebody thinks something means something
and somebody thinks it
means something else.
There's also human error.
People make mistakes.
It's normal, and those mistakes creep in.
And unfortunately, because of
the way annotation is done,
once somebody makes a
particular kind of mistake,
that mistake can propagate
for similar entries,
'cause somebody says, oh, let me see
how this was annotated
for a similar entry.
And it may have been
annotated incorrectly,
and then that propagates.
There's machine error,
and then I say there's bloody mindedness.
And I'm thinking of a very specific case
that I will not identify.
But some people decide
that they don't care what the rules are.
They're gonna do it their way,
and they're going to create
their own way of defining things,
even if it is not
consistent with the archive.
And I can think of
at least two very outstanding examples,
one of which has caused lots of problems,
and another which is just plain annoying.
And there are probably others.
And this is not because
somebody made a mistake.
It's because they said, "I don't care.
"I'm doing it my way."
There are other issues that you will hear
in discussing databases.
People will say any
attempt at standardization
is going to throttle creativity.
So they deliberately
do it a different way,
because they say their way is better
and don't really think
about the consequences
of how you can't compare their data
to someone else's data,
because they've made a deliberate change.
And the two cases I'm thinking of,
the authors themselves
are furious at the PDB
because they couldn't
find their own structures.
And the reason they couldn't find it
is that they didn't use
the standards that existed.
And these are very distinguished
and very important scientists.
Very hard to argue with people like that.
And that's what I mean
by bloody mindedness.
So the errors need to be fixed
to improve the data quality,
and we've been calling that remediation.
And we're gonna be talking today
about what that actually means.
So what's the relationship between data in
and data out?
Whatever the data quality is,
however the data are standardized,
whatever annotations exist,
if you improve that
and you try to get things
really standardized,
you are going to improve
query functionality
and you're going to
have more query options.
So it is a cycle and it's not a split
between what we call
the data in to data out.
You really have to have
the data in good shape
in order to do queries,
because very few query engines,
with the exception of Google,
can imagine what is it
that's on your mind.
One of the things that
I am very amused by,
or not so amused,
but Google appears to read your mind
because Google has
behind its search engine
incredible artificial intelligence,
some of the most creative
and brilliant computer scientists,
who have applied all different kinds
of machine learning techniques
to try to guess what it
is that you're thinking.
And they're examining the
patterns of your particular
queries over time.
So they kind of say,
oh, it's Helen Berman,
so collagen is going to be
about protein molecules,
not about face cream.
And it's completely different.
So if you go into somebody else's computer
and use Google, and they've
never seen you before,
you're gonna get different
kinds of answers.
OK?
So Google is really a brilliant,
brilliant search engine.
The other thing about Google is that
it doesn't pretend to give you the answer.
It gives you some choices
of what is like the kinds of answers,
and then you have to look through
what it is you want.
So it's not gonna give
you a definitive answer.
When you use a scientific search engine,
your expectation is very different.
Your expectation with a
scientific search engine
is that you're gonna get the answer,
and the right answer.
OK? And most scientific search engines
do not have the benefit
of millions of dollars,
billions of dollars of research
that allows you to do
the kind of searching
that Google does.
And even with that, Google could not
give you the definitive
answer that you really want.
So there really is a difference between
a typical scientific
search engine and Google.
Is that clear, what I'm getting at?
I've heard people say, well,
Google can do this and this.
For example, one of my pet peeves,
as people in the group know,
is the use of natural language processing,
or text searching for things
and expecting to get
any reasonable result.
Without having AI behind it,
without having machine learning behind it,
you're not going to get
the answers that you want.
What Google does is not...
You don't just type in a word
and then it looks for every
other thing that has that word.
It's looking for context.
And there's a very big difference
from what Google does
than what most scientific
search engines do.
So what are the types of inconsistencies
and errors that we've seen in PDB files?
One has to do with nomenclature.
There was a nomenclature system in place
for many years with PDB,
which was basically PDB-speak.
It had nothing to do with standards.
But it did exist for a very long time,
and so people used it.
And at some point, we
had to make the decision
to standardize the nomenclature.
And that was a massive job,
and not everybody liked us for doing that,
because once you change all the (mumbles)
every single atom name changes.
That has a big impact on
people who use the whole PDB
for their research.
Coordinate frames.
Cathy's gonna talk to you
about coordinate frames.
So that has to do with the standards
of how you represent molecules
in crystallography.
And crystallography has
had for a long, long time
standards for the coordinate frame
based on crystallographic (mumbles).
Some people chose, for whatever reason,
to trace to (mumbles) lab,
which defined a different
kind of coordinate frame
for the purposes of structure analysis.
And there are several entries in the PDB
that are based on a
different coordinate frame
than the standard crystallograph.
So that leads to big
problems with software.
Data harvesting, what I mean by that is
that structures come into the PDB
from structure determination programs,
people mistakenly
use a particular switch in that program.
PDB thinks that there's one kind of value
when in fact it's another kind of value,
and that leads to a systematic error.
There's several thousand
structures that this happened to
and needed to be remediated.
And then, finally, representation.
We know how to represent a polymer,
and we know how to represent
small-molecule ligands,
but there are the
intermediate-type structures
which you're gonna hear about from Shuchi.
She's gonna talk to you about peptides.
The same problem exists
with carbohydrates,
where there's inconsistent representation,
non-standard representation
that needed to be fixed.
The peptides were fixed.
Carbohydrates are under study right now.
So the considerations for remediation.
The first is a disruption
caused by changes
of large numbers of entries.
So as I said, in the case when we changed
the nomenclature of the atoms of the PDB,
it seemed like people asking us to do it
for years and years and years,
and to make it consistent
with IUPAC nomenclature rules.
But when that happened,
everybody was so used to the old way
that a lot of their software broke,
because as I show in point two,
people have built scripts
to correct known errors.
So for example, they
see the bad nomenclature
but they already have
a way of changing it,
and then you change it and
then their script breaks.
So when you make a change,
you have to think in terms of,
how will it impact the user community?
And if you think that now, you know,
we have more than a million downloads
of coordinate data a day,
and you change
a large fraction of the coordinates
to make some changes,
this is going to impact people's research
in a very strong way.
So you have to think about other people.
You know, we have to have some empathy
to understand what the impact is.
You can't just say, oh, I'm gonna do it
because it's the right thing to do.
Doesn't work that way.
And, of course, not everyone will agree
with the decisions made
about the remediated data.
So when you do remediation,
you have to consult
a broad, large section of the community.
Make sure that at least
you have some backup.
Minimally, your own advisory committee,
but hopefully, larger groups of people,
so you don't break
everything for everyone.
So again, empathy is very much needed
in being able to handle these data.
So what is the process that we go through?
First, to identify the
inconsistencies and errors.
So that's done by staff,
and it's done by all
the people who complain,
"Why do you do this? Bla, bla, bla."
So that all has to take...
You really have to be able to identify
in some systematic way
what the errors are.
Then you have to develop methods
to correct the errors.
You have to implement the corrections,
and you're gonna hear about that today,
this process, from Shuchi and from Cathy.
And then you have to...
You can't just change and
then say, "OK, I did it."
You have to change your curation process
so as to prevent
new entries from having the same errors.
If you don't change the curation,
then you will have fixed
a whole bunch of data
and then you're in trouble again.
You have to also work with the people
who create the software
that are used in structure determination.
I can tell you that it
took many, many, many years
to get the software developers...
What we really wanted to have is
whatever went into the PDB
and whatever came out of the PDB
would be able to do a round-trip
and there would be no inconsistencies.
And it may sound incredible
that we did not have
that round-trip ability,
for a whole lot of reasons.
And I can remember meeting
back maybe 10 years ago
in Salt Lake City, Utah,
where we tried to bribe
the software developers
to please follow some standards
and produce files that
we could actually read.
And the beer didn't help,
and they didn't do it.
But finally, in 2011, there was agreement
that they needed to be part of
the solution to the problem.
And then, they are now
very much part of the PDB family
in being able to produce data
that can be read by the PDB
without doing a whole lot of
complicated format changing.
And then finally, communicate
with all the stakeholders
about the corrections and any amendment
of the processing procedures.
So once you've kind of
figured out what to do,
and you've done it,
you then need to tell...
"Stakeholders" means
there's large numbers
of programs out there,
program applications out there,
that use PDB files.
They need to know that
changes have been made,
and the communication
has to be one-on-one.
It has to be done by email.
It has to be done in every possible way.
And no matter what you do,
there'll be whole bunches of people
who say they've never heard of it.
But you really have to try,
and we do have a 60-day rule,
which says that when there's a big change,
you have to give 60 days notice
to give people some time to sort this out.
But before that 60 days,
it's really important to talk to people.
But as you all know,
talk all you want,
people may or may not listen.
But that is what the process has to be.
OK, so now we're gonna talk about
two specific remediation projects
that were rather large and complicated.
Relatively small number of molecules,
but lots of issues.
And the first one is going
to be about peptides,
and Shuchi Dutta is going to
talk about that, and Shuchi
international team
to work on this problem
and come up with a solution,
and took many, many, many
years to do this right.
And so Shuchi's gonna talk about that.
And then after Shuchi,
Cathy is going to talk about
an even smaller number of molecules
that had all sorts of crazy
problems with the PDB,
virus molecules, and she's gonna show you
how she approached that.
- [Voiceover] What?
OK.
So I'm gonna talk to you about
these peptide-like molecules.
And the catch-word here is "like".
They are neither pure peptides,
nor are they just small molecules,
and that's where the problem lies.
So I'm gonna start with giving you
an understanding of how we represent
small molecules like polymers in PDB,
to give you a foundation for understanding
what really the problem is
with these peptide-like molecules.
And we'll look at the
problem in some detail,
and then discuss the solution.
And as we did the remediation,
we had to both improve the infrastructure,
update the files,
and as a byproduct of this process,
new dictionaries were created,
and I'll talk a little bit
about these dictionaries
and how that can now be used,
not just by us in the community
for annotation purposes,
but everyone in the community
has access to this now.
So let's start by talking about
building blocks and chemical groups.
So I'm not sure whether this has come up
in previous discussions,
but in the PDB, the Protein Data Bank,
every single polymer,
or every single molecule,
the smallest unit,
whether it's an amino acid, a nucleotide,
a drug, an ion, a ligand, whatever,
this is all maintained
in a chemical component dictionary
that is being maintained
along with the PDB archive.
Now, these chemical
components are sort of...
We use that as a reference.
For example, the amino acid alanine.
Every time alanine
appears in any PDB entry,
we check back with that
chemical component dictionary
to make sure that that alanine atom name,
connectivity, geometry,
all that is what we expected.
If there are deviations,
we need to report that
and say that this atom is missing
or there is a problem
with the atom nomenclature
or you think the atom
nomenclature (mumbles).
So here, for example, I'm showing you
a couple of different amino acids.
Note in the amino acids,
we have the amino group,
the carboxyl group over here,
and a carbon in the middle,
which we call the C-alpha carbon,
to which there is some
side-chain attached.
The side-chains vary in
different amino acids.
Notice there is a big, bulky group here.
A yellow atom, this a sulphur atom,
purple hydro atom, this
is (mumbles) group here.
So these give rise to
different amino acids.
And in the case of nucleotides,
we have a sugar or phosphate,
but the bases may be different,
and these are different nucleotides.
Other than amino acids and nucleotides,
we also have ligands such
as the (mumbles) group.
This is a cofactor found in proteins like
myoglobin and hemoglobin.
And then there are drugs like Tylenol.
Now, all these may appear
just as a free-floating
ligand in the file,
or they may be part of the polymers.
So for example, the amino
acids and nucleic acids,
when we are representing
a protein molecule,
these amino acids link to each other.
So if we look at this amino acid.
So again, this is the N
C-alpha C of one amino acid,
N C-alpha C of another amino acid.
There is a water molecule that is lost
in this reaction that
forms a peptide bond,
and this process is repeated
over and over and over
to make a large polymer.
The reason I am going over this
is that this is an understanding
that the entire community understands.
So this knowledge is implied.
Meaning if we just say that we
have amino acid one and two,
the fact that they are
linked by a peptide bond
is understood when we talk
about a protein or a peptide.
But this understanding fails when we are
dealing with these special
peptide-like molecules.
So this is something we
need to keep in mind.
Again, a similar thing happens
when we are talking
about a polynucleotide.
So there is a phosphodiester backbone.
This linkage is arranged to be the same
every time we are talking about
a nucleic acid polymer.
And so if we just specify
that these are the bases
that are appearing in our polymer,
this whole molecule can be constructed
just by knowing the chemistry.
And the same is true for
these proteins and peptides.
But now when we come to
our peptide-like molecules,
we notice that these molecules are,
first of all, not linear.
Many of them are cyclic.
They have branches.
They have all kinds of various
interesting things going on.
We want to keep them in our archive
because they have a lot
of biological importance.
Many of these are antibiotics
or inhibitors that have key
roles in biological processes.
Interestingly, some of
these are ribosomal products
produced by the ribosomal machinery
like any other protein
or peptide would be.
But some of these, this looks
like a peptide molecule,
but this is not a product of the ribosome.
This is produced by a series of enzymes
acting one after the other
to produce this molecule
that looks like a peptide,
but it has some oddities
inside its molecules.
And so there is biological significance,
and also there is chemical interest.
Now, if you look at this molecule,
and we'll look at this molecule again
later in the lecture,
the peptide part is here in the center,
but then there are these big,
cyclic arrangements of atoms
where the side-chains are
crossing into each other,
so there are all these other
complicated things going on
which have important
roles in their function.
And there are all these
interesting chemical properties
that help in the formation of
these complicated geometries.
But this is again going outside
of the standard definition
that we talked about
in the previous slides,
of the simple peptide polymer
or the nucleotide polymer.
So here, nothing is absolutely assumed.
So we have to very carefully specify
what the linkages are
and what is going on.
So this is part of the problem.
So let's look at the problem.
In these special peptide-like molecules,
there are non-standard amino acids.
There are perhaps sugars
or other chemicals groups
which are not amino acids or nucleotides.
Then the other complexity that we saw
was these molecules are not
necessarily always linear.
They may be cyclic, they may be branched,
they may have all kinds of other
interesting chemistries going on.
The other thing that we also noticed
when we did a careful
analysis of these molecules
is that these molecules,
when they bind to the target protein,
so, for example, if it's an inhibitor
and it interacts with
protease, for example,
they may form covalent
linkages with the target.
So now, not only are we looking
at this complicated geometry
and chemistry of the molecules,
but this is now linked
with a larger protein.
So we have to somehow address
and understand that, too,
and be able to clearly
indicate what would happen.
So before the molecule
binds to the target protein
and after it has bound, there
is a change in chemistry.
Somehow we have to represent that, too.
And then, when we looked at the problem,
we realized that some of these linkages,
some of these chemistries
were either incompletely represented
or incorrectly represented.
Because like Helen was saying earlier,
that when you are a pioneer in the field,
sometimes you say, OK, it's enough to give
one or two statements and comments
that say that, oh, there
is this special chemistry
that happens here and there.
But over time, you realize that
now you have, let's say, 30 structures
with that same molecule,
and people may or may
not have taken the care
to make the annotation that says
that there is a special
chemistry going on over here.
And so, over time, you'll
see just incomplete
and incorrect representation
of the molecules.
And so we have to deal with that.
Microheterogeneity says that sometimes
these specialized molecules
actually may have different flavors.
So in the same experiment,
we may see a molecule
that is sometimes this and sometimes that.
And as an advocate, when you
represent that in the file,
that becomes very confusing for the reader
until and unless it's
very explicitly identified
that this is where the change is,
and these are the two possible
or three possible scenarios.
To a lay reader, this
becomes very confusing.
What are you talking about?
Is it this, or is it that?
So there needs to be some
indication to explain that.
And we also found that these molecules
were just treated as, you
know, a bunch of things.
The fact that these molecules
have natural sources,
some of these are antibiotics
coming from fungi,
or weird other organisms,
that kind of information
was never even captured.
We never even bothered to identify
that these molecules are
biologically derived,
these molecules have important roles
and functions in the system.
So now I'm gonna give
you this as a scenario
for what is going on,
and we'll sort of jump off and see.
So this is the case of thiostrepton.
So this is the molecule.
So it's non-linear, you can see,
and in the archive pre-remediation,
this is how it was represented.
So in this particular entry,
it had a sequence that
covered one, two, three,
four, five, six, seven amino acids.
And the rest of it was
sort of added groups
and weird things.
Now, if we look at this molecule closely
and we follow the peptide backbone,
let's now look over on this side.
These are amino acids
that can be identified.
So you start from one, two, three, four,
so on and so forth.
You go all the way up to 17.
The red lines actually show
where you can cut these molecules
to have different amino acids
comprising this larger molecule, OK?
Now, of course, there are
certain points, like here.
This chemistry is very unusual.
This chemistry results from
two amino acids combining
and losing water and other atoms,
resulting in this special
what we call (mumbles) ring.
And this appears multiple
times in this molecule.
You'll notice one, two, three,
four of these (mumbles) rings appear.
And there is another one
which is a six-member ring
that appears over here.
And then there are these side-chains
or other groups linking up,
forming these larger rings.
So the complicated chemistry
and geometry all now,
after the remediation,
is represented in this much
longer polymer sequence
which has now all those 17 amino acids
that we had noted here,
plus the additional linkages
that are specified explicitly
in the dictionary and
in the specific entry.
And I'll tell you how we
developed the dictionaries
and how we represent it in the files.
So this is just to give you
a flavor for how earlier...
So notice that these different entries
have that same molecule represented
in completely different ways,
and now we make sure that every time
there is a thiostrepton
molecule in the archive,
whether it is alone or in complex
with a larger protein molecule,
we represent it in a consistent way
with this full representation.
And all these little
red lines that you see
are where the molecule is sort of chopped
and represented here,
but it is all patched together.
So the solution that we came up with
in trying to address these molecules
is that we have to do two things.
One is we have to keep the
molecule's detailed composition
because we want to be able to recognize
that this molecule is composed
of these smaller sub-units,
but we also want to keep
the molecule together
so that when we are
looking for thiostrepton,
we don't get just this
part or just that part,
we always get the full molecule, right?
So the solution is that we are
keeping the detailed composition,
but we are also keeping
the full molecule together.
And the disposition of
doing a dual representation
actually manifested itself
in three different flavors.
So there are some peptide-like molecules
which are very small,
and in these molecules...
Let me take a moment and tell you about
a little convention.
So when we have an amino acid,
it's just an amino acid,
which has an N C-alpha C.
If you have another amino
acid, it's a dipeptide,
OK, if you make a peptide bond.
A third amino acid,
and you've linked it with a peptide bond,
so now you have started forming a peptide.
So in our rules, in our conventions,
we say that two peptide bonds,
two consecutive peptide bonds,
is the definition of a small peptide.
And you can have three,
four, five, as many.
But look at this molecule.
This has one amino acid,
another amino acid.
This is a lysine, this is a proline.
But attached to it are two other groups
which are not amino acids.
So we can't really call this a peptide,
because there is really
one true peptide linkage.
But we do want to be able to recognize
that there is a lysine here
and there is a proline here.
Why? Because there could be
a family of other molecules
where we keep everything the same
but keep changing this amino acid.
So in some instances this is a lysine,
in another instance it's histidine,
another instance it's a tryptophan.
So if we can keep this information intact,
that there is amino acid
one and amino acid two
linked to two other groups,
if we can keep that information intact,
then we can see that there
is a family of molecules
where we are playing with this position
and changing this amino
acid to something else.
So this resulted in what we are calling
a component with subcomponents.
This is small enough to not make
a polymer or a peptide.
So we make it a small molecule,
but within that small molecule
we keep this subcomponent definition
that it is composed of this part,
two amino acids, and this part.
The next class is like
a just regular polymer,
and we'll look at this
again in more detail
in the upcoming slides.
So here we have a sort of polymer
where this is one amino
acid, two amino acid,
three amino acid, and so on,
and then there may be some groups
which are not typical amino acids
but that can be added to this polymer.
And this is a little bit more complicated.
So here, the peptide
part is over here, OK?
But then the peptide cyclizes
and makes these other rings,
and on top of that, these
sugars are attached.
So now you have defined a polymer,
which has all these
complicated chemistries,
and then on top of that
you have these other
groups that are attaching.
So here, we have to deal with it
in a slightly different way.
We are calling them grouped molecules,
and I'll describe them
in detail in a moment.
So this again is the
molecule that we just saw,
where we have two amino
acids and two groups.
And we said that this is too small
to be represented as a peptide,
so we are going to represent
it as a single component.
But we are going to capture
the subcomponent information
so that we can see that there is TFA,
lysine, proline, and isopropylamide.
And here, these two are amino acids
that everybody recognizes.
These two are other chemical groups.
Everybody with me so far?
Any questions?
So now, when we annotate these molecules,
the name of the molecule is listed
like the name of any other small molecule
that is present in the file,
like a heme or whatever.
But in the dictionary,
we capture the subcomponent information
so that the apparent sequence information
is captured in the subcomponent.
Now, if this molecule
is covalently linking
or doing other chemistries
with, let's say,
the protein molecule,
that will be represented
like in any other case,
like any other small
molecule is represented.
There are usually no biological sources
or other information for this,
because these are mostly
chemically synthesized molecules.
And when we are annotating the file,
we define the environment around it
as we would define the environment
around any small molecule ligand
that we encounter in a file.
OK?
So now we are going to
move to the next class,
where we are representing a molecule,
a peptide-like molecule, as a polymer.
So in this case,
notice there is an
aspartic, glutamic, valine.
So there are at least three amino acids.
And this is also an aspartic,
but there is something special about this.
And at the (mumbles)
another chemical group.
So if we follow the chain,
so this is your amino acid one,
amino acid two, amino acid three,
and this is amino acid four,
but there is something
strange going on here
where this carbon does
not have the same geometry
as it would have in an amino acid.
And on top of that, it covalently links
with the enzyme after the reaction,
and therefore this is not
a true amino acid anymore.
It has changed its geometry and (mumbles).
And so this molecule,
we can represent it as a polymer
because there are more
than two peptide linkages,
but there are other groups
that have to be clearly specified.
And in this particular instance,
when we annotate the file,
we have to also specify
the link between the peptide-like molecule
and the protein that it is bound to.
So here, of course,
for the parts that are peptide-like,
the standard peptide bond
information is implied.
Only when there are non-standard groups,
those linkages have to
be specified in the file.
So again, the name of the molecule
is listed as you would list any polymer.
Not any small molecule, but any polymer.
In these types of molecules,
the source information,
if it is available,
if it is an actual product,
the source information is included.
The composition, part of it
may be implied, like we said.
The standard peptide linkages are implied.
But all the non-standard linkages
have to be explicitly defined.
And there may be a
sequence database reference
for these kinds of molecules.
And for some larger molecules,
there may be regions of this polymer
that actually adopt a
specific 3D structure,
like it could be helical in certain parts
and loopy in certain other parts.
And so if that is the case,
then those three-dimensional
details are also captured.
And now, in this kind of molecule,
when we are describing the
environment of the molecule,
we describe the environment
of the whole polymer.
Not just the small ligand,
but the whole polymer.
And the function, if it
is known, is described.
Everybody with me so far?
Any questions?
OK.
So now this is the third group.
We are calling these
the grouped molecules.
So here, if you'll bear with me over here.
So if you follow this green, fuzzy line,
this is where the peptide linkages are.
OK?
So this is N C-alpha C, N C-alpha C,
and so on, so forth.
Now, the side-chains,
they actually link together
to form these other, larger rings.
So you have one, two,
three, four rings here.
Now, on top of that, to these positions
there are a variety of sugars
attached to this molecule.
So it has a peptide core
decorated with various sugars.
And these cores we have
seen in our archive.
We may see this core just as the core.
We may see these decorated sugars
in different flavors they appear.
So sometimes there will be only one sugar,
sometimes two sugars,
sometimes multiple sugars.
And so this sort of led us to realize
that there are these families
of grouped molecules that we see.
And I'll show you more in a moment.
But for these molecules,
this is a little complicated,
meaning that the name of the molecule
is the whole molecule's name.
But we have to keep in
mind that the molecule
is comprised of a polymer region
and some non-polymeric group.
The sugars are non-polymeric.
They are not polymers of sugar.
Sometimes we may see polymers of sugars,
or we may see other polymers.
So multiple polymers may come together
to form one single molecule.
And so in these scenarios,
we describe the sequence
of each of these polymers,
as appropriate.
We also list in the composition
all the non-polymer components.
Or if there are multiple polymers,
we describe those, too.
And then we have to be very, very detailed
about the intramolecular linkages
and intermolecular linkages,
because the polymer and
non-polymer linkages
have to be explicitly defined.
And if there is any non-standard linkage
within the polymer,
that has to be defined.
So all that has to be
very carefully defined.
And the binding environment,
when we are describing
the binding environment
of this molecule,
it's the binding environment
of this whole thing,
the polymer and non-polymer,
or polymer and other polymer components,
all together.
That is the molecule, and
the binding environment
is whatever is around it.
OK? Everybody with me so far?
And then, of course, the function
is of the whole molecule,
not just parts of it.
So to summarize, in our annotation now,
we follow this sort of
a flow chart to decide
whether or not the molecule
should be represented
as a small molecule with subcomponents,
as a peptide like a polymer,
or whether it should be represented
as a grouped molecule.
So the key question that we ask is,
can we define two or more
consecutive peptide bonds in this molecule
when we review this molecule?
If we cannot define it,
then this rules to become
a single component.
And then we'll try to capture
whatever subcomponent
information there is.
If it can be defined,
the next question that we ask is,
is this a branched polymer
or are there additional small
molecules attached to this?
And if it is, if it is branched
or if it has the additional
components attached to it,
then it becomes a grouped molecule.
Otherwise, it is a polymer.
So this is sort of our decision tree
that we use for our annotation process.
And in looking at the
remediation that we did,
all the pre-2012 entries
which we analyzed,
so we had about 1,000 entries,
850 inhibitors and 150 antibiotics
were remediated in this manner.
So we had to actually
go through each and every single file,
because the status of whether the molecule
was represented as a
polymer, or as a ligand,
or as a mixture, or da-da-da,
the chemistry was correct and complete,
it was different in every single file.
So we had to go through every single file
and we had to decide what needs to be done
to bring it up to the standard.
And then we appropriately made them
either peptides with polymer sequence,
as a single component,
or as a grouped molecule.
And simultaneously, we
also built dictionaries
and an infrastructure
that helped us make sure
that any new molecules that are coming in
that have the same chemistry
same (mumbles) antibiotic,
are represented in the correct way
according to the standards
that we have set up.
The other thing that we also noticed
in doing this process
is that we started noticing
families of molecules.
So for example, this vancomycin core.
So again, the peptide is over here
and there are these cyclic side-chain
links that are formed.
But notice, to the same core
there is one sugar added,
two sugars added, three sugars added.
They are all decorations of the same core,
meaning that this family,
it has certain properties,
it has certain structural features
and ways in which it behaves
and combines or acts in biology.
But these decorations may change
the specificity of the molecule,
may change the way in which
it interacts with the targets,
and so on and so forth.
But knowing the fact that they are related
helps us understand more
about the molecules.
And so we came up with two dictionaries,
or two kinds of files,
one which we called
a peptide reference
dictionary file, PRD file.
Now, this is where we
capture all the information
that goes into making a molecule,
whether it is a single
component with subcomponents,
whether it is a polymer,
or whether it is a grouped molecule.
In all cases, we will generate a PRD file.
So this is analogous to our
chemical component dictionary.
So whenever we get an
actinomycin or vancomycin,
we will go to this file,
the respective PRD file,
and we will say, does
it have all the atoms?
Are they linked in the correct way?
Are the atoms called whatever
they should be called?
Are all the geometries and
chemistries consistent?
And if they are not, then
we need to ask the author,
are you really working
with a different molecule,
or is this a mistake
that needs to be fixed?
And so in that file, we
capture the molecule name,
the formula, function,
details about the structure,
all the components, all the linkages,
and everything that
describes that molecule,
including if there is information
about the source of the molecule,
and where we learned about that molecule.
Now interestingly, we found
that some of these antibiotics,
for example, are produced
by different organisms.
So different people
working on different sources
or different problems,
they came up and said,
oh, I have derived this
from such and such fungus,
and somebody else said,
oh, I have derived this
from such and such other organism.
So we try to keep track of
all that information, too,
just in case the same molecule
has multiple sources.
And then, on top of this,
another kind of file that we keep
is the family information.
So family, FAM for family.
So the family tells us the organization
of these different PRD entries,
or peptide reference dictionary entries,
that have the same core but
have different decorations,
that have the same infrastructure
but details are different.
And this is where we also
capture additional information,
information that we have gathered around
from various other databases and resources
that tell us what this family
of molecules may be doing.
What their roles are, what
their interactions are,
and so on and so forth.
And this may give the
users information or ideas
about what their molecule of interest
could be also doing, in
addition to whatever they are
specifically working on.
So all this information
is all now available
from the worldwide PDB,
and it's what's used in
our annotation purposes
and also available to
the community for use.
Just to summarize everything,
we talked in the beginning
about components and polymers,
trying to give you a
foundation for why is it
that the peptide-like molecules
were a problem for us.
Then we looked at the
problem and some examples,
and then we also talked
about the solution.
So the solution, in our case,
was the dual representation.
We want to keep the molecule together,
but we also want to keep
the detailed composition
of that molecule
so that we can very quickly see
where there are specific changes
when people make similar molecules
or remake molecules.
And finally, we talked about
how we improved the infrastructure
and how we updated the files
that had these various
problems and issues,
and how we developed new dictionaries
to support future annotation,
and processing and understanding.
And I think later in the class,
we'll go through some specific examples
to give you a first-hand experience
of the kinds of problems we encountered
and how we addressed them.
So thank you.
Any questions, I'm happy to answer
before I give it over to...
Yeah?
- [Voiceover] So these
peptide-like molecules,
when you have a new molecule
which is not already in the dictionary,
do you frequently update it?
Do you encounter it--
- [Voiceover] Yeah, all the time.
- [Voiceover] OK.
- [Voiceover] We are
encountering new molecules
all the time,
and we go through that flow
chart that we showed you
to decide how are we going
to represent this molecule.
And then sometimes it needs
a little bit of research
and digging around to understand
what is the composition of this molecule,
because many times the authors
may or may not provide
in the first instance
the detailed breakdown of this is alanine
linked to some weird group,
linked to three other amino acids,
linked to some other.
So some of this we have to
dig out of the literature.
Some of it, we go back to the authors
and they help us figure
out what is going on.
And then we build that whole molecule,
and that becomes the foundation
for our annotation of the file.
- [Voiceover] That gets
added to the dictionary?
- [Voiceover] Yep. Yeah.
- [Voiceover] Thank you.
- [Voiceover] Any other questions?
OK.
- [Voiceover] So (clears throat)
for the chemists among you,
you can begin to maybe
figure the kinds of theories
that could be done now that
couldn't be done before
by doing the work that's been done here,
because for example, if you wanted to,
say, find all molecules
in the PDB that have,
say, proline, or have histidine,
and you don't have all of this parsed out,
it becomes very difficult.
If, for example, you're interested in
things that (mumbles).
so this was really important to do,
and it was a very, very difficult project,
as you can begin to see, I hope.
And I think that there's
been a lot of order
brought to the process.
It took way more work
than we ever anticipated,
but I think it was worth it.
So this is an example of remediation,
so it isn't just about fixing
the spelling in the file.
OK, so now, the next tough problem
that we're gonna hear about
are the virus molecules.
And Cathy was our savior in that one,
and she's going to describe
what the problem was
and how the problem was fixed,
and hopefully, how it
continues to not be a problem
as we go forward.
So Cathy?
- [Voiceover] OK. Thank you, Helen.
Hey.
So I'm gonna start by
introducing a concept,
just to make sure everybody
is on the same page
about symmetry.
So I'm gonna be talking about molecules
that have point symmetries
and helical symmetries.
They don't actually have
for helical symmetries,
but for point symmetries,
just so you're aware,
well, there are molecules in the PDB
that have all these different types
of point symmetries.
And they come in the cyclic,
dihedral, and cubic flavors.
And it turns out that the icosahedral form
of point symmetry, which
actually consists of
five-folds, three-folds, and two-folds,
is a very popular form for viruses.
It allows them to form
a large spherical object
which can contain genomic
DNA, for instance.
And there are other
symmetries that we also find,
and I'll show you one example
of a dihedral symmetry later.
OK, so the icosahedral viruses,
I'm just showing one example here,
which is the rhinovirus.
So this virus has icosahedral symmetry.
There is one single copy
that is actually defined as
coordinates in this file,
1RUG, if you were to go and look at it.
But then, what you really
would like to be able to do
would be to build the whole
assembly from that one copy.
And in order to do that,
rather than give 60
explicit sets of coordinates
with all the atoms in it,
instead you give one copy
and then you give 60
different transformations.
And by doing that, you can
very compactly describe
how to build the whole assembly.
And similarly, for helical viruses,
if you can define the
unique asymmetric unit,
helical asymmetric unit in this case,
and you have a set of parameters,
helical parameters, which are,
in general, a twist and rise
to go between one copy and the next,
then you can build the whole assembly
for a helical virus.
Now, so this was the
problem in about 2006.
And we had about 300
structures at the time,
so I say growing number.
But it was becoming problematic.
We were getting people complaining
about not being able to look
at these complex structures
and they were being deposited.
But the problem was that we didn't really
have a standard for
representing them at the time.
There was just no annotation process
for checking and validating
the depositor-provided information.
And so what went out to the public
was often very difficult
for anybody to make use of.
And what we found...
I was asked to take a look at this.
Shuchi was actually my partner
in helping me with this.
And we found that there were
really three major issues.
And so the first one was
that the instructions
for building these full assemblies
was often missing,
or it was incorrect.
And we actually looked at a whole bunch
of automatically generated images
to identify those entries
that had problems.
Then there was also a problem with
inconsistent deposition frames,
and we identified this by a different
image generation process.
And then we also had these
overly complex building instructions.
So I'm gonna show you examples
of each of these.
So the erroneous building instructions.
So typically, I don't
know if you can see here,
the PDB has this remark.
The PDB file format has this remark 350,
which we call the BioNT record.
And it usually starts with
an identity operation,
which you can identify by having...
This is X, Y, Z.
You have 100, 010, 001.
This is a translation component,
and it's usually zero.
So this is an identity operation.
Actually has quite a bit of importance
in what I'm gonna be telling you
about how we made use of that.
But then there's usually
59 other transformations
that are assigned.
So here you have the first one,
here's the second one.
And here with the second one,
you see this is a non-identity operation.
So it actually says,
"Take these coordinates
"and rotate them according
to these numbers."
And I'm not gonna go into details
about how that happens,
but some of you may know about
how this evolved.
But the important point is that these
define the instructions
for moving the coordinates
from the deposited position
into a new position.
And so as you can see, I
show here examples of entries
where we just tried to take
the building instructions
and apply them to the coordinates.
And in very few cases, you
could actually identify
something that looks like a virus.
Here, down here, this
is actually two viruses.
You know, it might be right,
but I think that one wasn't right either.
(light laughter)
So it was a mess, basically.
So now, what is this thing about frames?
So as Helen mentioned before,
it is an expected standard
for crystal structures that
are deposited to the PDB
that the coordinates are deposited
relative to the XYZ axes,
orthogonal XYZ axes,
according to a standard that is defined
for the crystal lattice.
I could go into that in detail,
but I won't do that other than to say that
the standard does exist,
and it is generally expected
for crystal structures
that they will be deposited this way.
But some people have very strong feelings.
And you know, actually in early days
when people were first
determining structures of viruses,
viruses have complicated symmetry.
They have five-folds,
they have three-folds,
they have two-folds.
And you know, computer programs were slow.
You wanted to figure out an
easy way to represent them,
you know, to actually
work with them in the lab.
And so it turned out that
this icosahedral frame
was very typically
preferred by the scientists
who were working on these structures.
And this frame actually had two folds
positioned along the X, Y, and Z axes,
and 5-folds positioned off-axes.
For instance, here is a five-fold,
and here is another five-fold,
and this would be a six-fold,
and then the two-fold down the Z axis.
Now, of course, there were
actually multiple definitions
for how to put the icosahedral frame
with respect to the XYZ axes,
so there's different ways to do it.
So one of the things that we did
was to actually look to
see how these viruses
were deposited with respect
to their frame type.
So here I've plotted the year
the structure was released,
whether it was deposited
in the crystal frame
or whether it was deposited
in one of the icosahedral frames.
And in the early days, most of the time
it was deposited in the icosahedral frame.
The trend, actually, if
I were to extend this out
to present day,
you would see that it is
more conventionally done
crystal frame now.
But we had all these structures
that we had to deal with to figure out
some sensible way to represent them.
One thing that we did
also have to deal with
is there is this rule in the PDB
that if you touch the coordinates,
the XYZ position, that
you actually have to
obsolete the entry and supercede it.
So in addition to wanting to represent
all these things properly,
we didn't really want to
touch the coordinates.
And here is an example of
complex building instructions.
This is a rhinovirus entry,
and I will tell you the author of this.
The senior author of this
entry was Michael Rossman.
And when I first read
what he put in the entry,
I kind of cribbed it
into a much smaller...
It was a huge, huge comment that he wrote,
"How to build my virus from
these coordinates." (laughs)
And it was very complex,
but at the same time, very elegant.
And I really feel like I learned a lot.
I learned basically the core
of how to develop the process
of remediating the entries
just from understanding
this particular entry.
And so what he said was,
"To generate a viral shell
from the coordinates,
"apply 532 point group
symmetry," icosahedral symmetry,
"elements in the specific order five-fold,
"two-fold, two-fold, three-fold
"about specific axes whose
transformations are given below."
I've triggered the movie
here so you can see some.
So I'm just only showing
you the five-fold.
So let p1 be the coordinates of the entry.
Apply this transformation,
which is down here,
four times to create an entire pentamer.
You apply this to p1 to make p2.
You apply the same
transformation to p2 to make p3,
you apply the same
transformation to p3 to make p4,
and the same transformation
to p4 to make p5.
And there you have your nice pentamer.
And it continues (laughs)
in this very logical manner,
all the way through.
(laughs)
Yeah.
Yes, and so here's the problem.
OK, so you have one virus entry
and you know, somebody's
interested in looking at it.
This is OK.
Say you have 300 of them and
you wanna look at all of them.
You wanna do some analysis on all of them.
This is unmanageable.
You just can't have that.
You need to have something
that a computer can read, basically.
Say, OK, I take all these numbers
that are the instructions
and I apply them to coordinates
and build the assembly.
So that was basically our goal,
we needed to be able to do that.
Oh, and here's another example
of complex building instructions
that actually is from the same laboratory.
And I'm only showing you this,
the wrong thing.
So they actually gave
us 88 transformations,
and three deposited chains,
and if you apply that,
you'll see this crazy looking thing.
I'm gonna come back to this later.
OK, so to improve the infrastructure,
first of all,
we weren't collecting symmetry
parameter information.
So if something had point group symmetry,
we weren't asking them
what was the point group,
and we realized that that should be given.
For helices, we wanted to make sure
that we had the helical information,
the twist and the rise information.
And we decided on a standard
order for symmetry operations
that would help people
if they wanted to compare
different structures.
So in this example here,
where I have the five-fold,
we decided that for every
icosahedral symmetry,
we would actually give these first,
the five-folds first,
just like Rossmann did.
And that would enable people
to look at pentamer symmetry.
So there would be a standard
order for the symmetry.
And then we wanted to give instructions
for moving assemblies between
all of the different frames.
So there would be a deposited frame,
there would be what we would call
the standard point frame,
and there would be a crystal frame.
So the idea is that we would give
the transformations
required to move between
all these different options.
And usually the identity
operation was assigned
as one of those,
because if it was already
in the deposited frames,
you move to the deposited frame,
you just put the identity operation there.
But if it was provided in
the point symmetry frame
and you wanted to get it to
crystal frame, for instance,
you would have to give
the transformations.
And so we created new dictionary items,
categories and items to hold these
new instructions,
and we actually developed software
in order to automate the production
of the new representation.
So here are symmetry parameters defined.
For icosahedral viruses,
it's just simply this one
item you have to give.
The Schoenflies symbol is I.
Here down at the bottom,
I show a different point symmetry.
And so this is a cyclic symmetry.
It actually has 17-fold symmetry
about the Z axis
and then a two-fold perpendicular.
We also addressed...
There were a few, just a very few
assemblies of this type
that we also addressed
during the remediation.
But for the filamentous virus,
then we also gave the rotation
and the rise definition.
As I said, we imposed a standard order
for the transformations
so we go one, two, three, four, five.
So any virus that you would look at,
if you applied the first five
transformations
And this is what I like to
think of as my game board.
This is your starting point.
It's what the depositors gave
you in terms of coordinates,
and you wanna get some other place.
So say you want to look
at the structure in the
standard point frame.
So you have to apply a transformation,
and I'm calling that P here.
Now, once you've gotten here,
there's actually a very
standard set of operations
that would apply to any
structure that you would,
so it's exactly the same
fixed set of 60 operations,
and that gets you from just
the single set of coordinates
to the fully built assembly
in the standard frame.
So let's say you wanna go
back to the deposited frame.
Then you need to apply the
inverse of this transformation
to get back.
Now, in the data file,
we provide the transformation key.
And in addition, we supply
a set of 60 transformations
in which, in addition,
we select this process
as well as this process,
so that although it
looks like you're going
directly from the
deposited frame coordinates
to the assembly and this
author-defined frame,
you're really going
through this other process.
Now, to get to the crystal lattice,
there are additional
operations to be performed.
If the depositor has given you
the coordinates in the crystal frame,
then typically this will be
identity transformation.
In this particular case
that I'm showing here,
there's actually multiple
copies of the virus
in the crystal.
So we provide two transformations.
So we move the virus into one position,
and then we move it
into a second position.
And then the last little bit
that we worked out
is now we've put them
in the crystal lattice,
but it turns out that there
are symmetry operations
within the lattice.
And so it is useful to be able to define
what's the unique content
of the crystal lattice.
So that turns out to be some fraction
of these two copies of the virus.
So there's a three-fold axis here.
And if you were to take
the colored parts here
and apply three-fold symmetry,
you would actually build
both of the full viruses.
So that's what I mean by selecting.
So we're basically selecting
those operations that are unique,
and then when you apply
the crystal symmetry,
you can then build the whole thing.
So then, maybe I'll just stop.
Are there any questions on this?
It's a little...
OK.
So then the next part was about...
So we needed to provide some
computer readable instructions.
I'm not showing you the CIF here,
but this was what is underlying
each row of a CIF table
for an icosahedral structure.
So here's your rhinovirus,
the complete icosahedral assembly.
We have transformations that are numbered
one through 60,
and it turns out there are four chains
that these transformations
need to be applied to.
So the instruction is apply one through 60
to change one, two, three, and four.
Similarly, we can define a pentamer
by only including one through five
to change one, two, three, and four.
If we want to see the
point asymmetric unit
in the point frame, we apply the P matrix.
And if we want to see
the whole crystal asymmetric unit,
we apply...
As I was telling you,
there were selections that you would make.
So in this case, you
select one through five,
11 through 15, 21 through
25, 36 through 40.
There's a software that figures that out,
what they should be.
In addition to that, you're applying
this transformation at zero,
which moves the coordinates
into the crystal frame.
Again, often that is an identity,
but we always define it anyway,
just so that we're consistent
across the entire archive.
OK, now back to this crazy thing,
when I told you there
were 88 transformations
that they provided.
So it turns out that in
this particular case,
60 of them defined an icosahedron,
and then there were an additional 28
that were giving what you might call
almost like a two-dimensional lattice
within each icosahedral asymmetric unit.
So the actual icosahedral asymmetric unit,
if you look here, I'm
just going to draw here,
so look at the pointer.
So it goes around here,
and then it includes
this section over here.
So that defines the 28 unique copies.
And I'm showing here, if you apply
three-fold symmetry to this part,
you get this triangle.
If you apply five-fold
symmetry to this part,
you get this other thing.
This big triangle is
called the trisymmetron,
and this other part, this
five-fold symmetry part
is called the pentasymmetron.
Those were defined in the primary citation
for this particular entry.
But in order, then, to be able
to build the full assembly,
what you do is you start
with chains A, B, C,
which are represented in yellow here.
Just these three little blips here.
You apply transformation 61 to 88
to get this outlined
bit I showed you here,
and then when you apply the 60
icosahedral transformations,
then you're able to
build the full assembly.
This is a little bit more
color than is really needed,
but I love the patchwork look to it
so I always show it anyway.
Alright.
Again, so the remediation process,
again, we identified the
errors by visual inspection.
We used a script that actually
was a program, UCSF Chimera.
I think Shuchi's gonna show you
something with it a little bit later.
It's a program that we frequently use
to look at structures,
and it was extremely essential
to have that capability
in order to be able to quickly pick out
where there were problems.
We also looked to other public databases
where there was curation going on.
So PQS is a database developed at
the European Bioinformatics Institute,
and the VIPERdb you might
think of as a boutique database
that was built off of the PDB,
but they were so concerned about
being able to look at the viruses
that they figured out
a lot of the issues
and made the corrections
in their database,
and they were very happy that we
incorporated them into the PDB.
Then we also looked at
the files themselves,
'cause as I said, there
were often comments in there
that we needed to carefully read.
Sometimes it went to the
primary citation itself,
'cause sometimes you
would actually see tables
where there were rotation
and translation units given.
And we did a lot of checking to make sure
everything was consistent.
Where there were structure factors,
we checked to see if we could
actually get the validations
of the structure against
the experimental data.
And in the end, we were successful,
so all of these structures
were remediated.
And I have this nice poster
that I keep in my office
to remind me of this great success,
which took a lot of effort.
Then, as Shuchi mentioned before,
it's not enough to correct
the existing errors.
You need to prevent them
from happening in the future.
We have an annotation process.
The Pointsuite software is being used.
I think it works maybe 90% of the time.
I could be corrected.
Brian is nodding his head.
And then the other 10% of the times I
sometimes get some questions,
but it's typically infrequent.
But I always love to
look at new assemblies
coming in anyway.
But the software now
enables the annotators
to build and inspect the full
assemblies automatically,
and so at least it keeps us
on top of these new, larger assemblies
that are coming in.
And then I think it's back to Helen.
- [Voiceover] OK,
so this is just a very quick overview.
I think you'll all agree
that what Shuchi and Cathy have presented,
it's really quite amazing what's involved
in understanding chemistry
and the mathematics of all of this
to make sure you do it right.
The whole community is very, very grateful
to what they have done.
So in terms of the
roles, so prevent errors.
The depositors we hope will use
the most modern version of software
to determine their structures,
because the most modern versions
incorporate some of these things
that we have developed here.
We ask that they review the
deposition before submission,
pre-validate if at all possible,
and after they deposit,
to respond to the curators
if there are problems with the structure.
The external software developers
are now working with the archive managers
to ensure output is
consistent with requirements.
And that is now a
relatively new development,
I mean in the last several years,
and this is very important,
and have an ongoing dialogue
with the archive managers.
And we had this working
group that got set up
after a meeting that we had in 2011,
when all of the software developers
agreed to use the same standard
for creating the files
that come into the PDB.
And as a result of that
particular meeting,
a working group was set up
and John Westbrook is
very heavily involved
in this working group,
where all these kinds of issues
about the data files
are regularly discussed.
And in a way, this is a
way of trying to prevent
the errors from coming in,
by being basically on top of things.
So this is a very important development.
And then for the biocuration staff,
need to follow best practices
and use the data processing software,
record any errors as
soon as they're noticed.
Now, with our new systems,
this is much easier to do.
It's hard to work around the system
and not follow procedures.
In early days, it was
possible to sort of bypass.
That does not happen now,
or it happens much less frequently.
And then do what I call a sanity check
on the entire file.
So after you finish,
the curators need to look and see,
does this all make sense.
And that's basically the
kind of take-home lesson
of everything that you've heard,
and I think at this point
we'll take some questions,
then take a little break,
and then there'll be
some practical exercises
that we'll do.
So does anyone have any questions
about what you heard today?
So it isn't just about
checking the spelling,
although that's important too (laughs).
OK, so let's take a break,
only for a few minutes,
'cause we still have
quite a bit to go through,
because we want to ensure
that you really understand
what's involved in the remediation.
