- [Voiceover] I'm one of
the course instructors
and I would like to welcome all of you
and I'd like to also tell you a little bit
about how this all came about.
There is a initiative with the NIH
called Big Data to Knowledge,
and the idea is to focus on the ways
in which data are being used
and could be used to
do biological research.
As a result of this initiative,
they put out several calls for proposals
having to do with creating training
and course materials in this area.
And Cathy Lawson, my colleague,
applied for this grant
just about a year ago.
And we decided as a group
to focus on the beginning of the pipeline.
How do you get from your experiments
into a database that will be
(speaking drowned out by beeping).
And we decided that that's the kind
of a place that we have as we,
people who work on the PDB
here and other databases,
we have a kind of unique
knowledge in that area
and we would try to write a
grant that would address that.
And most people just take for granted
that the data comes from the air somewhere
and it somehow appears and
after that they do analysis,
but how does the data actually arrive,
and how is it organized,
how do you know what
you're getting is reliable?
And so that's what we decided to focus on
and without bragging too
much on Cathy's behalf,
now this was quite competitive
and Cathy got an amazing score
and got money to run a
class, how unusual. (laughs)
So here we are.
I'd like to also tell you that we have
a very large mix of
students in this class.
We have those of you from chemistry
and the molecular bioscience
graduate students who are
taking this course for credit,
and we're very happy about this.
We also have librarians
from the Rutgers system,
two of whom are, three are in the room.
And others coming in via WebEx.
And then finally we have our colleagues
from other places,
Germany and Switzerland,
who we've collaborated with
or hope to collaborate with in the future,
who are trying to kind of get,
learn more about our point of view
about how the data
pipeline is established.
And so I know they're there,
I heard their voices and I'm really,
really pleased to do this, in this way.
And so as I said,
an unusual class in that way.
So I want to welcome
everybody who has joined us.
And just tell you that we have
a few rules for how to operate because
since we have people
coming from various places
we want them to be a part of this and
(speaking drowned out by
background conversations)
Please, Maggie.
- [Voiceover] All remote users
please mute your microphones,
we are hearing background disruptions.
- [Voiceover] Okay, hopefully
we'll all get used to this
and it'll work very well.
Maggie Gabanyi has been involved with
many of our collaborators
from around the world
and is very experienced
in running this webinars
and remote type classes
and so I'm inviting Maggie to
talk to you about how we
want to conduct the class.
Maggie?
- [Voiceover] Hello, everyone.
I just want to take a
moment to welcome all of you
and also to our remote participants.
We want to go over some logistics
for the class before we get (beep).
So first of all, the class website is
edsb-pilot.rcsb.org.
Helen will be walking
you through site for you
so that you know where
to get lecture materials
and how to submit your homework.
And also, if you did not receive a login
for this site yesterday,
either check your spam mail
or please contact us during the break
and we'll make sure here is some help.
Welcome to our newly
connecting remote participants.
How would you communicate with us?
So there are actually a number
of lecturers for the class.
During class we ask some common etiquette.
Please mute your cell phones
if you're here in the room
and also (mumbles) with me,
you will see a couple microphones
that I have over here.
And if you have a red light you are muted
and you unmute it and
the push button is green.
And we ask that you do this
so that our remote participants
can hear your question as well.
I'm gonna mute this one.
Remote participants, you are participating
either through WebEx chat
or there's a web audio or by BT.
And thank you for calling in.
We ask you to keep your microphones muted
because hearing click, clacking
during the lecture can be
disruptive to the students.
And if you do have a question,
'cause obviously you're muted,
please keep you WebEx chat window up.
I am monitoring it through
the course of the class
and I will then if you have questions
during our question and answer break.
How to contact us after class.
Obviously, you can email us all
at edsb@rcsb.rutgers.edu
and as Helen will show you,
on our class website we also have
a discussions forum if you
have general questions,
not only for us but for the class,
where you can discuss different
possibilities for answering things
and any other questions you might have.
Are there any questions in terms
of how we're gonna run this course?
I also want to mention that these courses
are being recorded so that
our remote participants,
sometimes especially, I want
to give a shout out to Jill.
Jill is joining us from Australia
at 1 o'clock in the morning.
In case she doesn't want to make it
at 1 o'clock in the morning,
she and all of you can also watch
our recorded classes to catch up.
If there are no questions, then Helen?
- [Voiceover] Okay, thank you.
Oh, and everybody, this is Ken.
Ken is our IT guru.
If you get into trouble,
like you can't get on to the wireless
because of all the crazy
different addresses
that exist at Rutgers, Ken will help you
as best he can (laughs) and any kind
of technical problem you might
have, he'll be available.
So Ken, thanks for being there.
We have everything about
the classes on the web,
so that you can access all the materials.
All the homework will be
conducted in that way.
And I'm gonna go through the conduct
of the class so that you will
understand what we're doing.
As Maggie said, if you have
any questions whatsoever,
no question is too small or too big,
just ask it using the combined email,
edsb@rcsb.rutgers.edu, we will all see it.
If you only want to talk to one of us
for some reason, that's okay too.
I would suggest you use email,
use our independent email
addresses if you wish to do that.
This is the course description
and what the learning objectives are.
Where we would like to be able
to have you be able to
build a simple database,
that's usable in areas
of your own interests.
And everything that we do in the exercises
and as I'll tell you a little bit later
we'll build on that.
So that's your homework basically,
by the end of the first
part of this course
you will have made a deposition site
for the molecules that you care about.
Or maybe that you don't care about 'em.
The one that has all
the molecules you hate,
so we'll talk about that later.
The homework, the students
who are taking this for credit
are probably concerned
about grading and homework.
Basically do the homework,
ask as many questions as you wish.
The whole idea is for you
to learn how to do this.
There will be no exams.
There will be no papers.
It will be simply doing the
homeworks one step at a time
and if you follow each
week you'll be fine.
Hopefully at the end you'll have something
that you can hold on to.
So that's the most often asked question
is how is the grading done,
and the grading will be done
based on your homework assignments.
And those of you who are who are taking
this for credit are graduate students,
and I would expect you to
complete all your homework assignments.
I wouldn't expect that
that would be enough
to get a good grade in the class,
but be serious about it, okay.
Okay, the syllabus is online.
And this is today's class,
but you can go online and see,
look ahead and see all
the different there.
There's a total of eight classes.
We've divided it into two mini courses
for the molecular bioscience students.
So some of you are only
taking the first four classes
and then others are
taking the whole thing.
So we have all the lectures online.
We have the PowerPoint for the exercises
so that you can see
what it is that we have said.
The reading is not an
inclusive reading list,
but we've given you some highlights
of the kinds of articles
that you need to look at
to understand what it is
that these are talking about.
And there's lots and lots
and lots of literature
that you could look at,
and we just focused on the two things.
And we tried, because this
was a PDB focused course
we have given you articles
that many of us have co-authored here
and on our wwPDP partners.
The homework assignments are shown here.
And when you answer your homework,
you just answer in line here and submit it
and then we'll be able to
review your homework, okay?
And I think that will,
so this is the first one,
and we'll go through this with you
at the end of today hopefully.
So I think we have,
I think that is an introduction to
what we have on the website.
I would suggest you read it
and then get back to us
if you have any questions or confusion,
but we tried to keep everything online.
One warning is that while we have prepared
all the lectures, exercises,
and homeworks up front,
they could change, so don't assume
that homework number eight is gonna be
the same as what's written now.
It will be modified based on what we see
and how much the information
is getting across.
So do you have any questions before
Maggie talks about the communication plan?
Are there any questions?
Any problems with how
we're conducting this?
Okay, you have a question.
Push the button so...
- [Voiceover] If we only
registered for half the course,
can we still come for
the rest of the course?
- [Voiceover] Yes.
- [Voiceover] Or is
there a way to register
for the second half?
- [Voiceover] Oh, my god. (laughs)
I'm sure there's a way,
but I don't know it, yes.
- [Voiceover] All right.
- [Voiceover] Okay, we
would love to have you
stay through the whole course.
And you can certainly ordered it,
but if you want to get credit,
it's one credit per four things.
And we're trying to do this is a way that
you don't have to have
much prior knowledge.
We did do a survey and
we're really pleased
to see the results.
Actually one of our librarians
had deposited two structures
to the PDB (laughs)
so we didn't think we would
see (beep) in this pool,
so that's really great.
And so now Maggie is going
to give you the communication plan
and then we'll start the first lecture.
- [Voiceover] Hello again.
Much of the emails we've
been talking about,
the website you should be
familiar with at this point.
Those of you with laptops, we
invite you to check it out.
I want to point out this discussion tab
that I had introduced earlier.
This is what it looks like.
Each session has its own
thing, its own section.
You have general comments and questions,
and then one for each session
if you want to keep it
most scientifically driven,
as we always should.
You'll see Cathy did an example.
And to add to it,
you just make your own question
by clicking create forum topic.
And this will be pre-selected
for the area you are in.
And, I hope to learn a lot.
And save.
You'll probably want the title.
And those of you that agree
with me should chime in.
So, it's pretty straightforward to use.
And I'll be monitoring it,
I think all the lecturers
will be monitoring it on a daily basis
to have very quick turnaround
for any questions that should go there.
And of course, the same goes for emails.
Okay, all right, back to Helen.
- [Voiceover] So, one
thing you should know
is that other people in this room
who you may or may not recognize
are members of the RCSB PDP team,
who have decided to find out what it is
we're telling you about
what they do. (laughs)
And they actually are some
of the real experts about
how you archive data.
So what I'm going to do now is
give you a lecture which kind of
takes you through the whole history
of the Protein Data Bank.
And then takes you through
what the topics that we're
going to cover in this course
and I'm just going to
slide over the top of that
so you get an idea of the whole,
all the subject matter
that we're going to cover.
And some of the slides
that you'll see today
you're gonna see again when
we have the in-depth lectures.
And some of the lecturers in the course
are also members of the RDSB PDB team
who have particular expertise
in some of the areas that
we think are very important.
So this first lecture is
meant to be an introduction
and overview so you get the whole picture,
at least in a way that
we think about this.
And so I'm going to start now.
And this is the outline.
I'm gonna talk to you a little bit about
the history of the Protein Data Bank,
and the synergies of science,
technology, and community
because you really can't build
this kind of a scientific database
without those three components.
I'm gonna give a very brief description
of the data pipeline
and all the components
involved in the data pipeline
from the moment you actually collect
your scientific data to the time
you put it into the Protein Data Bank
or any other data resource,
and then what is done with it.
I'm gonna tell you a little bit about
how these data are used.
Although there will not
be that much discussion
of that in this course until the very end.
And then I'm going to raise the subject,
very briefly of what's
called sustainability,
which is the buzzword now
in all the world actually.
Now that we have databases,
and they were paid for in various ways,
how do keep them alive
in these financial times?
And so I'm gonna give
you a little editorial
comment there just to give
you something to think about.
So the history, you can't
have the database without,
of any sort, and again, we're focusing
on structural biology in this course,
but you could take the same thoughts
that I've had here and
apply it to other fields,
but you always have to start with
where did the data come from?
Who are the pioneers?
There have to be people who
were visionary enough
to even get into a field
and so the first key person in this
is somebody named J.D. Bernal,
who was a British physicist
and crystallographer.
His student, Dorothy Crowfoot Hodgkin,
who ultimately won a
Nobel Prize for her work.
John Kendrew and Max Perutz,
each of whom did the very earliest
crystal structures of proteins.
So they had the persistence and vision
to know that it would
be important some day
to understand the structures
of these molecules.
The early molecules, the
very first one was myoglobin.
It was done by Kendrew.
And then the second one was hemoglobin,
second structure to be
determined was hemoglobin
by Perutz and Kendrew.
Perutz worked on this for
something like 30 years
and by the time he died,
many, many, many, many years
he focused on this molecule
which he correctly thought
was extremely important
for health and disease.
Lysozyme is another early structure.
And you'll notice now the first three
I'm talking about were
all determined in England.
Lysozyme by Phillips.
And then finally in 1967 we saw
the first American structure
which was ribonuclease.
Done in two labs, one in
Buffalo by Kartha and Bello
and the other by Richards
and Wyckoff at Yale.
So that's the backdrop.
This very early science
done in the '50s and '60s,
starting from the first idea
that this would be important in the '30s.
And there were a lot of
young people at the time
who thought that if we have these data
then they need to be somewhere
where we can ahold of them
so we can understand these structures.
So, remember back in those days,
the data were on punch cards, in boxes,
just very difficult to imagine
how you would transport all these data
around the world from lab to lab.
And so the idea was that
there would be an archive.
Being that it was the '60s,
we had a petition, we had meetings,
we were slightly combative
about the whole thing
because we thought we really need to,
(conversing off microphone)
So there were people meeting
at different society meetings,
who said this needed to happen,
we needed to have some kind of an archive.
You can see my name over here.
I worked with a place called
the Institute For Cancer
Research in Philadelphia.
And then there was a seminal meeting
at Cold Spring Harbor in 1971
where the giants of
the field came together
to talk about their work in
approaching crystallography.
And one of those people
was Walter Hamilton
who at the time was a
40-year-old superstar
of crystallography and this is Wyckoff,
who was one of the persons who determined
the structure of ribonuclease.
Various people talked to Walter
and he was at Brookhaven
at the time and he said,
"Do you want a data bank,
we'll make a data bank."
That was it and that was in June.
He immediately flew over to England,
talked to his colleague
Olga Kennard in Cambridge
and together they came up with
this international data bank
called the Protein Data Bank.
And this article was published in October.
So from the time the idea hatched
to the time that it
was published was 1971.
And, of course, as you heard, some of you,
there were only seven structures
in the data bank when we first started.
And this is a picture of Walter,
who was the initial head of the PDB.
And this is Tom Koeztle,
who was his post doc,
who took over when Walter died
only about six months after
this picture was taken.
And then this is me in younger days.
We were on a boat somewhere going
to a meeting on refinement
of crystal structures in 1972.
And this is just a picture to show you
the interesting characters
who were at this meeting.
And for those of you
who might be interested,
this is Jane Richardson,
who later became a pioneer
in the representation of structures.
And this is Dorothy Hodgkin over here,
and Alex Rich, who recently died,
and was one of the chief
people in this field.
So there was a Protein Data Bank, so what?
It was there, but there
were no rules that said
that the data had to be,
anybody had to put the data in there.
So there were various people in the field.
There were scientists.
There was a very important
person named Marvin Cassman,
who was at the NIH who felt
very strongly about data sharing.
So we wrote letters and
then of this particular,
this particular petition, sorry,
this particular petition was
started by Fred Richards,
who was at Yale and a real activist.
He felt that the data
had to be in the PDB.
Dick Dickerson was at UCLA,
and he made a big fuss,
and there's an article in science
about missing crystallography data.
So there were all kinds of
committees that got set up.
And by 1989,
an article was published in Acta Cryst
saying that there was gonna be a PDB
and laying out all the rules.
And this whole thing happened
by having lots of people coming together
to talk about how this should be done.
And so the people who
are creating the data
had to also help make the rules by which
these data were distributed.
The next step after that was
to create community standards,
that is how do you represent the data
in a way that is useful.
In 1990 a committee was set up to do this
and something was created for the
Macromolecular Crystallographic
Information format, mmCIF.
This picture of the early days.
This is Paula Fitzgerald,
who was the chair of this committee.
John Westbrook, who was
hiding in the back of the room
and was really the author
of this dictionary.
Phil Bourne, who's now
the Associate Director
for Data Science at NIH.
Sid Hall who was a pioneer in SIFT
the small molecules,
through Shawna Vo-dock,
who hosted this meeting and is
a well known computational biologist.
So people worked together
and we created a
dictionary over 3,000 terms
in this dictionary when it first began,
but it took many, many years
to get this people to buy in
and there was a seminal meeting at the EBI
in 2011 where the software developers
in the field in crystallography agreed
that mmCIF should be the
master format for the PDB.
So this is a lot of work
to think about how to do this,
and it's not just making it up.
You have a question?
Nope, okay.
Then as I pointed out when the PDB began,
Walter Hamilton realized that
science is international,
you can't just do things your way,
so he immediately set up this
collaboration with Cambridge.
But everything was done in
a kinda loosey-goosey way.
There were no real documents.
In 2003, we established something
called the world wide PDB,
recognizing that there should be
data centers all over the world.
There was one at EBI, one in Osaka,
one in Wisconsin, and
the one here at the RCSB,
but we needed to have some rules.
And so we wanted to
make sure that the data
were always freely and globally available.
We agreed that we would collaborate
on data processing and annotation,
which we're gonna talk about here
and that each site could offer
different views of the data,
so the websites provide
services and views of the data.
But the core data that's
underneath all those websites
had to be uniform and represented
in a way that we all agreed on.
So that began in 2003.
In 2008, so now we have the data,
it's required, all the journals
require it for publication,
and we have a standard format.
Then we needed something,
we're gonna talk about later,
we have to be able to validate
the data against the experiment
and until 2008 putting
in your primary data
was only voluntary
and in 2008 it became mandatory
for X-ray to put in what's
called the structure factors,
which are not quite the primary data,
but more primary than the
models that had come out.
The process intensities
that lead to the
determination of the structure
and so that became required
and then an announcement was made.
Again, lots of articles, lots
of people being involved.
I think you see a theme here,
which is an important thing to remember.
You can't just say this is
how it's gonna be and then it happens.
It would be nice, but
that's not how it works.
And so then we got into the
whole issue of validation.
So how do we know that the structures
that we're getting are right?
So again, the wwPDB
brought together experts in each field,
X-ray, NMR, EM, small angle scattering,
and now hybrid methods
to talk about how to know that
whether a structure is right.
What should be required to archive
and then how do you check the structures?
And so the X-ray Validation Task Force
had its first meeting in 2008,
published a paper in 2011,
and only last year all
of their recommendations
and most of their recommendations
were put into the PDB pipeline.
And step by step this is happening
in the other fields.
But this is a process because
it requires understanding the data
and understand it requires
having the community really buy into this.
So that's what happened then.
What I've talked about
so far is all historical.
And kind of the take home lesson is that
for this is that there must be synergy
between the science, the technology,
and the community to make this all happen.
And in the case of crystallography,
that process and look,
well now we have
crystallography, NMR, and EM.
That process takes time to
solidify and to get buy in.
And so in each case
you have to think about
what's the science going on,
how much can you actually
put into a database
that has any meaning to it?
So the very beginning
of the PDB in the '70s
there's just a bunch of tech strings
that talked about what the data were
and then in about 1990 approximately
there was an attempt to
standardize some of it.
And now we have a lot more standardized.
And the question is,
and this will come up again in the class,
is how much can the
community take, you know.
How much can they buy into?
If you make a data resource
that's too complicated,
then nobody's gonna use it.
And nobody uses it,
they're not gonna rally
to the journals and say this must happen,
so there has to be some kind of
back and forth with the community
and the science has to be ready
and the technology has to be
ready to make this all happen.
So that's what this mantra is,
and I think you're gonna see
some of it when you do
your homework assignments.
A lot of people say,
well why don't you demand
that they do this, this, and this?
Well, you know, you can demand
but how many people do what
anybody else says so?
So that's, okay.
So this is just a little
post script on this.
Remember I mentioned that J.D. Bernal
was a sort of father of
modern structural biology
and that he trained Dorothy Hodgkin,
but he was a very complex individual
who had very strong views about community
and he had a very strong influence
on the way in which crystallographers
conducted themselves.
And he wrote a book called
The Social Function of Science
and he covers the organization of research
and he focuses a lot on the social issues
and made a statement that
"science is communism."
You may not like to hear that,
but in a way he really believed
that sharing your data
and being open about what you were doing
is extremely important.
And he definitely, I
think about this a lot,
influenced generations
of crystallographers
who just understood
that they had to share.
And so he was one of our heroes,
a little bit of a character
as you might imagine,
but an important person in our field.
I also would like to mention that
there was a woman who died
recently named Elinor Ostrom,
who got a Nobel Prize in economics.
She did game theory and she believed
that the only way that
anything would ever get done
is by a bottom-up collective action
rather than having top-down enforcement.
And so she believed that communities work
by this, basically this method
that I just showed you, by example.
Now I had no idea who Elinor
Ostrom was, I will admit that.
I gave a talk to librarians at Rutgers,
and as soon as I described what happened,
they said, "Oh, you subscribe
"to Elinor Ostrom's theory,"
which I had to then learn.
So I went to one of my buddies
who I drink coffee with in Princeton,
named Avi Dixit, who's an
economist and knew Elinor,
and I asked him what this was all about
and he explained this idea
that the groups have to,
the groups that are involved
have to devise their rules
about what actions are
allowed and not allowed.
It has to be based on local conditions.
It has to be adapted to
changing circumstances.
And there has to be some kind of organized
monitoring and enforcement, okay.
So this was all new to me.
I didn't know that there was a theory
about what we did, but there is.
And I'm very glad about that.
Okay, so that's history.
Now I'm going to talk
about the data pipeline.
What we're gonna be talking about is,
outlined here, is very simple steps.
There's a data creation,
that is you're in the lab
and you do experiments
and you get results.
You then have to put the data somewhere.
And we call that data deposition.
Then the data has to be processed.
They have to be put in some kind of a form
that's understandable.
They have to be checked,
that's what we call validation.
And then that, again, archived somehow.
And then eventually distributed.
So this is true for any field.
This is not just about structural biology.
But this is the pipeline
that we're talking about
and that we're going to
talk about in this class.
In the beginning of the Protein Data Bank,
the only method that existed
for structure determination
is X-ray crystallography.
So not surprisingly, that's still
the largest number of structures
in the PDB are from X-ray crystallography.
And they continue to grow,
which always surprises us.
Every year we think, are
we gonna keep growing?
Are we gonna keep growing and yes, we are.
NMR is another field that
allows you to determine structures.
The structures first
started in the late '80s
to come into the PDB,
and they are continuing
to come into the PDB,
although not as fast a clip as the X-ray.
And part of the reason is
that NMR has other features
that allow you to look at
dynamic properties and interaction
properties of macromolecules
and more and more people
are using NMR in that way,
but they're still structures
and they still need to be checked.
EM, electron microscopy,
we saw the very first
structure of that in 1990,
coming out of Cambridge, England.
And now that represents a growth area
in structural biology
where you can look at
very large macromolecular machines
using electron microscopy.
And then there's this new method
and that's some of our people
who are coming in remotely
are involved in this,
and that has to do with hybrid
or integrative models.
Now, for the X-ray, NMR, and EM
for the most part, let's say for X-ray
you have one method and one model,
unless you decide to do
something really wild
and we may or may not talk about that,
but it's basically one
method of leading to a model.
In hybrid methods and integrative methods,
you have a little data
from many different sources
and different biophysical techniques
that have to come together
and you have to create from that,
derive restraints that would allow you
to then define the structure.
And we at the PDB have
in the last several years
begun to see those kinds
of structures come in.
We don't have a clue what to do with them.
And so as a result of that,
we need to come up with methods
for collecting the data,
annotating the data,
validating the models to see
whether they're good or not,
and so that's an area
that is sort of growing
and the best way to create
the pipeline necessary to process the data
is to start doing the work
when there's relatively few structures
so you can get a handle on it.
And so we are beginning that process.
This is just a slide to show you,
give you an idea of the scale of things
that the PDB right now is handling
such things from very small molecules
up to things like this, sorry,
things like this, the AIDS virus.
But there are all sorts of other things
that people are doing, let's
say, on light microscopy,
which I don't think
it's out of the question
that in 10 years we'll be seeing
models coming that would
be able to be archived.
Perhaps not in the PDB,
but they're gonna have
to be archived somewhere,
and so you need to think about that.
So this is basically,
you're gonna hear a lot more about,
everything that I'm saying now
we're gonna go in more detail later,
but just so that you understand
this sort of process.
In the case of X-ray,
the sample is a crystal
and you want to know where it comes from,
what the source organism
is, what the sequence is.
We then have the experiment.
We take the data using
one instrument or another
and we'd have to record
the sample conditions,
the X-ray source, the detector,
what the protocol was used for collection.
Then you have the actual
experimental data,
the structure factors.
So these spots here are defraction spots
that come from putting the
crystal in front of the beam.
These spots have to be,
the intensities of these spots
have to be measured in some way
and analyzed before you actually
go to the main event,
which is to derive a
model from these spots
using a variety of methods
which you'll hear about later.
And then you get a model
that looks like this.
So this is just a representation
of a three dimensional structure
and then you get the representations.
So that's what happens
in the X-ray experiment.
In the NMR experiment you have the sample,
which is in solution.
And you have to record information,
again about the source, the sequence,
the buffer conditions,
the isotope labeling.
Then the experiment that's actually done.
You have to record stuff
about the sample conditions,
the spectrometer,
and the acquisition parameters.
Then you have the experimental data,
the chemical shifts, the constraints,
the processing protocol,
and the Rsym assignment,
and then finally, the model.
So again, all of these
things have to be defined
and what subset of all this information
is actually collected is something
that has to be discussed and defined.
For EM, what you have in EM
is you have a sample that is on a grid.
Again, source and sequence
and buffer conditions,
and how the sample is supported
and the conditions of prior pooling.
You then have the experiment which is done
with a EM machine which we actually,
if you see a lot of stuff
going on in this building,
we're having an EM spectrometer
being installed right now,
and it's quite a massive undertaking.
And again, you have to talk about
the various conditions involved there.
Then you have the experimental data
which the first thing you get is a map
and this is what the map looks like.
Sort of a blobby thing.
And then you have to record
the information about this map,
and then finally the model,
and the description of the model.
So these are the data that are collected
for the purposes of archiving.
And every one of these things
is discussed as to what it is
that we need to collect
and what we do not need to collect.
This is the experiments
that are being done
and the things that are defined
that will have to be collected
if we make an archive.
And then we need to have a system
that will collect all this information
and put it in to the archive.
Over the last several years,
the worldwide PDB group decided
to create a deposition
and annotation system
that everyone worldwide would use
when they're managing the data.
The idea was to maximize the data quality,
to standardize the file formats
based on a controlled data dictionary.
We wanted to have more data capture.
We wanted to also be able to support
larger and more complex systems
'cause we were seeing them
and they were turning
out to be quite difficult
sometimes to handle the
data that the PDB got.
You know, how do you manage all that data?
We wanted some kind of improved efficiency
because no matter how
automated a system is
there are human beings that
have to look at everything
and decide whether it's right or wrong.
You know, lots of programs are run,
but in the end there
has to be a sanity check
and we needed to make
things much more efficient.
And then there needed to
be workload balancing,
that is we had the sites
and three different places collect,
four different places collecting the data
and three different places
collecting, example,
all the X-ray data we wanted to have a way
to easily share the load
so that there wouldn't be
select people coming in
and overloading one site,
and then another site not
having any business at all,
and so we wanted to do this
in some kind of a geographical way.
So this system was created.
It is now in production and
handling X-ray, NMR, and EM.
You'll learn more about that.
Data standardization, extremely important.
Now, scientists hate standardization.
Because they say it gets in
the way of their creativity.
They want to say things
however they wanna say it,
but then when you want to find something
and things are not standardized
then you have a big problem.
And so we need to have clear definitions
of all the data items,
we have to have the relationship
of the data items have to be clear,
and then there has to be
a well defined syntax.
To do that and to make a data dictionary,
you have to understand what we're doing.
You have to understand the science.
And so a lot of effort has
been spent in that area.
Right now, I'm not gonna dwell on this
and just tell you that there was
a historic format for the PDB format,
which people really loved because
it was human readable, had
been around for 40 years,
most structures fit into it,
and most software was happy with it.
The disadvantage of this format was that
there were only implicit, not explicit
relationships among the data items,
the definitions were not precise.
And if you talked to
three different people,
they would give you
three different answers
as to what a particular term meant.
It didn't work with large structures
because of the way the format
had restrictions on the
number of atoms in chains,
and it was difficult to extend.
This mmCIF, which we now call PDBx
because it's not just
about crystallography,
it's about all the structure
of biology deals.
Is machine readable, it
extensible to other methods,
it works on large structures,
and all the relationships among
the data items are explicit.
The disadvantages, unless you
are a very special person,
it's not human readable,
and the other disadvantage
is that people just hated it.
And so it took a long
time to get a buy in.
It is what is underneath
the PDB right now.
This just shows you,
this is a PDB file format
for just the coordinates, x, y, z.
This is atom, these are the atom name,
the residue name, the
chain name, and so on.
So those of you who
know structural biology,
you look at this, you say,
well I know what this is.
But if you're a computer
and nobody tells you what these rules are,
then you don't know what it is.
In SIFT every single one
of these things is defined.
So atom is defined,
the atom site ID,
the different symbols,
the Cartesian coordinates.
Here is Cartesian x, y, z.
In deference to the fact
that people loved this
so much, this PDB format,
there's a style, although
SIFT does not require
any style at all, does not require
any spaces or anything like that,
there was an agreement to keep the style
so it would look very
much like the PDB format
so people would not get too upset.
In terms of the software developers
on the data inside
structure determination,
have all adopted this and slowly
the groups of people
who are using PDB data
are beginning to adopt this
and it's especially important
with respect to these
(mumbles) structures.
So this is the
deposition and annotation tool.
We have the deposition pipeline
with the common deposition interface.
We have the data upload and harvesting,
you're gonna hear about all this.
There's review
and then validation.
We always strongly urge people to validate
their data before they
actually even enter the data.
And then they submit the data
and then it's various kinds of tracking.
When the data arrives at the PDB,
there's editing of the
small molecule ligands,
of the sequence, there's
validation of various sorts,
which we'll talk about later.
And then there is calculated annotations
and then release processing, which is all
now working in production.
In annotation and curation,
we add meta data to put the data
in context with the science.
So meta data, you know
if you just throw data
at something and you
don't define what you have
then somebody looks at
it and we have to guess,
so you need to have the meta data.
It has to be consistent
and we have to have standard
vocabulary and terms.
So ligand annotation, what happens
is you have a small molecule
that is co-crystalized with the protein
and you need to, and
that's really important,
especially with respect to drug discovery.
So the ligands have to be
very carefully checked.
And one of the problems
with X-ray crystallography
is you may not actually see
all of the ligand and density,
and if we don't there's some question
as to whether it's right or wrong.
And so there's a lot of
effort in the annotation
to double check and triple
check the ligands now,
or look in the tools and to alert the user
or the depositor when there's a problem.
The other very important
aspect is the sequence.
So the protein and the
nucleic acid has a sequence.
The depositor puts it in,
you may not actually see
all these side chains
in the protein, but they have to give you
the correct biological sequence.
And so all of this is checked by software
and then again, by the annotators
to be sure that there isn't
something unusual going on.
All of this is part of
the data processing.
Then there's validation,
which is to support or corroborate
on a sound or authoritative basis.
I think it's very important
that things are really checked.
And this has become more and
more important in recent years,
and it's also,
if you read the newspaper
there's a lot of issues
having to do with
reproducibility of science.
And some scientists have
gotten a pretty bad rap
because their results are not reproducible
and so validation is a way of
making sure that it happens.
So you compare the model
to the experimental data
and to prior knowledge.
So you check in the case
of the structural biology,
we check the model alone
to see if it makes sense
by the various standards that we have.
We check the data,
and then we check the fit
of the model to the data.
The validation report,
you'll hear about more later.
To me, the most useful thing
is what's called the residue plot,
which allows you to see whether
the geometry of the
polymer sequence is okay.
And green is really good, red is not good,
and orange is not so good either.
But this is not a bad structure.
And so the little dot over here means
that the sequence is not fitting,
some part of the sequence is not fitting
into the electron density which means
there may be a problem,
but overall this is not too bad
a validation for it again.
This is all coming in the
last several years of that,
coming up with validation standards
and then reporting
things to the depositors.
Data quality.
Garbage in, garbage out.
So you want to make sure
that the quality of what's in the database
is as good as it can be.
And if it is then you're going to improve
the query functionality
and you're gonna be
able to do more queries
There's a lot of emphasis now
on making sure that there
is high quality data.
And just to brag on Cathy's behalf,
this was the state
of the viruses before we did in the PDB,
before we did remediation.
There was nothing here
that was really wrong.
If you talk to each individual scientist
who deposited this, they'd say it's fine,
they just use a different kind of matrix,
or different coordinate thing or whatever,
but that doesn't help anybody
who's getting the data.
So Cathy went through
all of it and produced,
had all the matrices sorted out,
and you'll hear about this,
and came up with a beautiful
representations to the viruses
and then new standards
for how these data would
be processed in the future.
Okay, and then there's data distribution.
So you'd probably say,
well you got the data
and you just put it
somewhere and that's it.
Well, that's not the way it works.
So we have, the date goes
into a master archive.
It is then combined in various ways
with external databases,
and various calculations are done.
So this happens
in the RCSB, and the PDBe and the PDBj
do similar things.
And then eventually this goes
into what's called an FTP site,
which has all the data
that people can download
from anywhere in the world.
And there's a ww archive
and then each of the data centers
involve also, mirror this,
and so you can either get it
from the ww site or you can get it
from each of the partners.
This top part shows the various
different things that the
RCSB does with the data.
And as I said, the other partners
do different things with the data.
But this is the same no
matter where you go, okay?
The PDB Archive has the atomic coordinates
and the molecular descriptions,
it has metadata and it
has experimental data.
The FTP, file transfer protocol
is a protocol to download individual files
to the client computer.
And so you'll hear about this,
very near, later on in the course.
The RCSB has its website, its own website.
And it's a portal that has
different functionalities,
the depositions, searching, analysis,
visualization, tabulation, and
the ability to download data.
So PDB usage.
We have something like
500 million,
is that right, 500 million
downloads of data per year.
So that's not web hits,
that's downloads of data.
So that's a lot of usage.
Every statistic that's been done
has shown that the PDB has, you know,
it's among the most well used
biological databases in the world.
It's a very important data resource.
This shows where people
are who deposit the data.
And then the downloads
and where people access.
This shows the distribution
of the download
by the different sites.
No matter how you look at it,
it's a very heavily used resource.
And it doesn't seem to
be getting less use,
it seems to be getting
more use, which is nice.
So what has the PDB enabled?
Safe storage of the data, for sure.
And that was the original thing
that people really worried about,
is whether or not the data would be lost.
No kidding, we got this email
from a very well known
structural biologist,
and she was hysterical because
she had moved from one lab to another,
and she said, "Do you
have my original data?
"Because my computer is in a dumpster
"in back of the building."
And so she had lost everything
that she had collected,
but fortunately we had all of the data.
And so she could reconstruct what she did.
So this is a very
important responsibility.
It's also used for
modeling the structures.
One method, which we're
not gonna talk about here,
but there's a method for
molecular replacement
where you can use the
model from one structure
to model another structure
that's in a different space group
or has different characteristics,
so it's heavily used for that.
It's a parts list for modeling,
especially in EM people grab,
you have this messy map and you have
to spit all these models into the map,
and so this is a way of
getting different structures
that will fit into the map.
So it's used very much like that.
It's also used with
structure-based drug design.
Most famous, well basically,
every week or every month,
the various pharma companies
download all the data
from the PDB, combine it
with their in-house databases
in order to use the information
for their drug design.
It's also used for
structure classification,
which enables people
to understand the structures
in terms of their different properties,
and then finally, structure prediction.
So what's important?
The science that's being archived must be
important enough for people
to want to access the results.
So if the science is not important,
then nobody's gonna want
to access the results.
The technology for the data archiving
must be continuously evaluated
and changed as IT changes.
So if you have all your
data on a floppy disk
from 1982 or something like that, too bad.
So you can't do that.
So the PDB must always keep
moving the data to whatever
is the newest and best.
You can't just save, I'm sure some of you
have the situation of you have files
from a long time ago.
You go to even just Word files,
you can't open them.
So it's really important
that the things change regularly.
The creation of an
international organization
recognizes the fact
that science is global.
This is really important.
You need to understand the communities
of the data users and data producers
in order to make sure you're giving them
what they need and want for their science.
This is just a beautiful
picture done by David Goodsell,
who works with us at the RCSB PDB,
showing the structures
of the various kinds of proteins
and macromolecular machines
that can be found in the PDB.
At the RCSB we chose to categorize
according to different topics,
such as health and disease, for example,
or biotechnology and then you can
learn more about these
particular kinds of proteins.
And this is another poster
that David Goodsell did where he went
through the PDB to look
at all the different
examples of drugs that are in the PDB,
down to proteins and have given us
a better understanding
about how these drugs work.
Pathways.
We have now enough data in the PDB
that we can begin to see a
structural view of pathways,
although we have to use different models.
We don't have everything from one organism
or anything like that,
but we're getting to a place
where we can at least draw this picture.
And then this is the final point
that I wanted to raise with you,
although you're, some of you are students,
some of you are not students
and are running resources in other places.
And the issue is how do
you sustain a resource?
So how is the resource
funded for the long term?
So I was involved in a white paper.
There are white papers being written
regularly about what to do.
You set up a resource.
It's funded by, say, the government.
Then the government decides
they don't want to fund you anymore.
What do you do?
And so we brought together people
from physics and geology and biology,
and we brought library science people,
the leader in this was a
man named George Alter,
who's at the University of Michigan.
And I was part of this group
to try to figure out
what's the right model
for sustaining these databases?
So the current models are membership,
submission fees, institutional support,
federal funding, and fees for searching.
There's all different ways
depending on which database
you are dealing with.
And the PDB is funded right
now 100% by federal funds
although a different
kind of federal funds.
That's in America, in Europe there's
actually some money from the
Wellcome Trust, the PDBe.
But it's still something where
you're constantly revisited and revisiting
whether or not you're leading the mission
of that organization, it's not easy.
So possible new models
are commercial services, user fees,
the use of overhead from grants,
and something called infrastructure,
the idea that there would be
a set aside by the government
for well-used resources.
So we came up with
a matrix where we evaluated
all these people from different fields
and we were all so
nervous that we wouldn't
be able to come to consensus, but we did.
The idea, we needed to worry about,
we all believed in open access.
We wanted for people to be
able to put in their data,
and that there would be some kind
of equity for the
universities and institutions.
And what we came up with was this idea
that if each data
resource would be able to,
if the government, for
example, who's funding science,
would measure how much output there is
by some quantitative measure.
So with the PDB that
would be relatively simple
because for every deposit you could say
what grant paid for this
and then that could be
somehow reported in some way
and then there would be a set aside
for the archiving.
So all the calculations we did,
in no more than one or 2%
of scientific research,
it would take no more than one or 2%
of the funds that are given
to scientific research
to put that aside for archiving.
And that's what we
believed should be done.
If the science is important enough to do,
then it should be important enough
to archive in some consistent way
where you measure the output
and then that's the money
that's put aside for archiving.
So that's what this group came up with.
I saw something very recently
on some blog or other
where somebody came up with the same idea,
didn't know anything
about what we had done.
And it does make sense.
So I just thought I should,
with no concern of the
graduate students here,
except that you should know that
it costs something to archive
and funding things for archiving
is not always as
straightforward as it should be.
And then finally, the PDB management.
It takes a village.
This is a meeting that we had
at Cold Spring Harbor to celebrate
the 40th anniversary of the PDB.
And we brought together in the audience
people who had been
associated with the PDB
from the beginning,
whoever was still around,
and then we had as the speakers,
people who contributed the
scientific data to the PDB
and they talked about their science,
and it was a very exciting
and wonderful meeting.
So this is the worldwide PDB.
And then the four partners,
and then all the different kind of funding
that we have to assemble
in order to keep this thing running.
It's challenging.
So I leave you with that.
I went from the beginning
to the end of this
as a sort of fast overview of everything
that we have to think about.
And I welcome any questions.
Questions?
Okay.
- [Voiceover] Any questions
from our remote participants?
You was abundantly clear,
Helen, what can I say?
- [Voiceover] Yeah.
(laughter)
So, at this point we're gonna take,
we're gonna take a very short break.
Actually, we're exactly on time.
And then at 10:30 we'll begin the exercise
and actually teach you how
to do some stuff, okay?
- [Voiceover] Okay, the same goes for you
remote participants, come
back in about 15, 12 minutes.
We're gonna pause the recording so that
we don't have to listen to
12 minutes of some mumbling.
