- [Voiceover] I sure hope they do.
This is definitely the
most important lecture
of the course because in this lecture,
you're going to learn what
we mean by a data model,
and how you actually define your terms
so that you can build the database,
and without coherence in your
data model, it can't be done.
So the word data model is probably one
of those mystery words to most
chemists, and biochemists,
and I'm hoping that John
will make this much clearer.
So John Westbrook has been
involved in the Protein Data Bank
since, for a very long time,
and he's a pioneer in the
area of creating data models
and really understanding the importance
of how you define your data items,
and I started doing this in the early 90's
and his early works are the basis
of the whole Protein Data Bank database.
I think John is a mild mannered,
soft spoken person, with a powerful mind,
so I suggest you listen
very very carefully
to what John has to tell us.
- [Voiceover] Thank you, Helen.
- [Voiceover] One other
thing, John earned a PhD
from Rutgers University about 20 years ago
in Computational Chemistry,
so his background is not
dissimilar from yours
and then he became more
and more involved in the
projects having to do
with the infrastructure.
- [Voiceover] Thank you
for the kind introduction
and we're gonna start the discussion
of infrastructure with data
models as Helen described.
To begin with, at a very high level,
what we're aiming to discuss today
is breaking down what a data model is,
at a very high level.
You can think of a data model
as being a model that
describes the data items
within the domain or the area of interest,
and how these items relate to one another.
It's very general terms
but provide a lot of
power if this is done well
in terms of the kind of
structure it provides
and the kind of power it provides
and organizing data and
providing the capabilities
that come with well organized data,
in terms of how the data
can ultimately be delivered.
I thought it would be useful
to step back a little bit
and see how things, how data models
have evolved from the past,
so if you think of the time
when there was no Google,
or perhaps when there was not even
published dictionaries of text.
An early example of building a data model,
is in fact, constructing
a data dictionary,
if you will, for spoken language.
So to follow the chronology of
how this was done for
the English language,
you have to go back to to the 16th century
and one of the first
milestones in this development
that's been identified
is the work of Mulcaster
in first identifying a list of words
and so you can think of that
as a controlled vocabulary
or just identifying the scope of terms
that need to be included in the language.
Then if you go into the next century,
this process evolved somewhat.
Perhaps a smaller number of terms,
but now including additional metadata,
and by metadata we mean
enriching the data model
or the description of data to
include textual definitions
and some level of organization.
Cawdrey now provides a
description or a dictionary
which includes definitions
for 3,000 words,
which is now organized
in an alphabetical form.
So now we have some organization,
as well as some more metadata.
So if we advance now to the 18th century,
we go to much larger dictionary,
now including some technical
terminology and slang.
You can think of slang as synonyms
for your primary vocabulary,
again including definitions,
but also including additional metadata
in the form of pronunciation.
Giving you information about how
to use the information more effectively.
And then finally, in sort
of a historical context,
the Samuel Johnson dictionary
which is sort of a model for the
published dictionaries that would follow,
includes yet more metadata,
origin or proponents of the terminology,
the type of the word,
this could be in sort of
modern terms thought of as a data type,
part of speech, pronunciation
again, definitions,
Now most importantly
examples of each of the terms
which, in this dictionary,
were in the form of quotations
which provided usage for
all of the terminology.
So if we move to the scientific arena
and go back and look at biological data,
the earliest attempt to provide
some structure to biology
was attributed to Aristotle,
and Aristotle provided a data model
for his observations
of biological taxonomy
which included properties
of biological entities
describing where they live,
what they ate, and so forth,
and this style of taxonomy
in biology persisted
largely until the work of
Linnaeus in the 18th century,
it provided a fairly
detailed taxonomy of biology
which included his famous
binomial nomenclature
which is genus and species,
from that genus and
species characterization,
as well as a hierarchical classification.
So the work of Linnaeus was encoded,
was represented in 10 different volumes
which he worked on throughout his life,
so this is perhaps the first instance
of the necessity of some sort of
versioning associated with the data model.
Although perhaps not thought
in those terms at the time.
So data models and the
representation of data
prior to the latter 20th century
is largely in the form of
published volumes of data.
So manuscripts and collections of data
that had to be very manually curated.
But it wasn't until the latter
part of the 20th century
and the advent of computers,
that automation was brought into
the management of data and data models.
The first significant effort in organizing
data for digital computers is generally
attributed to Codd,
was at IBM at the time.
And was focused on
providing some structure
around how computers could organize data
for the purposes of search,
and Codd was pioneer in
developing relational databases.
Relational databases are tools
that allow you to search efficiently,
data that's organized in tabular form.
From our previous discussion,
relating to defining dictionaries,
of language or biology,
sort of the location of
the semantic information,
which is the definitions, the examples,
and the relationships among data
live in this top part of Codd's model
which is generally called
a conceptual data model.
Below that, moving down to a lower level
are models which describe an increasing
level of fineness of detail,
additional information which is required
in order to efficiently search
or store and search and
access the information.
Below the conceptual model,
one finds a logical model.
Switch slides, since the
text is describe here.
So we're sort of talking about a spectrum
of conceptual information
which is where the semantics is,
moving down to a more physical level
which is how the information
could be stored on magnetic hard disks
and accessed by an operating
system or a software service
that's providing access
and search facilities.
So we've already discussed
the conceptual model,
the logical model is providing you with
an organization which can support
a particular type of
database model, if you will,
the relational hierarchical network model,
and we'll talk more later in this course
about the specific implementation
of database services.
And then finally, the
information has to be
organized and stored in a persistent form,
and the physical storage is
how the bytes are organized
on the computer is at the most
concrete or physical level of the model.
So this is a very database centric view
of a data model which was
particularly useful in thinking about
the technology at the
time this was developed,
which was largely to support
relational database services.
This would be company databases
where people were accessing
records and tabulation of data.
A generalization of that data model
which is also very
important in the evolution
of how people think about data models,
came a little bit later with the
introduction of a model
which embraces both
a conceptual model, which
we've been discussing,
as well as an external view.
External view, this is, if you will,
think of this as an additional
layer of presentation,
which provides either a selection
or a slight re-organization of information
which is catered to a particular
application or user need.
The internal view here,
as well as the physical layer here,
is very similar to the model that
was previously described by Codd.
But the conceptual level
pretty much remains here.
This is where you have
the most descriptive view
of the concepts and relationships
in the data set itself.
The external view is some packaging
of that information for
an external application.
And the internal level of this model
is the physical representation
that's required to support the computing
hardware and services on the backend.
All of these representations
include this conceptual level,
which is where all the semantics live.
So if we now shift gears
and look at how all of this
has evolved in the concept
of structural biology
which is the subject of
this particular course.
This is a timeline that shows
the evolution of the data model
which has been developed
in structural biology.
Goes from the early 90's developing
an essential dictionary of terminology
for data diffraction experiments,
all the way to the data model
that's currently used by the PDB,
and it's latest implementation
of it's deposition and annotation.
So if we start at the beginning here,
you've already heard from
Helen in earlier lectures,
that a key characteristic of the community
is built around structural science
particularly in crystallography is
focused on data sharing and making data
available to a community of users.
And one manifestation of that,
was an interest in
streamlining the process
by which experimental data could
move into the publication process.
And with the least amount of
friction during the review process.
Crystallographic publications have a
very regular sort of organization
and include a relatively
standard set of statistics
that need to be calculated for
each structural experiment.
So it's quite reasonable to
try to package the information
coming out of an experiment
in such a way that
both the textual description
in the manuscripts fashion
as well as the data set itself can move
through the publication process
and that is in fact what the
International Union of Crystallography
set out to do by building
an electronic pipeline
which is based on a data model
which was called crystallographic
information file,
which provides a list,
basically an essential
list of terminology,
with very precise definitions,
of what each term meant.
Say that you have a table in a publication
where you have specific data that's
very regularly presented
in each publication.
Each of those data items
was given a definition
in a dictionary just like
a regular text dictionary
as well as some additional
information that was important
in how that data was processed
by the publication platform.
And that could be an extremely
successful model for this journal.
It reduced their, the amount of work,
in publishing the vast majority
of the manuscripts that
they had to publish,
to the extent that in a
relatively small period of time
the small molecule
crystallographic community
was able to publish almost
90% of the manuscripts
reporting small molecule
crystal structures
in a fully automated pipeline
using a data dictionary.
We're talking here about
macromolecular crystallography
and structural biology
and the macromolecular
effort was an offshoot
of the success of this
effort for small molecules,
to be applied to publications in journals
of the International Union
addressing structural biology,
or macromolecular
crystallography experiments.
Because this work in small
molecule crystallography
had gone so well, and had taken place
over a relatively short period of time,
the expectation was that this work
in macromolecular crystallography
or broader structural biology would
simply be a task that would be assigned to
a small groups of people and could be done
in a relatively short period of time,
and it would just involve adding
a few additional terms to an existing
dictionary describing core diffraction.
That turned out to be
an oversimplification
of what was going to be required
and the process of trying to extend
what was called the course of dictionary,
the support in macromolecular
crystallopgraphy
and the representation of
macromolecular structure
that was robust enough to
be used by the PDB archive
actually took quite a long time,
involved a variety of interactions
with the community both from
the structural biologists perspective
as well as people from bioinformatics,
people from core computer science and IT,
over more than a decade,
and involved an extension of the
type of metadata that was embodied in
the data dictionary as
well as a significant
extension in the actual
domain specific information,
but part of this process involved the
recognition that this core dictionary
embodied a significant amount
of implicit information,
that assumptions were made,
about the how the information was used
because the application was so
narrowly focused that when the concepts
here were generalized to
a more complicated case,
or instead of looking at
the diffraction experiments
with a single small molecule,
that was extended to a much more
complicated molecular system
these concepts could be
straight forwardly generalized.
Was something that was found to require
a considerable amount of thought
and part of this process of involving
a large number of people, in
these different disciplines,
was strengthening the poor representation
of the data dictionaries
that were produced
in such a way that the
implicit information
was made quite explicit
and by doing that it meant
that by making the information
more explicit it meant
that that information was
accessible to computing software
and databases in a way that
made it much more useful.
So let me give you a run through of
how this works in structural biology.
I will apologize that there is
a lot of jargon in this whole area
and I'll try to simplify that
as much as possible as we go.
Basically when we talk about a data model,
and the dictionary, the
dictionary is the vehicle
by which the data model is represented.
If we think about the representation
of that dictionary in a couple of,
it has a three layer system.
The most primitive layer is what we call
the dictionary definition language,
and that is effectively,
the list of attributes
which define every single item of data
in the domain of interest.
For instance, in the dictionaries,
in the English language dictionary
that we discussed at the very beginning,
that's a definition, an
example, a part of speech.
So they're the essential
defining attributes
that apply to every single definition.
Using those attributes,
one then constructs a dictionary
for the domain of interest
which in this case would
be structural biology.
But what we call the PDB
exchange dictionary now
is just an elaboration of the dictionary
that was developed in the
prior slide that I showed you
with the timeline that
reflects the current content
that's managed by the Protein Data Bank.
And then finally, that dictionary
provides the basis for which,
data files which are
the files that you find
on the FTP repository or
downloaded from the website.
The data names that are those
data files are defined here
and the attributes that defined
each of these definitions is defined here.
So it's multi-layer representation.
And to go back to our computer model,
what we're defining here at this level,
from the very traditional level
that was first put forward by Codd,
we're dealing with both the conceptual
and the logical data models that are
embodied in Codd's representation
and fully the conceptual model that's
provided in the SPARC architecture.
What constitutes, what
are the attributes that
are part of each definition?
So these are the attributes of
the dictionary definition language.
Some of this should be very familiar.
Everything that's, every data item
requires some semantic description,
and these would be
definitions and examples.
So one of the things
that, in moving from a
largely paper bound description of data
to a computer driven description of data,
one of the things that happens
is that computers deal
with things by name,
and the definitions and examples,
can get completely lost when one moves
from a description of data
where you're actually looking something up
and reading the definition,
or actually Googling it
and looking at the definition, however.
We'll come back to that point later,
but the computer organization depends
very much on having a unique
name for each data item.
Understanding it's data type.
Computers need to know whether
things are strings or
numbers, that sort of thing.
Being able to, for more complicated cases,
being able to recognize
a particular pattern
that satisfies, that is associated
like recognizing a telephone number
from some person's name.
There may be boundary
values that are important.
Things may have to be
positive or negative values
or between a certain range
and very important, often true that
the scope of a particular item
is limited to a particular number of cases
or a controlled vocabulary
or sometimes referred to as enumeration.
So these are very important features
that are associated with every definition.
And then, the information needs
to be organized in some way
and for convenience the
approach that's taken
in the SIF model is to organize
things in tables and columns,
and the fact that that snaps nicely onto
relational database technology
is in fact no accident,
because that technology,
having this kind of simple organization
makes moving data back and forth
between databases, between the most
robust sorts of databases,
that much easier.
Also we provide a mechanism
for keeping things
that are related together
in the form of chapters,
or what are called category
groups in this context.
And then most importantly,
keeping track of relationships
among common data attributes
between different tables or categories
is a very important feature
which we'll see some
examples of in a moment.
Other features of associations
include aliases or synonyms,
so keeping track of where things come from
that may be defined in another context,
and then interdependencies
between related items.
You can think of an interdependency
as representing that
fact that you may have
a set of data items
which are only meaningful
as a collection, like the
x, y, and z coordinates
of a point in three dimensional space.
So what does this look like concretely?
You've all at this point, seen
some examples of data files
that you download from one of
the websites or FTP repositories.
So the nomenclature for
organization is really quite simple.
There are only two different
syntactical constructs that are used.
One is a key value.
Pair which is the data
name followed by the value.
The other is table which in this context
is referred to as a loop or
an iterable data structure
where at the very
beginning of a set of data
you have the data names that
are going to be repeated
record by record in the data structure.
So if you had a comma separated
or tab separated data file
for instance, or any kind of tabular data
to make this, to represent
that in this syntax,
merely involves pre-tending the names
of the data that would be regularly
repeated in the table structure.
There's also a separation
of table and attribute
which is basically this dot notation
and a leading underscore at the beginning
of the data item name which is a way of
identifying the fact, that it is in fact
a data value, a key word
relative to a value.
We'll talk about the
floating rules in a minute
but it's interesting to
note that this syntax was
developed by a librarian in 1991.
And the purpose of that,
this syntax, is a descendant
of a syntax which is called STAR
which is an acronym for
Self-Defining Text and Art
Archive and Retrieval Format.
This originally developed as a way
of organizing data in
a human readable form
within a research library
network in the U.K.,
and the syntax was taken up by the IUCR
in developing what
became the syntax for SIF
and it's been inherited
by the MM SIF effort.
So here's an example of
a dictionary definition.
And it's a little bit busy
to look at on the slide,
but if you just break it down item by item
it's really quite straight
forward to look at.
The data name that's being defined here
is the symmetry type
or an entity assembly.
It has a description at the beginning,
a name, a category or
table name, if you will,
some information about whether or not
it's considered mandatory
or required logically
to be part of any
definition of this category.
A data type which is, we
can discuss in some detail
in the session this afternoon,
but these data type codes
are typically things
like num, line, or text.
They're very familiar data types which are
ultimately tied to a more
explicit regular expression
and underlying type.
But the meaning here is this
data item basically exists
and never extends beyond one line
and then some enumeration information here
which provides a list of allowed values.
A slightly more complicated case,
includes additional metadata,
describing related item
names, alias information,
dependant items, this
is their definition of
what for sees an x coordinate as reported
in a crystallopgraphic experiment
and it's relationship on y and z.
Information about a related
uncertainty, data type code,
units of expression, and angstrom.
So again, these elements,
or these definitions,
are simply a tabulation of the attributes
that are important for each case.
And then finally, an important
element of the definition
is the ability to describe
relationships among
related data items and these are
called parent-child
relationship in this context
and they're particularly
important in structural biology
because you have a very
complicated molecular system
which has significant
amount of nomenclature
associated with all
aspects of the structure.
So if you were to take
a look at an example,
this is just a selected diagram
describing relationships
that are emitted from
a chemical component definition
which is the individual distinguishable
molecule in which all
of the other structures
within the protein, either
polymer or nonpolymer are built
showing how the identifier
for that component
is shared within all of these
other tables and categories.
Later in the discussion
when we walk through
the dictionary in more detail,
we'll see these diagrams
and how they can be used.
To end here, the current
dictionary that the
PDB is using, consists of currently
around 380 data categories
or about 4,400 data items.
They're all defined in this syntax.
Can be represented by these diagrams,
and some portion of those actually occur
in the data files which are
present on the repository.
The typical data file
in the PDB repositiory
may contain three of four
hundred unique data items,
a very well populated data file might
include 1,200 or 1,500 unique data items.
The dictionary itself
contains significantly larger
number attributes which are generally
not populated in actual entries.
So this after the break,
we'll be talking a little bit
about the resource where we
maintain these dictionaries
and updates, and a little
bit about the tools
and software that are associated
with the dictionaries.
So now, yeah, so before Kathy
gets started with metadata,
are there any questions?
- [Voiceover] Also out to
our remote participants,
any questions?
So when you're thinking about this,
this is probably as it
was when we first began,
a very overwhelming concept
that you would have to define things
in quite this way in order
to build a usable database
but that's really what's involved
and I'll share with you that
when we first began this process,
and we didn't have any
of the relationships
between the data items specified
because getting small molecules,
it was felt that it was not necessary.
The way we did it was
by drawing road maps,
take the data items,
and say is this one, in some
way related to this one?
So we could get some idea of
how the data could be organized
and this was really key
in building any database.
For a particular scientific discipline
where you want to get a reliable answer.
Which is different than other
types of search mechanisms.
So it probably is a lot to take in.
What I'm hoping is that
after Kathy describes
she was a guinea pig in setting up
creating metadata items for
a particular discipline,
because the reason I'm
saying anything at all
is that you have to actually do this
but we're only asking
you to do definitions,
for a handful of things so that
you can learn how to do this.
So it would be important for
you to ask questions now,
just to get some clarity in this.
Any questions at all?
Okay, because you're going
to be doing this very thing,
so I wanna be sure you
understand what we're on about.
We're basically creating
a language that will
allow you to define
everything that you need to.
So if you think about it that way
you can form.
So Kathy why don't you talk?
- [Voiceover] Okay,
just a quick mic check.
I think we're good.
All right.
John described to you, in his section,
about the evolution of data dictionary,
and I liked this particular figure
because it is really
an evolutionary process
and what I'm going to
talk to you about today
is a particular section of the
MM6 PDB exchange dictionary
that I've been intimately
involved in developing
for now more than a decade.
And that is describing electron
microscopy experiments.
Helen, in her overview has told you about
how there are data terms
that are very similar
or maybe even shared between different
experimental data types,
and then there are very method
specific types of data items.
So in an EM experiment,
the very first step,
the biochemical preparation
is actually very similar
to what goes on in other methods,
but really then when you get
down to sample preparation,
imaging, data collection,
image processing,
reconstruction, and
the structural analysis
there's a lot of things that are going on
that are very specific to that field,
and so I'm going to describe
to you how we came about
developing a dictionary to describe what
an experiment from a
cryo-electron microscopy
or 3D electron microscopy experiment.
And I really appreciate
the vision of Helen
in really creating this process
when she actually drew
me into this project
because I think the very first day that
I was involved with this project,
was a community workshop
that I participated in
that she had helped organize with other,
with experts in the field.
So we launched into discussions
with expert data producers,
software developers, database experts,
we had journal representation
and funding agencies as well,
and the procedure again,
there's workshops to focus on,
community discussion to focus
the community's discussion
on the data dictionary development.
Ideally you get people in a room,
you make them turnoff their phones,
or their computers, and they can focus.
When I got involved in this project,
there were not very many structures,
from EM experiments.
There were a few key
ones that really showed
that there was a lot of potential.
So this very first structure
of bacterial redoxins
that was done was deposited in the PDB.
It was actually a 2D crystal structure
that used electron microscopy
and it really was a
groundbreaking structure
and nobody had ever done a
structure by that method before
and it actually got us on
the database side of things
was difficult to figure out
how to represent that data
if it was for different other structures
that had been determined.
And then we began to see more and more,
or very gradually more structures.
This was the first structure that was
solved with electron microscopy
in a non crystallographic form
so it was just single particles
laying in a field that were
classified and reconstructed
and then we began to see more
structures coming in in 2000.
Ribosomes became very
popular, in the mid 2000's
and actually if you see,
if you were to look,
ribosomes are more strongly represented by
EM experiments than by any other methods,
because it's becoming very powerful.
The early development workshops,
there were multiple ones.
I'm only here going to talk a little bit
about the workshop at Rutgers
which was a triangle on
John's earlier slides.
And this, as I mentioned,
it was organized by database developers
and experts in the field
and if you wanna read more about it
there's actually, this is
a link to the description
on our emdatabank.org website.
It's the very first news
item for that website.
So this is a few more of the details.
So we had 30 attendees,
and the current EM dictionary
was reviewed in two different focus groups
and recommendations for revisions
were obtained so there was
an initial development done
before the meeting and
lots of posters printed out
and people were drawing all over them
and making notes and making
changes and suggestions
and there was a lot of revision that
happened during this meeting.
One key thing, was I think I didn't,
quite make this point earlier.
I'll actually go back.
So in 2002, an EM map archive
was established at EDI
and I just want to point
out why did that happen?
Why didn't PDB just accept map?
So maps that are produced by
an electron microscopy experiment
are so different from experimental data
that is produced by crystallography
that it was just not clear what was,
it was not clear how to best archive it
and so it seemed like the best thing
to create a separate archive
and to build this separate archive.
And at the time of this meeting in 2004,
there were really only a
handful of entries in PDB
where coordinates were being deposited
and only the EMDB where
maps were being archived.
So one key recommendation though,
that came out of this meeting,
was that the experimentalists were not
particularly happy about that situation.
They really wanted to be able to have
a single deposition process,
so they wouldn't be putting their maps
in one archive and their models in
a completely separate archive.
And so this is the recommendation
that ultimately led
to the creation of the unified dictionary
between the two archives
and this dictionary
was the basis for an NIH funded
international partnership
the EM databank unified
data resource fund, 3DEM.
which is led by an
experimental specialist,
Wah Chiu at Baylor College of Medicine.
And also involved our colleague
who created the EMDB database
at European Bioinformatics Institute
and our group here at Rutgers.
So the workshop followup was that we had
a dictionary development team created.
It involved me, John,
and someone from EBI,
and also somebody from Wah
Chiu's group at Baylor,
and we reviewed all the results
of the workshop incorporated.
We went through a lot of examples.
And we mocked them up
to make sure that this new
dictionary would actually work
and we asked the people
who attended the workshop
to review the dictionary.
This was the beginning.
We actually see what
data was being collected
by EMDB and PDB and we discovered
that there were the four that
was being collected by both
but there were more specific
experimental data items
being collected by EMDB
and there were some
more related to modeling
data items that were
being collected by PDB.
And so the expansion of the dictionary
and this is a decade ago now,
involved all these new categories
that are shown here in orange.
And those were beyond what
the initial dictionary was
and these were all based on
the workshop recommendations.
Okay, so now I'm going to go into examples
of the dictionary development
and I'm gonna focus on EM imaging
because that's a really important category
and electron microscopy,
I would say it's the single
most important category
anybody really cares to make
sure that we get good data for.
The experimentalists are
very interested in knowing
what are all the parameters
of the imaging experiment?
And so we have, I did not
include here every item
that's shown here, like entry_id
in the dictionary it actually
has _em , _imaging, .entryid
just left out the beginning part to
simplify a little bit on the slide.
But you have data terms that have to
do with a parent-child relationship,
so the entry_id relates back to
the whole set of data items,
and the specimen_id relates back
to a particular specimen preparation.
You have equipment and basic settings,
so what kind of cryogen was used?
What was the electron source?
There's many different
types that are available.
How did they, what was
the illumination mode,
for the electrons being
cast through the sample?
What was the microscope model and so on,
and then there were a
huge number of parameters
but they can vary between experiments
and it was of great interest to
understand what was most useful,
accelerating voltage for instance.
There's huge impact of that on the ability
to actually see the sample
if you have a different,
and there are a lot of things here
that I won't go into detail
on how they're defined,
because it's very detailed
but I wanted to point out,
I told you what the most
important category was
and the most important data
item in that category is
the microscope model.
This is a mandatory data item,
and here I put an asterisk.
Those data items that are mandatory.
There aren't that many.
But the microscope model is mandatory
and you use the controlled vocabulary.
I'll show you it's current iteration here,
this long list of different
microscope models.
It has gone through many iterations
and there were some iterations where
things were not as well organized as this
but it was deemed very important
to actually be able to produce
entries both in EMDB and in PDB
where you could very clearly determine
what microscope it came from.
So that's how this came about.
So then in terms of sample representation
there were many examples
that we actually work through
to make sure that the
dictionary would work properly.
I'm gonna show you here,
the example of T4 bacteriophage
really a big challenge,
T4, did I say, oh I did,
so they're similar but T4,
and this is an example that was
Michael Roslin had introduced to us.
He was actually one of the people
who was involved in the original workshop
and he has quite a passion for being
able to determine the molecular structure
of this large complex system.
He's actually I would say very close,
if you look at the review
that I referenced down here,
you'll see that there's almost
every part of this structure
has been determined at
the molecular level.
But it's a very complex
virus and it has multiple,
it has a complex anatomy
that you have to appreciate,
it has a head, it has
a neck, it has a tail,
and it has big spade, and a
bunch of fibers sticking out.
One of the things that we
try to collect carefully
is about the symmetry and this is
a particular challenge because most
depositions have a very
easily defined symmetry
and this particular shape
is able to properly define
the symmetry you would have to recognize
that it has. the head,
has one type of symmetry,
the tail has a helical symmetry,
the base has another kind of symmetry.
And, so hierarchial representation
is needed for this example
and one thing that we have
introduced more recently
into the dictionary is this capability
with the _em_entity_assembly,
should be underscore, not periods here,
that you can define
any part of an assembly
and then you can define
sub parts of that assembly
and then assign them the parent id
so that you can actually
work out the hierarchy
of a very complex assembly
simply by looking at the parent_id
and what id it points to.
So I just show you here for the base plate
you could think of defining the base plate
as short tail fibers
and the long tail fibers
as having different identities
but they all belong to the same parent.
In addition, there's a
linkage that you need to make
if you have a complex
that you have defined
at a molecular level,
each of those molecules has some kind of
an entity description,
I think we've touched on
this a little bit before,
but in the picture over here
this is the base plate of T4
and it has each color
represents the different entity.
Each of them are present in six copies
or at least six copies.
But, so for instance, this
is the base plate base,
so all of these GP 11, GP 10, GP 8,
these are all represented in here
and so you need to be
able to make the linkage
from those entitities to the assembly.
And that's another facet of what is done
in the current EM dictionary.
So archiving to the present day,
we have, as I mentioned
we have been fortunate
to have funding by NIH
to develop archiving
both the depositions
system and annotations,
and we have, so once we had deposition
and annotation more or less under control,
and this is around 2010,
we can see here the models in
the PBD and maps in the EMDB,
things are starting to
ramp up, around 2010,
and we wanted to begin
to have the EM community
think about what kinds of
validation they may want to have
for their experimental data
that they were depositing
because how would a person who
is not familiar with the method
be able to understand
what they could trust
about what was being
deposited for an EM entry.
Especially in the early days,
a lot of these structures
were very low resolution
and so if you saw a
structure of a ribosome
that came from EM, it was very likely
a much lower resolution
structure than x-ray structure.
So how would we actually
be able to make sure
that the user, somebody who
wanted to use that structure,
how could we convey to them that this is
a different level of interpretation
of an experiment than x-ray.
And I'll touch a little bit more on that.
I think on our next slide,
but we have actually been fortunate
to have continued NIH funding in 2014,
there was enough interest
from the EM community
that a raw image data pilot
archive was established
at the European Bioinformatics Institute
by our collaborators.
In 2015 and 2016, we have been running
map and model challenges,
and I'll touch on that as well
and just this year in 2016,
we have had these separate
deposition systems
specifically for EM and we have now
actually achieved the holy grail of EM
and also NMR are integrated with x-ray
in a single deposition system.
That uses the MM6 PDB dictionary.
Our project has a website.
The Emdatabank.org
It has the resources for the scientists
who are involved in this method.
They can come here and
find the information
they need about how to
deposit their structures,
how to search for their structures,
and we also provide news, events,
we try to keep a current software list,
and as well as workshops that are
up and coming in the community.
Again this is a collaboration
between Baylor College of
Medicine, the Group MPMI,
the group at the European
Bioinformatics Institute
and our group at Rutgers.
As I mentioned, it was kind
of early days for validation.
We saw in the x-ray area that,
there was a team of scientists
for validation and x-ray
and they were able to come
up with a really really
specific list of all kinds of validation.
I think that would be really valuable
to add to x-ray entries but for EM,
add our groups really came
up with recommendations
that standard needed to be developed,
and so that's really the reason why
we have been running these challenges
to try to encourage thinking in this area.
So there was a modeling challenge
that was run in 2010 where
we initiated this process
and now we're running
new challenges right now.
So before it was just a modeling challenge
and it was really lowest
resolution structures
that were looked at
and in this case we're
running a map challenge
and a model challenge
and the map challenge
we're asking people to
take raw image data
sets and try to produce
the best possible map
that they can produce
and for the modeling challenge,
we are giving people the maps
that have been produced by the experts
and asking people to try to
build the best models into those maps.
And if you're interested
to see more about that
there's a website where the
challenges are being run.
Challenges.emdatabank.org
So I think these are my last two slides.
As I mentioned, the www.pdb
peposition and annotation system
is now the new place
where we're just beginning
to collect structures from EM.
This is actually the first time
that we've had the capability
to deposit both the map and
the model simultaneously.
In our prior systems, we
had a deposition system
and you click a link and it takes you
to the model deposition system
and that actually, in addition to that,
the metadata that was
produced in that deposition
was copied over to the model deposition
so that you would the idea of having
a one stop shop was met in principle,
and now with this new system
we really have one stop shop
for people to deposit their whole entry
and they will be issued both
an EMDB ID, and a PDB id
and we have included,
we have much further
developed the dictionary
with this, to go along with this change
or improvement in deposition.
I'll show you here that we have gone
from 26 categories to 65 categories,
and the number of items now
is about doubled as well.
This, what I'm showing you on this slide,
all of this current
categories that are included
in the current dictionary,
for collection of information
about electron microscopy.
So that is the conclusion
of my part of the lecture
and if there are any questions
I'll be happy to answer them.
- [Voiceover] Are there questions?
Maggie.
- [Voiceover] Thank you.
So this is a question more about
best practices and implementation.
So using the example you gave about
the microscope model, let's say SEI
put out a brand new spanking microscope,
how do you then update your dictionary,
obviously add the name to the list,
but somebody deposited something that uses
that microscope and it's
technically not on the list yet,
so what happens?
Is it that, because I
assume that data models,
obviously are versions,
and those versions go
out in a release cycle
so what happens between
when that thing comes in
and you need your structure
to come out. (laughing)
And it's not there.
- [Voiceover] The dictionary
can be updated very quickly
so we provide for the ability
to have a place holder
of a non compliant entry
at the deposition time
and then as new technology
and this occurs at all
levels of the data model,
are observed, if the
particular non compliant entry
is not truly a part of the
existing controlled vocabulary
then we simply extend it
and we can do that fast enough to address
the release cycle which
is on the order of weeks.
It takes two seconds to update the data.
- [Voiceover] Thank you Maggie,
that was a great question.
Are there any other questions
from the people in the
room or external people?
Okay, if not I would like to suggest
we take a 10 minute break
and then we're gonna do a
walk through the dictionary
so you can see what the
dictionary actually looks like,
and then we'll spend time reviewing
what you're gonna have
to do for your homework
because that is going to be challenging
because you're gonna be creating,
you're gonna be selecting data items
that you're gonna use
to describe your system,
and you're also gonna
create some new data items,
and not too many, but
you'll still have to do that
and we wanna make sure
you know how to do that.
So let's be back here 10:30, okay.
- [Voiceover] So we
conclude the lecture here.
Remote participants, we will restart
recording during the exercises.
See you in 10 minutes.
