- [Voiceover] That was how we deliver
structural information to the biologist.
So, on the first slide here,
I show basically a
workflow that we go through
actually, on a weekly basis.
You already heard about the PDB Archive
on the left hand side.
This is what we call Data In,
and Data In includes
everything from deposition,
annotation, validation, and so forth.
And this flowchart shows what happened
here at the PD in San
Diego on a weekly basis.
So, also I need to mention
in terms of data distribution,
there are really three
distribution centers
out of the worldwide PDB,
that is the PDB in Europe,
the PDB in Japan,
and the RCSB PDB,
which is the US branch
of the worldwide PDB.
So, all three organization
have their own website.
You may have seen them already.
And each website focuses
on different types
of ways of representing the data,
different ways of integrating
with other data and so on.
So, the different sites,
you can go to all of them,
and you get kind of a
different view of the PDB.
Each site has different
strengths and so forth,
but today we will focus on the RCSB PDB.
So, the major workflow,
the most important one,
is where do we go from Data In,
from the PDB Master
Archive, here in green,
all the way to the right
to the PDB FTP site.
So, this is basically a repository
of all the flat files in the PDB.
Their updates are on a weekly basis,
but there's much more that happens.
And let's first look
on the right-hand side
under this column called Dissemination.
There are various ways
of disseminating data,
not just the FTP site.
We have, over at the other web site here,
we have a mobile application
and we also have web services
that programmers would use
and we have MyPDB where you can get
weekly reminders about new structures.
So, in the three-in, a
lot of things happen.
So starting again on the left-hand side,
so every week, we will
get new PDB structures,
and first of all, we load
them into our database.
Okay, that's one part,
but then we also go to External Resources,
and enrich that information.
We go to databases such UniProt,
SCOP, PubMed, et cetera.
There are about 70 resources
where we get additional
value add information
and this information is
also loaded and integrated
with the PDB data itself,
and then finally we also
calculate various types of data.
For example, protein Secondary Structure.
We also cluster proteins
based on their similarity,
either by sequence similarity,
structural similarity, et cetera.
We generate custom images.
As you know, our website
is rich in images.
Most of those images are,
you know, computer-generated
and they are generated every
week as we get new structures.
Okay, let's go into more detail.
First of all, let's
talk about the FTP site.
So this is the main site
where you would download
files from the protein data bank.
The FTP archive is an
archive of flat files
and it contains a number of information.
First of all, atomic coordinates
and their descriptions
and that is actually coordinates of both
the macromolecules and the ligands.
It contains metadata
and experimental data.
One thing I haven't mentioned yet,
since I can't see you, I thought
I'll ask you some question
to make sure you don't fall
asleep on the other end.
So, here's my first question.
This FTP archive contains flat files
and then the question is,
so what is the default file format
for those atomic coordinates?
Does anybody know that?
- [Voiceover] Raise your hand.
What's the format?
- [Voiceover] Anyone?
So it's, to give you the answer,
it's the mmCIF form
and that's the archival format
which is now rich in metadata
plus, you know, it can be,
it's an extensible format where we can add
new information as it's
necessary, going forward.
So my question is what
does FTP mean actually?
FTP means file transfer protocol.
What is protocol?
You can download individual files
and to give you an idea,
on average per year,
about 500 million files are
downloaded from the FTP site.
That's a very large amount of data
that's actually being used in science,
as well as in industry.
And, okay, you can
download individual files,
but some organizations,
for example, farmer companies,
they like to keep a whole
copy of the PDB in house
and then the question is
how to keep the PDB in sync,
because we're getting new structure files,
some files may become
obsolete or withdrawn,
so we need to keep a synchronized copy
and that is done through
the RSYNC protocol,
which looks at timestamps of files
and it will see if a
file becomes out of date
and will automatically keep
a synchronized copy on the file system.
But, that's the FTP Archive.
This is our number one
distribution mechanism of files.
The second most used one
is obviously the website
and the website has more functionality
than just downloading files,
because if you've seen
this website before,
on the left-hand menu, they
actually give you an overview
what you can do.
You can first of all,
you can deposit stuff.
We heard a lot about that
in the last few lectures.
Then, more importantly,
you can search the PDB.
You can type in any kind of search term
or if you look on this page
down here under Search,
there are various search options.
You can search by Sequences, for Ligands,
you can find Drop Targets,
or you can use this drill-down here
which you may have used,
where it's categorized
the contents of the PDB,
for example by Organism,
my Experimental Method,
and Release Dates, et cetera, et cetera.
So Search, in a core functionality,
that you do not have on an FTP Archive.
You just have files.
Once you complete a search,
you may want to analyze
your search in more detail.
We provide various
types of analysis tools.
And then finally, we have 3D Database.
You obviously want to visualize
the proteins of DNA or
RNA structures in 3D,
so we provide a number
of visualization options on the site.
And the finally, as you're getting ready,
maybe, to write a paper,
so you may want to tabulate your data
and export them, for example,
as an Excel spreadsheet,
or you want to download
the specific structure
from your search and do
some further analysis on it.
So, this is overview of the functionality
of the RCSB PDB website.
I see now mobile devices are very popular,
so we also offer a mobile view of the PDB
and the idea here is the quick access
to the PDB on the go.
So, let's say you're at a conference,
you hear about a particular
coding structure,
you want to quickly learn about it,
or you're in a lecture, et cetera,
so this is a quick way to look
up information on your phone
and obviously, this has minimal features,
but you can do basic searches,
and that, you can also display
molecules in 3D.
This is supported both on the iOS,
I mean, on the Apple
iPhone and iPod, iPad,
as well as on Andriod devices, okay?
So if you have a phone and interest in it,
this is a good way to
learn more about the PDB.
You know, after this lecture,
you may want to go ahead and try it out.
It's a very good way
to impress your friends
and correlate to what we've
learned in this course.
You can actually show it
to them on your phone.
Okay, so what other
ways do we access data?
So, before we talked about
interactive usage of the PDB,
but there are also programmers who want
to write computer programs
and access information
and this is done what it's
called, a RESTful API,
an application program interface,
that flows communication
between an application
and our back-end database
and here a user can, for example,
download large chunks of
data at the same time,
rather than one piece of
information at a time.
Therefore, anything you
can do on the website,
searching, analyzing
data, or downloading data,
all of that can be done also
through computer programs,
and this was, you know,
used also quite a lot.
Here, it shows our web service box.
It's actually very easy.
First of all, you have a URL, a query URL.
It is the www.rcsb.org
and then there's some parameters here.
For example, this particular
web service request
is about getting sequence cluster data
and here, instead of ESM
from, it's a specified,
for example, we want
to get a cluster at 40%
sequence identity for a
particular structure, 4hhb.
I'm sure you have seen
those type of URLs before
when you browse the web somewhere,
so this is, it looks like
it's actually very easy
and once you put this in your browser,
you get a response, which is now,
what is important is machine readable,
rather than on the website, where you
interactively look at
something in your browser.
What you get back is
actually machine readable,
for example in the XML
or JSON format here,
and here's my next question for you.
See this response here.
Does anyone know what
type of format that is?
Is it XML or JSON?
Anybody know?
- [Voiceover] Everybody said XML.
- [Voiceover] Oh that's right, yeah.
(female speaker drowned out by male)
Yeah, that's right.
It's XML, okay, great.
Alright, so I guess already
you learned about that.
Let's go further.
So, we talked about
downloading information
from the approaching databanks,
but brings up, then, the question about
what is the licensing of the data?
For example, if I'm a commercial entity,
can I use data from the Protein
Databank for my research?
Is that allowed?
Well, let's look a little
bit about PDB usage policies.
And we have different types of data.
First of all, we have data files
and fortunately, they are freely available
without any copyright restrictions
and both for commercial
and noncommercial usage.
And, you know, a very
important point is here
there's a law for commercial usage
and that is being used extensively
in the pharmeceutical industry,
because many of the proteins in the PDB
are also drug targets
so this is very important
in terms of facilitating medical research
and as you probably know,
the PDB was actually
almost now 45 years old,
was actually one of the first
open access data resources in biology,
so this is quite remarkable
that already 45 years ago,
the community thought it that important
to make those type of
data freely available.
Other types of data
include molecular images
that we've talked about.
Again, they are available
under the same conditions,
but with two constraints.
First of all, if you use an image,
you should obviously cite the author
that produced the structural data,
and secondly, you should
also cite the RCSB PDB.
So you are familiar with
our molecular month illustrations,
they are available under
particular license called
CC-BY-4.0 and I will explain
on the next slide what that means.
And finally, we have Molecule
of the Month articles
and those are copyrighted
and yeah, they can be reprinted,
but there's some constraints.
Your permission is required
and it needs to be
attributed to the PDB here.
Alright, so let's talk all about
the different types of licensing.
Let's first talk about this concept
of copyright and copyleft.
I'm sure most of you have
heard about copyright.
It basically protects either
documentation or software
from unauthorized copying or selling
but then there's also
the concept of copyleft,
and here the idea is to make
the information freely available,
software freely available,
and again, distribute, modify,
(garbled audio)
distribute modified versions.
(garbled audio)
Over time, and different people have,
maybe they have a better idea.
They can improve software,
so it's important for software to be free
so it can be improved,
extended, and (mumbles)
copyleft allows you to do that.
Although copyleft does have
some strings attached to it,
that means if the software
was originally licensed
under a particular license,
you need to keep that
license going forward.
You cannot just say, "Okay,
now I take your first good
"and modify and I give it away
"under some different licensing."
So copyleft basically means
you need to keep that
same type of licensing.
Let's talk more about, first of all,
about software licenses,
then we'll talk more about data licenses.
So, about software licenses,
there are many of them,
but here are three popular ones.
The first one is the GPL license,
the GNU public license,
and it's used as this idea of copyleft,
so you can change the software,
you can distribute, and so on,
but all your derived works
also need to be distributed
under the same GPL license.
So, that's the idea of copyleft.
Now, there are two other licenses:
BSD and the MIT license.
There are fewer restrictions,
you know, how you can use them.
So, okay, let's see ...
Yeah, so, the thing about BSD and MIT
is, for example, you could use them
in proprietary software, and so on,
so, this is especially
important for companies
that want to use publicly
available resources.
There's a less restrictive
license, they can use it.
If you have a more restrictive license,
they may not use it.
So, it's very important what
type of license you choose,
you know, if you really
wanna enable science.
And, that's all the talk
about data licenses.
There are the talk about
Creative Common licenses.
You may have heard this before.
For example, I mentioned
that some of our content
is licensed by the CC BY
license and what that means
is you can take, you know, the data,
you can distribute them,
you can tweak them,
and you can build upon
that work and so on,
and even commercially,
so this is a very, you
know, unrestricted license.
On the other hand, for example,
if they have a license like CC BY NC,
it's basically the same thing,
but this can only be for
noncommercial usage and so forth.
Another one is CC BY ND.
That means no derivatives,
so that means you can distribute it,
you can use it commercially,
but you cannot actually change it.
You can't make changes to it.
And then finally, I want to mention
the Attribution ShareAlike license,
where people can use the data freely,
but then any derivatives they develop
they also need to have that same license.
This is again, the idea of copyleft.
You can do what you want with it,
except you need to keep
the same licence agreement.
You cannot make it, for
example, more restrictive.
So, I'm gonna go through
all those licenses.
What they all have in common
is the idea of attribution
that you basically say
where you got the data from,
it needs to be classified.
Okay.
Alright, so, we'll talk a little bit about
not about data licensing.
So now, we wanna talk
more about how to query.
- [Voiceover] Peter, Peter?
- [Voiceover] Yes?
- [Voiceover] I just
wanted to stop for a minute
and see if anyone has questions about
how to write a license
since this is a little bit different
than what you normally learn about.
Does anyone have any ...
How 'bout the librarian people?
So, is there a CC 0 license use
by the Protein Data Bank?
- [Voiceover] A CC 0?
- [Voiceover] Yeah.
- [Voiceover] I'm not ...
- [Voiceover] It's kinda new,
what they're using for data now.
What is the license that you supply?
- [Voiceover] Yeah, so,
the PDB has a long history.
So right now, some of, for example,
the data are described
in this usage policy
but we have not assigned
a specific license to it.
This might be something that
the PDB could consider in the future.
The PDB were already
existed and had a policy
way before the new license
agreements came online.
- [Voiceover] The person
who asked the question
is a librarian
and is pretty knowledgeable
in these areas,
which is why I wanted to stop for a minute
for a discussion here.
- [Voiceover] Right.
- [Voiceover] So, you're right.
We evolved the policy based on history,
but it might be a good idea now
to be very specific about
what the license is because
most of the people use the PDB
were born when the PDB was started, so ...
- [Voiceover] Okay, that
was really a good question.
Maybe it's something we can think about.
- [Voiceover] Linda has more to say.
- [Voiceover] So, um, Laura.
- [Voiceover] Laura, sorry.
- [Voiceover] No, it's alright.
So, do you require a data
deposit agreement to be signed
or are you just agreeing
by depositing, basically?
- [Voiceover] John, could you answer that?
- [Voiceover] So, the terms and conditions
for abiding by the principles of
a historic license
are accepted when the files
are deposited to the archive.
So, people have to
press a button when they
accept the backup.
- [Voiceover] So, Peter,
and John, and Mikey
are used to free, going forward
to review all of this again, and--
- [Voiceover] Definitely.
- [Voiceover] Tweak it a bit,
because we're in a different era now,
and it might be a very good idea
to look at the whole thing, you know,
because we have table that's for the data,
which is, you can do anything you want,
and the attitude has been
if somebody can
make money
by reusing our data,
in whatever way they want,
then they should do that.
And, if people, if the market
will give people money,
you know, will pay for something,
then our job is to enable that.
That has been our attitude
from the beginning.
- [Voiceover] Mm-hmm.
- [Voiceover] And I think
it's a good attitude.
- [Voiceover] Yeah.
- [Voiceover] I do think
think we ought to formalize
and re-look at all of these things
and make it clear to
the people who are new
who don't know the sort of
folksy history of the PDB,
just that this is the license,
so this is what it's called.
So, we could re-look at this, Kathy.
- [Voiceover] Yeah, I
was just gonna comment.
We could call it the hippie license.
- [Voiceover] Yeah, we have the hippie--
(laughing)
The PDB began in the hippie era,
and we stayed true to that, so ...
- [Voiceover] Mm-hmm, but that was really
a very good question,
so it actually had provoked
some thought for us,
maybe we can look at in more detail.
Alright, if there are no more questions,
I'll talk a little bit
about querying the data now.
So, underlying any queries
in the Protein Data Bank,
we have relational data page
and I heard you actually
already did some exercises
using SQL to query
something the data page,
but I just wanna kind of
go over it one more time,
and show you how this actually
relates to the website.
So, relational data, based on the name,
says that there are
relations of data in there
and this facilitates querying
that with query language
called SQL.
And here on the bottom are three examples,
of how data can be related to each other.
There can be one-to-one relationship
between a structure, and
a structure description.
Every structure has some description
what the protein is about, for example.
We can have one-to-many relationships,
for example a protein
chain contains amino acids
or nucleic acids residues,
so that one chain contains
multiple residues,
and that's indicated by the
small number, one and infinity.
One chain can have one or more residues.
Or, it can be many-to-many,
where one structure
belongs to a particular
structure classification,
but on the other hand,
several structure classifications,
can apply to one structure.
So, this is many-to-many relationships.
By having those relationships,
we can note from queries
about those relationships.
And what is usually used
is some kind of database
management system,
and there's several open
source packages available,
and that comes back to
this idea of licensing
and free software.
So, there's, for example, MySQL,
which is open source software,
and we use it at the PDB,
and other database like PostgreSQL,
that is another open source database,
and then there are obviously
many commercial databases.
So of, you know, common
use one might be Oracle.
Okay, so how that relates to the data
in the Protein Data
Bank, as we have learned,
the data represented using
the mmCIF dictionary.
And the nice thing about the dictionary,
is there's a one-to-one correspondence.
Each category, in the mmCIF dictionary
corresponds to a
relational database table.
And this is kind of shown
on the right hand side.
Here on the bottom there's a table
called Chemical Component,
which has an ID field in it.
The same ID field is also
used in other tables.
For example, in the Audit table,
or in this Chemical
Component Descriptor table.
The IDs are the same.
It's the Chemical Component ID.
That is the relationship
between those tables,
but we can, for example
given a chemical component,
using the ID, we can go back up
and figure what details we
can get this desctiption
based on this common ID here.
That's the idea of our
relational database table.
And, it's not just the PDB
who does this like that.
For example, other resources we are using,
Uniprot is the sequence database,
and Gene Ontology is the database that
defines protien functions.
For example, all of those resources,
can be mapped to a relational database,
and then we can use SQL
to query all those data.
And on the next slide,
I show an example of how we can use SQL
to query a database.
So, you have standards.
I hear on your own exercise.
Here, we wanna query, say SELECT *,
FROM Structure
WHERE numChains >=4
AND releaseDate>2015-01-01.
So, I know that you guys
already learned SQL.
I wanna ask you what the
different parts mean.
First of all, we have a SELECT *.
Does anybody know what SELECT * means?
- [Voiceover] Select all the database?
- [Voiceover] Yes, so
SELECT * basically means
show me all the fields
in this table called Structure.
So, Structure is a table
in the Protein Data Base,
saying basically, okay,
show me all the data
in this table,
but with the condition
that the number of chains
is greater or equal than 4,
and you probably heard about it before.
Protein has multiple
protein chains or subunits.
So, here we can basically say,
I wanna hear about those structures
where there are these four chains,
and where the release date is
beyond this January 2015, yeah.
So, if you see something like
number of chains greater than four,
what does it tell you about relationships
in this database?
Any ideas that we can learn from that?
Okay, I'll give you the answer,
basically the idea here is of
a one-to-many relationship.
There's one structure, may
contain multiple chains.
So, this an idea of a
one-to-many relationship
that you kind of see
through this query here.
Okay, so here's the query.
And why is this important?
When you go to the
Protein Data Bank website,
you can run a query.
For example, you can go to
our advanced search interface,
and here you would say,
okay, number of chains, is between four,
and if you don't specify another number,
that basically means okay,
I want four or more chains,
and the release date is
between January 1, 2015,
and today.
So, here you see the
graphical user interface.
What happens behind the scenes
on our web server is that
we run this particular query
on the back end database,
and it will return back the structure
that match those conditions.
So now you basically have seen all the way
how this works.
So, you see the user interface.
You see the SQL, how to run the query.
And then before we told you about
the mmCIF data schema,
how each category matched to one family.
So, now I think you see the full picture,
how we go from the data
to a relational database,
how we query the data,
and then how we show the query on,
you know, on the,
on the website itself.
Does that kinda make sense?
Any questions about that?
Alright, if not then I'll
go on to the next slide.
So, there are also alternatives to
to SQL database.
They call one NoSQL databases.
We're using that, for
example on our PDB-101 page,
which is the educational
section of the PDB website,
which is showing on the right-hand side.
So, here you don't really
have a relational database.
What you have here is
different educational content,
which you may search.
So, we don't necessarily need
a relational database here.
The idea of a non-SQL database is it's,
it can handle much bigger data structures
like Twitter or Facebook
that may use a non-SQL database.
An example of those databases,
for example, MongoDB,
which we're using on our PDB-101 website.
And what MongoDB is,
it's a keyword value store.
And what does that mean?
So, basically, you store layers of data.
You store keys.
So, for example, here we have
the Molecule of the Month,
January 2016.
So, the key might be January 2016,
and the value may be
this short blob of text
that we're displaying on our website.
And since we have more than 100
Molecule of the Month objects
we can display them as pairs
of keyword value pairs.
Then, the other types
of non-SQL databases,
for example, triple stores.
And triple store, basically
also contains relationships
where we have a subject, a
predicate, and an object.
For example, you would say PDB ID XYZ,
is a kinase.
So, the subject is a particular PDB ID.
The predicate is, is a,
and the objects is a kinase.
So, it's a simple relationship.
So, a triple, basically
a list of relationships.
And again, you know, this is useful
for in some instances of data.
And finally, a generalization
of Triple stores
are Graph databases,
where you not just have a single
simple pair of relationships,
but you have a complex
network of relationships.
That what be called a graph database.
So, it depends, really, on your use case,
you know, what you wanna,
what type of database you want to use.
Okay, so let's talk a little bit about
the other types of query interfaces.
You have seen the advanced search,
but, you can create a complex search
by adding different
search criteria together,
but we have other search features.
For example, we have an Auto-suggest,
and that is shown on the right hand side.
As you type a name into the
search box on our website,
we give you potential matches here,
which is maybe similar to what you've seen
in other search engines,
maybe in Google or on your
browser, and so forth.
So, if you type Vitamin here, it says,
"Oh, are you looking for
Vitamin D as a molecule name?
"Or are you looking for example
for an enzyme classification
"that contains this sort of vitamin?
"Or, are you looking for
"Vitamin Binding Protein,
et cetera, et cetera?"
Because the user may not
know what's in this database
But here, we give you some options
and then, you can just,
if I'm really interested
in vitamin D receptor,
then I know there are 95 hits,
so if I click on this,
I will retrieve those 95 hits.
We'll just go in order suggests.
Similarly, we have drill-down lists.
This is shown on the bottom here.
After you run a query,
we basically take your,
this top set of results,
and categorize it.
For example, if we say, okay,
your results contain,
in terms of organisms,
it contains those two organisms,
it contains a particular
UniProt, et cetera, et cetera.
So, this is what we call a drill-down,
basically taking your
data and categorizing it
and this allows you to
further refine, or drill-down,
into your data.
And basically, all those options
help you filter down your search results.
Usually, when you do a text search,
you may get hundreds of results
and this drill-down will help
you to refine your query.
We also have specific
software to do text searches,
like Lucene and Solr.
And what those, they're somewhat different
from a regular relational database
because they support text searching.
For example, usually,
when you search text,
that's saying a title of a PDB,
you may not know the whole title.
You only know words,
keywords, or parts of words.
It will be able to do a text search.
It can even deal with misspelled words
So that's the nice thing about it.
You may misspell vitamin,
but it will still suggest you some results
for text specific searching
that's involved there.
And then, as you go to larger databases,
such as Google, or so on,
they need to parallelise their (mumbles)
If you search, basically,
the entire internet,
you need to do it fast,
and that is done through parallelisation
and one of the mechanisms
that's being used.
It's called the MapReduce.
We don't currently use it
in the Protein Databank,
but maybe down the road,
we may implement such matters,
here you impel such a large data base.
Any questions?
Yeah?
Okay, then I go on.
The other important part is
that once you did a query,
you wanna save your results
and we offer you tabular reports
and you already did
some exercises for that
where you create the custom report,
you choose the field you want to export
and then you can export them in the
comma separated value file,
or as an Excel file, and
analyze that further.
Other types of reports
that we can generate
are image galleries, as shown right here.
And again, it's the back-end databases
that allows us to generate those reports.
So far, we've talked
about searching PDB data,
and now we'll talk about
integrating with other resources.
So, why would we link
with other resources?
The idea is to create
new query capabilities
by integrating with complimentary
other biological resources
and as I mentioned before,
we're integrating with about 70 other
biological databases.
So what is the advantage of integrating?
This is shown here.
On the left hand side is
the Protein Data Bank.
If we were just be able to
query the Protein Data Bank,
our queries are limited,
obviously, to PDB data.
But, we can run a query,
for example, query all X-ray structure
with a resolution better than 1.5A.
So, we can run this query
just using PDB data,
but the problem comes in if we integrate
with other resources that are shown here.
So, there are other various
biological databases.
We can integrate them if we
have common identity files,
and I will talk on the next slide
what we mean by common identity files.
But, this will allow us
to integrate other data
with the PDB value add data,
and that will expand
our query capabilities.
I'll give an example.
Again, we can say, okay,
give me all X-ray structures
with a resolution better than 1.5A
and they should have
a kinas functionality,
so this an example where we add value
by integrating with other resources.
So, let's talk a little more
about common identity files.
How do we actually link
with other resources?
Okay, here is an example.
So, on the left hand side
is the Protein Data Bank
and shown here is one of the proteins
in the Protein Data Bank.
So, the Protein Data Bank
is a 3D structure databank.
We have the 3D structure,
but proteins can also bind drug molecules,
shown here as a drug molecule
that is bined inside of
this protein pocket here
and so we know, in the protein database,
we know about drug molecules.
We also know the protein sequence,
which is this one letter code.
Each amino acid has one letter
that makes up a sequence
that's called a protein sequence.
We also capture during
deposition other data.
For example, what is the organism?
Like, in this case, HIV 1.
And all the different types
of data have identifiers.
As you know, the Protein
Data Bank has the PDB ID
as it's identity file.
Example, 1OHR is the PDB ID.
So, other resources may use PDB ID
as an identity file,
and if they do, we can link to them.
The other way to link is small molecules,
like drug molecules.
For drug molecules, we have
an identity file called an InChi key,
which is shown on the right here.
Again, that's loads of (mumbles) key.
If molecule in the Protein Data Bank
is present in other resources,
such as maybe top can, for example,
similarly, for protein sequences,
protein sequences have a unique ID
called the UniProd ID.
Organisms have a unique ID, a taxonomy ID.
but it's being assigned by the NCBi.
So, this is just a small example
of common identity files.
We use those to link to
other data resources.
Here's, again, one question for you.
So, you see the identity files here
and this InChi Key looks really strange.
Does anybody know what is an InChi Key?
How is that different from
all the other identity files here?
Any ideas?
- [Voiceover] A whole
bunch of inquisitive looks.
(laughing)
- [Voiceover] Yeah.
- [Voiceover] Nope, I think you
might have to tell them, Peter.
- [Voiceover] Okay.
Yeah, so first of all, it looks like
you can't really pronounce it.
This is not a word, right?
This is so-called a hash code.
This is actually calculated, an algorithm,
why you can take a drug molecule
and it work with a unique
signature, a fingerprint,
maybe an idea is this thing
is like a fingerprint, basically,
and we can calculate that.
That's the way an In-Chi
Key is calculated from the
(speaker drowned out by beep)
there's a one two one plus one
between this drug molecule
and this particular strain.
The nice thing about this strain,
if you want to try it
out after the lecture,
you take this strain,
you paste it into Google, into a text doc,
and you will see it.
It will match things.
So, this is the nice thing about it,
that you can use it's unique identity file
to find information
anywhere it's being used.
Yeah, so that's a little
bit about InChi Key.
Where the other identity files,
like PDB Id, UniProt Id,
those are basically serial numbers
that are assigned by the organizations,
you know, that provide those data,
where InChi Key, it is calculated.
Anybody can take a drug molecule,
calculate the key,
and then we have a unique identity file.
So this is a very unique mechanism
of creating an identity file.
- [Voiceover] Peter?
Can you expand on that just a little bit?
So, what data is used,
or what information
about the drug molecules
is used to create the key?
Is it ...
- [Voiceover] Yeah, so,
maybe if you remember
your chemistry in a drug,
a molecule basically has two components.
It has atoms, like carbon, oxygen,
sulfur, et cetera, et cetera,
that is being used, and molecules also
have bonds that connect elements together.
That's the basic information
that's being used
to calculate this type of a key.
So, there's an algorithm.
You take your molecule,
you can calculate the key from that.
It takes all the information into account,
such as their chemistry and so on.
Okay?
So, here's an example
of how we integrate with
another data resource,
in this case DrugBank.
Now, look on the
left-hand side, under PDB,
we have Chemical structures, for example,
of drug molecules.
We have Potein sequences.
We have 3D structures.
But, we don't really
know if those molecules
are drug molecules,
or what kind of properties they have,
or what names they have.
And here's like a symbiotic relationship
that we have with DrugBank.
DrugBank is hosted in Canada,
at this website drugbank.ca.
So, what they offer,
they also have chemical
structures in there,
in this case, of the drug molecules.
They also have protein sequences,
which are drug targets.
And now the idea is,
okay they have chemical structures,
the PDB has chemical structures,
can we link those two together?
So, we have to provide more information.
So, how will we do that?
Any ideas, how we could
link them together?
It has something to do
with my previous slide.
Think about that, what
I was talking about.
Okay, I'll give you the answer--
- [Voiceover] So, either
it's the interest key,
or it's the UniProt ID.
Yes, that's right, exactly.
- [Voiceover] Do I get an A on it?
(laughing)
- [Voiceover] Yes, yes. (laughing)
Okay, so yeah, right.
Chemical structures, as we talked about,
we can calculate in entry keys.
We have an entry key.
DrugBank has an entry key.
And the entry keys are the same.
That means we're talking
about the same molecule.
When the Protein Data Bank, for example,
we have a protein that
has a drug bound to it,
On the other hand,
DrugBank also has that
molecule in their collection.
And another nice thing,
DrugBank has lots more
information than we have.
They know about names, like
generic and brand names.
Other lots they know, the
Indication, Pharmacology,
Mechanism of Action,
Route of administration,
et cetera, et cetera.
So, they have a whole host
of additional information,
but using the entry key,
we can basically create a link here.
And then as Helen mentioned,
similarly for protein sequences we can use
the UniProt ID as the common identity
to map proteins in the
Protein Data Bank site
to protein sequences which
represent drug targets.
And let's look at that in more detail now.
So, this is the way it works.
For example, linking drugs
in the PDB to DrugBank,
we take the drug molecule,
and using the atom and bond information,
and their chemistry information.
We calculate this entry key.
Then we take the entry key,
we look in up in DrugBank,
and actually you can do that yourself.
The only thing you need
to do is copy this text,
the entry key, paste it into Google,
and one of the hits that will show up,
is DrugBank here.
So, you can, maybe after
the lecture, try this,
if you're interested.
So, that powerful concept of linking data.
Then we take DrugBank,
we extract information, like the PB,
I mean the DrugBank ID.
We wanna link that to the data resource.
One thing you notice on the
Protein Data Bank website,
is anytime you see orange tags,
that means that it's external data.
So, this is the way we
distinguish data from the PDB,
which usually has some
blue type of color scheme,
versus external data
which we highlight in
this orange color scheme,
plus we also tell you where
we got this data from: DrugBank.
We give you the ID and a link back
to this database as well.
And here we show you various information
about a description of the drugs synonyms,
indication, pharmacology,
a lot of information we can now retrieve
about this drug target.
So, the basic idea here is the PDB
is not in isolation.
We basically present you a picture of PDB
in the context of biology,
or in this case, pharmacology,
so we can learn more about
this particular structure.
And finally, we can create
what we call the Drug View
in the PDB.
So, for example,
we can search by the generic
or brand name of drug molecule.
For example, if we type in Lipitor,
we will find,
we will get this result
this result set here.
And the reason it works
is because in DrugBank,
Lipitor is a brand name.
And because of this
relationship with DrugBank,
we are able to associate Lipitor
with a particular molecule
in the Protein Data Bank,
which is shown in this box here.
And then you see all
the information coming
from the Protein Data Bank
is in those blue boxes.
External information is
indicated in orange boxes,
to ensure that we give proper credit here
to other resources.
And then we also talked about,
we also create images, for example,
as part of our pipeline.
That's shown on the top right.
This is an automatically created image.
It shows how this drug molecule, Lipitor,
interacts with its
protein environment here,
like in dashed lines,
those are so-called
hydrogen bonds, for example.
So, this is an example of
automated image creation
to add value.
And then the bottom right
is one of our 3D viewers,
called Ligand Explorer,
where you can now look
at this drug molecule,
how it binds into a pocket
here on the protein.
So, this is the idea of,
you know, adding value,
something you can't get
from at the FTP archive
So, giving to way external resources,
create images, we provide
visualization tools.
We provide search engines spaces,
and then so to give you a rich experience.
Okay, so now talk a little
bit about the proteins
or macro molecules and stuff.
Every protein has some kind of a function,
and here on a high level
of five important functions of a protein,
I don't wanna go into details here,
but you may have heard about antibodies.
They're related to your immune system.
Enzymes, they catalyze chemical reactions.
You have messengers,
where messages are sent
around in your body,
and so forth.
So, we also have a lot
of protein annotation
in the Protein Data Bank,
which again, comes from other resources.
And the question is, "How do you search?"
For example, the Protein Data Bank
has protein function.
So, what we have done is,
that we have used different
kind of community standards.
And you have already
heard about ontologies,
which are controlled vocabularies.
And we link to ontologies,
or classification schemes,
that the other resources came up with.
For example, there is the
ontology classification scheme
of gene ontology,
which defines biological processes.
And they're listed in
this hierarchical browser.
So the biological process ontology
is basically a hierarchical structure.
For example, it has biological adhesion.
Here of course, it talks
about biological phases.
And then in particular,
a subconcept of biological phases
will be cell cycle phases.
And then cell cycles
have different phases,
like anaphase, for example.
And then this is now linked, to the PDB,
to particular structures.
So, the anaphase annotation
is linked to one PDB ID.
So, there's one entity.
So, there's one linkage here.
Proteins related to the
anaphase of the cell cycle,
we have, for example 17 entries here.
So this is an an example of how we link
in our protein function annotation
to particular PDB IDs,
and then with this hierarchy,
is a nice way to browse for information,
because sometimes you don't know
exactly what you're looking for.
The browser gives you a way
to basically browse through the PDB
and understand you know, what's in there.
And we provide quite a number of browsers.
See, this is cut off.
We actually have more than look through.
Biological process, or cellular component.
We'll tell you which part of a cell
that this protein occur in.
For example, is it a nitrocomplia.
Is it in the nucleus,
et cetera, et cetera.
You heard a lot about enzymes.
Enzymes have a classification scheme
called the EC Number Scheme.
So, you can basically browse the PDB
using the Enzyme Classification,
et cetera, et cetera.
One browser we have is
called Mesh Term browser.
And I was just wondering,
does anyone know what a Mesh Term is?
Maybe the librarian.
- [Voiceover] Medical Software Happy.
- [Voiceover] That's right.
Yeah, exactly.
So, those are the
medical subject headings.
So, the medical subject
headings are key words
that MedLine or ParkMed is
using to index our tables.
So, they assign keywords to every article.
Those keywords are not free
artists, not free terms,
but they come out of a
described set of words
that's being used.
And Mesh itself has also hierarchy to it.
So, if you go to the Mesh Term browser,
you can browse the PDB
by those Mesh Terms.
But, Mesh Terms are assigned to articles,
not PDB IDs.
So, what does that mean here?
It's that every Protein
Data Bank structure
is related to a primary citation.
And that primary citation is
related to a ParkMed article.
And then for the ParkMed
article, we have Mesh Terms.
So, again here you see how
we linked to other database,
and enriched our search capabilities.
Okay, so far we were talking about
searching well-defined terms.
Now, we are talking about
things that are more ambiguous.
And one problem the entire
biological community has
had to do with protein naming.
As you may have heard that many people
call the same protein different names.
Everybody has their favorite name
for their favorite protein,
and this causes some confusion,
in terms of searching.
So, to help with that, UniProt,
which is a sequence database,
they provide what is
called a Recommended name.
And here is an example on the board.
The recommended name for
this particular protein
is called Annexin A5,
but then as you see,
there are lots of alternative names.
I don't know, maybe 20 or so
different alternative names.
By integrating with UniProt,
we can provide search function now.
You can search by the recommended name,
or you can search by any
of the alternative names,
and you will still find the right protein.
So again, this is another example of,
you know, our integration
with other resources,
where they basically
provide synonyms for us,
that we otherwise don't know about.
Okay, so we talked about
problems with protein names.
We also have problems,
unfortunately with author names.
Many authors have the same names.
You know, that's a big problem.
I think you already heard about ORCID,
which is a unique research identifier.
And I hope most of you have an ORCID ID.
If you don't, I suggest
you go and get one today,
because if you wanna publish any papers,
you really wanna make sure that
you get credit for your work.
So, that's what the ORCID is about.
And just to show you the issuhat
we have with author names,
down on the bottom,
is an example of our
autocomplete function.
Now, on the website,
you type the name Smith,
and obviously many people
have the last name Smith here,
but then there's Smith
D, which has 158 entries,
but I'm pretty sure is
not a single person.
There are probably multiple people
who share that same initial D.
So, this is part of the problem.
We have ORCID IDs,
we can then disambiguate those authors.
That's the idea of ORCID,
and that you may have heard
the new deposition system
of the PDB will in the
future use ORCID IDs,
so we can uniquely identify author,
and they can get the proper
credit for their work.
Okay, so it's not just protein and author
that can get unique IDs,
basically any digital object
can get a unique identifier.
I am sure you have
heard about DOIs before,
digital object identifiers.
It is used by, extensively by journals,
but it can also be used for datasets,
and for example the PDB provides DOIs
for every PDB entry .
If you go to a particular
PDB structure summary page,
for example, 2TRX,
you see the first thing after the title,
we show the DOI, which is 10.22110.
That's basically the
first part of the DOI,
that's the organization, that's the wwPDB,
and then comes the PDB code here.
So, every PDB dataset has a
digital object identifier,
so they can be uniquely identified,
and they can be used for example,
in publications to specifically
link to this particular data set.
And that becomes more and more important
as people want to cite data,
data sets.
To get a DOI, you need to
register your organization,
and there's a fee involved
for it to, you know,
for it to be able to assign the DOIs.
And then secondly,
there are other service.
For example, Zenodo in Europe there,
they currently offer free
archiving and DOIs as well.
Yeah, so I have a question.
Why would the PDB use DOIs?
I mean, you can just go to our
database and find it there,
but why would you use a DOI
to state in an article you write?
Any idea?
- [Voiceover] The DOIs
can let people to cite it,
so they can consistently, persistently
be linked in the future?
- [Voiceover] Yeah, that's right.
The thing is about persistence here,
because url may change,
but the DOI will always be the same.
So, if you go here in 20 years from now,
the DOI will be still the same.
The specific web address may have changed
in the meantime.
That's really the key about
persist, long time persistence,
about identifiers.
Alright, so there's--
- [Voiceover] Peter, I think
there's also the issue,
there's also the issue that
some percentage of the data
never actually gets
published in a journal,
but by having a DOI,
it's a citable object,
and that's gonna be
more and more important.
- [Voiceover] Yeah, right, exactly.
Mm-hmm, yeah.
So, we talked about common identifiers,
but let's talk about the data themself.
If we wanna make data interoperable,
we need to use common data formats
to make, you know, to
easily exchange data.
So, first of all,
if we're talking about data,
they need to be machine readable,
but they ideally, they also
be readable by humans as well.
So, shown on this page are
three common data formats.
One is called the comma separated value,
or task separated value format,
which you may have seen before,
like Excel, for example.
It's a type of format where
you have rows of data.
They are separated by commas.
First, there's a header line.
For example, it says the
first column is PDBID
The second column is
called alpha and beta.
Et cetera, et certa, and
then you see the actual data.
So, this is CSV,
which work well with any
kind of spreadsheet software.
Then, we have XML software
and it's called tagged
and has relationships.
So Tag, I think is in the
smaller or greater than.
Structure, PDB, structure,
both have tags, yeah?
And they clearly identify
the data item by name.
Add they're also relationship.
You see that, the hierarchy to it.
So, that's our relationships in there.
Now, these days, JSON is very popular.
JSON is as you see, is
very similar to XML.
Except some it's a lot
more human readable.
And the other advantage,
does anybody know what
the advantage of JSON is?
Why it's so popular these days?
- [Voiceover] Because you can't use it
on (speaking off microphone).
- [Voiceover] Yeah, that's right.
Basically, web browsers can natively
interrupt with that JSON format.
That's why it's so popular these days.
And you see that's really a
simple one-to-one correspondence
between XML and JSON.
They can go back and forth
between those two formats.
Now, for the PDB, there's another format
that's commonly used.
Anybody know what that format is called?
Well, I'll give you the answer.
It's the CIF format.
Both macromolecular
structures are different,
so, this mmCIF, you have
heard a lot about that.
It was actually created way
before XML and JSON even existed
but the idea is pretty much the same.
You tag your data, essentially,
make sure it's both machine readable,
but it's also human readable.
So, the RCSB PDB has to go up
with a set of two into upperability, so,
those two are available at
this particular web address.
The mmCIF form, the way it's structured,
can also be easily converted to XML
or it could also be converted
to JSON very easily.
So, this is the idea of
a machine readable form
that's very important if we
want to do any computer work,
rather than do anything the manual way.
Alright, so, so far we talked about
the databases, clearing,
licensing, and so on.
My question is, hosting.
So, we're hosting the Data Bank website.
How can we do that
and what kind of options are
there for hosting a website?
Well, we'll go into that a little bit.
But before I do that, I need questions
about the previous section here?
Okay, if not, I'll go ahead
with Data Hosting Options.
Well, there are a number
of ways to do that.
The traditional way is say
if you have a single server
that serves one particular function
and the idea is here,
for example, let's say,
looking at the website that's
hosting this particular core,
the EDSB core, that core website,
is much like, I'm not sure about that,
but it's very much like the
host on a traditional server.
A very small server has
a particular function,
but this is not very scaleable.
Let's say now you want to
host 1,000 different websites,
and you would need to buy
1,000 traditional servers
and it gets expensive.
So then, people had the idea,
we can create a virtual machine
and the idea here is you can host
multiple websites or
services on a single machine
and the basic idea, here's my analogy,
traditional service, as
you can think about it,
it's like your home.
It's a single house.
One family lives in each house.
The idea of a virtual machine is more like
an apartment building where when you have
multiple tenant, essentially.
That's kind of the basic idea.
They get scaled and go to the right
and finally, we have cloud services
and you may have heard about
Amazon and Google hosting Cloud services.
The idea is it is more
scalable, so for example,
when Amazon host their
own software system,
they need to be able to
accommodate holidays,
holiday sales.
They have a huge peak in
traffic during holiday.
At that time, they need more resources.
They obviously didn't want to work
with the traditional server.
They cannot buy many millions of servers
just to support holidays,
so they need a more scaleable model.
This is where Cloud services come in.
So, let me should you a
little more detail here.
So, traditional servers, mostly used
to host a single service,
like a webpage or a blog,
something like that.
On a virtual machine, you
can share multiple websites
or note multiple services
on the same machine.
And then, there's the Cloud,
provides what is called elasticity.
Maybe one day, you need lots of resources,
the other day you need less resources.
On a Cloud, you can basically
adjust your resources as needed.
So, the PDB has basically
gone all the way,
so most examples is started
on traditional server.
Then, the problem was, we're
getting more and more users.
Just to give you an idea, the
Protein Data Bank website,
or the RCSB PDB website,
is visited by about 30,000 per month.
So, maybe 10 years ago, it
was just half as many people.
In every generation,
we wouldn't need to buy
more and more and more
hardware to support that,
and that's obviously not scalable,
so then we went, the com PDB website
runs on virtual machines
where a single machine may host
multiple instances of all websites,
and they can't be moving.
The Cloud infrastructure, we
have our own in-house cloud.
The hose services that we on
much more flexible than we used to be.
Well, that gives you a little
bit an idea on the back end,
you know, how we are hosting our database,
and what ...
Any questions about that, yes.
- [Voiceover] Just a comment that
the EDFB pilot course site
is actually on your cloud.
- [Voiceover] Okay, oh that's good news.
So, that's a good example.
So, we can basically host
many course in the future.
We have no problem.
Okay, so now I'm gonna
show a little case study.
Again, it's about linking
with other data sources.
At this time, we're
talking about new science,
linking with genomic data.
As you know, there's a huge trend
in next-generation sequencing,
where you get the sequence patient data,
people with cancer, et cetera,
their genomes are being sequenced,
and we look for any kind of
changes in the DNA, basically.
And you will see, those changes in DNA,
we can relate to 3D broken structures
to gain more understanding
maybe what caused the
disease and so forth.
So, I'm going to go to a few slides,
to give you an example of that.
So, first of all, I need to
tell a little bit about biology.
So, this may be a little bit difficult,
if you haven't, you know,
if you're not a biologist,
but it all starts out with a DNA sequence.
I'm sure you have all seen
the double-stranded DNA.
The DNA is basically
the blueprint of life.
It's a sequence of nucleotides
that basically determines
a lot of, you know, everything about,
about life, essentially.
But your body contains proteins,
and there's a biological process
to get from the blueprint
with the DNA to the protein,
and there's several enzymes involved,
such as DNA polymerase,
which is an enzyme that actually
is also in the Protein Data Bank.
DNA polymerase creates messenger RNA,
which is shown here,
And then RNA ...
And then we create an RNA,
and RNA now needs to be
translated into proteins.
Again, there's an enzyme involved.
It's called ribosome.
So, ribosome basically takes on a region,
and creates a protein out of it.
Basically, there are three
important biological molecules.
There's DNA, RNA, and protein.
And all of those molecules
are in the Protein Data Bank.
Although it's called Protein Data Bank,
it's really a macromolecular data bank.
It contains DNA, RNA, and protein.
That's shown on the right-hand side.
There are structures of DNA.
See the double helix?
We have RNA here.
What is shown here, is
actually not messenger RNA.
It's a different kind of RNA.
The different types of RNA,
there's ribosomal RNA, messenger RNA,
transfer RNA, et cetera, et cetera.
So, there are different types of RNA here.
And then we are the most familiar,
and know the most about
our protein structure.
So, all of them have 3D shapes,
and those 3D shapes are
in the Protein Data Bank.
And then the 3D structure
determines its function.
So, it determines if a
protein is an enzyme,
or transporter, or messenger,
et cetera, et cetera.
So, this is a key concept,
that you start with a sequence,
which you can get from DNA sequencing,
or protein sequencing.
From that you can
determine the 3D structure,
which ends up in the Protein Data Bank,
and that structure then
determines function,
so it's very important to know structure,
because then we can
make some interferences
about function about it.
And then function is
also related to disease,
which I will talk about.
Basically, what happens is
in some patients there may be
a change in the DNA called a mutation.
At a single point in this long DNA strand,
which is about 300 billion
nucleotides long,
there can be a single mutation
somewhere along the line,
that will cause a defect
that will show up in
the protein structure,
and we can then look at
the Protein Data Bank,
and see what this disease might do,
why it may cause a particular disease.
I'll show an example of that.
So, first of all, we need to map
all the way from genome information
to the Protein Data Bank.
So, we have developed a pipeline
where we take a database
of curated human variants.
Variants basically means mutations,
or changes, or defects in the DNA,
coming out of this HGNC database.
We first match that to a protein sequence,
using the UniProt resource,
and then finally we map
the UniProt resources
to the Protein Data Bank,
where it allows us to map changes in DNA
to the protein sequence,
to a 3D structure.
And that allows us now to have some
understanding about protein function.
So, we provide different kind of views
on the Protein Data Bank
to see this relationship.
We talked about DNA sequences.
So, what you see here,
those letters, A, C, T, and G,
those are the letters of the human genome.
So, there are about
three billion of letters.
That's the human DNA,
and if there are any
three type of nucleotides
are related actually to one protein.
This is the protein sequence here.
You see the relationship
between the DNA sequence
and the protein sequence.
And at the same time,
we know this particular
protein structure here,
is actually in the PDB,
so this blue bar here represents an area
where we actually have
a 3D structure there,
which means if this particular nucleotide,
if there's a mutation in your DNA,
we can tell at this
position in your protein.
What effect this may have,
and down here, for example,
you see the 3D structure related to
this particular
gene that's being shown here.
Okay, so here's more zoomed out view,
as you may know,
the DNA's packaged in chromosomes,
What you see here on top
is a visualization of a chromosome,
which has those different bands here.
And here's a particular
position on the chromosome.
We can map that onto the PDB,
and in this case, there's
no structure there,
but every time you see
a little vertical bar,
that means we have a protein available
in the Protein Data Bank here.
So, we talked about DNA sequences.
DNA can be mapped onto a protein,
and we also have a visualization for that.
So, this is a visual representation
of integrating protein data here.
So, on top in green,
this is information we get
from the UnitProt database.
This gray bar represents
the length of a protein sequence here,
which is about in this case,
about 2000 amino acids long.
That's this representation here.
Then we get all kinds of other
annotations from UniProt.
For example, it tells us that
this is the breast cancer, that one,
susceptibility protein here.
We get other information
from other databases,
like PFan, and Forbes website,
and Forbes, et cetera,
which is presented in
a graphical way here.
Where in grey, we have calculated data
also represented here.
And then finally in blue down here,
this is PDB data.
So, every blue bar means
that for this stretch of sequence, whoops.
There's a piece of 3D structure there,
so this maps 3D structure
onto the protein sequence right here.
So, this is a graphical example
how we map information.
And now, this is a very
nice view for biologists,
to look at data in a graphical way,
rather than just looking
at text or web pages.
Okay, so how that relate to disease?
Well, here's an example.
So, there's a disease called
metachromatic leukodystrophy.
So, this particular disease
is caused by single mutation
where a single leucine amino acid
is exchanged for proline at
position 426 in this amino acid.
Just a single change can cause, you know,
can cause a disease,
a certain example where a
single amino acid change
can cause disease,
and you can see the change.
Here's a close up, you
can see the mutation
here happening in this area here.
What happens is
the wild-type, or the
natural form of this protein
forms an octonet.
That means eight copies of that protein
form this kind of a ring structure here.
But, once you introduce this mutation,
in this highlighted area here,
it can no longer form this shape,
this ring shape of an octonet,
that it forms what is called a dimer,
which are just two copies
instead of eight copies.
Just this single chain change
causes basically this particular disease.
Well, this the idea of
mapping changes in the DNA,
which we may get, you know,
from individual sequencing of patients,
and mapping this all the
way to closing sequence
through the 3D structure,
and then the 3D structure
will allow in some instances,
we can understand why it's causing
any particular problem.
So this kind of basically
rounds up my presentation.
Basically we have shown you
how to access the data,
how to search the data,
and finally I've shown
you how to integrate
data from other resources to add value.
And as you see, the
value that we add here,
is that we can take patient data,
map it onto a 3D structure,
and beginning to understand
how that may be related to disease.
So, this is basically,
you know, the idea to
go from a data resource
to provide more, to provide knowledge,
not just data, but actually
to provide knowledge.
So, again, the Protein Data Bank website
will provide more tools
to analyze those types of,
you know, relationships, essentially.
And okay, I think that
ends my presentation.
Any questions?
- [Voiceover] Any resources?
So, Peter, thank you very much.
That was a really
very good lecture.
(speaker garbled)
And appreciate that and especially,
everybody should know that
it's 7:30 in the morning,
and Peter's been lecturing
since 5:30 in the morning.
And we really do
appreciate that very much.
So, what I think we will do,
is we will take a 10 minute break,
and then Jose is going to go through
an exercise with us.
So, at 10:35,
7:35 your time,
we'll reconvene.
Okay?
- [Voiceover] Okay, thank you.
- [Voiceover] Thank you
very much, Peter, thank you.
