Thank you for participating in the NIH Metabolomics
Scientific Interest Group Webinar Series.
Today’s webinar from Dr. Tom Metz will focus
on “Shifting the Metabolomics Paradigm:
Exploiting Computationally Predicted Metabolite
Reference Data for Comprehensive Metabolomics.”
Before introducing our speaker, I’ll quickly
review a few logistics. Should you experience
technical difficulties, you can contact us
through the question box in the expanded control
panel, or by phone or by email. If you need
to view live closed captioning, please click
on the link that will appear in the chat box.
At the end of Dr. Metz’s presentation, we
will allow time for questions. If at any time
during the presentation you have a question,
please type it into the question box and I
will ask it on your behalf during the Q & A
session. So, it is my pleasure to introduce
Dr. Tom Metz. Dr. Metz earned his bachelor’s
degree in biology and chemistry before earning
a PhD in chemistry from the University of
South Carolina where he studied the role of
Maillard chemistry in the development of diabetic
complications via the chronic cumulative chemical
modification of tissue proteins. In 2003,
he joined Pacific Northwest National Laboratory
for postdoctoral work in mass spectrometry
with Dr. Richard D. Smith, where he focused
on metabolomics. He became a staff scientist
and a principal investigator in the Integrative
Omics group in 2005 and is a metabolomics
lead for a group of scientists that focuses
on development and application of high-throughput
metabolomics and lipidomics methods to various
biological questions. Dr. Metz’s research
has focused primarily on applying MS-based
omics approaches, including proteomics, in
studies of diabetes mellitus and infectious
diseases; this has resulted in over 150 publications
to date. Currently, he is the director of
the Pacific Northwest Advanced Compound Identification
Core within the NIH Common Fund Metabolomics
Program. So, we’re looking forward to hearing
from Dr. Metz today, and welcome.
Thank you, Krista. I appreciate the invitation
and the opportunity to speak today. Hopefully
everyone can hear me. I’m about to share
my slides. So, oops. Let me know if you cannot
see those. I will move this little panel out
of the way a little bit.
Tom, we can see them and we can hear you well,
so, you’re good to go.
All right, great. Yep, thanks again. Again,
I want to thank Krista for the opportunity
to speak. I’d also like to thank Bryce,
who’s our technical support person, who’s
been very helpful in working the logistics
here. As Krista mentioned, I’ll be talking
about what our vision and our current research
is at Pacific Northwest National Laboratory,
but also in the PNACIC—which is the Compound
Identification Development Core, funded by
the NIH through the Common Fund Metabolomics
Program Phase II. And really what we’re
focusing on is exploiting predicted reference
information—tandem mass spectra, CCS values,
and so forth, anything that we can get ahold
of and that seems to have good predictive
accuracy—and to use that for a confident
metabolite identification. And I also thank
you as the attendees for connecting. And I
hope you’re all staying safe and healthy
and sane in these challenging times. And speaking
of which, I just wanted to lighten the mood
a little bit and ask you to vote whether you
like “Gary Busey Tom” or “Baby Face
Tom.” So the photo on the left was after
several months of not having a haircut or
shaving after we went into our stay-at-home
situation in Washington state, and then finally
my wife said: “You have to clean up.”
And so, I bought a pair of clippers and she
cut my hair. And now I look, according to
all my friends, 10 years younger than what
I did before. So cast your votes. And, you
know, I’d be interested to see what you
think. So this is the outline for the talk.
I first want to talk about my personal experiences
that have influenced how I think about metabolomics.
I think that’s an important thing to do,
in general in life, to try and understand
where another person’s coming from before
you pass judgment on them or what they’re
doing. And especially in today’s times I
think it’s very much needed. I’ll then
talk about what we’re doing, the work that
we’re doing under our PNACIC core, and then
I’ll show some initial applications of the
use of in silico reference libraries for compound
identification. So, some background perspective
on things that I’ve experienced through
my research up until now: I performed graduate
research studying Maillard chemistry, which
is not enzymatic chemistry, and it was first
characterized by Louis Maillard around 1912.
He did some experiments heating glucose and
glycine, and then looked at the products that
were formed and characterized these as brown
pigmented products or melanoidins. And then
subsequent work through the years further
characterized the detailed molecular chemistry
that occurred during that heating of glucose
and glycine and resulted in the of the elucidation
of the first stable product of that reaction,
which is known as the Amadori compound. And
this type of chemistry, Maillard chemistry—also
known as browning chemistry—is very important
in food chemistry because these products lead
to the compounds that affect the flavor and
the aroma and the texture of food. And so
all that research was fascinating, of course,
to learn why food tastes the way it does.
But really it didn’t get, you know, that
exciting for the research community until
this type of chemistry was shown to have relevance
in vivo. And in 1968, Samuel Rahbar, who you
can see here, was doing some basic studies
of different types of hemoglobins—different
isoforms that were present in certain populations
in the Mideast—and he identified a fast-moving
form of hemoglobin that eventually was characterized
as HbA1C, and was found to have a molecule,
glucose, bound to the beta-chain amino terminal
valine residue of hemoglobin. And today HbA1C
is actually used for long-term monitoring
of blood glucose control in individuals with
diabetes. It’s in, you know, most of the
kits that you have for long-term assessment
of glucose control involve looking at HbA1C.
So, I found it fascinating that this non-enzymatic
chemistry that occurs when you cook food also
occurs in people. And my graduate advisor,
John Bains, at the time said, “Well, people
are basically ovens; they just have a lower
temperature, but a much longer cooking cycle.”
So, you know, these things began to make sense.
And in fact, we become browner as we age.
And what you see here are costal cartilage
samples—the cartilage that are between the
ribs—and you can see that as you age, this
cartilage becomes darker and darker and browner
and browner and begins to take on the characteristic
pigments that are very typical of Maillard
chemistry. And so essentially, the human body
is accumulating these products of non-enzymatic
reaction and long-lived tissue proteins—those
proteins that don’t turn over very frequently.
And part of my project was to assay for a
small molecule inhibitor of these types of
reactions and to look for reactions of that
inhibitor with molecules that had free carbonyl
groups, aldehydes and ketones, and then to
quantify those in the urine of diabetic animals.
And so that was my first adventure into small
molecule characterization. And I really had
a strong appreciation for the role of chemistry,
as it affects biology, coming out of that
research group. So then I moved to Pacific
Northwest National Laboratory. I’m in my
17th year now here, and at PNNL I focused
on mass spectrometry in general, but in terms
of specific applications, it was in untargeted
omics measurements metabolomics, but also
including proteomics. And today I lead a team
within our larger group, the Integrative Omics
group—it’s about 80 people—and my team
focuses on metabolomics. And while at PNNL,
I also had a strong appreciation for the finer
points of different types of instrument development—whether
it’s on ion optics, like you see here with
the ion funnel, or LC developments—these
LC systems are all high-pressure systems and
they were developed several years before Waters
then commercialized the UPLC system, and other
companies have come on now with their own
high-pressure systems. We’ve also developed
software for treating the data. And now the
group largely is moving more and more towards
gas-phase separations ion mobility and the
IMS work has predominantly been led by Dick
Smith, but also others in the group here as
well. Which brings me to metabolomics and
the current grand challenge of metabolomics.
You might hear Lloyd Sumner give talks and
he’ll talk about the grand challenge of
metabolomics, and it’s really unidentified
metabolites. And I really love to use this
example because I think it really makes the
point of this challenge that we have. This
is data from the Undiagnosed Diseases Network
where we were the metabolomics core from 2015
to 2018. And the purpose of this UDN program
is to perform different types of assays on
individuals who have rare disorders that have
yet to be characterized by medicine. And so
we perform metabolomics analyses. And the
gray dots that you see here are reference
data from our analyses of about 120 people,
that had no known metabolic disease, that
we used as background information. And then
we have metabolomics data from a patient in
the UDN, indicated here by the proband, and
then that patient’s mother and father. And
so you can see how far beyond the average
metabolite relative abundance certain metabolites
are for these three individuals. But I draw
your attention to a metabolite here that was
very, very interesting that was four z-score
beyond the average profile for that molecule
in the background data. And it was high in
all three individuals: the patient, the mother,
and the father. And so, in the context of
the Undiagnosed Diseases Network, we were
very interested in this. But then you scroll
over and it’s a metabolite that we can’t
identify because we don’t have reference
information for it in our reference library.
And so, very frustrating. You know, we don’t
know if this is indicative of the person’s
disease. We don’t know if it’s because
all three individuals had the same thing for
breakfast. We don’t know if they got new
carpet in their house and the carpet is off-gassing
and they’re accumulating these things off
gas. Very frustrating. But, this is pretty
typical of metabolomics study, where you usually—depending
on technology used—usually identify less
than 10% of the data that you can detect.
And usually a large majority of things that
you can’t identify happen to be the things
that have high fold changes and low p-value.
So it’s frustrating to have such obvious
information tied into phenotypes that we can’t
identify. But this is really a characteristic
of our current metabolomics paradigm and how
we actually confidently identify molecules
in our high-throughput screening studies.
And the way that we do that now is we analyze
pure substances that we buy from chemical
suppliers. We send it through our instrumentation.
We measure, you know, the properties that
we’re interested in, then we put that information
in a database. Then we analyze our samples
of interest—clinical samples for example—on
the same instrument platform. We take the
experimental data, we match it to the library,
and then things that match well, then we assume
it’s a confident identification. So, the
problem with this entire paradigm is that
typically any laboratory has a few thousand
pure substances in house that they can use
to build their libraries. And so we, as metabolomics
scientists, pretty much identify the same
molecules over and over again because those
are the same molecules that we can buy from
chemical suppliers. If you look through the
literature at some of these analyses of what
we call chemical space, the expanse of chemicals
that are thermodynamically stable that might
appear in nature—that in some estimates
put that at anywhere from 1030 to 1060 compounds,
and, you know, a lot of those are based on
cognitorial approaches—but the point being,
there’s a lot of chemistry out in the world
that we have an inability right now to confidently
identify in our high-throughput screening
studies. So I want to contrast this to a typical
proteomics study. I’ve conducted a number
of proteomics studies in my time here at PNNL
and so, you know, I love to draw this contrast.
So, we can take protein samples from any type
of tissue or biofluid. We send it through
our sample processing and analytical pipelines
and we can generate really a lot of data.
And in the experiment that I just showed you
previously, which was a study of human islets
stimulated with cytokines to induce a stress
response as a means of identifying potential
biomarkers, we generated almost two million
tandem mass spectra. We searched all of that
data in three to four hours, and, you know,
this will depend on your computational capabilities
and the number of servers and things that
you have available to you for housing proteomics
processing software. We identified 650,000
spectra. It’s about 36% of the data—a
little bit better than metabolomics but still
not, you know, the majority of data being
identified—but importantly, we identified
30,000 peptides that map to 11,000 proteins
at less than 1% false discovery rate. And
from there, we went on and performed additional
follow-up experiments—mouse models, so on
and so forth—and we actually identified
a novel, protective factor of beta cells in
vivo. It wasn’t a novel protein; it had
been known before and characterized before,
but the role of that as a protective factor
had not yet been characterized. And I can
tell you that we looked at zero raw data from
this proteomics experiment. We trusted the
computational pipelines and the methods for
limiting false discovery and moved on directly
into our biological interpretation of the
data. And this does not happen right now largely
with metabolomics. We generate our data and
we spend a lot of time going through and curating
the data to make sure that we’re not making
misidentifications. So this leads me into
our core project that Krista mentioned and
what our vision is for the future of metabolomics.
I truly believe that relatively soon we’re
going to get there as a community, but I’ll
tell the story and, you know, we can discuss
it later. So we’re one of five Compound
Identification Development Cores that are
funded under the Phase II Common Fund Metabolomics
program, and you can see us here in southeast
Washington state. The stars indicate the other
CIDCs in this program, and the plus signs
indicate some data and software development
centers. And you might recognize other entities
that have been existing in the metabolomics
program for quite a while such as the Metabolomics
Workbench at UCSD in Shankar’s lab. This
is what our structure looks like. And this
is actually the structure of all of these
cores as required by the call that we responded
to. There’s a computational core and there’s
an experimental core, and there’s tight
coupling between the two. The computational
core predicts molecular properties; the experimental
core measures those and feeds that information
back to the computational core so that the
predictive algorithms can be refined. And
as part of our team we have David Wishart,
who’s participating and providing access
to the CFM-ID algorithm updates as well as
BioTransformer, which predicts metabolism
for metabolites of interest. So, our motivation
was basically: Can we answer two questions?
One: Can we find a methodology that allows
us to confidently identify a molecule without
relying on data from authentic reference materials
that you purchase from chemical suppliers?
Also: Can we identify molecular properties
that facilitate these approaches for confident
identification and in a high throughput that
is automated pipeline? And the two questions
are related, and so the two answers will be
related. So for me personally the inspiration
came from proteomics; you may or may not be
familiar with it. And so if you’re not,
I’ll provide a little bit of a background
using this cartoon. We take proteins; we proteolytically
digest these with an enzyme. Usually we use
tryptone because it puts them in pieces that
are very amenable for a reverse-phase lipid
chromatography separation and also a nice
size for tandem mass spectrometry. So, we
perform separations, just like what we do
with metabolomics. We perform MS/MS, just
like what we do with metabolomics. But peptides
fragment very, very characteristically, so
that allows us to predict what they should
look like and then makes identification, you
know, comparatively easy in comparison to
metabolomics. Why is that? Well, proteins
and peptides have certain characteristics.
Proteins are direct read outs of the genetic
code. So, if you know the genome of the organism
from which your sample came from, then you
can predict which proteins you should get.
We’re not necessarily there yet in terms
of being able to do that type of prediction
of the metabolome in an organism confidently,
but there are lots of good work through the
community that is leading to that, but, I
think still a little bit more to be done.
Peptides also fragment very characteristically
around the amide bond during MS/MS. So you
see the mock peptide down here. And you can
see that you might get ‘a’ or ‘x’
ions, depending on the fragmentation mechanism
that you’re using. You might get ‘b’
or ‘y’ ions; this occurs during CID- and
HDD-type fragmentation. You may get ‘c’
or ‘z’ ions; this occurs during electron
transfer dissociation and so forth. But the
important thing is that the fragmentation
always occurs in this region of the amide
bond. And so, that has allowed the development
of algorithms that can computationally predict
what a peptide dissociation should look like.
That enables, essentially, confident identifications
of peptides without the use of authentic reference
peptides. And there are also good methods
for assessing false discovery, and I’ll
show that in the next slide. But essentially,
when one performs an untargeted proteomics
experiment, there’s really never any type
of follow up using standard reference peptides
to confirm the initial identification and
the discovery experiment. We may indeed do
validation studies later, and other cohorts,
using a targeted proteomics experiment to
verify the findings in a replication of the
entire experiment, but there’s never an
identification or a comprehensive validation
of the initial discovery data set. We rely
on the paradigm and the predictive software
that have been developed for processing the
data. There are also very good methods for
assessing false discovery rate with proteomics.
This isn’t the original paper on the target–decoy
database approach, which was developed by
Steve Gygi, but this is a nice cartoon to
show you that if you give the search algorithms
decoy data—and we create decoy data by sometimes
reversing the protein sequences or scrambling
the protein sequences that we’re searching
against to create essentially fake data—and
then we look at the proportion of the data
as it matches to the target database versus
the decoy database, and you very often see
these nice distributions and you can draw
essentially a line down through the score
on the x-axes and cut off all of these fall
fits by this sliding scale of, you know, where
you want your false discovery rate to be.
So there’ve been a few papers out there
on exploring false discovery rate methods
for metabolomes, but we don’t have anything
yet that has been widely adopted by the community.
So a little bit more work to be done there.
So coming back to this slide on the motivation,
we added two requirements to what we’re
looking for and an answer: that methodologies
will involve some type of a priori assumptions
of molecular characteristics or behavior of
the molecules that will allow confident and
accurate predictions of these, and that these
properties should be definitive—meaning
unambiguous when performing identification—and
experimentally measured and accurate in a
reproducible way. So our aims for this Pacific
Northwest Advanced Compound Identification
Core is: 1) to generate methods that lead
to confident prediction of metabolite properties
that we can then use to generate in silico-derived
reference libraries, and then 2) to develop
and apply a multi-property matching approach
to utilize all of these different properties
that we are predicting in a confident way,
and then 3) make all this available to the
research community. And so what we envision
for the new paradigm is the cartoon I showed
earlier. We replace the upper half, which
is a library building, and instead of performing
analyses of pure compounds we’ve obtained
from chemical suppliers, we predict the properties
that we would like to measure—but this assumes
that we can do so accurately. So, I just wanted
to point out we’re not the only group doing
this. There are many others that are developing
predictive capabilities, whether it’s on
tandem mass spectra prediction, or retention
time prediction, or prediction of collision
cross sections. So, you know, it’s not something
that we thought of de novo. There are many
others that realize this challenge in metabolomics
and that are trying to address it. And I think,
ultimately, we as a community will be able
to take the best of all of these types of
predictive algorithms and combine them in
a comprehensive approach to generate libraries.
I’ll just mention that our approach is based
on ion mobility spectrometry. Ion mobility
has a lot of nice features, I think. It provides
an additional dimension of separation, so
that increases the coverage of the metabolome
by, sort of, inline fractionation—it increases
the dynamic range of the measurement. It’s
hooked up to a mass spectrometer, so, of course,
you get mass and isotopic distribution information.
You can also generate tandem mass spectra
as you normally would. But importantly, you
can measure a physical chemical property of
a molecule called collision cross section,
and I’ll talk a little bit more about that
in the next slide. And I will mention also
that ion mobility has a nice ability to separate
isomers and isobaric compounds that liquid
chromatography and a mass spectrometer typically
cannot. And so that’s part of why you have
increased dynamic range and coverage of the
metabolome during measurements. And I personally
like to measure as many properties of a molecule
in a single experiment as possible. And you’ve
probably seen this typical cartoon where blind
scientists are touching different parts of
an elephant and they think it’s different
things. Well, if you use only a few properties,
you might have a similar experience with a
metabolite, and you might think it’s a certain
metabolite because you’re seeing a certain
point of view of the molecule, but you’re
not getting the entire picture and so you
might make a misidentification. And I feel
the more views that we can get of a molecule,
then the more confident and sure that we’ll
be in the chemical structure. Just a little
bit about collision cross section. I’ll
point out three things here. The collision
cross section is not influenced by chemistry,
so to speak. And what I mean there is that
when you’re performing liquid chromatography
analyses or gas chromatography analyses, you’ve
got partitioning of the metabolite or molecule
between a stationary phase and the liquid
phase. With ion mobility, the molecule is
already in the gas phase in the instrument.
Really, your separation comes down to the
structure of the molecule—the ionized molecule
itself—and the voltages and temperatures
and pressures that are present inside the
instrument. So you’ve got more physics impacting
the read out in an ion mobility experiment
and less chemistry, and so you don’t have
degradation of stationary phases and so forth
that cause shifting in retention times. What
that means, analytically, is that you can
obtain highly reproducible measurements in
your own laboratory across different instruments
in your own laboratory, but also across instruments
in different laboratories. And there was a
nice paper, that I mentioned down here, in
2017 where four laboratories participated
in the measurement of the same compounds and
the relative standard deviation using the
same instrument and the same operating protocols
to collect the data. They obtained an RSD
of less than 3%. It was fantastic. CCS is
also a property that can be accurately calculated
using quantum chemistry from first principles
and it can also be predicted using machine
learning approaches, which I’ll talk a little
bit coming up. In our core, we use two different
tools to predict collision cross section.
One is based on quantum chemistry—that’s
shown over here on the left—ISiCLE. The
other is a deep learning based tool. And both
of these tools were developed by Ryan Renslow
here at PNNL and his team. And you’ll see
the errors that are generated through these
tools when comparing predicted CCS to experimentally
measured values. The measured values are almost
2,000 different conformers. And you also see
the computational time required for some of
these. So, if you can get by with less error
then you can generate predictions relatively
quickly. If you want really, really accurate
predictions, then you’re going to pay for
it in terms of computational time. And these
are node hours on the supercomputer here at
PNNL. We’re also collaborating with Dave
Wishart at University of Alberta on this project
and Dave is lending his expertise in MS/MS
spectral prediction through the CFM-ID algorithm.
Both of these CCS prediction tools have been
published. I’ll just step through this cartoon
to show you, at least for the quantum chemistry
pipeline, how it works. You give it molecular
information in the form of an InChI; you can
also use SMILES. It will convert that to a
three-dimensional chemical structure that’s
readable by a computer. We use molecular dynamics
to model various gas-phase conformers. Those
are then processed through density functional
theory to optimize their structures. We generate
collisionary cross sections for these, and
then we use Boltzmann weighting to arrive
at one probable collision cross section. Move
on here. And just an example to show that
this has worked very well and a demonstration
case looking at cis/trans isomers of a molecule
that we call diCQA—dicaffeoylquinic acid.
This was a study conducted by Ryan Renslow
and Erin Baker at PNNL; diCQA is an antiviral
compound. It usually occurs in one particular
form once it’s isolated from a plant, but
then if it’s exposed to UV light then it
begins to isomerize. And these are difficult
to separate using conventional approaches,
but using ion mobility—and in this case,
this is data from structures for lossless
ion manipulations that I didn’t provide
much background on, but it’s a very high
gas-phase resolution technique. And you can
see the predicted values of these cis/trans
isomers agree very well with the experimental
values. And, in fact, we observed around 0.8%
average error, compared to the experimental
data. But again, this was with the most accurate
version of our quantum chemistry pipeline
that takes a long, long time. So we tend to
accept higher error and use the higher-throughput
techniques to generate our reference information.
And so far, for CCS we’ve generated values
for about 100 million compounds. And these
are all available at this website; you can
check it out and try it and see how well it
works. And there are data there from the different
flavors of the two different predictor platforms
that we have. The majority of values that
are up on the website right now are from our
DarkChem deep learning approach. And then
this just shows some updates from CFM-ID that
you’ll be reading about soon. I think David’s
lab has a paper in review now there on CFM-ID—next
generation CFM-ID. So, I’ll probably move
a little bit more quickly here through these
initial applications of the use of in silico
libraries. The first is our participation
in the EPA ENTACT program where the EPA provided
different laboratories and chemical mixtures
in a blinded format, and the laboratories
that participated were allowed to use whatever
technique they wanted to, to try and identify
the composition of the mixtures. We used a
combination of ion mobility and Fourier transform
ion cyclotron resonance mass spectrometry
and generated predictive data. We didn’t
use data from analysis of authentic compounds.
We used a reference library that was entirely
comprised of predicted values. And so we predicted
CCS and isotopic signature and mass, which
are more or less trivial predictions compared
to CCS. And our results were, you know, good
and bad. The good part was we had an accuracy
of 96% in determining the correct composition
of the mixtures—at least for things that
we were able to detect. The bad news was that
the false negative rate was high; the false
discovery rate was high. But importantly,
it showed promise and gave us encouragement
to continue down this path. And also a good
sign was that as we increased the score that
we came up with, a total confidence score,
then the false discovery rate dropped, which
is what you would like to see. And that paper
has been published as well. You can read the
details there in the Journal of Chemical Information
and Modeling. I’m going to skip through
these slides a little bit in the interest
of time, just so that we can get through to
the Q & A session. But I wanted to share some
data of our initial applications of the ion
mobility platform to a study of sleep deprivation
or altered circadian rhythm in study participants.
And this is a collaboration with Hans van
Dongen and Shobhan Gaddameedhi at WSU. And
just briefly on the study design, we’ve
got people that are on simulated night shift
work, people that are on day shift work. The
people that are on the night shift are supposed
to sleep during the day and be up all night
and vice versa with the day shift. And so
we looked at the blood to see, you know, what
might be going on there. And Hans had already
published some metabolomics data looking at
these samples with a collaborator in Australia
and they used the Biocrates kit. And you can
see how many metabolites they detected, how
many of those were rhythmic. You can see the
patterns here. So definitely there’s circadian
rhythm with small molecule abundances in individuals,
and that can be disrupted when they go on
an altered sleep schedule. We did a complementary
lipidomics analysis—that project was led
by Jennifer Kyle here—looking at the same
exact samples. And sure enough, we can replicate
the cycling behavior of lipids with time,
which gives us confidence that we’re able
to tease out these differences using a conventional
lipidomics analysis. We then applied liquid
chromatography ion mobility tandem mass spectrometry
of the same samples, focusing on the polar
metabolome. A lot of analyses. We analyzed
each sample in positive and negative ionization.
We also used three different collision energies
for the tandem mass spectra that resulted
in over 916 raw data files if you include
blanks and QCs. A lot of data—about 1.3
terabytes in total. And then we’re stumped,
really, with how to effectively process that.
And so we decided to develop our own software,
DEIMoS, for multidimensional data extraction
to extract data from the LC dimension, the
ion mobility dimension, and the mass spec
dimension. So, I’m running short on time.
I’m going to skip through these slides on
DEIMoS. It’s just showing some progress
on where we are with developing the software;
it’s almost finished. But as I step through
these slides, I’m not going to read them.
I just want to point out that we are able
now to extract data, the multidimensional
data, from that platform; we’re able to
align it. And so, these top/bottom plots here
are data from two different QC samples that
we had interspersed throughout the run. And
you can see there’s good agreement in the
mass spec domain, as you might expect. There’s
good agreement in alignment in the drift time
domain. And a little bit less agreement in
the LC domain, but still not bad. And we’ve
fixed some of these misalignments using an
SVR algorithm so far. Deisotoping of the mass
spectra is an effort and ongoing. We’re
working to deconvolute the data independent
tandem mass specs that are being generated.
The instrument itself can do this to a degree,
but there’s still a need for an algorithmic
improvement to the deconvolution. This type
of deconvolution is very analogous to what
happens with GC/MS data. And you can see the
difference for this spectra between a windowed
approach, which is what the instrument does
inherently, versus what we can do algorithmically.
And you can see that the algorithmic approach
is removing some ions that likely do not belong
to the molecule of interest using just the
windowed approach. So that work is in progress
as well. Coming back to that sleep study,
we’ve generated a lot of data: close to
one million features—three-dimensional features—in
positive ionization, and over two million
features in negative ionization. And I’ll
be honest here and say that we’ve set our
signal to noise fairly low to not miss anything.
And we also haven’t implemented deisotoping
yet. And so we’ve got multiple features
included here from the multiple isotopes.
But importantly, many of these are statistically
significant based on the statistical methods
that we’re using, and you can begin to see
some cycling behavior here. The bold blue
line and the bold dotted line are the averages
from the study participants. So, it’s nice
to see that we’re replicating the data that
we’ve seen from the previous publication
as well as the complementary lipidomics analyses
that we’ve performed. I’ll just quickly
touch on our attempts right now to use all
of our predictive data to confidently identify
metabolites. So, we believe we’ve identified
glycine betaine in these samples. This is
just looking at three representative data
files out of the over 900. We had 15 MS/MS
spectra match to experimental spectra and
a variety of repositories, mainly MoNA and
NIST. There were no experimentally determined
CCS that we could find for this molecule,
and so we matched to our in silico generated
in silico spectra—or in silico CCS instead
here. This just shows you two features that
had evidence to support that they could be
glycine betaine. They have the same retention
time, as you can see here, about 8.25 or .3
minutes, but they have different drift times.
So likely what we have here are two isomers
or isobaric species that happened to have
the expected mass of glycine betaine. And
so then we move on to tandem mass spectra
to try and identify these, and looking at
the different spectra that were available
through MoNA and NIST, you see different degrees
of match, depending on the collision energy
that were used. We had our best matches to
these at the higher energies. You can see
the dot product scores here. Twenty volts
seemed to do well, and then 40 volts not so
much for this particular reference spectra,
but for the other two reference spectra—one
from MoNA, one from this NIST—we see relatively
high scores as well there. And then for the
feature 314, actually, the predicted spectra
from CFM-ID at 10 volts matched very well,
whereas 20 volts and 40 volts not so much.
Digging into the details a little bit, it
was interesting to see the data from these
two features, again shown here. This shows
you the number of tandem mass spectra that
had hits for this feature 314 and their dot
product scores. You see the same dot product
scores down here for the 331 feature. These
are summaries of the previous slide that I
showed you. You can see that we had a systemic
mass shift in our mass accuracy through the
course of the analyses—that’s addressable
through recalibration of the data. But, interestingly,
if you go on and look at the CCS error—and
again these are all predicted CCS through
our pipeline—the feature with ID 314 had
a relatively reasonable match to CCS at about
3.2% error. Whereas the feature at 331 had
a poor CCS. And so, although 331 had a few
tandem mass spectra that matched well to experimental
data, the predicted MS/MS spectra did not
match well, and the CCS predictions did not
match well. So, it doesn’t mean that this
isn’t glycine betaine. I think it means
that it’s not the particular gas phase conformer
that we think it is. And so, we’re realizing
that we’ve got a lot of data to go through,
but I think a rich resource to begin to understand
the subtleties and gas phase conformations
of molecules and how those impact the tandem
mass spectra and the CCS and the other properties
that can be measured by mass spectrometers.
So, that was essentially the last slide. In
summary, I personally believe that in silico
reference libraries are the future of metabolomics.
The proteomics community has been able to
do this in part just because of the inherent
chemistry of a peptide and a protein. I don’t
think it’s impossible for metabolomics.
I just don’t think we’ve arrived yet on
the properties and paradigm for how to do
this yet, but many groups are working on it,
including ours. I think we’re going to get
there sooner than later. And I tend to be
a relatively impatient person, so I’d like
to have this capability and use it while I’m
still actively doing research. Multidimensional
analyses, such as LC-IMS and mass spectrometry,
they provide higher coverage of the metabolome
just because of the additional separation
that occurs during the measurement. Hopefully
you can begin to appreciate that additional
analytical properties of the same molecule
being measured—for example, a mass and isotopic
distribution, a tandem mass spec, a CCS—they
start to provide higher confidence for one
identification over another. But it also introduces
complications and other considerations that
we now have to deal with, and primarily it’s:
How do we integrate and weight appropriately
all of these multiple lines of evidence to
provide a molecular identity? It’s something
we’re working on actively now in our project
and our research and it would be interesting
to hear what you might have, you know—what
thoughts you might have on this as well. So
with that, I’ll just thank the project team,
our funding sources—the NIEHS and the Common
Fund—as well as some internal funding for
the data that I showed. And I’ll stop now
and take any questions that you might have.
I see some questions and comments in the chat
box. I can start through there.
Oh, gee, sorry. I am so sorry. I was muted
and I apologize, Tom. Thank you for chiming
in because I would’ve been talking away.
My apologies. There are a number of questions.
First I’m going to address very quickly:
a number of people would like to get your
slides after the webinar. We do post those.
Tom has agreed to post them and it takes about
two to three weeks to get them posted because
there are some procedural issues. And we will
have those posted on the Metabolomics Scientific
Interest Group website. And moving onto a
scientific question: The cell metabolism paper
looks interesting. I’m wondering how one
can select a specific, such as you did, for
GDF15 from a large number—over 300—that
were found to be different? At the end of
the day, experiments have to be conducted
to validate the target. So, how can confidence
in picking the correct molecules as high?
Yeah. So if I interpret the question correctly—I
interpret it as: you’ve identified 11,000
proteins. How do you then down select to where
you think the action is in terms of biology?
And, you know, there are a number of ways
you can do that. You can look at, you know,
volcano plots of the molecules of interest
or the data—you can look at the correlation
of fold change and p-values and so forth.
But, you know, for that particular study and
that paper, I really have to credit Ernesto
Nakyasu here in our group who did the first
author of that paper and he, I feel, had amazing
insights. In his analysis of the data, he
looked at additional cytokines that were turned
on or off through the in vitro cytokine stimulation
of the human islets. And of course we saw
many cytokines that were upregulated, which
you might expect as part of the propagation
of the stress response. But Ernesto keyed
in on three cytokines and/or growth factors
that were decreased in the expression. And
he had a hypothesis that because those were
going down, those might be protective in nature.
And we worked with our collaborators, Raghu
Mirmira at University of Indiana, and we did
a lot of follow-up studies on, you know, using
mouse models. We were looking at pancreatic
slices from nPOD, a repository of tissues
from individuals with diabetes—a lot of
follow up studies. So, all of the follow-up
work is really described in nice detail in
that cell metabolism paper. But, you know,
sometimes, you know, in addition to all the
normal logical steps that you go through to
parse the data to focus in on things that
you think are interesting, there’s often,
you know, what we say is gut feeling and intuition.
And I think in that case, it was more intuition
and we sort of got lucky in that it panned
out into a nice biological story.
Great, thank you. Next question. It’s well
known that bilirubin can be shifted through
different double bond isomers by blue light.
Is this something in metabolomics we should
be aware of? Does it occur after receiving
material for analysis?
Yeah, that’s a good question. You know,
I’m sure there are all sorts of artifacts
that get introduced into our samples, not
just from exposure to light but, for example,
freeze–thaw cycles and so forth. And there
have been some nice papers on freeze–thaw
cycles in particular that look at metabolite
stability. I would argue that we really can’t
fully address these phenomena or the impact
of these phenomena until we can identify the
majority of our data. Because when you’re
looking at, I don’t know, less than 10%
of the data set as confident identifications,
and the rest are unknown features, it’s
difficult to say the degree of impact that
any type of manipulation or, you know, perturbation
of the sample might have on the chemical composition.
You know, the proteomics community has had
it easy, I would say, through the years just
because of the nature of the molecules involved
and the nice tools that have been developed,
you know. They can do things like looking
at extended gradients and offline fractionation
or enrichment techniques, so on and so forth,
to see how those types of analytical improvements
impact the quality of the proteomics data.
And I would argue we can’t really do that
right now with metabolomics because we just
simply can’t identify the majority of the
data. And so running an extended gradient,
doing offline fractionation, looking at, you
know, effects of freeze–thaw and so forth,
I would argue that we can’t fully address
the impacts of those right now simply because
we can’t, you know, we just don’t have
a good—I’m struggling with the right words—a
good, I would say, representation of the data
in terms of, you know, the total chemical
complexity based on confident identifications.
We’re just not there yet.
Great. Thanks, Tom. Next question: How are
you planning to validate the in silico CCS
DB with experimental CCS data? What rationale
to select compounds and use those available
from efforts of other groups work with IMS?
So, for the papers that I showed in the CCS
section—one was on the in silico, or sorry,
the quantum chemistry tool, which we call
ISiCLE, the other was on the deep learning
tool called DarkChem—we had an experimental
data set available to us that was generated
by Erin Baker during her time here at PNNL.
And it consisted of, I don’t remember how
many unique molecules, but it was around 2,000
different conformers of molecules and adducts
of molecules. So that would include protonated,
sodiated, ammoniated forms of metabolites—all
of which tend to have different drift times
and ion mobility separation, and then correspondingly
different collisional cross sections. So the
plan is essentially to generate additional
CCS values, experimental CCS values, from
analyses of authentic compounds. We’ve got
about 10,000 in house. I’m not sure whether
we need the data from all 10,000, but we certainly
want to experimentally measure compounds that
have diverse chemistries to make sure that
we don’t have bias one way or the other
in either of the predictive platforms, and
that’s our plan. We’ve got some molecules
that Peter Dorsteen shared with us, which
are microbial secondary metabolites. We’ve
got molecules from the EPA, which are, you
know, these exposure compounds, manmade. They
have very different chemistries, you know,
halogens incorporated and so forth. And so
hopefully, if we analyze a sufficient representation
of chemistry and we see good agreement between
the predictive tools, then, you know, we’ll
be much more confident in using them in our
routine metabolomics analyses. The rationale
to select the compounds is, you know, what
I mentioned already a little bit, you know.
We don’t just want to measure all the metabolites
that are present, central metabolism. You
know, we just don’t want to focus on amino
acids, small organic acids, fatty acids, lipids,
so on and so forth. We want to get, you know,
the gnarly looking compounds that come from,
say, microbial or plant secondary metabolite
production, or that come from, you know, anthropogenic
sources—flame retardants, you know, pesticides,
pharmaceuticals. We’d like to be as diverse
in the chemistry as possible.
Thank you. So, we have a number of questions
that came in. So this is obviously—people
are very interested in this area and I don’t
know that we can address them all or we might
be here till about, like, well 12:30 my time,
9:30 Tom’s time. But I think maybe we can
go just a couple minutes over?
Okay.
And I’d like to try to ask maybe like three
more questions.
Sure.
One came just a little more broadly as a question.
I think I saw a number of epidemiologists
on the call, so I’m going to ask this one.
For the large-scale studies, epidemiologists
strongly prefer working only with identified
metabolites as it greatly simplifies the overall
workflow. Approximately how many years do
you think it will be before these methods
are available for large-scale studies?
Oh, that’s a good question. I mean, I wouldn’t
limit it to just large-scale studies. I would
reframe the question as, you know, when do
you think these methods are going to be available,
period? I don’t honestly know the answer
to that. We’re working hard on it, as are
others in the community. Again, I’ll point
out that I’m, you know, I tend to be an
inherently impatient person. I would like
to, you know, really grind this out sooner
than later so that I can use it, you know.
Ultimately, you know, my first degree was
in biology because I wanted to do disease
research. And then I got into chemistry to
better understand the molecular mechanisms
of what occurs in the context of biology.
And so, you know, although I’m conducting
analytical research, although I’m in a large
analytical mass spectrometry group, I, at
the end of the day, mainly care about, you
know, what these molecules mean, what these
changes in the molecules mean in the context
of the biology. So, I definitely hear, you
know, the intent of the question, and I definitely
want to get there and have a method that works
well in terms of not only comprehensive identification,
but also assessment of identification false
discovery rate. I want that like, yesterday,
but you know, it’s hard. We’re sort of
trailblazing here. I would like to see, you
know, some reasonably good method within the
next five years. Again, mainly so, you know,
selfishly so that I can use it myself in my
own studies, and, you know, not be frustrated
by where we are with metabolomics relative
to where we are with proteomics and transcriptomics,
for example.
Well, a follow up to that question, in a sense,
even though it was two different questions,
but I think it might play nicely into kind
of what you’ve just said, is: With many
different research groups taking the lead
on developing in silico reference libraries
for each major property, how do you envision
the community coalescing on the best final
tool and then disseminating them to the broader
metabolomics community? It looks to me like
it will be a leap to get everyday metabolomics
scientists to use these and not just the developers
themselves.
Yeah, that’s another good question. You
know, I look back at what occurred in the
field of proteomics. So it was, I think, the
early ‘90s when John Yates and Jimmy Eng
developed the SEQUEST algorithm to identify
peptides with confidence and relatively high
throughput based on the match of predicted
tandem mass spectra to observe tandem mass
spectra. And, I mean, it seemed like a, you
know, like a light switch in terms of the
community adopting that at that time because
there was nothing else, really, that worked
as well. And, you know, it just progressed
from there. There are many algorithms now
on the proteomics side and many software tools
to identify peptide sequences based on their
tandem mass spectra. They all kind of work
about the same. You know, there are certainly
differences between the algorithms that will
give you slight differences in, you know,
the peptides identified. But at the end of
the day, you kind of get the same data regardless
of the algorithm that you use, and you get
about the same confidence because of the target
decoy approach, again, that was developed
by Steve Gygi. So, I think really what it’s
going to take for the community to adopt all
these and, you know, to coalesce on a method
is, you know, you just have to demonstrate
it. You know, it’d have to be, you know,
some well characterized sample that’s maybe
handed out to all the different laboratories
and we all participate in a ring trial, or,
you know, other type of interlaboratory comparison.
You know, we provide the results and, you
know, hopefully one or more different methods
will, you know, generate many confident metabolite
identifications with associated low false
discovery. And I expect everyone to kind of
switch over to using that. I mean, I’ll
be honest and say if tomorrow such a method
came out, I’m going to use it. If we evaluate
it and it works well in our hands and we trust
it and, you know, it’s not a black box of
algorithms, you know, we can see the details
algorithmically of how the sausage is made,
I would use it tomorrow. Again my focus really
is, you know, ‘What does the data mean in
the context of biology?’ It doesn’t necessarily
matter to me how we get there, as long as
we get there in an analytically accurate and
robust manner. So, I think once a method or
paradigm is demonstrated to work and work
well, I just see the community naturally adopting
it.
Okay, we’re going to do one last question
here. CCS seems like a great measurement for
in silico predictions because it is reproducible
across different instruments and different
labs. Retention times and MS/MS fragmentation,
which are still the most common measurements
because IMS instruments are less common, unfortunately,
and very variable—what is your perspective
on the future of in silico predictions for
reference libraries on these measurements?
Well, I think they’ll still be used. You
know, I’ve had similar conversations with
Oliver Fiehn and others that are also PIs
under the NIH Common Fund Phase II program.
Personally, I would put more weight on evidence
that come from multiple orthogonal or orthogonal
properties than I would in one or two properties.
And so then I would assume, in that multidimensional,
multi-chemical property space that we’re
using to characterize a molecule, that you
could get by with larger errors in one individual
property using that aggregate of different
chemical properties than what you could get
by with if you’re only using one property.
So, for example, if I had a CCS error of,
you know, 5%, and I was only using CCS, then,
you know, I’m not happy with that 5% error;
I would want it below 1%. But if I have CCS
and say my error is, you know, 2% to 3%—which
most of the predictive methods seem to settle
in on about 2% to 3% error versus experimental
data—if I’ve got MS/MS with a certain
accuracy, if I’ve got retention time prediction
with a certain accuracy, if I’ve got, you
know, whatever property is being measured,
if the errors are a little bit higher in each
one of those individually than what I would
like, what I would hope would be that the
aggregate of all of those properties would
be, you know, relatively strong in terms of
being able to say that feature in the data
is this molecule, and to do so with confidence.
I believe that will be the case but we need
to do additional work. You know, this is what
we refer to as chemical space. You’ve got
this multidimensional space of these many
properties that you’re measuring and you’ve
got, say, a point out there in that space,
I personally would like to know—well, I
think, you know, this particular point in
this multidimensional space is this molecule—I
want to know what are the next closest points
around that point in that multidimensional
space? Maybe it’s a big gap, even with the
relatively high errors in retention time or
MS/MS prediction.
All right. Well, thank you all for your attention
today and your active participation in our
webinar. We would like to take this opportunity
to thank our presenter, Dr. Tom Metz, for
an excellent presentation. And this may have
been one of the most enthusiastic level of
questions I’ve gotten in a long time. So
there’s obviously a lot of interest in this
area, so that was a lot of fun to see. So
having said that, we would welcome your feedback
to inform future webinars. And again, thank
you so much, Tom, and thank you all for attending
today. Everybody have a great week.
Thanks for hosting and thanks for attending.
Thank you.
