MING LEI: Good afternoon.
And to those of you at the West Coast, Alaska,
or Hawaii, good morning.
My name is Ming Lei.
I am the division director for Research Capacity
Building at NIGMS, and I am your host today.
It’s a pleasure to welcome you to the fifth
webinar of the NIGMS training webinar series.
This is a difficult time.
The pandemic has disrupted everybody’s life
to a certain extent.
NIGMS created this webinar series to help
keep our training community together with
useful and interesting talks and conversations,
and I hope you are all enjoying them.
A few reminders before we start the presentation.
All webinars in this series are recorded and
some of them already have been and all of
them will be posted on the NIGMS website,
so you can view them, actually all of them,
at any time, and I would encourage you to
ask your friends to view them when they have time.
Secondly, there will be a Q&A session after
the presentation.
To ask a question, we ask you to type it in
the chat box and send it to me and I will
read it out for a response.
And again, my name is Ming Lei.
And our speaker today is Dr. Susan Gregurick.
Because she will share her scientific journey
with you, so I am not going to introduce her
except by sharing with you that as the NIH
associate director for data science, she is the leader at the center of all NIH data science activities.
So with that, Susan, take it away.
SUSAN GREGURICK: Thank you so much, Ming.
And to all my friends at NIGMS, it’s a pleasure
to be here today.
Just a little reality check, Ming.
You hear me OK, correct?
LEI: Yes.
GREGURICK: Awesome.
I’ve been so excited and looking forward
to this particular journey and discussion
with you for a week.
It’s not often that I get to tell young
people about my own personal journey, and
I hope that you see yourself in a little bit
of me and what I’ve done.
I’m going to tell you about what I’ve
done in the computational sciences, which is
my true love, and how this has helped shape biomedicine and my own personal professional choices.
So let me begin by telling you the beginning.
So I want to give you some historical perspective
in the development of computers, computer
science, the internet—I actually watched
the internet’s birth more or less—networking,
and analysis in my own personal journey and
how these have changed my professional life
and have helped me make my career choices.
And then I want to finish on something that’s
relevant to every single one of us around
the world: How have we applied computing,
internet, technologies, analytics to address COVID-19?
And we’re just at the start of COVID-19,
so we have a long ways to go.
So we’re going to have to go way
back in the way-back machine to the 1980s.
So the top song of 1982 is “Physical”
by Olivia Newton-John, which you may or may
not have ever heard, but I’m sure you have
seen the movie E.T.
It was the top-grossing movie in 1982.
And I’m living somewhere in a town in Michigan,
and I’m a dancer.
I actually take ballet as well as highland
dancing. I am a total goof-off.
I am probably in and out of school more than most.
I’m a DJ at our local high school radio station.
My name is Susie at that time.
I’m the homecoming queen.
And I’m a total closet geek.
Nobody at my high school knows that I’m
an avid reader of science fiction.
I’m reading Scientific American, which was
about at my level in high school.
I’m taking classes at the local community
college—mostly in genetics and chemistry.
I’m really fascinated with science, but
that was my secret life.
And here’s what computing looked like to
me in 1980.
So popular then was the Commodore 64.
I came from a community and a town where computers
were not very common, so even my high school
did not have any computers, but the local
community college did.
This is a typical computer science room that
I never got to visit when I was in high school,
but I’m pretty familiar with these.
And if you have never seen these before, these
are punch cards.
And so when you write a computer program in
1980s and before, you have to translate them
into the punch card system, and then you feed
those punch cards into a machine that’s
not really quite visible in my picture.
And the worry of every computer scientist
at that time was that you dropped those punch cards
Because they're a program and they're in order,
and if you drop those punch cards, you will
spend a significant amount of time and worry
trying to get them back in order.
Just imagine trying to debug your program
using a punch card system.
It was so hard to do, so much work.
And when I was in high school, this was one
of the computational biology highlights of 1982.
By the way, 1982 is the year that I graduated
from high school.
This is the story of the protein dynamics
of a small little tiny protein called BPTI,
bovine pancreatic trypsin inhibitor.
It’s approximately 60 amino acids long,
and you can see its ribbon structure on the screen.
Wilfred van Gunsteren and Martin Karplus did
molecular dynamic trajectories of this little
tiny protein for 25 picoseconds in vacuum,
and then they put it in a spherical shell
of 2,647 non-polar waters, and then they fixed
it in a crystal structure and they tried to
understand the dynamics of movement of this
protein in these three scenarios, and that
particular paper and that particular simulation
was a tour de force of computational biology
the year I graduated from high school.
And I was totally amazed that we could actually
do calculations of protein dynamics in these
three different scenarios—in vacuum, in
non-polar solvent, and in crystal image.
So moving a little bit forward in the later
1980s, the top song is “Walk Like an Egyptian.”
That was the song when I was in college as
an undergraduate.
The top-grossing movie of 1987 is Three Men
and a Baby, actually a movie I never saw—it’s
not quite my interest—and I am at the University
of Michigan and I’m an undergraduate.
And I graduate in the year 1987.
I’m a chemistry major and a math major.
It’s not uncommon probably for most of you
to have dual majors.
I am a research assistant.
I am a research assistant in mathematics.
I’m a research assistant in geology. And
I’m also a research assistant in the medical
school, where I am developing hepatic imaging
agents through synthetic organic chemistry,
and that is not my strength.
I do not do any more synthetic organic chemistry
after undergraduate, but at that time I thought
that would be an interesting type of research
to explore.
I’m also spending lots and lots of time
looking for errors in my code.
I want to just make one point to you as many
of you are undergraduates.
One of the most valuable experiences that
you can gain as an undergraduate is to work in a lab.
To work in a lab, to work with graduate students,
to work with other graduate students, to work
with postdoctoral researchers and your mentor,
your PI mentor, will allow you to see what
research is really like. You know, it's hard.
Sometimes you spend a lot of time working
on a project and it doesn’t go anywhere.
There’s a lot of false starts.
This is one of the most valuable real-life
learning experiences that you can have, and
I encourage everybody to take at least one
semester and do research in a laboratory.
And, obviously, I am no longer a closet geek;
I am an actual geek at the University of Michigan.
I am known mostly in the chemistry and math
department, but I do have a lot of work that
I do in coding as well.
And what do computers look like for me when
I’m an undergraduate?
This is one of the computers that I worked on.
It’s not my actual computer because I didn’t
take that with me.
This is an IBM PC/2, and you can see that
you can actually play chess on this computer.
This is The Ohio State University, a big competitor
to Michigan, by the way.
This is the supercomputing center in 1987.
They are a powerhouse of supercomputing.
They are not the only ones, but I knew them well.
And this is the birth of programming languages.
While you’ve probably heard of Fortran, that was my primary language when I was coding in the late '80s.
C++, certainly, but PERL and these more interpretive and dynamic languages really start developing
in the late '80s.
What’s the computational highlight from
the year I graduated from college, which was 1987?
It is another computer simulation.
This is the diffusion of a substrate in an
active site in an enzyme.
And this particular system is superoxide dismutase.
And what I wanted to show you is that unlike
the last simulation, which is the dynamics
in a trajectory sort of way, these are more
stochastic Brownian dynamic simulations, and
what was really super cool about Kim Sharp
and Barry Honig and Robert Fine’s work is
that they actually put the charges in the
active site of the enzyme into the calculation.
And having the ability to have molecules have
a charge gives you an electrostatic [unintelligible]
for what’s really happening in that active site.
And to me this was just a super cool simulation.
I love the work of Barry Honig.
I’ve followed him for years, and I have watched
the field of electrostatic calculations go
from point charges to probability charges
to all sorts of really innovative work, and
so I just wanted to share with you that one
particular highlight.
Moving to a new decade—1990s.
The top song in 1995 is “Gangsta’s Paradise”
by Coolio, featuring L.V., and the top movie,
which I did see, is Batman Forever.
All those Batman movies are so great.
And I am at the University of Maryland.
And, obviously, I have never left this area.
I am still living in Silver Spring today.
I am defending my PhD thesis in 1995.
Just a side note.
I took two years off between my undergraduate
and my graduate studies, and I worked at the Naval
Research Laboratory, where I was involved
in the physical characterization of organic
molecules used for blood surrogates.
And it was a really wonderful experience because
I got to see what it was like to work in a
very large team at the Naval Research Laboratory,
and I got to become much more proficient at
NMR spectroscopy and IR spectroscopy and Raman
spectroscopy, and I so loved Raman and IR
spectroscopy that you’ll see it popping
up in my future.
You see this character here on the giant steps.
That’s my PhD thesis advisor.
That’s Millard Alexander.
He’s still at the University of Maryland.
I think he might be emeritus at this point.
But what did we work on?
So I studied flux in reactive systems—systems
like boron hydride—and I studied what happens
in those systems when the potential energy
that describes different excitation states
cross and how do you actually calculate curve
crossing or reactions?
That’s really the story of flux.
I developed a new genetic algorithm, which
is a pretty cool algorithm, for optimization
of structures that have multiple potential
energy surfaces, PSEs, and obviously I’m
not in computational biology.
I am a serious homebrewer, and I got married
to my colleague in physics.
And this is a later picture, but that is myself
and my husband, Nicholas Phillips.
When I was a graduate student, I wanted to
change careers.
I wanted to think differently about computation
and what we can do with our careers, so I
changed from physics to computational biology.
Here’s what computers looked like in the 1990s.
This is actually a computer that I did most
of my PhD work on.
It’s an Apple Macintosh.
I was so lucky to watch the birth of Mozilla,
Netscape, and a little blurry for you is the
HTML language that most of you probably know
how to program in and you’re very, very efficient in.
But when I was in grad school this was completely
new—and so was this.
At one point, a list every day came out, a
new website, and there was a list of the top
websites that had come out that day.
And the first webcam, that’s the coffee
pot at Cambridge, where I actually visited
and did some work as a grad student.
There it is.
You could see the level of the coffee pot
at any particular time and you would know
when you could go down and get some new coffee.
And here I’m going to play for you in the
way-back machine the sound that I will never
forget [dial-up handshake]. There it is. And
that horrible sound goes on and on.
That is how we had to connect to the internet.
That is my dial-up modem.
So I had to sit at home and timeshare the
one computer in our grad school house, dial
up to the internet, and do our work.
And most of us actually played games, and we
had to have a lot of time in order to do our
work and play games.
So you guys have such a wonderful experience—always
connected, always on—but for us, that was
the sound that we heard hour after hour
throughout the night.
Here's something that was super exciting when
I was early in my graduate school days.
This is BLAST—Basic Local Alignment Search
Tool—developed by a number of colleagues,
including David Lipman.
David Lipman is still at NCBI and NLM here at NIH.
This was a new approach to rapidly do sequence
comparison of different sequences by doing
a basic alignment.
And you would get a score, and that would
tell you, for example, where the gaps were,
where the insertions were.
This particular algorithm has revolutionized
the way we do comparative genomics, and now
you can do slide BLAST and multi-PI BLAST,
and there’s just so much work that’s happened.
But yet I bet most of you have used BLAST
or one of its child prodigies in your own
research, and it was just remarkable.
And this is really one of the reasons that
I got inspired to think about bioinformatics
and data science, because I started to realize
when I was in physics that the world of data
and the world of biology and the world of
computing were the next big thing, and I think
that you might agree that that’s actually true.
In the years since 1995, I have traveled to Israel for a postdoctoral fellowship in computational biology.
I was a professor of computational biology
at a university, University of Maryland Baltimore
County, for a number of years, and one of
the projects that I worked on was this super
large protein complex called GroEL-GroES,
that is a protein chaperone complex, and it’s huge.
It’s 14 subunits, but you can’t see it all.
It’s all together as a big complex.
Each subunit is 58 kilodaltons.
I couldn’t even load that complex into memory
in my computer when I was working.
I had to do very large parallel processing
on supercomputers to just do the calculations
for how the GroEL-GroES chaperone complex and the proteins that are inside that are in blue actually work.
I switched.
I became a program director for the Department
of Energy, and I focused fully and totally
on data—data platforms, data computing—for
energy and the environment, very particularly on
bioenergy, translating poplar and other types
of soft woody plants into bioenergy complexes.
I decided to make that career change because
I wanted to have a bigger impact for a larger
amount of science, and I truly, truly am dedicated
to data science.
I was a division director at NIGMS and I worked
for Dr. Jon Lorsch, and I was the director
of Biophysics, Biomedical Technology, and Computational Biosciences, and I really wanted to think
about how we can change the landscape for
technology, incorporating much more new and
innovative technology as well as new ideas
for team science.
And now I am the associate director for data
science, where I am working across NIH and
across the community to make data, data resources,
findable, accessible, interoperable, and reusable.
And I also am the mother of two fantastic
young adults, Andrew Phillips, who is a junior
in college studying, of all things, organic
chemistry, and my daughter, Abigail Phillips,
who is finishing high school and hoping someday
to have a career in dance.
And I still brew beer.
Almost every month I have another five-gallon
carboy of beer brewing.
And here we are today.
You have data at your fingertips, and you
have wonderful platforms to access and use that data.
You’re always connected and you’re always
on, and that’s a wonderful thing.
And maybe it’s a curse too, but it’s so
nice to never have to listen to that dial-up sound.
You have supercomputers like we’ve never
seen before that can really address problems
of great complexity.
The problem I showed you GroEL-GroES chaperone
complex could easily be handled today without
any special workarounds with massive parallel
computing.
You have R and it's Shiny, and you write in codelets that you can match right onto the bare metal
with Kubernetes.
And you can package up your code into dockers and containers and move it around to different
cloud resources.
And you’re working on a community.
You have GitHub.
You share your software.
This is just an example of Jupyter, but there’s
such a great software-sharing community that’s
available to you.
So how can we use all these tools that we have at our hands today to address a pandemic that's significant?
How can we partner with industry for workflows
and tools and analysis?
And how can we provide you the resources so
that you can get your work done?
I want to just give you three or four use
cases of what we’re doing right now at NIH
that you can use to study COVID-19.
And this is an amazing story of two intramural
researchers—one of them from NIDDK and the
other from NCI—so NCI is National Cancer Institute, and
NIDDK is National Institute of Diabetes and
Digestive and Kidney Diseases.
And they, in three weeks, collected specimens
from pathology, created the digital images of
those specimens, de-identified them, partnered
with a company called HALO, and put those
whole-slide images up for you to use for reference so that you can study and understand COVID-19.
Right now we have much more than eight reference cases because our two intramural researchers are
getting more and more samples every day from
hospitals from different countries, so I think
we’re up to 19 reference cases, but
there’s more coming in every day.
And we're going to integrate this particular
resource into a much larger resource in the
near future, but just right now you can go
and do some limited artificial intelligence
algorithm development on these resources.
And we’re partnering with the gaming and
the video company to create processing workflows
for CT images.
CT has been one of the types of images that
you can use to detect COVID-19 in patients,
and so we’re developing those workflows
by using and leveraging gaming computers.
This is a very nice artificial intelligence
classification.
And we are providing high-performance computing
resources to the federal government, to industry,
to academic leaders around the world so that
you can use resources from the national labs,
resources from IBM, from Google, from AWS.
Over 4 million CPU cores are available.
The consortium is taking applications every day.
So if you have an idea that you think would
benefit from high-performance computing, this
consortium is there.
The resources are free for you.
We’ve come a long way since those days of
punch cards and 25 picosecond dynamic simulations
of tiny, tiny proteins, and I’m just wondering
where you, our new and brightest generation
of scientists, will take us in the future.
And with that, I would love to hear from you
your questions, your comments, and your thoughts.
And I’m going to turn it back over to Ming.
LEI: Thank you.
Thank you so much, Susan.
I will say with computers, beer, and the lovely
family as a very exciting life.
So as I mentioned earlier, we are going to
have questions.
Waiting for questions from the audience.
I will ask the first one on behalf of our audience.
So for a biology major interested in a research
career, what would be the key computational
and data science training or skills that the
student should pick up while he/she is in school?
GREGURICK: That is a great question, Ming. I would say that there are a few common ways in which biology is
coming to look at data and look at studies that you
can start to take classes in now, and that
would include getting familiar with the programming
language R, because quite a few software tools
are written for and in R. But if that’s
a bit of a barrier, there are also tools such
as Galaxy, which are a little bit more plug and
play, and so using the tools available in Galaxy or
Jupyter, you can have a lot of different types
of computational software like Glass and others.
So getting familiar with those platforms and
learning to use those tools and understand
what the results mean for your research would
be a great step forward.
And Coursera is offering many different types
of computational classes available for students.
And I think NIH has offerings to make Coursera
computational data science classes freely
available for NIH students, so we would be
more than happy to point you to those resources.
LEI: Great, great. There is actually a question from one of the students.
Where would I go to apply for access to computer resources?
GREGURICK: From the HPC Consortium.
There is a website, and the application is processed through NSF, through a program called XSEDE.
NSF will route your application to the consortium,
and the proposal is very lightweight.
It’s only, I believe, three pages, so you
can certainly easily apply for those resources,
and then they will match the resource needs
to the application that you put in, so you can
have access to many different types of resources.
LEI: Related to that, NIH has training opportunities
and resources available as well, right?
GREGURICK: Absolutely.
There are a number of different training opportunities
that I did prepare as an extra slide, including
our SRA metadata cloud, BigQuery, and NIAID
bioinformatics training resources.
All of the resources that I told you about
today can be found on our website, including
the high-performance computing application.
And then there’s a number of training opportunities
that we will be having available, including
if you really want to do computing on bare
metal, there’s a Kubernetes engine two-day
course coming up later this month.
There are a number of other opportunities
in the works that could be either working
with Google, GCP, or AWS.
Some new opportunities for machine learning
as well as data engineering later in July.
LEI: Great. Another question is more about your own scientific journey.
How did you decide to change your field, and
how did you update yourself with the new field?
GREGURICK: That is a great question. And it’s sort of a funny story.
I was studying physics, mostly in surface and
gas-based physics.
And the funding was starting to change when
I was a grad student from that physics/Silicon Valley
type of funding much more into bioinformatics,
and my PhD thesis advisor said, “There are a few
opportunities in your life when you can do
a career change, and from graduate school
to postdoc is one of them.
If you want to make a change to computational
biology,” because he saw all the articles
I was reading, “now is the time you need
to do it.”
So I wrote to a number of people to get specifically
training from people who were prior physicists
who had moved to computational biology, and
that is how I chose my postdoc was by working
with somebody who had also been a physicist
so that we would have some common language.
It was a hard change.
I had taken very, very little biology classes
when I was an undergrad, and obviously no
biology classes when I was a grad student,
so I had a huge lift to retrain myself.
I was lucky that my postdoctoral advisor was
very patient with me as I did have to take
additional training and coursework in biology
in particular.
And I will be the first to admit that I do
not have the strength and background that many
of my colleagues at NIGMS have in biology,
and I often have to look to them for understanding
about the meaning of the systems that I’m
trying to study in much more complex detail
than I have.
Biology is so complicated, but it’s also
so fascinating.
LEI: Great. This follow-up question is from a different angle.
Do you have advice for postdocs not classically
trained in data computational science wanting
to transition into the field?
GREGURICK: Yes, absolutely. I would take the similar thought that working with an advisor or doing a one-year
sabbatical as a junior assistant professor
with a colleague who has that training in
wet lab experimentation but has also made
a transition to computation will help you a lot.
So you might need to take an additional year
of postdoc or sabbatical to train in computational
sciences but working hands-on in the lab with
other people in the computational field will
give you a lot of insight.
I also took apart a lot of code to learn how
it worked, and that is a good way to learn
how something works is to take it apart and
then try to learn how to put it back together again.
LEI: OK, this is a closely related one.
What are the computational bioinformatics
opportunities as a prospective postdoc at NIH?
GREGURICK: There are a number of computational
fellowships that one can apply for.
There’s also a lot of funding for new investigators
in computational data science, and you
happen to be looking at the institute that
has, I’d say, the largest amount of computational
and data science funding opportunities, NIGMS,
and so working with them to get funding in
one of their programs is absolutely a wonderful opportunity.
LEI: Another one related to this, what level of math and statistics would you need to be able to take advantage
of the bioinformatic tools you mentioned earlier.
GREGURICK: I would say that having a good basic understanding of mathematics and statistics will always
help you. In fact, when I was looking at majors when
I was in college, I was thinking of double
majoring in computer science and one of my
colleagues told me that it’s much better to major
in math because math is the foundation of
most computer science.
And that’s true, I see that now.
So having a strong mathematics background
can never do you wrong.
But if there’s a little barrier, then having
a good foundation for statistics will definitely
be a very important tool to have in your toolbox.
LEI: Another one, what programming language will be suitable to understand computational biology?
GREGURICK: I have so many favorites, but probably
they’re a little old and outdated now, and
what I see is that people find R and R Shiny
to be very useful, and many of our professional
PIs are writing their programs in R. So if
I had to pick one, it would be R, but if you
ask me what my favorite programming language
is, it’s actually PERL.
I loved PERL so much.
I did not like Java very much and I certainly didn’t
like many of the threaded languages,
but I just absolutely loved PERL, but I don’t
think that’s very useful.
I think R is probably going to be your best bet.
LEI: Great. What would be your advice with gaining
computational skills you want to incorporate
into research rather than enter the field
as a whole?
What would be the best way for an undergrad
to approach a potential mentor?
GREGURICK: So you want to approach a mentor
and gain experience? I'm trying to understand how to parse that.
LEI: The first part is are there ways to gain
computational skills you want to incorporate
into research but not really want to become
a card-carrying data scientist.
GREGURICK: I would say learning some of the
more popular software tools, like BLAST,
for example, is a great tool.
Just learning how to use it and what those
results mean for your own research would probably
do you very well.
So you would never have to write any or much
code at all using existing software, but it
will really help you if you sort of know the
basics and know the results and know the foundation
of some of those more popular tools.
LEI: OK, here is a specific one, which tools would you recommend for cryo-EM image processing to determine protein structures?
GREGURICK: I am not an expert in that, but
I think there are some tools, something like
Cryolan is one tool that I’ve heard, and
I believe that’s been forwarded to the cloud
and actually I believe that NVIDIA worked on that
as well for cryo-EM.
There are probably other more popular and better
tools; that’s just one that I know about
because of the partnership with NVIDIA.
LEI: OK, there is also a pretty specific question that is
does NIH have open-resource for services
such as sequencing samples from patients?
GREGURICK: Absolutely.
And this may or may not be available to the open community, but our institute NHGRI, our National Human
Genome Research Institute, does do sequencing
on patients, particularly also right now for
COVID-19 as well as we have a national lab
in Frederick, Maryland—Frederick National
Lab—which is doing sequencing on COVID-19
patients, as well as developing serology testing
and analyzing that data.
I see that CryoSPARC is another popular cryo-EM data
processing tool.
Thank you so much.
That must be Mary Ann Wu who has mentioned that.
So thank you very much.
CryoSPARC is coming up as another popular tool.
LEI: So let's go for another question.
Given that sectors such as banking, insurance,
often offer much higher salaries to a student
with that kind of computational data
science training, what would you tell those
students so that they would consider biomedical
research as a rewarding and viable career choice?
GREGURICK: That’s a great question because
it’s always on my mind as well.
I would say that being an investigator and
a researcher in computational biology and
studying and understanding biology is rewarding
for a number of reasons.
The flexibility that you have in your career
and your career choices and the types of work
that you do, those are up to you.
You make the decisions and you are the captain
of your ship, and you make the contributions
to science, unlike in the private industry
where the captain of the ship is the CEO and
the board of directors, and they make a lot
of the decisions and you are implementing.
Here, when you are a researcher in an academic
setting, you are the one who is discovering
and pushing the field forward.
And if that passion for understanding, addressing
questions, using your skills in computer science
or in the wet lab drives you, you will stay
up day and night to do it.
You will find that the passion you have for
research will not be quenched by any lack
of money that you may not have by not having
moved to industry.
LEI: What are some of the big issues you are working on as the NIH associate director for data science?
GREGURICK: Right now the biggest issue we’re
working on is with respect to COVID-19, and
that is that we have to very rapidly create
and move an infrastructure to get the data
and the information to scientists in such
a way that they can use their algorithms to
answer really important questions.
Data science requires data, but it requires
data to be well formatted, to be well curated,
to be annotated, to be in a common model so
that we can look across many different organizations,
and that’s what we’re working on right now.
And we’re spending all of our days, most
of our nights, and even our weekends—and
not just me, many people at NIH—to move the data into a way that researchers can use it right now.
LEI: Which language do you think is best to
start learning if she does not have any knowledge
of programming language prior to that?
GREGURICK: I think the best one to begin with
is still probably working in R. I learned
Fortran—they don’t even teach that anymore—in college.
C++ underlies many of the programming languages
that are used, so that’s always a good language
to learn, especially if you want to be a heavy-hitting
computer science person.
But if you’re looking to pick something
up and be pretty proficient quickly, I do
recommend looking at R.
LEI: Is there a specific platform that is
better to take computer science courses online, like Coursera or Udemy? I'm sorry if I botched the names. I'm not familiar with them.
Is one better than the other?
GREGURICK: I’m much more familiar with Coursera,
and we have developed a partnership with them
so that we can provide training for a large
number of colleagues, so that is the one that
I personally know the best and would recommend,
but there probably are others.
My son is very fond of Khan Academy, and
he’s been taking a lot of courses, even when he was in high school, through Khan Academy.
LEI: Here's one question that requires some physician training, Susan.
With the transition from in-person to online, what would you recommend for preventing
your eyes from tiring due to staring at screens
for a long time?
GREGURICK: I don’t know if I’m qualified
to say or not, but my strategy is to take
lots of micro breaks, because I can certainly
understand what you’re saying in terms of eye strain.
And also sitting down all day is not so good
either, so my personal recommendation, and
I’m not a physician at all, I’m a computer
person, I like to take micro breaks.
LEI: I think you have a brewer to take care
of, right?
GREGURICK: I do, yes.
LEI: Does NIH work with the big tech companies?
GREGURICK: Indeed. Yes, we do.
We have partnerships through our STRIDES program
with Google and AWS.
We partner with Palantir, which is a very
large analytics platform.
We partner with NVIDIA, which is a gaming
chip developing company.
We partner with smaller companies.
I don’t know if Halo is super small, but
that’s the platform that we put the website up on.
So we do partner with a number of tech companies.
We’ve talked to a number of folks who are
in the AI space to look at partnerships.
We partner with the national labs and with
other agencies, such as NSF.
We’re looking to partner with sister
agencies such as the VA.
That’s how science moves forward, is to work together.
Each partnership offers strengths, and we
have a strength too.
We don’t duplicate each other’s work;
we partner and together we move science forward.
LEI: Do you recommend any data science bootcamps
for more structured training?
GREGURICK: I have to say that I have a colleague
who is in my office, her name is Allissa Dillman,
and she runs a number of codeathons and
bootcamps, and so I would love to encourage
you to take one of her bootcamps.
And in order to see which one is running,
you have to go to my website, and I just now
see that we did not put it up there.
but if you go to the data science NIH.gov
website, you’ll be able to find the bootcamps
that we’re running.
I’ve done a number of jamborees and bootcamps
in my past, and I’ve always loved the ones
that focused on writing analytic tools for
sequence analysis and metabolic pathway analysis.
Those are my personal favorites, but she runs
bootcamps on sequence analysis; she runs bootcamps
on understanding electronic healthcare record data.
She runs so many very different types of bootcamps.
But I would say that attending one of her bootcamps
would probably be a lot of fun.
She’s young and much more in tune with where
computer science is going than me. I haven’t
coded in more than, I don’t know, 10 years
now, I think.
LEI: All right. I’m interested in learning Python.
Do you have any advice on how I should learn?
I only have some experience with working on R.
GREGURICK: Yeah, Python. I can just tell you my strategy
for how I learned was to get code, take it
apart, and then work with it.
Put in new subroutines, new algorithms, and
see if I could get it to do something new.
That’s how I helped my son learn programming,
so I would suggest if you’re interested
in Python, get some codes written in Python
from GitHub and see if you can play around with it.
There’s great books by O’Reilly on understanding
computer code at a little easier level, and
I would also get one of the O’Reilly books.
The Python book is particularly fun.
We have that at our house.
LEI: I’m interested in bioinformatics with
a biology background.
I don’t have any physics background.
If I want to know more about physics, where
should I start?
GREGURICK: That's a great question. There are a lot of primers that you can get to understand some of the
underlying physics behind the bioinformatics.
Sometimes it’s just helpful to take a paper
that you’re interested in and read some
of the references or some of the underlying methodology.
So you find a paper that you’re interested
in and you see some methodology, then go back
to your textbooks and learn a little bit more
from the methodology that’s in the paper.
Or you could always take a class in physics,
although they tend to be not completely relevant
to the paper that you’re reading. So that would be my suggestion.
LEI: OK, here is one that is more current.
What type of information is available in association with NIH COVID-19 samples?
For example, is there specific phenotypic
information, like GI or cardiovascular symptoms
and the severity?
Or medications that patients were on prior
to infection, such as ACE inhibitors?
Is there proteomic or RNA sequence data associated
with histological samples you mentioned?
GREGURICK: That is a great question, because
COVID-19 is such a hydra of disease.
It’s been hard for us to get our hands around
it, so we’re looking at making and understanding
some of the very basic underlying electronic
healthcare record data that will tell you
about medications, about prior conditions
available, but in a de-identified way so that
you wouldn’t be able to trace it back to
a particular individual, but you could look
at correlations from what is presented in
the patient who has COVID-19 when they enter
the hospital with respect to what they have
taken in terms of drugs or in terms of prior conditions.
In terms of proteomics and sequences, we have
much less data on that.
It’s hard to get those data.
The healthcare system tends to be a little
bit taxed, and so right now getting proteomic
samples has been more challenging and we are just now getting sequencing samples from COVID-19 patients.
Putting all that information together is our
grand challenge at this point.
We think we can make some of the data available.
As you can tell, it’s coming in a staged way because we have the pathology images available right now.
We don’t even have the CT images available
for researchers.
They are in the queue.
They need to be de-identified.
They need to be associated with the appropriate
standards and metadata so that you can use them.
So even getting those CT images is taking
a long time.
Getting the other data like electronic healthcare
record data de-identified, we hope we can
get that done by this summer, but it’s going
to take some time.
And the sequencing data, that might be even
longer, so you can see the struggles that we
have just to get the data out for researchers to use.
LEI: Great. What do you think of current, state-of-the-art research on protein structural prediction?
GREGURICK: I do have a favorite, and I’ve
been involved in protein structure, determination
of prediction for a while.
In terms of determining the structures, certainly
X-ray scattering was a popular way to determine
structure for many, many years.
I certainly worked in X-ray structure as well
as neutron scattering, which is not as refined as X-ray.
Now we see cryo-EM blossoming into a real serious research tool for actual atom-specific structures.
In comparison, also in protein structure prediction,
there was the—I don’t know if you’re
familiar with CASP, Critical Assessment of
Structure Prediction, competition
that was run every two years. So I don’t
know what number we’re up to now, but when
I was working on it, people were doing homology
modeling, so taking a standard and trying
to align an unknown sequence to that standard.
They were working on threading.
I did a lot of threading.
I did a lot of genetic algorithm, protein
structure predictions, some molecular dynamics.
And then there was the work by David Baker
which looked at little tiny windows of protein
and mapping them onto existing structures.
And that approach seems to have been quite successful.
I think the field is still moving in that
direction of micro-threading.
I cannot believe I forgot Rosetta.
Rosetta, that was his program.
I think the field really pushed forward with
his revolutionary work in Rosetta, and now
I imagine what’s happening is much more
looking at artificial intelligence to gain
information about higher structures to even
move further into what those new structures might be.
So out of initial protein structure prediction,
I think the door really opened up with David
Baker’s work, but prior to that there was
an awful lot of BLAST-type based algorithms.
LEI: As we move closer to the end of the hour,
the questions are getting more futuristic.
Here is one.
Do you think physically writing code will
be less important in 5 to 10 years when
you can use platforms like Galaxy for basic
and the translation of biological research?
GREGURICK: Actually, I think you’re kind of right.
I think that people are producing codelets, little
micro bits of code that can be swapped in
and out in a modular way. And so my old way
of taking a giant code and I had to work on Charm,
which is fairly huge, and trying to add subroutines
to it, will change to microcoding codelets
where you just swap out little bits.
So that’s the idea of Galaxy, and platform-based
coding is probably going to be much more standard
for many, many folks in the future.
I think that computer science is moving in
really interesting and fun directions, and
I look forward to watching what you guys do.
LEI: Good. Here's a question. Are there any online training courses that include the biophysics branch of bioinformatics?
GREGURICK: I would think so, but off the top
of my head I don’t have those online courses—although
I do know that through NIGMS we have funded
a number of big data online training courses,
so through the societies there’s definitely
training courses, so the Biophysical Society
would be a great place to look for those online
training courses in biophysics.
LEI: Are there more questions?
I’ll wait a little bit.
Going once...going twice...three times.
Thank you so much, Susan.
This was a fantastic hour.
I hope everybody enjoyed it.
GREGURICK: Thank you so much.
It’s been a real pleasure to tell you about
my personal journey and data science/computational
science and where we are now with COVID-19. And I hope that you will take the opportunity
to look at the online training resources that
are available and also look at our website,
and do participate in any one of these training
opportunities offered through our STRIDES
partnerships with AWS and Google, or through
our NCBI courses and webinars and through the
NIAID bioinformatics training resources.
LEI: All right, thank you all.
Stay safe and be well.
GREGURICK: Thank you.
