
Chinese: 
大家好，我叫Cory McLean，
我是 Google Brain 基因组的工程师。
今天，我很高兴要向大家介绍 Nucleus，
它是我们今天发布的一个存库，
以便将基因组数据带入 TensorFlow。
基因组是研究基因组的结构和功能的。
在你身体的每个细胞中，
你有两个基因组拷贝，
爸爸一个，妈妈一个。
这是 DNA 的字符串，它是一个四字母的字母表，
在基因组中有大约 30 亿个字母。
因此，这里是一个快照的图片，
显示第1号染色体 - 150,000个字母。
我们可以看到有关这方面的一些已知的事情。
其中一个是功能元素，

English: 
Hello everyone, my name is Cory McLean,
and I'm an engineer on 
the genomics team in Google Brain.
Today I'm excited 
to tell you about Nucleus,
which is a library we've released today
to make it easy to bring 
genomics data to TensorFlow.
So genomics is a study of the structure
and function of genomes.
In every cell in your body 
you have two copies of the genome,
one from each parent.
And this is the strings of DNA,
which is a four-letter alphabet,
and about 3 billion letters in the genome.
So here is a picture of a snapshot
on chromosome 1 - 150,000 letters.
What we can see is there's a number
of known things about this area.
One, there are functional elements,

Chinese: 
就像第二行描述的基因一样。
生物学测量使我们能够分析
细胞中活跃的不同物质，因此在第三行
我们可以看到不同组织类型的
基因表达量在那里被量化。
最后，通过测序很多人，
我们可以找出哪些地方在人与人之间有差异。
在开发该图像时存在许多不同的计算算法难题。
这范围从以下几点：在实验数据生成方面，
我们能否更好地利用这些物理测量的输出
来获得准确的 DNA 读数？
或者在量化这种表达的实验中减少噪音？
我们能否采用 DNA 序列
并解释这些基因的功能元件在哪里？
或预测它们在不同组织类型中的活性如何？

English: 
like the genes depicted 
in that second row.
Biological measurements allow us 
to analyze what are different things
that are active in cells, 
so on that third row
we can see the amount of gene expression
across different tissue types
is quantified there.
And at the bottom, 
through sequencing many people,
we can identify places where 
there's variation across individuals.
There's many different 
computational algorithmic challenges
in developing that image.
This ranges from: 
on the experimental data generation side,
can we better take the output 
of these physical measurements
to get accurate DNA readings?
Or reduce noise in the experiments
that quantify this expression?
Can we take the DNA sequence
and interpret where are 
functional elements like these genes?
Or predict how active 
are they in different tissue types?

English: 
And can we identify places 
where individuals vary
compared to our reference?
And how is that different 
in small variance versus, say, in cancer?
And how did those changes 
influence human traits?
One thing that is really exciting for us
is there are many opportunities
for deep learning in genomics.
Now a lot of that is driven 
by the increase
in the amount of data available.
This graph shows 
the dramatic reduction in cost
to sequence a million bases of DNA 
over the past decade.
But also there's a lot of structure
in these data sets
that is often complex 
and difficult to represent
with relatively simple models.
But this may display 
a convolutional structure
so we can use techniques 
from image classification
as well as sequence models.
And there have been 
a number of proven successes
of applying deep learning
to problems in genomics

Chinese: 
我们可以确定哪些地方的人与我们的
参考相比有所不同？
那么，与癌症相比，这种差异又有什么不同？
这些变化如何影响人的特质？
这对我们来说真的是令人兴奋的一件事，
因为有很多机会
深​​入基因组学。
现在很大的因素是由可用数据量的增加所驱动的。
该图显示了在过去的十年中，对一百万个
DNA 碱基进行测序的成本显着降低。
但是这些数据集中也有很多结构，
这些结构往往很复杂，很难用
相对简单的模型来代表。
但是这可能会显示一个卷积结构，
所以我们可以使用图像分类技术
以及序列模型。
在深度学习等基因组学问题上
应用深度学习方面已经取得了一些成功，

Chinese: 
例如 DeepVariant，
这是我们小组开发的一种工具，
使用卷积神经网络来识别小变体。
所以我们在基因组学方面的目标是多方面的。
一种方法是将 TensorFlow
应用于基因组学问题，
并通过创建库
来简化基因组数据的处理。
我们也有兴趣开发工具，
并利用我们已经构建的这些东西
来突破这些科学问题的边界，
然后希望将所有这些公开提供
为社区可以使用的工具。
所以今天我将重点介绍
将基因组数据带入 TensorFlow 的第一部分。
那么，主要问题是什么？
一个主要困难是有许多不同类型的数据
是为基因组研究而生成的。

English: 
such as DeepVariant, 
which is a tool our group developed
to identify small variants
using convolutional neural networks.
So our goals in genomics 
are multifaceted.
One is to make it easy to apply 
TensorFlow to problems in genomics,
and do this by creating libraries
to make it easy to work
with genomics data.
We're also interested in developing tools
and pushing the boundaries
on some of these scientific questions
using those things that we've built,
and then want to make 
all of that publicly available
as tools that can be 
used by the community.
So today I'll focus on 
the first part of making it easy
to bring genomics data to TensorFlow.
So what is a major problem?
One major difficulty is that 
there are many different types of data
that are generated for genomics research.

Chinese: 
你可以在右侧看到，这是所使用的不同类型的子集，而
这些不同的文件格式具有不同的支持量，
并且通常没有统一的 API
我们也有一些关于效率和语言支持的问题，
我们希望能够在 Python 中表达一些操作，
但它需要一些有效的方法来有效地
看过这些数据，
使得本地 Python 无法实现这一点。
为了应对这 些挑战，
我们开发了Nucleus，
它是一个 C ++ 和 Python 库，
用于读取和编写基因组数据，
以便轻松将其带入 TensorFlow 模型，
然后通过 Derek 今天早些时候谈到的 TF 数据 API，
以便为特别感兴趣的任务转换模型。
在这个版本中，我们支持阅读
基因组中最常见的数据格式，

English: 
You can see here on the right,
a subset of different types used,
and these different file formats
have varying amounts of support
and in general no uniform APIs.
We also have some concerns
about efficiency and language support
where we would like to be able to express
some manipulations in Python
but it needs some effective ways 
to efficiently go through this data
such that native Python 
wouldn't make that possible.
So to address these challenges
we developed Nucleus,
which is a C++ and Python library
for reading and writing genomic data
to make it easy to bring 
to TensorFlow models,
and then feed through the TF data API
that Derek talked about earlier today,
for turning models 
for your particular task of interest.
In this release we support the reading
of many of the most common
data formats in genomics

Chinese: 
并为不同数据类型提供统一的API。
因此，我们可以遍历这些不同类型的不同记录，
并能够查询
基因组的特定区域以访问那里的数据。
我们开发它的方式是在引擎盖下使用协议缓冲区，
这样我们就可以在 C++ 中实现所有的通用解析，
然后将其用于其他语言，如 Python。
对于那些熟悉基因组学的人来说，
我们最终使用 HTSlib，
它是一种用于高通量测序格式的规范解析器，
如对齐的读取和变体
然后将其包装以生成协议缓冲区。
然后在这之上使用 CLIF 将数据提供给 Python。

English: 
and provide a unified API 
across the different data types.
So we're able to iterate 
through the different records
of these different types, 
and be able to query
on specific regions of the genome
to access the data there.
The way that we developed this
uses protocol buffers under the hood
so that we can implement 
all of the general parsing in C++
and then make those available 
to other languages, like Python.
For those of you familiar with genomics,
we end up using HTSlib
which is a canonical parser 
for the high-throughput sequencing formats
like the aligned reads and variants,
and then wrap that to generate
the protocol buffers.
And then use CLIF on top of this
to make the data available to Python.

Chinese: 
最后，我们使用一些 TensorFlow 核心库，
以便我们可以将这些数据写成 TFRecords，
以便 TF 数据 API 可以读取这些数据。
因此，我们目前支持的数据类型包括以下范围：
从通用基因组注释
到参考基因组以及不同的序列读取，
无论它们是直接排序还是映射，
以及遗传变体。
所以举一个阅读 API 的例子很简单。
这是一个玩具的例子，
但是当我们想要训练一个模型
以根据映射的序列读取和参考
基因组来确定实际的基因组变异时，
它与 DeepVariant 所使用的基本相似。
所以我们有三种不同的数据类型。
我们导入不同的阅读器类型，

English: 
And finally we use 
some of the TensorFlow core libraries
so that we can write out this data
as TFRecords
so that they can be ingested
by the TF data API.
So the data types that 
we currently support are the following
ranging from general genome annotation
to reference genomes
and different sequence reads,
whether they're direct off a sequencer
or mapped, as well as genetic variants.
So to give an example of the reading API
it's quite straightforward.
So this is a toy example,
but is essentially similar 
to what is used for DeepVariant,
when we want to train a model
to identify actual genome variations
based on mapped sequence reads
and a reference genome.
So we have three different 
data types that we need.
We import the different reader types

English: 
and then, say, in this region 
that we're interested in
we can issue queries
to each of the different reader types
and then have iterables
of these protocol buffers as output,
which we can then manipulate
and turn into TensorFlow examples.
On the writing side, 
it's similarly straightforward.
So if we have a list of variants,
for the common VCF format,
we'll have an associated header
which provides metadata about this
and then open a writer with that header
and then just loop through 
the variants and write them.
And note that we support
writing to block [inaudible] format
which is common for 
the subsequent indexing by other tools.
However, we can also write 
directly to TFRecords
and here I'll provide 
some convenient methods
to write out chartered data

Chinese: 
然后在这个我们感兴趣的区域中，
我们可以向每个不同的阅读器类型发出查询，
然后将这些协议缓冲区的迭代器作为输出，
然后我们可以操作并转化为 TensorFlow 示例。
在写作方面，它也同样简单。
因此，如果我们有常见的 VCF 格式的变体列表，
我们将有一个有关联的头文件，
提供关于这个元数据的信息，
然后用这个头文件打开一个写入器，
然后循环遍历这些变体并写入它们。
请注意，我们支持编写以阻止[听不清]格式
它是常见的，如果是由其他工具随后编制索引。
但是，我们也可以直接写入 TFRecords，
在这里我将提供一些便利的方法
来写出我们发现的特许数据，

English: 
which we found helps avoiding 
certain hotspots in the genome,
using a very similar API.
Finally, we have been working 
with the Google Cloud team
which has some tools
for analyzing variant data
and so they have developed a tool
called Variant Transforms
which allows you to load 
the VCF variant files to BigQuery
using Apache Beam.
And then you can do
structured queries over that data.
And so we're working now to integrate 
here to have Nucleus under the hood
providing that generation of the variants.
And to learn more about that tool,
you can go to the link below.
So to summarize, we have developed Nucleus
which is a C++ and Python library
to make it easy to bring 
genomics data to TensorFlow,

Chinese: 
这些数据可以帮助避免基因组中某些热点，
这使用非常相似的 API。
最后，我们一直与 Google 云团队合作，
他们拥有一些分析变体数据的工具，
因此他们开发了一种名为
Variant Transforms 的工具，
它允许你使用 Apache Beam 将 VCF 变体文件
加载到 BigQuery。
然后你可以对这些数据进行结构化查询。
所以我们现在正在努力整合，让 Nucleus
能够提供这一代的变体。
要了解有关该工具的更多信息，请访问下面的链接。
总结一下，我们开发了 Nucleus
它是一个 C++ 和 Python 库，
以便将基因组数据带入 TensorFlow，

English: 
to train your models of interest
for genomic problems.
We have the ability to interoperate
with Cloud Genomics
and are being integrated into 
the variant transforms at the moment.
This ended up being the foundation
of our CNN-based variant caller
which is also available, 
open-source, at the link below.
So with that I would like to thank you all
for your attention today.
(applause)

Chinese: 
以训练你感兴趣的基因组问题模型。
我们有能力与云基因组学进行互操作，
并且目前正在整合到变体转换中。
这最终成为我们基于 CNN 的变体呼叫者的基础，
它也可以在下面的链接中获得，这是开源的。
感谢大家用心聆听。
