
English: 
♪ (music) ♪
I'm Vijay.
Today I'll be talking to you
or hopefully convincing you
that when we try to apply
machine learning to solving problems,
that we should really be thinking
about designing search spaces
over solutions to those problems.
And then we can use
automated machine-learning techniques
in order to evaluate our ideas
much more efficiently.
I think a big reason
why a lot of us are here today
is due to the incredible impact
that machine-learning
can have on practical problems.
Two often cited reasons for this
is that we have increasing amounts
of compute capability
and access to data to train on.
But I think one other aspect
is all of you-- right?
There's so many more people involved
in machine-learning today
that are contributing
and publishing ideas.
So ThisCraft tries to put this
into perspective by measuring
how many machine learning papers
are published on Arxiv

Chinese: 
我是 Vijay。
今天我会和你谈谈，或者希望说服你，
当我们尝试应用机器学习来解决问题时，
我们真的应该考虑设计搜索空间
而不是这些问题的解决方案。
然后，我们可以使用自动化的机器学习技术，
以便更有效地评估我们的想法。
我认为今天我们很多人都在这里的一个重要原因
是机器学习对实际问题产生的难以置信的影响。
两个经常被引用的原因是我们有越来越多的
计算能力和数据的访问来供我们训练。
但我认为还有一个方面是你们所有人 - 对吗？
现今有人越来越参与机器学习，
他们正在贡献和发表创意。
因此，ThisCraft 试图通过测量

Chinese: 
自 2009 年以来每年在 Arxiv 上发布
的机器学习论文的数量，
并根据摩尔定律的指数增长曲线进行绘图。
正如你在这里所看到的，我们每两年
一直坚持了解摩尔定律 2 次，这非常好。
这正证明了该领域开发了很多新想法。
这是件好事，对吗？
所以，一个具体的看待这个问题的方式是，
在计算机视觉领域，
我们已经看到了 AlexNet 架构
第1名的成像精度从 50％ 范围开始，
这也改变了领域图像分类。
至2017年以来，每年我们都变得越来越好。
因为这些改进还没有到来
因为我们一直在训练更大的模型，对吧？
这些改进也来自于
我们有很多很好的想法，对吧？
像批量标准化，残差或跳过连接，
以及各种正则化技术。
像Jeff之前提到的那样，这些点

English: 
every year since 2009,
and plotting that against
a Moore's Law exponential growth curve.
As you can see here,
we've been keeping up with Moore's Law
2x every two years very, very well.
And this is demonstrating just
how many new ideas
are being developed in the field.
This is a great thing, right?
So, one concrete way of looking at this
is that in the field of computer vision,
we've seen Top 1 imaging accuracy
start from something like in the 50% range
from the AlexNet architecture,
which, by the way, revolutionized
the field of image classification.
And every year we've been just
getting better and better up until 2017.
Now, these improvements haven't come
just because we've been training
bigger models, right?
These improvements have also come
from the fact
that we have lots of great ideas, right?
Things like batch normalization,
residual or skip connections,
and various regularization techniques.
Now, each of these points,
like Jeff mentioned earlier,

Chinese: 
都是多年研究工作的结果。
我们建立在彼此的想法之上。
但其中一个具有挑战性的事情是
我们如何跟上这么多，
这么多被产生的想法？
而且我还希望你能够放大一些
在这些模型的复杂性方面。
所以我们将在 InceptionV4 这里放大一点，
并看看里面嵌入的想法。
这些是 Inception-V4 架构中的
两个卷积单元模块。
这些箭头和操作中的每一个，甚至是数字
都是由人类设计的。
有人编写了一些代码
来指定所有这些小细节。
现在，这种架构可能有意义的原因有很多，
但我们的理论并没有真正解释
每个细节是如何重要。
而且，作为一个领域，我认为我们正在努力
改进这个背后的理论。

English: 
is the result of years of research effort.
And we build on each other's ideas.
But one of the challenging things
is how do we keep up with so much,
so many ideas that are being produced?
And I also want you
to zoom-in a little bit
in terms of the complexity
of some of these models.
So we are going to zoom-in a little bit
on InceptionV4 here
and look at the kind of idea
that's embedded inside there.
These are two convolutional cell modules
within the Inception-V4 architecture.
Every single one of these arrows
and operations and even the numbers
was designed by a human.
Somebody wrote some code
in order to specify
all of these little details.
Now, there are high-level reasons
why this kind of architecture
might make sense,
but our theory doesn't really explain
with so much certainty
how every detail seems to matter.
And, as a field, I think
we're definitely working on trying
to improve the theory behind this.

Chinese: 
但是对于我们中的很多人来说，只要能够，
我们很乐意使用这种复杂性，
因为它确实有助于解决问题。
这并不奇怪 - 我们知道，因为机器学习
对真实产品产生了如此巨大的影响，所以我们
愿意使用任何我们可以的东西。
即使我们不了解所有这些小小的细节，
只要它能够很好地解决我们的问题，
那都是能够理解，
那么，鉴于所有这些想法，
我们如何能更有效地利用这种思想的爆炸？
所以，让我们退后一步，回答一些
我们刚刚尝试训练机器学习模型时
可能会听到的问题。
这些都是简单的问题，但很难回答。
例如，“我应该为我的优化申请什么学习速度？”
或者，如果我正在训练一个深度神经网络模型，
“我应该应用什么辍学率？”
我们今天如何回答这个问题？
我认为我们结合了几种不同类型的好处。
其中之一是
利用研究人员的直觉和工程师直觉。

English: 
But for many of us, we're just happy
to use this kind of complexity
out of the box, if we can,
because it really helps to solve problems.
Now this isn't too surprising-- we know
that because machine-learning
has had such an impact on real products
that we are going to be
willing to use
anything we possibly can.
And even if we don't understand
all the little, minor details,
as long as it solves our problems well
and hopefully, are understandable.
So, given all these ideas,
how can we harness this explosion
of ideas much more efficiently?
So let's step back and kind of ask
a few questions that we might've heard
when we were just trying 
to train machine-learning models.
These are simple questions,
but hard questions.
Like, "What learning rate should I apply
for my optimization?"
Or, if I am training
a Deep Neural Network model,
"What dropout rate should I apply?"
How do we answer this question today?
I think we combine a few
different types of benefits.
One of them is
leveraging researcher intuition
and engineering intuition.

English: 
And what this means is that we start
with code or we ask our colleagues,
"Hey, what are good settings
for these fields?"
And if it were the case that there was 
one setting that works for everybody,
we wouldn't even be looking
at these parameters.
But it does matter.
So then we move on
to some trial and error process.
We try a certain setting
and then we see
how well it works on our problem
and then we continue to iterate.
And I think the other aspect, 
which is becoming more common, hopefully,
is increasing access to computing data
by which we can evaluate these ideas.
So this combination is really ripe
for automation, right?
And not surprisingly, this exists today.
It's called hyperparameter optimization
and in this kind of set up what we have
is a tuner that's giving out settings
for these hyperparameter settings.
We have a trainer
that trains our model our dataset,
and then tries to give some kind of signal
about how good those settings were.
So it might give a validation accuracy
of some value,

Chinese: 
而这意味着我们从代码开始，
或者问我们的同事：
“你好，这些领域有什么好的设置？”
如果有一种设置适用于每个人，
那么我们根本就不会看这些参数。
但参数确实很重要。
然后我们继续进行一些试验和错误过程。
我们尝试一个特定的设置，
然后我们看看它对我们的问题
有着什么样的帮助，然后我们继续迭代。
我认为变得越来越普遍的另一方面，
是计算机数据越来越广泛的使用，
通过这些数据我们可以评估这些想法。
所以这个组合对于自动化来说是成熟，对吧？
毫不奇怪，这存在于今天。
它被称为超参数优化，在这种类型的设置中，
我们拥有的是一个调谐器，
可以为这些超参数设置提供设置。
我们有一位培训师训练我们的模型
我们的数据集，
然后试图给出关于这些设置有多好的某种信号。
所以它可能会给出一些有价值的验证准确性，

English: 
and the tuner can then learn
from this feedback to find better points
from the search space.
This is an existing big field
and there are existing systems
like those shown at the very bottom
that can help you to do this.
But now, let's kind of ask a few more
complicated questions
or detailed questions
that I think people do
often ask as well.
"Why do you use batch norm before relu?"
"I switched the order
and it seems to work better."
Or, if you're trying to train
a completely new model,
"Should I use one type of sub-architecture
or another type of sub-architecture?"
Now, if you think about it,
these questions aren't really that
different from hyperparameter settings.
So, if we think
of hyperparameter optimization
as searching over
a specific domain of ideas,
then it seems possible that maybe
we can actually treat the decisions
made in this type of model
as another form of searching
over a domain of ideas.

Chinese: 
然后调谐器可以从这个反馈中学习，
从搜索空间中找到更好的点。
这是一个现有的大领域，如下所示，
有现有的系统可以帮助你做到这一点。
但现在，让我们想问一些
更复杂或更详细的问题
我认为人们也经常问这些问题。
“为什么你要在relu之前使用批准规范？”
“我改变了顺序，效果似乎更好。”
或者，如果你想要训练一个全新的模型，
“我应该使用一这类型的子体系结构
还是那类型的子体系结构？”
如果你考虑这些问题，
那么这些问题与超参数设置并没有太大区别。
因此，如果我们将超参数优化看作是
搜索某个特定领域的想法，
那么似乎我们真的可以在这种类型的模型中
把所做出的决策
作为搜索一个领域的想法的另一种形式。

Chinese: 
因此，我们可以考虑不去强调我们在架构中
做出的任何具体决定，
而是考虑我们可能拥有的搜索空间的想法。
那么让我们举一个我的同事 Derek
做的搜索空间设计的具体例子，
他试图为卷积单元设计一个搜索空间。
我会教你如何设计这样的搜索空间。
所以第一个问题是 - 你必须得到你的输入。
所以我们可能说，
你可以访问你以前的输入，
如果你想拥有跳过连接等事情的支持，
你可能会有上一个-上一个的输入。
因此，搜索领域的第一件事情就是
定义我要选择哪些输入。
然后，一旦你选择了这些输入，
然后你需要在将它们加在一起之前
想想我应该对每个输入应用哪些操作。
所以我可能会选择像3x3卷积
或3x3最大池，然后将这些操作组合在一起。
现在，我们可以递归地转动曲柄

English: 
And we can therefore think 
about de-emphasizing any specific decision
that we make in our architectures,
and, instead, think about the search space
of ideas that we might have.
So let's take a concrete example
of a search space design
that my colleague Derek did,
where he tried to design
a search space for a convolutional cell.
I'll walk you through
how you might design such a search space.
So the first question is--
you have to get your inputs.
So we might say, you have access
to your previous input,
and if you want to have support
for things like skip connections,
you might have
the previous-previous input.
So the first job in the search space
is to define which inputs
I am going to select.
And then, once you have
those inputs selected,
you want to then to figure out
what operation should I apply
to each of those inputs
before summing them together.
So I might select something
like 3x3 convolution,
or 3x3 max pooling and then combine
those operations together.
Now, we can then recursively 
turn that crank

English: 
and apply it several more times,
where we use, maybe, different operations
for different inputs.
And you can even use
the intermediate outputs 
of previous decisions in our search.
And then, finally, you take
all of your outputs, that are unused,
and you can catenate them together.
And that is your convolutional cell.
And then finally, if you want to build
your whole model just like ResNet,
you might stack that same model
several times.
Now, I want to point out
that this is just one example.
Like one point
from the search space of ideas.
There are a billion possible ways
that you can construct a cell like this
in the search space.
by changing the set of operations
you have on your list,
the kind of ways that the connections
can be made, and so forth.
So, now, we've defined our search space,
we kind of go back to that type
of parameter-tuning type system
where we have
a program generator on the left
that generates samples
from this search space.
We then train and evaluate
on the task at hand,
and often time it's going
to be a proxy task.

Chinese: 
并再次做同样的操作，
我们对不同的输入使用不同的操作。
你甚至可以
在我们的搜索中使用先前决定的中间输出。
然后，最后，将所有未使用的输出都拿出来，
并且可以将它们连接在一起。
那就是你的卷积单元。
最后，如果 n1 想像 ResNet 一样构建整个模型，
则可以多次堆叠同一模型。
现在，我想指出这仅仅是一个例子。
像搜索空间的想法的一点。
有十亿种可能的方式
可以在搜索空间中构建像这样的单元格。
通过改变列表上的操作集，
连接的方式等等。
所以，现在，我们已经定义了我们的搜索空间，
我们回到那种类型的参数调整类型系统，
其中左边的程序生成器
从这个搜索空间生成样本。
然后，我们对手头的任务进行培训和评估，
并且通常应该是代理任务。

English: 
And then we iterate to quickly find
what are the best programs
from our search space.
And the system on the left-- 
this program generator--
can optionally learn from feedback.
So it might use something like
reinforcement learning,
evolutionary algorithms,
or even random search
can work well in certain situations.
So we did this type of approach.
We took this convolutional cell,
we trained it on a proxy task,
such as Cifar10
for a small number of epochs
to make quick progress
on the evaluation of an idea.
And then we took the best candidate-cells
that we found from that search.
We enlarged in terms of number of filters
and number of times we've stacked it,
and we applied that
to the ImageNet dataset.
And these are two cells
found from that search.
If you then look at the results
of applying those,
you'll see that we were actually able
to do better than the existing state
of the art models,
in terms of Top-1 accuracy.
Now, this effort was an example
where we took a model

Chinese: 
然后我们重复快速寻找
我们的搜索空间中最好的程序。
左边的系统 - 这个程序生成器 -
可以选择性地从反馈中学习。
所以它可能会使用强化学习，
演化算法，
甚至随机搜索等功能可以在某些情况下
很好地运作。
所以我们使用了这种类型的方法。
我们拿了这个卷积单元，
在一个代理任务上训练它，
比如 Cifar10，为小数的时代，
以便一个想法的评估可以取得快速的进展。
然后，我们从搜索中找到了最好的候选单元。
我们根据过滤器的数量和
堆叠它的次数来放大，
并将其应用于 ImageNet 数据集。
这些是从该搜索中找到的两个单元格。
如果你再看看应用这些的结果，
你会发现我们实际上能够
比现有技术模型做得更好，
从 Top-1 精度的方面来看。
这就是一个例子，

Chinese: 
我们拿了一个很难做出决策的模型，
我们确实发现了另一个更好的复杂模型。
但接下来，我将向你展示一个例子，
在这里我们可以使用这通用技术
来找到更多可解析的输出。
那么，让我们看看优化的更新规则。
你们中的大多数人可能对随机渐变下降熟悉，
更新规则显示在左侧，你可以将渐变
乘以学习速率，这就是到达三角洲的方式。
然后我们有更先进的优化更新规则，
比如 Adam，并且只需给予诸如渐变的移动平均值
之类的东西就可以相当简洁地表达这些规则。
但是，我们真的只有一小部分
像这样的优化更新规则，
这规则我们通常应用于例如，深度学习。
反之，如果我们将这些更新方程规则
作为更大搜索空间的一部分来处理呢？

English: 
where decisions were pretty complex
and we, honestly, found
another complex model that was better.
But next, I'll show you an example
where we can use this general technique
to find even more interpretable outputs.
So, let's look
at optimization update rules.
Most of you are, probably, familiar
with stochastic gradient descent,
and the update rule is shown on the left
where you take the gradient
and you multiply it by the learning rate
and that's your way to delta.
And then we have
more advanced optimization update rules,
like Adam, and these can be expressed
fairly concisely just by being given
things like the moving average
of the gradient, and so forth.
But we really only have like a handful
of this type of optimization update rules
that we're typically applying
for deep learning, for example.
What if we, instead, treat
these update equation rules
as parts of a larger search space?

Chinese: 
所以，你可以把这些表达式变成
一种描述更新规则的数据流程工艺。
通过使用一小组基本操作
和组合它们的方法，
我们可以使用这个简单的树
来表达这两个更新规则，
但也可以使用许多其他的想法。
因此，你可以在这个新的搜索空间中
转动此曲柄，
并尝试找到更好的优化更新规则。
所以，我的同事 Irwan 和 Barret
进行了这个实验，
他们在那里采用固定的卷积模型
并试图搜索优化规则。
他们发现更比我在这个特定任务中
向你展示的其他方程更好的新方程。
这个搜索空间的一个很好的特点
是结果更具可解释性。
如果你在这里看第四条更新规则，
它将采用渐变并将其乘以一些表达式，
基本上说，当渐变
和渐变的移动平均值同意它们的方向时，
那么我们应该朝这个方向迈出更大的一步。
当他们不同意时，

English: 
And so, you can take these expressions
and turn them
into a kind of a data flow craft
that describes the update rule.
By using a small set
of primitive operations
and ways of combining them,
we can express these two update rules
using this simple tree,
but also a lot of other ideas.
And so you can then turn this crank
on this new search space,
and try to find
a better optimization update rule.
So, my colleagues Irwan and Barret
ran this experiment,
where they took
a fixed convolutional model
and tried to search
over the optimization rules.
And they found update equations
that did better than some of the others
I'd shown you on this particular task.
And one nice feature of this search space
is that the results
are much more interpretable.
If you look
аt the fourth update rule here,
it's taking the gradient
and multiplying it by some expression
that essentially says
that when the gradient
and the moving average of the gradient
agree in their direction,
then we should take a bigger step
in that direction.
And when they disagree,

Chinese: 
我们应该迈出更小的一步。
这实际上是一种势头。
我们可以从中得到一件事 -
也许我们应该设计搜索空间，
它在搜索空间的想法中有更多的动量概念。
我们可能能够找到更好的结果。
到目前为止，我一直关注技术和搜索空间想法，
在这里我们关心精度。
但搜索许多想法的好处是，
我们有可能注重搜索多过准确性。
例如，我们很多人都关心推理速度，对吗？
我们想要建立一个模型并将其部署
在真正的硬件，真正的移动平台上。
我们经常花费大量时间
试图弄清楚如何采取一个想法
并使其足够快。
但是，作为搜索空间的想法的一部分，如果你能
找到能够平衡速度和准确性的东西呢？
所以我们试图做这个实验，将运行时间包含

English: 
that we should take a smaller step.
This is actually a form of momentum.
And so one thing we can get from this is--
maybe we should be designing search spaces
that have more notions of momentum encoded
within the search space of ideas.
We may be able to find
even better results.
So far I've focused on techniques
and search space ideas,
where we care about accuracy.
But what's great about searching
over many ideas
is that we might have the potential
to search over more than just accuracy.
So, for example, a lot of us care
about inference speed, right?
We want to take a model and deploy it
on real hardware, real mobile platform.
And very often we spend a lot of time
trying to figure out how to take one idea
and make it fast enough.
But what if you could, as part
of the search space of ideas,
find ones that balance both speed
and accuracy?
So we tried to do this experiment
where we included the run time

Chinese: 
在真实的移动设备上，作为评估的一部分。
所以我们试图把重点放在优化 -
精度以及推理速度上。
随着时间的推移，
程序生成器能够找到更快的模型，
同时也计算出如何使这些模型更加精确。
这样做的一个有趣的副作用是，
当你对想法进行搜索时，
输出实际上不只是一个模型，
它是一种隐含地编码这种折衷的模型文化。
所以，这种类型向我们展示了
我们在空间中有多个点，
这些点在多平台上的推理速度
和我们试图解决的数据集精度之间
提供了一种折中。
我可以得到一个结果，
这结果也许可以被部署
在各种类型的平台上，
而不是手动设计我想要运作的那个点。
所以我会以稍微不同的方式强调这一点。
我们可以在TensorFlow中
定义搜索空间的想法，

English: 
on a real mobile device as part
of this interloop of the evaluation.
So we tried to focus optimize on both--
accuracy, as well as inference speed.
And as this process goes on over time,
the program generator is able
to find faster models
while also figuring out how to make
those models even more accurate.
One interesting side effect of this
is that when you run searches over ideas,
the output is actually not just one model,
it's a culture of models
that implicitly encodes this tradeoff.
So, this kind of shows you
that we have points along the space
that provide a tradeoff
between inference speed
on the multi-platform,
and accuracy on the dataset
that we're trying to solve.
And so rather than
manually engineering the one point
that I really want get working,
I can get a result that maybe
can be deployed
on various types of platforms.
So I'll emphasize this
in maybe a slightly different way.
Which is that we could define
a search space of ideas

Chinese: 
通过这种自动机器学习过程，
我们可以获得在TargetPlatform设备上
保证运行时性能目标的模型。
拥有像 TensorFlow 这样的
集成生态系统的好处之一就是，
你可以使用从程序转换为程序的库，
以便你可以将这个端到端流水线
很好地运作在一起。
专门调整一个模型不需要任何手动工作。
所以，让我最后回到在这个世界上
评估想法的过程，即我们试图探索不同的想法。
首先是我们设计搜索空间
来尝试测试大量可能的想法。
现在请注意，当我们设计搜索空间时，
需要人类的直觉。
作为这个过程的一部分，
仍然需要人类的聪明才智。
因此，很好地设计搜索空间需要付出很多努力，
但至少可以让你更快地评估更多的想法。

English: 
in TensorFlow and through this automatic
machine-learning process,
we can get models
that have a guaranteed runtime
performance target
on a TargetPlatform device.
One of the nice things
about having an integrated ecosystem
like TensorFlow is that you can just
use the libraries that convert
from program to program,
so that you can get
this end to end pipeline
working well together.
There is no manual work required
to specifically tune a model.
So, let me conclude by returning 
to this process of evaluating ideas
in this world, where we're trying
to explore different ideas.
The first is that we design search spaces
to try to test out a large set
of possible ideas.
Now, note that when we designed
the search space,
that required human intuition.
There is still a need for human ingenuity
as part of this process.
So, designing the search space properly,
takes a lot of effort
but at least it allows you to evaluate
many more ideas much more quickly.

Chinese: 
当涉及到反复试验时，
我们必须开始思考
应该如何改变软件，
以便我们可以允许这种类型的搜索过程。
因此，例如，我认为我们中的很多人
可能已经编写了脚本，将学习速度
和丢失率等作为命令行标志。
如果你想在你的程序中测试更深入的想法呢？
你如何设计它，使其在你的程序
的各个层次上更具可调性？
我认为这是我们要解决的一个重大问题。
最后，我们认为这些想法会变得越来越重要，
因为你们可以使用越来越多的计算能力，
比如像 TPU 荚一样
想象一个这样的世界，你所需要做的
就是把你的想法提交给一个想法银行，
并且你有一些像荚的TPU，熬着夜，试图找出
哪些解决方案或想法是最好的。
然后你早上醒来时，
它告诉你，“这些是好主意，
这些是坏主意，”等等。
我认为这让我很兴奋的部分原因是
自动机器学习

English: 
And when it comes to trial-and-error,
we had to start thinking about
how software should be changed,
so that we can permit
this type of search process.
So, for example, I think many of us
have probably written scripts where
you take things like learning rate
and dropout rate as command-line flags.
What if you wanted to test out
much more deeper ideas in your programs?
How do you design it such
that it's much more tunable
at all levels of your program?
I think this is a big question
for us to tackle.
And lastly, we think these ideas
will become increasingly relevant
as many of you get access to more and more
computation capabilities,
such as things like TPU pods
Imagine a world
where all you have to do
is take your idea,
submit it to an idea bank,
and you have some like pod of TPUs
crunching overnight, trying to figure out
which solutions or ideas are the best.
And then waking up in the morning
and it telling you,
"these were the good ideas,
these were the bad ideas," and so forth.
I think part of the reason
this excites me
is that automatic machine-learning

English: 
can keep these machines
much more busy than we can.
We have to sleep, but machines
can keep on turning 24/7.
So, with that, thanks for listening!
(applause)
♪ (music) ♪

Chinese: 
可以让这些机器比我们更忙。
我们必须睡觉，但机器可以不停地运作。
就这样，感谢你的聆听！
