
Chinese: 
[音乐播放]
所以如果你看全视视觉，
这个术语的同义词是整体视觉。
所以我们在这里看到的基本上
是一些之前被分割的
视觉任务的统一。
因此，一个例子就是这种语义分割
和即时分割。
这些都是以前被区别对待的任务。
在全景视觉中，你基本上把它们合并在一起。
在这个方向上有两个主要的论点。
一个是学术性的，对吗？
所以这是一种巧合，真的，
这些任务被区别对待。
这是历史事实。
没有真正的理由这样做。
但还有一种更实用的方法来讨论
这个统一的视觉系统。

English: 
[MUSIC PLAYING]
So if you look at
panoptic vision,
it's a synonym for this
term is holistic vision.
So what we're looking
at here is basically
a unification of a
number of vision tasks
that were split before.
So, an example of this is this
kind of semantic segmentation
and instant segmentation.
Those were tasks that were
treated differently before.
So in panoptic vision, you
basically merge them together.
There are two main arguments
to go in this direction.
One is academic, right?
So it's kind of a
coincidence, really,
that these tasks are
treated differently.
That's just the historic fact.
There's no real
reason to do that.
But then there's a way
more pragmatic argument
to go towards this
unified vision system.

Chinese: 
那是因为它们实际上更便宜，
更容易实现。
所以在这个经典的方法中
每个任务都有不同的神经网络，
我们会看到这实际上会消耗你
很多能量，
或者你不能实时运行。
对于每一组任务，都需要一个
单独的模型来执行。
在全景视觉中，我们将其统一到一个模型中。
这将使我们真正进入
实时嵌入式视觉。
我将简要地--如果你在谈论全景分割，
我们不能只谈论准确性，
就像我们在分类中所做的那样。
所以我简单介绍一下
全景质量度量。
你不需要看公式本身。
但是全景质量基本上
是由你的分割质量和你的识别质量的组合来判断的，
比如知道哪个像素是哪个类别--
或者知道哪个对象
是哪个类别。
所以它是这两者的结合。

English: 
And that's because they are
actually much cheaper, much
cheaper to implement.
So in this classical
approach where
you have different neural
networks for every task,
we'll see that this
will actually cost you
a lot in terms of
energy consumption,
or you will not be able
to run that real-time.
So for every set of tasks,
you need a separate model
to execute that.
In panoptic vision, we unify
that in a single model.
And that will enable
us to really go
to real-time embedded vision.
I'll briefly-- if you're talking
about panoptic segmentation,
we cannot just talk
about accuracy as we do
in a classification.
So I briefly introduce this kind
of panoptic quality measure.
You don't have to look
at the formulas, per se.
But panoptic
quality is basically
judged as the combination of
your segmentation quality--
so knowing which
pixel is which class--
and your recognition quality,
or knowing which object
is which class.
So it's a combination
of those two things.

Chinese: 
[音乐播放]

English: 
[MUSIC PLAYING]
