In this video, we provide a quick walk-through
of our paper: Pixel Adaptive Convolutional
Neural Networks.
Convolution is the fundamental building block
of modern deep neural networks.
A major property of standard convolution is
the spatial-sharing of weights.
Although widely considered as an advantage,
spatial-sharing is not always ideal.
Consider the convolution operation on this
image.
The same set of filters are applied everywhere
in a sliding-window fashion.
Moreover, once learned, the same set of filters
are applied on all different images,
even if the images depict very different types
of scenes.
In other words, convolution is content-agnostic.
Filters optimized for one scenario, e.g. sunny
days,
are not necessarily a good fit for rainy days.
CNN can use a large number of filters to mitigate
this issue,
which, however, requires more parameters and
potentially also more training data.
In this work, we propose a content-adaptive
operation
that still retains several favorable properties
of standard spatial convolution.
In a standard spatial convolution, filter
W is multiplied with local patches
in a sliding-window fashion and generates
output v’.
In our layer, we consider an additional input,
f,
which itself can come from some other network
layers.
A kernel function, K, which we call “adapting
kernel”, is applied on f,
and compares f on each pixel with all other
pixels within its filtering neighborhood.
The result is then used to modify the filter
weight W for the current position.
Because the filters are adapted differently
across pixel positions,
we call our operation pixel adaptive convolution,
or PAC in short.
In definition, the only modification from
a spatial convolution
is the added term representing the adapting
kernel, K.
Despite being a simple modification,
PAC is highly flexible and can be seen as
a generalization of several widely-used filters.
Here are a few examples.
It is already obvious from the previous discussion
that
spatial convolution is a special case of PAC
with a constant adapting kernel.
By defining f as pixel color, and K and W
both as Guassians,
PAC can represent bilateral filtering.
PAC can also represent pooling operations
such as standard average pooling,
and more recent techniques such as detail-preserving
pooling,
by defining a different kernel function, K.
PAC is flexible and we expect great potential
in applications.
As a first use case, we demonstrate the use
of the transposed convolution variant of PAC
for joint upsampling.
For depth upsampling, our network can successfully
recover details not visible in the low-res
signal.
Similar results can also be obtained in optical
flow upsampling.
As a second use case,
we show that PAC can replace approximated
high-dimensional filtering typically used
in CRFs.
A sparse pairwise connection pattern is achieved
with the dilation option of our layer,
and helps provide faster inference on GPUs.
We can produce finer visual details for some
outputs, and also report improved overall
accuracies.
We also propose a way to use PAC to easily
leverage existing architectures and pre-trained
models,
which we call “hot-swapping”.
Given a pre-trained network,
hot-swapping directly replaces some CONV layers
with their PAC counterparts,
and retains the pre-trained weights as initialization
before further fine-tuning.
We show that hot-swapping can bring consistent
performance improvements
for semantic segmentation while adding minimal
computation.
In summary, PAC is a content-adaptive filter
that generalizes several existing techniques.
We demonstrate its use cases including joint
upsampling,
efficient CRF inference,
and network layer hot-swapping.
Please refer to the respective sections for
more details.
Thank you!
