Today we are going to review a research paper
about binary neural networks.
So why do we really call it binary neural
network so in the binary neural network all
the weights and biasis are either + 1 or -1.
so why do we actually need binary neural networks?
Today neural networks are doing amazing things
starting from real-time object detection image
recognition and Natural Language Processing
but in this task, we really need high power
Intensive devices like GPU and TPU, because
all these tasks requires thousands even millions
of multiplication and addition of floating-point
numbers.
What if we have very limited resources like
if you want to do deep learning or machine
learning on your phone devices or if you want
to do image recognition from your specs or
you want to run a machine learning model in
your watch, So at those places binary neural
networks can help.
In the binary neural network, all the weights
and biases are either + 1 or -1 so we can
store them in just one bit which reduces the
storage cost on the devices and binary operations like multiplication
and addition can be done by using bit shift
operations which further reduces the computation
cost on devices, so this is the main application
and this is the main advantage of using binary
neural network over normal deep neural network.
This paper made the following contributions.
Proposed a method to train a binary neural
network
Conducted an experiment on Torch7 and Theano
framework to prove that BNN can achieve nearly
the state of art result on MNIST and CIFAR-10
data set
Proved that During forwarding pass BNN drastically
reduces the memory consumption and replaces
all the arithmetic operations with the bitwise
operation which reduces the hardware complexity
by almost 60%
programed a binary matrix multiplication GPU
kernel with which it is possible to run MNIST
BNN 7 times faster than with an unoptimized
GPU kernel, without suffering any loss in
classification accuracy
Show the architecture of the binary neural
network is pretty much the same as normal
deep neural network except all the weights
and biases are binarize to + 1 or -1.
In this paper, there are mainly 2 methods
of binarizing the weights
Deterministic and Stochastic Binarization.
The stochastic binarization is better than
the Sign function but is harder to implement.
As a result, the deterministic Sign function
is used more often.
So how gradient computation is being done
in binary neural network
you can see this as a variation of dropout
layer in a normal deep neural network but
instead of making random neurons inactive
Here they are binarizing the weights and biases
of the neurons.
Because for stochastic gradient Descent SGD
to work properly we need some real value weights
which require more than 1 bit to store it
is necessary to keep sufficient resolution
for these accumulators, which at first glance
suggests that high precision is absolutely
required.
And next how propagation is being done in
BNNs
One problem comes up while performing backpropagation.
The derivative of the binarization functions
is zero almost everywhere.
As a result, the exact gradient of the cost
function w.r.t. the quantities before the
binarization step would be zero.
This would render SGD useless and our network
would never get trained!
To solve this, a “straight-through estimator”
with slight modification is used.
To know more about This check the link below.
While updating the weights, the following
is done:
Each real-valued weight, , is constrained
to remain between -1 and +1.
If a weight update brings outside [-1, 1],
it is clipped.
This is done because otherwise, the real-valued
weights will grow very large without having
any impact on the binary weights, .
The new updated binary weights are then calculated
as = Sign().
So the main idea is to treat binarization
as in noise And harness the network tolerance
to noise.
So that hardware demands are less.
So we binarize the network using binarization
techniques And replaced the non-linearity
function by binary nonlinearity.
Since we use backpropagation we need to differentiate
the binarization function but binarization
function on differentiation give zero every
single time that's why we use straight-through
estimator That simply passes the gradients
as they are but also considered as the saturation
effect and And rest of the multiplication
operation used for batch normalization and
other multiplication are replaced with its
shift operation with reducers the hardware
demand and also makes the operations more
efficient and less power consuming.
So I have the result of using a binary neural
network instead of a normal deep neural network
it requires 32 times less memory consumption
because all the 32-bit floating-point numbers
are now converted into 1-bit binary numbers.
And by using a fewer bit of numbers for multiplication
and addition the power consumption of machines
also reduces by several folds.
Finally, the neural network was successfully
able to achieve nearly state of art performance
with little accuracy loss on small data sets
like Amnesty see for 10 and FD action here
are the results of the comparison
But for the large data set like imagenet there
was some degradation in performance is the
comparison on the imagenet dataset but this
degradation can be improved by just adding
a few more bits.
It was also 7 times faster on GPU at runtime
if we use if SIMD single instruction multiple
Data Architecture to implement this.
all the links are provided in the description
box or the first comment of the post to go
and check them out and this video was brought
to you by Dockship, Dockship is the Marketplace
for AI models where you can explore hundreds
of pre-trained model for your use case so
go to dockship.io to explore the models
