Hey, quick question.
Given a data set of billion images, is it
possible for
you to rightly classify them into one of the
300000 types of plant images.
Woah!
Not that easy.
Alright!
Given thousands of images displaying numbers,
is it possible to for
you to rightly classify them into one of the
10 digits.
The task specified, would be intuitively straight-forward
for we humans, but
hard to be formally described as a set of
rules for the machine to process
information.
So for a mathematical model to work on problems
like these, it's network has to
be loosely analogous to the architecture of
a working brain.
The closest to that which exists in today's
technology for image recognition
are Convolutional Neural Networks.
With the aid of Convolutional Neural Networks,
it is possible to predict the
image with a pretty decent accuracy, when
compared to other mathematical models.
Here's with a simple example from which I'd
amplify it to a bigger one.
The MNIST dataset is a collection of thousands
of size-normalized handwritten
digits, centered in a 28*28 resolution image.
The 784 neurons from the 28*28
pixel image make up the first layer of the
network.
The last layer contains 10
neurons, each representing one of the numerical
digits, from 0 to 9.
Lets start with a gray scale image, i.e) a
range of monochromatic shades from
black to white represented by a set of rational
numbers from zero to one.
The input layer consists of a set of neurons
corresponding to each pixel of
the input image.
These neurons generally hold a number between
zero and one which represents the
gray scale value of the pixel.
These numbers are generally called the activation
number, and the neurons
containing values beyond a threshold value,
are activated.
The output layer is a set of neurons representing
the desired output for our
problem.
The numbers in these neurons represent its
probability of being the
desired output.
The layers in between the input and the output
layers are called the hidden
layers.
The neurons in the hidden layer holds certain
values known as weights, which
influences the predictability of the model.
These weights are sensitive to the
cost function, and are updated by back propagation.
The types of hidden layers I wanted to talk
about are the convolutional and the
fully-connected layers
The hidden layer consists of Convolution layers
and fully connected layers.
In fully connected layer, each of its neurons
are connected to the neurons of
the next layer.
It contains about 5-10% of the computation,
about 95% of the
parameters, and have small representations.
Fully connected layers can only deal with
inputs of fixed size, because it
requires a certain amount of parameters to
fully connect the input with the
output.
While convolutional layers just slide the
same filters across the
input, so it can basically deal with input
of any arbitrary spatial size.
The convolution layer cumulatively contains
about 90-95% of the computation,
about 5% of the parameters, and have large
representations.
The penultimate layer, i.e) the layer before
the output layer is generally a
fully connected layer.
The larger the amount of images supplied,
the prediction accuracy of the model
increases.
Since the process is computationally intensive,
it's essential to
analyze the scope for parallelism that can
be involved in the model.
Across the data dimension, when the workers
are allowed to train on different
data samples, it is known as data parallelism.
Here, the workers must
synchronize the model parameters (or the parameter
gradients) to ensure that
they are training a consistent model.
They are efficient when the amount of
computation per weight is high, because the
weight is the unit being
communicated.
Consider that a neuron can store up to 4 bytes.
For a smaller network as shown,
the amount of storage required for these last
two layers of the model will be
(16*10)*4.
That's 640 bytes of storage space.
Then what about a Network model that has to
classify 300000 *types*.
In our case,
we use the standard resnet-50 model which
has 1024 neurons in the fully-connected
layer.
Therefore, the amount of storage required
for these last two layers of
the model will be (1024*300000)*4.
That's approximately two Gigs of storage
space.
And considering only data parallelism in this
case, each worker has to
shared the complete model.
And here's where model parallelism comes into
play.
Different workers train
different parts of the model.
They are efficient when the amount of computation
per neuron activity is high, because the neuron
activity is the unit being
communicated.
By exploiting these two different parallelisms
combined, we obtain better
scaling.
The current convolutional Neural-Networking
models utilize data parallelism in
order to scale up and handle large training
sets.
The inability in further
compute-time reduction beyond certain compute
capacity, and the requirement of
larger training data sets, is benefited by
the inclusion of model parallelism.
In this resnet-50 model, the fully connected
layer is split among the workers,
thereby making the computation less intensive.
The entire model is implemented using Tensorflow
and Horovod libraries in
Python.
In order to measure the run times, the Tensorflow
and Horovod timelines are
used.
For convolutional and fully connected layers,
the forward and backward
computation time with respect to the layer
inputs, and the backward computation
with respect to filters separately are tracked.
The data parallelism approach
for training the model using up to 8 GPUs
with a weak scaling strategy, is
compared with a hybrid model, where each of
the 2 CPUs connects to four GPUs.
