Hello
Do you guys want me to use the Mic ?
(Yes, laughs)
I had to try
My name is Christian
I'll go through this presentation regarding different architectures for convolutional neural networks
This is an area that usually requires a previous introduction, but I'll try to give a very shallow overview
because can talk for about one hour on each of these papers that I'm going to show here
I'll also always focus on the architecture of the networks and on its application
I'll show different applications and different architectures for these applications of convnets
As I already mentioned, my name is Christian, I work at HP and I think that we can skip introductions
because we already introduced ourselves
The agenda here is to talk a little about tradicional CNN architectures
then later I'll speak about Siamese Networks
then about Dense Predictions and later about Localization and Detection.
and finally a short time for questions and answers if we have time
But you can interrupt me anytime if you have any question.
So, the tradicional CNN architectures that we have today are usually built by three main blocks
The first block is the convolutional block
the second one is the pooling
and the last one is the fully connected, or dense
which is this block that is usually put in the end of the network to do the classification itself
There are some parts here that Thomas already explained earlier, so I'll go trough them very fast
Convolutional networks usually have a input image that is represented in RGB color space
thus this image is seen as a volume with each of of its color channels
and each one of these channels also have a width and height, and this is the input of the convolutional network
As Thomas mentioned earlier, here you are an example of a convolution
the convolutions are applied over each one of these channels
or when these filters are present in the middle of the network, they're applied over feature maps
and not over the input image channels
Those filters, they slide over the images or feature maps applying this operation of convolution.
I think that Thomas also explained how pooling works, in fact this is the same slide that he used
(laughs)
Here we see an example of pooling
what it is important to note about pooling operation, is that
beyond doing this dimensionality reduction the more deep it goes
it also provides rotation invariance up to a certain degree
For instance, if this pixel moves to there, it will make no difference at all
because it will apply a max operation over here
so, if you have a lot of pooling layers, you'll get more and more rotation invariance the deeper you go
and here we see the fully conected layers
which are layers with neurons that are completely connected
So, now I'll start to talk about Siamese Networks
where they have an architecture that is a bit different than the tradicional architectures
they don't have only one branch of information flowing, but two branches
So, for instance, the input of these networks aren't only one image
also, just to note, this kind of siamese networks aren't only for images, they can be used for other problems too
but in this example I'll assume that we will use a convolutional network
so, in this case we'll have two inputs, called X1 and X2
and you'll have an output, which is the label of your instance, that you want to learn
so for instance, in the problem of face verification
where you want to verify if the image of someone's face is the same person on another image
but maybe with a little of rotation distortion or with different problems of illumination
you just use these two images as inputs in these two branches
and then this network will learn a metric space
and this means is that, in our case, we have a Euclidean distance function
that will take the output features of these two branches
and these features will be learnt in a way that will make they close if both instances (X1 and X2) belongs to the same class
these networks are also trained with some "impostor" pairs of images
where you train with one image of one person and another image of another different person
and in this case, the network will separate these two instances in the Euclidean space that we're learning
There is also an entire area in Machine Learning that handles this kind of problem, which is called Metric Learning
what this network is doing here is learning this metric
using these two branches
What it is important to note here, is that these two branches are actually not two different networks
They are the same network because they share weights
these two inputs X1 and X2, they don't go through different networks, they go through the same network
To let it clear, I'll start showing some examples of this network
In this application example, they trained a model
where they used product images, such as these iconic product images without background
and images of products in some scence, in an decor scene
The problem that they tried to solve in this paper, is that many people on social sites related to decor/design
they want to search for examples of some product in decor scenes, they want to know examples of how to use this product in their homes
Or even, when for instance, the users have a product, they want to search for what the product is and also to search for similar products
and also for doing recommendation or any sub product of this information
So, how this process works:
They train this metric space, this embedding space shown here
Here they projected just in two dimensions (t-SNE) to make it easy for visualization purposes
But each one of these points, representing the instances, they have 256 dimensions each.
or 256 features
And these features, it is what they learned using a Siamese Network.
So how this train is done, they use here a image of product from some scene
And then an image of the same product out of scene, an iconic image of the product
Then they input the two images into the two Siamese Network branches
Then, what happens is that: when you have in the input X1 and X2, instances of the same product
the training process will make these two points in the metric space to be close to each other
In a way that products that are the same, even when the product images are rotated, with different scales, etc. These instances will be made close together in this learned space.
They also train this network with "impostor pairs" of products
where for instance, in the X1 there is one product class and in X2 there is another product class (different than the product class in X1)
In this case, the training in this network will act as "forces" separating these two instances in this embedding space
In order to create a space, where when you make a query of a product in this network
you'll be able to search for products that are neighbors in this space,
and they will be all very similar to the product you used as input.
They really succeeded very well in this problem,
I had a slide that I had to cut-off due to the time constraints
but the results were very interesting.
And here we have a very good example, where we have this chair,
and they were able to find products even when parts of the produce were occluded, etc.
(question: this train using the impostor pair that you mentioned, is it explicit or the impostor is identified ?)
It is explicit, when you train this network, instance after instance,
you'll use pairs of products that belongs to the same class and pairs of products with impostor products
for the network to learn on how to get instances together and also to separate them in this space
Another very interesting example of Siamese Network, that I personally liked
In this paper, they used two Google datasets,
the first one is the images from Google Street View
and the other one are the satellite images, like images from Google Earth
So, what they do here, they train a Siamese Network
In way that you can use an image from street view (or a photo you took from a building in the street)
to search in these satellite images, where is the building located
They used the same training procedure that I mentioned earlier
They create pairs of images using this dataset
and in this dataset they have information of which Street View image belongs to which satellite image
So they train this Siamese Network, here is the training procedure
and then later (in an offline process) they do feature extraction of these aerial images
Thus you can use an photo of some building as a query in your network
and it will search through these fatures, which points
points in this aerial map are more visually close to that query image (building, home, etc.)
It's a very interesting example, here it shows that some aerial images are even occluded
but even when they are occluded, it will try to search for them
making it invariant to these distortions of images that are captured using a parallel or orthogonal angle
Here we also have some very interesting plots
These two cities above, Chicago and Charleston, they were used in the training dataset
and those two other cities below were used as part of the testing dataset
which means that these two cities below weren't used to train this network
The green rectangle are the correct correspondence of the query image
And here it shows, (counting), twelve most similar instances of the query image
So here we can see this query image that looks like a warehouse
I don't know if are able to see it clear
They have this image from Street View, and it returned this image
which is the correct aerial image for this warehouse
And here is where it's the most interesting part, where they have this heatmap
where the color intensity of each one of these pixels
shows how close this region is from this kind of building
So, for instance, you can see that here in Tokyo
where you have a query image from a corner building
The distance query returned images from buildings located at corners
and it ended up activating the heatmap in a lot of crossings in the city
Here you can also see San Diego, where it was able to find buildings that are very similar
and it activated this small region of the city where it contains this kind of architectural similarity
To quick sum this architecture, it is a very interesting type of architecture
where you can use it to create what we call "embeddings"
which means that you can embed images or instances that contains a high dimensionality
in a metric space that has a low dimensionality, such as 256-D features, where you can make a fast search
where this search is invariant with relation to image distortions, illumination distortions, etc.
Was I clear ?
Is there any question ?
Now we're going to talk about Dense Prediction
So, what is Dense Prediction ?
There are many CNNs that we train, where the output of them aren't always a class or a small regression
in many cases, the output of the CNN can be also a image
or a "dense output"
sometimes this output has the same dimensions of the input image
but it has a dense characteristic
Instead of a single probability of classification
Here we can see a very interesting architecture, because it contains elements developed in past few years
This is a example from a paper
that does automatic colorization of gray-scale images
This is a paper from 2016, this year
We can see that this architecture do not contain only a single branch, but two branches
They use this gray-scale image as input into both branches
In the initial layers of this network, we can see there is a Siamese Network
but only up to a certain point, we can also see that it is sharing its weights in here
So, up to this layer, this is a Siamese Network, with two networks sharing the same weights
After that, they continue in two branches, one that will extract local features
and the other will extract global features
But what are local and global features ?
Global features are information regarding what the image represents, if it is a garden, etc, the context of the image
and local features are features that has spatial nature
They contain for instance, information that in that region there is a tree, a grass field, sky, etc.
So, these are the two kinds of features that this architecture extracts using different branches
And then, after extracting local and global features,
there is another layer here that does the fusion of those two sets of features
It will concatenate the features that are coming from this global branch in this feature map
that is coming from this local branch
So this concatenation works as a way to fuse these two different kinds of information
that are local and global
After doing this information fusion, they will use a Transposed Convolution
or also called "deconvolution"
These kind of convolutions
they are filters that instead of doing dimensionality reduction, they will do a upsample of the features maps
But this upsample is not the same as a simple resize
they learn these filters using data from the dataset
these are filters that learn on how to increase this spatial dimensions
Because if you look at this architecture, in these first layers there is a dimensionality reduction
the initial part of the network it is increasing the depth of the feature maps, but reducing the feature map width/height
so it will have to "undo" this later
Then in the end they are doing this prediction of is called "chrominance"
and you can see that this map has now the size of the original image
So here they used a color space where you have luminosity and chrominance of the pixels
which is different than RGB
Since they already have the original image in gray-scale that already contains the luminance
they only need to merge these two components
together with the original image
and then they have the original image naturally colored
using data to learn this colorization process
This kind of application is very interesting because
the datasets required to train this are usually found without efforts,
you just need to take colored images, convert to gray scale and then you have your instance and label
Jointly with this color prediction, they also used a auxiliar task here
Thus, at same time that it is coloring the image, it is also classifying the image
they use this information about the image to construct the global features, this is not incidental
and the results are really awesome, here you can see the original gray-scale images in the top row
and below you can see that it was able to predict each pixel and when this pixel belongs to asphalt, or part of a brick from a building
or trees, and it is capturing very subtle changes of color in the top of the trees
Was it clear ? Is there any question ?
Here we can see another example,
where we can see that it was even able to detect the skin very well
it is also very good in the demarcation on where the skin is present, it doesn't overlap regions
In this other example it was event able to detect snow
So its a very interesting application, where it can colorize an image
based not on a artificial coloring process, but using a network that was trained on natural images
(question: do they explain why they needed that branch extracting global features ?)
Yes, I'll explain in a minute the effect of this global features, here they are playing with it
It will become more clear after these examples
So, what they do to color this image, is to use the same image as input of these two branches
But since you have two different branches, you can use different images to extract global/local features from
In the next examples that I'll show, they used one image as the input for the local features, which is the image
that they use to color, and then another different image as input in the global branch
and you'll see the effect of this in different situations
So here in this first image, there is a image of a green field
And in this first image, the same image was used to extract global and local features
And here there is another different field which is soy I think,
and they also used it both as global and local images
But here in the last one, they used the green field as the image to extract local features,
and the another one in the middle as input to extract the global features
So, for instance, the global information in here, is telling the network that this field
is a soy field and not a green grass field
so you can see here that these global features are changing the color of the image, as a context of the image
Here in the bottom row, there is also a very interesting experiment
they made the colorization process to change the hour of the day when the photo was taken
So here in the first column of this bottom row, they have the same image as input for both branches
And here they have as inputs for the global branch, images that were taken
closer to the sunset and another one closer to the sunrise
So you can see that the atmosphere color and also the color in the rocks are very characteristic
of a image that was taken at sunset or at sunrise
So it's a very interesting experiment, where they were able to change the global information
in order to color the image like the image was shot in different moments of the day
Is there any other questions ?
(question: in the slide that you shown before, where they have images in grayscale and and also color images...
Which slide ?
(any of them)
Ok
(is there any example taking the original color image, converting it go gray scale, and then
(and then compare the results of the colorization with the original one?)
They have this in the paper, I just don't have it here in the slides
But I'll make this presentation available later
It is also important to note that these images are images with good results, but there are some images
where the results aren't always very good like these ones
But what happens sometimes, and you'll se this in the paper, is for instance, when you have a image of a river,
some trees and then a tent
And then, what happens is that in the original image, this tent is orange
but after the colorization process, it becomes blue
And you'll se that they cite this in paper, because it is something that is effect of the nature of the problem itself
where you have ambiguity,
because you can have tents of a diverse amount of different colors
but there are other things that you have common colors, for instance: sky, snow, skin color. They usually follow
a common pattern, that is less diverse than for instance the colors of tents.
I'll make these slides available later, so the name of the papers are all cited below the images.
Right now I'm going to talk about the CNNs..
I don't know how much time I still have
I think that it is going to fit
I'll talk regarding Localization and Detection
This is one of the networks, that was created in 2014,
It's a very well-know network, there are also a lot of evolutions using this idea
And this network was created to solve a problem, the problem of detecting different objects with different classes in images
Usually, there are a lot of problems that we solve using CNNs that are related to classification
but in this case, they try to solve the problem of not only saying which class is the object in the scene
but also limiting the boundaries where this object is located
by using a bounding box (a rectangle)
So here we have an example of the input image,
where we have many rectangles that represent different objects in this scene
and what happens here, is that each one of these rectangles are used as input to these CNNs
Sorry, just to note that the name of this network is R-CNN
Do not confuse it with Recurrent Networks
It's called R-CNN because it is a Region based CNN
Because it is base don regions, I'll talk about these regions later
So, these different regions of the image,
are warped on a fixed size before entering in the CNN
And then each one of these regions go through the CNN
and each one of these regions will output two informations
One these outputs is the features that will be used on a SVM classifier
that will then classify the class of the object in this rectangle
And the other is a regression,
That will adjust the size of this bounding box proposal that we used,
this happens because these rectangles had their origin from a region proposal method
later I'll show you how they work
However, these rectangles may not be the best fit for the object inside it
and this is why we have this regression in here
that will adjust the bounding box coordinates
and improve their final results
As I already mentioned, you need to crop and warp each one of the regions before using it as input into the CNN,
But how do we define these region proposals ?
So here they used a method called "Selective Search", that will make this image segmentation
It's like a clustering of pixels where at each step you're merging these small clusters
It'll make these pixel regions to be merged in an hierarchical way
and then it generates some regions
Each one of these regions in here are then represented as a minimum enclosing bounding box
And then these are the region proposals that are used as inputs for the network
Usually these methos generates from 2k up to 3k region proposals,
I think that my time slot is over,
I'll talk here very briefly about one of the evolutions over the R-CNN
Since in the R-CNN you need to evaluate the CNN for the 2k rectangle proposals
This usually takes a lot of time because it will require computation resources to do this forward pass
To solve this, here you can see that they use the entire image as input into the CNN
Then they are able to do the region proposals in the feature maps instead of doing it on the input image
So they don't have to use 2k images as inputs to this network, they just need to input the entire image
together with the list of proposals to do these predictions from the feature maps
My time slot is over, is there any questions ?
(question: regarding the creation of the dataset itself..)
For which problem ?
(for any, images for instance)
(is there any criteria where it says "now I have a representative dataset" for my problem)
Nowadays, there is no objective criteria for defining which amount of data is enough for your problem
You need to take your dataset and give it a try, this problem is usually solved empirically
Usually, for classification problems for instance, to have a good accuracy,
it's a matter of having 100k, 200k, 500k instances
However, this (small datasets) can be solved using techniques from Transfer Learning
where you train a network, for instance on ImageNet, where you have millions and millions of images
and then, for instance, if you have a network with 10 layers,
these layers will learn on how to extract features from these images
but these features that were learned,
specially in the first layers of this network,
will learn generic features like edges, where there is circles,
and the more deeper you go in this network,
the more abstract are these features,
like for instance detecting an eye, the wheel of a car, etc.
and these generic features are also useful for other domains,
So if you want yo classify one image that you have, you'll not train it only on your dataset,
You'll train a model in this large dataset from ImageNet
and then you can freeze some layers of this network,
specially the first layers where you have very generic features,
and then you do a fine-tuning in the final layers with your smaller dataset.
By doing this, you'll end up requiring a much smaller dataset
to solve your problem.
But it will depent a lot on the domain of your problem.
Is there any other questions ?
Another one there
(question: do you know which model is state-of-the-art in this problem of detection ?)
For the localization, as far as I know, after this Fast R-CNN,
there is another model called Faster R-CNN,
There is the R-CNN, the Fast R-CNN and the Faster R-CNN.
what makes sense,
so, since you can choose any kind of CNN achitecture,
There is no time to go deeper, but they've created this RoI pooling layer here
which is a pooling layer, because the dense layers at the end of the network, they require a fixed feature map
so this RoI pooling was created to overcome this problem,
so you can plug any convnet in here,
so if you plug a convnet that is SOTA for classification for instance,
for instance a ResNet architecture or another network,
you can plug it here and then you get SOTA also in detection
What I recall that is SOTA today, is this Faster R-CNN together with ResNet, at least published.
Is there any other question ?
