So in this video, Im going to give to you a clear and simple explanation
on how Deep SORT works
and why its so amazing compared to other models like
Tracktor++ TrackRCNN and JDE
But to understand how DeepSORT works
we first have to go back,
waaaaaay back
to understand the fundamentals of object tracking
and the key innovations that had to happen along the way
for DeepSORT to emerge
Before we get started, if you are interested
in developing object tracking apps
then checkout my course in the link down below
where I show you how you can fuse
the popular YOLOv4 with DeepSORT
for robust and real-time applications.
Okay so back to Object Tracking
Now lets imagine that you that you are working for Space X
 and Mr Musk has tasked you with ensuring that on launch
the ground camera is always pointing at the falcon 9
as it thrust into the atmosphere
As much as you are excited to be personally chosen by Elon to work on this task
you ask yourself
“How will I go about this?”
Well
given that you have a PTZ or
pan tilt zoom camera aimed at the rocket
you will need to implement a way to track the rocket
 and keep the  the rocket at the center of the image
So far so good…?
Just note that if you do not track it properly
your PTZ motion will stray off the target and you’ll end up with
a really disappointed Musk
And you cannot screw this up
because this is your first job
and you really want Elon Musk to be impressed
I mean who wouldn’t want to, right?
Soo, Question…
how will you track the rocket?
Well you might say
"well ritz, you did a whole tutorial series on object detection"
"why don’t we just track by detection"
"you know umm using something
like YOLOv4 or Detectron2?"
Hahaha.. Okay okay..
lets see what happens if we use this method.
So the Falcon 9 launches on a day with a clear blue skies
you are armed with the state of the art detection models
for centralizing the camera on rocket
Everything going well~
until all of a sudden
a stray pigeon swoops in front of the camera
 occluding the rocket from your camera
and just like that the rocket is out of sight
…The boss is not happy
 Deep down inside you feel your heart sink
and your soul crushed by the disappointment
But you light up some greens
he chills out and after a smoke or two
he decides to give you another chance.
The high has also given you a chance
to reflect on why this did not work
you conclude that while detection works great for single frames
there needs to be a correlation of
tracked features between sequential images of the video.
Otherwise any sort of occlusion,
you will lose detection and Your target may slip out of the frame.
So you dig a little deeper in attempts
as to not disappoint Mr. Musk again
you go back to traditional method
such as mean shift and optical flow
Starting with mean shift
you find out that it works by taking our object of interest
which you can visualize as a blob of pixels,
so not just location, but also size.
So in this case the falcon 9 rocket
that we are detecting is our blob.
Then you go to next frame
and you search within a larger region of interest
known as the neighborhood, for the same blob.
You’ll want to find the best blob of pixels or features
in the next frame that best represent our rocket
by maximizing the similarity function.
This strategy  makes a lot of sense
If your dog goes missing, you wont just drive to the countryside
but instead start with searching your
immediate neighborhood for your best friend
Unless course you have a dog like Lassie.
In that case, she’ll find you.
The other tool you look into is optical flow
which looks at the motion of features
due to the relative motion across frames
between the scene and camera.
So say for example you have your rocket in your image,
and it moves up in the image,
you will be able to estimate the motion vectors in frame 2 relative to frame 1.
Now if your object is moving at a certain velocity,
you will be able to use these motion vectors
to track and even predict the trajectory of the object in the next frame.
A popular Optical Flow model that you could use for this is
Lucas Kanaader..
Kanada? Kanade?
Cool so now you’v got another shot at impressing Mr. Musk
he was only a little annoyed...
thats right.. only a little annoyed
.. that you lost his rocket..
So to save Elon a buck or 2,
you decide to model this in simulation
and test the viability of Optical flow and Mean Shift
You find out some interesting things from this experiment.
After running your simulations
you discover that while the traditional methods
have good tracking performance,
they however are computational complex and prone to noise
in the case of optical flow
And for mean shift, its unreliable if the object happens
to go beyond the neighborhood region of interest
So Move too fast,
loose the track.
And that’s not even considering any type of significant occlusion.
So as much as you want to show this off to Mr. Musk
you have a gut feeling telling you
that you can do better.. way better!!
You go to your shrine and meditate for a bit,
Spend some time crunching the numbers
and reasons why you were better off working somewhere else.
But You stumble across an amazing technic
used almost everywhere
known as the Kalman Filter.
Now I have a whole video on what the Kalman filter is
and how you can use it catch pokemon
But essentially its premise Is:
say you are tracking a ball rolling in 1 dimension
You can easily detect it within each frame.
That detection is your input signal
which you can rely on as long as there is a
clear line of sight to the ball, with very low noise.
Now during detection,
you decide to simulate cloudy conditions
using that fog machine
you used at the last office party.
You can still see the ball but now
your vision sensor has noise in it,
thus decreasing the confidence
of where the ball is
Now Lets make it a bit more complex
and throw in another scenario
where the ball travels
behind a box which occludes the ball.
How do you track something
that you can’t see?
Well this is where the Kalman comes in.
Assuming a constant velocity model
and gaussian distribution
 You can guestimate where
the ball is based on the model of it motion.
When the ball is able to be seen, you rely more on the 
sensor data and thus put more weight on it.
When it is partially occluded, You can place weight
or reliance on both motion and sensor measurement data
And if its fully occluded.
You will shift a lot of weight on motion data.
And the best part of the Kalman filter
is that it is recursively,
meaning where we take current readings,
to predict the current state,
then use the measurements
and update our predictions
Now ofcourse there is a lot more to
the Kalman filter to cover in just one video.
But by now you probably wondering,
"Ritz, the title of this video is
on DeepSORT.."
"what are you going on about Kalman filters
and traditional tracking algorithms from the good ol days?!"
"What going on here man!?"
Hold up, hold up, hold up
 we are getting there, just bare with me
The Kalman filter is a crucial components
in DeepSORT. Let’s Explore why.
The next launch is coming up
soon where multiple Projectiles may be need to be tracked
so you are required to find a way
for your camera to track your designated rocket.
The Kalman filter looks promising,
but your Kalman filter alone may not be enough.
Enter SORT
 Simple Online Realtime Tracking
You learn that SORT comprises of
4 core components which are:
1.Detection, 2.Estimation, 3.Association, 4.And Track Identity creation and destruction.
Hmmm, This is where is all starts come together
You start with detection
So as you’ve learn earlier
that detection by itself is not enough for tracking.
However the quality of detections
has a significant impact on tracking performance.
 Bewely et. al. use FRCNN(VGG16) back in 2016
now you can even you YOLOv4
in according implementation we use YOLOv4
which you can check out in the lnk down below
Estimation
So we got detections
now what the f*** do we do with them?
So now we need to propagate the detections
from the current frame to the next
using a linear constant velocity model.
Remember the homework you
did earlier on the Kalman filter?
yes that time was not wasted.
When a detection is associated to a target,
the detected bounding box
is used to update the target state
 where the velocity components are optimally solved
via the Kalman filter framework.
However if no detection is
associated tot the target,
its state is simply predicted without correct
using the Linear velocity model.
Target Association
In assigning detections to existing targets,
each target’s bounding box geometry
is estimated by predicting its new location in the latest frame.
The assignment cost matrix is then computed as
the intersection-over-union (IOU)
distance between each detection
and all predicted bounding boxes
from the existing targets.
The assignment is solved optimally
using the Hungarian algorithm.
This works particularly well
when one target occludes another.
In your face Swooping Pigeon!!
Track Identities life Cycle
When objects enter and leave the image,
unique identities need to be created
or destroyed accordingly
For creating trackers, we consider any detection
with an overlap less than IOUmin
to signify the existence
of an untracked object.
The tracker is initialized using the geometry
of the bounding box with the velocity set to zero
Since the velocity is
unobserved at this point
the covariance of the velocity component
is initialized with large values, reflecting this uncertainty
Additionally, the new tracker then
undergoes a probationary period
where the target needs
to be associated with detections
to accumulate enough evidence
in order to prevent tracking of false positives.
Tracks are terminated
if they are not detected for TLost frames
you can specify what the amount of frame for TLost
Should an object reappear,
tracking will implicitly resume under a new identity.
Wow, you are absolutely on fire now
with all this SORT power consuming you,
you power up even more, surging
power level over 9000
screaming until you transform
from SORT to your ultimate form
DeepSORT
Super Sayans be proud
Now if you’re almost there
So now you explore your new found powers
and learn what separates SORT from the upgraded DeepSORT
So in SORT we learnt that
we use a CNN for detection
 but what makes DeepSORT so different?
 If we analyze the full title of which is
Simple Online and Real time Tracking or SORT
 withwith a deep association metric.
"Hhmm okay Ritz.."
I really hope you are going to explain
what deep association metric is
We’ll discuss this in the next video..
hahahh..
just kidding
I cant leave you hanging like that.
Especially when we are so close to
completing the project for the falcon 9 launch.
Okay so Where is the
deep learning in all of this?
Well, we have an object detector
that provides us detections,
the almighty Kalman filter tracking it
and giving us missing tracks,
the Hungarian algorithm
associates detections to tracked objects.
You ask:
"o, is deep learning really required here?"
Well while SORT achieves an overall good performance
in terms of tracking precision and accuracy,
also despite the effectiveness
of Kalman filter
it returns a relatively
high number of identity switches
and has a deficiency in tracking
through occlusions and different viewpoints
So, to improve this, the authors of DeepSORT introduced
another distance metric based on the “appearance” of the object.
The Appearance feature Vector
So a classifier is build based on our dataset
which is trained meticulously
until it achieves a reasonably good accuracy.
Then we take this network and 
strip the final classification layer
leaving behind a dense layer that produces
a single feature vector, waiting to be classified.
This feature vector is known
as the appearance descriptor.
Now how this works is that
after the appearance descriptor is obtained
the authors, use nearest neighbor queries
in the visual appearance
and this is to establish
the measurement-to-track association
"Measurement-to-track association" or MTA
is the process of determining the relation
between a measurement and an existing track
So now we also use the
Mahalanobis distance as oppose to
the Euclidean distance for MTA.
So while tensions are mounting,
on the dawn of the launch day
You quickly run your simulation
and you find the Deep extension to the SORT algorithm
shows a reduced number
of identity switches by 45% achieved
an over competitive performance
at high frame rates.
Just like that you find yourself
standing alongside Elon
in the bunker moments before
the commencement of the launch
You clench your fists,
you feel the sweat on your brow
heart beating,
saying: "This is it..
this is the moment of truth”
Elon raises the same question
that you have on your mind
“So.. will it work?”
You stemmer a little.. but
you answer with a confident
“Im sure it will”
Elon looks forward as the countdown begins
3... 2... 1...
We have lift off!!
You PTZ camera is set on the target on the target
as the rocket lifts up from the ground…
So far so good we have track.
However, the rocket is passing through some clouds
that partially occluding the target.
The camera is still targeting
the deepsort model is holding up quite well.
Actually very well,
as you notice as the swooping pigeon occluded the camera
on multiple occasions without hinderance to the tracker.
YES!!!
Mission Accomplished
Elon looks at you and extends
his hand outwards to shake yours and says:
“Well done, that was quite impressive.”
You can now relax and pop
some champagne with the team
Job well done!
That was quite an adventure
for which you have learnt about object tracking,
particularly on the DeepSORT model.
Just out of curiosity you search the net for
DeepSORT alternatives and you create a quick comparison
You find 3 which are:
1. Tracktor++ which is pretty accurate,
but one big drawback is that
it is not viable for real-time tracking.
Results show an average execution of 3 FPS.
If real-time execution is not a concern,
this is a great contender.
2. TrackR-CNN, which is nice
because it provides segmentation as a bonus.
But as with Tracktor++,
it is hard to utilize for real-time tracking
having an average execution of 1.6 FPS.
JDE displayed decent performance
of 12 FPS on average.
It is important to note that
the input size for the model is 1088x608
so accordingly, we should expect JDE to reach 
a lower FPS if the model is trained on Full HD
Nevertheless, it has great accuracy
and should be a good selection.
Deep SORT is the fastest of the bunch,
thanks to its simplicity. It produced 16 FPS on average
while maintaining good accuracy, definitely making it
a solid choice for multiple object detection and tracking.
If you guys enjoyed this video
please like, share and subscribe
Comment on whether you would use
DeepSORT for your own object tracking applications
And if you want to learn how to implement
DeepSort with the Robust YOLOv4 model
then click the link below
to enroll in our yolov4 PRO Course.
Thank you for watching and we’ll see you in the next video.
