"Hello everyone. Ken here back with
another video for you."
"Today i have a special announcement..."
Oh boy! It's ken's new video
just in time for my five minutes of
sitting down and doing nothing
oh let me get started on that first actually
[Turns off YouTube, Sits in Silence]
Nothing at all...
Five minutes...
well actually let me just get started on
ken's video
Ken: "I''m starting to learn data
science over
from the very beginning i'll be doing my
best to spend about 70 percent of my
time
applying this knowledge through projects
and practice and about 30
of the time on theory hopefully seeing
me start again
from the beginning will help inspire you
to pick it up as well
i think this will be a fun experiment
and i really hope that you guys
participate in it with me."
Ken, that sounds like a call to action!
Luckily this guy
loves theoretical. Leave it to me!
[Snazzy Intro]
i already spilled the tea
How's it going, guys?
Welcome to DataLeap this is a
location where i
want to bring data science down to earth
for you and if you're new here
it's a pleasure to meet you my name is
Andrew from DataLeap
and if you're not new here welcome back
today we are going to have another
Dis or Data
We are going to explain exactly how
to
isolate outliers
oh hold on don't don't
leave don't leave hold on it's not just
theory
we're going to have dogs and cats
puppers and
meowmes we're going to have a very
interesting engaged understanding of
two very different algorithms that are
used
in industry to solve million dollar
problems
and you are going to understand them if
you were going to spend five minutes a
day just sitting around or
maybe doing some crafts like i have a
hedgehog collection if you want to see
that hedgehog collection
leave a comment down below so what are
we going to do today we are going to
figure out exactly
how large enterprise fang companies
in silicon valley doing lots of data
science
lots of advanced modeling and machine
learning
solve outlier detection
how are they are able to break into a
fraudulent transaction
how they are able to understand what is
not supposed to be there
and look into the unlockable
so in today's disorder we're going to
rate these two models on
two factors what ken is trying to do
with the 66 days of data is make sure
that you have
five minutes a day just five to learn a
little bit of data science
every single day that's enough time to
do nothing that's how much time you're
giving up
you're giving up the opportunity cost of
just sitting there for five minutes
waiting for
a microwave uh burrito i think that's
worth it
thank you ken for the initiative and i
will try my best to
help you follow along and stick to the
schedule myself
that's all thanks to 66 days of data
science with kenji if you want to be
part
of an exclusive discord page with kenji
and a lot of like-minded data science
learners like yourself ask me
and i will add you to the discord with
the blessing
of kenji himself all right once more
into the fray let's explain how these
two algorithms
differ with two major metrics the two
ways that we're going to rate
these two algorithms is one trainability
so how well
do they train how quickly and with what
size data sets
and explainability which is going to be
judged by you
how well can you understand these two
concepts five minutes each
here today in this or
data the first algorithm we're going to
explain
is the random isolation forest the
creators of this animation
have done an amazing job i will link
them down below but
notice how it's still a little
complicated you are slicing and dicing
some sort of purple and green dots
with uh lines and it's
it's confusing it's confusing but hey
you're gonna be able to understand this
i'm gonna do the exact same animation
but
i'm gonna use furry animals
like i promised there are some puppers
there are some
meowmers we have cats and dogs well in
this example
a cat is a fraudulent transaction
something that is
not supposed to happen or something that
is outside of the normal let's say that
you are trying to manufacture
planes and if a plane is anything
outside
of the normal distribution of randomness
you are going to have a lot of trouble
a lot of lives are in your hands how do
you
use a model to safeguard your reputation
and your work
let's see how the random isolation
forest handles up so right off the bat
how would you be able to tell
like explain to a robot which one is the
cat
versus all the dogs you could possibly
say well it's the one
with uh the you know the triangle nose
or you can say it's the one with um
blue eyes right this these are obvious
things that are different in this
specific example
we want to be able to build a model that
is able to explain away
a lot of possible randomness
not every cat is going to look like this
cat so
we are going to be able to feed in a
bunch of different
features that's what we are able to
measure
over a different series of metrics of
each picture and then plot them let me
give you an example
of what we're talking about let's have
just two features we're gonna have
the amount of indoor versus outdoor of
the picture and
how large the ears are in proportion to
the bodies
and the dog in yellow that's the dog
that we're gonna try
to find him with this
classifier with this model so the way
that the random isolation force works
is it randomly cut the coordinate chart
and see if it has successfully isolated
the data point that we want
to isolate right now is that dog
outlined in yellow
so let's start cutting the first cut
that was random i have no idea where
that cut was going to happen
but there it is that cut did not
separate
the dog yet so we have to cut again
okay that was random too i can't really
predict where these things are gonna
happen
so this cut did not separate the dog
either
let's keep cutting well that was really
close to the first cut
once again random can never predict them
okay now we're getting a little closer
but this dog is still sharing
this isolated space with the dalmatian
over there
so we're not quite there yet let's keep
cutting
okay they're still sharing the same
space
wow that was way off but randomness you
know
okay still not on target let's see if we
can separate these two
we did it
yeah so
now we know how many cuts it took this
one random try to find this dog we do it
multiple times
and then average what we call an
isolation number
so in this example let's say eight cuts
ends up being
the average number of cuts it takes to
get to
this dog well how many cuts would it
take
to get to our outlier the cat let's have
two different features
just so that i can explain how every
single feature is being considered when
we cut
when you cut a two-dimensional series of
features you only need a straight line
like the ones we've been using
if you cut a three-dimensional set of
features then you need an entire plane
a slice of paper essentially a wall
would be able to separate
a room but if you need a fourth
fifth or sixth dimensional set of
features
you need uh n minus one
plane a hyper plane let's say to cut
through
those set of features it's hard for our
margins to control
but let's keep going with two more
features now we've covered
four different features to separate cats
from dogs
so in this example we have big eyes
versus small eyes and mouth open versus
mouth closed
and so we're just judging you know maybe
dogs have their mouths open more often
or maybe
they have a smaller eyes in comparison
to the cat
we see that little puppy over here on
the left on the left top corner
has pretty big eyes but its mouth is
wide open so luckily a lot of different
features
will be able to help us better
understand the outlier cat
okay let's try to get this cat cut out
of
the chart isolate this cat let's use for
the sake of argument the exact same cuts
the random cuts that we used last time
so here's that first cut
okay here's that second cut
we did it huh so
this is using the theory that we have
fewer cuts on
average that we need in order to isolate
outliers since they are so
far away from the general population
of the inliers everyone else was
clustered closer together
so what are the caveats to this
algorithm well
there is a little something called data
drift as you can see not every single
cat or dog
looks like this and if we start
introducing cultural changes like
everyone starts dressing up their cat so
differently or start breeding different
kind of dogs then we might actually see
a problem with our algorithm the problem
with this model
is what if suddenly dogs start looking
like this
very very often and
what if we certainly start having cast
they look like this
now it is not impossible to use the same
algorithm we used
at the very least using the features of
indoor outdoor versus uh
year size and cut out that first cap
that we tried to cut out
or try to separate this new cat out of
the population
so that was our very first model how
easy was that to explain well let me
know in the comments below and i will
put a grade
up here for you depending on your
comment did you know that
youtube's new feature i can't do that
but what i can do is comment on your
comment
so if you want that to happen
you know what to do let's move on to the
trainability now this model
needs more data to be more precise
the more data that a random isolation
force gets the better the model performs
but it trains very quickly compared to
the one class svm
model trains 10 to 100 times faster
given the right parameters which means
you can test different hyper parameters
and
i will definitely go over what hyper
parameters are uh if you ask in the
comments below
uh let's move on to the one class sbm no
we will not move on to the oneplus svm
because
that is the first part of this district
data
dissertation let me know how you like
this format i am trying to get this
video out
by a certain time tomorrow because
analytically
it seems that you guys like to watch my
videos at that time
and it seems only forty percent of you
are subscribed
hey do you not want corgi content
you just took a bath and he smells
anyway do you not want to become a data
scientist
the best job uh according to this survey
that i'm about to present i'm just
giving myself more editing work
but that's okay because if you like this
content
i'm happy to make more please like
comment and
subscribe if you haven't ring that bell
button slam
that bell button make sure to gently tap
the like button there's a little bit of
give and take there
we'll see you very soon
what are you talking about
you
