[MUSIC PLAYING]
FELIPE HOFFA: Analyzing
data is like cooking.
You pick your data sets, your
ingredients, you clean them,
you slice them.
And then you serve your results.
I'm Felipe Hoffa,
and in this video,
I'm going to travel
to Dresden, Germany,
to meet the team
fighting spam at LOVOO.
LOVOO is an online dating
app, and their engineers
use data to identify
and block spammers
trying to abuse their system.
In this episode,
we're going to learn
how they define features to
train a logistic regression
model to find who's a spammer.
And now they can do all
of these inside the query.
And because we love
food, we are going
to do all this while cooking.
[MUSIC PLAYING]
Here we have Juan
de Dios Santos,
one of the Big Data engineers
fighting spam for LOVOO.
JUAN DE DIOS SANTOS:
It's likely so.
Hi, everyone.
My name is Juan, and I'm
one of the anti-spammers
here at LOVOO.
And my role here is to design
and to develop new techniques
and algorithms to detect the
spammer from our platform
and to avoid the
proliferation of them.
So today, since we're going
to be talking about spam,
I thought that the
most ideal thing would
be to cook something with spam.
So I'm going to be
preparing one typical dish
from my country,
Puerto Rico, which
is called spam guisada,
or just simply jamonilla.
FELIPE HOFFA: Ooh, I'm
getting hungry already.
[SALSA MUSIC]
JUAN DE DIOS SANTOS:
It's really important
to know what exactly spammer
and normal users are doing,
because of course they are going
to behave in different ways.
For example, spammers, they
do normally the same action
quite often.
And normal users-- well,
they just ask normal.
FELIPE HOFFA: Yeah, how do
you decide what's normal
and what's not normal?
JUAN DE DIOS SANTOS:
So the tracer
has three different
kind of actions.
We have what is called the
active action, which are
the things the user is doing.
For example, I like you,
I send you one message.
And then we also have
the passive actions,
which are the things that
you are doing up on me.
You are sending me one
message, you are liking me.
And then we have the time, which
is how much time it has been
between each different action.
[SALSA MUSIC]
FELIPE HOFFA: You have your
whole system processing images,
processing events.
And you're also using BigQuery.
I would love to know
how you use BigQuery.
JUAN DE DIOS SANTOS: So
BigQuery-- we use BigQuery
mostly for sorting our data.
You know, for storing our
historical data, prediction
data, training set and so on.
However, recently we
start to use BigQuery
to do machine learning on it.
[SILENCE]
FELIPE HOFFA: Can you show
me what are your tables?
How do you do machine
learning inside BigQuery?
JUAN DE DIOS SANTOS: All
right, let's get on it.
So first, here, we
have our data set.
And inside our data set,
we have one small group
of tables, which are
called the sample tables.
And in these sample
tables, I'm going
to keep just that-- just
small sample tiny versions
of our big data set.
So here, let's take a look
at what I call the Sheriff.
Sherrif is one of our
machine learning models.
And this one is going to
predict if you are a spammer
or not based on the
things that you are doing
and some of the characteristics
of your profiles.
And here we have two date sets.
We have the training
one and the testing one.
The training one, of course,
for training the system,
and the testing one
is for testing it.
FELIPE HOFFA: And in
the training data set,
you have all of your columns.
JUAN DE DIOS SANTOS: Yeah,
and here, each column
represent our features.
And, unfortunately, I
need to hide some of them.
So that's why you
only see numbers.
But, however, we can focus
on the one called message
sent, like, and likes created.
And this is saying just that--
these are the messages
that I sent to you,
these are the likes that
I have done to others,
and the people
that have liked me.
FELIPE HOFFA: OK, so this
table has, like, 200,000 rows.
What is each row?
JUAN DE DIOS SANTOS:
Yeah, like here,
each row represent one
observation, so one user.
FELIPE HOFFA: One user.
JUAN DE DIOS SANTOS: Yeah.
FELIPE HOFFA: OK, so, for each
user, just summarize features
that--
JUAN DE DIOS SANTOS: It's
like-- so here-- so like I say,
these data sets, this
training is actually
made of two different things.
The first one is,
what are you doing?
But we calculate this by looking
at the ratio of execution
of events.
Meaning that, of all the
events that you have done,
what is the percentage
of message sent?
And in this case, we have
one user, our user number 5,
and we can see that
the values are on 39%.
Meaning that of all the
event the person have done,
39% of them has
been message sent.
FELIPE HOFFA: Nice.
JUAN DE DIOS SANTOS: Got it?
FELIPE HOFFA: I got it.
JUAN DE DIOS SANTOS: Perfect.
FELIPE HOFFA: And you
determined this feature seem--
like experimenting?
JUAN DE DIOS SANTOS:
It's like it.
I mean, you try with
this, you try with that.
You lose some and you gain some.
You know, just a matter
of experimentation.
[SALSA MUSIC]
FELIPE HOFFA: How
do you train this?
How do you determine--
JUAN DE DIOS SANTOS:
How do I train this.
Well, we have our
features-- or, I
mean, the BigQuery
feature of how
to do machine learning
inside BigQuery.
And first, we need to go to
our query editor, of course.
And then we are going
to execute our query.
And here we have this
kind of simple query,
and it's called
Create or Replace.
And this is going
to do exactly that.
It's going to create
one model, or in case
the model already exists,
it's going to replace it.
FELIPE HOFFA: Cool.
What are your options here?
JUAN DE DIOS
SANTOS: My options--
so the first one, the
most important one,
is our model type, which is what
kind of model I want to train.
In this case, I want to
use logistic regression.
FELIPE HOFFA:
Logistic regression,
that allows you to--
JUAN DE DIOS SANTOS: To
do binary classification.
Either 0 or 1, true or
false, spammer or hammer.
FELIPE HOFFA: I see,
spammer or hammer.
JUAN DE DIOS SANTOS: Hammer--
oh, yeah, that's how we call it.
FELIPE HOFFA: And, with your--
that's where you
say, I want to get--
to predict this columns, spam.
JUAN DE DIOS SANTOS:
It's likely the other.
The second one is
input label columns.
And here we specify which is
the column that is our target--
in this case, going to be spam.
FELIPE HOFFA: And we also
have some options there--
additional--
JUAN DE DIOS SANTOS: Exactly.
So I also have max iteration.
And this one is set to 50.
And this means that
my system is going
to train for 50 iterations.
Normally, in this case, it's
no long-- that's not necessary.
Because I'm really truly kind
of [INAUDIBLE] on five or six.
However, for the next
part, I want to show you
how the loss function decrease.
And for that, I'm
setting us to 50
so we can have more data so
we can see a prettier curve.
Also, for that, we
need to set early stop
to false-- really important.
Otherwise, it's just
going to stop early.
FELIPE HOFFA: Cool, so let's
start training this model.
[SALSA MUSIC]
We are getting some results.
JUAN DE DIOS SANTOS:
At least, so far,
since we used max
iteration of 50.
We are currently on
iteration number 41.
As you can see, it's taking
around 6 to 7 seconds
to actually-- for
each iteration.
So it's quite fast.
However, we can realize here
that the training loss is not
changing much.
However, like I said before,
I was expecting this.
I just want to have
50 iterations just
to see how the loss
function looks like.
Because I just like
to visualize things.
FELIPE HOFFA: And how
would you visualize this?
JUAN DE DIOS SANTOS: Well,
let's do it in Data Studio.
One of my favorite
features, let's say,
of BigQuery is that we have
this cool button here--
really powerful button--
which is called
Explore in Data Studio.
And when you click there, this
is going to kind of migrate--
to take your whole data set.
And it's going to move it
to a new Data Studio window.
And here, in Data Studio, then
we can visualize our data.
FELIPE HOFFA: Cool.
So you can just now
select from wherever
the training info [INAUDIBLE]?
JUAN DE DIOS SANTOS: Exactly.
Yeah, yeah.
FELIPE HOFFA: And then
we can get this data out
into Data Studio.
JUAN DE DIOS SANTOS: Exactly.
And here, this is how it looks.
Let me first-- we need
to move a lot of buttons
here and drag and drop.
But essentially,
this is how our--
FELIPE HOFFA: On the
y-axis, we have the loss?
JUAN DE DIOS SANTOS:
Exactly, we have the loss.
And then on the x-axis we
have the number of iteration.
And here, like expected, once
again in around iteration five
or four, the values
was already kind
of getting to start more stable.
FELIPE HOFFA:
Pretty low, and then
you get to a minimum of 0.0--
JUAN DE DIOS SANTOS: Yes,
like it's really, really low.
It's between 0.1 and 0.2.
[LAUGHTER]
Maybe like 0.15, 0.14.
FELIPE HOFFA: That's pretty
good and pretty fast.
Oh, oh, oh, by the
way-- now you get
these visualizations
automatically inside the query
once the training is done.
And we can also look at
the stats of this model.
Or where the
training statistics--
JUAN DE DIOS SANTOS: Yeah,
this is actually pretty cool.
So you know that, to make sure--
I mean, if we want to know
if the model is either good
or not, then we need
to have some metrics.
And here, we have all
the five or six metrics,
just out of the box.
We don't have to
calculate anything.
And for example, we can see
precision, recall accuracy,
F1 score, the log function--
I mean, the loss function
that we just saw--
and the area under the curve.
And by just looking
at this value,
I will say that
it's pretty good.
We have an accuracy
of almost 96%,
while our F1 score is almost 92
and the area under the curve--
I want to say it's
really excellent.
FELIPE HOFFA: I like it.
And once you have your
whole model trained,
what's behind the model.
JUAN DE DIOS SANTOS:
What's behind the model.
So in this case, since we're
using logistic regression,
basically we have a bunch of
things we call the weights,
or the coefficients.
Which, if I explain
quick and easy,
logistic regression is basically
one huge mathematical equation
with a lot of different values.
And these values are
actually what we're
going to learn during training.
So at the end of the day,
all these fancy thing
just transforms into
a couple of numbers--
our coefficients
and the intercept.
FELIPE HOFFA: And with
this equal function,
we can get our weights out.
JUAN DE DIOS SANTOS:
Actually, good question, yes.
If we would like to
implement this in production
into another system,
then we just only
need to export all
of these values.
And then we should build
the equation ourself.
And then just by
plugging in the values.
And then we have our model.
FELIPE HOFFA: Well, so
this is really great
because, basically, you gave
a query-- this is my table,
these are my labels.
JUAN DE DIOS SANTOS: Exactly.
FELIPE HOFFA: And just
ran a logistic regression.
JUAN DE DIOS
SANTOS: Yeah, right.
FELIPE HOFFA: And you got a lot
of these just automatically.
And that's one way in which
LOVOO identifies spam--
online dating spam.
Now, with BigQuery,
it's really easy
to start iterating the features
and the training of your model
without leaving
the data warehouse.
But dinner is not ready
until we serve it.
How do we know how
well the anti-spam team
is doing at LOVOO?
Let's go back to the kitchen.
JUAN DE DIOS SANTOS:
Serving here in this case
will be, of course, to serve
the model in production.
However, more than that,
I need to make sure
how the model is performing.
You know, I want to know
how our results are.
FELIPE HOFFA: Exactly.
And how are you doing so far?
JUAN DE DIOS SANTOS:
How are we doing this.
So we have some numbers here.
These are from our
transparency reports.
And, basically, this is
some report that we do,
where we publish--
we make public all
these values so people
can have an idea of how
our system is performing.
And in the third
quarter of 2016,
our percentage of
spammers was only 0.3%.
Meaning that, of all our users,
we only have 0.3% of them
were actually spammers.
It's less than 1%.
And the number is quite stable.
As we can see here, it's
like 0.3, 0.2, and so on.
FELIPE HOFFA: And I guess
the number being stable
is a huge accomplishment.
JUAN DE DIOS SANTOS: It's
good because it means--
it's never going to
drop to 0, of course.
But it's actually, if
the number keeps stable,
if the number doesn't grow, it
means we're doing a good job.
Or so I think.
FELIPE HOFFA: I guess
the spammers keep finding
new ways to exploit the system?
JUAN DE DIOS SANTOS:
Yes, it's actually--
it's an eternal fight.
Like if I sleep--
if I get sloppy or anything,
then they will just go ahead.
FELIPE HOFFA: But yes, you
are keeping them on the edge.
JUAN DE DIOS SANTOS:
And another metric
that I really like
to see is the one
that says the time it
takes to punish someone.
Because, of course,
we are real time
and we want to get
the person now.
So at the beginning, again
in the third quarter of 2016,
we had around 2.2 hours.
After that, the
number went down.
And then after that,
we had a huge drop
where it went from 2.1 to 1.1.
It was like an hour less.
And then suddenly,
something that happens-- you
know, like new
spammer ways or so on.
And then the number increases.
And then after that, it's
going down, down again.
FELIPE HOFFA: Yeah, that's
a huge drop at the end of--
JUAN DE DIOS
SANTOS: It is, yeah,
FELIPE HOFFA: --how
long it takes.
JUAN DE DIOS SANTOS:
I'm really proud of it.
FELIPE HOFFA: Well
done with that.
JUAN DE DIOS SANTOS: Thank you.
FELIPE HOFFA: So, as we serve
our results in a dashboard
and we publish how
good you have been
doing throughout these years--
and this is basically
the end of our process.
JUAN DE DIOS SANTOS:
That's right.
FELIPE HOFFA: Once we
have everything running,
you can visualize it.
You can show everyone
how well you're doing.
And once your dish is
ready, you can serve it.
JUAN DE DIOS SANTOS: Like now.
So here you go.
FELIPE HOFFA: Thank you.
JUAN DE DIOS SANTOS:
I hope you enjoy it.
FELIPE HOFFA: I can
find some spam here.
And I'm glad that you taught
me how to find the spam.
JUAN DE DIOS SANTOS: Likely,
I'm glad I had this opportunity
to talk to you about anti-spam,
and BigQuery, machine learning,
and most importantly, about
how we fight our spammers.
FELIPE HOFFA: And
that was really cool.
Thanks a lot.
JUAN DE DIOS SANTOS:
Thanks to you.
And enjoy.
So now I'll do the same.
[SALSA MUSIC]
FELIPE HOFFA: I had so
much fun making this video.
Thanks a lot to LOVOO,
to Juan de Dios,
and to you for watching.
Should I do more of this?
Please leave me a
message and let me know.
What would you like to cook?
Thank you very much.
I'm Felipe Hoffa, and
remember to subscribe.
[SALSA MUSIC]
