Sanyam Bhutani: Hey, this is
Sanyam Bhutani and you're
listening to "Chai Time Data
Science", a podcast for data
science enthusiasts, where I
interview practitioners,
researchers, and Kagglers about
their journey, experience, and
talk all things about data
science.
Hello, and welcome to another
episode of the "Chai Time Data
Science" show. In this episode,
I'm honored to be interviewing
Kaggle legend, double Kaggle
Grand Master and former number
one on the competition still,
Gilberto Titericz, also known as
his username Giba on Kaggle. We
talk all about Giba's journey
into the field of data science
broadly speaking, and also on
Kaggle. Giba shares many amazing
advises about Kaggle and also
connect the dots for us of how
those connect to the real world
of data science. There are many
advices about Kaggle, about data
science and how to get your
first break into data science.
Giba has a background of
electrical engineering when he
got started on Kaggle and he
says a very honest and
transparent view of his journey
into data science as well as
Kaggle, so I'm really excited to
be sharing this. A quick note to
the listeners. This interview
will have manually checked and
uploaded subtitles to YouTube.
So if you're a non-native
English speaker, please remember
to enable those for a better
experience, along with the blog
post release, that will happen
later. And if you're interested
in that, you can find the links
to the website again in the
description of this podcast. For
now, here's my interview with
Kaggle legend and former number
one, Giba. All about data
science and Kaggle. Please enjoy
the show.
Hi, everyone, I'm on the call
with the Kaggle legend, Grand
Master, Giba. Grand Master Giba,
thank you so much for taking the
time and joining me on the team
data science podcast.
Gilberto Titericz: Thank you.
Thank you Sanyam. It's a, an
honor to be here today and talk
with you and everybody watching
and listening this video. Thank
you really appreciate it.
Sanyam Bhutani: It's really an
honor to have you on the show.
You're currently in the top four
rankings of the competition's
leaderboard. Could you tell us
about your Kaggle journey, what
got you interested on competing
in Kaggle? And your path to
becoming the number one
eventually on the leaderboard?
Gilberto Titericz: Yeah, it's a
long story. But sure. Actually,
I was always in my life involved
with, with our teams???? and
coding, but my graduation is in
electronics engineering. Right.
And I work for around 10 years,
no more than that. Look, I work,
I worked for 15 years as an
engineer, well before that
transition to the data science.
Right. And but during that time,
as an engineer, I was involved
with coding and our teams, even
my, my master was in the field
related to optimization
algorithms. I studied wireless
communication and turbo coding,
actually, this is the name, the
correct name is kind of
averaging two correct beats
transmitted via wireless
communication. Right. So I was
always involved with, with some
kind of our origins??? during my
entire life.
Sanyam Bhutani: I was reading
online, you were electrical
circuit noob, you had created
your first circuit board at the
age of 10. I also saw a video of
your segway boot testing that
you made yourself.
Gilberto Titericz: Uh huh. Yeah,
that's true. I used to build my
circuit boards by myself when I
was a kid.
Sanyam Bhutani: Okay.
Gilberto Titericz: And then
yeah, I do that, a Segway, also
around six or eight years ago,
while using microcontrollers.
Yeah, it was very cool.
Sanyam Bhutani: And the coding
and even the like circuitry
portion of the project.
Everything was done by you.
Gilberto Titericz: Yeah, yeah,
actually I do the circuit board
using soldering iron. Iron you
know, about melting iron, right.
Or transistors. I use XML,
microcontroller. I coded the
streamer of the microcontroller
also. It was pretty cool. I like
that. But actually the most
experiments in electronics in my
life I made airplanes, electric
power planes.
Sanyam Bhutani: I have seen
videos of that also on you
channel, I think the clear
view??? cameras from the drones.
Gilberto Titericz: Exactly. I
have some videos in YouTube.
It's, I called it some, some
onboarding circuitry and coding
for a, let's go airplanes. It
was my hobby for, for a long
time, long time. Since they,
they let Ricky motors appears in
the markets for electrical
airplanes I, I used to, to
compute it. Aerial models, but i
like, i like it, it's running
stellar.
Sanyam Bhutani: So how did you
Data Science-Sorry.
Gilberto Titericz: I still
loving the electronics and it's
a hobby. But since 2016 I moved
to data science, and I stopped
doing any electronics as a hob,
but I still want to do some some
electronics hob in the future.
Sanyam Bhutani: Hehe. Okay,
maybe, maybe we'll build a robot
in the future.
Gilberto Titericz: Yeah.
Sanyam Bhutani: How did data
science start to come into the
picture for you? All of these
years after working in the
electronics domain?
Gilberto Titericz: Yes, sure.
Doing electronics, it's a matter
of coding nowadays, right.
Sanyam Bhutani: Yep.
Gilberto Titericz: So when I,
when I figured out that doing
machine learning is something
related to that field, or at
least in the internet of things,
there is a way to mix
everything, right. Electronics,
hardware, software. I started to
get interested for different
kinds of algorithms. And I found
Kaggle in 2012 or 2011. Don't
remember exactly. But it, I
loved it, since the first time.
Sanyam Bhutani: Okay.
Gilberto Titericz: Right. And
when I joined Kaggle in 2012, I
started two, two competitions in
Kaggle at a time. There's two
competitions running and I
joined the two competitions, and
it was about time series, a
challenge, and there's a wind
forecast. And I realize that I
had to learn a lot to compete on
that kind of competitions. My
previous knowledge about
software had nothing to do with
these kind of competitions,
right. This was data science in
time series competition. So at
that time, my, I, my knowledge
about data science was based on
my knowledge on MATLAB. MATLAB
is I believe, you know that, but
it's a page tool, for to, mostly
use it for universities and
students, but also, I started to
use MATLAB and Kaggle,
especially the MATLAB neural
networks toolbox. I use it, as
two, first competitions to, and
I was bit lucky I got a top
three  and atop 11 on my first
two competitions in Kaggle.
Sanyam Bhutani: Yeah.
Gilberto Titericz: Yeah. And in
that time, it motivated me a lot
more to continually learning
that different youth on the
time. Right. So, it was a bit of
lucky. But also, I spent a lot
of time on those two
competitions and Kaggle also,
because I found it very
interesting. And, and yeah, I
got to stop 311 and at that
time, it was enough to become
Masters in Kaggle.
Sanyam Bhutani: Okay.
Gilberto Titericz: And at that
time Master was the top ranking
in Kaggle, right. So;
Sanyam Bhutani: For the
audience, the title was
introduced much later after Giba
had started, that's a recent
addition to Kaggle.
Gilberto Titericz: Exactly. So I
got the title, Master title in
Kaggle really soon, and it
motivates me to continue a lot.
And I enjoyed the competitions
because Kaggle that that have
that, that feeling rather that
competition environment, and at
the same time that you are
learning and reading, forums,
kernels, it's spirit addictive.
Right. You know,
Sanyam Bhutani: How has your
journey evolved from your first
gold medal from seven years ago
to now 47 gold medals. I'm not
even going to count the silver
and bronze. How is your approach
evolved and your views on Kaggle
evolved over the years?
Gilberto Titericz: Yeah, I
believe my approoch changes over
time. And because my skills
changes over time, my knowledge
and even the tools changes over
time, right. As I as I told you,
I start with MATLAB then I move
it to R
Sanyam Bhutani: There was no Tf,
no No Scikit-learn. I believe
back in 2012
Gilberto Titericz: Yeah, yeah,
it wasn't on the beginning of
this, of this tools. Right. But
yeah, I start with MATLAB they
moved to R, then Python. But my
first approach in the beginning
of Kaggle, my journey in Kaggle
was very naive. Actually, I
started to learn by myself on
this competitions, right? I use
a lot of Google search to learn
on that time. And my first
approaches are, I can say are
pretty naive. I start to think
how to how, actually started
thinking how data science and
machine learning works, by
myself. I tried to define my, my
own hypothesis and test the
hypothesis. And fine with my
errors. So I found that building
a good CV, cross validation
strategy is essential to in the
competition. I, I learned from
my errors that trusting the
public leaderboard is a mistake.
Sanyam Bhutani: Multiple problem
multiple times.
Gilberto Titericz: Yeah, yeah,
but my approaches improved over
time. I start, on the beginning
I used to quote everything for
every new competition. So I
started from zero from every new
competition. Today I have some
libraries I use and I use it
when I start the competition but
at the same time, most of my
code right now, I keep
customizing to each specific
competition. It's hard to build
a generic code that we can use
in all the competitions, right?
Sanyam Bhutani: Yeah.
Gilberto Titericz: Coder
pipeline. It's hard to to have
it especially for Kaggle that
the competitions can come from
different areas, different
fields. And maybe if you are
focusing in a specific field,
maybe you can have some code
that you can reuse on all the
competitions but not not much in
Kaggle. Maybe it's it's, it's
easier to do that in a company
that you can follow the same way
of thinking every day. Yeah, but
my approach improved over time.
Right now I think I have a good
approach when I start a
competition.
Sanyam Bhutani: Talking of
domain specific competitions, do
you think like it's absolutely
necessary to have business
knowledge or domain knowledge
for such competition or can you
even when them without having
the knowledge of, let's say
physics competition, for
example.
Gilberto Titericz: Yeah, no, I
don't think you must have
knowledge, specific knowledge to
win the competition. But of
course there's it's not 100% of
the time. Right. So I think
Kaggle proveth us that anyone
can win any competition. Right?
We have. I am an example. I won
many competitions in areas that
I have no idea idea exists
before the competition. I won
some, for example, medical
competitions, but cancer, rent
rental prices in Australia. So,
some risk competitions. Most
different kinds of computers. I
had no idea before and I joined
the competition I studied a lot
about the topic. I knew hundreds
of experiments, I found better
way to improve the accuracy for
that specific area. So, yeah, I
believe if you have an expert
size in a specific field, you
have an advantage from people
that are dying in the
competition, but it's it's not,
this is something yeah,
necessary. Yes, necessary. It's
better to, you can jump a lot of
steps in that competition,
because if you have previous
experience or expertise in the
topic, but it's not absolutely
necessary. That's also the idea
of Kaggle, right, the Kaggle is,
show that the data set the
problem to suit everyone and get
the best solution from the
challenge. That's the idea.
Sanyam Bhutani: So coming to
your journey, you became an
active Kaggler and I was reading
that no one at your workplace
even knew that you were number
one before you even told them
before you told you were
shifting to data science. So can
you tell us how did the
transition, like most people try
to use Kaggle to transition into
the industry, how was the
transition for you? And did you
have to learn anything outside
of Kaggle after you started
working as a data scientist?
Gilberto Titericz: Yeah, that
transition is not easy. And it's
not fast, right? Actually, I
took four years for that
transition for me. So during
four years, I was learning data
science, I was learning machine
learning, Kaggle, using the
internet. And it took four years
for me to change my title from
engineer to data scientist,
right, so yeah, I used to say to
people don't expect to be quick,
and expect to have a lot of
trouble during the transition
and expect it to be slow, a slow
transition. If it's going to
happen too fast, maybe it's
something wrong or maybe you are
not prepared yet. But data
science is a field that that
have a lot of different, it's
area with lots of different
topics. It's hard to, to have a
full knowledge and all these
different fields and topics of
the area. So it's good to to
learn one, one every time, one
by one every time every
different area of data science
or machine learning. So
especially in Kaggle, you learn
a lot of machine learning. But
when you look at side on
industry companies, you see
there's a broader vision of what
is data science, there are other
areas that are not present in
Kaggle, right. And especially
because in companies you are
challenged every day to build a
Kaggle competition every day,
for example, in my company,
usually you have a data set that
changes every day, every hour,
in Kaggle the data set is fixed
for the entire competition,
right.
Sanyam Bhutani: Yeah.
Gilberto Titericz: So, yeah, the
challenge we face in companies
outside Kaggle are especially
related with data preparation,
data acquisition, problem
formulation, and all the aspects
of building the problem, the
challenge, right? Kaggle is more
about how to solve it using
machine learning.
Sanyam Bhutani: Yeah.
Gilberto Titericz: And during
that transition to the data
science science, you find that
because most of the things that
you learn on Kaggle, and you
still can use on the, on outside
Kaggle, but you find that you
don't know some other areas,
present on the, on the daily
life of a company, right, like
the ones that I told. Also, for
example, multiple deployments is
something not explored in
Kaggle, but it's something very
common when you join a company.
And it's, there are many ways to
do that, and many ways to fail
doing that. And, yeah, so that's
why I say when you're
transitioning to a data science
field, you do it, try to do it
slowly, not not jumping steps,
try to learn all the aspects of
Kaggle. And try to understand
the difference of Kaggle, and
the real life problems to make
it most mostly, most smooth as
possible that transition. That
is my recommendation.
Sanyam Bhutani: That's great
advice. Coming to your current
job, your 're lead data
scientist at ople, could you
tell us what task are you
working on? Is it somewhat
related to all of your vast
Kaggle experience that you've
gathered over the years?
Gilberto Titericz: Yeah, it is.
It's all related with my Kaggle
machine learning experience.
Ople is using an automated
machine learning at product,
right, for for, for everyone.
And to automate machine learning
there are a lot of aspects, a
lot a lot of areas to make it
possible. But most of the, to
make it possible there's a lot
of algorithms that make it
possible, right? And having that
experience in Kaggle helps me to
understand better that the
machine learning algorithms and
provide better ways to build
that out to optimal system. And
I, I can't tell exactly what we
are working on now. But I can
tell we are covering all the
aspects of the AutoML since data
preparation, data, repossessing
all the machine learning
algorithms available in
sampling, stacking, hyper
parameter tuning, model
interpretation, deployment of
models. So all these fields that
I've been working on Ople and
trying to make it as best,as
best as possible to the end user
and all my Kaggle experience can
be used in order to improve that
kind of machine learning
product. Because, especially
because I have the feel, I have
the skill I or at least I
improved my my skills on Kaggle
right. And I explored the
machine learning algorithms the
most, I was like I called in
Kaggle in order to improve the
accuracy of the models. So that
kind of experience like tuning
hyper parameters, using the best
model as possible for each
solution, setting a good cross
validation strategy. All this
is, there's experience and
Kaggle can be used in our auto
ml products.
Sanyam Bhutani: Coming to uh,
during your active Kaggle, is
still very active but when you
were hyper active Kaggle, how
did you balance your time with
Kaggle family? And you were
working in a completely
different industry, did you even
sleep during that period? That's
one question that everyone has.
Gilberto Titericz: Yeah, back to
the beginning of my journey in
Kaggle at that time, I had two
babies and a regular job
differently area. So it was
pretty hard to balance the
Kaggle time, job and family
time, right. It was not easy for
me at all. Actually, I used to
work all the day, then I get
home, take care of the children.
And only after the children are
sleeping, I started Kaggle.
Sanyam Bhutani: Okay.
Gilberto Titericz: And I used to
spend three to four hours per
day on Kaggle at that time. And
of course, I I I used to sit
very late in the nights, so I
didn't sleep much during that
period. Maybe I spent 20 to 25
per [Week] on Kaggle during the
time. So it's a lot of time if
you have a regular job and
babies at home.
Sanyam Bhutani: Yeah.
Gilberto Titericz: And there's
no secret or magic to do that.
It's, it's a lot of dedication,
right? You must dedicate to it
once, if it's really what you
want. There's no secret sauce.
But right now, I'm not spending
that amount of time on Kaggle
anymore, right. I think right
now my job and my family are in
the top. So if there's some time
left, I go for Kaggle or maybe
some video games.
Sanyam Bhutani: Hehe.
Gilberto Titericz: Yeah, but
right now is my life is
different. It used to be like
crazy. Six years ago, like
spending also the weekends doing
Kaggle. But one strategy to do
Kaggle is let your computer work
for you, right, the computer can
work 24 hours per day, you
can't, right. So try to think
about experimentation, try to
set the experiments and leave
the computer working on that
experiment, because most of
machine learning algorithms
takes a lot of time to run. If
you can manage to, two sets of
experiments and leave the
computer running during the
night, it's a big advantage for
you in the competition, right?
So don't don't lose much time.
So, yeah. That's my advice.
Sanyam Bhutani: As you said,
like it takes a lot of
dedication. I'm sure it takes a
lot of talent as well. You have
you had had amazing finishes on
multiple competitions. If you
had to pick your favorite
challenge or your favorite
battle story if I may, of one
competition, but would that be?
Gilberto Titericz: Yeah, I have
many favorites.
Sanyam Bhutani: Maybe, maybe
your top three or top?
Gilberto Titericz: Yeah, I
believe my favorite competition
was the one about the leakage,
the Santander one, it's
senton???? their value
production challenge. It's about
the leakage that I disclose it
on the forum, and I eventually
win in that competition. I think
this is my favorite one. But I
also like the Auto Group product
classification competition. It
was, it, it was before it was
the number one. But it was the
first competition that I did a
really huge stacking and
sampling model. It was a multi
class classification. And it was
my first trial to build a huge
that can ensemble in a solution.
So I like this, these two ones,
something there because I
disclose it the leak and then
one and the auto, because I use
a very huge stacking to one. But
there are other ones I love it.
It's my first you about wind
forecast and the energy
forecasting that I be good good
placements. It was very good
because I use MATLAB and I won
this over some more experienced
people than me. I have many
favourites, but if I have to
choose, I would choose that
Santandar one, because I did it
a lot of time.,disclose the
leak. So everyone was on the
same place. And then, of course,
not a lot of bit of lucky, but a
bit of experience and hard work
to win a competition. And we
managed to win the competition
with it was pretty cool.
Sanyam Bhutani: Yeah. There's a
question from the AMA section
Carlos asks you're well known
for spotting target leakage in
competition data. Do you have
any specific techniques or do
the approaches change for every
competition?
Gilberto Titericz: No, the
approach changed over every
competition leak, because
especially because leakage
change, right? Every data set,
doesn't matter it can be tabular
data, time series or image data
or a mix of everything or even
NLP, every data set is
different. And every data set
can be any kind of leakage
associated with the data set. On
that previous example on the
Santander competition, the
leakage was about a time series
hidden in the, in the data set,
but I've seen in some image
competition on on the past that
there are some timestamps on the
mirror data of some image and
that the leakage can be
different for each problem and
the way the leakage is hidden in
the data also is different. So,
there is no classical approach
or the there is no secret. There
is it, actually the idea is
spend a lot of time taking a
look at the data set,
understanding the data set, mute
some hypotheses and test your
hypotheses, right. Because
sometimes you spend a lot of
time that data set, you find no
leakage. And sometimes you can
find something that is
correlated with the target. It's
another of trial and error. It's
a bit of lucky also to find
that, that issues in the data
set, so there is no no, no way
to to know if you shouldn't try.
And also if I say it's a bit of
lucky because you can try
anything that comes to your
mind, but the leakage is heating
something that you didn't test
it. So yeah, there's no no
regular approach. Every time a
customized solution to find out
leakage.
Sanyam Bhutani: What what kind
of Kaggle competitions interest
you today? What challenges do
you look for and how do you
decide to enter a competition
today?
Gilberto Titericz: Yeah,
actually, I try to, to join all
the competitions right now, but
I just become active in the
competition right compete in the
competition if I feel
comfortable with the data, if I
like the data, specially right
now if there's no leakage within
the data set, right. But my
favorite competitions are the
ones right, with tabular data or
10 time series data right now.
So I prefer to join this, these
two competitions. Why? Because I
found it in the past that I feel
more comfortable with that kind
of data. Right? I tried any kind
of competitions in Kaggle, all
the other types of competition,
Kaggle optimization, image
segmentation, tabular, or time
series all the time and I found
this to, tabular and time series
are the ones that I like more. I
try to join all this kind of
competitions ofcourse if the
time allows to do that if my
free time allows to do that, but
yeah,
This, these are two types I'm, I
prefer right now.
Sanyam Bhutani: Got it. So even
after becoming number one, the
best of Kaggle and you held the
title for the longest period, I
think for two years. Why did you
continue to Kaggle even though
you're not hyper active, still
one of the most active Kagglers,
why do you still continue to
calcul?
Gilberto Titericz: Yeah, Kaggle
is addictive, right.
Sanyam Bhutani: Hehe.
Gilberto Titericz: But it's not
about the prize money. Not about
being addictive, but I like to
use just you, competing in
Kaggle to learn nowadays. I used
to read the forums. I used to
take a look at the kernels to
see what people are doing. But
especially I like new, new kinds
of competitions, so. I used to
be much more active in the past
before 2017 but right, right now
I am not so active but I am
still competing on Kaggle. I
still try to to to compete, at
least in one competition or time
Kaggle right now.
Sanyam Bhutani: Yeah.
Gilberto Titericz: But yeah, but
what makes me continue is that
the hunger to learn, to learn
more. Because you'll never learn
enough right. Every day there's
something new on the data
science area. It's hard to be up
to date on the data science. You
can have, you can use Kaggle to
try new approach to different
problems. You keep reading
papers reading new approach for
new different problems. And, and
Kaggle is good to test
everything, right? This new
approach paper to improve your
skills sometimes the, you want
to to learn something new. So I
like to use Kaggle as a
playground. That's why I
continue using Kaggle. But I was
much more active in the past
that, that's true.
Sanyam Bhutani: Got it. So, now
when you enter a competition and
this is a question by Delai50,
what are your first tips when
starting out on a fresh
competition and has the approach
evolved over the years, has your
outlook on that changed?
Gilberto Titericz: Yeah, the
approach when I enter is, first
I understand the competition
data. Right? Take a look of the
data set. Try to understand the
future. How they are currently
target, the types of features
and what kind of validation
strategy we're going to build on
that to solve in order to solve
that problem, right? Because
sometimes you to, to win the
competition, you must mimic the
test set distribution, the
public or private test set
distribution on your training
set validation. For example, if
your, if your private
leaderboard in Kaggle, is a time
series problem and your private
leaderboard is based on seven
days of data, try to mimic that
pattern on the validation, the
training set. I mean, try to
make the trade set and the
validation most close as
possible as the test set
validation. This is yeah, this
is part of my approach. Usually
I spend a lot of time trying to
view the reliable validation
strategy using the train set.
And in using the information we
have from the test set. And
after that, just after having
tried that validation is
strategy, think about starting
to building good models. I mean,
with models, I mean high
accuracy models for that
competition, right, before that
makes no sense. If you don't
have a good cross validation or
validation strategy makes no
sense trying to build high
performance models before that.
So yeah, set the validation
strategy. Maybe spend a lot of
time doing that. I'm not saying
one, you one or two days, maybe
a week or two weeks just
thinking about that. Yeah, to
make it really reliable, really
stable, right? Want to
generalize as better as possible
in the test set and then start
building models using that, that
validation strategy and what I
mean building models, choosing
different machine learning
techniques to try models on the
data set, maybe if using
different views of the data set
using feature engineering and
eventually experiment with
blending different solutions in
order to improve their final
accuracy. This is my, my very
general approach for for a
Kaggle problem right now. Most
of the time I I must say, it's
best, it's best to spend much
more time in the beginning,
setting that validation
strategy, then have to change
everything on the last week of
the competition or last two
weeks then becomes a mess,
right?
Sanyam Bhutani: Hehe, yeah.
Gilberto Titericz: ######So
yeah, it's good. It's good to
think that way. Because, and
also Trust your your local
validation score. It's very
important. Sometimes you must
look at the public leaderboard
score also. But most of the
time, if you built your
validation strategy correctly,
you can trust your local
metrics. So right to trust more
local than the duplicate
leaderboard, that's, that's part
of the approach.
Sanyam Bhutani: I think that's
one of the mistakes that all of
the beginners make, and
specially you are one of the
Grand Masters that has survived,
survived most of the shakeups.
You're always good at that.
Coming to another aspect is
teaming up on Kaggle. You've
teamed up with multiple Kagglers
over the years. What's your
teaming up strategy and your
advice to beginners who start
looking for teaming up because
they think it'll help better in
terms of performance on the
competition.
Gilberto Titericz: Yeah, there
are many different teaming
strategies. If you are a Kaggle
novice, the strategy will be try
to team up with a more
experienced Kaggler, right.
That's, that's easy, because you
will learn from that more
experienced Kaggler. But if you
are a grandmaster, probably you
are looking for some
complimentary approach for the
problem. For example, an
approach that's completely
different from your approach,
and that is my, my main strategy
when I team. I ask people to
have an idea what approach they
are using to solve the challenge
and if I find that it is
different from my opinion, it's
very good to team up with that
person or that team in order to
have two different viewpoints of
the problem.
Sanyam Bhutani: Yep.
Gilberto Titericz: Because one
way to improve the final
accuracy of the models is
blending modules using different
approach. And right. And it is
also part of the ensembling
technique to have different
viewpoints of the, of the
problem.
Sanyam Bhutani: Yea.
Gilberto Titericz: And I think
this is most critical part when
teaming with a Grand Master, is
having something different to
add to the solution.
Individually, it's hard to find
someone that's different from
you, especially because if
people start to read too much of
that Kaggle forums, people start
to build very similar, very
similar solutions to the problem
or the ripple starts to get by
as it, by the solutions present
on the forum. So, the forum is
good, yes, but sometimes you
must stop reading the forms and
try to build your own customized
solution to the problem and it
can be difficult doing a
competition. It can be something
that can improve the accuracy of
the final solution, just doing
something different from others,
right. Also this is why blending
GBM model with a neural
networks, usually is good for
tabular data because they are
completely different approach,
different origins training
algorithms, right.
Sanyam Bhutani: Yeah.
Gilberto Titericz: This is
similar to merging. Find people
that thinks different from you
and, and have high accuracy
models as well, right. But it's
hard to find in Kaggle, people
like that. Right? And especially
because what I told you people
tries to build solutions similar
to the others in the forum. And
also, it's hard to ask that kind
of people, what about this, the
solution? Because before teaming
up, you cannot share information
about the competition, right. So
what's the best way to try to
tell what are you doing without
sharing exactly what you're
doing, that's another challenge.
So teaming up is not easy. It's
not easy in terms of finding the
right people to team up, asking
the right questions before the
teaming up and deciding to
teaming up. But what I, what I
can say especially for novice
people, try to teaming up with
people more experienced in
Kaggle, if, especially a Master
and Grand Master in Kaggle to
learn, because this is how I
learned in Kaggle also. The
beginning I teamed up with more
experienced people than me and I
learned a lot from that people,
I have a lot of tricks and tips
also from that teams. And, and I
recommend a lot of team. It's
something I believe is is is the
best way to learn because you
can share knowledge in a team.
You can learn from the
experience of more experienced
people and you have that, the
advantage of building different
approach for each team member
and then, and blend all the
solutions in one final solution.
I think, they mean it's great,
great good in Kaggle in all the
aspects.
Sanyam Bhutani: You're also one
of the people who are really
great at ensembling and you've,
you've given multiple talks that
are on YouTube, I could have
those linked, but any best
advice that you, that you have
for, on ensembling all of your
models, has your again approach
changed over the years?
Gilberto Titericz: Yeah,
ensembling, it's a technique
that is very easy to overfeed if
you do it the wrong way. So,
especially if you have a team
you have to set some rules to
avoid overfeeding. Not big x and
Sambol?????? and also you must
take care of bugs in the code,
especially regarding the cross
validation or validation
approach because ensembling many
models to make it work, you must
have out of four predictions for
all the models. And when I say
out of four, I mean, do some
validation strategy, like
presentation or time series and
moving window validation step by
step. So one of my
recommendations is do the model
out of four predictions is using
exactly the same faults for all
the models. Why that, why that,
to avoid leakage in their folds,
between folds of the same model.
So if you if you train a model
using exactly the same folds for
all the models in your pipeline,
you make it possible to compare
the performance of all the
models. This is the number one
approach you must take care when
doing ensemble right to make
sure that you are not leaking
information between the folds.
Sanyam Bhutani: Yeah.
Gilberto Titericz: And once you
do that, the stacking becomes
something trivial to implement.
Basically, you're going to
ensemble our marriage????,
predictions from different
models, right? But I believe
people must take care,
especially with the foundation.
Also, there's one technique,
very common in Kaggle nowadays,
that is called target encoding,
where we use the target
information to encode the
categorical variables. Also the
way how we do the targeting
coding can leak information from
the faults and this completely
destroyed ensemble life.
And so you must know what you
are doing before doing that. And
that kind of approach can
destroy ensemble generalization,
performance on the test set. And
in something you must take care
when teaming, teaming up with
another people because people
can have some different
understanding about how to run
that algorithms. And from one,
one single leaking model can
destroy an entire assembly of
hundreds of models. So this is
something that you must take
care, avoid leakage and
overheated models in ensembles.
And yeah, I think, I think I'm
ensembling is a pretty
interesting topic. I like
talking about it more. I use it
a lot in Kaggle and the best,
and it's one, general way to
improve their careers to that to
the top, just blending different
model relations.
Sanyam Bhutani: It's Four
novices also also like playing
with fire like you said, so you
have to be careful. Another
thing that's like playing with
fire for the noise is hardware
and I'd like to touch upon this
topic that a lot of your
achievements were on a mid range
laptop. Do you think it's still
possible to perform well in
competitions maybe even win on
normal hardware, maybe not the
deep learning competitions but
for the other categories?
Gilberto Titericz: Yeah, I think
it's possible with normal
hardware. I started Kaggle using
low end laptop, its core i5 with
four gigabytes of RAM.
Sanyam Bhutani: Okay.
Gilberto Titericz: I used a
laptop for many years. Then I
switched it to a better laptop
an, i7 with 32 gigabytes of RAM
but no GPU at that time. It
helped in my life in Kaggle, I
was able to fit more data set
memory. But I don't think it's
necessary to have a high end
specs on their laptop or
computer to win a Kaggle
competition. Not not, at least
not a tabular data competition,
right?
Sanyam Bhutani: Yep.
Gilberto Titericz: Also Kaggle
provides the kernels that have a
good amount of memory 16
gigabytes, and four, two or four
cores available, so it's enough
to win the competition. Because
competitions, people that win
usually are the people that run
more experiments on the data
set. And the competition's about
running a lot of experiments use
on the data set. And most
experiments mean more chance
you'll have to win the
competition because you have
more information about the data.
And right now, I, I used two use
just servers to Xeon servers 18
cores each and a lot of RAM
memory and one, at least one GPU
in each one.
Sanyam Bhutani: Okay.
Gilberto Titericz: But is, is
not what makes the difference
for me right now. I believe your
skills and your knowledge about
machine learning algorithms are
more important. And, but of
course, if you are joining image
competition, probably having
more GPUs is a big advantage.
Sanyam Bhutani: Yeah.
Gilberto Titericz: And the,
right now there are a lot of
competitions running for image
classification. So, yeah, it's
if you think that winning the
competitions is running a lot of
experiments and you'll have
dozens of GPUs available. You
can run maybe dozens of
experiments at the same time,
right?
Sanyam Bhutani: Yeah.
Gilberto Titericz: It's
different from tabular data that
you are, are mainly the
restrictions are mainly on the
number of CPUs you have. And a
little bit about the RAM memory
you have available. But I still
thinking that table or data in
time series data, you don't need
to have bigger hardware to win
the competition. You can win
using even the Kaggle kernels to
do that, if you have the enough
knowledge and time to invest in
the competition. Yeah, but it's
not the case for image
classification.
Sanyam Bhutani: Hehe, yeah.
Gilberto Titericz: And you'll
get there's a difference between
those two kinds of competitions.
Sanyam Bhutani: Definitely.
Gilberto Titericz: It was based
on my experiment, experience,
also on both kinds of
competitions, but I still like
image competitions. I have some
GPUs to run, but it's unfair. I
think it's not fair for people
that don't have access to GPS.
Or have only one GPUs available.
It's harder to win the
competitions using less
hardware. Just so yeah, hardware
can play a big role in the
winning image classification
competitions or maybe deep
learning competitions, right, if
we're talking about natural
language processing. Also, you
know, some some NLP problems,
it's necessary to use large GPUs
to run the models, or at least
to load the models in GPU
memory.
Sanyam Bhutani: Yeah.
Gilberto Titericz: But it's not
the case for tabular. Also,
that's why I like tabular data.
I can use my laptop and I can go
with my laptop anywhere in the
world and still competing
without the need of large amount
of GPUs.
Sanyam Bhutani: Yeah. Talking
about experiments, so there are
times for people like me,
especially when you get stuck on
the leaderboard, you've been
running a bunch of experiments,
all are failing. What advice do
you have for that time, when
your score isn't improving, and
the competition is still pretty
active. How should you come up
with new ideas or new
experiment?
Gilberto Titericz: Yeah, that's
the tricky part of the data
science because it's, it's about
your previous experience in
doing that, to solving that kind
of problem. But it's also about
having new ideas, right. And
having new ideas is something,
it's hard to, to prove that,
that kind of skill. And when I'm
running a lot of experiments, I
try to run the base key ideas
that comes to my mind, right and
then leave it running. Usually
you have some predetermined,
predefined ideas that are pretty
common between all the types of
challenge, but I used to think
about the competitions, even
where during my work time, or
even when I'm sleeping, I'm
thinking about competitions.
######Usually some of the ideas
comes from that time that you
are not competing, that you are
resting or playing video game,
you're enjoying it. And suddenly
the ideas just happens to appear
in your mind and you're, you're
thinking, okay, maybe I should
experiment that new idea I just
thought. There's no receipt to
that. It's about, it's an art,
right? It's something like an
art. It's about intuition. What,
what are my intuition to solve
that problem? If you have a lot
of ideas, it's easier because
you can prepare a set of
experiments and leave it running
and, taking notes of the
experiment results is pretty
important also step to record
what you made log, what you made
in the past, what you already
experiment, experiment.
Sanyam Bhutani: That's where the
intuition starts to build if,
you start to get a rough idea
that maybe this is not the best
experiment, because this has
failed earlier.
Gilberto Titericz: Yeah, yeah,
that's when you can use your
previous experience to do that.
But sometimes you have crazy
ideas. And I recommend if you
have great ideas, try to
implement that crazy ideas.
Sometimes it works. At least it
works with me some, sometimes
crazy ideas helps with the final
solution if you're building an
assemble of many models. But
also there's no recipt how to do
that. My recommendation is write
anything that comes in your mind
if you had enough time to do
that. Anything, even crazy ideas
and specially if the ideas are
different from the ideas of
other people in Kaggle, because
if you are running it the same,
exactly the same approach that
other people in Kaggle, you get
the same results of the other
people. It's very good if you
are learning, if you are a
novice and you are learning,
this is pretty good to try that,
that come across, but you're
just going to win the
competition if you're doing
something different from the
others, right?
Sanyam Bhutani: Mhm.
Gilberto Titericz: You should
take a look at the, the forums,
the prize winners. If you read
the solutions, you see they make
most of the times things
different from the other people.
They think out of the box. They,
yeah, they code something
completely different. Or
sometimes some, some step is
different, but it's hard to win
a competition using a general
approach right? Break them in
general. So yeah, that's my
recommendation, right?
Everything and anything that
comes your mind, even if you are
sleeping and you have idea
during the dream, in real life,
it happened with me the best.
Sanyam Bhutani: What best advice
would you have for a novice just
getting started on Kaggle, like;
Gilberto Titericz: Okay to a
novice, I recommend to read the
forums, right? Try, try to
attend some online classes,
online courses, or any free
courses or maybe paid courses
online to have an idea of the,
the idea of machine learning and
data science, but at Kaggle try
to read a lot of the forums, try
to replicate or at least
understand that kernels, Kaggle
kernels. And specially try to
team up with more experienced at
Kaggle. Is is pretty good. Maybe
you, you was, you cannot just
team with a Grand Master but
maybe you can team with an
expert or a Kaggle Master. If
you have some friend that is a
Master or an expert is easier to
teaming up. But try to learn
from that people, because you
can learn how Kaggle works and
you can learn how machine
learning works. Two different
things. ######To learn machine
learning, you first need to
learn how Kaggle works, right?
There's some rows in Kaggle. But
another advice is don't try to
learn everything on Kaggle,
because data science, there are
too many fields. If you try to
learn everything you get crazy
right? Try to have just to know
that every field exists, but try
to experiment just a bit of
every field and try to find they
want, the field that you have
more, you are more comfortable
or you are more interested about
how it works. For example, some
people can find that natural
language processing fits better
with the experience of that
people's, other people prefer to
work with image processing or
image classification, other
people image segmentation, other
people like time series, try to
get expert in any, in one
specific field before trying to
learn other fields right. Don't
try to learn everything at the
same time you will get crazy. It
also is impossible. So many, so
many things, so many papers, so
many theory. So try to
specialize in one specific field
and get better on that field
before moving to other fields
and if you like that you, maybe
you should continue to work in
that field. Right. I know a lot
of people that just work on that
field and don't want to change.
That's, that's my advice for
novice.
Sanyam Bhutani: That's great
advice. Just like this interview
that has been full of great
advices. I have a final
question. And this is, this will
be tough for you. What has been
your all time favorite game and
your current favorite game? And
ps4 or Xbox? Because;
Gilberto Titericz: That's the
hardest question. I have to
think about that. Actually, I
have all the game consoles. I
have an Xbox. I have all the
PlayStations. But, I have a lot
of favorite games, maybe for
PlayStation four the ones that I
liked more what, are the Gods of
War series, you know? Yeah, I
liked a lot that, that series
all the God of War for games,
and from the Xbox I love it that
Gears of Wars, it's a shooting,
shooting game. Gears of Wars
1-2-3-4, then a lot of.
Sanyam Bhutani: 5 is also out I
think.
Gilberto Titericz: Yeah, okay.
And but yeah but I have a lot of
games that I liked, and not my
favorites by, I love for
example, Horizon zero dawn for
PlayStation 4.
Sanyam Bhutani: Yeah.
Gilberto Titericz: Yeah even
Spider Man Uncharted, some
rolling playing games. There's,
there's some new, new releases
that's trending right now.
Yeah, there's so many good games
and I am a huge fan of games for
all my life. I want some, some,
some video game competitions
before when I was a kid also.
Sanyam Bhutani: So you had that
competitive element but if you
have to name one favorite game
of all time throughout your life
that you really like?
Gilberto Titericz: All the time?
Sanyam Bhutani: Yeah.
Gilberto Titericz: Well, that's
the hardest question, right.
Okay, I see the game that I
spent most of my time playing
the game, wasn't a very old game
from PlayStation 1 called Final
Fantasy 7. You know the game?
Sanyam Bhutani: I have played
that game.
Gilberto Titericz: Yeah, I
figured that I spend most of my
time playing the game. Actually,
I spent hundreds of hours
playing that game. I just, one
game, I love it. Yeah. I think I
can choose this. This one, is an
old one. I love it.
Sanyam Bhutani: Awesome. Thank
you so much, Grand Master Giba
for joining me on the podcast
and for all of your
contributions to the kernels and
discussions on the Kaggle
community as well.
Gilberto Titericz: I appreciate
an invitation to be here. And it
was a, an honor for me. So thank
you also.
Thank you.
Sanyam Bhutani: Thank you so
much for listening to this
episode. If you enjoyed the
show, please be sure to give it
a review or feel free to shoot
me a message. You can find all
of the social media links in the
description. If you like the
show, please subscribe and tune
in each week to "Chai Time Data
Science".
