Sanyam Bhutani: Hey, this is
Sanyam Bhutani and you're
listening to "Chai Time Data
Science", a podcast for data
science enthusiasts, where I
interview practitioners,
researchers, and kagglers about
their journey, experience, and
talk all things about data
science.
Hello, and welcome to another
episode of The "Chai Time Data
Science" show. In this episode,
I interviewed Dr. Russ
Wolfinger, who's the Director of
scientific discovery and
genomics at SAS, a department
which he had started and he
leads the R&D of GMP and SAS
software solutions in the domain
of genomics and clinical trials.
In this interview, we talk all
about his journey into the
field, how his Stats and
software development evolved
over the years. Russ is also
kaggle Grandmaster in the
competition tier. We talk a lot
about kaggle. His approach to
kaggle his his viewpoints on
kaggle, and best advice is to
new big new joiners, slash
beginners. This interview is an
amazing intersection of
statistics, data science and
data science applied to kaggle
and the real world. We also talk
a lot about kaggle as a platform
for learning data science and
applying your skills how to get
better at kaggle and data
science broadly speaking. So
thank you to Russ for sharing
all of those amazing advises. A
quick reminder to the audience.
This will have manually checked
and uploaded subtitles so please
remember to enable the subtitles
on YouTube or go watch the video
on YouTube. If you're a non
native speaker, for now, here's
my conversation with Russ. All
about statistics, data science
and Kaggle. Please enjoy the
show.
Hi everyone. Today I have the
privilege on the show to have
another statistician one of the
best alive today of the ranks,
if I may say so, of Leland
Wilkinson, who I already had on
the show, a data scientist Top
Kaggler. These are not three
different persons. This is the
same person Dr. Russ Wolfinger.
Russ, thank you so much for
joining me on the time data
science podcast.
Russ Wolfinger: Thanks a lot
Sanyam and welcome from North
Carolina here. I do need to
apologize I was helping my
sister in law yesterday with a
construction project. I had this
sticky glue like substance Still
on my hands. So I didn't want to
scare any viewers, if they see
something might happen to make a
gesture.
Sanyam Bhutani: I think you look
great. And we can do some deep
learning magic, hopefully.
Russ Wolfinger: There we go.
Sanyam Bhutani: So I want to
start by talking about how you
got interested in data science.
today. You're the director of
scientific discovery and
genomics at SAS. If I may you
been involved in the field for
almost close to half a century.
Now. Could you tell the
listeners how you got started?
was what you studied during your
university days now? Is it data
science? Because it was branded
as Stats during your day.
Russ Wolfinger: Yeah, yeah. And
yeah, I'm not quite half a
century.
Yeah, it's, it's, it's we're
really in this amazing time. And
I feel so fortunate and blessed
kind of going to be where I am.
And I think the my career path
has been really simple and
canonical. I pretty much went
from kindergarten Straight to
the PhD, and then hired on here
at SAS, and I've been here at
SAS for 30 years, so that we
could we could end the show
right there. But over that time
period, though, it's just been
just a really amazing series of
changes. And I, again, I feel
fortunate throughout that whole
path. I'm gonna have a very
loving home atmosphere with a
mom and dad who just did
everything they could to
encourage, encourage us and I've
got two wonderful sisters. We
didn't have a lot growing up, we
grew up, I grew up in Ohio north
central part, but a great school
system there that I went all the
way through. And when I had
teachers that led me I was
pretty proficient in
mathematics. And they allowed me
to kind of go at my own pace,
which was really awesome. And
then I did an undergraduate
degree in math at Liberty
University, and then PhD in
statistics at NC State on the
connection here at SAS really
was somewhat serendipitous. I
really did not know about SAS at
all. When I came to NC State,
and it just turned out my wife,
we decided to go through
graduate school together. She's
a molecular biologist, PhD. And
so we had a nice period there
where we were going through grad
school together. But I was I was
one year ahead. And so I
finished up early. I was like,
well, I gotta go get some
computing experience. This place
called SAS demonstrate looks
pretty reasonable. And they
hired me on and it's been, it's
just been an amazing ride ever
since. Well,
Sanyam Bhutani: talking about
your traditional route, if I
may, why was it important to
complete a PhD after right after
your bachelor's? Why did you go
along the route of research
during the 80s?
Russ Wolfinger: I think for me
just Kismet maybe in a way I'm
just like, I'm this so well
suited to it. I always was like
a good test taker. I liked
school. I liked the classroom
setting. You know, I always
asked a lot of questions. And
again, just Fortunately, I
always seem to have really great
teachers and professors. So I
wasn't really into joy to be
around them and to learn from
them. And so for me, it was it
was a very easy choice I, you
know, most I get the feeling
it's not as easy for, I'd say
most people, for me it really
was a very easy and natural
path. And it, it seemed to make
sense just to keep going. And,
you know, and again, I didn't
each stage throughout there, I
was able to pick up kind of new
pieces. Like, for example, I
didn't even coming out with a
PhD, I didn't have a lot of
strong computing experience. But
then here SAS, you know, and
then at the time, they are just
an incredible company. They've
been an incredible company to
work for that God said, the
studio that you see me and now
it's just one of our facilities
and shout out to the to the
studio guys here who are helping
me record this. But just the
kind of thing we just over the
years the company has been
extremely successful growing,
we've got an incredibly nice
campus here. And so, for me,
it's just a way I feel again, I
just feel I feel kind of lucky.
In a sense, just because you
know, being born here in the
United States just kind of had
this nice little path, really
not that far from home. And it
just worked out nicely for me.
So I do what I do feel a sense
of responsibility. And when I
try to get back and help help
folks around the world and
thankfully with the internet
now, the world is kind of
shrinking and we can we can
conduct an interview like this
halfway around the world. Pretty
amazing.
Sanyam Bhutani: Yeah, for sure.
I want to talk about your
research. You did your PhD in
statistics. Could you tell us
more about your research and how
was it like studying during
those days because now I would
shamefully cheat with my
smartphone under my desk. You
pick up your concepts during
your DS?
Russ Wolfinger: Yeah, you're
right. The internet was just
getting going. This was my PhD
years where the let the the late
1980s. And I do remember, in my
apartment, my little small
apartment near campus, I was
able to connect online and we
literally had I think it was 320
baud, so you could literally see
the characters. Okay, across the
screen. So it was just after
that I didn't have to do
anything on punch cards.
Thankfully, I wasn't better like
but just right after that. And
then I was so excited where I
could, I could have a little
data set and I could submit some
SAS code to this. We had a
central Computing Center called
tried to talk to UCC triangle,
you University's Computing
Center, and I was able to submit
jobs to that kind of like kind
of like an early AWS at the
time, but only only running it
at 320 baud, but I was able to
submit number, you know, submit
a little little piece of code
and get some results back. But
my dissertation was actually all
theory. statisticians, we often
like to study properties of
estimates as the sample size
grows to infinity, we call it
you know, asymptotic theory. And
the two main theorems that we
have are like the laws of large
numbers, and then the central
limit theorem. And I was a
certain class of models, I was
able to derive some rules for
bounding the rates. So kind of
you there's kind of bounding
roles that you want to do to
help as, as the data grows
larger, it kind of gives you
some rule of rules of thumb for
how many parameters you'd want
to use, based on the number of
observations that you have. All
theory, I don't think it's ever
really been used much. In fact,
I had this was a, I had a, this
is my only kind of poor
experience with publishing I, I
wrote up all my, my dissertation
and a couple of papers and
submitted it. And unfortunately,
I drew what I consider a fairly
adversarial referee, who's who
sat on the papers for a year and
a half. We heard zero, and this
guy was stalling. He wouldn't,
he wouldn't reply. And I'm
speculating, maybe he must have
been friends with the editor or
something. And they just, they
kind of sat on the papers for a
long time. And the crazy thing
is, but by that time, I'd
already hired on its ass. I
learned other areas of
statistics and published other
papers. By the time I got the
reports back, I was like, This
isn't even worth it at the time,
and I had kind of moved on. But
beyond that it's been. I do even
though, despite some of the
problems with the peer review
publication process, I do think
overall it's a great way to
disseminate you know, knowledge
gains and research that people
are doing. Although I must
confess I do like I like also
like the computer science model
of, say submitting to
conferences. That's another nice
way I think and of course
archive now and things like that
are ways to get around if you do
so anyone out there you know,
trying to publish don't be
discouraged if you happen to
draw some some bad referees for
one, one reason or another that
the process is somewhat
subjective, I'd say.
Sanyam Bhutani: That's great
advice. Talking about knowledge.
I think you still active learner
one of the best on kaggle What's
your favorite observation of the
change of learning methods over
the years? And maybe maybe one
like and one dislike towards how
we learn has evolved over the
over the years? Could you?
Russ Wolfinger: Yeah, that's a
that's a good question. And I
Realize even though I was a
good, I would consider myself a
pretty good book learner or
classrooms, classical classroom
style learning, my favorite way
to learn is just to have a
problem, you know, be habit
literally have a problem that
focuses your mind. And this is
where kaggle is just so
appealing, you know that they're
able to set up some type of
predictive modeling problem from
a sponsor. And then your whole
goal is to is to solve that
problem. So you just bring you
start, then you go on a big
search, you go, you look to find
any thing that you can possibly
imagine that might have bearing
on solving that problem. And
with that as your goal, but then
it also helps you filter away a
lot of things because there's so
much out there, right? It's
overwhelming the amount of
material that's out there, but
with a with a problem, focus.
You can see a certain amount of
theory, although it might be
interesting. You're like, I
don't that's not going to help
me right now. I'll make a note
that it's there, but I'll let it
go. But now here's something
it's like, oh, I see immediately
how I can use that. And man if
you if you do that You know,
several times over, I feel like
you're you're actually you know,
your knowledge kind of grows
like a organically and almost a
tree form where and then you,
you once you've used a method
and applied it to a problem, you
often have it for life. So I'm
really big on the kind of a
prop, a problem focused approach
to learning. and problem
solving, I think is just maybe
at the heart of what we do. And
for me anyway, that's just just
my observations of others. I
think that's a fairly appealing
wait way to learn.
Sanyam Bhutani: I think the
other thing that Internet has
imposed on us is the imposter
syndrome. Beginners, like me
really get scared of, maybe this
scary competition is not enough
for me, maybe this problem is
completely out of my reach. How
do you think people like us
should approach any problem, for
example, a new Kaggle
competition where we have 3000
people and hundreds of
grandmasters going on the
competition?
Russ Wolfinger: Yeah, it's it's
a good point in there. There
certainly is a social aspect and
man, it is a it's a kaggle
competitions now we're so
extremely competitive. You have
to, in a way you have to have a
dose of humility about where
you're at, you know, and if you
are a relative newcomer Well, I
went through it myself though
because I, you know, my first
Magg,e competition, I was a
fairly new fresh PhD in
statistics I wouldn't like I'm
gonna, I'm just gonna, I'm gonna
win this thing, no problem.
Punch it punch a joker, Joker
data science guys. And I
literally I just, you know, kind
of got my tail kicked, that I
tried everything I could think
of it horribly. And I believe
most lot of the top carriers
have gone through an experience
like this early on, and then you
just have to like, suck it up
and say, okay, you know, let's
let's reassess here. Eat some
crow. Be humble. Be humble about
it. reflect on what happened
where and then, man, you know,
kaggle is just a gold mine of
knowledge in terms of like, in
most most wet most times.
winners are top performers are
often very gracious in sharing
their solutions. And that's just
an incredible way, you know,
maybe that's a good, that's all.
Think about the problem solving
method I mentioned before, you
know, you just pour your heart
and soul into a problem, and you
still can't solve it very well,
maybe you don't finish so well
in a kaggle competition. But
then then go go look at what the
winners did talk about a rich
learning experience, you can
just see you're like, all, you
know, that's what they did. And
I've had many times where I've
actually been very close to what
the winner did, but then I
decided for whatever reason to
go a different way. And I
missed, you know, I missed the
kind of the magic as it were.
And I think that's pretty
common. And the other thing is
just to be realistic about
kaggle. You know, I mean, we're
talking, you know, if you sort
the scores, I mean, often
they're separated down in the
third or fourth decimal place,
you know, so the true difference
between finishing fifth and
finishing, you know, 200 and
50th is likely negligible. So
you shouldn't Put your whole
personal value on your on your
Kaggle ranking.
Sanyam Bhutani: But the upside
also is the leaderboard, as I
said, doesn't lie at all. People
from anywhere can join for free
compete and maybe even walk away
with the prize. I know a Robot
AGI, the 18 year old carry
Grandmaster who's been winning
competition even before he was
18. So anyone who's in high
school has done a PhD from any
walk of life can come to kaggle
if they are interested in
participating.
Russ Wolfinger: I know, that's
just it's just an amazing kudos
to Anthony and the team there
for this that's just that color,
just a brilliant, brilliant
platform that takes advantage of
the modern internet or modern
data science. They now offer
computing facilities for free.
It's just really amazing
platform and it's, it's had a
very positive effect on my life
personally over the last several
years. I almost I felt like I'm
almost kind of back in school
again, you know, it's learning
and having a lot of fun. And I
do think a really interesting
aspect to it is it does it does
it send to Do appeal to males
more than females. And I'm still
not quite sure why I think maybe
men tend to be a little more
competitive, as it were just
generally speaking. And so maybe
we're just so naive, and, and so
cheap that we love to see our
name on the leaderboard, you
know, someone else jumped out, I
was like, I gotta I gotta get
ahead of that guy. So I think
women tend to be much more
sensible, do that kind of kind
of action motivations as a
little bit silly.
Sanyam Bhutani: But I'd also
drop a few honorable mentions
maybe after the interview to
women Kagglers. Also, the ones
that are there also are in the
top of the leaderboard when they
get down to competing.
Russ Wolfinger: I know you
should definitely try to
interview some of them because
there are some brilliant women
data scientists out there. I
know here at SAS, I thought I'd
say it's roughly 5050 and I just
thought I mean, to me, I love
having a colleague that almost
just thinks in an orthogonal
directions from what I do
because then you can, you can
really make progress on
problems. Where each of us are
kind of limited by what we know
and the techniques that we have.
And it's easy to get kind of
stuck in a, in a little
intellectual cul de sac, as it
were and a colleague with a
different background and
perspective can often just pull
you out of that and set you on
your way and achieving better
performance. Again,
Sanyam Bhutani: so we discussed
about kaggle kaggle is really
the home of data science. I
don't think we need to debate
that many people join it to
become maybe get a taste of data
science or become a good data
scientist. up maybe a good
applied data scientist. Why did
you join kaggle? Why did you
started competing on kaggle?
Russ Wolfinger: Yeah, I actually
got a so I spent my career in
working here professional
software development and I've
got working at the jump division
of SAS, JMP, which is a nice
point and click interface for
for handling mostly tabular
data. And then my group my
individual group, we work on two
vertical products called jump
genomics and jump clinical cut.
And we had just recently
finished with jump genomics has
a predictive modeling framework
within it with with nice cross
validation methods and things
like that. We just kind of
wrapped up a redesign of that.
And we were looking for problems
to try it. Try it on. And it
turned out, I don't know if
you've heard of there's another
competitive data science
organization called dream, dare.
Dare am, I think IBM is involved
closely with them and sage bio
networks, they put on
competitions, mostly focused on
science. And so that since I was
already we were doing a lot of
genomics things we came across a
competition where they were, we
were predicting prostate cancer
survival, based on biomarkers or
actually clinical clinical data
as well. And this looked like a
nice problem. And it turned out
to be a competition and I was
like, Okay, let's try this
competition with our new
framework. And we actually ended
up doing pretty well with that,
but that dream The Dream
competitions tend to be fair bit
smaller than kaggle ones and
then saw I can't remember
exactly how I found out about
kaggle somebody told me I got on
there. And it was it was at that
point is like, Oh my goodness,
this looks This looks so
incredible. And at that point I
was hooked.
Sanyam Bhutani: So it to your
first failure on Kshhlr, did you
decide to accept the challenge
CPMP calls it a legal drug is
that when you got your new got
addicted to Kaggle?
Russ Wolfinger: Yeah, well, and
for me again, maybe this is a
bit of a fortunate timing. This
was right whenXGBoost was coming
onto the scene. And you know,
Anthony, in your interview with
him mentioned there's kind of
this transition period where for
a while random for us was were
dominating a lot of that data
science competitions. But then
when XGBoost came out, and
actually there was a GBM there
was a GBM module in there in
between where that that was,
that was clearly starting to
offer performance and
competitive gains. And that's
what so I did like, like I
mentioned, you know, I had that
first humbling experience I
looked back and I went to look
at the winners. And I believe
the top five people all use next
Jeepers. And I was like, What?
You know, okay, what is this, I
didn't even know about it. I was
like, I was using all the lot of
classical statistical methods
regression, and the, you know,
various various techniques that
I was aware of. And I really, I
was familiar with basic decision
trees, but these boosted trees
seemed to offer a little bit
more in terms of functionality.
And so my very next competition,
I'm like, Okay, I'm gonna, I'm
gonna I'm gonna try XGBoost on
this one. And I didn't even
realize but I ended up getting
my I think I got a solo gold and
my first my first time of my
second kaggle competition, it
was the Rosman one. And I just,
I mean, I remember that
competition. I was just amazed
at how well actually boosts
worked. You know, I was doing my
I had a lot of servers running
with different cross validation
schemes. I was doing some
feature engineering, trying to
really make sure my model was
going to generalize in a man I
was I'm so I think maybe that
was a defining moment because I
think Think on the public
leaderboard. I was like 35th
got, you know, right and then
when that when the private was
revealed, I jumped the fourth.
And I was like, Oh, that was so
I mean, I literally jumped out
of my chair. I saw that
happening. My wife was like what
it what's going on in there? And
I was like God look at that. So
it's really you know, I think
maybe that was a bit of a one of
my gold defining moments and
then after that it just like all
okay, I'm going crazy now.
Sanyam Bhutani: You You
definitely hold a record rate in
meddalling. I'll mention the
figures for the audience. I
think it's one of the best on
kaggle you had 42% Top 10
finishes 61% gold medals and 85%
net of winning a medal on
entered competitions not on any
other competition. What's your
secret? Is it your background?
Is it is there any other secret
that led to this?
Russ Wolfinger: Yeah, and thank
you for doing I've never tried
to compete those those ratios.
There are a couple duds in
there, so that would probably
mean realistically might even be
better. But maybe it's somewhat
my personal mindset where I tend
to like, I tend to be pretty
focused. So I only ever do one
competition at a time. And when
I when I commit to it, I mean,
this is kind of a philosophy my
whole life. Usually when I
commit, you know, when I get
interested in something, I want
to do it really well. And I'm
willing to do everything, you
know, just pour my heart and
soul into it, do background
research, try to try to remove
other distractions that are
preventing me and you know, just
just focus dive as deep as I
possibly can on a certain
problem. And I think for
something like kaggle that helps
where, you know, not a not a new
competition might come out, I'm
like, Nope, I'm not even gonna
look at it, you know, not even.
I'm just gonna stay focused on
what I'm doing. I think that
likely gives me maybe a little
bit of an edge over the maybe
the typical calculator that just
likes to jump around and try
different things. Or maybe
there's, I know, there's only a
few guys I'm aware of that can
really do multiple competitions.
Well, I don't I'm one of them.
I'm just too kind of single
minded I feel to pull that off.
And so that I guess that that's
probably where it where it comes
from to a certain degree. And I
am inspired like there had there
are certain competitors I've
noticed several from Japan say
and then a few others around the
world who they just seem to have
this mindset where they don't
even submit that often right?
They'll they maybe do one or two
to get to get a benchmark, and
then you don't hear from them.
But man, there they are at the
private leaderboard right up
there, you know, because they've
been, you know, that they don't
get caught up in the leaderboard
drama and you know, trying to
see their name in lights early
on, they tend not to overfit.
Yeah, and you can tend to do
better than in the end. I feel
like those are those are pretty
general principles that most
people can work off of.
Sanyam Bhutani: I'm sure,
winning competitions takes a lot
of focus. There's a lot of
talent and skill as well, but
like you mentioned it it's a lot
of focus for you, focusing on
one single competition. What
competitions Do you usually like
to enter will Shortly talk about
your recent gold finish. But
what competitions grab your
attention? And which ones do you
like to participate in?
Russ Wolfinger: Well, you know
what the statistics background
and that naturally the tap, the
tabular ones are the most
natural ones. To me. That's what
I think when I just think of
data. That's what I tend to
think of as a nice table of
numbers. So those are always a
consideration and I was
fortunate I was thinking back to
my graduate school days at NC
State, and I was actually
surrounded by some of the best
apply time series statisticians
in the world that time. Sastry,
Penn tulip, Peter Bloomfield, my
PhD advisor, Ron Golan, was a
really well known
econometrician. We had Dave
Dickey have got a famous
statistical tests for unit
roots. These guys are just my
professors, you know, but I
realized looking back, and for
example, Professor Pam toolless
class, I mean, literally one
semester course I had a book of
books worth of notes, you know
that killed us. But But he was
an expert in a kind of classic
all the classic arena models and
things like that. And we just,
we understood those things
inside and out. And then Peter
Bluefield was an expert on the
frequency domain side. So we
immediately we do a Fourier
transform and learn all the all
the beauties of working in the
frequency domain. So I feel very
comfortable with time series.
And even though the amazing
thing to me is how well the
boosted tree methods work on
time series, which intuitively
you wouldn't think that would be
true because trees tend to desta
to you know, they tend to
interpolate and shrink Yeah, the
various cells but boosted trees
are actually this does rely on
good feature engineering you
actually I thought, you know,
you got to put in, well form lag
lag features and other nice
things. But probably my my
favorite most comfortable
competition would be like a
tabular time series. But may
I'll tell you I did I had done
several image based ones were
traditionally To me, they were
just unbelievably fun and
incredible that you could take,
you know, these modern deep
learning networks, I had access
to some nice Linux servers here,
but I could just get them
cranking away. And just learning
all the all the crazy
architectures and pytorch I, you
know, the keras, TensorFlow pi
torch trying to learning those
frameworks. And then I start,
I'd say my favorites, probably
pytorch right now, it seems to
strike the right balance between
you know, encapsulation, but
also giving you some flexibility
to go in and, and try to play
with the networks a little bit,
which I think if you're going to
gain that edge and a competition
you've got out, you may want to
go in and you can do pretty well
just with can networks but then
if you're really going to going
to nail it, you often have to go
in and and think hard about the
problem and then make some make
some tweaks and adjustments. And
that's so much fun. You're
almost like net network
engineering, I call it where you
can go in you know, if you
haven't got a good set of
features and then then a a to
network. That's, that's a
foreign For Success,
Sanyam Bhutani: definitely,
there's this balance that you
just mentioned about knowing the
tools, knowing how to apply them
and knowing the theory. And I
think right now, beginners get
stuck in the trap of continuing
to learn the theory. How do you
suggest this strike the balance,
because there's this tremendous
amount of knowledge out there,
and it's put out, it's packaged
in a way that you'll always feel
that you there's something else
that you need to learn before
you can become better. For
example,
Russ Wolfinger: I know it's a
tough problem. And you know, we
all we all only have 24 hours in
a day and then certain amount of
brain cycles that we're going to
cycle through. So it's worth you
know, you got to kind of see
where you're at at the moment.
You know, if you're, if you're
just if you're just hell bent on
performing well and a certain
competition, you may, you may
decide to let some of that, that
theory just go by the wayside to
say, I was literally going to
only do things, they're going to
help me predict better and climb
the leaderboard. Whereas, like,
for example, for me this past
year 2019 I didn't compete as
heavily as I as I had in years
before. And I, I actually it was
a little more enjoyable for me,
I decided I just took a few
competitions, a couple of my
didn't even actually even
compete. But I still followed
along. And I was I was I was a
more in a research mind, where I
had some things I was working on
here at SAS, using that data to
try some things. And for me,
personally, I get I easily get
sidetracked with a, you know, I
see something in a problem where
it's like, oh, now I want to
implement software that we can
use here at SAS to offer our
customers and so that can be for
me personally, but but that's
part of my job. So I'm able to
kind of leverage research
questions like that into work.
But you have to each each
competition and problem you're
dealing with is different and
you have to kind of keep it an
internal sense of what your what
your overall objectives are at
that time in your life. You
know, and your other
responsibilities to your family
and other ways you're balancing
your time. But there's your
like, you said there's so many
interesting theoretical research
questions and it's been in
sometimes it's like, maybe maybe
you're, you're like, Hey, I'm
not so worried about doing as
well in this competition. I just
want to learn this theory. I
think that's perfectly fine. Why
not? You know, and that's where
you are in your life at that
point. Do it. And, you know,
chase your curiosity. In the
end, maybe that's, that's a more
satisfying path. And I think it
is good. I've had trouble a
couple competitions, right. I
think I'd probably definitely
overdone it in terms of my wife
has said she threatened to like
an inner my family intervention,
you know. Take your take your
fingers off the keyboard, Russ,
you know, we got we got we gotta
go talk about this. And so I've
been trying to strike a little
bit more of a balance more
recently. And I have I
unfortunately, where my children
are now grown and out of the
house. So I have had a little
extra time to spend on things
but still, there's still a
danger of overdoing it. So I do
have to struggle. I struggle
with that at time for sure.
Sanyam Bhutani: But you
mentioned you're allowed to
Kaggle at work. I think that No
secret is it?
Russ Wolfinger: I have been able
to and I've been just been so
fortunate here at SAS. Just
throughout my career, I've
always been an r&d. So we do we
take you know, they are they are
we take the art seriously. And a
lot of that research involves,
you know, spending some time
with the problem, trying to look
at what the state of the art is
leveraging that back. And I've
been fortunate to have a series
of really great managers back
when my boss now is one of the
billionaire co owners of the
company. And he just gives me
pretty much complete freedom to
pursue things I still don't want
to manage our team we've got
we've got day to day duties with
our software products. But I
have had some freedom to be able
to go off and do some cabling,
at the same time trying to
trying to be responsible with
what I'm learning and leveraging
that back to into the company
and sharing knowledge both
internally and externally.
Sanyam Bhutani: You just
mentioned I like to call this
the on kaggle versus off kaggle
data science. Do you have any
examples of how what you learned
kaggle impacter of kaggle data
science work, did you have any
takeaways from any competition
that heavily impacted your
professional life outside of
Kaggle?
Russ Wolfinger: Well, for me,
it's easy for me because I mean,
I'm working at a professional
software company. So I mean,
literally all the state of the
art techniques that I learned on
kaggle. And the media bearing on
things we're doing here at SAS,
and we, we offer a nice, we have
a nice suite of data science
tools, where we've been
reengineering, a lot of things
to work in the cloud. And so we
have some really, and this is
across the company, my little
group and jump, I've got certain
things that I'm really
interested in, for example, just
jumping on a mouse based
product. And I think there's
actually still a lot of room for
improvement with you know, let's
say you get a brand new tabular
data set from kaggle. Where do
you start? And you know, Anthony
was mentioning we don't want
anyone to start with a blinking
cursor. My view is, maybe you
don't even want to say Start in
a Python notebook. Maybe, maybe,
maybe it's better. Audio is
actually my secret weapon for
kaggle competitions. I open that
table up and jump in and do some
I can do some very. And within
10 minutes, I can get a very
clear sense of what's going on
with the data, some problems,
outliers, already ideas turning
for feature engineering. I may
run some a little bit of sass
code with things that I know. So
to me that, you know, just where
do where does one start, when
you have a new problem? I think
there's a lot of room there. And
I do, I'm actually pretty big on
the power of the mouse. If you
think about, you know, what,
little babies one of the first
things you do is point, you know
that that that paradigm extends,
right? We now have the cursor.
And in fact, my boss john saw
recently had an article entitled
what's the point where he draws
a fun analogy with Harry Potter,
and he's one you know, he could
Harry Potter can just point that
one at something and magical
things start to happen. And
that's part of our philosophy
with jumping the red throughout
the rest of SAS or we want to
use both the best of coding and
writing a well form code. But
there's still power to be, I
think, to be tapped into and
untapped potential with with the
mouse.
Sanyam Bhutani: Awesome. Now,
coming back to competitions, do
you have any favorite? I like to
call them battle stories or any
competition? And how have your
approach and views on calculable
as you competed over the past
few years?
Russ Wolfinger: Yeah, there's a
lot I think. I've really enjoyed
just every competition I've been
in. I guess. A lot of the joy
comes with teaming. And I tend
to have I've had this marvelous
teammates tend to be small
teams. And you just get to meet
you know, one of my one of my
good friends that I've met
through kaggle is to meet to
meet your Poplawski. Just a very
he's like a quiet genius guy.
You see him on the street you
wouldn't even notice but this
guy is unbelievably good with I
think coding image competitions
and we I think we just I think
maybe the first one we teamed on
words the the the fisheries
competition you know where we
had to the pictures of the fish
from from from up above. And you
had to predict you had to look
at look at an images of fish and
predict what kind it was for,
for tuna and other types of
fish. And we teamed on that we
just had such a fun time, you
know, exchanging ideas and going
back and forth. And then
occasionally when the deadlines
approaching you're scrambling to
try to get your last model to
fit. And I've had a couple cases
where I've literally made my
final submission just like maybe
one minute before the deadline.
So it generates a lot of
excitement and of course there's
always excitement when the when
the private leaderboards
revealed. So that kind of thing.
I think just overall just just
being being fun and we dimitra
and I and another good friend,
who I haven't got to meet in
person, but a great guy
Jonathan, we teamed up on the
Zillow competition and that was
a lot of fun where we were just
We're engineering features
trying things and, and that when
I get around here, I got
everybody asking me how much
their houses worth now and I'm
like, I'll just go to Zillow,
you get their algorithms
improved as a result of that
tygo the two competitions that
they ran.
Sanyam Bhutani: Okay. You're one
of the best statisticians, if I
may on to why why do you
continue to kaggle still today?
Russ Wolfinger: Oh, well, I
don't know if I consider myself
we've got there is a there is
there still are kind of these
two cultures going on. And I
I've been trying to kind of
bridge between the two or least
travel and, and being a
commercial software companies a
little bit of an unusual career
path for a statistician, but
there is still a very vibrant,
active statistical community
with brilliant we've got some
very brilliant theoreticians,
and there's a really nice
battles going on now over topics
like p values, and the classic
battle between Basie and
frequentist statistics, there's
a whole, broader discussions
around reproducibility. Or in
the science literature, you
know, there's a lot of lot of
controversies around reports
that don't reproduce. And so
what a lot of that some of that
some of the heat from that is
focused back in on the
statistical methods that are
used. And I love these I love
and I do a lot of background
reading on some of the
foundations. I haven't
contributed a lot personally,
but I at least try to keep up
with what's going on and try to
come up with sensible
perspectives on that, that we
can work into the software. And
occasionally I'll give a talk or
maybe a higher level paper
trying to share those. But some
of these are really deep, some
even deeper philosophical
questions about you know, what
is what is probability? What,
you know, is it is it a
subjective degree of belief? Is
it is it destroyed describing
real world phenomena? There's
several other theories around it
and a lot of these questions are
still kind of open and
interesting to pursue and I feel
a lot of data scientists, like,
statisticians are just concerned
with stuff that doesn't really
matter that much. But I don't
really share that I think
there's still, there's a lot of
deeper questions in the
interface between and I
personally tend to not, I tend
to think of statistics and data
science still is somewhat
separate. There's a lot of
intersecting that going on now.
But at least I know from the
statistical community that they
don't, they don't view
themselves as data scientists.
To a certain degree, they may
feel threatened or even jealous
sometimes because I do think
data scientists, they're,
they're likely it's a likely
much larger crowd. Now, data
scientists tend to be much
better at selling things. Even
terminology, you know, something
like, you want to learn about
dark knowledge, deep, deep
learning, whereas a statistician
is going to be like, what's the
asymptotic theory of a Mann
Whitney used statistics. I think
some of that personality things,
but in a way, it's all good. And
we got We're just in this
wonderful time now we're not
just statistics and computer
science, but domain expertise.
People, you know, people from
the physics community, social
science, we're all kind of
converging on the this coat,
this collection of techniques.
And sometimes it goes by the
term AI, you know, or, or Big
Data learning, a lot of
buzzwords floating around. But
in the end, we're trying to
solve these, these these really
interesting problems. And we've
got all the pieces now
computing, we got the data, the
big data, we've got the theory,
we've got techniques. So it's
just a wonderful time to be
working through these and in
solving solving interesting
problems.
Sanyam Bhutani: And like you
said, while teaming up, there
are no boundaries on guidance,
you could team up with a
physicist, you could team up
with someone from a different
nation, someone even outside of
your workforce, and you get
access to their brain, they code
which I think is is invaluable.
That aspect that Kaggle allows
Russ Wolfinger: I know and
there's that I love this
principle, you know, caught
there. Almost every kaggle
competition, you need to build
an ensemble typically, to get
that little extra edge. So
there's power in diversity and I
love the the analogy with
humans. You know, when you're
building a team, to me, it's
about you're often better to
have a diverse team. Instead of
you know, if I were going to
team with someone, I don't want
another I don't want another PhD
in Applied Statistics that went
in, you know, that kind of have
a similar path to me, I want
someone totally, you know,
coming at it from a totally
different angle. And that's
where you can often make
progress. And it's, it's the
same principle with ensemble,
you know, you don't you ensemble
two models that are very
similar, you're likely not going
to get much of a boost. But if
you take two, you know, take a
boosted tree, and a neural
network on a tabular data,
almost guaranteed you Ensembl
them, but they're both pretty
reasonable. You're going to get
better results.
Sanyam Bhutani: What advice do
you have or What tips do you
have for beginners who would
like to team up? And what's your
strategy for teaming up on
Kaggle?
Russ Wolfinger: You do want to
put some thought into it because
unfortunately there are there
are some, you know, quote
unquote cheaters. And if you you
know, kaggle is pretty strict
about it. If you team up with
someone that's broken the rules,
you're going to get disqualified
along with them. And that
unfortunate that's never
happened with me. But it has it
has happened with several my
friends, so you got to be
careful. And I don't know, it's,
I mean, it sounds crazy. I guess
everyone does this. But if I got
someone that I don't know, I'll
try to do my own little
background check. You know,
maybe they're on they got a
Facebook page or LinkedIn page,
check them out, what's their
background, you know, try to get
some sense of where they're
coming from. Just to make sure
that you've got a good team,
teammate and then there's kind
of also that unwritten rule
where you're, you tend to want a
team with someone that's roughly
the same place as you on the
leaderboard. And so, that that's
not always the best obviously,
it's if you can team up with a
with another Grandmaster. That's
often a really good experience
and a lot of a lot of gems are
very open to doing that. There's
a very sharp tacklers who have
they don't have the points yet,
but they're probably just, I
mean, in a way, they're likely
better than anyone that's kind
of a quote unquote Grandmaster.
So those are those are the folks
that you can find them. And I
think that there's a lot of
these quiet geniuses out there
like dimitra. If you can, you
can find one, I team up with
them in a heartbeat.
Sanyam Bhutani: Okay, now
talking about competitions. How
do you approach a new problem?
And what pipeline? Maybe it
might be a general one, or
specific to any competition? If
you could tell us what pipeline
Do you use while working on the
competition?
Russ Wolfinger: Yeah, yep.
pipeline. To me. Uh, well, maybe
I'm naive about this. But to me,
the very obvious first thing is
what kind of data is it? You
know, and I tend to definitely
make the initial big distinction
between tabular data, and then
image data versus, you know,
maybe if you want to have a
third category, maybe signal or
audio processing and then that
natural language I tend to view
those as they're quite distinct
domains that would have their
own their own pipelines. And
then some of the competition's,
you've got, you've got multiple
types of data. So you have to
kind of have multiple pipelines
feeding in. And again, I'm most
comfortable with tabular data.
My secret trick for tabular
data, just open it up and jump
and start playing around that.
And that's not a cotton, we
don't really promote job as a
data science tool. But man, it's
so it's so much fun and easy.
And you can just, it's very
satisfying to quickly get in
there and see what's going on
with the data set. Now with
image data, and there's no
substitute for Look, you're
looking at the data, just
literally like take a look at
some of the images and maybe
have ways of scrolling through a
lot of them quickly, trying to
see what kind of variety you've
got, you know, maybe compute
some summary statistics on each
image and then you can maybe
convert you can convert the
image or, or NLP data into
statistics. Now you're back into
the tabular domain, then you can
load those into job. And again,
if it doesn't come across too
well, but your job or any kind
of facility like that, where you
can quickly take a look at the
data and do things like
distribution, scatter plots,
maybe run some quick
regressions, tries a couple
quick models, just to see what's
going on that kind of thing.
That's what I tend to do. And
then the hard part really comes
in where typically, especially
with tabular data, you've got to
be able to do really smart
feature engineering, that
that's, that's hard work. You're
just not gonna there, there are
the auto ml routines, which can
make sometimes they can come up
with some decent stuff. They'll
get you part ways, but then
there's no substitute for just
really thinking hard about the
problem. You know, what are you
what are you trying to predict?
what's what's, what's the
underlying data generating
process that's going on? And I
will also put some time into
thinking about how was the data
created? And this is where a lot
of elite you know, you gotta
look you gotta go for kaggle you
have to check for leaks. They do
they do come up in spite kaggle
does a pretty good job, I think
of eliminating those. There,
they still pop up in spite of
everyone's best efforts. And
that can make sometimes that can
even kind of ruin our
competition to certain degree or
at least change the dynamics.
And if you don't find the leak,
the leak phenomenon or
structure, you're not going to
get anywhere on the leaderboard.
So you've got to have some
tricks for that. There's kind of
this whole bag of tricks that
over time you start to develop
and work on. And so I don't know
that's, that's probably about as
clear of a pipeline. I don't
have any I don't have any real
strict one. It's more it's
getting there, that there's no
substitute for look. A lot of
people I think this one on
quickly just start fitting a
neural network, you know, start
with they don't even look at the
data yet. And so you can you can
get you can get sideswiped by
outliers and other weird aspects
of the data if you just try to
easily jump directly into
modeling.
Sanyam Bhutani: It's called Data
Science for a good reason.
[That's right it
Russ Wolfinger: is. That's where
the science comes in. You can
kind of had the mind of a
scientist inquisitive. And then
of course, you do have to set up
your always in the back of your
mind coming up with some kind of
scheme for cross for validation,
cross validation, you do not
want to rely on the kaggle
public leaderboard to as your
only source of assessing models
and techniques. That's, that's a
recipe, that's a recipe for
disappointment for sure. You got
to come up with you got to come
up with a good internal cross
validation framework. And there,
you know, the basic place to
start would be k fold. But then
if it and if it's time series,
typically you want to forecast
into the future. So there you
might just do a single, a single
set into the future, it's often
a good way or maybe several,
several different schemes is
often a good thing. Not to not
just rely on one cross
validation scheme that has a
role you can do, just depending
on how big the data and how much
time you have.
Sanyam Bhutani: I also want to
ask you this naive question,
maybe base or base it on terms
of a recent gold finish on the
NFL Big Data bowl competition.
But what does it take to win a
gold medal in a competition both
decision making or efforts for
anyone who's aspiring to become
a grandmaster someday?
Russ Wolfinger: Yeah, that is I
think that probably is the
getting that solo gold
competition. That's probably
maybe the one of the most
difficult things with kaggle. I
happen to get lucky. Like I
said, I think I was early in the
problem is it's getting tougher
and tougher all the time, right?
Because there's more sharper
people on there. I think you've
got to be you got to be smart
about which competition you
choose, you know, try if your if
your goal is to get a solo gold.
First thing is you got to filter
those competitions. And make
sure it's one where you feel
like you can you can come up
with an edge somewhere. And it
may or least be able to learn to
learn during the competition
where you can get there. And
then so it also should be maybe
it happens to be a domain where
you have some extra expertise,
external expertise. That's I can
be very valuable. There have
been certain competitions, like
for example, there was a one on
fit, you know, physics and
chemistry recently. Or if you've
got that kind of background, in
that that's a natural one to go
for, because you may, because
these, you know, understanding
of the underlying problem can
really be helpful and
engineering features. And then
just this often comes down to
feature and network engineering.
And so you just have to, and I
try, sometimes I will try, like
literally, try to sense if
you're getting stuck in a rut,
and be willing almost in a sense
to back completely up and almost
start over or go in a new
direction. Because I think all
of us just tend to, the problem
can be intriguing enough that
maybe you're making slow
progress on a certain path, but
then you inevitably are going to
like hit a hit a wall and not be
able to get any further. And if
you can, somehow just almost
this kind of like reboot, you
know, and almost try to go start
from a different place. And
going, and the beauty of that is
you haven't lost anything. And
you can often come up with a
second model, but you can then
ensemble and then boom, you
know, you've just leveraged your
way up. So it's almost like you
know if you're gonna go solo but
you're you want to become a
little bit schizo phrenic, you
know, and have have make it make
up your own team within your own
mind and have the
that's the way that's the way I
would tend to approach it.
Sanyam Bhutani: Awesome, but I'd
also say that for anyone, please
don't be misled by just becoming
a grandmaster even if you close
to that, or even if you
producing good models, I think I
think that's a great takeaway
for applying it to real data
science.
Russ Wolfinger: That's right.
And you do want to keep it in I
mean, becoming a gaggle Grand
Master I mean, it's a it's a
nice honor, but it's not it's
not the be all end all. I mean,
there's much in a way there's
much more important things in
life and problems to pursue. And
so if you can keep it in
balance, I do think if you if
you stick with it, It will
happen, you know, you just kind
of just be steady. Don't don't
get don't freak out. And there's
definitely going to be a lot of
disappointments along the way.
But just stick to your guns
learn every every competition is
a chance to learn and get
better. And then at some moment,
I think there's a right now
there's enough rich variety of
competitions where I do think
it's achievable. I know I think
it was two years ago, I did that
there was one on the on the
power lines. Remember that one?
Is the VSB VSB. Yes. Yeah. And I
did that one. I was like, for
that particular one, I was like,
I was asking myself, can I get a
solo goal still, you know, a
couple years since I got my
first one. So that was kind of
my goal. And I did I kind of
like I said, got got a little
bit scared, sell on myself and
try some things. I was able to
pull it off, you know that that
competition was a little bit
weird. There was some unusual
things, but I was able to kind
of that one it was really
important to have a nice cross
validation strategy because
there was a little bit of there
was definitely some shake up.
And so I think it's doable, but
you just have to be, you got to
be in a sense discipline and,
and be very careful with your
time, you know, because like I
said, we all have a limited
amount of time. So you got to
use it as wisely as you possibly
can. Definitely.
Sanyam Bhutani: What's your take
on the now that's opening that's
coming up in the public that
kaggle is not equal to real
world data science? And do you
think there's any aspect that
someone who's very active on
kaggle might miss out who's
trying to seek a career in data
science? And how can they build
a surrogate for that?
Russ Wolfinger: Yeah, I think
most of that's above you know,
bunch of malarkey. Basically,
the only the only the only
people I would listen to is
someone that's actually got in
there and competed in kaggle.
And done pretty well. There was
a grand master, you don't really
hear it, you don't hear any
kaggle grandmasters or masters
making those kind of claims,
guys, we know how valuable of a
resource it is. And I think It's
human nature, if you're if you
decide not to do something that
might be important, you tend to
want to minimize it or put it
into a box. I think it's just
that we all do that, right? I
mean, some, there's a lot of
fields that I'm really
interested in, but haven't had
the time to explore. And just
it's a natural tendency to want
to make a some kind of
simplifying assumption or, or
maybe knock it down a little bit
and make it not as important.
Just I think it's again, it's
just human nature. So I tend to
discount any claims like that
and just a simple fact look at
the sponsors of the competition.
These are not just some crazy,
you know, off the wall, we got
big, important companies
financial from all different
domains, sports, you know,
finance, Zillow, places like
that, I mean that they are
making direct business decisions
to invest, you know, serious
time and money in the kaggle in
a kaggle competition, they
wouldn't they wouldn't be doing
that if there wasn't value to
them. And it's just it's a
pretty amazing You know, if you
think about it's a pretty
amazing community that's that's
developed around it. So I don't
put any weight on those those
those naysayers
Sanyam Bhutani: what's the
aspect I do think that a
beginner might miss out that
they can, and that they might
miss out by being active on
kaggle? And how can the
substitute build a substitute
for it?
Russ Wolfinger: Yeah, and again,
this is everyone's a little bit
different. I'm not sure I can
give perfectly general advice.
But obviously you've got you
know, you got your core, your
core values that you're you're
looking to develop and you've
got your own life situation. And
maybe you've got a family to
support or you're looking to
build your career. So all these
things in a way have have
bearing on how active you might
want to be with kaggle. You want
to view it as a really good
resource, but be careful about
not getting, you know, over
consumed with it. But if it's,
if it's a way to kind of
especially if you can find
kaggle competitions that align
well with your Personal
objectives or maybe maybe you're
doing a research problem in
school, like I think one of the
physics ones, one of the winners
was actually a PhD student in,
in astrophysics or whatever. And
that's like a perfect, right,
that's a perfect setup for a
kaggle competition for them.
That doesn't happen too often.
But if you can kind of stay
active and kind of keep an eye
on what's going on, you can
usually find ones that are
complimentary to your goals, and
try to do that the right way
things are lining up, and you're
kind of accomplishing multiple
things at once, rather than just
jumping on every little cut
every kaggle competition that
you might happen to see.
Sanyam Bhutani: Awesome. Now, I
want to zoom out and go back to
another topic of software
development. You've been
involved involved in it for over
a few years. What was software
development, like in the early
90s? And what's been your
favorite development in all of
these years in terms of software
development?
Russ Wolfinger: Oh, yeah, I
mean, man, I probably am
happiest when I'm writing code.
I just thought like, we will
continue I think we kind of
share a similar spirit that we
just love it just so much fun to
write software and spec,
especially being able to write
software that's going to help
others. That's just like that's,
that's probably one of the most
core status, you know,
satisfying drives that one has
in software is pretty amazing,
right? We could write it, you
can send it out for free or
distributed around the world.
There's no cost to literally
zero cost to the sending it. So
and we've got techniques in the
infrastructures now for doing
that. In a way though, like when
I first started here at SAS, I
was we wrote a lot of our
original routines and see and I
worked we had a we had I
actually worked in an E at this
nice kind of tricked out Emacs
environment where I had my C
code on one side and sass output
in another and I work on some
code, compile it, run it, and in
a way that that that that basic
paradigm hasn't changed too
much. I do. I do a lot of coding
that I've been doing some more
coding recently in c++ and Then,
of course, SAS is also a
language. And jump also has its
own language ASL. And if you're
going to be on kaggle partner,
you pretty much need to know
Python or R. And so I think
those code, you know, software
development now is changing to a
certain degree. Of course, we've
had the explosion of open
source, which is MIT, you know,
we, you know, talk about
competition with the whole
landscape has changed. I think,
I feel like for us as a
commercial software vendor,
we've got to be on our toes now
and aware of these big trends.
And you got the big players like
Facebook and Google, dumping out
super high quality code in the
public domain. And we want to
take advantage of that actually,
when we can and it's the
responsibility is on us to make
sure that our software is still
work we do we do it is
commercial software. So we do
charge a licensing fee for but
we want to make sure that that
fee is worth it, and we feel
like it is and you know, no
software is free, even though
you know, there are we Column at
open source zealots that claim
that all software should be free
for everybody. And I think
that's how it starts to get to
be ridiculous if you kind of
work through the logical
implications of that. We've been
fortunate, you know, relatively
worldwide, you know, overall
fairly prosperous worldwide
economy and big companies that
are able to fund nice, huge open
source initiatives. So there's
kind of this interesting cycle
of feedback between commercial
companies and the open source
community. I think it benefits
everybody if we can do it
correctly. And of course, there
will be some disagreements about
exactly what you know what, how
things should be promoted or
whatever. But overall, it's just
it's pretty incredible time
we're in and, you know, the
techniques we've got are just
really fantastic.
Sanyam Bhutani: Leland also
mentioned this people often miss
out that open source isn't free
in the sense that someone is
paying the coders to write the
code or maybe they're earning
from somewhere else that gives
them the leverage to be able to
contribute to open So many
people do miss out on that
aspect.
Russ Wolfinger: Now I know Yeah,
it's it's it's nice. And I love
open source. I use it all the
time. And you know what one nice
little story with Lee he wrote?
This was I don't know, maybe
10-15 years ago he had written.
Yeah, we all know that, you
know, Venn diagrams, right.
Yeah, he had read, he had
written some nice Java code for
doing proportional area events
where the size of the circle
corresponds to the size of that
set. And it turns out that
creating that a diagram
proportional area Venn diagram
is a non trivial problem, that
there are certain configurations
where it's impossible, but you
can you can get a best
approximation. Well, we had
written some Java code to solve
the problem and he had a nice
paper, and he was looking for an
outlet for it. And it turned out
right near the same time, I was
working on a jump add in to draw
the Venn diagrams. And Lee there
was a beautiful calvet
collaboration. He's like Hey,
why don't you just take our Java
code, you know, look loaded in.
And then we can distribute all
this out there for free. And we
did it, it was a, you know, that
was one of, to me, that's a
really nice story where he was
he was willing enough to share
that code openly. And then I
shared our ad and even though
you need you need job to run it,
but usually you can get your
hands on that fairly, fairly
readily. So just a nice example
of someone like me being just
open hearted sharing the code,
and there's many a bunch of
people have used that at it and
and complimented complimented us
on the functionality. And we
like we acknowledge him, you
know, and all that, all that in
there. So I just think that
that's the kind of example of
the kind of things that really
are possible and can really make
a positive impact on the
community.
Sanyam Bhutani: How do you feel
about the part that things that
maybe you started during your
PhD days can now be done in a
single line of code? Do you do
you feel salty about that? Well,
Russ Wolfinger: You can always
write a function around
reduce anything? I don't know,
that doesn't mean there, we have
made progress certainly in the
way we think about data
analysis. And like, there's
still big problems about how do
you approach a problem? What are
the most important aspects? What
are the things can go away? And
I don't know that, you know,
counting lines of code can get a
little bit off balance about
your only metric. So but
certainly, the real question is
what what's like, a language
like Python, you know, what's
the proper scope for functions?
You know, how much should a good
Python function do for you? How
much did it require you to set
up around it? That's, that's
difficult. But I do feel like
Python seems to have struck a
nice chord with the community
now and it's been pretty amazing
to see its popularity rise. I
mean, ours also got really
strong tradition all the way
back to Bell Labs and some
really incredibly intelligent
statisticians and others.
Related folks who had the
original s language and it kind
of morphed in our, and it's got
its own paradigm to with so many
interesting factors. in SAS,
we've actually set the SAS
language actually was, you know,
you hear about online learning,
you know, with with data kind of
streaming in. That's actually
the design of the SAS Data step
back in the 1970s, which a lot
of people don't even realize.
And I kind of taken it for
granted, but you kind of look at
it like Oh, so there are there
are things that we've learned
that are kind of they get
reinvented in other terms. So
it's fun to see that happen. And
it tends to validate concepts
that get reinvented tend to be
tend to be strong ones that we
want to pay attention to. And
even though they may have go
under different languages or
have slightly different
variations, they tend to be the
things that advance the field.
Sanyam Bhutani: Okay, now, we're
talking about another field that
you have been enrolled over the
years. What has been your
favorite development in stats
and how have you seen the hype
change over the years of
statistics
Russ Wolfinger: Yeah, I don't
know, it's, we're in this weird
time now, like I said, I think I
think statisticians are feeling
they're like, man, they're kind
of getting I think for many
years they did. They didn't
really respect to us if I could
use that as kind of a strong
word, but they kind of just felt
like the data scientists were
just, they were doing silly
things. They were building these
big predictive models. Now, I
think there's I think there's
mutual respect that that have
been developed and increased
and, like some of the some of
morones, brilliant statisticians
like the guys in the Stanford
from the Stanford department,
tips, Ronnie and hasty Brad
Efron. Those guys have, I think
they were the original pioneers
of kind of building this bridge
between the two communities, and
even things like boosted trees.
You know, we've got Leo breiman,
and drone Friedman, who kind of
came up with the original
theory. But then it took the
University of Washington group
to kind of finally get it all
coated in a nice open source
framework that exploded. So we
get these, we get these things
happening. But it's, I mean, for
me personally, I still feel like
there's some tensions there,
which I'd like to overtime maybe
try to alleviate. And I
actually, I don't mind kind of
living in between the two worlds
because it's kind of, I feel
like it's a rich, it's a rich
area for research. And there's a
lot of really intriguing
problems that come up. And you
gotta admit, the data science
community is a lot of fun, you
know, we're on some slack
channels and stuff and mixing it
up with some silly, silly
silliness. That's the kind of
thing that makes life and and,
you know, day to day life, a lot
of really enjoyable. And there's
those kind of things in that in
the stat world as well. And
instead of such as we tend, we
do tend to be very geeky and
nerdy and stuff to probably
maybe not quite as much as the
as the are. We tend to be more
so than the data science guys.
So geek, I guess geeks are going
to take over the world, you
know, eventually anyways, why
not?
Sanyam Bhutani: Talking about
life, Well, generally, you'd
also Brazilian Jiu Jitsu expert,
how do you find the time and
balance for all of these amazing
things that you are active
across?
Russ Wolfinger: Oh, oh, yeah,
thanks for bringing that up. I
actually I actually haven't been
I haven't been training and BJJ
for several years now, and I'm
okay much out of shape and fat.
But I'm in a ball. I'm in a
bulking phase right now. But I
really, there's actually, and
I've always been a pretty good
I've done competitive different
sports and competed in different
things like BJJ over the years,
and I find there's a lot of
competition that's actually kind
of one of the attractive things
for kaggle you know, it's kind
of a different form, but the BJJ
world is really amazing to some
of my some, some of the best
guys I think I've ever met. Not
to mention, those are the guys
you want with you if you're
walking through city streets
somewhere. But there's a lot of
parallels with with lert like
learning BJJ it's a pretty
amazing art where I don't think
a lot of for people that aren't
familiar involved. No striking.
And it's also designed for a
weaker or stronger, you know,
smaller opponent to be able to
handle someone bigger and
stronger. And to me, those are
really appealing aspects. I
don't really I'm not interested
in punching somebody really nor
do I got a lot invested right
here, don't want to get punched.
But so it's a grappling art but
it's a lot. It's very technical.
And it's been described
sometimes even as human chess.
And so I like there's actually a
hole and when I was into it, I
would spend a lot of time on I
know there's a lot of a lot of
people are putting out videos
and content on different new
different moves and systems that
you can use and there's a whole
really competitive framework.
And then of course BJJ is one of
the key components of all the
mixed martial arts fighters of
these days. Now there you got
it, you've got to be able to
punch there obviously. I
actually when I watch I do watch
some MMA fights sometimes that I
actually contract I think most
fans I actually like it when
they go to the ground because
There's a hole when it when they
get on the ground, there's a
whole new dynamic of leverage
and techniques that you have to
follow. And if you don't, if you
don't understand those
principles you're going to get
you're going to, you're going to
get messed up pretty quickly.
But I do like and I do still
follow a little bit and I've got
in my basement, I've got a mat,
I'll go down there and do some
drills a little bit myself. And
every one of these days, I'll
get back into it. But I love I
got up there's a really nice BJJ
community here in the Raleigh
area. And some of the best guys
I've ever met. feel a little bit
rough around the edges. But guys
I'd much rather hang out with
and some other crowds.
Sanyam Bhutani: I think it's
also analog is to being
competitive anywhere. If you're
down on the ground, and you
start just throwing your fits
around. That's not going to help
you as much as keeping your
comment putting in technique.
And this applies even outside of
just professional fights, I
think.
Russ Wolfinger: And I'll think
about it like you know, when
something goes poorly on your
kaggle competition, you just
tend to want to, you know, you
just tend to flail around do
stupid things where? He said,
No. So there's all these really
interesting to me really
interesting parallels amongst
human you know, human
competition. And this applies,
you know, not what we're talking
about data science, competition
sports, but we're talking in the
business world. And even more
importantly, maybe even across
the political world, where we've
got really difficult conflicts
and issues, right, I think, you
know, being being level headed
and learning how to do today
data science well can. Who
knows, I mean, a data scientist,
I feel like as data scientists,
we may be holding the, you know,
some of the strongest keys to
the future and being a very
strong and important position of
responsibility. And when you do
accept that and realize it, and
make sure that we can develop
good techniques for handling
problems and the good being
good, good solvers of them.
Sanyam Bhutani: For sure. What
do you foresee for the future of
data science and statistics,
broadly speaking,
Russ Wolfinger: Who knows it's
good. I mean, it's it's going to
be at the appeal, it's going to
be kind of that the epicenter of
you know of things for the next
several decades, probably we're
just in this period now, where
we've got data flowing in from
everywhere. Anyone that's going
to be a serious player has to be
has to have some quantitative
skills, to be able to interpret
data that's coming their way.
And this is, I mean, pretty much
it, this name, any discipline,
you can think about, everyone's
got data now, you've got to be
able to handle it, you got to be
able to think, and I do think
there are some differences in
philosophy between, say,
straight predictive modeling
versus what I consider maybe
more general statistical
thinking, which I tend to view
the ladders, maybe a little more
broad and more targeted towards
problem solving. So that those
kind of skills to be able to
think statistically, and we've
got some job actually has some
new materials around that. I
think those kind of skills are
going to be super valuable
important into the future. It's
hard to predict exactly where
things are going to go you know,
because we get you know,
breakthroughs happen and jumps.
Yeah. And we've got amazing
things down like self driving
cars and all these AI,
reinforcement learning things
that are happening. And I think
the matros said he's been
flying. He's been flying some of
the drones through through these
race courses and stuff. So I
mean, that could think about bad
alone could could impact the
modern, you know, how the next
there is a war that breaks out,
that's going to be at the heart
of it. So I hope it doesn't come
to that. We've just enjoyed a
relatively reasonable time of
peace, but there's still these.
I worry, there's tensions
growing across the world, even
within the US where we're
pretty, you know, we're split
politically, and we've got an
election coming up later this
year. So I hope we can be
sensible about it and move
forward.
Sanyam Bhutani: I think there's
a flip side also, maybe data
central becomes so normal,
similar to websites now. No one
thinks of Rockstar website
developers websites are there
you're surfing the web without
even realizing that Hey, baby,
someone putting a lot of effort
towards a how good this looks on
your phone.
Russ Wolfinger: No, I know.
Yeah, there's all these
questions like that. And I think
that's great to observe that and
it, we need to all I think every
data scientist needs to, you
know, even though we get so in
love, like the problems, like
how do we need to kind of focus
out and realize where we are in
the world, and actually the
positions of responsibility that
we hold even though, you know,
maybe you're just a lowly data
scientist and some company doing
something, it turns out you, you
may be the key to that
company's, you know, doing the
right thing. And actually, you
may have a better understanding
of certain key problems than
anyone else in the whole
company, so that, you know, take
advantage of it.
Sanyam Bhutani: Okay. Now, this
has been a great interview, if
you were to give one basic
advice for someone who's just
starting their journey in the
world of data science or
calculus, what would that be?
Russ Wolfinger: Yeah, I've
actually been thinking about
this a little bit. I was
thinking, I think maybe the
starting point is a little bit
of time of self reflection. And
I think I get on my kids About,
they just get so distracted, you
know, they get in the car and
they just turn up the radio and
then they've got their texts
going and friends and they never
had, they never often don't take
the time just to get find a
quiet moment, maybe even when
you're drifting off to sleep,
just think about your own
personal values and your you
know, purpose, things like that
is get your life kind of
aligned. And then from there,
kind of see how data science
might apply to that. And again,
looping back to what we
mentioned early on about problem
solving. Now, goodness knows we
got immense number of problems
in our world, and we need more
people to solve. We need more
problem solvers that we need,
that we do anyone creating
problems. So take a problem
solving mindset fight, find a
problem that's interesting and
passionate to you and go, you
know, go and then use pull all
the data science tools together
to help solve that problem. And
in a way I like that's what I
like about kaggle. It kind of
compresses that into a nice
little three month or six month
time periods where you can do
exercises But then the important
thing from there would be to go
and tackle something really big
and important. You know where
you live. Which reminds me if
you don't mind I wanted to ask
you something I was interested I
saw in on kaggle I think they
did some demographic statistics
on the typical calculator was
like an, I think a 26 year old
in the mail. I guess you're a
little bit younger than that.
Maybe but uh, what I was curious
what what do you see? And maybe
you would be willing to kind of
what do you see the future of
data science being like in India
and what's what's the scene like
there now?
Sanyam Bhutani: I think I think
I'd be a bad person to answer
this. I don't look 22 but I'm 22
and I used to freelance a lot
overseas because used to lag the
right amount of opportunities
that trust me, but I think
that's definitely changing now.
We, H2O.ai has the headquarters
in Chennai in India. That's the
Indian headquarters and we also
have a lot of companies from
overseas establishing their
offices in India. Starting their
Data sciency teams, of course,
the startup scene is pretty big
in our, we call it the Silicon
Valley of India, Bangalore. And
I think that that reflects on
kaggle as well this, this shared
enthusiasm of of data science of
tech on casual we, India, India
has a large population. And that
also means we have a large
number of CS engineers or
software engineers and all of
them title are trying to learn
about data science, they
definitely go to kaggle, which I
think again speaks of Kaggle's
greatness, that it's also
empowering a large portion of
our country to get into the
field.
Russ Wolfinger: No, I really, I
really like that. Thank you for
that perspective. I'm fascinated
by how we've got I just like
this night, the 2020s are just
going to be a good you're going
to be pretty incredible. So I'm
excited to see how it's gonna
play out.
Sanyam Bhutani: I hope the scene
gets better. We'll definitely
not there. I think we much
behind the states for example,
compared to The number of
opportunities but I think I am
personally tried to contribute a
lot to the community. And I say
this not to brag, but to maybe
inspire other people to start
communities try to talk to your
friends in, in the Silicon
Valley, bring their culture back
to India. That's what I'll be
doing next when visiting the
valley attending as many meetups
trying to bring back that
culture just because I'm
fortunate enough to be able to
afford that too. I hope that
changes in the future, and I
think it will.
Russ Wolfinger: Thank you. Yeah,
thank you for putting on this
podcast. I think it is a real
service to the community and
best wishes for us for future
success with it.
Sanyam Bhutani: Thanks so much.
Thanks so much for seeing us and
thank you so much for joining me
on the podcast.
Russ Wolfinger: Thank you,
Sanyam.
Sanyam Bhutani: Thank you so
much for listening to this
episode. If you enjoyed the
show, please be sure to give it
a review or feel free to shoot
me a message you can find all of
the social media The links in
the description. If you like the
show, please subscribe and tune
in each week to "Chai Time Data
Science"
