>> You're not going to want to
miss this episode of the AI Show.
We have Andreas Muller or
Andy because now we're bffs,
talk about Scikit-Learn,
machine learning
and he's got a cool new
project up his sleeve.
You're not going to want to
miss it. Make sure you tune in.
[MUSIC].
>> Hello and welcome to
this special edition of
the AI Show where we've got a
special guest Andreas Muller.
He said, I could call him Andy,
so now we're actually bffs.
How you doing my friend?
>> I'm great. How are you doing?
>> Fantastic. So tell us who
you are and what you do.
>> So I recently joined
Microsoft two months ago
as a Principal Research SDE,
I think it's the official
name, it's a bit long.
I worked a lot on software
for machine learning,
in particular on Scikit-Learn,
so I've been a core developer on
that project for a while now.
I also have some other projects like
dabl and I have a book
on machine learning,
introduction to machine
learning with Python.
So I'm trying to
make it easier for people to use
machine learning in Python mostly.
>> So let's start all the way at
the beginning if you don't mind,
I'd love to get a sense for how
you got into machine learning,
how you started getting into some of
the projects that you were doing,
and what the overall
goal is for people.
Because personally I
love Scikit-Learn,
I feel like we talk about deep
learning a lot in the field,
but I'd love to get a sense for
what Scikit-Learn is
and why it's important.
So let's start with
all of those things.
There's so many because I have
so many questions for you.
How did you get started?
>> All right. So how
did I get started?
So my background is
actually in pure math.
But then after I graduated
with a master's degree,
I was like okay, so what
I'm I going to do next,
and then I saw an article
about some people
in the same town doing robot
soccer and I was like,
oh my god, robots,
they seem really cool.
So I started my PhD in that group.
But I ended up not working on robots,
but actually ended up working
on machine learning for
some deep learning and then
some computer vision structure
prediction and so on.
So during that time in my PhD,
I started contributing
to open source,
I started playing around with
Scikit-Learn and then yeah,
became a contributor there.
>> So how does one
become a contributor to
a project like that,
like Scikit-Learn?
>> Back then it was quite a
bit smaller and less popular,
I guess, and it was
a smaller community.
But generally, now, as then,
you start becoming a developer by
contributing like it's on GitHub.
Like any other community
open source project,
you can just start
making pull requests.
My first pull requests
were fixing typos in
the documentation or clarifying
things in the documentation.
So even small stuff,
it's really appreciated.
Actually often small stuff
is much easier to get into
a project than huge new things.
So just start
contributing, start small,
and then engage in a conversation
with the project to what,
where things are going.
>> So for those that are new to
machine learning are super
excited to get started.
Can you frame for us
what's the difference between
the goals of something
like a Scikit-Learn versus a thing
like a PyTorch or a TensorFlow.
Are they the same thing?
How are they different?
What do you say?
>> I think there's
two main differences.
Scikit-learn, this may
be little bit more
comparable to in the
deep learning space,
fast AI or maybe Keras,
which is more like
batteries included.
So Scikit-Learn really gives
you full algorithms and tools
to trying the models to
validate the models and so on.
Whereas PyTorch and TensorFlow
are both much more low level,
so they allow you to really
tweak what model you're using.
They give you a lot of flexibility,
but also they require
you to do a lot of work,
whereas Scikit-Learn
really has a fixed,
like a toolbox of models
and you can pick which
model you want from
this toolbox and you have
to do very little work
to use them but there's also like
a limited amount of customization.
Then the other maybe more
obvious thing is domain,
which is; TensorFlow, PyTorch,
[inaudible] they're
all for deep learning
where Scikit-Learn is what is
now called traditional
machine learning,
which is actually newer
than deep learning.
>> So tell me about
that, tell me about
how traditional machine
learning is newer,
look, I went to grad school when
sump's were all the
rage and I'm like you,
I'm a math wonk and I just
loved how SVMs have like
this perfect mathematical
structure and
optimization and I always
gravitated toward those algorithms.
Tell us about that.
>> That's actually also
what I did during my PhD.
So my advisor actually was like
super gung-ho about neural networks,
that was before ImageNet happened.
I was like, no, this is
not working and then I
was much more going to the
optimization direction.
But it turns out actually
like neural networks were just on
the cusp of becoming
much more important.
But so a lot of the
technology is from the 90s.
The math is like mostly from the 90s.
there's been some couple of
very important innovations like
different learning strategies
and residual networks and so on,
and obviously transformers now,
but a lot of the stuff that
people use in deep learning
is very similar to what Yann
Lecun has in his 1989 paper.
Whereas in 1989, random
forests did not exist at all.
>> No it didn't.
>> Random forests have, I don't
want to misquote the year,
so I'd rather not,
but they're newer than
convolutional neural
networks by quite a bit.
>> It feels like as we look at what's
actually the cool things
in machine learning,
people tend to gravitate
towards deep learning.
Do you think we've over-indexed on
that side of the house as opposed
to some of the other things in
Scikit-Learn or the
newer machine learning
approaches that are
not deep learning.
Do you think we've over-indexed?
>> It depends a lot
on what your application
is and what your goal is.
So in terms of research,
there's so much really,
really amazing research
going on in deep learning,
whereas instead of the
traditional stuff,
the innovations that are relevant for
practice are relatively small.
So people love gradient boosting,
people have loved gradient
boosting for the last 5-10 years.
This is really the thing
that people really go to.
Whereas in deep learning,
you have some major
innovations every year
happening that are
important innovations.
So in terms of research,
I think it makes lot of sense.
In terms of applications,
it depends a lot on two things.
A is what's your trade
off between the work you
put in and the accuracy?
The second thing is,
what is your modality?
Let me maybe talk about
the second thing first.
So if you're working on image data,
video data, sound data,
you'd need to do deep learning,
there's no point in not
doing deep learning.
Did I say text data? Text data,
probably these days, it
depends a little bit,
but mostly, you also
want to deep learning.
If you have tabular data
or potentially text data,
then actually traditional stuff
often fares a little bit better,
but that might also change.
But here, really the trade off
is the work you have to put
in versus the accuracy.
I worked for a while at
Amazon and even at a company
that we perceive as a
very high-tech company,
there were a lot of places where they
could use machine
learning, but they didn't.
Really, the big improvement
comes from going from
a manual process to a
principle data-driven,
potentially machine learning process,
and so the hard work is going from
zero to that or from manual to that,
and you can do that with
logistic regression.
If it's really important for you
to get the best possible results,
you can then do more and
more complex models,
and maybe deep learning,
even though tabular data will
give you a bit more performance.
But you'll have to spend
a lot of time on that.
I would argue that in many contexts,
it might be better for you to
move on to the next place where
your company is still
at zero and go from
zero to 90 percent accuracy
instead of going from 90 to 95.
>> I see.
>> There are some applications
where office like,
if you're doing ad click
prediction for Google,
every smallest percentage
point is millions of dollars.
So clearly, you want to really
optimize everything out of it.
But I think these applications
are actually quite rare.
>> What I'm hearing
from you, and this is
an interesting take and
one that I agree with,
is you're saying instead of trying to
shave a percentage off of the error,
like a tiny bit, do
something that moves
your business forward in
a significant way without
shaving a little bit.
This is the question I
have for you because look,
I was trained as a
machine learning person.
I went to school, the
University of Utah,
and I'm familiar with the math,
and the models, and such.
But I don't have the horse sense
of a data scientist that goes
out and looks at a problem.
When you're advising people
with machine learning problems,
is there a specific path
that you give them?
You gave us a little bit
of that with like, "Hey,
if it's a table, you should
probably not do deep learning."
What's the path that you follow
for solving machine
learning problems?
Obviously, you're not going
to try everything at once,
but what's your path?
Does that make sense?
>> Yeah. The thing
that I start with in
both my lecture series
and the all the workshops I give,
is that you should really start
with thinking about your objective.
What is your overall
business objective?
Then how can you measure this?
Or what is a good measure for this?
Then collecting data to measure it.
I think before you start thinking
about doing machine learning,
you should think about,
let's say I do the perfect
algorithm to solve this problem.
What will the actual business impact
be and how can I measure this?
Then I start from a
very simple baseline,
like if I always give answer one
or if I use this very simple if,
how far can I get to the maximum
business impact that I can get?
This is actually all of the things
that I just said are really,
really hard problems
because very often,
the problem you're
currently trying to solve
doesn't necessarily directly
translate to your business objective.
I know several years ago,
Spotify use Scikit-Learn to
make their artist radio.
Spotify as a company
obviously wants to
maximize subscription because
that's how they get money.
But how do you relate that to
what songs you recommend
in the artist radio?
This is a super indirect thing,
and this is the same
in most applications,
and so you have to have
a a proxy that's say,
well, how long did they listen
to this playlist or something?
But then you need to think about,
how does the proxy that I've found
relate to my overarching goal?
Do I have the data to
measure that proxy?
>> If I'm understanding
you right, and let me
say because a baseline I
think is super important.
You say start with a business
objective, number 1,
find a baseline thing like ran.next
less than 0.5 or something,
if I were to just use a column
in my database to predict.
Then what you do is you find
a measurement that maximizes
whatever your objective is,
and sometimes you can't find that,
so you find a proxy, and then you use
the baseline together
with the proxy to
see if the machine learning
outcomes you're using are
actually helping you for
that. Did I get that right?
>> Yes, I think so.
>> The question that I have
for you following this is,
how has Scikit-Learn evolved
to help with these problems?
>> Scikit-learn comes from a much
more scientific perspective,
and so it really focuses on
the modeling aspect and
not on the workflow
that's all around this.
I think it's very hard
to try to solve many of
these problems with software
because they are not
software problems,
they are business problems.
As programmers, we always
like to have technical tools,
and often, that's not
the right solution.
What Scikit-Learn does is,
let's say I have the data, I
want to use machine learning.
I know what the effect of this is
going to be on my overall process.
Now I want to use logistic regression,
it's three lines in Python.
I want to use random force.
I changed this one line and
then I have random force.
I changed this one line again
and I have gradient boosting.
That here is this code to
plot all the metrics that
are relevant for this.
>> I think that's cool because I
saw at the beginning,
you're like, well,
Scikit-Learn isn't designed to
actually solve the hard one.
But the way you described
the code is it lets
you experiment quickly against
the baseline to show whether or
not you're actually driving change.
That's the thing that I
like about Scikit-Learn,
specifically how it treats models
and how it treats optimization,
and it treats them in a
uniform way across models.
Maybe that's just my
limited thinking of it.
What do you think about it?
Is there a philosophy with
how you present models
and optimization in
Scikit-Learn that's important?
>> First, I want to do a comment
that's slightly tangential,
which is exactly the
thing you just said,
having nice tooling helps you
iterate on the overall process.
This is my pitch for
my new project Dabble,
which is trying to
make it even easier.
Just allows you to focus
really on the big picture,
that was the idea of Dabble.
>> The question is, is
there a general philosophy?
Now that you've introduced Dabble,
we're not going to leave that,
but I'm going to leave that
for a little bit later.
The question I have is,
is there a general philosophy
with Scikit-Learn because some of
the models that are in
there are radically
different in terms of shape.
An SVM and a decision tree,
they're not the same things at all,
but it feels like they're
treated in a similar way.
Is there an underlying
philosophy with
Scikit-Learn that lets you
do this kind of approach.
>> If you ask me, I would say that
our interface is actually super
inconsistent in some ways,
which is mostly for
historical reasons.
We have a lot of users right now,
so changing anything is super-hard.
The underlying philosophies is
make things easy for the user,
think about what is the code
the user has to write to use
this and use as few
concepts as possible.
Whenever a programmer
introduces a concept,
he or she has to explain the concept
to the user and that's the
hardest part in programming,
is having the user understand
all the constructs
you had in your head.
We try to have not too many of those.
>> I see. But is there a philosophy?
Look, for me personally,
the thing I've liked about Scikit-Learn
in general is that for me,
a model, because look,
I was a programmer for like a decade
and then I went to grad school,
I think of models as functions
that we don't write,
we just lazily write with data.
The way that you separate all
the models out and you show what
they are and this concept
of fitting data to it,
to me, is super-interesting.
Are those same concepts, the
same across all of the models?
Obviously, there's
differences between them.
>> The API is completely consistent
across all of the models.
I recently had a paper
rejected about how you
can frame these concepts,
which is there's basically
two ways to think about it.
One is you have
the stateful object that
consumes the data and then
changes the internal state.
If I rewrite Scikit-Learn now,
I might change using
a different concept,
which is that you
create a stateless object,
it does a prediction in.
You have one function that creates
an object and this object
is a prediction object.
Because actually if you
look at applications,
in the deployment setting,
you usually care about shipping
the prediction function,
and so Scikit-Learn marries
the training and the prediction
very closely together,
which is very nice from
a teaching perspective and
from I have everything in
my notebook perspective,
but actually from a deployment
infrastructure perspective,
it's a little bit annoying.
>> That's cool. I wrote a
machine learning library in C#,
when I was a C# programmer,
and I had the concept of
an AI trainer and an AI
model which are interfaces.
Because the reality of the matter is,
is that the true business
value of these things is
putting these models
out into production.
Does Scikit-Learn have
a facility to do that?
I know there's a lot
of pickling going on.
How does that work with Scikit-Learn?
>> Basically, you can use joplib
or pickle that will always work.
But has the two caveats
that basically you need
to have a container with exactly
the same version of everything,
which is not that nice.
The other version is using ONNX.
ONNX, I'm not sure if
everybody is familiar with it.
It's a serialization format that
Microsoft and Nvidia and a
couple of other companies
are working on,
that basically allows
you to serialize
arbitrary machine learning
models and then run
them, there's the runtimes.
There's Microsoft Scikit-Learn
converter for ONNX so you can take
the model and convert it to ONNX and
ONNX only encapsulates
the prediction function.
ONNX doesn't know
about training models,
ONNX says given data,
this is how you make a
prediction with this blob.
But this blob is now
completely self-contained,
which a pickle file is not.
>> That's interesting, I had
not heard because I thought of
ONNX as like the PDF for
machine learning inference.
But I didn't know it extended to
models such as those in Scikit-Learn.
Is this a new thing? How
long has this been out?
>> I think it's at least
a year or something
since the Scikit-Learn
converters have been launched,
maybe a bit longer.
I'm not sure how
intensively they're used
right now, but it's probably,
it's the best solution
that is not pickles and
the best solution that
is platform-agnostic
and version-independent.
ONNX was originally made more
for deep learning models,
but they can certainly also now
support random forests
and things like this.
I think it's pretty feature complete
and that it can support most
of Scikit-Learn, I think.
>> That's cool. I
hadn't of heard that.
I'm excited to look into it.
Now, before we get into Dabble,
do you have anything to
show us to give people
a sense for if they've
never seen Scikit-Learn,
how it works and how one would
go about building something.
>> Sure. Totally. It's even
better, I can do this.
Then I live code all of the same
stuff in Dabble in a single line.
>> Let's do the Scikit-Learn thing
first and then we'll come back,
and then we'll talk about
Dabble and then we'll come back
and see how it's
better. Let's do that.
>> What you see here is
a Jupyter Notebook that's running
on Azure Machine Learning,
so you could just run
the same Notebook
locally in your Jupyter
or Jupyter lab.
But here I'm just running it on
Azure Machine Learning on
a remote compute instance.
I'm going to use Scikit-Learn here
to do a simple binary
classification problem.
I start by loading the data,
I'm using pandas to read a CSV file.
This is a classical
machine learning data set,
which is the adult census dataset,
which has census data from
around 1990 and some of the
census properties of people.
The goal is to predict
whether their income
will be less than 50k a year or more.
So Scikit-Learn usually
separates the features or say,
independent variables
from the target into
two different [inaudible]
frames or not [inaudible].
Here, income will be
the target column and data features
will be all the other columns,
which will be the input features.
Then I'm actually going
to do just slightly more
advanced Scikit-Learn
way of doing it,
which is using pipelines
and column transformers.
This allow you to encapsulate
all of your machine learning
workflow in a single Python object,
which you could then pickle
and store or send somewhere.
Here, I start off by splitting my
data into training a test set,
to build and evaluate the model.
The column transformer
here is something that
allows you to apply
different transformations to
different parts of your data set.
I will apply OneHotEncoder to
the categorical variables and the
StandardScaler to
continuous variables.
The way I'm selecting
this is I say, well,
make a column selector and
select city type object and make
a column selector that selects
everything but the dtype objects.
The dtype object are going
to be interpreted as
categorical and the rest are going to
be interpreted as continuous.
This is my preprocessing and then I
pipeline this together with
logistic regression models,
so this is a simple binary
classification model here.
This basically declares the model.
Another thing I wanted
to briefly show,
so this is text
representation of the model,
the thing that we've
recently built in.
If you set this sklearn.set_config
diagram is you get
some nice HTML representation
that shows you the modeling,
you click on the
"Parameters" and so on.
>> That's cool.
>> That's kind of new, so
I wanted to show this off.
Then whenever you have a model,
usually you want to
tune parameters and
so logistic regression is
penalized in Scikit-Learn.
Here I'm defining a
parameter space for
logistic regression to tune
the regularization parameter
and then running grid search.
This is where actual
fitting will take place.
It will run cross-validation and
for each possible value of
the regularization parameter
as specified here.
Then at the end, it will evaluate
the best model it
found on the test set.
>> I see. You're actually
running multiple models
using the parameterization
for regularization there.
>> Yeah.
>> Okay. What is your hybrid
parameter that you're putting in?
>> C, that's the
autoregularization. Yeah.
>> That's cool. Then it just goes
ahead and goes through all of
those things and finds the best model
with that regularization parameter.
>> Yeah, so it does. Right now,
so there are seven parameters.
For each of them, it has
five-fold cross-validation.
So it trains 35 models for that,
and then it finds the best
parameter using cross-validation,
and then it retrains the model
on the full training dataset.
So in total we trained 36 models,
and then evaluated the last
model on the test set.
>> That's cool. So
all of that and just,
I think it's seven
different cells, basically.
>> Yeah.
>> That's cool.
>> This is a relatively
sophisticated example where I show
off all the different parts and
how you can ideally
put them together.
You can obviously
also just instantiate
logistic regression call
fit and call score,
and then you fit your
Scikit-Learn model on
your pre-processed data or something.
>> I see. That's cool.
So tell me about dabl.
For example, you mentioned
a little bit about dabl.
Because I hadn't heard that before,
but it's something you're working on?
>> Obviously, I love Scikit-Learn.
But the thing is that
Scikit-Learn asks you to be very,
very explicit about things.
For example, for the pre-processing
we had up here, you say,
"Well, I want to scale my data,
I want to encode my data."
If I had missing values,
I would need to specify how to
impute them and then I have to say,
"Specifically these are the
parameters I want to tune,
and this is how I want to tune them."
Scikit-learn is not
opinionated at all.
Scikit-learn requires
you to be really,
really precise in what you're doing,
which is great for some settings,
like in a production
setting, you want to
be precise in what you're doing.
You want to make sure you understand
exactly what your model is doing.
But maybe if you want
to iterate quickly
or maybe there's a
default that often works,
it might be nice to
have something that's a
bit more opinionated that
can give you some results quickly
and allows you to
iterate really quickly.
Dabl basically just
wraps Scikit-Learn
and does most of the cells
in this automatically.
There's a couple of things in dabl,
but if I want to do
what I just did above,
then it would be something like.
Basically what this does is,
so there's basically three
things or four things in dabl.
The simple classifier, basically
detects what the data types are,
automatically detects
what to scale and
if there's an index
that it should drop.
You can also give it a data frame
and say this is the target column,
and then it does all the scaling
and everything automatically,
then it does cross-validation
and it runs a bunch
of the simple classifiers
like you can see here.
Dummy classifier,
Gaussian Naive Bayes,
multinomial decision tree, logistic
regression with different
organization variants.
Now we have, basically
this is here is
the same result that
we had head above.
>> Interesting.
>> Only that I only had
to write a single line.
Well, I used to train split
from above. It's two lines.
There's another thing it can do,
which is plotting, which I
probably should have done before.
It doesn't work that
well, well it kind of
works simply well, on this data set.
So it gives you a visual summary.
In machine learning, we often have
very high dimensional datasets.
So there's actually no
plotting library that's
really targeted as doing
visualizations for machine learning.
So here starts like it gives you
a distribution of the target,
then it gives you, there's
like two continuous variables,
age and capital-gain and it
gives you a pair plot for this,
it gives you a PCA plot,
it gives you linear discriminant.
Then for categorical variables,
it gives you a mosaic
plot that shows you,
I'm sorry, I don't know
how to deal with this,
how that's popping up.
>> Escape. There you go.
>> Yeah. It says, well,
relationship status was
actually the most important,
and it shows for the categorical
variables in this mosaic plot,
both with the most common
relationships are,
but then also the cost
balance in each of them.
>> That's really cool. How
were the couple of lines?
>> This was one line also, again.
You can do this for
regression as well.
It does some type and friends
and then it tries to figure out
what are the most interesting
plots to show you.
Then to give you a very,
very quick idea of what's
happening in your data.
>> This is now really cool.
I'm excited, you were saying
and then there's more stuff,
like there's more and that's cool.
>> Yeah. Like these two things
work reasonably well, I think.
There's two more things that are
more experimental like all
of this is experimental.
But there are even more
experimental and that's
basically one is that there's
a thing called any classifier,
which does some fancy optimization
between all different models.
So it tries out different
gradient boosting random forest,
support vector machine and uses
successive huffing and a
fixed portfolio and does
some things that should give you
the simplest version of RML,
but it should work reasonably well.
Then there's another one which is,
and it also again does all
the pre-processing for you.
Then the last thing is there's an
explained function that is there
to give you-all the metrics and
interpretation of your model.
So it gives you feature
importances and
ROC curves and all of
this kind of stuff.
>> This is all super cool.
I just have a couple
more questions just to
finish up because I know I've
taken a lot of your time.
The first question
I have, and I think
you've pointed it out a little bit,
where do you see machine
learning going from here?
It looks like a lot of those opinions
were influenced what went into dabl?
>> Yes, definitely. I was
thinking about what
is the next best tool
that I can build that
will help people?
That was my motivation for dabl.
Dabl is really for
the data scientist,
dabbling with the data,
playing around, iterating quickly.
The other part that I
think it's going to be
important is like more
interpretability,
also potentially looking at fairness
and other ethical concerns,
and having infrastructure for
machine learning so that you can
have an experimental repository
and a unified way to access data.
I think some of the
big companies have,
I noticed, at least
Facebook and Google,
they have centralized
ways to do experiments,
to do machine learning experiments or
other drive data of experiments,
where people can go in and
launch their own trials,
and this really gives
them a big advantage.
But this is not something
that most companies do.
So really doing this right is
probably one of the next things
and there's a bunch of things,
like Azure machine learning has
some tools to solve this problem.
Data pre-access tools
will help this problem.
There's some Kubeflow and
there's a bunch of other things,
but this is a space
where I think we'll
see a lot of
interesting things happening
for machine learning now.
As I said, going from zero to
something is the hardest part.
I think making this part easier from
the infrastructure side is
going to be an important piece.
>> Well, this has been amazing.
Where can people go to find out
more about Scikit-Learn and dabl?
>> Well, dabl.github.io for dabl.
You can obviously find me on
Twitter as like Andreas Muller ML.
The Scikit-Learn website
has so much documentation,
so definitely check that out.
If you want, you can also find
my lecture series I did at
Columbia on applied machine
learning on YouTube.
That's youtube.com/AndreasMuller
or something,
you can probably find it.
That's a semester long
lecture that goes over
Scikit-Learn and Keras
and some other stuff.
>> Well, this has been amazing.
Thank you so much for spending
some time with us Andreas,
Andy, because now we're BFFs forever.
Again, thank you for watching.
It's been glorious spending
some time with you.
Hopefully you've learned
a lot. I know I have.
Thank you so much for watching
and we'll see you next time.
Take care.
[MUSIC]
