Hello world, It's Siraj!
What hyperparameters should you use  to train your models?
You will see these magic numbers ALOT
They are the model values that are set before you train on any data set
 
A machine learning model is just a formula with a number of parameters that need to be learned
from data. But there are also parameters
that can't be directly learned from the regular training process
We call these higher level properties
Hyperparameters
This could be number of trees in random forest,
number of hidden layers in neural network
the learning rate for logistic regression
it is a process of trial and error
and it is not very intuitive since we are not great at interpreting
high dimensional data
Researchers consider the possibility
space of hyperparameters thier canvas
But what if we could have these parameters learn
the optimal value used for themselves?
that would make life easier right
Let's see if we can figure out really basic strategy for ourselves
and then try to improve it
I have got a data set of tweets that labeled as positive or negative
perfect for a binary classification problem.
and let's say I build a support vector machine
to learn this mapping so it can then
classify a new tweet immediately.
This is called sentiment analysis. It is a really popular
task in language processing.
If we mapped out these vectors in 2D space, we can
imagine a curly line that separates the positive tweets
from the negative ones. This is a decision boundary, separate but equal.
A support vector machine can help us define this decision barrier
Since it is non-linear, our SVM will use what is known as the kernel trick.
That means instead of trying to fit a non-linear model
we can map the data from the input space
to a new higher dimensional space
called the feature space. By doing a non-linear transformation
using a kernel or similarity function
and then use a linear model in the feature space
We define our kernel or similarity function
between tweet vectors  as the radial
basis function which takes its inputs
from two vectors and outputs a similarity based on the following
function. So the more similar two tweets are
the higher the output value from our function.
There are two hyper parameters that govern in how our line is going to be drawn
Both of these hyperparameters need to be selected very carefully
they depend on each other in unknown ways
so we cannot just optimize one parameter at time
then combine the result.
What if we just tried every single combination of hyperparameters?
Assuming we built our SVM already
we can choose a set of possible values for both of them
and create a variable to store our
models accuracy for each set
then we will create a nested for loop for every value
of C and try every value of Gamma
Inside our loop we will initialize our SVM with the hyperparameters at that iteration
we will train it and score it, the compare its score
to our best score. If it is better we will update our values accordingly
This process
will run for every hyperparameter value we have
until it finds the optimal ones.
This technique is called grid search.
We essentially made a grid of our search space and
then evaluated each hyperparameter setting
at the points we introduced for as many dimensions
as necessary. This was a pretty easy strategy to implement
But this scales pretty poorly
with more hyperparameters or dimensions we add.
Also known as the curse of dimensionality
I think we can do better than a exhaustive search
We tried every combination of a preset list of
values of our hyperparameter.
But what if instead we tried random combinations
of a range of values for a number of iterations we define.
This wont guarantee that we will get the best hyperparameter combination
like grid search. But it
will take a lot less time.
So manual search, grid search, and random search are fine and dandy.
But there is got to be a more intelligent way
of doing this that incorporates learning.
One technique that is very population right now is
called Bayesian Optimization.
Last episode we talked about how bayes theorem
is a way to determine conditional probabilities.
It shows us how to update
a existing prediction given new evidence
This forms the basis of the bayesian way of thinking
as apposed to the frequentist approach.
These are the two different approaches to probability
basically its like a mathematical gang war
between applied statsticians.
Bayesian means probabilistic. It focuses on the
probability of the hypothesis given the data.
That means the data is fixed and the hypothesis is random
The frequentist approach
focuses on the probability of the data given
the hypothesis. So data is random as in
if we repeat the study the data might come out differently.
But the hypothesis is fixed. We can apply
frequentist or bayesian methods to pretty much
any learning algorithm.They have different aims.
In the context of hyperparameters optimization
a bayesian approach takes advantage of
the information our model learns
during the optimization process. The idea
is that we pick some prior belief about  how
our hyperparameters will behave, and then search the
parameter space by enforcing and updating our prior belief
based on our ongoing measurement.
So the tradeoff between exploration
making sure we visited the relevant corners of our space
and exploitation once we found
the promising region our space, finding optimal value in it
is handled in more a intelligent way
You know, we only have few weeks left to submit our
 
 
 
 
 
 
Bayesian optimization uses previously evaluated points
To compute a posterior expectation of what the loss f looks like
Then it samples a loss at a new point that maximizes some utility
of the expectation of f
That utility tells us which regions of the domain of f are best to sample from
This 2 step process is repeated until convergence
For the prior distribution we assume that f can be described by a Gaussian process
A Gaussian Distribution - often called a normal distribution
Is described as a bell shaped curve
Distributions are equations that link outcomes of a statistical experiment
with its probability of a current
The Gaussian is quite popular,  half of the data falls on the left of the mean
Half falls on the right
And this is useful in many situations
A Gaussian process is a generation of the  Gaussian Distribution over functions
Instead of random variables
While Gaussian Distribution are specified by their mean and variance
Gaussian Processes are specified by their mean function and co-variance function
The way we find the best point to sample f next from
Is to pick the point that maximizes an acquisition function
This is a function of the posterior distribution over f
That describes the utility for all values  of the hyper-params
The values that has the highest utility will be the values you compute the loss for next
We'll use the popular expected improvement function
Where x is the current optimal set of hyper parameters
By maximizing this it will give us the point that improves on f the most
So given on the observed values f of x
We update the posterior expectation of f using the GP model
Then we find that the new x that maximizes the acquisition function the expected improvement
And finally compute the value of f for the new x
Initially the algorithm will explore the parameter space
But it quickly discovers the region with best performance and samples points in that region
To Summarize,  we can optimize our hyper parameters using several strategies
But Bayesian Optimization looks most promising
Bayesian Optimization picks a prior belief about how the hyper parameters will behave
And then Searches the parameters space by enforcing and updating that prior belief based on ongoing measurements
So Bayesian let their prior beliefs influences their predictions frequentists don't
