Hello World, it's Siraj!
In this video, we're going to use genetic programming
to identify if some energy is gamma radiation or not.
I'm getting angry. Gamma rays! Augh!
Nah, I wish.
Data science is a way of thinking about discovery.
A data scientist needs to decide the right question to ask, like
"Who's the best candidate to vote for in the US election?,"
then decide what dataset to use,
like tweet history of candidates
and past endorsements of each candidate,
and lastly decide what machine learning model to use
on the data to discover the right answer.
♫ Life goes on! ♫
With the right data, computing power,
and machine learning model, you can discover
a solution to any problem,
but knowing which model to use can be challenging
for new data scientists. There are so many of them!
That's where genetic programming can help.
Genetic algorithms are inspired by
the Darwinian process of natural selection,
and they're used to generate solutions to optimization
and search problems.
They have three properties:
selection, crossover, and mutation.
You have a population of possible solutions
to a given problem and a fitness function.
Every iteration, we evaluate how fit each solution is
with our fitness function.
Then we select the fittest ones and perform crossover
to create a new population.
We take those children and mutate them
with some random modification
and repeat the process until we get
the fittest or best solution.
So take this problem, for instance.
Let's say you want to take a road trip across a bunch of cities.
What's the shortest possible path you could take
to hit up each city once
and then return back to your home city?
This is popularly called the "traveling salesman problem"
in computer science,
and we can use a genetic algorithm to help us solve it.
Let's look at some high-level Python code.
We have the number of generations set to 5,000
and the population size set to 100.
So we start by initializing our population
using our size parameter.
Each individual in our population represents
a different solution path.
Then, for each generation, we compute the fitness
of each solution and store it
in our population fitness array.
Now we'll perform selection
by only taking the top 10% of the population
which are our shortest road trips
and produce offspring from them
by performing crossover.
Then you take those offspring randomly
and repeat the process.
As you can see in the animation,
eventually we will get an optimal solution
using this process, unlike Apple Maps.
Alright, so how does this all fit into data science?
Well, it turns out that choosing
the right machine learning model
and all the best hyperparameters for that model
is itself an optimization problem.
We're going to use a Python library called TPOT,
built on top of scikit-learn,
that uses genetic programming
to optimize our machine learning pipeline.
So after formatting our data properly,
we need to know what features to input to our model
and how we should construct those features.
Once we have those features,
we'll input them into our model to train on,
and we'll want to tune our hyperparameters,
or tuning knobs, to get the optimal results.
Instead of doing this all ourselves through trial and error,
TPOT automates these steps for us
with genetic programming,
and it will output the optimal code for us when it's done
so we can use it later.
So we're going to create a classifier for gamma radiation
using TPOT after installing our depencies,
and then analyze the results.
TPOT is built on the popular scikit-learn
machine learning library, so we'll want to make sure
that we have that installed first.
Then we'll install pandas to help us analyze our data
and numpy to perform math calculations.
Our first step is to load our dataset.
We'll use pandas' read_csv() method
and set the parameter to the name of our saved CSV file.
This is data collected from a scientific instrument
called a "Cherenkov telescope"
that measures radiation in the atmosphere
and these are a bunch of features of whatever
type of radiation it picks up.
Thanks, Putin!
Since the class object is already organized,
we'll shuffle our data to get a better result.
The iloc() function of the telescope variable
is pandas' way of getting the positions in the index.
And we'll generate a sequence of random indices
the size of our data using the permutation function
of numpy's 'random' submodule.
Since all the instances are now randomly rearranged,
we'll just reset all these indices so they are ordered
even though the data is now shuffled,
using the reset_index() method of pandas
with the drop parameter set to "True."
We'll now let our 'tele' variable know
what our two classes are by mapping both of them
to an integer with the map() method.
So 'g' for "gamma" is set to 0;
'h' for "hadron" is set to 1.
Let's store those 'Class' labels,
which we're going to predict,
in a seperate variable called 'tele_class'
and use the 'values' attribute to retrieve it.
Before we train our model, we need to split our data
into training and validation sets.
We'll use the train_test_split() method scikit-learn
that we imported to create the indices for both.
The parameters will be the size of our dataset.
We want both sets to be arrays,
so we'll set the 'stratify' parameter to our array type.
Then we'll define what percent of our data
we want to be training and testing
with these last two parameters.
We have a 75/25 split now in our data
and we're ready to train our model.
We'll initialize the 'tpot' variable using the 'TPOT' class
with the number of generations set to 5.
On a standard laptop with 4 gigs of RAM,
it takes five minutes per generation to run
so this will take about 25 minutes.
This is so TPOT's genetic algorithm knows
how many iterations to run for,
and we'll set 'verbosity' to 2,
which just means "Show a progress bar in terminals
during the optimization process."
Then we can call our fit() method on our training data
to let it perform optimization using genetic programming.
The first parameter is the training feature set
which we'll retrieve from our 'tele' variable
along the first access for every training index.
The second variable is our training class set,
which we'll retrieve from our 'tele' variable like so.
We can compute the testing error for validation
using TPOT's score() method with validation feature set
as the first parameter
and the validation class set as the second.
We'll export the computed Python code
to the pipeline.py class using this method
and name it in the parameter as a string.
Let's demo this thing.
After training, we'll see that after five generations,
TPOT chose the gradient_boosting classifier
as the most accurate machine learning model to use.
It also shows the optimal hyperparameters
like the learning rate and number of estimators for us.
♫ Yeah, boyyy! ♫
So, to break it down:
with the right amount of data, computing power,
and machine learning model,
you can discover a solution to any problem.
Genetic algorithms replicate evolution
via selection, crossover, and mutation
to find an optimal solution to a problem,
and TPOT is a Python library that uses genetic programming
to help you find the best model and hyperparameters
for your use case.
The winner of the coding challenge from the last video
is Peter Mitrano.
He added some great Deep Dream samples
to his repository, and even Deep Dream'd my own video.
Badass of the week!
And the runner-up is Kyle Jordaan.
Good job stitching all the Deep Dream'd frames
together with one line of code
The challenge for this video is to use TPOT
and a climate change dataset that I'll provide
to predict the answer to a question you decide.
This will be great practice in learning to think
like a data scientist.
Post your GitHub link in the comments
and I'll announce the winner next time.
For now, I've got to stay fit to reproduce,
so thanks for watching.
