Ok. Perfect!
It is time to set our goals and introduce
the framework we will use.
Creating a machine learning algorithm ultimately
means building a model that outputs correct
information, given that we’ve provided input
data.
For now, think of this model as a black box.
We feed input, and it delivers an output.
For instance, we may want to create a model
that predicts the weather tomorrow, given
meteorological information for the past few
days. The input we’ll feed to the model
could be metrics, such as temperature, humidity,
and precipitation. The output we will obtain
would be the weather forecast for tomorrow.
Now, before we get comfortable and confident
about the model’s output, we must train
the model. Training is a central concept in
machine learning, as this is the process through
which the model learns how to make sense of
the input data. Once we have trained our model,
we can simply feed it with data and obtain
an output.
Alright.
The basic logic behind training an algorithm
involves four ingredients:
• data
• model
• objective function
• and an optimization algorithm
Let’s explore each of them.
First, we must prepare a certain amount of
data to train with. Usually, this is historical
data, which is readily available.
Second, we need a model. The simplest model
we can train is a linear model. In the weather
forecast example, that would mean to find
some coefficients, multiply each variable
with them, and sum everything to get the output.
As we will see later, though, the linear model
is just the tip of the iceberg. Stepping on
the linear model, deep machine learning lets
us create complicated non-linear models. They
usually fit the data much better than a simple
linear relationship.
Awesome.
The third ingredient is the objective function.
So far, we took data, fed it to the model,
and obtained an output. Of course, we want
this output to be as close to reality as possible.
That’s where the objective function comes
in. It estimates how correct the model’s
outputs are, on average. The entire machine
learning framework boils down to optimizing
this function. For example, if our function
is measuring the prediction error of the model,
we would want to minimize this error or, in
other words, minimize the objective function.
Okay.
Our final ingredient is the optimization algorithm.
It consists of the mechanics through which
we vary the parameters of the model to optimize
the objective function. For instance, if our
weather forecast model is:
Weather tomorrow equals: W1 times temperature,
plus W2 times humidity, the optimization algorithm
may go through values like:
1.05 times temperature plus 1.2 times humidity,
or
1.05 times temperature minus 1.2 times humidity,
or
1.04 times temperature minus 1.19 times humidity,
and so on.
W1 and W2 are the parameters that will change.
For each set of parameters, we would calculate
the objective function. Then, we would choose
the model with the highest predictive power.
How do we know which one is the best? Well,
it would be the one with an optimal objective
function, wouldn’t it?
Alright. Great!
Later, we’ll reiterate what we’ve said
here because there will be separate lessons
for the ingredients of an algorithm.
Did you notice we said four ingredients, instead
of saying four steps? This is intentional,
as the machine learning process is iterative.
We feed data into the model and compare the
accuracy through the objective function. Then
we vary the model’s parameters and repeat
the operation. When we reach a point after
which we can no longer optimize, or we don’t
need to, we would stop, since we would have
found a good enough solution to our problem.
In this lesson, we will explore the concept
of “training the model” through the data.
Let’s start with an example. You have a
coffee machine that doesn’t know how to
make coffee. In a non-machine learning setting,
we would need to incorporate the instructions
in the machine’s electronics, or at least
that’s how all coffee machines I have seen
work. For instance, the engineers may set
the machine’s electronics to:
First, grind 20 grams of coffee beans. Second,
heat the water to 200 degrees Fahrenheit.
Third, pour 100ml through the ground coffee
into the cup.
Now, in a machine learning setting, we won’t
explicitly provide instructions to the machine.
Instead, we just need to state our goals.
In the coffee machine case, that would be:
produce a cup of coffee. Then we would let
the machine work out the problem on its own.
The machine learning process is a kind of
trial-and-error training. The machine would
try various combinations of grinding, heating,
and pouring. Most would not make sense. For
instance, the machine would try heating the
water and pouring it before grinding the coffee,
resulting in a cup of hot water. Or, it may
grind the coffee and pour the water, without
heating the water. Anyhow, after thousands
of trials and errors, the algorithm would
train itself to reach the set goal every time.
It can make a cup of coffee  It is possible
that it will learn to make the best coffee
you’ve ever tried, much better than the
one obtained by following a set of instructions.
That’s because it would have gone through
many more recipes than a human would ever
be able to.
A reasonable optimization algorithm would
not try all combinations, as there are usually
inexhaustibly many options. In the coffee
example: if the coffee machine learns that
grinding the coffee has to go before pouring
the water, it would not waste time attempting
it in the wrong order. We’ll talk more about
optimization algorithms soon; don’t worry.
This example shows us why machine learning
is so powerful. It allows systems to learn
on their own in situations where humans cannot
define a rigid set of rules for the computer
to follow. Even if we can define a set of
rules, an algorithm can probably provide a
better one.
Before wrapping up, let’s explore another
interesting instance. Self-driving cars. Contrary
to what many people think, self-driving cars
don’t follow rules such as: ‘avoid curbs’.
Essentially, they train on thousands of hours
of footage of real people driving and learn
how to mimic them efficiently. It is not the
set of rules they know but the final goal.
The final goal fundamentally is to drive safely
and efficiently - avoid curbs, don’t bump
into other cars, don’t go over the speed
limit, stop at red lights, and so on.
Okay.
There are three major types of machine learning:
• supervised,
• unsupervised,
• and reinforcement.
Supervised learning refers to the case where
we provide the algorithm with inputs and their
corresponding desired outputs. Based on this
information, it learns how to produce outputs
as close to the ones we are looking for. Both
examples, the one about the weather forecast
and the one about preparing coffee, illustrated
supervised learning. That was intentional,
as it is the focus of this course.
Alright.
The second type of Machine learning is unsupervised
learning. In unsupervised learning, we feed
inputs but there are no target outputs. This
means we don’t tell the algorithm exactly
what our goal is. Instead, we ask it to find
some sort of dependence or underlying logic
in the data provided.
For instance, imagine we administrate the
website cats&doggos.com. Users have the option
to submit photos of their cats or dogs to
the website. Once a photo has been submitted,
we would like it to be automatically classified
in the subpage cats or the subpage dogs.
In supervised learning, we would train the
algorithm on a dataset of, say, 1000 cat photos
and 1000 dog photos. Each would be labeled
‘cat’ or ‘dog’. The model would then
learn how to interpret an input picture as
either a cat or a dog, minimizing mismatches
over the training set.
Sometimes, however, we may not have the resources
or the need to label the whole dataset. In
the previous dataset, there was a person who
manually labeled all 2000 pictures. Now, imagine
a dataset of 2 million pictures. If it takes
a person 5 seconds to label a picture, a dataset
of 2 million observations would take around
2800 hours or 345 working days to label all
pictures.
With unsupervised learning, though, we can
train the algorithm without labeling the photos
or in an ‘unsupervised way’. We could
simply ask it to split them into two groups
based on visual similarities. The result would
be two groups that are unlabeled. Once we
have obtained that, we can examine them and
say: “Oh, yes, the first set is dogs and
the second one is cats. Thanks, algorithm.”
Unsupervised learning is especially useful
when our goal is to split a dataset into a
certain number of categories, which we do
not know prior to implementing it. That by
the way is called clustering.
Okay.
The final type of Machine Learning is reinforcement
learning. Without digging too deep into it,
with reinforcement learning, we would train
a model to act in an environment based on
the rewards it receives. It is much like training
your pet and rewarding it with treats every
time it achieves a goal - sits, rolls over,
or gives you a pawn. In the same way, the
machine learning algorithm could be taught
how to play Super Mario by rewarding it for
progressing with an increase in score.
Alright.
As we already said, supervised learning is
the focus of this course. That is because
it is the simplest and the most commonly used.
To exhaust the topic, supervised learning
could be divided into additional sub-types:
classification and regression. The difference
is very straightforward. Classification supervised
learning models provide outputs, which are
categories, such as cats or dogs.
In regression supervised learning models,
the outputs will be of numerical type. For
instance, predicting the euro dollar exchange
rate will always give us a continuous number,
like 1.21 or 1.19.
In this course, we will create both classification
and regression algorithms.
Stick around and thanks for watching!
So far, we learned what training a dataset
involves and established that we will focus
on supervised learning. It is time for the
second piece of the puzzle – the model.
Let’s consider a variable x. The function
f of x gives us an output - y.
y is a function of x, but we don’t know
this function. We want to make the algorithm
find it on its own. This is done by providing
it with as many pairs of observations of x
and y as possible and following the methodologies
to come.
Alright.
Let’s start with the simplest model possible
– the linear model. Despite appearing oversimplified,
it is extremely important, as it is the basis
for more complicated models, including non-linear
ones.
In the linear model universe, f of x is x
times w plus b.
X is the input we have. In traditional statistical
jargon, we would call w the coefficient of
x, and b would be the intercept. In machine
learning, though, w is called the weight or
weights when we have more than one parameter.
B is called the bias or biases.
There are many ways to define the linear model.
W times x, x times w, x transposed times w,
or W transposed times x. It doesn’t matter.
We will keep the linear model simple and represent
it in the following way:
X times w plus b.
The goal of the machine learning algorithm
would be to find such values for w and b,
so the output of:
x times w plus b
is as close to the observed values as possible.
Okay. Let’s see an example.
Say our goal is to predict the price of an
apartment. We may do that based on its size.
So, the input x is the size. x times w plus
b is the model we are using. The calculation
of this expression gives us the price, or
the output, y.
Let’s input real values. The size of an
apartment is 743 square feet. A possible model
for predicting its price is: x times 336.1
minus 3237.51. The weight is 336.1, and the
bias is minus 3237.51. If we calculate the
output following this model for a 743 square
feet apartment, we would obtain a price of
$ 246,484.79.
Similarly, given an apartment of a different
size, say 1000 square feet, our model would
predict a price of $ 332,862.49.
Knowing the size of any apartment, we can
get a prediction of its price, based on the
linear model.
Note, this example oversimplifies how things
work in practice. But don’t worry! We will
get there soon enough.
Let’s extend our little real estate valuation
model to multiple inputs.
What if we had additional information. Say,
the apartments we are valuing are located
by the sea. The closer an apartment is to
the beach, the higher its price should be,
right? Well, a better linear model would predict
the price, based on both size and proximity
to the beach.
The linear model tackling this issue is:
the price of an apartment equals its size,
times the weight of size, plus the apartment’s
proximity to the beach, times the weight of
proximity, plus the bias.
If we want to state this expression in linear
algebraic terms, we can use the same framework
as before:
x times w, plus b.
This time, though, both x and w would be vectors,
while both y and b will continue to be scalars.
Okay.
X and W are both vectors. X is 1 by 2, and
W is 2 by 1. Multiplying X and W would give
us a scalar, as 1 by 2, times 2 by 1 is equal
to a 1 by 1 shape.
Let’s continue our housing example with
the two input variables.
The output y is equal to x times w plus b.
X has two elements: a size of 743 and a distance
from the beach of 1.21 miles.
W has 2 elements too, one for each input.
The values are: 403.77 and minus 15,212. This
time the bias is plus 1212.45. To calculate
the new projected price, we must multiply
743 times 403.77, then subtract 15,512 times
1.21, and add the bias, 1212.45.
The result is: $ 282,444.04.
Similarly, if the 1000 square feet apartment
is 2 miles away from the beach, its price
becomes $ 373,958.45.
Notice that the weight of the distance from
the beach is negative. This is because we
expect this variable to have a negative impact
on an apartment’s price. The greater the
distance from the beach, the lower the price.
Okay.
Let’s apply the linear model logic for multiple
output variables. Here’s our new situation.
We may be interested in predicting not only
the price of the apartment when buying it,
but also the price for which we can rent it
out.
Our inputs are unchanged: size and proximity
to the beach. This time, though, we have 2
outputs. Therefore, we can create two linear
models:
The price as a function of the size and proximity
to the beach, and
The rent as a function of the size and proximity
to the beach.
Y1 equals: x1 times w11, plus, x2 times w21,
plus b1, and
Y2, equals: x2 times w12, plus x2 times w22,
plus b2.
Notice the indices of the weights. The first
number refers to the respective input, while
the second to the output.
We have 2 outputs, 2 inputs, 4 weights, and
2 biases. The number of weights depends on
the inputs and the outputs. There is a different
weight for each input in each equation.
In general, if we have k inputs and m outputs,
the number of weights would be k times m.
The number of biases is equal to the number
of outputs, m.
Let’s see this
in linear algebraic terms.
Y equals x times w plus b. 2 outputs. 2 inputs.
4 weights. And 2 biases.
Notice how the previous example is actually
part of this one. Here are the respective
output, inputs, weights, and bias.
Finally, I would like to point out this was
only one observation. We could extend this
example to many inputs, outputs, and observations.
The output matrix will be n by m, where n
is the number of observations, and m is the
number of output variables. The input matrix
will be n by k, where k is the number of input
variables. The weights’ matrix remains the
same, as the weights don’t change depending
on the number of observations. The same applies
to the biases.
This last bit is extremely important. It shows
us we can feed as much data in our model as
we want to, and it won’t change, as each
model is determined solely by the weights
and the biases. This property will help us
greatly when creating machine learning algorithms.
We vary only the values of the weights and
the biases, but the logic of the model stays
the same.
Let’s stop for a second and illustrate two
situations in which machine learning and neural
networks come in handy.
You are probably wondering how a linear model
can solve our cats and dogs photo problem.
Well, it depends on the data. Here is a scatter
plot of a dataset consisting of animal photos.
Let the blue dots be the dogs’ photos from
one of our previous examples and the orange
ones – the cats.
A linear model, or a straight line, can solve
this problem fairly easily. Everything below
the line is one category, while everything
above it is the other.
In this situation, we have a classification
problem – we are trying to classify the
photos into dogs and cats. Such a model is
called a linear classifier. It looks useful,
right?
Well, that’s true, because the data represented
in the graph is linearly separable.
But what about this case?
We have only several categories, but we can’t
fit a straight line through them. This data
is not linearly separable. Therefore, we must
use a non-linear model. Well, we will learn
how to handle such problems in the section
where we will learn about deep neural networks.
Finally, to be true to our promises, we will
show you a regression picture.
Here is the graph of our apartment price example.
Each point represents an apartment. A linear
model explains the data well, right? This
is one of the well-known linear relationships.
However, different regression problems may
not necessarily be solved by a linear model.
Look at this graph for instance. Totally non-linear.
Once again. That’s a topic for deep neural
networks.
For now, you have learned the most fundamental
modeling block – the linear model.
It is time for the third building block of
the machine learning algorithm – the objective
function.
The objective function is the measure used
to evaluate how well the model’s outputs
match the desired correct values. In this
lesson, we will elaborate on that.
Objective functions are generally split into
two types: loss functions and reward functions.
Loss functions are also called cost functions.
The lower the loss function, the higher the
level of accuracy of the model. Most often,
we work with loss functions. An intuitive
example is a loss function that measures the
error of prediction. We want to minimize the
error of prediction, thus minimize the loss.
Reward functions, on the other hand, are basically
the opposite of loss functions. The higher
the reward function, the higher the level
of accuracy of the model. Usually, reward
functions are used in reinforcement learning,
where the goal is to maximize a specific result.
Remember the algorithm we mentioned earlier?
The one playing Super Mario? The score obtained
by the algorithm while playing the game is
the reward function. Maximizing the final
score would mean maximizing the reward function.
Alright.
When dealing with supervised learning, we
normally encounter loss functions. Therefore,
in this course, we’ll deal mostly with them.
In our next video, we will explore two of
the most common loss functions.
Earlier, we divided supervised learning into
two types – regression and classification.
We will take the same approach here and consider
two of the most common types of loss functions
– each is used with one of the two types
of supervised learning.
Note that the objective function is a separate
block in our framework from the model. That
is to say that what we are going to discuss
now is generally true for all models, regardless
of their linearity.
Okay.
First, we should define another concept, called
the target, denoted by t. The target is essentially
the desired value at which we are aiming.
Generally, we want our output, y, to be as
close as possible to the target, t.
In the cats and dogs example we’ve been
employing so far, the targets, would be the
labels we assign to each photo. So, we are
100% sure these values are correct. They are
the values we aspire to.
The y values are the outputs of our model.
The machine learning algorithm aims to find
a function of x that outputs values as close
to the targets as possible.
Using this new notation, the loss function
evaluates the accuracy of the outputs regarding
the targets.
Alright.
Let’s see the two common functions we talked
about.
First, we will talk about regressions. I’d
like to remind you that the outputs of a regression
are continuous numbers. A commonly used loss
function is the squared loss, also called
L2-norm loss in the machine learning realm.
The method for calculating it equals the least
squares method used in statistics. Mathematically,
it looks like this:
The sum of the squared differences between
the output values, y, and the targets, t.
Naturally, the lower this sum is, the lower
the error of prediction, therefore, the lower
the cost function.
Okay.
We will check out a common loss function for
classification in our next lesson.
What about classification? We discussed that
the output of a regression is a number, but
for classification, things are different.
Since the outputs are categories, like cats
and dogs, we need a better suited strategy.
The most common loss function used for classification
is cross-entropy, and it is defined as:
L of y and t equals: minus the sum of (the
targets times the natural log of the outputs).
Time for an example before I lose your interest.
Let’s consider our cats and dogs problem.
This time, we will have a third category – horse.
Here’s an image labeled as “dog”. The
label is the target. But how does it look
in numerical terms? Well, the target vector
t for this photo would be 0,1,0. The first
zero means it is not a cat, the one shows
it is a dog, and the third zero indicates
it is not a horse.
Okay.
Let’s examine a different image. This time,
it will be labeled “horse”. Its target
vector is 0,0,1.
Imagine the outputs of our model for these
two images are 0.4, 0.4, 0.2 for the first
image and 0.1, 0.2, 0.7 for the second. After
some machine learning transformations, these
vectors show the probabilities for each photo
to be a cat, a dog, or a horse. We will learn
how to create these vectors later in the course.
For now, we just need to know how to interpret
them.
The first vector shows that, according to
our algorithm, there is a 0.4 or a 40% chance
that the first photo is a cat, 40% it is a
dog, and 20% it is a horse. So that’s the
interpretation of these vectors. What about
the cross-entropy of each photo?
The cross-entropy loss for the first image
is:
Minus 0 times natural log of 0.4, minus 1
times natural log of 0.4, minus 0 times natural
log of 0.2. This equals approximately 0.92.
The cross-entropy loss for the second image
is:
Minus 0 times natural log of 0.1, minus 0
times the natural log of 0.2, minus 1 times
the natural log of 0.7, which equals approximately
0.36.
As we already know, the lower the loss function,
or the cross-entropy in this case, the more
accurate the model. So, what’s the meaning
of these two cross-entropies? They show the
second loss is lower; therefore, its prediction
is superior.
This is what we expected. For the first image,
the model was not sure if the photo was of
a dog or a cat. There was an equal 40% probability
for both options.
We can oppose this to the second photo, where
the model was 70% sure it was a horse; thus,
the cross-entropy was lower.
Okay.
An important note is that, with classification,
our target vectors consist of a bunch of zeroes
and a 1, which indicates the correct category.
Therefore, we could simplify the above formulas
to:
“minus the log of the probability of the
output for the correct answer”.
Here’s an illustration of how our initial
formulas would change.
Alright.
Those were examples of commonly used loss
functions for regression and classification.
Most regression and classification problems
are solved using them. But there are other
loss functions that can help us resolve a
problem. We must emphasize that ANY function
that holds the basic property of being higher
for worse results and lower for better results
can be a loss function. We will often use
this observation when coding. It will all
become clear when we see them in action.
We have reached the last piece of the puzzle
before we can start building our first machine
learning algorithm.
So far, we have learned, at least conceptually,
how to input data into a model and measure
how close to the targets are the outputs we
obtain through the objective function.
However, the actual optimization process happens
when the optimization algorithm varies the
model’s parameters, until the loss function
has been minimized. In the context of the
linear model, this implies varying w and b.
Ok?
The simplest and most fundamental optimization
algorithm is the gradient descent.
I would like to remind you that the gradient
is the multivariate generalization of the
derivative concept.
Let’s first consider a non-machine learning
example to understand the logic behind the
gradient descent.
Here’s a function f of x, equal to 5 times
x squared, plus 3 times x, minus 4.
Our goal is to find the minimum of this function
using the gradient descent methodology.
The first step is to find the first derivative
of the function. In our case, it is 10 times
x plus 3.
The second step would be to choose an arbitrary
number, for example, x naught equals 4. X
naught is the proper way to say x zero.
Then we can calculate a different number,
x 1, following the update rule:
x (i plus 1) equals: x i minus eta times the
first derivative of the function at x i.
X1 is equal to 4 minus eta, times, 10 times
4, plus 3. Or 4 minus eta times 43.
So, what is eta? This is the learning rate.
It is the rate at which the machine learning
algorithm forgets old beliefs for new ones.
We choose the learning rate for each case.
By the end of this lecture, the concept of
eta will be clearer.
Using the update rule, we can find x2, x3
and so on.
After conducting the update operation long
enough, the values will eventually stop updating.
That is the point at which we know we have
reached the minimum of the function. This
is because the first derivative of the function
is zero when we have reached the minimum.
So, the update rule:
x (i plus 1) equals x i minus eta, times the
first derivative at x i
will become:
x (i plus 1) equals x i minus 0, or x (i plus
1) equals x i.
Therefore, the update rule will no longer
update.
Let’s illustrate this with an example.
Let’s take an eta of 0.01.
We start descending. X 1 is equal to 3.57,
x2 is equal to 3.18, and so on. Around the
85th observation, we see our sequence doesn’t
change anymore. It has converged to minus
0.3. Once the minimum is reached, all subsequent
values are equal to it, since our update rule
has become x (i plus 1) equals x i minus 0.
Graphically, the gradient descent looks like
this. We start from an arbitrary point and
descend to the minimum.
Alright.
The speed of minimization depends on the eta.
Let’s try with an eta of 0.1.
We have converged to the minimum of minus
0.3 after the first iteration!
Now, knowing the minimum is minus 0.3, let’s
see an eta of 0.001.
This step is so small that we need approximately
900 iterations before we reach the desired
value!
We descend to the same extremum but in a much
slower manner.
Finally, I’ll try with an eta of 0.2.
We obtain a sequence of 4 and minus 4.6 until
infinity. No matter how many iterations we
execute, our sequence will never reach minus
0.3. We already know minus 0.3 is the desired
value, but if we didn’t, we would be deceived.
This situation is called oscillation. We bounce
around the minimum value, but we never reach
it. We can use 4 or minus 4.6 in the algorithm,
but this won’t be its true minimum.
Graphically, we are stuck into these two points,
never reaching the minimum.
Now that we have seen different learning rates
and their performance, let’s state this
rule.
Generally, we want the learning rate to be
high enough so we can reach the closest minimum
after repeating the operation in a rational
amount of time. So, perhaps, 0.001 was too
small for this function. At the same time,
we want eta to be low enough so we are sure
we reach the minimum and don’t oscillate
around it, like in the case where we chose
an eta of 0.2. In the sections in which we
will study deep learning, we will discuss
a few smarter techniques that would allow
us to choose the right rate.
Alright.
There are several key takeaways from this
lesson.
First, using gradient descent, we can find
the minimum value of a function through a
trial and error method. That’s just how
computers think.
Second, there is an update rule that allows
us to cherry-pick the trials, so we can reach
the minimum faster. Each consequent trial
is better than the previous one with a nice
update rule.
Third, we must think about the learning rate,
which has to be high enough so we don’t
iterate forever and low enough so we don’t
oscillate forever.
Finally, once we have converged, we should
stop updating, or as we will see in the coding
example, we should break the loop. One way
to know we have converged is when the difference
between the term at place i + 1 and place
i is 0.001. Once again, that’s a topic we’ll
see in more detail later
Please download and look at the Excel file
associated with the gradient descent, available
in the course resources section. We encourage
you to play around with the learning rate
or the arbitrarily chosen number x naught
and see what happens. This will give you a
good intuition about the learning rate, which
is central to teaching the algorithm.
In the next lesson, we will generalize this
concept to the n-parameter gradient descent.
If we want to create working models that can
be easily adapted to different problems, you
must understand the drivers of a machine learning
algorithm. That is why we’ve covered several
theoretical steps. And this is where the introductory
part ends.
We will step on the one-dimensional gradient
descent concept and explain the concept of
the gradient descent used in machine learning.
In addition, we’ll apply what we’ve learned
about linear models and loss functions. It
will all fall in place. Promise 😊
Let’s consider the linear model we have
discussed so far:
The inputs, x, times the weights, w, plus
the biases, b, are equal to the outputs, y.
Now, each output y i can be represented using
the linear model equation, where the input
is just the corresponding x i. The weights
and the bias remain unchanged.
Using our apartment size-price example, y
i would be the price of a single apartment.
The corresponding x i, would be information
we have about this apartment. In essence,
we are taking a single observation.
Therefore:
the output y i is a scalar and is equal to
the corresponding x i, times w, plus the bias.
Naturally, we are interested in the ith target,
so t i. This will be the target to which we
will compare the output y i.
Alright.
Time to pick the loss function we’ll use.
Usually, we denote the loss function with
L, and in brackets we put the outputs and
the targets, as the loss function depends
on these arguments. L is for loss, but we
can have C for cost, E for error, and so on.
Depending on the framework you are using,
notations could differ, but they carry the
same meaning.
Okay.
Since we’ve discussed only two types of
loss functions - the L2-norm loss and the
cross-entropy, obviously, our choice is limited
to them. We will look into a regression example,
so let’s take the L2 norm loss and augment
it a bit by dividing it by 2. This is conventional
and we will see why in just a minute. A division
by the constant of 2 does not change the nature
of the loss function, as it is still lower
for better predictions. The machine learning
algorithm will not be affected. We emphasized
this in the objective function lecture. Every
function holding the general property to be
lower for higher accuracy is a loss function.
Division by some constant changes nothing.
Make sure you remember what the gradient is,
and let’s start working in the multidimensional
space! 😊
To perform the gradient descent, we need old
beliefs, which will be updated on each step,
remember?
Well, the update rule:
x i plus one equals x i minus eta times the
first derivative at x i
becomes
w i + 1 equals w i minus eta times the gradient
of the loss function with respect to w i for
the weights, and
b i + 1 equals b i minus eta times the gradient
of the loss function with respect to b i for
the biases.
It is basically the same but for a matrix
w and a vector b, instead of a number x.
Okay.
We want to minimize the loss function by varying
the weights and the biases. This means we
are trying to optimize the loss function regarding
w and b.
Mathematically, it looks like this.
The gradient with respect to w of the loss
function is equal to the sum of the gradient
of:
one half, times (y i minus t i) squared, with
respect to w. From the linear model, y i is
equal to x i, times w, plus the bias, where
w and x are matrices, and this is why we’ve
applied bold formatting. So, let’s plug
that in the formula.
Running the operations, we obtain:
the sum of x i, times y i minus t i. Please
take a more detailed look in the course notes.
It is useful to combine y i minus t i into
a new variable, called delta. Delta is often
used to measure differences. This notion will
come in handy when we start coding in Python
and when we start dealing with deeper neural
networks.
The final output becomes the sum of x i times
delta i, with respect to i. So, we calculate
that expression for each observation and then
sum them all.
Okay.
Analogically, the gradient of the loss function
with respect to the bias is the sum of delta
i.
Notice that the one half we introduced cancelled
out the two we obtained when differentiating
the square. That’s why we included it - to
get a neater result.
Finally, let’s go back to our update rule.
We said the generalized rule is
w i + 1 equals w i minus eta times the gradient
of the loss function with respect to w i.
Replacing the gradient with what we found
here, we obtain:
w i + 1 equals w i minus eta times the sum
of x i times delta i.
Analogically, the update rule for the biases
is:
b i + 1 equals b i minus eta times the sum
of delta i.
Alright.
This was the generalized gradient descent
of a linear model. We can use it to minimize
the cost function and train our model to enable
it to produce valuable insights from our data.
This is all we promised you at the beginning
of this section. And we’ve delivered. Maybe
a bit more, but nothing less.
Ok. Great!
We are in good shape to create our first machine
learning algorithm in Python!
See you in our next lesson!
Ok. Great!
It is time to create our first machine learning
algorithm!
We will build a model, and we will feed it
with inputs. What we expect from the algorithm
is to learn the underlying relationship of
the data.
Here’s how we are going approach the problem.
First, we will import the relevant Python
libraries for the problem at hand.
Second, we will generate random data to train
with. We have decided to make this example
with random data, as rerunning the code will
always yield different inputs. However, the
model will remain the same. In this way, you
can see the same methodology applied for a
potentially infinite number of datasets. As
this simple example is the basis for the more
sophisticated machine learning algorithms,
it is more important to see how it works and
prove that it works than acquire deep insights.
We will leave the wow effect for a bit later,
when we deal with deep learning.
Third, we will create the targets. These are
the correct values. In our apartment example,
they would be the actual prices properties
have. Essentially, we will use “fake”
targets to be sure there is a linear relationship.
In this way, when we train the algorithm,
we will be certain about the dependence it
has to learn. That’s a good way to prove
the optimization is actually working – we
know where we want to get by creating fake
targets, which we would like the algorithm
to figure on its own. If it does that, then
we can be certain it works.
Fourth, we will plot the training data, so
you can visually see it.
This is the preparation phase of the lesson.
In the second part, we will define the variables
we need. We must create weights, biases, and
set a learning rate.
At the end, we will conclude with an actual
regression.
Along the way, each line of code will be explained
to make sure everything is understood.
Let’s begin!
The relevant Python libraries are NumPy and
matplotlib dot pyplot.
Generally, NumPy contains all the mathematical
operations you will need. Moreover, it is
extremely fast. For these two reasons, it
is heavily used in data science. Let’s import
NumPy as np, which is the conventional approach.
Matplotlib is a library used for plotting
data. Its module, pyplot, provides a nice
interface and requires very few arguments.
It is helpful when plotting data.
Let’s import matplotlib.pyplot as plt. Again,
that’s the convention.
Finally, I will also import the Axes3D module
from mpl toolkits dot mplot3d, as it provides
us with the ability to create 3D graphs. This
is a tool we’ll use to visualize the operations
we’re carrying out.
Pyplot and Axes3D are not essential for the
machine learning algorithm. They will just
provide us with good looking plots of our
data. This will provide us with an intuition
of what’s going on there. Anyhow, NumPy
is sufficient for creating a nice algorithm
on its own.
Let’s start this lesson by generating the
data we will train on. This step is not part
of the machine learning algorithm. Essentially,
we will create fake data with a linear relationship.
Once again, we will do that to prove the methodology
is working.
Feel free to move onto the next lecture if
you’d like. In real life, we would normally
load the data from some external source. But
that’s a topic we’ll deal with starting
from our next section.
Alright. If you are still here, we can generate
the data.
First, I will declare a variable called ‘observations’.
This variable will contain the size of the
dataset we want to generate. Let’s work
with 1,000 observations. You can do the same
with 100,000 or 10 million numbers if you’d
like. This choice will affect the speed of
the algorithm. When we are done with this
example, you can play around with the number
of observations. Try 1,000 and then 100,000,
and even 10 million. Be careful, as 10 million
observations may cause computers to freeze
😊
Alright.
We are about to create a two-variable linear
model, and the two variables will be called
x and z.
Let’s work on the inputs.
We will use the NumPy method random.uniform.
It draws a random value from a specified interval,
where each number has an equal chance of being
selected. This method requires three arguments:
the lowest point of the interval, the highest
point of the interval, and the interval’s
size. The size is actually the shape in which
we want to generate the data. Let’s choose
an interval from minus 10 to 10. It doesn’t
matter. Once again, you can play around with
these numbers for homework.
According to our theory, the appropriate size
is the number of observations by the number
of variables. So, observations by 1, as we
are only talking about one variable, x.
We will generate the Zs in the same manner.
So, these are the two input variables we will
feed to the algorithm. Let’s combine them
in one matrix, called inputs. As theory suggested,
the inputs matrix will be of shape:
The number of observations by the number of
variables or 1000 by 2.
The appropriate method to use is np column
stack. It literally stacks the observations
into a matrix. This results in a matrix with
dimensions 1000 by 2.
Dimensionality is very important in linear
algebra. We can only multiply matrices that
are compatible. We want to make sure we can
multiply x times w, since our linear model
relies on that. In these lectures, I will
print the shape of each variable, using the
shape method to make sure we are working properly.
The appropriate Python method is the object
in question dot shape.
As expected, input is a matrix of size 1000
by 2, so we have worked properly.
Okay.
In supervised learning, we must know two major
parameters - inputs and targets. We’ll also
have to generate targets. The weights and
the biases are varied through the algorithm,
and the outputs are the result from the model
employed. We’ll leave them to the computer.
Okay.
Since we have only seen the linear model,
we will do that in a linear model friendly
way. So, say our targets will be defined by
a function f of x and z which is equal to:
2 times x, minus 3 times z, plus 5, plus some
noise.
Conceptually, the algorithm must learn this
is the function. The weights are 2 and minus
3, and the bias is 5. That’s the correct
result. If we don’t get that at the end,
we haven’t worked properly.
You may be wondering about the noise. It is
introduced to randomize our data a bit. Real
data always contains noise – it’s never
perfect. Introducing some small noise will
make the data a bit random, yet the underlying
linear relationship will be retained. Let’s
declare the noise variable, using the random.uniform
method, once again. I’ll constrain it from
minus 1 to plus 1. And its size will match
the size of the inputs.
Okay.
The target values will be given according
to the function we wanted. The proper line
of code is:
targets equals: 2 times the Xs, minus, 3 times
the Zs, plus 5, plus the noise.
The targets are a linear combination of two
vectors 1000 by 1, a scalar, and noise 1000
by 1. Their shape should be 1000 by 1. Let’s
check that.
Cool.
We have our inputs and the desired targets.
It is time to create the algorithm.
For those of you who skipped the previous
lesson, I’ll make a quick recap.
We have our input data, which is in a 1000
by 2 matrix. In non-linear algebraic terms,
this refers to a two-variable problem with
1000 observations. We also have our targets.
I will quickly plot the data, so we see there
is a strong linear relationship.
Here’s the 3D plot.
You don’t need to do that. It’s just that
simple linear problems are quite visual, so
we can afford to plot them. If you download
the notebook file, you will see the same code
with comments. Feel free to inspect it in
more detail.
Alright.
Let’s reexamine the linear model.
y is equal to x times w plus b.
Our algorithm will try to find such values
for w and b so the outputs, y, are closest
to the targets.
Remember, when we performed the gradient descent,
we started from an arbitrary number and then
proceeded? Well, we must do the same thing
now. However, this is tricky. Conventionally,
we don’t start from an arbitrary number
we choose. Rather, we randomly select some
small initial weights. We will talk about
that in more detail later.
For now, let’s declare a variable, called
init_range, and set it to 0.1. That will be
the radius of the range we will use to initialize
the weights and the biases. Our initial weights
and biases will be picked randomly from the
interval minus 0.1 to 0.1.
We will generate them as we did so far - by
using the random.unform method. The size of
the weights matrix is 2 by 1, as we have two
variables, so there are two weights, one for
each input variable, and a single output.
Let’s declare the bias analogically. The
appropriate shape is 1 by 1.
So, the bias is a scalar.
In machine learning, there are as many biases
as there are outputs. Each bias refers to
an output. If you recall the example we saw
earlier about apartment prices and apartment
rent, it involved two biases, as there were
two outputs.
I’ll print the weights and the biases, so
you see how they look like. They are small
and close to zero. These are the weights,
and this is the bias.
Alright.
Finally, we must assign a learning rate, which
we denoted with eta earlier. I’ll simply
select a value of 0.02. I found this learning
rate useful for this demonstration. For homework,
you will have to play around with it, so you
can see how different learning rates affect
the speed of optimization.
So, we are all set.
We have inputs, targets, and arbitrary numbers
for weights and biases. What is left is to
vary the weights and biases so our outputs
are closest to the targets. As we know by
now, the problem boils down to minimizing
the loss function with respect to the weights
and the biases.
And because this is a regression, we’ll
use one half the L2-norm loss function.
Ok. Great! Next:
Let’s make our model learn!
Hi, again!
It is time to train our model. Sounds awesome,
doesn’t it? 😊
Since this is an iterative problem, we must
create a loop, which will apply our update
rule and calculate the loss function.
I’ll use a for loop with 100 iterations
to complete this task.
Let’s see the game plan we’ll follow.
At each iteration, we will calculate the outputs
and compare them to the targets through the
loss function. We will print the loss for
each iteration, so we know how the algorithm
is doing. Finally, we will adjust the weights
and biases to get a better fit of the data.
At the next iteration, these updated weights
and biases will provide different outputs.
Then the procedure will be repeated.
Alright.
Time to create the outputs. They are given,
following the well-known linear model equation.
The outputs are equal to the inputs times
the weights plus the biases.
Multiplying matrices requires the dot method.
I’ll use the NumPy one. So, np dot of inputs
and weights.
Now, the dot product of the inputs times the
weights is 1000 by 2, times 2 by 1, so a 1000
by 1 matrix. When we add the bias, which is
a scalar, Python adds it element-wise. This
means it is added to each element of the output
matrix.
Okay.
For simplicity, let’s declare a variable,
called deltas, which will record the difference
between the outputs and the targets. We already
introduced such variable in the gradient descent
lecture. Deltas equals: outputs minus targets.
That’s useful, as it is a part of the update
rule.
Then we must calculate the loss.
We said we will use half the L2-norm loss.
Pythonically speaking, deltas is a 1000 by
1 array. We are interested in the sum of its
terms, squared, following the formula for
the L2-norm loss. There is a NumPy method,
called sum, which will allow us to sum all
the values in the array. The L2 norm requires
these values to be squared, so the code looks
like this: np dot sum of deltas squared.
We then divide the whole expression by 2 to
get the elegant update rules from the gradient
descent.
Let’s further augment the loss by dividing
it by the number of observations we have.
This would give us the average loss per observation
or the mean loss. Similarly, to the division
by two, this does not change the logic of
the loss function; it is still lower than
some more accurate results that will be obtained.
This little improvement makes the learning
independent of the number of observations.
Instead of adjusting the learning rate, we
adjust the loss. That’s valuable as the
same learning rate should give us similar
results for both 1,000 and 1 million observations.
Once again, that’s something we’ll discuss
in more detail later in the course.
W e’ll print the loss we’ve obtained at
each step. That’s done as we want to keep
an eye on whether it is decreasing as iterations
are performed. If it is decreasing, our machine
learning algorithm functions well!
Finally, we must update the weights and biases
so they are ready for the next iteration.
Using the same rescaling trick, I’ll also
rescale the deltas. This is yet another way
to make the algorithm more universal.
So, the new variable is deltas_scaled and
equals deltas divided by observations.
Let’s update the weights. We will follow
the gradient descent logic.
The new weights are equal to the old weights
minus the learning rate times the dot product
of the inputs and the deltas_scaled.
The shape of the weights is 2 by 1. The shape
of the inputs is 1000 by 2, and that of the
deltas scaled is 1000 by 1. Obviously, we
cannot simply multiply the inputs and the
deltas. This is an issue that may arise occasionally
due to the linear algebra involved. To fix
it, we just transpose the inputs matrix, using
the object dot T method. Now, the matrices
are compatible. 2 by 1000 times 1000 by 1
is equal to 2 by 1.
I’d like to spare an extra thought on that.
Often, when dealing with matrices, you find
the correct way to code it through dimensionality
checks and compatibility errors. However,
transposing matrices doesn’t affect the
information they hold, so we can do it freely.
Alright.
Let’s update the biases.
The new biases are equal to the old biases
minus the learning rate times the sum of the
deltas, as explained in the gradient descent
lecture.
This is the entire algorithm.
Let’s recap what it does.
First, it calculates the outputs for given
weights and biases.
Second, it calculates a loss function that
compares the outputs to the targets.
Third, it prints the loss, so we can later
analyze it.
And fourth, we update the weights and the
biases following the gradient descent methodology.
Let’s run the code.
What we get is a list of numbers that appears
to be in descending order, right? These are
the values of our averaged loss function.
It started from a high value, and at each
iteration, it became lower and lower, until
it reached a point where it almost stopped
changing.
This means we have minimized or almost minimized
the loss function with respect to the weights
and biases. Therefore, we have found a linear
function that fits the model well.
The weights and biases are optimized, but
so are the outputs. Since the optimization
process has ended, we can check these values.
Here, we observe the values from the last
iteration of the for loop, the one that gave
us the lowest loss function. In the memory
of the computer, the weights, biases, and
outputs variables are optimized as of now.
Congratulations, you learned how to create
your first machine learning algorithm.
Still, let’s spend an extra minute on that.
I’d like to print the weights and the biases.
The weights seem about right. The bias is
close to five, as we wanted, but not really.
That’s because we used too few iterations
or an inappropriate learning rate. Let’s
rerun the code for the loop.
This will continue optimizing the algorithm
for another hundred iterations. We can see
the bias improves when we increase the number
of iterations.
We strongly encourage you to play around with
the code and find the optimal number of iterations
for the problem.
Try different values for observations, learning
rate, number of iterations, maybe even initial
range for initializing the weights and biases.
Cool.
Finally, I’d like to show you the plot of
the outputs at the last iteration against
the targets. The closer this plot is to a
45-degree line, the closer the outputs are
to the targets. Obviously, our model worked
like a charm.
Alright.
This was the last lesson from our first big
topic. From next time, we will start with
more complicated stuff. If you have any doubts
about your knowledge so far, please revisit
the lessons and make use of all the extra
resources available, like course notes, exercises,
and Jupyter notebooks. In addition, feel free
to post in the course Q&A section. We love
hearing from you!!
