Accio Data!
Data collection is one of the most important
parts of building machine learning models.
Because no matter how well designed our model
is, it won't learn anything useful if the
training data is invalid.
It's garbage-in, garbage-out!
Invalid data leads to invalid results.
This is not to say that the training data
needs to be perfect.
Especially if you have millions of samples,
it's inevitable that some samples will have
inaccurate labels.
That can be tolerable.
But if there is a serious flaw in our data
collection strategy, we might end up with
data that is complete garbage.
What we mean by 'garbage' can mean several
things.
For example, the labels can be inaccurate
or the input variables can be inaccurate or
missing.
These ones are usually obvious and easy to
notice after a quick inspection.
A sneakier type of flaw is the dataset bias.
When people, or companies, use data-driven
models for decision making, they sometimes
assume that it cannot be biased since it's
the machines that make the decision, and the
decision is based on 'big data'.
The problem is that the data can be biased,
no matter how big it is.
You might have heard about Tay, a chatbot
that started to post racist Tweets after being
left unsupervised for a while.
It's highly unlikely that that behavior was
intentional but human biases sometimes surface
themselves in models trained on data collected
from humans.
Racism in, racism out.
Sexism in, sexism out.
One should be extra careful when building
models that affect people's lives, such as
the models that are used for medical, financial,
or legal purposes.
Using biased data to make decisions can further
reinforce the biases and lead to unfair discrimination.
Ok, that being said, let's talk about where
to find data and what to do with them.
Let's start with the cheapest and easiest
option.
If you are lucky, you might not have to collect
any data at all.
There are many datasets that are freely available
on the web.
Unless you want to build a model that focuses
on a niche topic, it's likely that you will
find the dataset you want by doing a simple
web search.
Let's move to the next cheapest option: web
crawling and scraping.
The internet is an immense source of information.
For example, the entire Wikipedia is available
for download.
Content available on Wikipedia can be used
both as-is or as a starting point to collect
more data.
Natural Language Processing models, in particular,
would benefit greatly from this data as a
large corpus of text in a particular domain.
Other applications, such as image classifiers
can also benefit from data available on Wikipedia.
For example, if your goal is to classify dog
breeds, you can pull a list of dog breeds
from Wikipedia, iterate over its rows, and
find and download pictures of each breed on
the web.
Some content providers make this easier by
providing an official API that gives programmatic
access to the content.
Web crawling and scraping tools, such as the
Selenium WebDriver and a headless browser,
can also be used to build such custom datasets.
One thing to keep in mind about web scraping
is that make sure that you are complying with
the terms of services of the websites that
you scrape content from.
Also, make sure to add delays between web
requests to avoid putting too much load on
the servers that you pull data from.
In some cases, these 'easy' data collection
techniques might not be applicable, particularly
if our tasks need some subjective human judgment.
For example, if our goal is to predict how
a human would rate the quality of a given
image, then we might need a set of images
that are rated by humans.
Actually, for this specific example, there
are publicly available datasets.
But my point is that at some point we might
need to ask people to provide us some data.
One way of doing that is to conduct surveys.
If you can structure your study in the form
of a game, that's even better.
What would you prefer: to play a game or a
fill out a survey?
Games are a great way to collect data.
You can also utilize crowdsourcing platforms
such as Amazon Mechanical Turk.
In any case, if you need human input to build
a dataset, pay attention to ask unbiased questions,
make the user interface easy to use, and make
it fun for the subjects.
More importantly, ensure that the entire process
is ethical and be extra cautious if you are
dealing with sensitive data.
Now that we know where to find data, let's
assume that we already have the data, and
talk about how to make it ready for machine
consumption.
We mentioned missing data earlier.
It's best to avoid missing values during data
collection but that might not always be an
option.
So what do we do if we have some missing values
here an there?
If it's only a small portion of the samples
that have the missing values, we can simply
discard them.
If it's some particular data fields that have
a lot of missing values we can drop those
columns.
Another option is data imputation.
For time series data, the last valid value
is sometimes carried forward to fill in missing
values.
If this is a global stock market dataset,
for example, there might be some missing values
for companies in different countries, since
the stock market might be closed due to national
holidays.
Other basic data imputation methods include
substituting the missing values with the mean
or the median of the column.
Some more sophisticated methods predict the
missing values from what's available by using
another learned model.
One caveat about missing values is that they
might not always be random.
A missing value can have a meaning on its
own too.
There might be a particular reason why some
fields are empty and filling in these fields
might lead to a bias in the dataset.
If it's a categorical variable that is missing
some values it's sometimes best to treat missing
data as just another category.
Fixing a dirty dataset takes a lot of time.
Cleaning a dataset usually involves manual
inspection and corrections in addition to
automatic processes.
If you are at the beginning of the data collection
process, it would be the best to identify
the underlying problems and revise the data
collection strategy to prevent missing or
inconsistent data in the first place when
possible.
Having a clean dataset is not always enough
to train useful models.
There might still be some issues.
For example, if the input variables are in
different orders of magnitude, features having
larger magnitudes can dominate features having
smaller magnitudes during training.
One solution to that is feature scaling, which
is usually done to achieve consistency in
the dynamic range of the variables.
Scaling the variables properly can improve
the results and speed up the convergence.
A very simple way to scale a variable is to
scale them to a specific range linearly.
For example, we can scale all ages to the
range [0, 1] by simply subtracting the
minimum age we have in the dataset from the
values and dividing the result by the difference
between the maximum and minimum values.
Another widely used method is standardization
which makes the variables zero mean and unit
variance, preventing one variable with a large
variance dominating an objective function.
This is simply done by subtracting the mean
from each feature and dividing it by its standard
deviation.
One more thing that is worth mentioning is
data imbalance.
Certain classes in a dataset can sometimes
have a relatively smaller number of samples
compared to the others.
As a result, the learning algorithm might
choose to ignore those underrepresented classes.
For example if you have a dataset of animal
species, the pictures of cats might outnumber
the pictures of tigers.
As a result, the learning algorithm might
choose to classify all felines as cats and
get away with it if we use a uniform cost
function.
In such a case, we have a few options.
The first option is to leave it as-is.
It might be acceptable to classify the samples
that are less likely to be observed in the
future with lower accuracy.
If we think that all classes are equally important,
one option is to undersample the larger classes
by throwing out some of their samples.
Personally, I don't recommend this.
Another option is to oversample the underrepresented
classes or to synthesize fake examples for
these classes.
I have seen this technique to be useful in
traditional machine learning systems, but
never really used it for training deep models
on large scale data.
Perhaps, a better option for deep learning
models is to use a class-weighted cost function,
where a higher cost is assigned to the misclassification
of the underrepresented classes.
These weights might either be hard-coded or
dynamically adjusted based on observed frequencies
of the samples.
For example, we can assign a higher weight
to axolotyls to compensate for their rare
occurrenece in the dataset.
Axolotyls are awesome by the way.
The topics we covered today are usually discussed
in the realm of data mining and often not
discussed much in the context of deep learning.
It's true that deep learning models can be
robust against some sorts of noise and learn
useful models even from not so clean data.
But, personally, I consider data collection
and inspection one of the most important steps
when building models.
If a model doesn't work well, my experience
is that the culprit is more likely to be the
data than the hyperparameters.
That brings us to the end of this video, except
for the bloopers at the end of the video.
In the next video, we are gonna shift gears
and talk about convolutional neural networks,
finally!
Convolutional neural networks are amazing
and we will see why.
As always, thanks for watching, stay tuned,
and see you next time.
