Hello and welcome to the mathematical statistics
course.
I’m Taeyoung Park, Professor of Applied
Statistics at Yonsei University.
This semester I’ll be teaching mathematical statistics.
What is mathematical statistics and what do we learn?
First of all, statistics is a field of study
about collecting data, analyzing data, and
inferring from data.
Think of an opinion pool (or election survey)
before an election.
Because the actual opinion of the entire population
is revealed on election day, the general public is typically interested in knowing the results
way before the actual election day.
Look at the news media. 
The news media is even talking about who is
going to be the next president several years 
before the actual presidential election.
So, everyone wants to predict some future events, but how can it be done?
The answer is statistics.
So, instead of dealing with the entire population,
we use a statistical method to choose and 
interview a smaller group of people out of
this population.
This is a smaller group of people, and this
is called a sample.
And this sample can represent the entire population,
which is about collecting data.
Next, we summarize all the information that
we collected and apply statistical methods
to it, which is about analyzing data.
Last, we use statistical inference to make
a decision or prediction,
which is about inferring from data.
So here we are going to do the decision or prediction.
In short, this whole process is about inferring
the opinions of the entire population from
its much smaller portion, which is called
a sample.
So I would say statistics is a machine that
transforms data into knowledge, and the whole
process is called statistical learning.
Sometimes this is called machine learning.
These days, however, machine learning is much
more famous than statistical learning.
But essentially these two are the same, but
strictly speaking machine learning is a narrow
field of statistics.
What is the difference between statistics
and machine learning?
As I mentioned, statistics is about collecting
data, analyzing data, and inferring from data.
So there is a process, data collection, data
analysis, and inference.
And can you see a tiny little word, prediction?
Yes, statistics is doing a little bit of prediction,
but most emphasis is on inference.
That is why I made the word prediction intentionally
small.
What about machine learning?
If you use machine learning to make a prediction
for the next president, you crawl the relevant
data from the web sites.
Because people who use machine learning think
they can collect the entire data, they don’t
care about whether data can represent the
entire population through a collection process.
That is the first difference. Machine learning
just uses data crawling.
Then based on the crawled data, people analyze
the data with deep learning or whatever fancy
method they prefer, and make a prediction.
Most people who use machine learning are interested
in prediction, not inference.
That is the second difference.
So the first difference is about data collection.
The second difference is inference vs. prediction.
One of the research projects that I worked
on is to understand people’s life style
at a certain location.
Because I didn’t know any people who live
at the location, what I did was to collect
all the instagram posts that tagged the location.
And what happened is this. This graph is from
the actual data from Instagram.
Here the X-axis is about the percentile rank
of an Instagram user, so the percentile of
0.2 means top 20% in terms of the number of
posts.
The Y-axis is about the cumulative relative
frequency of posts, so 0.8 means about 80%
out of the entire posts that tagged the certain
location.
So, the top 20% of users are responsible for
what? About 80% of Instragram posts.
Does it sound familiar? Yes, it is the Pareto
law. This is also known as a 80-20 rule.
So 20% of users are responsible for 80% of
posts.
This implies what you see on the social media
does not represent the entire opinion.
And if you make a decision or prediction solely
on the entire data crawled on the web, you
will have a bias and make a wrong prediction.
So we should be always careful when we collect
data, which is unfortunately ignored in machine
learning.
And machine learning cares prediction, while
statistics is more interested in inference.
Prediction is about future, but inference
is about past.
As a human being, we are naturally interested
in what is coming next, not what just passed.
This is why people are more thrilled about
machine learning than statistics.
However, essentially statistics and machine
learning are similar, and actually machine
learning is a narrow field of statistics.
So, you will be learning statistics this semester,
but what is the mathematical statistics?
In the mathematical statistics course, you
will understand the mathematical foundation
of statistics.
Topics include essentials in probability theory,
probability distributions, and introduction
to statistics inference.
You may ask: Is the mathematical foundation
of statistics the probability?
Yes, it is correct.
So, a mathematical statistics (1) is sometimes
called a probability theory course, and a
mathematical statistics (2) is called a statistical
inference course.
So basically we are going to talk about probability
theory during this semester, but to make a
connection between this course and mathematical
statistics (2), I’ll give you a little bit
flavor of statistical inference based on what
you have learned in this course in the later
semester.
Basically in this situation, we are interested
in the characteristics of the population.
However, the characteristic of the population
is typically unknown.
So we are going to take a small portion of
a population, which is called a sample.
And based on the sample, we are going to infer
the characteristics of the population using
the statistical inference.
Statistical inference is a very important
topic in statistics, and I’m going to talk
about these three things about statistical
inference, point estimation, interval estimation,
and hypothesis testing.
I told statistical learning is similar to
machine learning. Yes once upon a time.
Nowadays things are changed and paradigm was
shifted.
Statistics has a challenge caused by data
structure and the corresponding analyzing
method.
Because of an advance in technology, data
becomes big data.
It does not necessarily mean the size of the
data is getting bigger.
Rather the types of data are various.
So, in addition to structured data 
that are neatly stored in excel or sql, we
confront unstructured data 
like text, image, audio, video, web log, and
mobile data.
Machine learning such as deep learning is
doing a very good job at analyzing these unstructured
data.
In essence, the machine learning is based
on probability theory that mathematically
expresses how likely something is.
For example, there is an image of handwriting
a digit as an input, and an actual digit as
an output.
The machine learning algorithm connects between
the input and output, with a black box.
In a black-and-white image, here we have 18
by 18 pixels, and each pixel depth has 256
different intensities from 0 to 255.
So, even though we have this black-and-white
image, it can be turned into a 18 by 18 matrix,
where each entry ranges from 0 to 255.
In a colored image, you have red-green-blue,
and each pixel also has 256 intensities but
now with 3 dimension. So, it becomes more
complicated.
Anyway, given the data, the machine learning
algorithm will learn the relationship between
inputs and outputs.
After the algorithm is optimized, the machine
learning algorithms takes an image of a handwritten
digit as an input, and tells you what digit
this image corresponds to.
How can it be dome?
Inside the machine learning algorithm, it
will calculate how likely each digit is and
select the digit with the maximum likelihood
or probability.
So, if you want to fully understand what the
machine learning algorithm does, you need
to learn mathematical statistics.
When it comes to analyzing unstructured data,
there is still room to develop a new machine
learning algorithm or tune the existing machine
learning algorithms.
If you want to make a contribution to the
world, this mathematical statistics course
will help you do so.
