
English: 
Consider now the data
that we're going to use.
We're going to have features that
we've computed, these are things like
Bollinger bands and momentum and
price change and things like that.
We're going to use these features to try
and predict prices or price changes.
So this is our X data, and
if we've got multiple features,
we've got multiple dimensions in X.
So this might be X1, X2, X3, and so on.
And this is our Y data,
which we're trying to predict.
In order to evaluate our learning
algorithms in a scientific manner
we need to split this data
into at least two sections.
A training section and
a testing section.
If we trained over the same
data that we tested over,
the results would be suspicious
because we should obviously be able
to do very well if we test over
the same data we trained on.

Chinese: 
看看我们将使用哪些数据
我们将用到计算的特征
例如布林带 动量 价格变化 等等
我们将使用这些特征预测价格或价格变化
这是 X 数据 如果有多个特征
那么 X 将有多个维度
可能是 X1 X2 X3 等等
这是我们要预测的 Y 数据
为了科学地评估学习算法
我们需要至少将该数据拆分为两个部分
分成训练集和测试集
如果训练和测试使用的是相同的数据
结果将不可靠 因为如果用训练数据进行测试
很明显结果很好

Chinese: 
这种将测试数据与训练数据
分开来的流程叫做样本外测试
这是一个很重要也很必要的技巧
我们将训练使用的 X 数据称为 Xtrain
并将训练使用的 Y 数据称为 Ytrain
同理 测试使用的 X 和 Y 数据
将称为 Xtest 和 Ytest
这里的一般方法 是我们将 Xtrain 数据和 Ytrain 数据
传入机器学习算法中并生成模型
算法可能是线性回归或 KNN
然后使用此数据测试模型的准确率
模型的输入是 Xtest 将该 X 数据传入模型中
然后生成 Y

English: 
This procedure of separating testing and
training data from one another
is called out of sample testing.
This is a very important and
essential technique.
We'll call the X data that we use for
training, Xtrain and
the Y data that we use for
training, Ytrain.
Similarly, the data we'll test
on will be split into X and
Y sections, Xtest and Ytest.
So the general idea here is that
we'll take our Xtrain data and
our Ytrain data, run that through
our machine learning algorithm which
might be linear regression or
KNN to generate a model.
We can then test the accuracy
of that model using this data.
So, the input to the model is Xtest,
so we plug that X data into the model,
and out comes something, some kind of Y.

English: 
And the question is, is that Y equal to
this Y which we know is ground truth.
The more closely the model
outputs a Y that reflects
this Xtest data,
the more accurate the model is.
Something that I didn't mention,
is that in this class,
our data is time oriented.
So, as you move downward,
we're going forward in time.
We typically split the data
up according to time.
We train our model on older data and
test it on newer data.
It's generally frowned
upon to do the reverse.
You might argue, well this
data's different than that data.
It's still out of sample.
But there are certain look-ahead
biases that can occur
if you were to train on later data and
test on earlier data.

Chinese: 
问题在于 这个 Y 等于这个 Y 吗 这个 Y 是真实结果
模型生成的 Y 越能体现这个 Xtest 数据
模型的准确率就越高
我没有提到的是 在这节课
数据以时间为导向
当我们向下移动时 我们在时间上向前移动
我们通常根据时间拆分数据
使用更早的数据训练模型 并使用更新的数据测试模型
通常建议不要反过来
你可能会辩解 这个数据与这个不一样
依然是样本外的数据
但是如果你用后面的数据训练模型并用早期数据测试模型
将出现某种前视偏差
