>> So, my name is
Christian Steiner.
I'm going to talk about
compressing human text.
I'm very happy to
be here and to see
so many machine
learners in one place.
I have to add that a lot of
data compression isn't run
by machine learners and
there are a lot of
computer scientists
who mainly focused on
writing algorithms.
On the one hand, it's
a really difficult thing
to compress text and
to come up with algorithms
that actually work.
I'm very proud to be presenting
joint work with Suban Garamanian
and David Mechai on
a new data compressor.
So, what are we compressing?
So here's randomly sampled
bits that's in compressible,
there's no way we
could store this file
more compactly than by just
storing the bits as they are,
that's because the
information content of
that file is exactly
as large as it is.
We cannot squeeze
that anymore at all.
Then, perhaps a
more realistic sort
of thing could be that data.
So this is a text file
and if you stare at the zeros
and ones a little bit,
you'll see that there's
a lot of repetition.
If you squint your eyes, you will
see lines appear and that's
because most of
the ASCII characters
actually start with
the same signature.
So, there's a lot of structure
and that sort of structure
can be exploited and can
ultimately be compressed.
Of course, we care about
all sorts of other things,
we may want to care
about compressing
the latest viral thing
from YouTube or
other parts of the world.
The question is, how can
we write models that mold
these sequences and how can we
write algorithms that do that.
So, sequences are more or less
ubiquitous in our world
and human text is
just one example.
There are also
computer programs and
genomic data and of
course once we store
something in a computer,
the computer ultimately makes
a sequential representation of
the thing in one form or another.
One important message is that
the sequences are
all very different.
So if we have human
text for example,
it has very different
characteristics from
say a program binary which
is the second box there.
This is the Java binary that
is the new compress that I'm
presenting and then that's
again different from say,
markup languages or sort of
the thing in between what humans
and computers talk about.
So, the third box there is
the slight source code of
the very slide you're looking
at and if I
change the slide then
the source code also changes.
So, why is it difficult
to model sequences?
It's difficult because the space
of sequences is very large
and we may be interested in
some sequences that don't
even fit into a memory.
So, it's sort of
difficult to think
about compressing such sequences.
Of course, different
sequences also
have a set very
different properties,
for example DNA has
very different sorts of
properties than text has,
so the myth of
a universal file compressor
doesn't really hold
because the sort of
assumptions we make
about one file
may be very different
about other files.
Of course, there's no end to
how complex the structure
and sequence can get.
It's an AI complete task,
so we can always do better.
So, let's talk a little bit about
models that work on sequences.
So here, we're taking
machine learning approach,
so imagine we have
a probability distribution
over sequences.
So, we have discrete
symbols X1 to Xn.
We have n of them and
essentially we're assigning
a probability to
any particular sequence
and also implicitly it's length.
What are these models useful for?
Well, obviously compression
because that's what we're
talking about but also
anomaly detection,
we can use it for
smart entry methods like dusha,
for example, or on your phone.
It may also be useful
for sequence synthesis.
Imagine you have
a cheap DNA reader
and you can only afford to
read the sequence once.
You may want to have
a probabilistic model that
guesses well what the things
are that you're missing.
It's also helpful in in
other applications
like OCR and so on.
Of course, you can imagine
any algorithm that takes
a piece of text and then
compress it somehow.
You can imagine doing all sorts
of string substitutions,
you can run automata,
you can try and find
repetitions and then
do something with it.
No matter what you do,
if you have a mapping from
a sequence to a code,
then that can be interpreted as
a probabilistic model and that's
because the output has a length,
and if you have E to the minus
of that length than that
gives you a probability
distribution where
Z is your normalizing constant.
So really anything
you do to sequences,
as long and as you preserve
every piece of information,
is really probability
distribution.
There are some famous
algorithms that work
on that principal.
For example, this gzip,
which I'm sure you all
know there are various
others as well.
So, the question is how
well can we compress,
so if you look at
a sequence like this
for example, that
should look familiar,
that's the binary
expansion of pie
and it seems clear that that's
very easy to regenerate,
so you just need
a few lines of code
essentially to be able to
regenerate that sequence.
But how do we know
it's pie when we see
an arbitrary file?
So, that's where essentially
strong AI would come in.
We'd have to be able to crack
any algorithm in
reverse engineer,
what is the shortest program
that could possibly
generate the data.
No one knows how to do it,
not for lack of trying
but it's just really
hard and it doesn't make
any difference if you
choose E instead.
And ultimately, we're always
bound by the problem
that if we don't
understand what's in the data
then our model has
no chance of modelling it.
So, here maybe we can crack
the pseudo random
number generator and
it's seed and then
regenerate that sequence
but that's pretty hopeless.
If we manage then we'd have
other things to worry
about I'm sure.
So, how does
a sequence model work?
Well, we assign a probability
to a sequence and its length,
so essentially sequence wants
that inherently
non-parametric models.
We can always factorize such
a distribution as follows,
but that doesn't necessarily mean
that if we come up
with a wacky model,
that second part,
these conditional
distributions that product of
conditional distributions
is necessarily
easily accessible or
cheap to compute.
The interesting models for us
though they make it
cheap to compute.
So let's start simple,
let's say that
every character is just
independently models.
We have some probability
distribution over characters and
then we take our sequence
and we model each character
independently of what
has happened before.
Then, of course, real sequences
aren't like that,
and for example English text has
a particular distribution
over letter.
So this is when we just count
uni-grams and we'll
see that for example,
E is the most frequent letter
and that doesn't mean that
we should hard code this because
ultimately we may want to
compress other languages too.
So, it's not clear what to do.
What we want to do is
learn that distribution.
We want to learn a
histogram namely,
this histogram and we want
to learn it from the data.
I'm going to use the notation
curly M to write down
a multiset of
essentially this histogram
of all the symbols and
its occurrence counts.
So, there's something called
the Dirichlet process,
I'm sure you've heard of it.
And it's really nifty
because all it does is,
it collects these counts and it
forms a distribution
over symbols.
So we can condition on say,
a bunch of simple observations
and then that gives
us a probability distribution
over what might happen next,
so it learns essentially
on histogram.
We can do that, but we
do not actually learn
contextual dependencies
which in English is
what it's all about.
We don't really believe that if
we take our text of Shakespeare,
sorted by letter having
all the A's come first,
and all the Bs, and
Cs, then all Ds,
we don't believe that
that should be the
same as in its original
configuration,
so we want to exploit
the fact that there
is structure and
we do that by putting
in some context.
So, Python distributions
would work as follows,
we'd say what's the
probability distribution over
the next symbol given that
the previous symbol was
a particular value,
say an E or a Q?
So after a Q in English,
we'd expect to have a spike on U,
even though overall U
would not be
a spike in the histogram.
Then, of course, we can push
that idea a bit further,
and instead of just
conditioning on,
say one letter, we condition
on three letters or more.
Now we're talking,
we're getting to
an interesting
proposition that we
could use to build a compressor.
It's very nice and flexible
and more contexts mean
unfortunately also
that each histogram
is a lot of sparser so it
doesn't really get
trained very much.
That ultimately
means that we're bad
compressing because we haven't
seen say BLI before
shouldn't mean that we
don't know what to do.
Maybe we seen LI
before or I before and
make use of that
hierarchical information.
So that's what we'll do.
We'll use the fact that
there's some sort of
hierarchical
relationship going on
because at least for human texts,
we fundamentally believe that
the characters that
follow a given context or
a given sequence of
characters is similar in
some way to the next
shorter context.
That can be expressed by, again,
revisiting our DP by
making it hierarchical.
So, instead of having
a uniform base
distribution at the end,
we will use the
empirical distribution
of the next shorter contexts.
So this way, we are asking
the longest context first
to predict what the next
symbol is going to be.
If the long context doesn't know,
we'll ask the next shorter
context and that's how we're
generating data and
that's how we're
learning all these
hierarchical histograms.
So, now, we have
lots of histograms,
and they're sparse, and
information is
shared between them.
So, for example, if in
a longer context we learn,
then we can exploit the fact that
the shorter context also learns.
But ADP isn't really
the best way of doing this.
So, in the past,
people have used something
called a Pitman-Yor process,
which extends the DP by
adding one more parameter.
Don't worry. We
don't have to worry
about these details here.
Just one thing, one
step that led to
this nice new compressor
was to realize
that actually Pitman-Yor
process is very
expensive to do inference
in once it's hierarchical.
That's because it
has latent state
that isn't really giving us much,
but costs us a lot.
So, we have to do
Gibbs sampling over
latent seating allocations
of restaurants,
and all sorts of things that
we don't really want to do.
The good news is we
can simplify it.
If we don't care exactly
about the table counts,
it does just as well at
compressing and practice,
but we now have a sort of
deterministic allocation and
that makes it a lot easier.
Again, if you're interested in
the details, you can
talk to me later.
Yes. One thing that I
also wanted to mention is
there's a question of
how deep we should go?
So, we all agree that
it's a good idea
that we should condition on
more than just one character
or more than just two characters.
We should condition on
context, but how
deep should we go?
It turns out that, so,
this is the compression rate
in bits per symbol
on a long file.
These are all of
Shakespeare's plays.
Unfortunately, he stopped writing
at five megabytes roughly.
One thing that's sort of
interesting to see is
that for example, here,
this, the green line
that does best,
so with context step five,
we now compress the best.
But then in the end,
as we see more data,
depth six starts to have
a better compression rate,
and that's because Shakespeare
has written more plays,
and we've seen more data.
We've populated more histograms.
Of course, if you had
kept writing, hopefully,
then D7 could have been
the next line to win,
but he stopped writing,
so that's the problem.
So, this hierarchical
architecture is
nice because information's
faculty shed,
and the simplified
Pitman-Yor process
is really good for
modeling the histograms,
and it's cheap to
do inference in it.
Now, of course, we
should worry about,
how should we set the parameters
because they clearly matter?
So, there was the
strength parameter theta,
and then the discount parameter
who knows how to pronounce
it, and they have to be set.
Perhaps, one thing that
we also believe is that
some histograms are really
different from others.
So, for example, you
can imagine that
at the beginning of a sentence,
you have a very
different distribution
over characters than say,
in the middle of a word.
Maybe we can infer which of
these situations are the case
and do the right thing.
So, how can we do that?
Clearly, we could sort of have
one parameter for
each possible context,
but that's difficult because
the more parameters we have,
the more we need to learn or
the more we need to
know how to set them.
But we could make the parameters
depend on context depth.
That means that the
longer our context is,
we just look at the length,
we choose a different set
of parameters,
different concentration,
different discount.
Then, maybe somehow, we
can learn how to set them.
Then, the second idea is perhaps,
it matters how many
observations we had and
depending on
how many observations
we had in this context,
we should choose
the different parametrization.
Seems weird, but there's actually
some fairly old research
in NLP where people
have tried such wacky ideas
and they worked really well.
So, here in this work,
we're putting
all these things together
to make a new model
and to do both.
So, we're having a matrix of
different parameters
and depending on
how deep we are and what the fan
out of our tree notice,
so how many different
unique things we've seen,
depending on that,
we go and compress.
Of course, we need
to know how to set
the parameters and for that,
we do a gradient
optimization, and ideally,
we do so online,
this the math that
you end up doing.
It's not pretty, but it's
pretty nice what it does.
So, we can do this online
and we get nice results.
So, here are
some famous compressors
that people might know.
I'll delete some that
we don't really need.
So, gzip is pretty famous.
Bzip is also pretty famous.
The previous world best was
an algorithm called PPMII.
PPM was information inheritance
by a Russian guy
that was unbeaten for
more than 10 years.
Now, we have some new models.
If we just use
a context depth of eight
and just leave it at that,
so note that PPMII is
context depth of 16,
which is more, we
already do pretty
well and this is
without optimization.
Once we add the optimization,
then we're doing even better.
Once we go to depth 16,
then we really start
rocking the boat
except on these three files.
These nasty files that I should
just delete called Kennedy.
This is a binary spreadsheet.
PTT, that's a picture.
Some, it's a computer binary.
All the rest is human text.
We've engineered a model
that's good on human text,
but it pays the price for it by
not being good at other things.
So, yes, we kind of suck at
the others, but we do well.
Note that the Russian algorithm,
which is sort of
fundamentally alter
a PPM like algorithm
and therefore similar,
also doesn't do quite
as well on those files.
The winner there is an
algorithm called LVMA
or lzip in Linux or
7-Zip in Windows,
and it really just rocks
on these files, and
sucks on the others.
As I said in
the beginning, there is
no such thing as a general file
compressor except if you
make 100 different models,
and mix them all together,
and that's what PAQ does,
and that's currently the world
best compressor in the world,
and it comes at a huge cost
because it runs I think in
it's full powered version
600 models at the same time,
and that's not what
we want to do.
So, if we just want
to run one model,
then I recommend currently
N16 is the model to run.
So, there's a paper
coming out on the matter,
and it's going to be
presented this April at
the data compression
conference in
Utah, feel free to come.
You can also get the pre-print
on my website if
you're interested.
I think, it's fairly on the top,
so you click it,
and download it, and that's it.
Before I stop, I think
I'm going to talk about
one more little fun thing.
You can also use a
compressor to sample.
So, essentially, we are having
a probability distribution that
gives a probability to the data.
The log p is
exactly the file
length that you get
when you compress.
That's the beauty of it.
Now, you can also
condition on data say,
which is what the compressor
does anyway when it
learns more about
the file that is
compressing, and then sample.
So, I've done that for
a few things just to show.
So, when I train on Alice in
Wonderland and this
is what comes out.
This is what
the compressor imagines,
text looks like after all that it
knows is Alice in Wonderland.
So, after reading
Alice in Wonderland,
it just fantasizes data and
that's what it looks like.
Of course, that's
not very compelling.
It wouldn't pass the Turing test,
but how about if I say,
switch to Russian?
Can you tell that it's
not Russian? I can.
This is trained on crime and
punishment by Dostoyevsky.
Or we switch to
traditional Chinese,
this is a Kung Fu novel.
Looks compelling to me.
Or well, being not
a native English speaker,
this is Shakespeare,
so we trained on Shakespeare,
all of his place, and this is
what the model fantasizes.
I really like the name of
the play that it thinks up.
So, it generates
even the white space
and the new lines that
generated it all.
Best of all, all your
favorite characters
are all in there.
It just mixes them altogether.
So, this must have been
the the ultimate play.
Of course, when shouldn't
be to biblical about it.
This is trained on the Bible.
What it thinks, text looks
like according to the King
James version of the Bible.
So, to me, this sort of looks
encouraging in the sense
that for example,
it doesn't know
anything about words.
It doesn't know anything
about how your
characters are encoded.
It learns all this. It
learns essentially Unicode.
It learns how to form sentences.
It even makes up
new words by just sort
of combining what it
thinks are likely
characters to follow.
Still, you get some sort of
interesting coherent thing out.
So, it looks to me like this
is Shakespeare on drugs,
the Bible on drugs,
which led me to think
maybe let's actually run
a drug synthesis manual
through the thing
and get random drug
synthesis instructions.
So, this is taking from
Finnish film means I
have known and loved
by the Shogun couple who
synthesized all sorts of
hallucinogenic drugs,
and try them in self experiment,
and they wrote about it,
and published it on the Internet.
This is what you get when you
train on that and you get
a totally delirious
incomprehensible
synthesis instruction
for the next big drug.
Thank you very much.
>> We have time for questions.
>> How many presentations
of what and so on?
Have you thought about trying
to build that into the,
moment is just n-grams based on
characters whether
you can try it,
how about a higher level
structures? Just give me one.
>> So, the question is,
have I thought about
putting support in four words?
Yes. I thought about all sorts
of things and indeed,
making it more high level
could be interesting.
Yet, it's sort of compelling
to see how well it
does just with
his n-gram assumption.
So, that doesn't seem
to be strict need
to model word's explicitly.
But of course, I could
imagine it helping.
There are also languages
that don't really
matter words in the text.
For example, Japanese,
where you don't have explicit
spaces and the question is,
where do you infer, where
it starts and ends?
It's very interesting. If you're
interested to work on
this sort of thing,
feel free to get in touch.
Any other questions? Yes.
>> Is this available?
>> Is this available?
Not yet. It will be.
We've bought a website.
We haven't yet set it up.
We're going to present this
at the data compression
conference and after that,
it will be available. All
right. Thank you very much.
