[MUSIC PLAYING]
ELENA NIEDDU: I'm
excited to be here
and to talk to you about the
In Codice Ratio project--
that is a project going on
at Roma Tre University--
and to talk to you
about TensorFlow
help us build a module
that is able to transcribe
ancient manuscripts in the
Vatican Secret Archive.
So some introduction first.
This is our team.
On the right, you can see
paleographers and archivists.
And on the left, there is
us, a data science team.
And that's why I think the
name we chose, In Codice Ratio,
reflects us very well.
Because it's a word
play between the Italian
and the Latin meaning
of the word "codice."
Now, in Latin, "in codice
ratio" would mean a knowledge
through manuscripts.
But the word "codice" in Italian
also means software code,
so it's also knowledge
through software,
which is exactly what
we're planning to do.
And so you might
ask yourselves, what
brings paleographers and
archivists and data scientists
together?
Well, they have one
problem in common.
They both want to discover
knowledge from big data.
We are used to think of
big data as something
that happens in the web.
But actually,
historical archives
are endless source of
historical information,
of important information,
of cultural information.
And just to give you a scale of
how large this information can
be, let's just
compare for a second
the size of the
Vatican Secret Archive
to the height of Mount Everest.
Now, if you were to take each
shelving of a Vatican Secret
Archive and stack it
one top of the other,
you would get to
85 kilometers tall.
That is about 10 times
the size of Mount Everest.
And the content spans the
centuries and the continents.
For example, there,
you have examples
of letters coming from China,
from Europe, from Africa,
and, of course,
from the Americas.
So what is our goal?
Our goal is to build
tool and technology that
enable historians,
archivists, and scholars
of the humanities in general
to perform large-scale analysis
on historical archives.
Because right now, the
process, let me tell you,
is entirely manual.
You still have to go there,
consult the documents manually,
and be able to read that
very challenging handwriting.
And then, if you
find information
that may be linked to
another collection,
then you have to do
it all by yourself.
But first, we have to face
the very first challenge that
is when you are dealing with
web content-- for example,
if you want to extract data
from the internet-- well, that's
already text.
And when we said we're dealing
with the historical documents,
that's often scans.
And traditional OCR is
fine for printed text.
But then you get to this.
This is medieval handwriting.
It's Latin, a language
nobody uses anymore.
It's a handwriting nobody is
able to write or read anymore,
for that matter.
It's heavily abbreviated.
And still, you want to
get texts out of it.
So you might want to train
a machine learning module.
Of course, you want.
But then, we come to
the second challenge.
And that is scalability in the
data set collection process.
Now, the graph you see there
is a logarithmic scale.
And it might show you
something that you already
know that is known as the
zip flow that tells you
that there is very few words
occurring humongous times.
And then, most of the words,
they do not occur that often.
What does that mean for us?
That if we want to collect data,
for example, at word level,
at vocabulary level,
this means that we
have to annotate thousands
of lines of text, which
means hundreds of pages, OK?
And similar systems do exist.
They are state of
the art systems.
But most of the
paleographers, even when
they know of these tools,
get discouraged in using them
because they say, well, it's
not cost-effective for me.
Because it can take up to
months, or even years, of work
on these documents just to get
a transcription system that they
will maybe use once or twice--
I don't know-- whereas they
would like to do it faster.
So we asked ourself, how
can we scale on this task?
And so we decided to go by
easier step, simpler step.
The very first
things that we did
was to collect data
for single characters.
And these enabled
us not to involve
paleographers but people
with very less experience.
We built a custom
crowdsourcing platform
that worked pretty much
like CAPTCHA solving.
What you see there is an actual
screen from the platform.
So the workers were
presented with an image
and with a target.
And they had to match
the target and select
the areas inside of the image.
And in this way, we were
able to involve more than 500
high school students.
And in about two
weeks' work, we made
more than 40,000 annotations.
So now we had the data, we
wanted to build a model.
When I started working
at the project,
I was pretty much a beginner
in machine learning.
And so TensorFlow helped
me put in practice
what I was studying in theory.
And so it was a
great help that I
could rely on tutorials
and on the community
and, where everything else
failed, even the source code.
So we started
experimenting, and we
decided to start small first.
We didn't want to overkill.
We wanted the model to
fit, exactly, our data.
So we started small and
proceeded incrementally
and, in this phase,
in a constant cycle
of tuning hyperparameters
and model tuning
and choosing the best optimizer,
the best thing initializers,
the number of layers
and the type of layers,
and then evaluating
and training again.
Then we used Keras.
It was good for us because
it allowed us to keep
the code small and readable.
And then, this is
what we settled for.
It might look trivial.
But it allowed us to get up
to a 94% average accuracy
on our test characters.
So where does this fit
in the whole scheme
of the transcription system?
It's there in the middle.
And it's actually, so far,
the only [INAUDIBLE] part,
but we are planning to expand.
And you will see how later--
we will see how later.
And so we have the input image.
So far, we're relying
on an oversegmentation
that is old-school.
It's a bit old-school,
but it allows
us to feed single characters
or combinations of characters
inside of the classifier,
which then produces
a different transcription
who are ranked according
to a Latin language model, which
we also build from publicly
available sources.
How good do we get?
We get about 65%
exact transcription.
And we can get up to 80% if we
consider minor spelling errors
or if the segmentation
is perfect.
If we had perfect segmentation,
we could get up to 80%.
We will see that this
can be more challenging.
OK.
So what are our
plans for a future?
We're very excited
about the integration
of TensorFlow and Keras.
Because I described the
process as being fully Keras.
What we actually found out was
that sometimes some feature
were lagging behind,
and sometimes we
wanted to get one part of
the features from Keras
or from TensorFlow.
And so we found
ourselves doing lots of--
I don't know if that's
your experience, as well--
but we found ourselves
doing lots of back and forth
between TensorFlow and Keras.
And now, we get the
best of the two worlds,
so we're very
excited about that.
And so how do we plan to expand
our machine learning system?
First thing first,
we are trying U-Nets
for semantic segmentation.
These are the same Nets that
achieved very good results
on medical imaging.
And we're planning to
use them to get rid
of these tricky computer
vision, old-school segmentation.
And that would also achieve the
result of having classification
together.
Because this is semantic
segmentation we're talking of.
These are some
preliminary examples
that work particularly well.
Of course, there is work
still that we have to do.
And then, of course, since
there could still be ambiguity,
we could do error correction
and then transcription.
But I think this
would be, in itself,
a significant improvement.
And another thing we're
experimenting with
is enlarging our data set.
Because we don't want
to stick to characters.
We want to evolve.
We want to move to
word level, and even
sentence level,
annotated characters.
But still, our
focus is scalability
in the data set collection.
So we want to
involve paleographers
as little as possible.
So for example, this is our
generated inputs from GAN.
But we are also
planning on using,
for example, a
variational autoencoder
so that we can
evolve our data set
with little human interaction--
the less we can.
And in the end, this would
bring us to actually use
sequence model that could take
full advantage of the sentence
level context, for
example, and could even
be able to solve things that
we couldn't be able to solve
with single character
classification-- for example,
abbreviation.
In this kind of text, many words
occur abbreviated, for example,
just like you would text.
In some texts, you
would say me too
and use two, the number, or 4U.
And that's the same with
this kind of manuscript.
And that's one of the
application you could have.
Also, we are planning
to use sequence models
to get to a neural language
model because so far,
we only have experimented
with statistics.
And one last thing
before I let you go.
I mentioned the
people in the team,
but there is so many people
I would like to thank
that were not in that slides.
And first of all Simone,
who should have been here,
but he couldn't make it.
And he was my machine
learning Jedi Master.
And then Pi School of AI and
Sébastien Bratiéres and Lukasz
Kaiser for their
amazing mentoring.
And Marica Ascione, who is
the high school teacher that
actually allowed us to
involve those students that
were part of the platform.
And, of course,
all of the graduate
and undergraduate students
that worked with us and help us
achieve what we have
achieved and what we
plan to achieve in the future.
And of course, thank
you for your attention.
[APPLAUSE]
[MUSIC PLAYING]
