- Hi everyone today we'll be
talking about the pipeline
for state of the art
MMP, my name is Anthony.
I'm an engineer at Hugging
Face, main maintainer
of tokenizes, and with
my colleague by Lysandre
which is also an engineer and
maintainer of Hugging Face
transformers, we'll be talking
about the pipeline in NLP
and how we can use tools from
Hugging Face to help you with
that, your feedback is important
to us, don't forget to rate
and review the sessions, so,
before we start, I'd like to
quickly introduce Hugging Face
for those of you that never
heard about us, and ever since
started when we were trying
to build your best friend
to conversational AI, an AI
capable of listening to you
and talking with you about any
subject, if you've seen the movie
"Her" you know what I'm talking about,
let's say it's quite an
ambitious goal and yet a lot
of filming while trying to
achieve it, we started to open
source tools we built along
the way and we've been very
surprised by the way it's been
received by the community.
Soon enough, we're spending
all our time on these tools
making them better day after
day and here we are today.
Transformers is our most
successful projects and is the most
popular open source library
for NLP today, we have close
to 30,000 stars in GitHub,
more than 1000 research paper
mentioning it, and many companies
are using it in production
every day, we are now pursuing
a new goal, which is to help
improve the NLP and make it
accessible to everybody, and we
recently raised a new round of
funding to pursue this goal.
We now have multiple open source
projects that can help you
work in many different steps
of the NLP pipeline, and we are
going to show you two of them,
so Today's Menu will talk
about transfer learning in
NLP and how it applies to
transformer networks, then
we'll dive into the Tokenization
followed by Transformers models,
and now let's get started
with my colleague, my colleague
Lysander, we'll start with
the transfer learning.
- All right, thank you,
Anthony, and so we'll
we'll take a look at transfer
learning, especially transfer
learning NLP, and especially
apply to fulfillment, so in a
few words, NLP took a gigantic
turn in 2018 with the arrival
of the transformer architecture,
so it arrived in 2018 with
the vas one transformer and
immediately after the GPT
and Bert Transformers arrived
and obtained state of the art
results on many different NLP
tasks, and so, first of all,
before we dive in, how
Transformers work and how they how
transfer learning makes them
so efficient, let's take
a quick time, a quick pause
to understand what is exactly is transfer
learning, and how it's
different to traditional machine
learning, so traditional
machine learning, you would have
multiple tasks and you would
have multiple learning systems,
which we'll all need to learn
on that specific task, so if
you had three different
tasks, you would need three
different learning systems,
and there was not really
multitask systems or not really,
you weren't focusing your
efforts on the different
tasks at once, and this works.
All right, however,
some tasks can actually
leverage some shared knowledge
across different tasks.
For example, if you're working with tech
with text two different tasks
that comes from text, well,
the underlying principle is
language so if you have one
tasks that understands language
and another task, you can
surely use that knowledge
that was acquired and use it
for a specific task, which is
exactly what transfer learning
tries to do, so you have one
or multiple source tasks on
which you train your learning
system, and then that's not
really your target task but
your target test may can use
the knowledge required
using those two tasks using
the training and those two
tasks for threaded gets
better results with less data,
and so how does that apply
to NLP, well with sequential
transfer learning, which is
very similar to what I
explained, so there's a it's done
in two steps, the first
step is the pre training
and the second step is the
adaptation or also called fine
tuning, the pre training is a
very computationally intensive
step where you have a lot
of data you require a lot
of compute, and you're
basically trying to cram as much
knowledge as possible in
different systems, and those
systems can range from word
embeddings, like which back
or globey, or a very recent
transformer networks, like GPT
berts, for this server, and
once you have this general
purpose model, you're going to
try and adapt it to different
tasks, and in order to do so
you require less data since you
already have a strong
knowledge base right.
So now how does this apply
to Transformer Networks?
So Transformer Networks and a
few words are very, very large
neural networks, they went from
a few million parameters to
billions of parameters, the
biggest transformer network that
has the most amount of parameters
is GPT, three, which was
released last week, and which
has 175 billion trainable
parameters, and so with such
a big amount of parameters,
you actually have a very big
capacity, we can train those
very big neural networks
on very big data sets.
However, one flaw of this
is that it requires a lot
of compute to be trained, so
that's where transfer learning
really kicks in, that's where
it's really interesting,
because starting from a base
model in order to obtain
a general purpose language
model, so a language model that's
specific to a domain, let's
say a text or language,
but completely text agnostic,
well, you will need thousands
and thousands and computes
a very, very large corpus
and days or even weeks of
training, just to get to that
stage however, once you have
the pre trained language model,
then fine tuning it to different
tasks is actually very easy
you only need small datasets,
because most of the knowledge
is already in the model,
so you don't need your
base model to get an
understanding of language
just from this whole data set,
you're completely relying on
the previous training and just
fine tuning a bit on it for
that specific tasks, so transfer
learning is really useful
in the case of very large
models like transformer
networks, and that's where
model sharing is especially
important because we producing
a pre training is completely
impractical, it costs a lot,
and since the general purpose,
the it's completely task agnostic,
so it's very important to
share that pre train model so
that other users may just use
it and fine tune it on their downstream
tests or on their own
datasets, and this is why
this is something we're very
proud of the Hugging Face
where we allow the easy retrieval
and distribution of models
entirely for free, so that
users may share compute, right
so now let's take a deeper
look at the inner mechanisms
of transfer learning of the
pipeline, for natural language
processing, pre training and
fine tuning of transformer
architectures, so the transfer
learning pipeline in NLP
is composed of two big
steps, the first part is
the tokenization aspects,
which Anthony will present in a
bit and the second part is
all the prediction aspect.
So the tokenizer's goal, no
is tokenizer's goal is to get
some input like a sequence
or a sentence a dump of text
and convert it into inputs
understandable by the model.
Once the sequences have been converted to
understandable inputs, then
these can be fed to the model
which can then do a prediction
on top of so, how does
pre training a language model
works, there's different ways
of doing it but the most frequent
and the best way to do it
is by language model, learning
to predict text given other
texts, one huge advantage of
this method is that it doesn't
require human annotation, and
since pre training requires
a lot of data, not requiring
human annotation is a very big
deal because you don't want
to have to annotate a lot
millions of examples when
you can just leverage your,
in this case, self supervised
aspects of language model.
That's also very interesting
in the case of languages
that have that only have
a few data, low resources,
language, no resource languages,
because these languages in
these languages, it's very hard to acquire
datasets that have sufficiently
annotated data, to perform
a pre training however just
obtaining a dump of text is easy
enough and can be used to
obtain a sufficiently good
pre trained model, so let's
take a quick look at how
Language Modeling operates in NLP
with when we look at two objectives
right now, the first one being
mask language modeling or
MLM, which is also known as
the close task, and is a very
old language modeling method,
what we're trying to do here
is mask some of the inputs and
ask the model to predict what
was instead of the mask,
we're going to ask it to fill
the mask, so in a traditional
pipeline in a traditional
NLP pipeline, this would
look like the following.
So you have a sentence, the
pipeline for state of the art
natural language processing,
that you first tokenize so you
convert it into tokens, and
then you must have those tokens.
So for example, here the
token natural was masked, it
was replaced by a massive token,
and you then ask the model
to replace that mask token
so for example, here we asked
the model to replace it and
give its five most probable
answers, which are natural,
artificial machine processing,
and speech, which all makes
sense in that context so that's
already pre trained model
that did this completion.
But at first, of course, the
model wouldn't do such correct
predictions, and by making a
lot out of it and training it,
that's how you would obtain
a speech and language model.
The second objective is
the CLM language model that
is CLM objective, so we're
here instead of masking some
of the inputs, you want
the model to predict
what's the next token with the
token following the sequence.
And so this is interesting,
because it actually trains
the model to do text generation,
since it's trying to always
generate the next token, you
can change that and it will,
at the end, generate a sequence
that has been introduced.
Generate a sequence just from
the context that you gave it
up first, so this is
interesting, but it only attends
to the left hand side context,
compared to the MLM objective
that we saw previously, which since you're
masking a token that can be
in the middle of the sentence,
you can have access to both
right context and left contexts.
Whereas here, you only have
access to left context.
So usually CLM has,
doesn't get as good results
on downstream tasks as
the MLM between those.
However, it allows for text
generation, which is a very
useful feature, all right,
so now we'll let Anthony do
the present the Tokenization,
which is one major aspect
of the natural language processing there.
- Yeah, let's dive a little
bit more on the tokenization,
now, I want to start by
quickly talking about the
role of the tokenization.
So as we saw in NLP, our inputs,
the data that we generally processes,
basically some raw text, like
in this example Jim Henson
was a fifth year but our models
obviously only work with numbers.
So we need to find meaningful
way to transform this world
text these strings, numbers,
that's what the tokenizers do
and there are a lot of different
ways to do this, but our
goal is generally to
find the most meaningful
representation, the one that
makes the most sense for models
and possibly the smallest one,
so let's see some examples
of tokenization algorithm
and questions we get to ask
ourselves with this.
The first kind of tokenization that comes
to mind is simply based on
words, and it's really easy to
set up a news is generally
just a few words and use good
reasons, in this case, we want
to separate the world text
into words and find a numerical
representation for each
of them, this usually requires
splitting the text somehow,
and we have to choose how
to do it, do we want to keep
the punctuation the words or
maybe separated in their home
tokens or even another role,
with this kind of tokenizer we
generally end up with some
fairly large vocabularies.
Words like dog different words
like dogs, and they end up
with a different representation,
and the same applies for
run versus running, for example,
and all of these different
words, get a message at the
ID generally static, starting
from zero and going up to the
size of the vocabulary that
we'll use to identify each
word, so the more words we have,
the more IDs we need to see and the bigger
the vocabularies, this also
means that we generally need
some token to represent
any word that is not in our
vocabulary, this is what we
call an out of vocabulary token,
also known as an unknown
token, now, we don't like this
token, because it means we
are not able to be present
some of the words we might
see might get as input and we
are losing information on the
way, so that's something we
want to avoid as much as possible,
and so, one way to reduce
the risk of having out of
vocabulary tokens is go one level
deeper I'm talking about
character based tokenizers
in this case we now split our
texting characters, and one
of the advantages is that
we ended up with smaller
vocabularies or so we have a lot less
out of vocabulary tokens as
the words can be built from
these characters yeah we get to
ask ourselves, also, do
we want to keep the spaces
and punctuation or not, one
thing about this is that
intuitively, we're seeing
that kind of lack in terms
of information since each
character doesn't mean a lot
separately, this is not
true for some languages like
Chinese, for example, where
each character carries more
information that's in Latin
languages, but in English, for
example, each character
doesn't mean by itself.
And another thing to consider
is that by using such
techniques, we end up with
large amounts of tokens to be
processed by our models, and
this can have an impact on
the size of the context
carry around, for example.
No, we've seen some good
results with this kind
of tokenizers, so it really
is interesting to consider
some cases and we can also
think about another technique
which is built from the
previous one we saw about words.
And then having as a
fallback, the character based
tokenization when the word
is not in the vocabulary,
but actually let's see something
that is even better than
that, and I'm talking about
subword tokenization, you might
have heard about BPE or byte
pair encoding before this
algorithm was initially used
for compression, as introduced
in 1994 by Philip Gage, before
it got applied to NLP in 2016
by Rico Sennrich and these
quarters in the paper,
Neural Machine Translation of Rare Words
with Subwords Units, and
this algorithm brought some
really interesting improvements.
The idea with BPE is to start
building an alphabet composed
of unicode characters that
will serve as the base
vocabulary, then we start
building new tokens for most
for the most frequent
pairs, we find the original
corpus, or for example, in
English, the letter T and H are
often seen together so
they end up being merged in
a new token, TH and later
on, we might see that TH is
actually often seen next to
the letter E, so emoji being
the token D, and then keep
building up more and more tokens
up to some target look at
the results, since we start
from the characters and build
up to words generally when
a word is not part of the
vocabulary, we are able to use
multiple tokens to represent
it, in this example, you see
that tokenization can be
presented by token and zation.
The same would apply for
standardization for example,
which could be split in
standard and zation and so these
sub words ended up having a
lot of meaning and also space
efficient, we can have very
good coverage is relatively
small vocabularies and also
less known tokens, there are
obviously also techniques out
there like the Byte level BPE
which is very interesting it
was introduced with GPT-2 by
open AI and used bytes as base
alphabets instead of unicode
characters, this means that
the initial vocabulary fits
in 256 different characters
which is the number
of values by can have and yeah
instead of the 100,000 more
than 100,000 different building characters
and so there is also word peace which
is a form of sub word tokenization
two actually very similar
to the way BPE works used by
Google in models like berts,
or so the more recent unigram
are generally implemented
in sentence is which brings
some improvement, have a BPE by
improving the way we merge
tokens, so now, knowing all
of this, I can now tell
you why we decided to build
a tokenized library and how
it works, so the first reason
is simply for performance,
our models usually run with
frameworks like PyTorch,
tensor flow, or even onyx.
And all of these provide
great performance day.
The tokenization is from
happening into pricing
and it ends up being really
slow, sometimes bottleneck
and in the NFT pipeline,
so having to pre process your entire data
sets before actually training
the model shouldn't be a
requirement because this means
that whenever you want to
change something, you have to
start over and you see it can
really cumbersome, so we
definitely wanted to improve
this experience and provide
real time optimization that we
didn't even notice, it is also
a great occasion to have all
the various tokenization,
and the reason under one roof
is a shared API, this makes
experimenting way easier in
switching tokenizing subways,
we want it to be easy to share
your work and access the work
of others and finally, we also
want to make it really easy
to train and tokenizer for
example, on the new language,
or new data sets, or whatever.
So now, let's see how we
actually do this Tokenizer is
actually a pipeline, the
input text goes through this
pipeline, and in the end, we
get something ready to be fed
to the model so the first step
is the normalization, that is where
we transform our input, generally
that is where we're going
to treat white spaces, for
example, all lowercase everything
for language, yeah, sorry, can
we start over the first one?
Okay, the first step is the
normalization, that is where we
transform our input, generally,
that is where we're going
to treat white spaces, for
example, all lowercase,
everything, maybe apply
some unique amounts
of normalization, then we
have the pre-tokenization.
In this step we take care of
pre segmenting the input as
many, in most cases means simply
explaining on white spaces,
for example, languages that
use white spaces, but we could
be anything that makes sense
for our specific use case,
once all of this is done, we
are ready to apply the actual
tokenization algorithm, this
is where BPE unigram word level
or any other tokenization
algorithm does its magic.
And finally, the last step
is the post processing.
Here we are the special tokens
like for example, the CLS
and set in birth, we take care
of truncating the input so
that it fits the model
bad necessary et cetera.
So now let's see some code
and how to build custom
tokenizer, this first example
shows how to build a byte
level BPE as you can see,
we start by building and
tokenizes tokenizer based
on an empty BPE model, we attach
a normalizer in this case,
this is a classic unicode and
if case normalization we also
attach our prayer tokenizer
here this is a vital event which
will take care of speeding
the world text to words,
if possible, and transform
the input into the right
representation, so that we
can process the text as byte
level, this also means
that we need a decoder now,
this one will take care of
transforming the tokens back
to the readable unicode
characters when we want to decode
from IDs back to text, and
that's it, our tokenizer is ready
and we can now train it for
the trainer, for the training,
we specify the target
vocabulary size, as well as
the spatial tokens we plan
on using, we give it to be a
bunch of piles and just let
it work, the training step
actually trains the model so
here when the training is done,
it just means that our BPE
model is not empty anymore
now has a vocabulary, and I
want also to show you another
example of tokenizer, in this
case, a word is like the one
used in bytes, so just
like before, we initialize
the tokenizer this time is an
empty word piece, this time we
use the sequence as the main
normalizer, which is just
a utility helper that we
can use to combine multiple
normalizes, in this case,
we want to strip each screen
we receive as input and also lowercase
everything, we use a simple
whitespace pre-tokenizer that
will split on white spaces
and keep the punctuation as
separate tokens, and we set
up our decoder, in this case,
we need it to decode the IDs
while treating the double
aspects involved, which is
basically to work this and once
again, we're ready for and
there is a final step here
though, which is about setting
up the post processor and if
you remember from before,
and we describe the pipeline.
This is the part that actually
adds the spatial tokens.
And now we're actually defining
this here after the training
because we need to initialize
it with a token and a token ID
for both the set and CLS tokens,
and in order to be able to
use the token to ID method
and the tokenizer we can do is
only after it's been stained
so now after these examples our
tokenizes are ready to be used,
he can also specify maximum
if you wants by enabling
truncation, this would ensure
that the inputs to a modal
always be of the right size
you can also enable the
padding for example,
so that all the sequences we encode
as batches have the same
length in this case, we specify
that what are the token and
by the ID that we want to use.
And we can say these tokenize
everything we defined before
that we get before that
will get saved in this file,
the normalizer, the tokenizer,
the padding options,
everything, this means that
whenever we want to use this
tokenizer in the future,
all we need is a single line
of code means utilize it from
the file, ready to use it.
If you want to try another
tokenizer your code doesn't
change, you just change the file
from each new load tokenizer.
Now, of course, each tokenizer
can be used to encode some
input text and here we actually
supports multiple types
of inputs, we can encode single
sequences, but also pairs
of sequences, for example,
if you want to do question
answering, you would encode
the context and the question
as a pair of sequence you
can also encode batches
of sequences, both finger
and pair of sequences,
and even mix these that's
relevant, and we even bought
pre tokenized inputs, for
example, if you want to do some
entity extraction, and that's
your data set is close data.
And in all of these cases,
when you encode and you
get an encoding back,
this encoding contains all the
scenes that you may need like
the IDs, obviously, but also
the generated tokens if you
want to check what it looks
like, this example shows you
what some byte level
tokenization would look like.
'Cause its strange characters
from the feature token,
this character actually present
a whitespace white level,
and we also provide the
offsets for each token so
that you can extract text
from your original input if
needed, that's something
really helpful in the case
of question answering, for
example, where you want to
extract the answer from the
original text, and maybe
highlight it, you can also
find special tokens masking
attention, mask them and a
lot of other things also that
it lets you discover, so yeah,
try it, it's as easy as using
pip install tokenizes or
NPM, install tokenizing node
and we'll keep on adding
bindings for new languages in
the future, now I let Lysander
talk to you about transformers.
- All right thank you Anthony,
so Transformers is another big
part of the natural language
processing pipeline, because it
takes care of the prediction
right after the tokenization.
So, since 2018, and the arrival
of the first transformer
architecture, there really
has been an explosion
of the number of Transformers
that have and just to name
a few words GPT-3 was released
last week and before that
there was a Mina from Google
and several others the past few
months and each of those
transformer architecture
it is a bit different to the previous.
For example, Bert uses word
piece tokenization was trained
with mask language modeling,
and next sentence prediction.
Well GPT-2 uses byte level BPE
tokenization and was trained
with their causal language
model, and it's the case for
every single transformer
model, even though they're very
similar, they have some inner
quirks and some slightly
different API's, what we try
to do in Transformers is offer
all of those transformer
architectures to be very simply
usable under the exact same API.
While doing so we try to make
it accessible to as most users
as possible, given that we first
started with only supplying
our models in PyTorch, now
they're all available in both
PyTorch and TensorFlow since
last September, and since last
month, we have our first
two models Bert and Roberta
in the Jack's framework from
Google, all of those models
train and run on CPU, GPU
and TPU with each framework
specific optimizations, their
XLA for TensorFlow TorchScript
for PyTorch, and others,
like Half-precision
and Half-precision and
others, so to understand how
an inference script would
look like using transformers,
it leverages the two abstract
classes which are pre trained
tokenizer, which is completely
based off of Hugging Face
tokenizers that Anthony just
presented you and pre trained
model which is an abstract class
and compressing the models.
Both Praetorian tokenizer
and pre trained model offer
the exact same API, whichever
model you use whichever model
tokenizer pair you choose to
use, so for example, if you
would want to use bytes
and write a script with it,
an inference script with
it it will be as simple
as simple as just changing words
to GPT-2 to just completely
change the transformer, so with
Transformers what we try to
do is to, we publicly host pre
train tokenizer vocabularies
and model weights on a model
hub, which allows easy sharing
and use of pre trained
models and tokenizers,
so right now, we have more than 1600 model
tokenizer pairs that you
can use very simply as shown
in the decode sample here,
using just tokenizer from
pre trained and model
from pre trained as well.
Another abstraction that is
available into in Transformers
is the pipeline abstraction,
which handles both
the tokenization and
prediction, it uses reasonable
defaults, which means that
you can get state of the art
results state of the art
predictions without tuning in
settings, however, they're
still very customizable,
which means that if for example,
you want to use a pipeline
in a different language than
English, you can totally
just input which pre train
model and tokenizer pair you
would like to use, and it
will automatically download it
from the model hub and use
that pair that you mentioned.
So let's check a few use
cases, that's where it gets
really interesting, so here
we have sentiment analysis,
which is a kind of sequence
classification will be for
the next three example,
we'll be using the pipeline
abstraction, for the last
example we'll be using, we'll be
looking at how it looks
without using the pipeline
obstruction, so just using
the variables, tokenizer,
and model to see what really
happens inside the pipe.
What's your, for example, with
the pipeline, it's very easy
to do sentiment classification,
you would start by first
importing the pipeline, and
then initializing it with
the task that you're trying
that you want to complete.
So here's sentiment analysis,
then you would just call this
object NLP, with the
sequence that you want to
classify according to a
sentiment positive or negative?
So the results of NLP I hate
you results in a negative label
with a score of 99.9% and
the result of NLP I love you
results in a positive label
with a score of 99.9%.
Now for our next task the
question answering which
I find absolutely incredible
the question answering task in
natural language processing
is given a context given
a context and a question,
it extracts an answer from
the context to give an answer
to the question, so here you
would do like the previous
slide you will initialize
the pipeline with the task
question answering and then
define a context which in this
case is I will read it out
loud, so extractive question
answering is the task
of extracting an answer from
the text given a question?
An example of a question
answering data set
is discarded datasets, which
is entirely based on that task.
If you would like to fine tune
a model on on a squat task,
you may leverage the human
squad, QI script, so then you
would call the NLP variable
with a question and the context
that I just read, so for
example, NLP what is extracted
question answering given
the context, and it returns
a result with the score
the start and end location
of the answer in this case, the task
of extracting an answer from
a textbook question, which is
entirely correct, and
for the second question,
what is a good example of a
question answering data set?
The answer is what data set
which is correct as well.
So for causal language modeling or text
generation, which is the
same thing as we've seen
previously, and you will
first initialize the pipeline
as we've done previously with
the text generation string,
and then create a sequence
that you would like to be
completed so for example here
the spark will say AI summit.
We want the model to complete
the sentence following that
initial context, and so we
do NLP sequence which results
in the following generated text
the spark will say AI summit
was a special event held in
January 2016 to celebrate
the emergence of a major and
enduring innovation, the spark
plus the event open with a
keynote by Dr. Stephen Hawking,
creator of AI, although it
still has many of et cetera.
So you can see here that the
text is syntactically coherent,
even though it's factually
incorrect, right now is a an
example using the tokenizer
and models without relying on
the pipeline obstruction,
so here, we're going to do
sequence classification,
again, similar to the sentiment
analysis, but we're going
to do a different task here.
We're going as you can see
in the in the code sample,
we're loading the tokenizer
and model checkpoints bert-base
case to fine tune the
MRPC bert base case means
that is the original bert
base with the case vocabulaty
that was released by
Google when it was bert.
And finally MRPC means
that we find in these
checkpoint to the MRPC
test, MRPC is the Microsoft
Research paraphrase
corpus, which is a data set
that tries to identify when
two sequences or two sentences
are paraphrases of each
other, so here we initialize
the tokenizer and model from
checkpoint, next when you find
the classes, so not paraphrase
and is paraphrase, and we
define three sequences, the
first one is the company Hugging
Face is based in New York
City, the second one is apples
are especially bad for your
health and the last one is
Hugging Face's headquarters
are situated in Manhattan.
The first 10 sequences
are completely unrelated.
So definitely not paraphrases
of each other, however,
the first and last sequences
are practically mean the same
thing, so maybe they're
not actually paraphrases,
but they're very, very similar
to each other, so we want our
model to predict that the
first and last sequences are
paraphrases, so we define
that the we now use
the tokenizer to encode the
sequence pairs, so that it
basically combines the
two pair the two sequences
into a pair using the model
special tokens and builds
tensors that can then be
used by the model to output
the prediction, that's why we
have returned tensors equals T
which means that we want
tensors to be returned
so paraphrase and not paraphrase
here are two dictionaries
containing PyTorch tensors
that the model will use
to really do a correct
prediction, so now we spread
the paraphrase and not paraphrase
dictionaries in the model
and get the first results the
first return of the model.
Since the model will return
triple outputs which contains
a lot of different data on how
the prediction was computed,
we only want the first
results which is only
the classification levels, so
the paraphrase classification
nuggets and the not paraphrase
phrase specification nuggets.
Now, we just pass these results
through a softmax to get
to make sure that our results
are between zero and one
and we can do a probabilities
or a percentage of those
and we print the results, and
we can see the results should
be paraphrased is the first and
last sequence, so the result
is not paraphrased 10%
and is very phrase 90%
So, it did identify that
there two sequences were very
similar, and the second case
should not be paraphrased.
It's not a paraphrase that 94%
and it is paraphrase at 6%.
So here to identify which
paraphrases it identified that
the first and second sequences
were not paraphrases of each other.
So, using the model and
tokenizer is still very easy
and seamless since we offer
these high level methods
that allow converting inputs
to better converting sequences
to model inputs very simple,
but Transformers isn't only
limited to inference, it can
also train models, and we do
offer quite a lot of examples
scripts in our library in both
TensorFlow and PyTorch for a
few tasks that are listed here
named entity recognition,
sequence classification, question
answering language modeling,
both fine tuning and from
scratch if he wants to do a
pre trained language model
and multiple choice all of
those trained on both TPU CPU
and GPU, as well as this, since
our models are simply bare
bones, models and both PyTorch
and TensorFlow, you can use
it with the respective training
frameworks, so for example,
in PyTorch, you can just use
it with a simple training
group or with PyTorch lightning,
and then TensorFlow, you
could use it with a karass fit
method or with the TensorFlow
estimators, we also offer a
trainer class that can be simply
overridden in python, just to
build a very simple training
specific to NLP, so with
Transformers with this presentation
we just raised the surface
of transformers, which offers
many many different possibilities,
to just name a few that
were added in the past few
weeks, we've added the ELECTRA
model and the ELECTRA pre
training method, we've added
the Reformer for very high efficiency,
the Longformer for very,
very large sequences,
the Encoder-decoder architectures for
Translation and Summarization,
so now we've seen
both the tokenization and prediction
aspects of the full NLP
pipeline, but is that really
everything, but is that
really everything that covers
the NLP pipeline, well, not
really, because there's still
the data parts and the
metrics part that needs to be
covered, since just obtaining
data and feeding data to
the tokenizer in a memory
friendly way, and computing
metrics afterwards are two
different tasks, well, we've
created the Hugging Face
NLP library, just so that
it's so that it offers in memory mapping
and computing the metrics
automatically, and the NLP
library is will be soon built
on top of Spark, and Kersey
offers more than 100 different
data sets that can be used
very, very simply by the
tokenizer and transformers nugget.
So that's it for the full pipeline
of the current natural
language processing using
Hugging Face and tools, thank
you very much for listening
to our talk, your feedback
is very important to us.
So please don't forget to
rate and review the sessions.
Thank you for from both Anthony
and I, thank you very much.
