okay so i um added two documents to my 
Quechua sample. 
The first one is the wikipedia article that
we were looking at earlier.
uh...
and generally wikipedia articles are a lot of
garbage and a little bit of text.
So what i did was, add these two um jape grammars from
the jape grammar folder
uh... to filter off
and only use the data that's in a paragraph
because the paragraphs are generally going to be
sentences and that's interesting to me.
uh... as a linguist we want to look for data in
context 
um, not just words
all over the place.
So that was one f-file the other one was a
Quechua magazine which
i found online
uh... it's here.
It so magazine written in Quechua uh... pretty
interesting, not too long, not too big.
uh... kind of a reasonable size for-for
exploring
uh...
I loaded it into GATE it came out as pure text
like that. 
uh... there are some of the
markups you have inside the PDF, there, also.
uh... and then i ran this script on them.
Um, basically the script is
uh...
going through the whole document and looking
at the words that are in it. 
A couple of things you might want a change
in this script is you might wanna turn this
on.
Right now it's commented out. This allows
it to go through only a small section of the
tokens which if you're going to do a little bit
of debugging you might not want to run the whole document.
The other thing you might want to change is the outpath.
You should put this for the repository that's on your
computer.
uh... go through to all the way through to the 
repository which is here 
and deep into the source folder
as the outpath
and then that way, when you create the output
it will go into these folders 
which are in the outpath area, so i'll just show you that
here on the terminal.
So if i go right here,
this is where the output goes.
So the output goes into the outputtxt folder.
um... that's why i put the text output.
um... For any output that uses javascript like, 
to make a graph for to make a nice picture, 
I put it in the javascript folder.
and for anything that output that makes it
a grammar i put in the japegrammar folder
That way i can use my, um..
my output later for some other purpose,
if that output is useful. 
The thing about the output is that um...
it's based on the name of the document that
you run it on so
Here I added a variable which points to
the name of the document so that way
um... The outputs are going to go and correspond
to document that they're ran on so that's
kind of helpful too.
So let's take a look at uh... at the output
of the uh... processing.
There's a couple of files here and there's
two of each, one for each document, one for the
wikipedia one, one for the magazine.
So if i do, let's look at the suffix one first,
because that's the fun one.
And lets pipe that into more so than i can read it 
screen by screen.
So, here I'm looking at words that are sorted by 
rhyming order.
Which means that uh... the backend, to the front.
So there's a lot of words ending in "a," actually I see quite
a few words
ending in "cha"
So i might hypothesize that "cha" 
was a suffix in the language, go through the data
look for things that in end "cha" look for things that
are systematic on the other side
that are actually uh stems for example.
and I can find stems and morphology by looking
basically from the suffix, in.
So i think "cha" might be an interesting candidate
uh...
"ja" looks moderately interesting also,  it could
be just a sn-sequence of phonemes.
"alla"
or /aja/
s-could be
That looks a lot like it's just Spanish.
So of course in a Quechua document you're gonna
have quite a few spanish words.
uh... Because Quechua's often spoken by bilinguals.
uh... And of course, spoken in countries
which are, which speak spanish
so a lot of place names and  
uh... you know, object names, like fruit or whatever 
are gonna be in Spanish.
"taqa" looks like uh...
potential morpheme or maybe just yeah, "taqa" or maybe just "qa"
because here's an example of just "qa" by
itself
So that, you got the idea.
That's one of the output files of the script.
The other one, oops
uh...
Will show you the frequency order, so cat words  
function
and then
the magazine and pipe it into more. 
So this shows me that the most popular word
that magazine is "mana."
at, there's 71 
instances of "mana" in the  document
So i would say that definitly a function word. 
uh... if i go down around uh... 14 repetitions
i have this word
/puku-jun/, /pukukuna/
I don't really know how to pronounce
it I'd have to check with an informant to have them
pronounce it for me.
uh... That looks like a content word. So, we're into 
content words already.
Going down a little further...
we're getting some uh...
some clear words they're definitely not Quechua
um... obviously Spanish. 
Like I said uh... we have to be aware that
documents
in any language that's a minority language, would
generally have other languages in them.
Alright, so let's get out of here. Let's go take
a look and see what uh...
what our
jape grammer has tagged for us. So let's take a
look 
for example uh...
"mana" So "mana"s the most frequent word in the document
"uk" and "wata" are other words that are very
frequent
those look like they're going to be functional
words so I'm gonna go find out what they do.
So let's go into my uh... frequent my,  my list
of Tokens.
I can sort it by type, here.
So if I just click on Type, it will uh...
allow me to do that. S I'm gonna go look at the
ones that are very functional.
Here's an example of "wata"  
Here's another one.
Okay, "wata" 
"uk" looks suspicious, it could be "UK" 
you know, a website. 
Yea, no it's definitely a Quechua word
we can see that it starts generally a phrase, 
or a sentence it's probably some sort of a [discourse particle]
or a conjunction.
uh... Here's "mana" on
With these function words all I have to do really is 
to look them up in a dictionary
there's a small list of them, they're very frequent
uh... you know, its uh
you just look them up and you can find out
what they are and you can tag them and what their 
[dictionary] meanings.
Now the content words on the other hand are all over the
place.
And this is where a lot of the trouble is
when you [use statistics] on Quechua at least in terms
of people who are trying to do
word counting. So words that have lots of
suffixes just aren't going to be repeated
and not going to find them in a dictionary.
So that's when you need to start finding the
morphemes, 
and cutting them up,
and discovering what the root is,
and therefore distig-learning with the concepts
in the document really are.
so uh...
So that's just a qui- quick intro into this
script, 
which uh... you can run on the GATE document 
and it will extract some information...
 It wlll make a graph for you...
uh...
like this
So you can see uh... the proportion of function
words versus content words in the document
and uh...
We can also
compare that to the wikipedia article.
And we can see the wikipedia article
is a very bad article to work on [for word counting] as you can
see there's only a hundred and forty three
words so you can a very small sample size
that's very skewed so a lot of things look repre-
repeat.
and uh...
And then also you can see some highlighting in
your gate documents. So if i want only look at
function words, I mean content words in Quechua
i'll have to do is turn that on, and then i can take a look at 
just the words that are
that have meaning, we would say, so, that has some lexical semantics.
