[ Music ]
>> We're going to look at
two programs, two examples.
The first one is
going to be structure.
And in the slides material you
saw yesterday was information
on the website that
you can go to,
to download the structure
package and program.
And what we have here on
the computer is that file.
And once you have it,
if you look in the
contents there is this icon
that has the various
colored bars typical
of structure output.
But you can run structure
and this is the page
that will come up.
You can go under file and
examples of all of these are
in that manual tutorial
whatever you wanna call it
that you also had.
Here, you can go to new project,
click on that and it's going
to ask you for information.
You can name the project,
what I would suggest you
do is use your initials
and call it test something
any arbitrary name.
I used Test 2 because I've
also run a Test 1 before.
You need to tell the program
where information is the program
and its necessary components
are in the desktop in this case,
in the structure folder,
and you select that.
The data file, we have
loaded in here a data file.
We will browse also
on the desktop
and in the structure folder we
have a folder called "data."
And because this data set
is related to the examples
in the FROG database that
we'll talk about later.
We call it FROG 1943
and that is entered.
Next, you want to enter
the number of individuals.
Well, we called it FROG 1943,
because that's the
number of individuals.
Everybody is a deployed two
chromosomes these are all
the symbol.
The number of loci is 39.
And what we have used is minus 9
to represent a missing
data value,
not everybody will have
results for every locus.
At this step 3, we
simply click and note
that we have put the marker
names into the data file.
And here we've also put
in the individual ID.
We put in the putative
population of origin.
It's in fact not putative
because we know where all
of these individuals came from.
And we're going to
click the flag
that will use that for display.
The calculations in structure
do not use population origin
but we can display
by population.
We can look at the data format
and it tells us how many lines
of data, how many
columns, et cetera.
We finish and here is a
summary of what we've put in,
our project name, the
pathway, the data source,
the number of individuals
and the nature of the data.
And we can proceed and here
is what we see, the data.
We have all of the marker names,
we've got the population ID
and the flag for ordering them.
Now, at this point, we
need to set the parameters.
And there are no
parameters already stored,
it's a new project.
We have no parameters.
Now, the parameters are going
to tell us how long
to run the program.
This works on Markov
chain Monte Carlo
which means it chooses
values in the parameter space
to gather a sense of where the
parameters are best during a
burning period.
And then it follows that up
with a match finer sampling
around the better values.
So, because I know this
computer runs reasonably fast,
I'm going to put in 4,000 and
4,000 for a total of 8,000.
In most runs for really
getting a better idea
where speed is indeed
important 10,000 burn-ins,
20,000 MCMC repetitions is sort
of the minimum one word used
but is usually quite sufficient.
So we want to name
this parameter set,
I'm going to give it the same
name as the project but that's
by no means necessary,
it's an arbitrary name
so that you can come back
to this parameter set.
So now, we're really
ready to run the program,
depending upon your computer
for reasons I do not understand.
Sometimes you can immediately
start to run the project,
other times you actually have
to save at this point, get out,
come back in and
restart the project.
So I'm going to do that, we're
going to save the project
and then we're going to exit
and then we're just
going to come back in.
And we are back here
and now at file instead
of a new project we're going to
open a project and we're going
to the desktop, we're going
into structure and we're going
to look at Test 2 and here
is an SPJ that's the project,
structure project.
And we start it, there
is our data set again.
And now, we can go in
and try to start a job.
And it's going to ask us,
what are parameters
are, and the K values.
The K values tell you how many
different clusters you want
to fit these individuals into.
And structure will
then find the best way
to fit these individuals,
given that it's
on Monte Carlo approximation.
So every maybe a
little bit different.
So, we know that these data are
reasonably good at clusters of 6
or 7 so we can try both of them.
And the number of iteration
because it say local maximum,
it'll be different
potentially at every single run.
So normally, one would do 10
or 20 runs and find the one
with the highest likelihood
as the examples are
in that tutorial.
>> But here, we're just
going to do 2 replica.
So we're talking now about
a total of 2 replicas
at K equals 6, 2 root K equals
7, and we're going to start.
And you can see in the lower
left it's actually started,
it's already done 200, 300
of these 4,000 burn-in runs.
So we simply have
to wait a minute.
We're going to do 8,000,
4,000 burn-in 4,000.
So we're already 1/8 of
the way through there
and you may want to--
we're almost at 4,000,
the burn-in has completed.
Now, we're beginning the
finer runs, and here you see
in the right hand column is the
estimated log of the likelihood
of the particular parameters
you see in that line of F1
through F6, and they're
all hovering somewhere
around minus 51, 300.
So the first run has
completed and it's now started
on the second run at K equal 6.
But, while that's running,
we can go up here and look
at the results of the first run.
There is a lot of
text here with a lot
of different bits
of information.
I'm just going to scroll down
to this point fairly close
to the front where the mean
value of the log likelihood
which is a value we're
going to want to use
to compare different runs.
We want the run that has
the highest likelihood.
And if you saw they are
fluctuating around minus 51 300
and here is the exact result
for these set of data.
But, what's more interesting
is to just look at it.
And if we go up to the bar plot
and show it depending upon the
sides of screen you've got,
you maybe able to extend
this out to the right.
We don't have room for that.
So we'll have to
scroll through it.
This is individuals each
little vertical line
and you can see some
of them here
as thin distinct vertical lines
is a different individual.
And they are colored by which
of the six clusters
they best fit into.
Here there in their input
order, but we're now going
to group them by
those population ID's.
We just put in numbers, so you
have to use the number code page
that you-- you have in the
folder where we give the number
and what population
it corresponds to.
Here, we've got a
whole bunch of blue.
Let's quickly scroll across,
we're going to have six
colors corresponding
to the different populations,
a different clusters
that we're trying to group
these individuals into.
You see some populations
are only one color,
others are a mess.
But let's look at
what those are.
The first, the blues
are the Africans,
1 are the Biaka Pygmies, 7
are the Yoruba from Nigeria,
12 are the Hausa, another
Nigerian peoples group
that does not speak
a Bantu language.
And they are slightly
distinct, doesn't show
up with these markers.
If you look at the
bottom of the screen,
you see the second
run has finished
and it's rapidly writing
out all of the results
and then we'll start
the third run.
In the meantime, we can
look at this first run,
13 and 14 are the Maasai
from Northern Tanzania,
and the Chagga from the
based of Mount Kilimanjaro.
They are the porters for
those who hike up Kilimanjaro.
And 16 are the Sandawe
a click-speaking,
hunter-gatherer group
also in Tanzania,
17 are African-Americans.
And suddenly you can begin
to see individual differences
that can be fairly significant.
And we know African-Americans
are admixed.
We will talk more
about that later.
Here, 20 are the Ethiopians.
If you remember,
or go back and look
at the tree structure I
showed, this population,
these are the same individuals
same population are intermediate
between sub-Saharan
Africa and Southwest Asia.
And here most people are closed
to 50/50, that may be admixture
but its also equally
likely beforehand to be
that these individuals are--
that this whole population
is intermediate
and structure has had
lots of individuals
in Southwest Asia
here on the right,
and lot's of sub-Saharan
Africans on the left,
and it doesn't have
enough degrees of freedom
or enough clusters
to allocate these
to their own cluster it's
simply places them in between.
Whereas in African Americans
we know there is admixture,
and these may very well
on an individual basis represent
the amount of admixture.
One can't say arbitrarily.
What is admixture and
what is intermediate?
So 25 and 26 are the Samaritans
and the Yemenite Jews,
28 are the Druse and then
31 are the Ashkenazi Jews.
And they have a lot of green
and a lot of yellow and a lot
of individual variation
but clearly also have a lot
of Southwest Asian
signature if you will.
How much again this is admixture
on an individual basis
is difficult to say
but it's not surprising that
there is European gene flow
into Ashkenazi Jews and
yet as a population.
They clearly maintain a signal
if you will of Southwest Asia,
39 are the Adygei a Southern
Southeastern European
population, just at the North
of the Caucasus mountains
on the shores of the black sea,
the east shore of the black sea,
it's the only Caucasian
population we have.
You heard me say
how much I think
of Caucasian is a racist term
in geographic sense,
these are Caucasian.
42, are Hungarians
and as we move
through this Europeans we see
the Irish have a lot of yellow,
we have still got a little
bit of this pink showing
up the European-Americans
are fairly "admixed"
and then there is a
little more green as we get
into the northeastern
populations if we look here
at 49 and 50 those
are Finns and Danes.
51 are the most northeastern
population that Komi that exist
on both sides of
the Euro mountains
across the northern
edge of the Euros.
56 are Khanty a side a western
Siberian population that falls
in between any of
the clusters we have.
57 are the Keralites from
South India and they have a lot
of these pink showing similarity
to the southwest Asian's,
Middle East and not
to Northern Europe.
Then, we get into East Asians
and like blue the pacific 103,
105 are not distinguished here
and we then have the 4
Native American populations
that are quite distinct.
So let's take a quick look now
at other runs the second run
at K equal 6, we can look and
see if it's a different pattern.
A lot of them are
different patterns.
Here, it looks pretty similar,
we end up with the same colors,
here we're getting
into northern Europe
and quite complex the
Khanty and the Keralites,
let me move this up, and
then we get into East Asia
and the Native Americans.
So here we got two patterns
that were really quite similar
but very different
patterns can occur.
If we stop here and
now go to K equals 7,
one of them has already
completed
and the other is getting
close to completion.
So now we're allowing
one more cluster.
And here we see all
of the Africans,
there are still roughly half
and half happen the Ethiopians.
Now, the Samaritans are
very clear and things begin
to degrade meaning cluster
cannot allocate these
individuals into a
single group very much.
Now, in part of Europe, we're
getting three colors red, pink,
actually the light
blue and an orange.
What does the orange represent?
The orange seems
to be representing far
northwestern Europe the Irish.
The whole job just completed so
we can look at the second run
at K equals 7 in a moment.
And here the Komi are not orange
but 50 are the Danes are
very much like the Irish.
Again, this is complex
East Asia maybe a hint
that the pacific are a
little different but not much
and then the new world.
We can look at the fourth run.
Here, we see very much the same
thing we we're seeing we're
getting 3 different
colors in Europe,
the same sort of pattern.
The colors are different because
each time you run structure it
arbitrarily chooses
a set of colors.
There are ancillary
programs disrupt
by Rosenberg will allow one
to make all the colors
roughly correspond
when the clusters are the same.
And here we get essentially the
identical pattern we had before.
So unfortunately, in
this run I'm not able
to show how different
the patterns can be.
But they can be very different
and we can email
you this data set
and you can play
with it if you want.
And the data will clearly
in some cases give
very different results.
At the end of the tutorial
package is the example
of eleven different runs
of a superset of this data
with a few more individuals
in it
that will allow one
together sense
of the different likelihood's
and the different
patterns that can occur.
Now, structure is
not a good method
for assigning an
individual to a population,
you'd have to put your
unknown into a large data set
with a good number
of markers and see
which cluster the person fell
out in because in that long list
of information that comes with
the cluster run, you can scroll
down and here you can
see for each unique ID
of an individual you
have the probabilities
or the relative clustering
in each of the 7 clusters.
And so, clearly here
are several individuals
from the same population and
they mostly cluster in cluster--
well, 1, 2, 3, 4 cluster 4
of 7, that's population 26
so these are the Yemenite Jews.
But, some of them don't,
here's one that falls primarily
in cluster 1 not cluster 4.
So instead of using
cluster to analyze data,
we find it's very useful for
trying to identify the sets
of markers that are
most powerful
in subdividing populations.
This is a set selected to be
able to make some differences
across Europe as well as
more continental groups
and indeed it does that.
What is better for trying
to look at the ancestry
of a given individual is to
use a livelihood approach
with relative likelihood's and
that's what we have attempted
to implement in a very early
stage of our FROG database.
So let me connect to the
internet, bring up FROG
and then we can start on the
second part of this exercise.
[ Music ]
