[ Music ]
>>My training was in
human population genetics.
And I started working
with DNA polymorphisms
in the early 1980's to study
humans for a variety of reasons.
And became involved in the
late 80's and early 90's
when DNA started
being introduced
into court as an expert witness.
And I testified in many
different places around the US
and in Canada and found
it quite interesting.
It soon became unnecessary
for expert witnesses.
So I basically dropped
out of forensics.
It was never my main
area of research.
And then at the time of the
World Trade Center attacks,
an expert committee was set up,
and I was asked to be on it.
It became obvious that the
standard forensic markers
for individual identification
did not work well
in a sizable number of cases
because of the extreme
degradation of the DNA
and because they could
say nothing about ancestry
of the sample, which
would in itself be a help
in identifying the person.
At that point I realized
that we had all the expertise
and the samples necessary
to do a lot of research.
So I got back involved.
And let me then start the
lecture and talk about some
of what we're doing and why
DNA can be very important both
in ancestry and phenotype.
And I could say that
DNA is going
to make everything you've
learned so far unnecessary,
but I won't go that far.
[Laughter]
>>DNA is not going
to solve all problems
but it already can be
extremely helpful when available
and it's going to be much
more helpful in a few years
because of projects
that are ongoing now.
Caveat. What we know now, as
I hope you will you learn,
is often over interpreted --
that there is the implicit thing
that if a conclusion is
based on DNA it's immutable,
it's true, it's precise.
Ain't so. So let's talk about
the DNA in the human genome.
We've got mitochondrial DNA,
which is a very small part
of the genome -- less than
17,000 base pairs as a circle --
compared to the nuclear
DNA which is
over 3 point 3 billion
base pairs of DNA.
The autosomes are 22 chromosomal
pairs of varying sizes.
The sex chromosomes, one pair
unmatched, females have two X's.
Males have an X and a Y. The
Y chromosome can be subdivided
into a small part that
recombines with the X chromosome
at the tip of the chromosome.
That is important in segregation
during meiosis forming gametes,
and then the non-recombining
part which is inherited
without any recombination.
So two parts -- they're
both interesting
and have different implications
for how they're studied.
All of these segments of
DNA have polymorphisms.
So what's a polymorphism?
It literally translates
to a part of the DNA
that occurs in many forms.
Depending upon how big
the segment you look at,
one base pair will generally
occur only in two forms,
sometimes three and four.
But in general the least
common form must --
or the least common form,
less than one percent
or the most common form at
least less than 99 percent.
So we try to make a distinction
between a polymorphism --
which in general
must mean normal even
in it's got functional
differences because millions
of people around the world
will have that form of DNA --
and the rare variants
that cause disease.
So the other part of this,
the idea that site is the
polymorphism that occurs
in different forms, each
of which is an allele.
There's a great tendency to
call an allele a polymorphism.
It's the site that's the
polymorphism, and also SNP,
single nucleotide polymorphism,
that you'll hear about more.
So the types of polymorphisms
are a combination
of how one detects it and
the nature of the variation.
So the restriction fragment
length polymorphisms was one
of the first technologies.
Those can be almost any of the
other types in terms of the DNA.
It's basically a way
of detecting variation.
Pretty much obsolete.
The short tandem
repeat polymorphisms,
I always put the P on.
In forensics you generally
hear of it as STRs.
But it's the polymorphism
part that's important.
They're short tandem repeats.
The VNTRs are generally
longer segments that occur
in tandem repeats, but
similar in concept.
Insertions, deletions -- a
bit of DNA from megabases,
huge deletions in some areas
that seem to be compatible
with normality down to one
base pair more or less.
The single nucleotide
polymorphisms instead
of an adenine or an
adenosine, you've got guanosine,
et cetera just in
the string of DNA.
And then the copy
number variation
where it's not a tandem repeat,
but a segment of DNA may occur
in two copies or three
copies -- sometimes tandem,
sometimes a whole
segment is missing --
analogous to a deletion.
So let's -- mitochondrial DNA.
It's this small loop of DNA.
It occurs in the mitochondrion.
It's the remnant of
an early parasite
that invaded an early cell.
And it's now dependent
upon the nuclear genome
and we're dependent
upon the genes here
as the major energy producing
genes and apparatus of the cell.
It has its own slightly
different DNA code.
And it's got its own transfer
RNA for making proteins
and its own machinery for
making its own proteins.
But most of the proteins are now
made in the nucleus of the cell
and imported into
the mitochondrion.
The relevant point
here is of course is
that there are many
polymorphisms.
It's almost all coding.
So there are great restrictions
on what variation can occur
because it has to be
compatible with function.
So some variants are recurring.
They have arisen
independently many times
because both alleles are
compatible with functioning.
The other thing is that the
control region is highly
variable because it only
has limited function.
As long as most of it's
the same, the replication
of the mitochondrion
occurs normally.
It doesn't code for a protein.
So you'll see a lot of studies
of hypervariable regions
which are highly polymorphic and
then of the single nucleotide
or other variants around the
rest of the mitochondrion.
An advantage is that for every
cell, there are one, two,
maybe more thousands of
copies of mitochondrial DNA.
So it is much more prevalent in
a sample than is nuclear DNA.
And hence it has been studied
because it doesn't need
quite the sophistication
and characterization
as nuclear DNA.
So a lot of human
ancestry information
and forensic identification has
been based on mitochondrial DNA.
Some of it is very
powerful, but it's not
as powerful as nuclear DNA.
So basically this summarizes
what I've just been saying
about the variation
in both segments.
But at any one site there's not
such a huge number of variants.
Now the relevant
thing for ancestry
and even individual
identification is
that the mitochondrial
DNA is inherited only
through the mother.
Because the little sperm is
only a bundle of nuclear DNA
that gets into the egg.
But the egg comes already with
all of the mother's mitochondria
and hence her mitochondrial DNA.
So even males have
mitochondrial DNA.
They just don't transmit
it to their children.
It's entirely based on what they
inherited from their mother.
So going back five generations,
how many of your ancestors
have your mitochondrial DNA?
[ Pause ]
>>One out of 32, assuming
there's no inbreeding along
the line.
[Pause]
>>My father was inbred because
his mother and father were
at least five generations
removed from the two brothers
who originally settled
in the colonies.
And they met just because
the young soldier coming
through town was asked if
he'd met the Widow Kidd
and her three beautiful
young daughters.
So young soldier, of
course, wanted to meet them
and married one of them.
So within five generations
it is not that uncommon.
So you're not learning
much about your ancestry
from this type of DNA.
What about Y chromosome,
the non-recombining part
of your Y chromosome.
It shows exactly the same
pattern except it's the
paternal lineage.
So again, you're not learning
much about your ancestry --
your overall ancestry
-- from just this.
And I'll add that
some of the companies
that do ancestry testing
will use only Y-chromosome
or only mitochondrial DNA.
And they're not telling you
a lot about your ancestry.
So I mentioned the mitochondrial
variation and the Y chromosome.
There are relatively few genes.
So there's a fair amount of
single nucleotide polymorphism.
And there are also many
STRPs in the Y chromosome
that have a higher
mutation rate.
[Pause]
>>Now let's look at autosomal.
Here is a sample pedigree
where I've colored the
alleles coming down.
So we have ampersand and
at sign alleles in a sister
and we have number sign and
star allele in a brother.
They got the opposite alleles.
Now on this monitor,
the green and the --
the green and the white
don't show up very well.
But what can be said
about the ancestry
who five generations
ago contributed each
of these alleles?
Well, here we have
the tracing back.
And notice that this
blue ampersand allele,
as we go back we
have a homozygote
and another homozygote.
And so we cannot really identify
where this particular
allele came from.
But it has to be one of
those five ancestors.
Similarly the at sign -- here
going back through the white
because here is white --
it could only have come
from here even though
both have the light green.
That means the light green
must have come from here.
So from here, we can go back.
Again a homozygote, it
could have been from either
of those or this ancestor.
So we know roughly
where that came from.
We can go back with the
hash mark, the number sign,
and the star is the
only one where we know
that the father's father's
mother's father father was the
origin of that particular
allele.
But clearly overall
we're beginning
to get a profile
of the ancestry.
And if we look at
many individual loci,
each locus is going
to tell us something
about some of those ancestors.
And with many loci,
because they're independent,
we get a picture of all of them.
So the next point is
measuring variation.
We know that these polymorphisms
exist in frequencies --
the individual alleles -- in
frequencies in a population.
And a standard measure we
use is called FST for --
F is the letter we use for
the inbreeding coefficient,
subtotal to total, S and T.
So in theory it's related
to random genetic drift.
Has anybody heard of
random genetic drift?
A couple. You know that
if you have two children,
there's a 50 percent
chance that you give to both
of them the same
allele that you have.
And that chance over many people
means that the gene frequency
in a population of children
will not be exactly the same
as the gene frequency
in the parents.
And the smaller the number
of parents and children,
the greater the possible
fluctuation would be.
So among different
populations, over time,
random genetic drift
causes some changes.
And FST is a way, theoretically,
of explaining that.
So here's an example.
I'm going to show
many slides like this.
And I always arrange African
populations on the left,
Middle East, European
populations.
Here's Western Siberia,
East Asia,
a couple of Pacific
Island, Eastern Siberia,
North America, South America.
So here you see in black
one of the two alleles --
the two have a frequency
summing to one.
So the other allele is
one minus this or coming
from the top instead
of the bottom.
And you see two different
polymorphic loci show different
patterns of variation
around the world.
The expectation for any
polymorphism you know nothing
about in advance is
that there's a lot
of gene frequency
variation around the world.
We are all alike
in ethical ways.
But we are all different
genetically.
Even if we're from the same
ethnic group, we're different.
And so those differences
become important.
[Pause]
>>Here's another example,
but these have low
variation around the world.
Same populations, same order.
But they're not identical.
One of the ways, just as an
example, to look at genotypes,
here are a bunch of individuals.
Each dot is an individual.
And this is using a TaqMan assay
which measures fluorescence
as a function of the genotype.
And so across the bottom
you've got the intensity
of fluorescence for one floor
and the Y axis you've
got the intensity
of fluorescence for
the other floor.
We typed 384 individuals
at a time.
That's what each dot is.
And so you can see the blue
here represent individuals
who have only the allele
fluorescing in blue.
They are homozygous,
only fluorescing
in red hence homozygous
for the other allele,
and a bunch of heterozygotes
who fluoresce both colors.
And the controls down here as
black squares and those samples
that did not give an
interpretable result.
And here are some that
were not interpreting.
Here's one for whatever reason
low fluorescence, low intensity,
whatever -- not being
interpreted.
Here's one of the controls that
not where we'd like it to be.
This is clearly real data.
But it's certainly not up here.
So it's not really
affecting the interpretation.
So that's sort of a
little bit of background.
But now exactly how are we
using some of this DNA variation
and the polymorphism
in forensics?
So it can be used to
identify a criminal.
That's the way it's
classically being used now.
There's DNA from a crime scene.
You've got the suspect's DNA.
They match.
And that's evidence
for identity.
Most of what we're interested
in here is identifying
human remains or maybe
from the crime scene trying
to make some inference
of the ethnicity or ancestry or
the phenotype of the individual
who left that DNA
at the crime scene
on a supposition
that's the criminal.
But DNA is used all the
time in parentage testing.
And in the court system,
the best use of DNA is
to exonerate innocent people.
I was once asked in
cross examination
if I were falsely accused of
a crime and there was DNA,
would I allow the DNA lab
and the Royal Canadian
Mounted Police Forensics Labs
to test my DNA?
And my response was, 'Of course.
It's the surest way I know
to prove I'm innocent.'
At which point the judge said
to the cross examining
defense attorney,
'Don't you think you should
stop working for the prosecution
and excuse this witness?'
[Pause]
>>So identifying human remains.
You may have, based on
what we've experienced
from the World Trade Center
attacks, may have known DNA.
Almost all of the firemen
-- the first responders --
had given samples for
bone marrow donation.
And so there was a known DNA
sample available to test.
Clearly a lot of relatives
brought in toothbrushes,
brought in dirty underwear,
brought in all sorts
of personal objects from
which DNA could be obtained.
And to date -- I forgot to
bring the number with me,
but it's over 1600 of the
individuals have been identified
with at least one
little piece of bone.
[Pause]
>>Determine the phenotypic
characteristics.
What hair color -- natural hair
color -- did the person have?
What skin color?
Can we say anything
about whether it was
thick or thin hair?
Could we say anything
about height?
Or determine the ancestry in
terms of more indigenous --
geographically indigenous
-- origins.
So the forensic question
in matching
to a known person is first
what are the DNA patterns?
So this is a molecular
and a laboratory issue.
Has the DNA been
analyzed correctly?
Have the patterns been
interpreted correctly?
Then, do the two patterns match?
Is the method used
specific enough
that if the results are
the same, you could say
that match for that locus?
Then the statistics,
what are the chances
that two unrelated people
have the same pattern?
Obviously that becomes very
critical assuming the molecular
is done well.
And that's where
databases are needed
because it all depends
upon the allele frequency.
If the frequency of an
allele is 99 percent
and you've got two homozygotes
for that allele, well 81 percent
of the population has
both alleles the same.
That really doesn't
exclude a lot of people
as not being the same.
Not very informative.
And we'll get into that later.
So the CODIS markers --
the standard short tandem
repeat polymorphisms --
used in cases nationwide
now are a panel
of individual identification
SNPs
that are clearly appropriate
for this kind of question.
The lab methodology
is pretty good.
And there are fair databases.
But individual identification
is not the only type.
And remember I mentioned earlier
the CODIS markers are not good
for ancestry.
They were picked because
they are highly variable,
almost every place.
And so there's not
a lot of difference
in allele frequencies among
different populations.
So I came up with
a classification
of four types a few years ago.
There are individual
identification SNPs.
They have very low probabilities
of two individuals having
the same multisite genotype.
So each SNP is optimized
and the panel is very good.
Ancestry-informative SNPs would
be sort of the opposite --
the high probability that an
individual's ancestry comes
from one part of the
world or maybe admixed
from two parts of the world.
Lineage SNPs are where
we're trying to get
down to individual clans within
a group -- extended families,
organized crime where
it is a family.
And the phenotype-informative
SNP --
SNPs that will, based on allelic
differences that control parts
of the phenotype, will
tell you something
about how a person looks.
[Pause]
>>So there are different
requirements
for these different
purposes of using SNPs.
And I'm concentrating on SNPs
because that's really the
best type of DNA for any
of these applications in
terms of laboratory methods,
numbers of markers available,
and the detailed annotation.
So the importance here is
that for the individual,
the ancestry, and the
clannish or lineage markers,
they represent a small fraction
of all available polymorphisms.
So one wants to search
for and optimize a set
that is particularly appropriate
for one of those purposes.
The phenotype informative SNPs
are also uncommon, but they deal
with specific phenotypes.
And as yet, though there
are good candidates,
they're poorly documented
for exactly how they function
in development of the phenotype.
So there are now
five or six loci
that we know are clearly
involved in the amount
of melanin in the skin, but we
don't know how they interact.
So while we can type them
and make predictions,
the predictions are
based on associations
without a clear understanding
of the interactions
when you look at all of them.
[Pause]
>>So general criteria.
I'm reiterating myself
to an extent.
Readily typable, has a unique
marker, highly informative
for the stated purpose,
and well documented
for such relevant
characteristics
as allele frequencies,
association with
phenotype, biology.
So which ones are
going to be best?
So we want the maximum amount
of information per SNP,
but what do we mean
by information?
And we want SNPs that are not
subject to typing difficulties
and what kind of typing
difficulties exist.
So additional slides
will amplify the first.
Let me verbally amplify
the second.
Almost all of the typing methods
involve using bits of DNA
that are complementary to
either conduct amplification
of a fragment of DNA
or specifically probe
the small region
around where the
known variant is.
But if there are other
variants nearby that interfere
with either of those then one
may not get an accurate reading.
The test fails.
And so if you've
got a heterozygote,
you only detect one
of the alleles.
It's not that the
polymorphism is not valid.
It's that that method does
not detect it accurately.
[Pause]
>>There are other problems
that anybody working
in a laboratory knows about
the phase of the moon, the --
what you ate for
dinner the night before.
All of these are
probably real variables
in humorous sense at least.
But no method is perfect.
No dataset is error free.
We have to try to minimize them.
And that's where the
prior work will be best.
So in terms of amount of
information, we're talking
about alleles, we're talking
about allele frequencies.
But what we see in the
population is individuals
who have two copies -- one from
the mother, one from the father.
So fortunately back in 1904,
a geneticist asked
a mathematician
about this question.
And Hardy and Weinberg came
up with this very
simple relationship based
on elementary probability.
And as a function of
the gene frequencies,
you can see here the
genotype frequencies.
And basically P squared, 2PQ,
Q squared is the
Hardy Weinberg ratios.
It's the square of the quantity
P plus Q, the quantity squared
where P and Q are frequencies
of the two different alleles.
So it's very elementary
probability.
If we want the most
diversity within a population,
we clearly want the
allele frequencies to be
at point five -- point five.
So for individual
identification,
the lowest probability
of somebody unrelated
being the same is
if the allele frequencies
are equal to P,
equal Q, equal point five.
But remember I said they always
differ among populations.
So here's where the low
FST and here we're talking
about heterozygocity, the
frequency in this green line
of an individual having
two different alleles being
a heterozygote.
In the zygote they had
two different alleles.
So that's one aspect.
For ancestry identification,
we want the opposite.
We want one population
to be like this
and the other population
to be like that.
So when we test it,
we've got a distinction
between the populations.
[Pause]
>>So, let's review now in terms
of ancestry information a little
bit about what we really do know
about modern human evolution.
This is, at its basics,
no longer controversial,
absolutely accepted.
There's of course
infinite argument
about the nitty picky
fine details.
That will always go on.
This is science and these are
humans who are looking at it.
But it's clear.
Modern humans evolved in Africa
roughly 200,000 years ago.
And it's also very clear
that considerable genetic
variation accumulated
in Africa and it's still there.
Where are the shortest
people in the world?
[Pause]
>>African pygmies.
Where are, on average, the
tallest people in the world?
The Nilotics in Africa.
[Pause]
>>And tremendous variation.
In the US, among non
scientists, there tends to be --
and even among some
scientists who don't know much
about human variation --
there tends to be the
assumption Africa is
genetically homogenous.
Well, no. [Pause]
>>About 100,000 years ago --
and here's where the argument,
some say as recently as 80,000,
some have even said
60,000 years ago --
some individuals left
Africa into southwest Asia.
And the single population
had only a small fraction
of the genetic variation
present in Africa.
And that population
then expanded
to occupy the rest of the world.
[Pause]
>>And here is how I put
it in a pointillist way.
And this has been reproduced
in National Geographic.
And if you see the race
exhibit that's going in museums
around the country
it's currently
at the Smithsonian
in Washington.
>>This is part of the triple A
online also -- The Race Project.
>>Is it online?
Well it's animated
in the museum.
I don't know about online.
But it's clear where just the
different colors represent
generalized genetic variation
that Africa had accumulated
a lot by 100,000 years ago.
But notice it's not
uniformly distributed.
There's a little more red here.
There's a little more
blue and yellow here.
Typical of any widespread
mammalian or any other species,
the fringes of the distribution
don't have all of the variation.
There are little bit.
That's gene flow and
random genetic drift.
Well the last time
I left Africa,
it was in a 747 from
Johannesburg.
A hundred thousand years ago,
the only way out of Africa was
out of Northeast Africa
into Southwest Asia.
And we know that by 40,000 years
ago that population had spread.
And if you look carefully,
there is less variation
out here than there is here.
And yet it's dramatically less
all over than it is in Africa.
So basically if you
wiped everybody
of non-African origin out,
the human species
would still almost all
of the genetic variation
that it has today.
Non-Africans represent a
subset of genetic variation
and it's characterized
by a loss of variation
as humans have spread out of
Africa with a few exceptions,
but they're the exceptions.
So random genetic drift
can explain most of that,
but selection also occurs.
We all believe in evolution.
That's selection.
These things evolve to get
food into mouth in part.
I have no trouble
using them to eat.
[Pause]
>>How do we detect selection?
We can argue that higher FSTs
indicate selection in one part
of the world than not.
But that's hard.
You have to be very specific.
And it can occur by chance.
So one of the methods of
detecting selection is the idea
that a particular
variant in one part
of the world has
become common quickly
where random genetic drift
to become common would
take many generations.
The result is lest recombination
in the DNA flanking it.
So you tend to get around a
variant that's been selected
for an extended part of the
DNA that is all identical.
[Pause]
>>For example, what about
lactose tolerance as an adult?
Do you all know about that?
Well I've got the genes for it.
Sorry I just hit the mike.
That is essentially fixed
for one particular
variant in Northern Europe.
And it shows a cline, a
gradient, from low frequency
in Southern Europe to higher
frequency in Northern Europe,
and very strong evidence
of selection.
The plausible hypothesis is that
as the Neolithic moved north,
your cow was in your hut
with you during the winter
when there was little
to eat outside.
And if you could use
the cow's milk fresh,
you would survive
the winter better.
And if you as the hunter
gatherer during the winter died,
your children were going to die.
So there's very strong
survival value
in being able to
drink fresh milk.
In Southern Europe what
happens to fresh milk?
Converted to yogurt.
So yogurt is a varied
part of that diet,
but what makes yogurt?
Lactobacillus that
digests the lactose.
So here we have culture
and selection operating
on a genetic trait.
East Africa has adult
lactose tolerance as well,
but from a different
independent mutation.
There are many ways one can
think of looking at selection,
but they're mostly at
the moment statistical
until there's a solid
biological explanation.
And the one I just gave
you is a good story,
probably makes sense.
But it's not proof.
So there are others.
We know there are
variants in hemoglobin.
Everybody knows about
sickle cell hemoglobin.
There, there is proof.
The different susceptibility
of the different genotype
to infection from the
trypanosomes is clear.
The survival of infants is
clear in a malarial environment.
[ Music ]
