Hi, everyone.
My name is Brian Yu,
and sorry I couldn't be there in person.
But I'm here to present DNA to a programming problem written by myself and David Malan that we use
at CS50, Harvard's introductory class in computer science.
And with CS50,
we draw a lot of students from a number of different domains.
Some students are students who are intending to major or minor in computer science.
We get a lot of students that come from other backgrounds, come from a background of the natural sciences or
mathematics and statistics that are trying to learn some computer science tools that they can take with them to
apply to their own domain in their own areas of work as well.
So one of the goals with DNA was to create a problem at the intersection of the sciences
and computation.
And specifically,
what DNA focuses on is the intersection between computation and biology looking at
how we can use algorithms and the ideas learned in computer science class
to apply them to biology-inspired problems.
This problem is appropriate for a CS1 or CS2 class and, generally speaking,
familiarity with strings and lists and reading from files
will prove helpful with the assignment.
But certainly,
this programming assignment can be used as an opportunity to give students more exposure to these ideas
as well.
The main topics that we talk about in this particular assignment,
which is all about using computational tools to be able to identify a person based on their
DNA,
are the analysis of algorithms and how we can write algorithms to do this kind of work.
Computational biology,
in terms of applying computer science skills to biology and then in terms of the programming skills that
students,
learn there is some file I/O,
as students read from files,
we'll soon see, some working with loops as students loop through a sample of DNA,
for example,
and some string manipulations that students need to do some analysis of strings and to be able to
figure out to whom a particular sample of DNA belongs to.
So we introduced this assignment to students by first introducing them to the concept of DNA for those who
might be unfamiliar.
And the idea there is that every human's genetic material is really composed of all of these
units of DNA,
where there are really four different types of units of DNA that are here,
represented by the letters T C A and G.
And it turns out that inside of a typical humans genome are 6.4 billion of
these units of DNA,
just these four units that are in some particular sequence.
And so one question that we ask students to consider is how you might store this
information in a database,
for example,
that if a government like the FBI's  database,
wants to be able to store information about the human genome,
how much data is it going to take?
How much space is it going to take to store all this information?
And for 6.4 billion units of DNA per person,
that's a fair amount of information.
So we might like to come up with a better way,
a more efficient way in terms of both memory and time to be able to identify someone
based on their DNA.
And the solution to this problem turns out to be biologically inspired that there is a
biological answer to trying to solve this problem of how to efficiently identify someone
based on their DNA,
and it turns out that if you take someone's sample of DNA like the sample of DNA we're looking at now
and look through that DNA and you'll eventually notice some patterns that if you
look through the sequence of DNA,
there will come across certain sequences of DNA that repeat consecutively back to
back to back.
So,
for example,
the very short sequence A G A T.
 
If you look through someone's DNA,
eventually you'll come across particular locations inside of a person's DNA,
where you'll notice that this sub sequence AG A T
repeats again and again and again A G A T repeats once.
And then it repeats again consecutively right after that.
And then again and then again and then again here.
It shows up five times in a row consecutively,
and this type of occurrence inside of a person's DNA is known as a
short tandem repeat or an STR for short in the world of biology.
And there are a whole bunch of these STRs that just appear naturally inside of people's DNA,
naturally occurring sequences that just happen to repeat multiple times.
And the interesting fact about the STR is the reason why they're useful and applicable in the
context of trying to identify someone based on their DNA is the fact that different
people have each of these
STRs repeated a different number of times that one person might have a particular sequence
repeated many times back to back.
Another person might have it fewer times another person might have it somewhere in between.
For instance,
and so if you know how many times these subsequences repeat inside of a
person's DNA,
then you can use that information to identify someone based on a particular sample of
DNA.
And it turns out that this is actually how governments tried to do DNA identification,
that the FBI has a database called the FBI combined DNA Index system
CODIS for short, and inside of this database.
what the FBI is keeping track of is they're keeping track of particular locations inside of
people's DNA,
where they find these repeating sequences.
So here's a sample from their documentation of which particular places inside of a DNA that they're keeping
track of it.
And one of the sequences they're looking at,
for example,
is that subsequence A G A T.
 
That this particular sub sequence repeats multiple times,
and it turns out the FBI is keeping track of 13 such locations 13
places inside of a person's DNA,
where they find these sort of repeating sequences.
So with just that information about biology we've introduced students to the part of the problem that they
need to know in order to now complete this assignment that what this assignment is ultimately going to be
about is given a database of a whole bunch of people and how many times
the sequences repeat in their DNA.
Can you now identify someone based on a sample of DNA?
So,
for example,
we give students a DNA database that is in the format of a CSV file just a comma separated
file where there's one row for every person,
for example,
and one column for every sequence of DNA that repeats a certain number of
times.
So for this sample input,
for example,
you would interpret this is meaning.
Harry has the subsequence A G A T.
repeated twice,
back to back somewhere in his DNA.
He has sequence,
A A T G,
repeated eight times  somewhere in his DNA and then has the subsequence
T C T A G repeated three times consecutively somewhere inside of his
DNA,
and you could make similar analyses for the other people that happen to be inside of this DNA database as well
This DNA database is the first input to the problem,
and the second input to the problem that students get is a sample of DNA,
just a text file of a whole bunch of As and Ts and Gs and Cs
we give to students some small sample files that are just 500
characters or so and then some larger files that are about 5000 characters or so.
But you could generate these files just by randomly choosing As and Ts and Gs and Cs
making sure that these match one of the people inside of the database.
So the task is for students to then parse through this string, the sequence of As and
Ts and Gs and Cs, looking for the longest run of consecutive
repeats of one of these sequences.
So for A G A T,
for example,
students would find the longest repeated sequence of A G A T
back to back,
and they do the same thing for A A T G looking for the longest sequence where A A T G
repeats again and again and again,
back to back and then the same thing for each of the other sequences that happened to be
inside of this DNA database once they found all of these longest matches,
students can then compare what they find against the database to conclude that this DNA
belongs to Harry in all likelihood,
or Ron or Hermione.
Or maybe it doesn't belong to anyone in the DNA database,
which is a possibility as well.
And so what we hope that students learn through working on this is
a) string manipulation,
the idea of, given this input string,
we want them to be able to figure out an algorithm for trying to come up with the longest run of
consecutive repeats of a particular DNA subsequence.
And it turns out that their various different types of algorithms that could apply, we don't prescribe a particular one,
but we challenge more comfortable students to push themselves and try and come up with a more efficient way
to try to find these longest runs and the second goal that we hope students get out of this assignment
is practice with reading files. That we give students the DNA sequence
as a text file.
We give students the DNA databases,
a CSV file,
and so this is an exercise as well in trying to read those files and parsing them and figuring out
how to manipulate the data that they're able to read out of those files as well.
In CS50 we offer this programming assignment in Python,
but really,
it could work for any language where students have the ability to read files and manipulate strings.
It's general purpose enough that it can apply to multiple different languages as well.
All you really need to give of input is some CSV file,
where you've defined for particular DNA sequences,
how many times they repeat for different people and then give to students some sample DNA
that is just randomly generated As and Ts and Gs and Cs, that just so happened to have a certain
number of consecutive repeats somewhere inside of that string.
But with just those ingredients,
you can create a really interesting CSI-style assignment for students to complete,
where they're really tasked with solving a real biological problem,
trying to use algorithms to be able to do something practical with biology,
trying to identify someone based on their DNA.
And we found that students have responded quite positively to trying to work on this.
They feel a real world connection to the work that they're doing,
and they tend to have a lot of fun with it as well.
So hopefully you found that assignment interesting.
All the materials are available for this on the NFTE website,
and certainly if you have questions about the assignment,
feel free to reach out to either myself or David.
And we're happy to answer questions about DNA.
Thank you all so much for listening to this and enjoy the rest of SIGCSE.
