DAVID MALAN: All right.
Well, let's pick up now where we
left off talking a bit more generally
about technology stacks and
topics like frameworks-- which
is a word that came up earlier--
and libraries, and design patterns,
and approaches to building software,
and bring to bear to this conversation
some examples of data
structures and algorithms
that are part of the
programming process,
but for generally part
of problem solving
and are often handed to programmers
these days in the form of libraries.
Indeed one of the other reasons to
use certain languages over others
is that they just come
with more functionality.
For instance, C, the language with
which I wrote Hello World earlier,
does not come with much functionality.
Anything you want that
language to do, you pretty much
have to implement it yourself, which
means writing lots and lots and lots
and lots into code even to implement
something relatively simple,
like web pages, as I alluded to earlier.
But with something like PHP, or Python,
or Ruby, or Java, or other languages,
you have like a kitchen
sink of functionality
that companies like Oracle
or others have given
to you in the form of that language.
That functionality is often packaged
up in something called a library.
A library is just a set of programs,
or really a set of functionality
that someone else has written that
you can integrate into your own work,
so as not to reinvent those wheels.
And indeed, if you think
about it, even though you
might be the only programmer on
a project or on a small team,
you could certainly not have enough
time to build an entire-- not only a web
application, but the web server that
understands HTTP and spits that out,
let alone the device drivers that I had
been talking about at this lower level
earlier.
And so you want to
stand on the shoulders
of people that have come before you.
And thankfully, especially
thanks to open source software,
there's a huge community of
freely available software
that helps people solve problems.
And let's consider one of the
simplest problems first using
one of the simplest
data structures, whereby
data structure is a way of organizing
data inside your computer's memory.
So you can think of your computer's
memory, for instance, this,
as-- and let's pull up a picture.
RAM DIMM.
A dual in-line memory module.
It's something that looks a
little something like this
and-- Okay, it doesn't look like
that, but might as well, perhaps.
These are DIMMs, dual
in-line memory modules.
Bigger for desktops,
smaller for laptops.
And some of you might have actually
handled or used some of these things
before in the past, but increasingly
these are not serviceable parts.
Apple's known for not letting you
open the computer add more RAM.
But inside of your Mac and
your PC are things like this.
And those chips, the little black
chips, represent some kind of technology
that allows you to store zeros and
ones to the tune of several gigabytes
typically these days, in total.
So if you think of that as really just
a sequence of bytes-- so if one of those
sticks is one gigabyte,
that's one billion bytes.
Maybe this rectangular region of my
screen represents a billion bytes.
I'm only going to draw
some of those bytes.
But if I have bytes one,
two, three, four, five,
I'm going to draw them each as
just a little cell on the screen.
So I'm just going have
these rows and columns.
Each of whose cells ultimately
represents one byte.
And we'll pretend that there's actually
a billion such boxes on the screen
as for the dot, dot, dot.
So if each of these boxes represents
one byte, just to be clear,
how many bits does each box represent?
STUDENT: Eight?
DAVID MALAN: Eight bits.
So there are eight times whatever number
of bits pictured here on the screen.
And I could permute the
zeros and ones however
we want, but bits is
all about yesterday.
Today we're just going to talk in terms
of number and letter, and so forth.
So I can just think of this
now as memory storage space.
And this is essentially equivalent
to one of those rectangular things
on the screen there.
So what can I actually do with
this in order to organize memory?
Well, suppose that I'm
implementing a calculator.
And much like an accountant
might want to just keep
hitting numbers, plus numbers, plus
numbers, plus numbers, plus enter
and then get some total,
it stands to reason
that you need to store those numbers
somewhere so that you can actually
do the arithmetic.
And suppose that, for the sake
of discussion, all we want to do
is allow the user to type
in a bunch of dollar amounts
and see how much we've spent
on the business this month.
So the first such number might be-- all
right, well, we spent $100 at first.
And then we spent $50 the next
transaction, and then $150,
and then that's it, for the moment.
Where might I store this data?
Well, if each of these
numbers is less than 255,
I can store it in just one byte.
So I'm just using small
numbers to keep things simple.
But realistically these should
be 32-bit chunks of memory,
but we would have to use more
blocks to represent them.
So we're going to use small
numbers for the sake of discussion.
In terms of hardware, this is where
your information might end up--
underneath the hood, inside
of those little green sticks.
And if I hit the equal sign
on my little calculator,
this might then give me a new value--
namely, 300-- that goes in that box
there.
And in fact, if your computer has
the ability to undo-- like backspace,
backspace, backspace, or
something like that-- this helps
explain why that's possible.
Because if the computer's
remembering all of the things
you previously typed in, clearly there's
a way logically to undo things work,
or see them still on the screen.
And indeed, a calculator like
the one I keep using here-- you
can imagine that when
I do 100, plus 150,
plus 50-- how are each of these
symbols remaining on the screen?
Well, they're stored
somewhere in memory.
And my screen is constantly
refreshing itself with lights
so that I can see it.
And so that's somewhere
underneath the hood.
And that's where we're looking now.
Now in a computer
program, ideally we don't
want to have to think about our
memory at this super, super low level.
But we do want to have the ability
to organize our information.
So for instance, let me just
pull up a representative program.
I'll use C, just because we keep
using that example from earlier.
So suppose that this
is a program written
in C. And suppose that-- actually
let's say-- yeah, we'll use C here.
So suppose I want to store
one number in my calculator.
I might call that first number x, based
on my recollection of algebra and grade
school.
So int x would give me
an integer called x.
And while I'm coding this, I'm going to
call a function get_number_from_user.
So this is just pseudocode now.
It's a mix of C and pseudocode
where this function is somehow
going to get a number from the user.
And then if I do this again, I might
want to store a second number in y.
And if I do this a third time, I
might want to get a third number
and store it in z.
And then maybe I want
to compute the sum.
So I'm going to declare a
fourth variable called sum
and have this be equal
to x plus y plus z.
So at this point in
the story, essentially
I'm treating this as x, this as
y, this as z, and this as sum.
So each of those bytes
represents a variable.
Again, an integer
would typically bigger.
It'd be 32 bits.
So I'm simplifying.
But it's just eight bits,
or one byte, in this case.
This doesn't really scale very well.
I seem to have implemented a calculator
that's only capable of adding maximally
three numbers, which is a
pretty subpar calculator.
I want to be able to store more.
And so, all right, you
want to store more.
Well, I already used x, y, z.
So you know what?
I'm going to just start to use a.
And if you need another
one, I'm going to get be.
And I'm going to add a c and then d.
Or this is a little messy.
You know what?
I could do better that that.
Let me just call this x1.
This will be x2.
This will be x3.
This will be x4.
And the mere fact that I'm literally
copying and pasting while writing code?
Very bad.
It might be reasonable to do it
once, maybe twice, in a program.
But as soon as you find yourself
copying and pasting as a developer,
generally speaking, this
is a bad habit to get into,
because you're duplicating code.
You're creating twice
or thrice many places
where there might be a mistake now.
Much better to write one line of
code once and somehow reuse in a way.
And a data structure called an
array allows us to do exactly this.
An array is a feature of
many programming languages
that lets you declare one variable but
store multiple pieces of data inside
of it, back to back to
back to back, literally
underneath the hood like this.
So instead of having this situation
devolve into x1, x2-- I mean,
this is just messy.
This is not descriptive.
It's just kind of
hackish, one might say.
I'm instead going to say give me an
array of four integers, for instance.
And the syntax I'm going to use is that.
So it's a little more
cryptic at first glance.
But most languages would have
you do something like this.
You say, give me an int-- wait a minute,
how many of them?-- in square brackets.
Give me four ints and
collectively call those ints x.
And x is no longer or never
really was descriptive.
Let's just call this array numbers.
So what this means underneath
the hood is that-- this
is what's inside of my computer.
This is how it's physically organized.
But so far as the computer's
concerned-- or rather,
so far as my program is concerned, I
can think of this whole block of memory
as one variable called numbers.
And I've abstracted away the
underlying implementation detail.
I don't have to know about
this whole grid of memory.
All I care about is, give me four ints.
Put them wherever you want,
but give me four integers.
And syntactically, in
a language like this,
I would then do this kind of syntax.
Numbers bracket 0 gets--
what did I call it?
get_number_from_user Yeah,
get_number_from_user.
From user.
And now, I would do this.
One, two, three.
And now, I would say int
sum gets numbers bracket 0
plus numbers bracket 1
plus numbers bracket 2.
And now, because I've declared
it of size four total,
I can add four numbers together, which
is one more than I could earlier.
Give me one justification for
counting from zero instead of one
like a normal person.
STUDENT: Because everybody
using C starts indexing at zero?
DAVID MALAN: That is true.
But you're really just
describing the situation.
Explain the situation.
Why do we start counting at zero?
David?
DAVID: I didn't hear it, but I was
going to say that bits-- don't they
start with zero-zero?
DAVID MALAN: That's
the underlying reason.
If we rewind to yesterday-- if
you have some number of bits,
the smallest number you can represent
is just setting all those bits to zero.
And so it's just arguably
natural in computer science
and in programming more generally.
Once you understand
what's going on, if feels
only natural to start counting at zero.
Because if you don't, you're
essentially throwing that value away,
because you still have
that series of bits.
But if you don't use zero
ever, it's not all that useful.
So in this case we
start counting at zero.
There's still four elements.
It just so happens that the
highest numbered one of them
happens to be three, as
a result. This is what's
called zero indexing, so to speak.
OK.
So this looks more
cryptic now, to be sure.
But what's clean about
it fundamentally is
that there's only one variable
for all four of those values.
It's a variable called numbers.
And the type of that variable
is what we would call an array.
An array is a way of
storing data contiguously--
back to back to back to back-- by
just using one descriptor for it,
like the word numbers.
And we could have called
it anything we want.
So there's a few things
that are nice about this.
In fact, let me draw a slightly
larger example of a few numbers here.
Suppose I have the number--
let's say, 7, 4, 3, 2, 5, 1, 6.
OK.
So I've drawn seven
numbers on the board.
They're in random order.
And suppose now, for the
sake of discussion, that I've
stored them in an array like this.
So suppose that this is now a
program whose purpose in life
is to allow me to search an array.
Suppose we're building
a search algorithm.
And we're just using numbers
to keep things simple,
but these could be words
that are in English.
How would I go about finding a value?
Well, it turns out that when
you're programming, even though we
humans can kind of glance
at the board and-- oh, OK.
I see seven numbers there.
A computer cannot do that.
A computer can only look at
one piece of memory at a time.
And in fact, if you recall earlier
when I pulled up some sample code
and I mentioned the load operator,
that's sort of a manifestation of this.
You can load a value.
You can't say go load a whole bunch
of things all at once necessarily.
It has to be more low-level than that.
So even though we, the
humans, see the whole thing,
think of it like a
horse with blinders on.
You can only see one value at a time.
And you can ultimately see all of them,
but you have to iterate, so to speak.
You have to look at
one element at a time.
So what programming construct does
this perhaps remind you of from Scratch
before lunch?
What puzzle piece would allow you to
implement this idea of doing something
again and again?
Yeah, Grace?
GRACE: Repeat.
The repeat.
DAVID MALAN: Yeah.
Like repeat seven, instead
of the default repeat 10.
Why don't we repeat seven times?
Forever feels bad,
because then I'm going
to go way past the edge of the board.
And who knows what might happen?
That's when computers might crash.
But if we repeat seven
times, we can iterate
over looking at each of these elements.
So with that said, looking at this
array, how would you as a computer
go about finding the numeric
equivalent of Mike Smith?
How might you go about finding the
number six in an array of numbers?
Well, turns out by convention,
when you have an array--
you can technically start at
the left or the right end,
but it's almost always the case you
just start arbitrarily, but consistently
at the left end.
Which means you can look at the seven
first, then you can look at the four,
then the three, then the two, then
the five, then the one, then the six.
Aha!
It's there.
And if you go all the way and you
don't find it, then you conclude,
oh, this number's not in the array.
So let's describe the running
time of searching an array.
In other words, how
many steps does it take
to find an arbitrary number
in this kind of array?
STUDENT: As many steps as
there are numbers in the array?
DAVID MALAN: Yeah.
In the worst case, the
number you're looking for,
like six-- deliberately chosen--
might be all the way at the end.
So you've got to search
the whole darn thing.
Now, this is very arbitrary example,
so describing the efficiency of arrays
as seven is kind of meaningless.
But if you think of the length of
this whole thing-- in general is n,
just like yesterday with the number of
pages in the phone book-- we would say,
and I used the syntax
yesterday, the running time
of searching an array linearly
left to right is on the order of n.
Maybe it's less than n.
It's not going to be more n.
Maybe it's a few steps less than
n, depending on where the six is.
Maybe you get really lucky
and it's only one step.
But if we use this special capitalized
O rotation called big O notation,
that describes in this case an
upper bound on how many steps
it might take us to find this element.
OK, that's all fine and good.
But seven is pretty tenable.
What if it's like
seven billion elements?
Well, that's going to be on the
order of seven billion steps,
and that kind of invites an
opportunity for smarter algorithms.
And what was the smarter
algorithm we introduced yesterday
that solved search problems much
faster than this linear left-to-right
approach?
STUDENT: Divide and conquer.
DAVID MALAN: Yeah, that
divide and conquer strategy.
Can I use that here?
If I think of the phone book
as the analog yesterday,
I might look at the middle
of the array, which I can do.
I don't have to start at the left.
Because what's nice is, because I know
that this is location 0, 1, 2, 3, 4, 5,
6, I know I can go immediately
to the middle of the array
just by knowing its length.
If I know the length is
seven, I can divide it by two.
That unfortunately gives me 3 and 1/2.
But if I round down,
that gives me three.
So where's the middle of the array?
I can look immediately at
the middle of the array.
So the computer doesn't strictly
have to start left to right,
because if it knows its length, it
has what's called random access.
It can jump randomly to any element
just by doing a bit of arithmetic.
Thus was born the name RAM.
Random access memory is
memory that looks like this.
And it's random access in the
sense that using simple arithmetic,
you can jump to any random
location that you might like.
You don't have to search
the whole thing per se.
Now can we leverage simple
arithmetic and divide
and conquer to find
the number six faster?
So I look at the middle.
Aha!
It's the number two.
Six should be to the
right or to the left?
STUDENT 5: To the right.
DAVID MALAN: Well, I
mean, it is to the right.
But most of you are
questioning this logic.
Why?
What's flawed about it?
STUDENT 6: It's not sorted.
DAVID MALAN: It's not sorted.
Right?
The whole beauty of the
phone book and the reason
we could apply divide and
conquer is because it was sorted.
So you knew s was after m and
not in just some random location.
So unfortunately here, if this array--
even though it gives us random access,
that's useless in this case
insofar as it doesn't actually
let us search it any more effectively.
So what if we did have random
access but also a sorted array,
such that-- now I'll use some
different numbers just so it's not just
a silly example with 0 through 6.
Suppose we have the number 5,
10, 20, 32, 42, 50, and 99.
And suppose now I'm searching for
the number 42-- meaning of life,
the universe, everything.
So where do I begin?
It's sorted, and I know that in advance.
So I jump to 3.5, or
round down, so it's 3.
Now I know it's here.
And what's nice now is that
I have a well-defined middle.
I chose the number seven deliberately,
so that it would look nice and pretty.
But 32 is less than 42.
So I know 42 must be, if
anywhere, to the right.
So I can figuratively and literally
throw this half of the problem
away, leaving me now with a nicely
symmetric three element array.
I look at the middle of this one, 50.
Ah, 50 is too big.
It's bigger than 42.
So I can throw this half
of the problem away,
including the number I'm looking
at, and look now to left.
I see.
Now I got lucky.
But it's also the last
element, so voila!
I'm done.
And this took how many steps?
STUDENT: Three?
DAVID MALAN: Three steps.
I went here, then I went
here, then I went here.
And you can kind of see that
insofar as how many times can you
split seven elements?
Log base 2 of 7 will give us,
indeed, 3 in this case, give or take.
But don't worry if
that's unfamiliar math.
But that's how we got there.
How many times can you
divide it again and again?
So this is to say ultimately,
what's the running time of searching
an array if it is sorted?
It's big O of?
STUDENT: [INAUDIBLE]
STUDENT: Log base 2.
DAVID MALAN: Yeah.
So log base 2 of n.
Because again, if unfamiliar
or you don't recall,
logarithm base 2 is just how many
times can you divide, divide, divide,
divide until you get to one.
And frankly, most computer
scientists wouldn't
bother writing constant
numbers, like the number 2.
We would just write it
more simply like this.
But this is smaller than this, certainly
for sufficiently large values of n.
So this is a more appealing algorithm.
And this now is the more formal way of
describing what I described yesterday
as the first approach,
turning one page at a time,
versus the third approach, which
was dividing and conquering.
So this is great.
We seem to have found now an application
of that high-level phone book
idea to actually solving problems
that a computer might solve.
And again, even though
we're using numbers,
imagine this being words or web pages
in a database, or anything that's
actually more interesting than numbers.
Unfortunately, there's a problem.
With an array like
this one here, suppose
I'm dealing not with some
calculator but with sorted data.
Because we seem to be at the point
in the story where keeping your data
sorted seems to be advantageous.
You can search it faster.
So suppose those numbers again are
5, 10, 20, 32, 42, 50, and then,
who knows what else?
After that, it's just undefined.
Question marks everywhere else.
Yeah.
STUDENT: What if they're
completely unsorted?
DAVID MALAN: If they're
completely unsorted,
we're in the former scenario,
where we have to use linear search,
left to right.
So what the narrative here is,
if they're completely unsorted,
you can do no better than looking
through the whole darn list.
And maybe you get lucky
and it's early on,
but you have to check the whole list.
If it's somehow sorted, either because
you did it or because the user did it,
then you can leverage
something like binary search.
But that win, which seems to be
an improvement, comes at a cost
by using an array.
Binary search, as it's
called-- bi meaning two,
splitting the list again and again--
assumes that the array is sorted,
but it also assumes that your data
structure gives you random access.
Because how did I know to
go to element 3.5, or 3?
Like arithmetic.
I needed to know the lower
bound and the upper bound,
so that I could jump
right to the middle.
Unfortunately, this comes at a cost.
If you're storing all of
your data back to back
to back to back to back in this way.
Suppose I want to insert the
number 21 into this data structure.
Where does it belong if I
want to keep my data sorted,
which seems to be in my best interest,
because then search is faster?
Where does it have to go?
Yeah, after 20, and before 32.
And while I could cheat
with a whiteboard marker
and just draw it wherever I want,
that's not how memory works.
These are actual 0's and
1's underneath the hood.
So if I want to keep
the list in sorted order
and insert 21 in between 20
and 32, how do we fix this?
And I guess I shouldn't show you this.
How do we fix this?
STUDENT: Delete the
last three and add it?
DAVID MALAN: Yeah.
Delete the last three.
And maybe not just
delete them, but rather
maybe make a copy of the last element
and then erase it or change it.
Then make a copy of the second-to-last
number and move it over.
And notice, the order of
operations is important.
If I blindly do this, damn it.
What do I need to write down?
A computer's just going to forget unless
the computer remembers it somewhere.
So this has to be done
in the right order.
So this might be 32.
Now I can put the number 21 here.
So it's fixable, this problem.
We still have sorted order.
We still have random access.
And we have the ability to
insert new data into this list.
That seems like
everything you might want.
But I would argue that
that was expensive.
That was a minorly painful
and certainly would be painful
if this list were bigger than seven.
What price did I just pay for
keeping all of these features?
STUDENT: Time.
DAVID MALAN: It takes time.
I did it somewhat quickly.
It was only three elements.
But again, if it's a million
elements in the list,
that might take me half a million
steps to make room or move things over.
That's going to cost me and the
computer time, but it does work.
So that's one of the
prices you pay of arrays.
It's again this balancing act,
whereby it feels like win-win-win, ah,
but wait a minute.
Inserting data just got very expensive.
Deleting data is going to get very
expensive, because then I have
to go in and keep everything compact.
I could just start deleting
things and creating holes my data.
And then when I insert data, I could
just put it in any random location,
but the problem there is we
sacrifice the sortability,
the sorted-ness of the list.
And we sacrifice the
ability to jump predictably
to elements that are in the list.
So how do we fix this?
I feel like I would really
like to be able to insert data
into the list faster without having
to move every darn thing around.
STUDENT: We'd use a directory?
DAVID MALAN: A directory.
OK, yeah.
Actually, let me-- hold that thought.
So that's good and
that's a nice way framing
something that's going to be
called a hash table in a bit.
Let me propose that's
even more sophisticated
than we need at this point.
Grace?
GRACE: No, never mind.
DAVID MALAN: Alicia?
ALICIA: Could you just
do an insert command?
DAVID MALAN: An insert command.
OK, but we are implementing the insert
command today, so that's not an option.
Yeah.
Dan?
DAN: Create it as you go?
And just have the list
changing one piece at a time?
DAVID MALAN: Yeah.
Why don't we create the list as we go?
And this will be easier to
look at first in the abstract.
So let's do it this way.
Suppose that the first number I insert
into this list, by whatever sequence
of real world events, is the number 20.
So I'm just going to draw
the number 20 on the screen.
And just for kicks, I'm going to draw
it in a box, so it looks like this.
That's it.
Now suppose the next number I want
to insert into this list is 50.
So I'm going to assume
for now-- I'm just
going to keep track of
where that first element is.
So this small little
dot here on the screen
just represents-- that's the
beginning of my list that I'm making.
So now I insert the number 50.
Where, of course, does it belong?
To the left or to the right of the 20?
So obviously to the right.
So I start here and I follow this.
Oh no, it belongs to the right.
So now, I create a new box.
In other words, I ask Mac OS or Windows,
give me a little more memory please.
I plop a number in it.
And I somehow have to do the
equivalent digitally of this arrow.
But for now, we'll keep it abstract
with just an arrow on the board.
Now I insert the number 32.
Where does that go?
Obviously in between.
And now previously, I would have
drawn a new node, so to speak.
These squares-- we would
generally call nodes.
I would erase this and
then insert the new one.
But because of these arrows,
we don't need to do that.
Because to Dan's point,
why don't we just again ask
Mac OS or Windows for
memory-- and that's
going to give me this chunk
here represented with a box.
But if we're just kind of
allocating, so to speak,
or creating these things as we go, I
don't need to touch the 20 or the 50,
per se.
Why don't I just change
what points to what?
And so in this way,
these numbers literally
don't need to be back to
back to back in this grid.
They could literally be one up
here, one over here, one over here.
And so long as in code,
I have the ability
to draw the equivalent
of arrows on the board,
I can somehow link all
of these things together.
And indeed, version two
of our data structure
would be what most people
would call a linked
list, which is fairly self descriptive.
And the upside of this,
now, is that I don't
have to do a massive amount
of movement of nodes,
which can be costly and time-consuming.
I still have to find where it belongs.
So there's a bit of a trade off.
I can't jump immediately to the
middle of a linked list, because I
claim the only way you
keep track of a linked list
is by way of this first
node-- this pointer over here.
But that's OK, because
at least I don't have
to do all that deletion
and movement of space.
Moreover, even more compellingly, in the
world of arrays-- in this world here,
suppose that this is
the list I start with.
If I have in advance asked
operating system, Mac OS or Windows,
give me space for eight numbers, it
will hand you back the equivalent
of a chunk of memory like this.
Essentially it'll hand
me back instructions.
All right, David, you can store
it at location 0 through 7.
And that's all the operating
system is committing to.
Because another constraint
of arrays is you
must decide in advance
how big it is going to be.
If you want eight numbers, that's fine.
Tell the OS.
You'll get space for eight numbers.
But the operating system
reserves the right
to plop other stuff right
after your chunk of memory.
So suppose that something else is
going on in your program or computer,
like the user has just
typed in their name.
And so D-A-V-I-D might end up here.
And then some other program
is running, and so the number
88 and then 99 ends up here.
Whatever those mean.
But the point is that your data is now
bumping up against some other data.
So the next time your program
says, hey, operating system.
The user wants to insert the number 22.
I need more space.
What you can't do in an array
is-- all right, damn it.
Uh, 50.
All right, I have a
free space over here.
Let me put the 50 here,
and that moves the 42.
Then let me move the--
let's see, the 32 goes here.
And that frees up space.
Oh, and now I can put the 22.
I mean, I could do that.
But you're violating the
definition of an array.
Why?
It's out of order.
I mean, it's still kind of increasing
in order, left to right, top to bottom.
But it's no longer contiguous.
And the contiguousness property is
requisite for what functionality?
Search, for random access.
Right?
I know my array might be length 9
now, because I can keep track of that
with like a variable,
somewhere in memory.
But the 50 is not next
to the 42, which means
if I'm jumping to the middle of this
list, god knows where I might end up.
I might end up at the i.
And that makes no sense, because
it's a letter, not even a number.
So arrays are problematic, because you
kind of paint yourself into a corner.
Deliberately, it's a plus,
insofar as you get random access.
But you also kind of
shoot yourself in the foot
for future additions to the list.
So what might you do to avoid this?
Well, maybe the first time
I ask the OS for memory,
maybe I shouldn't be so short-sighted
and instead of asking for, like,
eight elements.
Let me just round up.
Give me space for 100 numbers.
Why not just preemptively do that so
that most of the board stays free?
And then if the user
types his name in, it
ends up at least down here, so I
have a little bit of breathing room.
What's bad about that solution?
Grace?
Or Alicia?
STUDENT: It's very expensive.
DAVID MALAN: Yeah.
It's wasteful.
It's extravagant.
Yes, it's self defense, so
that you have room to grow.
But one, you're going
to consume more memory.
And we know from yesterday, you
only have a finite amount of memory.
So now, thanks to
virtual memory, now you
might actually be slowing down
the computer as a side effect,
because you're so greedy asking
for memory you might not ever use.
Moreover, it doesn't really
fundamentally solve the problem.
Because eventually, the user's going
to ask potentially for more memory
than you preemptively chose.
So you've just kind of postponed the
problem, which might help a little bit.
But it doesn't solve the problem.
So linked lists by contrast,
here, allow us to just plop
our numbers wherever we want in RAM.
And we just have to somehow draw the
digital equivalent of these arrows,
so as to sort of stitch
everything together,
like popcorn on a Christmas tree,
that kind of linked list kind of feel.
But this can't be all good like.
Literally nothing we've
done yesterday or today ends
with a perfectly happy ending.
So what's the downside of a linked list?
STUDENT: You can't search it.
DAVID MALAN: You can't
search it in the same way.
You can't do binary
search, because you don't
have random access, because
literally the thing might
be all over your computer's memory.
The upside of which though, to be clear,
is you can grow and even shrink it
dynamically without having to worry
about the annoying back-to-back-to-back
property.
But there's another downside.
STUDENT: If one of the arrows breaks?
DAVID MALAN: OK.
What if one of the arrows breaks?
Could be.
So it's not going to break
on its own, to be fair.
But it is indeed the case.
And one of the reasons that so
few-- well, one of the reasons
C has fallen into disfavor as a go-to
language for many applications is it's
so easy to make mistakes.
Because it turns out when you write
code, it's so easy in a language
like we have on the screen here to make
a typographical error or logical error,
such that the arrow breaks because
you put it in the wrong place.
So it's not going to happen on its
own, but human error is very possible.
And a lot of the sources of problems
in C programs is related to memory.
Yeah?
STUDENT: [INAUDIBLE]
DAVID MALAN: True.
That's true.
So you give up random access, but that's
kind of equivalent to no binary search,
because those are reduced
to the same issue.
STUDENT: Is it difficult
to delete one entry?
DAVID MALAN: Is it difficult to delete?
No, I mean you have to
write the code for it.
It's a few steps.
But once you get it, no.
Then it just works.
So not worrisome.
Yeah, Avi?
AVI: [INAUDIBLE]
DAVID MALAN: Yeah, that's the catch.
It's nice and fine to talk
about things in the abstract.
But as soon as we go back to the
lower level-- at the end of the day,
this is our canvas.
Any ideas we come up
with here verbally, we
have to implement at the end
of the day using just this
as our storage mechanism.
So where do I put all of these arrows?
Well, let's see.
Let me go ahead and erase
all of this for just a moment
and consider just the numbers
in question, the three
that we have at the moment.
And those things look like this.
It was number 20, 32, and 50.
And I'll draw them
roughly the same location.
So 20, 50-- so 20, 32, 50.
So that's roughly where they are, though
I kind of blew it up deliberately.
So previously, I had drawn
arrows from one to the other,
but I can't cross
boundaries like that here.
I can't just draw
arrows anywhere I want.
So the only basic primitive
I have is addressing.
Recall that this is element 0, 1, 2,
3, 4, 5, 6-- it'll be tedious for just
a moment-- 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, 17-- it's going to get
increasingly illegible--
19, 20, 21, 22, 23, 24.
Should've drawn them closer together.
All right.
That's it.
That is our canvas.
That is the extent of the technology
underlying our Macs and PCs
when it comes to memory.
So how do we implement arrows?
STUDENT: Each one takes up
[INAUDIBLE] an extra byte?
DAVID MALAN: An extra bite Yeah.
Let's do that.
Let's be a little greedy.
And you know what?
I can use memory anywhere I want with
this scheme, so let me just go ahead
and say, all right, this
guy-- I'm just going
to draw the border darker-- that
is now one node, so to speak.
It's going to take up two
bytes instead of one byte.
But why?
What did you want to use this for?
STUDENT: A pointer.
DAVID MALAN: A pointer.
The arrow.
So if all I have in terms
of my vocabulary here
is the ability to write down
numbers-- at the end of the day,
addresses-- well, what number
should I put at location 9?
STUDENT: 32?
DAVID MALAN: Close.
You just pointed me over here.
STUDENT: [INAUDIBLE]
DAVID MALAN: 24.
In other words, if I want to
create a sort of treasure map
that leads from one
node to another, just
put the address of the
destination in the map.
So if the address of this byte-- this
is byte number 24, which zero-indexed
is the 25th byte on the screen.
That's just an implementation detail.
Put this here, because now this is
sufficient information-- a breadcrumb,
if you will-- to lead you to
the next number in the list.
Meanwhile, the next number here.
Let's cheat here-- or rather, not cheat.
Let's just be greedy and steal a second
byte, box it in just to stand out.
What number goes here?
12.
Because we want to go to location 12.
So I'm going to put the number 12 here.
And then finally, this guy
here, just for consistency,
I'm going to give him two bytes.
What should go in his second byte?
STUDENT: Eight?
DAVID MALAN: Eight.
Eight, eight, eight.
OK.
If you want to make it a
circular list, perhaps.
So that's possible.
That is a type of data structure,
a linked list that is circular.
Or what else?
What's an alternative?
Because I worry that's going to
get me into dangerous territory
unless I code that properly, because
I could end up searching forever
and never find something.
STUDENT: Essentially a period?
DAVID MALAN: Yeah.
So the computer
equivalent of just period.
And the convention-- there
is no notion of a period
if all we have are
numbers in this context,
so it's actually common
to just put a zero.
So technically, what this means
and what most computers do
is-- there is an address 0, but
you're never allowed to use it.
We steal at least one byte.
In reality, we steal
a bit more than that.
But we deliberately steal
it so that that is invalid.
This is what would be called, if
you've ever heard the term, null.
N-U-L-L. And it's sort of
a black hole, so to speak.
But that's it.
And this is what's kind of cool, I
think, about programming and even just
getting a bit of exposure to it.
As complicated as all of
this might look conceptually
and as ridiculous as the code
might look, at the end of the day
it can't possibly be that
complex, because this
is all it reduces to
underneath the hood,
at least when it comes to storing data.
So there's only so many ways we humans
can come up with ways of storing data.
And we might come up with fancier
ways or more human-friendly ways
of talking about that data.
But it just has to map to fairly
simple primitives, if you will.
All right, still can't be all good.
What price have we paid?
Well, now to your suggestion earlier--
we had three bytes in use earlier,
now how many do we have?
Six now, or seven if we include null.
So we've doubled the
amount of space required.
Now maybe that's negligible.
Maybe it's not.
Back in the day, it probably
wasn't to just double
the amount of memory you're using.
So a linked list is more of an
extravagance, at least, back
when computers had very,
very little memory.
Nowadays if you've got
two gigabytes of memory,
using a bit more space for a pointer,
so to speak-- each of these addresses
is what people would call a
pointer-- is probably reasonable.
But can we do better?
Well, what is the running time
of searching a linked list?
Previously with an array, we had--
let's keep a running list here.
So previously, an array we
said might be as bad as linear,
because if it's in random order.
But we could do better.
It could also be
logarithmic, if it's sorted.
That was good.
A linked list, though, takes how much
time to search in the worst case?
I
Again, n is the length of
the linked list in this case,
so how many steps might it take
us to find someone or some number
in a linked list?
STUDENT: n times 2?
DAVID MALAN: n times 2?
Why that?
STUDENT: [INAUDIBLE]
DAVID MALAN: OK, yeah.
And it's not quite--
yeah, if we're really
going to count the number
of steps, absolutely.
But it's on the order of-- and
this is where computer scientists
would cut verbal corners.
It's on the order of
n-- 2n, 3n, whatever.
A function of n is the
variable for the formula.
So I would write it as that.
Wait a minute.
What if the linked
list is sorted though?
Can we do better?
STUDENT: Can't do binary.
DAVID MALAN: No random access.
STUDENT: [INAUDIBLE]
DAVID MALAN: Yeah, so damn it.
If we haven't shot ourselves
in the foot in this sense,
even if we maintain the linked list
as sorted, at the end of the day,
we might have to search the whole thing.
Even though you might
get lucky, where you just
find it at the beginning of the
list, some number, the worst case
it's going to be at the end of the list.
Or the worst, worst case, it's
not even going to be in the list.
And you're only going to know that once
you've gone through the whole thing.
So that's the worst case scenario.
So the upper bound is
instead big O of n now.
So again, trade off.
Just when it felt like we
were getting somewhere,
now we've kind of hurt ourselves again.
So what might be better than this?
What else could we do?
Well, it turns out there's
other data structures still.
And now we'll move away
from the low level details.
Because again, just like yesterday,
once you get stuck in the weeds,
it very quickly becomes sort
of boring, if not complicated.
Let's just do a picture.
Well, it turns out that a
variation on using a linked list
would be to use this new feature--
these arrows, AKA pointers--
but create the equivalent
of a family tree.
So for instance, if I've
got a whole bunch of numbers
here, like the number 32-- I'm
going to draw it as just a circle
to be different this time, just
to be consistent with convention.
I'm going to draw 32 here.
I'm going to draw let's say, 50 here.
And then see if you can infer
where I'm going with this.
What two numbers should I
put on the left and the right
respectively, if again I'm stealing
those same seven numbers as before?
I heard 42 and then 99.
So let's try that.
Whoops.
Or 49.
42.
99.
And let me fill in the blank here now.
I'm going to put three circles here.
What would you propose I put left,
middle, right, in that order?
5, 10, 20.
Now, in fairness we've just
transcribed left to right everything.
So maybe we're getting lucky.
But there's some logic to this.
What is true about this structure?
There's a property, if you will.
This was either lucky or smart.
Or both.
STUDENT: [INAUDIBLE]
DAVID MALAN: True.
That's in the center, but that's
not really a generalization.
Give me something deeper.
STUDENT: It gets you
anywhere in three steps.
DAVID MALAN: It gets you
anywhere in three steps.
That is true.
But if I start adding more data,
that's not general enough for me.
STUDENT: [INAUDIBLE]
DAVID MALAN: The array,
specifically the--
STUDENT: The log.
DAVID MALAN: Yeah.
We have this logarithmic property.
It's viewed from a different angle,
because if you think of the--
Log n actually describes the height
of this family tree, if you will--
or one, two, three, so
height minus 1 of this tree.
If the height is 1, 2, log base 2
of eight elements or seven elements
is going to be 3.
So it's about that.
But there's another
property here. [INAUDIBLE]?
STUDENT: You don't have
to search every node.
The top of the node is the middle one.
DAVID MALAN: Exactly.
Or put more formally,
each of these nodes
represents the root of a binary
tree, as we would call it--
a binary tree meaning a tree
whose nodes have zero, one,
or two maximally children.
So bi meaning two, two children max.
And more specifically,
every node in this tree
is a binary search tree, which has
exactly [INAUDIBLE] property, which
is the left child is smaller
than the root and the right child
is greater than the root.
And this was the
general definition I was
looking for whereby I
could make that sentence
hold for every one of
the nodes in the tree,
irrespective of how big the
tree is and irrespective
of the numbers in the tree.
So yes 32's in the top,
but that's uninteresting.
What's interesting is that 32 is bigger
than 10, and 32 is smaller than 50.
42 is smaller than 50,
and 99 is bigger than 50.
That same relationship
holds at every node.
So what does this mean?
If now I only have access-- as would
be the convention in a computer,
I only have the ability to look at
one thing at a time, by default,
I'm always going to look at the top
rather than some arbitrary element.
So if I always have a pointer, so
to speak, to the root of this tree,
how many steps now does
it take me maximally
to find any element in the tree?
Three.
Because I look at the top.
If it's not there, I
go left or I go right.
If it's not there, I
go left or I go right.
And then I'm at the leaves
of the tree, so I'm done.
So it's either there or it isn't.
So now we've borrowed ideas from both.
We still don't have random
access, but because we've
made this a two-dimensional
data structure and not
a one-dimensional data
structure, this depth
is allowing us to bring
back that feature of divide
and conquer by just laying things
out in memory a little more cleverly.
So if we now have what's called,
again, a binary search tree--
BST-- the maximum number of steps it
might take us to find a number is what?
STUDENT: Log n.
DAVID MALAN: Log n, again.
So we've done better now
than the linked list.
So what's the downside?
Always a catch.
What's the catch this time?
STUDENT: How much space does two
pointers from each data point--
DAVID MALAN: Good catch.
Yes.
So three, technically.
So if we look back at our memory, to
implement a binary search tree node,
now we need one, two,
three, one two three.
So we pay a bit of a price.
So more memory for sure.
And what else might
bite us in the end here?
STUDENT: [INAUDIBLE]
DAVID MALAN: Yeah.
How do we add stuff in the middle?
So it's possible.
Suppose I want to add the number 21.
A naive approach might be--
OK, start where the root is.
And I know 21 is smaller,
so I go this way.
I know 21 is bigger than 10.
So I go this way.
I know 21 is bigger than 20.
OK, there's nowhere more to go.
So let me just create a new arrow--
so use that additional byte--
and put 21 here.
But an interesting issue now
is-- suppose I want to insert 22.
Where does it go?
Logically, we can skip the verbal steps.
22's down there.
Where does 23 go?
How about 24?
25?
26?
27?
28?
29?
I'm not going to bother drawing it,
partly because I've run out of space,
because I didn't anticipate this.
But what does this kind of
devolve into eventually?
STUDENT: An array.
DAVID MALAN: An array.
Well, not an array, because
it's still non-contiguous.
A linked list.
So indeed, depending on
the order in which we first
insert these nodes, if we
have the same numbers again--
suppose I start with an
empty binary search tree
and I insert the first number 5.
It's going to go there, because
it's got to go at the top.
And then I insert the number 10.
Where does it go?
Well, to the right.
Well, where do I put the number 20?
Well, to the right.
Where does 32 go?
To the right.
And then so you can very
easily construct a scenario
where it's not even
problematic eventually--
it's a problem from the outset.
If you get unlucky and you have
just a bad sequence of insertions,
you end up really with a linked list.
So damn it.
How do we fix this now?
Now as I hear myself
bemoaning the trade-offs,
I feel like no one's should
ever go into computer science,
because it's completely frustrating.
But so be it.
How do we fix this?
STUDENT: Make new numbers in the nodes.
DAVID MALAN: Make new
numbers in the nodes.
OK, so we could start
changing what's in the-- OK,
so we can make sure that we always
build a tree that look structurally
like our original one here--
nicely balanced, so to speak--
and we just move the numbers
around or change them.
So that's fine.
I would argue-- it's
definitely doable in code,
but you start moving
so much stuff around.
It's perhaps not ideal necessarily
to just start moving too much around.
But that's actually halfway to
what's the common solution here.
STUDENT: [INAUDIBLE]
DAVID MALAN: Yeah.
Thank about this.
Let's take this simple
example where this
was starting to devolve into
just an annoying linked list.
How can I fix this?
Forget about 23.
We're not there yet.
All we've done is insert 20, 21, and 22.
How could I fix this so that
it doesn't devolve subsequently
back into a linked list?
There's an easy tweak I can
make an eraser and a marker.
Yeah, Dan?
DAN: We could go to the next
[INAUDIBLE] the empty option,
where you would go to
the left of 42, instead--
DAVID MALAN: The left of 42.
OK.
So but that's going to violate
the recursive property that
says 32 has to be smaller than
all the elements to the right.
And that's not technically
what I said before,
but it follows through transitivity
by looking respectively
at each of the children, which has to be
larger than its parent on that branch.
Yeah?
STUDENT: Instead of
rewriting the tree, just
go to the leaf of the smallest node
that you could just rewrite a subset.
DAVID MALAN: Yeah, so just a subset.
So let's merge your ideas.
And I feel like if I had scissors, I
might snip it right here, for instance,
thereby detaching this little chain.
And who would make a better candidate
as the root of this mini tree?
Yeah, so 21.
So let me just fix this.
And I can use the same
memory, but the eraser's
easier to do mechanically here.
So what if I redraw this as 20,
21, 22, and make 21 the new node.
So this is similar in spirit to what
I think Dan was getting at earlier,
or to what Griff was getting at earlier.
But instead of rewriting
too much of this tree,
we could really just
focus on the subtree
into which we want to insert this node.
And indeed, now it's still getting
a little long on this side,
but we're not in such bad shape yet.
Because if we end up getting more
nodes here and here, then maybe
here and here, and here and
here, here and here-- which
you could imagine coming up with the
numbers to make sure that happens,
eventually the tree gets a little
long and stringy on the left.
So fundamentally what do we
have to do at that point?
Yeah, so instead of using 32 as
the node, maybe what I want to do
is snip, snip there, rotate
the tree a little bit--
and I have to fix some of the pointers.
I have to make a few snips this
time, because the problem is so big.
But it turns out that I can
fix this by really just making
one snip and rotation per
height of the tree-- which
is to say that per our
running time earlier,
it might take me log of n steps to
figure out where the new node goes,
and then I regret it because damn, now
the tree's getting a little stringy.
It's OK.
Just by going back up
the tree log n steps,
I can kind of rotate things
over-- snip, snip, reattach--
in roughly the same number of steps.
So it's two times log of n, which
is surely bigger than just log of n.
But again, to my claim earlier that
a computer scientist would generally
ignore these constant
numbers like two, the reality
is, in practice, doing log
n steps versus two log n
steps-- it's almost the same thing.
It's not worrisome fundamentally.
What's more worrisome is
if it's like n squared or n
cubed, when it's the big number that
depends on the size of the problem.
Constant numbers, like multiplying by
2, we don't really care theoretically.
So the binary search tree can
remain this if we balance it.
And so technically, a binary
search tree is big O of n,
Because technically a linked
list is a binary search tree.
It's just a crazy incarnation of one.
But what we're describing
here is something
that's generally called an AVL
tree, which is still a binary tree,
but by its own definition it maintains
balance by doing the snip-snip approach
and rotating as needed in order to
make sure it's not devolving back
into a linked list.
All right.
So now we seem to be getting somewhere.
What goals might we still have when
it comes to data and data structures?
How could we do even better?
Yeah?
STUDENT: How about dedicating memory?
I know graphs are able to dedicate
memory just for their own options,
then just leaving integers and
zeros if they're not being used.
So you can just do--
DAVID MALAN: Yeah.
So that's perfectly fine.
It doesn't fundamentally
solve the problem,
because even they have to decide in
advance how much memory to allocate.
And arguably, it's wasteful.
And you might feel that,
because if you're like a gamer
and you install your own
graphics card to speed things up,
you're literally spending more money to
get that feature, so totally possible.
But it's a trade-off.
You're being somewhat
more wasteful as a result.
Good question.
Other thoughts?
I mean, what's better
than big O of log n?
That feels like it's the progression.
We're chipping away at the problem.
So what should come next?
What would be the ideal?
What's the holy grail of search times?
What's the fewest number
of steps you would enjoy
allowing to find some piece of data?
One!
Feels like-- I mean, it can't be
zero, because that makes no sense.
You're doing no work.
You have to do some work.
Well, one step.
So can we get there?
Well, turns out there's a data
structure that aspires to be big O of 1.
This is misleading.
It's not actually typically big
O of 1, but that's the ideal.
And how can we implement
something like that?
Well, turns out that a hash table and--
you mentioned earlier a dictionary.
A dictionary actually
is the high level term
used to describe the functionality
that a hash table gives you.
A hash table would typically do this.
Just as we've seen a
binary search tree--
it can borrow ideas from say,
an array and a linked list.
A hash table, in turn, can borrow
ideas from an array and a linked list,
but in a different way.
What a hash table
typically does is this.
Suppose we're implementing a hash
table for everyone's birthdays
in the room or birth months, let's say.
So we want to put everyone
in this room into a bucket--
January, February, March, April,
May, all the way to December.
So the problem here is that we
could use, for instance, an array.
So let's draw it like this.
[INAUDIBLE]
1, 2, 3, 4, 5 6, 7, 8, 9, 10, 11, 12.
So this is January, February,
March, dot, dot, dot, December.
So this is an array.
And rather than talking about numbers,
we can talk about people's names.
And just for demonstration's sake,
does anyone have a January birthday?
OK, so Shivang?
So Shivang and just Shivang.
How about-- oh, and Dan.
Perfect!
Already we have a problem.
I can only fit one of you in
this data structure, right?
Because if we try to put
Dan in the same bucket--
this is finite amount of memory.
And maybe it's more than one byte now,
because we need to tolerate names.
But we can only put one name
at the time, I claim today.
So Dan or Shivang,
one of you's gotta go.
So where could we put you?
Where could we put Dan if Shivang's
already taking up that space?
STUDENT: In February.
DAVID MALAN: In February.
So we could.
Unfortunately, we're just kind
of ignoring the problem, which
might get us through.
But unfortunately with 20
plus people and 12 buckets,
somebody's going to be left without
a chair, so to speak, at the end.
For instance, does anyone
have a February birthday?
So Shawn and Owen.
So now, OK.
So how about one of you
moves to March and April?
And so this is actually an example
of a technique called linear probing,
whereby you have a data structure.
You try to put the data where
you intend for it to go.
But if not, you probe progressively
for an additional free spot,
so as to at least fit the
data into the data structure.
So in the best case,
it's where you expect.
But in the worst case, where
might Dan or Owen actually end up
if we keep running into these
collisions where two names are
trying to go in the same place?
In the worst case, Owen and Dan
might end up all the way in December,
especially if there's more people with
February or March or April birthdays.
So at the end of the day, we're
trying to get to constant time.
One step to find Dan.
One step to find Owen and Shivang.
But if we're arbitrarily as a hack
moving Owen or Shawn over here,
then it's just back to linear.
And we seem to have regressed
back to where we were earlier.
So how do we fix this fundamentally?
It's not good if we can only handle
12 students in a '20 plus student
classroom.
We want to handle everyone, and we want
to remember everyone's birthday month.
What could we do pictorially
to deal with this?
What would you do?
STUDENT: [INAUDIBLE]
DAVID MALAN: So cut into half.
Unfortunately, if I do
this, we're only going
to fit like "Ow" here and "Sh" for
Shawn, because it's finite memory.
So it's not quite as simple as just
chopping the chunks of memory in half.
What else could we do?
STUDENT: Add a second byte.
DAVID MALAN: Add a second byte.
So we can't necessarily steal
February's byte like before
and merge January and
February, because we'd devolve
right back into the same situation.
We can only do that like six
times before we're out of memory.
But where could we
steal another byte from?
I've only drawn this.
But this is the high
level representation.
At the end of the day,
this is our canvas.
So, so to speak, we could go below it.
And again, the grid here
is just a conceptual thing.
It's just a sequence of bytes.
But we could certainly do this.
And in fact, you know what--
it's not available here.
There's no free memory right
here in our other picture.
But maybe there's some over
here, over here, over here.
So let's borrow some of those ideas.
What do we do to give some
extra space to January?
STUDENT: Make it into
another twelve months?
DAVID MALAN: Make it into a what?
STUDENT: Another twelve months.
DAVID MALAN: Another 12 months.
So we could do that,
but that's overkill.
I just need room for
like Shawn and Owen here.
We just need to create a little
bit of space, not 12 more spaces.
Grace?
GRACE: So if Shawn and Owen, you
each-- you give to each of them
two spaces somewhere else,
and then one is their name,
the other is their
reference back to January.
DAVID MALAN: OK.
So we could have used this
pointer idea from before.
But I daresay we're making it
harder than it needs to be.
This is an array, because we know a
priori there are 12 months in a year.
And we're pretty
comfortable hardcoding that.
That's not going to change.
What is going to change is how
many people are in the room
and how many people we want to
fit into this data structure.
So what is the data structure that
gives us that kind of dynamism?
STUDENT: Search tree.
DAVID MALAN: Oh, OK.
We could do a search tree, but I
don't care so much about search.
What's the simpler incarnation?
Just a linked list, right?
If this is fixed on our horizontal axis
visually, why don't I grab more space,
but if there's a block of
memory over here-- so let's see,
I've already forgotten where we started.
But Shivang was in January.
And then was it Dan?
You're also January?
So Dan is January.
You know, what I'm going to do?
I'm not going to put Shivang here.
Instead I'm going to have an
array of 12 pointers-- arrows
if you will-- that
are initially nothing.
They're just zeros, or
pointing to nothing.
But as soon as I find, ooh,
someone has a January birthday,
I'm going to go ahead and
write his or her name here.
And as soon as we encounter another
such student-- you know what,
I'm just going to draw another
arrow to another block.
And each of these
blocks can be anywhere.
The only one that has to
be contiguous is this one.
So I could put Dan over here.
So we're no longer resorting to
linear probing where we're just
hoping-- I got to find a space for
Dan, got to find a space for Dan--
because that can devolve into him
being way at the end of the list
after all of our other data.
Now at least, we can start to
plot people where they belong,
but just kind of growing
that data structure.
So in an ideal world, we would put just
one person in each of these months,
but that's obviously not realistic.
We have to decide in advance
typically when writing software
like how big a data
structure's going to be,
at least if it's an array like this.
But 12 feels reasonable if we're
bucketizing people, so to speak,
by their birthday months.
But the linked list kind of
gives us the best of both worlds.
We can immediately find
all the January people
and then I can tell you one at
a time who those people are.
I can immediately find all
the December people just using
arithmetic and random access,
and then I can rattle off
who the December people are.
So the fact that there's
a linear aspect here
isn't such a big deal if the goal
is just to remember everyone.
You've got to follow some
kind of bread crumbs.
But if we want to search them,
this will be problematic,
especially since I've put them--
it would seem out of order,
reverse alphabetical order.
So a hash table is this.
This is one of the most in common
incarnations of a hash table
whereby for each of
these months, you might
have a different length linked list.
But in an ideal world, each of these
linked lists is as short as possible.
And while I've chosen for the
sake of discussion birthday months
as the number of elements in our
array, that was just arbitrary.
And a good hash table will actually
choose a much bigger number than 12.
So if there are 26 people in this
room, hopefully the hash table
is actually going to be
at least 26, if not 200.
So a little bit wasteful,
but that minimizes
the probability of there being these
collisions or reduces the probability.
But even if there are
collisions, it's OK.
We have a plan B. We'll
introduce linked list.
Now hopefully there won't be too
many collisions, and when there are,
they'll be relatively few.
And so a hash table is technically
not in this form constant time,
instant access.
But it's much closer to instant
access than any of the data structures
we've looked at before.
Because if there's 26 people
in the room, it might take us,
what, four or five steps
if we did a binary search
tree to traverse that many
people-- log base 2 of 26.
If we did linear search,
it might take us 26 steps.
But in this scenario, it looks
I can find Shivang and Dan
in two steps, one steps-- which is
just proof by example, to be fair.
But it is fewer than those ballpark
numbers I mentioned earlier.
So a hash table gets
us much closer to this.
And the knobs a developer
would turn-- someone
who's implementing a library that
we would use in our own software
would generally try to use
some sophisticated math,
and if not some statistics
or just some heuristics,
to decide how big the
hash table should be
and maybe once it even reaches a
certain size how to reorganize the data.
And indeed one of the
features of databases today
like Oracle and MySQL and
PostgreSQL is underneath the hood,
one of the reasons they
are able to give you back
your data faster than just
searching it in like an Excel file,
top to bottom, or over a file
system, a folder of files,
is they are doing fancy structures
like this, like our tree structures.
In fact, the B-tree is a common data
structure that a database would use.
And it's similar in spirit to the BSTs
and the AVLs but just with more nodes.
It's even shorter.
It's just much wider, because each node
has more than two children, typically.
Yeah?
STUDENT: How does it know
which month to start on?
DAVID MALAN: How does it know--
what do you mean, start on?
STUDENT: Compared to the
array, where originally it
would be sort of a habit with binary
search, how does it know to go to March
or go to December?
DAVID MALAN: So in this case,
I'm assuming that these 12
elements have some meaning.
So January will be represented
with 0 and December
will be represented with 11.
And that's actually very common.
In programming languages, if you
want to print out the month December,
you will know or hard code in
your software the number 11,
by convention for exactly that reason.
Yeah, Vanessa?
VANESSA: So I can imagine that
whenever you start a project,
you have a particular focus
and a particular data structure
that you think would be suited for
that, and as that project evolves,
there would be even further
complexity or different direction that
might require you to change or
migrate or update your data structure?
DAVID MALAN: Absolutely.
Over time integer problems change you
might need to change your underlying
data structures.
However, I would say it's typically
only certain types of companies
that would focus-- well, is that fair?
No, that would be an overstatement.
So short answer, yes.
As your data changes
or as you realize you
have so much data that
your algorithm is slow,
you would absolutely re-engineer
how you're implementing things.
For instance, if you have only a few
hundred things in your-- few hundred
customers or few thousand
customers, frankly,
when you load those into memory,
you can just put them in an array,
search them linearly.
Because a computer can search
a thousand elements super fast.
No one's going to notice.
But eventually-- it's
a bad design decision.
But it's maybe a good
business decision if it
means you can ship the
software faster than it
would take to actually
engineer something fancier,
like we've been discussing here.
But this is how companies
accrue what would
be called technical debt, whereby you
keep cutting a corner, cut a corner.
You go the cheap route,
cheap route, cheap route.
Eventually you're going to
have to really pay the price.
Because when you have
a wonderful success,
like all these customers
suddenly sign up,
now your software doesn't
even work, because you haven't
anticipated the load on your servers.
And then you have to go
back and re-engineer things.
Technical debt can also be accrued in
the sense of-- if you let your team
or yourself just be sloppy with
your code-- you're not commenting
or describing it, or you're
just copying and pasting code
rather than really thinking through
how you can factor things out-- you
accrue technical debt that
eventually you have to pay back.
And that's generally in
the form of time, which
can be bad to postpone in that way.
So absolutely.
It's an evolving cycle.
And I would say that what you
pay companies like Oracle for
is that secret sauce,
among other things,
for-- they're faster than the
competition, for instance,
if they claim.
Well, there's probably
some science behind that
and some theoretical
arguments to back it up.
Now as an aside, it is
technically possible
to get constant time access
in the following way.
There's another data structure.
Let me wave my hands at
the details and call it
a try-- which is short for
retrieval, which is still
pronounced strangely different.
But a try does actually give you what
folks would call big O of 1 time.
And it would often be used
for a dictionary of words.
Not a directory, rather a dictionary.
Oh, I misspoke earlier.
You said directory, not dictionary.
Dictionary is the fancy speak for a
hash table, or an incarnation of it.
But a try could give us
constant time as well.
But let's not go too far down
the rabbit hole just yet.
But this is why I would argue
computer science is fun or software
engineering is fun and also hard.
Again, you can cut corners so easily
when your data sets are small,
but the reason that Google and
Microsoft and Facebook and Twitter
have such smart people working
for them is that these are really
non-obvious problems sometimes.
And indeed, what's been new about big
data, so to speak, in recent years,
especially Twitter-- Twitter
was horrible initially
at actually keeping up with
the rate of their success.
And the fail whale used to
be a thing, if you recall?
Thankfully, that's been
decommissioned in recent years.
But they used to
struggle under the load,
so much so that you
couldn't access twitter.com,
and that's because this stuff is hard.
When you have thousands or millions
of transactions per day or per minute,
you actually need to think
about things to this level.
And so these are just some of the
basic building blocks early on.
And people build on these general
ideas to make fancier software still.
Any questions?
All right, well, let's just round this
out with a bit of a laundry list just
to take it up a higher
level, just to make
mention of some of the
tools and ingredients
that we might use on the context
of frameworks and libraries.
So we had this laundry
list of languages earlier.
Unfortunately, I claimed
that it's not always
super easy to use Java out of the
box for web programming, or Python,
or Ruby, and certainly
not C. And so there
exists some very popular
libraries out there
that people have started
to use and popularize,
especially because of open source,
that are just worth knowing about.
So we'll talk more after
our break about jQuery.
But jQuery is a very popular library
for the language called JavaScript.
And this just makes JavaScript
arguably easier to write,
because it comes with a lot more
functionality than the language itself.
So it's worth noting.
JQuery in particular is
so popular that it's not
uncommon to see people
list it on a resume as part
of a list of, I know this language, this
language, this language, and jQuery.
It's not a language.
And frankly, when people
don't quite realize that,
that itself can be kind of a hint of
their sophistication, I would argue.
But it's so popular, that
most people essentially
equate it with the language itself.
It's used in so many places, though
there's a bit of a turn against it
now, because it's become
so big and weighty.
So it's not requisite.
Ruby on Rails is a
very popular framework.
And actually, let me keep these in two
separate columns so as not to mislead.
So we'll have libraries over here--
and frankly, sometimes the line
will get blurry-- frameworks over here.
And this is an infinite list, not
unlike the languages one we saw before.
So jQuery for JavaScript,
Rails for Ruby.
Django is a very popular
framework for Python.
Flask, and now we have another
category, called microframeworks.
Let me distill this in a moment.
Frameworks, which are things
like Flask, dot, dot, dot.
And these are the kinds of things.
There's no way we
could give you the fire
hose of all these possible things,
because these are the things that
are constantly changing.
So let me give some
high-level takeaways.
There's often this ebb and
flow in technology certainly
in recent years as to, what's popular
or the right way to do things?
And this is in part a function of
people learning better techniques.
We live with some languages
for a little while
and we realize, damn, it's annoying to
write certain stuff in this language.
Let's come up with a new language,
or let's write a library of code
that we can make freely
available to other people
to sort of stand on
each other's shoulders.
Or let's start to a change to a
different approach altogether.
So jQuery's an answer
to that first scenario.
Like wow, this is an annoying
language to write in certain patterns.
So thus was born libraries like jQuery.
Frameworks-- like Rails for
Ruby and Django for Python
and yet others-- were
introduced to make it
easier and more
pleasurable to actually use
Ruby and Python and similar
languages for web development.
And they themselves just get so popular
that these are still too pretty popular
go-to places.
Node.js-- kind of framework,
I'd say, although it's
a little different from these.
The lines start to get a little murky.
Node.js is a way of using a language
called JavaScript on the server side.
And we'll talk about JavaScript
after our break in a few,
but Node.js has been in
recent years popular for using
a language in a way it wasn't
initially intended for.
But it allows you a different
way of programming still.
Microframeworks are kind of a reaction
to the proliferation of frameworks.
Frameworks are essentially--
it's collections of libraries,
might be fair?
It's perhaps a whole
bunch of libraries you
use to accomplish some goal--
a library for sending email,
a library for talking to a
database, some low-level stuff
often that you yourself
don't want to think about.
You want to build a business
that just needs to send email
and needs to use a database.
Frameworks are collections of libraries
and also with sets of conventions.
And so one of the downsides
of choosing a framework
is that you pretty much have
to follow their documentation
and you have to organize your code.
You have to organize your
files and folders in a way that
is compatible with that framework.
And that's fine, because
there's an advantage sometimes
to doing what other people are doing.
But it also creates a
bit of buy-in or lock-in,
whereby if you decide or your
staff changes and you're like,
we're tired of using Rails.
We want to use something else.
You have to do some
nontrivial architecting
or rearchitecting-- redesigning to
actually change over to something else.
So these are the kinds
of decisions that when
first building a prototype of
something or deciding on a technology
stack, so to speak-- a
collection of technologies
your business is going to use--
it's worth spending more time
upfront discussing, researching,
arguing about, whiteboarding
the possible approaches
so that when you dive in,
you actually know what
you're getting yourself into.
And of course, if you hire certain
people who have good experience,
they'll too bring their own biases,
their own expertise to bear as well.
And microframeworks,
to wrap up this part,
is a reaction to frameworks, which
have gotten very bulky-- just so
many libraries and so much code, like
truly the kitchen sink of software.
So microframeworks are a reaction to
that approach, whereby microframeworks
tend to do just one thing,
or one or two things.
And let me put this into context.
For instance, a popular framework
for PHP is called Laravel.
And if I pull up its documentation,
it does a whole bunch of stuff.
So routing, which refers to how you
get your HTTP request to the right
file on your server.
Views, which has to do with
the aesthetics of your site.
How do you present data to users?
Scrolling down here.
Authentication, how you log
users in; authorization,
how you decide whether or not
to allow a certain operation;
billing, how you bill your
customers; caching; encryption,
how you encrypt your data--
and I'm skipping over ones
I'm not even sure of.
Mail, how you send mail.
So when I say kitchen
sink, I just mean-- Laravel
comes with all of this functionality.
And this too creates
all the more lock-in.
And if you start Googling around,
you'll often find performance benchmarks
on libraries or frameworks, whereby
someone will compare A to B to C
and make the claim, credibly or not,
that one library or framework is better
than the other, because
look how much faster it is!
You should always take
that with a grain of salt,
because so often do people run
different code on different platforms.
And some frameworks might be good at
one thing, worse at another thing,
so it's really not
apples and apples always.
But the fact that there's so much
code baked into certain frameworks
means they are a little
bit slower than others.
And so microframeworks do maybe
one thing, but only one thing.
And so you use this micro
framework for problem A,
this microframework for
problem B, and so forth.
So on the one hand, it's both fun
in that the technology is always
changing, on the other hand, it is so
frustrating and so hard to keep up,
or time-consuming to
keep up that you really
have to have passion or a foot in
the game to-- hand in the game?
Skin in the game!
All right.
I'm picking wrong body parts--
in order to stay current
with this kind of stuff.
Griff?
GRIFF: So frameworks are
essentially collections of libraries
and microframeworks are smaller
collections of libraries.
What's the dividing line between
a library and a framework?
DAVID MALAN: It's a good question.
I wouldn't say there's one
more formal definition,
but a library is code
that someone else has
written that accomplishes
some goal that you
were integrating into your own project.
So that might be a library
that gives you functionality,
gives you a puzzle piece
that lets you send email.
That would be a library, for instance.
Or actually, Scratch is not
a bad example as a metaphor.
If we had this palette here, I would
argue that each of these categories
is a library.
This is a library of motion-related
puzzle pieces, a library
of looks-related puzzle pieces.
So that's a good way of
thinking about a library.
It's related pieces of functionality
that you can use in your own projects.
A framework would be
something like Scratch.
If your goal is to
implement a program that
implements a cat saying
hello world, Scratch
is one of the frameworks you
can use to implement that.
Because the framework comes
with a whole bunch of libraries.
It also comes with a set of
conventions-- hey, here's our cat.
Hey, here's our script.
And so forth.
And so if you want to implement a
cat that says hello world, Scratch
one of the frameworks you can use.
Unfortunately, you have to go all
in and use MIT's application or not,
and so there's that lock-in aspect.
By contrast, you could
use something like OpenGL,
which is a framework and a set of
libraries for actually implementing
graphical software to
a much lower level,
but it's going to be a lot harder.
It's going to be a lot
more arcane than something
like this drag and drop language.
So a framework, rather
a micro framework--
it's kind of equivalent to a library
with better branding, perhaps.
It's a library that does one
or few tasks that might also
come with it some conventions.
So a microframework like Flask might
prescribe how you organize your code
and how you organize your files
and directories, I would say.
Yeah, Alicia?
ALICIA: So if you [INAUDIBLE]
a framework from a library,
you're kind of locked in.
If you use C or C++ on its own, you
get complete control of the palettes.
DAVID MALAN: That is correct.
ALICIA: So, for
instance, in my industry,
Goldman-Sachs is known for coming
up with their own language,
meaning they don't want
to control everything.
Is that more of a time to market thing?
So if you were a startup
and wanted to get started,
you're probably want to leverage
frameworks and libraries
and not just be a purist and say we're
going to write every piece of code--
DAVID MALAN: Yeah, a startup should
absolutely use off-the-shelf tools.
And it is an indulgence for sure
to have the resources to come up
with your own language.
That would be a very uncommon case.
And if not that, it's
perhaps a way of explaining
why you have so much legacy code.
You've bought in so much that you have
your own language, because you never
sort of adapted to industry trends.
If it works, it works.
That's fine.
But yeah.
A startup should be using
the kind of laundry list
we've been tossing on the
board yesterday and today.
ALICIA: Is there a security
features without [INAUDIBLE]?
DAVID MALAN: That is true.
The less you rely on other
people's code, the fewer threats
you expose yourself to, because
there's less code to worry about.
But that assumes you are better
at writing code than other people.
And that's a risky claim to make.
The upside of open source
software too or an upside
is, you have more eyes
in the community on it.
Now those eyes might be
good and bad programmers,
so that may be a good or a bad thing.
But openness of software
is generally a good thing,
because you would think asymptotically
you approach perfect software, which
is more people contributing to it.
So it goes both ways.
But even then, a company
would be naive if they think,
well, we're only using
tools we developed in-house.
Because odds are, at the end
of the day, they're absolutely
using a compiler someone else wrote.
Unless they're going back to
the equivalent of punch cards,
there's no way that's even tenable.
Other questions?
All right.
So next up is going to be a
little bit of web programming
and some of the technologies
related thereto.
We'll talk a bit about JavaScript,
which is commonly used on the front end.
And that's where we'll play with it.
But it can be used, as I mentioned,
with Node.js on the back end.
And that will tie everything together.
Hopefully, we'll fill in some blanks.
I'm still thinking about what
my answer is what's the future.
But let's go ahead now and take a 15
minute break and we'll resume at 3:15.
