>> Sean: We are kicking off a bit of a 
mini-series on things that programmers find
essential, and it'd be great to find out
what you think is essential.
>> BWK: Sure.  Lots of
things that are essential.  Do you want to
narrow the scope of that a little bit?
I mean, coffee is essential.
>> Sean: So within programming, I mean, you want to tell us
about some kind of array -- is that true?
>> BWK: Yes, I guess I would say that I wanted
to talk primarily about associative
arrays which are in other contexts
called hash tables or sometimes in a
language like Python they're called
dictionaries.  The idea is an array which
has as its subscripts not integers from
0 or 1 up to something, but rather any
arbitrary thing is the subscript of the
array.  And so this gives you a great deal
of flexibility; you can build really
interesting things with associative
arrays.  I will call them associative
arrays.  If you think hash table if you're
a Java or JavaScript programmer, let's
say, or if you think dictionary for
Python people, it's the same basic thing.
So the idea is instead of having
subscripts which go from let's call it
zero up to something, you have arbitrary
subscripts, and for simplicity let's just
call them strings of characters, just
text: words like "Sean" or "Dave" or "Steve". 
>> Sean: So
let's get people context there -- we have
our our film crew in full effect.
So if I understand rightly, then that
means instead of keeping track on you
know number 13 was the number that I
gave to the number of whatever then instead
it can say nails or screws or whatever 
>> BWK: Correct.  One of the classic
examples that this thing is that you
want to keep track of, for example, oh
your groceries that you might buy.  So we
are buying things like beer and pizza
and coffee and chips, this sort of
thing.  And so what you'd like to do is
have an array where the subscripts are
beer and pizza and chips and coffee or
whatever else.  And then the values of
those array elements can be whatever you
want, but they might be for example
how much you spent on beer and pizza and
coffee.  And so you could write a very
simple program that would simply say "I
have pizza ten pounds, beer 20 pounds, coffee two
pounds, beer 20 pounds" and I would like to
just add up all of the things for the
pizza, all of the things for the beer
separately, and so on.  And so I would
simply add that, add numeric values to
what you see in the array, add those
subscripts that are instead of being
numbers are just the strings: beer, pizza.
I think it's a lot easier in many
respects and so easy to use this stuff,
very natural I think for many people, and
it turns out the implementation is
actually quite simple as well.  It took I
imagine in computer science a while for
people to realize how to do it well and
now I think people really do understand
how to make effective, efficient
implementations of associative arrays.  So
I don't know whether I could perhaps
draw a picture that would be useful that
would give somebody an idea of how do
this.  Let me try.  
>> Sean: Sure.
>> BWK: Let's suppose I
have an array, let's just call it X
because I have no imagination and I
would like to say, okay, the element here
is pizza.  And I could say how much pizza
did I buy?  Well 20 pounds worth of pizza or
something like that, or maybe it was 20
pizzas -- doesn't matter.  And so I can say X
of beer equals 10, and so on.  And then I
could later on say well, X of beer equals X
of beer plus 15.  So this is the way
that you would use these in a program
with all of the advantages and they
would look just like the subscripts that you
would find if these were integers
instead, like pizza were 1 and beer were 2
and so on, but this is a lot more natural.
So that's kind of the use of this.  How
would you actually build one of these
things?  The internal representation is
often, and this is where really hash
table is probably the right way to think
of it, suppose we have in memory
somewhere an honest to god array of the
sort that we're familiar with where the
subscripts go from 0, 1, 2,
dot dot dot up to whatever, some number
N like this.  This is going to be the
hash table where you can find things.  And
what happens is if I want to find the
particular place in this table where
elements associated with let's say pizza
are stored, then what I do is I take pizza
and I run it through a hash which
scrambles it up in some way and produces
a number which is between 0 and in this
case let's call it N.   OK?  So the
hashing takes this, scrambles it, you
could imagine doing just by adding up
the letters, treat them as numbers, add
them up, and you get some value, and then
you use the modulus function to reduce
it so that it's in this range.  That tells
you the pizza stuff is going to be
stored here, but you don't store it in the
table; what you do is that you have
typically something that says I'm pizza
and my current value is 20.  This is
getting a little too small.  And this
simply points to it.  Okay?  So now somebody
comes along later on and wants to say
where is Pizza stored in the array?  What
I do is same process I say ok take
pizza, run it through the hash function,
gives me a number, let's call it in this
case 2, and I say oh ok the pizza guy is
there.  Okay?  So that's the basic thing. But
what happens if something else by
accident of the hashing collides with
pizza.  Maybe beer collides with pizza.  I
mean they go well together.
So what happens there is that the data
structure is really a linked list and so
if it turns out that by accident beer
also, when run through the hash, comes out
as 2, so this is called a hash collision.
So what you have to do there is to say
well there's really a data structure a
little more complicated and it points off
to something that says beer, which is at
this point 25 let's say.  And so when I
want to look up beer, I simply start at
the front here and I say pizza no that's
not beer; ah there's beer, and now I'm all
set.  And there's also something that says
there's no more in this list.  And if
another thing comes along that happens
to also collide with this, then no
problem, we just make the list longer.  So
then what you have to do, this hashing
function here has to have the property
that you give it a bunch of different
things like beer, pizza, coffee, coke,
whatever, it should spread those things
fairly uniformly across the table so
that you don't get everybody piling up
in this particular element but you get
mostly these little chains of things
about the same length.  And that means
that the access to the information in
this table, it's sort of constant time.
You do the hash function, it tells you
where to go, and there's usually only one
or two elements in any given chain of
things that happen to have the same hash
value.  You know, suppose that this is
small, maybe N is only 10 and you've got
hundreds of different things in the
table, N has to be bigger.  And so what
you can do at that point is you can
actually grow the table.  You can say okay
let's just rehash everything in sight
and make a new table.  I'm going to switch
to a new page here so we could see it.
>> Sean: fanfold is going to throw you now.
>> BWK: Fanfold has thrown me.
I remember fan-fold from my youth.
So here we have this hash table that
went from 0 up to some old value of N
like this and we've got things in it
like my pizza and my beer and so on.  And
suppose that it's gotten very crowded at
this point; if N was small and I've got
lots of things, it's going to get crowded
necessarily.  So what I can simply do is to
say well let me make a new table let's
twice as big, four times -- doesn't matter,
something that's a lot bigger, and simply
take every element that I find here,
whatever it might be,
compute its new hash value and just stick
it into this table instead.  And so with a
different value here, might find that
pizza is here but beer is down here;
they're no longer linked together -- quite
possible because this value would
determine where they're put and so if I
changed so that this is let's say a
table of 2 N or something like that,
they might hash differently.  Might
hash the same too.  And so now you can see
that on the average, the chains are going
to be half as big and only do this once
in a while when the table starts to get
full, and maybe I make it four times as
big.  And so at all times the accessed
information in this hash table is pretty
much constant time -- you come in, you hash
it, and very short look down a list and
you're done.  Once you've made this new
table of course, then you can throw away
the information from that one.
>> Sean:  So the modulo or the
amount to be modulo by will be N.
>> BWK: Yes,
right, exactly.  Yeah, I mean I've been
careless because we started at zero and
we went up to N, and so that's really
N plus one elements.  You have to be a
little cautious there, that's a classic
kind of error that programming,
programmers -- other programmers -- might make.
I would never do that.  But but that's the
idea.  And and then of course the details
of what's a good hashing function and
how to make that really spread arbitrary
information around -- it's still kind of a
you've got to be careful to get it right,
but there's an awful lot of good
guidance and so for the most part it
just works fine.  And and this is very
simple to implement; the amount of code
in pick your favorite programming
language is probably a few tens of lines
of code at most.  It's very compact, very
easy to do -- you do it once and then you
say okay, now I understand how that stuff
works.  You don't have to think about it.
You can use somebody else's implementation 
>> Sean: So they're saying so some
of this is you write yourself but that
will probably be libraries out there.
>> BWK:  Yeah
right if you're in Java there's a class
what is it, hash table or something like
that I think; I've forgotten exactly.  In
Python there's the subscript dictionary
that looks like that.  In Perl there's a
hash -- literally hash and so on.  These
kinds of things, I've got too many of
them in my head, they blur, especially
under pressure, and so I would have to go
and look to be sure but that's the
essence of it.
Self-publication is a recipe for having things disappear
without a trace.  And so the first
edition did sort of disappear without
much trace.  The second edition was
published by Princeton University Press
who also has an arrangement with Oxford
University Press, and so I'm hopeful that
the book gets a lot more publicity.
