DAVID MALAN: We've seen how
with languages like Python
can we implement business logic
on a server and, even if we want,
generate web pages.
And we've seen, on the browser
side, the so-called client side,
how you can render
information or data to users.
And with CSS can we style
it, and with JavaScript can
we even make it interactive.
But when building a business
or any application that
is interacting with users, where
is all that data being stored?
Well, if you're running
your own small business,
you might simply be using a spreadsheet.
Of course, spreadsheets,
whether they're implemented
as Microsoft Excel, or Google
Spreadsheets, or Apple Numbers,
generally stores data
in rows and columns.
And if you're like me, you
probably use that first row
to represent the names of the
columns and the types of the data
that you're storing.
And as you continue to add more
and more data to that spreadsheet,
you probably, like me, continue
to add row after row after row.
And meanwhile, if you've got so much
data or so many different types of data
that it doesn't really
belong in one sheet,
you might actually create
a second sheet or a tab
along the bottom of the screen,
for instance, in which you
can store additional data as well.
So indeed, Microsoft Excel allows you
to store data relationally, so to speak.
You might have one sheet with customer
info, another sheet with invoices,
another sheet with products, or
any number of other types of data.
And you can relate those to another
by somehow having some commonality
among them, some identifier, some
name, some customer ID or the like.
And so data that tends to be
stored in rows and columns
can have these relationships.
And you can do this certainly
in the cloud these days
as well with Google Docs, simply
a web-based version of the same.
But when designing your
spreadsheets, or more
generally, your data's
schema, the design thereof,
there's a number of questions
you need to consider.
What data do you want to store?
And where do you want to store it?
And what type of data is it, in fact?
Because whether using Excel, or
Numbers, or Google Spreadsheets,
odds are you sometimes configure the
columns to be displayed or formatted
in some ways.
Numbers might be displayed
with or without commas.
Numbers might be displayed with
dollar signs or some other symbol.
You might separate
one field from another
using special symbols or punctuation.
So these are all configuration
options in a spreadsheet that might
be among the first decisions you make.
In fact, let's consider
a sample spreadsheet
wherein I might want to represent
users in some web application.
All of us have, of course, registered
for websites or applications
these days.
And what are some of the
questions you're asked?
Well, you might be asked for your
user name by some application.
You might be asked for your actual name.
What else might you be asked for?
Perhaps your email address.
And perhaps if you're buying
something, or registering
for something, or someone that
needs to know a bit more about you,
perhaps they'll even ask you for your
address, maybe your phone number,
or perhaps even your age.
So you can certainly imagine there
being even other fields that you need.
But let's consider now what types
of data each of these fields is.
Now, what should a user name be?
Well, by definition, this should be
a unique value that belongs to you
and only you on a system.
For instance, if I'm
registering for a website
and that website is storing some
data in, for instance, a spreadsheet,
I might ideally try to claim
Malan if it's available.
My name meanwhile, of
course, will be just
David Malan, which someone else
can certainly have as well.
Hopefully my email address is only mine.
And that, too, will be stored
in a column of its own.
And then my address, for instance, here
on campus, 33 Oxford Street, Cambridge,
Massachusetts, 02138.
A phone number by which I
can be reached, 617-495-5000.
And then my age shall be--
well, we'll leave that
blank just for now.
So notice how each of these pieces of
data is just a little bit different.
Some of them seem to be numeric.
Some of them seem to be alphabetic.
Some of them even have
some punctuation therein.
Now, you might or might
not in a spreadsheet
care to display this
data a bit differently.
Indeed, the only number that
I'm sweeping under the rug
here is, in fact, my age.
But hopefully I don't need one
or more commas in that value.
So there isn't really any need
for special formatting here.
But it turns out that
when you're actually
building a website or
software-based application,
you're probably not storing your
data ultimately in just spreadsheets.
You might graduate eventually from being
a small shop to needing more than just
Microsoft Excel, or
Numbers, or even something
cloud-based like Google Spreadsheets.
Why?
Well, you have more rows than those
programs can generally handle.
In fact, on your own
Mac or PC, odds are,
when opening up big files, whether
a spreadsheet or any other document,
sometimes you might actually feel that.
The computer might
start to slow, and you
might start to see a spinning
beach ball or hourglass
because you start to
hit the limits of what
a local computer can do with
just client-side software,
like a spreadsheet tool.
And so eventually you might actually
need to use a proper database.
And a database is really
just a piece of software
that can absolutely
run on your Mac or PC.
But very commonly it runs on
a server or somewhere else
in the cloud to which your
own software connects.
For instance, if you're building
a website or an application
in a language like Python,
you can, in Python,
write code that talks or communicates
with that database to pull data down
and to send data back up.
But spreadsheets are
wonderfully straightforward.
It's just so intuitive how you
store the data in rows and columns
and just more and more rows as
you have more and more data.
And so what's nice about some databases
is that they mimic exactly this design.
There exists in the world what
are called relational databases.
And indeed, this is among the most
common ways to store data relationally.
The jargon is a bit different
in the world of databases.
You no longer call these
things spreadsheets.
We call them databases.
And we don't call the
individual tabs sheets.
We call them tables.
But we continue to call the
structure of the data therein
rows for each additional
piece of data and columns
for the different types
of data that we have.
But in a relational database, such as
Oracle, Microsoft Access, SQL Server,
My SQL, Postgres, or something smaller
and lighter weight called SQLlite,
the burden is on you, the
designer of the database,
or the programmer, or
the business person
to actually decide on what types
of data you are going to store.
Because the database, in order
to handle more and more data
than a typical spreadsheet can
support, needs a bit of help from you,
needs to know what type of data you're
storing so that it can search it
more efficiently and
sort it more effectively
and make it easier for you
ultimately to add and remove data
from that particular database.
So in other words, you can do
all of the same operations.
But generally, in a database,
you have so much more data,
you're going to need to have the
database help you help yourself.
So what does that actually mean here?
Well, let's consider some
of these fields here.
Were we to migrate my spreadsheet
with many, many, many rows of users,
for instance, to a proper
database relationally,
I'm going to have to make a
few more decisions as well.
And it turns out that I need to choose
various types for these columns.
And generally, in a database, you
only have a limited menu of options.
So let's take a look
at what those might be.
Here in many relational
databases are just
some of the data types
that are available to you;
an integer if you want to represent
something like 1, 2, 3, or perhaps
even a negative, a real number,
otherwise known as a floating point
value, for instance,
in Python that actually
has a decimal point and perhaps
some numbers thereafter,
numeric, which is more of a catch-all
and might handle things like dates
and times that are numbers in
some sense but have some more
formal structure to them, and then
the more general text, when you just
have words or phrases,
characters or whole paragraphs
or more that you might want to
store as well in some column.
And then lastly, you might
have the cutely named BLOB,
or binary large object, when you
actually want to store zeros and ones,
that is to say binary data, like
actual files in your database.
However, you needn't do this typically.
You can actually store files, of course,
on a file system on your own hard drive
somewhere on the server, but
that option exists for you.
Now, this happens to be the
list of data types supported
by the simplest of relational
databases, something called SQLite.
And as its name implies, it actually
supports a very specific language
via which you or anyone can request
data from the database and store
data in the database, or
update it, or delete it.
And that language is called
SQL, structured query language.
SQL is a language via
which you can do exactly
that, retrieve data from a database, put
data into a database, and so much more.
It's an alternative to the pointing
and clicking with which you're probably
familiar with something like Microsoft
Excel, or Google spreadsheet,
or Apple Numbers, where
it's really you, the human,
doing all the work via a
graphical user interface.
Now, to be fair, in all of
this spreadsheet programs
are there functions or
macros that you might
be able to write so that you can
program in those environments
but with some limitations.
And indeed with SQL and with
SQL databases, more generally,
that is databases that
understand this language, can you
achieve far higher performance,
can you store far more data,
and nonetheless get at it quickly.
So it's sort of the next evolution
of what you might otherwise do only
within your own computer spreadsheet.
But many other databases, Oracle,
SQL Server, Microsoft Access, MySQL,
Postgres and more, support
more than these data
types that allow you to help
the database help you even more.
Because the more the database knows
about your data, the smarter decisions
it can make at a lower level
below these abstractions
and storing that data so that,
when you have a question,
it can answer it quickly.
And so how might you refine
what we mean by integer?
Well, in some databases,
you might want to tell it
that this is just a small
integer in this column one
after the other, row by row.
Or maybe it's just an integer
somewhere a little bigger
than that, or perhaps a big int,
which means it can be even larger.
Now, these actually map to
very well-defined values.
If you think back to how we considered
data itself is stored in a computer,
it's ultimately with
zeros and ones or bits.
And indeed, an integer, so
to speak, generally takes up
just 32 bits or four bytes.
But, of course, that few
bits actually translates,
if you do a bit of the math, to
four billion possible values.
So if you were to specify
that a column in your database
is of type integer, that means you
could type in the number from zero
all the way on up to four billion,
or if you want negative numbers,
from negative two billion roughly
to positive two billion as well.
But beyond that three billion
or four, you're out of luck
if you're only using 32 bits.
Now, that might seem huge.
And that is a good problem to have if
I have four billion users, or sales,
or anything in my database.
But for certain large companies
these days, having four billion
or more records of some
sort, transactions or logs,
not at all uncommon.
And for those purposes
does there exist big int,
a bigger data type that uses 64
bits or eight bytes via which
you can count much, much,
much higher, so high
that I'm not even sure how
to pronounce that number.
But indeed, it should be large enough
for almost all of our purposes.
So ultimately, it's not enough to
say that the field in a database
is an integer but to specify just how
big maximally that integer can be.
Now, fortunately, I think for my age we
could probably get away with small int.
But indeed, with a small int,
you typically have 16 bits.
So even then could your user
be as old as 65,535 years old.
Now, what about real numbers?
These are floating point
values with decimal points.
And here, too, you have fewer choices,
typically, but along the same lines.
A real number would
typically take 32 bits,
which allows you only some amount
of precision, some number of digits
after that decimal point.
If you want to go even further
and be ever more precise
can you use twice as much
space, 64 bits or eight
bytes, and use double precision instead.
Well, it would seem that
we sort of start with an OK
value and an even better value here.
But what's the trade-off,
as there always is?
Well, it's fine if want to
be able to count higher,
whether with integers or real numbers.
But you are literally
going to pay a price.
Because if you want to count
higher than four billion
in the context of integers,
you're going to have
to spend twice as much space
from four bytes to eight
in order to store those values.
And if you're never actually going to
use values larger than four billion,
you're just spending twice
as much space as you need to.
And if you don't really
need to store floating point
values to terribly many digits of
precision, you're just wasting space.
Now, for small databases,
this might be inconsequential.
But for the Googles, and
Microsofts, and others of the world
to deal in gigabytes and terabytes
of data, all of these bits and bytes
will surely add up.
As for numeric, more of a
catch-all numerically, so to speak,
whereby you have Booleans, zeros
or ones, or false and true values,
dates which comes in a
prescribed format, typically
year, year, year, year, dash,
month, month, dash, day, day.
Indeed, even though
across the world there
are different formats and different
communities for representing dates,
in SQL do you specify them
exactly in that format only.
Date time adds to that a space followed
by hour, hour, colon, minute, minute,
colon, second, second, whereby if you
want to store both date and a time,
you should store it in exactly that
format in your database, even though,
thanks to code, whether
Python or SQL itself,
you want to display that
date in a geographically
localized and conventional
way, you nonetheless
store it in this way in your database.
Now, sometimes you need to fight
against the imprecision that's
inherent as a risk with real numbers.
And so you can use the
so-called numeric data type
to specify a so-called
scale and precision.
That is two numbers that
specify really how many digits
you ultimately want to support.
So if you want to store
things like dollar amounts,
maybe to cents or hundreds
of cents, two or four decimal
places, can you specify
exactly that, and you
will get exactly that much precision.
As for time and timestamp, well, time is
quite simply hour, hour, colon, minute,
minute, colon, second, second.
But timestamp's a
little more interesting.
It, too, can be used to represent time.
But as with many languages,
Python and JavaScript among them,
it does so typically by
remembering or counting
some number of milliseconds or seconds
from a particular point in time.
In fact, some years ago, it was
conventional to simply start
counting time from January 1st,
1970, the so-called epic as opposed
to say 0 BC or AD.
Unfortunately, it turns out that you
generally represent these timestamps
with four bytes or 32 bits, which
gives you access to four billion
or so possible values.
And it turns out that
in the year 2038, we
will have run out of bits
with which to represent time.
So if you recall the so-called Y2K
problem when we ran into this issue
around the year 2000, so have
we rather painted ourselves
into a corner in many systems
by not allocating enough space
to get us past that year.
Well, what's the solution
invariably going to be?
Well, use more space and, thus, cost.
But that is inherent in many systems.
And it's one of these things that
us humans will have to address.
How about text?
Well, text can, too,
come in multiple forms.
Whether it's English
or any other language,
you might have different
lengths thereof.
And so within the world of
text values and databases,
need you specify typically
how many characters
you want to store in that column.
And this is distinct from something
like a spreadsheet program
where you can probably keep typing and
typing or pasting as many characters
as you'd like.
A database, in so far as its purpose in
life is to give you better performance
and more scalability than
a spreadsheet allows,
wants you to tell it a bit more.
It's going to ask you
to decide whether you
want to store a fixed number of chars.
So char followed by a number
here represented at its end
means how many characters exactly shall
you store in every cell in this column.
Now, why might that be applicable?
Well, in the US, for
instance, for states,
we have a standardization
of two characters
for every state in the country.
And so you might say char
(2), thereby allowing
you to store CT for
Connecticut, and FL for Florida,
and MA for Massachusetts and so on,
because you know that every value is
going to be that finite length, two.
But sometimes you don't
know the max length.
And indeed, in my own table of users,
my own name might have D-A-V-I-D
and a space and M-A-L-A-N, so 11 for me.
But your name might
be shorter or longer.
Or some user that hasn't even
registered for our website
yet might have an even
longer name than that.
So varchar, or variable
number of chars, exists, too,
where you can specify not a
fixed number of characters
to use but a maximum upper
bound, perhaps something like 30,
perhaps something like 300,
or some number of characters
that you decide that's going to be the
upper limit of any human I actually
see in my sight who might have a
name as many as that characters.
But varchar is smart and
the databases that use it,
because it will use that
many characters maximally.
But it's not going to waste space.
In fact, if you're using 30 characters
for every name in your table,
well, it's only going to
store as many characters are
necessary, perhaps plus a
bit of overhead or accounting
to keep track of that value.
But that way you save space but still
can handle large text or large strings.
Lastly, if you have particularly large
text, whole paragraphs, or essays,
or documents that
someone might paste in,
or the contents of an entire web
page that you want to search,
well, you can have a text
field more generally,
which tends to support tens of
thousands or more characters in total.
But it does so in a way
that's not quite as efficient.
Rather than keep it right in
the column itself, so to speak,
it puts it somewhere else on
the server, the result of which
is that you can fit more
room over here, so to speak.
But it takes more time and effort
to go search or find that data, so,
again, a trade-off of space and time.
But here, too, as with
integers and real,
you seem to have multiple choices, one
of which at first glance is better.
After all, why use char ever if
you could also just say varchar
with that same amount?
For instance, for those two
character codes for states,
why not just say varchar
and not just char (2)?
Well, it turns out that if
you promise the database
that every cell in your column will
use exactly the same number of bytes,
it turns out that you'll have
very straight or justified edges,
in some sense, conceptually
for that column.
For instance, if every
cell in a column takes up
exactly the same amount of
space, then you might know,
if you start numbering the bytes
that are represented by those cells,
this might be byte zero up here
and then two characters away
will be address two, and then
four, and then six, and then eight.
And if you think back to how
algorithms like, say, binary search
are implemented, as soon as you have
the ability to know arithmetically
where the next value, or the last
value, or, best yet, the middle value
is, numerically can you jump right
to it in so-called constant time,
thereby enabling something like binary
search and ultimately logarithmic time.
So this is a powerful thing.
If you commit to the database
that all of your values
will take up the exact
amount of space, the database
can search that value faster for you.
Varchar, unfortunately, is rather akin
to having only a left justified column,
whereby you might have data that
looks very straight on one edge
with all of the characters left aligned.
But because the length
of those characters vary,
this cell might be this wide, and this
one this wide, and this one this wide,
and this one this wide.
And as such, because you have a
very ragged edge, so to speak,
you can't just do simple
math and add two and add two
to get to the next value in that column.
So in the worst case, if a
column is only a varchar,
the best the database can do by
default is just linear search,
or big O of n, which, recall,
was notably slower than something
like log N or logarithmic time.
So the more help you can provide to
the database the better off you are.
But of course you don't want to
air too far on the other side
and say, oh, well, I'm going
to have char 300 for every cell
in my Names column, because then you're
using 300 characters by definition
for just David Malan
who needs just, say, 11.
So lastly is there this thing
BLOB, binary large object,
which can only store binary data,
which can be used for images or files.
But those, again, tend to be best
stored somewhere else on the system.
So with these possibilities
in mind, how might we
go about designing the
schema for this data?
We already have the names of my columns.
And I've already gone ahead
and put in one sample row.
So let's consider in the context
of all of those SQL types
which ones apply here.
Well, for user name,
something like Malan,
it's probably not numeric
because that's not
something I'd be so inclined to type.
But it probably is text.
But within text we have the
choice of char or varchar
or larger blocks of
text, more generally.
So what should this be?
Well, this one depends.
A user name tends to be fairly short.
Because after all, we humans probably
don't want to type very long strings
just to log into some site.
Historically, maximal values
of eight characters was common.
But that tends to be fairly constrained.
And you can no longer express
very much of your name
if you tend to have
a long name yourself.
So here we might say something like
char, because we know it will be small.
And we'd like to be able to
search on this field efficiently,
especially to log someone in
fast, but probably not quite as
short as eight, so maybe 16 or 20.
Or if it's hard to guess there
and you want more dynamism,
maybe you would say varchar
of something like 255.
Why that value?
Well, recall that with eight bits
can you count 256 possible values.
But if you start counting at zero,
you can only go as high as 255.
And historically have many
databases had limits of, say,
255 for a varchar, though
that's no longer the case.
But you'll still see this very commonly.
So what's the best answer?
I don't know.
It depends on your
data or someone else's.
For here, I'll go ahead and
leave it as varchar (255)
just to give myself flexibility.
But a very good argument could be
made for char of some smaller value
or even varchar of some
smaller value, too.
As for name as well, I'm not really
sure what this one should be.
I don't know how many
characters your own name has.
Something like 11 is
probably too few, even
though it satisfies my name just fine.
30 feels a bit
[? type. ?] And frankly, I
bet if we Google longest
name in world, something
tells me there's someone out there
with quite a few characters, hopefully
no more than, say, 255.
But there, too, we might want
to do some due diligence.
With email, too, this seems to be easy.
This, too, is just characters,
even though you could certainly
have numbers.
I don't know how long the
maximum email address will be.
But frankly, it's probably
going to be variable in length,
so I'm going to go with a default.
Why 255?
Again, it's probably
way more than we need.
But varchar is smart.
It's not going to waste that many bytes.
It's just a reasonable upper bound
in the absence of better intuition.
At least we're being consistent.
Now, address, that
might be decently long,
especially if it's going on
envelope on, say, multiple lines.
But here, too, this is probably
just text, and so I'll go ahead here
and say this, too, 255.
Phone number.
Finally, a number by name.
Well, what should a
phone numbers field be?
Well, we had integer, or real, or, more
specifically, small int or integer,
or big int.
But the funny thing here is even
though a phone number is indeed
called by us humans
a number, it's really
just a symbolic representation of a
unique identifier for someone's phone.
And so these hyphens might be
common in one community or culture.
Maybe you'd have parentheses or
even pluses for country codes.
So frankly, very quickly does even
a phone number not become so much
a number but a textual string.
So here I have some discretion.
And maybe I could be
presumptuous and assume
I'm only going to have
customers or users for now,
say, in the US for whatever
business constraints.
And so I might say, you know what?
This is a candidate to actually
do something like char, say, 10,
three-digit area code, a three-digit
exchange, and then four digits
thereafter.
But that doesn't leave room for those
hyphens, so I could make a char 12.
Or frankly, if they're just
going to be there all the time,
why don't I leave them
as char 10 and just
get rid of those values in my data?
Or alternatively, I could support
parentheses or pluses as well.
It really depends on how
I want to store the data.
But I like the idea of a textual type,
something like char or maybe varchar
as opposed to an integer.
Because at least if I've
called certain communities
or out from some businesses, sometimes
you have to type unusual numbers.
At least in other
countries, for instance,
if we generalize beyond
this data set here do you
type zero to actually
connect to someone local.
And the problem with zero is
that, mathematically, it's
meaningless to start a value with zero.
And unfortunately, a
computer takes that to heart.
And if you were to store
phone as an integer
but a number were in some
community to start with a zero,
your database would
probably just get rid of it.
In fact, try that yourself in
Microsoft Excel, or Apple Numbers,
or Google Spreadsheets.
And depending on how you formatted
the column, you can type as many zeros
as you want followed by other digits.
And odds are, when you hit
Enter, those zeros are gone.
As for age, here we perhaps finally have
a compelling candidate for a number.
Small int should probably
get the job done.
Integer would work as well, or big
int, but increasingly wasteful.
But you know what?
Even here it's not that obvious.
I probably shouldn't even use
an integer type here at all.
Why?
Well, I might indeed ask a human
for his or her age upon registering.
The catch is that age might
change the next day or the day
after, or the day after, because,
of course, time is advancing.
And unless I also stored
the date and ideally
time at which the user
registered, I can't really
even do any math to figure out,
oh, you registered a year ago.
Let me assume your age is that plus one.
So what would have been better
from the get-go than age?
Probably just something
like date of birth, DOB.
And, of course, in SQL
do we have the ability
to store dates, even date times.
So here we probably
have a better candidate
for exactly one of those
numeric types, so to speak.
Now, we've only just begun to scratch
the surface of available data types
and data we might want to store.
In fact, our spreadsheet or database
could contain even more types.
But now let's just suppose that
we're happy with our types.
And the goal now at hand is
to start searching our data
and storing that data
in an actual database.
I've been using here Google
Spreadsheets just to lay out my schema
sort of as a worksheet, if you will.
But now let's graduate
to actual SQL syntax
and suppose that this
same data is stored
not in Google Spreadsheets or
any other but in some database
elsewhere, a piece of software running
somewhere else on my own computer
or somewhere in the cloud that knows
how to organize this data, still
in rows and columns but in such a way
that I can use this new language, SQL
or SQL, in order to access my data.
So what are the
fundamental operations that
SQL supports, or a relational
database, more generally?
It turns out that throughout
computer science is there
a pattern of operations that
many different systems support.
In the world of databases,
you have fairly crassly
what's called CRUD, the ability
to create data, read data, update,
and delete.
But more specifically, in the context
of SQL, this new language, our last,
you have the ability to create
data, select data, a.k.a.
READ data, update or insert data, a.k.a.
UPDATE, or delete or drop data, a.k.a.
DELETE.
So whereas in SQL you have
these very specific terms,
they are just representative
of a class of operations
that you might see
throughout computer science.
So how do we go about using SQL?
Well, short of buying and
installing something like Oracle
or downloading something
free like MySQL,
we can simply use something
that's indeed, by definition,
lighter weight called SQLite.
SQLite is an implementation of
SQL, this database language,
via which you can install
it on your own Mac or PC,
and it uses not special software
or servers but rather just a file
on your own Mac or PC that
stores all of those tables
and, in turn, rows and columns.
It thereafter creates
an abstraction of sorts
as though you have a
full-fledged server running
on your own machine with which
you can communicate using SQL.
Now, any number of programs can be used
to actually talk to a SQLite database.
You might use a purely textual
interface, so-called command line
interface, using only your keyboard.
Or you might use something graphical,
a GUI, a graphical user interface.
In fact, one of the
easiest programs to use
is this one here, DB Browser,
freely available for Macs,
for PCS running Windows, or Linux,
or any number of operating systems
as well.
In advance, I've gone ahead and opened
up a database in a file whose name ends
in .sqllite or also commonly .db.
This then is, again,
just a file, a binary
file filled with zeros
and ones that collectively
represent patterns of rows and columns
and the tables that contain them.
And if I open this program, I see
all of the tables or, if you will,
sheets from our old spreadsheet world.
This happens to be a
database all about music.
And indeed, I have a table about
musical albums and artists, customers
and employees, and genres, and
invoices, invoice lines that
represent actual songs bought,
the types of media involved,
and playlists and playlists tracks
via which users can actually
customize those songs, and then lastly,
the tracks or the songs themselves.
In other words, someone
else on the internet
has gone to the trouble of
aggregating all of this data
about music and their authors and
organized it into an SQL database.
Now, how?
Well, if you look to
the right here, you'll
see a number of esoteric commands
all involving CREATE TABLE.
And indeed, CREATE is one of
the four fundamental operations
that SQL supports.
And without going into the weeds
of how these tables were created,
they were created in such a way that the
author had to decide on every table's
columns' names as well as their types.
And that part is interesting.
For instance, let me go into the Albums
table here and expand this one now.
Here I have Album ID, and Title,
and, curiously, Artist ID.
But you'll notice in the Album table
we have no mention of artist's name.
In fact, if we follow
the white rabbit here,
we see in the Artist table
that, oh, in Artist do
we have an artist ID and name.
And indeed, if we keep digging, we'll
see that in each of these tables
there's not quite as much
information as I might like.
In fact, IDs, whatever those
are seem to be more prevalent.
And that is because, in
the world of databases,
and perhaps even some of
your own spreadsheets,
it's conventional and daresay best
practice to normalize your data.
You indeed should try to avoid having
redundancies within tables and even
across tables.
Now, what does that mean?
Well, consider our own
spreadsheet of users
that we're trying to
get into a database.
I had a column there called Address.
And I proposed that my own address
was 33 Oxford Street in Cambridge,
Massachusetts 02138.
Now, it turns out that there's quite
a few other people at my workplace,
and so they might have that
exact same address as well.
And in fact, that address
doesn't really lend itself
to very nice printing on
envelopes because all I did
was separate things with a
comma, not actually hit Enter.
And in fact, because everything
was all in one column,
it would seem a little difficult
to search by, say, zip code.
Find for me all of the users from
Cambridge, Massachusetts in 02138.
Well, you could do it sort
of searching free form
all of the values in some column.
But it wouldn't be particularly
clean or efficient because you'd
have to look at and then ignore all
of those other words and numbers, 33
Oxford Street, Cambridge mass,
if you only care about that zip.
So before we forge
ahead, let's see if we
can't infer why this database
is seemingly, at first glance,
more complicated than you might like.
Well, if I go back into my
spreadsheet here, what really should
I have done with Address?
I probably should have
quantized that value
into streets and then city,
then state and zip code,
not having just one column but several.
In fact, here, what I've gone ahead
and done is separate out Address
into Street and City and State and Zip.
And for each of those have I
specified a very precise type.
I've gone ahead and proposed that Street
is still varchar (255), as is city,
because I don't really know an upper
bound, so we'll at least be consistent.
For state, I've hopefully been smart in
at least assuming users are in the US.
I've said char (2) for just
that two-character code.
And for Zip, too, I'm
preemptively trying
to avoid a mistake with even Cambridge
whose zip codes start with a zero.
Where I had to specify again
that that's just an integer,
I might actually lose
mathematically that first digit.
But by storing it as a char with five
characters and no hyphen or four others
can I ensure that that
02138 remains exactly that.
But here, too, we have
a bit of inefficiency.
Imagine if more and more
users from my own building
register for this
particular application.
You might have again and
again these exact same values
if all of my colleagues
are in the same building.
And here, as with
programming, as soon as you
start to see a lot of redundancy,
the same data again and again,
there's surely an opportunity
for better design.
This is correct.
My colleagues might also
live here if we added
their names and their distinct emails.
But they don't necessarily need
to have the same exact text stored
again and again.
So what might you do in this case
even in the world of spreadsheets?
Well, on the sheet, I might just rename
this actually more explicitly to users.
And you know what?
Let me go ahead and create another
sheet in my spreadsheet world
and call this, say, Cities.
And my cities might actually have a
city name and perhaps a state and a zip.
But in this leftmost
column, I could be smart
and start to assign each of these
cities some unique ID or identifier.
And so here might I have just a unique
identifier, typically an integer.
City might again be varchar (255).
And State might, again, be char (2).
Zip Code, meanwhile,
can stay as char (5).
But now what I can do is presume this,
that if Cambridge, Massachusetts,
02138 is in this sheet or, if you will,
table let's arbitrarily but reasonably
give it a unique identifier as one.
And if I were to add more cities here,
like Allston, Massachusetts and its zip
code, I could sign it
a unique ID of two.
Because now that I
have this data, here's
where you get that relational aspect.
I can relate this sheet or
table to my other as follows.
Back in this table now can I go ahead
and delete much of this redundancy
here and actually go ahead and say city
is not really a city but it's a city
ID.
And here now it can be a proper
integer, because the city
in which all of my colleagues live
is the exact same one as mine.
Now, here there's still an opportunity
for improvement, to be fair.
Because if all of us are
in the same building,
maybe that should be
factored out as well.
And if I really wanted to
go down this rabbit hole,
I could add another
sheet or table called,
say, Buildings and factor out also that
33 Oxford Street, give it a unique ID,
and only store numbers.
So in short, the more
redundancy and the more waste
that you end up having in your
data, the more opportunities there
are to, so to speak, normalize it.
To factor out those
commonalities and create
relations between some
pieces of data and others
and the convention in computing
is to do, quite simply, numbers.
Why?
Well, it turns out it's a lot more
efficient to just relate some data
to others by relying on small integers.
Four bytes is not all that much.
And in fact, inside
of a computer CPU are
small registers, tiny
pieces of memory that
can be used on the most
basic of operations;
additions, subtractions, and
comparisons for equality.
And so with small values
like integers can you
very quickly reassemble or
relate some data to others.
And so here we have a general
principle of database design
to normalize it by factoring things out.
And so if we go back into
our musical database,
you can perhaps infer why the author
of this data did that preemptively.
They have their albums having album
IDs, a number like 1, 2, and 3, a title,
which is the actual title of that album.
And then to associate
those albums with artists,
they've used not the
artist's name but an ID.
In this way can an album have
the same artist as another
without storing that
artist's name twice.
Moreover, if the artist happens
to change his or her name,
as is not uncommon in
the musical world, you
can change that name in
just one place and not
have to scour your tables
for multiple copies.
And so if we continue this logic,
we'll see in more and more tables
that we have this principle of naming
the data but then assigning it an ID.
And if you want to relate
some of that data to another,
you simply store the ID,
not the actual values.
Of course, this is decreasingly useful
as we go, because now some of my data
is in this table, and that,
and this other table, and here.
And so while very academically clean
and refined, it doesn't anymore
seem useful to the data scientist
in front of his or her computer
that just wants to answer
questions about data.
And yet it's all over the place,
whereas before, with Excel,
you could just pull up a window.
But that's where SQL itself comes in.
SQL does not just prescribe
how to type your data
but rather how to query it as well.
And in this particular
program here, DB Browser,
I can actually go over to
this tab here for Execute SQL.
And I can begin to execute
those actual commands, SELECT,
and CREATE, and UPDATE, DELETE, and
others and actually see the results.
What SQL allows you to do is
express yourself programmatically
using a fairly small
language, albeit new, that
allows you to create temporary tables,
just the results, just the result
sets, so to speak, that you want,
only those rows that you care about.
So for instance, if I want to see
all of the albums in this database,
well, in the world of spreadsheets, I
would just double click in and peruse.
But in the world of SQL, I'm
actually going to type a command.
I'm going to go ahead here and say
SELECT star, for give me everything,
from the table called Album, and
then semicolon to finish my thought.
I'm going to go ahead and click
the graphical Play button here
to execute this command.
And you'll see suddenly that
here are all of the albums
apparently in this table, 347 of them
in total in this particular database.
And notice that all of the
titles are in one column,
the Album ID is to the left, and the
Artist ID, of course, to the right.
Well, if you're now curious who is
the artist behind the album called
For Those About to Rock,
We Salute You, well, I
can just make a mental note
that the artist ID is one.
And you know what?
With SQL, it turns out
you can use predicates.
You can qualify the data you want.
You don't have to say, give me all.
You can say, give me this.
So how do I do that?
Well, I can actually say SELECT star
from Artist, the other table, where--
and here comes my predicate-- artist ID
equals one, a semicolon again to finish
that thought, hit Play, and voila.
It turns out it's AC/DC, the artist
behind that particular album.
Of course, this felt a bit manual.
And this seems no better
than a spreadsheet wherein
now more of the work seems to be on me.
But SQL's more expressive than this.
Not only can you create, and
select, an update, and delete data,
you can also join data
from one table and another.
So in fact, let me go ahead and
do this a little more explicitly.
I want to go ahead and select everything
from that original album table,
but I'd like to first join
it with the artist table.
How?
Well, I want to join those two
tables, this one and this one,
kind of like this,
conceptually, so to speak.
On what fields?
Well, in Album, I recall
there's an artist ID.
I want that to be equal
to artist.artistid.
In other words, if you imagine this hand
to be albums and this hand to be artist
and the tips of my fingers each
represent those artist IDs,
we essentially want the
SQL to line up my fingers
so that I have on the left my albums
and I have on the right the artist.
And every one of my rows now has both.
Let me finish my thought with
a semicolon here and hit Play.
And voila, now we see even more
information but all together.
We see that album number one, For
Those About to Rock, We Salute You,
has an artist ID of one and clearly
an artist ID of one but a name.
Well, what's happened?
Well, both of these tables have kind of
been concatenated together but joined
intelligently such that the
artist IDs in both tables
line up on the left and the right.
Of course, at this point, I really
don't care about all these numbers.
And I definitely don't need the
temporary duplication of data,
so I don't have to
just keep saying star,
which is the so-called wild card,
which means give me everything.
I can actually instead just
say, give me the title of albums
and the names of the artists
by specifying with commas
the names of the columns
that I actually want.
And if I now click Play, I
get much simpler results,
just the titles of albums and
just the names of those artists.
Now, how else can we
express ourselves with SQL?
Well, there are other keywords
besides WHERE and besides JOIN.
You can also group values by something
and specify that they must all
have something as well.
For instance, let me go back
to my data here and consider
which artists have multiple albums.
Well, if we consider the
results that we had earlier do
we have AC/DC as the artist behind For
Those About to Rock, We Salute You,
but also behind Let There Be Rock.
Moreover, this band Accept
has multiple albums as well.
And if we scrolled further,
we'd probably see others.
So if we'd like to see those
bands, those artists that have
multiple albums, how can we do this?
Well, what if I somehow collapsed all
mentions of AC/DC into just one row,
and next to that row
I put an actual count?
Well, I could refine
this query as follows.
I can say, yes, join these
tables together, but then
collapse them, if you will, based
on grouping some common value.
Let's go ahead now and
group the data by name
so that any artist that
appears multiple times
will effectively be collapsed into one.
But I'd like to remember how many
rows got collapsed into just one.
And so rather than select the
albums themselves this time,
I'm going to select the album's
name and then the count there of,
thereby specifying show me the name
and show me the count of that name
before we grouped by.
If I go ahead now and finish
my thought and click Execute,
I'll see that, indeed, AC/DC had two
names and Aaron Goldberg had one.
And if we keep scrolling, we'd
see all of the bands' names
that had one or more albums and
the count for each of those.
If I want to filter out
now maybe those bands that
only had one hit album in this
database, I can instead say,
you know what, go ahead and
group by the group's name,
but then show me only those
bands having more than one album.
Well, here, too, can I simply filter
my results saying literally quite
that, having some number count
name greater than one, semicolon.
Hitting Play now and you see the
results immediately eliminate all
of those bands that had just one album.
And now if I scroll through,
we'll see all those bands
that had two or really more.
And so with SQL can you
begin to express yourself
certainly more arcanely
than you could with just
a click of the mouse in a spreadsheet
but ever so much more powerfully.
But the key is to build
these queries up piecemeal.
Indeed, this now already
looks quite complicated.
But we began by just selecting all data
and then refining, refining, refining,
thereby working at this level
and only getting into the weeds
once we knew what we wanted.
Notice now just how fast
these operations have been.
Indeed, at the bottom here do I
see that 56 rows were returned.
How fast?
In one millisecond.
And indeed, even though
it might take me longer
to describe in English
what it is I want, in fact,
the computer can find
this data so quickly.
But that's because we've done
at least one thing already.
This data is already organized
with these proper types.
And it also has a few other
key characteristics as well.
When storing your data
in a SQL database,
you're also asked to
answer some questions.
For every table you're asked
to specify effectively what,
if any, is this table's primary key.
These are key words in
SQL that you can apply
to some column that says,
hey, database, this column
is my primary value that uniquely
identifies every row in this table.
In the context then of
our user spreadsheet
with which we began this
discussion, that identifier for City
was a primary key.
I might very well have
used a city's name
as unique or perhaps even the zip code.
But far more efficient,
especially if you
want to avoid ambiguities
or duplication,
is to just use that integer.
And so here a primary key is almost
always a numeric value, at least
in the most optimized layouts of data.
But it guarantees to the
database that there will be
no duplicates on this particular value.
But more powerfully, you can define in
one table a column to be a primary key,
but then in another table that same
value to be a so-called foreign key.
In other words, throughout this
example in the actual SQL database,
I had Albums in one table
and Artists in others.
And that Artist table
had an Artist ID column
that was within that table
known as a primary key.
But when you saw that artist
ID in the Albums table,
it was contextually there a foreign key.
Now, beyond semantics, this is
an actual and valuable property.
It ensures that the
database knows how to link
and how to link those
two columns efficiently.
Moreover, you have even fancier features
available to you when you declare keys.
You can also tell the
database, you know what?
If I ever delete this artist, go ahead
and delete all of that artist's albums
as well.
And you can configure a
database automatically
to have this so-called cascade
effect whereby data is updated
and your data is consistent at the end
of the day based on those relations,
if you will.
Now, in columns of data
can you also specify
that every value's got to be unique.
It doesn't necessarily need
to be your primary key,
but it might still be unique.
Like what?
Well, in our Users table that
we were creating on the fly,
an email address might,
by human convention,
be unique, assuming I don't
share it with someone else.
But using an email address, multiple
characters, many possible characters,
tends not to be the most
efficient way to search on data.
So even in my Users table might
I have added for best practice
a numeric column as well, probably
called ID, as my primary key.
But I might still specify when moving
that data from my Google spreadsheet
into a SQL database that, you know what?
Please ensure that this
Email column's unique
so that I or some other
programmer doesn't accidentally
insert duplicate data into this table.
And moreover, the database
then can search it
more efficiently because it
knows how many, if any, there
are of any one value.
There's one and only one maximally.
Lastly, there's this
keyword here, index.
Well, the other feature you get so
powerfully from proper databases
is the ability to search
and sort values efficiently.
But the database doesn't know
a priori how to do either
on the data you care about.
Because only if you tell the database
what data you plan to search on
and sort frequently can
it help you in advance.
And so if when creating a database table
you tell the database server, go ahead
and index this column, what it will
do is use a database structure,
a tree structure not unlike our
so-called binary search trees,
that pulls all the data up in
an artist's rendition thereof,
thereby ensuring that it
doesn't take as many steps
to find some email address
or something else because you
have indexed that particular column.
It won't store it linearly top
to bottom or left to right.
It will store it in a
two-dimensional tree
structure of some sort,
often known as a B-tree, that
allows you to grab the data in hopefully
logarithmic and not linear time.
Well, turns out there are
even more features you
get from actual databases like SQLite.
Well, you have the ability to specify
when creating a table, please go ahead
and auto increment this column.
Well, what does that mean?
Well, I very manually
a moment ago assigned
Cambridge the unique identifier of one.
But why should I, the
programmer, even have
to worry or care about what the
unique values of my inputs are?
I just need that that key exists.
I do not need to care about what
that value is, just that it exists
and it's unique.
So you can tell the database
on its own, please go ahead
and, any time I add a
new row to this table,
increment that value automatically.
You can also specify
to a database, please
ensure that no values in my
database are null that is empty,
thereby ensuring that a bug in your
code or some missing user input
doesn't accidentally
put into your database
a row that's only sparsely
filled with real data.
The database can help you with these
things just as Python can as well,
but it's a final layer of
a defense before your data.
And then functions as well.
SQL itself is a programming language.
It might not necessarily have as high
ceiling as something like Python,
as much functionality.
But built into SQL are
any number of functions.
If you want to select
the average revenue
across any number of purchase orders,
you can use the average function.
And in MySQL Query can I select data
but pass it into one of these functions
and get back an answer without
having to paste it into, say,
a spreadsheet, let alone calculator.
I can count rows just as I did.
I wanted to count the number of
albums that a given artist had,
and COUNT was a function
supported by SQL.
You can get maxes and mins.
You can get summations
of value and so many more
features built into this language.
And while the tool you
might use might not
be DB Browser, perhaps it's just a
textual interface or even something
even more graphical,
it ultimately is just
executing on your behalf the SQL queries
and handing them off to the database
for execution.
Now, with all of these features
that you get with the database,
it all sounds too good to be true.
You can scale, you can
eliminate redundancy,
and you can still select
all the data you want.
But unfortunately, you have
to start to think harder
about the design of your system.
Because databases are sometimes
vulnerable to mistakes, if you will.
Consider, for instance,
something like Twitter
that tends to keep track of how
many times something's retweeted.
Or consider an app like
Instagram, which keeps
track of how many likes or upvotes
or views some post has gotten.
On the most popular or viral
of media, those counters,
those integers might be getting
incremented ever so quickly.
If you and I both happen to view or
like something at nearly the same time,
well, that interface
from us into the system
might actually trigger some
instruction on some server
somewhere to tell Instagram's
database to increment some value.
But how does a database go
about incrementing a value?
Well, if the value of views
or the value of counts
is somehow stored in a database,
a column of type integer,
and you go ahead and execute a SQL
SELECT in order to get that value,
and, for instance, 100 people
before me has liked some post,
well, the value of that
result comes back as 100.
I then do some math in
my code, perhaps Python.
I increment the 100 to 101, and
then I use a SQL UPDATE, as you can,
to push the data back into the database.
But suppose both you and
I anywhere in the world
both happen to like a post at
the same or nearly the same time,
as can certainly happen when
posts are especially popular.
Unfortunately, a computer can
sometimes do multiple things
at once or at least in
such rapid succession
that it appears to be at the
same time, but a race of sorts
can happen, a race condition,
if you will, as follows.
If both my button and
your button is pressed
at nearly the same time and that induces
execution of code on Instagram server
that selects for both of us
the current count of views,
suppose that both of the threads, so
to speak, both of the SQL operations
that select that data both
come back with the value 100,
each of the blocks of
code serving our requests
go ahead and increment that value
to 101 and then, via SQL UPDATE,
pushes that value back to the database.
Unfortunately, because both you and
I induced an evaluation of that math
at nearly the same time, what
the database might end up storing
is not 102 but 101.
In other words, if two people's input
is triggering a race to update data,
the database had better keep track of
who and when asked for that update.
Otherwise, you lose data.
Now, in the case of tweets or likes,
it's probably not all that problematic.
Though, frankly, that is their business.
But you can certainly imagine that with
banks or financial institutions, where
the numbers matter ever
so more, you certainly
don't want to accidentally
lose track of some dollars.
And so how do we go about solving
this in the case of a SQL database?
Well, it turns out that there is fairly
fundamental primitives or solutions
you can use.
Consider a metaphor in the real world,
such as, say, a familiar refrigerator.
And suppose that you and
your significant other
happened to both like something to drink
at the end of the day, like some milk.
And so you go ahead when you get
home, and the other person's not,
and you open the fridge and you
see, oh, darn it, we're out of milk.
And so you close the fridge
and you head downstairs
and you walk to the nearest store.
Meanwhile, that other person
comes home and, wanting some milk,
opens the fridge, and darn it if
we aren't out of milk as well.
And so that person now heads
out, perhaps in a different car,
in a different route, and heads to the
store, some other store to get milk.
Fast forward some amount
of time and both of you
come home, and darn it if you don't now
have twice as much milk as you need,
and it does go bad.
And so you've both ended up buying milk
when really only one of you needed to.
And this is similar in
spirit, but now you've
got more data than you actually wanted,
but it's not the right amount of data.
So why did that happen?
Well, both of you, like Instagram,
inspected the state of some value
and made a decision on it
before the other person
was done acting on that information.
So in our very real world
of milk, how could you
go about avoiding that conflict,
that race, to restock the fridge?
Well, you could quite simply grab
a pen and paper and leave a note,
so to speak, on the fridge telling
someone else, gone for milk,
and hopefully they then
would not do the same.
Or perhaps more dramatically you could
lock the refrigerator in some sense
so that they can't even get
into it and inspect that state.
But ultimately, you need your
act of checking the fridge
and restocking it to be
what we'll call atomic.
And databases can and
hopefully do provide atomicity,
that property, the ability to do
multiple things together or not at all
but not be interrupted
by someone else's work.
And in fact, in the
database world, these
are generally known as actual locks
whereby you say to the database,
don't let anyone else
write to this table or row
until I am ready to
release or unlock it.
That, of course, though, tends to
be a very heavy-handed solution.
Say don't let anyone
else touch this data.
Better to do it on a
more fine-grained control
so that you don't slow
your whole system down.
And so SQL databases tend to
support what are more finally known
as transactions whereby you can execute
one or more commands again and again
and again back to back but
make sure that all of them
go through it once
before, say, your commands
that your user input induced
actually is allowed to get executed.
Now, honestly, even
in the most high-tech
of places like Instagram and
Twitter, this is a hard problem.
Because at some point, even
waiting for my operations
to complete before yours
can go in can be a challenge
and a bottleneck for everyone else.
And so in the case of the most viral of
posts, what can systems these days do?
Well, you could just
kind of wait and write
that data back to the database
that is updated eventually.
And indeed, another
property of databases
is known as just that, eventual
consistency, a property that says,
don't lose any data, but only eventually
make sure it's reflected on the server.
Eventually get the value
right, but do get it right.
And so what Instagram and
Twitter and others might do
is just cache or buffer that data,
waiting until things have quieted down
'til the post is no longer viral
or most users have gone to sleep.
Now, that alone might not
be to the best of solutions,
but it avoids having to get the highest
powered and most expensive hardware.
Of course, in other contexts,
that might be the better solution.
In the world of finance,
sometimes it comes
down to the actual length of cables or
distance from some server to another
to ensure that the
data gets there so fast
that you don't run into
these sorts of challenges.
So databases can solve this,
but the developers and designers
that use those databases need to know
how to do it and that they should.
Lastly, there's another
challenge as well,
unfortunately all too
commonly done these days
because folks just don't defend
against it via common mechanisms.
It turns out that a bad actor somewhere
on the internet or your own network
can potentially, if
you're not careful, trick
a database into executing
commands that you didn't intend.
For instance, suppose
in the context of Python
you have some code that
looks a bit like this.
Here is a program written
in a mix of pseudocode
and Python that's designed
to allow a user to input
the title of an album for
which they want to search.
And so here I use the Python function
INPUT to prompt the user for just that.
On the left-hand side do I clear
a variable called Title, and then
assigned from right to left, the
user's response to that variable.
Then suppose for the sake of discussion
there is some function called EXECUTE
whose purpose in life is to take input
that itself represents a SQL command.
That SQL command might be
this, so like star from Artist
where Title equals something.
Now, what is that something?
Well, if I have the user's input
in a variable called Title,
I can use the plus operator in
Python, not to add but concatenate
two strings together, coding them
singly and completing that thought.
The problem, though, with SQL is
that user's not really to be trusted.
And whether the user's input is
coming from a keyboard on a Mac,
or PC, or perhaps, more
compellingly, from an app or website,
you probably should not
trust all your users.
Because suppose that your user typed
in not the album name for which they
want to search, Let There Be Rock,
but rather they type something
like Let There Be Rock,
semicolon, DELETE,
thereby using SQL's apparent
DELETE command in order
to trick your database into
executing not one but two
commands, a SELECT and DELETE.
And indeed, this is what's
known as a SQL injection
attack, the ability for
an adversary, a bad actor
out there, to somehow trick
your database and your code
into executing some command
that you didn't intend.
How is that possible?
Well, some of these characters
are dangerous, so to speak.
A semicolon in SQL tends to
separate one command from another.
It finishes your thought.
And if you yourself don't anticipate
that some human, this bad actor,
might type in themselves a
semicolon when they really
shouldn't be typing SQL at all,
you might mistake that semicolon
for the actual terminus between
one command and another.
And if you just blindly
pass it into your server
and let it execute as usual, you
might execute not just that SELECT
but that DELETE or
anything else as well.
And in this way can an adversary not
only delete data from your database
but maybe select more than you
intended, or update or insert.
It's ultimately up to you to
defend against these threats.
So, how?
Well, it turns out that there are
libraries, code written by other people
that, frankly, via very
easy-to-use functions,
just make it easy to sanitize or
scrub, so to speak, user's input.
What do these libraries or
these functions typically do?
Honestly, they just escape, so to
speak, these dangerous characters.
Something like a semicolon or
perhaps a single apostrophe that
might, in SQL, have some special
and dangerous potential meaning,
they just escape them as by putting a
backslash, a special character in front
of them so that if the human were
to type in Let There Be Rock,
semicolon, DELETE, that would actually
be interpreted safely by your database
as a search for an album called Let
There Be Rock, semicolon, DELETE,
which of course most likely
is not the name of an album.
So that query would probably not
return or select any results.
But more importantly,
it would not be tricked
into executing two SQL commands.
Rather, it would execute only the
SELECT but with a nonsensical value.
Lastly, consider what a database is.
It's really a piece of software
running on some computer
somewhere, be it on my own Mac, or
PC, or some server in the cloud.
But if you have just one database,
as I seem to keep presuming,
you have this so-called
single point of failure,
again, just as we had in the world
of cloud computing more generally.
And so with something like data
where you don't want to lose it
and you certainly don't want
all of your web servers or apps
to go offline just because one server,
your database server, has gone out,
it's time to revisit the architecture
or topology of systems more generally.
Something tells me that
it's not going to be
sufficient to have just one database.
You probably want two.
But if you have two databases, now how
do you decide where to put the data?
Do you put it here, or over
here, or maybe in both places?
If you put it in both places, though,
you're then using twice as much space,
so already we've opened a can of worms.
To solve one simple problem, don't
be a single point of failure.
But that's going to cost you some time,
or some money, or certainly space.
So what can you do if
you're architecting
a system that has now not just
web servers but, say, databases?
Well, odds are you're going to want to
have not just the one, pictured here
as a cylinder, as this
convention, but you're probably
going to want to have a second as well.
But of course, if you
have two of them, odds
are it's not sufficient just to
store half of your data on one
and half of your data on the
other, because, of course,
you've not solved the
single point of failure.
You now just have two single points
of failure because half of your data
could be lost here or half of it here.
So odds are you're going to want
to start having backups of data.
But you don't want to necessarily
have to waste time restoring
from backup, especially
if you want to maintain
as many as five nines of uptime.
So odds are you're going to want
to have these databases really
be duplicates of one another.
And whenever you write
data to one database,
you should probably write
it to the other in parallel.
So, yes, admittedly, you have
just spent twice as much space
and, frankly, twice as much money.
But at some point those sorts
of costs are inevitable.
But there's other ways
to scale here, too.
You can, of course, hit a
ceiling on vertical scaling
even when it comes to databases.
After all, if a database is just
a program running on some server
and there is only so much RAM or
disk space or CPU in that server,
eventually you're not going to
be able to store as much data
or as quickly as you want.
So what can you do?
Well, you could, for
instance, shard your data
and have not just two but maybe four or
more servers and put all of the users
whose names start from A to M on
one half of your cluster of servers,
so to speak, but maybe
everyone else from M
to Z based on, say, their last
name can go on the others.
To shard a database means to split
the data in some predictable way
that you can repeat again and again.
But even there, too, even if only the
As through Ms are going to the left,
you want to make sure that you
still have that backup or replica.
And this arrow suggests that they
really should be intercommunicating,
not unlike load balancers we've seen.
But there's another way you can
scale your databases as well.
You don't have to have databases
doing both reading and writing.
To read data from a database or any
server means to take it from its memory
and read it into yours.
And to write means to do
the opposite, to save it.
Well, what you can do actually
in the world of databases
is also replicate your
databases multiple times.
And you might have connected
to these two primary databases
multiple other databases that are just
copies in one direction and not both.
And what you might then do is use these
two primary databases not only to read
but to write, abbreviated here RW.
But these other databases
down here, which
are just copies of the ones
to which they're connected,
are just called read replicas.
They exist solely for the purpose
to read from them again and again.
When might this make sense?
Well, in some contexts,
like social media,
like Facebook, it's
probably the case that there
are more reads than there are writes.
That is to say you probably know more
people who post more content than you
but you probably still
read or look at theirs.
And so if the data for
your business follows
that pattern whereby writes are maybe
common but reads are way more common,
you can do exactly this model and
replicate again and again, honestly,
as a tree structure for
efficiency so that it doesn't all
have to replicate one at a time.
But then you can write software,
be it in Python or something else,
that writes data only to these
one or two servers but reads
from any number of them as well.
But this, too, is a bit of a
rabbit hole, because at some point
you want to have this
redundancy not in one
location but others, east coast and
west coast, one country and another.
And at that point, you might actually
run into the limitations of time.
Because after all, it takes a non-zero
number of milliseconds or seconds
for data to travel long distance.
Consider after all
how long it might take
data to go from Cambridge,
Massachusetts to somewhere in Japan.
That's far longer than it might take
to just go down the road to MIT.
So here, too, we can revisit all of the
problems we saw in the world of cloud
computing and servers more generally.
They're back to revisit in
the context of databases.
But with databases, you care ever
more that these things not go down,
or if they do, that you have
spares, backups, or replicas.
Because now that we're storing our
data in this centralized place,
we have to think hard not only about
how we're scaling computationally
but how we're scaling in
terms of our data as well.
So consider where then we began.
We started by laying out data in a
spreadsheet, be it Microsoft Excel,
or Apple Numbers, or
Google Spreadsheets.
From there we considered
what types of data
we might store there so
that if we want to upgrade,
so to speak, from a
spreadsheet to database,
we know what types we can specify.
And in SQL, whether implemented
in SQLite, Oracle, or MySQL,
or something else, they tend
to be standard types that
tend to be common across platforms,
albeit with some variations,
so that we can think
hard about these types
and then ultimately help the
database help us be performant.
Because if I know that I'm going
to be searching or selecting based
on certain data, I can tell
the database, for instance,
to make it unique or at least index it.
And then using SQL constructions
like SELECT, and INSERT, and UPDATE,
and DELETE, and yet others can
I manipulate that data and get
at it even faster, frankly, than the
human me could with a mere spreadsheet.
But with the design of any
system, as with databases,
we start to open new cans
of worms and new problems
as we start to explore more
sophisticated challenges.
But here, too, many, if
not all, of these problems
can be solved by simply reducing
the problems to first principles
and consider, what is
the problem to be solved?
How is our data represented?
Where is it stored?
And consider ultimately what business
constraints or legal constraints
we have when manipulating that data
and consider what tools of the trade
are available to us.
This then is database design and,
more generally, design unto itself,
not just focusing on the correctness
of implementing solutions to problems
but the efficiency and the
scalability as well thereof.
