Welcome again to the NPTEL course on Storage
Systems. In the previous classes, we were
looking at issues relating to the design of
the Google file system and we also looked
at some of the problems in that system, and
how a newer system called the BigTable resolves
these issues.
So, just as a summary we will look at what
we discussed previously. Bigtable has 3 main
components - a library that is linked into
every client, a Master server and many Tablet
servers. The Tablet servers have multiple
rows and that is the unit of distribution.
The Tablet servers can fail or you can have
more rows coming in and you want to move it
to a new machine. So, you need to accommodate
changes and workloads, this kind of dynamic
activities with respect to tablet servers
keep happening.
Again, you have a Master which actually does
assigning of Tablets to Tablet servers detecting
which Tablet servers are available, which
are no longer working, why you need to do
some balancing and you also need to do some
garbage collection of files in the GFS. Basically
it turns out that GFS and BigTable together
work in a manner similar to a log-structured
file system. You keep some information and
then some older information becomes stale
which has to be deleted and so garbage collection
is an important part of the system and the
BigTable actually manages the garbage collection
of files in GFS. As we remember, when we do
some delete in GFS, it is just marked as delete
- it doesn't do anything to the file immediately
- it is done later as part of garbage collection
service.
Now, it also handles certain other things
like if you change the schema. So, I think
we'll briefly look at the interaction between
GFS and BigTable one more time. The important
thing to remember is that the persistent state
of a Tablet is stored in GFS. So, basically
when writes come in, you commit to a commit
log and these, once they are committed are
sorted and buffered in memory. The sorted
and buffered records are called memTables.
Just like in Log-structured file system, once
you are running out of memory for eg, you
can push out those reasonably full memTables
onto persistent storage - the SSTables (Sorted
String Tables) - and this is what is being
stored in GFS.
So, in case there are any reads coming in,
the information can be in multiple places.
It could be in memTables because that is where
they recently got put in or it could be in
older ones. And because it does not delete
old information immediately & stale information
can be there, you need to figure out which
is the most recent one & of all the possible
memTables and SSTables, you have to find the
most current one. In a sense you have to look
up all these Tables. Now, because these things
are sorted, you can do this fairly easily.
This is different than LFS where you need
to have additional indexing information by
which you can actually check up & lookup what
you need. Here, sorting helps. Though it is
time consuming, it certainly helps when it
comes for looking up reads.
Again, if there are any failures, the persistant
state is sitting in GFS. Basically u have
to read the Tablet and then the Tablet server
has to read its meta-data. It is basically
a set of those updates stored in SSTables
- the ones which have been made persitant
and you have to read all those things into
memory and you also have a set of redo points.
These are the points where the commit logs
tell you they may contain data for the Tablet.
These data are read into memory and you have
to reconstruct the memTable. I told you the
mem Table keeps the current set of recently
committed updates and you can continue from
there.
So, this is how the system has been set up.
There is some interaction between GFS and
BigTable. GFS worries mostly about things
like efficient storage (in 64MB chunks for
eg), and it also worries about reliability,
it worries about how updates are done, it
tries to do something about how to ensure
that atomic appends take place etc. BigTable,
on the other hand, tries to be closer to application
and tries to figure out how to make record
structure avaliable to the application. So,
buffer multiple records as things are coming
in and in some kind of a log structured manner
and keep pushing it out to GFS once you have
certain amount of size.
So that's how GFS and BigTable have been designed
- it is an example of what you might call
a cross layer design. Each party tries to
do certain things but you try to ensure that
the right set of functionality are at the
different layers - that's the idea. GFS came
first and it was designed for handling mostly
web search and crawling of data. And BigTable
came in because lot of people were doing other
things also with it and it provided an easier
way to interact with the GFS system. So, it
is slightly closer to the Application layer,
I would say.
So, now, we'll take a look at slightly closer
systems that have been designed. Hadoop is
one which is closely modeled on GFS/BigTable
kind of model. They say it is a rack-aware
FS. Actually, most of the systems that we
talk about are rack-aware (also called Data-Center
aware) - they exactly know - if the data is
present - which rack it is present in so that
if you want to replicate it, they might want
to keep it on a diff rack. Or if you want
to do geographic distribution, u also ensure
that it can be replicated in places quite
far away from one of the Data Centers. So,
the same data can be in multiple Data Centers
which are distributed - this is an important
issue because the latency and BW aspects r
closely correlated with whether it is within
a rack, or across racks or within a Data Center
or geographically distributed.
So, essentially all these filesystems are
that, but Hadoop also makes the explicit claim
that it is a rack-aware filesystem just like
Cassandra and other such filesystems. Basically
the idea is to make sure that the work that
is being done is actually closer to the data
that is there. If it is not that, try to be
near to the rack/switch so that you reduce
backbone traffic. Again, HDFS also stores
large files just as GFS does and this is also
64MB chunks. Similar to GFS they also does
reliability by replicating data across multiple
hosts; they dont use RAID storage on hosts
- since it is expensive and also doesn't give
as much reliability as if you were to distribute
across multiple nodes.
Basically, even if you have RAID storage,
if the m/c dies, you dont have access to data
which is why RAID storage is not used - similar
to GFS. And above that they run application
level entities like MapReduce engine, Jobtracker
(which does scheduling) etc. This is an open
source system compared to GFS and is widely
used and many major infrastructure companies
also use Hadoop. There are lot of differences
between HDFS and GFS but there are also similarities
as far as design is concerned.
We'll now briefly look at another design.
Yahoo has come up with a system called 
PNUTs which is a large geographically distributed
data serving system for web applications.
Basically, what they r trying to look at is
how to support workloads mostly of queries
that read and write single records or small
groups of records. It is not for complex queries.
If the workloads are mostly dealing with single
records or small groups of records, then it
is slighly earier to manage consistency also.
Because GFS and other systems usually provide
atomicity for only single rows. So if it is
single records or groups of records, possibly
it is there in single node and it can probably
give you lot more guarantees than is otherwise
possible.
Data in this system is organized as hashed
or audit tables. We'll talk about Hashing
soon - it is another different model of storing
data. We looked into it sometime back but
we'll look in some additional detail w.r.t
what is called consistent hashing - we'll
look at that part soon. The most important
thing they are trying to do is design for
low latency for large numbers of concurrent
requests including updates and queries. So,
because of this anything that is of high latency
is usually done asynchronously. Another thing
they do is they want to support a system by
which if you are modifying any records you
can actually have a Master for it so that
it becomes a local operation for you when
you update those things. So essentially you
can cache it and update it. In some sense
it is like a sessions semantics - you copy
it, keep it with you, update it as much as
you want and put it back out. You want to
do as many local operations, as far as possible.
Another thing this system does is give you
per record consistency guarantees - all replicas
of a given record apply all updates to the
record in the same order. So, we'll look at
this a bit more. You'll notice that in GFS
etc., you'll find that the updates happen
as decided by Master. Only - When a Master
does one particular set of ordering of updates,
if it fails the new Master can do it in different
order. So there are some issues w.r.t the
kind of updates that are present in it. Eg.,
if you are doing appends, you'll find that
they can be in different orders on different
m/cs.
So, what is attempted here is to give additional
control to the application so that you can
say that w.r.t. a record at least, all replicas
of a given record apply all updates to the
record in the same order. You want to now
instead of doing it at slightly higher levels,
you want to do it at record level and similar
to what is given in BigTable w.r.t rows.
Eg., they have something called Timeline consistency
model. The basic idea is, a particular object
has a timeline. In a sense it got read here,
it got modified here etc. So, there r various
versions of the obj available - reads are
served using a local copy but this may be
stale w.r.t the more current version somewhere
else. But the application API can specifically
ask for a particular version or any version
greater than a particular number. Eg., I can
ask - give me the most current version or
a version better than the 5th version. So
basically, while copies may lag master record,
every copy goes thru the same sequence of
changes. In a sense, there is a timeline for
all objects at all places.
So, you can essentially say the following:
I'm interested in a particluar version or
any version later than a given one. Because
of this it is esentially consistent across
all systems and clients. You also have a system
which is test-and-set writes. You can essentially
say what is the current version and if it
is this current version, i'll allow the write
to go thru'; otherwise it'll become no-op.
So it turns out this can also be used to also
facilitate per record transactions. Essentially
you can roll-back - you can try to write it;
if it is not possible, you can roll back.
So of course, this means you'll be re-trying
certain operations multiple times before you
may succeed.
One other difference between this system and
previous systems we looked at is that they
said they use a gauranteed message delivery
service rather than a persistent log. What
is that? In the case of GFS and BigTable,
we had certain logs and the primary set those
logs and you had to use those for recovery.
Here, they are depending on a guaranteed message
delivery service. There are various methods
by which messages can be delivered with guarantee
that they are actually delivered. And, they
use a system called "Publish-subscribe" message
system. Other thing that they have done is
provide support for triggers i.e, if some
data changes, applications can register themselves
with the system saying that if this particular
change takes place, I want to be notified.
And this is important for some applications
that must invalidate cached copies after sometime.
So, eg., if you are serving ads in a system
and there is a contract which says that after
some period of time, the contract expires.
You want to remove all the ads corresponding
to the contract. Or you might have a system
which is trying to track certain websites
as soon as some modifications take place,
I want to be notified because I want to re-index
it. So if you are looking for real-time kind
of models you need to do this. Again, GFS
also has proceeded in this direction and if
you look at Google, they also have a similar
service - something called Megastore. So,
you want to essentially provide a trigger
like capability so that you get notified whenever
things happen - and that is done by users
subscribing to stream updates on a table.
And this is an asynchronous publish-subscribe
message system.
So, a good thing about this is that anybody
can publish it and anybody can subscribe to
it. Each party doesn't know who the other
party is. So, basically what I'm saying is
that because everything goes through this
particular message system, the message system
keeps track of who the parties are - who's
subscribing, who's publishing. So, eg., if
you are trying to update something, one replica
doesn't need to know locations of other replicas.
You just have to publish it and then system
will make sure that other subscribers who
are looking for the updates can get to it.
So, in a sense, what's happening is that there's
a broker in between which is trying to figure
out what people are interested in and make
sure it happens. And because of this system
it turns out that they can also optimize it
for geographically distant replicas - you
can figure out what's the best way to push
the updates. This is in contrast to gossip-like
protocols which might not be tuned for geographical
distribution. They might not be really taking
care of this kind of things. So, this is one
other design.
Again, we'll quickly look at another design
which is Windows Azure model. This is a slightly
different system because it is designed as
a scalable cloud storage system - used both
by end users as well as by companies like
Microsoft themselves. Eg., they want to use
the system for search just like GFS was used
by Google. They also wanted to export it out
as a cloud storage system that anybody interested
could use. Their basic design is - keep the
data in the form of blobs (user files), tables
(structured storage) or queues (message delivery).
The queues are similar to what we saw in the
case of PNUTS. There also you have some kind
of guaranteed message model - same thing is
being done in the queues here instead of depending
on persistent logs.
So, basically, in this system you have blobs
for incoming and outgoing data similar to
BigTable you notice that incoming table is
coming to both memTables and stored persistently
in SSTables etc. Again, there's some other
log structuring going on there. So here what
is happening is that blobs are being used
for incoming and outgoing data. And, the queues
are used for... basically this has got the
message delivery... so the message essentially
will tell u the kind of workload that has
to happen. So the queues are used for overall
workload processing corresponding to the blobs.
and any intermediate data is stored in tables
or blobs.
So they say that just like PNUTS was using
triggers, they also provide publicly search-able
content within a very short time of something
being published. Eg., they claim that within
15s of somebody posting something in Facebook
or Twitter, it'll show up in their search
engines eg., Bing. The only way they can do
this is by having some way in which somebody
is pushing the updates to them also. So they
can subscribe to certain APIs that Facebook
or Twitter is making available so that as
soon as anyone posts something, they get the
data and they actually push it through their
system out here and that is made available
to the Bing search system. Again, it'll re-index
it so that new end users can pick it up. So
this is another thing that they are talking
about.
Windows Azure system says that they provide
"Strong consistency", but if you look at it
closely, it is basically same as what other
people are providing. Eg., if you look at
what PNUTS or BigTable gives you, it is basically
within a Tablet or within a Data centre or
a Rack for eg. The name which Win Azure gives
you is "stamp" - it is something that has
all data, say a few tens of TB of data, all
available in one single place rather than
on multiple Racks.
So they provide strong consistency within
a stamp, and across, they usually dont provide
it and they say that most of the times if
you do intelligent partitioning & all the
upper layer protocols are properly set up,
it turns out that almost all the updates happen
within a stamp and you get strong consistency
- that is the claim they make.
Again, just like in the previous system, if
u r looking for within a stamp/Data Centre/multiple
Racks, they usually do synchronous application
and across stamps they do asynchronous application.
Again as we know, asynchronous things come
into picture any time you have the CAP theorem
in picture. As long as you are doing synchronous
application only, there is no serious issue
- that's why they say they can provide strong
consistency. Because it's a cloud storage
system, they also provide user accessible
global and scalable Namespace and Storage.
This is in contrast to other systems which
don't have a set of users also using the system.
So they have to provide a global and scalable
Namespace.
So, I'm not going into too much detail - each
of these system is quite complex and there's
a considerable amount of description that's
available - eg., if you want to look up Windows
Azure, in the recent SOSP 2011, they have
described it in some detail and you can look
it up.
If you look at the previous systems like GFS,
PNUTS etc, they have the model of one particular
sysem like Master which keeps track of where
the tablets are or the various 64MB chunks
are kept etc. They actually manage all those
things . There's another model for storage
which is called consistant hashing. In this
we try to avoid having a Master in the picture.
You want to directly use a name, hash it and
then figure out how to get the data. So, u
want to avoid having somebody like a Master
in the picture. This also is used in some
other sys. Eg., if u look at a parallel file
system called GPFS which was d/n by IBM, they
also use something similar.
Basically, the idea is - in a parallel file
system, it turns out that if you want to parallelize
all the activities, you have to parallelize
the lookup part of it also. The pathname can
be quite long - if it becomes sequential,
then ur throuput also can be impacted. Again,
going by Amdahl' law and other things, if
your pathname lookup is serialized, the chances
of your being able to scale also becomes that
much smaller . So they had to incorporate
abilities by which you can do name lookup
as far as possible without any sequentiality,
and, the way to do it is by hashing.
Similarly, another file system called Ceph
also does something similar - it may not use
consistent hashing, but it uses a hashing
technique to take a name and map it to a hash
value and that hash value is used for lookup.
What happens here is, you are looking for
highly scalable storage - people are adding
more & more storage and sometimes some storage
is deleted because of failures; so you want
a way in which you can add storage without
having to move things for load balancing.
Notice that in the case of GFS and other models,
there is an explicit web balancer which moves
things around in background which is an issue
since it can cost quite a bit in terms of
time & energy.
So, here the idea is to find a way in which
you are somehow resilient to additions and
deletions - that's the basic idea. The basic
model is the following - you have various
objects in the system and various devices
storing the objects; you hash both objects
and devices using the same hash function,
map each obj to a point on the edge of a circle.
Basically you have some kind of circle on
which all these things are residing and you
map it to various points on the circle. You
can either call it a point or a specific angle.
You also map each device (eg., in S3 you have
the notion of storage buckets) on to a series
of points around the circle. Now an object
is stored by selecting the closest mapped
Device on the circle.
And therefore, each device contains the resources
mapped to an angle between it and the next
smallest angle. So you have to search it in
one of those places. Again for each device,
there are some data structures kept for doing
the lookup. So the basic intuition in the
model is: If a device is added or removed,
only nearby objects are remapped; you don't
have to move anything else and you try to
avoid depending on Masters like in the case
of GFS etc. This kind of systems have been
used by some major companies like Amazon,
facebook, PNUTS etc. They also say that you
can do storage also in this particular model,
they give you the option.
So, again, let's look at Hashing in some more
detail. The hashing technique uses what are
called Distributed hashing Tables. It 
goes as follows - we are trying to hash the
keys to nodes. What do we do here? We have
a node with ID i(x). It owns all keys k(m)
for which i(x) is the closest ID and there
is some kind of nearness measure which gives
you that info. Eg., you have a key k(m) and
you can see how close it is to a particular
node with ID i(x).
So, what you do is, you basically hash both
of them and then you can figure out which
is the closest one that you have to search.
That closeness is going to be given by some
measure. Eg., if you want to store a file
with filename fn and data in a DHT, you first
hash the filename with some hashing function.
The one people often use is SHA-1 because
it has very good collision resistance properties
(even if one small bit is changed, it actually
maps to something qute different - they use
some avalanche techniques by which this happens).
So, you want to store it, you get the hash
and say put(k,data) and send to any node in
DHT. You can send it anywhere. Then that node
in turn forwards it 
to any other node thru an overlay n/w which
connects the nodes until it reaches a single
node responsible for key k as specified by
the keyspace partitioning. So essentially
things are forwarded until it reaches the
right place. Thats where it's stored and then
if u want to get it, you send this - "get(SHA-1(fn))"
and then it can be given to any node and then
again forwarded and then picked up from wherever
it is coming from and through the overlay
network, it comes back to the client. This
is roughly the model of DHT.
And as I mentioned this is what is used by
major companies for storing large amounts
of data so that u dont have to keep on moving
things around when something is delected or
added, etc. Here also there are some hotspots
and other kinds of things, but we will not
discuss it here.
Now, let's look at an example system which
uses DHT. Basically, its similar to Bigtable
except that they use DHT kind of models. It's
also a distributed multi-dimensional map indexed
by a key; and similar to other models, the
key is a byte stream and the value is a highly
structured object (with columns etc.) just
like BigTable. Just like BigTable, they also
gaurantee that any operation on a single row
key is atomic per replica no matter how many
columns are being read or written into. And
they are also similar to BigTable in terms
of column families etc.
So, data is partitioned across cluster using
consistent hashing, but with a small tweak
- they use an order preserving hash function.
The API is pretty simple, with three simple
methods - insert, get, delete - by which you
can interact with the system. I'm not going
into too much detail since you can find plenty
of information on each of the systems we have
seen so far in published literature. Basically
they have three simple methods inserting in
change, row mutation is a change into a particular
row in a particular table with a particular
key. We can get given a key a particular table,
you can get a particular column, and you can
also delete something because they have only
three methods by which you can interact with
the system. So, I am not going to too much
detail about some of these things maybe I
am going to stop with respect to details of
this kind.
Let's just review what we've seen so far.
We looked at GFS, Yahoo's PNUTS, Windows Azure
and Cassandra. Basically, the understanding
you can have when you look at all these systems
is that they are trying to provide a model
of consistency that is manageable w.r.t throughput
and the high rates of transactions (or operations)
that need to be supported in the systems.
Let's also quickly look at and summarize some
other kinds of models - you can have either
Server-side Consistency Models or Client-side
Consistency Models.
Client-side Consistency Models are those that
the client sees and the Server-side is basically
if it has got any replicas, how does it manage
any updates. One is coming from what a server
has to do and the other is what the client
finally sees. Eg., if it is on Server-side,
in summary we'll see the following happening
-
Lets say
N = number of nodes that store replicas of
data
W = number of replicas that need to ack receipt
of updates before update completes (eg., if
there are 5 replicas, do u wait for all the
acks from 5 replicas before u say that the
write is finished or are you willing to take
only 3 writes as being complete before u say
that write is complete?)
R = number of replicas that are contacted
when a data object is accessed through a read
operation
Since in these kind of systems, some of the
replicas may not be current, it might be your
bad luck that whatever you contacted may be
out of date. You can control this with R.
Eg, if you 
set R=N, you have to contact all the nodes
and you can also see easily that some of them
are not current compared to other ones. Then,
you can decide what to do.
Eg., if R=1 or very small, there's a high
chance you'll see some stale version. What
happens then is that if n is the no of replicas
and you call W the Write qourum and R the
Read quorum, if W+R>N there is some overlap
b/w the Write set and the Read set. And because
of this, you can guarantee strong consistency.
This is the same trick thet Paxos does. They
always ensure that there is always somebody
within 2 ballots. That's how they are able
to say that if some thing has been committed
in one ballot but is not known to everybody
still then there is a way in which information
flows and becomes available from one of the
guys in the previous ballot (if it got committed),
to the next ballot (if it gets started). And
therefore, whatever decision that was committed
in the first ballot becomes known to all other
ballots. And same thing will be followed - thats
why you get strong consistency there. Here
also, the same story.
If you have W+R<=N, then 
in this case, there could 
be a disjoint set of replicas and therefore
they might have different versions and some
stale values that could be written.
Now, finally, if R=1 & W=N, what happens?
You are saying that you dont really mind as
long as you make sure that W=N. That means
that every replica has to be written before
it is declared as a success. Therefore, it'll
be slow but reads will be fast. Because you
dont have to contact multiple nodes. Whereas
if W=1 & R=N, u are saying that the Readers
take the trouble of figuring out how to contact
all the replicas and deciding what to do if
they are out of whack. They might have to
execute some protocol by which they can synchronize
them, make them consistent etc.
Typically, in web-type apps this is the typical
situation. Thats why you'll see that in the
case of GFS, anytime anything is written,
u have to get acks from all the secondaries.
Then only, the Master proceeds, otherwise
it wont . So, This is what most large-scale
web systems follow. And, this case: W=1 & R=N,
is not very common. I'm not aware of any major
Application which requires very fast Writes
and Readers are going to be penalized so that
the Writes can be very fast. At least on web-type
systems, you don't find this that much on
the server-side.
So, lets summarize. We saw what happens on
the Server-side. We'll look at the Client-side.
If it is strong, after an update completes,
any subsequent access will return the updated
value. In the weak case, it may return a stale
value. In the spl case, Eventual model, stor
sys gaurantees that if no new updates made
to the object, eventually all accesses will
return last updated value. Classical eg is
DNS - it has certain information about mappings
and will eventually converge to the right
values but there may be some intermediate
times when it doesn't converge and may give
wrong information.
There are various variations you looked at
- Causal consistency (CC) similar to when
we are discussing Group Communication Systems
we talked about causal Models. The same thing
out here - there is a causal reason why certain
updates have to happen before some other updates
that u need to honor . Eg., you may have some
info kept in some tables and you may have
a requirement that you should not be seen
by some people once I make certain types of
updates. Then it should be the case that if
I updated it, then removing the permissions
for somebody to look at it should have happened
before the new updates. I want that guarantee
to be there. If that is not given, then I
will not process the system.
So in some cases, causal consistency required.
Read-your-writes consistency (RyWC) means
that you can cache your stuff and keep updating
it and anytime you read, you can get ur copy.
But it is not synchronous with the rest of
the system - in the architecture world this
is called processor consistency (in a process,
you have a write buffer you keep on writing
to it, whenever you read it, it looks up the
Write buffer and gives you the most recent
version; but it may not be consistent with
other parties who have also cached it there
and are writing it there).
Other model is session consistency. This is
closer to what many file systems provide.
What you are doing is - you open a session
and then till close, you are guaranteed that
the Reads & Writes happen the way you expect
and only when the session is completed do
you explicitly say close the session. Then,
your updates are uploaded to everybody else's
sessions. There are also simpler models - Monotonic
Read Consistency (MRC) and Monotonic Write
Consistency (MWC). In MRC, if u read it multiple
times, they are all consistent. Basically
what it means is that I read something and
say I get five, if I read it again I cannot
get six, if there is no other intervening
write for example.
In this can happen that this may be violated
in internet scale systems eg., when you read
a cricket score once it is 320, but next time
it can become 350. That's because you are
getting updates from multiple different paths
and that will not have 
Monotonic Read Consistency. In MWC, if you
write something - eg., "A" and you write "B"
later, the value of "A" is seen by everybody
before you see the value of "B" going round.
Essentially, you are ensuring there is some
kind of barrier between each Write. You first
write "A" completely and then only you write
"B". Again, if a client connects only to a
single server, then certain things are trivial
- eg., RyWC and MRC become very simple.
Again, I'd like to conclude with a slightly
different kind of model. We were looking at
consistency management as a critical one - there
are other kinds of solutions where slightly
different set of things are optimized when
you are thinking about scalability. Facebook
has a Photo Storage system - they call it
Haystack and for sometime they were using
NFS. But they found that to read a single
photo, it was requiring excessive Disk Operations.
Basically, it requires 1+ disk operation to
translate filename to an inode #. You have
to read the inode # and then read the file.
Basically, this is required to look up the
directory. Then after directory, you have
to look up the inode. And then the file - there
are 3 things required.
So the basic idea here is to see if you can
avoid multiple Disk accesses. One eg., is
to somehow reduce the amount of meta-data.
If you reduce the meta-data, all meta-data
can be kept in main memory itself, you don't
have to go to Disk. Here, the problem was
you were going to look up meta from Disk.
Just like GFS also did, but with a completely
redesigned file system, what we are talking
about here is - we are using mostly similar
type of file system, not too different but
you want to remove certain things that are
not useful.
Eg., in their system, you dont need “rwx”
permissions since its not a shared system
Facebook is providing to users, but only a
photo service application that it is providing.
So you dont need “rwx” permission information.
So if you remove this kind of things, then
it turns out that typical inodes are 128-256
to 512 bytes. This size is too big. If you
reduce all those unnecessary things, your
inodes would become smaller - sufficiently
small that the meta-data for keeping billions
of photos is now possible to be kept in memory
itself. GFS tries to do it by making very
big 64MB chunks - thats why they reduce the
amount of meta-data required for keeping it.
Here, instead of going for very big chunks,
you reduce the size of meta-data itself. Hence,
If you carefully study the design of your
system and try to see what you are using it
for, then you can eliminate lots of unnecessary
things. They say that by looking at all such
things, the high throughput and low latency
and at most one disk operation per read. Basically,
u need to scale and the way to scale is by
keeping all meta-data in memory - you cant
keep all meta-data in memory if inode is very
big so you have to figure out how to look
at your app, see what is really critical and
throw out all useless stuff and then put that
in memory.
The question now is why does caching not work.
Typically often they say that if you want
one disk operation per read, people often
say caching will work. Of course, as we notice
one issue is that inodes have become too big.
There is another major reason why it actually
does not work and that reason is the following.
It turns out that Facebook etcetera use something
called content distribution networks. And
these are used for serving hot photos recently
upload and popular. Basically, there are systems
like Akamai which are used by companies like
Facebook to distribute that content so that
users geographically distributed can look
at the closest server to pick up the photos.
But the prob with these social networking
sites is that they also see a large number
of requests for less popular and often older
content - this is called the long tail of
requests. There are unpopular things that
are often used now and then.
It seems that these requests account for a
significant amount of Facebook traffic and
this will miss in the CD-ROM. By reducing
the amt of meta-data, they can essentially
get most of the lookups quite fast and that
is what is being served by Facebook. The rest
of it is being handled through Content Distribution
Network. They say they can do quite well and
are able to do 4x the reads per second than
an equivalent TB on a NAS Appliance.
To summarize what we have looked at regarding
scalability. Essentially, scalability is a
critical issue for web-scale systems. You
really have to systematically look into cross-layer
designs. These cross-layer designs are complicated
because instead of the time tested layering
principles, you are going to throw them away
and do something new. Typically, this means
you need a lot more careful attention to details
which is one part.
Another part is how consistency has been taken
as one area to do cross-layer optimization.
Another thing we looked at was how to design
inodes. Other issue is handling failures since
distribution of data is a necessity and co-ordination
of updates is also critical. This is an important
issue which we'll start looking at from the
next class.
