As I've told, fast data is the main topic
of this day.
We are very glad.
Jörg, you can now come on stage.
Jörg from Mesosphere is here to join us and
give us a keynote speech about elastic data
pipelines using...
I guess something like the SMACK stack.
Yeah, could well be, good guess.
I hope you enjoy the keynote.
After the keynote we have a short break and if
you have any questions you can shoot us,
wherever you can.
Please enjoy and a warm welcome to Jörg.
[audience applauds]
So I just have to switch laptops for a second...
So as already mentioned we're going to talk
about elastic data pipelines here and we're
going to hopefully figure out what that means
throughout this talk.
In my opinion the entire field of big data,
fast data, is quite overloaded with a lot
of buzzwords.
So I’m actually going to play a little buzzword
bingo throughout this talk.
Let's see how many buzzwords we can check
off.
Just briefly about myself: so I'm, as mentioned,
a Mesos developer at Mesosphere in Hamburg
Germany.
So I actually work on the Apache Mesos, which
is like a big C++ project which we're going
to hear about a little more about throughout this
talk.
And that's what I'm doing on a normal day,
when I don't have the pleasure of being here
and telling you about what I do on a normal
day.
So let's dive in, and let's figure out what
this talk is about.
So in the beginning there was Batch processing.
So that's where it all started.
Hadoop, MapReduce, it was crunching a lot
of data and for a lot of people that was kind
of cool.
It worked nicely, but then data started to
turn faster... and my screen doesn't really
like that resolution.
I hope we're not going to see that on other
slides.
So data started to turn faster and that actually
provides us with a number of challenges we
are going to see here.
So what actually is fast data?
So before with big data we usually had big
data silos where we were collecting data.
Then at some point in time, months later maybe
even years later, and from what I saw a lot
of companies never really touched all the
data in the data silos, they were just collecting.
So that was the era of this big data Hadoop
number crunching.
With fast data it is slightly different.
With fast data we have a lot of incoming data
and we want to process close to real time.
We can discuss a little in more detail what
real time means for us.
For example: if I'm Twitter, I have like ten
thousand tweets every minute.
If I'm Facebook I have 600,000 shares of some
items every minute.
E-mails they're like 200 million every minute.
And if I'm working at Google on YouTube, I
have like 48 hours of new videos being uploaded
every minute, so we can't just simply put
that in a data silo and then hope to work
with that in like a month or so.
We really have to process now or pretty close
to now.
Other big factors for fast data is this entire
buzzword IOT, so Internet of Things.
That actually comes from that a lot of planes,
and a lot of things, now come equipped with
sensors, so we actually end up with a lot
of new data.
So for example in the A380, each of those
engines is collecting about 10 gigabytes of
data every minute.
So there's actually throughout the flight-
if you're flying across the Atlantic there's
a lot of data collected there.
Also in the health industry, we have a lot
of sensors, a lot of data collected and a
lot of data collection, and we are hoping
that we can utilize this data to basically
predict what might happen.
So we can predict a heart attack before it
will even happen and can assist that person
up front.
And therefore we can't just crunch the data
in a monster so we really need a fast response
here coming back to fast data.
Also, the stock market.
Stock market is basically also a really big
data source where we have millions of transactions
flowing in and we actually also need to do
some processing here to figure out how the
prices are reacting.
Or if I am trying to exploit that, I also
want to react quickly to market changes.
Our modern cities, actually they also come
with a lot of sensors.
May it be on the street, may it be like smart
traffic lights which are adapting to traffic
patterns, may it be cars coming up which are
warning each other.
So we are right now with one big German car
manufacturer, we're working on a project of
collecting data of cars talking to each other
and exchanging data.
Also one big producer of fast data is actually
all of you, so many of you, you have like
an Apple watch, you have like an Android watch,
you have- most of us have a smartphone.
Also like modern houses they come equipped
with like a lot of sensors so they're like,
there are fridges that actually tell you:
‘Hey, you have to buy cheese when you're
coming home, because your wife just ate the
last bit.’
Or something like that.
So there are actually a lot of sensors.
So we're seeing this patterns that there is
actually a lot of data which we can collect,
and the challenge is not to just collect the
data, but basically make some use of it.
It was really nice on the introduction slide
by you guys, where it said ‘smart data’.
And I believe this is a big trend where we're
trying to move from this big data, collecting
everything, we're now trying to move to smart
data.
Which basically means utilize and actually
extract value out of that part of the data
which is valuable to whatever company or whatever
purpose I have.
So already these are the buzzwords I want
to check off during this talk so let's see.
We already got Batch, we got Hadoop and we
already covered IOT, so seems like we're on
a pretty good track concerning the buzz words
here.
Next, I would like to talk a little bit about
the toolbox we need to deal with those large
amounts of data.
So as mentioned there’s the SMACK stack,
which we're also going to talk about.
But overall what tools there are because there's
a large number of tools, and what tools can
I use to actually deal with this large amount
of fast data coming up here.
So let's see, what do we actually all need?
We have the sensors as just seen, so we get
data from there, part 1.
We actually we want to store the data somewhere
in the end.
Maybe, we want to store the raw sensor
data.
Maybe, we want to store some aggregated
results or some events later on, but what
we usually need in such kind of tool box is
something to store data.
We also need to do something with that data,
so we need something to do data processing
and actually extract value out of this data.
This might be like some Batch processing as
for example Hadoop or Spark, as another more
Batch-like processing kind of tool, but could
also be more like streaming tools as for example
Apache Flink, or it can also be like Spark
streaming.
But in the end we need something to crunch
this data and make use out of it.
We also need some kind of end user application
which has some interest in this crunch data
because otherwise, I can crunch as much data
as I like but I actually need some kind of
application which is going to utilize this.
And connecting all of this, these are like
a lot of tools, but I need something in the
middle which can actually deal with all
the incoming data or even like intermediate
data inside this application.
And therefore what I usually need is some
kind of message queue.
So a message queue is really just like a FIFO
queue; whatever gets in, gets out at the end.
And thereby it serves us like a buffering
between different applications.
So, could for example be the output of the
data processing is put into the message queue
and then the message queue, the storage layer
is going to pull the results from the message
queue here.
What we also need is some kind of infrastructure
to run this number of apps on top of.
So it's not as simple as it was before, I
simply have like one virtual machine or maybe
even one physical machine, I actually need
a large distributed infrastructure.
And this comes from the fact that fast data
is still kind of like big data in most scenarios.
So actually I end up with a number of servers
which I need to utilize for this scenario
we are seeing here.
And that actually, to those of you who have
heard of the SMACK stack...
Maybe just a short show of hands who know
the SMACK stack already?
Cool, some, that's pretty good.
That sounds pretty familiar because SMACK
stack is one implementation of those requirements
here.
And the SMACK stack actually consists of several
elements.
First one is Spark.
Spark is a distributed large-scale data processor,
covering the data processing requirements
we are having.
Mesos, which is a tool I'm working with on
my normal working days, is a cluster resource
manager, so you can view it as an abstraction
layer for your individual cluster nodes.
So you can actually develop your applications
independently, without having to think of individual
nodes running in your cluster, without thinking
of what happens if I have to add nodes, or
what if there are failures in my cluster.
Akka is actually a toolkit allowing you to
easily write message driven applications on
top of the JVM.
So this is something, if you're generating
for example events, you need some kind of
component which reacts to those events and
actually is going to trigger something.
So that's what Akka is, a really nice tool
kit to develop such kind of event driven applications.
Cassandra.
Cassandra is in our storage layer and it's
a distributed highly available database.
I put there...
I don't really like to say database, so maybe
just from my background: I did my PhD on distributed
databases and there are a lot of different
notions of databases.
So depending on what you expect from a database,
Cassandra might be a database or Cassandra
might not be database.
But if it's about storing your data and then
running analytics on top, it's a pretty good
tool to have in your toolbox.
And the last part connecting all of those
individual components in the SMACK stack is
Kafka.
So Kafka is a distributed highly available
messaging system which we are also going to
see in some more detail in this talk.
And altogether this forms the SMACK stack,
which is now used by a number of companies
to deal with these requirements of partially
both big data, but also fast data using the
message queuing provided by Kafka.
Let's talk a little bit about message queues,
because for all those components, there is
not just the part which we actually use in
SMACK stack, a set SMACK stack is just one
implementation of this toolbox you could have
and it usually really varies what your requirements
are, what you're actually trying to build
in the end.
So for message queues we actually have a number
of different implementations we can use.
First there's Kafka, there's RabbitMQ, there's
Disque, there's Fluentd, there's Logstash,
if you're more trying to stay in just providing
logging results those are pretty cool, there's
Akka streams and then if you're trying to
constrain yourselves to cloud environments,
or if you already made the architectural decision
to run everything on AWS anyhow, there are
actually also services which provide that to
you on both Google infrastructure and also
Amazon infrastructure.
And edgeR is actually coming up with something
similar right now, so on basically all the
big public cloud providers you can get something
native, some native message queueing.
You can also check queues.io if you want to
read in more detail.
And also see, as said, there is like a growing
number of different options you can consider
when picking your message queue tool.
Kafka.
Kafka is basically... it has this nice scalability.
So with Kafka I can easily scale the brokers
we have here in the middle and that actually
gives me a really nice scalability so I can
add, actually independently add, more producers
and I can also add more consumers on top,
while scaling on my cluster.
See, it actually it uses Zookeeper, which
for some people is a good thing, for other
people it's a bad thing.
So in my opinion Zookeeper is kind of production
proven, but at very large scale that's actually
going to get into a bottleneck.
So from an operational point it's nice and
production tested, for example by LinkedIn.
Who were running pretty large Kafka clusters.
Also for example Netflix, they also have quite
large Kafka clusters running.
A typical use case for Kafka is the decoupling
between producers and consumers.
So imagine you have a scenario... let's just
say you have one producer and one consumer.
You just want to store your data in Cassandra,
being like a sequel consumer.
You could put it directly to the Kafka system
by the producer, but that actually is going
to run into two problems.
First of all scalability.
What happens if you're actually adding more
Cassandra instances or independent Cassandra
services?
And secondly, what happens if there are failures.
So you usually try to avoid coding against
a single endpoint in this kind of coupled
infrastructures.
So a message queue is a really nice way of
decoupling them.
You basically put your results in the message
queue and then they can be consumed by Hadoop,
or Cassandra can pick it out of this message
queue in the end.
Fluentd.
It's from basically the other side.
So a lot of those use cases are actually mostly
focused on logs.
So if you have a large cluster with a lot
of different applications running, you have
a lot of different services creating logs.
And Fluentd is a really nice tool for gathering
all those logs, similar to Kafka, and then
actually providing them filtered in a way,
so you can filter out uninteresting messages
directly in this middleware and then basically
provide into different consumers again.
That could be for example alerting services.
So for example, if you detect something in
your logs which tells you that something is
severely wrong, you want to alert some ops
person in your organization.
It can be, certain parts can be routed to
analytic solutions, like Hadoop for example,
or it can simply be that you need to archive
all your logs for regulatory purposes.
Then you can just simply route everything
to S3 and simply filter the interesting stuff
you need immediately out of there, that's
going to be routed or duplicated [...] delivering
guarantees for messages.
So there are three different kind of guarantees
you can have.
There is: at most once, which basically means
your message might be delivered, but you can
always be sure it's not going to be delivered
twice.
But it might not be delivered at all.
So the second guarantee is: at least once.
In that case you can be sure your message
will end up at the consumer no matter what
happens in your cluster, but it might actually
happen that it comes there twice, three times,
four times.
So you need to deal with potential duplicates
at the consumer end of your stack.
What most people actually would like to have
is: exactly once guaranteed.
Exactly once basically means this message
is sent ones and it's only going to be consumed
once out of the messaging system.
That's actually really, really hard.
A lot of the messaging systems, if you look
at them, are claiming that they do it.
But if you actually check it, they only do
it in case there are no failures or they have
pretty strict limitations when they actually
can achieve this exactly once guarantee.
So you should always be careful if you read
that exactly once messaging delivery guarantee.
Of course, usually there's a fine print warning
you when this guarantee is not holding up.
Why is this so hard?
This is actually one of my favorite loss and
distribute systems.
It basically, it's Murphy's Law adapted to
distributed systems.
It says anything that can go wrong, will go
wrong partially.
In distributed systems, it's really nice,
if you have a failure you can actually detect.
Maybe a node is really gone forever, a network
is down if you can just detect that, but it
becomes really hard if they're like partial
drops.
So your node might just be not responding
for ten seconds at a time, then respond again,
then not respond again because it's under
high load for example.
Or partial network packages might be dropped,
some others are arriving, which makes it really
hard to detect that something is wrong in
your cluster or with your application.
So failure detection in distributed systems
is really hard and this is why for instance exactly
once guaranteed it would need to detect that
there's actually been a failure and the message
has been dropped.
But in cases of network partitions or partial
network partitions, it's really hard to, in
my opinion impossible, to actually figure
that out while still maintaining availability.
So this is why ‘exactly once’ is really
hard to achieve in distributed systems.
Next topic: stream processing.
So we now are talking about actual data
stream processing, so we're getting some stream
of data into our systems and now we actually
want to extract valuable information out of it.
And over the last years and especially last
year, there have been like a large number
of projects spinning up because for more and
more companies this is becoming really important.
So probably the oldest one is Apache Storm,
which has been used for over five years as
far as I know and it’s still like based
on the old MapReduce architecture, but does
a pretty good and reliable job of going to
SAMOS downsides in the next life.
Next most commonly used tool for stream processing
is actually Spark streaming, which has also
been around for a while, is pretty mature
and it nicely fits into this entire ecosystem
of Spark.
So especially if you're using Spark for other
stuff, Spark might be a good choice there.
Then we have a lot of other projects like
Apache Samza, which is a little older, Apache
Apex, it's kind of an open source project,
it's rather new, it just became a top-level
project in april this year.
So for example Flink is a little older and
more mature in that respect, but actually
it's also been used for a long time internally
by large companies, so there's also some value
into looking into Apache Apex.
For the really non-Apache world, there’s
for example also Concord, and if you're again
on cloud settings all major cloud providers
have their own streaming solutions like Kinesis
or Google Cloud dataflow, which allow you
to basically have this stream processing setup
for your cloud environment in particular.
Overall, I personally tend to recommend people
to use open source versions.
Of course, you don't really want to tie yourself
to any of those cloud providers.
They make it really easy to use when you're
for example on AWS, but once you want to move
partially off or entirely off AWS, it becomes
really hard to rewrite those applications
and there's a tight coupling into the rest
of the application code.
Next slide...
I’m offline...
Okay, I hope I will be back online in a second…
And worst case: do we have a network cable
here?
Okay, I'll just continue presenting from here
until I’m reconnected.
So with stream processing there are actually
two big paradigms you can use.
And the one is called micro batching and that's
for example utilized by Spark, and micro batching
actually means...
Spark is like a native batch application and
what it does, it's actually creating really
tiny batches of data and that's what it uses
for stream processing.
See, advantages are that it's really good
if you don't have low latency requirements,
because usually systems can process batches
of data faster than it can like an individual tuple.
On the other hand there's a downside that
actually, if you really need low latency in
like the microsecond range, it's usually somehow
slower because it first needs to create a
batch of tuples and then can process this
batch of tuples.
On the other hand we have like the native
stream processing, so there's for example
like Flink, Storm, Apex and those guys that
actually do native stream processing, so they
are really processing at the tuple at a time,
and that way they can actually reach a way
lower latency requirement than compared to like
Spark.
On the other hand, if you want to do batch
processing or if you don't care so much but
you care more about the entirety, they might
actually be a little slower, those systems.
One example is Apache Storm.
Let me just move that.
Can you actually see the slide?
I hope it's going to reconnect soon...
Let's see.
So Apache Storm, as mentioned, is like the
oldest one of those systems [...] Especially
like scaling garbage collection.
So what they actually did: they rewrote compliant...
API compliant rewrite in C++, which they called
Heron.
It's also open source and if you're looking
into Apache Storm, I can actually recommend
to also have a look at Heron.
It's a little harder to set up, but overall
like the latency and benefits also like memory
consumption, they are actually a good payoff
compared to this effort.
Spark.
Spark is another tool kit.
So Spark is actually way more than just the
stream processing as you can see.
Spark streaming is just like a really small
part in this entire infrastructure picture.
And so it actually grew as a Mesos showcase
framework and now it's actually mostly used
for batch processing.
Many of the people who were running MapReduce
jobs before, they now switched to over to
Spark, because it's way more flexible, I'm
not constrained to these MapReduce paradigms,
but I can actually come up with way more flexible
patterns which fit many of the data analytic
applications way better.
There's also large machine learning libraries
on top, so M-lab for example.
There's also some Mahout.
Those of you who still know it; it used to
be the machine learning on top of MapReduce,
so they are also recoding some of their jobs
or applications to Spark, simply because it
has flexibility, there's also graph processing,
so you can see it's like a really large ecosystem
of which Spark streaming is like a small part,
but it's actually widely used across like
many companies, despite the earlier mentioned
drawbacks that it has, this microbatching
approach.
This actually brings us to the next topic:
data stores.
So data stores is a huge field so probably,
we would need like at least one talk about
it and still, I really like this topic as
said I did my PhD and like basically this
picture, but for this talk it’s actually
a little too much, so I would just like to
dive into like one particular area of data
stores which is Time-Series data stores.
Are any of you already using time series data
store or evaluating it?
One?
Okay.
So time series data stores, they nicely fit
with fast data for the following reason:
fast data is usually, or it's often, it's some
kind of sensor data.
And sensor data what it is, is basically you
have a timestamp plus some kind of value.
If you're using like Cassandra or some other
MongoDB, they often end up creating a lot
of storage overhead while you can really store
time series more efficiently and also then
provide better analytics on top.
This is why there's a special category of
time series databases and I just picked here,
there are more, InfluxDB, Open Time Series
Database, KairosDB and actually Prometheus,
which mostly it's like monitoring but underneath
they have a pretty powerful time series data
store as well.
Open Time Series Database, it's actually built
on top of HBase.
So again, as with many of the other implementation
choices, this has advantages and disadvantages.
On the one hand it relies on really well proven
infrastructure component HBase, on the other
hand it brings in like all the other parts
associated with HBase, for example like HBase
being based on HDFS, so it's actually just
append only, you need to merge data and so
for low-latency again, there's like a limit
to how low latent you can get with HBase,
and therefore how low you can get with latency
with Open Time Series Database.
Otherwise it's really nice, it's really easy
to store data to index and query data and
it gets metrics out of it and due to the underlying
HBase it's really scalable across many nodes.
So if you want to collect large, very large
amounts of data, so for example in this airplane
case there are some airplanes which actually
have a small OpenTSDB cluster on board, so
that's pretty cool if you really want to collect
large amounts of data.
Second one is InfluxDB.
So InfluxDB actually was designed when they
were unhappy with the performance they got
out of Open Time Series Database and it actually
has no dependencies, it's entirely rewritten
from scratch in Go and it actually also has
a different query language.
So the query language for InfluxDB is sequel-like,
so if you have a database background it’s
actually quite easy to query out InfluxDB.
And you can also run InfluxDB in a distributed
mode.
They mostly used it in single mode deployments
but it's actually also using the Raft protocol
or their own Go Raft implementation there,
so you can also run it in distributed mode.
With the toolbox- so having looked at some
tools in this toolbox, overall there is still
a large number of challenges.
So just keep in mind we're still talking about
distributed systems.
And Murphy's Law means basically that there
are going to be failures in that system.
And so that often makes monitoring and debugging
issues quite hard because they are not reproducible
and it's really not trivial to figure out
which component failed or where, like a hardware
resource, failed.
Overall, also setting up such a kind of distributed
environment is pretty hard, and also operating it
requires some shift in thought from the
traditional one box deployment.
And therefore, what many companies are still
doing is they basically- for each of those
tools they take like five nodes in their cluster
and those five nodes are their Cassandra cluster,
they take another five nodes and those five
nodes are their Kafka cluster, they take another
five nodes that’s...
You basically see where this pattern is going.
So they statically partitioned their cluster
into different applications.
And as said before there are failures and
we’re going to see that this actually is
quite problematic this partitioning.
Hopefully on the next slide.
Cool!
Yeah, we got rid of some more buzzwords here
and this brings us to the next topic which
is the datacenter itself.
So the datacenter has quite changed over
the last years, and it's not just in what
the datacenter looks like, but also what
are the requirements for a datacenter.
And this mostly has to do with the evolution
of applications we are seeing.
With the first cluster picture, it used to
be like a mainframe application, so you had
your huge one UNIX system for data and transaction
processing.
Then came the server-client architecture,
so you had one server standing somewhere and
SAP-clients would log onto the server and
do their work on there.
The next evolution was basically driven by
the Ember and were like those virtual machines
where you have still potential like one large
server, but you would for each of the applications
you were running on there, so you didn't need
a dedicated server anymore for your web server,
you didn't need a dedicated mail server anymore,
you just needed a dedicated mail server VM,
which could run somewhere in your infrastructure.
And now we are actually at applications which
are distributed by themselves, so you can't
really constrain them to a single virtual
machine or even to a single box anymore.
So this is basically, in my opinion or my
view, actually it's the opposite of what we
did with virtual machines.
What we did with virtual machines is basically,
we took one big server and we chunked into
like smaller boxes.
What we have to do with distributed applications:
we're taking one application and we are distributing
it across a number of servers.
So this is quite different and therefore we
actually need this new form factor.
Your app, your application is actually running
against the datacenter and not against a single
virtual machine anymore or against a single
server.
This is kind of the vision we at Mesosphere
and for example also people at Google have
when they are talking about this Data Center
Operating System
Let's look at maybe one more concrete example
of how applications changed over time.
So who still remembers this LAMP stack?
Like Linux, Apache, MySQL, PHP Perl on top,
for running your cool web applications.
Quite long ago.
Back then, many of us were already doing Hadoop.
And then as we just talked about, there's
now this new thing called SMACK stack.
And if we are looking at this evolution of
applications we can actually see that in the
first stack everything was running on a single
box.
So Linux was running on a single box, my Apache
web server on a single box and so on.
With Hadoop we actually ended up having the
first distributed app running in our cluster
and therefore what most people did, they had
like they're dedicated Hadoop cluster where
their Hadoop workload was running.
But if we now look at the SMACK stack, we
actually have a large number of distributed
applications and so we don't want to end up
having a specific cluster, the static partition
cluster, for each of them.
So this is basically the trend, that the number
of distributed applications is really increasing
over time, and therefore also our data center
has to take care of it.
So if we look at how still, if we go to customers,
what it often looks like, we have this partitioning.
Basically this part Flink cluster, as mentioned
before, this part for Cassandra and this is
bad for several reasons.
The first reason is that resource utilization
gets rather bad in those clusters, so usually
it ends up then around like 20% maybe 30%,
even if we're doing some other tricks, but
overall my utilization of those resources
is rather bad.
And this is for example driven by the different
demands.
So for example Rails beings the user facing
part of the cluster, I really want to provision
that to the max workload, so I always want
to be able to serve customers a good sight,
and this might actually vary over time of
day, so it might be at night, most of us,
we have less people visiting our website and
therefore what we usually do at night, we
want to utilize the Spark part more for running
analytics and with such kind of partitioning
it's actually rather hard to shift resources
between clusters and therefore I usually overprovision
by a large chunk of resources and that then
leads to bad a resource utilization.
Second problem is what happens if we have
failures in this scenario.
So what happens if in our Rails part there
are some machines failing.
Then basically an operator has to come there
and shift over resources from another part
of the cluster.
So we're already seeing, this is maybe not
the best model to run, all those distributed
applications on your cluster.
The second trend which we are seeing is actually
this container trend, run everything in containers.
How many of you are using or evaluating containers
already?
Most, yeah, I would say that's more than half
of the people.
And I must say I personally really also like
containers running them on my laptop, it's
so easy to spin up your first Docker image,
to run that Docker image on my laptop, even
to transfer this Docker image to another server
and run it there.
So basically because I could package my application
into this container here.
But it actually becomes way harder if I want
to start orchestrating multiple containers
and actually run that in production, because
then I have dependencies, then I have to deal
with stuff like failures and this is actually
one part about Docker, we just switch back
like this whale, this whale actually consists
of two big parts.
This one part is the container on top,
so Docker has this nice container format which
is, actually if you look at it, it’s rather
easy, but it's actually pretty powerful because
you can pack your application and all its
dependencies in there.
The second part, which I would now view basically
as the whale itself, is the Docker runtime.
So the Docker runtime is then actually responsible
for running those containers on a given system.
And those are two independent parts and I
personally really like the image part of Docker
because they really made it easy and popular
to create those images, but I don't really
like the runtime because I can at least tell
you a hundred ways of how to crush a Docker
deamon, and if you read like the latest Doc
post about like the 1.12 release of Docker,
there are like so many people running into
production issues with Docker.
So that basically really makes container management
or container orchestration kind of hard, that
you have to depend on the runtime and you
have to end up on the second layer, how do
I coordinate all those containers, how do
they find each other, so if I have my Cassandra
container or my MySQL container running somewhere,
my application has to be able to talk to it,
the topic of service discovery, the topic
of load balancing, so there's actually a lot
more going on than I as a developer see when
I start that first Docker image, which is
a pretty cool experience.
So okay, we also checked off ‘container’
from our buzzword list.
And that actually brings me to like my favorite
part because this is what I'm working on,
which is Mesos.
So just looking at the timeline of Mesos.
Mesos was initially created as a graduate
project of several master and PhD students
at UC Berkeley.
And for those of you who know Spark quite
well, if you see that name list especially
like Andy and Matei, they're also the creators
of Spark.
So Mesos and Spark were co-created there at
UC Berkeley.
We have some pretty cool pictures where they
are at an offside off Ben's parents’ house,
where they have like a whiteboard and they're
basically drawing Mesos and the initial Spark
ideas there, which is pretty cool.
And then they actually, those guys gave a
tech talk at Twitter.
Those of you who still remember like this
Twitter Fail Whale whenever Twitter was down
or couldn't scale to the many users, that's actually
why Twitter was quite interested in solving
this problem, or actually two underlying problems
they were having.
The first of them was, again, low resource
utilization across their clusters.
So they had huge clusters but couldn't really
utilize them to serve customers because of
exactly those different problems, basically
the picture we saw before.
They had a number of applications running
there, they always needed to overprovision,
so they actually ended up with pretty low
resource utilization and they were spending
a lot of money on infrastructure.
The second problem they were having was that
it took them a long time to deploy stuff into
production.
So whenever a developer had finished something,
it usually took them three to four weeks to
move it over into production.
So this was also a problem for them because
it simply took them too long to get their
performance improvements into production.
They actually decided to invest in Mesos
and, what I really like, they also invested
into this idea of open source.
So at the end of 2010, Mesos became an Apache
Incubator project and then actually the next
year, it already graduated from there.
This is kind of nice because now anyone can
basically use it.
The last point on this timeline is actually
DC/OS release, so DC/OS, as we’re also going
to see in a second, is another open source
tool kit basically, it's like a opinionated
toolkit around Mesos, so we call it DC/OS,
which stands for datacenter operating system,
and it actually is built around Mesos as the
kernel and provides things like monitoring,
best practice deployment and so around it.
So it actually makes Mesos really easy to
use, because setting up the first Mesos Marathon
cluster it can be kind of hard to make this
production ready or production HA setup ready.
So as mentioned, as of today Apache Mesos
is a top level Apache project.
We usually refer to it as cluster resource
scheduler or cluster resource negotiator.
We're going to see in a second where this
name comes from.
It's actually scaling to 10,000s of nodes,
so many of those companies on the right side,
they have really large deployment clusters
and they're for example like Twitter, which
has huge Mesos clusters, any of you... who
has an iPhone here?
Who of those people are using Siri?
Oh, not many Siri... okay, some of them are.
So actually whenever you use Siri you're also
using Mesos, because the entire Siri infrastructure
is also built on top of Mesos, which is also
pretty large clusters they are having there,
or clusters actually.
And as you can already see, it's really fault
tolerant, because in a 10,000 node clusters,
you’re going to eventually hit every, or
most potential failures there can be and it's
therefore also battle-tested on those large
clusters.
What I personally also really like about it:
if you're developing like a new distributed
application as for example some partners of
us like ArangoDB are doing, it's really easy
to use Mesos as like an SDK to write distributed
frameworks.
So during my PhD we were actually writing
several distributed applications and usually,
it's the same pattern.
You start over how do I deploy something to
a node to run.
So basically say: ‘application, run on that
node.’
How do you detect failures, so how do I figure
out this node isn't running anymore, and so
it's basically like all those reoccurring
patterns you see when developing distributed
systems and Mesos actually is taking a large
chunk of those design patterns and makes it
really easy for you to write a new distributed
application.
And to just have this buzzword checked off
as well: it supports Docker.
By itself, so there are actually two ways
of running Docker on top of Mesos.
The first one is basically using the Docker
runtime and we have something called a unified
containerizer, so containerizers and Mesos
are just containing tasks from each other
and you can either use the Docker runtime
for Docker images to do that, or you can also
use the normal Mesos containerizer which basically
understands how to run those Docker images.
So I set Dockers like image plus runtime and
basically the unified containerizer can run
most Docker images.
If you have like really weird options of like
networking, persistent volumes or something like that
the more advanced features, it might
not work, but for like the eighty percent
standard use case, you can actually run Docker
images without running like the Docker daemon
and the Docker runtime.
Brief dive into architecture.
I don't actually want to go too far on this,
but the actual nice part is, that there are
two parts.
So Mesos is considered a two-level scheduler
and that actually means, Mesos itself is just
the allocation part on the far right side
here and what it does: it's going to offer
resources to framework schedulers and then
the framework scheduler can decide what it
wants to do with that.
If we are looking here at this example, we're
having two framework schedulers, so one is
like the Marathon scheduler, Marathon is basically
a native Docker or let's say container orchestration
framework on top of Mesos and Myriad is basically
running YARN on Mesos, so if you still need
support for your old YARN apps, you can also
run YARN on Mesos.
So in this example we're having two applications
running on top of this Mesos cluster.
And so the agents, those are the basic like
worker nodes in the cluster, they don't talk
directly to the schedulers, they are basically
just saying: ‘Hey Mesos master,
I have 10 CPUs here, can you actually do something meaningful
with those 10 CPUs?’
And the Mesos master is then doing it’s
part, which is the resource allocation part
and it's going to decide: ‘Hmm, which framework
should I offer this to?’
So basically which framework given all the
roles, weights and quota you might have, so
basically how you configured your different
applications, which share they should receive,
so which applications turn it is to receive
those resources.
In this example it's like the Marathon scheduler,
and so the Mesos master is basically asking
the Marathon scheduler: ‘Hey Marathon scheduler,
if you want, you can use 10 CPUs.’
Next, the framework can actually decide: ‘Hmm,
I don't actually need any resources, so I'm
just going to say no.’
It can say: ‘Yeah cool, I can use all of
them and basically start a single task or
even multiple tasks on top of those 10 CPUs.’
Or it can also partially accept them, so basically
say: ‘I don't really need 10 CPUs, but 5
would be really cool, so I'm going to start
tasks on those 5 resources here.’
The Mesos master will then actually take care
of the rest and ensure that those are actually
started on the individual worker nodes below.
Once the task is running, the master is also
reporting the status back to the Marathon
scheduler, so it can be that the task we just
launched by Marathon, it fails, it might be
that this task is running out of disk resources
or it's running OM, and therefore it's being
killed and this is going to be communicated
either by the agent to the master and then
the master is going to communicate it to the
Marathon scheduler, or in case also for example
node failures or network partitions when the
agent can't communicate with the master anymore,
also the master is going to tell the Marathon
scheduler after a time: ‘Hey Marathon, those
tasks which were running on that node, they're
actually in either an unknown state or they're
in a failed state because that node is not
reachable anymore or because that node really
failed.’
And then I as a Marathon scheduler... and
I fear I'm out of network again... and the
Marathon scheduler then actually doesn't have
to deal with detecting this issue that this
node is not running anymore, it simply has
to decide what it wants to do with that information
and how it wants to react.
Okay, why do we actually need this complex
model?
It actually enables pretty nice scenarios,
so for example we often get asked the question:
when should I use YARN, when should I use
Mesos?
The short answer is: if you're only running
stuff that likes a YARN universe, YARN might
be a good choice, but otherwise it's actually
pretty cool to run YARN on top of Mesos.
Because what that actually allows you to do
is you can run multiple YARN cluster you can
scale up and scale down your YARN clusters
running on top of Mesos and therefore making
it really flexible.
Also you can still run other applications
on the side, so your cluster is not just dedicated
to YARN, but if another group wants to start
for example with Kubernetes, it can also run
Kubernetes on top.
Okay this brings me to actually the last buzzword...
How much time do I have left..?
Okay, I'll try to finish quickly.
And that's actually DC/OS.
So DC/OS, as mentioned, is basically like
this surrounding Mesos toolkit, so it gives
you like best practice...
I'm not there anymore?
Can I...
Oh there, I’m back.
Okay.
Was it there for a while and I’ve just been
talking?
Okay good, whoo!
So basically this DC/OS, it’s building around
of Mesos and it basically makes using it pretty
nice and gives you some, I would say opinionated
best practices.
So some companies might want to choose their
own practices because they established for
example different monitoring solutions or
different ways for authorization authentication,
but overall it gives you like a best practice
opinionated guess on what you want to do.
It's also entirely open source and actually
DC/OS itself is just like an umbrella project
for a number of other open source projects.
So for example Mesos, Marathon, then the diverse
number of monitoring tools and basically DC/OS
is like the surrounding project which has
a common road map and common docs and tutorials
to build it.
And it also gives you as any operating system,
or as most operating systems do, it also gives
you a UI.
And what I find kind of nice about this UI
is basically the focus.
I as an operator, when I look at my cluster,
what do I care about?
I don't necessarily care about individual
nodes, but what I want to see is basically:
is everything, are all my applications here
healthy?
So this is what we can see, almost can see
down here.
It's basically telling me: do my applications
all respond healthy to the health checks they're
having.
Also, what I want to see is basically: how
is my cluster utilization?
In this case it's a demo cluster, it's pretty
bad, but overall, I also want to have alerting
on that if it either goes too low, that means
either some apps are failing or I can remove
some resources, or also if it hits for example
like 90% CPU limits or 95%, depending on what
kind of workload I'm running in my cluster;
then I also care as an operator.
But about the rest, if there's a node failure,
I as an operator, I don't immediately care.
I don't want to be woken up at night simply
because a node failed.
The other really nice part of DC/OS is the
Universe.
So Universe is basically like the App Store of
your cluster, and it actually allows you to
install for example Spark with a single click.
I would have shown that, but as we're running
out of time, feel free to ping me later.
I have a DC/OS cluster running, so we can
just try it out and install it.
So you can either use the UI to install those
applications and there's also a pretty large
store where users are contributing stuff or
partners are contributing different frameworks.
And whoever has installed Spark in the cluster,
it can be kind of annoying to install all
those different applications.
Here it's basically like one click, one click
to deinstall again, or if I really like a
CLI and my keyboard, I can also do that with
CLI by typing basically, DC\OS package installed
Spark for example.
So with that, I actually checked off all my
buzzwords and thereby, I'm also done with my talk.
Thank you very much for listening and time
for questions...
No questions?
One question.
An example?
Companies which I'm allowed to say openly.
So for example Verizon is using DC/OS to run
their infrastructure, Autodesk is using DC/OS
to run their infrastructure...
It's more about what I can say openly.
So there are a number of large clusters running
DC/OS.
So what are usually the main customers, are
everyone who has like a large number of servers.
So like in private companies it's usually
like finance industry, or telecommunications
as in the Verizon case, because they really
have requirements as their cluster is up all the
time and they usually have a large number
of nodes.
And yeah, across that in the public sector,
there are also some people needing a large
number of nodes.
Usually, how most people start, there
are like two different kinds of customers.
Usually European customers, they're doing
that when they start a new project, if they
for example want to get started with Spark.
In the US, customers are usually- they're a
little further on their roadmap so they already
have a lot of stuff running in their cluster
and they usually start-up with like a small
part of the cluster and then part by part
they're putting more nodes into the DC/OS
cluster, and therefore also moving more and
more applications over into the DC/OS part
of the cluster.
Yeah, that's a question I have got on
my sheet here.
Yeah, I would have touched up on that...
So the answer is two-fold.
So where is the slide I’m looking for?
It’s still reconnecting.
So, from a practice point you could run all
your individual applications as Marathon jobs,
but in practice you usually don't want to
do that for the following reason: each time
you start a new Marathon task, it's going
to actually start a new Mesos executor on
those nodes.
So that means when starting that job, you
have to wait for a resource, a matching resource
offer, from like one of those worker nodes
and then you can start your job.
What you want to do with a more Lambda architecture
like, is you have an executive running on
one of those workers and that executor is
basically just waiting for jobs.
It's basically holding the resources.
So it's owning 5 CPUs for example, and whenever
you're sending in a Lambda job, it can immediately
start, it doesn't have to wait for both the
resource offer and it doesn't have to wait
until it's spawned like the executor on the
worker node.
Usually what you want to do and what people are
also like looking into and actually doing
in not production scale yet, you would actually
write a framework to do that.
Last question...
So what you would do...
I think OrientDB is already covered by some
people and let's talk about OrientDB later,
but basically what you would do is...
The Universe it's a bias own but public GitHub
repo and you would submit a pull request against
that repo and we would review it and, or the
DC/OS PMC starting next month would review
it, and then basically decide and merge it
into, but you can also have your private stores.
So you're not constrained to the single store.
Like, you have an organization, what you might
want to do is, you have your internal store,
you can just say DC/OS, package, repo, add,
I think is a command, please look it up, and
basically add your own repo.
Which can be either just a simple, very simple
file structure, filing folder structure, or
you can actually just clone the Universe on
GitHub, submit your own stuff there, merge
it there and then basically add that as an
alternative app store.
Alright Jörg, thank you very much for this
keynote!
Give him a big applause.
Sorry for the technical challenges, but I
think we could all see the slides and it worked out.
Besides that, Jörg will be available during
the afternoon, so you can ask him a lot of
questions and in the end, we have little sessions
you can also ask him questions.
So thank you very much again.
Yeah, and if you publish the slides I'm happy
to share them with you.
Thank you very much, Jörg.
We have a break till half-past two, then the
new sessions will start for the speakers we
have time to connect the new laptops So, enjoy
some drinks at the catering and I'll see you
back in 50 minutes.
