Hi, my name is Robert O Grady. I'm
the digital platform lead at
Yellow New Zealand. Today I'll be
talking about building a
serverless event driven digital
platform. I'd like to thank
everybody at Serverless Days,
Australia, New Zealand, for
giving me this great opportunity
to speak in front of everybody.
I hope everybody gets something
out of this today and this is
just a very very high level
overview of what we've been
implementing in Yellow over the
last few years, so I hope this
is beneficial to at least some
of you out there.
First topics I'd like to talk
about is the initial challenges
we had with serverless and the
first challenge we had we found
was that we needed to get some
company buyin. Historically,
and in the past we had a ECS
container platform that's been
extremely, extremely stable for
us, and it's never given us any
issues in terms of outages, so
we decided that we needed to do
some presentations to people on
a weekly basis and get people up
to speed with new technologies
like serverless. We also realized
very, very quickly that we
needed to change our culture and
breakdown silos. So more
specifically, our development
team in our cloud team needed to
come together as one as the idea
of two separate teams was just
going to hold deployments and
development and progress up. We
needed to think of new ways of
working. Traditional roles would
change, and 
fundamentally the organization
would change, which was a great
opportunity to redefine how our
teams looked. It was also great
chance for people to upskill in
serverless, and we found that
people were quite eager to hear
about serverless and to
understand it and once they
understood the benefits of
serverless they absolutely jumped
at the chance, which was a great
great great thing. The time and
effort, unfortunately, was
probably the hardest part, so
the project timelines didn't
change. However, the technology
did, so as people were upskilling
and we were running proof of
concepts the project
deliverables didn't change, so
there was a tight tight
timeline around this. So that
was one of the major challenges
we faced. Once we decided that
serverless was indeed the direction
we would like to go we then
need to start thinking about our
core technologies and
architectures that we would need
to focus on to deliver this.
In the past, we'd use Amazon
for ECS container platform
with great success. We
already had our team skilled
in Amazon, so this seems to
be the natural choice.
We also did a lot of research
into microservices. We ran
a lot of proof of concepts
around microservices.
Given that we had actually a
long list of monolithic approaches
in the past and the
issues we faced with them around
troubleshooting, around
development. Microservices
resonated with us quite a
lot, and this was another
core, core pillar in the
suite of technologies that
we would use going forward.
We also needed at communications
mechanism and we decided on an
event driven architecture
approach to use for this
communication mechanism. A lot
of our microservices would
need to talk to each other in
a structured, in a very
formatted way, so we realized
that from a very, very early
stage that event driven
architecture was the way
forward for us. Architecture
that we finally decided on
and it seemed natural to split
into four different layers.
Our top layer was the client
layer, so this included our
ecommerce platform, our self
service platform and any third
parties that we wanted to
interact with. Our second
layer was a very important
layer as this was our API and
authorization and authentication
layer. So this was the layer
that we exposed to the outside
world, so this from an Amazon
perspective included
technologies like API gateway,
app sync and deeply embedded in
that was our identity management
platform which was our cognito.
This is the layer that managed
interacting with clients and
third parties, and this was an
important layer where we
initiated contact with Amazon
where all of our business logic
would be triggered from.
Layer three is where business logic
lived. This is pretty much
where all of our Lambda
functions were created, our
step functions, any queuing
mechanisms and where we would
eventually create something
called search services.
This is also where all of our
events would live, and this is
where a lot of where our event
driven architecture was driven
out of. Within here this is
where developers,
traditional developer role would
live. It was this layer that
talked up to layer two in
terms of an API integration
and then up to layer one in
terms of our client
integration. So as you can see
it starting. This is how all
of our services started to tie
together.
Layer four was our destination
systems, so essentially this
is where our Lambda functions
talked to, also in some cases
our step functions.
This is where our Lambda
functions would interact with
our CRM systems. Our internal
systems, any cloud services
or any billing systems.
So this was a pretty complete
picture of the full environment
that we wanted to interact with
our client API 
layer into our business logic
and then the destination then
was where we want to interact
with. So this seems to fit
exactly what we want to interact
with. How the high level
architecture holds together in
those four layers. Let's get
into some of the actual Amazon
services that we've used.
Some are very early stage.
We've noticed that step
functions, Lambda
functions and API gateway
when used as a combination
were very, very powerful.
So let's let's have a look at
how step functions and Lambda
functions interact. So one of
the main issues we have is when
a Lambda function fails. For
example, if the Lambda function
was talking to third party API
the Lambda function would fail
and it didn't really have the
ability to re run by itself.
We noticed that in early stage
it, if we wrap wrap that into
a step function, we could
build in some retry logic
into that and it's because of
that reason we came up with
three strategies to deal with
failure across trans and
transient failures. So the first
strategy was to deal with
transients failures across
period of seconds or minutes,
and that was a retry logic that
we would just built into it a
Lambda function. Almost our
first line of defense.
The second strategy does with
system failures that will
resolve within minutes, hours or
days. This was built into the
step function and a step
function would have the ability
to retry the Lambda function.
The third then was the final
strategy where it was just a
code issue or system change
that meant the execution
would never get completed and
this would fundamentally be a
developer change.
Another strategy that we decided
very early stage was an API gateway
would always initiate a step
function. Now I guess
traditionally when you think
about it. API gateway would
always, in most people's minds
trigger a Lambda function, but
we found that API gateway
initiating a step function
opened up many other avenues for
us. What's in very, very simple
VTL templating we could
initiate step functions from an
API gateway, which would then
traditionally, run your Lambda
function. We also noticed that we
could actually uniquely create
the ID's of these Lambda step
function executions. So an
example where we found this
very, very useful when an event
would pass through an API
gateway into our back end if
there was something very unique
in that event coming through, we
could use that unique ID to
actually name the step function
and the reason we thought we
found this useful was when
interacting with third parties
we sometimes found there is
duplicate events coming through
our API gateway so fundamentally
that could ultimately mean
duplicate orders two bills going
out to the same customer, so we
needed a mechanism where we had
this unique, unique execution
ID. If, for example, this same
event came true twice from the
third party, the first step
function will be created with
that unique ID, and if the
second step function was
created, it would fail because
it would also try and use that
same unique ID. So this was a
very very powerful mechanism and
really really good way for us to
manage external events coming
through. Again, that just
highlights the power of that
step function, Lambda function,
an API gateway combination so
that unique execution ID was
very, very powerful.
Another thing that we found
very, very useful was the
ability for step functions to
call directly into Dynamo DB,
Elasticsearch, Glue, etc. So
pretty much a third party could
interact with our API gateway an
write directly into our Dynamo
DB without actually any business
logic in Lambda functions or
whatnot. And this gives us the
ability to write data into
our Dynamo DB in our data
sources at pace.
This is just one example of what
a successful execution of a step
function can look like.
At the very top where it says
start, this is where the API
gateway initiates the step
function and the business logic.
This is a good example as it
represents both how a
synchronous and asynchronous
path can look like in a step
function.
As it passes down through the
successful states, it
executes various Lambda
functions and what we call
choices in a step function.
Now choices is business logic
that lives outside of the
Lambda function and actually in
the step function which I
think is a really good example
of how we can destruct
monolitic programs. We don't
necessarily need to put all of
the business logic now into
Lambda functions. We can
extract that even more out and
start layering business logic
into step functions.
As it passess down each green box
means as either a successful
choice or a successful Lambda
function execution until it
gets to the very end, where we
would then consider this as
successful execution.
This is one of many, many,
many step functions that
we've created over the past
year or two, and I think this
is a good representation of
how potentially complex they
can get, but they can even
they can really get really
simple as well.
How a step function works and
how it interacts with Lambda
functions. We could understand
how our business logic was now
going to execute. However, we
now need to think about a
communications mechanism.
The communications mechanism we
decided on was through Cloud Watch,
i.e. Cloud watch events on an
event bus. This gave us the
mechanism to create our own
events as well as process
external events from third
parties.
The importance of the structure
and consistency was crucial in
the understanding of how we were
going to work with this, so
developers, testers, cloud
engineers all understood how
these events were constructed.
So developers could by
themselves decide the event
structure that they needed to
publish onto our event bus.
Information in the event given to
the step function and the Lambda
Function was crucial and so
was the context of the event.
So understanding what the
event meant, where it came
from and the context of this
event was crucial.
We decided to split these
categories of events into 2.
Something that has happened,
i.e. immutable,
immutable alludes to the fact
that it can't be changed, so
this is an event has already
happened. So for example, if
you've got a third party CRM
system, maybe you've just
created a customer. And the
event is published into the event
bus to let anybody else know
that we now have a new customer.
This event has already happened.
We can't do anything about it.
The second type of an event is
something that has to happen
or should happen, i.e. the
customer updates some
properties or some details.
This is then filtered out into
the event bus and all other
waiting event rules are waiting
for this event to pick the event
up and then execute whatever
business logic is waiting.
The idea of patterns in this
environment is crucial.
In the next slide, we will talk
a bit more about patterns, how
to write custom patterns, and
how to understand and how your
developers understand how
to write these patterns?
Another mechanism we decided to
put together was to record all
of these events traveling
through to our event bus.
We created a rule that picked up
all events traveling through
event bus and stored him into a
Dynamo DB and then streamed into
Elasticsearch. This gave us the
ability to audit these at a
later stage. This was a nice
mechanism a nice circular
mechanism which Elasticsearch
presented a back up to the user
plane so that the user could
query it at a later stage.
The next slide I just like to
talk very quickly about the code
that we put together very
generically to write these
events into our event bus.
Here we can see a piece of
Python code which demonstrates
how we would create an
construct our events and write him
onto the event bus.
Initially we would create
a boto3 client.
And then within, within the
program we use the put
events API to write the event
into the event bus. Now the
important here is the content
and as you can see, there's a
very defined structure of how
these events were written.
The top piece of information
would be the version, so is it
1.0? Is it one that one is a 2.0
and this was a very bit of
beneficial way to upgrade to the
type of information that was
coming through. The detailed
type was used as context to
the event.
The time was written to
give some more context of
when the event actually
happened, so when they
went into Elasticsearch
we could filter by time.
The ID, the ID was a unique
ID that may pass through that
could be of use.
The source this is another
important field. This also give
us more context as to where the
event came from. So for example,
if the Lambda function knew was
coming from CRM system or
billing system, it would write
the source in here and also give
a context like if it was from a
CRM system you could have the
source as CRM customer update.
And then the detail was the
payload, and the payload was
what the Lambda function used to
extract the business logic form
and then resource is was any
additional information like
environment variables. So in
the next example, we will show
you working site example.
In this next slide, I just like
to run through an example of what
all of this looks like.
So if you can refer back to the
architectural diagram we did
previously, we had four layers.
Top layer being the client
layer. And in this slide,
you'll see it's either the
Yellow website or an
external CRM system.
The second layer is our API
layer and authentication layer,
so that's our API gateway.
The third layer is our
business logic layer, so that
would be step functions,
Lambda functions and any
events.
And on the 4th layer we would
deal with destination systems.
So in this example you can see
our CRM system speaking to our
API gateway to notify
Amazon that something has
happened in CRM.
Now this API gateway would then
call a step function which would
execute a Lambda function. Now
remember, we never call a Lambda
function directly, we always
call a step function to manage
failure and then that step
function manages the execution
of the Lambda function.
The Lambda function then
inspects the message and dips
back into CRM system if it
requires more information.
Once the Lambda function has
all the information it
requires it then constructs
the message that we've seen in
the previous slide and writes
it into the event bus.
As you can see on the
bottom, there are three
rules written listening for
events on the event bus.
Now, depending on the rule, the
event could trigger all or none
of these events, but let's just
say it triggers all.
Each event rule would trigger a
step function which would then
trigger a Lambda function. So
you see, we could. We also
applied the pattern of never
calling a Lambda function
directly. We call a step
function which then manages
that if the Lambda function.
That Lambda function then has
the process to transform that VE
data or event into whatever
format the destination system
needs. So in this example.
If, for example, in the CRM
system a customer is created,
it's published into our API
Gateway, Lambda Function pushes
that event into the event bus.
All three events are triggered.
And all three Lambda
functions write into their
destination systems.
This gives us a really flexible
mechanism. This is just one of
many, many examples that we've
used in the past, but we found
it's very, very easy to work
with. If, for example, we needed
introducing a fourth rule is very,
very simple. There is no code
change, it's just adding new
code, new rules and new step
functions into Lambda functions
to write into that fourth
destination system. So now that
we were happy with architecture.
We wondered how we're going
to deploy everything.
We had over 60 Lambda functions
around 20-30 step functions,
many API gateway endpoints,
dynamodb data databases,
Elasticsearch domains and this
was just the very start.
We knew we'd become into the
situation where we would hit
our cloud formation resource
limits. At that point, we
decided to come up with an idea
around domains. How to split and
how to logically group this
business logic together? So for
example, if a group of Lambda
functions interacted with our
billing system, they should
obviously be deployed as one
unit. We also established
customer domains, i.e. if there was
any Lambda functions interacting
with CRM systems or the CRM
Systems interacting with API
gateway. This would all be
deployed via that mechanism. In
the end we came up at 6 domains
and we had six deployments.
This seems to work very well and
to this day we continue to add to
this as we see fit.
As we all know, the unit of
deployment in the CI CD world
is very, very important. But
this natural separation of
domains give us a very very
good hard boundary between
where everything lived.
We also encourage small code
changes, so when code was
released in production, we had a
good track record of being able
to reverse that change.
We also had a very good idea
where that change was made and we
had a record of who made it.
Our CI CD system was codebuild
and cod pipeline. We
found codebuild very very
effective to running
containers within our
previous environments and our
ECS world we used containers
for build strategy. So this
was also used within that we
would use some automated
testing with PyTest unit
and integration testing. We
also enforce it, enforce some
code coverage and we also ran
many security checks before
this was successfully built.
Another concept that we toyed
with was the idea of
deploying directly to
production. So one of the
developer will create a branch,
make his code change, we give
them a facility to run a
specific set of commands that
would give them their own cloud
formation stack. They would then
run and test and develop their
code within that cloud formation
stack, which will give them a an
array of endpoints, an array of
Lambda functions, Dynamo DB
bases, everything they needed to
do to develop. Once they were
happy that with the code they
would then merge directly into
master, which would deploy down
to the environments and directly
out into production. So as you
can see, automated testing would
be a key part of all of this,
and we encourage a lot of our
testers to get involved in the
creation of the automated tests
as well as of our developers.
In the end, the developers
wrote the majority of our
automated tests and again this
is just another example of how
our teams have changed, how our
organization has changed.
So now we've seen the
architecture. We've just run
through how we deploy our code
into production. So now our code
is in production. The picture
wouldn't be complete without
talking a little bit about
operations and some of the
operational challenges that we face.
So one of the major issues we
faced initially was how to
manage failures. I had to manage
Lambda functions that failed. As
we previously spoke about
we resolved this or we try to
resolve this by using step
functions and wrapping step
functions around our Lambda
functions. We talked a little
bit about our strategies.
First strategy deals with
transfer transient failures that
resolves within seconds and
minutes, and pretty much that
stands for Lambda function to
manage and retry. The second
strategy deals with system
failures that resolved within
minutes, hours, or days. This
would be left at the step
function, so remember
Lambda functions can only run
for 15 minutes, so anything over
15 minutes would need to be
passed back to the step function
to manage, and in the third
strategy was a code issue or a
system change that meant the
execution would never be
completed and this was managed
by an alarm and
eventually the developer
bug fixing.
Another challenge we face was
how to monitor monitor
distributed system. So not like
a traditional infrastructure
world. We don't have servers
involved. You don't have any
physical infrastructure we need
to monitor. How do we monitor
something where we have Lambda
functions talking to API
gateways, API gateways talking
to step functions? It's quite
difficult. So we needed that
single pane of glass and I
was a challenge initially.
Now X Ray has presented us
with the recent really good
options in that space.
One of the major things we've
seen was the development teams
now become a bit more Ops
focused, so we don't have a
traditional cloud team that
looks after EC2 instances or
manage container platforms. That
in the world of serverless has now
disappeared. So you find that
the developers now are closer
to this failure and closer to
these challenges. A lot of the
developers are now sitting in
the operational space looking
at alerts, understanding the
alerts and they're the guys that
write the code and create the
fixes for this code.
So that was a challenge. The
last challenge we found was
around cloud watch logs that can
be difficult to use.
We found that initially, about
two years ago we need to push
those logs into a third party
aggregator and paying
additional license for that.
But as the years have gone
by, cloud watch logs have
been became a lot more easier
to aggregate now natively in
Amazon, which is really good.
Over the last few years
Yellow and I have learned a
lot about serverless and
serverless technologies.
Its totals that we need to have
a think about how our team
structures are set up. How our
organization works and how
our business thinks.
One of the three main
takeaways from this I'd
like to talk about now.
The first one is the importance
of patterns. From a very early
stage we noticed that we needed
to rely on patterns to give us
some order and structure or
else this will turn into a
little bit of a monster.
Architecture patterns and the
patterns around API gateways,
Lambda functions, step functions
is something that we needed to
reuse over and over and over
again, and we have done over the
various years. This gives
the Ops teams a sense of
understanding as they know
what to look for in each
flow and when they're
troubleshooting issues.
Patterns around cloud watch
events was another important
piece that we recognized early.
Documenting and understanding
these patterns so the cloud
people and dev teams were all on
the same page and this was key.
Establishing domains early.
Another key takeaway from all of
this was the early
establishment of domains we've
already talked about. So this
set out our repository
structures. It set out the
structures of our API
gateways. For example, we
should use slash customer in
the customer domain slash
product in the product
domain. Slash Provisioning
and the provisioning domain
and these in turn would
execute step functions within
those similar domains.
Another key take key takeaway
was to innovate where possible.
The freedom of trying new
technologies and experiment,
now people were super engaged
in what they were doing.
We like to talk about new
technologies and new
announcements that came through
from Amazon. This super
excited people on a Friday
meeting as well and
understanding the possibility
of what they could do next
week. This seemed very, very
key factor in motivating
people and driving people
forward and driving innovation
and technology forward.
I'd like to thank everyone for
logging in today and taking the
time out of your day just to sit
down and have a listen to what
we at Yellow have been working
on for the last few years
around serverless and serverless
technologies with Amazon.
It's been a great experience
and I feel like it's been
these last few slides have
been very, very short and
each slide could probably be
a presentation in itself, but
if anyone has any questions,
or if anyone would like to
follow up later through
LinkedIn, I'll be more than
happy to answer your
questions forever. So if
anyone has any questions,
please let me know now.
