(electronic music)
(audience applauding)
- Hello, and welcome the
Microsoft Mechanics Live.
Coming up, we're gonna have a
look at a special edition of
How We Built It on Microsoft Mechanics,
live from Amsterdam.
We're gonna learn how Symantec,
a global leader in
cybersecurity, was able to shift
from their complex self-managed
database infrastructure
to a geographically
dispersed and low-latency
database solution using Azure Cosmos DB,
all with minimal code
changes and no down time,
and we're also gonna look
at how this was accelerated
with their work as they
actually accelerated
into insights-driven DevOps and that model
for continuous iteration and
also threat detection services,
all the things that they're
doing as part of Symantec.
So, I'm joined today by
the technical architect
Michael Shavell from Symantec.
Please give him a warm welcome.
(audience applauding)
- It's really great to be here at Ignite,
and of course in Amsterdam.
- Now, before we go deep into
the work that you've done
and actually the team has
done to move to the cloud
services that you're using now as part
of what you're doing with Symantec,
a lot of people know Symantec
from antivirus protection,
but you do a lot more.
Can you give a description about
what Symantec does in totality?
- Yeah, so, originally we
were the antivirus company,
but we really have so
many more solutions now.
We're focused around
protecting the identity
and data across devices and services,
and this is really, as
we move into the cloud
generation, where things are heading.
Norton Security, as many
of you guys might know
from our consumer business
unit, focuses on antivirus,
anti-spam, anti-malware
services to keep your computers
and devices safe.
My team is actually
focused around 43 services,
half of them being
security, and other pieces
delivering overall user experience.
These really are the back-end services
that provide file, web, and IP reputation
across the web.
But also really importantly,
now moving into Wi-Fi
and then some other
pieces like authentication
including two factor
authentication, user policy,
and notification.
- And right, let's put
this now into context
for everyone who's here.
What's the scale of your global operation?
- So, the security services themselves get
about five billion
endpoint requests per day.
We see almost 12 billion in
total across all of them,
and customers are globally
relying on these checks
in order to provide URL, file, Wi-Fi,
and app reputations to
figure out what's secure,
where should I be visiting.
So, the protection has to work
silently behind the scenes.
We can't be impacting productivity.
We can't be right in their face.
So, we have a five second SLA
to understand and intercept threats.
If we don't meet that, we
actually can't protect the user.
So, that's not a position
we would ever wanna be in,
and it's really five seconds
is a worst case scenario.
- And the great thing is
it works if you're offline
or online, but what's the
typical response time?
- Internal in our servers we strive
for a two to five second turnaround.
So, when you add a
worldwide latency of maybe
up to 100 milliseconds for those endpoints
is really where we're striving to be.
- All right, so, let's
dig into the latest,
the tech architecture you're using now.
What does that look
like in terms of running
your reputation service?
- Of course you have to have an endpoint
to protect your device.
So, as a user you would have installed
Norton Antivirus or Norton
Security on your device,
maybe Norton Mobile Security,
and what these are doing
is watching for unusual
processes on your devices:
malicious activity,
maybe things like disk encryption,
which is what happened
with the WannaCry outbreak.
So, as I mentioned we
have about five seconds,
actually sub five second SLA,
to deal with these threats.
So, our agent actually
goes and communicates
with our back-end services.
To minimize the latency worldwide,
we leverage Azure Traffic Manager.
This allows us to route the requests
accurately to the closest data center
to get the best performance.
Sitting behind there is
actually Azure VM Scale Sets
that allow us to grow
and contract to the needs
in the given day.
So, just to set context, any one endpoint
that's hitting our
services may make up to 100
queries against Cosmos on the back end.
So, if we take our File Reputation Service
as an example, once you
download a file on your device
it calls one of our services.
One of those reputations get
returned on our machines.
We use machine learning algorithms
among other technologies
to actually determine
the state of the file.
We hold on to it before
it's allowed to be executed
for a very short time.
That's where that five seconds come in,
'til we determine it's
safe, where we unlock it
and allow it to be executed.
So, on the back end this
actually communicates
with Azure Cosmos DB in order to decide
if the file process is known, good, bad,
and then actually feeds
that back to the user
very rapidly, and then we
can decide if it's malicious.
If we do decide that it's a bad file,
we go ahead and we actually
just completely remove it
and stop it.
- Very cool, so, what happens
then if it's a brand new
file that's never been
seen before by the service?
- That's great, so, the
telemetry is super important.
So, every hit to our services is logged
by our eventing system, and then analyzed
to keep up to date with
new and emerging threats.
We need a very short active feedback loop.
So, if a file was weaponized
after it has been downloaded
you have a behavior analysis that occurs.
We will detect something is going bad.
This anomalous behavior
is very rapidly logged,
such as the file maybe
immediately turns around
and starting crypting stuff, not good.
- Right.
- So, our clients have logic
to understand that
something is not going well,
and we can short circuit
that process and stop it
while we go ahead and analyze those files.
From there, we actually output
these reputation values,
and then in this case we're
gonna call it a bad file,
and we'd run it through our proprietary
reputation system which
actually goes ahead
and injects it back into
Cosmos DB for fast recall
on all those other machines.
So, it's really important
that we continually update
our database models to return
fast, accurate verdicts
that also trigger
protections in the background
to user machines by
blocking the bad processes.
- And it's really cool, all
of this really typically
happens silently in the background within
usually two to five milliseconds.
So, it's really amazing in
terms of how fast that is.
- Yeah, and worldwide
we're always striving
to drive this latency down, which is why
we're expanding these
services into so many regions.
- Right, so, you also mentioned WannaCry.
How well were you able to handle
with a threat like WannaCry?
- Yeah, so, that was a
really interesting morning.
I was commuting into Boston on my train,
and we saw an elevated bit of response.
So, we were seeing 20 to 30 extra thousand
requests per second.
The interesting thing was
Symantec automatically
with our systems shut
that entire attack down,
and we were able to just remove it
without any human interaction.
- So, you've made a really,
a couple of really important
tech decisions then at
Symantec to make a shift
to using basically managed cloud services
as part of your migration or your upgrade.
- Right, so, four years
ago we were actually
an on-prem data center.
So, we were very restricted
with what we can use,
and there was no managed services.
- Okay, so, where did that all start then?
- Yup, so, the first thing,
telemetry is our lifeblood.
We can see how it's tied into
all these reputation systems.
So, the first service we
saw when we moved to cloud
is we grabbed Azure Event
Hubs, because it gave us
a lot of confidence.
We were able to utilize this service.
It was up.
We got a great SLA, and it just worked,
and we turned our detection system down
to minutes, full turnaround in minutes.
So, this combines with
the behavior blocking
on the clients for good
zero-day threat protection.
Next, we're constantly
improving our reputation
and notification services.
So, we adopted Azure Application Insights.
So, we're actually gonna take a look
at a dashboard here.
This is one of our actual
dashboards that we run.
It's showing average
availability at 100%, fantastic.
- [Jeremy] That's always
a good thing to see.
It's flat at 100%.
- [Michael] Absolutely, so, as we take
a look at that, fantastic.
We'll go back, we take a
look at server response time,
critically important.
So, this is one of our policy services,
very, very rapid, 60
millisecond response time,
fantastic, and of course
we can see all the errors
in our system.
- [Jeremy] Right.
- [Michael] So, leveraging
this core area of focus.
However, the other pieces we needed
to really leverage was Azure Cosmos DB.
- So, how did you go about then making
that transition to Cosmos DB?
- So, we didn't get there overnight.
We had about five database solutions
prior to Azure Cosmos DB.
A core step for us was of
course to move to the cloud.
So, we leveraged Cassandra
running on IaaS infrastructure.
So, this was actually running on raw VMs.
It took a substantial amount
of time to build this out
though, because there's
just a lot of work to do.
You're standing up 30 Cassandra
nodes in every region,
because you have to handle
180,000 requests per second.
So, we have to tune the IO.
We have to patch.
We have to monitor.
Everything is on the
engineering teams to do so.
So, we also unfortunately
have to over plan
for capacity because we have to handle
those worst-case scenarios.
As an engineering team,
we had to load test
the database because we were responsible.
So, we had to build an
entire load test framework.
- Right, and it looks
like it's fairly complex
to get all of that to run.
You've got all of the kind
of server, OS management,
and all the patching,
all that stuff to do.
I can imagine it was really good
for known threats, in
terms of getting this done,
but what happens if you
got to really scale up fast
for things like unknown threats?
- Yeah, right.
So, we always had to balance
the risk of the unknown
and planning for it.
So, unfortunately we were
running idle infrastructure
anticipating the next big cyber threat,
because if it's not
ready, we can't handle it.
- Right.
- So, all of that completely
went away with Azure Cosmos DB, though.
So, let's go ahead and take a look here.
So, the interesting thing
here is we could take a look
and see replicate data globally.
We can click a button.
I'm not going to, because
this is a production database.
However, as we expand into Amsterdam,
which is our next region,
we actually will be
just clicking a button,
replicating, and moving on.
So, it really can scale to any level
that we're looking for, and
we'd only have to load test
our own code now.
We don't have to load
test the infrastructure
because we can depend upon
Cosmos to handle the load.
- Very cool, so, that
shifting, and that database
infrastructure had to be an area of risk
effectively for Symantec.
Now, you're running
services that can't have
any down time, and after
all, this is AV and kind of
reputation services, et cetera.
How were you able to manage the migration
without really impacting
the services or the users?
- Great, so, yeah, the
interesting thing here
was that you guys offer a Cassandra API.
So, really great, again.
We choose Cassandra API,
and as that comes up
go ahead and just create a new Cassandra,
and then actually instantiate.
So now, we can do a very
simple migration over
with very, very, very
minimal code changes,
and that allows us to just shift our code,
point at the new database, and run,
and we did this with our
notification platform.
Now, that service gets about
3.5 billion requests per day.
So, we were able to shift
that, no code change.
So, imagine what that means.
We can task our engineers
with working on the next
big piece, or maybe some other migration
or some other managed service integration.
- Yup, it's really cool
stuff, and basically it's just
the app thinks it's running in Cassandra.
It's running Cosmos though.
Just a few connection strings typically,
and everything just
works and scales, right?
- Absolutely, absolutely.
So, now, interesting, we
also had to develop a way
to move our data into other databases.
So, we actually built a really
well-documented database
move migration process
that actually allowed us
to move from database to database.
The live migration process
actually goes ahead
and what we do for our
active users is they come
into our APIs, and currently their data
is living actually in Cassandra API.
So, when a REST API hits, we
want to move that into Cosmos.
Now, this is a high-level
visual of our migration process,
and there's some pieces
we're gonna hit on here
like live migration, common migration,
and background migration,
that are really important.
- [Jeremy] So, how do all these
processes then actually work?
- [Michael] Great, so, in
a live migration process
somebody comes in and hits a REST API.
We don't want everybody
to move in one shot,
because migration is expensive.
You have to go to the old
database, pull the entire
user's record out, perform an ETL process,
and then move it into the new one.
So, you don't want to just
flip the switch in one shot.
So, we take an incremental approach.
So, we start with 1%.
So, 1% of our live users
start migrating over,
and then at that point we
know once the user migrates
from Cassandra into Cosmos in this case
they're now being served
out of Cosmos permanently.
So, we can go ahead and
start to slowly ramp this up
a little bit as our confidence gains.
- Right, so, then, so you
started then effectively
with migrating 1%.
How do you get past that first 1%
and really start ramping up?
- Yeah, so, the team
actually watches, and we,
and our app insights shows us
that we're in a healthy state.
So, we go ahead and we stay at 1%
and then very quickly move
to 5%, 10%, 20, 25, 50,
and then finally 100%,
which is at the point
when we can actually start running
the background migration process.
- So, how then does the background
migration process actually work?
- Yup, so, the, not everybody
hits our services every day.
In fact, some of our users we
might not see for six months.
Think about it, I have a couple devices
that sit in a drawer.
Maybe it's a test device, and
we won't see them very often.
So, we have to go ahead and migrate them
so we don't have to
manage the old database.
So, we have a Spark
job that actually scans
the Cassandra DB and then
it gives us back a list
of IDs that we can then migrate.
So, our migration engine lives
within the service itself,
as we showed with that first graphic.
However, what we wanna
do is actually feed it
through that same process.
So, as we see the user and the device ID
we can then go ahead and push them through
and figure out did I migrate this guy?
No, okay, so let me go ahead and do this.
Now, the really important
part there to minimize
amount of testing is it's
the exact same code flow.
You don't wanna have two code flows,
because everything should
be the same in this process.
You do not want two different results.
It would also double your testing,
and your stability requirements,
because now you're looking
at two different code flows.
So, we utilize the same exact code flow
either from an API or a message bus.
- So, what else did you have
to plan for in terms of,
let's talk about things like scale.
What did you have to do
to really set the right
thresholds on scales, so
that you aren't really
paying for resources?
You mentioned before that
you might have to have
180 requests per second or something,
the infrastructure to manage that.
How do I do that in Azure
then to where I'm not paying
for that kind of overhead
of just keeping resources
at that peak level?
- Yeah, as I mentioned before, we're using
Application Insights tied
into all these applications.
So, we actually combine
this with an Azure function
allowing us to do some
pretty interesting things.
So, let's take a look at that.
This function is actually
how we perform our migration.
So, we grab some environment variables
that are really important.
We figure out how far do we wanna scale.
Do we wanna go up by 10%, down by 10%?
Then we go ahead, calculate that, log it,
and of course write it back
to Application Insights
so we know exactly where our trends are.
So, this is triggered off of 429's,
so, when we get pushback from the service.
Now, let's take a look.
What does that actually look like?
This is our actual running
prod resource utilization.
So, if we look here we can see on the left
that's our scale trend.
That's that function going up and down.
In the right is actually our request RUs
and one of our really important
services that can't go down
leading into a weekend where
our enterprise customers
get quiet and then heading
back into Monday morning
where it gets busy again.
- [Jeremy] And when you
think about what you see
there on the left, that
320 bar is probably where
you'd keep idle infrastructure
running on prem.
- Yup.
- So, literally all of that
white space is stuff that
you don't have to pay for,
and one thing to point out,
if you do hit a 429 code
it means you're gonna be throttled.
Services aren't gonna
stop, but the nice thing is
it's kind of a good listening mechanism
to wait for that throttle to happen,
then you can start bumping
up services and start bumping
up scale based on those
throttles coming in.
So, those are really cool triggers
in terms of being able to kind
of right size the service.
But, what's next?
- So, I would say at this point 50%
of our services have been
migrated into Cosmos DB
over in Azure.
So, we wanna continue and
have the rest of our services,
where appropriate, in Azure Cosmos DB
by the end of the year.
- So, really fantastic stuff, Michael.
Any final tips that you wanna embark
on all the folks that are watching today?
- Yup, fear of down time
shouldn't get in your way
when you're adapting
these managed services.
They are trustworthy, they
work, they function as promised.
Cosmos was the first service with an SLA.
That's a big deal.
So, you do need to take a
workload-focused approach though.
Figure out what is your service.
Can it have down time?
Probably not, don't do
that, and figure out
what's the appropriate DB, okay?
Expect to overscale.
So, when you're dealing with these systems
during the migration you're doing a full,
full migration--
- Full move,
all the attributes, all
those have to move, right?
- All the extra data.
So, you're gonna overscale,
but the nice part
with Cosmos is as soon
as your migration's done
within hours you go ahead
and pull it back down
to the appropriate levels,
and then you're right sized
and your cost is accurate
to what you wanna run at,
and then of course
telemetry is absolutely key.
Everything you do through
these migration processes
should be tracked and it
should be well defined
in your dashboards.
- Really cool, so thanks
for joining us today Michael
on How We Built It to
really share your experience
in migrating to Azure Cosmos DB.
Now, if you wanna learn
more about Azure Cosmos DB,
you can check out the link
here shown, and of course,
keep watching Microsoft Mechanics
for the latest tech updates.
Hit subscribe if you haven't already.
We'll see you next time.
(audience applauding)
(electronic music)
