The Amazon Relational Database Service (RDS)
makes it easy to set up, operate, and scale a
relational database in the cloud. It provides cost-efficient
and resizable capacity while automating time-consuming
administration tasks such as hardware provisioning,
database setup, patching, and backups.
When we think about backups, we often think about
disaster recovery or data migration scenarios, but
these aren’t the only use cases.
Let’s consider another common one. Suppose we
have an example database called “bank-demo”
running on a db.r5.large instance and taking automated
nightly snapshots.
Our internal analytics team wants to be able to run a
high volume of intense queries on the data whenever
they want. Can we make this possible for them in a
safe, reliable, and cost-effective way?
The answer is “yes!”
In this demo, we’re going to export an Amazon RDS
snapshot from this example RDS database to S3, and
then, with the help of AWS Glue, use Amazon Athena
to perform queries on the exported snapshot.
This will enable us to generate reports, perform
computationally expensive analytical queries,
combine our data from RDS with other sources
to build data lakes, comply with long-term data 
retention requirements, and much more - all without
impacting the performance or availability of our database.
To get started, let's head over to the RDS console, and
open the list of available snapshots for our sample
database. We’ll choose the most recent automated 
system snapshot, but we could also select any
snapshots we’ve taken manually, or even create a
brand new snapshot just for this exercise by using
the “Take snapshot” button. Next, we’ll select
“Export to Amazon S3” from the “Actions”
dropdown menu button and start configuring
our export. Let’s call it “bank-demo-export-1”.
Now we need to indicate how much data we’d
like to export. We can either export the entire
database, or we can specify only a few tables by
listing the schema and name for each table that
we want to include. Let’s go ahead and export
everything.
We need to choose a destination on S3 for the export,
but since we don’t have one yet, let’s go create it.
We’ll call it “bank-demo-exports”, leave
the default options alone, and as a general
best practice, leave the “Block all public access”
setting checked. Once we’ve done that, we can
go back to the snapshot export form, refresh the
“S3 Destination” section, and select our new bucket.
Now we need to choose an IAM Role that the
snapshot export process can use to access
our S3 bucket. We don’t have one yet, so let’s create
a new role called “bank-demo-exports”.
The role will automatically have this IAM Policy shown
here, and that will allow it to read, write, and manage
the snapshot exports for us. To protect
our exported data, we must also provide 
a customer-managed encryption key from the
Amazon Key Management Service (KMS).
We can do this by going to the KMS console and
pressing the Create Key button.
We’ll give the key a logical name and description,
but since there are no other users in this demo
account, we don’t need any separate key
administrators or key users at this time, so
we can leave those options alone. We’ll click
“Finish” to create the key, but before leaving
KMS, let’s grab a copy of the key ARN, and then
paste it into the Encryption section back in RDS.
Now we can start the Export.
The amount of time it’ll take to finish will depend
on the size of the snapshot itself, but we can look
in the RDS Console to view the current status. For more
granular progress, the `aws rds describe-export-tasks`
command in the AWS CLI will tell us how much data
has been extracted and what percentage of the
overall process has been completed thus far.
Now that the export is complete, let’s take a
look at what ended up in our S3 bucket.
We have two export info JSON files, and a folder
with the same name as our database. The first
info file is the final report of the export task, and
the second one breaks the status down for us by
individual table, including overall size and the
datatype mappings. Within the “bank_demo”
folder we have separate folders for each individual
table that was exported, and under each of those
we’ll see one or more folders corresponding to how
many partitions were created during the export
process. “transactions”, being the largest of the
three tables, created several hundred partitions,
while the “accounts” and “customers” tables only
required one each. Finally, at the deepest level,
we’ll find the exported data in
Apache Parquet format.
The Parquet format is up to 2x faster to
export and consumes up to 6x less
storage in Amazon S3, compared to text
formats, and we can analyze the exported
data with other AWS services like Amazon Athena,
Amazon EMR, and Amazon SageMaker.
Let’s set up Athena for that right now.
First, we need to tell Athena where to find the
data and what it looks like. Thanks to AWS Glue,
we don’t need to give Athena the details on every
table and its properties ourselves though - we can
set up a crawler to go discover that for us.
Let’s call the crawler “bank_demo” and tell
it that it’s going to crawl through an S3 bucket
in our account under this path. Next we’ll ask it
to create the appropriate IAM Role for us, and, for
the time being, tell it to only crawl our data when we
ask it to. And finally, we’ll ask it to organize the schema
information it discovers under the name “bank_demo”
With that, the crawler is ready to go, but there’s one
more important task that we need to complete before
running it. Remember earlier when we configured
the KMS key policy, but didn’t need to give anybody
else permission to use the key? That’s no longer
the case. This crawler is going to need to crawl
through our exported snapshot data in S3,
which is encrypted, so the crawler, or more
specifically the crawler’s IAM Role, is going
to need access to use that key. To grant it
access, we just go back to KMS and add the new
IAM Role as a key user. Now let’s run the crawler!
This will take a few minutes to complete depending
on the size of your data, but you can expect it to
finish considerably faster than the original
export to S3 did.
Now let’s go back to Athena
and run some queries!
Amazon Athena supports a subset of SQL, and
you can refer to the Athena documentation for
a full reference. If you’ve ever worked with SQL
before though, you’ll find the query editor fairly
straightforward. As you can see, we’ve just counted
up all of the accounts, found a random customer,
and looked up the ten most recent transactions
in all of their accounts using the same general
query structure that we’re accustomed to.
The best part is that all of these queries we’re running
are being executed against the exported data that we
have in S3. Whether we need to run a high volume
of analytical queries, or even just a handful of very
slow queries involving columns that aren’t optimized
for search, we know that there’s going to be absolutely
no impact to the database itself.
That’s all great, but we don’t want to have to be
exporting RDS snapshots to S3 manually every time.
Let’s automate this!
Amazon RDS uses the Amazon Simple Notification
Service (SNS) to publish notifications when certain
RDS events occur. These notifications can
trigger an AWS Lambda function to start a
snapshot export, and we can then use an
AWS Glue crawler to make the snapshot
data available for querying with Amazon Athena.
We’ll use the code provided in the 
`aws-samples/rds-snapshot-export-to-s3-pipeline`
repository to deploy an example of this setup.
After giving it the name of our database and the
name of an S3 bucket to create for the exports,
the Amazon Cloud Development Kit (CDK) will
handle deploying everything.
Step-by-step instructions on how to do this are
provided in the repository, so we won’t walk
through all of that here, but we can see that
after a few minutes, CDK has created
everything we need to automatically export the
“bank-demo” database’s automated snapshots to S3.
Now let’s wait a few days and see what happens...
... OK, great!
Our Lambda function has been exporting all
of the snapshots that took place after the
pipeline was set up, and we can now run our
Glue crawler to make this data available to us
in Athena. That’s going to take a little bit longer
this time, since the crawler has more data to go
through, but to avoid this in the future, we could
schedule the Glue crawler to run automatically
each morning shortly after we expect our snapshot
exports to complete. Our “On Demand” crawler just
finished, though, so let’s see what’s
available now in Athena.
Unlike in the previous example, where we had one
table in the Glue data catalog corresponding to
each table that we had in the “bank-demo” database,
we now have several instances of each of those tables,
which correspond to the different days’ snapshots.
This means that we can run queries on not just the
latest snapshot, but also on the earlier versions
of our database in the older snapshots.
These snapshot exports in S3 and the tables in the
AWS Glue data catalog will build up over time though,
but you can clean them up automatically by using
Amazon S3 object lifecycle policies, which will
transition your snapshot exports through the different
tiers of storage available such Infrequent Access,
Glacier, or Deep Archive automatically as they age.
For more information on the S3 lifecycle policies,
the RDS Snapshot Export to S3 feature, or any of
the other services mentioned in this demo, please
visit the links in the video description below.
Thanks for watching!
