- Hello, everyone.
Welcome to our Spark + AI session
on Adaptive Query Execution:
Speeding up Sparks SQL at Runtime.
My name is Maryann.
And I'm a software
engineer from Databricks
working on Sparks SQL.
I'm also a PMC member of this,
the Apache Calcite and
Apache Phoenix project.
Today, it's gonna be a
60 minute co-presentation
by me and my fellow speaker Ke from Intel.
We'll start with what is
a Adaptive Query Execution
also known as AQE.
Why is AQE or the motivations behind it?
And followed by a high level introduction
to AQE's framework and workflow.
Then we take you through
the major optimizations
of AQE and Spark 3.0 using query examples
and explain the technical details
of how each of these optimizations work.
In the second half of this presentation,
my co-speaker Ke is gonna do a live demo
of all these AQE features
to show the changes
AQE makes to the query
plans, and more importantly,
how these changes accelerate the queries.
Then she will present a
TPC-DS benchmark result
of AQE and Spark 3.0.
Last but not least,
she'll also share Intel's
customer experience
with AQE in production.
Adaptive Query Execution has
been a well studied problem
in database literature.
Spark first implemented
this idea in version 1.6,
but with rather limited
functionality and extensibility.
The Intel big data team later
prototyped and experimented
with a more advanced version of AQE,
and then worked with Databricks together
to improve and ship this
version in Spark 3.0.
What exactly is adaptive query execution?
Simply put AQE is a query optimization
or to be more accurate re-optimization
that occurs in query execution.
Unlike traditional query optimizations,
which is conducted based
on pre-collected statistics
before query starts,
AQE is a dynamic process
of query optimization that happens
in the middle of the query execution
using runtime statistics
as it's optimization input.
In other words, AQE adapts
and adjusts a query plan
according to the actual characteristics
of the data it discovers at runtime.
Now the question is, why do we need AQE
or what are the challenges
that AQE aims to tackle?
Spark has had a cost-based
optimizer since version 2.2,
which tries to pick the best plan
based on the input size
and cardinality estimate.
But the problem is, when pre
collected stats are out of date
or the desired stats
are not even available,
the estimate can become inaccurate.
And as a result, Spark
compiler might end up
generating less optimal query plans.
On the other hand, collecting
comprehensive stats
and keeping them up to date
can be an expensive thing to do as well.
So users often have to struggle
between the overhead of stats collection
and the accuracy of cardinality estimate.
But let's say you're fine
with all this overhead
or your dataset doesn't change as often,
so you always have the most
complete and up to date stats.
And would you be free
of all these problems?
Unfortunately, no.
Some predicates are just
hard to estimate using stats.
For example, predicates
containing user defined functions
can be very much just like a
black box, making selectivity
or cardinality estimate
literally impossible.
Some users may try to
rely on optimizer hints,
but they do not always work either,
especially with rapidly evolving data
where previous assumptions
of the data characteristic
might not hold for the latest dataset.
That's why we wanna introduce you to AQE,
which is targeted at solving these issues
by basing optimization decisions
on accurate runtime statistics.
As AQE happens in the
middle of query execution,
one might be curious, what's the timing
of AQE performing optimizations
and adjusting query plans?
Or in other words, at which
exact point does this happen?
To answer this question,
let's first look into the
execution flow of Spark operators.
Spark operators are usually executed
in a completely distributed
and pipelined fashion
except exchange operators
like shuffle or broadcast.
Shuffle or broadcast
exchanges, break the pipeline
and divide a query into query stages.
One query stage usually maps
to a Spark job or Spark stage.
At the end of each quarter stage,
intermediate results
are materialized on disk
so that it can be copied
and transferred across the network
to form a new distribution
as required by the next query stage.
Because of this inherent property
of being the pipeline break point,
and a materialization point,
we choose query stage boundaries
as the time when AQE performances
runtime optimizations.
This means, zero disruption to
the existing execution flow.
And also it's one of the easiest places
to get runtime statistics
both dataset-wise and partition-wise.
The query from the right hand side example
contains two query stages.
The operators scan impartial aggregate
in the first stage, run and
distribute it parallel tasks.
And at the end, the shuffle materializes
the stages' output onto disks,
partition in a way that
records having the same group
by key co-located in the same partition.
The second stage then
starts with parallel tasks
to partition outputs from the first stage
to form a new distribution.
The final aggregate from
the second stage now
operates on this new distribution
before following the following shuffle
and materializes the output again,
partitioned on the
order by key, this time.
Both of these shuffle
materialization points
from the first stage and the second one,
are gonna be where AQE jumps in
and tries optimizing the query plan
based on the statistics
returned from these shuffles.
Now, by putting things all together,
we can get this workflow
of data query execution.
The very first step, is to
kick off all leaf stages,
which means stages that do not depend
on the input of any other stages.
Then we wait for any of the running stages
to complete which then
trigger AQE's next steps.
Because as soon as one
or more stages are done,
it means two things
first, that part of query
has runtime statistics coming in
and can provide new insight
for AQE optimizations.
Second, so other stages that are dependent
on the finish stages, may
now have their dependency
requirements satisfied and
are ready to be started.
So the next two steps are one,
to run the optimization and
update the current query plan.
And two, start the query
stages with dependency cleared.
We keep repeating these steps
of performing optimizations
and starting new stages
whenever a running stage finishes.
It brings AQE process to its end
once all those stages have been completed.
Now that we've gone through
AQE's framework and workflow,
we'll introduce the
three major optimizations
AQE has implemented in Spark 3.0.
First one is the dynamic coalescing
or combining of shuffle partitions.
Second one is join strategy switching
basically from sort-merge
join to broadcast hash join.
And the last one is the
dynamic skew join optimization.
Let's first look at a dynamic coalescing
of shuffle partitions.
Shuffle exchanges often have a big impact
on query performance and getting the right
partition number is what gets you
the best shuffle performance.
For example, over partitioning,
can make partition sizes too small
and the Spark jobs will end up
with more frequent and inefficient I/Os
and higher latencies
because of the scheduling
and tasks set ups overhead.
On the other hand, though,
partitions being too large
usually leads to more serious problems,
as there will be greater
memory pressure in each task.
And if it goes beyond the limit,
Spark operators will
resort to disk spilling
and that slows down the query even more.
It is often though, not always
the result of under partitioning,
which means your query
works with a large dataset,
but the partition number is not scaled up
to match the data size.
And then what makes
tuning a partition number
even more difficult, is the
fact that the data size changes
at different points of
the query execution.
Some operators for example,
filter or aggregate,
bring the size down, while
other operators like join
can blow it up, which
means you can't cater
to all parts of the query with
just one partition number.
But unfortunately, Spark
uses only one such number
for the entire query.
And that's why AQE does
this dynamic coalescing
of shuffle partitions.
So you can start with initial
partition number big enough
to accommodate the largest
data blow point in your query
and then rely on AQE to
coalesce to partition
after each stage, shooting
the partition number
to the actual data size as it
changes throughout a query.
Let's use this example to understand
when and how the query plan
is changed by AQE's partition coalescing.
This is the same example we used earlier
to explain the idea of query stages.
In this query, we select from table T
and group by column X and then in the end,
order by aggregate expression average Y.
An initial plan, comes
out as a partial aggregate
on top of a table scan
followed by a shuffle
and the group by key.
Then there is a final aggregate
followed by another shuffle
on the order by key.
And finally, the sort.
We set the initial partition number as 50.
So, each shuffle operator
here has partition number 50.
As mentioned earlier, we'll
first start and run leaf stages
and in this case, it's the shuffle stage
on the bottom containing the
partial aggregate in the scan.
Once this stage is complete,
we can see that the overall
data size is 650 megabytes
with each partition
averaging around 13 max,
which is a little too small.
So the AQE optimizer
figures those partitions
can be coalesced to 10
bigger partitions instead.
As a result of the first
round of AQE optimizations,
we can see that this new execution plan
is now added with a coalesce
node on top of the first stage.
Next, we figured at stage two
containing the final aggregate
can be triggered.
Meanwhile, the coalesce node in a new plan
means the second stage
will start with 10 tasks
instead of the static pre-planned 50.
Now, as stage two finishes,
we see that partition sizes
being six max on average,
are even smaller after
the final aggregate.
Again, AQE applies a new coalesce node
on top of the second stage
with only five partitions.
At this point AQE is done with its job
as we can see all the first
stages have been completed.
The rest of this query will now
run with five parallel tasks
based on the latest plan generated by AQE.
Next, let's take a peek
inside this coalesce operator
and see what it does to decrease
the number of shuffle partitions.
This is what a regular shuffle process
without coalescing looks like.
For simplicity, we just use five
as the initial partition number
and assume there's only two map tasks.
The first step of a shuffle
is the map side partitioning
based on partition keys.
Since our partition number here is five,
the partition keys will be hashed
into five different groups.
And the to map tasks here,
partition their data into five sections
corresponding to each key group.
The second step of a regular shuffle
is to have one reducer for each key group
and each of these reducers will read data
belonging to their respective key group
from all the mappers.
So on the right hand side of this diagram,
you can see that reduce a one
reads all the data from
the first key group
which is marked red,
and reducer two read data
from the second key group, marked purple,
and so on and so forth.
Now, this is what a
coalesce shuffle looks like.
The first step, the map side partitioning
remains exactly the same as
that of a regular shuffle.
But the difference lies
in the second step.
It is obvious that adjacent
partitions corresponding
to key groups two, three and
four are small partitions.
So that's why we have a coalesce operator
to combine them here.
Now, instead of setting up five
reducers for each key group,
the coalesce shuffle combines data
from the key groups two, three and four,
so that there is now only three reducers,
with reducer two reading
from the initial partitions
two, three, four.
Operators that follow this coalesce node
and the same query stage will in turn
run with three partitions instead of five.
Now, let's come to our second
AQE optimization today,
the dynamic joins strategy selection.
We all know that broadcast
hash join runs faster
than than the sort-merge join
and that's why Spark
picks broadcast hash join
over other joins strategies.
If one side of the join is small enough
to fit well in memory.
Like we said earlier, that the estimates
can go wrong for a number of reasons.
For example missing or still stat
hard to predict predicates, et cetera.
For Spark join strategy selection,
that means the opportunity
of doing broadcast hash joins
can be missed sometimes.
AQE tries to remedy this situation
by switching over to
broadcast join at runtime,
once the query stage from
either side of the joins
finishes with a data size
that falls below the broadcast threshold.
This is an example
of how dynamic joins
strategy switching works.
This example query has a filter
on one of the join tables
and it's hard to get a good estimate
on the selectivity of that filter.
And for this reason, the
Spark static planning
picks the sort-merge join.
As AQE starts, we have two
leaf stages in this case,
so we'll start both of
them at the same time.
The second stage turns out
to be rather small and finishes quickly.
The estimate was 25 megabytes,
but the actual size is only
eight megabytes which goes below
this default broadcast
threshold 10 megabytes.
The completion of this stage
triggers AQE optimizations in which AQE
changes to query from the original plan,
originally planned sort-merge join
into a broadcast hash join.
As we can see in the new plan
that shows up in the second step here,
the sort of operators
have been taken away.
And meanwhile, there is now
a new broadcast query stage,
which is to be kicked off right afterwards
in the third step when we start
the next batch of stages
that are ready for execution.
The third and last AQE optimization
we'll talk about today is a
dynamic skew join handling.
Data skew is a problem when
data is not evenly distributed
leading to some partitions being
much bigger than the others.
This can cause significant
performance downgrade,
especially with sort-merge join.
Those individual long running tasks
will become stranglers
slowing down the entire stage.
And what is worse, is that disk spilling
usually happens in those
especially large partitions,
doubling the effect of the slowdown.
Well, thanks to the partition stats,
it gets after each stage,
AQE is in a unique position
to detect such skew.
And once a skew is detected,
AQE splits those two partitions
into smaller sub partitions,
which eliminates both the strangler
and the disk spilling problem.
This is how the skew join optimization
happens in the AQE workflow.
Similar to the previous example,
this is a joint query
planned as sort-merge join.
Both stage one and stage
two are leaf stages
and are started simultaneously
in a first step.
When stage two finishes, no
optimizations can be applied
and the plan stays the same.
And when stage one finishes,
it shows an apparent skew by having
one partition of over
250 megabytes of size,
and the others around 55 megabytes.
This triggers the skew join optimization
which adds a skew shuffle reader
on both sides of the join.
The next two slides will explain
what a skew shuffle reader does,
and why the non-skew side
also needs such a skew reader.
First, this is how a
regular sort-merge join
with ASCII optimization works.
Again for the purpose of simplicity,
we use four as the initial
partition number here.
Both join tables, table A and table B
will first be partitioned on the join key.
And the partitions that
fall in the same key group
will be sorted and then joined together
in their respective tasks.
For example, in a diagram here,
partition A0 from table A
will be joined to
partition B0 from table B.
And partition A1 joined to
partition B1, et cetera.
The problem here with A0
being especially large,
is that it will take the first task
a much longer time to complete
than the other three tasks,
which slows down the whole query.
Now in order to solve the skew problem,
skew join optimization splits
the skew partition A0 into
smaller sub-partitions
so that each task containing
the new sub-partitions,
now deals with the input of a smaller size
and can avoid disk spilling.
After the split, A0 now becomes
A0-S0, A0-S1, and A0-S2.
Each of these new sub partitions
need to be joined with partition B0,
respectively to ensure we
still get a correct result
after changing the
partition layout of table A.
In order to do that, we
create three duplicates
of partition B0 to match the
new partitions from table A.
And that is why we need a skew reader
for the table B side even though
it does not have a skew itself.
After this skew join optimization,
the sort-merge join now
runs with six partitions
from the original four,
which means a more balanced load
and no disk spilling.
And more importantly, a
speed up of the query.
All right, this is the
end of my part today.
And you must be excited
to see how AQE works in a real cluster.
So let me hand it over to Ke,
who is going to give you
an awesome AQE live demo.
- Firstly, we need to enable AQE.
It is disabled by default.
And you can use the umbrella configuration
of Spark SQL adaptive enable to turn it on
when you decide to use AQE features.
We can click the red
button to execute the SQL.
Okay, the SQL is done.
And we have already
enabled AQE in this demo.
Secondly, we need to create two tables
and prepare the test data for this demo.
The first one is the
item the table with ID
and the price, two columns.
The second one, is the
sales table with ID,
quality and the data, three columns.
In order to show the performance
of dynamically optimizing SQL join,
the sales table is skew
with almost 80%, 100
value in item ID column.
I have already prepared the
test data for the two tables.
So we don't need to
generate the data again.
Okay, let's begin the first feature,
dynamically coalesce shuffle partitions.
The test result is to get
the sum of sales quality
going by this data.
In order to show the performance
post enabling and disabling query,
we also run the SQL without
AQE features previously.
And because the execution time is long,
so we only show the execution result.
And we will not run the SQL
without AQE in this demo.
Generally, data set to one
common partition number
for the whole query state,
regardless of the amount of inside things.
Then when the data size is very small,
after filter or aggregation,
the performance will downgrade.
If it was too large, the partition number
because of the task scheduler
and startup overhead.
In order to demonstrate the overhead,
we set the shuffle
partition numbers to 10,000.
And we can see the execution time
when disabled AQE is about 14 seconds.
How much time it will
cost when we enable AQE?
Okay, let's begin the SQL with AQE.
And before this data,
we set AQE configurations
for this feature.
Of course, the first one is to enable AQE.
And the second one, is to be same
with disabled configuration with 10,000
the shuffle partition number.
And the last one, is to set
the min partition number to one
because the data size that big,
aggregation is very small in this demo.
You don't need to set it to
one in your real use case.
The default value is parallelism
of the cluster you used.
Okay, let's run this SQL.
The SQL is done.
And we can see the execution
time is very short,
only about three seconds,
which can be about four
times performance improvement
compared without AQE in the query.
Then we can go to the Spark UI
to see the change in your query plan.
Hold on a moment.
And there, we can go to the SQL tab.
Okay, the query plan is coming.
We can see this query is simple.
The first query scans the sales table
and then execute it as
a aggregation operator.
We can see the exchange
number after the aggregation
is only 700 kilobytes
which is a very small.
So AQE will combine the smaller partitions
into a new partitions,
to execute the shuffle
operator in this query
which means there will be less IO request
and the schedule overhead.
This is the time of coalescing
shuffle partition number.
Next, let's go to the second feature,
dynamically switching the join strategies.
For the second feature,
the test SQL we used
is to get the total sales per
month grouped by sales date
for items with the price lower than 10.
Similar with the above approach,
we also run this SQL without AQE features
to compare the performance with AQE.
We can see the execution
time was very long,
about seven minutes.
And we can go to the Spark UI
to see the query plan without AQE.
I have already opened
the query plan earlier.
The SQL first delays (mumbles) item.
Then join the two tables
with sort-merge join.
And we can find the exchange
number of item table
after scan and filter is very
small, only 3.3 megabytes.
(mumbles)
because values estimation
data size (mumbles)
and then sometimes the estimation
may be wrong with some reasons.
So the sort-merge join is chosen
to perform these join operator
which cause the performance to fall
due to the high shuffle operator overhead.
(mumbles)
Okay, let's go back to the SQL
with AQE, and find the thing.
Okay, let's run the SQL with AQE.
The SQL is running.
We can hold on a moment.
The SQL is done, and then we can see
the execution time is only 31 seconds,
which can be about a 13
times performance improvement
compared without AQE in this query.
Next, we can go to the Spark UI
to see what changes to make
all the changes make
improvement when we enable AQE.
Okay, we can see the biggest change
is that Spark switches sort-merge join
to broadcast the hash join
because when you enable AQE,
Spark will re-optimize the query plan
based on the real data size.
And when one join size is smaller
than the broadcast hash join threshold,
Spark will repeat the
broadcast the hash join
to perform with the join operator.
And we know the data size of that table
after filter is only 3.3 megabytes.
So the join is optimized
to broadcast the hash join
when you enable AQE
which can can read the shuffle locally
to reduce the network traffic overhead.
This is the demo of dynamically
switching join strategies.
And last, let's go to the feature
dynamically optimize skew join.
To test the SQL, we use the two kinds
of total sales amount by sales data.
And the partition in the sales table,
continue value 100 as item ID
is much larger than other partitions.
When disable AQE, the
execution time is a very long
because of the skew for the use case,
which can seriously go
down the performance
due to the longest task.
And we have already executed this query
without a AQE previously.
We don't need to run it again.
And the execution time
is about seven minutes.
Then, let's go to run this SQL with AQE
and see how much time it
cost when you enable AQE.
And how Spark resolves skew in runtime.
The SQL is running and
we can hold on a moment.
The SQL is done.
And we can see the execution time
is only 48 seconds, which
can gain about eight times
performance improvement compared
without AQE in this query.
Next, we can go to the Spark UI
to see why enabling AQE gains improvement.
Okay, we can see the SQL
is to join the table of sales and item
after scan and filter.
And because the table of item after filter
is about 352 megabytes which is bigger
than the broadcast hash join threshold,
so the query cannot be optimized
to broadcast the hast join.
And we know the table of
sales is skew earlier.
And we enable AQE, Spark
did the skew partition.
Spark will get it as a skewed partition
and then split the skewed
partition from one big partition
into 32 smaller partition in runtime.
We know the skew join is
due to the longer execution
that make the whole query slow.
And after this breaking out,
other even side task can make
almost the same execution time.
Then the skew issue can be resolved.
Okay, this is a part of lave demo.
At last I will show you
the performance of AQE
in TPC-DS and production.
Finally, let me show you the performance
resulting in TPC-DS and the production.
Here is the cluster setup
for test environment
is three terabytes TPC-DS.
In our test, we have five server nodes
with the Intel Xeon CPU.
It has 96 cores and 384
gigabytes memory and seven SSD.
And the network is 10 gigabytes.
And then we run the Spark master
with a command ID 0b6aae.
Okay, we run the TPC-DS
with three terabytes
see your benchmark and
compilers performance
with Spark SQL and
adaptive query execution.
And we found in many queries
gains performance improvement.
The first two queries can gain
over 1.5 times improvement.
And Almost 37 queries can be
over 1.1 times improvement.
And when we analyze the
query see more details.
We can see why we get the
performance improvement.
The first one is dynamically
coalescing shuffle partitions.
And we keep the query eight
to shows the performance improvement.
Initially, we set a large
partition number with 1000.
In this query, we can see
that we have two stages,
the data size is not a very
large and the execution time
is separately five seconds and 11 seconds.
And with AQE mode, Spark can dynamically
collapse the shuffle
partition number to about 700.
And the execution time can be
two seconds and six seconds
which you can gain about two
times performance improvement.
Because this means we have
less a scheduler overhead
and less task for startup time.
And it also means there will
be less disk IO request,
because we combine the
continuous partition
into one single request.
And we also obviously observe the data
shuffle less after the aggregation.
So in this case, we get
the performance improvement
by dynamically coalescing
shuffle partitions.
This feature simplifies the tuning
of shuffle partition number
when running queries.
And you don't even need
a to set a proper shuffle
partition number to filter your data size.
The query can pick the proper
shuffle partition number
at runtime once you set a
large enough initial number
for shuffle partitions.
The second feature is dynamically
switching join strategies.
Here I listed example
with the query for 40 MB.
When without AQE, Spark
pick the sort-merge join
to join the two tables in planning phase
because of the wrong estimation data size
So in this query the
stage took a long time
to finish the joints because
of the shuffle operator overhead.
And it total takes about 30 minutes.
When you enable AQE, Spark can get
the real dataset in runtime.
Then it will re-optimize
the sort-merge join
to broadcast the hash join.
Once you're inside the data size
is smaller than the broadcast
hash join thresholds.
We can see from the
images that one join side
is only about 10 megabytes
which can fit in memory.
So Spark optimize the sort-merge join
to broadcast the join in the runtime.
And after the optimization,
the stage only cost about three
minutes to finish the join,
which improve about four times
when you enable the feature
of dynamically optimizing join strategies.
This feature can read
a shuffle file locally
to save the network traffic
and also changes randomly
the two sequencing read.
Because data generated in
TPC-DS is matched evenly,
and we don't gain the
performance improvement
with dynamically optimizing
SQL join features
in three terabytes of TPC-DS.
But AQE can gain improvement
by resolving the SQL join
issues in real workload.
Hence the results are shared
by two big company in China.
The first one is a shared by
one of the largest
e-commerce company in China.
AQE helped them resolve the
critical data skew issues
and achieve the significant performance
for the online business queries.
And the best of one
can gain about 18 times
in the production environment.
The second one is shared
by one of the largest
internet company in China.
And AQE can achieve five
times performance improvement
in the online environment.
We also analysis one skew queries
in the online environment.
The SQL is to select
the users invoice detail
based on the sales order
ID, invoice accounting ID,
and the invoice type ID.
Execution time of the SQL is
a very long one without AQE.
After a detailed analysis,
we found the join key
of the sales order ID,
invoice account ID and the invoice type ID
contains skewed partitions.
For example, the average
record of sales order ID
without null value is almost 2000.
However, the skewed records with no value
can be up to 15 million records,
which causes a serious skew
join issues in the query.
When you enable AQE,
(mumbles) skewed partitions
into small positions, then
those parts will take roughly
the same execution time,
which can accelerate the whole query.
And we can see the best one
can be about 18 times improvement.
This is my part about AQE demo,
and performance both in
TPC-DS and production.
Okay, thanks everyone for
listening the demo of AQE.
And do you have any question?
