0% found this document useful (0 votes)
71 views

Benchmark Report - Amazon Redshift

Benchmark Report - Amazon Redshift

Uploaded by

lborrego_bacit
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views

Benchmark Report - Amazon Redshift

Benchmark Report - Amazon Redshift

Uploaded by

lborrego_bacit
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Cloud Analytics

Performance Benchmark:
Amazon Redshift

By David Mariani & Krasimir Kovachki


2021-Q3
TABLE OF CONTENTS

Executive Summary________________________________________________ 1
Introduction_____________________________________________________ 2
Leveraging Amazon Redshift for BI and Analytics_________________________ 2
The Power of AtScale and Amazon Redshift___________________________ 3

Benchmark Methodology____________________________________________ 4
Benchmark Measurements_________________________________________ 4
Benchmark Dataset______________________________________________ 5
Benchmark Queries______________________________________________ 6
Test Harness___________________________________________________ 6
Configuration Tested_____________________________________________ 7
Query Performance for a Single User Test Methodology____________________ 8
Query Performance with Concurrency Test Methodology___________________ 8
Compute Cost Calculations________________________________________ 8

Summary Results__________________________________________________ 9
Queary Performance Test Results___________________________________ 10
Concurrent Query Performance____________________________________ 11
Median Query Time by TPC-DS Query Test Results_______________________ 13
Compute Cost Test Results________________________________________ 14
Complexity Test Results__________________________________________ 15

Conclusion_____________________________________________________ 19

© 2021 AtScale Inc. All rights reserved.


Executive Summary
This benchmarking study was conducted to quantify the benefits of using the AtScale semantic
layer platform with the Amazon Redshift data platform to manage BI and analytics workloads. The
comparative analysis was based on four defined measurements: Query Performance, Concurrent
Query Performance, Compute Cost, and SQL Complexity. Using the standard TPC-DS (10TB)
benchmarking framework, measurements were taken for raw Amazon Redshift and for AtScale on
Amazon Redshift that showed the clear advantages for combining AtScale with Amazon Redshift to
accelerate and optimize BI and analytics programs.

Improvement Factor with AtScale


Test Amazon Redshift

Query Performance1 11x Faster

Concurrent Query Performance2 31x Faster

Compute Cost3 3,7x Cheaper

Complexity4 76% less complex SQL queries

Figure 1: Improvements with AtScale

This analysis is a refresh of a study first done in 2020 using the same methodology. The results
illustrate improvement in Amazon Redshift’s raw performance, but with clear benefits for the
combined solution

1
Elapsed time for executing 1 query five times
2
Elapsed time executing 1 (x5), 5, 25, 50 queries
3
Compute costs for cluster time for user concurrency test
4
Complexity score for SQL queries for number of: functions, operations, tables, objects & subqueries (AtScale = 258, TPC-DS = 1,057)

© 2021 AtScale Inc. All rights reserved. 1


Introduction
The enterprise has entered into a new era of data warehousing. Driven by the increasing popularity of
the public cloud, new cloud-based data platforms have become the dominant choice for enterprises
managing their data. By offering customers the power of a relational, scale-out data platform without
the overhead of managing it, cloud data platforms promise to make more data available at a lower cost
with fewer data management headaches.

Leveraging Amazon Redshif for BI and Analytics


Amazon Redshift is a great choice of cloud data platform for a number of reasons. First, Amazon
Redshift was the first cloud data platform to hit the market and it is therefore the most mature and well
known cloud data platform with a wide range of third party tooling support. Next, Redshift is the most
economical choice of all the cloud data platforms tested and delivers very good performance for the
money.

Amazon Redshift is the most economical choice of all


the cloud data platforms tested and delivers very good
performance for the money.

Finally, Amazon Redshift offers Redshift Spectrum which provides direct query access to files on S3
which extends Redshift’s reach into data lakes seamlessly.

© 2021 AtScale Inc. All rights reserved. 2


The Power of AtScale and Amazon Redshift
While cloud data platforms reduce the maintenance cost and scaling headaches of managing data
infrastructure for IT, they don’t make data any easier to understand or access for analytics consumers,
nor do they help IT better predict and control cloud costs. The AtScale platform works natively with
cloud data platforms to deliver an analytics semantic layer for business intelligence (BI) and data
science teams.

The AtScale semantic layer provides the following benefits:

1. It presents a consistent set of business-friendly metrics for BI and data science teams
to consume data with the tools of their choice.

2. It provides an integration layer to support analytics discoverability, governance,


and security.

3. It accelerates end-to-end query performance while optimizing data platform resources


and costs.

By leveraging a graph-based semantic model, the AtScale platform sends queries to Amazon Redshift
using its data virtualization engine and pushes workloads to the Amazon Redshift platform. By
automatically creating and managing aggregate tables on Amazon Redshift based on user query
patterns, AtScale avoids costly atomic table scans and delivers superior query performance by re-
writing queries to access those aggregate tables.

In this study, we will compare the performance, complexity and costs of these cloud data platforms with
and without the AtScale platform.

© 2021 AtScale Inc. All rights reserved. 3


Benchmarking Methodology
Benchmark Measurements
This benchmark uses four key metrics to compare Amazon Redshift to Amazon Redshift + AtScale. The
metrics are designed to answer basic questions relevant to enterprise analytics leaders.

Query User Compute Semantic


Performance Concurrency Costs Complexity
How fast can the cloud How do multiple users How do query workloads How difficult is it to write
Data Warehouse answer a running queries affect and configuration impact the query to answer the
query for one user? performance & stability? your monthly bill? business question?

Run 20 TPC-DS Queries for 1 user Run 20 TPC-DS Queries for 5, 25 & Measure the total elapsed time or Compare the raw TPC-DS SQL
five times & measure the total 50 users one time & measure the bytes read for the query & queries to the equivalent BI
elapsed time on a TPC-DS 10TB total elapsed time on a TPC-DS concurrency test on a TPC-DS semantic layer queries on a
dataset 10TB dataset 10TB dataset TPC-DS 10TB dataset

Figure 2: Benchmark Testing Topics

By automatically creating and managing aggregate tables


on Amazon Redshift based on user query patterns, AtScale
avoids costly atomic table scans and delivers superior
query performance by re-writing queries to access those
aggregate tables.

© 2021 AtScale Inc. All rights reserved. 4


Benchmark Dataset
We used the TPC-DS benchmark v2.11.0 from the Transaction Processing Council (TPC) for our tests.
We chose the 10TB (scale factor 10,000) version for this benchmark to better measure scalability
limits of each platform and to simulate a typical enterprise workload. This version’s largest fact table
(store_sales) at 28+ billion rows and the largest dimension (customer) at 65 million rows is a significant
scale challenge for most data platforms. In addition, the TPC-DS benchmark is ubiquitous amongst the
database warehouse vendors and we felt it represented a reasonable real-life analytics schema and set
of queries.

Table Name Row Size Row Count


call_center 305 54
catalog_page 139 40,000
catalog_returns
catalog_sales
166
226
1,440,033,112
14,399,964,710
THE TPC-DS 10TB
customer
customer_address
132
110
65,000,000
32,500,000
DATASET HAS:
customer_demographics 42 1,920,800

1
date_dim 141 73,049
household_demographics 21 7,200 Multiple fact tables
income_band 16 20
inventory 16 1,311,525,000
item 281 402,000
promotions
reason
124
38
2,000
70
2 Large fact tables
ship_mode 56 20
store 263 1,500
store_returns
store_sales
134
164
2,879,970,104
28,799,983,563 3 Large dimensions
time_dim 59 86,400
warehouse 117 25
web_page 96 4,002
web_returns 162 720,020,485
web_sales 226 7,199,963,324
web_site 292 78

Figure 3: TPC-DS 10TB Table SIzes

© 2021 AtScale Inc. All rights reserved. 5


Benchmark Queries
We selected a representative set of 20 queries from the 99 TPC-DS queries set to keep the run time and
costs of running the benchmarks within reason without having to downsize data size. The queries were
chosen in no particular order and were selected to eliminate redundancy and to ensure the usage of
most tables. It was imperative to benchmark the cloud data warehouse vendors with the largest data we
could afford and test to reveal real-life differences in the respective platforms.

The following 20 TPC-DS queries were selected for the test:

Figure 4: TPC-DS Test Queries

Test Harness
To ensure consistency for concurrency tests, we ran queries using v5.4.1 of Apache JMeter. The
instructions, documentation, utility scripts, results, and JMeter JMX files can be found in our GitHub
repository and are available upon request.

We designed the JMeter test suites to run the above 20 queries in the following four configurations:
▲ 1 concurrent user, 5 loops ▲ 25 concurrent users, 1 loop
(averaging the result to even out cold starts)
▲ 50 concurrent users, 1 loop
▲ 5 concurrent users, 1 loop

© 2021 AtScale Inc. All rights reserved. 6


We originally planned to run a 100 thread user concurrency test for Amazon Redshift but found
challenges at the 100 concurrent user level. Running 100 concurrent users without the help of a
semantic layer like AtScale proved to be a scaling challenge that resulted in extended run times as a
result of query queuing. Amazon Redshift has an option for managing user concurrency automatically
by spinning up additional clusters automatically. We did not test this feature because we wanted to
keep our resource level fixed for apples to apples comparisons for the different data platform choices.
As a result, we only ran the 100 thread tests with AtScale.

Configuration Tested
The following Snowflake configuration was used for the test:

Vendor Configuration Compute Cost per Hour5

dc2.8xlarge (6 nodes at
Amazon Redshift $28.80
$4.80 per node)

Figure 5: Data Platform Configurations

For the test, we used Amazon Redshift’s “out of the box” configuration. We did not manually tune any
of the TPC-DS queries and used the same clustering scheme used in AWS Labs’ TPC-DS benchmark in
GitHub.

Query Performance for a Single User Test Methodology


To test raw query performance, we ran the 20 TPC-DS queries with one concurrent user five times and
calculated the average elapsed time to finish each query. The elapsed time is simply the difference
between the start and end time of the test as reported by JMeter. We disabled Amazon Redshift’s query
caching for this test.

5
Storage cost wasn’t factored in (only compute cost)

© 2021 AtScale Inc. All rights reserved. 7


Query Performance with Concurrency Test Methodology
To test how each data warehouse performs with different levels of user concurrency, we ran each of
the 20 TPC-DS queries with 1, 5, 25 and 50 concurrent users using JMeter. We added a 750ms sleep
between each query start and using a single connection pool that was sized according to the number
of threads for the test. We used 1 loop (iteration) for the 5, 25, and 50 thread test and 5 loops for the
1 thread test. The elapsed time is simply the difference between the start and end time of each thread
test as reported by JMeter. We disabled Amazon Redshift’s query caching for this test.

Compute Cost Calculations


Amazon Redshift charges per hour, per cluster node with options for higher powered node types. We
calculated the compute costs by multiplying the total end-to-end run time as reported by JMeter for the
concurrency test by the cluster compute cost per hour like so:

ConcurrencyRunTimeMinutes / 60 * ComputeCostPerHour

We explicitly excluded storage costs from our calculations. We found that storage cost was nominal
across all platforms and given that it’s a fixed cost, it was not subject to variation in our testing
scenarios.

© 2021 AtScale Inc. All rights reserved. 8


Summary Results
We also ran the same 20 TPC-DS queries through the AtScale platform for Amazon Redshift. AtScale’s
Acceleration Structures showed major benefits in accelerating query performance, improving user
concurrency and reducing compute costs. AtScale’s semantic layer also drastically reduced the
complexity of the TPC-DS queries by hiding the joins and calculations from consumers. The illustration
below shows the extent of the benefits AtScale provides on top of the Amazon Redshift data warehouse:

Query Performance6 User Concurrency7 Compute Costs8 Semantic Complexity9

11X 31X 37X


.
Faster
76% Faster Cheaper less complex
SQL queries

Figure 6: Improvements with AtScale

6
Elapsed time for executing 1 query five times
7
Elapsed time executing 1 (x5), 5, 25, 50 queries
8
Compute costs for cluster time for user concurrency test
9
Complexity score for SQL queries for number of: functions, operations, tables, objects & subqueries (AtScale = 258, TPC-DS = 1,057)

© 2021 AtScale Inc. All rights reserved. 9


Query Performance Test Results
For the query performance test, we ran our 20 TPC-DS queries 5 times each using JMeter with a single
thread. Even at a single concurrent user, we saw orders of magnitude improvement using AtScale on the
Amazon Redshift data warehouse in this test.

Elapsed Run Time (Minutes)


1 User - Redshift

3.483
3.5

3.0

2.5
Cost

2.0

1.5

1.0

0.5
0.330

0.0

No AtScale AtScale

Figure 7: Elapsed Run Time for 1 Thread

© 2021 AtScale Inc. All rights reserved. 10


Concurrent Query Performance
For the user concurrency test, we ran consecutive JMeter suites configured to execute 1, 5, 25, and 50
queries at the same time to simulate user concurrency. Each test ran 1 iteration with the exception of
the 1 thread test which ran 5 iterations sequentially.

In this test, we saw some real impact in query performance under additional user concurrency load. To
be fair, Amazon Redshift offers an option for their enterprise edition for automated concurrency scaling.
This option dynamically adds more cluster resources to handle concurrency bottlenecks. We chose not
to enable this option in order to quantify performance for a fixed level of resource.

Elapsed Run Time (Minutes)


All Runs - Redshift
150
143.0

140

130

120

110
Run Time (Minutes)

100

90

80

70

60

50

40

30

20

10
4.6

0
No AtScale Q3 2021 AtScale Q3 2021

Figure 8: Elapsed Run Time for All Runs

© 2021 AtScale Inc. All rights reserved. 11


Elapsed Time (Minutes)
by Thread Group - Redshift

1 5 25 50

80.0 79.03

70.0
Run Time (Minutes)

60.0

50.0

40.0 37.53

30.0

20.0

10.0 8.55

3.48 1.42
0.33 0.35 0.78
0.0

No AtScale AtScale

Figure 9: Elapsed Run Time by Thread

© 2021 AtScale Inc. All rights reserved. 12


Median Query Time by TPC-DS Query Test Results
The following chart (logarithmic scale) illustrates the benefits of AtScale for each of the 20 TPC-DS
queries (by TPC-DS Query number) tested with a median reference line overlay for comparison. This is
the median elapsed query time for all runs (1, 5, 25, 50 concurrent users) so data platform load is taken
into account. Notice that for Amazon Redshift raw (without AtScale), the median query time is almost
2 minutes versus Amazon Redshift on AtScale at a median time of 1.8 seconds. For interactive business
intelligence, elapsed query times over 10 seconds are not typically not acceptable by users which may
force IT to use data extracts or external caching solutions instead.

Average Query Time by Query (Seconds)


All Runs - Redshift
546.3
500
369.3 340.9
319.1 249.7

200 185.2
136.3 147.1 136.3 117.8
Elapsed Time (Seconds)

100 61.7
90.6
55.1 55.2 55.7
50 29.6
27.3 26.4

20
11.6
10 9.2 No AtScale
5

0.5

Median = 104.2 seconds

500

200
Elapsed Time (Seconds)

100

50

20
2.0
10.4
10 AtScale
5
3.5 3.4 2.8
2.7 2.7 2.3 2.1
2 1.7 1.6 1.7 1.9
1.5 1.4 1.4 1.4 1.2 1.1
1 0.8

0.5
2 7 13 15 26 31 33 42 48 50 52 53 55 56 60 61 71 88 96 98

Median = 1.8 seconds

Figure 10: Average query time by TPC-DS query number with median

© 2021 AtScale Inc. All rights reserved. 13


Compute Cost Test Results
You will also see the value that AtScale can bring to cost predictability. By minimizing the amount of
data scanned, AtScale takes less time to run queries, with fewer resources used, which means more
users can run queries at the same time (higher concurrency) without additional hardware or resources.

Compute Cost
All Runs - Redshift
$70.00
$68.39

$65.00

$60.00

$55.00

$50.00

$45.00

$40.00
Cost

$35.00

$30.00

$25.00

$20.00 $18.31

$15.00

$10.00

$5.00

$0.00
No AtScale AtScale

Figure 11: Compute Costs for All Thread Groups

© 2021 AtScale Inc. All rights reserved. 14


Complexity Test Results
The TPC-DS benchmark provides a good illustration of just how hard it can be to write SQL to answer
a simple business question. Translating tables and star schemas into business logic is not an easy
task. With today’s BI tools, our business users are spending more and more time dealing with data
engineering tasks rather than getting answers to their business questions.

For example, with query #60 of the TPC-DS benchmark, the business question is fairly straightforward
but the SQL to express it is not. .

BUSINESS QUESTION:

What is the monthly sales amount for a specific month in a


specific year, for items in a specific category, purchased by
customers residing in a specific time zone?

SQL TO ANSWER BUSINESS QUESTION:

TPC-DS Raw
with ss as ( item
select where
i_item_id,sum(ss_ext_sales_price) total_sales i_item_id in (select
from i_item_id
store_sales, from
date_dim, item
customer_address, where i_category in (‘Jewelry’))
item and cs_item_sk = i_item_sk
where and cs_sold_date_sk = d_date_sk
i_item_id in (select and d_year = 1999
i_item_id and d_moy =9
from and cs_bill_addr_sk = ca_address_sk
item and ca_gmt_offset = -6
where i_category in (‘Jewelry’)) group by i_item_id),
and ss_item_sk = i_item_sk ws as (
and ss_sold_date_sk = d_date_sk select
and d_year = 1999 i_item_id,sum(ws_ext_sales_price) total_sales
and d_moy =9 from
and ss_addr_sk = ca_address_sk web_sales,
and ca_gmt_offset = -6 date_dim,
group by i_item_id), customer_address,
cs as ( item
select where
i_item_id,sum(cs_ext_sales_price) total_sales i_item_id in (select
from i_item_id
catalog_sales, from
date_dim, item
customer_address, ...

26,640 bytes
Figure 12: TPC-DS Raw SQL to answer question

© 2021 AtScale Inc. All rights reserved. 15


As you can see, it’s not at all obvious what the query is doing and obviously there’s a lot of repetition
which makes it very prone to error.

In response to this challenge, for this benchmark study, we defined an AtScale model that drastically
simplifies user queries by translating the raw tables and schema into a business semantic layer. The
following screenshot is the TPC-DS model expressed in AtScale Design Center:

Figure 13: AtScale TPC-DS Data Model

© 2021 AtScale Inc. All rights reserved. 16


Instead of writing complex SQL or engineering data models in the BI tool, this business question was
easily answered with Tableau on AtScale as you can see below:

Figure 14: Tableau on AtScale TPC-DS Model for Query #60

The visualization above for TPC-DS query #60 generated the following SQL against AtScale:

AtScale SQL
SELECT
`d_product_item_id` AS `d_product_item_id`,
SUM( `Total Ext Sales Price` ) AS `sum_total__ext_sales_price_ok`
FROM
`tpc-ds benchmark model - snowflake`.`tpc-ds benchmark model` `tpc_ds_benchmark_model`
WHERE
`I Category` = ‘Jewelry’
AND `Sold Calendar Year` = 1999
AND `Sold d_month_of_year` = 9
AND `d_customer_gmt_offset` = -6
GROUP BY 1

18,593 bytes
Figure 15: AtScale SQL to answer question

© 2021 AtScale Inc. All rights reserved. 17


As you can see, the SQL written against the AtScale semantic model is human readable and
understandable. In addition, this semantic model provides important context for query optimization
which delivers query acceleration, user concurrency improvements and cost reduction.

As a measure of complexity, we used an open source parser to break down each SQL statement into the
following groups:

1. Number of functions used


2. Number of arithmetic operations
3. Number of tables accessed
4. Number of objects used and number of subqueries needed.

Complexity Factor
Configuration
# of # of # of # of # of
Total Score
Functions Operations Tables Objects Subqueries

No AtScale 87 66 177 700 27 1,057

AtScale 36 2 21 198 1 258

Figure 16: Complexity score for TPC-DS benchmark with and without AtScale Semantic Layer

© 2021 AtScale Inc. All rights reserved. 18


Conclusion
As you can see from the benchmark results, the future for data warehousing is definitely in the cloud.
The cloud data platforms we tested prove that the cloud is a viable alternative with many performance
and management advantages for data warehousing compared to the traditional on-premise options.
However, there are key differences in performance, scalability and cost that need to be considered.

We also proved that the inclusion of a semantic layer like AtScale’s can make the cloud data warehouses
even better by:

1. 2. 3.
Drastically Insuring all Increasing
simplifying users access query
queries for the same, performance
users secure data by up to 11x

4. 5.
Improving user
Reducing cost
concurrency by
by up to 3.7x
up to 31x

ABOUT ATSCALE
AtScale enables smarter decision-making by accelerating the flow of data-driven insights. The company’s semantic layer
platform simplifies, accelerates, and extends business intelligence and data science capabilities for enterprise customers
across all industries.

© 2019 AtScale Inc. All rights


© 2021 AtScalereserved. Confidential
Inc. All rights reserved. and proprietary. 19
atscale.com

You might also like