Microsoft SQL Database Analytics Paper

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

white paper

Accelerating Large-Scale
Business Analytics

Using a 100TB dataset powered by the Intel Xeon Processor E7 Family


and Microsoft SQL Server 2016

Authors Executive Summary


Mahmut Aktasoglu The creation, organization, and distribution of content across data centers
Intel Corporation continues to rise at an ever-increasing pace. With an increasing amount of data
being stored in modern data warehouses, organizations need solutions that scale
Avaneesh Shetty proportionally. Although building clusters of servers and distributing data across
Intel Corporation the systems is one of the more common approaches to scale, it is not desirable
from either a performance or TCO point of view.
Hamesh Patel
Intel Corporation In this paper, we show how Microsoft SQL Server technology in a single
Intel Xeon processor-based platform can effectively perform advanced analytics
Kevin Farlee on a large dataset (100TB in this case), using Microsoft SQL Server 2016 and the
Microsoft Corporation Intel Xeon processor E7-8890 v4.1
Sadashivan Krishnamurthy
Microsoft Corporation

The work is the culmination of a thoroughly tested configuration including


meticulous design aspects that took care in sizing key hardware components and
optimal Microsoft SQL Server parameters to provide meaningful insights on large
datasets. Results shared in this paper include:
The system loaded a 100TB datasetthe largest scale factor allowed
by TPC-H
The SQL Server 2016 columnstore engine compressed the data by 3x and
loaded the data at a rate of 1.6TB/hour
Using the Intel Xeon processor E7-8890 v4, we observed up to 1.36x average
speed upcompared to the Intel Xeon processor E7-8890 v3for complex
ad-hoc queries against a 100TB dataset
Using the Intel AVX2 instruction set in the columnstore engine provides up
to 2.3x performance boost for in-memory execution
It took less than 2.5 minutes, processing 4.3B rows/second when data is
in memory
The intended audience for this paper includes data analytics experts, IT executives
and managers, solution architects, and infrastructure planners.
White Paper | Accelerating Large-Scale Business Analytics

Contents Technology
Executive Summary . . . . . . . . . . . . . . 1 Intel Xeon Processor-Based Platforms
Technology . . . . . . . . . . . . . . . . . . . . . . 2 In our configuration, we rely on the advantage of in-memory computing for fast
Intel Xeon Processor- insights and decision making. Infrastructure built on the Intel Xeon processor
Based Platforms. . . . . . . . . . . . . . . . 2 E7-8800 v4 product family can deliver real-time analytics services and open up
Intel Data Center SSDs . . . . . . . . 2 new data-driven business opportunities. Designed for the most mission-critical
workloads and the always-on enterprise, the processor combines large memory
Intel RAID Controllers . . . . . . . . . 2 capacities with high performance, reliability, and virtualization capabilities to keep
Microsoft SQL Server 2016: Data data centers supplying business advantage without interruption. 2
Warehousing with Improved
Intel Data Center SSDs
Columnstore Technology . . . . . . . . . 2
System Configuration. . . . . . . . . . . . . 3 To ensure that data centers can keep up with the speed of business, all thats
needed is an I/O-optimized infrastructure using the extraordinary performance
Database Creation and and reliability of Intel Data Center SSDs. 3 By pairing Intel Data Center SSDs with
Performance Testing . . . . . . . . . . . . . 6 Intel CPUs, chipsets, firmware, and drivers, we built a seamless system, enabling
Building the Database . . . . . . . . . . 6 large amounts of data to be transferred to processors quickly, eliminating I/O
bottlenecks. Faster data servers give organizations a competitive edge as they
Test Queries . . . . . . . . . . . . . . . . . . . 6
deal with ever-increasing infrastructure demands. 2
Results & Analysis. . . . . . . . . . . . . . . . 7
Intel RAID Controllers
The New Intel Xeon Processor
E7 v4 Family Performance SAS-based Intel RAID controllers, featuring the dual-core LSI SAS3108
Combined with Improved RAID-on-Chip (ROC) processor, offer significant performance enhancements
SQL Server 2016 for solutions architected with 12Gb/s or 6Gb/s SAS drives.4 The read/write
Columnstore Engine. . . . . . . . . . . . 7 performance ideally suits the controllers for a broad range of application
workloads, such as enterprise data center applications, cloud computing, and
SQL Server 2016 Optimized
content applications. The massive growth of cloud and big data applications
for Intel AVX2. . . . . . . . . . . . . . . . . 8
requires enterprise features to manage, optimize, and improve the efficiency of
Summary . . . . . . . . . . . . . . . . . . . . . . . . 9 growing data centers. The 12Gb/s SAS Intel RAID controller StorCLI utility
Acknowledgements . . . . . . . . . . . . . 10 features server storage management, either locally or remotely, from a single
pane of glass with the RAID Web Console application. This generation of storage,
Appendix A . . . . . . . . . . . . . . . . . . . . . 11 spurred by the increased adoption and trending lower costs of flash, requires
Test Queries . . . . . . . . . . . . . . . . . . . . 11 more bandwidth and I/O per second (IOPS) capability than the previous generation.
In addition to outstanding performance, the Intel 12Gb/s SAS product family
Test Query 1 . . . . . . . . . . . . . . . . . . 11
supports advanced features such as FastPath I/O and RAID SSD Cache, dual-level
Test Query 2 . . . . . . . . . . . . . . . . . . 11 advanced RAID types, and revertible hot spare. 2
Test Query 3 . . . . . . . . . . . . . . . . . . 11
Test Query 4 . . . . . . . . . . . . . . . . . . 12
Microsoft SQL Server 2016: Data Warehousing
with Improved Columnstore Technology5
Test Query 5 . . . . . . . . . . . . . . . . . . 12
Microsoft SQL Server 2016 has made significant improvements in data
Test Query 6 . . . . . . . . . . . . . . . . . . 13 warehousing technologies and performance, including columnstore features,
Test Query 7 . . . . . . . . . . . . . . . . . . 13 as well as many other improvements.
Test Query 8 . . . . . . . . . . . . . . . . . . 14 Columnstore indices offer great advantages over traditional row stores for analytics
Test Query 9 . . . . . . . . . . . . . . . . . . 14 and data warehousing queries. They are ideally suited for star schemas and tables
with billions of rows, which are commonly seen. Among their advantages for
Test Query 10. . . . . . . . . . . . . . . . . 15 analytics are:
Test Query 11 . . . . . . . . . . . . . . . . . 15
Up to 10x Compression in Data Size
Test Query 12. . . . . . . . . . . . . . . . . 15
Data warehouses are very large by nature, and the compression offered by
Appendix B . . . . . . . . . . . . . . . . . . . . . 16 columnstore index technologies offers not only space and cost savings, but also
Building the Database . . . . . . . . . 16 significantly increased performance. These benefits are due to the dramatically
reduced I/O requirements given by the compression, coupled with the ability to
Performance Tests with the Intel
scan only the specific columns required by each query. This compression also
Xeon Processor E7-8890 v3. . . 16
reduces the amount of memory required to hold a given number of rows from
In-Memory Performance of the source data warehouse.
Columnstore Engine. . . . . . . . . . . . . 17
Performance Scaling with
Data Size. . . . . . . . . . . . . . . . . . . . . . . . 17
Performance of Running
Queries Concurrently. . . . . . . . . . . . 18

2
White Paper | Accelerating Large-Scale Business Analytics

Additional Indices
SQL Server 2016 adds the capability to add additional (B-Tree) indices to columnstore-based tables, which enables efficient
single-row lookup. In addition to these architectural features, we have further optimized the processing of queries in
columnstore indices in the following ways:

Operator Pushdown Batch-Mode Processing


Pushdown refers to moving both filter and aggregation SQL Server 2016 includes enhancements in batch-
query operations closer to the data so that many of mode processing, which processes many rows at a time
the filters and calculations can be done in the scan rather than doing calculations serially on each individual
operators, dramatically reducing the volume of data that row. These batch operations are further optimized
needs to be handled further on in query processing. by leveraging Single Instruction, Multiple Data (SIMD)
vector-processing CPU instructions in the
Intel architectures.

Tempdb

Backup
1 Intel Xeon processor
Data E7-8800 v3
Dual
Backup SAS SF 8088
2 connectors

Intel RAID
Controller

Log

Figure 1. Hardware Configuration6

System Configuration
The Intel technology-powered white box used is a 4U
Intel Server System that supports four Intel Xeon
processor E7-8890 v4 product family processors.
Attached to the platform, we use 8 LSI 620J JBOD storage
enclosures (see Figure 16), which are dual-connected to
the Intel RAID controllers via SAS SF-8088 cables, which
provides 4 external lanes with a 12Gb/s transfer rate per
lane. We used the directly attached storage units for storing
data (i.e., raw data and database files), tempdb (i.e., SQL
Server system database used for query execution), and
backups. The most crucial performance aspect of our storage
subsystem for our workload was read bandwidth for data and
tempdb drives. Before sizing and configuring the drives, we
ran Iometer tests for 512K sequential reads to measure I/O
bandwidth of each storage unit, populated by 1.6TB Intel
SSDs. In the graph below, you can observe the amount of I/O
bandwidth that can be attained as we add more disks to
the enclosure.

3
White Paper | Accelerating Large-Scale Business Analytics

BaNDwiDth (MB/s) vs. NUMBer OF DiSKS iN a BaY


5000
4500
Bandwidth (MB/s)

4000
3500
3000
2500
2000
1500
1000
500
0
0 1 2 3 4 5 6 7 8 9 10 11 12

Number of Disks

Figure 2. Bandwidth Scaling with Number of Drives

As can be seen from Figure 2, Intel RAID cards can easily As for system memory, we fully populated our DIMM sockets
scale up to an approximate 4.5GB/s data transfer rate with 8 with 64GB DDR4 LRDIMMs, giving us a total memory capacity
SSDs. Beyond that point, we are limited by SAS transfer rates. of 6TB. To make use of the whole physical address space,
Based on our observation, we used 8 SSDs for our data drives we used Microsoft Windows Server 2016 RTM build, which
per enclosure to provide peak I/O bandwidth and ample supports memory capacity up to 12TB.
storage (about 11TB per data drive, 88TB total capacity). We
Table 1 lists the specifics about the components we used in
measured total I/O bandwidth of over 25GB/s across all data
our configuration.
drives with our configuration detailed below.
Our storage configuration is as follows: Each storage bay is
populated with 24 x 1.6TB 2.5 Intel SSD DC S3500 Series
drives. We created RAID0 across 8 SSDs for data drives
and 4 SSDs for tempdb drives. We created RAID6 over
the remaining 12 SSDs to hold our database backup files,
ensuring good performance with adequate redundancy
(i.e., around 14.5TB per bay, which allows us to keep 2 sets of
backups). Lastly, for SQL log files, we included a storage unit
with 24 x 800GB 2.5 Intel SSDs daisy-chained to one of the
main storage bays. Figure 1 illustrates the storage layout of
our system.

4
White Paper | Accelerating Large-Scale Business Analytics

CPU
Processor Intel Xeon processor E7-8890 v4
# Cores/Threads 24/48
Last Level Cache 60MB
Base/Turbo Frequency 2.2GHz/3.4GHz
QPI 3 links, 9.6GT/s
Instruction Set Extensions AVX2
PLATFORM
# Sockets 4
Chipset C600/X79 Express Chipset
# PCI Express (PCIe) ports 12
Networking 2x Intel Network Adapter X540-T2 (10Gb/s)
STORAGE
Intel SSD DC S3500 Series
SSDs Data + Tempdb + Backups x192
(1.6TB, 2.5 SATA 3.0)
Intel SSD DC S3700 Series
SSDs Log x24
(800GB, 2.5 SATA 3.0)
Intel RAID Controller RS3SC008
RAID Controllers x8
PCIe x8 Gen3
LSI LSI00217 / 620J 24-Bay 2U
Storage Enclosure x9
Rackmount JBOD
MEMORY
Memory Type DDR4
DIMM Size and Type 64GB LRDIMM, dual rank
Memory Frequency (MHz) 2133
DIMMs/Channel 3
System Memory Capacity (GB) 6144
SOFTWARE
Microsoft SQL Server 2016 Enterprise RTM
Database Software
64-bit Edition (build no: 13.0.1601.5)
Microsoft Windows Server 2016 TP5
OS Distribution
64-bit (build no: 14300)

Table 1. System Configuration Details

5
White Paper | Accelerating Large-Scale Business Analytics

Database Creation and Performance Testing TABLE NAME CARDINALITY


Our choice of workload to put our system to the test is a 100TB LINEITEM 599,999,969,200
data warehouse database. We constructed our database based
ORDERS 150,000,000,000
on the TPC-H schema. The TPC-H benchmark* models a
decision-support system for worldwide sales and distribution PARTSUPP 80,000,000,000
of arbitrary products. It provides guidelines to define size of PART 20,000,000,000
business (i.e., scale factor), which in turn dictates the total
CUSTOMER 15,000,000,000
size of the data. The size of the database we chose, 100TB, is
the biggest scale factor allowed by TPC-H. There are nearly SUPPLIER 1,000,000,000
600 billion rows in the fact table (i.e., LINEITEM) representing NATION 25
orders data over the course of 7 years. Table 2 summarizes
cardinality of each table in the database. REGION 5

Table 2. Table Cardinality


Building the Database
Using SQL Server 2016, we created the file group, consisting of Support for additional indices on tables with columnstore
512 data files of 100GB size, and mapped the database table indices is a new capability introduced in SQL Server 2016.
onto this file group.7 Next, we defined a clustered columnstore Having multiple indicesoptimized for different logical
index (CCI) on the tables and used a week-based partitioning operationsallows the SQL Server query optimizer to come
scheme applied to the largest two tables of the schema (i.e., up with the most efficient query plan, improving query
LINEITEM and ORDERS). Following database creation, we performance. Furthermore, we generated statistics to improve
generated data in batches of 10TB. Note that we assume the estimations made by the query optimizer, which is crucial for
data for the largest two tables is presorted and preprocessed optimized query plans and better performance. After the 4.7
(i.e., split into subsets of one-week data) to optimize parallel hours of creating the statistics, the ETL phase was complete.
loading. Generating 10TB worth of data at a time allowed us The database consumed about 45TB of storage space.
to build the database on the same configuration with no extra We finally created the database backup to be able then to
storage overhead required to keep the raw data. Once data is seamlessly restore for continuous measurements as shown in
generated, we loaded data into columnstore tables using BULK Figure 3.
INSERT (BI) statements. As data is loaded into the database,
the SQL Server 2016 columnstore engine compresses the Test Queries
data by 3x, which greatly reduces storage requirements. It
took about 6 hours to load 10TB of data onto our system. This To test performance with the Intel Xeon processor E7-8890
is a very impressive result, showing that SQL Server 2016 is v4, we restored the database created from our backup files
capable of loading 100TB worth of data into a columnstore and ran a subset of ad-hoc queries derived from the TPC-H
database in 60 hours at a rate of 1.6 TB/hour. Benchmark standard (see Appendix A for details) against the
100TB columnstore database. In this paper, we provide results
Once all data was loaded, we created non-clustered indices of three tests to highlight various performance aspects of
(NCI) in addition to the CCIs on 5 tables (PART, CUSTOMER, the system under test. Brief descriptions of the queries are
SUPPLIER, NATION, and REGION), which took about 1 hour. provided in Table 3.

Columnstore Database
Total File Size: 50TB Backup

Bulk Insert
into CCI
Backup
35TB
Raw Data
100TB Restore

Figure 3. Load and Backup/Restore Process

6
White Paper | Accelerating Large-Scale Business Analytics

Results & Analysis TEST


NOTES
The New Intel Xeon Processor E7 v4 Family QUERY ID
Performance Combined with Improved SQL Server TQ1 Pricing summary report
2016 Columnstore Engine
TQ2 Revenue change forecast
The Intel Xeon processor E7 v4 family is the newest member TQ3 Important supplier stock identification
of data center processors designed for the performance needs
in data platforms. The Intel Xeon processor E7 v4 family TQ4 Promotion effect
has up to 48 logical coresa 33% increase compared to its TQ5 Top supplier
predecessorin terms of the maximum degree of parallelism TQ6 Product type profit measure
that can be attained.
TQ7 Returned item report
In our first experiment, we explore the benefits of the
TQ8 Minimum cost supplier
columnstore enginefirst introduced in SQL Server 2012
in SQL Server 2016. A common characteristic of data TQ9 Shipping priority
warehouse applications is the need to scan large amounts TQ10 Local supplier volume
of data, in some cases the whole table. The SQL Server
columnstore engine provides technology optimized to perform TQ11 Shipping modes and order priority
such operations. By building a columnstore index, SQL Server TQ12 Small quantity order revenue
converts row-based data into a compressed columnar format.
Once data is stored in columnstore format, scanning the whole Table 3. Query Descriptions
table becomes very efficient in terms of disk reads. A scan
operation issues I/O requests for only relevant table columns Figure 4 shows the performance of the first 7 test queries
as opposed to reading the whole row of data. Moreover, all executed on the same configuration but with different
persistent data is compressed. Once data is read from disk, it processors. Each query was run twice, clearing all SQL
is cached in memory within a columnstore object pool (think Server caches between each run, to warm up the SQL
buffer pool for columnstore data). SQL Server decompresses Server columnstore object pool cache. We plotted the
the cached data as needed using Intel AVX2 instruction set- query run times from the second run in Figure 4, showing
optimized batch mode scan operator. These optimizations performance gains across the board when using the new
enable SQL Server to efficiently access large amounts of data Intel Xeon processor E7-8890 v4.
and keep them in memory for reuse.

GeN-tO-GeN perFOrMaNCe GaiNS ON 100tB


1000 2.5
3762
2987

900
914
Query Run Time (seconds)

2.21
800 2.0

700
Speed Up

1.53
600 1.5
1.41

500 1.22 1.25


493

1.17
400 1.06 1.0
413
401

300

200 .5
161
138

100
105
5.31
4.97

79
74

52

0 0
TQ1 TQ2 TQ3 TQ4 TQ5 TQ6 TQ7

Intel Xeon processor E7-8890 v3 @ 2.5GHz Intel Xeon processor E7-8890 v4 @ 2.2GHz Speed Up

Figure 4. Query Execution Time Comparison


A comparison of the query execution times between the previous-generation Intel Xeon processor E7-8890 v3 and the new Intel Xeon processor E7-8890 v4;
lower run times are better. Note that y-axis is scaled down to 1,000 seconds.

7
White Paper | Accelerating Large-Scale Business Analytics

For the 7 test queries we used in our previous work (See


Appendix B), we observed up to 1.36x average speed up
compared to the Intel Xeon processor E7-8890 v3 for
complex ad-hoc queries against a 100TB dataset.8
Of all queries in our test, TQ1 is especially of interest in terms
of representing real-world scenarios: This query scans 97%
of the fact table to calculate an aggregate valuea very
common operation for most DW applications (see Appendix
A for implementation details of TQ1). Figure 5 shows the TQ1
query plan statistics from the scan operator. The plan shows
the number of scanned rows that are passed down to the
aggregate operator and the number of rows that have been
aggregated within the scan operator itself. The total of the
highlighted fields gives the total number of rows processed.
The SQL Server columnstore engine, optimized for
Intel architecture with Intel AVX2 instruction set support,
can run TQ1 in just less than 2.5 minutes, processing
4.3B rows/second when data is in memory.
SQL Server 2016 Optimized for Intel AVX2
Our next experiment demonstrates yet another value of the
Intel Xeon processor family. SQL Server 2016 has different
versions of the same algorithms using vector instructions,
and the latest release includes Intel Streaming SIMD
Extensions (Intel SSE) instruction set support in columnstore Figure 5. TQ1 Execution Plan Properties of the Batch-Mode
batch-mode operations. This means that when database Columnstore Scan
software is updated to SQL Server 2016, the database
software will detect all vector instruction set extensions
available on the CPU and select the algorithms that provide
the best performance.9
To show the performance benefits of using vector
instructions in SQL Servers columnstore engine, we
implemented a SQL Server 2016 prototype that can override
selection of algorithms via a trace flag. Using this prototype
engine for the experiment, we ran the same test queries on
the same platform, but with a different instruction set.
For the first set of runs, we used a trace flag to turn off
Intel AVX2 instruction set support, choosing algorithms
using a scalar Intel architecture instruction set instead. In
the second run, we let SQL Server 2016 choose the best
version of algorithms, which in this case was the
Intel AVX2 instruction set.
Similar to the previous experiment, we ran the queries twice,
clearing SQL Server caches between each query run. Using
the run time from the second execution of the query, we
plotted the results in Figure 6.
Figure 6 shows a performance boost of up to 2.3x by
utilizing Intel AVX2 instruction set.

8
White Paper | Accelerating Large-Scale Business Analytics

iNteLaVX2 perFOrMaNCe QUerY rUN tiMeS & SpeeD Up


800 2.5

700 2.31
Query Run Time (seconds)

671
2.0
600 1.31
1.31

594
540
500

511

Speed Up
1.5

496
452
400
1.19
1.16 1.15
300 1.09 1.0

200
.5
150

143
138

100 122

114
99
4.97
12

0 0
TQ1 TQ2 TQ8 TQ9 TQ10 TQ11 TQ12

Scalar AVX2 Speed Up

Figure 6. Query Execution Time Comparison


Test query execution times (in seconds) on 100TB using the Intel Xeon processor E7-8890 v4 with and without Intel AVX2; lower run time is better.

TQ2 is the perfect value-proposition scenario in our product family and enterprise-class Intel SSDs and RAID
experiments. It is a query with a relatively simple query plan, controllers. We demonstrated that it is possible to build and
and it spends most of its query execution time in Intel AVX2- run a 100TB data warehouse on a symmetric multiprocessor
optimized scan operator. This query calculates an aggregate (SMP) configuration using technology that is readily available
value over a dataset from a given year with some additional today from Intel and Microsoft. Our system with four Intel
predicates (i.e., processing roughly 12 billion rows). Hence, Xeon processor E7-8800 v4 product family processors fuels
performance differences in query run times can be directly the advanced business analytics capabilities of Microsoft
attributed to improvements from vectorizing the algorithms. SQL Server 2016 to deliver stunning performance, processing
As query plans get more complicated, measurable benefits 100TB worth of data to bring key business insights in a
from utilizing instruction-level parallelism will also reduce matter of minutes. SQL Server 2016, optimized for Intel
in accordance with Amdahls law, but are still impactful, architecture, makes use of large system memory that can
showing a reduction of 25% in query run times on average fit terabytes of data, eliminating costly I/O operations.
with Intel AVX2.10 Enterprise-level Intel SSDs ensure fast data access,
delivering performance as needed.
We encourage readers of this paper to explore the
possibilities with Intel SSE for different use cases in Through continuous innovation, Intel and Microsoft are
the online Intel 64 and IA-32 Architectures Software committed to bringing exciting technology to customers and
Developer Manuals and easy-to-use Intel Intrinsics more power to their business analytics solutions. Stay tuned.
Guide tool.
Acknowledgements
Summary We would like to acknowledge:
In this paper, we presented our findings on running a large
Lindsey Allen for spearheading the project
(100TB) data warehousing application using Microsoft
SQL Server 2016 on the Intel Xeon processor E7-8800 v4 Jamie Reding for his TPC-H benchmark expertise

9
White Paper | Accelerating Large-Scale Business Analytics

Sudin Bhat and Lin Jiang for providing insight on the


columnstore engine
Christian Hescott for his guidance and improvements
around performance experiments and results
Vic Stelter for his help with system setup and
hardware expertise
Raechel Frick for her efforts in enabling the publishing
of this paper within a short period of time

This paper has been updated to reflect the changes to our test platform and our most recent results using the latest-generation Intel Xeon processor E7-8890 v4. In the earlier version of the
paper, we used the Intel Xeon processor E7-8890 v3 to build and run performance tests. We have retained sections of the earlier paper to guide users on how we set up our experiments and
updated performance results obtained using the latest processors. For previous experiment results, please refer to Appendix B.
Disclaimer: The workload used in this paper is derived from portions of the TPC-H Benchmark* and, as such, is not comparable to published TPC-H Benchmark results.
1. http://www.intel.com/content/www/us/en/processors/xeon/xeon-processor-e7-family.html
2. I ntel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system
configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at http://www.intel.com. For more details, please refer to the Legal
Notice section in this paper.
3. https://www-ssl.intel.com/content/www/us/en/solid-state-drives/data-center-family.html
4. http://www.intel.com/content/www/us/en/servers/raid/raid-controller-rs3sc008.html
5. Microsoft claims based on SQL Server customer case studies. See https://customers.microsoft.com for more details.
6. Images used for illustration purposes only. They may not reflect actual components.
7. Since both the Intel Xeon processor E7-8890 v4 and the final Windows Server 2016 were not publicly available at the time of this test, we used the previous-generation Intel Xeon processor
E7-8890 v3 and an earlier build of Windows Server 2016 for data generation.
8. 36% more performance average on 7 ad-hoc data warehousing queries, comparing the Intel Xeon processor E7-8890 v4 to the Intel Xeon Processor E7-8890 v3 on a 4-socket server with
6TB DDR4 memory, 100TB storage for data (64 Intel SSD DC S3500 Series) running SQL Server 2016 RTM on Windows Server 2016.
9. SQL Server 2016 supports Intel SSE4.2, Intel AVX, and Intel AVX2 instruction set extensions.
10. Up to 2.3x performance improvement and 25% average reduction in query run times on 7 ad-hoc data warehousing queries, comparing customized SQL Server 2016 RTM with Intel AVX2
instruction set support to same engine without Intel AVX2 support, on a 4-socket server with Intel Xeon processor E7-8890 v4 processors, 6TB DDR4 memory, 100TB storage for data
(64 x Intel SSD DC S3500 Series) running on Windows Server 2016. Reduction in query run time calculated as: ([1 (AVX2 run time / scalar run time)] x 100). Average reduction in run time is
geometric mean of reduction in query run times for the 7 queries.
Copyright 2017 Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, the Intel Inside logo, and Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or
other countries.
*Other names and brands may be claimed as the property of others
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are
measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
For more information, visit intel.com/performance.

10
White Paper | Accelerating Large-Scale Business Analytics

Appendix A
Test Queries
In this section, we present the queries used in our tests.

Test Query 1
Test Query 1 creates a report on pricing on sales up to approximately the last 2.5 months of sales data.
This query is derived from TPC-H Q1.
SELECT L_RETURNFLAG,
L_LINESTATUS,
SUM(L_QUANTITY) AS SUM_QTY,
SUM(L_EXTENDEDPRICE) AS SUM_BASE_PRICE,
SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS SUM_DISC_PRICE,
SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX)) AS SUM_CHARGE,
AVG(L_QUANTITY) AS AVG_QTY,
AVG(L_EXTENDEDPRICE) AS AVG_PRICE,
AVG(L_DISCOUNT) AS AVG_DISC,
COUNT_BIG(*) AS COUNT_ORDER
FROM LINEITEM
WHERE L_SHIPDATE <= dateadd(dd, -73, cast(1998-12-01as date))
GROUP BY L  _RETURNFLAG,
L_LINESTATUS
ORDER BY L  _RETURNFLAG,
L_LINESTATUS

Test Query 2
Test Query 2 estimates the amount of revenue increase assuming that discounts within a range were not applied within a year.
This query is derived from TPC-H Q6.
SELECT SUM(L_EXTENDEDPRICE*L_DISCOUNT) AS REVENUE
FROM LINEITEM
WHERE L_SHIPDATE >= 1993-01-01 AND
L_SHIPDATE < dateadd (yy, 1, cast(1993-01-01 as date)) AND
L_DISCOUNT BETWEEN 0.04 - 0.01 AND 0.04 + 0.01 AND
L_QUANTITY < 25

Test Query 3
Test Query 3 lists the most important set of suppliers within a nation. This query is derived from TPC-H Q11.
SELECT PS_PARTKEY,
SUM(PS_SUPPLYCOST*PS_AVAILQTY) AS VALUE
FROM PARTSUPP,
SUPPLIER,
NATION
WHERE PS_SUPPKEY = S_SUPPKEY AND
S_NATIONKEY = N_NATIONKEY AND
N_NAME = INDIA
GROUP BY PS_PARTKEY
HAVING SUM(PS_SUPPLYCOST*PS_AVAILQTY) >
( S
 ELECT SUM(PS_SUPPLYCOST*PS_AVAILQTY) * 0.0000000010
FROM P  ARTSUPP,
SUPPLIER,
NATION
 S_SUPPKEY = S_SUPPKEY AND
WHERE P
S_NATIONKEY = N_NATIONKEY AND
N_NAME = INDIA
)
ORDER BY VALUE DESC

11
White Paper | Accelerating Large-Scale Business Analytics

Test Query 4
Test Query 4 monitors the percentage of promotional revenue within a given month. This query is derived from TPC-H Q14.

SELECT 100.00 * SUM ( CASE W  HEN P_TYPE LIKE PROMO%%


THEN L_EXTENDEDPRICE*(1-L_DISCOUNT)
ELSE 0
END) / SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS PROMO_REVENUE
FROM LINEITEM,
PART
WHERE L_PARTKEY = P_PARTKEY AND
L_SHIPDATE >= 1997-05-01 AND
L_SHIPDATE < dateadd(mm, 1,cast (1997-05-01 as date))

Test Query 5
Test Query 5 finds the supplier that made the biggest contribution to overall revenue within a quarter.
This query is derived from TPC-H Q15.
CREATE VIEW REVENUE0 (SUPPLIER_NO, TOTAL_REVENUE)
AS
SELECT L_SUPPKEY,
SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT))
FROM LINEITEM
WHERE L_SHIPDATE >= 1993-07-01 AND
L_SHIPDATE < dateadd(mm, 3, cast (1993-07-01 as date))
GROUP BY L_SUPPKEY
GO
SELECT S_SUPPKEY,
S_NAME,
S_ADDRESS,
S_PHONE,
TOTAL_REVENUE
FROM SUPPLIER,
REVENUE0
WHERE S_SUPPKEY = SUPPLIER_NO AND
TOTAL_REVENUE = ( SELECT MAX(TOTAL_REVENUE)
FROM REVENUE0
)
ORDER BY S_SUPPKEY
DROP VIEW REVENUE0

12
White Paper | Accelerating Large-Scale Business Analytics

Test Query 6
Test Query 6 displays the amount of profit on certain parts, grouped by suppliers nation for each year.
This query is derived from TPC-H Q9.
SELECT NATION,
O_YEAR,
SUM(AMOUNT) AS SUM_PROFIT
FROM ( SELECT N_NAME AS NATION,
datepart(yy, O_ORDERDATE) AS O_YEAR,
L_EXTENDEDPRICE*(1-L_DISCOUNT)-PS_SUPPLYCOST*L_QUANTITY AS AMOUNT
 FROM PART,
SUPPLIER,
LINEITEM,
PARTSUPP,
ORDERS,
NATION
WHERE S_SUPPKEY = L_SUPPKEY AND
PS_SUPPKEY = L_SUPPKEY AND
PS_PARTKEY = L_PARTKEY AND
P_PARTKEY = L_PARTKEY AND
O_ORDERKEY = L_ORDERKEY AND
S_NATIONKEY = N_NATIONKEY AND
P_NAME LIKE %%smoke%%
) AS PROFIT
GROUP BY N  ATION,
O_YEAR
ORDER BY N  ATION,
O_YEAR DESC

Test Query 7
Test Query 7 displays customers that caused high revenue loss due to returns. This query is derived from TPC-H Q10.
SELECT TOP  20
C_CUSTKEY,
C_NAME,
SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS REVENUE,
C_ACCTBAL,
N_NAME,
C_ADDRESS,
C_PHONE,
C_COMMENT
FROM CUSTOMER,
ORDERS,
LINEITEM,
NATION
WHERE C_CUSTKEY = O_CUSTKEY AND
L_ORDERKEY = O_ORDERKEY AND
O_ORDERDATE >= 1993-05-01 AND
O_ORDERDATE < dateadd(mm, 3, cast(1993-05-01 as date )) AND
L_RETURNFLAG = R AND
C_NATIONKEY = N_NATIONKEY
GROUP BY C_CUSTKEY,
C_NAME,
C_ACCTBAL,
C_PHONE,
N_NAME,
C_ADDRESS,
C_COMMENT
ORDER BY REVENUE DESC

13
White Paper | Accelerating Large-Scale Business Analytics

Test Query 8
Test query 8 finds lowest-cost supplier for a part in a region. This query is derived from TPC-H Query 2.
SELECT TOP 1
 00
S_ACCTBAL,
S_NAME,
N_NAME,
P_PARTKEY,
P_MFGR,
S_ADDRESS,
S_PHONE,
S_COMMENT
FROM P
 ART,
SUPPLIER,
PARTSUPP,
NATION,
REGION
WHERE P
 _PARTKEY = PS_PARTKEY AND
S_SUPPKEY = PS_SUPPKEY AND
P_SIZE = 50 AND
P_TYPE LIKE %%COPPER AND
S_NATIONKEY = N_NATIONKEY AND
N_REGIONKEY = R_REGIONKEY AND
R_NAME = AFRICA AND
PS_SUPPLYCOST = ( SELECT MIN(PS_SUPPLYCOST)
FROM PARTSUPP,
SUPPLIER,
NATION,
REGION
WHERE P_PARTKEY = PS_PARTKEY AND
S_SUPPKEY = PS_SUPPKEY AND
S_NATIONKEY = N_NATIONKEY AND
N_REGIONKEY = R_REGIONKEY AND
R_NAME = AFRICA
)
ORDER BY S_ACCTBAL DESC,
N_NAME,
S_NAME,
P_PARTKEY

Test Query 9
Test query 9 displays top 10 unshipped orders with the highest value. This query is derived from TPC-H Query 3.
SELECT TOP  10
L_ORDERKEY,
SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS REVENUE,
O_ORDERDATE,
O_SHIPPRIORITY
FROM CUSTOMER,
ORDERS,
LINEITEM
WHERE C_MKTSEGMENT = BUILDING AND
C_CUSTKEY = O_CUSTKEY AND
L_ORDERKEY = O_ORDERKEY AND
O_ORDERDATE < 1995-03-05 AND
L_SHIPDATE > 1995-03-05
GROUP BY L_ORDERKEY
O_ORDERDATE,
O_SHIPPRIORITY
ORDER BY REVENUE DESC,
O_ORDERDATE

14
White Paper | Accelerating Large-Scale Business Analytics

Test Query 10
Test query 10 lists revenue volume from local suppliers. This query is derived from TPC-H Query 5.

SELECT 
N_NAME,
SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS REVENUE
FROM CUSTOMER,
ORDERS,
LINEITEM,
SUPPLIER,
NATION,
REGION
WHERE C_CUSTKEY = O_CUSTKEY AND
L_ORDERKEY = O_ORDERKEY AND
L_SUPPKEY = S_SUPPKEY AND
C_NATIONKEY = S_NATIONKEY AND
S_NATIONKEY = N_NATIONKEY AND
N_REGIONKEY = R_REGIONKEY AND
R_NAME = MIDDLE EAST AND
O_ORDERDATE >= 1993-01-01 AND
O_ORDERDATE < DATEADD(YY, 1, cast (1993-01-01as date))
GROUP BY N_NAME
ORDER BY REVENUE DESC

Test Query 11
Test query 11 checks if shipping mode impacts delivery dates of high-priority orders. This query is derived from TPC-H Query 12.
SELECT L_SHIPMODE,
SUM( CASE WHEN O_ORDERPRIORITY = 1-URGENT OR
O_ORDERPRIORITY = 2-HIGH
THEN 1
ELSE 0
END) AS HIGH_LINE_COUNT,
SUM( CASE WHEN O_ORDERPRIORITY <> 1-URGENT AND
O_ORDERPRIORITY <> 2-HIGH
THEN 1
ELSE 0
END) AS LOW_LINE_COUNT
FROM ORDERS,
LINEITEM
WHERE O_ORDERKEY = L_ORDERKEY AND
L_SHIPMODE IN (FOB,MAIL) AND
L_COMMITDATE < L_RECEIPTDATE AND
L_SHIPDATE < L_COMMITDATE AND
L_RECEIPTDATE >= 1997-01-01 AND
L_RECEIPTDATE < dateadd(yy, 1,cast (1997-01-01 as date))
GROUP BY L_SHIPMODE
ORDER BY L_SHIPMODE

Test Query 12
Test query 12 calculates average yearly revenue from small-shipment orders. This query is derived from TPC-H Query 17.
SELECT SUM(L_EXTENDEDPRICE)/7.0 AS AVG_YEARLY
FROM LINEITEM,
PART
WHERE P_PARTKEY = L_PARTKEY AND
P_BRAND = Brand#44 AND
P_CONTAINER = JUMBO CASE AND
L_QUANTITY < ( SELECT 0.2 * AVG(L_QUANTITY)
FROM LINEITEM
WHERE L_PARTKEY = P_PARTKEY

15
White Paper | Accelerating Large-Scale Business Analytics

Appendix B
Building the Database
It took about 6 hours to load 10TB of data onto our system. Table 4 provides a summary of the amount of data processed
This is a very impressive result, showing that SQL Server for each step and how SQL Server 2016 performed for the
2016 is capable of loading 100TB worth of data into a whole build.
columnstore database in 60 hours at a rate of 1.6 TB/hour.

ELAPSED TIME APPROX. DATA SIZE DATA PROCESSING RATE


OPERATION
(HH:MM) (TB) (TB/HR)
Bulk Insert into CCI 62:30 100 1.6
Build NCI 1:05 2 2
Create Statistics 4:40 45 9.5
Backup 2:00 45 22.5
Restore 3:30 45 13.3

Table 4. ETL and Database Management Statistics with Intel Xeon Processors E7-8890 v3
Performance Tests with the Intel Xeon Processor
E7-8890 v3
In this section, we describe three performance tests we
conducted with the test queries TQ17 with the Intel Xeon
processor E7-8890 v3:
1- In-memory performance of SQL Server columnstore
engine We use this test to demonstrate the benefits
of performance gains from utilizing data caching in
memory, reducing I/O requirements. Note that some
queries (even with 100TB data scale) can fit all data in
memory. In cases where data is too big to fit in memory,
we show that SQL Server can still run queries, utilizing
I/O subsystems efficiently. For this test, we ran each
query twice in isolation. Before the query execution, we
ensured that there was no data caching (i.e., we issued
DBCC DROPCLEANBUFFERS command). By doing so, we
ensured that for the first execution, all data was read from
disk. For the second run, data was cached in SQL Server
columnstore object pool. Comparing the two runs, we can
illustrate the benefits of in-memory execution.
2- Performance scaling with respect to data size We
use this test to compare how query execution time scales
with increased amounts of data. To be able to test this, we
built a reference database with the same schema, sized at
3TB. We then compared a second run from queries that
run in memory. Note that on the same system, all queries
running on a 3TB database will be in memory in a second
run, whereas only 5 queries run in memory on the 100TB
database.
3- Performance of queries when they are run
concurrently We use this test to loosely simulate a use
case where multiple clients interact with the database
simultaneously. In this test, we started all queries at the
same time and measured the time to complete all requests.
We then compared how long it took to execute queries
sequentially.

16
White Paper | Accelerating Large-Scale Business Analytics

In-Memory Performance of Apart from 2 queries (TQ6 and TQ7), all queries can fit
Columnstore Engine in memory thanks to data compression. Caching data in
memory provides up to a 30x performance improvement
Table 5 shows execution time for the test queries on 100TB, and 2.5x on average on subsequent operations on the same
demonstrating the benefits of in-memory columnstore data. Note that in cases where the I/O subsystem is not as
technology. performant as our set up, benefits will further increase with
larger system memory.

1ST RUN EXECUTION TIME 2ND RUN EXECUTION TIME


TEST QUERY ID SPEED UP
(SEC) (SEC)
TQ1 604 161 3.75
TQ2 160 5 30.05
TQ3 799 493 1.62
TQ4 175 105 1.66
TQ5 92 79 1.16
TQ6 6459 3762 1.72
TQ7 1029 914 1.13
Average Speed Up 2.54x

Table 5. Query Execution Time

Performance Scaling with Data Size


Our next experiment investigates the impact of data size on In Table 6, we show the data-scaling efficiency of SQL Server
performance, showing that SQL Server performance scales 2016 on our platform. Average scaling efficiency of SQL
with the increasing amount of data. To illustrate scaling Server 2016 running our test queries is 122%, which implies
efficiency, we compare performance of queries running on that in-memory execution is optimized for large data sets.
databases of different sizes, namely 3TB and 100TB. We We observed the lowest scaling in TQ3. Further investigation
ensure that data processed by each query scales linearly by revealed that TQ3 has a long tail of single-threaded execution
database size (i.e., 33x) and that all queries run in memory waiting for the client buffer to consume the very large result
(to make a fair comparison). We then calculate data-scaling set returned by the query.
efficiency as:
Execution Time on 3TB x 33
Data-Scaling Efficiency = x 100
Execution Time on 100TB

EXECUTION TIME ON 3TB EXECUTION TIME ON 100TB


TEST QUERY ID DATA-SCALING EFFICIENCY
(SEC) (SEC)
TQ1 4.6 161.2 94%
TQ2 0.5 5.3 311%
TQ3 9.5 493.9 63%
TQ4 2.7 105.0 85%
TQ5 4.3 79.2 179%
Average Data Scaling Efficiency 122%

Table 6. Data-Scaling Efficiency

17
White Paper | Accelerating Large-Scale Business Analytics

Performance of Running Queries


Concurrently
Our final experiment is demonstrating a scenario where Figure 7 shows a performance comparison of running all
multiple users are running queries on the same system. In queries concurrently against running them sequentially.
this test, we started all ad-hoc queries at the same time Running all ad-hoc queries concurrently takes 2:03 hours
and executed them until completion, loosely simulating (i.e., longest-running query execution time), compared to 2:35
multiple client connections querying the data warehouse hours back to back.
simultaneously.

TQ7
TQ6
Concurrent

TQ5 1.26x
TQ4 SpeeD Up
TQ3
TQ2
TQ1

Sequential

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Query Execution time (sec)

Figure 7. Concurrent vs. Sequential Query Execution Time

With 7 queries running simultaneously, we were putting PERFORMANCE CHARACTERISTICS OF


our I/O subsystem to a test. We captured the workload CONCURRENT RUNS
characteristics using Perfmon. The workload provided 3:2
Avg. CPU Utilization 84%
R/W request ratio, with 130KB reads and 64KB writes, on
average. Our I/O subsystem provided 56K IOPS and 6ms Avg. Read Size 130KB
latency on averageand 180K at peak with a max latency Avg. Write Size 64KB
of 23ms. Table 7 summarizes Perfmon characteristics of this
workload. Read/Write Ratio 3:2
Avg. Read Bandwidth 4.4GB/s
Avg. Write Bandwidth 1.3GB/s
Avg. I/O Bandwidth 5.6GB/s
Peak I/O Bandwidth 15.3GB/s
Avg. IOPS (Read + Write) 55,740
Peak IOPS (Read + Write) 179,484
Avg. I/O Latency 6ms
Peak I/O Latency 23ms
Table 7. Performance Characteristics of Concurrent
Query Execution

18

You might also like