Get started with Microsoft SQL Polybase

SQL Server 2016 PolyBase
Henk van der Valk
Oct.15, 2016
Level: Beginner
www.Henkvandervalk.com

Starting SQL2016 on a server with 24 TB RAM

Thanks to our platinum sponsors :
PASS SQL Saturday Holland - 20163 |

Thanks to our gold and silver sponsors :
PASS SQL Saturday Holland - 20164 |
APS
Onsite!

Speaker Introduction
 10+ years active in SQLPass community!
 10 years of Unisys-EMEA Performance Center
• 2002- Largest SQL DWH in the world (SQL2000)
• Project Real – (SQL 2005)
• ETL WR - loading 1TB within 30 mins (SQL 2008)
• Contributor to SQL performance whitepapers
• Perf Tips & tricks: www.henkvandervalk.com
 Schuberg Philis- 100% uptime for mission critical apps
 Since april 1st, 2011 – Microsoft Data Platform !
All info represents my own
personal opinion (based upon my own experience)
and not that of Microsoft
@HenkvanderValk

SQL Server 2016 as
fraud detection scoring engine
https://blogs.technet.microsoft.com/machinelearning/2016/09/22/predictions-at-the-speed-of-data/
HTAP (Hybrid Transactional Analytical Processing)

The Big Data lake Challenge
How to
orchestrate?
Different types of data
 Webpages, logs, and clicks
 Hardware and software sensors
 Semi-structured/unstructured data
Large scale
 Hundreds of servers
Advanced data analysis
 Integration between structured
and unstructured data
 Power of both

PolyBase builds the Bridge
Access any data
Azure
Blob
Storage
Just-in-Time data integration
 Across relational and non-relational data
 Fast, simple data loading
Best of both worlds
 T-SQL compatible
 Uses computational power at source
 Opportunity for new types of analysis

PolyBase View in SQL Server 2016
PolyBase View
• Execute T-SQL queries against
relational data in SQL Server and
‘semi-structured’ data in HDFS
and/or Azure
• Leverage existing T-SQL skills
and BI tools to gain insights from
different data stores
• Expand the reach of SQL Server
to Hadoop(HDFS & WASB)
Access any data

Remove the complexity of big data
T-SQL over Hadoop
JSON support
PolyBase
T-SQL query
SQL Server Hadoop
Quote:
************************
**********************
*********************
**********************
***********************
$658.39
Simple T-SQL
to query Hadoop data (HDFS)
Name DOB State
Denny Usher 11/13/58 WA
Gina Burch 04/29/76 WA
NEW
NEW
NEW

PolyBase use cases
Access any data

Polybase - Turning raw data tweets
into information
 Query & Store Hadoop data Bi-directional seamless & fast

Get started with Microsoft SQL Polybase

16
Setup & Query
SQL Server 2016 & SQL DW Polybase!

Prerequisites
•An instance of SQL Server (64-bit) Ent.Ed. / Developer Ed..
•Microsoft .NET Framework 4.5.
•Oracle Java SE RunTime Environment (JRE) version 7.51 or higher
(64-bit). (Either JRE or Server JRE will work). Go to Java SE downloads.
•Note:The installer will fail if JRE is not present.
•Minimum memory: 4GB
•Minimum hard disk space: 2GB
•TCP/IP connectivity must be enabled.

Step 2: Install SQL Server
PolyBase
DLLs
PolyBase
DLLs
PolyBase
DLLs
PolyBase
DLLs
Install one or more SQL Server instances with PolyBase
PolyBase DLLs (Engine and DMS) are installed and registered
as Windows Services
Prerequisite: User must download and install JRE (Oracle)
Access any data

Components introduced in SQL Server 2016
PolyBase Engine Service
PolyBase Data Movement Service (with HDFS Bridge)
External table constructs
MR pushdown computation support
Access any data

How to use PolyBase in SQL Server 2016
- Set up a Hadoop Cluster
or Azure Storage blob
- Install SQL Server
- Configure a PolyBase
group
- Choose Hadoop flavor
- Attach Hadoop Cluster
or Azure Storage
PolyBase T-SQL
queries submitted
here
PolyBase queries
can only refer to
tables here and/or
external tables here
Computenodes
Head nodes
Access any data

Step 1: Set up a Hadoop Cluster…
Hortonworks or Cloudera Distributions
Hadoop 2.0 or above
Linux or Windows
On-premises or in Azure
Access any data

Step 1: …Or set up an Azure Storage blob
Azure Storage blob (ASB) exposes an HDFS layer
PolyBase reads and writes from ASB using Hadoop
RecordReader/RecordWrite
No compute pushdown support for ASB
Azure
Storage
Volume
Azure
Storage
Volume
Azure
Storage
Volume
Azure
Access any data

Step 2: Configure a PolyBase group
PolyBase
Engine
PolyBase
DMS
PolyBase
DMS
PolyBase
DMS
PolyBase
DMS
Head node
Compute nodes
PolyBase scale-out group
Head node is the SQL Server
instance to which queries are
submitted
Compute nodes are used for
scale-out query processing for
data in HDFS or Azure

Step 3: Choose /Select Hadoop flavor
Supported Hadoop distributions
Cloudera CHD 5.x on Linux
Hortonworks 2.x on Linux and Windows Server
What happens under the covers?
Loading the right client jars to connect to Hadoop distribution
Access any data

Step 4: Attach Hadoop Cluster or Azure
Storage
PolyBase
Engine
PolyBaseDMS
PolyBaseDMS PolyBaseDMS PolyBaseDMS
Head node
Azure
Storage
Volume
Azure
Storage
Volume
Azure
Storage
Volume
Azure
Access any data

After Setup
Compute nodes are used for scale-
out query processing on external
tables in HDFS
Tables on compute nodes cannot be
referenced by queries submitted to
head node
Number of compute nodes can be
dynamically adjusted by DBA
Hadoop clusters can be shared
between multiple SQL16 PolyBase
groups
PolyBase T-SQL
queries submitted
here
PolyBase queries
can only refer to
tables here and/or
external tables here
Computenodes
Head nodes
Access any data

Polybase configuration
--1: Create a master key on the database.
-- Required to encrypt the credential secret.
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'SQLSat#551';
-- select * from sys.symmetric_keys
-- Create a database scoped credential for Azure blob storage.
-- IDENTITY: any string (this is not used for authentication to
Azure storage).
-- SECRET: your Azure storage account key.
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH IDENTITY = 'wasbuser', Secret =
'1abcdEFGb3Mcn0F9UdJS/10taXmr5L17xrEO17rlMRL8SNYg==';

Create external Data Source
--2: Create an external data source.
-- LOCATION: Azure account storage account name and blob container
name.
-- CREDENTIAL: The database scoped credential created above.
CREATE EXTERNAL DATA SOURCE AzureStorage with (
TYPE = HADOOP,
LOCATION
='wasbs://staging@vault2016.blob.core.windows.net',
CREDENTIAL = AzureStorageCredential
);
-- view list of external data sources;
select * from sys.external_data_sources

Create External file format
--select * from sys.external_file_formats
--3: Create an external file format.
-- FORMAT TYPE: Type of format in Hadoop
-- (DELIMITEDTEXT, RCFILE, ORC, PARQUET).
-- With GZIP:
CREATE EXTERNAL FILE FORMAT TextDelimited_GZIP
WITH (
FORMAT_TYPE = DELIMITEDTEXT
, FORMAT_OPTIONS (FIELD_TERMINATOR ='|',
USE_TYPE_DEFAULT = TRUE)
, DATA_COMPRESSION =
'org.apache.hadoop.io.compress.GzipCodec'
);

Create External Table
--4: Create an external table.
-- The external table points to data stored in Azure storage.
-- LOCATION: path to a file or directory that contains the data (relative to the blob co
-- To point to all files under the blob container, use LOCATION='/'
CREATE EXTERNAL TABLE [dbo].[lineitem4] (
[ROWID1] [bigint] NULL,
[L_SHIPDATE] [smalldatetime] NOT NULL,
[L_ORDERKEY] [bigint] NOT NULL,
[L_DISCOUNT] [smallmoney] NOT NULL,
[..
[L_COMMENT] [varchar](44) NOT NULL
)
WITH (LOCATION='/',
DATA_SOURCE = AzureStorage,
FILE_FORMAT = TextFileFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
));

Import
------------------------------------
-- IMPORT Data from WASB into NEW table:
------------------------------------
SELECT *
INTO [dbo].[LINEITEM_MO_final_temp]
from
(
SELECT * FROM [dbo].[lineitem1]
) AS Import

Export data (Gzipped)
-- Enable Export/ INSERT into external table
sp_configure 'allow polybase export', 1;
Reconfigure
CREATE EXTERNAL TABLE [dbo].[lineitem_export] (
[ROWID1] [bigint] NULL,
..
[L_SHIPINSTRUCT] [varchar](25) NOT NULL,
[L_COMMENT] [varchar](44) NOT NULL
)
WITH (LOCATION='/gzipped',
DATA_SOURCE = AzureStorage,
FILE_FORMAT = TextDelimited_GZIP,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
)

Manage
External resources
 SSMS / VSTS
New:
- External Tables
- External Resources
- Ext. Data Sources
- Ext. File formats

PolyBase query example #1
-- select on external table (data in HDFS)
SELECT * FROM Customer
WHERE c_nationkey = 3 and c_acctbal < 0;
A possible execution plan:
CREATE
temp table T
Execute on compute nodes1
IMPORT
FROM HDFS
HDFS Customer file read into T2
EXECUTE
QUERY
Select * from T where
T.c_nationkey =3 and T.c_acctbal < 0
3
Access any data

-- select and aggregate on external table (data in HDFS)
SELECT AVG(c_acctbal) FROM Customer
WHERE c_acctbal < 0 GROUP BY c_nationkey;
Execution plan:
Run MR Job
on Hadoop
Apply filter and compute
aggregate on Customer.
1
What happens here?
Step 1: QO compiles predicate into Java
and generates a MapReduce (MR) job
Step 2: Engine submits MR job to
Hadoop cluster. Output left in hdfsTemp.
hdfsTemp
<US, $-975.21>
<FRA, $-119.13>
<UK, $-63.52>
Access any data

-- select and aggregate on external table (data in HDFS)
SELECT AVG(c_acctbal) FROM Customer
WHERE c_acctbal < 0 GROUP BY c_nationkey;
Execution plan: 1. Predicate and aggregate pushed
into Hadoop cluster as a
MapReduce job
2. Query optimizer makes a cost-
based decision on what
operators to push
Run MR Job on
Hadoop
Apply filter and compute
aggregate on Customer.
Output left in hdfsTemp
1
IMPORT
hdfsTEMP Read hdfsTemp into T3
CREATE temp
table T On DW compute nodes2
RETURN
OPERATION Select * from T4
hdfsTemp
<US, $-975.21>
<FRA, $-119.13>
<UK, $-63.52>
Access any data

Query relational
and non-relational
data, on-premises
and in Azure
Apps
T-SQL query
SQL Server Hadoop
Summary: PolyBase
Query relational and non-relational data with T-SQL
Access any data

Lots of new DMV’s
----------------------------------------
-- Monitoring Polybase / All DMV's :
----------------------------------------
SELECT * FROM sys.external_tables
SELECT * FROM sys.external_data_sources
SELECT * FROM sys.external_file_formats
SELECT * FROM sys.dm_exec_compute_node_errors
SELECT * FROM sys.dm_exec_compute_node_status
SELECT * FROM sys.dm_exec_compute_nodes
SELECT * FROM sys.dm_exec_distributed_request_steps
SELECT * FROM sys.dm_exec_dms_services
SELECT * FROM sys.dm_exec_distributed_requests
SELECT * FROM sys.dm_exec_distributed_sql_requests
SELECT * FROM sys.dm_exec_dms_workers
SELECT * FROM sys.dm_exec_external_operations
SELECT * FROM sys.dm_exec_external_work

Find the longest running query
-- Find the longest running query
SELECT execution_id, st.text,
dr.total_elapsed_time
FROM sys.dm_exec_distributed_requests dr
cross apply
sys.dm_exec_sql_text(sql_handle) st
ORDER BY total_elapsed_time DESC;

Find the longest running step of the
distributed query plan
-- Find the longest running step of the distributed query plan
SELECT execution_id, step_index, operation_type,
distribution_type,
location_type, status, total_elapsed_time, command
FROM sys.dm_exec_distributed_request_steps
WHERE execution_id = 'QID1120'

Details on a Step_index
SELECT execution_id, step_index, dms_step_index, compute_node_id,
type, input_name, length, total_elapsed_time, status
FROM sys.dm_exec_external_work
WHERE execution_id = 'QID1120' and step_index = 7

Polybase - data compression to minimize data movement
http://henkvandervalk.com/aps-polybase-for-hadoop-and-windows-azure-blob-storage-wasb-integration

Enable Pushdown configuration (Hadoop)
Improves query performance
1.Find the file yarn-site.xml in the installation path of SQL Server.
C:Program FilesMicrosoft SQL ServerMSSQL13.SQL2016RTMMSSQL
BinnPolybaseHadoopconf yarn-site.xml
On the Hadoop machine:
in the Hadoop configuration directory.
Copy the value of the configuration key yarn.application.classpath.
3.On the SQL Server machine, in the yarn.site.xml file,
find the yarn.application.classpath property.
Paste the value from the Hadoop machine into the value element.

Time to
InsightsAPS Cybercrime
Filmpje & Demo!
Various sources
Single query

Further Reading
» Get started with Polybase:
» https://msdn.microsoft.com/en-us/library/mt163689.aspx
Data compression tests:
http://henkvandervalk.com/aps-polybase-for-hadoop-and-
windows-azure-blob-storage-wasb-integration

Q&A
Henk.vanderValk@microsoft.com
www.henkvandervalk.com

Please fill in the evaluation forms

Get started with Microsoft SQL Polybase

Recommended

More Related Content

What's hot (20)

Similar to Get started with Microsoft SQL Polybase (20)

Get started with Microsoft SQL Polybase

Editor's Notes