datablogs: Azure Cloud

Its took little long time to derive our approach tunning on azure data factory with Azure Synapse why because we need to run the system atleast two months validate our approach is smooth

Yes its all running good as expected performance on the ETL loads and Processes

Here are the major things we need to take care on Azure Synapse Dedicated Pool ,

Dedicated SQL Pool Scaling
Dedicated SQL Pool Workload Management

Dedicated SQL Pool Scaling :

We have decide below metrices to optimize the Azure Synapse

Data Load and Processing timelines
Critical Process running timelines
Minimal Processing timelines
Reporting and Visualtions Timelines

Based on above requirements , we can easily scalable the Dedicated SQL Pool based on the load

Consider we need to load the data and process with in 2 hours , we can upscale the Dedicated SQL Pool to as like we want

Key Part to increase the DW is concurrency , Instance Capacity and allocated resource for each queries

Before start our critical process , we can automate upscale process with ADF Pipelines itself . So many blogs available to configure that

Best method configure the authentication method with service principle ,

Dedicated SQL Pool Workload Management :

We have decide below metrices to prepare workload management

Short Queries
Critical Queries
High Intensive Queries
Reporting Queries

Based on the classifications , we have to split workload group for above queries

Step 1 :

We need to create login and user for workload management in Dedicated SQL Pool

--CREATE LOGIN [Analyticsprocess] WITH PASSWORD='xxxxxxxxx'
--CREATE USER [Analyticsprocess] FOR LOGIN [Analyticsprocess]
--GRANT CONTROL ON DATABASE::[sql-datablogs-dw] TO
Analyticsprocess

Step 2 :

Consider you have upscaled instance into DW400c below are the resources allocation for the DW400c instance concurrency requirements

In the workload group --> New workload group --> Click ELT

Consider analytics process user is used for high intensive queries we have to allocate as much as minimum resource for workload group

Click Classifiers --> Add Classifiers --> Name it as ELT --> and specify Member should be the login and Label is important to mention

Once click Add , we will get below concurrency range based on DW400c

By Default , its having system level workload group to handle the queries but its not effective we have to force our workload group

Step 3 :

This is very important to utilize the workload group properly . We need to specify Label on the heavy processing queries so that it will utilized properly

CREATE TABLE rpt_datablogs_finalreport.smgreport WITH (HEAP,DISTRIBUTION = REPLICATE) AS select * into rpt_datablogs_finalreport.smgreport_vw from rpt_datablogs_finalreport.vw_smgreport OPTION (LABEL='highintensiveprocess')

So these two things given much improvement on performance as well as reduced operational cost below 60%

Still few more things to learn to optimize it on Azure Data Factory lets see on that next blog

Happy Troubleshooting !!!

Three hours of long running pipeline is reduced to run in 35 Minutes when we given right kind of top bottom optimization like our Body !!!

Yes Sometimes if we give proper core and cross workouts you can run 10km in 55 Minutes !!!

Its happened to me But Not Now 😆😆😆

Oh Oh we are away from Data ground !!! Lets Move into Azure Synapse tunning,

In the starting phase every piece of code and pipeline process was really trouble us to optimize it , whatever we do on tunning it will run as same time

Waited long time until each process to complete as its running 3 hours !!!

But Reecha blog helped something to start with basic check and given some hope to fine tune it

Remember below piece of monitoring code always help us to too dig more

--Check Long Running Query in Azure Synapse

SELECT *
FROM sys.dm_pdw_exec_requests
WHERE status not in ('Completed','Failed','Cancelled')
  AND session_id <> session_id()
ORDER BY submit_time DESC;

--Find the distributed query plan steps for long running query

SELECT * FROM sys.dm_pdw_request_steps WHERE request_id = 'xxxxxx'  ORDER BY step_index; 

***Replace your request_id from first query result

Sample Running Query :

Initially we thought data volume is huge , so we are expecting this much delay on complete . But distributed query Plans are given more finishing time for each queries in Azure Synapse

We will go step by step process of tunning ,

Multiple ways we can see the Execution plan of your Azure Synapse Queries

Using Azure Console --> Click you Dedicated Pool --> Click Query Activity --> Each Queries you have Query Plans

2.Using SSMS , you can run below query and get your explain plan in XML Format

EXPLAIN select * from dbo_datablogs.vw_fact_transferdata (nolock)

So ,We have figured out the execution plan and further need to fix below things to make it faster

We need to reduce data movement operations ( Shuffle Move , Trim Move , Partition Move ) and needs to have proper indexing on your business tables

Based on your data distribution on tables and indexing , queries will be executed on SQL Pool . We need to distribute the tables accordingly . So Major part of the tunning on table level only

Once we done that in proper way we have achieved your milestone ,

How to we distribute table in Azure Synapse ?

Two types of distribution is available in Azure Synapse . Round Robin and Hash Distribution

Round Robin Distribution

By Default , If you create table it will create in Round Robin Distribution
Table rows are distributed in all distribution
Main purpose of this table to improve loading speed we can use round robin distribution
If your table is just junk or doesn't have proper any constraints and keys use round robin distribution

CREATE TABLE [dbo_datablogs].[ControlTable_blogs](

[Id] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY NONCLUSTERED NOT ENFORCED,

[SourceObjectSettings] [nvarchar](max) NULL)

WITH ( CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = ROUND_ROBIN )

Hash Distribution

If your table is having frequent DML operations ( Insert , Update , Delete ) use hash distribution
It will increase performance 200% of your queries
It will distribute the data to nodes based on your distributed column

CREATE TABLE [dbo_datablogs].[ControlTable_blogs](

[Id] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY NONCLUSTERED NOT ENFORCED,

[SourceObjectSettings] [nvarchar](max) NULL)

WITH ( HEAP, DISTRIBUTION = HASH(Id))

Replicate

This is another way to storing the table in SQL Pool

Full copy of the table will be available in all distribution to avoid data movement
Table is having less size use replicate option

CREATE TABLE [dbo_datablogs].[ControlTable_blogs](

[Id] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY NONCLUSTERED NOT ENFORCED,

[SourceObjectSettings] [nvarchar](max) NULL)

WITH ( HEAP, DISTRIBUTION = REPLICATE )

Thumb of Rules of Creating tables

First Level : Find the absolute unique column or else use partial unique column for creating table as Hash Distribution

Second Level : If your table doesn't have proper unique columns , try to create the table with used columns in Join , Group By , Distinct , Over and Having Clauses on Procedures or Select Clauses

Lets examine First Level ,

When creating the table with Round Robin and created table with below script ,

CREATE TABLE dbt_datablogs.fact_transferdata

WITH

(

CLUSTERED COLUMNSTORE INDEX,

DISTRIBUTION = ROUND_ROBIN

)

AS SELECT * FROM dbt_datablogs.vw_fact_transferdata

Its taken 28 Minutes to complete

So we figured out good candidate key and created table with below script

CREATE TABLE dbt_datablogs.fact_transferdata

WITH

(

CLUSTERED COLUMNSTORE INDEX,

DISTRIBUTION = HASH(tranferaddressid)

)

AS SELECT * FROM dbt_datablogs.vw_fact_transferdata

Its taken 4 Minutes to complete

We will see the huge difference when creating proper keys

Lets examine Second Level ,

When we don't have proper keys , based on your business logics or select clause join conditions we can try to evaluate with multiple columns and create proper hash function to improve performance

Lets do the smaller table tunning on this blog , you can try the same in your own environments ,

Below table is small but its having shuffle move and its delaying the process 11 Minutes ,

Then based on shuffle columns , add the proper hash function and recreate the table

CREATE TABLE dbt_datablogs.fact_memberinfo

WITH

(

Heap,

DISTRIBUTION = HASH(infoid)

)

AS SELECT * FROM dbt_datablogs.vw_memberinfo

Once created the table again examined the Execution plan for the same statement and process is finished with in a seconds

This is not an end , Also we have tunned few things on Data Factory and Azure Synapse Workload Management to get expected performance , we will see further on next blog

To learn more in-depth , Please use below references from Microsoft Site as well

References :

1.Cheat Sheet - https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/cheat-sheet?view=azure-sqldw-latest

2.Distributed Tables - https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute

3.Replicated Tables - https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/design-guidance-for-replicated-tables

datablogs

Data will talk to you if you're willing to listen !!!

Categories

Friday, February 17, 2023

How to tune long running Queries on Azure Synapse DW (Dedicated Pool) - Part 2

Sunday, August 28, 2022

How to tune long running Queries on Azure Synapse DW (Dedicated Pool) - Part 1