Note

Access to this page requires authorization. You can try signing in or .

Access to this page requires authorization. You can try .

Troubleshoot a slow query on a dedicated SQL pool

Applies to:   Azure Synapse Analytics

This article helps you identify the reasons and apply mitigations for common performance issues with queries on an Azure Synapse Analytics dedicated SQL pool.

Follow the steps to troubleshoot the issue. The first three steps walk you through collecting telemetry, which describes the lifecycle of a query. The references at the end of the article help you analyze potential opportunities found in the data collected.

Important

Most of the reported performance issues are caused by:

  • Outdated statistics
  • Unhealthy clustered columnstore indexes (CCIs)

To save troubleshooting time, make sure that the statistics are created and up-to-date, and rebuild clustered columnstore indexes in the dedicated SQL pool.

Step 1: Identify the request_id (also known as QID)

The request_id of the slow query is required to research potential reasons for a slow query. Use the following script as a starting point for identifying the query you want to troubleshoot. Once the slow query is identified, note down the request_id value.

First, monitor the active queries. This query is ordered by the newest rows first.

-- Monitor active queries
SELECT *
FROM sys.dm_pdw_exec_requests
WHERE [status] NOT IN ('Completed', 'Failed', 'Cancelled')
 AND session_id <> session_id()
-- AND [label] = '<YourLabel>'
-- AND resource_allocation_percentage is not NULL
ORDER BY submit_time DESC;

Then, find the active queries that have the longest run time, starting with the longest-running queries.

-- Find top 10 longest running queries
SELECT TOP 10 *
FROM sys.dm_pdw_exec_requests
ORDER BY total_elapsed_time DESC;

To better target the slow queries, use the following tips when you run the script:

  • Sort by either submit_time DESC or total_elapsed_time DESC to have the longest-running queries present at the top of the result set.
  • Use OPTION(LABEL='<YourLabel>') in your queries and then filter the label column to identify them.
  • Consider filtering out any QIDs that don't have a value for resource_allocation_percentage when you know the target statement is contained in a batch. Be cautious with this filter as it might also filter out some queries that are being blocked by other sessions.

Step 2: Determine where the query is taking time

Run the following script to find the step that might cause the performance issue of the query. Update the variables in the script with the values described in the following table. Change the @ShowActiveOnly value to 0 to get the full picture of the distributed plan. Take note of the StepIndex, Phase, and Description values of the slow step identified from the result set.

Parameter Description
@QID The request_id value is obtained in Step 1.
@ShowActiveOnly Setting the value to 0 shows all steps for the query.
Setting the value to 1 shows only the currently active step.
DECLARE @QID AS VARCHAR (16) = '<request_id>', @ShowActiveOnly AS BIT = 1;
-- Retrieve session_id of QID

DECLARE @session_id AS VARCHAR (16) = (SELECT session_id
 FROM sys.dm_pdw_exec_requests
 WHERE request_id = @QID);
-- Blocked by Compilation or Resource Allocation (Concurrency)
SELECT @session_id AS session_id, @QID AS request_id, -1 AS [StepIndex], 'Compilation' AS [Phase],
 'Blocked waiting on '
 + MAX(CASE WHEN waiting.type = 'CompilationConcurrencyResourceType' THEN 'Compilation Concurrency'
 WHEN waiting.type LIKE 'Shared-%' THEN ''
 ELSE 'Resource Allocation (Concurrency)' END)
 + MAX(CASE WHEN waiting.type LIKE 'Shared-%' THEN ' for ' + REPLACE(waiting.type, 'Shared-', '')
 ELSE '' END) AS [Description],
 MAX(waiting.request_time) AS [StartTime], GETDATE() AS [EndTime],
 DATEDIFF(ms, MAX(waiting.request_time), GETDATE())/1000.0 AS [Duration],
 NULL AS [Status], NULL AS [EstimatedRowCount], NULL AS [ActualRowCount], NULL AS [TSQL]
FROM sys.dm_pdw_waits waiting
WHERE waiting.session_id = @session_id
 AND ([type] LIKE 'Shared-%'
 OR [type] IN ('ConcurrencyResourceType', 'UserConcurrencyResourceType', 'CompilationConcurrencyResourceType'))
 AND [state] = 'Queued'
GROUP BY session_id
-- Blocked by another query
UNION ALL
SELECT @session_id AS session_id,
 @QID AS request_id,
 -1 AS [StepIndex],
 'Compilation' AS [Phase],
 'Blocked by ' + blocking.session_id + ':' + blocking.request_id + ' when requesting ' + waiting.type + ' on ' + QUOTENAME(waiting.object_type) + waiting.object_name AS [Description],
 waiting.request_time AS [StartTime],
 GETDATE() AS [EndTime],
 DATEDIFF(ms, waiting.request_time, GETDATE()) / 1000.0 AS [Duration],
 NULL AS [Status],
 NULL AS [EstimatedRowCount],
 NULL AS [ActualRowCount],
 COALESCE (blocking_exec_request.command, blocking_exec_request.command2) AS [TSQL]
FROM sys.dm_pdw_waits AS waiting
 INNER JOIN
 sys.dm_pdw_waits AS blocking
 ON waiting.object_type = blocking.object_type
 AND waiting.object_name = blocking.object_name
 INNER JOIN
 sys.dm_pdw_exec_requests AS blocking_exec_request
 ON blocking.request_id = blocking_exec_request.request_id
WHERE waiting.session_id = @session_id
 AND waiting.state = 'Queued'
 AND blocking.state = 'Granted'
 AND waiting.type != 'Shared'
-- Request Steps
UNION ALL
SELECT @session_id AS session_id,
 @QID AS request_id,
 step_index AS [StepIndex],
 'Execution' AS [Phase],
 operation_type + ' (' + location_type + ')' AS [Description],
 start_time AS [StartTime],
 end_time AS [EndTime],
 total_elapsed_time / 1000.0 AS [Duration],
 [status] AS [Status],
 CASE WHEN estimated_rows > -1 THEN estimated_rows END AS [EstimatedRowCount],
 CASE WHEN row_count > -1 THEN row_count END AS [ActualRowCount],
 command AS [TSQL]
FROM sys.dm_pdw_request_steps
WHERE request_id = @QID
 AND [status] = CASE @ShowActiveOnly WHEN 1 THEN 'Running' ELSE [status] END
ORDER BY StepIndex;

Step 3: Review step details

Run the following script to review the details of the step identified in the previous step. Update the variables in the script with the values described in the following table. Change the @ShowActiveOnly value to 0 to compare all distribution timings. Take note of the wait_type value for the distribution that might cause the performance issue.

Parameter Description
@QID The request_id value is obtained in Step 1.
@StepIndex The StepIndex value is identified in Step 2.
@ShowActiveOnly Setting the value to 0 shows all distributions for the given StepIndex value.
Setting the value to 1 shows only the currently active distributions for the given StepIndex value.
DECLARE @QID VARCHAR(16) = '<request_id>', @StepIndex INT = <StepIndex>, @ShowActiveOnly BIT = 1;
WITH dists
AS (SELECT request_id, step_index, 'sys.dm_pdw_sql_requests' AS source_dmv,
 distribution_id, pdw_node_id, spid, 'NativeSQL' AS [type], [status],
 start_time, end_time, total_elapsed_time, row_count
 FROM sys.dm_pdw_sql_requests
 WHERE request_id = @QID AND step_index = @StepIndex
 UNION ALL
 SELECT request_id, step_index, 'sys.dm_pdw_dms_workers' AS source_dmv,
 distribution_id, pdw_node_id, sql_spid AS spid, [type],
 [status], start_time, end_time, total_elapsed_time, rows_processed as row_count
 FROM sys.dm_pdw_dms_workers
 WHERE request_id = @QID AND step_index = @StepIndex
 )
SELECT sr.step_index, sr.distribution_id, sr.pdw_node_id, sr.spid,
 sr.type, sr.status, sr.start_time, sr.end_time,
 sr.total_elapsed_time, sr.row_count, owt.wait_type, owt.wait_time
FROM dists sr
 LEFT JOIN sys.dm_pdw_nodes_exec_requests owt
 ON sr.pdw_node_id = owt.pdw_node_id
 AND sr.spid = owt.session_id
 AND ((sr.source_dmv = 'sys.dm_pdw_sql_requests'
 AND sr.status = 'Running') -- sys.dm_pdw_sql_requests status
 OR (sr.source_dmv = 'sys.dm_pdw_dms_requests'
 AND sr.status not LIKE 'Step[CE]%')) -- sys.dm_pdw_dms_workers final statuses
WHERE sr.request_id = @QID
 AND ((sr.source_dmv = 'sys.dm_pdw_sql_requests' AND sr.status =
 CASE WHEN @ShowActiveOnly = 1 THEN 'Running' ELSE sr.status END)
 OR (sr.source_dmv = 'sys.dm_pdw_dms_workers' AND sr.status NOT LIKE
 CASE WHEN @ShowActiveOnly = 1 THEN 'Step[CE]%' ELSE '' END))
 AND sr.step_index = @StepIndex
ORDER BY distribution_id

Step 4: Diagnose and mitigate

Compilation phase issues

Execution phase issues

  • Use the following table to analyze the result set in Step 2. Determine your scenario and check the common cause for detailed information and the possible mitigation steps.

    Scenario Common Cause
    EstimatedRowCount/ActualRowCount < 25% Inaccurate estimates
    The Description value indicates BroadcastMoveOperation and the query references a replicated table. Uncached replicated tables
    1. @ShowActiveOnly = 0
    2. High or unexpected number of steps (step_index) is observed.
    3. Data types of joiner columns aren't identical between tables.
    Mismatched data type/size
    1. The Description value indicates HadoopBroadcastOperation, HadoopRoundRobinOperation or HadoopShuffleOperation.
    2. The total_elapsed_time value of a given step_index is inconsistent between executions.
    Ad hoc external table queries
  • Check the total_elapsed_time value obtained in Step 3. If it's significantly higher in a few distributions in a given step, follow the these steps:

    1. Check the data distribution for every table referenced in the TSQL field for associated step_id by running the following command against each:

      DBCC PDW_SHOWSPACEUSED(<table>);
      
    2. If <minimum rows value>/<maximum rows value> > 0.1, go to Data skew (stored).

    3. Otherwise, go to In-flight data skew.

Wait type issues

If none of the above common issues apply to your query, the Step 3 data affords the opportunity to determine which wait types (in wait_type and wait_time) are interfering with query processing for the longest-running step. There are a large number of wait types, and they're grouped into related categories due to similar mitigations. Follow these steps to locate the wait category of your query step:

  1. Identify the wait_type in Step 3 that is taking the most time.
  2. Locate the wait type in wait categories mapping table and identify the wait category it included in.
  3. Expand the section related to the wait category from the following list for recommended mitigations.

Feedback

Was this page helpful?

Additional resources