DDSQL Reference

Available for:

DDSQL Editor | Notebooks

Overview

DDSQL is SQL for Datadog data. It implements several standard SQL operations, such as SELECT, and allows queries against unstructured data. You can perform actions like getting exactly the data you want by writing your own SELECT statement, or querying tags as if they are standard table columns.

You can run DDSQL queries from AI agents using the Datadog MCP Server ddsql toolset (Preview).

This documentation covers the SQL support available and includes:

Syntax

The following SQL syntax is supported:

SELECT (DISTINCT) (DISTINCT: Optional)

Retrieves rows from a database, with DISTINCT filtering out duplicate records.

SELECTDISTINCTcustomer_idFROMorders

JOIN

Combines rows from two or more tables based on a related column between them. Supports FULL JOIN, INNER JOIN, LEFT JOIN, RIGHT JOIN.

SELECTorders.order_id,customers.customer_nameFROMordersJOINcustomersONorders.customer_id=customers.customer_id

GROUP BY

Groups rows that have the same values in specified columns into summary rows.

SELECTproduct_id,SUM(quantity)FROMsalesGROUPBYproduct_id

|| (concat)

Concatenates two or more strings together.

SELECTfirst_name||' '||last_nameASfull_nameFROMemployees

WHERE (Includes support for LIKE, IN, ON, OR filters)

Filters records that meet a specified condition.

SELECT*FROMemployeesWHEREdepartment='Sales'ANDnameLIKE'J%'

CASE

Provides conditional logic to return different values based on specified conditions.

SELECTorder_id,CASEWHENquantity>10THEN'Bulk Order'ELSE'Standard Order'ENDASorder_typeFROMorders

WINDOW

Performs a calculation across a set of table rows that are related to the current row.

SELECTtimestamp,service_name,cpu_usage_percent,AVG(cpu_usage_percent)OVER(PARTITIONBYservice_nameORDERBYtimestampROWSBETWEEN2PRECEDINGANDCURRENTROW)ASmoving_avg_cpuFROMcpu_usage_data

IS NULL / IS NOT NULL

Checks if a value is null or not null.

SELECT*FROMordersWHEREdelivery_dateISNULL

LIMIT

Specifies the maximum number of records to return.

SELECT*FROMcustomersLIMIT10

OFFSET

Skips a specified number of records before starting to return records from the query.

SELECT*FROMemployeesOFFSET20

ORDER BY

Sorts the result set of a query by one or more columns. Includes ASC, DESC for sorting order.

SELECT*FROMsalesORDERBYsale_dateDESC

HAVING

Filters records that meet a specified condition after grouping.

SELECTproduct_id,SUM(quantity)FROMsalesGROUPBYproduct_idHAVINGSUM(quantity)>10

IN, ON, OR

Used for specified conditions in queries. Available in WHERE, JOIN clauses.

SELECT*FROMordersWHEREorder_statusIN('Shipped','Pending')

USING

This clause is a shorthand for joins where the join columns have the same name in both tables. It takes a comma-separated list of those columns and creates a separate equality condition for each matching pair. For example, joining T1 and T2 with USING (a, b) is equivalent to ON T1.a = T2.a AND T1.b = T2.b.

SELECTorders.order_id,customers.customer_nameFROMordersJOINcustomersUSING(customer_id)

AS

Renames a column or table with an alias.

SELECTfirst_nameASnameFROMemployees

Arithmetic Operations

Performs basic calculations using operators like +, -, *, /.

SELECTprice,tax,(price*tax)AStotal_costFROMproducts

INTERVAL value unit

Interval representing a time duration specified in a given unit. Supported units:
- milliseconds / millisecond
- seconds / second
- minutes / minute
- hours / hour
- days / day

Data types

DDSQL supports the following data types:

Data Type	Description
`BIGINT`	64-bit signed integers.
`BOOLEAN`	`true` or `false` values.
`DECIMAL`	Floating-point numbers.
`INET`	Network address values (IPv4 and IPv6, with optional CIDR prefix length).
`INTERVAL`	Time duration values.
`JSON`	JSON data.
`TIMESTAMP`	Date and time values.
`VARCHAR`	Variable-length character strings.

Array types

All data types support array types. See Arrays for array literals, element access, and array functions.

Type literals

DDSQL supports explicit type literals using the syntax [TYPE] [value].

Type	Syntax	Example
`BIGINT`	`BIGINT 'value'`	`BIGINT '1234567'`
`BOOLEAN`	`BOOLEAN 'value'`	`BOOLEAN 'true'`
`DECIMAL`	`DECIMAL 'value'`	`DECIMAL '3.14159'`
`INET`	`INET 'value'`	`INET '192.168.1.5/24'`
`INTERVAL`	`INTERVAL 'value unit'`	`INTERVAL '30 minutes'`
`JSON`	`JSON 'value'`	`JSON '{"key": "value", "count": 42}'`
`TIMESTAMP`	`TIMESTAMP 'value'`	`TIMESTAMP '2023-12-25 10:30:00'`
`VARCHAR`	`VARCHAR 'value'`	`VARCHAR 'hello world'`

The type prefix can be omitted and the type is automatically inferred from the value. For example, 'hello world' is inferred as VARCHAR, 123 as BIGINT, and true as BOOLEAN. Use explicit type prefixes when values could be ambiguous; for example,TIMESTAMP '2025-01-01' would be inferred as VARCHAR without the prefix.

Example

-- Using type literals in queries
SELECTVARCHAR'Product Name: '||nameASlabeled_name,price*DECIMAL'1.08'ASprice_with_tax,created_at+INTERVAL'7 days'ASexpiry_dateFROMproductsWHEREcreated_at>TIMESTAMP'2025-01-01';

Arrays

Arrays are ordered collections of values that all share the same data type. Every DDSQL base type has a corresponding array type.

Array literals

Use the ARRAY[value1, value2, ...] syntax to construct an array literal. The array’s type is automatically inferred from the values.

SELECTARRAY['apple','banana','cherry']ASfruits;-- VARCHAR array
SELECTARRAY[1,2,3]ASnumbers;-- BIGINT array
SELECTARRAY[true,false,true]ASflags;-- BOOLEAN array
SELECTARRAY[1.1,2.2,3.3]ASdecimals;-- DECIMAL array

Element access

Access individual array elements with a 1-based subscript. Accessing an index that is out of bounds returns NULL.

SELECTARRAY['a','b','c'][1];-- Returns 'a'
SELECTARRAY['a','b','c'][2];-- Returns 'b'
SELECTARRAY['a','b','c'][10];-- Returns NULL (out of bounds)

To access elements of an array column, use the same subscript syntax:

SELECTrecipients[1]ASfirst_recipientFROMemails

Array functions

The following functions operate on arrays:

Function	Return Type	Description
`CARDINALITY(array a)`	`BIGINT`	Returns the number of elements in the array.
`ARRAY_POSITION(array a, typeof_array value)`	`BIGINT`	Returns the 1-based index of the first occurrence of `value` in the array, or `NULL` if not found.
`STRING_TO_ARRAY(string s, string delimiter)`	`VARCHAR[]`	Splits a string into an array of strings on the given delimiter.
`ARRAY_TO_STRING(array a, string delimiter)`	`VARCHAR`	Joins array elements into a string with the given delimiter.
`ARRAY_AGG(expression e)`	array of input type	Aggregates values from multiple rows into an array.
`UNNEST(array a [, array b...])`	rows of a [, b…]	Expands one or more arrays into a set of rows. Only valid in a `FROM` clause.

Functions

The following SQL functions are supported. For Window function, see the separate Window function section in this documentation.

Function	Return Type	Description
`MIN(variable v)`	typeof v	Returns the smallest value in a set of data.
`MAX(variable v)`	typeof v	Returns the maximum value across all input values.
`COUNT(any a)`	numeric	Returns the number of input values that are not null.
`SUM(numeric n)`	numeric	Returns the summation across all input values.
`AVG(numeric n)`	numeric	Returns the average value (arithmetic mean) across all input values.
`BOOL_AND(boolean b)`	boolean	Returns whether all non-null input values are true.
`BOOL_OR(boolean b)`	boolean	Returns whether any non-null input value is true.
`CEIL(numeric n)` / `CEILING(numeric n)`	numeric	Returns the value rounded up to the nearest integer. Both `CEIL` and `CEILING` are supported as aliases.
`FLOOR(numeric n)`	numeric	Returns the value rounded down to the nearest integer.
`ROUND(numeric n)`	numeric	Returns the value rounded to the nearest integer.
`POWER(numeric base, numeric exponent)`	numeric	Returns the value of base raised to the power of exponent.
`LOWER(string s)`	string	Returns the string as lowercase.
`UPPER(string s)`	string	Returns the string as uppercase.
`ABS(numeric n)`	numeric	Returns the absolute value.
`COALESCE(args a)`	typeof first non-null a OR null	Returns the first non-null value or null if all are null.
`CAST(value AS type)`	type	Converts the given value to the specified data type.
`LENGTH(string s)`	integer	Returns the number of characters in the string.
`TRIM(string s)`	string	Removes leading and trailing whitespace from the string.
`REPLACE(string s, string from, string to)`	string	Replaces occurrences of a substring within a string with another substring.
`SUBSTRING(string s, int start, int length)`	string	Extracts a substring from a string, starting at a given position and for a specified length.
`REVERSE(string s)`	string	Returns the string with characters in reverse order.
`STRPOS(string s, string substring)`	integer	Returns the first index position of the substring in a given string, or 0 if there is no match.
`SPLIT_PART(string s, string delimiter, integer index)`	string	Splits the string on the given delimiter and returns the string at the given position counting from one.
`EXTRACT(unit from timestamp/interval)`	numeric	Extracts a part of a date or time field (such as year or month) from a timestamp or interval.
`TO_TIMESTAMP(string timestamp, string format)`	timestamp	Converts a string to a timestamp according to the given format.
`TO_TIMESTAMP(numeric epoch)`	timestamp	Converts a UNIX epoch timestamp (in seconds) to a timestamp.
`TO_CHAR(timestamp t, string format)`	string	Converts a timestamp to a string according to the given format.
`DATE_BIN(interval stride, timestamp source, timestamp origin)`	timestamp	Aligns a timestamp (source) to buckets of even length (stride). Returns the start of the bucket containing the source, calculated as the largest timestamp that is less than or equal to source and is a multiple of stride lengths from origin.
`DATE_TRUNC(string unit, timestamp t)`	timestamp	Truncates a timestamp to a specified precision based on the provided unit.
`CURRENT_SETTING(string setting_name)`	string	Returns the current value of the specified setting. Supports the parameters `dd.time_frame_start` and `dd.time_frame_end`, which return the start and end of the global time frame, respectively.
`NOW()`	timestamp	Returns the current UTC timestamp at the start of the current query.
`CARDINALITY(array a)`	integer	Returns the number of elements in the array.
`ARRAY_POSITION(array a, typeof_array value)`	integer	Returns the index of the first occurrence of the value found in the array, or null if value is not found.
`STRING_TO_ARRAY(string s, string delimiter)`	array of strings	Splits the given string into an array of strings using the given delimiter.
`ARRAY_TO_STRING(array a, string delimiter)`	string	Converts an array to a string by concatenating elements with the given delimiter.
`ARRAY_AGG(expression e)`	array of input type	Creates an array by collecting all the input values.
`APPROX_PERCENTILE(double percentile) WITHIN GROUP (ORDER BY expression e)`	typeof expression	Computes an approximate percentile value. The percentile must be between 0.0 and 1.0 (inclusive). Requires the `WITHIN GROUP (ORDER BY ...)` syntax.
`UNNEST(array a [, array b...])`	rows of a [, b…]	Expands arrays into a set of rows. This form is only allowed in a FROM clause.

Regular expressions

Flavor

All regular expression (regex) functions use the International Components for Unicode (ICU) flavor:

Metacharacters
Operators
Set Expressions (Character Classes)
Flag Options for in-pattern flags. Refer to the flags section below for function-level flags.
Find and Replace (using capture groups)

Functions

Function	Return Type	Description
`REGEXP_LIKE(string input, string pattern)`	Boolean	Evaluates whether a string matches a regular expression pattern.
`REGEXP_MATCH(string input, string pattern [, string flags ])`	array of strings	Returns substrings of the first pattern match in the string. This function searches the input string using the given pattern and returns captured substrings (capture groups) from the first match. If no capture groups are present, returns the full match.
`REGEXP_REPLACE(string input, string pattern, string replacement [, string flags ])`	string	Replaces the substring that is the first match to the pattern, or all such matches if you use the optional `g` flag.
`REGEXP_REPLACE (string input, string pattern, string replacement, integer start, integer N [, string flags ] )`	string	Replaces the substring that is the Nth match to the pattern, or all such matches if `N` is zero, starting from `start`.

Function-level flags

You can use the following flags with regular expression functions:

i: Case-insensitive matching
n or m: Newline-sensitive matching
g: Global; replace all matching substrings rather than only the first one

Window functions

This table provides an overview of the supported window functions. For comprehensive details and examples, see the PostgreSQL documentation.

Function	Return Type	Description
`OVER`	N/A	Defines a window for a set of rows for other window functions to operate on.
`PARTITION BY`	N/A	Divides the result set into partitions, specifically for applying window functions.
`RANK()`	integer	Assigns a rank to each row within a partition, with gaps for ties.
`ROW_NUMBER()`	integer	Assigns a unique sequential number to each row within a partition.
`LEAD(column n)`	typeof column	Returns the value from the next row in the partition.
`LAG(column n)`	typeof column	Returns the value from the previous row in the partition.
`FIRST_VALUE(column n)`	typeof column	Returns the first value in an ordered set of values.
`LAST_VALUE(column n)`	typeof column	Returns the last value in an ordered set of values.
`NTH_VALUE(column n, offset)`	typeof column	Returns the value at the specified offset in an ordered set of values.

JSON functions and operators

Name	Return type	Description
json_extract_path_text(text json, text path…)	text	Extracts a JSON sub-object as text, defined by the path. Its behavior is equivalent to the Postgres function with the same name. For example, `json_extract_path_text(col, ‘forest')` returns the value of the key `forest` for each JSON object in `col`. See the example below for a JSON array syntax.
json_extract_path(text json, text path…)	JSON	Same functionality as `json_extract_path_text`, but returns a column of JSON type instead of text type.
json_array_elements(text json)	rows of JSON	Expands a JSON array into a set of rows. This form is only allowed in a FROM clause.
json_array_elements_text(text json)	rows of text	Expands a JSON array into a set of rows. This form is only allowed in a FROM clause.

Network address functions and operators

The inet type represents IPv4 and IPv6 network addresses with an optional CIDR prefix length (for example, 192.168.1.5/24 or ::1). Create inet values with the type literal syntax INET 'value' or by casting a string with CAST(column AS inet).

Functions

Function	Return Type	Description
`host(inet addr)`	`VARCHAR`	Returns the IP address as text, without the prefix length.
`network(inet addr)`	`INET`	Returns the network part of the address, with host bits zeroed.
`netmask(inet addr)`	`INET`	Returns the network mask for the address.
`masklen(inet addr)`	`BIGINT`	Returns the prefix length of the network mask.
`broadcast(inet addr)`	`INET`	Returns the broadcast address of the network.
`family(inet addr)`	`BIGINT`	Returns the address family: `4` for IPv4, `6` for IPv6.

Operators

Operator	Return Type	Description
`inet a << inet b`	`BOOLEAN`	Returns `true` if `a` is strictly contained within `b`.
`inet a <<= inet b`	`BOOLEAN`	Returns `true` if `a` is contained within or equals `b`.
`inet a >> inet b`	`BOOLEAN`	Returns `true` if `a` strictly contains `b`.
`inet a >>= inet b`	`BOOLEAN`	Returns `true` if `a` contains or equals `b`.
`inet a && inet b`	`BOOLEAN`	Returns `true` if the subnets of `a` and `b` overlap.

Table functions

Table functions are used to query logs, metrics, cloud costs, and other data sources.

Function	Description	Example
dd.logs( columns => array < varchar >, filter ? => varchar, indexes ? => array < varchar >, storage ? => varchar, from_timestamp ? => timestamp, to_timestamp ? => timestamp ) AS (column_name type [, ...])	Returns log data as a table. The columns parameter specifies which log fields to extract. Nested fields are accessed using dot notation, and non-core fields need to be prepended by `@`. The AS clause defines the schema of the returned table. Optional: filtering by index or time range. When time is not specified, DDSQL defaults to the global time setting, which in DDSQL Editor is set to the past 1 hour. Optional: specifying the storage to use (for example, `hot`, `flex_tier`). When not specified, the default is hot storage.	`SELECTtimestamp,host,service,message,asset_idFROMdd.logs(filter=>'source:java',columns=>ARRAY['timestamp','host','service','message','@asset.id'])AS(timestampTIMESTAMP,hostVARCHAR,serviceVARCHAR,messageVARCHAR,asset_idVARCHAR)`
dd.metrics_scalar( query varchar, reducer varchar [, from_timestamp timestamp, to_timestamp timestamp] )	Returns metric data as a scalar value. The function accepts a metrics query (with optional grouping), a reducer to determine how values are aggregated (avg, max, etc.), and optional timestamp parameters (default 1 hour) to define the time range.	`SELECTFROMdd.metrics_scalar('avg:system.cpu.user{} by {service}','avg',TIMESTAMP'2025-07-10 00:00:00.000-04:00',TIMESTAMP'2025-07-17 00:00:00.000-04:00')ORDERBYvalueDESC;`
dd.metrics_timeseries( query varchar [, from_timestamp timestamp, to_timestamp timestamp] )	Returns metric data as a timeseries. The function accepts a metrics query (with optional grouping) and optional timestamp parameters (default 1 hour) to define the time range. Returns datapoints over time rather than a single aggregated value.	`SELECTFROMdd.metrics_timeseries('avg:system.cpu.user{} by {service}',TIMESTAMP'2025-07-10 00:00:00.000-04:00',TIMESTAMP'2025-07-17 00:00:00.000-04:00')ORDERBYtimestamp,service;`
dd.cloud_cost_scalar( query varchar, reducer varchar [, from_timestamp timestamp, to_timestamp timestamp] )	Returns Cloud Cost Management data as a scalar value. The function accepts a cloud cost query (with optional grouping), an aggregation reducer (use `sum` for cost data; other reducers such as `avg`, `min`, and `max` are accepted but rarely applicable to cost queries), and optional timestamp parameters (default 1 hour) to define the time range. Note: Cloud cost data is typically delayed by 24-48 hours, so recent timestamps may return no results.	`SELECTFROMdd.cloud_cost_scalar('sum:all.cost{} by {service}','sum',TIMESTAMP'2025-07-10 00:00:00.000-04:00',TIMESTAMP'2025-07-17 00:00:00.000-04:00')ORDERBYvalueDESC;`
dd.cloud_cost_timeseries( query varchar [, from_timestamp timestamp, to_timestamp timestamp] )	Returns Cloud Cost Management data as a timeseries. The function accepts a cloud cost query (with optional grouping) and optional timestamp parameters (default 1 hour) to define the time range. Returns cost datapoints over time rather than a single aggregated value. Note: Cloud cost data is typically delayed by 24-48 hours, so recent timestamps may return no results.	`SELECTFROMdd.cloud_cost_timeseries('sum:all.cost{} by {service}',TIMESTAMP'2025-07-10 00:00:00.000-04:00',TIMESTAMP'2025-07-17 00:00:00.000-04:00')ORDERBYtimestamp,service;`

Name	Return type	Description
tags -> ’text'	Text	Gets the value for a given key. Returns `null` if key is not present.
akeys(hstore tags)	Array of text	Gets the keys of an HSTORE as an array
avals(hstore tags)	Array of text	Gets the values of an HSTORE as an array

Writing efficient queries

Queries that read large amounts of data or run heavy computations can be slow or return resource errors. The patterns below are the most common causes, each with a rewrite that usually resolves the issue.

Filter early and aggregate early. Computations in the query filter or a GROUP BY summary run against the index. Computations in a JOIN or a wide LIMIT are held in memory.

Checklist

Before re-running a slow query, ask the following questions:

Does the query filter include a selective token (not a wildcard like service:* or env:*)?
Is the time range as small as it can be for the question?
Are you selecting only the columns you actually use?
If you have a large LIMIT, could a GROUP BY summary work instead?
If you have many JOIN clauses, could the query be rewritten as a single scan?
When joining across sources, is each side filtered?
Is the JOIN key high-cardinality (user/ request/trace ID)?
Is the same regex being run more than once per row?

Filtering

Filter on the data source

Always include a selective token in your data source query, such as service:, host:, env:, or any @attribute:value.

Wildcard filters such as service:* or env:* match every event, so they don’t actually narrow the data. Treat these filters as equivalent to leaving the filter blank.

Before

-- No filter; scans all logs in the time range
SELECTtimestamp,service,host,messageFROMdd.logs(columns=>ARRAY['timestamp','service','host','message'],from_timestamp=>NOW()-INTERVAL'7 days',to_timestamp=>NOW())AS(timestampTIMESTAMP,serviceVARCHAR,hostVARCHAR,messageVARCHAR)

After

-- filter => narrows the scan to matching events before any SQL runs
SELECTtimestamp,service,host,messageFROMdd.logs(filter=>'service:checkout-api env:prod',columns=>ARRAY['timestamp','service','host','message'],from_timestamp=>NOW()-INTERVAL'7 days',to_timestamp=>NOW())AS(timestampTIMESTAMP,serviceVARCHAR,hostVARCHAR,messageVARCHAR)

Choose a time range that fits the question

Scan time grows with the time range. Consider starting with a window that covers the question (often a few hours or a day) and widening only when the question demands more. For long-term trends, a pre-aggregated metric or a daily-summary query runs faster than re-scanning raw events on every run.

Goal: understand log volume from the billing service over the last month.

Before

-- Scans 31 days of raw events
SELECTtimestamp,messageFROMdd.logs(filter=>'service:billing',columns=>ARRAY['timestamp','message'],from_timestamp=>NOW()-INTERVAL'31 days',to_timestamp=>NOW())AS(timestampTIMESTAMP,messageVARCHAR)

After

-- Aggregate up front for trends; only fetch the timestamp column needed for grouping
SELECTdate_trunc('day',timestamp)ASday,count(*)ASeventsFROMdd.logs(filter=>'service:billing',columns=>ARRAY['timestamp'],from_timestamp=>NOW()-INTERVAL'7 days',to_timestamp=>NOW())AS(timestampTIMESTAMP)GROUPBY1ORDERBY1;

Column selection

Project only the columns you use

Each column in your query is fetched from storage. Trim the column list to what you use downstream. Wide attributes like the raw message or full HTTP headers can slow down the query significantly.

Before

-- columns array fetches every field, including the expensive raw message
SELECT*FROMdd.logs(filter=>'service:checkout-api',columns=>ARRAY['timestamp','service','host','message','@http.url','@http.status_code'],from_timestamp=>NOW()-INTERVAL'1 day',to_timestamp=>NOW())AS(timestampTIMESTAMP,serviceVARCHAR,hostVARCHAR,messageVARCHAR,http_urlVARCHAR,status_codeVARCHAR)

After

-- Only declare the columns the analysis actually uses
SELECTtimestamp,service,hostFROMdd.logs(filter=>'service:checkout-api',columns=>ARRAY['timestamp','service','host'],from_timestamp=>NOW()-INTERVAL'1 day',to_timestamp=>NOW())AS(timestampTIMESTAMP,serviceVARCHAR,hostVARCHAR)

Aggregations

Return summaries, not raw rows

When your goal is to understand the data (such as top-N, counts per category, or distributions), a GROUP BY returns a focused result that is faster to compute and easier to work with than scanning millions of raw rows.

Before

SELECT*FROMdd.logs(filter=>'service:orders-api',columns=>ARRAY['timestamp','service','@http.status_code','message'],from_timestamp=>NOW()-INTERVAL'1 day',to_timestamp=>NOW())AS(timestampTIMESTAMP,serviceVARCHAR,status_codeVARCHAR,messageVARCHAR)LIMIT5000000;

After

-- Returns ~10 rows (one per status_code), answers the actual question
SELECTstatus_code,count(*)AShitsFROMdd.logs(filter=>'service:orders-api',columns=>ARRAY['@http.status_code'],from_timestamp=>NOW()-INTERVAL'1 day',to_timestamp=>NOW())AS(status_codeVARCHAR)GROUPBYstatus_codeORDERBYhitsDESCLIMIT100;

Note: Aggregating in SQL with GROUP BY is more efficient than fetching raw rows and aggregating in a downstream step. The engine filters and summarizes data at the source.

Narrow the scan before aggregating on a high-cardinality column

When working with high-cardinality columns (such as emails, IPs, or request IDs):

Use SELECT DISTINCT or GROUP BY to return one entry per distinct value across workers. Without a tight filter, the result set grows without bound.
Narrow the data source filter first so the aggregation runs over fewer rows.
For wide cardinalities, pre-aggregate to one row per time bucket and key, then count distinct across buckets.

Goal: find the distinct user emails in the checkout service’s logs over the last 7 days.

Before

-- No service filter; unbounded distinct set across all logs
SELECTDISTINCTuser_emailFROMdd.logs(columns=>ARRAY['@usr.email'],from_timestamp=>NOW()-INTERVAL'7 days',to_timestamp=>NOW())AS(user_emailVARCHAR)

After

-- Pre-aggregate to one row per (day, email), then count distinct across days
SELECTcount(DISTINCTuser_email)ASdistinct_emailsFROM(SELECTdate_trunc('day',timestamp)ASday,user_emailFROMdd.logs(filter=>'service:checkout-api',columns=>ARRAY['timestamp','@usr.email'],from_timestamp=>NOW()-INTERVAL'7 days',to_timestamp=>NOW())AS(timestampTIMESTAMP,user_emailVARCHAR)GROUPBY1,2)daily;

Joins

Combine self-joins into a single scan

Joining one source to itself many times to correlate different events is one of the most common causes of slow queries. Most self-joins can be rewritten as a single scan with CASE expressions, window functions, or GROUP BY ... HAVING.

Before: 4 self-joins

WITHaAS(SELECTuser_idFROMdd.logs(filter=>'service:checkout-api',columns=>ARRAY['@usr.id'],from_timestamp=>NOW()-INTERVAL'1 day',to_timestamp=>NOW())AS(user_idVARCHAR)),bAS(SELECTuser_idFROMdd.logs(filter=>'service:payment-api',columns=>ARRAY['@usr.id'],from_timestamp=>NOW()-INTERVAL'1 day',to_timestamp=>NOW())AS(user_idVARCHAR)),cAS(SELECTuser_idFROMdd.logs(filter=>'service:shipping-api',columns=>ARRAY['@usr.id'],from_timestamp=>NOW()-INTERVAL'1 day',to_timestamp=>NOW())AS(user_idVARCHAR)),dAS(SELECTuser_idFROMdd.logs(filter=>'service:orders-api',columns=>ARRAY['@usr.id'],from_timestamp=>NOW()-INTERVAL'1 day',to_timestamp=>NOW())AS(user_idVARCHAR))SELECTuser_idFROMaJOINbUSING(user_id)JOINcUSING(user_id)JOINdUSING(user_id);

After: single scan

SELECTuser_idFROMdd.logs(filter=>'service:(checkout-api OR payment-api OR shipping-api OR orders-api)',columns=>ARRAY['service','@usr.id'],from_timestamp=>NOW()-INTERVAL'1 day',to_timestamp=>NOW())AS(serviceVARCHAR,user_idVARCHAR)GROUPBYuser_idHAVINGcount(DISTINCTservice)=4;

Use a high-cardinality join key

When joining, prefer a high-cardinality key like user_id, request_id, or trace_id. Coarse keys like service or status can expand a moderate input into billions of intermediate rows. For large datasets, also narrow the data source filter and time range on each side of the join.

Filter both sides of a cross-source join

When you JOIN across two data sources, apply a selective filter on each side. Examples include logs + RUM, logs + traces, and feed + logs. An unfiltered side becomes a full scan that has to be held in memory for the join. Where possible, pre-aggregate each source separately and join the summaries.

Before

-- No filter on the logs side; full scan held in memory for the join
SELECTlogs.message,rum.user_idFROM(SELECTmessage,trace_idFROMdd.logs(columns=>ARRAY['message','trace_id'],from_timestamp=>NOW()-INTERVAL'1 day',to_timestamp=>NOW())AS(messageVARCHAR,trace_idVARCHAR))logsJOINrumONlogs.trace_id=rum.trace_idWHERErum.view_name='cart';

After

-- Both sides filtered before the join
SELECTlogs.message,rum.user_idFROM(SELECTmessage,trace_idFROMdd.logs(filter=>'service:checkout-api status:error',columns=>ARRAY['message','trace_id'],from_timestamp=>NOW()-INTERVAL'1 day',to_timestamp=>NOW())AS(messageVARCHAR,trace_idVARCHAR))logsJOINrumONlogs.trace_id=rum.trace_idWHERErum.view_name='cart';

Expressions

Run regular expressions once per row

If you call REGEXP_MATCH once for each output column, the same pattern is evaluated against message repeatedly for every row. Run it once into an array, join the captures into a single delimited string, and unpack them with SPLIT_PART in a downstream SELECT.

Before

-- Same regex evaluated 3 times per row
SELECTSPLIT_PART(ARRAY_TO_STRING(REGEXP_MATCH(message,'user_id=(\S+) latency_ms=(\d+) error=(\S+)'),'|||'),'|||',1)ASuser_id,SPLIT_PART(ARRAY_TO_STRING(REGEXP_MATCH(message,'user_id=(\S+) latency_ms=(\d+) error=(\S+)'),'|||'),'|||',2)ASlatency_ms,SPLIT_PART(ARRAY_TO_STRING(REGEXP_MATCH(message,'user_id=(\S+) latency_ms=(\d+) error=(\S+)'),'|||'),'|||',3)ASerror_codeFROMdd.logs(columns=>ARRAY['message'],from_timestamp=>NOW()-INTERVAL'1 day',to_timestamp=>NOW())AS(messageVARCHAR);

After

-- Regex runs once per row; captures unpacked in the outer SELECT
SELECTSPLIT_PART(matched,'|||',1)ASuser_id,SPLIT_PART(matched,'|||',2)ASlatency_ms,SPLIT_PART(matched,'|||',3)ASerror_codeFROM(SELECTARRAY_TO_STRING(REGEXP_MATCH(message,'user_id=(\S+) latency_ms=(\d+) error=(\S+)'),'|||')ASmatchedFROMdd.logs(columns=>ARRAY['message'],from_timestamp=>NOW()-INTERVAL'1 day',to_timestamp=>NOW())AS(messageVARCHAR))sub;

URL: https://docs.datadoghq.com/ddsql_reference/

⇱ DDSQL Reference

DDSQL Reference

Overview

Syntax

Data types

Array types

Type literals

Example

Arrays

Array literals

Element access

Array functions

Functions

Regular expressions

Flavor

Functions

Function-level flags

Window functions

JSON functions and operators

Network address functions and operators

Functions

Operators

Table functions

Tags

HSTORE functions and operators

Writing efficient queries

Checklist

Filtering

Filter on the data source

Choose a time range that fits the question

Column selection

Project only the columns you use

Aggregations

Return summaries, not raw rows

Narrow the scan before aggregating on a high-cardinality column

Joins

Combine self-joins into a single scan

Use a high-cardinality join key

Filter both sides of a cross-source join

Expressions

Run regular expressions once per row

Further reading