VOOZH about

URL: https://docs.datadoghq.com/llm_observability/experiments/advanced_runs/

⇱ Advanced Experiment Runs


For AI agents: A markdown version of this page is available at https://docs.datadoghq.com/llm_observability/experiments/advanced_runs.md. A documentation index is available at /llms.txt.

Advanced Experiment Runs

This product is not supported for your selected Datadog site. ().

This page discusses advanced topics in running experiments, including multiple experiment runs and setting up experiments in CI/CD.

Run an experiment on a subset of your dataset

First, add tags to your dataset records. These tags can be unique identifiers (for example, name:test_use_case_1) or represent properties of the scenario (for example, difficulty:hard).

Then, use the tags argument of LLMObs.pull_dataset() to filter the dataset to the relevant records and run the experiment.

Example

prod_dataset = LLMObs.pull_dataset(dataset_name="my-dataset", tags=["env:prod", "version:1.0"])
experiment = LLMObs.experiment(
 name="example-experiment",
 dataset=prod_dataset,
 task=topic_relevance,
 evaluators=[exact_match, false_confidence]
)
experiment.run()

Multiple runs

You can run the same experiment multiple times to account for model non-determinism. You can use the Agent Observability Python SDK or Experiments API to specify how many iterations to run; subsequently, each dataset record is executed that many times using the same tasks and evaluators.

Multiple runs with the Python SDK

Requires Agent Observability Python SDK v4.1+.

Define an optional runs parameter in your experiment method, as shown in the following example:

experiment = LLMObs.experiment(
 name="example-experiment",
 dataset=dataset,
 task=topic_relevance,
 evaluators=[exact_match, false_confidence],
 config={"model": "gpt-4o-mini", "temperature": 0.3, "personality": "helpful"},
 runs=5 
)

If you do not define runs, your experiment defaults to running 1 iteration.

Multiple runs with the Experiments API

Include a run_count field in your request body, as shown in the following example:

{
 "data": {
 "type": "experiments",
 "attributes": {
 "project_id": "<project_id>",
 "dataset_id": "<dataset_id>",
 "name": "example-experiment",
 "description": "",
 "run_count": 5,
 "metadata": {
 "team": ""
 },
 "config": {"model": "gpt-4o-mini", "temperature": 0.3, "personality": "helpful"}
 }
 }
}

If you do not define run_count, your experiment defaults to running 1 iteration.

Analyzing multiple-run experiment results in Datadog

Navigate to LLM Experiments in Datadog and select an experiment with multiple runs.

At the Summary level, evaluations and metrics are computed as:

  • For score-based evaluations, the average across record averages
  • For categorical evaluations, the mode across record modes

At the Record level, evaluations and metrics are computed as:

  • For score-based evaluations, the average across runs on the input
  • For categorical evaluations, the mode across runs on the input

Searching and filtering

To filter by evaluations or metrics for multiple runs, use @multi_run_avg for score-based evaluations, or @multi_run_mode for categorical evaluations.

For example:

@multi_run_avg.@evaluations.custom.random_evaluator:>0.5 @multi_run_mode.@evaluations.custom.sentient_evaluator:positive

Queries that mix @multi_run_avg. columns and plain @trace. columns are not supported.

For example:

  • Supported: @multi_run_avg.@trace.total_tokens:>1000 (All records where the average token count is over 1000 tokens)
  • Supported: run_iteration:1 @trace.total_tokens:>1000 (All records in the first run where the total tokens in the first run were over 1000)
  • Unsupported: @multi_run_avg.@trace.total_tokens:>1000 @trace.estimated_total_cost:>0.01 (All records where the average tokens across all runs were over 1000, and the cost of the first run was over 1 cent)

Comparing experiment runs

Similarly to how you can compare single-run experiments, you can compare multiple-run experiments by selecting particular runs. For example, you can compare Run 1 of Experiment A to Run 4 of Experiment B to analyze variability between runs.

Setting up your experiment in CI/CD

You can run an experiment manually or configure it to run automatically in your CI/CD pipelines. For example, run it against your dataset on every change to compare results with your baseline and catch potential regressions.

GitHub Actions

You can use the following Python script and GitHub Actions workflow as templates to run an experiment automatically whenever code is pushed to your repository.

Note: Workflow files live in the .github/workflows directory and must use YAML syntax with the .yml extension.

from ddtrace.llmobs import LLMObs
from ddtrace.llmobs import EvaluatorResult
from typing import Dict, Any, Optional, List
LLMObs.enable(
 api_key="<YOUR_API_KEY>", # defaults to DD_API_KEY environment variable
 app_key="<YOUR_APP_KEY>", # defaults to DD_APP_KEY environment variable
 site="datadoghq.com", # defaults to DD_SITE environment variable
 project_name="<YOUR_PROJECT>" # defaults to DD_LLMOBS_PROJECT_NAME environment variable, or "default-project" if the environment variable is not set
)
dataset = LLMObs.create_dataset(
 dataset_name="capitals-of-the-world",
 project_name="capitals-project", # optional, defaults to project_name used in LLMObs.enable
 description="Questions about world capitals",
 records=[
 {
 "input_data": {
 "question": "What is the capital of China?"
 }, # required, JSON or string
 "expected_output": "Beijing", # optional, JSON or string
 "metadata": {"difficulty": "easy"}, # optional, JSON
 },
 {
 "input_data": {
 "question": "Which city serves as the capital of South Africa?"
 },
 "expected_output": "Pretoria",
 "metadata": {"difficulty": "medium"},
 },
 ],
)
def task(input_data: Dict[str, Any], config: Optional[Dict[str, Any]] = None, metadata: Optional[Dict[str, Any]] = None) -> str:
 question = input_data["question"]
 difficulty = metadata.get("difficulty", "unknown") if metadata else "unknown"
 # Your LLM or processing logic here
 return "Beijing" if "China" in question else "Unknown"
def exact_match(
 input_data: Dict[str, Any], output_data: str, expected_output: str
) -> bool:
 return output_data == expected_output
def overlap(
 input_data: Dict[str, Any], output_data: str, expected_output: str
) -> float:
 expected_output_set = set(expected_output)
 output_set = set(output_data)
 intersection = len(output_set.intersection(expected_output_set))
 union = len(output_set.union(expected_output_set))
 return intersection / union
def fake_llm_as_a_judge(input_data: Dict[str, Any], output_data: str, expected_output: str) -> EvaluatorResult:
 fake_llm_call = "excellent"
 return EvaluatorResult(
 value=fake_llm_call,
 reasoning="the model explains itself",
 assessment="pass", # or fail
 tags={"task": "judge_llm_call"},
 )
def num_exact_matches(inputs, outputs, expected_outputs, evaluators_results):
 return evaluators_results["exact_match"].count(True)
experiment = LLMObs.experiment(
 name="capital-cities-test",
 task=task,
 dataset=dataset,
 evaluators=[exact_match, overlap, fake_llm_as_a_judge],
 summary_evaluators=[num_exact_matches], # optional
 description="Testing capital cities knowledge",
 config={"model_name": "gpt-4", "version": "1.0"},
)
results = experiment.run(jobs=4, raise_errors=True)
print(f"View experiment: {experiment.url}")
name:Experiment SDK Teston:push:branches:- mainjobs:test:runs-on:ubuntu-latestenvironment:protected-main-env# The job uses secrets defined in this environmentsteps:- uses:actions/checkout@v4- name:Set up Pythonuses:actions/setup-python@v5with:python-version:'3.13.0'# Or your desired Python version- name:Install Dependenciesrun:pip install ddtrace>=4.1.0 dotenv- name:Run Scriptrun:python ./experiment_sdk_demo/main.pyenv:DD_API_KEY:${{ secrets.DD_API_KEY }}DD_APP_KEY:${{ secrets.DD_APP_KEY }}

Querying and comparing CI/CD runs via the API

When creating experiments for a CI/CD pipeline, use the name field to include a shared logical pipeline name and a per-run suffix (for example, my-pipeline-<run-id>), and use the metadata field for commit, branch, or other pipeline context (for example, "metadata": {"commit": "abc123", "branch": "feat/foo"}).

To retrieve and compare runs through the API, call GET /api/v2/llm-obs/v1/experiments with these filters:

  • filter[experiment]=<pipeline-name> returns all runs sharing the same logical pipeline name.
  • filter[metadata]={"branch":"main"} returns runs whose metadata contains those key-value pairs.
  • Combine both filters to narrow comparisons, for example: ?filter[experiment]=my-pipeline&filter[metadata]={"commit":"abc123"}.

Each matching experiment includes aggregate_data with pre-computed eval score distributions, token costs, and error rates, so you can compare runs without querying individual spans.

page[cursor] is not supported when filter[experiment] or filter[metadata] is used.

filter[experiment] matches the shared logical pipeline name stored in the experiment column (for example, my-pipeline). This is distinct from the unique per-run name field (for example, my-pipeline-<run-id>).

# Create experiment tagged with pipeline context
POST /api/v2/llm-obs/v1/experiments
{
 "data": {
 "type": "experiments",
 "attributes": {
 "project_id": "<project_id>",
 "dataset_id": "<dataset_id>",
 "name": "my-pipeline-<run-id>",
 "metadata": {"commit": "abc123", "branch": "feat/foo"}
 }
 }
}
# Compare all runs of the same pipeline on main branch
GET /api/v2/llm-obs/v1/experiments?filter[experiment]=my-pipeline&filter[metadata]={"branch":"main"}
# Retrieve a specific commit's run
GET /api/v2/llm-obs/v1/experiments?filter[experiment]=my-pipeline&filter[metadata]={"commit":"abc123"}