![]() |
VOOZH | about |
The Datadog MCP Server enables AI agents to access your Agent Observability data through the Model Context Protocol (MCP). The llmobs toolset provides tools for searching and analyzing traces, inspecting span details and content, and evaluating experiment results directly from AI-powered clients like Cursor, Claude Code, or OpenAI Codex.
Connect an MCP-compatible client to the Datadog MCP Server with the llmobs toolset enabled.
The MCP Server endpoint depends on your Datadog site. Use the Datadog Site selector to display the endpoint for your site. Append ?toolsets=llmobs,core to enable the Agent Observability and core toolsets.
Endpoint for your selected site ():
?toolsets=llmobs,coreChoose remote authentication when possible. Use local binary authentication if your environment blocks the remote OAuth flow.
Remote authentication uses the MCP specification’s Streamable HTTP transport.
Claude Code (command line):
claude mcp add --transport http datadog-mcp "?toolsets=llmobs,core"Codex CLI (~/.codex/config.toml):
[mcp_servers.datadog]
url = "?toolsets=llmobs,core"
After adding the configuration, run codex mcp login datadog to complete the OAuth flow.
Gemini CLI, Kiro CLI, and other MCP-compatible clients:
{
"mcpServers": {
"datadog": {
"type": "http",
"url": "?toolsets=llmobs,core"
}
}
}
Local binary authentication uses the MCP specification’s stdio transport. Use this method if remote authentication is unavailable.
Install the Datadog MCP Server binary:
curl -sSL https://coterm.datadoghq.com/mcp-cli/install.sh | bash
The binary installs to ~/.local/bin/datadog_mcp_cli.
Complete the OAuth login flow:
datadog_mcp_cli login
Configure your AI client. For Claude Code, add the following to ~/.claude.json, replacing <USERNAME> in the command path:
{
"mcpServers": {
"datadog": {
"type": "stdio",
"command": "/Users/<USERNAME>/.local/bin/datadog_mcp_cli",
"args": [],
"env": {}
}
}
}
Alternatively, add the server with the Claude Code CLI:
claude mcp add datadog --scope user -- ~/.local/bin/datadog_mcp_cli
The MCP Server uses OAuth 2.0 by default. If OAuth is unavailable, send a Datadog API key and application key as the DD_API_KEY and DD_APPLICATION_KEY HTTP headers:
{
"mcpServers": {
"datadog": {
"type": "http",
"url": "?toolsets=llmobs,core",
"headers": {
"DD_API_KEY": "<YOUR_API_KEY>",
"DD_APPLICATION_KEY": "<YOUR_APPLICATION_KEY>"
}
}
}
}
For security, scope the API key and application key to a service account with only the required permissions.
Agent skills are prebuilt instruction sets for AI coding agents that automate common Agent Observability workflows. The dd-llmo skill set is available in the Datadog agent-skills repository. It provides six skills for classifying sessions, diagnosing failures, analyzing experiments, generating experiment code with the ddtrace.llmobs SDK, and bootstrapping evaluators against your live production data.
Install the dd-llmo skills with the following command:
npx skills add datadog-labs/agent-skills --skill dd-llmo --full-depth -y
The skills require the llmobs MCP toolset to be connected. If you have not already connected it, run:
claude mcp add --scope user --transport http "datadog-llmo-mcp" \
'https://mcp.datadoghq.com/v1/mcp?toolsets=llmobs'
Restart Claude Code after running both commands for the skills to appear.
| Skill | Invoke with | What it does |
|---|---|---|
| Session classify | /llm-obs-session-classify | Classifies whether user intent was satisfied in a session, trace, or batch |
| Trace RCA | /llm-obs-trace-rca | Root cause analysis on failing production traces |
| Experiment analyzer | /llm-obs-experiment-analyzer | Analyze and compare LLM experiment results |
| Experiment Python codegen | /llm-obs-experiment-py-bootstrap | Generate Python experiment code using the ddtrace.llmobs SDK |
| Eval bootstrap | /llm-obs-eval-bootstrap | Generate evaluator code or publish online LLM-judge evaluators |
| Eval pipeline | /llm-obs-eval-pipeline | End-to-end pipeline: classify → RCA → bootstrap evaluators |
/llm-obs-session-classify classifies whether user intent was satisfied in a given interaction. It draws from up to three signal sources: Agent Observability traces, RUM behavioral data, and Audit Trail events. The skill returns a yes / partial / no verdict with supporting evidence. Confidence improves with each additional signal source.
/llm-obs-session-classify session_id=<SESSION_ID>
/llm-obs-session-classify trace_id=<TRACE_ID>
/llm-obs-session-classify ml_app=my-chatbot --timeframe now-7d
/llm-obs-trace-rca diagnoses why an LLM application is producing poor results. It selects an analysis mode based on the strongest available signal (LLM-judge eval verdicts, runtime errors, or structural anomalies) and compiles a structured RCA report. The report includes a failure taxonomy and concrete BEFORE / AFTER fix proposals grounded in trace evidence.
When Claude Code has access to your codebase, the skill can search for the relevant source files and propose diffs inline.
/llm-obs-trace-rca ml_app=my-chatbot
/llm-obs-trace-rca ml_app=my-chatbot eval_name=faithfulness --timeframe now-24h
/llm-obs-eval-bootstrap analyzes production traces and proposes a suite of evaluators targeting the observed failure modes. It outputs one of three artifacts: Python BaseEvaluator / LLMJudge classes for offline experiments, a framework-agnostic JSON spec, or online LLM-judge evaluators published directly to Datadog.
/llm-obs-eval-bootstrap ml_app=my-chatbot
/llm-obs-eval-bootstrap ml_app=my-chatbot --publish
/llm-obs-eval-bootstrap ml_app=my-chatbot --data-only
/llm-obs-experiment-analyzer retrieves experiment results and surfaces what changed between a candidate and a baseline: which metrics improved, which regressed, and where the candidate underperformed.
/llm-obs-experiment-analyzer experiment_id=<EXPERIMENT_ID>
/llm-obs-experiment-analyzer experiment_id=<CANDIDATE_ID> baseline_id=<BASELINE_ID>
/llm-obs-experiment-py-bootstrap generates a self-contained Python experiment client that uses the ddtrace.llmobs SDK. The output is either a runnable .py script or a Jupyter .ipynb notebook matching the canonical reference notebook style. The dataset can come from a local JSON or CSV file, an existing Datadog dataset fetched by name, or a built-in inline sample. Every generated experiment is tagged with generated_by=claude-code so you can identify and filter Claude-generated experiments in the LLM Experiments list.
/llm-obs-experiment-py-bootstrap
/llm-obs-experiment-py-bootstrap --dataset ./data/qa.json --format ipynb
/llm-obs-experiment-py-bootstrap --dataset-name <DATASET_NAME> --project-name <PROJECT_NAME>
/llm-obs-eval-pipeline chains session classification, trace RCA, and evaluator bootstrap into a single supervised workflow with user checkpoints between phases. It is the recommended starting point when you have no existing evaluators for an application.
/llm-obs-eval-pipeline my-chatbot
/llm-obs-eval-pipeline my-chatbot --timeframe now-30d --publish
For a complete guide to these skills and a recommended end-to-end workflow, see Analyze LLM Applications with Claude Code Skills.
The Agent Observability MCP tools enable AI-assisted workflows for:
The llmobs toolset includes the following tools:
search_llmobs_spansget_llmobs_traceget_llmobs_span_detailsget_llmobs_span_contentfind_llmobs_error_spansexpand_llmobs_spansget_llmobs_trace returns collapsed nodes.get_llmobs_agent_loopget_llmobs_experiment_summarylist_llmobs_experiment_eventsget_llmobs_experiment_eventget_llmobs_experiment_metric_valuesget_llmobs_experiment_dimension_valueslist_llmobs_evalslist_llmobs_evals_by_ml_appget_llmobs_evaluatorcreate_or_update_llmobs_evaluatordelete_llmobs_evaluatorlist_llmobs_pattern_configsid, name, evp_query, sampling settings, and timestamps. Start here to find a config_id.get_llmobs_pattern_configget_llmobs_pattern_run_statuslist_llmobs_pattern_runsid, status, timestamps, and the config_snapshot used.get_llmobs_patternsname, description, and point_count. Omit run_id to read the most recent completed run.get_llmobs_patterns_with_pointsinclude_metrics=true to also include per-span duration, cost, token counts, and evaluations.get_llmobs_pattern_pointsspan_id, session_id, and a span input preview. Pass next_page_token back as page_token to continue paging.search_llmobs_spans to find traces by ML app, status, span kind, or custom tags.get_llmobs_trace to see the full span hierarchy tree.get_llmobs_span_details to get metadata, timing, and evaluations for specific spans.get_llmobs_span_content to retrieve the actual I/O, messages, or documents.find_llmobs_error_spans to locate all errors in a trace with propagation context.expand_llmobs_spans to load children of collapsed spans for deeper exploration.get_llmobs_agent_loop to see the step-by-step execution flow of an agent span.get_llmobs_experiment_summary to get overall statistics and discover available metrics and dimensions.list_llmobs_experiment_events to find events of interest, filtering by dimension or sorting by metric.get_llmobs_experiment_event to view full details for a specific event.get_llmobs_experiment_metric_values to get percentile distributions, true/false rates, or compare across dimension segments.get_llmobs_experiment_dimension_values to find valid filter and segment values.list_llmobs_pattern_configs to find available Patterns configurations and their config_id values.get_llmobs_pattern_run_status to verify the most recent run is complete.get_llmobs_patterns to get the full topic hierarchy with names, descriptions, and coherence scores.get_llmobs_patterns_with_points to get topics with span IDs inlined, or get_llmobs_pattern_points to page through the spans of a specific topic.get_llmobs_span_details or get_llmobs_span_content with the span_id values from the previous step to inspect the actual inputs, outputs, and metadata of individual spans within a topic.list_llmobs_pattern_runs to see historical runs and pass a specific run_id to compare topic distributions over time.After connecting, try prompts like:
customer-support-bot app over the past week. Summarize the most common failure patterns, how often they occur, and recommend which ones to fix first.trace-123. Walk me through exactly what happened: what the user asked, what the agent did at each step, and where things went wrong. Suggest a code fix.exp-456 and generate a markdown table of the worst-performing dimensions broken down by evaluation scores. Include any other relevant columns that help me understand where and why performance is degrading.exp-123 (baseline) against experiment exp-456. Summarize what improved, what regressed, and by how much. Give me a recommendation on whether the changes are worth shipping.exp-456 and identify the top 5 lowest-scoring events. For each, show the input, output, and which evaluations failed.The core toolset included in the setup URL gives your AI agent access to additional Datadog tools that pair naturally with Agent Observability analysis.
The core toolset includes create_datadog_notebook and edit_datadog_notebook, which let your AI agent create Datadog Notebooks directly from analysis results. You can export findings from agent chats into a collaborative, shareable notebook that lives in Datadog alongside your traces and experiments.
Try prompts like:
exp-456, identify the worst-performing dimensions, and export a summary report to a Datadog Notebook with a breakdown by evaluation scores.customer-support-bot over the past week and create a Datadog Notebook with the findings, including common failure patterns and recommended fixes.For custom visualizations that go beyond standard Datadog widgets, like comparison charts or quadrant plots, Notebooks also render Mermaid diagrams natively. Try prompts like:
exp-456, compare the accuracy scores across each prompt version, and export the results to a Datadog Notebook that includes a Mermaid bar chart of the average score for each version.exp-456 and export a Datadog Notebook that plots each prompt version on a Mermaid quadrant chart with relevance on one axis and accuracy on the other. Identify which versions are underperforming on both dimensions.Additional helpful documentation, links, and articles:
| |