![]() |
VOOZH | about |
This page provides instructions on setting up Datadog’s GPU Monitoring on your infrastructure. Follow the configuration instructions that match your operating environment below.
To begin using Datadog’s GPU Monitoring, your environment must meet the following criteria:
If using Kubernetes, the following additional requirements must be met:
If using Google Kubernetes Engine, Container-Optimized OS nodes are not supported due to limitations in the kernel and permission system.
Configuring GPU Monitoring does not require DCGM. You need to opt-in to the collection of GPU Monitoring metrics at the Agent. Setup depends on your environment: non-Kubernetes/uniform, Kubernetes cluster, or mixed cluster.
After you’ve enabled the collection of GPU Monitoring metrics, you can opt-in to enable several integrations for more advanced insights:
The following instructions are the basic steps to set up GPU Monitoring in the following environments:
Make sure that the latest version of the Datadog Agent is installed and deployed on every GPU host you wish to monitor.
Modify your DatadogAgent resource with the following parameters:
gpu.enabled: truegpu.patchCgroupPermissions: truesystem-probe that helps the Agent access GPU devices.gpu.requiredRuntimeClassName:<runtime-name>nvidia, nvidia-cdi, nvidia-legacy. The default value is nvidia, as that is the default runtime defined by the NVIDIA GPU Operator. In EKS and Oracle Cloud, this value should be set to the empty string as the default runtime class already allows GPU device access.Example datadog-agent.yaml, running on GKE:
apiVersion:datadoghq.com/v2alpha1kind:DatadogAgentmetadata:name:datadogspec:features:gpu:enabled:truepatchCgroupPermissions:true# Only for GKErequiredRuntimeClassName:""# Only for AWS EKS or Oracle CloudApply your changes and restart the Datadog Agent.
Make sure that the latest version of the Datadog Agent is installed and deployed on every GPU host you wish to monitor.
Modify your datadog-values.yaml configuration file with the following parameters:
gpuMonitoring.enabled: truegpuMonitoring.configureCgroupPerms: truesystem-probe that helps the Agent access GPU devices.gpuMonitoring.runtimeClassName:<runtime-name>nvidia, nvidia-cdi, nvidia-legacy. The default value is nvidia, as that is the default runtime defined by the NVIDIA GPU Operator. In EKS and Oracle Cloud, this value should be set to the empty string as the default runtime class already allows GPU device access.Example datadog-values.yaml, running on GKE:
datadog:gpuMonitoring:enabled:trueconfigureCgroupPerms:true# Only for GKEruntimeClassName:""# Only for Oracle Cloud and AWS EKSUpgrade your Helm chart and restart the Datadog Agent.
To enable GPU Monitoring in Docker, use the following configuration when starting the container Agent:
docker run \
--pid host \
--gpus all \
-e DD_API_KEY="<DATADOG_API_KEY>" \
-e DD_GPU_ENABLED=true \
-v /var/run/docker.sock:/var/run/docker.sock:ro \
-v /proc/:/host/proc/:ro \
-v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
registry.datadoghq.com/agent:latest
Replace <DATADOG_API_KEY> with your Datadog API key.
If using docker-compose, make the following additions to the Datadog Agent service.
version:'3'services:datadog:image:"registry.datadoghq.com/agent:latest"environment:- DD_GPU_ENABLED=true- DD_API_KEY=<DATADOG_API_KEY>volumes:- /var/run/docker.sock:/var/run/docker.sock:ro- /proc/:/host/proc/:ro- /sys/fs/cgroup/:/host/sys/fs/cgroup:rodeploy:resources:reservations:devices:- driver:nvidiacount:allcapabilities:[gpu]GPU Monitoring requires configuration in both datadog.yaml and system-probe.yaml. Configuring only datadog.yaml does not load the eBPF module and results in no metrics being collected.
Add the following to /etc/datadog-agent/datadog.yaml:
gpu:enabled:truecollect_gpu_tags:trueenable_nvml_detection:trueIf /etc/datadog-agent/system-probe.yaml does not exist, create it from the example:
sudo -u dd-agent install -m 0640 /etc/datadog-agent/system-probe.yaml.example /etc/datadog-agent/system-probe.yaml
Add the following to /etc/datadog-agent/system-probe.yaml:
gpu_monitoring:enabled:trueRestart both the Agent and system-probe:
sudo systemctl restart datadog-agent
sudo systemctl restart datadog-agent-sysprobe
In a mixed Kubernetes cluster, some nodes have GPU devices while others do not. Two separate DaemonSets are required: one for the runtime class in GPU nodes, and another for non-GPU nodes. This split is due to runtime class requirements for the NVIDIA device plugin for Kubernetes.
The recommended method is the Datadog Operator, version 1.20 or greater. This version provides features that make setup easier. For compatibility, instructions are also provided for Helm installations and older Datadog Operator versions.
To set up GPU Monitoring on a mixed cluster, use the Operator’s Agent Profiles feature. This selectively enables GPU Monitoring only on nodes with GPUs.
Configure the Datadog Operator to enable the Datadog Agent Profile feature in the DatadogAgentInternal mode.
If the Datadog Operator was deployed with Helm directly without a values file, the configuration can be toggled from the command line:
helm upgrade --set datadogAgentProfile.enabled=true --set datadogAgentInternal.enabled=true --set datadogCRDs.crds.datadogAgentProfiles=true --set datadogCRDs.crds.datadogAgentInternal=true <release-name> datadog/datadog-operator
If the Datadog Operator was deployed with a values file, the configuration can be toggled by adding the following settings to the values file:
datadogAgentProfile:enabled:truedatadogAgentInternal:enabled:truedatadogCRDs:crds:datadogAgentProfiles:truedatadogAgentInternal:trueThen re-deploy the Datadog Operator with: helm upgrade --install <release-name> datadog/datadog-operator -f datadog-operator.yaml.
Modify your DatadogAgent resource by adding the agent.datadoghq.com/update-metadata annotation to the DatadogAgent resource.
The additions to the datadog-agent.yaml file should look like this:
apiVersion:datadoghq.com/v2alpha1kind:DatadogAgentmetadata:name:datadogannotations:agent.datadoghq.com/update-metadata:"true"# Required for the Datadog Agent Internal mode to work.Apply your changes to the DatadogAgent resource. These changes are safe to apply to all Datadog Agents, regardless of whether they run on GPU nodes.
Create a Datadog Agent Profile that targets GPU nodes and enables GPU Monitoring on these targeted nodes.
In the following example, the profileNodeAffinity selector is targeting nodes with the label nvidia.com/gpu.present=true, because this label is commonly present on nodes with the NVIDIA GPU Operator. You may use another label if you wish.
apiVersion:datadoghq.com/v1alpha1kind:DatadogAgentProfilemetadata:name:gpu-nodesspec:profileAffinity:profileNodeAffinity:- key:nvidia.com/gpu.presentoperator:Invalues:- "true"config:features:gpu:enabled:truepatchCgroupPermissions:true# Only for GKEAfter you apply this new Datadog Agent Profile, the Datadog Operator creates a new DaemonSet, gpu-nodes-agent.
To set up GPU Monitoring on a mixed cluster, use the Operator’s Agent Profiles feature. This selectively enables GPU Monitoring only on nodes with GPUs.
Make sure that the latest version of the Datadog Agent is installed and deployed on every GPU host you wish to monitor.
Modify your DatadogAgent resource with the following changes:
spec:features:oomKill:# Only enable this feature if there is nothing else that requires the system-probe container in all Agent pods# Examples of system-probe features are npm, cws, usmenabled:trueoverride:nodeAgent:volumes:- name:nvidia-deviceshostPath:path:/dev/null- name:pod-resourceshostPath:path:/var/lib/kubelet/pod-resourcescontainers:agent:env:- name:NVIDIA_VISIBLE_DEVICESvalue:"all"volumeMounts:- name:nvidia-devicesmountPath:/dev/nvidia-visible-devices- name:pod-resourcesmountPath:/var/lib/kubelet/pod-resourcessystem-probe:env:- name:NVIDIA_VISIBLE_DEVICESvalue:"all"volumeMounts:- name:nvidia-devicesmountPath:/dev/nvidia-visible-devices- name:pod-resourcesmountPath:/var/lib/kubelet/pod-resourcesApply your changes to the DatadogAgent resource. These changes are safe to apply to all Datadog Agents, regardless of whether they run on GPU nodes.
Create a Datadog Agent Profile that targets GPU nodes and enables GPU Monitoring on these targeted nodes.
In the following example, the profileNodeAffinity selector is targeting nodes with the label nvidia.com/gpu.present=true, because this label is commonly present on nodes with the NVIDIA GPU Operator. You may use another label if you wish.
apiVersion:datadoghq.com/v1alpha1kind:DatadogAgentProfilemetadata:name:gpu-nodesspec:profileAffinity:profileNodeAffinity:- key:nvidia.com/gpu.presentoperator:Invalues:- "true"config:override:nodeAgent:runtimeClassName:nvidia # Only if not in AWS EKS or Oracle Cloudcontainers:# Change system-probe environment variables only for advanced# eBPF metrics, or if running in GKEsystem-probe:env:- name:DD_GPU_MONITORING_ENABLEDvalue:"true"# cgroup permission patching only for GKE- name:DD_GPU_MONITORING_CONFIGURE_CGROUP_PERMSvalue:"true"agent:env:- name:DD_GPU_ENABLEDvalue:"true"# Only for advanced eBPF metrics- name:DD_GPU_MONITORING_ENABLEDvalue:"true"After you apply this new Datadog Agent Profile, the Datadog Operator creates a new DaemonSet, datadog-agent-with-profile-<namespace>-gpu-nodes.
To set up GPU Monitoring on a mixed cluster with Helm, create two Helm deployments. One deployment targets GPU nodes, and the other targets non-GPU nodes.
Make sure that the latest version of the Datadog Agent is installed and deployed on every GPU host you wish to monitor.
Modify your datadog-values.yaml configuration file to target non-GPU nodes.
The following example targets nodes that do not have the label nvidia.com/gpu.present=true, because this label is commonly present on nodes with the NVIDIA GPU Operator. If you wish, you may use another label to exclude GPU nodes.
agents:affinity:nodeAffinity:requiredDuringSchedulingIgnoredDuringExecution:nodeSelectorTerms:- matchExpressions:- key:nvidia.com/gpu.presentoperator:NotInvalues:- "true"Create a values file, datadog-gpu-values.yaml. Configure this file to:
gpuMonitoring.enabled: truegpuMonitoring.configureCgroupPerms: truesystem-probe that helps the Agent access GPU devices.gpuMonitoring.runtimeClassName:<runtime-name>nvidia, nvidia-cdi, nvidia-legacy. The default value is nvidia, as that is the default runtime defined by the NVIDIA GPU Operator. In EKS and Oracle Cloud, this value should be set to the empty string as the default runtime class already allows GPU device access.Example datadog-gpu-values.yaml:
# GPU-specific datadog-gpu-values.yaml (for GPU nodes)
datadog:
kubeStateMetricsEnabled: false # Disabled, as you're joining an existing Cluster Agent
gpuMonitoring:
enabled: true
configureCgroupPerms: true # Only for GKE
runtimeClassName: "" # Only for Oracle Cloud and AWS EKS
agents:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.present
operator: In
values:
- "true"
# Join with existing Cluster Agent
existingClusterAgent:
join: true
# Disabled datadogMetrics deployment, since it should have been already deployed with the other chart release.
datadog-crds:
crds:
datadogMetrics: false
Deploy the Helm chart with your modified datadog-values.yaml.
helm install -f datadog-values.yaml datadog datadog
Deploy the Helm chart again with GPU-specific overrides.
helm install -f datadog-values.yaml -f datadog-gpu-values.yaml datadog-gpu datadog
Additional helpful documentation, links, and articles:
| |