VOOZH about

URL: https://thenewstack.io/how-autonomous-agents-are-changing-infrastructure-management/

⇱ How Autonomous Agents Are Changing Infrastructure Management - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2025-11-25 12:00:50
How Autonomous Agents Are Changing Infrastructure Management
sponsor-duplocloud,sponsored-post-contributed,
AI / AI Agents / Operations

How Autonomous Agents Are Changing Infrastructure Management

Autonomous agents called AI DevOps engineers integrate with production tools to analyze infrastructure, propose actions and secure your systems.
Nov 25th, 2025 12:00pm by Fahmid Kabir
👁 Featued image for: How Autonomous Agents Are Changing Infrastructure Management
Image from Bigzumi on Shutterstock
DuploCloud sponsored this post.

Infrastructure failures have never been more expensive. Recent research estimates the average cost of downtime at $12,900 per minute. This climbs to nearly $24,000 per minute for large enterprises. With this level of pressure, infrastructure and platform teams face a constant trade-off.

You can either firefight urgent issues or push innovation forward.

Now a new model is emerging called AI DevOps engineers.

These are autonomous agents that analyze infrastructure and coordinate with operational tools. They also propose actions in near-real time. Unlike earlier generations of automation or coding assistants, these systems run inside enterprise cloud environments, integrate with production-grade tooling and operate under existing governance frameworks.

The Architecture of Autonomous Infrastructure Agents

These systems differ from developer-focused AI assistants. Instead of generating code in IDEs, AI DevOps engineers integrate directly with:

A core requirement across implementations is data ownership. Many organizations require that infrastructure-related data stay within their cloud accounts. These include businesses in healthcare, government and financial services.

Most solutions, therefore, rely on cloud native large language models (LLM) services like Amazon Bedrock rather than routing data externally.

Common components in modern agent architectures include:

  • Local LLM Integration

Models run inside the organization’s cloud account using cloud native AI services. This supports compliance requirements (HIPAA, SOC 2, PCI-DSS) by keeping logs, metrics and code analysis on trusted infrastructure.

  • Agent Orchestration Layer

This layer coordinates multiple specialized agents. It handles task sequencing, context management, authentication and tool execution across systems like:

  • Kubernetes API/kubectl
  • Jenkins/GitHub Actions
  • Grafana/CloudWatch/OpenTelemetry
  • Container registries
  • Cloud provider CLIs
  • Terraform and Infrastructure as Code tools

The orchestration layer abstracts tool integration complexities and manages errors. It also maintains operational state across agents.

  • Human-in-the-Loop Controls

All actions affecting infrastructure require approval. Approvals are routed through existing platforms like ServiceNow, Jira, Slack or custom ticketing interfaces. This helps you make sure agents can’t bypass organizational governance.

Six Emerging Specialized Roles for AI DevOps Engineers

While implementations vary, organizations are converging on six core agent personas:

1. Kubernetes Agent (Platform Engineering)

Handles pod life cycle analysis, deployment checks, log correlation and environment drift detection.

Example tasks: Diagnosing 5xx errors by correlating metrics, deployment diffs and pod status.

2. Observability Agent (SRE)

Integrates with metrics, logs and event systems to identify root causes across distributed systems.

Example tasks: Linking a memory spike in one service to downstream latency in dependent services.

3. CI/CD Agent (Release Engineering)

Analyzes pipeline failures, interprets logs and proposes fixes.

Example tasks: Identifying dependency conflicts or flaky test patterns automatically.

4. Architecture Agent (Documentation and Infra Mapping)

Builds real-time infrastructure diagrams using cloud APIs and graph databases.

Example tasks: “Show all services dependent on this RDS [Amazon Relational Database Service] instance,” rendered as up-to-date diagrams.

5. Cost Optimization Agent (FinOps)

Surfaces cost anomalies, unused resources or overprovisioned infrastructure using billing data and resource tags.

6. Compliance and Security Agent (Policy Enforcement)

Reviews infrastructure code, checks for misconfigurations and validates policies using LLM reasoning. It does all this while keeping sensitive code within the organization’s cloud.

Why Orchestrating Multiple Agents Is a Pain

Building a single agent is straightforward. Coordinating multiple agents across different tools and contexts is way harder. Modern orchestration layers address the following challenges:

  • Tool Integration Complexity — Each agent interacts with numerous APIs, CLIs and services. Each one has its own authentication model, rate limits and error patterns.
  • Context Management Across Agents — Incidents. They can cause performance issues, failed deployment and/or cost spikes. A unified orchestrator decides when to involve the CI/CD agent, observability agent or FinOps agent and transfers context between them.
  • Model Selection and LLM Coordination — Different tasks require different LLM capabilities. Systems often switch between reasoning-optimized models, lightweight models for pattern detection and domain-specific instruction-tuned models.
  • Operational State Management — Unlike stateless scripts, agents maintain memory of incidents, prior actions and approval patterns.

What Real Teams Do With These Agents Today

Teams piloting AI DevOps engineers report several consistent behaviors:

1. Ticket-Based Interaction as the Primary Interface

Incidents typically follow flows like:

  1. Ticket created (“502 errors on production API”)
  2. Appropriate agent assigned
  3. Automated log/metric correlation
  4. Proposed fix generated
  5. Human approval
  6. Execution and audit logging

2. Fast Analysis Times

Most agents return initial findings in five to 30 seconds, significantly reducing the time engineers spend switching between dashboards and tools.

3. Integration Through Developer Workflows

Common entry points include:

  • Slack commands
  • Ticket submission
  • VS Code extensions
  • Web-based dashboards with full audit trails

4. Approval Hierarchies That Match Organizational Risk

Read-only queries run autonomously; production changes require explicit approval.

Security and Compliance Considerations

Any production-grade use of autonomous agents must support:

  • RBAC (role-based access control) inheritance from existing IAM (identity and access management) systems.
  • Just-in-Time (JIT) permissions for elevated access.
  • Immutable audit trails for every inference and action.
  • Data-boundary guarantees, ensuring no external model training.
  • Integration with SIEM (security information and event management) platforms like Splunk, Datadog or CloudWatch.

These controls make sure AI agents act as trusted extensions of DevOps teams. And you never have to worry, they’re acting as independent actors.

Limitations and Industry Challenges

Across implementations, several constraints remain:

  • Full multicloud support is still early.
  • Many systems lack first-class distributed tracing integration.
  • Multiregion agent coordination is not yet automated.
  • Most interfaces remain English-only.
  • Support for self-hosted or open source models is emerging.

These reflect the broader maturity curve of AI in production operations.

How Organizations Are Adopting This Technology

Successful early adopters tend to share:

  • Strong baseline DevOps and governance practices.
  • Gradual rollout strategies beginning with read-only tasks.
  • Clear approval hierarchies for change-requiring actions.
  • Deep integration across existing toolchains.

The next 12 to 18 months will likely focus on improved orchestration layers, richer context-sharing across agents and deeper integration with developer workflows.

DuploCloud enables teams to deploy AI DevOps engineers within their own cloud environments with built-in governance, ticketing workflows and compliance controls. Learn more or request a demo at duplocloud.com.

DuploCloud offers a DevSecOps software platform for teams that don’t have dedicated DevOps and augments those that do. The platform automates the provisioning of your application to the cloud (AWS, GCP, Azure), integrating cloud ops, SecOps, and security/compliance with 24×7 monitoring and support.
Learn More
The latest from DuploCloud
TRENDING STORIES
Fahmid Kabir leads product and go to market at DuploCloud, an AI-powered DevOps platform. He has worked with deep AI technologies, cloud infrastructure and compliance for the past 18 years.
Read more from Fahmid Kabir
DuploCloud sponsored this post.
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Real.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.