The role of Testing in AIOps

This article explains why AIOps must go beyond Observability and make Testing the core discipline that turns AI governance from aspiration into operational control.

Written by Pierre Oberholzer

Estimated read time: 24 min
What you’ll learn: Learn how Testing can be considered an essential pillar of AIOps, and how Step can contribute to implementing it in practice.
Ideal profile(s): Test Managers, AI Engineers, AIOps and LLMOps Engineers, AI Leads, QA Leads, DevOps Engineers, Automation Engineers, SREs, Enterprise Architects, Business Analysts.

Abstract

As AI systems and agents move into production, organizations increasingly rely on observability and automation to manage their behavior. Yet visibility alone does not provide control.

This article argues that Testing is the missing pillar of AIOps. By framing AIOps around Testing, Observability, and Operation, it shows how explicit, executable expectations enable AI systems to be steered rather than merely monitored. It proposes Testing as the most effective entry point into AIOps and explains how test orchestration provides a practical abstraction for managing the complexity of agentic AI systems.

Note: In this article, we use “AIOps” in the sense of operating AI systems in production. This is distinct from the more common use of the term to describe applying AI techniques to traditional IT operations.

1. The AIOps paradox: powerful models, fragile systems

AI systems such as AI Agents have never been more capable, and yet getting them reliably into production remains painfully hard.

Hallucinations [1], probabilistic behavior, unclear accountability, and fragile integrations are now well-documented failure modes [2] of contemporary AI systems. In practice, many incidents in production AI systems are not caused by exotic model failures, but by predictable system-level weaknesses. Our recent case study on testing a real-world AI agent illustrates how such weaknesses can be surfaced and addressed early [3].

The result is familiar to many organizations.

AI initiatives reach impressive proof of concept stages, but adoption often stalls shortly after. This pattern, sometimes informally called PoC purgatory, reflects a documented gap between experimentation and scaled operational impact in enterprise AI adoption. Indeed, some studies report that while 88% of companies are diving into AI, only 32% of models ever make it to production [9]. Even when they do, many organizations fail to convert those deployments into measurable business value, and meaningful ROI remains out of reach [11].

At the same time, a new wave of AI systems is emerging. AI agents promise to combine traditional software components such as logic, rules, calculations, and APIs with the semantic understanding and generative capabilities of large language models (LLMs). This hybrid nature dramatically expands what AI systems can do, but it also amplifies their operational complexity. Autonomous AI agents do not introduce entirely new classes of failure, but they amplify existing operational weaknesses by turning local errors into systemic risks.

This is where classical approaches start to break down.

The DevOps, MLOps, LLMOps, and AgentOps disciplines each address part of the problem.

DevOps: focuses on reliable and automated software delivery.
MLOps: extends DevOps to manage the training and deployment of machine learning models.
LLMOps: adapts operational practices to large language models and their lifecycle.
AgentOps: addresses the orchestration, monitoring, and operation of autonomous AI agents in production

While they bring structure to deployment, experimentation, monitoring, and iteration, none of them, on their own, fully answer a fundamental question:

How do we bring AI systems to a production grade level of quality without destroying the flexibility that makes them valuable in the first place?

This question is at the core of what we refer to as AIOps.

In this context, AIOps can be understood as the discipline and set of processes building an horizontal bridge between development and operation for AI systems, from shift-left (DevOps focused) to shift-right (QA-focused), while preserving their adaptive nature.

It is also the discipline that enables organizations to collect evidence about system behavior under defined conditions, whether through controlled test scenarios in pre-production or through observation in real operating contexts. Seen this way, AIOps is not only an operational concern. It is also a critical vertical bridge between AI systems and AI governance, accountability, and trust.

Yet, despite this ambition, most AIOps discussions and tool stacks remain heavily skewed toward one capability: Observability. Dashboards, traces, logs, and metrics have become the default answer to AI reliability.

Observability focuses on what happened. Beyond reactive guardrails, it says little about what should happen. What is largely missing is the capability that historically made complex software systems controllable in the first place: Testing.

2. Why classical DevOps thinking breaks down with AI systems

For more than a decade, DevOps has provided a powerful answer to a fundamental software challenge: how to deliver complex systems to production quickly and reliably. Through automation, CI/CD pipelines, infrastructure as code, and feedback loops, teams learned how to control change at scale.

When machine learning (ML) entered production environments, MLOps naturally extended this model. Training pipelines, model registries, experiment tracking, and deployment automation brought much-needed structure to what had previously been ad hoc processes. More recently, LLMOps and AgentOps have emerged to address the operational realities of large language models and autonomous AI agents, extending these practices toward increasingly agentic systems.

At first glance, it is tempting to believe that this evolution is sufficient, that AIOps is simply DevOps or MLOps with better tooling, with LLMs and agents added to the stack. In practice, this approach must be revisited.

The first reason is that intent becomes a variable input at runtime. In MLOps, data is primarily treated as a training-time dependency and as a relatively stable inference-time input. In agentic systems, user intent, and often upstream agent intent, directly drives behavior during execution. Intent is expressed through prompts, goals, constraints, and contextual instructions, and it can change from one interaction to the next. Current LLMOps and AgentOps practices can capture and log this intent, but they provide limited support for validating whether resulting behaviors remain acceptable under varying or underspecified intents. This makes failures harder to predict, reproduce, and diagnose using existing workflows alone.

The second reason is a new form of non-determinism at the system level. MLOps already accounts for non-determinism in data and models, for example through drift detection, performance monitoring, and retraining pipelines. LLMOps extends this with tracing, prompt versioning, and output evaluation. AgentOps further adds execution management and runtime observability for agent workflows. However, AI agents introduce non-determinism in how whole systems behave over time. Decisions that affect control flow, tool usage, execution paths, or termination conditions are made dynamically at runtime. Two executions with similar inputs may therefore result in different sequences of actions. While current practices can capture and replay these sequences, they typically do not provide a mechanism to explicitly define, validate, or constrain acceptable ranges of system-level behavior across executions. As a result, failures emerge not only from prediction error, but from divergent behaviors unfolding over multiple steps.

Third, AI systems introduce feedback human-in-the-loops (HITLs) that blur the boundary between development and operation. User interactions influence system state, and system state influences future decisions. Ground truth may arrive late, or never. While MLOps and AgentOps acknowledge these feedback loops conceptually, validation activities are still largely organized around pre-deployment checks, with monitoring and remediation occurring after deployment. As a result, validation remains implicitly phase-based, even though system behavior evolves continuously.

Finally, modern AI systems increasingly operate as systems of systems. They combine models, prompts, tools, APIs, decision logic, policies, and orchestration layers, often owned by different teams. Current Ops practices provide strong visibility into individual components and execution traces, but they offer limited support for expressing and validating expected behavior that emerges from interactions across components and execution steps. Failures therefore tend to surface at the system level, even when no single component is clearly defective.

These characteristics do not invalidate DevOps, MLOps, LLMOps, or AgentOps. On the contrary, they remain essential. But in their current implementations, they are not sufficient on their own to ensure production-grade quality in autonomous AI systems.

What breaks is not automation, deployment, or monitoring. It is the widespread assumption in practice that deeply observing systems is enough to control them. Therefore AIOps cannot be reduced to better dashboards or more detailed traces. It requires reintroducing a discipline that has always been central to reliable engineering: the systematic ability to define expectations and verify behavior under controlled conditions.

In other words, Testing.

In the next section, we will look at why observability alone has become the default answer to AI reliability, and why it leaves a critical gap unaddressed.

3. Why AI Observability alone is not AIOps

As AI agents move into production, AI observability has become the dominant response to their growing complexity. Today’s “golden datasets” primarily measure whether the LLM answers well and not whether the end-to-end system behaves correctly in real-world conditions. While these tools provide much-needed visibility into what systems are doing and how decisions are produced, they rarely validate the full agentic workflow from input to outcome.

This focus is both understandable and necessary. Without observability, AI agents are opaque, impossible to debug, and dangerous to operate at scale.

But observability answers only part of the problem.

First, Observability struggles to express intent.

It can show how a system behaved, but not how it should have behaved. LLMOps metrics primarily capture observed outcomes related to safety and model quality, with limited ability to express user or business intent. Without explicit testable definitions of intent, teams lack a shared, executable understanding of what correct behavior means for an AI system. As a result, systems are often monitored reactively but less frequently tested against mandatory outputs that reflect real-world scenarios, leaving deviations to be interpreted subjectively rather than evaluated against predefined criteria. This reflects current LLMOps practice, which emphasizes observability and safety guardrails over structured, intent-driven validation frameworks [12,13,14].

Second, Observability is fundamentally reactive.

Because behavior can drift without code changes, issues may only be detected after they have already impacted users. While fallback and human escalation mechanisms can contain damage, they rely on post-hoc detection. By the time anomalies become visible in metrics or traces, incorrect decisions may already have been taken at scale, and retrospective explanations do not undo their impact [4]. This reflects a widespread assumption in practice that failures are acceptable as long as they can be detected and analyzed quickly. Also, it does not prevent failures from occurring in the first place. In practice, many of the most damaging failures are not sudden outages, but slow, silent degradations in behavior that remain invisible until they have already affected business outcomes [9], as exemplified below (Figure 1).

Figure 1 — A concrete error undetected by Observability, identified by Testing

This gap becomes particularly critical for AI systems designed to make decisions or recommendations directly within business-critical workflows.

In such systems, acceptable behavior depends on context, constraints, and business rules that must be defined explicitly in advance and cannot be inferred from telemetry alone. While reactive detection may be reasonable for non-critical systems, it does not hold for AI systems whose decisions affect users, finances, or compliance.

None of this diminishes the value of Observability. It remains a foundational capability of AIOps. But on its own, it is reactive by nature.

To move from reaction to control, AIOps requires a complementary discipline. One that allows teams not only to observe behavior after the fact, but to actively validate it against explicit expectations. This shift from passive observation to active validation is essential if AI systems are to be trusted at scale.

In the next section, we will introduce a simple understanding of AIOps based on three complementary pillars, and explain why Testing must be considered a core capability alongside observability and operation.

4. A simple way to think about AIOps

If AIOps is not just better Observability, how should we defined it ?

Rather than introducing yet another complex reference architecture, it helps to step back and think in terms of capabilities. What are the essential disciplines an organization needs in order to bring AI systems to production and keep them under control over time?

A practical way to approach this is to frame AIOps around three complementary pillars: Testing, Observability, and Operation.

This framing is intentionally simple (see Figure 2). It does not prescribe tools or platforms and does not replace DevOps, MLOps, LLMOps, or AgentOps. Instead, it provides a unifying control loop across them, independent of how responsibilities are split organizationally, and describes what must exist for AI systems to be reliable, governable, and evolvable in real-world environments.

Testing

Testing represents the capability to recurrently validate that an AI system behaves as expected under defined conditions.

In the context of AI systems, Testing means comparing explicit expectations against observed behavior under defined inputs and contexts. Given a specific scenario, the system should produce outputs that are correct, acceptable, or at least bounded within known limits. This builds on established system and end-to-end testing practices, while applying them to AI systems whose behavior is partly probabilistic, context-dependent, and often emergent rather than strictly deterministic.

Testing also includes the definition of test scenarios, the orchestration of test workflows, and the execution of those tests at different stages of the lifecycle. Tests can be run before deployment, during deployment, and repeatedly while the system is operating in production, reflecting the fact that behavior can evolve over time even without code changes.

For complex systems such as AI agents or multi-agent setups, test orchestration becomes critical to systematically exercise and evaluate the interactions from which correct or incorrect behavior emerges. Without structured, extended, scenario-based testing, these interactions remain largely uncontrolled.

Observability

Observability represents the capability to continuously capture and understand how an AI system actually behaves.

It relies on signals such as logs, traces, metrics, and events produced by the system during development and operation. These signals make decisions inspectable and allow teams to investigate unexpected behavior, performance degradation, or drift.

Observability provides visibility into reality. It answers questions such as what happened, how often, and under which conditions. It is indispensable for debugging and learning from the system behavior, in development and in production phase.

However, observability does not define what should happen in a normative or executable way. It describes outcomes, not intent. This distinction becomes important when observability is combined with testing.

Operation

Operation represents the capability to run and act on an AI system in order to keep it performant, reliable, and safe, in development and production.

This includes provisioning and hosting, scaling, access control, security, configuration management, and health monitoring. It also includes runtime control mechanisms that allow teams to intervene when expectations are violated, whether through rollbacks, feature flags, throttling, policy enforcement, or controlled degradation of capabilities.

Operation is where insights turn into action. Without operational control, observability remains passive and testing remains theoretical.

A reinforcing system, not a checklist

These three pillars are not independent. They reinforce each other continuously.

Observability reveals how the system behaves in reality, which often leads to the creation of new test scenarios. Testing formalizes expectations and can trigger operational actions when those expectations are not met. Operational changes, in turn, modify system behavior and generate new signals to observe.

Seen this way, AIOps is not a linear pipeline. It is a control loop.

This framing also explains why Testing plays such a critical role. Testing is the only pillar that makes expectations explicit and executable. Without it, Observability has nothing to compare reality against, and Operation has no objective basis for action.

In the next section, we will look more closely at why Testing is currently the most overlooked pillar of AIOps, and why this gap becomes especially dangerous as AI systems grow more autonomous.

5. Why Testing is the most overlooked pillar of AIOps

If Testing is so central to reliable AI systems, why is it so often missing from AIOps discussions and implementations?

The first reason is historical.

Testing has always been closely associated with deterministic (i.e. rule-based) software. Inputs lead to predictable outputs, and correctness can be asserted with precision. AI systems, by contrast, are probabilistic by nature. Outputs may vary, correctness may be contextual, and expectations may be fuzzy. As a result, many teams implicitly conclude that Testing is either impossible or not worth the effort.

This conclusion is understandable, but incorrect. While assertions in AI systems are often probabilistic, contextual, or range-based rather than binary, they are still assertions. Without them, behavior cannot be evaluated systematically.

Instead, many teams fall back on monitoring and manual inspection.

The second reason is organizational.

In most organizations, Testing is still treated as an activity rooted in traditional IT delivery models. It is typically centralized and closely aligned with operational applications, whether through DevOps practices that emphasize early validation and automation, or through QA functions that focus on late-stage verification. In both cases, Testing is optimized for deterministic systems and stable requirements, and remains largely disconnected from innovation teams, where AI systems are explored, iterated on, and fundamentally reshaped.

The third reason is tooling.

While Observability tooling has evolved rapidly to support AI systems and agentic workflows, Testing tooling has largely remained rooted in traditional software paradigms. Existing Testing platforms are optimized for deterministic systems, fixed interfaces, and stable inputs and outputs. In practice, test scenarios are often authored by business analysts and maintained by test automation engineers using domain-specific languages or low-code tooling, with limited integration into AI development stacks.

The limits of informal testing

As a result, teams often test AI systems informally. They rely on ad hoc prompts, spot checks, or manual reviews. These approaches may work during early experimentation, but they do not scale. They produce no evidence, no repeatability, and no shared understanding of what correct behavior means.

This gap becomes especially dangerous as AI systems evolve toward greater autonomy.

AI agents and agentic workflows introduce decision loops, tool usage, and long-running interactions with their environment. Failures rarely take the form of a single incorrect output. They emerge from sequences of decisions that drift outside acceptable boundaries. By the time observability surfaces a problem, the system may already be operating far from its intended behavior.

Without Testing, teams are left reacting to symptoms rather than controlling causes.

Testing as the foundation of control

Testing changes this dynamic.

It forces teams to make expectations explicit. It creates a shared and reusable language between developers, operators, users, and governance stakeholders. It turns subjective judgments into executable checks. Most importantly, it allows failures to be discovered under controlled conditions, rather than through real user impact.

This is why testing is not simply another AIOps feature. It is the mechanism that transforms observability from passive visibility into actionable insight, and operation from reactive intervention into deliberate control. This is in line with recent AI engineering work that explicitly states that agent quality assurance requires automated testing and validation to prevent failures [9].

In the next section, we will make this concrete by clarifying what it actually means to test an AI system, and why testing models alone is not enough.

6. What it really means to test an AI system

When teams discuss testing AI systems in practice, particularly within agentic workflows, they typically concentrate on evaluating individual model invocations or discrete agent steps. These evaluations often rely on established traditional AI (i.e. machine learning) metrics such as accuracy, precision, and recall, as well as generative AI–specific measures including toxicity, bias, hallucination rates, or task-specific quality scores. While valuable, these approaches are inherently local in scope and provide limited insight into the behavior and performance of the system as a whole.

In production, AI systems rarely fail due to small degradations in a single model or step. Instead, failures more often arise from unexpected system behavior in specific situations, typically through sequences of decisions that compound over time [10].

To make Testing meaningful in AIOps, the unit of testing must therefore be the AI system as a whole, not just models or agent steps in isolation. Testing an AI system means validating end-to-end behavior across multiple components, execution paths, and contexts.

Data and context tests

AI behavior is inseparable from the data and context it receives at runtime. In agentic systems, this “data” is not only a static dataset, but a dynamically assembled context that may include user intent, templated prompts, retrieved information, tool outputs, memory state, and external signals. Testing must therefore include explicit checks on the integrity of this context.

This includes validating schemas, formats, and basic quality constraints, but also testing higher-level properties. Are retrieved documents relevant and bounded? Are critical context elements present? Are inputs drifting outside known or safe boundaries? These tests often execute before or alongside inference. Their purpose is simple: prevent invalid, misleading, or dangerous context from ever reaching the decision logic of the agent.

Model and component (tool) tests

An agentic workflow consists of a combination of model-based components, including ML models and LLMs, but also non-AI components, which may constitute the majority of the system. The latter encompass traditional software components such as business logic, data transformations, database access, and API integrations. All components must be subject to dedicated testing using methods appropriate to their role and failure modes.

Rather than focusing solely on global metrics, testing at this level should emphasize regression detection, robustness, and behavior on critical slices. A component that performs well on average may still fail catastrophically for specific inputs, intents, or contexts that matter most.

These tests determine whether the behavior of individual components remains acceptable relative to previous versions or defined baselines, without attempting to validate end-to-end system behavior.

System and pipeline tests

While data and component-level tests are necessary, most real-world failures occur at the system level, even when their root causes lie in internal subsystems. System tests validate the behavior of the AI system end to end, including the correct functioning of critical pipelines that assemble context and drive decisions. This includes feature extraction, prompt construction, tool invocation, external API calls, and post-processing logic, as well as non-functional properties such as latency, throughput, and cost constraints. Pipeline tests focus on these internal, behavior-shaping subsystems to ensure that downstream decisions are made on valid, well-formed, and stable intermediate results.

For agentic systems, testing must extend beyond isolated input–output validation to multi-step workflows. Runtime orchestration frameworks such as LangGraph coordinate execution, but testing must deliberately exercise and evaluate these orchestrated sequences across realistic scenarios and variations. Behavior in such systems emerges from interactions between agents, tools, memory, and external services over time. Agents that behave correctly in isolation may still produce unstable or undesirable outcomes when interacting, which makes systematic, scenario-based evaluation of full workflows essential to validate system behavior as a whole.

Behavioral and business tests

Finally, AI systems must be tested against behavioral, user, and business expectations. These tests encode rules, policies, and constraints that define acceptable system behavior, including hallucinations, policy violations, unsafe outputs, as well as their achievement of business objectives.

Correctness at this level is rarely binary. Tests may assert ranges, thresholds, or qualitative acceptability rather than exact matches, using deterministic rules (e.g., regex, schema validation) and, where appropriate, LLM-based evaluators acting as judges to score qualitative outputs (e.g., tone, toxicity, relevance) or quantitative ones (e.g. facts, calculations, logical consistency) against explicitly defined criteria.

This layer connects Testing directly to governance and accountability by producing explicit evidence that the system behaves within agreed boundaries under defined conditions.

From static validation to recurrent testing

Across all these layers, a key shift is required: to move from static validation to recurrent testing. Testing cannot be a one-time activity performed before deployment. Even without code changes, AI system behavior can evolve due to changes in data, queries, context, memory, feedback, external dependencies, or model updates. Tests must therefore be executed continuously across the system lifecycle, in development, during deployment, and repeatedly in production.

Observability feeds this process by revealing new behaviors that should be tested. Operations close the loop by acting when tests fail.

Seen this way, Testing is not opposed to flexibility. It is what makes flexibility safe.

In the next section, we will look at how Testing fits into a broader AIOps control loop, and why it often becomes the most effective entry point for organizations starting their AIOps journey.

7. How the three pillars reinforce each other

Individually, Testing, Observability, and Operation each provide value. Together, they form a control loop that allows AI systems to be steered rather than merely watched.

This distinction matters. Many organizations today have visibility into their AI systems, but little control over their behavior. Dashboards and traces describe what is happening, but they do not define what should happen next. The difference lies in how these capabilities interact.

Observability is the primary source of learning. By capturing real-world behavior, it reveals patterns, edge cases, and failure modes that were not anticipated during development or captured by pre-deployment testing (e.g. in an IDE). These insights are essential, but on their own they remain descriptive.

Testing turns observation into intent. When observability reveals unexpected behavior, testing allows teams to formalize this insight as an explicit expectation. What was once an incident becomes a test scenario. What was once implicit becomes executable.

Operation runs and acts on defined expectations. Operational mechanisms enforce those expectations by controlling how and when the system runs. Failed tests may block deployments, trigger rollbacks, degrade functionality, throttle execution, or route decisions through safer paths. Operation translates validation into control.

These actions, in turn, change system behavior and generate new telemetry. Observability captures this reality, closing the loop and feeding the next iteration of testing. This continuous cycle is what distinguishes AIOps from a collection of loosely connected tools.

This loop also explains why Testing plays a central role. Testing is the only pillar that directly connects what the system should do with what it actually does. Without testing, observability has no reference point and operation has no objective trigger for action.

As AI systems grow more capable, they also grow more complex. Their behavior emerges from interactions between data, models, logic, tools, and environments. In such systems, uncertainty cannot be eliminated. It can only be managed.

This is why AIOps should be understood primarily as a discipline of control. Control does not mean rigidity or frozen behavior. It means being able to define expectations, observe reality, and act deliberately when the two diverge.

The three pillars reflect this clearly. Testing defines intent. Observability reveals behavior. Operation enforces decisions. When these capabilities reinforce each other, AI systems become steerable. Failures become scenarios to anticipate rather than surprises to endure.

This perspective also reframes the role of humans. Rather than supervising every decision, humans define boundaries and decide how systems should react when expectations are not met. Automation executes within those boundaries.

AIOps, in this sense, is not a promise of self-managing AI. It is a commitment to engineering discipline in an uncertain domain. And like in any other engineering discipline, control starts with the ability to test.

Without AIOps, AI governance is aspirational. With AIOps, it becomes operational. Testing is what makes AI intentional.

8. Where to start: Testing as the catalyst

Faced with the breadth of AIOps, many organizations struggle to decide where to begin.

Observability, deployment automation, governance, retraining pipelines, and agent orchestration all appear equally important. Attempting to address everything at once often leads to slow progress, diluted ownership, and architectural decisions made without clear intent.

A more effective approach is to start where learning is fastest and commitment is lowest.

For most organizations, that starting point is Testing.

Testing does not require redesigning production infrastructure. In most organizations, it is already a familiar and established practice. It does not mandate new deployment strategies, operating models, or organizational structures. Instead, it focuses on a deceptively simple question:

What do we expect this AI system to do under defined conditions?

By answering this question, teams immediately surface assumptions, ambiguities, and hidden risks. They discover which behaviors matter, which scenarios are critical, and where current systems are fragile. This learning happens early, often before any operational changes are required.

Testing also creates a shared language. Business analysts, AI engineers, software developers, data scientists, testing experts, AI governance, risk and compliance stakeholders can all reason about test scenarios and outcomes. Expectations become explicit rather than implicit. Disagreements surface early, when they are still cheap to resolve.

From a risk perspective, Testing is a low-commitment entry point. It can be introduced incrementally. Teams can start with a small number of high-value scenarios and expand coverage over time. There is no need to test everything at once. Even a limited test suite can significantly improve confidence and decision-making.

Testing also naturally pulls the other pillars into place. As tests are executed repeatedly, teams need observability to understand failures and unexpected behavior. As tests become gating conditions, operational mechanisms are required to act on results. Observability and operation emerge as necessary complements, rather than as abstract prerequisites.

This sequencing matters. Organizations that start with observability often end up with rich dashboards but unclear expectations. Organizations that start with Testing build a concrete foundation for control.

This approach only makes sense if AIOps is understood not as a quest for perfect prediction, but as a discipline focused on control.

9. Implementing the Testing pillar

Up to this point, we have deliberately stayed at the level of principles and capabilities. This is important. The AIOps framing presented in this article should remain valid regardless of specific tools or platforms.

Still, a natural question arises: what does it actually look like to implement the Testing pillar in practice?

To implement the Testing pillar effectively, organizations need tooling that is open, extensible, and capable of handling high concurrency.

A concrete example of a tool supporting this approach is Step.

As an open platform, Step illustrates how the principles of composable, large-scale test orchestration can be applied to AI. Its relevance in the context of AIOps does not come from being an “AI-native” tool, but from acting as a reference implementation for how to treat AI agents as testable, controllable entities. It addresses many of the structural requirements that testing AI systems demands [7], while also supporting business-critical RPA and automation workflows [8].

At its core, Step treats systems — including AI systems — as testable entities that can be exercised under controlled conditions. Test scenarios are defined explicitly. Executions are orchestrated, parallelized, and repeated. Evidence is collected systematically. Failures are observable, comparable, and actionable. These properties are not specific to AI, but they turn out to be particularly well suited to it.

Applied to AI systems, this means that:

AI agents can be invoked as part of test scenarios, just like any other system component
Inputs, contexts, and configurations can be varied systematically
Assertions, including LLM-as-a-judge when appropriate, can encode expectations about behavior, performance, cost, or safety, whether deterministic or probabilistic
Test orchestration provides a versatile keywords abstraction to decompose and recompose complex agentic workflows
Test executions can run at scale, in parallel, on demand, on schedule, or continuously as part of normal operations or delivery pipelines

A case study demonstrating how Step can be used to test AI agents can be found in [3].

Importantly, Step does not replace observability or operational tooling. It complements them. Observability platforms remain essential for understanding real-world behavior. Operational platforms remain essential for deployment and control.

Step focuses on the missing capability: turning expectations into executable tests that can be orchestrated, repeated, and enforced over time.

This separation of concerns matters. It avoids collapsing AIOps into a single monolithic platform. Instead, it allows organizations to compose their AIOps stack around clear responsibilities: testing defines intent, observability reveals reality, and operation enforces decisions.

Seen in this light, Step is not “the AIOps solution”. It is an example of how the Testing pillar can be implemented concretely and integrated into a broader AIOps control loop.

Other implementations are possible, and will certainly emerge. What matters is not the specific tool, but the capability it embodies. Without systematic testing, AIOps remains reactive. With it, AI systems become controllable engineering systems rather than fragile experiments.

That is ultimately the goal of AIOps.

References

Testing AI Agents with Step

This case study demonstrates how Step can be used as a unified execution and testing platform for AI Agents, rather than only as a test orchestrator for externally hosted systems.

Illustration for Whitepaper - Unified Testing

Testing - Unified, Scalable and AI-Enabled

A white paper that reveals how enterprises can unify functional testing, load testing, and production monitoring into a single reusable, scalable, and AI-enabled system that bridges DevOps and QA and delivers measurable quality at speed.