Testing AI Agents with Step

This case study demonstrates how Step can be used as a unified execution and testing platform for AI Agents, rather than only as a test orchestrator for externally hosted systems.

Written by Pierre Oberholzer
Illustration for Testing AI Agents with Step

Estimated read time: 13 min
What you’ll learn: Learn how Exense’s Step platform enables scalable, end-to-end testing of AI Agents, providing a reusable and traceable testing backbone for production-grade AI systems. Ideal profile(s): Test Managers, AI Engineers, AIOps and LLMOps Engineers, AI Leads, QA Leads, DevOps Engineers, Automation Engineers, SREs, Enterprise Architects, Business Analysts.

Abstract

AI has become ubiquitous across enterprise discussions, strategic roadmaps, and initial use cases, yet tangible return on investment (ROI) remains elusive for many organizations. While large language models (LLMs) and LLM-based applications such as retrieval-augmented generation (RAG) supported chatbots demonstrate impressive capabilities in proofs of concept (PoCs), their limited reliability, lack of explainability, and poor reproducibility introduce significant risks that severely constrain their safe use in business-critical environments.

This article argues that AI Agents, when designed as traceable and testable systems, provide a practical path toward production-grade enterprise AI. However, this is only achievable when they are developed under an evaluation-driven development (EDD) paradigm, which extends traditional quality assurance (QA) practices to AI systems. We show how this paradigm introduces a combined AI testing and DevOps scalability challenge, and how the Step platform from Exense, proven for large-scale testing of non-AI systems for more than a decade, can serve as a robust and reusable testing backbone for AI Agent testing.

The approach is illustrated using a real-world AI Agent, TxAgent from Alpina Analytics, which is further described in a complementary article titled TxAgent: An ISO 20022 AI Agent Industrialized with Massive Parallel Testing [1].


1. Why AI Agents Are Finding Their Way into the Enterprise

Artificial intelligence (AI) is now widely adopted across enterprises; however, achieving sustainable return on investment (ROI) remains challenging [3]. Many initiatives remain at the proof-of-concept (PoC) stage. A key reason is the limited reliability of large language models (LLMs), even when equipped with retrieval-augmented generation (RAG), which frequently produce plausible but incorrect or unverified outputs and often lack explainability [4]. Such behavior is incompatible with business-critical and regulated environments.

AI Agents are increasingly seen as a way to address these limitations. An AI Agent is a system that completes a task by orchestrating language models, deterministic tools, and data sources. It can operate in a semi-autonomous or autonomous manner and can be designed as an expert agent to deliver controlled and observable outcomes in a given domain.

In enterprise settings, AI Agents enable more than conversational use cases. They can support and automate parts of business workflows, including data retrieval, analysis, and decision support. Their improved reliability comes from a hybrid architecture that combines deterministic components with semantic interpretation.

For example, when a user asks for the result of adding two apples to a basket containing one apple, the language model interprets the request, while a deterministic function performs the calculation. The final answer is then presented in natural language. In this setup, the calculation itself is deterministic, in contrast to a fully generative AI usage, where the model predicts the most probable next token, including when the input is a simple expression such as “1 + 2 =”.

This separation of responsibilities reduces the risk of errors and explains why AI Agents are increasingly considered for production use in enterprise environments.


2. AI Reliability as a Testing Challenge

Defining reliability

The objective is to deploy AI Agents in production with a level of reliability that is acceptable for business-critical use. In practice, this means the agent must produce outputs that are correct, repeatable, and complete within operational time constraints, while remaining understandable to engineers, auditors, and stakeholders.

Reliability for many AI Agents typically includes the following expectations:

  • Test scenarios must cover a wide range of use cases and user intents
  • Outputs should not contain ungrounded or fabricated information
  • Results should be deterministic, or at least stable within clearly defined limits
  • Execution should complete within minutes, not hours
  • The execution path should be explainable through traceable steps and intermediate results

Meeting these expectations is not primarily a model selection issue. It is an engineering and validation issue. The agent is a system that combines prompts, LLM calls, code, tools, data access, and orchestration logic. Any change in one of these components can introduce regressions.

As a result, reliability must be continuously demonstrated through structured testing.

Requirements for a Testing Platform

A testing platform for AI Agents must support systematic evaluation and validation at scale. In practice, this is often referred to as evaluation-driven development (EDD). From a quality assurance (QA) perspective, this means being able to continuously and objectively demonstrate system reliability rather than assuming it.

Such a platform should provide:

  • A golden dataset containing validated input and expected output pairs
  • The ability to execute a large number of end-to-end test cases to cover the breadth of expected usage
  • Execution performance that supports fast iteration during development and reflects production constraints
  • Automated and flexible assessment of outputs against expectations, using methods ranging from classical assertions to LLM-based evaluation and custom validation logic
  • Storage and analysis of complete execution traces, including intermediate steps, tool calls, and relevant logs, for debugging, improvement, and audit purposes
  • Versioning of prompts, models, tools, and datasets to ensure reproducibility across runs

3. Demonstrator: Step Meets AI Agents

Step is already deployed in mission-critical business applications, including large banking and insurance enterprise clients [2]. To demonstrate the applicability of Step as a testing platform for AI Agents, we present results from an assessment conducted by Alpina Analytics, using TxAgent as the system under test.

TxAgent is an AI Agent designed Alpina Analytics for banking use cases. It performs autonomous queries and investigations on transactional payment data (based on the ISO 20022 standard). From a technical perspective, TxAgent is a Python application built on a LangGraph-based agent workflow. It orchestrates domain-specific components, retrieves data from graph and tabular databases, and applies semantic interpretation using language models.

TxAgent executes complete investigations from a single input to a single output, while logging key information such as queries, workflow paths, and parameters as side artifacts. The expected outputs are validated by domain experts and stored in a golden dataset, which serves as the reference for automated testing. Additional details are available in the dedicated article TxAgent: An ISO 20022 AI Agent Industrialized with Massive Parallel Testing [1].


4. Limits of Local Testing

When TxAgent is executed locally, a single end-to-end test typically takes 60s on average to complete. Even with a small test set of 15 cases, a full test run requires approximately 15 minutes. While this may appear manageable at first during early prototyping, it quickly becomes a constraint during active development.

AI Agent behavior is highly sensitive to changes in prompts, orchestration logic, and tool interfaces. Small modifications can have unexpected effects on reasoning paths and outputs. For this reason, a reliable development workflow requires frequent re-execution of the full test suite to detect regressions early and avoid costly rollbacks.

This creates a structural tension between development speed and reliability (see Figure 1). If testing is performed infrequently to preserve development time, regressions are detected late and risk increases (left-hand illustration). If testing is performed frequently, local execution time starts to dominate the workday and slows down progress (right-hand illustration). Even a modest test suite with a runtime of around 15 minutes can significantly limit the number of feedback cycles an engineer can run in a day.

Figure 1 — Impact of testing time on development pattern

As test coverage grows from a few cases to dozens or hundreds, this trade-off becomes increasingly unfavorable and often leads to reduced test frequency.

When testing becomes too slow, it is performed less frequently, increasing development risk.

The issue is not the correctness of the tests themselves, but the inability to execute them fast and often enough. Developing and validating complex AI Agents therefore requires a dedicated test infrastructure that can scale independently of local development environments.

At this stage, reliability becomes a DevOps concern. Feedback speed, automation, and execution scalability directly impact the ability to integrate changes and deploy AI Agents safely. Developing and validating complex AI Agents therefore requires a dedicated test infrastructure that can scale independently of local development environments.


5. Platform Testing with Step

n this study, the full execution of the AI Agent is hosted on the Step platform as part of the test runs (Figure 2). This is a deliberate design choice that uses Step not only as a testing and orchestration platform, but also as the execution environment for the agent itself. By combining hosting and testing within the same platform, this setup simplifies integration and provides tighter control over execution, parallelization, and traceability.

As a result, the measured execution times include additional overhead related to container management and remote execution. This overhead would not be present in scenarios where Step is used solely to orchestrate tests against externally hosted systems.

Figure 2 — Step as the testing platform for an AI Agent (TxAgent)

The workflow separates local development from test execution. The AI engineer continues to work in a familiar local environment, while Step provides the infrastructure required to execute tests at scale.

The workflow can be summarized as follows:

  • The AI engineer develops agent logic and prompts in a local development environment.
  • Business experts and users validate and maintain the golden dataset.
  • Test cases are derived from the golden dataset using a testing framework (e.g. pytest).
  • The AI engineer triggers remote test execution using the Step client.
  • A container image encapsulating the TxAgent code, dependencies, golden dataset, and configuration is built and pushed.
  • Test scenarios derived from the golden dataset are executed remotely and in parallel on the Step platform. Each test stimulates TxAgent with a predefined investigation question and input data, executes the full agent workflow, and applies automated validation combining classical assertions with LLM-as-a-judge techniques to verify content, facts, and evidence beyond wording variations.
  • Logs and execution traces are collected centrally.
  • Results are retrieved in the local development environment.
  • The engineer investigates results and performs debugging based on the collected traces.
  • The development cycle continues.
    Note that the pipeline chosen in this study differs from a classical application delivery pipeline (Figure 3). While the operation of the system under test and its validation are typically treated as separate, post-deployment phases, Step enables an integrated pipeline in which execution and testing are combined into a single, scalable phase.
Figure 3 — DevOps pipeline options

In both cases, decoupling test execution capacity from local resources enables frequent and reliable validation without slowing development, as motivated earlier and quantified hereafter.


6. Experiment

The figure below (Figure 4) presents the results of a dedicated testing campaign designed to assess, from an AI engineer perspective, the benefits of using Step to accelerate test execution, in line with the motivations discussed earlier.

Two test configurations are compared. The upper graph represents a small test set consisting of 10 test scenarios, defined as question and expected answer pairs. The lower graph represents a larger test set consisting of 100 scenarios.

The x-axis represents the level of parallelism, which corresponds to the product of the number of workers per execution instance and the number of execution instances.

Figure 4 — Comparison of local testing and Step-based testing for a small test set of 10 scenarios (top) and a larger test set of 100 scenarios (bottom).

Local Execution

In a local setup, parallelism is limited to the number of workers running on a single machine, typically a developer workstation. Parallel test execution can be achieved using standard tooling (e.g., pytest-xdist with pytest), which distributes test cases across multiple worker processes:

pytest tests/ -n N_WORKERS

This approach is constrained by local hardware resources and cannot scale beyond a single execution instance.

Remote Execution with Step

With Step, parallelism can be increased along two dimensions. In addition to a configurable number of workers per container, Step allows multiple execution instances to run in parallel. The number of instances is defined in the Step test plan.

This enables horizontal scaling that is not possible in a local setup.

Results and Interpretation

The evaluated test sets consist of 10 and 100 scenarios, respectively. While these numbers are modest compared to traditional non-AI applications, each scenario represents a complete, end-to-end AI Agent investigation involving multiple tool calls, data retrieval steps, and language model interactions. The selected test sizes reflect a realistic stage in the development of an industrial AI Agent, where test cases are few but highly heterogeneous and validated by domain experts.

For the small test set consisting of 10 scenarios, local execution remains competitive. This is primarily due to the overhead associated with remote execution, including container build and push, infrastructure provisioning, and result retrieval. At this scale, the benefits of distributed execution do not yet outweigh the setup costs.

The situation changes when moving to the larger test set of 100 scenarios. In this configuration, the advantages of using Step become clear. Parallel execution across multiple instances significantly reduces total execution time compared to the fastest achievable local run. As detailed in Appendix A, the observed execution time was reduced by approximately a factor of 2 under the tested conditions.

This reduction in execution time directly translates into higher testing throughput. As shown in Appendix A, the system was able to process approximately 0.4 scenarios per second, corresponding to about 24 scenarios per minute, using a moderately sized execution cluster. These values are indicative and depend on the chosen level of parallelism, infrastructure configuration, and the characteristics of the test scenarios.

Importantly, this evaluation is not intended as a best-case benchmarking study designed to maximize speed-up. Instead, it demonstrates the applicability of Step in a realistic AI Agent development workflow with limited but diverse test coverage. In this study, the full execution of the AI Agent is hosted on the Step platform, rather than limiting Step’s role to parallel test orchestration for externally hosted systems. As test suites grow in size and complexity, higher throughput can be achieved by scaling the number of execution instances and by further optimizing both the agent implementation and the test setup. In cloud-based environments, such scaling is not inherently limited, and the relative advantages of distributed execution are therefore expected to increase over time.

Beyond Execution Time

While local execution may remain appropriate for very small test sets, execution time is not the only decision factor. Centralized execution on Step reduces the load on local development machines and allows engineers to focus on development rather than resource management. It also provides centralized validation, logging, and trace collection, which simplifies debugging, analysis, and audit activities. In addition, a shared execution platform improves collaboration in multi-developer environments by ensuring that tests are executed under consistent conditions across teams. Taken together, these factors make Step a suitable platform for industrial-scale testing of AI Agents, particularly as test coverage and team size grow.


7. From Experimentation to Industrialization

AI Agents provide a practical path for moving enterprise AI beyond proof-of-concept deployments and into critical business workflows. By combining language models with deterministic components and structured orchestration, they improve control and predictability. However, architecture alone is not sufficient. Reliability must be continuously demonstrated through systematic testing and traceability.

As AI Agents increase in complexity and test coverage grows, local testing becomes a limiting factor. Execution time, constrained parallelism, and the need for frequent validation introduce structural bottlenecks. Addressing these constraints requires testing capabilities that scale independently of local development environments and support evaluation-driven development practices.

This article has demonstrated, using the TxAgent use case, how Step can support the transition from experimentation to industrialization by enabling scalable, end-to-end testing of AI Agents. Step makes it possible to execute large regression suites in parallel, apply flexible evaluation logic, and collect centralized execution traces across runs. This approach allows teams to treat reliability as an engineering property that can be measured, monitored, and improved over time, rather than as an assumption based on limited testing or isolated demonstrations.


8. Next Steps for Testing AI Agents

AI engineering teams developing AI Agents for critical business applications should assess whether their current testing setups can support increasing test volumes and frequent validation cycles. When test execution time starts to constrain development, introducing a scalable test platform becomes a technical necessity rather than an optimization.

Step can be integrated into existing workflows to offload test execution, increase feedback frequency, and provide consistent execution conditions across teams. Applying these practices early helps reduce risk and supports the controlled industrialization of AI Agents.

Practical guidance on applying this approach, including sample configurations, for testing AI Agents with Step, is available on request.

Please contact contact@exense.ch for more information.


Appendix A. Calculation of Gain and Throughput

This appendix details the calculations used to quantify the performance benefits observed when executing tests on Step compared to local execution.

Execution Time Gain

The execution time gain is defined as the ratio between the minimum execution time measured during local execution and the minimum execution time measured during execution on Step:

Based on the observed results (Figure 3):

This indicates that, under the tested conditions, the total test execution time is reduced by approximately a factor of 2 when using Step.

Throughput

Throughput is defined as the number of test scenarios executed per unit of time during the fastest observed execution on Step.

For the tested configuration:

This corresponds to approximately 24 test scenarios per minute.


References

[1] https://medium.com/txagent/txagent-an-iso-20022-ai-agent-industrialized-with-massive-parallel-testing-605d4f243e1e
[2] https://step.dev/blog/whitepaper-unified-testing/
[3] https://www.deloitte.com/global/en/issues/generative-ai/ai-roi-the-paradox-of-rising-investment-and-elusive-returns.html
[4] https://www.mckinsey.com/capabilities/quantumblack/our-insights/building-ai-trust-the-key-role-of-explainability

Want to hear our latest updates about automation?

Don't miss out on our regular blog posts - Subscribe now!

Image of a laptop device to incentivize users to subscribe