Skip to main content

Testing AI-Infused Applications: Strategies for Reliable Automation

AI is transforming the software landscape, with many organizations integrating AI-driven workflows directly into their applications or exposing their functionality to external, AI-powered processes. This evolution brings new and unique challenges for automated testing. Large language models (LLMs), for example, inherently produce non-deterministic outputs, which complicate traditional testing methods that rely on predictable results matching specific expectations. Repeatedly verifying LLM-based systems leads to repeated calls to these models—and if the LLM is provided by a third party, costs can quickly escalate. Additionally, new protocols such as MCP and Agent2Agent (A2A) are being adopted, enabling LLMs to gain richer context and execute actions, while agentic systems can coordinate between different agents in the environment. What strategies can teams adopt to ensure reliable and effective testing of these new, AI-infused applications in the face of such complexity and unpredictability?

Real-World Examples and Core Challenges

Let me share some real-world examples from our work at Parasoft that highlight the challenges of testing AI-infused applications. For instance, we integrated an AI Assistant into SOAtest and Virtualize, allowing users to ask questions about product functionality or create test scenarios and virtual services using natural language. The AI Assistant relies on external large language models (LLMs) accessed via OpenAI-compatible REST APIs to generate responses and build scenarios, all within a chat-based interface that supports follow-up instructions from users.

When developing automated tests for this feature, we encountered a significant challenge: the LLM’s output was nondeterministic. The responses presented in the chat interface varied each time, even when the underlying meaning was similar. For example, when asked how to use a particular product feature, the AI Assistant would provide slightly different answers on each occasion, making exact-match verification in automated tests impractical.

Another example is the CVE Match feature in Parasoft DTP, which helps users prioritize which static analysis violations to address by comparing code with reported violations to code with known CVE vulnerabilities. This functionality uses LLM embeddings to score similarity. Automated testing for this feature can become expensive when using a third-party external LLM, as each test run triggers repeated calls to the embeddings endpoint.

Designing Automated Tests for LLM-Based Applications

These challenges can be addressed by creating two distinct types of test scenarios:

  1. Test Scenarios Focused on Core Application Logic
    The primary test scenarios should concentrate on the application’s core functionality and behavior, rather than relying on the unpredictable output of LLMs. Service virtualization is invaluable in this context. Service mocks can be created to simulate the behavior of the LLM, allowing the application to connect to the mock LLM service instead of the live model. These mocks can be configured with a variety of expected responses for different requests, ensuring that test executions remain stable and repeatable, even as a wide range of scenarios are covered.

However, a new challenge arises with this approach: maintaining LLM mocks can become labor-intensive as the application and test scenarios evolve. For example, prompts sent to the LLM may change when the application is updated, or new prompts may need to be handled for additional test scenarios. A service virtualization learning mode proxy offers an effective solution. This proxy routes requests to either the mock service or the live LLM, depending on whether it has previously encountered the request. Known requests are sent directly to the mock service, avoiding unnecessary LLM calls. New requests are forwarded to the LLM, and the resulting output is captured and updated in the mock service for future use. Parasoft development teams have been using this strategy to stabilize tests by creating stable mocked responses, keeping the mocks up to date as the application changes or new test scenarios are added, and reducing LLM usage and associated costs.

  1. End-to-End Tests that Include the LLM
    While mock services are valuable for isolating business logic, achieving full confidence in AI-infused applications requires end-to-end tests that interact with the actual LLM. The main challenge here is the nondeterministic nature of LLM outputs. To address this, teams can use an “LLM judge”—an LLM-based testing tool that evaluates whether the application’s output semantically matches the expected result. This approach involves providing the LLM that is doing the testing with both the output and a natural language description of the expected behavior, allowing it to determine if the content is correct, even when the wording varies. Validation scenarios can implement this by sending prompts to an LLM via its REST API, or by using specialized testing tools like SOAtest’s AI Assertor.

End-to-end test scenarios also face difficulties when extracting data from nondeterministic outputs for use in subsequent test steps. Traditional extractors, such as XPath or attribute-based locators, may struggle with changing output structures. LLMs can be used within test scenarios here as well: by sending prompts to an LLM’s REST API or using UI-based tools like SOAtest’s AI Data Bank, test scenarios can reliably identify and store the correct values, even as outputs change.

Testing in the Evolving AI Landscape: MCP and Agent2Agent

As AI evolves, new protocols like Model Context Protocol (MCP) are emerging. MCP enables applications to provide additional data and functionality to large language models (LLMs), supporting richer workflows—whether user-driven via interfaces like GitHub Copilot or autonomous via AI agents. Applications may offer MCP tools for external workflows to leverage or rely on LLM-based systems that require MCP tools. MCP servers function like APIs, accepting arguments and returning outputs, and must be validated to ensure reliability. Automated testing tools, such as Parasoft SOAtest, help verify MCP servers as applications evolve.

When applications and test scenarios depend on external MCP servers, those servers may be unavailable, under development, or costly to access. Service virtualization is valuable for mocking MCP servers, providing reliable and cost-effective test environments. Tools like Parasoft Virtualize support creating these mocks, enabling testing of LLM-based workflows that rely on external MCP servers.

For teams building AI agents that interact with other agents, the Agent2Agent (A2A) protocol offers a standardized way for agents to communicate and collaborate. A2A supports multiple protocol bindings (JSON-RPC, gRPC, HTTP+JSON/REST) and operates like a traditional API with inputs and outputs. Applications may provide A2A endpoints or interact with agents over A2A, and all related workflows require thorough testing. Similar to MCP use cases, Parasoft SOAtest can test agent behaviors against various inputs, while Parasoft Virtualize can mock third-party agents, ensuring control and stability in automated tests.

Conclusion

As AI continues to reshape the software landscape, testing strategies must evolve to address the unique challenges of LLM-driven and agent-based workflows. By combining advanced testing tools, service virtualization, learning proxies, techniques to handle nondeterministic outputs, and testing of MCP and A2A endpoints, teams can ensure their applications remain robust and reliable—even as the underlying AI models and integrations change. Embracing these modern testing practices not only stabilizes development and reduces risk, but also empowers organizations to innovate confidently in an era where AI is moving to the core of application functionality.

The post Testing AI-Infused Applications: Strategies for Reliable Automation appeared first on SD Times.



from SD Times https://ift.tt/bWigQ9J

Comments

Popular posts from this blog

A guide to data integration tools

CData Software is a leader in data access and connectivity solutions. It specializes in the development of data drivers and data access technologies for real-time access to online or on-premise applications, databases and web APIs. The company is focused on bringing data connectivity capabilities natively into tools organizations already use. It also features ETL/ELT solutions, enterprise connectors, and data visualization. Matillion ’s data transformation software empowers customers to extract data from a wide number of sources, load it into their chosen cloud data warehouse (CDW) and transform that data from its siloed source state, into analytics-ready insights – prepared for advanced analytics, machine learning, and artificial intelligence use cases. Only Matillion is purpose-built for Snowflake, Amazon Redshift, Google BigQuery, and Microsoft Azure, enabling businesses to achieve new levels of simplicity, speed, scale, and savings. Trusted by companies of all sizes to meet...

2022: The year of hybrid work

Remote work was once considered a luxury to many, but in 2020, it became a necessity for a large portion of the workforce, as the scary and unknown COVID-19 virus sickened and even took the lives of so many people around the world.  Some workers were able to thrive in a remote setting, while others felt isolated and struggled to keep up a balance between their work and home lives. Last year saw the availability of life-saving vaccines, so companies were able to start having the conversation about what to do next. Should they keep everyone remote? Should they go back to working in the office full time? Or should they do something in between? Enter hybrid work, which offers a mix of the two. A Fall 2021 study conducted by Google revealed that over 75% of survey respondents expect hybrid work to become a standard practice within their organization within the next three years.  Thus, two years after the world abruptly shifted to widespread adoption of remote work, we are dec...

10 Simple Image Slider HTML CSS JavaScript Examples Neeraj Mishra The Crazy Programmer

Slider is a very important part of any website or web project. Here are some simple image slider examples that I handpicked from various sites. These are built by different developers using basic HTML, CSS, and JavaScript. Some are manual while others have auto-slide functionality. You can find the source code for each by clicking on the code button or on the image. 1. Very Simple Slider Demo + Code 2. Popout Slider Demo + Code 3. Really Simple Slider Demo + Code 4. Jquery Simple Slider Demo + Code 5. Manual Slideshow Demo + Code 6. Slideshow Indicators Demo + Code 7. Simple Responsive Fullscreen Slider Demo + Code 8. Responsive Image Slider Demo + Code 9. Simple Image Slider Demo + Code 10. Slicebox – 3D Image Slider Demo + Code I hope these simple image sliders are helpful for you. For any queries, you can ask in the comment section below. The post 10 Simple Image Slider HTML CSS JavaScript Examples appeared first on The Crazy Prog...