Why Traditional Testing Fails for AI Systems - Dusanka Lecic

Categories: Podcasts , Software Testing Unleashed

May 28, 2026

Chatbot testing faces challenges like non-determinism and user-centric issues, requiring frameworks like C-H-A-T to manage context, hallucinations, and relevance while emphasizing manual exploration and traceability. Hybrid testing methods combining manual and automated approaches are critical for addressing invisible bugs, edge cases, and evolving tooling limitations in chatbot development.

Software Testing Unleashed

Software Testing Unleashed - hosted by Richard Seidl. Different guest per episode. The official Show notes contain a comprehensive overview of the episode. Released as audio and video.

Episode Details

Show Notes: https://www.richard-seidl.com/en/podcast/testing-chatbots-invisible-bugs
Published: 2026-05-28T04:00:00Z
Duration: 00:24:32
Author: Richard Seidl | Software Development & Testing Expert

Overview

Chatbot testing presents unique challenges distinct from traditional software testing, including non-determinism, where chatbots may produce varying outputs for the same input, complicating the definition of pass/fail outcomes. Testing prioritizes user behavior, such as typos or frustrated phrasing, over strict functional correctness, and relies on manual exploration and query analysis to uncover subtle issues like relevance, accuracy, and user frustration points. A key challenge is the lack of robust testing tools, necessitating specialized solutions. Testing strategies emphasize chunking for semantic boundaries, retrieval logic beyond response accuracy, and preserving context in multi-turn conversations to avoid repetition or hallucinationsfalsified or misleading answers. The proposed C-H-A-T framework focuses on context retention, hallucination control, accuracy/relevance, and structured testing workflows to trace and retest issues effectively.

The discussion highlights the need for a hybrid testing approach combining manual and automated methods: manual testing ensures deep understanding of queries and responses, while automation streamlines documentation and test scenario creation, though current tools remain limited. Creating repeatable test scenarios, including edge cases like typos, is vital, but documenting workflows and ensuring clarity in results remains difficult. Bugs in chatbots are often “invisible,” arising from retrieval errors or malformed prompts rather than code flaws, making them harder to detect. Traceability through logging queries, responses, and retrieval chunks is critical for debugging and retraining models based on user feedback. Retraining and fallback mechanisms, such as asking users for clarification, are essential to address persistent issues and improve user experience. Future advancements in integrated testing suites may address current tooling gaps, but challenges like infrastructure limitations and the complexity of chatbot logic are likely to persist as the field evolves.

What If

What if you built a hybrid test scenario generator using AI to automate edge case creation while manually validating hallucination hotspots?
- Move: Create a semi-automated test suite that uses AI (e.g., chatbot training data) to generate typo-laden or ambiguous queries, then manually validate outputs for hallucination or context loss.
- Why Now?: Current tools lack robust automation for user-frustration scenarios, and manual testing is time-consuming. This approach balances speed and human oversight.
- Expected Upside: Rapid identification of hallucination-prone patterns in responses, reducing context-switching errors and improving user trust.
What if you implemented a lightweight C-H-A-T framework to track context retention and retrieval accuracy during testing cycles?
- Move: Design a logging system that captures query history, retrieval chunks, and response accuracy for each interaction, flagging deviations from context or relevance thresholds.
- Why Now?: Existing tools focus on code-level bugs, but chatbot failures often stem from retrieval errors. This prioritizes the “A” (accuracy) and “T” (traceability) pillars.
- Expected Upside: Faster root-cause analysis for retrieval issues, enabling targeted retraining or prompt adjustments to reduce misinformation.
What if you stress-tested your chatbot with a curated dataset of extreme user behavior (e.g., 100 variants of typos, slang, or abrupt topic shifts)?
- Move: Compile a dataset of 500+ user inputs mimicking frustration (e.g., “I dunno, help!” or “This is so stupid, fix it!”), then map error rates to retrieval logic and prompt structure.
- Why Now?: Non-determinism and user behavior are core pain points, yet few tools simulate realistic stress scenarios. This exposes weaknesses in chunking strategies.
- Expected Upside: Identifying fragile retrieval patterns early, leading to a 2030% reduction in user-reported errors due to improved chunk scoring and prompt tuning.

Takeaway

Conduct manual exploration testing to identify user frustration points, typos, and edge cases by simulating real-world input variations, ensuring responses remain accurate and relevant.
Implement a hybrid testing approach combining manual evaluation for deep query analysis with automation tools to streamline documentation and scenario creation.
Adopt the C-H-A-T framework (Context, Hallucination control, Accuracy, Traceability) as a structured method to validate retrieval logic, maintain conversation continuity, and trace bugs systematically.
Create repeatable test scenarios with positive and negative cases (e.g., typos, ambiguous queries) to stress-test retrieval systems and uncover hidden issues in response generation.
Enable comprehensive logging of all queries, responses, and retrieved chunks to trace the root cause of errors and refine model training based on real user interactions.

For a PDF of longer Software Testing Podcast Episode Summaries with Briefing Notes and more detailed summary notes, visit EvilTester Patreon Podcast Summaries.