Episode 97 — Test AI controls with evidence, not opinions or vendor demos (Domain 3B)
In this episode, we focus on a principle that makes an audit credible: you test controls with evidence, not opinions and not vendor demos. When you are brand new to cybersecurity and auditing, it is easy to be impressed by a confident walkthrough where a vendor representative clicks through a polished interface and shows you the best-case scenario. Demos can be useful for understanding what a system is supposed to do, but they are not evidence that controls are working in the real environment, under real conditions, with real data and real users. A I systems make this challenge sharper because many controls are behavioral, such as guardrails against prompt abuse, and behavior is easy to curate in a demonstration. Teams inside the organization can also unintentionally rely on opinions, such as we think the model is safe or we trust our vendor, instead of gathering proof. Your goal in this lesson is to learn how to approach A I control testing in a way that is fair but rigorous, grounded in criteria, and built on artifacts that show what the system actually does. By the end, you should be able to explain what counts as evidence, how to test key A I controls, and how to avoid being distracted by presentations that do not prove anything.
Before we continue, a quick note: this audio course is a companion to our course companion books. The first book is about the exam and provides detailed information on how to pass it best. The second book is a Kindle-only eBook that contains 1,000 flashcards that can be used on your mobile device or Kindle. Check them both out at Cyber Author dot me, in the Bare Metal Study Guides Series.
The first step is understanding what evidence means in an audit context. Evidence is information that supports a conclusion and can be examined independently of the person making the claim. In other words, evidence should remain convincing even if the speaker is not in the room. Policies, slide decks, and verbal explanations can provide context, but they are weak evidence by themselves because they describe intent, not enforcement. Strong evidence includes configuration settings, access control assignments, logs, change histories, test results, monitoring alerts, incident records, and records of approvals and reviews. In A I systems, strong evidence may also include traces showing which data sources were accessed during a request, records of prompt and system instruction changes, and telemetry showing how the system responds to known abuse patterns. This matters because A I risk is often hidden in small gaps between what people believe is happening and what the system actually does. When you test controls with evidence, you shrink that gap. You also protect the audit from bias, because you are not deciding based on who speaks most confidently.
A useful way to think about control testing is to separate control design from control operation. Design is what the organization intended, such as a rule that only certain roles can change prompts or a policy that sensitive documents cannot be retrieved. Operation is what actually happens in practice, such as whether prompt changes are truly restricted and logged, and whether retrieval truly respects access boundaries. A vendor demo might show design, or it might show a simplified version of operation in a clean environment. An opinion might reflect belief about design or operation, but belief is not proof. Evidence-based testing tries to answer both questions with artifacts. For design, you gather documentation and configuration that define the control. For operation, you gather logs, records, and test results that show the control working over time and under conditions that matter. This approach is especially important in A I because systems change frequently, and a control that was designed well can degrade quietly through updates, new connectors, or new integrations. Evidence of operation is what tells you the control still works today.
Now let’s apply this to a core A I control area: access control for models, data, and keys. An organization might claim that only authorized users can access training data or change model configuration. Evidence-based testing begins by identifying what the sensitive resources are and what roles are supposed to have access. Then you examine the actual access assignments, including human roles and service accounts, and compare them to the expected least privilege design. You also check for segregation between environments, such as development and production, because overbroad access across environments can bypass governance. Next, you look for evidence of operation, such as logs showing who accessed datasets, who changed configuration, and whether access reviews were performed. The strongest test includes a trace of a recent privileged action, such as a prompt change or a deployment, showing that the right person approved it, the right identity performed it, and the action was logged. This is far more meaningful than a screenshot of a role list shown during a demo. It proves that access controls are not only configured but also used and monitored.
Another crucial control area is change management for models, prompts, and integrations, because change is where A I risk often enters. A team may claim that changes are reviewed and tested, but evidence-based testing asks for specific change records and then follows them through the lifecycle. You select a small number of recent high-impact changes, such as enabling a new data connector, updating a system prompt template, changing a model version, or enabling a new tool integration. Then you examine the evidence that shows the change was requested, reviewed, approved, tested, deployed, and validated. You also check whether any required testing included known risk scenarios, such as prompt injection attempts or sensitive data retrieval checks. A vendor demo might show a change management interface, but it does not show whether real changes follow the process. Evidence-based testing forces the organization to demonstrate that the process is not just a diagram but a working practice. This is one of the most reliable ways to evaluate maturity because organizations that truly control change can usually produce clean traces, while organizations that move fast without governance struggle to show consistent records.
Monitoring and detection controls are another place where opinions and demos can be misleading. A vendor might show a dashboard with colorful charts, and a team might say they monitor the system closely. Evidence-based testing asks what signals are monitored, what logs support those signals, what alerts exist, and whether alerts lead to investigation and action. You examine alert rules or configurations, then review a sample of alert events and their associated response records. You also look for coverage of A I-specific abuse signals, such as repeated bypass attempts, unusual retrieval patterns, abnormal tool calls, and query patterns consistent with extraction attempts. The goal is to confirm that monitoring is not just present but effective, meaning it detects the defined misuse patterns and triggers timely response. A dashboard in a demo is not evidence unless it reflects real data, real alert thresholds, and real response workflows. Evidence-based testing emphasizes the chain from telemetry to detection to action. If the chain breaks, the control does not protect the organization, even if the interface looks impressive.
Behavioral guardrails, such as protections against prompt injection and unsafe outputs, are a signature A I control area where evidence-based testing is essential. Teams may claim the model is aligned, safe, or protected, but those words often hide uncertainty. A better approach is controlled testing aligned to criteria. That means you define a set of safe, ethical test cases that represent known abuse patterns and sensitive situations relevant to the system’s use. Then you observe how the system responds, and you capture evidence of that behavior, such as policy block events, refusal behavior, and monitoring alerts generated by the tests. You also test consistency by varying the input phrasing, because evasion often relies on small changes in wording. The goal is not to perform offensive experimentation; it is to confirm that the guardrails function as expected and that the system detects probing patterns. Evidence-based testing also checks whether guardrails are enforced at multiple layers, such as input filtering, output filtering, and integration constraints, rather than relying on a single safety setting. A vendor demo can show a refusal once, but evidence-based testing shows whether the refusal is consistent and whether monitoring responds when someone tries repeatedly.
Data protection controls also benefit from evidence-based testing that focuses on real data flows rather than policy statements. If the system uses retrieval, you test whether the model can access documents outside approved scope, and you validate which repositories are connected and what access enforcement exists. If the system stores prompts and outputs, you examine retention settings, access to logs, and evidence of encryption and access auditing. If the organization claims it minimizes sensitive data in prompts, you examine user guidance, training records, and evidence of enforcement mechanisms or monitoring for sensitive content. If the vendor claims it does not train on your data, you examine contractual language and operational evidence such as data use statements, retention controls, and any provided independent assessments. Evidence-based testing here is about proving where data goes, who can see it, and how long it lives. In A I systems, data is not only what you store in databases; it is also what flows through prompts and outputs, and that can be a blind spot if you rely on policies without verification.
Incident response readiness is another area where organizations often rely on confidence rather than proof. A team may say they have a playbook, but evidence-based testing asks whether the playbook is practiced, whether roles are defined, and whether there is evidence of prior incidents or exercises. For A I, you also want to see evidence that containment levers exist, such as the ability to disable connectors, restrict endpoints, rotate credentials, or disable tool integrations quickly. A useful evidence approach is to review a small number of historical events or exercises and trace how detection occurred, how triage was performed, what containment actions were taken, and what learning actions were implemented afterward. This proves that the organization can respond under real conditions, not just that it can describe a response in theory. Vendor demos rarely cover the messy reality of response, and opinions often underestimate the time it takes to coordinate across teams. Evidence-based testing keeps the audit grounded by showing whether the organization has operational muscle, not just a written plan.
One of the most important skills for beginners is learning how to treat vendor-provided evidence appropriately without being either overly skeptical or overly trusting. Independent assessment reports, control attestations, and detailed technical documentation can be meaningful evidence, but you still need to confirm how they apply to your specific use case. A vendor might have strong controls in general, but your organization might configure the service in a way that increases risk, such as connecting sensitive repositories or using shared keys. Evidence-based auditing therefore includes checking both vendor controls and customer-side controls, especially at the boundary where data and credentials cross. It also includes verifying that contracts reflect the security expectations needed for the business risk, such as retention limits, incident notification, and change communication. The goal is not to prove the vendor is perfect; it is to prove that the combined system, vendor plus customer configuration, meets the criteria and outcomes required. This is how you avoid being sold by a vendor’s strongest story while missing the weak points in integration and usage.
As you wrap up, remember that testing A I controls with evidence is what turns auditing into a disciplined practice rather than a conversation. Evidence is independent, verifiable information that shows what controls are designed to do and what they actually do over time. You test access controls by comparing expected roles to actual permissions and by tracing real privileged actions through logs and approvals. You test change management by tracing real changes from request to deployment to validation. You test monitoring by confirming the chain from telemetry to alerting to response. You test behavioral guardrails with controlled scenarios that reflect known abuse patterns, capturing results and ensuring consistency. You test data protection by tracing real data flows, retention, and access boundaries. You test incident readiness by reviewing exercises and real response records, ensuring containment levers exist and are used. When you focus on evidence rather than opinions or demos, your conclusions become defensible, your recommendations become actionable, and your audit results are far more likely to reduce real A I risk.