What a Real AI Audit Looks Like in Saudi Arabia

The Audit That Almost Never Happens

A regional bank in the Kingdom deploys a credit-scoring model. The model passes internal testing. The relevant teams sign off. A year passes. The model has been silently disadvantaging applicants from certain districts — not because someone programmed it to, but because it learned the pattern from historical data that reflected years of systemic inequity. Nobody caught it because the quarterly review consisted of checking whether the model was still running and whether overall approval rates had shifted dramatically. They had not. The damage was granular, distributed across thousands of individual decisions.

This is not a hypothetical. It is the most common failure mode in AI deployment: a governance process that looks like an audit but functions as a sign-off ritual.

Saudi Arabia's regulatory environment is accelerating. SDAIA has published AI ethics principles and expects organizations to operationalize them. SAMA has model risk management guidance that extends to algorithmic systems in financial services. The NCA's Essential Cybersecurity Controls apply to AI systems that touch critical infrastructure. The Personal Data Protection Law — Royal Decree M/19, with its implementing regulations — creates direct obligations around automated decision-making affecting individuals. What most organizations are doing in response is writing policies and calling it governance. When an actual auditor arrives, that gap becomes visible immediately.

Two Definitions of an AI Audit

There is the audit that satisfies a checklist and the audit that tells you whether your AI systems are actually behaving as intended, fairly, and in compliance with applicable law.

The checklist version typically involves confirming that a model register exists, that someone has signed a policy document, that there is a named owner for each system, and that monitoring dashboards are running. This takes a few days and produces a clean report. It misses almost everything that matters.

A substantive AI audit is an investigative exercise. It asks not whether governance structures exist on paper but whether they function. It probes the data lineage behind a model — can you actually trace which training records produced a specific behavior? It tests for distribution shift: is the population the model sees today materially different from the population it was trained on? It examines the gap between the model's intended use case and how it is actually being used in the field. It checks whether the explainability methodology documented in the model card matches what the operations team actually uses to interpret outputs.

These two definitions coexist in the market right now. Organizations that understand the difference are building audit frameworks that will hold up under scrutiny. Organizations that do not are accumulating latent liability.

What SDAIA Actually Requires — and What It Leaves Open

SDAIA's National AI Ethics Principles establish fairness, reliability, transparency, privacy, accountability, and humanity as governing values for AI systems in the Kingdom. These are not aspirational suggestions. They are the criteria against which regulators will evaluate organizations that face AI-related incidents, complaints, or examinations.

The practical question is what operationalizing these principles requires. Transparency, for example, is not satisfied by publishing a vague statement that your company uses AI responsibly. It requires that affected individuals can receive a meaningful explanation of how an automated system reached a decision affecting them. Under PDPL's implementing regulations, automated decisions that produce legal or similarly significant effects create rights of explanation and contestation. A credit refusal, a benefits determination, a risk classification that affects an individual's insurance premium — these are covered. Organizations that cannot reconstruct an explanation for a specific model output at the time of the decision are already non-compliant, regardless of what their governance documentation says.

Reliability under SDAIA's framework means something specific: ongoing validation that a model continues to perform within the parameters for which it was approved. A model approved in Q1 on one data distribution is not automatically reliable in Q4 when the distribution has shifted. SDAIA's expectation is that organizations monitor this. What constitutes adequate monitoring — thresholds, frequency, escalation triggers — is where the framework currently leaves organizations to exercise judgment, and where the gap between compliant-on-paper and genuinely governed is widest.

The Three Layers of a Functional AI Audit Framework

A framework that holds up under regulatory scrutiny operates at three layers simultaneously, each with distinct methods and artifacts.

The first layer is pre-deployment assessment. No AI system that affects individuals, business decisions, or regulated processes should reach production without a structured gate. This means documented evidence of the training data's provenance and any bias assessments conducted on it. It means performance metrics tested not just on aggregate accuracy but on subgroup performance — a model that performs well overall while systematically underperforming for a particular demographic is a fairness problem waiting for a regulator to name it. It means a documented decision about the acceptable explainability methodology, because not every model architecture supports the same interpretability approaches, and the choice has downstream consequences for compliance with PDPL's explanation requirements.

The second layer is continuous production monitoring. AI systems degrade — through data drift, distribution shift, feedback loops, and model decay — and they do so silently unless you are measuring the right things. Continuous monitoring is not a dashboard that shows the model is running. It is a measurement system that detects when the model's behavior in production diverges materially from its behavior at deployment. The critical metrics depend on the model type and risk classification, but at minimum should capture prediction distribution shift, subgroup performance differentials, and output anomaly rates. For SAMA-regulated institutions, this connects to existing model risk management obligations under the Basel-aligned guidance the authority has issued; AI systems used in credit, fraud, and market functions are model risk, not a separate category.

The third layer is periodic comprehensive audit. Continuous monitoring catches gradual drift. Comprehensive audits catch structural problems: governance gaps that have normalized, documentation that no longer matches the deployed system, use-case drift where a model is being applied to decisions beyond its validated scope, and control failures that monitoring metrics cannot see. Comprehensive audits require both technical examination — white-box testing of the model itself, red-team exercises to probe adversarial vulnerability — and procedural examination: interviewing the teams that actually use the model's outputs to understand how the system's recommendations translate into decisions, and whether the humans in the loop are exercising the judgment the governance framework assumes they are.

The Model Registry: Foundation, Not Destination

Every AI audit framework literature recommends a model registry. Most organizations treat building the registry as the goal. It is not. The registry is the minimum prerequisite for governance to be possible — it tells you what you have. Governance is what you do with that knowledge.

A registry that functions for audit purposes captures, for each AI system in production: its purpose and intended decision scope, the risk classification assigned to it and the rationale for that classification, the training data sources and the date of the last data lineage validation, the current monitoring status and the date of the last alert review, the owner responsible for remediation if the system is flagged, and the audit history including any findings and their resolution status.

The classification schema matters as much as the registry itself. Not all AI systems carry the same risk profile. A model that generates internal content recommendations carries fundamentally different risk from a model that determines who receives a benefit, a service, or an opportunity. The audit cadence, the depth of monitoring, and the escalation thresholds should all follow from the risk classification — and that classification should be reviewed when use cases change, because systems migrate toward higher-risk applications over time without anyone formally upgrading their risk tier.

Shadow AI: The Audit Scope Problem

Any AI audit framework built around formally approved systems is auditing a fraction of the actual AI footprint. The proliferation of generative AI tools, no-code ML platforms, and AI-enabled SaaS applications has created a shadow AI layer in virtually every organization. Employees are using these tools to inform decisions, draft communications, analyze data, and synthesize information — activities that, depending on the data involved and the decisions influenced, may create PDPL exposure, SDAIA compliance gaps, and operational risks the organization cannot see because it has not looked.

An audit framework that does not include a shadow AI discovery process is protecting against the risks you know about while remaining blind to the ones that will actually surprise you. Discovery does not require surveillance. It requires combining network traffic analysis for known AI service endpoints, SaaS license inventory review for AI-enabled tools, and structured conversations with business units about what tools their teams are actually using. What you find will typically require a governance response: either formal approval and monitoring integration, or a clear policy on prohibited tools with enforcement mechanisms.

Sectoral Obligations Beyond SDAIA

SDAIA sets the baseline. Sector regulators apply additional layers.

SAMA's model risk management expectations — which align with Basel Committee guidance while incorporating KSA-specific requirements — treat AI systems used in credit, market, and operational risk functions as model risk requiring validation, independent review, and ongoing monitoring. The board-level oversight requirement is explicit: senior leadership cannot delegate accountability for model risk entirely to technical teams. The audit framework for a SAMA-regulated institution needs to integrate AI audit findings into the broader model risk governance structure, not run it as a parallel track.

The NCA's Essential Cybersecurity Controls create obligations that intersect directly with AI systems in critical infrastructure contexts. AI systems are attack surfaces. A model serving a critical operational function can be targeted through adversarial inputs, data poisoning, or extraction attacks designed to reverse-engineer proprietary information. Cybersecurity resilience testing for AI — distinct from standard penetration testing because AI systems fail in different ways than traditional software — belongs in the audit methodology for any NCA-regulated organization.

For healthcare organizations, SFDA requirements for AI-assisted diagnostic tools create a validation evidence requirement that goes beyond general AI governance. Clinical validation — demonstrating that a diagnostic or treatment recommendation model performs appropriately on the patient population it will actually serve in the Kingdom — is a prerequisite, not an afterthought. An audit framework in this sector must include clinical validation evidence review as a first-class audit artifact.

What Auditors Actually Test

When a regulator or a serious third-party auditor examines an AI system, they are not reading the governance policy document. They are asking for evidence.

Can you produce the data lineage for a specific model decision made six months ago? Can you show that the model's performance on protected demographic groups was tested before deployment and is monitored in production? Can you demonstrate that the monitoring alerts for this system were reviewed, that someone with decision-making authority saw them, and that any triggered remediations were completed within the timeframes your own policy requires? Can you reconstruct the explanation for a specific model output that an individual is contesting under PDPL?

If the answer to any of these questions is "we would have to reconstruct that," the audit has already found a finding. Documentation that cannot be produced contemporaneously — meaning it was created as part of normal operations, not assembled in response to the audit — does not satisfy regulatory evidence standards.

The organizations that are prepared for this kind of scrutiny built their governance infrastructure so that audit evidence is a byproduct of how the system is operated, not a deliverable that has to be generated on demand.

The Governance Failure Mode That Kills Frameworks

The most common reason AI audit frameworks fail is not technical. It is organizational. A framework designed by a compliance team, implemented by an IT team, and operationalized by a data science team — without any of those teams having shared accountability for outcomes — will produce documentation and will not produce governance.

Effective AI governance requires that the people who build models understand what auditors will test for and design their documentation and monitoring accordingly from the start. It requires that the people who use model outputs understand their accountability for the human judgment the framework assumes they are applying. It requires that compliance functions understand enough about how AI systems work to recognize when a governance artifact is substantive and when it is a formality.

This is a people and process problem that frameworks cannot solve by themselves. Frameworks define the requirements. Culture determines whether those requirements are met.

When the Auditor Arrives

Regulators across the Kingdom are developing the capacity to examine AI governance directly. SDAIA has signaled active engagement with organizations on AI ethics implementation. SAMA has examined model risk practices in financial institutions with increasing technical sophistication. The NCA conducts assessments that now include AI system security. The PDPL's enforcement mechanism, once fully operational, creates a formal complaint pathway that can trigger regulatory scrutiny of automated decision-making practices.

When an auditor arrives at an organization that treated AI governance as a documentation exercise, the examination will reveal it quickly. The model registry exists but has not been updated since it was created. The monitoring dashboards run but no one reviews the alerts systematically. The pre-deployment checklist was completed but the model in production does not match the version that was assessed. The explainability methodology on paper is SHAP values; the operations team has never seen a SHAP output.

The gap between governance on paper and governance in practice is exactly what auditors are trained to find. Organizations that understand this build frameworks designed to be tested — where the artifacts are current, the monitoring is actionable, and the people responsible can explain what they do. That is not a compliance burden. It is the difference between AI systems you can stand behind and AI systems you are hoping no one examines too closely.

The frameworks that survive scrutiny were not built in anticipation of the audit. They were built because the organization decided, before the auditor arrived, that it needed to actually know whether its AI systems were behaving appropriately. That decision is where the work begins.

Published by PeopleSafetyLab — AI safety and governance research for KSA organizations.