Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure

Paper · arXiv 2506.08713 · Published June 10, 2025

Ensuring complex systems meet regulations typically requires checking the validity of assurance cases through a claim-argument-evidence framework. Some challenges in this process include the complicated nature of legal and technical texts, the need for model explanations, and limited access to assurance case data. We propose a compliance detection approach based on Natural Language Inference (NLI): EXplainable CompLiance detection with Argumentative Inference of Multihop reasoning (EXCLAIM). We formulate the claim-argument-evidence structure of an assurance case as a multi-hop inference for explainable and traceable compliance detection. We address the limited number of assurance cases by generating them using large language models (LLMs). We introduce metrics that measure the coverage and structural consistency. We demonstrate the effectiveness of the generated assurance case from GDPR requirements in a multi-hop inference task as a case study.

However, ensuring compliance is challenging due to the complexity of legal documents. Navigating privacy policy documents requires two sets of unique expertise in legal and technical domains. This process can be tedious for a legal expert, an auditor, or an engineer. This motivates many experts to automate the process of compliance detection. However, automating this process also means the AI system needs to be transparent. This demand makes it challenging to adapt to the black-box AI system.

To address the transparency and robustness factors of detecting risk and safety, experts use assurance cases (Rhodes et al., 2010; Mansourov and Campara, 2011) as a structured argument approach to audit complex software products, such as those in aircraft or autonomous vehicles, against standard requirements. This is to ensure compliance with legal and technical standards. Assurance case is a systematic collection of arguments and supporting evidence to demonstrate a certain level of confidence that a product meets particular claims related to its requirements (Netkachova et al., 2015; Piovesan and Griffor, 2017). These assurance cases can be created and updated throughout the development lifecycle (Ross et al., 2022).

• We propose compliance detection based on Natural Language Inference. We formulate the claim-argument-evidence as a discourse structure which can be used to develop a multihop inference model for explainable tracing of the compliance detection model’s output (Figure 1).

• We propose evaluation methods to measure the assurance case generated by large language models (LLMs) to validate the reliability of instance-level consistency and structural coherence. We conduct a study by generating Claim Argument Evidence Assurance Case data with open source and proprietary LLMs from GDPR requirements as the context prompt from a previous study (Cejas et al., 2023). We analyse the generated Assurance Case data by using our proposed metrics.

2005). We propose compliance detection by devising the requirements as premises and evidence as hypotheses. This method allows the model to learn from the verbalisation of requirements text instead of discrete classes compared to previous studies, as text classification (Azeem and Abualhaija, 2024).

For example, in Table 1, the inputs are GDPR requirements, which served as the premise, and the Data Processing Agreement (DPA) served as the hypothesis. The entailment label means requirements and DPA belong to the same GDPR requirement class and comply

From the figure, the parent node in the assurance case structure serves as the premise. The child node, which builds upon or supports the parent, becomes the hypothesis. These structured relationships are extracted from assurance cases and transformed into a format suitable for training NLI models. This approach enables our systems to perform more complex inferences in a simple NLI model as a strong baseline.

our best performing approach performs comparably to previous approaches (Binary-FT). However, our approach can provide better explainability and traceability to verbalisation requirements.