Towards Trustworthy AI in Financial Compliance

A study of explainable Graph Neural Networks on Elliptic2.

BSc (Hons) thesis, TU Dublin, April 2026. Supervisor: Dr. Musfira Jilani

The question

The EU AI Act came into force in 2024. Annex III §5(b) labels anti-money-laundering models as high-risk, and Articles 9, 13, 14, and 15 attach binding obligations on risk management, transparency, human oversight, and robustness. Without explainability, auditability, and stress-testing, an AML model is no longer viable from a regulatory point of view.

The natural model class for graph-structured transaction data is graph neural networks. The Elliptic2 dataset, released by MIT, IBM and Elliptic in 2024, is the largest fully-labelled public AML benchmark, with 121,810 labelled subgraphs on a 196M-edge background graph and a 2.27% positive rate. Bellei et al.'s baselines reach a maximum PR-AUC of 0.208 using only graph structure. They do not assess whether the resulting models can be explained.

This thesis asks one question:

Can an explainable GNN trained on node features only, on the labelled subgraphs of the Elliptic2 dataset, match or exceed structure-based baselines on the AML task, and whether the resulting explanations yield meaningful insights for analysts?

Topologies

Contributions

The thesis makes four contributions.

A feature-aware subgraph-local benchmark on AML task performance. A two-layer GATv2 with global attention pooling reaches 0.515 PR-AUC on the held-out test set: approximately a 2.47x improvement over the best published structure-only baseline. The regime caveat: the comparison locates the discriminative signal in node attributes rather than in any particular ranking of architectures.
An evaluation of feature-attribution explainers on the trained model (Integrated Gradients and Kernel SHAP) with rank correlation (ρ = 0.9505) reported across the leading discriminative features. Convergence between two methods built on different assumptions strengthens the trustworthiness claim of the resulting feature ranking.
A broader-than-typical evaluation of structural explainers (GNNExplainer, GATv2 attention, SubgraphX), with mask entropy proposed as a new fidelity diagnostic for the median three-node subgraph case where standard fidelity collapses.
A mapping of the entire pipeline against the ALTAI seven-pillar framework and the EU AI Act, identifying the gaps in current GNN explainability that prevent compliance-grade deployment.

Why graphs

Money laundering is not a niche compliance problem. The UN Office on Drugs and Crime estimates ~$2 trillion is laundered worldwide each year, and the FATF's most recent annual report flags fraud as the top reported predicate offence in 89% of mutual-evaluation reports.

Criminals chain transactions and intermediaries because their schemes are built so that no single transaction can be caught by a transaction-by-transaction filter. Bellei et al. formalise this by labelling at the connected-component level, not the transaction level. That is what we follow throughout. The Bitcoin substrate is well-suited to this kind of analysis: pseudonymous, not anonymous, with a permissionless, fully-reconstructible transaction graph.

Architecture and training

We considered GCN, GraphSAGE, GAT, GATv2, and NNConv, with gated pooling and Jumping-Knowledge concatenation to control over-smoothing. GATv2 was selected for two reasons: its dynamic attention scheme yields strictly more expressive power than the static attentions of vanilla GAT, and the learned edge weights provide an interpretable output that we further analyse in Chapter 6.

We adopt Wang and Zhang's GLASS labelling trick (feeding an indicator of the input subgraph into the model), which recovers the expressive power that vanilla message-passing loses on subgraph tasks.

Training uses stratified train/val/test splits at the subgraph level to prevent structural leakage while preserving the 2.27% positive rate across splits. Hyperparameter search uses Optuna, with PR-AUC as the primary metric, appropriate for the extreme class imbalance where standard accuracy is uninformative.

Explainability: two questions, two families

The thesis treats explainability as two distinct questions, each with their own methods, costs and validation criteria.

Feature attribution answers what drove the prediction. Integrated Gradients and Kernel SHAP both regard the node-feature vector as the unit of explanation, and we report their cross-method agreement on Elliptic2 at ρ = 0.9505. This convergence is necessary but not sufficient: it is still possible that both methods agree for the wrong reasons, but the probability of coincidental agreement is small.

Structural explainers answer where in the graph the prediction came from. We evaluate GNNExplainer, GATv2 attention weights and SubgraphX on the same model. Standard fidelity metrics break down on Elliptic2's median three-node subgraph: removing any single edge often collapses the input graph entirely. We propose mask entropy as a fidelity diagnostic that is robust to this regime, and report fidelity, sparsity and stability across the resulting test set.

Evaluating against ALTAI and the EU AI Act

The European Commission's High-Level Expert Group on AI distilled the broader ethics guidelines into seven dimensions: human agency and oversight, technical robustness and safety, privacy and data governance, transparency, diversity, environmental and societal well-being, and accountability. We treat these dimensions not as a checklist but as a hypothesis: if the best-performing GNN, equipped with a GNN-native explainer, can satisfy all seven, the case for compliance-grade graph methods is strengthened. Where it cannot, the gaps are themselves a contribution.

The thesis maps the Elliptic2 pipeline against each ALTAI pillar and against Articles 9, 10, 12, 13, 14, 15 and 86 of the EU AI Act, identifying which dimensions current GNN explainability supports and which it does not.

Status

Submitted April 26, 2026. Final corrections complete. The full PDF is linked below.

Repository and reproducibility artefacts to follow.