Domain-Specific Agents: A Cyber Risk Multi-Agent Framework

Background

This is my MSc dissertation prototype: a domain-structured, evidence-grounded multi-agent framework for cyber-risk analysis. The system decomposes the reasoning task into role-specialised agents grounded in structured evidence, and supports two complementary execution modes for clean ablation experiments.

Source: Currently a private dissertation prototype. Architecture, claims, and code structure are described below.

Architecture

Input Case ──▶ [Stage 0 (optional) JELAS Grounding Agent]
                    ↓ JelasResult
              [Stage 1: Evidence Retrieval (LLM, JELAS-aware)]
                    ↓ EvidenceBundle
        ┌───────────┼───────────┐
        ▼           ▼           ▼
   Exposure    Likelihood     Impact
     Agent       Agent         Agent
        └───────────┬───────────┘
                    ▼
   [Stage 3 (conditional): Critic Agent]
   (fires when level disagreement ≥ threshold)
                    ▼
   [Stage 4: Coordinator Agent]
   (confidence-weighted synthesis)
                    ▼
              FinalReport
       (preserves all intermediate outputs)

Two execution modes

Both modes share the same run_pipeline(case) entry point. The only switch is whether the case JSON includes a jelas_env_name field — deliberate so the project supports clean A/B ablation.

LLM-only pipeline — pure language-model reasoning over a structured case.
JELAS-grounded neuro-symbolic pipeline — adds a deterministic Stage 0 that injects pre-computed knowledge-graph + Datalog risk facts before any LLM call.

Five testable claims

Each claim has a matching ablation experiment built into the framework:

#	Claim	Mechanism
C1	Cyber risk is better modelled as structured reasoning than a single opaque prediction.	Five dedicated agents, each with its own schema and prompt.
C2	Domain-aligned roles (Exposure / Likelihood / Impact) yield more interpretable intermediate state than generic Planner / Reviewer roles.	Role-specific Pydantic v2 schemas keep each agent’s contribution typed and inspectable.
C3	Evidence-grounded reasoning improves coherence and trustworthiness.	Shared `EvidenceBundle` + optional JELAS neuro-symbolic facts injected before any analytical agent reasons.
C4	Lightweight conditional validation outperforms unconstrained multi-agent debate in this domain.	Conditional Critic that only fires when analytical agents disagree by ≥ 2 ordinal risk-level steps.
C5	Cross-case “analyst experience” can be reused without retraining the model.	`CaseMemory` retrieving top-k similar past cases via Jaccard × EWMA-recency, adapted from LLMTraveler (Wang et al., 2025).

Design principles

Domain-aligned, not role-generic. Agents correspond to the conceptual dimensions of cyber-risk reasoning (exposure / likelihood / impact), not to generic software roles like planner / reviewer. This makes the intermediate state interpretable to security analysts.

Schema-first. Every inter-agent message is a typed Pydantic v2 model. This gives the LLM a hard contract, enables auto-normalisation (level strings lowercased and validated against enums), and makes ablation experiments deterministic to compare.

Dual-mode by construction. LLM-only and neuro-symbolic modes share one entry point. The presence or absence of jelas_env_name is the only switch. This lets the dissertation present clean A/B comparisons without maintaining two code paths.

Graceful degradation. Missing JELAS files, missing memory entries, and disabled components all degrade silently rather than crashing the pipeline. This matters for unattended experiment sweeps.

Block-structured prompts. Each agent’s user message is assembled from independent blocks — Profile / Memory / Case / Evidence / Guidance. Each block can be ablated by a single config flag, which is what makes the sensitivity analysis cheap to run.

Cross-cutting components

agents/components/
   ├── memory.py     # CaseMemory: cross-case experience store + retrieval
   └── profile.py    # AnalystProfile: behavioural calibration (4 axes)

CaseMemory retrieves top-k most relevant past cases for an incoming case (Jaccard similarity over case features × EWMA-recency).
AnalystProfile captures four behavioural axes (calibration / aggressiveness / domain bias / verbosity) that can be swapped per-experiment to study persona effects.

Status

In progress — MSc Dissertation, NUS School of Computing (2026.01–present). Source available on request.