Medical Q&A System with LLMs (RAG)

Why a knowledge graph instead of a vector store

Most RAG systems chunk documents, embed them, and pull the top-k nearest neighbours into the prompt. That works fine for free-form prose but breaks down in medicine, where:

Relations carry meaning. “Drug X treats disease Y” and “drug X is contraindicated for disease Y” are nearly identical embeddings but clinically opposite.
Hallucinations cost more. A made-up dosage or contraindication is not a UX issue, it’s a safety issue.
Provenance must be traceable. Every fact in the answer should resolve to a structured source the user can audit.

This project replaces the vector store entirely with a typed knowledge graph on Neo4j, and converts the Q&A loop into: parse the question → resolve entities into KG nodes → run a templated Cypher query → let the LLM phrase the structured result.

Knowledge graph design

The KG is built from the DiseaseKG dataset (Open-KG), modelling Chinese clinical knowledge.

Entities — 8 types, ~44.6k nodes

Type	Count	Notes
Disease	8,808	7 attributes: description, cause, prevention, treatment duration, cure probability, susceptibility, basic-info
Drug	3,828	Linked back to Producers
Food	4,870	Split into “do-eat” / “avoid” via relation type, not separate node types
Check	3,353	Diagnostic exams
Department	54	Clinical departments — used for routing
Producer	17,201	Drug manufacturers (commercial brand layer)
Symptom	5,998	Patient-facing complaints
Cure	544	Therapy / procedure types

Relations — 11 types, ~312k edges

Disease ↔ Symptom · Disease ↔ Common-Drug · Disease ↔ Recommended-Drug · Disease ↔ Required-Check · Disease ↔ Recommended-Diet · Disease ↔ Forbidden-Diet · Disease ↔ Complication · Disease ↔ Therapy · Disease ↔ Department (belongs-to) · Drug ↔ Producer · Drug ↔ On-sale-Drug.

The two-tier “common drug” vs. “recommended drug” split matters: it lets the agent distinguish “what’s typically prescribed” from “what’s clinically optimal” — a distinction LLM-only answers tend to flatten.

Why Neo4j

Cypher pattern matching is the natural query language for “what does the graph say about X?”
Constraint indexing on :Disease(name), :Symptom(name), etc. keeps entity-resolution lookups O(1).
Multi-hop traversals (e.g. disease → complication → symptom) are one query, not a JOIN cascade.

NER for medical entity extraction

A user query like “糖尿病患者能吃苹果吗？” needs to surface two entities (糖尿病 as Disease, 苹果 as Food) before the system can pick the right Cypher template. Off-the-shelf Chinese NER misses domain terms; a dedicated model was trained.

Architecture

Input tokens
  │
  ▼
chinese-roberta-wwm-ext   (contextual embeddings)
  │
  ▼
2-layer Bi-LSTM           (sequence-level dependencies)
  │
  ▼
Linear classifier         (per-token logits)
  │
  ▼
BIO tags                  (B-Disease, I-Disease, B-Symptom, …, O)

Why this stack: RoBERTa already gives strong contextual representations, but BIO sequence labelling benefits from the explicit sequential bias of a BiLSTM head — and the resulting model is small enough to run on a single GPU.

Data augmentation — the F1 lift

Baseline F1 on the held-out test set was 96.77%. Three augmentation strategies pushed it to 97.40%:

Entity replacement. Swap a recognised entity with another of the same type from the KG vocabulary, keeping the BIO labels aligned. Forces the model to rely on context, not memorised surface forms.
Entity masking. Replace entity tokens with [MASK] and let the model recover them — increases robustness to OOV terms.
Entity concatenation. Glue two short single-entity sentences together to create harder multi-entity examples, addressing the rare case where two entities of the same type appear in one query.

The 0.63-point F1 gain may sound small, but at the upper end of the curve each false positive becomes a wrongly-routed Cypher query, so the marginal value is high.

Entity alignment

NER produces spans, not KG node IDs. Surface forms vary (“II型糖尿病” vs “2型糖尿病”). Each recognised mention is matched to the closest node of the right type via TF-IDF cosine similarity over node-name + alias text. Cheap, deterministic, no embedding model required at inference time.

Intent recognition without a labelled classifier

Sixteen intent types are needed to cover the question space, e.g.:

ask_symptom · ask_cause · ask_complication · ask_treatment · ask_check · ask_department · ask_drug · ask_drug_producer · ask_diet_do · ask_diet_avoid · ask_prevent · ask_cure_time · ask_cure_prob · ask_susceptible · ask_disease_intro · ask_overview

A trained classifier would mean labelling thousands of Chinese medical queries — high cost, brittle as the schema evolves. Instead intent is handled by a 34B LLM via prompt engineering:

A system prompt enumerates the 16 intents with one-line descriptions.
3–5 few-shot examples per intent ground the model in the expected output format.
A chain-of-thought scratch step asks the model to first identify the entities, then reason about which intent best matches, before emitting the final intent label as JSON.

Trade-off: latency is higher than a classifier, but adding a new intent is a prompt edit, not a retraining job — important for a research artifact that’s still iterating on the schema.

Retrieval and generation flow

User query
   │
   ▼
[1] Intent classifier (LLM + few-shot + CoT)  → intent ∈ {16}
   │
   ▼
[2] BERT-NER (RoBERTa + BiLSTM)               → mentions[]
   │
   ▼
[3] TF-IDF entity alignment                   → KG node IDs[]
   │
   ▼
[4] Templated Cypher per (intent, entity-type) → KG triples
   │
   ▼
[5] LLM answer synthesis (Qwen / Llama)        → grounded response
   │   conditioned on retrieved triples + user query
   ▼
Final answer (with the underlying triples available for inspection)

Each (intent, entity-type) pair maps to a fixed Cypher template. For example:

// ask_diet_avoid for a Disease
MATCH (d:Disease {name: $name})-[:no_eat]->(f:Food)
RETURN f.name AS food

Templates are deterministic and audited once; the LLM only ever sees the resolved triples, never raw user input embedded into Cypher — so prompt-injection-as-Cypher-injection is structurally impossible.

The synthesis stage is where the LLM earns its keep: turning a list of foods into a fluent answer (“糖尿病患者建议避免：…”) while respecting the retrieved set as the source of truth.

Frontend and ops

Streamlit UI with login (user / admin roles), persisted conversation history, and a runtime model switch between Qwen and Llama for side-by-side comparison.
Multi-window conversations — useful when validating answers across different LLMs without losing the previous turn’s context.
Admin view for inspecting raw NER output, resolved KG nodes, and the executed Cypher per question — the “show your work” panel that turned out to be the most valuable debug surface.

What I’d build next

Confidence scoring: per-answer score combining NER confidence, entity-alignment cosine similarity, and Cypher result-set size (empty result ≠ confident “no”).
Conflict detection: when the same patient profile yields contradictory diet recommendations across diseases (do-eat vs. avoid for the same food), surface the conflict instead of silently picking one.
Multilingual NER: extending the model to handle code-switched Chinese / English medical terms — common in Singapore clinical contexts.

Status

Completed (research-assistant role, 2025-01 — 2025-05). Source code is held by the host lab and is not publicly available; architecture is described above.