Hi, my name is

Chunxiao(Elin) Ren.

Ship software at the intersection of MLE and full-stack engineering.

Master of Computing (CS specialization) @ NUS. I ship end-to-end — from data and modeling to deployed UIs on the cloud — across LLM agents, applied ML pipelines, and full-stack web. Open to SWE / MLE / DS internships in Singapore.

Building Climate Policy Evidence Knowledge Graph Platform, Cyber Risk Assessment Platform
Shipping applied ML pipelines (AVM, recommenders) + cloud deploys
Open to SWE / MLE / DS internships in Singapore

01. About Me

I’m a Master of Computing student (Computer Science specialization) at the National University of Singapore, with a double-degree BSc background in Software & System Engineering (LUT, Finland) and Computer Science (HBUT, China). I work as a generalist across MLE and full-stack — owning the path from data and modeling through to deployed UIs on the cloud, rather than handing off a notebook.

Climate Policy Evidence Knowledge Graph Platform (Neo4j + GraphRAG) · Cyber Risk Assessment Platform · multi-agent cyber-risk dissertation (NUS) · vertical code-gen agent at Seeyon · medical Q&A pipeline over a Neo4j disease KG.

02. Research

Climate Change Wiki home — Topic → Policy Instrument → Outcome → Evidence Papers entry cards with the four Drivers root categories visible (Policy and Regulation, Physical Climate Shock, Technology and Market Shifts, Other Drivers) and per-card paper counts.

Research

Climate Policy Evidence Knowledge Graph Platform

[in progress] · 2026-01 — present

Chunxiao Ren

Research Assistant @ National University of Singapore

A research platform that turns climate-policy PDF literature into a queryable, traceable evidence knowledge graph in bulk — letting an LLM Agent answer researcher questions under a whitelist-validated citation gate, active refusal when evidence is thin, and a four-layer anti-hallucination stack. The backend parses PDFs with MinerU, extracts Finding / Evidence / Driver / Outcome nodes with GPT-4 into a Neo4j Aura cloud graph; Query Router v2 single-pass classifies each question into T1–T4, where T3 metadata queries run through 26 deterministic Cypher templates at zero LLM cost, and T1/T2 fan out across five parallel retrieval routes (semantic / hybrid / graph expansion / community summaries / Cypher precise fallback) before a constrained Agent assembles a structured Section answer; confidence is scored by an independent programmatic dual-track system and offline-calibrated via user-feedback-driven OLS, not self-reported by the LLM. The React + Vite frontend renders answers section-by-section over SSE events, with a built-in force-directed graph visualization and interactive neighbor expansion; deployed on a cloud server and exposed over HTTPS through Caddy + ngrok. Architecture, anti-hallucination stack, and screenshots in the deep dive.

Python
Neo4j
MinerU (PDF parsing)
Vector + GraphRAG retrieval
GPT-5.4-mini (keyword extraction)
OLS calibration
React + Vite
react-force-graph-2d
SQLite (feedback)

Read deep dive →

Research

Domain-Specific Agents: A Cyber Risk Multi-Agent Framework

[in progress] · 2026-01 — present

Chunxiao Ren

MSc Dissertation @ NUS School of Computing

A domain-structured, evidence-grounded multi-agent framework for cyber-risk analysis (MSc dissertation prototype). The system decomposes the reasoning task into role-specialised agents — Exposure / Likelihood / Impact / Coordinator / Critic — each with its own typed Pydantic schema and prompt, and supports two execution modes: an LLM-only pipeline, and a JELAS-grounded neuro-symbolic pipeline that injects pre-computed knowledge-graph + Datalog risk facts before any LLM call. The framework targets five testable claims: (C1) cyber risk is better modelled as structured reasoning than a single opaque prediction; (C2) domain-aligned roles yield more interpretable intermediate state than generic planner / reviewer roles; (C3) evidence-grounded reasoning improves coherence; (C4) lightweight conditional validation outperforms unconstrained multi-agent debate; (C5) cross-case "analyst experience" can be reused without retraining via a Jaccard × EWMA-recency CaseMemory adapted from LLMTraveler. Block-structured prompts make every component cheaply ablatable, so each claim has a matching A/B experiment.

Python
Pydantic v2 (typed schemas)
LLM orchestration
JELAS neuro-symbolic engine
Datalog
CaseMemory (Jaccard × EWMA-recency)
Block-ablatable prompts

Read deep dive →

Research

Cyber Risk Assessment Platform

[completed] · 2025-09 — 2026-02

Chunxiao Ren

Research Assistant — Lead Developer @ NUS School of Computing

A web platform that delivers cyber-risk assessments to insurance underwriters and SME operators through a guided intake → analysis → report workflow. As the main in-team developer on the product side, I owned the end-to-end web stack — frontend, backend, database, authentication and role-based access, admin tooling, feedback collection, and cloud deployment — wrapping the team's underlying risk engine into a product an underwriter can complete in roughly ten minutes from a single company URL. Internal, pre-commercial project.

Python
Flask
Tailwind CSS
Authlib (Google OAuth)
bcrypt
SQLite
OpenAI API (SSE)
SentenceTransformers
Jina Reader API
PyKEEN / NetworkX
gunicorn
ngrok
Vagrant

Read deep dive →

Research

Medical Q&A System with LLMs (RAG)

[completed] · 2025-01 — 2025-05

Chunxiao Ren

Research Assistant @ Lappeenranta University of Technology(LUT)

A medical Q&A pipeline that replaces vector-store RAG with structured Cypher retrieval over a Neo4j knowledge graph — to address LLMs' hallucination problem in safety-critical domains. Built on DiseaseKG (~44.6k entities, ~312k edges); NER fine-tunes chinese-roberta-wwm-ext + BiLSTM, intent recognition runs as few-shot prompting on a 34B LLM, and answers synthesise from retrieved triples via Qwen / Llama (UI-switchable). Streamlit frontend with user / admin login. Knowledge-graph schema, augmentation strategies, and the full retrieval flow are in the deep dive.

Python
Neo4j 5.18
chinese-roberta-wwm-ext
BiLSTM (2-layer) + Linear classifier
BIO tagging
TF-IDF entity alignment
34B LLM (intent, few-shot + CoT)
Qwen / Llama
Streamlit

Read deep dive →

03. Where I’ve Worked

AI Large Model Engineer Intern

@ Beijing Seeyon Internet Software

2025-05 — 2025-08 · Beijing, China · CoMi Agent / V5 PaaS

Benchmarked Qwen, GLM, Llama and DeepSeek for enterprise workflow code generation via structured evaluation and prompt engineering, providing the basis for selecting and enhancing Seeyon's in-house model.
Fine-tuned the proprietary LLM 'CoMi' to generate Python business-logic scripts for the V5 PaaS platform, enabling automated OA workflow / template creation with 90%+ executable accuracy and a 20% reduction in manual configuration workload.
Built a high-quality fine-tuning dataset from real workflow documents and introduced Semantic Consistency Loss + AST Loss, improving both syntactic correctness and business-logic reliability of generated scripts.

Python
PyTorch
LLM Fine-tuning
SFT/LoRA
Qwen
GLM
DeepSeek
Prompt Engineering

04. Things I’ve Built

Featured Project

Singapore Public Housing Automated Valuation Model

2025-08 — 2025-12 · Collaborative Development

End-to-end ML pipeline for HDB resale price prediction on a Kaggle dataset (162,691 train / 50,000 test transactions, 2017–2025), augmented with five categories of geospatial POIs (~774 points: MRT, primary schools, secondary schools, malls, hawker centres) pulled from Singapore government open APIs. Engineered ~20 proximity features per sample using sklearn BallTree + Haversine, with dual-radius density counts and tier flags (top primary schools, MRT core lines, flagship malls). Final model: CatBoost + LightGBM + XGBoost stacking with 5-fold OOF and a no-intercept linear meta-learner; monotonic constraints on floor area and remaining lease; 3-seed averaging; a two-stage refiner for the top-10% high-price tail. Validation log-RMSE dropped from 0.061 (v2) to 0.050 (v3) — about 18% improvement.

BallTree + Haversine over ~774 POIs: ~20 proximity features per sample (dual-radius density, nearest distance, KNN-3, tier flags).
Stacking ensemble — CatBoost + LightGBM + XGBoost with 5-fold OOF and a no-intercept linear meta-learner.
Two-stage refinement for the top-10% high-price tail; 3-seed averaging (42 / 100 / 2025).
Validation log-RMSE 0.061 → 0.050 (≈ 18% improvement) across the v2 → v3 evolution.

Python
CatBoost
LightGBM
XGBoost
scikit-learn (BallTree, Haversine)
Stacking + linear meta
pandas
NumPy

Task & data

Predict the resale price of Singapore HDB public housing flats. Main dataset is a Kaggle release of HDB resale transactions; supplementary geospatial data is pulled live from Singapore’s government open-data APIs.

Train: 162,691 transactions, Test: 50,000 transactions
Time range: 2017 — 2025
Target: RESALE_PRICE in SGD — right-skewed, median ≈ S$488k, mean ≈ S$518k, tail extending past S$1.6M
Metric: RMSE in log-price space

The price distribution is heavy-tailed and the time trend is strong: 2017 medians ≈ S$380k vs. 2025 ≈ S$620k (+60%). Both observations directly drive design choices below — the time-decay sample weighting and the high-price two-stage refiner.

Auxiliary geospatial data

Five POI categories pulled from Singapore government open APIs (data.gov.sg, LTA DataMall, OneMap, MOE, NEA), totalling ~774 points:

File	Count	Source
`sg-mrt-stations.csv`	243	LTA DataMall
`sg-primary-schools.csv`	182	data.gov.sg / MOE
`sg-secondary-schools.csv`	153	data.gov.sg / MOE
`sg-shopping-malls.csv`	89	data.gov.sg / OneMap
`sg-gov-hawkers.csv`	107	data.gov.sg / NEA
`sg-hdb-block-details.csv`	9,660	data.gov.sg HDB Property Information

The HDB block file is not a POI — it’s the property index: per-block lat/lon, planning area, and MAX_FLOOR, used to back-fill geocodes onto every transaction (matching by BLOCK only after a merge-validation analysis showed BLOCK + TOWN produced 115% match rate via duplicates and ADDRESS matching produced 0%; deduping the block file by retaining the row with the largest MAX_FLOOR per block resolves this).

Feature engineering

Time features

Parsed from the MONTH field (YYYY-MM):

year, month_num
month_sin / month_cos — cyclic encoding so December and January are neighbours, not opposites
flat_age = year - lease_commence_data
lease_left = 99 - flat_age (HDB leases are 99 years)
is_new flag for ≤ 5-year-old flats

Floor features

FLOOR_RANGE is a string like "07 TO 09". Decomposed into:

floor_num — lower bound
avg_floor — midpoint
floor_range — coarsened bin (Low / Mid-Low / Mid / Mid-High / High / Very High / Top)

Geospatial features — BallTree + Haversine

The core of the feature engineering. For each transaction, build a BallTree over each POI category’s lat/lon under the Haversine metric (great-circle distance on the sphere) and run nearest-neighbour queries.

Why BallTree. Brute-force nearest-neighbour over 162k transactions × 774 POIs is O(N·M); BallTree gives O(N · log M) — roughly 10× faster in this configuration.

Dual-radius design. A single nearest-distance feature can’t tell “one MRT 800m away” apart from “three MRT lines clustering nearby”. Each POI category therefore uses an inner radius for “tightly served” and an outer radius for “broadly served”:

POI	Inner	Outer	Extra
MRT	800 m	1500 m	nearest distance · nearest-station-name (categorical)
Primary school	600 m	1000 m	nearest distance · nearest-school-name (categorical)
Secondary school	800 m	1500 m	nearest distance
Mall	1000 m	2000 m	KNN-3 mean distance · nearest-mall-name (categorical)
Hawker centre	500 m	1200 m	nearest distance

That yields ~20 geo features per sample: 5 nearest distances + 10 dual-radius counts + 2 KNN-3 means + 3 nearest-POI categorical IDs.

Tier flags

A second pass marks whether the nearest POI is “elite” — categorical density alone misses the prestige signal:

nearest_primary_top — is the nearest primary school on the Top Primary list?
nearest_mrt_core_line — is the nearest MRT on a core line (DTL / TEL / NSL / EWL / CCL / NEL)?
nearest_mall_flagship — VivoCity / Ion Orchard / etc.?

Categorical encoding

CatBoost handles categoricals natively via Ordered Target Statistics, so most fields stay raw: town, flat_model, flat_type, floor_range. One additional engineered feature: model_rank, the flat_model ordinal sorted by its market median price — a cheap price-aware encoding for the gradient-boosted models.

ECO_CATEGORY (100% "uncategorized") is dropped. BLOCK and STREET are dropped post-merge in favour of the geocoded coordinates.

Sample weighting — time × price

Two factors compose into one sample weight:

time_weight  = exp(decay_rate × (year − min_year))     # recent transactions weigh more
price_weight = 1 + α × 1[price > price_90th]           # high-price boost
sample_weight = time_weight × price_weight

Time decay reflects the EDA finding that 2023–2025 prices behave differently from 2017–2018. The price boost trains the model to spend extra capacity on the heavy-tailed top decile, which would otherwise be averaged-out by the volume of mid-market 4-room flats.

Monotonic constraints

CatBoost monotone constraints on:

floor_area_sqm — strictly increasing
lease_left — strictly increasing

These prevent the model from learning paradoxical predictions like “+10 sqm → cheaper” that emerge when local correlations between area and other features (e.g. older blocks happen to be larger) flip the marginal effect.

Stacking architecture

Level-0 base learners (5-fold OOF predictions)
  ├── CatBoost      n_estimators=10000, lr=0.033, depth=8, od_wait=350
  ├── LightGBM      n_estimators=5000,  lr=0.035, num_leaves=64
  └── XGBoost       n_estimators=5000,  lr=0.035, max_depth=8

Level-1 meta-learner
  └── Linear regression (no intercept) over OOF base predictions

OOF predictions prevent the meta-learner from seeing the same row’s own training prediction (which would leak the label). The no-intercept choice is intentional: each base learner already produces well-calibrated price estimates, so the meta-learner just learns optimal convex-ish weights rather than re-shifting the mean.

Errors on the top decile (RESALE_PRICE > 90th percentile, ~ S$750k+) are systematically harder. A second CatBoost is fine-tuned on just this segment, then blended:

y_main = main stacking prediction (all data)
y_seg  = high-price-only fine-tune
y_final = w · y_seg + (1 − w) · y_main      # w grid-searched on validation

The blend weight w is selected by grid search on the validation set rather than learned end-to-end, to avoid this segment dominating the main stacker’s gradient.

Multi-seed averaging

Final submission averages predictions across three CatBoost seeds (42, 100, 2025), flag-controlled in the main training script. Reduces variance from a single random initialization without much added compute, since each seed’s OOF folds are independent.

Results

Version	Description	Val log-RMSE
v1	Baseline, no geo features	—
v2	Log-space training, single-radius geo	0.061
v2.5	Price-space + stacking	—
v3 (final)	Dual-radius geo + tier flags + segmented refiner	0.050

About 18% relative improvement over v2.

Feature importance (CatBoost, base feature set)

Feature	Importance
floor_area_sqm	29.36%
town	23.17%
year	18.39%
lease_commence_data	12.84%
flat_model	8.05%
avg_floor	3.17%
flat_type	2.52%
flat_age	1.48%

Area + town + year + lease together account for ~83% of importance. The geo features layer further information on top of town, distinguishing “a flat in Bukit Timah next to the MRT” from “a flat in Bukit Timah on the edge of the planning area”.

Key design decisions

CatBoost as the main learner, not XGBoost — native categorical support via Ordered Target Statistics avoids the encoding noise from manually one-hot / target-encoding town (26 levels), flat_model (21 levels), and flat_type (12 levels after de-duplication).
Price space, not log space, in v3 — log-space regression compresses high-price errors but the leaderboard metric is absolute. Median-price normalization (price / median_price) keeps numerical stability while letting the model optimize the actual error.
Dual-radius density, not single nearest distance — captures both proximity and richness of the surrounding facility set.
Block-only join + de-dup by MAX_FLOOR — chosen after testing BLOCK + TOWN (115% match via duplicates) and ADDRESS (0% match due to format inconsistency); deduping by MAX_FLOOR prefers the geocode anchored at the highest measured floor (better GPS quality).

What I’d build next

Holdout-time validation — current 5-fold splits are random; a time-based split (e.g. train ≤ 2024, validate 2025) would test the model’s forward-projection ability, which is what an AVM in production actually needs.
Quantile regression — predict P10 / P50 / P90 instead of point estimates to surface the uncertainty inherent in HDB pricing.
Lightweight neural retrieval over POIs — replace radius-counts with a learned attention over per-flat POI sets, conditioned on flat type.

Featured Project

Multi-Strategy Movie Recommendation System

2025-01 — 2025-06 · Collaborative Development

A systematic study covering five method families — demographic baseline, content-based recall (TF-IDF + CountVectorizer), KNN collaborative filtering (item / user), SVD matrix factorization with three optimizers (SGD / SGLD / SGHMC), and three hybrid pipelines — evaluated on MovieLens ml-1m. Best single-stage recall: User-CF at 14.54% hit rate; best rating-prediction model: SVD-SGHMC at 0.84117 RMSE.

Nine recommenders end-to-end: demographic, 2× content-based, 2× KNN-CF, 3× SVD optimizers, 3× hybrids.
User-CF reached 14.54% hit rate (878 / 6040, 15% test split) — strongest single-stage recall.
SVD with SGHMC sampler beat SGD and SGLD: 5-fold CV RMSE 0.84117 on MovieLens ml-1m.
Recall-then-rerank hybrids trade raw hit rate for rating-aware ordering.

Python
scikit-learn
TF-IDF / CountVectorizer
KNN (item / user)
SVD
SGD / SGLD / SGHMC
MovieLens ml-1m
TMDB 5000
pandas
NumPy

1. Demographic baseline

A non-personalized popularity prior using the IMDB-style weighted rating:

score = (v / (v + m)) · r + (m / (v + m)) · c

where r is a movie’s mean rating, v its vote count, m the 90th-percentile vote threshold, and c the global mean. Used as a cold-start fallback and a sanity-check leaderboard against the personalized models below.

2. Content-based recall — two variants

Both variants produce item–item nearest-neighbour lists, but draw from different views of the movie:

Text variant — TF-IDF over plot synopses + cosine similarity. Strong when the user has rated narratively-similar films.
Feature variant — CountVectorizer over a “metadata soup” (keywords + genres + top-3 cast + director). Captures categorical taste signals (e.g. “Christopher Nolan films”) that synopsis text alone tends to miss.

3. KNN collaborative filtering — two variants

Operates on the MovieLens ml-1m user × item rating matrix (~1M ratings, 6,040 users, ~3,900 movies).

Item-based KNN — score unseen movies for a user by the similarity-weighted ratings of their already-rated items.
User-based KNN — find the top-k similar users and surface films they rated highly that the target user hasn’t seen.

User-CF reached 14.54% hit rate (878 / 6040) on an 85/15 split — the strongest single-stage recall in the study.

4. SVD matrix factorization

Personalized rating prediction over a shared latent-factor model, comparing three optimizers:

Optimizer	Posterior treatment	5-fold CV RMSE
SGD	MAP point estimate	baseline
SGLD	Stochastic-gradient Langevin dynamics	improved
SGHMC	Stochastic-gradient Hamiltonian Monte Carlo	0.84117

The two Bayesian samplers (SGLD, SGHMC) explore the posterior over latent factors instead of collapsing to a single MAP point. SGHMC’s momentum-based proposals mix faster than SGLD’s pure-noise dynamics, and on ml-1m it delivered the best 5-fold CV RMSE.

5. Hybrid multi-stage pipelines

All three hybrids follow a recall-then-rerank pattern: a cheap recall stage narrows the catalogue, a more expensive rerank stage orders the survivors.

User-KNN → text similarity — KNN-recalled candidates rescored by content similarity to the user’s history.
User-KNN → Movie-KNN — switch from user-similarity recall to item-similarity rerank.
User-KNN → SVD — narrow to candidates from similar users, then re-rank by predicted ratings. Hit rate 9.97% (602 / 6040).

The hybrid hit rates are lower than User-CF alone, but the ordering is rating-aware — a better proxy for what users would actually choose to watch from the candidate set, even if fewer test-set items make the cut.

Datasets

TMDB 5000 — synopsis + metadata (keywords, genres, cast, director) for content-based recall.
MovieLens ml-1m — ~1M ratings from 6,040 users on ~3,900 movies, used for KNN-CF and SVD.

What I’d build next

Implicit-feedback signals — current pipeline only consumes explicit ratings; click / view / watch-time data would expand the candidate space dramatically.
Two-tower neural retrieval — replace TF-IDF + KNN recall with a learned dual-encoder, using SVD ratings as labels for the rerank stage.
Diversity-aware reranking — penalize redundant candidates (same franchise, same director) at the rerank step to widen the recommendation set.

Featured Project

TeamClaw — Local-First Multi-Agent Workspace

2026-01 — 2026-03 · Open-source contributor

Contributed to TeamClaw — a local-first multi-agent workspace that exposes an OpenAI-compatible /v1/chat/completions endpoint and ships a visual orchestration layer (OASIS) supporting sequential, parallel, selector, and DAG workflows. Unifies three agent types under a single Team abstraction: Stateless experts, Stateful sessions, and External-API agents (incl. OpenClaw). Team Creator turns a plain-text task description or discovered SOP pages into roles, personas, and a runnable DAG. Backed by a living GraphRAG memory (SQLite + optional Zep mirror), multimodal I/O, Telegram / QQ bot bridges, and Cloudflare Tunnel for one-click public access.

OpenAI-compatible local endpoint at /v1/chat/completions — drop-in for any OpenAI client.
OASIS engine: sequential / parallel / selector / DAG workflows over unified Stateless / Stateful / External-API agents.
Team Creator turns a task description or SOP page into roles, personas, and a runnable DAG.
Living GraphRAG memory (SQLite + optional Zep mirror) + multimodal I/O + Telegram / QQ bots + Cloudflare Tunnel public access.

Python
FastAPI / Flask
LangGraph
OASIS engine
MCP toolchain
OpenAI-compatible API
GraphRAG (Zep)
SQLite
Cloudflare Tunnel

What TeamClaw is

A local-first multi-agent workspace that brings team-style AI orchestration to a single machine — without forcing the operator into a cloud SaaS. From the outside it looks like an OpenAI-compatible chat endpoint; from the inside it’s a visual orchestration engine, a living memory layer, an auto-team generator, and a set of bridges to bots and the public web.

The design tension the project navigates: hobbyists want a one-binary experience, but real multi-agent work needs persistence, scheduling, and visibility into what each agent did. TeamClaw answers both with a layered architecture you can opt into module by module.

OpenAI-compatible local endpoint

Everything funnels through /v1/chat/completions, which speaks the OpenAI request/response shape. That single decision unlocks:

Drop-in client compatibility — any OpenAI SDK, IDE plugin, or third-party tool that talks to OpenAI’s REST API works against the local instance unchanged.
Model-agnostic backend — the endpoint fans out to whichever model is configured (Antigravity-Manager bridges 67+ models, including long-context backends like MiniMax M2.7 with 1M context).
Stable contract for orchestration — the OASIS engine and the auto-teamer both call models through the same surface, so swapping providers doesn’t ripple into orchestration code.

OASIS orchestration engine

OASIS is the core scheduler. It treats agent collaboration as a graph and supports four workflow shapes:

Shape	When you’d use it
Sequential	Step-wise pipelines (research → draft → review).
Parallel	Independent fan-out (multiple critics scoring the same artifact).
Selector	Routing: one upstream node picks which downstream agent runs.
DAG	Arbitrary directed graphs — the most general case, used by auto-teaming.

State is persisted as a living graph: posts, callbacks, and timeline events are written to SQLite locally and can be mirrored to Zep for cross-session memory. That means you can pause a multi-step team mid-task, restart the workspace tomorrow, and resume from the same state instead of re-running the prompt chain.

A “swarm graph” view in the frontend renders the same engine state visually, so you can see which agent is running, which edges have fired, and which tool calls produced which posts.

Three agent types, one team abstraction

TeamClaw unifies three execution models under a single Team configuration:

Stateless experts — internal lightweight agents that take a prompt + context and return a result. No memory, no session. Cheap to fan out in parallel.
Stateful sessions — agents with persistent context across turns. Backed by the GraphRAG memory layer; useful for long-running roles like “PM” or “ongoing analyst” that need to remember past decisions.
External-API agents — wrappers around OpenClaw runtimes and arbitrary HTTP-API agents. Lets the team include capabilities that don’t live inside TeamClaw, while keeping the orchestration interface uniform.

The unification matters because the auto-teamer (next section) generates DAGs without caring which type each node uses — the OASIS engine resolves that at runtime from the Team config.

Team Creator — auto-teaming from a task description

This is the feature that ties the engine to a non-developer workflow. Given either:

a free-text task description, or
a set of SOP / organization pages (discovered via the TinyFish Web Agent),

Team Creator runs an LLM-driven extraction step that produces:

Roles — what kinds of agents are needed for this task.
Personas — fleshed-out per-role prompts and behavioural specs (editable before import).
An OASIS DAG — wired up between the roles, ready to execute.

The output is a draft Team — meant to be inspected and edited in the frontend before going live. That review checkpoint is deliberate: auto-generated agent graphs are useful as scaffolds, but blindly executing them tends to produce confidently wrong outputs.

OASIS Town — pixel-art swarm visualization

For the larger Team configurations, the same OASIS state can be rendered as a pixel-art “town”, where each agent is a resident. Live activity (a node firing, a tool call, a memory write) shows up as resident animations or “nudges”. Compact mode collapses to the swarm-graph view; the town adds ambient audio and a more intuitive at-a-glance feel for whether the team is making progress.

It’s a UX experiment more than a core capability — but it surfaces an underrated metric: which agents are idle. In manual debugging, dead nodes in a DAG are often the first sign that the auto-teamer mis-extracted a role.

Memory layer — living GraphRAG

Memory is graph-shaped, not vector-store-shaped. Posts, callbacks, and decisions accumulate as nodes; relations between them (caused-by, supersedes, refers-to) accumulate as edges. Two backends:

SQLite — local, zero-config default.
Zep — optional mirror for cross-machine continuity and richer retrieval primitives.

Why graph instead of vectors: TeamClaw’s primary memory access pattern is “who said what to whom and when”, which is fundamentally relational. Vector retrieval is layered on top for free-text recall, but the structural traversal (e.g. “all posts in this thread that the Critic flagged”) is the dominant query, and graphs answer that in a single hop.

Tooling and integrations

The MCP toolchain handles agent ↔ tool plumbing with approval-aware policy hooks (so a tool call can be paused for human approval before execution). The current frontend includes a policy panel for inspecting and authorising calls.

Integrations layered on top:

OpenClaw external runtimes — for capabilities outside TeamClaw’s process.
Telegram / QQ bot bridges — chat-as-a-frontend for the same Teams.
TinyFish Web Agent — used by Team Creator for SOP discovery; also serves a competitor-monitoring use case (web crawls + pricing snapshots feeding into a long-running Stateful agent).
Cloudflare Tunnel — one-click public access without exposing the local IP.
Antigravity-Manager — model fan-out across 67+ providers under a single config.

How it runs

A selfskill/scripts/run.sh|ps1 driver handles setup → configure --batch → start, which brings up the local services and exposes the frontend at http://127.0.0.1:<PORT_FRONTEND>. The default validation path is pytest + GitHub Actions, with Playwright smoke tests for the frontend.

The interesting design choice here: the project also ships a SKILL.md aimed at AI coding agents, so a coding agent (e.g. Claude Code) can install the workspace autonomously without a human running the scripts manually. That’s a small detail but a telling one about who the intended operator looks like in 2026.

What I’d build next

OASIS DAG diffs — show what changed between two auto-teamed drafts so users can iterate on the task description and see the structural delta.
Memory-layer evictions — graph memory grows unbounded; eviction policies (LRU on edges, summary-rollups on subgraphs) would matter at the year-of-usage scale.
Replay-mode for failed runs — given a finished OASIS run, let the user re-execute a single subgraph with edited personas to debug local mistakes without re-running the whole DAG.

Chunxiao(Elin) Ren.

Ship software at the intersection of MLE and full-stack engineering.

01. About Me

02. Research

Climate Policy Evidence Knowledge Graph Platform

Domain-Specific Agents: A Cyber Risk Multi-Agent Framework

Cyber Risk Assessment Platform

Medical Q&A System with LLMs (RAG)

03. Where I’ve Worked

AI Large Model Engineer Intern

Data Scientist Intern

Backend Development Engineer Intern

04. Things I’ve Built

Singapore Public Housing Automated Valuation Model

Multi-Strategy Movie Recommendation System

TeamClaw — Local-First Multi-Agent Workspace