Task & data
Predict the resale price of Singapore HDB public housing flats. Main dataset is a Kaggle release of HDB resale transactions; supplementary geospatial data is pulled live from Singapore’s government open-data APIs.
- Train: 162,691 transactions, Test: 50,000 transactions
- Time range: 2017 — 2025
- Target:
RESALE_PRICE in SGD — right-skewed, median ≈ S$488k, mean ≈ S$518k, tail extending past S$1.6M
- Metric: RMSE in log-price space
The price distribution is heavy-tailed and the time trend is strong: 2017 medians ≈ S$380k vs. 2025 ≈ S$620k (+60%). Both observations directly drive design choices below — the time-decay sample weighting and the high-price two-stage refiner.
Auxiliary geospatial data
Five POI categories pulled from Singapore government open APIs (data.gov.sg, LTA DataMall, OneMap, MOE, NEA), totalling ~774 points:
| File | Count | Source |
|---|
sg-mrt-stations.csv | 243 | LTA DataMall |
sg-primary-schools.csv | 182 | data.gov.sg / MOE |
sg-secondary-schools.csv | 153 | data.gov.sg / MOE |
sg-shopping-malls.csv | 89 | data.gov.sg / OneMap |
sg-gov-hawkers.csv | 107 | data.gov.sg / NEA |
sg-hdb-block-details.csv | 9,660 | data.gov.sg HDB Property Information |
The HDB block file is not a POI — it’s the property index: per-block lat/lon, planning area, and MAX_FLOOR, used to back-fill geocodes onto every transaction (matching by BLOCK only after a merge-validation analysis showed BLOCK + TOWN produced 115% match rate via duplicates and ADDRESS matching produced 0%; deduping the block file by retaining the row with the largest MAX_FLOOR per block resolves this).
Feature engineering
Time features
Parsed from the MONTH field (YYYY-MM):
year, month_num
month_sin / month_cos — cyclic encoding so December and January are neighbours, not opposites
flat_age = year - lease_commence_data
lease_left = 99 - flat_age (HDB leases are 99 years)
is_new flag for ≤ 5-year-old flats
Floor features
FLOOR_RANGE is a string like "07 TO 09". Decomposed into:
floor_num — lower bound
avg_floor — midpoint
floor_range — coarsened bin (Low / Mid-Low / Mid / Mid-High / High / Very High / Top)
Geospatial features — BallTree + Haversine
The core of the feature engineering. For each transaction, build a BallTree over each POI category’s lat/lon under the Haversine metric (great-circle distance on the sphere) and run nearest-neighbour queries.
Why BallTree. Brute-force nearest-neighbour over 162k transactions × 774 POIs is O(N·M); BallTree gives O(N · log M) — roughly 10× faster in this configuration.
Dual-radius design. A single nearest-distance feature can’t tell “one MRT 800m away” apart from “three MRT lines clustering nearby”. Each POI category therefore uses an inner radius for “tightly served” and an outer radius for “broadly served”:
| POI | Inner | Outer | Extra |
|---|
| MRT | 800 m | 1500 m | nearest distance · nearest-station-name (categorical) |
| Primary school | 600 m | 1000 m | nearest distance · nearest-school-name (categorical) |
| Secondary school | 800 m | 1500 m | nearest distance |
| Mall | 1000 m | 2000 m | KNN-3 mean distance · nearest-mall-name (categorical) |
| Hawker centre | 500 m | 1200 m | nearest distance |
That yields ~20 geo features per sample: 5 nearest distances + 10 dual-radius counts + 2 KNN-3 means + 3 nearest-POI categorical IDs.
Tier flags
A second pass marks whether the nearest POI is “elite” — categorical density alone misses the prestige signal:
nearest_primary_top — is the nearest primary school on the Top Primary list?
nearest_mrt_core_line — is the nearest MRT on a core line (DTL / TEL / NSL / EWL / CCL / NEL)?
nearest_mall_flagship — VivoCity / Ion Orchard / etc.?
Categorical encoding
CatBoost handles categoricals natively via Ordered Target Statistics, so most fields stay raw: town, flat_model, flat_type, floor_range. One additional engineered feature: model_rank, the flat_model ordinal sorted by its market median price — a cheap price-aware encoding for the gradient-boosted models.
ECO_CATEGORY (100% "uncategorized") is dropped. BLOCK and STREET are dropped post-merge in favour of the geocoded coordinates.
Sample weighting — time × price
Two factors compose into one sample weight:
time_weight = exp(decay_rate × (year − min_year)) # recent transactions weigh more
price_weight = 1 + α × 1[price > price_90th] # high-price boost
sample_weight = time_weight × price_weight
Time decay reflects the EDA finding that 2023–2025 prices behave differently from 2017–2018. The price boost trains the model to spend extra capacity on the heavy-tailed top decile, which would otherwise be averaged-out by the volume of mid-market 4-room flats.
Monotonic constraints
CatBoost monotone constraints on:
floor_area_sqm — strictly increasing
lease_left — strictly increasing
These prevent the model from learning paradoxical predictions like “+10 sqm → cheaper” that emerge when local correlations between area and other features (e.g. older blocks happen to be larger) flip the marginal effect.
Stacking architecture
Level-0 base learners (5-fold OOF predictions)
├── CatBoost n_estimators=10000, lr=0.033, depth=8, od_wait=350
├── LightGBM n_estimators=5000, lr=0.035, num_leaves=64
└── XGBoost n_estimators=5000, lr=0.035, max_depth=8
Level-1 meta-learner
└── Linear regression (no intercept) over OOF base predictions
OOF predictions prevent the meta-learner from seeing the same row’s own training prediction (which would leak the label). The no-intercept choice is intentional: each base learner already produces well-calibrated price estimates, so the meta-learner just learns optimal convex-ish weights rather than re-shifting the mean.
Two-stage refinement for the high-price tail
Errors on the top decile (RESALE_PRICE > 90th percentile, ~ S$750k+) are systematically harder. A second CatBoost is fine-tuned on just this segment, then blended:
y_main = main stacking prediction (all data)
y_seg = high-price-only fine-tune
y_final = w · y_seg + (1 − w) · y_main # w grid-searched on validation
The blend weight w is selected by grid search on the validation set rather than learned end-to-end, to avoid this segment dominating the main stacker’s gradient.
Multi-seed averaging
Final submission averages predictions across three CatBoost seeds (42, 100, 2025), flag-controlled in the main training script. Reduces variance from a single random initialization without much added compute, since each seed’s OOF folds are independent.
Results
| Version | Description | Val log-RMSE |
|---|
| v1 | Baseline, no geo features | — |
| v2 | Log-space training, single-radius geo | 0.061 |
| v2.5 | Price-space + stacking | — |
| v3 (final) | Dual-radius geo + tier flags + segmented refiner | 0.050 |
About 18% relative improvement over v2.
Feature importance (CatBoost, base feature set)
| Feature | Importance |
|---|
| floor_area_sqm | 29.36% |
| town | 23.17% |
| year | 18.39% |
| lease_commence_data | 12.84% |
| flat_model | 8.05% |
| avg_floor | 3.17% |
| flat_type | 2.52% |
| flat_age | 1.48% |
Area + town + year + lease together account for ~83% of importance. The geo features layer further information on top of town, distinguishing “a flat in Bukit Timah next to the MRT” from “a flat in Bukit Timah on the edge of the planning area”.
Key design decisions
- CatBoost as the main learner, not XGBoost — native categorical support via Ordered Target Statistics avoids the encoding noise from manually one-hot / target-encoding
town (26 levels), flat_model (21 levels), and flat_type (12 levels after de-duplication).
- Price space, not log space, in v3 — log-space regression compresses high-price errors but the leaderboard metric is absolute. Median-price normalization (
price / median_price) keeps numerical stability while letting the model optimize the actual error.
- Dual-radius density, not single nearest distance — captures both proximity and richness of the surrounding facility set.
- Block-only join + de-dup by
MAX_FLOOR — chosen after testing BLOCK + TOWN (115% match via duplicates) and ADDRESS (0% match due to format inconsistency); deduping by MAX_FLOOR prefers the geocode anchored at the highest measured floor (better GPS quality).
What I’d build next
- Holdout-time validation — current 5-fold splits are random; a time-based split (e.g. train ≤ 2024, validate 2025) would test the model’s forward-projection ability, which is what an AVM in production actually needs.
- Quantile regression — predict P10 / P50 / P90 instead of point estimates to surface the uncertainty inherent in HDB pricing.
- Lightweight neural retrieval over POIs — replace radius-counts with a learned attention over per-flat POI sets, conditioned on flat type.