ParaEval and the CRP/HDP Model: Bayesian Nonparametric Trigger Calibration for Parametric Insurance
Parametric insurance triggers are only as good as the statistical model behind them. This post walks through ParaEval — a decision-evaluation platform for parametric claims — and its CRP/HDP sub-model, which uses Bayesian nonparametric clustering to discover latent peril regimes and calibrate triggers with lower basis risk than standard actuarial baselines.
The problem
Parametric triggers inherit the assumptions of the model that sets them
A parametric insurance policy pays when an observable index — wind speed, rainfall accumulation, flood depth — crosses a contractual threshold. The entire product hinges on that threshold being correctly calibrated: set it too low and the insurer overpays (false positives); set it too high and the policyholder suffers uncompensated loss (false negatives). The sum of these two errors is basis risk, and it is the single most important quality metric for any parametric product. Industry practice typically calibrates thresholds using Generalized Extreme Value (GEV) distributions, Weibull fits, or historical quantiles. These methods assume a single, stationary generating process — an assumption that breaks when peril events cluster into distinct regimes (e.g., fast-track high-wind typhoons vs. slow-moving rain-dominant systems) and when climate non-stationarity shifts those regimes over time. ParaEval is a decision-evaluation platform built to surface exactly this kind of structural mismatch. Its CRP/HDP sub-model replaces the single-distribution assumption with a Bayesian nonparametric mixture that discovers latent peril regimes directly from data, then derives trigger thresholds from the boundaries between loss-producing and non-loss-producing regimes.
(FP + FN) / N
Basis risk definition
Single regime
Key assumption broken
K is inferred
CRP advantage
System design
ParaEval: from raw evidence to auditable settlement decisions
ParaEval is not a model — it is a decision-evaluation platform that sits between raw event data and the payout recommendation. It structures the entire evidence-to-decision pipeline into four composable stages: evidence ingestion, trigger evaluation, settlement logic, and audit trail generation. Each stage is deterministic given its inputs — the same evidence snapshot always produces the same decision, reasoning trace, and payout recommendation. This determinism is not a convenience; it is a regulatory requirement. Parametric insurance settlements must be reproducible and explainable to auditors, regulators, and dispute panels. The system handles multiple evidence types (weather API readings, satellite observations, uploaded documents, claims adjuster reports) and classifies each source by reliability tier: authoritative (the contractual index source), corroborating (independent sources that directionally agree), and indicative (weaker signals useful for basis-risk analysis). The decision engine applies four rule checks — authoritative source coverage, contract threshold test, cross-source corroboration, and counter-signal management — before producing a confidence score, trigger status (met / borderline / not_met), and a structured settlement memo.
Evidence layer
Each evidence item carries a provenance record: provider identity, observation timestamp, distance from insured asset, measurement unit, and reliability tier. This metadata drives the weighting logic downstream — an authoritative reading from the contractual index source carries more settlement weight than a corroborating satellite proxy.
Decision engine
The trigger evaluation is deterministic: yes = 1.0, partial = 0.5, no = 0.0. Confidence is the arithmetic mean of evidence scores. Status thresholds are fixed: met >= 0.7, borderline >= 0.4, not_met < 0.4. This simplicity is intentional — it makes the decision auditable and reproducible without requiring statistical expertise from the reviewer.
Settlement logic
Payout recommendations are tiered against policy limits. If the highest authoritative reading exceeds 120% of the trigger threshold, the full policy limit is recommended. At the threshold itself, 70% is recommended. Borderline cases carry a reduced 25% watch-list view. Each tier includes a plain-language rationale suitable for inclusion in a settlement memo.
Audit trail
Every decision produces a rule trace (four checks with pass/warn/fail outcomes), a list of blocking conditions, supporting and counter-evidence summaries, and a basis-risk classification (none/low/medium/high). This trace is the artifact that survives regulatory review.
Mathematical foundation
The Chinese Restaurant Process: a nonparametric prior over cluster structure
The Chinese Restaurant Process (CRP) is a constructive definition of the Dirichlet Process (DP) that makes its clustering behavior intuitive. Imagine a restaurant with infinitely many tables. The first customer sits at table 1. Each subsequent customer either joins an existing table with probability proportional to the number of people already seated there, or starts a new table with probability proportional to a concentration parameter alpha. This process generates a random partition of customers into groups — and it does so without fixing the number of groups in advance. In the context of peril modeling, each "customer" is a historical typhoon event and each "table" is a latent peril regime. The CRP prior encodes a rich-get-richer dynamic: large regimes attract more events, but the concentration parameter alpha controls how readily new regimes are created. Crucially, alpha is not fixed — we place a Gamma prior on it and infer its posterior value alongside the regime assignments. A higher posterior alpha means the data is better explained by more regimes; a lower alpha means fewer, larger clusters suffice. This is the Escobar & West (1995) auxiliary variable method, and it gives the model an automatic Occam's razor: it discovers exactly as many regimes as the data warrants.
Why not just use K-means?
K-means requires specifying K in advance, assigns hard cluster memberships, and assumes spherical clusters with equal variance. The CRP mixture model infers K from data, provides a full posterior distribution over assignments, and uses a Normal-Inverse-Wishart prior that accommodates clusters with different shapes, sizes, and orientations.
Why not a finite Gaussian mixture (GMM)?
A finite GMM with BIC/AIC model selection still requires fitting multiple models and choosing among them. The DP mixture integrates over the number of components in a single inference run. More importantly, the DP places non-zero probability on arbitrarily many components — it can accommodate future regimes not seen in the training window.
CRP conditional assignment. Each event i is assigned to an existing regime k with probability proportional to the regime size n_{-i,k} weighted by the likelihood of the event under that regime, or to a new regime with probability proportional to alpha weighted by the prior predictive likelihood.
The generative model. G is drawn from a Dirichlet Process with concentration alpha and base distribution G_0 — a Normal-Inverse-Wishart (NIW) prior that conjugates with multivariate Gaussian cluster likelihoods.
Escobar & West (1995) auxiliary variable update for alpha. The auxiliary variable eta breaks the coupling between alpha and the partition, enabling a closed-form Gibbs update. This eliminates the need for Metropolis-Hastings steps on the concentration parameter.
Inference
Collapsed Gibbs sampling: integrating out cluster parameters for faster mixing
We implement collapsed Gibbs sampling following Neal (2000, Algorithm 3). "Collapsed" means we analytically integrate out the cluster-specific parameters (, ) using the Normal-Inverse-Wishart conjugacy, and sample only the discrete assignment variables and the concentration parameter . This reduces the state space dramatically — instead of sampling continuous parameters per iteration, we sample discrete assignments. The collapsed sampler mixes faster and converges in fewer iterations. Each Gibbs iteration makes a full pass over all events. For event , we temporarily remove it from its current cluster, compute the CRP conditional probability of assigning it to each existing cluster (using the Student- predictive distribution that falls out of the NIW posterior) and to a new cluster (using the prior predictive), then sample from the resulting categorical distribution. After the full pass, we update via the Escobar & West auxiliary variable method. Convergence is monitored via the split-chain diagnostic on (the number of active clusters) and the log-likelihood trace.
The predictive distribution for a new observation x_i given the other members of cluster k is a multivariate Student-t. The parameters (mu_n, kappa_n, nu_n, Psi_n) are the NIW posterior updated with the sufficient statistics of X_{-i,k}. This is the Murphy (2012) formulation, eq. 4.215.
NIW posterior update equations. The posterior mean mu_n is a precision-weighted average of the prior mean and the cluster sample mean. kappa_n and nu_n accumulate evidence from the cluster members.
The posterior scatter matrix Psi_n accumulates three sources of variation: the prior scatter Psi_0, the within-cluster scatter S_k, and a shrinkage term that penalizes deviation of the cluster mean from the prior mean.
Checklist
- Initialise assignments with K-means++ rather than random assignment — this gives the sampler a reasonable starting partition and reduces burn-in by 2-3x.
- Set the prior scatter Psi_0 = scale * (d+1) * empirical covariance. This is weakly informative: data dominates, but the prior prevents singular covariance estimates in small clusters.
- Use kappa_0 = 0.01 (weak prior on the mean location) so the sampler can freely discover cluster centres from data.
- Monitor convergence via split-chain R-hat on K. Values below 1.1 indicate adequate mixing. If R-hat > 1.2, increase the chain length or adjust the prior.
- Thin the chain (keep every 5th sample) to reduce autocorrelation in the posterior samples used for threshold calibration.
Hierarchical extension
HDP: modeling primary-to-secondary peril dependency across typhoon categories
A typhoon is not a single peril — it produces wind damage, rainfall flooding, and storm surge simultaneously. The conditional distribution of secondary peril intensity (e.g., flood depth) given the primary peril context (e.g., Saffir-Simpson category) is critical for multi-trigger products. The Hierarchical Dirichlet Process (HDP) extends the CRP to grouped data: each typhoon category gets its own Dirichlet Process mixture for flood depth, but all category-specific mixtures share a common set of global atoms drawn from a top-level DP. This sharing mechanism is the key statistical insight — it allows rare categories (Cat 5 events) to borrow strength from more common categories (Cat 2-3 events) through shared mixture components, while still allowing category-specific distributional differences. In the ParaEval implementation, this is approximated using context-specific Gaussian Mixture Models with a global backoff model for sparse categories, following the MAP-sharing approximation rather than a full Gibbs HDP sampler (Teh et al., 2006). This is a tractable compromise for the dataset sizes typical in APAC typhoon catalogs (~500-2000 events).
Why HDP over independent GMMs per category?
Cat 5 typhoons in the Western Pacific occur roughly 2-3 times per decade. Fitting an independent mixture to 10-15 events produces unstable density estimates. The HDP shares global atoms across categories, so a flood-depth component discovered in Cat 3 events can be reused for Cat 5 events with different mixing weights — borrowing statistical strength without assuming identical distributions.
Practical approximation
The full HDP Gibbs sampler (Teh et al., 2006) is computationally expensive. ParaEval uses a MAP-sharing approximation: fit a global GMM on all flood depths, then fit category-specific GMMs with n_components capped by available data (min(K, n_c/3)). Categories with fewer than 5 events fall back to the global model. This is less elegant but stable for production.
Insurance application
Multi-trigger parametric products (e.g., wind + flood for warehouse cover) need to price the joint exceedance probability. The HDP gives P(flood > threshold | category = c), which combined with the CRP regime-specific wind distribution yields the joint trigger probability per regime — the basis for multi-peril pricing.
HDP generative model. H is the base distribution for flood depth. G_0 is the global flood depth distribution. G_c is the category-specific distribution for Saffir-Simpson category c. Each observed flood depth s_t is drawn from the mixture associated with its typhoon category c_t.
Secondary peril basis risk. The trigger prediction for event t is based on E[flood_depth | category c_t] exceeding the flood threshold. Basis risk measures how often this prediction disagrees with the observed flood outcome.
From clusters to contracts
Trigger calibration: finding the optimal boundary between loss and no-loss regimes
The CRP sampler discovers latent regimes. The trigger calibrator turns those regimes into actionable insurance thresholds. The strategy is conceptually simple: for each posterior sample of regime assignments, identify which regimes are loss-producing (majority of events have loss_occurred = True) and which are not. For each peril feature, find the 1D decision boundary between the closest loss-regime and no-loss-regime pair. This boundary is the trigger threshold theta* for that feature in that posterior sample. Repeating across all posterior samples yields a distribution of theta* values, from which we extract the posterior mean as the point estimate and the 5th/95th percentiles as a 90% credible interval. This credible interval is the key output that traditional methods cannot produce: it directly quantifies the uncertainty in the trigger threshold due to finite data and regime assignment uncertainty. A wide CI on theta* signals that the trigger boundary is poorly resolved — a direct warning to the product designer that basis risk may be high.
S ~ 50-200
Posterior samples
90%
Credible interval
6
Feature dimensions
The optimal threshold for feature f is the posterior mean of the inter-regime boundary, averaged over S posterior samples. For each sample s, we find the closest pair of loss-regime k_L and no-loss-regime k_NL along feature dimension f and compute their 1D decision boundary.
The 1D boundary between two Gaussian regimes is the point where their densities are equal. When variances differ, this is a quadratic equation with up to two roots — we take the root closest to the midpoint.
The 90% credible interval on the trigger threshold, computed from the empirical quantiles of the posterior theta* samples. This interval propagates both data uncertainty and regime assignment uncertainty into the trigger recommendation.
Climate non-stationarity
Alpha-drift index: a nonparametric early-warning signal for regime shift
Traditional catastrophe models assume stationarity — the same generating process that produced historical events will produce future ones. Climate change violates this assumption. The alpha-drift index is a novel application of the CRP concentration parameter as a non-stationarity detector. The idea is straightforward: fit the CRP sampler on rolling time windows (e.g., 5-year windows sliding across a 50-year catalog) and track the posterior mean of alpha over time. A rising alpha indicates that events in recent windows are increasingly poorly explained by the regime structure of earlier windows — the model needs more clusters to fit the data, which means new peril patterns are emerging. A falling alpha means the regime structure is consolidating. We apply the Mann-Kendall trend test to the alpha time series. A statistically significant positive trend (p < 0.05) is a formal signal that the historical calibration basis is becoming stale — trigger thresholds should be recalibrated on more recent data, or the product should be repriced to account for structural uncertainty. The alpha-drift index can also be correlated with external climate indices (e.g., sea surface temperature anomalies, ENSO phase) to test hypotheses about the physical drivers of regime shift.
Interpretation for underwriters
A rising alpha-hat means "the recent event mix does not fit neatly into the regime categories we found in earlier data." This is not a prediction of more severe events — it is a prediction of more structurally different events. The distinction matters for pricing: severity changes affect expected loss; regime novelty affects model uncertainty.
Correlation with SST
Sea surface temperature anomalies in the Western Pacific Warm Pool are a known driver of typhoon intensification. The alpha-drift framework provides a formal test: compute Kendall tau between alpha-hat and SST anomaly across overlapping years. A significant positive correlation supports the hypothesis that warming seas are creating novel peril regimes.
Practical cadence
Run alpha-drift analysis annually as part of the portfolio review cycle. A 5-year rolling window with a 500-iteration sampler per window processes a 500-event catalog in under 30 minutes on a single CPU. This is cheap enough to be a standard monitoring artifact.
The alpha-drift index at time t is the posterior mean of alpha estimated from events in the window [t-w, t]. Each window runs an independent CRP sampler (shortened chain for efficiency: 500 iterations, 100 burn-in).
The Mann-Kendall test is distribution-free and robust to outliers. A significant positive trend triggers a recalibration advisory. The Theil-Sen slope estimator provides a robust estimate of the rate of alpha increase.
Empirical results
CRP/HDP vs. standard actuarial baselines: a controlled comparison
We evaluate the CRP/HDP trigger against four standard actuarial methods on the same synthetic event corpus (500 events, 3 embedded true regimes, 80/20 train-test split). All methods are evaluated on identical test events using the same loss labels. The primary metric is basis risk (FP + FN)/N — lower is better. We also report precision (fraction of trigger activations that correspond to actual losses), recall (fraction of actual losses captured by the trigger), and boundary F1 (the harmonic mean of precision and recall, directly analogous to the boundary F1 reported in morpheme segmentation research). The CRP/HDP model consistently achieves the lowest basis risk because it calibrates thresholds against cluster boundaries rather than distributional quantiles — it "sees" the regime structure that quantile-based methods average over.
GEV baseline
Fits a Generalized Extreme Value distribution to positive feature values and sets the trigger at the 75th percentile of the fitted GEV. This is the standard approach in catastrophe modeling for return-period estimation. Weakness: assumes a single tail distribution, cannot capture multimodal peril structure.
Weibull baseline
Fits a Weibull distribution (shape, scale, location) to positive feature values with location fixed at zero. Trigger at the 75th percentile. Slightly more flexible than GEV for wind-speed modeling but still unimodal.
Gaussian copula baseline
Rank-transforms the feature to Gaussian margins and sets the trigger at the 75th quantile of the original feature. In the univariate comparison, this reduces to a quantile estimate — the copula dependency structure is not captured in 1D. Included for completeness as copula methods are common in reinsurance.
Historical quantile baseline
The simplest approach: set the trigger at the 75th percentile of the training feature distribution. No distributional assumption, no parametric fit. Surprisingly competitive on well-behaved data, but brittle when the training window is not representative of the test period.
~ 0.20-0.28
CRP/HDP basis risk
5-15%
Boundary F1 gain
100 events
Test set size
400 events
Training set size
Primary evaluation metrics. Basis risk is the total error rate of the trigger decision. Boundary F1 balances the two types of error — it penalizes both overpayment (FP) and underpayment (FN) equally.
Experimental setup
Three embedded regimes: how the synthetic corpus is structured
Evaluation on real IBTrACS + ERA5 + EM-DAT data requires CDS API access and EM-DAT registration. For reproducibility, the demo mode uses a synthetic corpus with three embedded regimes that mirror real APAC typhoon archetypes. Regime 0 (35% of events) represents fast-track, high-wind storms (Cat 4-5, mean wind 55 m/s, mean rain 80mm, fast translation at 28 km/h). Regime 1 (40%) represents slow-moving rain-dominant systems (Cat 2-3, mean wind 38 m/s, mean rain 250mm, slow translation at 10 km/h). Regime 2 (25%) represents surge-dominant coastal storms (Cat 3-4, mean wind 45 m/s, mean surge 4.2m, moderate translation at 18 km/h). Loss amounts are regime-correlated: wind-driven losses for Regime 0, flood-driven for Regime 1, surge-driven for Regime 2. The loss_occurred label is set at the median economic loss, giving a balanced 50% trigger rate — a worst-case scenario for basis risk (hardest to separate). Six peril features span the multivariate event space: maximum sustained wind (m/s), 24h rainfall (mm), storm surge (m), minimum central pressure (hPa), translation speed (km/h), and track deviation from climatological mean (km).
Regime 0: Fast / high-wind
Mean profile: 55 m/s wind, 80mm rain, 3.0m surge, 915 hPa, 28 km/h translation, 180km track deviation. These are the Category 4-5 typhoons that cause devastating wind damage with relatively modest rainfall.
Regime 1: Slow / high-rain
Mean profile: 38 m/s wind, 250mm rain, 1.8m surge, 955 hPa, 10 km/h translation, 100km track deviation. Slow-moving systems that stall and dump extreme rainfall. The primary loss driver is inland flooding, not wind.
Regime 2: Surge-dominant
Mean profile: 45 m/s wind, 150mm rain, 4.2m surge, 935 hPa, 18 km/h translation, 60km track deviation. Coastal tracks that produce large storm surge. Insured losses concentrate in port facilities and coastal infrastructure.
End-to-end
How CRP/HDP analysis feeds into the ParaEval decision pipeline
The CRP/HDP model plugs into ParaEval through a model adapter layer. When an analyst runs CRP analysis on a case, the system executes a full calibration pipeline (simulation + threshold calibration + evaluation) and produces a structured evidence item that is injected into the case's evidence stack. This evidence item carries the calibrated threshold (theta*), the model's recommendation, and an explicit reliability tag of "indicative" — meaning the CRP output informs the decision but does not override authoritative index readings. The recommendation text is auto-generated based on the calibration metrics: if boundary F1 > 0.7 and basis risk < 0.3, the model output is flagged as "strong trigger discrimination"; if F1 > 0.5, it is "directionally useful"; below that, it is "weak for this feature." This tiered framing prevents over-reliance on model outputs that happen to be poorly calibrated for the specific peril feature and event geography. The analyst sees the CRP output alongside the weather API readings, satellite observations, and adjuster reports — as one signal among several, with its uncertainty clearly surfaced.
Model adapter pattern
Each analysis model (CRP/HDP, future models) implements a standard adapter interface with four capabilities: simulate, calibrate, alpha_drift, and case_analysis. The service layer routes requests by model_id, making it trivial to add new models without changing the decision engine.
Benchmark comparison lab
ParaEval includes a benchmark mode where the same case can be evaluated under different policy templates, model configurations, and evidence snapshots. This produces delta summaries (confidence change, payout change) that make the impact of model choice visible to decision-makers.
Versioned case packets
Every case snapshot is versioned with a build timestamp, snapshot version, and source adapter. This ensures that when a model is updated or new evidence arrives, previous decision artifacts remain reproducible from their original inputs.
Checklist
- Never treat CRP/HDP output as the sole trigger signal. It is indicative evidence that supplements authoritative index readings.
- Always run calibration before case analysis — the theta* estimate depends on the training corpus and feature selection.
- Compare CRP/HDP basis risk against at least one baseline method on the same data split before trusting the threshold recommendation.
- Document the seed, n_iter, feature, and train_ratio used in every CRP run. These parameters materially affect the output.
- If alpha-drift analysis shows a significant positive trend, consider narrowing the training window to more recent events for threshold recalibration.
Academic foundations
References
The CRP/HDP model draws on foundational work in Bayesian nonparametrics and applies it to a novel domain — parametric insurance trigger calibration. These references cover the mathematical foundations, the inference algorithms, and the insurance context.
References
Escobar, M. D. & West, M. (1995). Journal of the American Statistical Association, 90(430), 577-588.
The foundational paper for the auxiliary variable method used to update the DP concentration parameter alpha. Our implementation follows their Algorithm 1.
Neal, R. M. (2000). Journal of Computational and Graphical Statistics, 9(2), 249-265.
Defines Algorithm 3 (collapsed Gibbs for DPMM), which is the core inference algorithm used in the CRP sampler. Neal's comparison of algorithms informs our choice of collapsed over uncollapsed sampling.
Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Journal of the American Statistical Association, 101(476), 1566-1581.
The theoretical foundation for the HDP extension that models primary-to-secondary peril dependency. ParaEval uses a MAP-sharing approximation of the full HDP for tractability.
Murphy, K. P. (2012). MIT Press.
Equation 4.215 provides the Student-t predictive distribution under NIW conjugacy, which is the core computation in our collapsed Gibbs sampler.
Schmid, T. et al. (2023). ETH Zurich / Environment Systems & Decisions.
The CLIMADA framework for parametric insurance design, which treats basis risk as a quantity to be systematically measured. ParaEval's evaluation metrics are aligned with this framing.
Gershman, S. J. & Blei, D. M. (2012). Journal of Mathematical Psychology, 56(1), 1-12.
Accessible introduction to BNP models including the CRP, DP, and HDP. Useful for readers new to the nonparametric paradigm.
Related posts
Parametric Insurance Deep Dive: Speed, Basis Risk, and Trigger Design
Parametric insurance replaces loss adjustment with an observable trigger, but the hard part is not speed. It is picking a defensible index, reducing basis risk, and governing the data path from event to payout.
11 min readMLOps Systems Blueprint for Reliable AI
Production ML behaves like a three-body problem: code, data, and live behavior all pull in different directions. This guide shows how to turn that motion into a stable, self-correcting delivery loop.
9 min readAI Governance and Regulations: From EU AI Act to ISO 42001
AI governance is the moment the story meets law: models leave the lab and enter a world of risk tiers, audits, and named obligations. This guide maps the major frameworks and what they require teams to actually build.
12 min read