ParaEval

Architecture

ParaEval runs as two services deployed via Docker Compose: a Next.js frontend and a FastAPI Python backend. The domain logic is a pure function implemented identically in TypeScript and Python. A shared contract layer (Zod ↔ Pydantic) ensures language boundary never becomes a data contract boundary.

Phase 1 is complete: curated demo cases, a live decision engine, a live optional extraction endpoint, regression coverage, Docker Compose deployment, and CI/CD building both images. Phase 2 adds persistent case storage, richer policy logic, and a fuller multi-agent enrichment pipeline.

Full Stack

LayerTechnologyNotes
FrontendNext.js 15 App RouterServer-rendered public pages plus a verdict-first comparison workbench. Next route handlers proxy the public benchmark/session APIs so the browser never talks to the backend service directly.
Public APIFastAPI public benchmark/session routesRead-only benchmark packet routes plus reproducible session create/evaluate/export/import flows. Public routes are separate from maintainer mutation routes.
Maintainer APIFastAPI maintainer routes with bearer-token authRefresh/build jobs, packet publication, and rollback live on a separate surface guarded by PARAEVAL_BACKEND_SECRET so benchmark mutation cannot leak into public read flows.
Packet registryAppend-only file store with pointer swapsBenchmark packets, sessions, and jobs are stored as append-only versioned records. Small current-pointer files are swapped atomically so publication and rollback do not rewrite packet history in place.
Evaluation authorityFastAPI + Pydantic v2 + UvicornThe backend owns benchmark packet selection and evaluation runs. The Next app may render or cache results, but it does not silently substitute a second production decision engine.
Contract layerZod 4 + Pydantic v2TypeScript and Python both serialize the comparison-lab entities in camelCase so session/export/import payloads, public packets, and runs stay structurally aligned across the boundary.
AuthNext auth + backend bearer tokenThe site can still gate maintainers at the Next layer, but the backend independently enforces bearer-token auth for mutation routes. Public benchmark reads remain unauthenticated.
Local fixturesExplicit dev/test mode onlySynthetic fixture packets still exist for tests and local development, but production reads are no longer meant to fall back silently to static demo data.

Decision Algorithm (Phase 1 — Pinned)

The algorithm is deliberately simple and deliberately pinned. Every evidence item is scored on a 0–1 scale. The policy trigger is evaluated on the average. No weighting, no mandatory source requirements, no peril-specific rules — that is Phase 2. Pinning the algorithm means regression tests are stable: a historical case that returned "met" will always return "met" until a version bump explicitly changes the model.

# evidence item scores
score(supportsTrigger = "yes") = 1.0
score(supportsTrigger = "partial") = 0.5
score(supportsTrigger = "no") = 0.0
# confidence and status
confidence = mean(scores) # 0.0 if no evidence
status = "met" if confidence ≥ 0.70
status = "borderline" if confidence ≥ 0.40
status = "not_met" if confidence < 0.40
# basis risk detection
basisRiskNote = note if any(yes) AND any(no) in evidence
basisRiskNote = null otherwise
# empty evidence guard
if len(evidence) == 0: return not_met, confidence=0.0, basisRiskNote=null
What this model gets right
  • Reproducible: same input always returns same output
  • Auditable: the logic fits in 10 lines, readable by non-engineers
  • Basis-risk aware: detects and surfaces yes/no conflict explicitly
  • Regression-safe: golden cases catch any unintended change
Phase 2 upgrades planned
  • Mandatory index condition: at least one gauge or API source required
  • Peril-specific evidence weights (satellite carries more weight for flood)
  • Missing-evidence penalty when expected sources are absent
  • Source quality tiers: authoritative vs. indicative

Python Backend

The backend is a FastAPI service deployed as a separate Docker container. It holds a Python port of the decision engine and serves the same demo data as the TypeScript layer. Cross-runtime parity tests run in CI to verify both engines produce identical outputs on identical inputs. The extraction pipeline (POST /extract) already supports LLM-backed evidence classification when a DeepSeek-compatible API key is configured; it remains deliberately stateless until persistence lands in Phase 2.

MethodPathNotes
GET/healthHealthcheck endpoint. Returns {"status":"ok"}. Used by Docker Compose depends_on condition.
GET/public/benchmark/configReturns current benchmark packets, policy templates, and model dossiers for the workbench.
POST/public/sessionsCreates a reproducible session pinned to an exact benchmark snapshot version.
POST/public/sessions/{id}/evaluateAppends a new run under the selected policy/model and returns delta-ready session state.
GET/public/sessions/{id}/exportExports the current session payload without changing its pinned snapshot version.
POST/maintainer/jobs/refreshCreates a maintained packet-build job against a fixed source adapter. Mutation route; bearer-token protected.
POST/maintainer/packets/{id}/publishAtomically promotes a specific packet version by swapping the current pointer file.
# backend/ directory layout
backend/
app/
main.py ← FastAPI app, CORS, middleware
auth.py ← Bearer token dependency
schemas.py ← Pydantic v2 models (camelCase alias)
decision.py ← Python port of decision.ts (pure)
demo_data.py ← Python port of demo-data.ts
routers/
cases.py ← GET /cases, GET /cases/id
decide.py ← POST /cases/id/decide (live)
extract.py ← POST /cases/id/extract (optional LLM-backed extraction)
tests/
test_decision.py ← 14 parity tests vs. decision.ts
test_api.py ← 18 API integration tests
Dockerfile ← python:3.12-slim, non-root user
requirements.txt
requirements-dev.txt

Contract Layer: Zod ↔ Pydantic

The contract layer is the guarantee that the TypeScript frontend and the Python backend speak the same language. Zod schemas define the canonical shape in TypeScript. Pydantic models mirror them exactly. Both sides use camelCase for JSON serialization (TypeScript naturally; Python via alias_generator=to_camel). Any deviation is caught by cross-runtime parity tests that assert identical decision outputs for identical inputs in both engines.

lib/paraeval/schemas.ts (Zod 4)
# canonical source of truth
EvidenceItemSchema
id: z.string().uuid()
sourceType: SourceTypeSchema
rawValue: z.string()
normalizedValue: z.number().nullable()
supportsTrigger: SupportsTriggerSchema
notes: z.string()
backend/app/schemas.py (Pydantic v2)
# mirrors Zod schema exactly
class EvidenceItem(_CamelModel):
id: UUID
source_type: SourceType
raw_value: str
normalized_value: Optional[float]
supports_trigger: SupportsTrigger
notes: str

Deployment: Docker Compose + CI/CD

Both containers are built in CI and pushed to GitHub Container Registry. The production server pulls images via docker compose pull and restarts — no source code on the server. The backend is not exposed to the host in production; it is reachable only on the internal Docker network viahttp://backend:8000. The Next.js container proxies to it from route handlers; the browser never hits the Python service directly.

# CI/CD pipeline (GitHub Actions)
quality-frontend ← tsc, vitest (192 tests)
quality-backend ← pytest (32 tests), mypy
docker-frontend ← build + push ghcr.io/.../personal_website
docker-backend ← build + push ghcr.io/.../paraeval-backend
deploy ← needs all four; ssh pull + restart
# docker-compose.yml (production)
web:
image: ghcr.io/skumyol/personal_website:latest
depends_on:
backend: condition: service_healthy
backend:
image: ghcr.io/skumyol/paraeval-backend:latest
healthcheck: GET /health every 10s
# not exposed to host in production

Repo Shape

app/paraeval/ ← public pages (landing, cases, architecture, regression)
app/api/paraeval/ ← GET + admin-gated POST stubs → Python backend proxy
components/paraeval/ ← UI components (CaseTabs, StatusBadge, TriggerDecisionCard...)
lib/paraeval/
schemas.ts ← Zod 4 contract (canonical source of truth)
types.ts ← TS types inferred from Zod
decision.ts ← Pure evaluation engine (zero deps)
formatters.ts ← Display helpers (formatConfidence, formatStatus...)
demo-data.ts ← Synthetic cases (Hong Kong, Manila, Jakarta)
backend/ ← FastAPI service (Python 3.12)
tests/lib/paraeval/ ← 31 unit tests (decision engine + schemas + formatters)
tests/api/paraeval/ ← Route handler tests
tests/app/paraeval/ ← Page render tests

Database Schema (Phase 2+)

Table names and column shapes are defined in the schema design. Not active in Phase 1. SQLite via node:sqlite for development; the schema is Postgres-compatible for production scale.

packets/versions/{caseId}/{version}.json
Immutable benchmark packet records
packets/current/{caseId}.json
Atomic current-version pointer per benchmark packet
sessions/versions/{sessionId}/{revision}.json
Append-only exported/imported session revisions
sessions/current/{sessionId}.json
Current session pointer for reopen/export flows
jobs/versions/{jobId}/{revision}.json
Queued/running/failed/succeeded maintainer job records
jobs/current/{jobId}.json
Latest maintainer job status pointer

Phase 2: Extraction Pipeline

The extraction pipeline is the missing piece. Phase 1 uses static demo data to prove the evaluation and decision surfaces work. Phase 2 replaces static data with a live extraction pipeline that queries real sources: gauge APIs (USGS, JMA, Thai Meteorological Department), satellite-derived indices (MODIS flood extent, NDVI anomaly, GPM precipitation), and document ingestion for field reports and loss adjuster notes.

Planned Python stack:

  • LangGraph — orchestration graph for multi-step extraction runs with retries and partial results
  • vLLM — local inference for document extraction (loss adjuster reports, policy schedule parsing)
  • CLIMADA — catastrophe model context: expected loss at location for peril, used to weight evidence and calibrate basis-risk notes
  • MLflow — extraction run tracking: which sources were queried, what was returned, how long each step took
  • pytest parity tests — cross-runtime verification that the Python decision engine produces identical outputs to the TypeScript engine for all golden cases

The extraction pipeline adds operational complexity — long-running jobs, partial failures, external API rate limits, model inference latency. That complexity belongs in a separate service. The split from the Next.js monolith happens when extraction demand is real, not as a premature architectural decision.