ParaEval

Architecture

ParaEval runs as two services deployed via Docker Compose: a Next.js frontend and a FastAPI Python backend. The domain logic is a pure function implemented identically in TypeScript and Python. A shared contract layer (Zod ↔ Pydantic) ensures language boundary never becomes a data contract boundary.

Phase 1 is complete: curated demo cases, a live decision engine, a live optional extraction endpoint, regression coverage, Docker Compose deployment, and CI/CD building both images. Phase 2 adds persistent case storage, richer policy logic, and a fuller multi-agent enrichment pipeline.

Full Stack

Layer	Technology	Notes
Frontend	Next.js 15 App Router	Server-rendered public pages plus a verdict-first comparison workbench. Next route handlers proxy the public benchmark/session APIs so the browser never talks to the backend service directly.
Public API	FastAPI public benchmark/session routes	Read-only benchmark packet routes plus reproducible session create/evaluate/export/import flows. Public routes are separate from maintainer mutation routes.
Maintainer API	FastAPI maintainer routes with bearer-token auth	Refresh/build jobs, packet publication, and rollback live on a separate surface guarded by PARAEVAL_BACKEND_SECRET so benchmark mutation cannot leak into public read flows.
Packet registry	Append-only file store with pointer swaps	Benchmark packets, sessions, and jobs are stored as append-only versioned records. Small current-pointer files are swapped atomically so publication and rollback do not rewrite packet history in place.
Evaluation authority	FastAPI + Pydantic v2 + Uvicorn	The backend owns benchmark packet selection and evaluation runs. The Next app may render or cache results, but it does not silently substitute a second production decision engine.
Contract layer	Zod 4 + Pydantic v2	TypeScript and Python both serialize the comparison-lab entities in camelCase so session/export/import payloads, public packets, and runs stay structurally aligned across the boundary.
Auth	Next auth + backend bearer token	The site can still gate maintainers at the Next layer, but the backend independently enforces bearer-token auth for mutation routes. Public benchmark reads remain unauthenticated.
Local fixtures	Explicit dev/test mode only	Synthetic fixture packets still exist for tests and local development, but production reads are no longer meant to fall back silently to static demo data.

Decision Algorithm (Phase 1 — Pinned)

The algorithm is deliberately simple and deliberately pinned. Every evidence item is scored on a 0–1 scale. The policy trigger is evaluated on the average. No weighting, no mandatory source requirements, no peril-specific rules — that is Phase 2. Pinning the algorithm means regression tests are stable: a historical case that returned "met" will always return "met" until a version bump explicitly changes the model.

# evidence item scores

score(supportsTrigger = "yes") = 1.0

score(supportsTrigger = "partial") = 0.5

score(supportsTrigger = "no") = 0.0

# confidence and status

confidence = mean(scores) # 0.0 if no evidence

status = "met" if confidence ≥ 0.70

status = "borderline" if confidence ≥ 0.40

status = "not_met" if confidence < 0.40

# basis risk detection

basisRiskNote = note if any(yes) AND any(no) in evidence

basisRiskNote = null otherwise

# empty evidence guard

if len(evidence) == 0: return not_met, confidence=0.0, basisRiskNote=null

What this model gets right

Reproducible: same input always returns same output
Auditable: the logic fits in 10 lines, readable by non-engineers
Basis-risk aware: detects and surfaces yes/no conflict explicitly
Regression-safe: golden cases catch any unintended change

Phase 2 upgrades planned

Mandatory index condition: at least one gauge or API source required
Peril-specific evidence weights (satellite carries more weight for flood)
Missing-evidence penalty when expected sources are absent
Source quality tiers: authoritative vs. indicative

Python Backend

The backend is a FastAPI service deployed as a separate Docker container. It holds a Python port of the decision engine and serves the same demo data as the TypeScript layer. Cross-runtime parity tests run in CI to verify both engines produce identical outputs on identical inputs. The extraction pipeline (POST /extract) already supports LLM-backed evidence classification when a DeepSeek-compatible API key is configured; it remains deliberately stateless until persistence lands in Phase 2.

Method	Path	Notes
GET	/health	Healthcheck endpoint. Returns {"status":"ok"}. Used by Docker Compose depends_on condition.
GET	/public/benchmark/config	Returns current benchmark packets, policy templates, and model dossiers for the workbench.
POST	/public/sessions	Creates a reproducible session pinned to an exact benchmark snapshot version.
POST	/public/sessions/{id}/evaluate	Appends a new run under the selected policy/model and returns delta-ready session state.
GET	/public/sessions/{id}/export	Exports the current session payload without changing its pinned snapshot version.
POST	/maintainer/jobs/refresh	Creates a maintained packet-build job against a fixed source adapter. Mutation route; bearer-token protected.
POST	/maintainer/packets/{id}/publish	Atomically promotes a specific packet version by swapping the current pointer file.

# backend/ directory layout

backend/

app/

main.py ← FastAPI app, CORS, middleware

auth.py ← Bearer token dependency

schemas.py ← Pydantic v2 models (camelCase alias)

decision.py ← Python port of decision.ts (pure)

demo_data.py ← Python port of demo-data.ts

routers/

cases.py ← GET /cases, GET /cases/id

decide.py ← POST /cases/id/decide (live)

extract.py ← POST /cases/id/extract (optional LLM-backed extraction)

tests/

test_decision.py ← 14 parity tests vs. decision.ts

test_api.py ← 18 API integration tests

Dockerfile ← python:3.12-slim, non-root user

requirements.txt

requirements-dev.txt

Contract Layer: Zod ↔ Pydantic

The contract layer is the guarantee that the TypeScript frontend and the Python backend speak the same language. Zod schemas define the canonical shape in TypeScript. Pydantic models mirror them exactly. Both sides use camelCase for JSON serialization (TypeScript naturally; Python via alias_generator=to_camel). Any deviation is caught by cross-runtime parity tests that assert identical decision outputs for identical inputs in both engines.

lib/paraeval/schemas.ts (Zod 4)

# canonical source of truth

EvidenceItemSchema

id: z.string().uuid()

sourceType: SourceTypeSchema

rawValue: z.string()

normalizedValue: z.number().nullable()

supportsTrigger: SupportsTriggerSchema

notes: z.string()

backend/app/schemas.py (Pydantic v2)

# mirrors Zod schema exactly

class EvidenceItem(_CamelModel):

id: UUID

source_type: SourceType

raw_value: str

normalized_value: Optional[float]

supports_trigger: SupportsTrigger

notes: str

Deployment: Docker Compose + CI/CD

Both containers are built in CI and pushed to GitHub Container Registry. The production server pulls images via docker compose pull and restarts — no source code on the server. The backend is not exposed to the host in production; it is reachable only on the internal Docker network viahttp://backend:8000. The Next.js container proxies to it from route handlers; the browser never hits the Python service directly.

# CI/CD pipeline (GitHub Actions)

quality-frontend ← tsc, vitest (192 tests)

quality-backend ← pytest (32 tests), mypy

docker-frontend ← build + push ghcr.io/.../personal_website

docker-backend ← build + push ghcr.io/.../paraeval-backend

deploy ← needs all four; ssh pull + restart

# docker-compose.yml (production)

web:

image: ghcr.io/skumyol/personal_website:latest

depends_on:

backend: condition: service_healthy

backend:

image: ghcr.io/skumyol/paraeval-backend:latest

healthcheck: GET /health every 10s

# not exposed to host in production

Repo Shape

app/paraeval/ ← public pages (landing, cases, architecture, regression)

app/api/paraeval/ ← GET + admin-gated POST stubs → Python backend proxy

components/paraeval/ ← UI components (CaseTabs, StatusBadge, TriggerDecisionCard...)

lib/paraeval/

schemas.ts ← Zod 4 contract (canonical source of truth)

types.ts ← TS types inferred from Zod

decision.ts ← Pure evaluation engine (zero deps)

formatters.ts ← Display helpers (formatConfidence, formatStatus...)

demo-data.ts ← Synthetic cases (Hong Kong, Manila, Jakarta)

backend/ ← FastAPI service (Python 3.12)

tests/lib/paraeval/ ← 31 unit tests (decision engine + schemas + formatters)

tests/api/paraeval/ ← Route handler tests

tests/app/paraeval/ ← Page render tests

Database Schema (Phase 2+)

Table names and column shapes are defined in the schema design. Not active in Phase 1. SQLite via node:sqlite for development; the schema is Postgres-compatible for production scale.

packets/versions/{caseId}/{version}.json

Immutable benchmark packet records

packets/current/{caseId}.json

Atomic current-version pointer per benchmark packet

sessions/versions/{sessionId}/{revision}.json

Append-only exported/imported session revisions

sessions/current/{sessionId}.json

Current session pointer for reopen/export flows

jobs/versions/{jobId}/{revision}.json

Queued/running/failed/succeeded maintainer job records

jobs/current/{jobId}.json

Latest maintainer job status pointer

Phase 2: Extraction Pipeline

The extraction pipeline is the missing piece. Phase 1 uses static demo data to prove the evaluation and decision surfaces work. Phase 2 replaces static data with a live extraction pipeline that queries real sources: gauge APIs (USGS, JMA, Thai Meteorological Department), satellite-derived indices (MODIS flood extent, NDVI anomaly, GPM precipitation), and document ingestion for field reports and loss adjuster notes.

Planned Python stack:

LangGraph — orchestration graph for multi-step extraction runs with retries and partial results
vLLM — local inference for document extraction (loss adjuster reports, policy schedule parsing)
CLIMADA — catastrophe model context: expected loss at location for peril, used to weight evidence and calibrate basis-risk notes
MLflow — extraction run tracking: which sources were queried, what was returned, how long each step took
pytest parity tests — cross-runtime verification that the Python decision engine produces identical outputs to the TypeScript engine for all golden cases

The extraction pipeline adds operational complexity — long-running jobs, partial failures, external API rate limits, model inference latency. That complexity belongs in a separate service. The split from the Next.js monolith happens when extraction demand is real, not as a premature architectural decision.