Skip to content

Production hardening

The demo is a portfolio piece. Production is a regulated SaaS handling pre-contractual finance disclosures for real customers. This page is the checklist between the two: what to lock down before the first real customer sees the surface.

AreaCheckDone when
Rate limitsPer-resource KV counters live10 quote-creates/min per signed URL, 5 magic-link opens/min per token, 30 admin reads/min per session, all enforced
Magic-link signingReal keys deployed, rotation cadence documentedMAGIC_LINK_SIGNING_KEY_CURRENT is 32+ random bytes; rotation runbook exists
Token TTLMAGIC_LINK_TTL_HOURS setDefault 336 (14 days). Approved by broker compliance.
Env varsAll secrets live in Vercel, none in sourcegit grep for known secret prefixes returns clean
HTTPSDomain serves valid certificatecurl -I shows 200 and HSTS header
HSTS preloadSubmitted to hstspreload.orgSite appears in the HSTS preload list
CSPContent-Security-Policy header setSee below
Other headersX-Frame-Options, Referrer-Policy, Permissions-PolicyAll present
DatabasePostgres in UK region, backups onDaily backups, 30-day retention
KVUK regionKV instance in lhr1
EmailPostmark domain verified, SPF/DKIM/DMARCDMARC at p=quarantine minimum
Audit log retention7 years on acknowledged, 24 months on othersRetention sweep job scheduled
RBACAdmin / auditor / read-only roles enforcedTest plan confirms each role’s scope
DSAR endpointAdmin /admin/dsar live and testedOne end-to-end DSAR completed in staging
Error monitoringVercel + SentrySentry catches client and server errors with source maps
Uptime monitoringStatus page + alertingExternal monitor pings the customer phone surface every 60s
LogsStructured logs to Logflare or Axiom30-day retention minimum
AlertingOn-call rotation setPagerDuty or Opsgenie wired to Sentry and the uptime monitor
Pen testAnnual external test bookedScope agreed and signed
Backup restoreTested in stagingOne full restore-from-backup performed end-to-end

Three resource ceilings, enforced at the API edge via Vercel KV counters.

ResourceLimitWindowKV key
Quote create1060srl:create:<retailerId>
Magic link open560srl:open:<tokenHash>
Admin read3060srl:admin:<sessionId>
async function rateLimit(key: string, max: number, windowSec: number) {
const count = await kv.incr(key);
if (count === 1) await kv.expire(key, windowSec);
if (count > max) {
throw new HttpError(429, "rate-limited");
}
}

Exceeding a limit returns 429 with a Retry-After: 60 header. The customer-facing 429 redirects to a polite “try again in a minute” page.

Content-Security-Policy:
default-src 'self';
script-src 'self' 'sha256-<computed>';
style-src 'self' 'unsafe-inline';
img-src 'self' data: blob:;
font-src 'self' data:;
connect-src 'self' https://vitals.vercel-insights.com;
frame-ancestors 'none';
base-uri 'self';
form-action 'self';

Notes:

  • unsafe-inline on styles is required for Tailwind’s runtime utilities. Acceptable.
  • script-src is 'self' plus a hash for any inline <script> tags Next.js emits. Compute with the build.
  • frame-ancestors 'none' prevents the surface being iframed elsewhere. Combine with X-Frame-Options: DENY for older browsers.
  • connect-src allows Vercel Web Vitals telemetry. Add additional origins (Sentry, Postmark webhooks, etc.) explicitly.
Strict-Transport-Security: max-age=63072000; includeSubDomains; preload
X-Frame-Options: DENY
X-Content-Type-Options: nosniff
Referrer-Policy: strict-origin-when-cross-origin
Permissions-Policy: camera=(), microphone=(), geolocation=(), payment=()

Set in next.config.ts:

const securityHeaders = [
{ key: "Strict-Transport-Security", value: "max-age=63072000; includeSubDomains; preload" },
{ key: "X-Frame-Options", value: "DENY" },
{ key: "X-Content-Type-Options", value: "nosniff" },
{ key: "Referrer-Policy", value: "strict-origin-when-cross-origin" },
{ key: "Permissions-Policy", value: "camera=(), microphone=(), geolocation=(), payment=()" },
];
export default {
async headers() {
return [{ source: "/:path*", headers: securityHeaders }];
},
};
PracticeImplementation
Secrets only in Vercel env varsNever in source. .env.local gitignored.
Rotation cadenceMagic-link keys quarterly; webhook secrets annually; Postmark token annually
On staff exitRotate every secret the leaver had access to within 24 hours
Audit accessVercel project access reviewed quarterly; remove unused team members
Break-glass accountOne root-equivalent account, password in a sealed vault, used for emergency only

A nightly cron job runs the retention sweep:

app/api/cron/retention-sweep/route.ts
export const runtime = "nodejs";
export const maxDuration = 300;
export async function GET(req: Request) {
if (req.headers.get("authorization") !== `Bearer ${process.env.CRON_SECRET}`) {
return new Response("unauthorized", { status: 401 });
}
await sweepExpiredQuotes(); // unacknowledged > 24 months
await purgeOldWebhookDeliveries(); // > 90 days
await purgeOldNonces(); // KV TTL handles this; sweep is a safety net
return new Response("ok");
}

Configure in vercel.json:

{
"crons": [{ "path": "/api/cron/retention-sweep", "schedule": "0 2 * * *" }]
}

The sweep emits a retention-sweep-run log entry with counts. Two consecutive failures page the on-call.

Sentry for the application, Vercel’s built-in observability for the platform.

WhatWhere
Unhandled exceptionsSentry
Failed token validationLogged at WARN, not paged unless rate exceeds threshold
Failed webhook deliveriesLogged + visible in admin portal “delivery health” panel
5xx rateVercel + Sentry, page on > 1% over 5 minutes
p95 page loadVercel Web Vitals, page on > 3s sustained

External pings every 60 seconds:

EndpointExpectedIf failing
GET /200, “Lending Agent Presenter” in HTMLPage on-call
GET /api/health200, {"ok": true}Page on-call

/api/health checks: server alive, Postgres reachable, KV reachable, signing key loadable. Simple boolean.

WhatFrequencyRetention
PostgresDaily snapshot, point-in-time recovery enabled30 days
Vercel Blob (PDFs)Versioned by Vercel by defaultLifetime of the parent quote (7 years)
KVVolatile by design; nonce blocklist tolerable to loseNone
SourceGitHubForever

A full restore drill is run twice a year. Documented in the runbook; on-call rotates through it.

SeverityDefinitionResponse
P1Customer cannot acknowledge OR data breachPage on-call. 15 minute response. Status page update.
P2Rep cannot send a quotePage on-call. 1 hour response.
P3Admin portal degradedTicket. Next business day.
P4Cosmetic or non-blockingBacklog.

P1 with data breach triggers UK GDPR notification chain: Shermin → broker (controller) within 24 hours → ICO + customers per Article 33/34, within 72 hours of becoming aware.

A short list, run in staging the week before launch:

  1. Build a quote on the rep tablet. Confirm the email arrives at a real address.
  2. Click the magic link. Confirm the customer surface loads, validates the token, and renders the quote.
  3. Pick an option, tick all four boxes, confirm. Confirm the audit events appear in the admin portal.
  4. Open the admin portal. Run a CSV export. Open a quote detail. Resend the magic link. Confirm new audit event.
  5. Force a token expiry by waiting (or shortening TTL temporarily). Confirm the expiry page renders.
  6. Force a rate limit by hammering the open endpoint. Confirm 429 with Retry-After.
  7. Confirm CSP and security headers via curl -I.
  8. Confirm Sentry catches a deliberate test exception.
  9. Confirm the retention sweep ran in staging (log entry visible).
  10. Confirm the on-call rotation receives a test page from the uptime monitor.

When all ten pass, the surface is ready.