Scope: Stage-level App->Initial Underwriting (UW) decision only (excludes appraisal, title, escrow).
Cohort (clean files): AUS-accept, documents complete at submission, no appraisal/title holds.
Windows: Unless noted, Q2-Q3 2023 pilot for stage metrics; 30-day view for DORA.
The Challenge
City National Bank's legacy underwriting suite serialized every stage of mortgage approval. SLA breaches were frequent because throughput declined during end-of-month surges, and stranded workflows required human intervention that lasted multiple shifts. Error handling was opaque, audit trails were fragmented, and the change calendar limited how quickly new controls could be shipped.
The Solution
I led an incremental migration that reframed underwriting as event-driven choreography while protecting production stability and regulator trust.
Modernization Program Timeline
Benchmarked legacy bottlenecks
Instrumented the monolith and traced the highest-risk failure modes while aligning auditors on safety guardrails.
Piloted event-driven Step Functions
Split intake, bureau sync, appraisal, and compliance into isolated express state machines behind safe traffic mirroring.
Productized rollback and replay patterns
Codified compensating SLIs, automated roll-forward handlers, and replayable audits before sunsetting the monolith.
Scaled with multi-signal guardrails
Raised deployment frequency while keeping regulator trust through DORA reporting and burn-rate gated scaling.
Underwriting-Stage Metrics
Intake Validation p95
Document and schema checks stabilized with asynchronous retries and staged rollouts (Q2-Q3 2023 clean-file view).
Credit Bureau Timeout Rate
Multiattempt bureau sync with circuit breakers trimmed daily timeout incidents (Q2-Q3 2023 clean-file view).
Collateral Pre-score SLA
Internal collateral pre-screen kept pre-appraisal checks sub-hour on clean files (Q2-Q3 2023 window).
Compliance Rule Latency
Deterministic rules engine shipped with queue back-pressure telemetry (rolling 6-week operational window).
Scorecard Adoption
Operations leadership consumed standardized underwriting scorecards each morning (daily instrumentation window).
Each underwriting stage now exposes a standardized telemetry envelope:
- Intake validation: median processing time under 90 seconds, p95 under 140 seconds, success rate at 99.3%.
- Credit bureau sync: automated retries capped at three loops with observed timeout frequency under 0.4% per day.
- Collateral pre-score: internal pre-screen median SLA under 45 minutes driven by asynchronous document scoring.
- Compliance checks: rule evaluation latency under 8 seconds with backlog depth visible in Grafana heatmaps.
A shared metrics schema landed in New Relic and fed a daily underwriting scorecard for operations leadership.
"Collateral pre-score" refers to the internal pre-screening workflow prior to any vendor-managed appraisal steps.
Step Functions Recovery Configuration
We decomposed the monolith into Step Functions express workflows partitioned by underwriting stage. Each state machine defined:
- Zonal failure policies that shifted workloads to warm standby clusters when burn-rate SLOs signaled risk.
- Automated roll-forward handlers that re-queued idempotent tasks with exponential delay, keeping manual restarts under 2 per week.
- Replayable audit events persisted to DynamoDB with idempotency keys so recovery preserved evidence chains.
During quarterly game-day exercises, this configuration recovered >=99% of stranded workflows within 15 minutes (p95 <= 15 minutes; p99 <= 30 minutes).
Replay Harness Excerpt
export async function replayExecution({
executionArn,
messageId,
}: {
executionArn: string;
messageId: string;
}) {
const history = await stepFunctions
.getExecutionHistory({ executionArn })
.promise();
const payload = extractPayload(history, messageId);
await stepFunctions
.startExecution({
stateMachineArn: process.env.REPLAY_STATE_MACHINE_ARN!,
name: `replay-${messageId}`,
input: JSON.stringify({
payload,
idempotencyKey: messageId,
}),
})
.promise();
}
The replay runbook pairs this harness with DynamoDB idempotency keys so every retry preserves control evidence.
Event-Driven Orchestration Blueprint
SQS Poison-Pill Compensating SLI
Poison-pill messages previously created cascading failures. We established a compensating SLI that tracked:
- DLQ ingress share <= 0.3% with redrive age p95 <= 5 minutes; alerts triggered on sustained breach across 10-minute (fast) and 1-hour (slow) windows.
- Time-to-neutralize, enforced to stay under 5 minutes through automated quarantining lambdas (rolling 6-week window).
- Duplicate delivery ratio, held under 0.2% with idempotent consumers (rolling 6-week window).
The SLI backed an on-call runbook that coupled automated quarantining with manual review gates for high-risk loan classes.
Multi-Signal ECS Auto-Scaling with Burn-Rate Gates
Auto-scaling moved beyond CPU thresholds. We combined:
- Request concurrency sourced from ALB target tracking.
- Message age pulled from SQS metrics.
- Error budget burn-rate computed from Stage-level SLOs.
Horizontal scaling executed only when multiwindow burn-rate stayed below 1.5 across fast (10-minute) and slow (1-hour) lookbacks with hysteresis, preventing thrash during transient partner outages while keeping recovery within the 30-minute objective.
Burn-Rate Alert Rule (excerpt)
name: underwriting-burn-rate-fast
metric: error_budget_burn_rate
period: 5m
evaluation_periods: 2
threshold: 1.5
comparison: greater
alarm_actions:
- sns: underwriting-oncall
ok_actions:
- sns: underwriting-oncall
datapoints_to_alarm: 2
The fast alarm pairs with a 1-hour lookback twin so on-call received balanced signal without scaling flap.
Platform Control Surfaces
Select any capability above to dig into the stack, metrics, and operating posture that kept regulators and underwriting aligned.
HTTP Revalidation Strategy
Partner underwriting APIs enforced strict rate caps. We introduced an HTTP revalidation strategy featuring:
- Conditional GETs with ETag validators and
Cache-Control: must-revalidateon decisional reads, keeping partner call volume within contractual maxima via 304 responses. - Explicit stale-while-revalidate allowances (
stale-while-revalidate=600) only on non-decisional status reads, labeled in ops tooling. - Circuit breakers that degraded gracefully to cached responses whenever upstream p95 latency exceeded the 1.2-second limit.
30-Day DORA View
We published a rolling 30-day DORA scorecard via Looker:
- Deployment frequency: >=3/day on weekdays (30-day view).
- Lead time for changes: median ~10-12 hours from merge to production.
- Change failure rate: <= 9.6% while meeting control evidence requirements.
- Mean time to recovery: ~27 minutes thanks to Step Functions roll-forward automation.
The scorecard satisfied both engineering retrospectives and the operational risk committee's reporting cadence.
Audit Outcome
The bank's internal audit team reviewed the migration artifacts, runbooks, and control evidence. Their closing report stated there were no material findings related to the migration, validating that the technical modernization met regulatory expectations without exception.
