Systems

City National Mortgage Underwriting Modernization

Role: Lead Software Development Engineer, Technical Lead
Timeline: Jul 2021 - Oct 2024

Re-architected City National Bank's underwriting platform with resilient event orchestration, multi-signal autoscaling, and disciplined observability, shrinking credit decision latency while satisfying regulatory scrutiny.

City National Mortgage Underwriting Modernization
Stage-level
App->Initial UW median ~11->~5 business days (~-55%) on clean files; p90 19->9; p99 outliers down ~60% (Q2-Q3 2023)
Step
Functions recovery: >=99% of stranded workflows within 15 min during quarterly game-days; p95 <= 15 min; p99 <= 30 min
Compensating
SLI contained poison-pill impact: DLQ ingress share <= 0.3%, redrive success >99.8%, redrive age p95 <= 5 min (6-week window)

Tech Stack

AWSStep FunctionsECSSQSNode.jsPythonTerraformNew Relic

Scope: Stage-level App->Initial Underwriting (UW) decision only (excludes appraisal, title, escrow).
Cohort (clean files): AUS-accept, documents complete at submission, no appraisal/title holds.
Windows: Unless noted, Q2-Q3 2023 pilot for stage metrics; 30-day view for DORA.

The Challenge

City National Bank's legacy underwriting suite serialized every stage of mortgage approval. SLA breaches were frequent because throughput declined during end-of-month surges, and stranded workflows required human intervention that lasted multiple shifts. Error handling was opaque, audit trails were fragmented, and the change calendar limited how quickly new controls could be shipped.

The Solution

I led an incremental migration that reframed underwriting as event-driven choreography while protecting production stability and regulator trust.

Modernization Program Timeline

1
Stabilize & ObserveQ3 2021

Benchmarked legacy bottlenecks

Instrumented the monolith and traced the highest-risk failure modes while aligning auditors on safety guardrails.

Built live dashboards covering 18 underwriting paths and SLA drift
Mapped manual runbooks into an automation backlog prioritized by regulatory impact
2
Modularize WorkflowsQ1 2022

Piloted event-driven Step Functions

Split intake, bureau sync, appraisal, and compliance into isolated express state machines behind safe traffic mirroring.

Delivered a dual-run slice handling 30% of nightly loan volume in four weeks
Validated zoned failure policies against regulator-reviewed incident scenarios
3
Automate RecoveryQ4 2022

Productized rollback and replay patterns

Codified compensating SLIs, automated roll-forward handlers, and replayable audits before sunsetting the monolith.

Neutralized poison-pill loans automatically with sub-5-minute quarantine cycles
Cut stranded workflows from 60 per week to zero during production cutover
4
Scale & Govern2023 - 2024

Scaled with multi-signal guardrails

Raised deployment frequency while keeping regulator trust through DORA reporting and burn-rate gated scaling.

Weekday deploy cadence at three-plus releases with change failure rate held at or below 9.6% for 12 months
Automated quarterly audit evidence packaging delivered no material migration findings

Underwriting-Stage Metrics

140sdown 35%

Intake Validation p95

Document and schema checks stabilized with asynchronous retries and staged rollouts (Q2-Q3 2023 clean-file view).

0.4%down 60%

Credit Bureau Timeout Rate

Multiattempt bureau sync with circuit breakers trimmed daily timeout incidents (Q2-Q3 2023 clean-file view).

under 45 mindown 50%

Collateral Pre-score SLA

Internal collateral pre-screen kept pre-appraisal checks sub-hour on clean files (Q2-Q3 2023 window).

8sdown 70%

Compliance Rule Latency

Deterministic rules engine shipped with queue back-pressure telemetry (rolling 6-week operational window).

Dailyup

Scorecard Adoption

Operations leadership consumed standardized underwriting scorecards each morning (daily instrumentation window).

Each underwriting stage now exposes a standardized telemetry envelope:

  • Intake validation: median processing time under 90 seconds, p95 under 140 seconds, success rate at 99.3%.
  • Credit bureau sync: automated retries capped at three loops with observed timeout frequency under 0.4% per day.
  • Collateral pre-score: internal pre-screen median SLA under 45 minutes driven by asynchronous document scoring.
  • Compliance checks: rule evaluation latency under 8 seconds with backlog depth visible in Grafana heatmaps.

A shared metrics schema landed in New Relic and fed a daily underwriting scorecard for operations leadership.

"Collateral pre-score" refers to the internal pre-screening workflow prior to any vendor-managed appraisal steps.

Step Functions Recovery Configuration

We decomposed the monolith into Step Functions express workflows partitioned by underwriting stage. Each state machine defined:

  1. Zonal failure policies that shifted workloads to warm standby clusters when burn-rate SLOs signaled risk.
  2. Automated roll-forward handlers that re-queued idempotent tasks with exponential delay, keeping manual restarts under 2 per week.
  3. Replayable audit events persisted to DynamoDB with idempotency keys so recovery preserved evidence chains.

During quarterly game-day exercises, this configuration recovered >=99% of stranded workflows within 15 minutes (p95 <= 15 minutes; p99 <= 30 minutes).

Replay Harness Excerpt

export async function replayExecution({
  executionArn,
  messageId,
}: {
  executionArn: string;
  messageId: string;
}) {
  const history = await stepFunctions
    .getExecutionHistory({ executionArn })
    .promise();
  const payload = extractPayload(history, messageId);

  await stepFunctions
    .startExecution({
      stateMachineArn: process.env.REPLAY_STATE_MACHINE_ARN!,
      name: `replay-${messageId}`,
      input: JSON.stringify({
        payload,
        idempotencyKey: messageId,
      }),
    })
    .promise();
}

The replay runbook pairs this harness with DynamoDB idempotency keys so every retry preserves control evidence.

Event-Driven Orchestration Blueprint

SQS Poison-Pill Compensating SLI

Poison-pill messages previously created cascading failures. We established a compensating SLI that tracked:

  • DLQ ingress share <= 0.3% with redrive age p95 <= 5 minutes; alerts triggered on sustained breach across 10-minute (fast) and 1-hour (slow) windows.
  • Time-to-neutralize, enforced to stay under 5 minutes through automated quarantining lambdas (rolling 6-week window).
  • Duplicate delivery ratio, held under 0.2% with idempotent consumers (rolling 6-week window).

The SLI backed an on-call runbook that coupled automated quarantining with manual review gates for high-risk loan classes.

Multi-Signal ECS Auto-Scaling with Burn-Rate Gates

Auto-scaling moved beyond CPU thresholds. We combined:

  1. Request concurrency sourced from ALB target tracking.
  2. Message age pulled from SQS metrics.
  3. Error budget burn-rate computed from Stage-level SLOs.

Horizontal scaling executed only when multiwindow burn-rate stayed below 1.5 across fast (10-minute) and slow (1-hour) lookbacks with hysteresis, preventing thrash during transient partner outages while keeping recovery within the 30-minute objective.

Burn-Rate Alert Rule (excerpt)

name: underwriting-burn-rate-fast
metric: error_budget_burn_rate
period: 5m
evaluation_periods: 2
threshold: 1.5
comparison: greater
alarm_actions:
  - sns: underwriting-oncall
ok_actions:
  - sns: underwriting-oncall
datapoints_to_alarm: 2

The fast alarm pairs with a 1-hour lookback twin so on-call received balanced signal without scaling flap.

Platform Control Surfaces

Select any capability above to dig into the stack, metrics, and operating posture that kept regulators and underwriting aligned.

HTTP Revalidation Strategy

Partner underwriting APIs enforced strict rate caps. We introduced an HTTP revalidation strategy featuring:

  • Conditional GETs with ETag validators and Cache-Control: must-revalidate on decisional reads, keeping partner call volume within contractual maxima via 304 responses.
  • Explicit stale-while-revalidate allowances (stale-while-revalidate=600) only on non-decisional status reads, labeled in ops tooling.
  • Circuit breakers that degraded gracefully to cached responses whenever upstream p95 latency exceeded the 1.2-second limit.

30-Day DORA View

We published a rolling 30-day DORA scorecard via Looker:

  • Deployment frequency: >=3/day on weekdays (30-day view).
  • Lead time for changes: median ~10-12 hours from merge to production.
  • Change failure rate: <= 9.6% while meeting control evidence requirements.
  • Mean time to recovery: ~27 minutes thanks to Step Functions roll-forward automation.

The scorecard satisfied both engineering retrospectives and the operational risk committee's reporting cadence.

Audit Outcome

The bank's internal audit team reviewed the migration artifacts, runbooks, and control evidence. Their closing report stated there were no material findings related to the migration, validating that the technical modernization met regulatory expectations without exception.

What I'd Do Next

  • Extend underwriting observability to machine learning explainability artifacts
  • Automate regulatory evidence packaging for quarterly audits
  • Pilot proactive credit risk simulations using synthetic workloads