Iris Coleman
Mar 23, 2026 14:21
Paxos reveals how it partitioned a 21TB crypto ledger table with zero downtime, achieving a 371ms cutover while crypto markets ran 24/7.
Paxos, the stablecoin infrastructure company behind PYUSD and USDP, has published technical details on how its engineering team partitioned a 21-terabyte ledger database table without taking systems offline—a feat that required just 371 milliseconds for the final cutover while crypto markets continued trading around the clock.
The company’s ledger table had grown to roughly two-thirds of Aurora’s per-table size limit, giving engineers about a year before writes would start failing. For a firm processing stablecoin transactions that can’t tolerate data loss or delays, traditional migration approaches involving extended maintenance windows weren’t viable.
The Technical Approach
Rather than copying billions of rows into a new structure—the standard partitioning playbook—Paxos built the partitioned architecture around the existing table. The original table became a “history” partition while new time-range partitions caught incoming data. To external systems, nothing changed; Postgres handled routing internally.
The catch? Postgres needs to verify every row satisfies partition constraints before attaching a table, which means a full table scan. On a 21TB table, that’s not quick.
Paxos split this into two phases. First, they added the constraint as NOT VALID—a fast operation that skips verification. Then they ran VALIDATE CONSTRAINT separately, allowing reads and writes to continue during the scan.
Nine Hours of Pain
The validation scan took just over nine hours. During that time, the lock prevented autovacuum from cleaning up dead tuples, causing tail latency to climb steadily. The first attempt failed when write spikes exceeded timeout thresholds.
On the second attempt, Paxos coordinated with market makers to pause trading activity temporarily and relaxed timeout thresholds. P50 latency stayed relatively flat, but P95/P99 degraded significantly as dead tuples accumulated.
“In hindsight, this was the most operationally demanding part of the migration—not the cutover, but the scan that made the cutover possible,” the engineering team wrote.
Why This Matters for Crypto Infrastructure
The migration highlights a growing challenge for crypto infrastructure providers. Unlike traditional finance with scheduled maintenance windows, crypto markets run continuously. Database tables backing ledger systems can’t simply go offline for rebuilding.
Paxos also tackled a subtle partitioning gotcha: uniqueness constraints across partitions. In Postgres, unique constraints on partitioned tables only work globally if they include the partition key. Otherwise, two inserts with identical idempotency keys could land in different partitions and both succeed—a disaster for a financial ledger that could apply the same transaction twice.
The solution involved idempotency checks outside the balance-update lock, adding less than 5 milliseconds of latency in internal testing.
Testing at Production Scale
Staging environments couldn’t replicate the problem. Paxos used Aurora production cloning to create a full-sized test database, then built a “reverse history” SQL generator that replayed real transaction patterns backward to avoid triggering overdraft failures.
The approach reflects broader industry movement toward zero-downtime database migrations. Similar techniques have been documented for tables ranging from 1.5TB to multi-terabyte scale, often using views to abstract transitions from applications.
For Paxos, the partitioned structure now enables one-liner archiving through DETACH PARTITION and removes the looming size ceiling as a constraint. The company says it’s beginning a series of posts on engineering challenges behind its infrastructure—a signal that stablecoin operators are increasingly competing on technical credibility alongside regulatory compliance.
Image source: Shutterstock
Credit: Source link
