From KVM Postgres to RDS and back: a migration that should not have worked
The database had been running on three KVM domains for nine years. It was provisioned by FAI off a debian-installer preseed that nobody on the current team had written, configured by a Puppet module last meaningfully edited in 2019, and patched only when the SRE on call had the energy. PostgreSQL 9.6, on Debian 8 (jessie), past the end of LTS, past the end of ELTS, past the end of any reasonable explanation. The boxes were fine. They were always fine. They had been fine for so long that nobody touched them on principle.
Then somebody from finance asked why we had three idle Xeons in a rack in Frankfurt, and we got eight months to move it to RDS.
I am writing this in March 2025 about a migration that started in early 2023 and ended, eventually, with the same database back on three new KVM domains in late 2024. We did the round trip. Both directions hurt. This is the long version of what broke.
The cast
What we were moving. About 1.4 TiB of data across two logical
databases, around 6,500 tables. PostgreSQL 9.6.24 on Debian
8.11, PGDG repo rather than stock jessie. One primary, two
streaming replicas, all on libvirt/KVM with raw LVM volumes on
local SSD. pgbouncer in transaction mode in front of the primary,
roughly 2,400 client connections collapsed to about 80 backend. A
Puppet module that managed postgresql.conf, pg_hba.conf, and
the boot-time symlinks that pointed /var/lib/postgresql/9.6/main
at the LVM mount. The same Puppet module also installed a
tablespace on a second LVM volume called fast_ssd for an
analytics schema. Custom code: 11 functions in plperlu
(untrusted), 3 functions in plpython2u, and a single ill-advised
function in pltclu that printed a date in Cyrillic. Three FDW
links to other Postgres clusters using postgres_fdw, one of
which pointed at a MySQL via mysql_fdw from the
EnterpriseDB packages. A scheduled job runner that was just
a cron entry on the primary running psql -c 'SELECT
vacuum_analyze_partitions()' every twenty minutes. And an
application written by people who had since left the company and
who used LISTEN/NOTIFY for a job-queue pattern. A small thing.
It mattered later.
What we were moving to. RDS for PostgreSQL, two-AZ Multi-AZ,
db.r6g.4xlarge, storage gp3 2 TiB, eventual plan to upgrade to
PG 15 once we got there. RDS Proxy in front for connection
management, replacing pgbouncer.
Why we moved
The reason in the slide deck was consolidation onto cloud-native primitives.
The reason in the room was that nobody under thirty in the company knew
what apt pinning was anymore and we needed to stop pretending the
jessie box was a strategy. Fair. The boxes were also overdue for
hardware replacement and capex was harder to defend than opex that year.
I do not have a strong opinion about that decision. I have very strong opinions about the next twelve weeks.
TO RDS, the things that broke
1. The dump that wouldn’t restore (tablespaces)
First plan was the obvious one: pg_dump --schema-only on production,
restore to RDS, set up DMS for full load + CDC, cut over.
$ psql -h tentomon-prod.xxx.eu-central-1.rds.amazonaws.com -U postgres -f schema.sql
psql:schema.sql:18241: ERROR: permission denied to create tablespace "fast_ssd"
HINT: Must be superuser to create a tablespace.
RDS does not give you superuser. There is no postgres superuser; the
master user gets rds_superuser, and rds_superuser is not a real
superuser, it’s a role with a curated set of grants. You cannot
CREATE TABLESPACE against an arbitrary path because the only path
RDS will accept is under /rdsdbdata/db/base/tablespace, and even
then the RDS docs say outright:
RDS for PostgreSQL supports tablespaces for compatibility purposes, but due to all storage being on a single logical volume, they cannot be used for I/O splitting or isolation.
In other words: you can create a tablespace, but it does literally
nothing. The dump’s TABLESPACE fast_ssd clauses on every CREATE
TABLE in the analytics schema either fail (if you don’t pre-create the
matching tablespace name) or succeed and silently lie (if you do).
pg_dump --no-tablespaces would have spared me, and is what I used in
the end. The fix is one flag. The annoying part was the discovery: I
spent a Friday afternoon convinced our schema dump was somehow
corrupted, looking at it line by line, before I noticed how many
tables had the TABLESPACE clause attached. Puppet had been
provisioning that tablespace into the bootstrapping for years. None of
us thought about it because none of us had touched it.
2. plperlu and plpython2u: the rewrite trail
Next discovery, courtesy of the same dump. RDS has plperl (trusted)
and plpython3u (depending on engine version). It does not have
plperlu, the untrusted variant, which is precisely where every
function we’d written had landed because the original author needed
require LWP::Simple for a one-line HTTP call from inside the
database. (Yes, I know.) From the Aurora extensions table
explicitly: “some extensions are no longer supported, such as
adminpack, plperlu, pltclu, pageinspect, and xml2.” The same
restriction holds for community RDS PostgreSQL on the relevant
versions.
So the eleven plperlu functions had to be rewritten. plpython2u
was worse: not only is the untrusted variant gone, plain plpython3u
on the destination meant porting from Python 2 to Python 3, in 2024,
years after Python 2’s funeral. One of the functions was a homemade
parser for a CSV format produced by a printer in Belgium that nobody
wanted to think about. It worked. It had worked for eight years.
It used print as a statement.
The single pltclu function we deleted. It was never called.
Rewrites in priority order:
- Functions called from triggers got
plpgsqlrewrites. Most were simple enough. - Functions called from cron-style jobs got moved out of the database entirely into a small Python service running on the same VPC, talking to the new RDS via psycopg. Better placement anyway.
- The Belgian-printer parser got a
plpython3urewrite, tested against a 50,000-line corpus pulled frompg_largeobject, and merged with three reviewers because none of us trusted it.
This consumed seven weeks. It was not the most technically interesting seven weeks of my career.
3. glibc collation, the silent corrupter
The boxes were on Debian 8, glibc 2.19. RDS PostgreSQL runs on a managed Amazon Linux base with a much newer glibc, well past the 2.28 boundary where glibc rewrote its locale collation tables to match ISO 14651:2016 and Unicode 9. This is the now-famous glibc 2.28 collation break that has bitten every serious Postgres operator who’s done a major OS upgrade since 2018.
Symptoms are bad in a specific way. Indexes don’t appear corrupt.
pg_amcheck may not flag them. Queries return wrong results. A
SELECT … WHERE name = 'Müller' finds the row on the source and
misses it on the destination, because the index was built under one
sort order and is being walked under another. The tell is sometimes:
WARNING: index "users_name_idx" contains corrupted page at block 0
DETAIL: Failed to find parent tuple for heap-only tuple at (12, 4)
…but more often, no tell at all. The index just lies.
If you build the dump under jessie’s collation and restore it to a
host with newer collation, every index on a text column with a
non-C collation is suspect. The workarounds are all painful. You
can REINDEX everything post-restore. We did. On 1.4 TiB the
initial run took 11 hours wallclock with REINDEX (CONCURRENTLY),
with parallelism limited by what RDS would tolerate without the
Performance Insights graph turning red. You can build with
lc_collate=C on the destination, which buys you byte-order sort
and nothing else; often you cannot, because some app somewhere
relies on locale-aware sort. Or you can switch to ICU collations,
which were available since PG 10 and are versioned, so Postgres
can warn you when the version changed underneath an index. We
could not use ICU on the source because 9.6 had no usable ICU
support, but we did use it on the destination to inoculate against
the next migration. (Foreshadowing.)
I learned later that PG 15+ tracks collation versions per index and
emits WARNING: database "x" has a collation version mismatch in the
logs. We weren’t going to PG 15 yet. We were doing a like-for-like
upgrade to 9.6 on RDS first because that’s all DMS would let us
replicate cleanly with this much custom code on the source. So no
warning. Just wrong answers.
4. The replication slot that filled the disk
Plan was to use native logical replication from on-prem to RDS for the cutover, with DMS as the fallback. Native logical replication needs PG 10+ on both ends. We were on 9.6 on the source. Fine, says me, we can use pglogical which has a 9.4-compatible decoder and is supported on RDS as an extension.
The setup looks reasonable:
-- on RDS, in the cluster parameter group
rds.logical_replication = 1
max_replication_slots = 20
max_wal_senders = 20
-- on source (jessie box)
wal_level = logical
max_replication_slots = 10
max_wal_senders = 10
(Both static. Both require a restart. The reboot of the cluster parameter group on RDS took about 90 seconds; the reboot on the jessie primary took about 90 seconds plus thirty minutes of failover choreography because the puppet module hadn’t been told what to do with a logical-replication-shaped postgres in 2019.)
The fun part came two days into the initial sync. A pglogical worker
on the destination side hit a malformed row (a bytea column whose
length on source was reported one way and on destination was decoded
another way, an interaction with bytea_output = escape that I won’t
re-litigate here) and wedged. Default behavior: the slot stays open,
the apply worker keeps trying, WAL on the source keeps accumulating
because the slot is alive and unconsumed.
Twelve hours later the on-prem primary’s pg_xlog/ was at 340 GiB.
max_slot_wal_keep_size did not exist in 9.6 (that’s a PG 13
parameter) so there was no soft cap. The disk filled at 04:11 local.
The primary stopped accepting writes. We failed over to the
synchronous replica (also out of WAL space, because synchronous, but
the failover worked because of how RDS doesn’t, it’s a Patroni thing,
shoutout to whoever set that up). Application took ~9 minutes of
errors before fully reconnecting through pgbouncer.
The lesson is simple and the docs say it plainly: an inactive
logical-replication slot retains WAL forever. RDS says it,
the AWS DMS pre-flight assessment checks
max_slot_wal_keep_size, the pglogical README says it.
We knew. We had a runbook entry. The runbook entry was for the
destination, not the source. The destination was the safe one. RDS
will not let WAL eat its own storage past a certain point on PG 13+.
Our source was 9.6 and had no such governor.
After that we wrote a 5-minute cron on the source that paged if any
slot’s pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)
exceeded 50 GiB. It paged twice more during the migration. Both times
the apply worker had wedged on something we hadn’t anticipated.
5. REPLICA IDENTITY (or: why my deletes evaporated)
About week six of CDC, with backfill done and replica caught up to
within a second, somebody on the application team noticed that a
specific cleanup job ran on the source nightly, deleted ~50,000 rows
from a session_log table, and on the RDS side those rows were
still there. Insert lag: zero. Update lag: zero. Delete lag:
infinite, because the deletes simply weren’t being applied.
Logical replication needs a way to identify the row being deleted on the apply side. It looks at the source’s REPLICA IDENTITY setting:
ALTER TABLE session_log REPLICA IDENTITY FULL; -- or DEFAULT, or USING INDEX
DEFAULT means “use the primary key”. The session_log table had no
primary key. Nine years on, it had grown a created_at-based
candidate but never had it promoted. With no PK, pglogical/native
logical replication can decode the DELETE from the WAL but cannot
emit it on the wire because there’s nothing to put in the WHERE
clause on apply. The apply worker either drops it silently or errors,
depending on version and config. In our case, dropped. No log entry.
Nothing.
The DMS pre-flight check “REPLICA IDENTITY FULL” flags this exact case as “Detecting tables using REPLICA IDENTITY FULL and either changing the REPLICA IDENTITY setting or switching to a test_decoding plugin.” Useful for DMS users; we weren’t on DMS for this segment; the same trap applied. We ran:
SELECT n.nspname, c.relname, c.relreplident
FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'r' AND c.relreplident NOT IN ('d','i')
AND n.nspname NOT IN ('pg_catalog','information_schema','pglogical');
…and then for any row where there was no PK or unique index, set
REPLICA IDENTITY FULL. Which makes every UPDATE/DELETE log the full
old row, which roughly doubles the WAL volume for those tables.
On session_log, which we had been deleting cheerfully at 50k/night
without thinking about it, this turned the next day’s WAL into ~14
GiB of mostly-DELETE traffic, which the apply worker then chewed
through at maybe 4 MiB/s, and we were behind again for two days.
We added a BIGSERIAL PRIMARY KEY column to seven tables, took the
brief locks during a low-traffic window, and went back to DEFAULT
identity. The proper fix had been sitting in the backlog under the
title “add PKs to legacy tables” since 2017. It got merged on a
Tuesday afternoon under the title “fix CDC”.
6. DMS, briefly, and why we left it
We did try DMS for the analytics schema, separately, because
pglogical didn’t love the size of one particular partitioned table
(2,400 partitions, ~80 GiB). DMS comes with its own family of
problems, all documented if you know where to look. JSONB goes to
CLOB by default. Source has a jsonb column, target also has a
jsonb column, and DMS will happily put a stringified
representation through a CLOB pipe and either truncate or rewrite
whitespace, depending on the LOB mode. Limited LOB mode
caps individual LOBs at 100 MB and pre-allocates memory for them;
Full LOB mode handles arbitrary sizes but is dramatically slower.
We had values up to 8 MB. Limited LOB at 16 MB worked, with a
careful eye on the memory footprint of the replication instance.
Sequences are not migrated. DMS does not transfer sequence current
values; the cutover script has to setval() every sequence on the
target to MAX(id) + buffer before redirecting writes. Forget one
and you hit a primary key collision the moment the app inserts.
Materialised views must be re-created manually on the target, per the assessment report. We had four. We forgot one. It surfaced three weeks after cutover when a dashboard went blank.
DDL is not replicated unless you turn on a specific event trigger.
We froze schema changes for the duration and added a Slack bot
that yelled when anybody tried to ALTER TABLE on prod.
The combined behavior with our bytea and jsonb columns made me
nervous enough to keep DMS only for that one partitioned analytics
table where the diffs were simpler, and run pglogical for the rest.
Two replication paths, two sets of monitoring, two sets of failure
modes. Worth it for stability of the OLTP path.
7. Cutover night, and pgbouncer to RDS Proxy
Cutover was scheduled for a Saturday at 02:00 UTC. The plan:
00:30 freeze deploys, last sanity check
01:30 switch app pgbouncer to read-only mode
01:45 drain writes, verify pglogical lag < 5s on all subscribers
02:00 flip DNS to point at RDS Proxy endpoint
02:05 sequence resync (setval everything)
02:15 unfreeze, monitor for 60 minutes
03:30 declare victory or roll back
The DNS flip went fine. The application reconnected fine. Latency dashboards looked fine. Then about ten minutes in, we started seeing weird behavior from a Java service: occasional 30-second hangs on queries that should have taken milliseconds, then resumed. Not errors. Hangs.
The Java service was using Hibernate, which by default likes to use
named server-side prepared statements. Hibernate had been talking to
pgbouncer in transaction-pooling mode for years, and pgbouncer in
transaction mode famously breaks server-side prepared statements.
The team had worked around that with prepareThreshold=0 in the JDBC
URL. Fine.
RDS Proxy has its own behavior. Per the
RDS Proxy pinning docs, for PostgreSQL the proxy
will pin a client connection to a backend (effectively turning
off multiplexing for that client) when it sees SET commands
other than transaction-scoped, prepared statement
creation/management, temporary tables, sequences, or views,
declared cursors, LISTEN on a notification channel, or
session-state-altering library loads. Among other things.
When a connection gets pinned, it stays bound to one backend for the
rest of the session. Under load, you get a sudden cliff: the proxy
runs out of unpinned connections to multiplex over, and new clients
queue waiting for one. The 30-second hangs were MaxConnectionsPercent
queueing, courtesy of an internal SET work_mem that Hibernate or
some library it pulled in was issuing on every connection acquire to
match what we’d had on pgbouncer.
We rolled back to a fronting pgbouncer in front of RDS Proxy
(yes, pgbouncer in front of RDS Proxy in front of RDS, three
connection layers, deeply unaesthetic, worked) for a week while the
app team excised the per-session SETs and moved them into the
RDS-side default parameter group. After that we removed the pgbouncer
hop. CloudWatch metric to monitor:
DatabaseConnectionsCurrentlySessionPinned. If that stays above zero
for any sustained period you have something issuing a pin trigger.
8. The LISTEN/NOTIFY queue, ghost edition
This one’s small but I keep telling it because I find it funny in retrospect.
A worker process in the background-jobs service used LISTEN
job_ready on a long-lived connection to receive NOTIFYs from a
trigger on the jobs table. Cute pattern, fine for low scale, this
was low scale.
Post-cutover the worker silently stopped processing jobs. Connection
was up, the LISTEN was registered (we checked pg_listening_channels()),
the trigger was firing on inserts, the NOTIFY was being issued. The
worker just never got a notification.
What it had actually done was: open a connection through RDS Proxy,
issue LISTEN job_ready, get its connection pinned (LISTEN
pins a connection), and sit there. NOTIFY pushes are per-backend. The
backend the worker was pinned to was fine. The triggers, however,
were running on whichever backend the writer pool happened to grab
for the relevant transaction, which was a different backend each
time, and notifications don’t cross backend boundaries except via the
shared pg_notification_queue. Which they do. But not in a way that
fires on a pinned proxy session in real time the way it had on
pgbouncer’s stable session-mode connection on the old setup.
Resolution: move the worker to connect directly to the writer endpoint, not through the proxy. RDS Proxy is fundamentally not built for long-lived single-session listeners and the docs say so if you read closely. The fix was four lines of config. Diagnosing it took an afternoon of perplexed staring.
Eight months on RDS
Things were fine. Latency from our app fleet (in our colo, not in
AWS) to RDS in eu-central-1 was around 11ms p50 over Direct Connect,
which was about three times what we’d had on the LAN side, and the
app teams had to retune one or two N+1-prone services, but nothing
broke that wasn’t fixable. Performance Insights was genuinely
useful. The on-call burden dropped because none of us were patching
jessie kernels anymore. Storage autoscaling triggered twice when an
analytics intern ran a SELECT INTO against a 600 GiB table; the
autoscaling docs mention a six-hour cooldown between
scale events which we hit on the second one and ate a brief storage
warning before it cleared. Fine.
What was not fine was the bill.
The headline was the instance. r6g.4xlarge Multi-AZ at on-demand, plus 2 TiB of gp3 with provisioned IOPS, plus the RDS Proxy, plus the read replica we ended up adding, plus the DMS instance we kept around for the analytics schema, plus Direct Connect, plus data transfer to the app fleet, plus snapshots, plus Performance Insights long retention, plus CloudWatch logs at vended-logs pricing. Total monthly came in at roughly 6.4x the fully-loaded TCO of the three KVM boxes including hardware amortization, power, hands, and the SRE fraction. Reserved instances would have brought it closer to 4x, but nobody wanted to commit a 3-year RI on an architecture we weren’t sure was the long-term answer.
That, plus a separate compliance question about data residency from one of our larger customers that turned into a legal review that turned into a data-locality requirement that RDS in Frankfurt technically satisfied but politically did not, ended with a steering-committee decision in April 2024: move it back. New hardware this time, proper Debian 12 (bookworm), Patroni from day one, ZFS snapshots, proper monitoring. We’d been forced into doing the unmaintained stack a favor.
BACK from RDS, the second set of disasters
If migrating into RDS is hard, migrating out is harder. RDS gives you exactly two ways to get your data out continuously:
- AWS DMS, which we had already learned to fear.
- Native logical replication, where RDS plays the publisher and the on-prem cluster plays the subscriber.
What it does not give you is physical replication out. You cannot
pg_basebackup an RDS instance from outside. You cannot pg_receivewal
its WAL stream. You can issue pg_dump, which we tried first for
sanity, and which on a 1.4 TiB database took eleven hours and left us
needing seventeen hours of CDC catch-up before cutover. That doesn’t
work for low-downtime cutover. So: native logical replication, again,
in reverse.
1. The publisher setup, in reverse
Now RDS is the source. To make RDS publish, you need (we’d already done this, ironically, for the DMS direction):
rds.logical_replication = 1
…in the cluster parameter group, plus a publication and a user with
the rds_replication role (per the docs):
GRANT rds_replication TO repl_out;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO repl_out;
CREATE PUBLICATION pub_all FOR ALL TABLES;
Then on the destination (the new on-prem PG 15 box):
CREATE SUBSCRIPTION sub_all
CONNECTION 'host=tentomon-prod.xxx... port=5432 user=repl_out password=... sslmode=require'
PUBLICATION pub_all
WITH (copy_data = true, create_slot = true);
This worked. It was painfully slow. The initial COPY of 1.4 TiB over
Direct Connect at our committed bandwidth came out to about 38 hours
when I’d planned for 20. The bottleneck wasn’t network; it was apply
single-threadedness on the subscriber. PG 16 has parallel apply for
streamed transactions; PG 15 has it only behind the
streaming = parallel option, which works for in-progress
transactions but not for the initial COPY. We were stuck with the
single apply worker per subscription doing the bulk import.
Workaround: split the publication. Three publications, three subscriptions, three apply workers running in parallel, each responsible for a non-overlapping subset of tables. Cut the COPY time to ~16 hours. Operationally messier; you have to coordinate which tables go where, and you cannot have a row referenced across publication boundaries during initial copy without ordering issues. It worked.
2. The aws_s3 calls in stored procs
Eight months on RDS had let one team get clever. They’d written a nightly job that exported a reporting view to S3 using the aws_s3 extension:
SELECT aws_s3.query_export_to_s3(
'SELECT * FROM v_daily_report',
aws_commons.create_s3_uri('reports-bucket','daily.csv','eu-central-1'),
options := 'format csv'
);
aws_s3 is RDS-only. It is not a community Postgres extension. It
does not exist on the on-prem destination. The function call sits in
a SQL file that lives in the application repo and has been called
from a cron job for eight months, and on cutover day it would start
returning ERROR: function aws_s3.query_export_to_s3 does not exist.
We ported the export to a small Python service that ran outside the
database, talked to the new on-prem cluster, and wrote to S3 via
boto3. Same shape, same schedule. The migration of one function call
was four hours of work and felt like it should have taken twenty
minutes. There were three other places aws_s3 had crept in. We
found two during the audit and one during the post-cutover smoke
tests when a lambda failed.
This is a category of damage that is hard to predict before you do the move. RDS-specific extensions are sticky. Once teams discover them they use them, because they are right there and they work, and the assumption that “we’d never go back” calcifies into code.
3. pg_cron, but the wrong pg_cron
Similar shape. We had moved the in-database cron to pg_cron,
RDS-flavored, after the original on-prem cron-on-the-primary pattern
became unworkable on a managed instance you don’t have a shell on.
Community pg_cron exists; the RDS variant has minor behavioral
differences around the cron.database_name config and around how it
handles failed executions. None of them break in obvious ways, but
the on-prem extension we installed (community pg_cron 1.6) parsed
one of our schedule expressions slightly differently and silently
ran a job hourly that had been running every six hours. We noticed
because the daily volume of an audit log shot up by 6x. (pg_cron
1.6 added second-level scheduling in
*/5 * * * * *-style six-field expressions; one of our schedules had
been written assuming the old five-field parser, but the differences
were subtle enough to matter only in one case.)
Diff your schedules across versions before you cut over. Don’t trust that the same string means the same thing.
4. glibc collation, again, the other way
The on-prem destination was Debian 12, glibc 2.36. RDS’s underlying Amazon Linux had been on glibc 2.34 the last time I’d checked. Both are post-2.28, which is the cliff that matters most, but they are not the same, and any glibc version skew across an index is a risk for non-C, non-ICU collations.
I had learned my lesson the first time. The destination was configured with ICU collations for everything that mattered:
CREATE COLLATION en_us_icu (provider = icu, locale = 'en-US-x-icu');
…and the migration plan re-declared columns to use ICU collation during the COPY phase. PG 15+ tracks ICU collation versions per index and emits warnings on mismatch, which means even if we screwed up we would be told. We did not screw up. The destination was clean.
But there was still a problem: the data on RDS had been built under glibc collation order for eight months, because that’s all RDS PostgreSQL 13 supported at the time. So the values were sorted on the source according to one rule and were being received by the destination, which would index them under a different rule. As long as the rules agreed on equality (which they do, for the cases we cared about), CDC apply was fine. Indexes built on the destination were fine. Range scans across the boundary, during the cutover window, were briefly weird. We avoided it by quiescing reads on the old side before promoting the new side.
In retrospect: if you have any locale-sensitive data, switch to ICU on both sides as early in the migration as you can. Once you’ve got an ICU column on the source, you’ve removed glibc from the trust chain entirely, and you can move between operating systems freely.
5. The egress bill
Direct Connect is not free. AWS data transfer to a Direct Connect location is cheaper than internet egress, but it is not zero. Our 1.4 TiB initial COPY plus ~280 GiB of CDC traffic during the catch-up window plus tail traffic during cutover came in at about $190 in data transfer charges, which is fine, plus another $400 of related charges I do not fully understand because the AWS billing dashboard does not always make sense. Round it to $600. This was a forgettable fraction of the savings from getting off the instance. It is not forgettable if you’re moving 100 TiB.
The number that’s more annoying is DMS replication instance time during a hypothetical re-cutover. We kept a DMS instance running at m5.large for two weeks during the dual-running period as a safety net. About $260. Forgettable. Add zeros for larger fleets.
6. The cutover that mostly worked
Cutover was a Saturday at 03:00 UTC, six weeks after we started
the re-migration. By then we had three pglogical-equivalent native
subscriptions all caught up to within 200ms of source, a pgbouncer
cluster fronting the new on-prem primary (configured in
transaction mode at first and then session mode after we re-tested
for the previous LISTEN/NOTIFY issue), a traffic-shifting plan on
the application’s connection-string config that flipped a single
environment variable and cycled pgbouncer pools, and an hourly
script that ran pg_dump --schema-only on both sides and diffed
them in case anybody snuck a DDL in past the freeze. Nobody did.
We checked anyway.
The cutover took 14 minutes from “freeze writes on RDS” to “RDS is read-only, on-prem is primary, app is reconnected”. Sequence resync was the slowest individual step, because we had ~470 sequences and the script ran them serially out of an abundance of paranoia. Could have parallelized; it didn’t matter.
The first hour after, we saw WARNING: collation version mismatch
from PG 15 a couple of times on indexes that had survived from the
RDS side via the COPY (it was tracking the upstream glibc version on
data it had received). REINDEX CONCURRENTLY cleaned them up.
ALTER COLLATION ... REFRESH VERSION after, per the
Postgres docs. No data corruption, just a label
update.
Twelve hours in: a dashboard somewhere reported the wrong number because a query was hitting a stale read replica that hadn’t quite caught up. We bumped the replica’s apply worker priority and called it. Twenty-eight hours in: production was production. We deleted the RDS instance ten days later, after a pause that was mostly psychological.
What stuck, what I’d do again, and what I wouldn’t
The bill is gone. The migration was, in absolute terms, a success. The data is intact, the application is running, the new on-prem cluster is properly Patroni-managed and properly puppeted by code that has been written this decade and that I can read.
I still wonder if we should have just modernized the original cluster in place and never moved to RDS at all. The honest answer is: probably yes for the database itself, no for the team and the political situation around it. The migration to RDS forced us to clean up nine years of tablespace cargo-culting, to delete the Cyrillic date function, to write down what every plperlu function actually did, to rebuild our knowledge of the schema. The migration back forced us to learn modern Postgres ops (Patroni, ICU collations, parallel logical replication apply, proper CDC). The round trip cost us roughly fourteen calendar months of one engineer at 60% capacity and another at 30%, plus the cloud bill during the residency. I do not think we would have done either of those things otherwise.
If I had to do this again and could change one thing: switch every
text column to an ICU collation before you migrate anywhere. It
removes glibc from the trust chain, lets you move across operating
systems without REINDEX marathons, and PG tracks the collation
version for you so you find out before a query returns wrong data
instead of after. Everything else on this list is workaroundable.
Silent collation drift is the one thing I am still nervous about, two
migrations on.
If I could change a second thing: don’t use LISTEN/NOTIFY for
anything that has to survive a connection-pooler change. Use a real
queue. The pattern is cute. The next migration will hate it.
The boxes in Frankfurt are gone. The new boxes are in two different data centers, with a third in AWS as a delayed read replica, just in case. The Puppet module is now Ansible. Nobody under thirty knows what FAI is. They don’t need to. I just hope they don’t decide we need to consolidate onto cloud-native primitives again in 2031.