<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Fellow DevOps Blog</title>
  <subtitle>Notes from the digital world</subtitle>
  <link href="https://alien2003.github.io/feed.xml" rel="self"/>
  <link href="https://alien2003.github.io/"/>
  <updated>2026-05-02T17:20:38+00:00</updated>
  <id>https://alien2003.github.io/</id>
  <author>
    <name>alien2003</name>
  </author>
  
  <entry>
    <title>Pod evictions when nothing on the dashboard is wrong</title>
    <link href="https://alien2003.github.io/2025/08/debugging-kubernetes-pod-evictions/"/>
    <updated>2025-08-04T09:30:00+00:00</updated>
    <id>https://alien2003.github.io/2025/08/debugging-kubernetes-pod-evictions/</id>
    <summary type="html">&lt;p&gt;You can usually find the cause of a pod eviction in five minutes.
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;requests&lt;/code&gt; not set. JVM heap leak. Log volume past
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ephemeral-storage&lt;/code&gt;. The kubelet’s own log line tells you which.&lt;/p&gt;

&lt;p&gt;This post is about the other kind. Every pod has correct requests
and limits. Dashboards look fine. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;free -m&lt;/code&gt; says there’s plenty of
memory. Pods still die in waves at 14:23 on a Tuesday, and there is
no Friday-night-pager glory to make up for it.&lt;/p&gt;

</summary>
    <content type="html" xml:base="https://alien2003.github.io/2025/08/debugging-kubernetes-pod-evictions/">&lt;p&gt;You can usually find the cause of a pod eviction in five minutes.
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;requests&lt;/code&gt; not set. JVM heap leak. Log volume past
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ephemeral-storage&lt;/code&gt;. The kubelet’s own log line tells you which.&lt;/p&gt;

&lt;p&gt;This post is about the other kind. Every pod has correct requests
and limits. Dashboards look fine. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;free -m&lt;/code&gt; says there’s plenty of
memory. Pods still die in waves at 14:23 on a Tuesday, and there is
no Friday-night-pager glory to make up for it.&lt;/p&gt;

&lt;!--more--&gt;

&lt;p&gt;A bit of vocabulary before the gory bits, because confusing these two
costs hours.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;Eviction API&lt;/strong&gt; is what &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kubectl drain&lt;/code&gt; uses. Cooperative,
honors &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PodDisruptionBudget&lt;/code&gt;, runs from the control plane.
&lt;strong&gt;Node-pressure eviction&lt;/strong&gt; is the kubelet on a worker, acting alone,
based on host-level signals it reads from cgroupfs. Different code path,
different policy, different consequences. The kernel &lt;strong&gt;OOM killer&lt;/strong&gt; is a
third thing entirely; it doesn’t know what a pod is.&lt;/p&gt;

&lt;p&gt;When pods die “too fast for the kubelet to log”, that’s the kernel.
When pods die in clean waves of two or three with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Evicted&lt;/code&gt; status,
that’s the kubelet. Both can fire on the same node within seconds.&lt;/p&gt;

&lt;h2 id=&quot;the-signal-isnt-rss-it-isnt-free-and-dashboards-lie-about-it&quot;&gt;The signal isn’t RSS, it isn’t &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;free&lt;/code&gt;, and dashboards lie about it&lt;/h2&gt;

&lt;p&gt;Defaults from
&lt;a href=&quot;https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/eviction/defaults_linux.go&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pkg/kubelet/eviction/defaults_linux.go&lt;/code&gt;&lt;/a&gt;:&lt;/p&gt;

&lt;div class=&quot;language-text highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;memory.available  &amp;lt; 100Mi
nodefs.available  &amp;lt; 10%
nodefs.inodesFree &amp;lt; 5%
imagefs.available &amp;lt; 15%
imagefs.inodesFree &amp;lt; 5%
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memory.available&lt;/code&gt; is computed from the node’s root cgroup as&lt;/p&gt;

&lt;div class=&quot;language-text highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;memory.available := capacity[memory] − memory.workingSet
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;workingSet&lt;/code&gt; is read from cgroupfs, not from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/proc/meminfo&lt;/code&gt;. On
v1 it’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memory.usage_in_bytes − total_inactive_file&lt;/code&gt;; on v2,
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memory.current − inactive_file&lt;/code&gt; from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memory.stat&lt;/code&gt;. The kubelet ships
a shell reproduction at
&lt;a href=&quot;https://kubernetes.io/examples/admin/resource/memory-available.sh&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/examples/admin/resource/memory-available.sh&lt;/code&gt;&lt;/a&gt;. Keep a
copy on each node. When production goes sideways you want a tool
that prints the same number kubelet sees.&lt;/p&gt;

&lt;p&gt;Working set is &lt;strong&gt;bigger than RSS&lt;/strong&gt; by exactly the size of the
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;active_file&lt;/code&gt; page cache. Which leads to the next part.&lt;/p&gt;

&lt;h2 id=&quot;the-active_file-page-cache-trap&quot;&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;active_file&lt;/code&gt; page-cache trap&lt;/h2&gt;

&lt;p&gt;Filed as &lt;a href=&quot;https://github.com/kubernetes/kubernetes/issues/43916&quot;&gt;kubernetes#43916&lt;/a&gt; in 2017, still open.&lt;/p&gt;

&lt;p&gt;The kernel’s page-cache reclaim list has two halves: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;inactive_file&lt;/code&gt;
and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;active_file&lt;/code&gt;. Both are page cache, both reclaimable. The kernel
demotes pages from active to inactive under pressure and then reclaims
from the inactive end. The kubelet subtracts &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;inactive_file&lt;/code&gt; from
working set on the assumption that “active” means in-use. That
assumption is wrong for any workload that re-reads its working set.&lt;/p&gt;

&lt;p&gt;A second &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;read()&lt;/code&gt; of the same page promotes it to active. Index scans,
Parquet readers, Prometheus remote-read, anything mmap-heavy. They
all grow &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;active_file&lt;/code&gt; while RSS holds steady. The kubelet then sees a
node it considers full of memory pressure that the kernel would happily
reclaim if asked.&lt;/p&gt;

&lt;p&gt;Shape of the issue, on a quiet test node:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;awk&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;$1 ~ /^(rss|active_file|total_inactive_file)$/&apos;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    /sys/fs/cgroup/memory/memory.stat
rss 482344960
active_file 12058624
total_inactive_file 9854976

&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;dd &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/var/log/syslog &lt;span class=&quot;nv&quot;&gt;of&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/dev/null &lt;span class=&quot;nv&quot;&gt;bs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1M &lt;span class=&quot;nv&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;512 &lt;span class=&quot;nv&quot;&gt;status&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;none
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;dd &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/var/log/syslog &lt;span class=&quot;nv&quot;&gt;of&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/dev/null &lt;span class=&quot;nv&quot;&gt;bs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1M &lt;span class=&quot;nv&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;512 &lt;span class=&quot;nv&quot;&gt;status&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;none

&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;awk&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;$1 ~ /^(rss|active_file|total_inactive_file)$/&apos;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    /sys/fs/cgroup/memory/memory.stat
rss 482344960          &lt;span class=&quot;c&quot;&gt;# unchanged&lt;/span&gt;
active_file 530378752  &lt;span class=&quot;c&quot;&gt;# +500MB cache, all &quot;active&quot;&lt;/span&gt;
total_inactive_file 23097344
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;That 500 MB now counts against &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memory.available&lt;/code&gt;. Across a busy node
with several pods doing the same trick, you can lose multiple GB of
apparent capacity to data the kernel would throw away if it needed to.&lt;/p&gt;

&lt;p&gt;Mitigations are all bad. Dropping caches works (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;echo 1 &amp;gt;
/proc/sys/vm/drop_caches&lt;/code&gt;) and hurts every other tenant on the host;
do not put it in cron. Bounding the workload’s own cgroup with
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memory.max&lt;/code&gt; keeps its cache local. cgroup v2 plus the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MemoryQoS&lt;/code&gt;
feature gate (alpha as of 1.27, see &lt;a href=&quot;https://kubernetes.io/blog/2023/05/05/qos-memory-resources/&quot;&gt;the announcement&lt;/a&gt;) lets
the kernel apply &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memory.high&lt;/code&gt; back-pressure before kubelet ever sees
node-level pressure, but you have to be on a recent enough kubelet,
recent enough kernel, and willing to run an alpha gate.&lt;/p&gt;

&lt;h2 id=&quot;tmpfs-emptydir-is-a-stealth-eviction-generator&quot;&gt;tmpfs &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;emptyDir&lt;/code&gt; is a stealth eviction generator&lt;/h2&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;emptyDir: {medium: Memory}&lt;/code&gt; is a tmpfs. Files written to it are
anonymous from the cgroup’s perspective, so they count against the
writing container’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memory.workingSet&lt;/code&gt; (and therefore its
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memory.limit&lt;/code&gt; if set) and against the node’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memory.available&lt;/code&gt;
calculation.&lt;/p&gt;

&lt;p&gt;If &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sizeLimit&lt;/code&gt; is omitted, the tmpfs grows up to node-allocatable
memory by default. A pod that writes 30 GiB to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/dev/shm&lt;/code&gt; on a 32 GiB
node has, from the kubelet’s perspective, just made the node go full.
Eviction ranks by usage-over-request, more on that next, and the
offender’s “request” is whatever its containers declared, which almost
never includes the tmpfs. Innocent neighbours get evicted. The pod
eating the RAM is fine.&lt;/p&gt;

&lt;p&gt;Both fields, always:&lt;/p&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;na&quot;&gt;volumes&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;scratch&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;emptyDir&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;medium&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Memory&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;sizeLimit&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;256Mi&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;containers&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;app&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;resources&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;requests&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;pi&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;memory&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;512Mi&lt;/span&gt; &lt;span class=&quot;pi&quot;&gt;}&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;# must include the tmpfs ceiling&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;limits&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;   &lt;span class=&quot;pi&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;memory&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;768Mi&lt;/span&gt; &lt;span class=&quot;pi&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sizeLimit&lt;/code&gt; is enforced by the kernel via &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tmpfs -o size=&lt;/code&gt;, so writes
past it return &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ENOSPC&lt;/code&gt; and the pod gets evicted by the kubelet’s
local-storage logic with a clean reason. If the container’s own memory
limit is below the tmpfs size, the cgroup OOM kills the writing
process before any node-level event fires, and you get a confusing
crash with no &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Evicted&lt;/code&gt; pod state.&lt;/p&gt;

&lt;h2 id=&quot;eviction-ranking-is-not-oom-ranking&quot;&gt;Eviction ranking is not OOM ranking&lt;/h2&gt;

&lt;p&gt;Two algorithms. They use different signals and they routinely target
different pods.&lt;/p&gt;

&lt;p&gt;The kubelet’s pod ordering, from
&lt;a href=&quot;https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/eviction/helpers.go&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pkg/kubelet/eviction/helpers.go&lt;/code&gt;&lt;/a&gt;:&lt;/p&gt;

&lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;orderedBy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;exceedMemoryRequests&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stats&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;priority&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;memory&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stats&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Sort&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pods&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Pods over their memory request go first, then by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PriorityClass&lt;/code&gt;, then
by absolute usage above request. QoS class is not in this comparator.
People assume &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BestEffort&lt;/code&gt; always dies first; it does, but only because
its request is zero so it always exceeds requests by definition once it
uses anything at all.&lt;/p&gt;

&lt;p&gt;The kernel’s ordering is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;oom_score&lt;/code&gt;, which is a function of RSS as a
fraction of total RAM, adjusted by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;oom_score_adj&lt;/code&gt;. The kubelet writes
those values from
&lt;a href=&quot;https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/qos/policy.go&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pkg/kubelet/qos/policy.go&lt;/code&gt;&lt;/a&gt;:&lt;/p&gt;

&lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;KubeletOOMScoreAdj&lt;/span&gt;    &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;999&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;KubeProxyOOMScoreAdj&lt;/span&gt;  &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;999&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;guaranteedOOMScoreAdj&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;997&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;besteffortOOMScoreAdj&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1000&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;// burstable:&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;oomScoreAdjust&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1000&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1000&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;containerMemReq&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;memoryCapacity&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The Burstable formula is the part that surprises people. A container
that requests 10% of node memory gets &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;oom_score_adj = 900&lt;/code&gt;. One that
requests 1% gets 990. &lt;strong&gt;Asking for less makes the kernel more likely to
kill you for the same RSS&lt;/strong&gt;, which is the opposite of what most people
expect from “I’ll just request a little, it’s fine”.&lt;/p&gt;

&lt;p&gt;Conversely, a Burstable pod with a huge request and currently low RSS
has a small &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;oom_score_adj&lt;/code&gt; and is unlikely to be picked by the
kernel. It is the very &lt;em&gt;first&lt;/em&gt; candidate for the kubelet’s manager
if it’s over its current usage relative to request. So under
sudden pressure you can lose two pods to the same event: the kernel
takes one, the kubelet wakes up ten seconds later and takes another.&lt;/p&gt;

&lt;h2 id=&quot;the-10-second-polling-gap&quot;&gt;The 10-second polling gap&lt;/h2&gt;

&lt;p&gt;The eviction manager runs &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;synchronize()&lt;/code&gt; on a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;monitoringInterval&lt;/code&gt;
that defaults to 10s
(&lt;a href=&quot;https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/eviction/eviction_manager.go&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;eviction_manager.go&lt;/code&gt;&lt;/a&gt;; the historical thread is at
&lt;a href=&quot;https://github.com/kubernetes/kubernetes/issues/30173&quot;&gt;kubernetes#30173&lt;/a&gt;). Between polls it relies on the kernel
memcg notifier (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cgroup.event_control&lt;/code&gt; on v1, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memory.events&lt;/code&gt; on v2)
to wake it early, but only for thresholds the kubelet wired the
notifier for, which means &lt;strong&gt;only &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memory.available&lt;/code&gt;&lt;/strong&gt; gets the fast
path. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nodefs&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;imagefs&lt;/code&gt;, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pid.available&lt;/code&gt; are 10-second polls,
period.&lt;/p&gt;

&lt;p&gt;A spike that exhausts memory faster than the polling interval races
the kernel OOM killer and loses. There’s no fix. Headroom is the
defense: oversize the threshold, size requests honestly, and don’t
pack nodes to 95%.&lt;/p&gt;

&lt;h2 id=&quot;allocatable-quietly-subtracts-the-eviction-hard-threshold&quot;&gt;Allocatable quietly subtracts the eviction-hard threshold&lt;/h2&gt;

&lt;p&gt;From &lt;a href=&quot;https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/&quot;&gt;Reserve Compute Resources&lt;/a&gt;:&lt;/p&gt;

&lt;div class=&quot;language-text highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Allocatable = Capacity − kube-reserved − system-reserved − eviction-hard
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So raising &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memory.available&lt;/code&gt; from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;100Mi&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;1Gi&lt;/code&gt; to “be safer” cuts
schedulable memory by 900 MiB on every node in the cluster. Across a
fleet of one hundred 8-core nodes that’s ~88 GiB of capacity that just
disappeared. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kubectl describe node&lt;/code&gt; doesn’t tell you why allocatable
dropped; you have to remember you changed the threshold.&lt;/p&gt;

&lt;h2 id=&quot;the-ghost-pressure-window&quot;&gt;The ghost pressure window&lt;/h2&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--eviction-pressure-transition-period&lt;/code&gt; defaults to &lt;strong&gt;5 minutes&lt;/strong&gt;.
After pressure clears, the node holds the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MemoryPressure=True&lt;/code&gt;
condition (or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DiskPressure&lt;/code&gt;/&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PIDPressure&lt;/code&gt;) for the full window, which
the scheduler honors as a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NoSchedule&lt;/code&gt; taint.&lt;/p&gt;

&lt;p&gt;In practice you see a node “stuck” in MemoryPressure with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kubectl top
node&lt;/code&gt; reporting 30% use, four minutes after the pressure cleared. A
rolling deploy stalls because half its target nodes are silently
refusing pods. The cluster autoscaler refuses to scale the node &lt;em&gt;down&lt;/em&gt;
because apparently-pressured nodes aren’t removal candidates.&lt;/p&gt;

&lt;p&gt;The window is there to prevent flapping. Tune it down on noisy nodes
(&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--eviction-pressure-transition-period=30s&lt;/code&gt;) at the cost of more
scheduler churn.&lt;/p&gt;

&lt;h2 id=&quot;minimum-reclaim-defaults-to-zero-which-is-why-you-see-waves&quot;&gt;Minimum-reclaim defaults to zero, which is why you see waves&lt;/h2&gt;

&lt;p&gt;After evicting a pod, the kubelet checks the signal again. If the
threshold is no longer crossed, it stops. With
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--eviction-minimum-reclaim=0&lt;/code&gt; (the default), the death of one
small pod is enough to satisfy the predicate even if the freed memory
is a rounding error.&lt;/p&gt;

&lt;p&gt;A slow leak in some other workload then trips the threshold every
~30 seconds, the kubelet evicts one pod, the leak resumes, the
threshold trips again, and on it goes. Evictions get spread across
many pods, masking the actual leaker. Set a real reclaim:&lt;/p&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;na&quot;&gt;evictionMinimumReclaim&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;memory.available&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;500Mi&quot;&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;nodefs.available&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;1Gi&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Now each round must produce real headroom, which slows the cadence and
concentrates evictions on fewer larger pods. The leaker surfaces
faster.&lt;/p&gt;

&lt;h2 id=&quot;per-pod-ephemeral-storage-limits-evict-before-nodefs-ever-signals&quot;&gt;Per-pod ephemeral-storage limits evict before nodefs ever signals&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://kubernetes.io/blog/2022/09/19/local-storage-capacity-isolation-ga/&quot;&gt;Local Storage Capacity Isolation&lt;/a&gt; went GA in 1.25.
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ephemeral-storage&lt;/code&gt; requests/limits work like memory: exceed yours and
the kubelet evicts your pod, with no node-level signal at all. Per-pod,
summed across container writable layers, stdout/stderr logs, and
non-memory &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;emptyDir&lt;/code&gt; volumes.&lt;/p&gt;

&lt;p&gt;The classic victim is a noisy logger with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;limits.ephemeral-storage:
1Gi&lt;/code&gt;. A burst of error logs to stdout at 50 MB/s blows past it in 20
seconds and the pod dies with reason &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Pod ephemeral local storage
usage exceeds the total limit&lt;/code&gt;. The disk is fine. The other pods are
fine. Look in the pod’s events, not the node’s.&lt;/p&gt;

&lt;h2 id=&quot;hard-eviction-ignores-graceful-shutdown&quot;&gt;Hard eviction ignores graceful shutdown&lt;/h2&gt;

&lt;p&gt;A hard threshold trip is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SIGKILL&lt;/code&gt; after a 0-second grace period. The
container’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;preStop&lt;/code&gt; hook does not run, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;terminationGracePeriodSeconds&lt;/code&gt;
is ignored, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PodDisruptionBudget&lt;/code&gt; is not consulted. Soft thresholds
respect &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--eviction-max-pod-grace-period&lt;/code&gt;, which is a kubelet flag (not
a pod field) and which &lt;strong&gt;caps&lt;/strong&gt; whatever the pod declared. If your pod
asks for 600s and the kubelet caps at 30s, you get 30s.&lt;/p&gt;

&lt;p&gt;Stateful workloads that need to flush on shutdown have to do it on
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SIGTERM&lt;/code&gt; from the drain path, not the eviction path. The two paths
are not interchangeable, even though the API surface looks similar.&lt;/p&gt;

&lt;h2 id=&quot;filtering-for-kubelet-driven-evictions-specifically&quot;&gt;Filtering for kubelet-driven evictions specifically&lt;/h2&gt;

&lt;p&gt;Since 1.25, the kubelet sets a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DisruptionTarget&lt;/code&gt; condition on evicted
pods, with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;reason=TerminationByKubelet&lt;/code&gt;. Per-pod-limit evictions
don’t get this. Those are deemed the pod’s fault and shouldn’t be
retried.&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;kubectl get pods &lt;span class=&quot;nt&quot;&gt;-A&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; json | jq &lt;span class=&quot;nt&quot;&gt;-r&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;
  .items[]
  | select(.status.conditions[]?
    | select(.type==&quot;DisruptionTarget&quot; and .reason==&quot;TerminationByKubelet&quot;))
  | &quot;\(.metadata.namespace)/\(.metadata.name) on \(.spec.nodeName)&quot;
&apos;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Same &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DisruptionTarget&lt;/code&gt; type covers &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PreemptionByScheduler&lt;/code&gt; and
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;EvictionByEvictionAPI&lt;/code&gt;. Different reasons, different remediations.
A useful field in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kube-state-metrics&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id=&quot;a-real-one-five-things-wrong-at-once&quot;&gt;A real one: five things wrong at once&lt;/h2&gt;

&lt;p&gt;We had a production cluster running maybe forty nodes, m5.4xlarge,
nothing exotic. A data team rolled out a new analytics service that
read large Parquet files (8–12 GB each) off an internal blob store on
an hourly cadence. Memory request 16Gi, limit 24Gi, sized off a load
test. They tested it for a week in staging. Looked great.&lt;/p&gt;

&lt;p&gt;Production rolled the change Tuesday morning. By 14:00 we had pods
flapping on three nodes. Not the analytics pods themselves: random
neighbours. A Go web service. A Prometheus exporter. The cluster’s own
metrics-server, twice.&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kubectl describe node&lt;/code&gt; showed &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MemoryPressure: True&lt;/code&gt;. Fine,
expected when the kubelet evicts. But &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kubectl top node&lt;/code&gt; reported 62%
memory used. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;free -m&lt;/code&gt; on the node, same number. No OOM in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dmesg&lt;/code&gt;.
Pods reported &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;status.reason: Evicted&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;message: The node was low
on resource: memory&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That’s the first head-scratch. The kubelet says low on memory. The
node says it has plenty.&lt;/p&gt;

&lt;p&gt;I’d already learned from §1 to not trust &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;free -m&lt;/code&gt;, so I went to
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memory.workingSet&lt;/code&gt; directly. From the node’s stats summary endpoint:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;kubectl get &lt;span class=&quot;nt&quot;&gt;--raw&lt;/span&gt; /api/v1/nodes/agumon-prod-03/proxy/stats/summary &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  | jq &lt;span class=&quot;s1&quot;&gt;&apos;.node.memory&apos;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;s2&quot;&gt;&quot;time&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;2025-07-22T14:13:22Z&quot;&lt;/span&gt;,
  &lt;span class=&quot;s2&quot;&gt;&quot;availableBytes&quot;&lt;/span&gt;: 89128960,        &lt;span class=&quot;c&quot;&gt;# 85 MiB. yikes&lt;/span&gt;
  &lt;span class=&quot;s2&quot;&gt;&quot;usageBytes&quot;&lt;/span&gt;: 67284598784,         &lt;span class=&quot;c&quot;&gt;# 62 GiB, matches free -m&lt;/span&gt;
  &lt;span class=&quot;s2&quot;&gt;&quot;workingSetBytes&quot;&lt;/span&gt;: 67195469824,    &lt;span class=&quot;c&quot;&gt;# 62.5 GiB&lt;/span&gt;
  &lt;span class=&quot;s2&quot;&gt;&quot;rssBytes&quot;&lt;/span&gt;: 18043871232,           &lt;span class=&quot;c&quot;&gt;# 17 GiB&lt;/span&gt;
  &lt;span class=&quot;s2&quot;&gt;&quot;pageFaults&quot;&lt;/span&gt;: ...
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;workingSetBytes&lt;/code&gt; was 62.5 GiB on a 64 GiB node. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rssBytes&lt;/code&gt; was 17 GiB.
The 45 GiB delta is page cache that the kubelet was counting against
me. I’d never seen that wide a gap before. Reading
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/sys/fs/cgroup/memory/memory.stat&lt;/code&gt; confirmed it: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;active_file&lt;/code&gt; was
sitting at ~38 GiB.&lt;/p&gt;

&lt;p&gt;So this was §2: active_file trap, exactly the read-heavy Parquet
workload pattern from #43916. The analytics service was reading
the same files multiple times per run (Parquet predicate pushdown +
partition pruning, two passes over each file), promoting pages to
active. Cache the kernel would gladly drop, that the kubelet considered
real load.&lt;/p&gt;

&lt;p&gt;Cool, so why are &lt;em&gt;neighbours&lt;/em&gt; getting evicted instead of the analytics
pod? Section §4. The kubelet’s ranker is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;(exceedMemoryRequests,
priority, memory)&lt;/code&gt;. The analytics pod requested 16Gi and was using ~12
GiB RSS, so it was &lt;em&gt;under&lt;/em&gt; request. The Go web service requested 256Mi
and was using ~290 MiB. Over request by a hair, no PriorityClass set,
and absolute usage above request was 35 MiB. First in the order. Goodbye.&lt;/p&gt;

&lt;p&gt;Then there was the second weird thing in the timeline. Out of every
five eviction events, one had no kubelet log entry at all in the
preceding 10 seconds. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dmesg -T&lt;/code&gt; had it:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[Tue Jul 22 14:18:47 2025] oom-kill:constraint=CONSTRAINT_MEMCG,
  nodemask=(null), cpuset=...,task=otel-collector,pid=2891034,
  uid=65532
[Tue Jul 22 14:18:47 2025] Memory cgroup out of memory: Killed process
  2891034 (otel-collector) total-vm:1247892kB, anon-rss:412348kB,
  oom_score_adj:961
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;oom_score_adj: 961&lt;/code&gt;. That’s the Burstable formula from §4 hitting an
otel-collector with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;requests.memory: 100Mi&lt;/code&gt; on a 64 GiB node:
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;1000 − 1000 × 100/65536 ≈ 998&lt;/code&gt;, close enough. The collector wasn’t
even leaking; it was a bystander whose request had been set conservatively
(§4 again, small requests raise oom_score_adj). The kernel grabbed it
during the 10-second window between kubelet polls (§5) when the active_file
spike pushed &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memory.available&lt;/code&gt; straight through the floor.&lt;/p&gt;

&lt;p&gt;So now I had two kill paths firing for one root cause. We were also
seeing the wave behaviour from §8: the kubelet would evict one
neighbour, free 290 MiB, declare victory, and 12 seconds later the
analytics service would issue another file read and active_file would
climb again. We watched seven separate eviction events on a single
node in two minutes, each freeing a tiny amount.&lt;/p&gt;

&lt;p&gt;And the cluster-autoscaler refused to scale up because the affected
nodes still had &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MemoryPressure=True&lt;/code&gt; from §7’s 5-minute window. The
nodes looked busy to the scheduler but had 30% RSS by the time autoscaler
checked. New nodes weren’t coming online; tainted nodes were rejecting
the rescheduled pods. Pods sat Pending for four minutes at a time.&lt;/p&gt;

&lt;p&gt;The fixes, in priority order. The first three got us through that day:&lt;/p&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# 1. minimum-reclaim non-zero, so each eviction round actually does something&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;evictionMinimumReclaim&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;memory.available&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;1Gi&quot;&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# 2. transition period down so cleared nodes start accepting pods again&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;evictionPressureTransitionPeriod&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;45s&quot;&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# 3. analytics pod gets a memory.high via cgroup v2 MemoryQoS so its&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;#    page cache can&apos;t blow past its limit&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;featureGates&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;MemoryQoS&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The lasting fix was in the analytics service: a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--cache-dir&lt;/code&gt; flag we
hadn’t been using that pointed Parquet caching at a sized
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;emptyDir: {medium: Memory, sizeLimit: 4Gi}&lt;/code&gt; with the matching memory
request and limit, so the cache was bounded to that container’s
cgroup instead of being node-global page cache. Pages still get cached;
they just no longer eat into other tenants’ eviction budget.&lt;/p&gt;

&lt;p&gt;Total damage: about three hours of production weirdness, six hours of
debugging, and one shared post-mortem with a slide titled &lt;em&gt;“things
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kubectl top&lt;/code&gt; will never tell you”&lt;/em&gt;. The author of the analytics
service had done nothing wrong by any reasonable reading of the
kubernetes docs. The system was correct as designed and pathological
in interaction.&lt;/p&gt;

&lt;h2 id=&quot;what-i-check-first-now&quot;&gt;What I check first now&lt;/h2&gt;

&lt;p&gt;When pods evict in waves and the dashboards say nothing’s wrong, in
this order: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;workingSetBytes&lt;/code&gt; minus &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rssBytes&lt;/code&gt; (cache delta), then
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;active_file&lt;/code&gt; from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memory.stat&lt;/code&gt; (the §2 number), then the kubelet log
for the 10s before each eviction (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;§5&lt;/code&gt; race), then the kubelet’s
ranker on the surviving versus evicted pods (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;exceedMemoryRequests&lt;/code&gt;
first), and then &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--eviction-minimum-reclaim&lt;/code&gt; to see if I’ve been
papering over a leak with wave evictions.&lt;/p&gt;

&lt;p&gt;If none of that fits and you’ve still got phantom evictions, you’re
probably looking at a memcg accounting bug in your kernel. Bisect the
kernel before you bisect the workload.&lt;/p&gt;

</content>
    <category term="kubernetes"/><category term="sre"/><category term="kubelet"/><category term="cgroups"/><category term="oom"/><category term="debugging"/>
  </entry>
  
  <entry>
    <title>From KVM Postgres to RDS and back: a migration that should not have worked</title>
    <link href="https://alien2003.github.io/2025/03/postgres-kvm-to-rds-and-back/"/>
    <updated>2025-03-12T09:15:00+00:00</updated>
    <id>https://alien2003.github.io/2025/03/postgres-kvm-to-rds-and-back/</id>
    <summary type="html">&lt;p&gt;The database had been running on three KVM domains for nine years. It was
provisioned by &lt;a href=&quot;https://fai-project.org/&quot;&gt;FAI&lt;/a&gt; off a debian-installer preseed that nobody on
the current team had written, configured by a Puppet module last
meaningfully edited in 2019, and patched only when the SRE on call had
the energy. PostgreSQL 9.6, on Debian 8 (jessie), past the end of LTS,
past the end of ELTS, past the end of any reasonable explanation. The
boxes were fine. They were always fine. They had been fine for so long
that nobody touched them on principle.&lt;/p&gt;

&lt;p&gt;Then somebody from finance asked why we had three idle Xeons in a rack
in Frankfurt, and we got eight months to move it to RDS.&lt;/p&gt;

&lt;p&gt;I am writing this in March 2025 about a migration that started in
early 2023 and ended, eventually, with the same database back on three
new KVM domains in late 2024. We did the round trip. Both directions
hurt. This is the long version of what broke.&lt;/p&gt;

</summary>
    <content type="html" xml:base="https://alien2003.github.io/2025/03/postgres-kvm-to-rds-and-back/">&lt;p&gt;The database had been running on three KVM domains for nine years. It was
provisioned by &lt;a href=&quot;https://fai-project.org/&quot;&gt;FAI&lt;/a&gt; off a debian-installer preseed that nobody on
the current team had written, configured by a Puppet module last
meaningfully edited in 2019, and patched only when the SRE on call had
the energy. PostgreSQL 9.6, on Debian 8 (jessie), past the end of LTS,
past the end of ELTS, past the end of any reasonable explanation. The
boxes were fine. They were always fine. They had been fine for so long
that nobody touched them on principle.&lt;/p&gt;

&lt;p&gt;Then somebody from finance asked why we had three idle Xeons in a rack
in Frankfurt, and we got eight months to move it to RDS.&lt;/p&gt;

&lt;p&gt;I am writing this in March 2025 about a migration that started in
early 2023 and ended, eventually, with the same database back on three
new KVM domains in late 2024. We did the round trip. Both directions
hurt. This is the long version of what broke.&lt;/p&gt;

&lt;!--more--&gt;

&lt;h2 id=&quot;the-cast&quot;&gt;The cast&lt;/h2&gt;

&lt;p&gt;What we were moving. About 1.4 TiB of data across two logical
databases, around 6,500 tables. PostgreSQL &lt;strong&gt;9.6.24&lt;/strong&gt; on Debian
8.11, PGDG repo rather than stock jessie. One primary, two
streaming replicas, all on libvirt/KVM with raw LVM volumes on
local SSD. pgbouncer in transaction mode in front of the primary,
roughly 2,400 client connections collapsed to about 80 backend. A
Puppet module that managed &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;postgresql.conf&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_hba.conf&lt;/code&gt;, and
the boot-time symlinks that pointed &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/var/lib/postgresql/9.6/main&lt;/code&gt;
at the LVM mount. The same Puppet module also installed a
tablespace on a second LVM volume called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fast_ssd&lt;/code&gt; for an
analytics schema. Custom code: 11 functions in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;plperlu&lt;/code&gt;
(untrusted), 3 functions in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;plpython2u&lt;/code&gt;, and a single ill-advised
function in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pltclu&lt;/code&gt; that printed a date in Cyrillic. Three FDW
links to other Postgres clusters using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;postgres_fdw&lt;/code&gt;, one of
which pointed at a MySQL via &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mysql_fdw&lt;/code&gt; from the
&lt;a href=&quot;https://www.enterprisedb.com/downloads/postgres-postgresql-downloads&quot;&gt;EnterpriseDB packages&lt;/a&gt;. A scheduled job runner that was just
a cron entry on the primary running &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;psql -c &apos;SELECT
vacuum_analyze_partitions()&apos;&lt;/code&gt; every twenty minutes. And an
application written by people who had since left the company and
who used &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LISTEN&lt;/code&gt;/&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NOTIFY&lt;/code&gt; for a job-queue pattern. A small thing.
It mattered later.&lt;/p&gt;

&lt;p&gt;What we were moving to. RDS for PostgreSQL, two-AZ Multi-AZ,
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;db.r6g.4xlarge&lt;/code&gt;, storage &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gp3&lt;/code&gt; 2 TiB, eventual plan to upgrade to
PG 15 once we got there. RDS Proxy in front for connection
management, replacing pgbouncer.&lt;/p&gt;

&lt;h2 id=&quot;why-we-moved&quot;&gt;Why we moved&lt;/h2&gt;

&lt;p&gt;The reason in the slide deck was &lt;em&gt;consolidation onto cloud-native primitives&lt;/em&gt;.
The reason in the room was that nobody under thirty in the company knew
what &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;apt pinning&lt;/code&gt; was anymore and we needed to stop pretending the
jessie box was a strategy. Fair. The boxes were also overdue for
hardware replacement and capex was harder to defend than opex that year.&lt;/p&gt;

&lt;p&gt;I do not have a strong opinion about that decision. I have very strong
opinions about the next twelve weeks.&lt;/p&gt;

&lt;h2 id=&quot;to-rds-the-things-that-broke&quot;&gt;TO RDS, the things that broke&lt;/h2&gt;

&lt;h3 id=&quot;1-the-dump-that-wouldnt-restore-tablespaces&quot;&gt;1. The dump that wouldn’t restore (tablespaces)&lt;/h3&gt;

&lt;p&gt;First plan was the obvious one: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_dump --schema-only&lt;/code&gt; on production,
restore to RDS, set up DMS for full load + CDC, cut over.&lt;/p&gt;

&lt;div class=&quot;language-text highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ psql -h tentomon-prod.xxx.eu-central-1.rds.amazonaws.com -U postgres -f schema.sql
psql:schema.sql:18241: ERROR:  permission denied to create tablespace &quot;fast_ssd&quot;
HINT:  Must be superuser to create a tablespace.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;RDS does not give you superuser. There is no &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;postgres&lt;/code&gt; superuser; the
master user gets &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rds_superuser&lt;/code&gt;, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rds_superuser&lt;/code&gt; is not a real
superuser, it’s a role with a curated set of grants. You cannot
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CREATE TABLESPACE&lt;/code&gt; against an arbitrary path because the only path
RDS will accept is under &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/rdsdbdata/db/base/tablespace&lt;/code&gt;, and even
then the &lt;a href=&quot;https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/PostgreSQL.Concepts.General.FeatureSupport.Tablespaces.html&quot;&gt;RDS docs&lt;/a&gt; say outright:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;RDS for PostgreSQL supports tablespaces for compatibility purposes,
but due to all storage being on a single logical volume, they cannot
be used for I/O splitting or isolation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In other words: you can create a tablespace, but it does literally
nothing. The dump’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TABLESPACE fast_ssd&lt;/code&gt; clauses on every CREATE
TABLE in the analytics schema either fail (if you don’t pre-create the
matching tablespace name) or succeed and silently lie (if you do).&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_dump --no-tablespaces&lt;/code&gt; would have spared me, and is what I used in
the end. The fix is one flag. The annoying part was the discovery: I
spent a Friday afternoon convinced our schema dump was somehow
corrupted, looking at it line by line, before I noticed how many
tables had the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TABLESPACE&lt;/code&gt; clause attached. Puppet had been
provisioning that tablespace into the bootstrapping for years. None of
us thought about it because none of us had touched it.&lt;/p&gt;

&lt;h3 id=&quot;2-plperlu-and-plpython2u-the-rewrite-trail&quot;&gt;2. plperlu and plpython2u: the rewrite trail&lt;/h3&gt;

&lt;p&gt;Next discovery, courtesy of the same dump. RDS has &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;plperl&lt;/code&gt; (trusted)
and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;plpython3u&lt;/code&gt; (depending on engine version). It does not have
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;plperlu&lt;/code&gt;, the untrusted variant, which is precisely where every
function we’d written had landed because the original author needed
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;require LWP::Simple&lt;/code&gt; for a one-line HTTP call from inside the
database. (Yes, I know.) From the &lt;a href=&quot;https://docs.aws.amazon.com/AmazonRDS/latest/AuroraPostgreSQLReleaseNotes/AuroraPostgreSQL.Extensions.html&quot;&gt;Aurora extensions table&lt;/a&gt;
explicitly: &lt;em&gt;“some extensions are no longer supported, such as
adminpack, plperlu, pltclu, pageinspect, and xml2.”&lt;/em&gt; The same
restriction holds for community RDS PostgreSQL on the relevant
versions.&lt;/p&gt;

&lt;p&gt;So the eleven &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;plperlu&lt;/code&gt; functions had to be rewritten. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;plpython2u&lt;/code&gt;
was worse: not only is the untrusted variant gone, plain &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;plpython3u&lt;/code&gt;
on the destination meant porting from Python 2 to Python 3, in 2024,
years after Python 2’s funeral. One of the functions was a homemade
parser for a CSV format produced by a printer in Belgium that nobody
wanted to think about. It worked. It had worked for eight years.
It used &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;print&lt;/code&gt; as a statement.&lt;/p&gt;

&lt;p&gt;The single &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pltclu&lt;/code&gt; function we deleted. It was never called.&lt;/p&gt;

&lt;p&gt;Rewrites in priority order:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Functions called from triggers got &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;plpgsql&lt;/code&gt; rewrites. Most were
simple enough.&lt;/li&gt;
  &lt;li&gt;Functions called from cron-style jobs got moved out of the database
entirely into a small Python service running on the same VPC,
talking to the new RDS via psycopg. Better placement anyway.&lt;/li&gt;
  &lt;li&gt;The Belgian-printer parser got a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;plpython3u&lt;/code&gt; rewrite, tested
against a 50,000-line corpus pulled from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_largeobject&lt;/code&gt;, and
merged with three reviewers because none of us trusted it.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This consumed seven weeks. It was not the most technically interesting
seven weeks of my career.&lt;/p&gt;

&lt;h3 id=&quot;3-glibc-collation-the-silent-corrupter&quot;&gt;3. glibc collation, the silent corrupter&lt;/h3&gt;

&lt;p&gt;The boxes were on Debian 8, glibc 2.19. RDS PostgreSQL runs on a
managed Amazon Linux base with a much newer glibc, well past the
&lt;strong&gt;2.28&lt;/strong&gt; boundary where glibc rewrote its locale collation tables to
match ISO 14651:2016 and Unicode 9. This is the now-famous
&lt;a href=&quot;https://www.crunchydata.com/blog/glibc-collations-and-data-corruption&quot;&gt;glibc 2.28 collation break&lt;/a&gt; that has bitten every
serious Postgres operator who’s done a major OS upgrade since 2018.&lt;/p&gt;

&lt;p&gt;Symptoms are bad in a specific way. Indexes don’t &lt;em&gt;appear&lt;/em&gt; corrupt.
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_amcheck&lt;/code&gt; may not flag them. Queries return wrong results. A
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SELECT … WHERE name = &apos;Müller&apos;&lt;/code&gt; finds the row on the source and
misses it on the destination, because the index was built under one
sort order and is being walked under another. The tell is sometimes:&lt;/p&gt;

&lt;div class=&quot;language-text highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;WARNING:  index &quot;users_name_idx&quot; contains corrupted page at block 0
DETAIL:  Failed to find parent tuple for heap-only tuple at (12, 4)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;…but more often, no tell at all. The index just lies.&lt;/p&gt;

&lt;p&gt;If you build the dump under jessie’s collation and restore it to a
host with newer collation, every index on a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;text&lt;/code&gt; column with a
non-C collation is suspect. The workarounds are all painful. You
can REINDEX everything post-restore. We did. On 1.4 TiB the
initial run took 11 hours wallclock with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;REINDEX (CONCURRENTLY)&lt;/code&gt;,
with parallelism limited by what RDS would tolerate without the
Performance Insights graph turning red. You can build with
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;lc_collate=C&lt;/code&gt; on the destination, which buys you byte-order sort
and nothing else; often you cannot, because some app somewhere
relies on locale-aware sort. Or you can switch to ICU collations,
which were available since PG 10 and are versioned, so Postgres
can warn you when the version changed underneath an index. We
could not use ICU on the source because 9.6 had no usable ICU
support, but we did use it on the destination to inoculate against
the next migration. (Foreshadowing.)&lt;/p&gt;

&lt;p&gt;I learned later that PG 15+ tracks collation versions per index and
emits &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WARNING: database &quot;x&quot; has a collation version mismatch&lt;/code&gt; in the
logs. We weren’t going to PG 15 yet. We were doing a like-for-like
upgrade to 9.6 on RDS first because that’s all DMS would let us
replicate cleanly with this much custom code on the source. So no
warning. Just wrong answers.&lt;/p&gt;

&lt;h3 id=&quot;4-the-replication-slot-that-filled-the-disk&quot;&gt;4. The replication slot that filled the disk&lt;/h3&gt;

&lt;p&gt;Plan was to use &lt;a href=&quot;https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/PostgreSQL.Concepts.General.FeatureSupport.LogicalReplication.html&quot;&gt;native logical replication&lt;/a&gt; from on-prem
to RDS for the cutover, with DMS as the fallback. Native logical
replication needs PG 10+ on both ends. We were on 9.6 on the source.
Fine, says me, we can use &lt;a href=&quot;https://github.com/2ndQuadrant/pglogical&quot;&gt;pglogical&lt;/a&gt; which has a
9.4-compatible decoder and is supported on RDS as an extension.&lt;/p&gt;

&lt;p&gt;The setup looks reasonable:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;-- on RDS, in the cluster parameter group&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;rds&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;logical_replication&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;max_replication_slots&lt;/span&gt;   &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;max_wal_senders&lt;/span&gt;         &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;-- on source (jessie box)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;wal_level&lt;/span&gt;               &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;logical&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;max_replication_slots&lt;/span&gt;   &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;max_wal_senders&lt;/span&gt;         &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;(Both static. Both require a restart. The reboot of the cluster
parameter group on RDS took about 90 seconds; the reboot on the
jessie primary took about 90 seconds plus thirty minutes of failover
choreography because the puppet module hadn’t been told what to do
with a logical-replication-shaped postgres in 2019.)&lt;/p&gt;

&lt;p&gt;The fun part came two days into the initial sync. A pglogical worker
on the destination side hit a malformed row (a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bytea&lt;/code&gt; column whose
length on source was reported one way and on destination was decoded
another way, an interaction with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bytea_output = escape&lt;/code&gt; that I won’t
re-litigate here) and wedged. Default behavior: the slot stays open,
the apply worker keeps trying, WAL on the source keeps accumulating
because the slot is alive and unconsumed.&lt;/p&gt;

&lt;p&gt;Twelve hours later the on-prem primary’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_xlog/&lt;/code&gt; was at 340 GiB.
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;max_slot_wal_keep_size&lt;/code&gt; did not exist in 9.6 (that’s a PG 13
parameter) so there was no soft cap. The disk filled at 04:11 local.
The primary stopped accepting writes. We failed over to the
synchronous replica (also out of WAL space, because synchronous, but
the failover worked because of how RDS doesn’t, it’s a Patroni thing,
shoutout to whoever set that up). Application took ~9 minutes of
errors before fully reconnecting through pgbouncer.&lt;/p&gt;

&lt;p&gt;The lesson is simple and the docs say it plainly: an inactive
logical-replication slot retains WAL forever. RDS &lt;a href=&quot;https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PostgreSQL.Replication.ReadReplicas.Mechanisms-versions.html&quot;&gt;says it&lt;/a&gt;,
the AWS DMS &lt;a href=&quot;https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.AssessmentReport.PG.html&quot;&gt;pre-flight assessment&lt;/a&gt; checks
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;max_slot_wal_keep_size&lt;/code&gt;, the &lt;a href=&quot;https://github.com/2ndQuadrant/pglogical&quot;&gt;pglogical README&lt;/a&gt; says it.
We knew. We had a runbook entry. The runbook entry was for the
destination, not the source. The destination was the safe one. RDS
will not let WAL eat its own storage past a certain point on PG 13+.
Our &lt;strong&gt;source&lt;/strong&gt; was 9.6 and had no such governor.&lt;/p&gt;

&lt;p&gt;After that we wrote a 5-minute cron on the source that paged if any
slot’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)&lt;/code&gt;
exceeded 50 GiB. It paged twice more during the migration. Both times
the apply worker had wedged on something we hadn’t anticipated.&lt;/p&gt;

&lt;h3 id=&quot;5-replica-identity-or-why-my-deletes-evaporated&quot;&gt;5. REPLICA IDENTITY (or: why my deletes evaporated)&lt;/h3&gt;

&lt;p&gt;About week six of CDC, with backfill done and replica caught up to
within a second, somebody on the application team noticed that a
specific cleanup job ran on the source nightly, deleted ~50,000 rows
from a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;session_log&lt;/code&gt; table, and on the RDS side those rows were
still there. Insert lag: zero. Update lag: zero. Delete lag:
infinite, because the deletes simply weren’t being applied.&lt;/p&gt;

&lt;p&gt;Logical replication needs a way to identify the row being deleted on
the apply side. It looks at the source’s REPLICA IDENTITY setting:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;ALTER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;session_log&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;REPLICA&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;IDENTITY&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FULL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;-- or DEFAULT, or USING INDEX&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DEFAULT&lt;/code&gt; means “use the primary key”. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;session_log&lt;/code&gt; table had no
primary key. Nine years on, it had grown a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;created_at&lt;/code&gt;-based
candidate but never had it promoted. With no PK, pglogical/native
logical replication can decode the DELETE from the WAL but cannot
emit it on the wire because there’s nothing to put in the WHERE
clause on apply. The apply worker either drops it silently or errors,
depending on version and config. In our case, dropped. No log entry.
Nothing.&lt;/p&gt;

&lt;p&gt;The DMS pre-flight check
&lt;a href=&quot;https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.AssessmentReport.PG.html&quot;&gt;&lt;em&gt;“REPLICA IDENTITY FULL”&lt;/em&gt;&lt;/a&gt; flags this exact
case as “Detecting tables using REPLICA IDENTITY FULL and either
changing the REPLICA IDENTITY setting or switching to a test_decoding
plugin.” Useful for DMS users; we weren’t on DMS for this segment;
the same trap applied. We ran:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nspname&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;relname&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;relreplident&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pg_class&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;c&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;JOIN&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pg_namespace&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;oid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;relnamespace&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;relkind&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;r&apos;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;relreplident&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;IN&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;d&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;i&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nspname&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;IN&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;pg_catalog&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;information_schema&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;pglogical&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;…and then for any row where there was no PK or unique index, set
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;REPLICA IDENTITY FULL&lt;/code&gt;. Which makes every UPDATE/DELETE log the full
old row, which roughly &lt;strong&gt;doubles the WAL volume&lt;/strong&gt; for those tables.
On &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;session_log&lt;/code&gt;, which we had been deleting cheerfully at 50k/night
without thinking about it, this turned the next day’s WAL into ~14
GiB of mostly-DELETE traffic, which the apply worker then chewed
through at maybe 4 MiB/s, and we were behind again for two days.&lt;/p&gt;

&lt;p&gt;We added a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BIGSERIAL PRIMARY KEY&lt;/code&gt; column to seven tables, took the
brief locks during a low-traffic window, and went back to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DEFAULT&lt;/code&gt;
identity. The proper fix had been sitting in the backlog under the
title “add PKs to legacy tables” since 2017. It got merged on a
Tuesday afternoon under the title “fix CDC”.&lt;/p&gt;

&lt;h3 id=&quot;6-dms-briefly-and-why-we-left-it&quot;&gt;6. DMS, briefly, and why we left it&lt;/h3&gt;

&lt;p&gt;We did try DMS for the analytics schema, separately, because
pglogical didn’t love the size of one particular partitioned table
(2,400 partitions, ~80 GiB). DMS comes with its own family of
problems, all documented if you know where to look. JSONB goes to
CLOB by default. Source has a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jsonb&lt;/code&gt; column, target also has a
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jsonb&lt;/code&gt; column, and DMS will happily put a stringified
representation through a CLOB pipe and either truncate or rewrite
whitespace, depending on the LOB mode. &lt;a href=&quot;https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.LOBSupport.html&quot;&gt;Limited LOB mode&lt;/a&gt;
caps individual LOBs at 100 MB and pre-allocates memory for them;
Full LOB mode handles arbitrary sizes but is dramatically slower.
We had values up to 8 MB. Limited LOB at 16 MB worked, with a
careful eye on the memory footprint of the replication instance.&lt;/p&gt;

&lt;p&gt;Sequences are not migrated. DMS does not transfer sequence current
values; the cutover script has to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;setval()&lt;/code&gt; every sequence on the
target to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MAX(id) + buffer&lt;/code&gt; before redirecting writes. Forget one
and you hit a primary key collision the moment the app inserts.&lt;/p&gt;

&lt;p&gt;Materialised views must be re-created manually on the target, per
the assessment report. We had four. We forgot one. It surfaced
three weeks after cutover when a dashboard went blank.&lt;/p&gt;

&lt;p&gt;DDL is not replicated unless you turn on a specific event trigger.
We froze schema changes for the duration and added a Slack bot
that yelled when anybody tried to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ALTER TABLE&lt;/code&gt; on prod.&lt;/p&gt;

&lt;p&gt;The combined behavior with our &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bytea&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jsonb&lt;/code&gt; columns made me
nervous enough to keep DMS only for that one partitioned analytics
table where the diffs were simpler, and run pglogical for the rest.
Two replication paths, two sets of monitoring, two sets of failure
modes. Worth it for stability of the OLTP path.&lt;/p&gt;

&lt;h3 id=&quot;7-cutover-night-and-pgbouncer-to-rds-proxy&quot;&gt;7. Cutover night, and pgbouncer to RDS Proxy&lt;/h3&gt;

&lt;p&gt;Cutover was scheduled for a Saturday at 02:00 UTC. The plan:&lt;/p&gt;

&lt;div class=&quot;language-text highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;00:30  freeze deploys, last sanity check
01:30  switch app pgbouncer to read-only mode
01:45  drain writes, verify pglogical lag &amp;lt; 5s on all subscribers
02:00  flip DNS to point at RDS Proxy endpoint
02:05  sequence resync (setval everything)
02:15  unfreeze, monitor for 60 minutes
03:30  declare victory or roll back
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The DNS flip went fine. The application reconnected fine. Latency
dashboards looked fine. Then about ten minutes in, we started seeing
weird behavior from a Java service: occasional 30-second hangs on
queries that should have taken milliseconds, then resumed. Not
errors. Hangs.&lt;/p&gt;

&lt;p&gt;The Java service was using Hibernate, which by default likes to use
named server-side prepared statements. Hibernate had been talking to
pgbouncer in transaction-pooling mode for years, and pgbouncer in
transaction mode famously &lt;a href=&quot;https://www.pgbouncer.org/faq.html#how-to-use-prepared-statements-with-transaction-pooling&quot;&gt;breaks server-side prepared statements&lt;/a&gt;.
The team had worked around that with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;prepareThreshold=0&lt;/code&gt; in the JDBC
URL. Fine.&lt;/p&gt;

&lt;p&gt;RDS Proxy has its own behavior. Per the
&lt;a href=&quot;https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/rds-proxy-pinning.html&quot;&gt;RDS Proxy pinning docs&lt;/a&gt;, for PostgreSQL the proxy
will &lt;em&gt;pin&lt;/em&gt; a client connection to a backend (effectively turning
off multiplexing for that client) when it sees &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SET&lt;/code&gt; commands
other than transaction-scoped, prepared statement
creation/management, temporary tables, sequences, or views,
declared cursors, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LISTEN&lt;/code&gt; on a notification channel, or
session-state-altering library loads. Among other things.&lt;/p&gt;

&lt;p&gt;When a connection gets pinned, it stays bound to one backend for the
rest of the session. Under load, you get a sudden cliff: the proxy
runs out of unpinned connections to multiplex over, and new clients
queue waiting for one. The 30-second hangs were &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MaxConnectionsPercent&lt;/code&gt;
queueing, courtesy of an internal &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SET work_mem&lt;/code&gt; that Hibernate or
some library it pulled in was issuing on every connection acquire to
match what we’d had on pgbouncer.&lt;/p&gt;

&lt;p&gt;We rolled back to a fronting pgbouncer in front of RDS Proxy
(yes, pgbouncer in front of RDS Proxy in front of RDS, three
connection layers, deeply unaesthetic, &lt;em&gt;worked&lt;/em&gt;) for a week while the
app team excised the per-session SETs and moved them into the
RDS-side default parameter group. After that we removed the pgbouncer
hop. CloudWatch metric to monitor:
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DatabaseConnectionsCurrentlySessionPinned&lt;/code&gt;. If that stays above zero
for any sustained period you have something issuing a pin trigger.&lt;/p&gt;

&lt;h3 id=&quot;8-the-listennotify-queue-ghost-edition&quot;&gt;8. The LISTEN/NOTIFY queue, ghost edition&lt;/h3&gt;

&lt;p&gt;This one’s small but I keep telling it because I find it funny in
retrospect.&lt;/p&gt;

&lt;p&gt;A worker process in the background-jobs service used &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LISTEN
job_ready&lt;/code&gt; on a long-lived connection to receive NOTIFYs from a
trigger on the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jobs&lt;/code&gt; table. Cute pattern, fine for low scale, this
was low scale.&lt;/p&gt;

&lt;p&gt;Post-cutover the worker silently stopped processing jobs. Connection
was up, the LISTEN was registered (we checked &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_listening_channels()&lt;/code&gt;),
the trigger was firing on inserts, the NOTIFY was being issued. The
worker just never got a notification.&lt;/p&gt;

&lt;p&gt;What it had actually done was: open a connection through RDS Proxy,
issue &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LISTEN job_ready&lt;/code&gt;, get its connection &lt;a href=&quot;https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/rds-proxy-pinning.html&quot;&gt;pinned&lt;/a&gt; (LISTEN
pins a connection), and sit there. NOTIFY pushes are per-backend. The
backend the worker was pinned to was &lt;em&gt;fine&lt;/em&gt;. The triggers, however,
were running on whichever backend the writer pool happened to grab
for the relevant transaction, which was a &lt;em&gt;different&lt;/em&gt; backend each
time, and notifications don’t cross backend boundaries except via the
shared &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_notification_queue&lt;/code&gt;. Which they do. But not in a way that
fires on a pinned proxy session in real time the way it had on
pgbouncer’s stable session-mode connection on the old setup.&lt;/p&gt;

&lt;p&gt;Resolution: move the worker to connect directly to the writer
endpoint, not through the proxy. RDS Proxy is fundamentally not built
for long-lived single-session listeners and the docs say so if you
read closely. The fix was four lines of config. Diagnosing it took an
afternoon of perplexed staring.&lt;/p&gt;

&lt;h2 id=&quot;eight-months-on-rds&quot;&gt;Eight months on RDS&lt;/h2&gt;

&lt;p&gt;Things were fine. Latency from our app fleet (in our colo, not in
AWS) to RDS in eu-central-1 was around 11ms p50 over Direct Connect,
which was about three times what we’d had on the LAN side, and the
app teams had to retune one or two N+1-prone services, but nothing
broke that wasn’t fixable. Performance Insights was genuinely
useful. The on-call burden dropped because none of us were patching
jessie kernels anymore. Storage autoscaling triggered twice when an
analytics intern ran a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SELECT INTO&lt;/code&gt; against a 600 GiB table; the
&lt;a href=&quot;https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PIOPS.StorageTypes.html#USER_PIOPS.Autoscaling&quot;&gt;autoscaling docs&lt;/a&gt; mention a six-hour cooldown between
scale events which we hit on the second one and ate a brief storage
warning before it cleared. Fine.&lt;/p&gt;

&lt;p&gt;What was not fine was the bill.&lt;/p&gt;

&lt;p&gt;The headline was the instance. r6g.4xlarge Multi-AZ at on-demand,
plus 2 TiB of gp3 with provisioned IOPS, plus the RDS Proxy, plus the
read replica we ended up adding, plus the DMS instance we kept around
for the analytics schema, plus Direct Connect, plus data transfer to
the app fleet, plus snapshots, plus Performance Insights long
retention, plus CloudWatch logs at vended-logs pricing. Total monthly
came in at roughly &lt;strong&gt;6.4x&lt;/strong&gt; the fully-loaded TCO of the three KVM
boxes including hardware amortization, power, hands, and the SRE
fraction. Reserved instances would have brought it closer to 4x, but
nobody wanted to commit a 3-year RI on an architecture we weren’t
sure was the long-term answer.&lt;/p&gt;

&lt;p&gt;That, plus a separate compliance question about data residency
from one of our larger customers that turned into a legal review
that turned into a data-locality requirement that RDS in Frankfurt
technically satisfied but politically did not, ended with a
steering-committee decision in April 2024: move it back. New hardware this time,
proper Debian 12 (bookworm), Patroni from day one, ZFS snapshots,
proper monitoring. We’d been forced into doing the unmaintained
stack a favor.&lt;/p&gt;

&lt;h2 id=&quot;back-from-rds-the-second-set-of-disasters&quot;&gt;BACK from RDS, the second set of disasters&lt;/h2&gt;

&lt;p&gt;If migrating &lt;em&gt;into&lt;/em&gt; RDS is hard, migrating &lt;em&gt;out&lt;/em&gt; is harder. RDS gives
you exactly two ways to get your data out continuously:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;AWS DMS, which we had already learned to fear.&lt;/li&gt;
  &lt;li&gt;Native logical replication, where RDS plays the publisher and the
on-prem cluster plays the subscriber.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What it does not give you is &lt;strong&gt;physical replication out&lt;/strong&gt;. You cannot
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_basebackup&lt;/code&gt; an RDS instance from outside. You cannot &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_receivewal&lt;/code&gt;
its WAL stream. You can issue &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_dump&lt;/code&gt;, which we tried first for
sanity, and which on a 1.4 TiB database took eleven hours and left us
needing seventeen hours of CDC catch-up before cutover. That doesn’t
work for low-downtime cutover. So: native logical replication, again,
in reverse.&lt;/p&gt;

&lt;h3 id=&quot;1-the-publisher-setup-in-reverse&quot;&gt;1. The publisher setup, in reverse&lt;/h3&gt;

&lt;p&gt;Now RDS is the source. To make RDS publish, you need (we’d already
done this, ironically, for the DMS direction):&lt;/p&gt;

&lt;div class=&quot;language-text highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;rds.logical_replication = 1
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;…in the cluster parameter group, plus a publication and a user with
the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rds_replication&lt;/code&gt; role (per &lt;a href=&quot;https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/PostgreSQL.Concepts.General.FeatureSupport.LogicalReplication.html&quot;&gt;the docs&lt;/a&gt;):&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;GRANT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rds_replication&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TO&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;repl_out&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;GRANT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ALL&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TABLES&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;IN&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;SCHEMA&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TO&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;repl_out&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;PUBLICATION&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pub_all&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FOR&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ALL&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TABLES&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Then on the destination (the new on-prem PG 15 box):&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SUBSCRIPTION&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sub_all&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;CONNECTION&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;host=tentomon-prod.xxx... port=5432 user=repl_out password=... sslmode=require&apos;&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;PUBLICATION&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pub_all&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;WITH&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;copy_data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;create_slot&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This worked. It was painfully slow. The initial COPY of 1.4 TiB over
Direct Connect at our committed bandwidth came out to about 38 hours
when I’d planned for 20. The bottleneck wasn’t network; it was apply
single-threadedness on the subscriber. PG 16 has parallel apply for
streamed transactions; PG 15 has it only behind the
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;streaming = parallel&lt;/code&gt; option, which works for in-progress
transactions but not for the initial COPY. We were stuck with the
single apply worker per subscription doing the bulk import.&lt;/p&gt;

&lt;p&gt;Workaround: split the publication. Three publications, three
subscriptions, three apply workers running in parallel, each
responsible for a non-overlapping subset of tables. Cut the COPY time
to ~16 hours. Operationally messier; you have to coordinate which
tables go where, and you cannot have a row referenced across
publication boundaries during initial copy without ordering issues.
It worked.&lt;/p&gt;

&lt;h3 id=&quot;2-the-aws_s3-calls-in-stored-procs&quot;&gt;2. The aws_s3 calls in stored procs&lt;/h3&gt;

&lt;p&gt;Eight months on RDS had let one team get clever. They’d written a
nightly job that exported a reporting view to S3 using the
&lt;a href=&quot;https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PostgreSQL.S3Import.InstallExtension.html&quot;&gt;aws_s3 extension&lt;/a&gt;:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;aws_s3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;query_export_to_s3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
  &lt;span class=&quot;s1&quot;&gt;&apos;SELECT * FROM v_daily_report&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;aws_commons&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;create_s3_uri&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;reports-bucket&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;daily.csv&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;eu-central-1&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;options&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;format csv&apos;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;aws_s3&lt;/code&gt; is RDS-only. It is not a community Postgres extension. It
does not exist on the on-prem destination. The function call sits in
a SQL file that lives in the application repo and has been called
from a cron job for eight months, and on cutover day it would start
returning &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ERROR: function aws_s3.query_export_to_s3 does not exist&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;We ported the export to a small Python service that ran outside the
database, talked to the new on-prem cluster, and wrote to S3 via
boto3. Same shape, same schedule. The migration of one function call
was four hours of work and felt like it should have taken twenty
minutes. There were three other places &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;aws_s3&lt;/code&gt; had crept in. We
found two during the audit and one during the post-cutover smoke
tests when a lambda failed.&lt;/p&gt;

&lt;p&gt;This is a category of damage that is hard to predict before you do
the move. RDS-specific extensions are sticky. Once teams discover
them they use them, because they are right there and they work, and
the assumption that “we’d never go back” calcifies into code.&lt;/p&gt;

&lt;h3 id=&quot;3-pg_cron-but-the-wrong-pg_cron&quot;&gt;3. pg_cron, but the wrong pg_cron&lt;/h3&gt;

&lt;p&gt;Similar shape. We had moved the in-database cron to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_cron&lt;/code&gt;,
RDS-flavored, after the original on-prem cron-on-the-primary pattern
became unworkable on a managed instance you don’t have a shell on.
Community &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_cron&lt;/code&gt; exists; the RDS variant has minor behavioral
differences around the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cron.database_name&lt;/code&gt; config and around how it
handles failed executions. None of them break in obvious ways, but
the on-prem extension we installed (community &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_cron&lt;/code&gt; 1.6) parsed
one of our schedule expressions slightly differently and silently
ran a job hourly that had been running every six hours. We noticed
because the daily volume of an audit log shot up by 6x. (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_cron&lt;/code&gt;
1.6 added &lt;a href=&quot;https://github.com/citusdata/pg_cron&quot;&gt;second-level scheduling&lt;/a&gt; in
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;*/5 * * * * *&lt;/code&gt;-style six-field expressions; one of our schedules had
been written assuming the old five-field parser, but the differences
were subtle enough to matter only in one case.)&lt;/p&gt;

&lt;p&gt;Diff your schedules across versions before you cut over. Don’t trust
that the same string means the same thing.&lt;/p&gt;

&lt;h3 id=&quot;4-glibc-collation-again-the-other-way&quot;&gt;4. glibc collation, again, the other way&lt;/h3&gt;

&lt;p&gt;The on-prem destination was Debian 12, glibc 2.36. RDS’s underlying
Amazon Linux had been on glibc 2.34 the last time I’d checked. Both
are post-2.28, which is the cliff that matters most, but they are
not the same, and &lt;strong&gt;any&lt;/strong&gt; glibc version skew across an index is a
risk for non-C, non-ICU collations.&lt;/p&gt;

&lt;p&gt;I had learned my lesson the first time. The destination was
configured with &lt;strong&gt;ICU collations&lt;/strong&gt; for everything that mattered:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;COLLATION&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;en_us_icu&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;provider&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;icu&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;locale&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;en-US-x-icu&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;…and the migration plan re-declared columns to use ICU collation
during the COPY phase. PG 15+ tracks ICU collation versions per index
and emits warnings on mismatch, which means even if we screwed up we
would be told. We did not screw up. The destination was clean.&lt;/p&gt;

&lt;p&gt;But there was still a problem: the data &lt;strong&gt;on RDS&lt;/strong&gt; had been built
under glibc collation order for eight months, because that’s all RDS
PostgreSQL 13 supported at the time. So the values were sorted on the
source according to one rule and were being received by the
destination, which would index them under a different rule. As long
as the rules agreed on equality (which they do, for the cases we
cared about), CDC apply was fine. Indexes built on the destination
were fine. Range scans across the boundary, during the cutover
window, were briefly weird. We avoided it by quiescing reads on the
old side before promoting the new side.&lt;/p&gt;

&lt;p&gt;In retrospect: if you have any locale-sensitive data, switch to ICU
on &lt;strong&gt;both&lt;/strong&gt; sides as early in the migration as you can. Once you’ve
got an ICU column on the source, you’ve removed glibc from the trust
chain entirely, and you can move between operating systems freely.&lt;/p&gt;

&lt;h3 id=&quot;5-the-egress-bill&quot;&gt;5. The egress bill&lt;/h3&gt;

&lt;p&gt;Direct Connect is not free. AWS data transfer to a Direct Connect
location is cheaper than internet egress, but it is not zero. Our
1.4 TiB initial COPY plus ~280 GiB of CDC traffic during the catch-up
window plus tail traffic during cutover came in at about $190 in
data transfer charges, which is fine, plus another $400 of related
charges I do not fully understand because the AWS billing dashboard
does not always make sense. Round it to $600. This was a forgettable
fraction of the savings from getting off the instance. It is not
forgettable if you’re moving 100 TiB.&lt;/p&gt;

&lt;p&gt;The number that’s more annoying is &lt;strong&gt;DMS replication instance time&lt;/strong&gt;
during a hypothetical re-cutover. We kept a DMS instance running at
m5.large for two weeks during the dual-running period as a safety
net. About $260. Forgettable. Add zeros for larger fleets.&lt;/p&gt;

&lt;h3 id=&quot;6-the-cutover-that-mostly-worked&quot;&gt;6. The cutover that mostly worked&lt;/h3&gt;

&lt;p&gt;Cutover was a Saturday at 03:00 UTC, six weeks after we started
the re-migration. By then we had three pglogical-equivalent native
subscriptions all caught up to within 200ms of source, a pgbouncer
cluster fronting the new on-prem primary (configured in
transaction mode at first and then session mode after we re-tested
for the previous LISTEN/NOTIFY issue), a traffic-shifting plan on
the application’s connection-string config that flipped a single
environment variable and cycled pgbouncer pools, and an hourly
script that ran &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_dump --schema-only&lt;/code&gt; on both sides and diffed
them in case anybody snuck a DDL in past the freeze. Nobody did.
We checked anyway.&lt;/p&gt;

&lt;p&gt;The cutover took 14 minutes from “freeze writes on RDS” to “RDS is
read-only, on-prem is primary, app is reconnected”. Sequence resync
was the slowest individual step, because we had ~470 sequences and
the script ran them serially out of an abundance of paranoia. Could
have parallelized; it didn’t matter.&lt;/p&gt;

&lt;p&gt;The first hour after, we saw &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WARNING: collation version mismatch&lt;/code&gt;
from PG 15 a couple of times on indexes that had survived from the
RDS side via the COPY (it was tracking the upstream glibc version on
data it had received). &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;REINDEX CONCURRENTLY&lt;/code&gt; cleaned them up.
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ALTER COLLATION ... REFRESH VERSION&lt;/code&gt; after, per the
&lt;a href=&quot;https://www.postgresql.org/docs/current/collation.html&quot;&gt;Postgres docs&lt;/a&gt;. No data corruption, just a label
update.&lt;/p&gt;

&lt;p&gt;Twelve hours in: a dashboard somewhere reported the wrong number
because a query was hitting a stale read replica that hadn’t quite
caught up. We bumped the replica’s apply worker priority and called
it. Twenty-eight hours in: production was production. We deleted the
RDS instance ten days later, after a pause that was mostly
psychological.&lt;/p&gt;

&lt;h2 id=&quot;what-stuck-what-id-do-again-and-what-i-wouldnt&quot;&gt;What stuck, what I’d do again, and what I wouldn’t&lt;/h2&gt;

&lt;p&gt;The bill is gone. The migration was, in absolute terms, a success.
The data is intact, the application is running, the new on-prem
cluster is properly Patroni-managed and properly puppeted by code
that has been written this decade and that I can read.&lt;/p&gt;

&lt;p&gt;I still wonder if we should have just modernized the original cluster
in place and never moved to RDS at all. The honest answer is:
probably yes for the database itself, no for the team and the
political situation around it. The migration to RDS forced us to
clean up nine years of tablespace cargo-culting, to delete the Cyrillic
date function, to write down what every plperlu function actually
did, to rebuild our knowledge of the schema. The migration back
forced us to learn modern Postgres ops (Patroni, ICU collations,
parallel logical replication apply, proper CDC). The round trip cost
us roughly fourteen calendar months of one engineer at 60% capacity
and another at 30%, plus the cloud bill during the residency. I do
not think we would have done either of those things otherwise.&lt;/p&gt;

&lt;p&gt;If I had to do this again and could change one thing: &lt;strong&gt;switch every
text column to an ICU collation before you migrate anywhere.&lt;/strong&gt; It
removes glibc from the trust chain, lets you move across operating
systems without &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;REINDEX&lt;/code&gt; marathons, and PG tracks the collation
version for you so you find out &lt;em&gt;before&lt;/em&gt; a query returns wrong data
instead of after. Everything else on this list is workaroundable.
Silent collation drift is the one thing I am still nervous about, two
migrations on.&lt;/p&gt;

&lt;p&gt;If I could change a second thing: don’t use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LISTEN&lt;/code&gt;/&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NOTIFY&lt;/code&gt; for
anything that has to survive a connection-pooler change. Use a real
queue. The pattern is cute. The next migration will hate it.&lt;/p&gt;

&lt;p&gt;The boxes in Frankfurt are gone. The new boxes are in two different
data centers, with a third in AWS as a delayed read replica, just
in case. The Puppet module is now Ansible. Nobody under thirty knows
what FAI is. They don’t need to. I just hope they don’t decide we
need to consolidate onto cloud-native primitives again in 2031.&lt;/p&gt;

</content>
    <category term="postgres"/><category term="rds"/><category term="dms"/><category term="aws"/><category term="migration"/><category term="devops"/><category term="postmortem"/>
  </entry>
  
  <entry>
    <title>Hello, world (or: why this site looks like Windows 95)</title>
    <link href="https://alien2003.github.io/2024/09/hello-world/"/>
    <updated>2024-09-15T10:00:00+00:00</updated>
    <id>https://alien2003.github.io/2024/09/hello-world/</id>
    <summary type="html">&lt;p&gt;Somewhere to dump notes. Production postmortems, opinions about
tooling that get repeated in Slack often enough to deserve a
permalink, the occasional bit of postgres or kubernetes pathology
worth being able to find again in two years when it bites someone
else. So: a blog. About time.&lt;/p&gt;

</summary>
    <content type="html" xml:base="https://alien2003.github.io/2024/09/hello-world/">&lt;p&gt;Somewhere to dump notes. Production postmortems, opinions about
tooling that get repeated in Slack often enough to deserve a
permalink, the occasional bit of postgres or kubernetes pathology
worth being able to find again in two years when it bites someone
else. So: a blog. About time.&lt;/p&gt;

&lt;!--more--&gt;

&lt;h2 id=&quot;why-windows-95&quot;&gt;Why Windows 95?&lt;/h2&gt;

&lt;p&gt;The short version: I’m tired of how modern desktop UX has gone, and
this is a small protest in CSS.&lt;/p&gt;

&lt;p&gt;The look pulls me past the screen too. The whole
late-90s/early-2000s media palette. CRT glow, VHS scanlines, the
soft chroma bleed of a tape recording, glossy WordArt logos
slapped onto everything from CD-ROM splash screens to magazine
ads. Same feel, same comfort. For weekend amusement I keep a
Windows 98 SE install in QEMU inside a Distrobox container on my
Steam Deck, browsing &lt;a href=&quot;https://protoweb.org/&quot;&gt;Protoweb&lt;/a&gt; through Netscape
Communicator. It’s silly. It also loads pages in under 100 ms.&lt;/p&gt;

&lt;p&gt;Mobile-first flat design moved into desktop software around 2012
and never left. Material, Metro, Fluent, and whatever the current
Apple language is called this quarter. Hamburger menus on a 27-inch
monitor. Affordances erased in the name of “clean”, which is to
say: you can no longer tell what is clickable without hovering;
tooltips disappeared because they were “cluttered”; the dropdown
that used to be a dropdown is now a slide-out panel three layers
deep. Settings move between releases on the assumption that nobody
had memorised where they were, and as somebody who operates
production systems for a living, I had memorised where they were.
The result is software that looks like a marketing site and
operates like one too. Consistent and cozy beats slick and
rearranged every quarter.&lt;/p&gt;

&lt;p&gt;Older UI got plenty of things right that the industry has agreed
to forget. Title bars tell you what window you’re in. Buttons look
like buttons. Menus stay where you put them. Keyboard navigation
works because the focus ring is actually visible. The visual
vocabulary is small and closed, which means there is no sprint
spent fiddling with gradients instead of writing. So:
&lt;a href=&quot;https://github.com/jdan/98.css&quot;&gt;98.css&lt;/a&gt;, vendored locally, no JS framework, no design
system, no animation budget. Mostly text on grey.&lt;/p&gt;

&lt;p&gt;This is a DevOps blog. Most posts are long-form notes on production
failures, postgres migrations gone sideways, kubelet internals,
things learned the hard way and worth writing down before they’re
forgotten. The chrome is meant to get out of the way of that.&lt;/p&gt;

&lt;h2 id=&quot;whats-under-the-hood&quot;&gt;What’s under the hood&lt;/h2&gt;

&lt;p&gt;Plain Jekyll, no theme gem, hand-built layouts. &lt;a href=&quot;https://github.com/jdan/98.css&quot;&gt;98.css&lt;/a&gt; is
vendored locally so the cloud build doesn’t depend on a CDN that
might disappear. Two plugins, both on the GitHub Pages allowlist
(&lt;a href=&quot;https://github.com/jekyll/jekyll-seo-tag&quot;&gt;jekyll-seo-tag&lt;/a&gt; and &lt;a href=&quot;https://github.com/jekyll/jekyll-sitemap&quot;&gt;jekyll-sitemap&lt;/a&gt;) so the local
build matches what GitHub serves. Icons are real
&lt;a href=&quot;https://github.com/trapd00r/win95-winxp_icons&quot;&gt;Win95/XP &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.ico&lt;/code&gt; files&lt;/a&gt;, not SVGs faking it.&lt;/p&gt;

&lt;p&gt;JavaScript footprint is roughly nothing. A clock in the taskbar
and a start-menu toggle. With JS off, the clock stops and
everything else works.&lt;/p&gt;

&lt;p&gt;The font setup is the one bit of actual fussiness. Window chrome
(title bars, taskbar, top nav) renders in Pixelated MS Sans Serif
at 11–12 px with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;image-rendering: pixelated&lt;/code&gt; and font smoothing
off; rounded antialiasing on a chunky pixel font looks wrong. Post
content drops back to Tahoma/Verdana with smoothing on, like a
Notepad/WordPad-era text window. Two stacks, one site, on purpose.&lt;/p&gt;

&lt;h2 id=&quot;smoke-test&quot;&gt;Smoke test&lt;/h2&gt;

&lt;p&gt;Mostly here so a Jekyll upgrade or a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;_sass&lt;/code&gt; change that breaks
rendering surfaces immediately. Skip otherwise.&lt;/p&gt;

&lt;p&gt;A bulleted list:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Item one.&lt;/li&gt;
  &lt;li&gt;Item two with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;inline_code&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;Item three.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A code block:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;#!/usr/bin/env bash&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;set&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-euo&lt;/span&gt; pipefail
&lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;ns &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;kubectl get ns &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; name&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;do
  &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;Checking &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;ns&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;...&quot;&lt;/span&gt;
  kubectl &lt;span class=&quot;nt&quot;&gt;--namespace&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;ns&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;#namespace/&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; get pods &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--field-selector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;status.phase&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;Running
&lt;span class=&quot;k&quot;&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;A blockquote:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Software is a gas. It expands to fill its container.
Containers should therefore be small.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A table:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Tool&lt;/th&gt;
      &lt;th&gt;Use case&lt;/th&gt;
      &lt;th&gt;Mood&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kubectl&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;All of it&lt;/td&gt;
      &lt;td&gt;Resigned&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;terraform&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;State you can name&lt;/td&gt;
      &lt;td&gt;Cautious&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pulumi&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;State you can debug&lt;/td&gt;
      &lt;td&gt;Hopeful&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;awk&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;Text in a pinch&lt;/td&gt;
      &lt;td&gt;Affectionate&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;A keystroke: &lt;kbd&gt;Ctrl&lt;/kbd&gt;+&lt;kbd&gt;R&lt;/kbd&gt; reloads the page.&lt;/p&gt;

&lt;p&gt;If the syntax highlighting, table borders, blockquote tooltip-yellow,
and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;kbd&amp;gt;&lt;/code&gt; chrome all render right, the theme is fine. Click the
X in the title bar to close the window.&lt;/p&gt;

</content>
    <category term="meta"/><category term="jekyll"/>
  </entry>
  
</feed>
