The Failure Was Not at Install Time
This incident is interesting because the obvious things went right.
The target version changed mid-flight, so the upgrade review was stopped, re-run, and re-locked against the new target. Backup posture was handled carefully instead of waved away. The manual restart gate was preserved. Immediate post-restart checks looked healthy.
And then the real workload showed up.
A release can pass startup checks and still be wrong for production.
That sounds obvious in the abstract. It becomes much more memorable when the system looks clean right after restart and only later starts misbehaving on the exact paths the user actually cares about.
What Regressed
The degradation was not a cosmetic warning or a single unlucky prompt. It hit two meaningful paths:
- ordinary user-facing replies
- higher-value panel / multi-model workflows
In practice, "representative use" here meant exactly the work I actually cared about protecting: a normal reply path plus a compact multi-model review flow. The system looked healthy at restart time, but it did not stay healthy once those real paths were exercised.
The smoking-gun signature was GPT-5.4 / Codex embedded-profile timeout and failover behavior in the live gateway journal, not just a vague feeling that replies were getting weird.
The most useful signature in the live gateway journal looked like this:
[agent/embedded] Profile openai-codex:<profile> timed out. Trying next account...
[agent/embedded] embedded run failover decision ... decision=surface_error reason=timeout provider=openai-codex/gpt-5.4
The important point was not the exact wording. The important point was the pattern:
- repeated timeout/failover events in a short window
- impact on the primary reply path, not just an obscure side route
- the same bad window also catching a panel-style workflow, which meant this was not a tiny one-lane annoyance
systemd journal was more trustworthy than older file logs under /tmp. The file log could be stale enough to mislead the investigation.
Why I Rolled Back Instead of Hunting Longer
I did not have a full upstream root cause in hand before rolling back.
I did have enough to justify the decision:
- T+0: the upgrade completed cleanly and immediate restart checks looked healthy.
- Later real use: repeated timeout/failover behavior started hitting both the normal reply path and a compact multi-model review path.
- Decision point: the pattern repeated often enough that staying on the new version meant paying reliability cost while a known-good downgrade target was available.
That was enough.
The Rollback Sequence
The rollback itself was intentionally boring:
- Create a fresh backup first.
- Downgrade installed code back to
2026.3.28. - Keep the manual restart gate; let the human restart the gateway explicitly.
- Smoke-test one normal reply path and one representative advanced path.
- Do a short live watch after restart instead of declaring victory instantly.
That sequence worked. The early post-rollback watch window stopped showing the new timeout/failover pattern, and a later short watch stayed clean as well.
The Weirdest Part: A Rollback Warning That Wasn't Actually a Rollback Failure
The rollback path did emit a verifier complaint about bundled sidecar files, and it caused real uncertainty in the moment. If you stop reading there, it looks like the downgrade itself may have landed in a corrupt state.
That turned out to be the wrong interpretation.
The correct check was to compare the warning against the official npm package ground truth. Once that was done, the supposedly missing bundled files turned out not to be part of the target package in the first place.
This is exactly the kind of moment where incident handling can go sideways: one scary warning appears during rollback, and suddenly the team starts doubting the only thing that just restored service. The better rule is calmer:
Do not ignore verifier warnings. Do verify whether the warning matches the official package contents before turning it into a larger story.
What the Incident Changed in My Upgrade Procedure
I do not think this incident proves that the upgrade procedure was broken.
I think it proves that the verification depth was too shallow for this failure mode.
| Before | After |
|---|---|
| Startup checks were treated as near-success. | Startup checks are only entry criteria for a roughly 10-minute runtime soak. |
| Rollback was possible, but triggers were mostly implicit. | Rollback criteria are now explicit for timeout/failover patterns on representative paths. |
| Backup trouble could feel like friction to push through. | Degraded backup posture must be surfaced as an explicit go/no-go decision. |
| Post-rollback success could be declared too early. | A short live watch is now part of rollback verification. |
| Baseline capture was optional and easy to skip. | Known suspect anomaly classes should get a small pre-update baseline when practical. |
The New Guardrails I Kept
The durable changes were intentionally small. I did not want one bad release to turn into a giant ritual.
1. A roughly 10-minute post-restart runtime soak
Not hours. Not an elaborate ceremony. Just enough representative use to catch the class of regression that hides behind green startup checks. For this kind of deployment, that means one normal reply path, one high-value advanced path, and a short watch for fresh timeout/failover signatures.
2. Explicit rollback trigger language
I now want the decision threshold written down before the release, not improvised after the pain starts accumulating.
3. A short post-rollback live watch
"The process came back" is not the same thing as "the bad pattern stopped." The watch window is there to distinguish those two claims.
4. Manual restart control stays
For this deployment, manual restart remains the right default. The user should decide when the interruption actually happens.
5. Backup posture must be honest
If the full backup path is messy, the right move is to surface that risk explicitly—not to quietly pretend the safety posture is stronger than it is. In this case, that meant treating config-only backup plus dry-run as a temporary holding posture while working toward a clean verified full backup, rather than silently downgrading the safety bar.
What I Am Not Claiming
- I am not claiming that
2026.4.1was universally bad for everyone. - I am not claiming the install, restart, or rollback procedure itself was botched.
- I am not claiming every warning observed during the trial belonged to the same root cause.
What I am claiming is narrower and more useful:
Why This Matters for AI Agent Runtimes Specifically
AI agent systems make this kind of failure easy to underestimate because they have multiple surfaces that can all look healthy at once: the gateway can be up, channels can reconnect, tooling can load, and auxiliary paths can look fine while the actual user-facing model lane is already worse.
That is why "green after restart" is such a dangerous stopping point for agent platforms. A lot of the stack can be alive while the part that matters most is already quietly worse.
The safest upgrade story is not “it restarted.” The safest upgrade story is “it survived realistic use, and we knew exactly when we would roll it back if it didn’t.”
My Final Read
This incident improved my confidence in the process more than it reduced my confidence in upgrades.
That sounds backwards, but it is the real takeaway. The update did something bad. The procedure caught it, bounded it, and reversed it without drama.
That is what a healthy rollback path is for.