Jingxiao Cai's Blog | ML Infrastructure, Distributed Systems, Self-Hosted AI

Technical blog by Jingxiao Cai about ML infrastructure, distributed systems, self-hosted AI agents, debugging, automation, and production engineering.

Blog Posts

When a True Alert Is Still the Wrong Page: An Agent-Ops Threshold Lesson

May 11, 2026

Categories: ai-agents, automation, alerting, reliability, openclaw, agent-ops

A technically true row-count alert became an alert-tuning lesson: record weak proxy crossings, but page only when they combine with real pressure.

When a Coding-Agent Route Drifts: Closing the Loop Without Premature Fixes

May 10, 2026

Categories: ai-agents, coding-agents, openclaw, gemini, reliability, devops

A degraded coding-agent lane is not automatically a local repair task; first classify the state as passive watch, upstream wait, or a narrow adapter fix.

Handling Gemini Capacity Exhaustion: Fallback Lanes for Reliable Agent Workflows

March 29, 2026 · Updated May 10, 2026

Categories: openclaw, ai-agents, gemini, reliability, devops, llm-ops

When Gemini route health drifts, the hard part is not picking a random next model. It is classifying capacity, auth, upstream, and adapter failures before changing fallback policy.

When the Report Exists but Delivery Failed: An Agent-Ops Triage Pattern

May 9, 2026

Categories: ai-agents, automation, debugging, reliability, openclaw, agent-ops

A daily scan job generated its report, but the final delivery side effect failed; the recovery pattern was to replay the saved artifact instead of rerunning the whole workflow.

Local Semantic Memory on a 4-Core ARM VPS: How I Got OpenClaw Memory Search Working Without External APIs

March 19, 2026 · Updated May 9, 2026

Categories: openclaw, ai-agents, self-hosted, memory, embeddings, devops

How I got OpenClaw local memory search working on a small ARM VPS, now with safer rollout, session-list fast-path lessons, source hygiene, and stricter active-memory canary promotion gates.

One Heading Level Broke the Nightly Build: Fixing Markdown Drift in Generated REM Reports

May 8, 2026

Categories: openclaw, automation, markdown, regression-testing, ai-agents, writing

A generated Markdown report failed over one heading-level jump; the durable fix was testing each rendered output surface as its own artifact contract.

Why Custom Skills Don't Load in OpenClaw - A Historical Bug and Follow-Up

February 23, 2026 · Updated May 8, 2026

Categories: openclaw, bugs, ai-agents, generated-markdown

A historical custom-skill loading bug, refreshed with a generated-Markdown regression lesson and a standalone artifact-contract follow-up.

Long-Running Agent Work Needs a Bridge Back, Not Just a Background Thread

May 7, 2026

Categories: ai-agents, automation, discord, reliability, workflow, openclaw

Detaching long-running agent work is useful only when admission, work ownership, and final delivery all have explicit contracts.

When SQLite Looks Empty but Isn’t: Reproducing Corrupt Task Registries Without Touching Prod

May 6, 2026

Categories: sqlite, openclaw, ai-agents, incident-response, debugging

A self-hosted agent-ops debugging story: raw SQLite can still see rows while the runtime registry restore fails, so reproduce on copies before touching production.

Gateway Restart Behavior: What OpenClaw Users Need to Know About Config Changes

March 11, 2026 · Updated May 6, 2026

Categories: openclaw, devops, ai-agents, configuration, gateway, reliability

Some OpenClaw config changes apply live. Others trigger gateway restarts. Now updated with rollback, health-monitor, and task-registry restore-gap lessons.

The 10-Second Session List: Why Prefiltering Before Row Build Matters in Agent Gateways

May 2, 2026

Categories: openclaw, ai-agents, performance, devops, control-plane

A self-hosted agent-gateway performance lesson: if a tiny session-list request builds hundreds of rich rows before filtering, limit is too late to save you.

Closing External Threads Cleanly: An Agent-Ops Pattern

May 1, 2026

Categories: ai-agents, automation, workflow, human-in-the-loop, decision-making

A lightweight agent-operations pattern for closing external threads cleanly: make constraints explicit, record the decision, finish the action, and define reopen criteria.

Treating AI Agent Updates Like Production Deployments: The Runbook Keeps Paying Off

April 30, 2026

Categories: openclaw, ai-agents, release-engineering, reliability, rollback, devops

Why self-hosted AI-agent updates need production-deployment discipline: preflight, backup, staged rollout, human activation, adoption scans, and verification.

Design-Tool Integrations Need Capability Gates: Lessons from a Missing LLM Config

April 30, 2026

Categories: tooling, ai-agents, workflow, design-tools, llm-ops, validation

Why design-tool integrations need capability gates before LLM generation: validate inputs, route readiness, model config, and artifact proof early.

Modernizing Agent Skills Without Growing a Skill Jungle

April 29, 2026

Categories: ai-agents, openclaw, skills, workflow, maintenance, governance

How I modernized agent skills with problem-first discovery, intake gates, thin wrappers, package hygiene, capability gates, and consolidation instead of skill sprawl.

Fail-Closing Agent Launches: Why Auth and Readiness Gates Should Block Before Tooling Starts

April 29, 2026

Categories: ai-agents, security, tooling, reliability, openclaw, auth

Why AI-agent tool launches should prove auth intent, isolate ambient credentials, check route readiness, and block before side effects when the launch contract is unhealthy.

Building Fail-Closed Stage Environments for AI Agents on a Small VPS

April 8, 2026 · Updated April 29, 2026

Categories: openclaw, devops, ai-agents, staging, release-engineering, self-hosted

An OpenClaw stage-environment pattern for a small VPS: fail-closed testing, zero-production-secret bootstrap, detect-only catalog refresh, and a mock-to-real-to-higher-risk ladder.

LLM Panel Orchestration in OpenClaw: Config-Backed Routing, Timeout Classes, and Honest Dissent Without Chaos

April 3, 2026 · Updated April 29, 2026

Categories: openclaw, ai-agents, llm, orchestration, devops, multi-model-review

How I turned multi-model consultation into a config-backed OpenClaw workflow with launch guards, watchdog-backed waiting states, bridge-back final delivery contracts, and user-visible dissent that survives partial failure.

VPS OAuth Survival Guide: Google APIs Without a Browser

February 25, 2026 · Updated April 29, 2026

Categories: tutorial, oauth, vps, devops, automation, google-cloud

Complete tutorial on OAuth 2.0 for headless servers, now with a fail-closed readiness gate that links OAuth-backed automation checks to the broader agent-launch gate pattern.

Why AI Cron Jobs Need Exact-Exec Drivers Instead of Freeform Agent Prompts

April 27, 2026

Categories: ai-agents, automation, reliability, cron, devops, openclaw

Why AI cron reliability needs exact-exec drivers: artifact-first execution, explicit timeout budgets, and deterministic delivery instead of freeform agent prompts.

The Nightly Build: How My Agent Runs Security Audits While I Sleep

March 2, 2026 · Updated April 27, 2026

Categories: ai-agents, devops, automation, openclaw, security

My AI agent runs autonomous cron jobs every night—security audits, health checks, and documentation—now updated with the exact-exec driver lesson that prevents false-negative wrapper alerts from hiding real command success.

Declarative Change Propagation: How I Built a Self-Documenting Cron System

March 27, 2026 · Updated April 8, 2026

Categories: devops, automation, cron, infrastructure-as-code, drift-detection, openclaw

How I built a declarative change propagation system for cron automation: manifest-driven updates, contract-derived documentation blocks, and validation that keeps desired state from quietly drifting — now with a stage-validation ladder from mock to real to a narrow higher-risk lane.

The Supply Chain Attack on AI Agents: What OpenClaw Users Need to Know

February 23, 2026 · Updated April 3, 2026

Categories: security, openclaw, ai-agents, supply-chain, moltbook

A deep dive into malicious skills in AI agent platforms, now updated with the LiteLLM incident, the first named downstream victim report, LangChain/LangGraph vulnerabilities, exposed Ollama servers, and an approved-but-blocked lesson from Moltbook's unstable post-incident API surface.

When Startup Checks Lie: Rolling Back an OpenClaw Runtime Regression

April 2, 2026

Categories: openclaw, devops, ai-agents, incident-response, rollback, reliability

A clean OpenClaw upgrade passed startup checks but regressed under real use. This incident report covers the rollback, the verifier false alarm, and the upgrade guardrails I kept.

The LiteLLM Supply Chain Attack: What OpenClaw Users Need to Know

March 28, 2026

Categories: security, openclaw, ai-agents, supply-chain, python, incident-response

LiteLLM 1.82.7–1.82.8 were malicious PyPI releases tied to a compromised Trivy CI/CD path. This operator-focused write-up covers the fast audit, rotation, inspection, and pinning checklist for OpenClaw users.

Bigger Embeddings ≠ Better Memory: Why I Chose `text-embedding-3-small` for OpenClaw Remote Memory

March 27, 2026 · Updated March 28, 2026

Categories: openclaw, ai-agents, memory, embeddings, debugging, devops

After proving local memory search worked, I stabilized a remote memory-only lane in OpenClaw. The follow-up reinforced the same lesson: source discipline, lexical anchors, and hybrid retrieval mattered more than another round of model churn.

The Hidden Input Limit: When "202K Context" Doesn't Mean 202K

February 24, 2026 · Updated March 16, 2026

Categories: llm, debugging, bailian, claude, pricing

A debugging story about hidden platform caps, now updated with Anthropic's March 2026 flat 1M Claude pricing change and why cheaper long context still doesn't eliminate API-level ceilings.

The Recovery Problem: Why Your AI Agent Needs an Undo Button

March 7, 2026

Categories: ai-agents, devops, recovery, openclaw, automation, safety

I run autonomous cron jobs with no built-in undo capability. When Moltbook's community started talking about recovery primitives, I realized that unattended automation needs a stronger recovery story.

Blog Post Sanitization Checklist: What to Redact Before Publishing

March 3, 2026

Categories: writing, security, blogging, opsec, technical-writing

I created a sanitization checklist after nearly publishing sensitive deployment details. Here's what to redact, what to keep, and validation scripts for technical bloggers.