Jingxiao Cai's Blog
Technical blog by Jingxiao Cai about ML infrastructure, distributed systems, self-hosted AI agents, debugging, automation, and production engineering.
RSS Feed
Blog Posts
May 11, 2026
Categories: ai-agents, automation, alerting, reliability, openclaw, agent-ops
A technically true row-count alert became an alert-tuning lesson: record weak proxy crossings, but page only when they combine with real pressure.
May 10, 2026
Categories: ai-agents, coding-agents, openclaw, gemini, reliability, devops
A degraded coding-agent lane is not automatically a local repair task; first classify the state as passive watch, upstream wait, or a narrow adapter fix.
March 29, 2026 · Updated May 10, 2026
Categories: openclaw, ai-agents, gemini, reliability, devops, llm-ops
When Gemini route health drifts, the hard part is not picking a random next model. It is classifying capacity, auth, upstream, and adapter failures before changing fallback policy.
May 9, 2026
Categories: ai-agents, automation, debugging, reliability, openclaw, agent-ops
A daily scan job generated its report, but the final delivery side effect failed; the recovery pattern was to replay the saved artifact instead of rerunning the whole workflow.
March 19, 2026 · Updated May 9, 2026
Categories: openclaw, ai-agents, self-hosted, memory, embeddings, devops
How I got OpenClaw local memory search working on a small ARM VPS, now with safer rollout, session-list fast-path lessons, source hygiene, and stricter active-memory canary promotion gates.
May 8, 2026
Categories: openclaw, automation, markdown, regression-testing, ai-agents, writing
A generated Markdown report failed over one heading-level jump; the durable fix was testing each rendered output surface as its own artifact contract.
February 23, 2026 · Updated May 8, 2026
Categories: openclaw, bugs, ai-agents, generated-markdown
A historical custom-skill loading bug, refreshed with a generated-Markdown regression lesson and a standalone artifact-contract follow-up.
May 7, 2026
Categories: ai-agents, automation, discord, reliability, workflow, openclaw
Detaching long-running agent work is useful only when admission, work ownership, and final delivery all have explicit contracts.
May 6, 2026
Categories: sqlite, openclaw, ai-agents, incident-response, debugging
A self-hosted agent-ops debugging story: raw SQLite can still see rows while the runtime registry restore fails, so reproduce on copies before touching production.
March 11, 2026 · Updated May 6, 2026
Categories: openclaw, devops, ai-agents, configuration, gateway, reliability
Some OpenClaw config changes apply live. Others trigger gateway restarts. Now updated with rollback, health-monitor, and task-registry restore-gap lessons.
May 2, 2026
Categories: openclaw, ai-agents, performance, devops, control-plane
A self-hosted agent-gateway performance lesson: if a tiny session-list request builds hundreds of rich rows before filtering, limit is too late to save you.
May 1, 2026
Categories: ai-agents, automation, workflow, human-in-the-loop, decision-making
A lightweight agent-operations pattern for closing external threads cleanly: make constraints explicit, record the decision, finish the action, and define reopen criteria.
April 30, 2026
Categories: openclaw, ai-agents, release-engineering, reliability, rollback, devops
Why self-hosted AI-agent updates need production-deployment discipline: preflight, backup, staged rollout, human activation, adoption scans, and verification.
April 30, 2026
Categories: tooling, ai-agents, workflow, design-tools, llm-ops, validation
Why design-tool integrations need capability gates before LLM generation: validate inputs, route readiness, model config, and artifact proof early.
April 29, 2026
Categories: ai-agents, openclaw, skills, workflow, maintenance, governance
How I modernized agent skills with problem-first discovery, intake gates, thin wrappers, package hygiene, capability gates, and consolidation instead of skill sprawl.
April 29, 2026
Categories: ai-agents, security, tooling, reliability, openclaw, auth
Why AI-agent tool launches should prove auth intent, isolate ambient credentials, check route readiness, and block before side effects when the launch contract is unhealthy.
April 8, 2026 · Updated April 29, 2026
Categories: openclaw, devops, ai-agents, staging, release-engineering, self-hosted
An OpenClaw stage-environment pattern for a small VPS: fail-closed testing, zero-production-secret bootstrap, detect-only catalog refresh, and a mock-to-real-to-higher-risk ladder.
April 3, 2026 · Updated April 29, 2026
Categories: openclaw, ai-agents, llm, orchestration, devops, multi-model-review
How I turned multi-model consultation into a config-backed OpenClaw workflow with launch guards, watchdog-backed waiting states, bridge-back final delivery contracts, and user-visible dissent that survives partial failure.
February 25, 2026 · Updated April 29, 2026
Categories: tutorial, oauth, vps, devops, automation, google-cloud
Complete tutorial on OAuth 2.0 for headless servers, now with a fail-closed readiness gate that links OAuth-backed automation checks to the broader agent-launch gate pattern.
April 27, 2026
Categories: ai-agents, automation, reliability, cron, devops, openclaw
Why AI cron reliability needs exact-exec drivers: artifact-first execution, explicit timeout budgets, and deterministic delivery instead of freeform agent prompts.
March 2, 2026 · Updated April 27, 2026
Categories: ai-agents, devops, automation, openclaw, security
My AI agent runs autonomous cron jobs every night—security audits, health checks, and documentation—now updated with the exact-exec driver lesson that prevents false-negative wrapper alerts from hiding real command success.
March 27, 2026 · Updated April 8, 2026
Categories: devops, automation, cron, infrastructure-as-code, drift-detection, openclaw
How I built a declarative change propagation system for cron automation: manifest-driven updates, contract-derived documentation blocks, and validation that keeps desired state from quietly drifting — now with a stage-validation ladder from mock to real to a narrow higher-risk lane.
February 23, 2026 · Updated April 3, 2026
Categories: security, openclaw, ai-agents, supply-chain, moltbook
A deep dive into malicious skills in AI agent platforms, now updated with the LiteLLM incident, the first named downstream victim report, LangChain/LangGraph vulnerabilities, exposed Ollama servers, and an approved-but-blocked lesson from Moltbook's unstable post-incident API surface.
April 2, 2026
Categories: openclaw, devops, ai-agents, incident-response, rollback, reliability
A clean OpenClaw upgrade passed startup checks but regressed under real use. This incident report covers the rollback, the verifier false alarm, and the upgrade guardrails I kept.
March 28, 2026
Categories: security, openclaw, ai-agents, supply-chain, python, incident-response
LiteLLM 1.82.7–1.82.8 were malicious PyPI releases tied to a compromised Trivy CI/CD path. This operator-focused write-up covers the fast audit, rotation, inspection, and pinning checklist for OpenClaw users.
March 27, 2026 · Updated March 28, 2026
Categories: openclaw, ai-agents, memory, embeddings, debugging, devops
After proving local memory search worked, I stabilized a remote memory-only lane in OpenClaw. The follow-up reinforced the same lesson: source discipline, lexical anchors, and hybrid retrieval mattered more than another round of model churn.
February 24, 2026 · Updated March 16, 2026
Categories: llm, debugging, bailian, claude, pricing
A debugging story about hidden platform caps, now updated with Anthropic's March 2026 flat 1M Claude pricing change and why cheaper long context still doesn't eliminate API-level ceilings.
March 7, 2026
Categories: ai-agents, devops, recovery, openclaw, automation, safety
I run autonomous cron jobs with no built-in undo capability. When Moltbook's community started talking about recovery primitives, I realized that unattended automation needs a stronger recovery story.
March 3, 2026
Categories: writing, security, blogging, opsec, technical-writing
I created a sanitization checklist after nearly publishing sensitive deployment details. Here's what to redact, what to keep, and validation scripts for technical bloggers.
February 21, 2026
Categories: ai, automation, gmail
How I built an AI assistant to automate my morning email routine.
February 18, 2026
Categories: ai-agents, troubleshooting
Common issues and solutions when working with AI agent skills.
February 18, 2026
Categories: cloud, onedrive, multcloud
How I migrated my cloud storage using MultCloud.
February 18, 2026
Categories: api, google, tutorial
A comprehensive guide to setting up Google APIs for your projects.
February 17, 2026
Categories: intro, personal
After several years in industry, I'm starting to share my learnings, thoughts, and experiences...
February 17, 2026
Categories: blog, tech
I used to run a blog on WordPress back in 2016-2017 during my PhD years at University of Oklahoma...
← Back to Personal Site