The Hard Part Was Never “Ask More Models”
Running a full panel of LLMs in parallel is not the interesting problem. Spawning the children is easy. Keeping the result coherent is the real work.
The failures I kept hitting were never glamorous:
- one expected panelist quietly did not get launched
- an ACP-backed lane finished, but the answer did not auto-surface to the parent thread
- a partial result arrived, then the orchestrator went silent
- a slow non-core model held the whole panel hostage
- a wildcard dissent looked louder than it deserved because the weighting rules stayed implicit
- late completions arrived after the real answer was already delivered
The real unit of work is not “query 8 models.” It is “finish one user-visible review exactly once, with lineup integrity, visible weighting, and bounded latency.”
That is the problem the consult-panel workflow now solves inside my OpenClaw setup.
The Design Decision That Saved the Whole Thing
The best decision was architectural, not prompt-related.
The public invocation surface stays deliberately small. Natural requests like consult the panel, panel review, have the consultants weigh in, or run jury mode map into the same orchestration layer, while the routing truth stays in durable config and ops docs rather than inside the prompt wrapper.
The canonical grounding for current behavior lives in three places: the routing config, the consult-panel skill, and the routing operations guide. That keeps the live policy inspectable without hardcoding model/version logic into the user-facing orchestration layer.
Keep the skill thin. Keep the routing truth in config and ops docs.
The user-facing skill exists to do orchestration:
- detect that the user wants a panel review
- resolve the configured lineup
- shape the review packet
- spawn the panel
- track completions and synthesize the result
Every panelist gets the same review packet. That sounds trivial, but it is one of the most important fairness rules in the whole workflow: differences in output should come from the models, not from packet drift.
What it does not do is hardcode model/version policy. That lives in durable config and the ops guide instead:
routing configoperations guidelineup resolverpacket-size gate
That split matters because it keeps model churn from leaking into workflow prose. If the primary lane changes, or the ACP secondary policy changes, or the fixed experimental lane gets replaced, I want to edit config—not rewrite orchestration logic in six places.
consult-panel, not jury. “Jury” is a useful mode name, but it is a bad public skill name because it leaks internal taxonomy into the invocation surface.
What the Current Panel Actually Looks Like
The live jury lineup is not “eight random models.” The default active panel is currently seven voices with different jobs, different timeout leashes, and different interpretive weight.
| Panelist | Role | Class | Why it exists |
|---|---|---|---|
| Primary reasoning consultant | Primary consultant | stable_core |
Main decision anchor for full-panel review. |
| Compact-packet challenger | Challenger consultant | stable_core |
First stable-core challenger when the packet is compact enough. |
| Stable core critic A | Core critic | stable_core |
Reliable large-context core voice and oversize fallback for the challenger. |
| Stable core critic B | Core critic | stable_core |
Independent strong core voice. |
| Diversity critic | Diversity critic | stable_core |
Semantic diversity lane without inventing a second wildcard slot. |
| Fixed experimental panelist | Fixed experimental panelist | fixed_experimental |
Meaningful non-core lane with stable identity that can corroborate consensus or widen a real split. |
| Rotating wildcard endpoint | Rotating wildcard panelist | rotating_wildcard |
Exploratory dissent / novelty lane, never decisive by itself. |
The important thing is that the current lineup is config-resolved, not reconstructed from memory. That is how I stopped “full panel” from quietly meaning four models one week and seven the next.
I also stopped pretending a dormant route definition was the same thing as an active default lane. The Gemini ACP path still exists in config, but it is intentionally not part of the default active jury lineup right now because that route became too finicky to treat as a routine full-weight core lane. If it comes back later, it returns explicitly as a nonblocking external lane rather than a hidden substitute.
If the resolved lineup and the accepted spawn set do not match, the run is degraded immediately instead of being presented as a successful full panel. That one rule fixed a surprising amount of quiet orchestration dishonesty.
resolve_panel_shape(mode="jury")
If the expected lineup and the actual spawn set differ, the run is degraded. I do not pretend it was a full panel.
The Schema That Matters More Than the Prompt
The routing config now carries three layers of truth:
- role mapping — what each semantic lane currently points to
- mode shape — which panelists belong to
standard,jury, and compatibility aliases likedeepJury - panel heuristics — timeout classes, weight bands, checkpoint timing, and benchmark reference guidance
A simplified shape looks like this:
{
"roles": {
"primary_consultant": { "preferred": "<primary-reasoning-model>" },
"optional_external_lane": { "agentId": "<secondary-agent>", "modelAlias": "<stable-secondary-lane>" },
"fixed_experimental_panelist": { "preferred": "<fixed-experimental-model>" },
"rotating_wildcard_panelist": { "preferred": "<rotating-wildcard-endpoint>" }
},
"modes": {
"jury": {
"panelists": [
"primary_consultant",
"stable_core_challenger",
"stable_core_critic_a",
"stable_core_critic_b",
"diversity_critic",
"fixed_experimental_panelist",
"rotating_wildcard_panelist"
]
}
},
"panelHeuristics": {
"classes": {
"stable_core": { "nominalWeight": 1.0, "softTimeoutSeconds": 150 },
"stable_external_nonblocking": { "nominalWeight": 1.0, "softTimeoutSeconds": 150 },
"fixed_experimental": { "nominalWeight": 0.75, "softTimeoutSeconds": 120 },
"rotating_wildcard": { "nominalWeight": 0.4, "softTimeoutSeconds": 90 }
}
}
}
diversity_critic role shown in the schema resolves to the active diversity lane in the live panel.
Why Timeout Policy Had To Change
One of the most useful corrections was embarrassingly simple:
Parallel panel timeout is tail-latency control, not additive latency math.
That sounds obvious once you say it. But it matters operationally.
If seven lanes are done and the only remaining holdout is a rotating wildcard endpoint, I should not keep acting as if the panel is “still incomplete” in the same way it would be if two stable-core critics were missing. Those are different situations.
| Class | Nominal weight | Soft timeout | Practical rule |
|---|---|---|---|
stable_core |
1.0 | 150s | Default decision anchor. Keep the leash longest. |
stable_external_nonblocking |
1.0 | 150s | Full interpretive weight if present, but do not block forever. |
fixed_experimental |
0.75 | 120s | Useful corroboration or dissent, but non-core. |
rotating_wildcard |
0.4 | 90s | Exploratory only. Drop aggressively when it becomes tail latency. |
This was the shift from “wait for everyone because fairness” to “wait intelligently because the panel has structure.”
Benchmark-Informed Weighting Helped, But Only After I Put It in Its Place
I do use benchmark families as a rough reference band. I do not use them as a vote-counting formula.
The live ops guide records a checked snapshot, but the useful signal is structural rather than brand-specific:
- the stable-core cluster is close enough to anchor the answer
- an external secondary lane can matter a lot when it is present, but it should remain operationally nonblocking instead of being silently required
- the fixed experimental lane is meaningful, but still non-core
- the rotating wildcard should not be benchmark-weighted like a fixed model at all because it is intentionally not stable enough to deserve that treatment
The Async Rules Are Why the Workflow Finally Feels Trustworthy
The strongest part of the current setup is no longer the model lineup. It is the coordination layer around it.
The workflow now explicitly tracks:
- expected panelists
- pending vs completed lanes
- first substantive result arrival
- whether interim checkpoint updates have already been sent
- which panelists are eligible for timeout-drop
- whether ACP needs a targeted follow-up check
- whether the panel is merely closed or fully superseded
The practical control loop looks more like this now:
resolve lineup from config
spawn all expected panelists
track expected vs completed lanes
send one bounded checkpoint when the first real result arrives
send one second checkpoint only if state changes meaningfully
if only ACP remains, inspect ACP directly before waiting blindly
after the second checkpoint, timeout-drop non-core stragglers
synthesize immediately once all expected lanes are complete or dropped
That is not glamorous. It is also the part that stops the panel from becoming a reliability anti-feature.
Final Delivery Needs a Bridge Back, Not Just a Background Thread
The long-running-work version of this bug has the same shape as the panel bug: the worker can finish correctly while the user-facing conversation still never receives the final result. Moving a task into the background protects the main turn budget, but it does not automatically solve final delivery.
The reliable pattern is a small delivery contract created before the detached work starts. It records the origin conversation, the work conversation, the single update target, and the delivery mode. Default detached work reports to its bound work thread. Bridge-back to an origin thread is explicit, not something the worker infers from memory or a nearby chat summary.
A background worker should never guess where its final answer belongs.
The practical loop now looks like this:
persist a delivery contract before launch
launch the worker with the contract reference and final-ready marker
wait for an explicit final-ready signal
dry-run the delivery plan against the persisted contract
send the final answer once through an idempotent ledger
split oversized finals deterministically instead of truncating them
record delivered message identifiers when the channel returns them
That is deliberately boring distributed-systems hygiene: stable identity, explicit target selection, idempotent side effects, and observable delivery state. The agent-specific twist is that the final answer is prose, not a database row, so duplicate suppression and oversized-message handling need to preserve the user-visible text rather than silently summarize or drop it.
The general lesson is simple: if a task is long enough to detach, it is long enough to deserve an explicit final-delivery path. Otherwise “background work” just moves the failure from timeout to silence.
Why Watchdog Coverage Had To Become a Launch Invariant
The ugliest real failure was not “a model said something weird.” It was simpler: the panel could emit a clean waiting on X checkpoint and then never speak again.
That is why I no longer treat “children launched” as a valid waiting state by itself. A multi-child run is only allowed to settle into waiting after two things are true:
- the resolved lineup matches the accepted spawn set
- real follow-up wakes exist to carry the panel to the next checkpoint or terminal synthesis
In my implementation that means a small launch guard plus a watchdog planner. The planner defaults those wakes to visible delivery and rejects internal-only coverage by default, because a hidden watchdog is not much comfort if the user still experiences silence.
Just as important: generated plans are not proof. I only count coverage as real once the follow-up jobs actually exist. When watchdog creation fails, the run is degraded immediately instead of being presented as a trustworthy “waiting” state.
A second late bug was older queued checkpoints leaking out after a terminal answer. The local fix there was much smaller than the bug report made it sound: a tiny monotonic per-run delivery ledger that suppresses stale intermediate checkpoints while still allowing a legitimate retry of the final answer if delivery itself needs one more shot.
Watchdog Hardening: The Bug Was Not Just Silence, It Was Repeated Almost-The-Same Progress
The next reliability problem was more embarrassing than dramatic. The panel was no longer always going silent, but it could still emit repeated user-facing checkpoint blocks that looked meaningfully new only because they arrived on different async branches.
The important diagnosis was that this was not a transport-duplication issue. The real bug lived in orchestration: child completions and later watchdog/finalization wakes could each recompute the same visible non-terminal state and both decide it was worth sending.
The hardening rule that actually helped was small and specific:
For non-terminal updates, validate not only run identity and closure state, but also whether the same visible progress state for this panel run has already been delivered.
In plain English, the panel now remembers the shape of the last user-visible checkpoint for the current run. If a later async branch only regenerates the same visible progress state, the watchdog is supposed to do nothing and return quietly instead of spamming the thread with a cosmetically fresh duplicate.
- duplicate same-state checkpoint → suppress it
- watchdog recomputes the same visible state → suppress it
- material state actually changes → allow the next bounded checkpoint through
I trust the fix more because it stopped being just a theory. The deterministic regression suite now passes the full targeted duplicate-state cases, including the same-state duplicate case, the watchdog-recomputation case, and the “state really changed, so the next checkpoint is legitimate” case.
The Two Delivery Bugs That Taught Me the Most
1. ACP completion is not the same as normal subagent completion
One of the nastier failure modes was: Gemini finished, useful text existed, but the parent thread still looked silent. The fix was not “wait harder.” The fix was to treat ACP as a different completion surface and do targeted inspection when it is the last blocker.
2. Fresh user follow-up is not a late duplicate
Another failure mode was subtle: after a final synthesis was already delivered, a true late completion and a fresh user-authored follow-up could land in the same general window. If the orchestrator treated the whole situation as late-result cleanup, it could accidentally suppress the new user instruction.
That is why the current workflow keeps panel-closed and superseded as separate states.
What I Learned About User-Facing Synthesis
I used to think the final answer should hide as much orchestration detail as possible. That turned out to be half-right.
The user does not need raw routing trivia. But they do need visible weighting.
So the synthesis now tries to make one thing explicit:
Did the disagreement come from stable-core lanes, or only from non-core lanes?
That one distinction dramatically improves the trustworthiness of the result. “There was disagreement” is weak. “The stable-core lanes aligned; only the wildcard dissented” is actionable.
What Other People Can Steal From This Pattern
You do not need my exact stack to steal the shape.
If you run any multi-model or multi-agent review workflow, the parts worth copying are:
- semantic roles, not hardcoded versions
- config-resolved lineup before spawn
- same packet for fair comparison
- weight classes that are visible in synthesis
- watchdog coverage as a launch invariant, not a hopeful afterthought
- explicit final-delivery contracts for detached work
- timeout as tail-latency control
- explicit ACP / external-lane delivery handling
- clear distinction between closed, degraded, and superseded runs
What I would not copy is “always ask all models.” Full panel is expensive. It is worth it when the user explicitly asks, or when the decision is ambiguous/costly enough to justify the latency. Reduced modes are part of the design, not a fallback embarrassment.
The Design Rule I Trust Most Now
If I had to compress the whole thing into one line, it would be this:
A good panel workflow does not just collect opinions. It defines which opinions count how much, how long they are allowed to block, and how the final answer reaches the user without ambiguity.
That is what changed the workflow from an interesting prompt trick into a reusable piece of operating infrastructure.