← Back to Blog

Local Semantic Memory on a 4-Core ARM VPS: How I Got OpenClaw Memory Search Working Without External APIs

March 19, 2026 | By Jingxiao Cai | Updated May 9, 2026
Tags: openclaw, ai-agents, self-hosted, memory, embeddings, devops
This post was co-created with Clawsistant, my OpenClaw AI agent. It helped turn a messy debugging session into a cleaner explanation of what actually broke, what actually fixed it, and why a small self-hosted box was not the limiting factor.
April 8 follow-up: I added a public limitations section covering rebuild-safety limits in the indexing path, the safer shadow-index workaround, more conservative live sync behavior, and the diagnostic split between timeout-style failures and request-path instability.
April 29 follow-up: I added a source-hygiene section about anti-echo retrieval: generated scout reports can suggest candidates, but direct evidence and fresh operator friction should remain the primary sources.
May 9 follow-up: I added the active-memory canary trade-off explicitly: a helper that works only under a long budget may be a valuable diagnostic ceiling, but it is not ready for the normal reply path until scope, query shape, or delivery mode makes it feel boring.

The Goal

I wanted semantic memory search in OpenClaw without paying an external embedding provider and without shipping my memory corpus out to somebody else's API.

The target environment was not glamorous: a 4-core ARM VPS running my personal OpenClaw deployment. In other words, exactly the kind of box where people assume local embeddings are going to be too slow, too fragile, or too annoying to be worth it.

That assumption turned out to be wrong.

Result: local semantic memory search came up successfully with node-llama-cpp + embeddinggemma, indexing 239 files into 1,778 chunks with hybrid search active and no external embedding API required.

Why I Wanted Local Memory in the First Place

OpenClaw memory search is one of those features that gets much more valuable the moment you have real history:

Remote embeddings work, but local mode has three obvious advantages:

Reason Why it matters
Cost No per-query or per-reindex embedding bill for your memory corpus
Privacy Your memory notes stay on your own host instead of becoming another API payload
Independence No key management, remote quota surprises, or embedding-provider drift just to search your own notes

What Failed First

The first version of this story looked like a disappointing cliché: local memory search seemed flaky, the index did not complete cleanly in the interactive workflow I was using, and it would have been very easy to blame the VPS.

But that diagnosis would have been lazy.

The problem was not that ARM was too weak. The problem was that the local runtime was incomplete and the indexing workflow was wrong for the machine.

The actual root cause was a missing optional dependency in the OpenClaw install:

node-llama-cpp

Once that was fixed, the rest of the story changed completely.

The Useful OpenClaw Details

The local docs were clear on two points that mattered:

The documented default local embedding model is:

hf:ggml-org/embeddinggemma-300m-qat-q8_0-GGUF/embeddinggemma-300m-qat-Q8_0.gguf

That is a very reasonable fit for this use case: small enough to be practical on a modest VPS, but good enough for personal-assistant memory retrieval.

The Config Shape That Matters

This is the important part of the config shape, simplified to the pieces that mattered operationally:

{
  "agents": {
    "defaults": {
      "memorySearch": {
        "enabled": true,
        "provider": "local",
        "fallback": "none",
        "local": {
          "modelPath": "hf:ggml-org/embeddinggemma-300m-qat-q8_0-GGUF/embeddinggemma-300m-qat-Q8_0.gguf"
        }
      }
    }
  }
}

I like the explicit fallback: "none" here for one reason: if the point of the exercise is "stay local", silently falling back to a remote embedding provider defeats the whole point.

Important nuance: the local docs also note that some installs may need native-build approval for node-llama-cpp. In my case the immediate failure was simpler: the dependency was missing, not just unapproved.

The Operational Fix That Actually Made It Work

Even after local mode was correctly wired, I hit another trap: trying to do a full index build in a foreground, short-lived interactive flow on a small box is a dumb way to judge whether local semantic memory works.

Initial indexing is the expensive part. Search after the index is built is the easy part.

So the fix was operational, not architectural:

  1. fix the missing local runtime dependency
  2. use the default embeddinggemma GGUF path
  3. run the full index as a detached, low-priority background job overnight
  4. verify status the next session instead of babysitting it in a fragile foreground turn

That last step mattered more than it sounds. On a small VPS, background indexing is not a workaround for failure. It's just the sane execution model.

The Verification

Once the overnight run completed, the status check was exactly what I wanted to see:

openclaw memory status --agent main --deep --json
{
  "files": 239,
  "chunks": 1778,
  "embedding_cache": 1761,
  "provider": "local",
  "searchMode": "hybrid",
  "dirty": false
}

More importantly, semantic search started returning useful results rather than placebo noise.

In plain English: this was not just "the index command finished." It was real semantic recall on a self-hosted memory corpus.

What This Taught Me About Small VPSes

People often compress two different questions into one:

  1. Can this machine run local embeddings at all?
  2. Is foreground full reindexing a pleasant interactive experience on this machine?

The answer I got was:

Those are different statements, and too many self-hosting discussions blur them.

What I’d Recommend to Other OpenClaw Users

If your goal is… Recommendation
cheapest durable memory search Try local first
best privacy for personal notes Prefer local and keep fallback explicit
fastest large-corpus reindexing with minimal host CPU pressure Remote embeddings may still be easier operationally
testing whether your host is "good enough" Do not judge from one foreground timeout

The Real Pattern

The interesting lesson here is broader than memory search:

When self-hosted AI features fail, the first suspect should be wiring and execution model—not automatically the hardware.

In this case:

One Honest Caveat

This post is specifically about the local-memory win. Later on I also tested remote embedding providers for different operational reasons. That later work eventually stabilized on a separate memory-only remote lane with text-embedding-3-small after larger or more ambitious candidates kept tripping timeout and indexing-path problems.

I am deliberately not turning that into the main story here, because it answers a different question. The local result already stood on its own: once the dependency and indexing model were fixed, local semantic memory worked on the small ARM VPS.

OpenClaw local semantic memory absolutely worked on this 4-core ARM VPS once the dependency and indexing strategy were corrected.

Retrieval Audit Results (March 2026)

After the original local-memory win, I wanted one more sanity check: was the system merely indexed, or was it actually retrieving the right things cleanly?

The useful answer was yes, with two important caveats about noise and memory hygiene.

Audit question What I saw Operational lesson
Did English recall work? Yes — targeted queries about memory-search stabilization returned the expected durable daily-note hits. Real retrieval quality matters more than a pretty "index complete" message.
Did Chinese recall work? Yes — Chinese queries surfaced Chinese-language content again instead of falling back to generic noise. Cross-language recall was viable on the same memory corpus once the indexing path was clean again.
What reduced noise most? Dropping session transcripts and keeping the index focused on durable memory notes made retrieval calmer and more relevant. More source material is not automatically better. Curated memory often beats transcript sprawl.
What content stayed noisy? Transactional, Gmail-style, daily operational content is still much noisier than curated notes and troubleshooting writeups. If everything becomes memory, nothing feels like memory.
What improved transferability? Clear headings, explicit labels, and disciplined note structure consistently made later recall easier. Metadata discipline is a retrieval feature, not just a writing preference.

The biggest practical lesson is boring and useful: embedding quality is only part of memory quality. Index scope, note structure, and whether you keep high-noise material out of the main retrieval path matter just as much.

Later Developments: Why the Remote Story Became Its Own Post

The later remote-memory work taught a different lesson from the original local-memory post. The important instability was not “local bad, remote good.” It was that source scope and indexing-path behavior mattered more than jumping to a bigger model.

That is why the remote stabilization story belongs in the separate follow-up, Bigger Embeddings ≠ Better Memory, instead of being pasted wholesale into this post. This page is about proving that local semantic memory was viable on a modest self-hosted box. The later page is about what happened when I kept experimenting after that proof.

April 2026 Follow-Up: Session Transcript Indexing Turned This Into a Blue/Green Rollout Problem

The next meaningful step after the original local-memory win was not “try another embedding model.” It was add session transcripts without breaking the live lane.

That sounds like a small follow-up. It was not. The first design correction was easy to miss: session transcript indexing is agent-scoped. A brand-new shadow agent would index its own mostly empty session folder, not the real history from my main agent. That meant the honest experiment kept the same agent identity and changed the config and SQLite target instead.

The easy comparison trap: comparing memory-only + 4B against memory + sessions + 0.6B looks tidy, but it changes two variables at once — source set and embedding model. That is not an A/B test. It is an accidental mash-up.
Lane Sources Embedding model Why I kept it
A memory + sessions perplexity/pplx-embed-v1-4b Isolate the effect of adding session transcripts while holding the stronger known-good model family constant.
B memory + sessions perplexity/pplx-embed-v1-0.6b Test the cheaper lane on the same corpus instead of hiding a source-set change inside a model swap.

The trade-off was real. The smaller 0.6b lane was about 7.5× cheaper per embedded token, used 1024-dimensional vectors instead of 2560, and produced a meaningfully denser on-disk index. The bigger 4b lane still looked stronger on several public retrieval benchmarks I checked by low-single-digit absolute margins. That is exactly why I wanted the rollout to separate source-set enablement from model choice instead of hiding both inside one “upgrade”.

The honest operational pattern turned out to be blue/green, not clever live toggling:

  1. build shadow configs first
  2. keep agent-scoped SQLite paths because the session index is agent-scoped too
  3. validate with real memory status and memory search checks
  4. patch the live config deliberately
  5. restart manually on purpose so rollback is obvious if the first live sync goes wrong
Important scope note: I am not claiming that agent-scoped storage is a universal best practice for every retrieval system. It was the honest fit for this design because the session corpus itself was scoped per agent.

My current live lane now resolves to the smaller session-enabled path:

sources: ["memory", "sessions"]
model: perplexity/pplx-embed-v1-0.6b
vector dims: 1024
dbPath: ~/.openclaw/memory/{agentId}-sessions-shadow-pplx06b.sqlite

More importantly, the live checks return real source: "sessions" hits. That is the proof I actually care about. The point was never to win an embedding beauty pageant. The point was to add session recall without lying to myself about what changed.

April 2026 Follow-Up: A Separate Multimodal Lane Worked Better Than Forcing Everything Through Text Memory

The cleanest later example came from a photo-clustering problem, not from memory search itself. I wanted semantically similar stills and GIFs grouped together. Hash-only approaches were fine for exact or near-duplicate cleanup, but they were the wrong primary feature space for “these belong together” ordering.

That turned into a deliberately separate Gemini Embedding 2 pilot instead of another attempt to overload the memory stack:

Pilot signal What I saw Why it matters
Corpus 96 items total: 72 stills and 24 GIFs Large enough to expose whether the workflow was real or just a toy success.
Embedding lane Gemini Embedding 2 returned 3072-dimensional vectors The API path worked cleanly for a separate multimodal workflow on the same VPS.
Top-1 similarity min 0.6609, max 0.9947, avg 0.8900 The neighbors were strong enough to justify moving on from hash-only tuning.
Sanity check Top-1 neighbors stayed same-lane (still→still, GIF→GIF) The results passed a basic modality sanity check instead of collapsing into obvious nonsense.

The operational pattern mattered as much as the numbers. I did not change the live gateway memory configuration. I created a dedicated local virtual environment and called Gemini Embedding 2 directly from small task-specific scripts, while keeping the vectors and review artifacts local.

python3 -m venv ~/.openclaw/workspace/.venvs/photo-semantic
source ~/.openclaw/workspace/.venvs/photo-semantic/bin/activate
python3 scripts/gemini_embedding2_probe.py
python3 scripts/gemini_embedding2_pilot.py
Why this matters: the honest architecture is hybrid. Use local embeddinggemma when the problem is text memory on your own notes. Use a separate remote multimodal lane when the problem is image/GIF semantics or another feature space the text-memory lane was never meant to solve.

That distinction matters because the title claim here is still true. Local semantic memory worked on this VPS without external APIs. The later Gemini pilot did not overturn that result. It showed where a separate remote lane was the right tool for a different class of problem.

Known Limitations: A Stable Live Lane Is Not the Same Thing as a Universally Safe Rebuild Path

The session-transcript upgrade made the memory lane more useful, but it did not magically make every indexing path safe. The hardest limit I can now state plainly is this: the built-in full indexing path can still fail against a shorter-than-expected embedding timeout ceiling even when the surrounding status surface suggests a much longer configured timeout.

That distinction matters because two failure families showed up, and they are similar only if you look at them too casually:

The practical workaround that actually held up was not “just tune the live lane harder.” The reliable recovery path was a shadow indexing workflow: build or rebuild the index in a reduced-pressure lane first, validate that it is queryable, and only then treat it as a promotion candidate instead of assuming the stock live rebuild path is trustworthy under pressure.

What I would not claim: the current live setup has become calmer, but I would still not describe plain built-in full reindex as a boring solved problem in this environment. The shadow path is still the safety net I trust when a real rebuild matters.

More conservative live sync behavior is still worth keeping because it makes the system less eager to stampede itself on ordinary session churn.

Those settings help by reducing how often automatic sync fires. They do not prove that a manual full build is now safe. That is the operational split I wish I had written down earlier:

Best current reading: the memory stack is stable enough to use, but not yet honest to describe as “full rebuild is solved.” The right mental model is a useful live lane with a separate lower-risk rebuild path, not one magical knob that fixed everything.

April 2026 Follow-Up: Source Hygiene Is an Anti-Echo Control

The latest memory-search lesson was not another embedding benchmark. It was source hygiene.

Any retrieval or scouting system can accidentally start chasing its own echoes. A generated report summarizes yesterday's evidence. The next scan sees that report and treats it as fresh signal. A later summary then cites the summary of the summary. If you are not careful, the system starts amplifying its own prose instead of discovering new reality.

Source type How I treat it now Why
Direct artifact or command result Primary evidence. It is closest to what actually happened.
Durable human- or agent-written note Useful context, especially when it records outcome and next step. It preserves intent, but still needs freshness checks.
Generated scout or roundup Candidate source, not proof of novelty. It may already contain prior rankings, bias, or repeated phrasing.
Older scout report Anti-repeat evidence. It is good for avoiding duplicates, not for claiming something is newly important.

That gives me a cleaner rule for local memory and topic scouting:

Use generated summaries to find candidates. Use direct evidence to decide what is true.

This also changes how I interpret retrieval quality. A search result is not good just because it is semantically similar. It is good when the source tier matches the decision. For a new blog idea, fresh daily-memory evidence should outrank an older generated scout. For an operational claim, a direct validation artifact should outrank a polished recap.

Connection to skill governance: the same anti-echo rule applies to agent skills. A skill radar should not create new skills merely because prior skill-radar prose kept mentioning them. See Modernizing Agent Skills Without Growing a Skill Jungle for the catalog-governance version of this lesson.

May 2026 Follow-Up: Active Memory Canaries Need a Latency Budget

The next memory lesson was not about whether a helper could finish. It was about whether it could finish inside the reply path without making the whole assistant feel stuck.

A 60-second active-memory canary found a working setting in the narrow sense: the helper stopped immediately failing under the previous shorter timeout pattern, and early real runs produced non-empty summaries instead of timing out. But the observed pre-reply latency was still roughly 45-53 seconds.

Question Canary answer Operational interpretation
Can the helper complete under a larger budget? Yes, in early real runs. The failure class changed from immediate timeout to slow success.
Is that comfortable for the reply path? No. Forty-plus seconds before a response is a ceiling, not a polished default.
What should tighten first? Helper shape and query scope. Reducing work is safer than blindly shrinking the timeout and rediscovering failure.

“It completes” is not the same as “it belongs on the critical reply path.”

That is now how I think about active-memory helpers: a successful canary proves a ceiling, not a production tuning. The next step after a slow success is to make the helper smaller, more targeted, or more asynchronous before treating it as part of the normal interaction loop.

The May 7 refinement is mostly about wording and rollout discipline: a slow successful canary should be recorded as working but not yet user-comfortable. That keeps the team from declaring victory merely because the timeout class disappeared.

The May 8 closeout made the promotion gate stricter: I would not treat a long canary budget as a normal product budget. Before this belongs in the ordinary reply path, the helper needs a narrower query contract, a smaller source set, or an asynchronous delivery shape that keeps normal replies responsive.

The May 9 framing is the one I would keep: slow success is a promotion candidate, not a promotion. A useful active-memory canary should answer two questions separately: can the recall path complete at all, and can it complete without becoming the dominant user-visible latency. If the first answer is yes and the second answer is no, the next move is to shrink or relocate the work, not to celebrate the long timeout as a finished setting.

Canary result What it proves What it does not prove
Slow success The helper can finish under the larger budget. The helper belongs synchronously in every reply path.
Non-empty summaries The recall path can produce useful material. The query shape is already minimal enough.
No immediate timeouts The previous budget was too tight for this lane. A permanent large timeout is the right product behavior.

May 2026 Follow-Up: Session Lists Have a Memory-Search Lesson Too

The session-list latency work looked separate at first, but it rhymed with the memory story. The gateway was doing expensive derived-row work before cheap filters had narrowed the candidate set.

In a stage-style filtered probe, the fast-path fix collapsed expensive row construction from 381 rows to 2 rows. Transcript-derived fallback time moved from roughly 953 ms to 28 ms, child-relationship resolution from roughly 109 ms to 2 ms, and total filtered-probe wall time from roughly 1.31 seconds to 216 ms.

The takeaway is the same as the memory-indexing lesson: source scope matters. A tiny query should not behave like a full-corpus operation just because the implementation builds the rich view too early.

Related write-up: I expanded the control-plane performance side of this into The 10-Second Session List.

Why I Think This Matters

A lot of OpenClaw users are exactly in this bucket:

For that audience, local semantic memory is not some philosophical purity test. It's a practical pattern:

Sanitization note: I kept the overall architecture, provider names, model path, and performance/result numbers because those are the useful parts. I intentionally left out deployment-specific IDs, host-specific paths, exact tuning knobs, and any secrets or internal tokens.

About the Author

Jingxiao Cai works on ML infrastructure and spends an unreasonable amount of time turning vague "it should work" self-hosting advice into concrete operational evidence. He likes cheap, boring systems that keep working after the benchmark crowd gets bored and leaves.

If a feature can run privately on a small box with one careful fix and one good operational decision, that is usually more interesting than a flashy demo on oversized hardware.