Zerto Docs MCP — Part 2: Hybrid Search and a Curated Knowledge Layer

Last week I posted about a Zerto Docs MCP server — a hosted MCP that lets Claude, Cursor, or Copilot answer Zerto questions against the actual help.zerto.com docs. That post described the path from “standard dense RAG” to “dense + cross-encoder reranker”, which lifted retrieval metrics about 3× and made the thing actually useful.

What I didn’t know yet: the dense + reranker architecture has its own ceiling. This is the story of finding it, hitting it, and the changes I made on the other side — plus three other things that turned out to need fixing while Claude was in there.

The query that started this

I asked Claude to help me write a Python script to create a Zerto VPG on AWS. The MCP was supposed to find the example code I’d just ingested from ZertoPublic/zerto-api-quickstart. Instead it returned an Admin guide page titled “Example Scripts” — a page that says see [other page] for details with no actual code.

I checked the index. The Python create_vpg.py example was in there. The chunk text contained “Python”, “create”, “VPG”, “AWS” — every keyword from my query. But the dense retriever ranked it at position #1984 out of ~73,000 chunks. The reranker only sees the top 200. So the matching script never got a chance to be considered.

What dense embeddings actually optimize for

Single-tower dense embedding (the workhorse of every “AI search” demo) takes a query and a document, encodes each into a fixed-length vector independently, then compares cosine similarity. The model has never seen them together; it’s hoping the two embeddings end up near each other in vector space because the underlying texts share meaning.

That works wonderfully for queries against discursive prose: “how do I configure SAML” lands near “SAML federation setup procedure”. Different words, same meaning.

It works terribly for queries with rare specific tokens against short code chunks. The user types “python script to create a VPG on AWS”. The dense embedding for that query is dominated by the topic — “create a VPG” — because the model treats “python” and “AWS” as background words that lots of documents contain. Across 73,000 chunks, there are thousands that talk about creating a VPG. The model can’t tell the script apart from the prose.

The reranker is supposed to fix exactly this. A cross-encoder sees the query and document jointly, weighting term-by-term matches properly. But the reranker only scores candidates the dense retriever provides. If the script doesn’t make dense top-200, the reranker doesn’t know it exists.

The fix: hybrid retrieval

Search engines solved this decades ago: lexical matching. BM25 is the formula every Lucene-derived search engine has used since 1995. Score is a sum over query terms of term-document frequency × inverse-document-frequency × length-normalization. Rare query terms that appear in a document score high. Common terms across the corpus score low.

For our failure case: "python" appears in maybe 0.5% of chunks. "create_vpg.py" appears in ~9 chunks total. "AWS" appears in maybe 30% of chunks. BM25 weights "create_vpg.py" enormously, "python" significantly, "AWS" a little. The Python script’s chunk lights up. The Admin guide chunks (which don’t have "python" or "create_vpg.py") score near zero.

Dense retrieval is great at understanding meaning. BM25 is great at term-rarity weighting. Hybrid retrieval runs both queries against their respective indexes, then fuses the two ranked lists. Most production RAG systems built since 2022 ship some version of this.

The fusion math is called Reciprocal Rank Fusion (RRF). For each document, score = Σ over retrievers of 1/(60 + rank_in_retriever). Documents are ranked rather than scored, so the two retrievers’ wildly different scoring scales don’t need calibration. Documents that appear in both lists score highest. Documents that appear in only one list still rank well if they’re high in that list.

After RRF, the reranker scores the fused pool. Same reranker as before; same setup. Just now it actually sees the relevant documents.

Implementation

I picked SQLite’s FTS5 for BM25. It’s in the Python stdlib — no new dependencies. Indexes 73,000 chunks in three seconds. Persists to disk like Chroma. ~167 MB on disk for the full corpus. The whole module is about 200 lines of Python.

Schema is two tables — a regular table for the chunk metadata (so filters like version=10.9 work as SQL WHERE clauses) and an FTS5 virtual table for the indexed text. JOIN them on rowid.

CREATE TABLE chunks_meta (
    rowid INTEGER PRIMARY KEY AUTOINCREMENT,
    id TEXT UNIQUE,
    bundle_id TEXT, page_id TEXT, version TEXT, ...
);
CREATE VIRTUAL TABLE chunks_fts USING fts5(
    text,
    tokenize = 'porter unicode61 remove_diacritics 2'
);

The query path in search_docs becomes:

dense_hits = chroma.query(query, n=200, where=filter)
bm25_hits  = bm25.query(query, n=200, where=filter)
fused      = rrf_fuse({"dense": dense_hits, "bm25": bm25_hits})
candidates = fused[:200]
top_k      = reranker.rerank(query, candidates)[:k]

End-to-end the hybrid path adds about 50 ms of latency (SQLite FTS5 is fast). The bigger cost — the reranker round-trip — was already there.

The eval that proved it worked

I built an A/B harness back in Part 1 for the dense → reranked decision. Same harness applies here. 25 hand-curated golden queries committed to the repo, each labeled with expected (bundle_id, page_id) tuples. Run each retriever, compute MRR / Recall@5 / nDCG@5, dump a markdown report with side-by-side top-5.

The numbers (20 labeled queries of 25):

RetrieverMRRRecall@5nDCG@5
dense (original)0.0870.0830.075
dense + reranker0.3420.2960.280
BM25 alone0.3250.3210.290
hybrid (BM25 + dense + RRF)0.2600.3080.225

Counterintuitive read: pure BM25 outperforms hybrid in this bench. The dense contribution dilutes high-confidence BM25 wins.

In production we add the reranker on top of the hybrid pool. The reranker (a cross-encoder, trained for exactly this) reorders the merged candidate list properly. Hybrid+rerank in production surfaces the right Python script for “python script to create a VPG on AWS” — top-1 hit. Same for the curl failover test, same for the PowerShell monitor task. Three queries that dense+rerank buried at rank 1000+ now top-1 with hybrid+rerank.

Net architecture:

query → embed       → Chroma (dense)  ─┐
                                       ├── RRF fuse ── reranker → top-k
query → tokenize    → FTS5 (BM25)     ─┘

Everything is gated behind a HYBRID_SEARCH=true env var. Kill switch is one line in the compose file.

Three more things that needed fixing

While I was rebuilding the retrieval stack I also fixed three other things that were biting users.

KB articles surfacing properly

Zerto’s KB archive (659 articles) was in the corpus from day one but never surfaced for relevant queries. Two problems: every article’s body began with six lines of authoring-tool noise (raw article ID, semicolon-separated version list, source/target hypervisor names) that dominated chunk-0’s embedding without contributing any useful semantics. And the actual descriptive title — “How to Protect and Recover the ZVMA Server” — lived in a sidecar field that never made it into the chunk text. Embeddings had no real anchor on what the KB was about.

Now the scraper parses the preamble into structured fields (KB number, affected versions, source hypervisors, target hypervisors), strips it from the markdown, and rebuilds chunk-0 with the descriptive title as H1 plus those structured facets as readable Markdown. Queries like “Hyper-V VRA install troubleshooting” now top-1 the right KB article, where they used to return alarm-code pages.

A curated lessons doc

Six weeks of writing Zerto integrations produced a 500-line document of things you’d-only-know-if-you’d-been-bitten. The swagger advertises an OAuth implicit flow that doesn’t work for service-to-service automation (the real path is Keycloak password grant). Every long-running operation returns 202 + an id and requires polling — the swagger documents it as fire-and-forget. applyUpgrade.platform is documented as nullable; the server rejects empty. AWS ZCAs return PlatformInformation: null even though the swagger says the field is always set. Forty-plus gotchas like this.

I made it a first-class MCP tool: zerto_api_lessons(topic="authentication") returns just the auth section; no argument returns the full doc. The tool description explicitly tells the LLM “call this proactively whenever the user asks you to write a script or integrate with the Zerto API” — Claude actually does. The doc is also indexed in the corpus so search_docs finds specific sections via dense + reranker for tangential queries, and a one-line banner appears in search results when the query matches script/API trigger words.

ZertoPublic example repos

Zerto’s public GitHub has two repos full of working API scripts: one for ZVM/ZCA, one for Zerto In-Cloud. 99 scripts across Python, PowerShell, and bash+curl. I wrote a sync script that pulls both repos weekly, wraps each script in a markdown header with task description and source URL, and feeds it into the corpus. When a user asks Claude for an example, it now finds a working script with the github.com source link in the citation. (Took two iterations to get the chunk text right — my first attempt buried task-specific content under a metadata table that didn’t embed well. Lesson: write chunk-0 in natural language that mirrors how users actually ask.)

What I learned

A few things I’d tell myself before starting this.

Dense retrieval has a specific failure mode that no amount of model tuning fixes. When users ask queries that contain rare technical tokens against a corpus dominated by prose, dense will bury the right answer. The fix is structural (add lexical retrieval), not a different embedding model.

The reranker is not magic. It can only rank what it sees. The reranker quality story from Part 1 is real, but it was always conditional on the dense retriever surfacing the right candidates. Increase pool size or add a complementary retriever; both can help.

Eval discipline is what saved me from shipping each iteration without evidence. 25 hand-curated queries committed to the repo. A run_eval.py that takes 90 seconds. A markdown report that I read every time I changed anything in the retrieval path. Without it I’d be flying blind on whether each “fix” actually helped.

Defensive fallbacks matter. Every stage of the pipeline (BM25 query, RRF fusion, reranker call, even the embed call) can fail. The server should log a warning and fall back to a simpler path — never block a query. Every layer in production has its own kill-switch env var.

MCP tool descriptions are user interface. Claude actually reads the tool description and decides whether to call the tool based on the wording. “Use when…” and “Call proactively whenever…” are real surfaces. Treat them as you’d treat a button label.

What’s actually running now

Public endpoint:  https://mcp.jpaul.io/metamcp/zerto-docs/mcp
Tools:            14  (search_docs, get_page, list_cluster,
                       diff_versions, bundle_changelog, corpus_status,
                       list_versions, zerto_api_lessons,
                       interop_check, interop_versions,
                       interop_platforms, interop_categories,
                       find_doc_inconsistencies, submit_doc_bug)
Retrieval:        BM25 + dense + RRF + jina-reranker-v2-base
Corpus:           ~21K pages, ~73K chunks, refreshed weekly
                    • help.zerto.com mirror
                    • 659 KB articles with structured platform metadata
                    • interop matrix
                    • curated lessons doc
                    • 99 working API example scripts (ZertoPublic)
Doc-bug workflow: find_doc_inconsistencies surfaces candidates;
                  LLM drafts; operator confirms; submit_doc_bug posts
                  to Zerto's docs feedback channel (env-gated)
Hosting:          single host self-hosted on my homelab, MetaMCP gateway,
                  Cloudflare Tunnel, Watchtower auto-updates from CI

Usage logging in JSONL with 90-day retention. Per-call captures the query, filters, hits returned, which retriever surfaced the top result (dense / BM25 / both), and elapsed time. That’s what drives the next iteration.

If you’ve got a Zerto question, give it a try. And as before — if it returns the wrong thing for your query, I’d genuinely love to know. The whole point of this project is to capture lived experience that helps the next person.

Loading

Share This Post

One Response to "Zerto Docs MCP — Part 2: Hybrid Search and a Curated Knowledge Layer"

Leave a Reply