Hybrid Search Architecture for Apache Solr 10 and beyond

2026-05-18T00:00:00+00:00

Starting Point

“Hybrid search” today almost always means: lexical search (BM25) plus vector search (dense embeddings, optionally sparse like SPLADE), merged via Reciprocal Rank Fusion (RRF) or weighted linear combination, often with a downstream reranker (cross-encoder, ColBERT). The question is no longer whether an engine can do this — it’s how mature, how operationally expensive, and how scalable.

Equally important is the forward-looking perspective: the most transformative innovations of the coming years are happening not in the index, but in the layers above it — LLM query rewriting, agentic retrieval planning, cross-encoder and LLM reranking, generative answer synthesis, multimodal search, late-interaction models like ColPali for visually rich documents. This matters for engine choice because some of these trends are engine-neutral and some are not.

The serious open-source candidates fall into three categories:

Lucene-based search platforms: Solr, OpenSearch, Elasticsearch
AI-native search engine: Vespa
Vector-first databases with BM25: Qdrant, Weaviate, Milvus

The evaluation of these categories runs through the following functional areas, which together in modern search systems make the difference between “works” and “competitive”:

Full-text and hybrid search — BM25, dense vectors, sparse models, fusion (RRF), cross-encoder and LLM reranking
Filters and selections — structured, deterministic constraints on the result set; in the hybrid path implemented as pre-filter to the KNN search
Facets and aggregations — counted refinements on the current result set, hierarchical or pivot, distinct from OLAP-style aggregations
Autosuggest / search-as-you-type — its own path with its own index, latency class (P99 < 50 ms), and ranking logic
Ranking and personalization — from static boosts through learning-to-rank to real-time multi-phase ranking with user features
Generative SERP layer — RAG answers, agentic query plans, multimodal and late-interaction retrieval, dynamic result composition

The most honest answer to “which engine ages best”: the one you hard-wire the least.

The Candidates in Detail

Apache Solr 10 (available since early 2026)

Solr 10 is no longer the “old classic that can barely do vectors.” The release brings substantial improvements: scalar and binary quantization of dense vectors, optional GPU acceleration via cuVS-Lucene as a pluggable codec, new efSearch parameters for HNSW tuning, feature-vector caching for learning-to-rank, and with SeededKnnVectorQuery and PatienceKnnVectorQuery (early termination) two specific hybrid accelerators. Hybrid retrieval constructions ({!bool should=$lex should=$knn}) and hybrid ranking work — though they are still underdocumented in the reference guide (SOLR-17103). The TextToVectorQParser allows the query to be encoded directly within Solr.

Pros	Cons
No migration needed — schema and ops knowledge stay. Genuine ASF governance, no corporate overlord, Apache 2.0. Very strong faceted search, geo, parallel SQL — if relevant. Learning-to-rank is mature and has been able to use vector similarity as a feature since 9.3.	Hybrid DX is raw — a lot of manual XML/JSON, pagination with BoolQParser + KNN is tricky, RRF not out of the box (actively being worked on). External ZooKeeper dependency for SolrCloud. Java 21 as minimum (operational implication). Community smaller and shrinking relative to Elastic/OpenSearch. Multi-vector / late-interaction fields (ColBERT/ColPali) are not yet first-class in Lucene — the most relevant future weakness.

Pros

Cons

No migration needed — schema and ops knowledge stay. Genuine ASF governance, no corporate overlord, Apache 2.0. Very strong faceted search, geo, parallel SQL — if relevant. Learning-to-rank is mature and has been able to use vector similarity as a feature since 9.3.

Hybrid DX is raw — a lot of manual XML/JSON, pagination with BoolQParser + KNN is tricky, RRF not out of the box (actively being worked on). External ZooKeeper dependency for SolrCloud. Java 21 as minimum (operational implication). Community smaller and shrinking relative to Elastic/OpenSearch. Multi-vector / late-interaction fields (ColBERT/ColPali) are not yet first-class in Lucene — the most relevant future weakness.

OpenSearch (3.x)

Apache 2.0, Linux Foundation governance since 2024, Lucene-based. Version 3.2 explicitly expanded “agentic AI” and native hybrid search, supports FAISS and nmslib engines alongside Lucene HNSW, vector dimensions up to 16k. RRF and score normalization are built in as pipeline processors.

Pros	Cons
Truly open source. Security (RBAC, FLS/DLS, audit) is in the free distribution — with Elastic this costs Platinum/Enterprise. Very active push toward AI features. Large ecosystem, Kibana-equivalent dashboards. AWS integration if desired.	Performance benchmarks show it lags 40–140% behind Elasticsearch (vendor benchmarks, read with caution). Operational complexity similar to Elasticsearch. A migration from Solr is a real migration: schema, query language, tooling, configuration. Multi-vector late-interaction shares the Lucene weakness with Solr.

Elasticsearch (8.x / 9.x)

License is OSI-compliant again since 2024 via AGPLv3 option (alongside SSPL and the Elastic License). Mature hybrid search with RRF, ELSER (Elastic’s own sparse model for out-of-domain semantics), built-in reranking API.

Pros	Cons
Probably the most polished hybrid search experience in the Lucene family, excellent documentation, mature ML pipelines, Kibana. Strongest DX for hybrid out of the box.	The licensing nightmare isn’t fully over — AGPLv3 is OSI-compliant but tricky for many enterprise contexts. Many premium features (ML, security tier, RAG API) remain gated. TCO at larger cluster sizes is relevant. If the customer is trying to move away from commercial pressure, this is the wrong signal.

Vespa

Formerly Yahoo, open source under Apache 2.0 since 2017. Unlike the Lucene family, a vector-native architecture: mutable in-memory data structures (no refresh interval), multi-phase ranking on content nodes (not scatter-gather), ONNX/LightGBM executable locally.

Pros	Cons
Clear performance king for hybrid at scale — vendor benchmarks claim 8.5× higher hybrid throughput per core compared to Elasticsearch, 12.9× for pure vector. True real-time visibility. First-class tensor and ranking expressiveness, ColBERT/late-interaction native. The only engine that does retrieval and complex ranking in a single query round trip. Exactly the architecture that supports real-time personalization and multi-phase ranking.	The steepest learning curve in this list — its own configuration language, its own query language (YQL), its own mental model. Smaller community, fewer Stack Overflow answers. Operationally demanding for self-hosting; Vespa Cloud is the pragmatic alternative. Overkill if data volume is "medium" and latency isn’t critical in single-digit ms.

Pros

Cons

Clear performance king for hybrid at scale — vendor benchmarks claim 8.5× higher hybrid throughput per core compared to Elasticsearch, 12.9× for pure vector. True real-time visibility. First-class tensor and ranking expressiveness, ColBERT/late-interaction native. The only engine that does retrieval and complex ranking in a single query round trip. Exactly the architecture that supports real-time personalization and multi-phase ranking.

The steepest learning curve in this list — its own configuration language, its own query language (YQL), its own mental model. Smaller community, fewer Stack Overflow answers. Operationally demanding for self-hosting; Vespa Cloud is the pragmatic alternative. Overkill if data volume is "medium" and latency isn’t critical in single-digit ms.

Qdrant / Weaviate / Milvus (Vector-first with BM25)

Qdrant (Rust, Apache 2.0), Weaviate (Go, BSD-3), Milvus (Go/C++, Apache 2.0) are primarily vector DBs but by now all ship usable BM25 + hybrid fusion (RRF, DBSF, alpha-blending). Qdrant integrates IDF calculation into the engine, Weaviate has the with_hybrid(alpha=...) API. ColBERT / ColPali support is first-class here — so they’re strongest precisely where the Lucene family is weakest.

Pros	Cons
Best DX for vector + hybrid when starting greenfield. Quick to set up, clear APIs, small footprints. Reranking hooks (ColBERT, cross-encoder) are first-class. Multimodal workflows (CLIP, SigLIP, ColPali for PDFs/images) come with less friction than the Lucene engines.	Weaker on the "classical" lexical side — tokenizers, analyzers, synonyms, fuzzy matching, phrase slop, highlighting, faceted search, spell-check are not at Lucene level. If you’re running Solr in production today, you almost certainly use features that are missing here or would have to be built. Better suited as a RAG backend than as a universal site/product search.

Candidate Evaluation Matrix

Criterion	Solr 10	OpenSearch	Elasticsearch	Vespa	Qdrant / Weaviate
License (clean OSS)	✔ Apache 2.0	✔ Apache 2.0	⚠ AGPLv3 / SSPL	✔ Apache 2.0	✔ Apache 2.0 / BSD
Hybrid search DX	⚠ raw	✔ good	✔ very good	✔ excellent	✔ excellent
Lexical depth	✔ excellent	✔ excellent	✔ excellent	✔ very good	⚠ basic
Faceting / aggregations	✔ excellent	✔ excellent	✔ excellent	✔ very good	⚠ weak
Autosuggest (e-comm level)	⚠ building blocks	⚠ building blocks	✔ search_as_you_type + LTR	✔ reference	⚠ basic
Vector performance	✔ good (with 10)	✔ good	✔ good	✔ top tier	✔ very good
Late interaction (ColBERT/ColPali)	⚠ weak	⚠ weak	⚠ in progress	✔ native	✔ first-class
Ranking flexibility	✔ LTR mature	✔ good	✔ ML stack	✔ multi-phase	⚠ rerank hook
Operational maturity	✔ high	✔ high	✔ high	⚠ steep	✔ simple
Migration cost (from current)	✔ none	✘ large	✘ large	✘ very large	✘ large
Community momentum	⚠ stable	✔ growing	✔ large	⚠ niche	✔ growing

Where Is the SERP Heading — and What Does That Mean for Engine Choice?

Before deciding, it’s worth looking at the direction of innovation. Six trends I see as defining for the next 2–4 years:

Late-interaction models migrate from reranker to retrieval layer. ColBERT was the start; ColPali/ColQwen are the natural continuation — multi-vector representations per document, MaxSim matching, no OCR pipeline drama with PDFs or images. Vespa, Qdrant and Weaviate support this in production today; Lucene-based engines have a harder time structurally because the index has historically been single-vector-centric.
LLM and cross-encoder rerankers become standard stage 2. The math is uncontested: hybrid retrieval on top-100, then a cross-encoder or LLM reranker on top-10. Voyage, Cohere, Jina, FlashRank, ColBERT-v2 are the building blocks. Engine-neutral, runs externally.
Generative answers and “generative UI” on the SERP. The display becomes dynamic: comparison table when the query looks like one; map when geo; carousel when products; pure answer when FAQ-like. The engine doesn’t decide this; the layer above does.
Agentic search and multi-step query plans. An LLM decomposes the user question into sub-queries, calls retrieval as a tool, checks the results, refines, asks back. MCP is becoming the standard interface here.
Real-time personalization in the ranking stage. Multi-phase ranking where user context, session, embedding similarity to past behavior, and business logic come together. Native in Vespa, via LTR in Lucene-based engines.
Multimodality as default. Image-to-text, text-to-image, mixed queries. CLIP, SigLIP, ColPali are the tools — for visually rich sites a realistic use case in 2–3 years.

What’s engine-relevant, what isn’t? Cross-encoder reranking, generative answers, and agentic orchestration live almost entirely above the engine. Late interaction, real-time multi-phase ranking, and (with caveats) multimodality are the trends where index architecture genuinely makes a difference. That’s where the Lucene family has structural work to do, while Vespa and the vector-first DBs are already ahead.

Recommendation

If the customer were starting greenfield — without the existing Solr investment — and the profile is “classical search engine with hybrid extension, medium-to-large data volume, no megascale RAG,” I would recommend OpenSearch. The hybrid DX is mature, RRF and score normalization are built in, the license is clean, security features are included at no extra cost, and the ecosystem is large enough that for most problems someone has already posted a solution.

However: The customer is not starting greenfield — they’re facing the Solr 9-to-10 upgrade. And here the recommendation flips: Solr 10 is sufficient in the overwhelming majority of cases for hybrid search, and migration cost to OpenSearch would be substantial (schema modeling, query language, indexing pipeline, ops, monitoring, team skills). Solr 10 closes the exact gaps that 9.x still had — with quantization, GPU codec, SeededKnn and PatienceKnn termination.

Looking ahead reinforces this recommendation — with one important caveat. The truly innovative layers of the SERP for the next several years (generative answers, agentic orchestration, query rewriting, LLM reranking) are engine-neutral and live in the application layer. Anyone thinking “I’ll buy engine X and that gets me AI search” is wrong — regardless of which engine. Only three trends are genuinely engine-architecture-relevant: late interaction (ColBERT/ColPali), real-time multi-phase ranking, native multimodality.

Concretely as a two-stage approach:

Now: Run the upgrade to Solr 10. As part of it, build a hybrid retrieval setup as a PoC — DenseVectorField, an embedding model (e.g., multilingual-e5 or bge-m3), {!bool should=$lex should=$knn} with RRF in the application layer, optionally a cross-encoder reranker as a second stage. Within a few weeks you’ll know whether the relevance gain justifies the complexity.
If the PoC hits limits — missing multi-phase ranking on large data sets, need for native late interaction for visual documents, real-time personalization requirements — then the question is where to migrate, and the answer depends on the specific bottleneck (Vespa for ranking power and real-time, Qdrant/Weaviate for RAG/multimodal use cases, OpenSearch for broader platform).

Upgrade to Solr 10. Add hybrid. Build it so the engine stays replaceable.

The worst move would be to migrate off working Solr without a concrete pain point to justify the bill. Every serious engine in 2026 does hybrid search — the real differentiator isn’t “can it?” but ranking quality, embedding choice, reranking strategy, and the evaluation loop. That work is engine-independent. And it’s what will actually shape the SERP of the next few years.

What Follows — A Reader’s Map

The recommendation above is the what. The remainder covers the how — the architecture and the disciplines that turn the recommendation into a working system, built so that today’s deterministic stack can host tomorrow’s agent without a rewrite.

Reader’s Map

A Resilient API Stack — the layered architecture from client to engine, the contract that keeps the engine replaceable, contrasted with the agentic alternative.
Filters and Facets — the second, non-relevance path through the same stack; why pure filter queries should skip half the pipeline.
Autosuggest — its own subsystem, different latency class, different signals, different pool.
The Indexing Path — embeddings, evaluation, the cross-cutting disciplines that decide whether the system improves over time.
Putting It Together — the synthesis, and the one rule that decides whether the engine stays replaceable.

A Resilient API Stack — Architecture Hygiene in Concrete Terms

There are two ways to design a search API today, and a system built well can support both. The first is the deterministic pipeline — query understanding, retrieval, fusion, reranking, composition — laid out as imperative code that runs the same plan every time. The second is the agentic orchestration — a model that owns the plan, calls the engine as a tool, evaluates results, and iterates. The deterministic version is what every search team has been building for two decades; the agentic version is what teams are starting to ship in production.

Query path (synchronous)                   Indexing path (async)
────────────────────────                   ─────────────────────

┌──────────────────────────────┐           ┌──────────────────────────┐
│ Client (Web, App, Agent,MCP) │           │   Source systems / CMS   │
│   speaks stable Domain API   │           │      CDC or Webhook      │
└──────────────┬───────────────┘           └────────────┬─────────────┘
               │                                        │
               ▼                                        ▼
┌──────────────────────────────┐           ┌──────────────────────────┐
│   Search BFF / Domain API    │           │    Indexing pipeline     │
│ /search /suggest /facets …   │           │  Normalize Enrich Chunk  │
│       engine-agnostic        │           └────────────┬─────────────┘
└──────────────┬───────────────┘                        │
               │                                        ▼
               ▼                              ┌──────────────────────────┐
┌──────────────────────────────┐              │    Embedding service     │
│      Query understanding     │◀─────────────│   versioned models,      │
│ Rewrite Expansion Intent     │              │       dual-write         │
│         Sub-queries          │              └────────────┬─────────────┘
└──────────────┬───────────────┘                           │
               │                                           │
               ▼                                           │ Bulk index
┌──────────────────────────────┐                           │
│    Retrieval orchestrator    │                           │
│ Plan Fan-out Fusion(RRF)     │                           │
│            Top-K             │                           │
└──────────────┬───────────────┘                           │
               │                                           │
               ▼                                           │
┌──────────────────────────────┐                           │
│        Engine adapter        │                           │
│  translates Query-IR to engine                           │
└──────────────┬───────────────┘                           │
               │                                           │
               ▼                                           ▼
┌──────────────────────────────────────────────────────────────────────┐
│                            Search engine                             │
│              Solr / OpenSearch / Vespa / Qdrant                      │
└──────────────┬───────────────────────────────────────────────────────┘
               │
               ▼
┌──────────────────────────────┐           ┌──────────────────────────┐
│       Reranking stage        │           │  Eval & experimentation  │
│  Cross-encoder ColBERT LLM   │◀──────────│ Goldset A/B online metr. │
└──────────────┬───────────────┘           └────────────┬─────────────┘
               │                                        │
               ▼                                        │
┌──────────────────────────────┐                        │
│      Result composition      │                        │
│ Hits Facets Highlights       │                        │
│         RAG answer           │                        │
└──────────────┬───────────────┘                        │
               │                                        │
               ▼                                        │
┌──────────────────────────────┐                        │
│    Telemetry & click logs    │◀───────────────────────┘
│ structured, linked to Query-IR
└──────────────────────────────┘

The Agentic Alternative — Thin Primitives, an Orchestrating Agent

The deterministic-pipeline assumption is exactly what the next few years will challenge most directly. Doug Turnbull made the case sharply in a recent post: the “thick search monolith” is being unbundled. In its place: a small set of thin retrieval primitives (basic keyword search, basic embedding search, a few filters), orchestrated by an agent that sees the whole problem rather than executing reductive steps.

Frontier models like GPT-5 and Sonnet already do the 80% case well — they understand most queries with general knowledge, and they can drive a retrieval tool reasonably. But Doug’s central point is about the last 20%: the domain knowledge that isn’t in a frontier model’s training. A furniture store knows that “bistro tables” means small outdoor tables, not restaurant equipment; GPT-5 doesn’t. Specialized agentic search models — SID-1, Glean’s Waldo, startups like Charcoal — get trained on the domain and on search-as-task specifically.

What the agentic shift changes in the layers

The Retrieval Orchestrator becomes the agent’s seat. An LLM occupies this layer and runs a loop: call the engine, evaluate the result, decide whether to refine, filter, expand, or retry. No longer imperative code; a model with tools.
The engine adapter becomes hot. A deterministic pipeline calls the engine once or twice per user query. An agentic orchestrator may call it five or ten times in a loop. The adapter must be idempotent, fast, safe to call repeatedly, with clear failure modes the agent can interpret.
The reranker may shrink. When the agent itself selects across iterations — keeping promising candidates, dropping bad ones — it is reranking, spread across the loop. A dedicated Cross-Encoder stage may still earn its place for raw quality, but it stops being mandatory.

What the agentic shift doesn’t change

What survives unchanged is the discipline: a stable domain API at the boundary, engine-agnostic hit schemas, an evaluation loop, versioned embeddings, a dedicated suggest path. The agent has to talk to something, and that something is the layered stack.

Design the engine adapter as a tool, not a remote procedure. The orchestrator calling it today is your code. The orchestrator calling it in three years is a model.

The Latency Problem — and How to Pay for the Loop

The honest cost of going agentic is latency. A deterministic pipeline runs one query plan: query understanding (~5 ms) → engine call (~50 ms) → rerank (~50 ms) → compose. Total: ~100 ms for the fast path. A plan-act-analyze loop costs (LLM inference for planning + engine call + LLM inference for analysis) per iteration. With a frontier model at 200–400 ms per call and three iterations, you’re at 750 ms to 1.5 seconds before the user sees anything. That’s the difference between “feels instant” and “feels broken.”

The bottleneck also moves. In the deterministic pipeline the engine dominates and you tune Solr. In the agentic loop the LLM dominates by a factor of 4–10×, and tuning the engine harder buys you almost nothing. Seven mitigations, ordered roughly by impact:

Specialized, smaller orchestrator models. A 50 ms domain-tuned model vs a 300 ms frontier model changes the equation entirely. SID-1, Waldo and similar models are designed to be cheap enough to call multiple times per query. For online search, the single biggest lever.
Speculative parallelism. Fire multiple candidate retrievals in parallel from the first plan and let the analysis step pick. Two iterations of serial latency collapse to one.
Hot-path bypass for simple queries. A small fast classifier decides: simple queries → deterministic pipeline (~100 ms), complex or ambiguous queries → agent (500–1500 ms).
Caching at multiple layers. Query-IR caching, retrieval caching, reranker caching by (query_hash, doc_id). Cache hit rates of 30–60% on hot queries are realistic.
Streaming results during iteration. Start streaming partial output from the first iteration while the agent decides whether to refine. The user perceives latency as “time to first useful content,” not “time to final response.”
Iteration budgets and timeouts. Hard cap on agent iterations: typically 2–3 for online queries.
Deterministic plan and analyze, with LLM escalation. Implement the plan and analyze steps as rules, heuristics, lookups, small fast classifiers — and reach for an LLM only when the deterministic version reports low confidence. You keep the agentic architecture while running it at deterministic-pipeline cost for 90% of queries.

The agentic architecture and the LLM tax are separable. Build the plan-act-analyze loop deterministically. Open up to LLM-driven plan and analyze selectively, where measurement shows rules can’t carry the load.

The combination matters more than any individual mitigation. A realistic production setup: a deterministic plan-act-analyze loop for every query (~120 ms baseline); hot-path bypass skipping the loop entirely for trivial queries (~80 ms, 40% of traffic); LLM escalation in plan or analyze for genuinely ambiguous queries (+200–300 ms, 8% of traffic); full multi-iteration LLM-driven loop reserved for the hardest cases (~600 ms, 2% of traffic). Weighted average: well under 150 ms.

Agentic search is a latency tax — and the bill comes due on every iteration. Don’t ship a 1.5-second loop and hope users forgive you.

The Guiding Idea

A resilient stack accepts three truths. First, the engine is the longest-lived component, but not the most valuable one — you swap it maybe once every five years; the layers above grow every year. Second, ranking is its own subsystem, not an engine feature. Third, the SERP is composed in the application layer, not in the index.

Layer 1 — Stable Domain API (the Search BFF)

The most important decision in the whole stack. The client (web, app, later agents via MCP) speaks not with the engine but with its own domain API, formulated in the language of your search — /search?q=...&filter=type:trick&page=2, not /solr/select?q=...&fq=...&rows=10. The response format is also independent: { hits: [...], facets: [...], suggestions: [...], answer?: {...} } — and contains no Solr-specific fields.

Take this seriously, and you can switch engines later without touching the client. Don’t, and in two years you have solrFacetCount in your React code and never get out again.

Layer 2 — Query Understanding

Incoming query → outgoing structured representation (the internal Query IR). Spellcheck/did-you-mean, synonym expansion, language detection, intent classification, and increasingly LLM-based: sub-query decomposition, HyDE-style query hypotheses, entity linking to your own vocabulary.

Important: this stage returns an object, not a rewritten string. A good query IR looks like:

{
  "raw": "new tricks for beginners",
  "normalized": "new tricks for beginners",
  "language": "en",
  "intent": "browse",
  "entities": [{"type": "skill_level", "value": "beginner"}],
  "expanded_terms": ["tricks", "stunts", "moves"],
  "embeddings": { "dense": [], "sparse": {} },
  "subqueries": []
}

Vector search is not semantic search. This distinction lives here, in Query Understanding. Vector search embeds the query string and finds documents whose embeddings are close — a similarity operation, not a meaning operation. True semantic search takes the actual meaning of the query and reflects it into retrieval, often by rewriting or augmenting the query before it touches the engine.

The classic example is “wireless bras.” A general-purpose embedding model puts “wireless bras” near documents about bras in general — the word “wireless” is a weaker signal in the embedding than the word “bras,” and the model has no domain knowledge that, in this product category, “wireless” means “no underwire.” Pure vector search will happily return underwire bras as top results. True semantic search recognizes the intent — no underwire — and acts on it.

Vector search asks “what’s nearby?” Semantic search asks “what did you mean?” The first is math. The second is domain knowledge — and the document index alone won’t give it to you.

Layer 3 — Retrieval Orchestrator

The ranking brain. The orchestrator decides: which retrieval strategies run (BM25, dense, sparse/SPLADE, late interaction), in parallel or sequentially, how to fuse (RRF, linear combination, learned), and how much (top-K). This is where you can later emulate Vespa-style multi-phase ranking even if the engine only delivers phase 1.

Layer 4 — Engine Adapter

The adapter translates the internal query IR into what the specific engine understands. Solr gets {!bool should=$lex should=$knn}, OpenSearch gets a hybrid pipeline, Qdrant gets a query_points call with prefetch. The adapter must contain no business logic — that belongs in the orchestrator or reranking. The adapter is dumb and mechanical; that’s its virtue.

Layer 5 — Reranking as Its Own Stage

The most important additional service in this stack — and the one most teams get the biggest relevance gain from. In practice you run two tracks here: a fast cross-encoder or ColBERT for the default path (~50–100 ms on top-50), and optionally an LLM reranker for high-value queries. Reranker outputs are cacheable by (query_hash, doc_id) pair.

Layer 6 — Result Composition

The SERP is assembled here. Hits from the reranker, facets from the engine, highlights, optionally a generative answer (RAG with the top-3 hits as context), possibly a dynamic UI hint. This layer will grow the most in the next several years, because generative UI and AI-Overview-style features dock here. That’s exactly why it must not touch the engine directly.

Filters and Facets — The Second Path

The layers above optimize the relevance path: full-text query in, ranked hits out. Filters and facets are different in nature — deterministic, set-based, and need neither embeddings nor reranking. A future-proof architecture treats them as a second, leaner path through the same stack.

Browse path (filters, no full-text query)
─────────────────────────────────────────

┌──────────────────────────────┐
│ Client (Web, App, Agent,MCP) │
│ Filters+facets, no q         │
└──────────────┬───────────────┘
               │
               ▼
┌──────────────────────────────┐
│   Search BFF / Domain API    │
│ /search?filter=…&facet=…     │
└──────────────┬───────────────┘
               │ (Query Understanding,
               │  Orchestrator, Reranking
               │  are skipped)
               ▼
┌──────────────────────────────┐
│        Engine adapter        │     ┌──────────────────────────┐
│ Filters as pre-filter        │ ◀── │ Facet-only / Suggest     │
│ Facet aggregations           │     │ (cacheable, separate)    │
└──────────────┬───────────────┘     └──────────────────────────┘
               │
               ▼
┌──────────────────────────────┐
│        Search engine         │
│ Filters + Facet in 1 round   │
└──────────────┬───────────────┘
               │
               ▼
┌──────────────────────────────┐
│      Result composition      │
│ Hits + Facets + Selections   │
└──────────────────────────────┘

Domain API: Filters and facets must be first-class. /search?q=...&filter[category]=trick&facet[]=brand. The response format needs its own facets section with buckets, counts, and the currently active selection.
Query Understanding does noticeably less here. The exception is natural-language filter extraction: “cheap BMX bikes under 500 euros” should become {q: "BMX bikes", filter: {price: "<500"}}.
Retrieval Orchestrator: the browse path forks from the relevance path. Pure filter queries need no hybrid fusion, no RRF, no embeddings.
Engine Adapter: the most important technical pitfall is here. Filters must go to the KNN search as pre-filter, not as post-filter. Solr 10 supports pre-filtering via the filter clause of the KNN query, OpenSearch via efficient_filter, Qdrant via native filter conditions.
Reranking is skipped in the browse path. Reranking a purely filtered list with no query is pointless.
Result Composition builds the facet UI from the engine’s buckets — and decides which facets are displayed (sticky, conditional, hierarchical).

The honest answer: usually yes, but not necessarily. The dividing line runs along the question of whether the facet refers to the current result set or to the full corpus or analytics data.

Should come from the engine are all facets that aggregate over the current search/filter result set — the classic refinement facets. Don’t have to come from the engine are global counts and analytics-style aggregates — those belong in an OLAP store like ClickHouse or Druid.

The decisive question: does the facet count the current result set, or the whole corpus? Result set → engine. Corpus → can live elsewhere.

When Filters Cost You Recall — and How LLMs Change the Calculus

There’s a rule every senior search engineer has evangelized at some point: do not pre-select filters from a free-text query. The reasoning is sound. Pre-selecting filters destroys recall in two ways at once. Misclassification: the system infers a filter the user didn’t intend, and correct documents disappear. Missing attribute data: documents that would match are tagged inconsistently or not at all in the filter field. The user sees a thin, wrong result set and walks.

The question is whether LLMs change the calculus, and the honest answer is: yes, but only with deliberate safeguards. Three patterns make filter inference safer:

Filters as boosts, not gates. boost:underwire=false^3.0 instead of filter:underwire=false. Trades a small amount of precision for a meaningful amount of safety.
Confidence-aware filter application. The LLM returns a confidence score with each candidate. High-confidence numeric constraints → filters. Lower-confidence semantic inferences → boosts or omitted.
Agentic iteration with recall checks. Apply the inferred filter, look at the result count. If the result set collapsed below threshold, drop the filter and re-run. The orchestrator detects the failure mode and self-corrects.

The old rule “never pre-select filters from a free-text query” wasn’t wrong — it was right for a system that couldn’t recover. With LLM-driven query understanding and an orchestrator that can iterate, the rule becomes “pre-select as boosts, with confidence, with a fallback path.” Same caution, more tools.

Consequences for Engine Choice

Solr and OpenSearch have the most mature faceting engines in the Lucene family. If a use case is heavily browse- and filter-driven (product catalog, classical site search), Lucene-based stays the natural choice. For RAG-centric use cases where facets play a minor role, the weakness of the vector-first DBs is acceptable.

The split between relevance path and browse path belongs in the orchestrator — not the adapter, not the composer. Miss it, and pure filter queries push embeddings through the stack for nothing.

Autosuggest — The Underestimated Lever

Autosuggest is not an afterthought in e-commerce. Vinted reports that over 20% of all search sessions now start with a click on a suggest result — a few years ago it was below 8%. The system handles 4,700 queries per second with P99 of 31 ms against a pool of 125 million suggestions. That’s not UX polish, that’s a direct conversion lever.

Autosuggest Is Not Ordinary Search

Latency class: P99 below ~30–50 ms against a large suggestion pool, on every keystroke. Full-text search may take 200 ms; suggest may not.
Load profile: 5–8 suggest calls per submitted search — suggest QPS is typically 5–10× higher than search QPS.
Its own index, its own ranking logic: we rank queries, not documents. At Vinted, query-log candidates make up only 2% of the pool but generate about half of all clicks.
Ranking signals differ: not BM25 + vector but STR (sell-through rate), suggestion CTR, prefix-level click frequency, and crucially: input length.
Its own fallback logic: progressive relaxation — exact prefix → fuzzy(1) → fuzzy(2) — with stop-as-soon-as-10-results.

Suggest is its own subsystem — its own index, its own latency class, its own ranking model. Treat it as a setting on full-text search and you build a feature. Architect it as its own path and you build a conversion lever.

What Solr 10 Brings to the Table

Solr has traditionally had a rich suggester infrastructure. The building blocks are solid, but the gap to the Vinted/Vespa reference architecture is real.

Existing building blocks in Solr 10: AnalyzingInfixSuggester and BlendedInfixSuggester (Lucene-based, with a real analyzer chain); FuzzySuggester for Levenshtein-based typo tolerance; WFSTCompletionLookup / FSTCompletionLookup for very fast FST-based lookups (FSTLookupFactory is the new default in 10); EdgeNGram field type as a manual path; context filtering; chained suggesters mapping the tier architecture; mature LTR with vector features since 9.3.

Where Solr 10 falls structurally behind Vespa: LTR in the hot path on every keystroke is possible but uncomfortable. Real-time feature store for user features is missing. Accent tolerance with intent preservation isn’t out-of-the-box. Streaming-mode indexing for suggest-pool updates is doable but not the standard path.

Concrete Recommendation for Building Autosuggest

If the customer today has Solr 9 with rudimentary suggest and wants to raise the level with Solr 10, I would not start with LTR. Vinted’s data are very clear: the biggest lever wasn’t ML reranking, but adding query-log candidates to the pool.

Raise the baseline: BlendedInfixSuggester on a dedicated suggest core, pool from product metadata + search logs, simple heuristic, progressive relaxation in two tiers. A 2–3 week project, probably captures 80% of the Vinted effect.
Build out tier matching and measure: add the third fuzzy tier, set up A/B tests. Tune Solr suggest performance to P99 < 30 ms.
Personalization via reranker service: only then add LightGBM reranking as its own stage. Start with few, high-impact features.
Session awareness and personal history as API features, no engine changes needed.

The biggest suggest lever isn’t the model — it’s the pool. Real user queries from your search logs beat any personalization you can bolt on top.

The Indexing Path

The indexing pipeline is the place where embedding discipline is decided. Three rules:

Embeddings are versioned. Every embedding carries a model tag (bge-m3-v1, e5-large-v2). When you change the model, dual-write runs: all new documents get both embeddings, the backfill runs in the background, and only when 100% coverage is reached does the query side switch over.
Embedding generation as its own service, not as an engine plugin. Solr 10’s TextToVectorQParser is tempting, but it binds the embedding logic to the engine. Better: a small dedicated service called both from the indexing pipeline and from query understanding. Same model on both sides — that’s the point that often goes wrong.
The pipeline is declarative, ideally CDC-driven. A document update in the CMS → an event → the pipeline normalizes, chunks, embeds, indexes. No cron job, no “reindex button.”

The Central Cross-Layer — Evaluation

This is the service 80% of teams forget, and it has the largest lever. Three components:

A goldset with query → expected top-K, maintained by the business side. A nightly job computes NDCG, MRR, recall@K with explicit metric targets.
An A/B infrastructure running two configurations in parallel and measuring online metrics (CTR, position of first click, reformulation rate, zero-result rate).
Structured telemetry linking every click to the query IR active at the time and the displayed hit list. This is simultaneously the training-data pipeline for later learning-to-rank.

Without this layer you can’t measure improvements, and without measurement every change becomes an act of faith.

Putting It Together — What This Means for Solr 10

In the customer’s context: Solr is the engine in the “Search engine” box. Around it sit embedding service (standalone), query understanding (standalone, initially simple), retrieval orchestrator (initially thin, just hybrid + RRF), adapter (Solr-specific), reranker (standalone, with a cross-encoder), composition (standalone). Search BFF at the top.

If you build it this way, a later switch to OpenSearch costs only the adapter swap and a reindex, no replatforming. A switch to Vespa costs more — parts of the orchestrator and reranker migrate into Vespa, because Vespa does this natively — but the domain API and the client stay untouched.

Keep the domain API, the reranker, and the composer Solr-free. Those three layers clean, everything else is fixable. Those three layers dirty, nothing is.

So the honest answer to “which engine ages best” is therefore: the one you hard-wire the least.

Sources for Deeper Research

Engine documentation and releases

Apache Solr 10 Reference Guide — Suggester
Major Changes in Solr 10
Sease.io blog on Solr vector search and KNN optimization
Vespa.ai documentation and Vespa blog
Qdrant articles — BM42, RRF, DBSF, hybrid reranking patterns
Elasticsearch search-as-you-type field type

solr.cool