OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources

Summary (Overview)

Key Contribution: OmniRetrieval is a novel framework that provides unified retrieval across structurally diverse knowledge sources (unstructured text, relational databases, RDF knowledge graphs, and labeled property graphs) by engaging each source through its native query language, rather than collapsing them into a shared representation.
Core Methodology: The framework operates in three stages: 1) Source Selection identifies relevant knowledge bases from a large pool, 2) Query Formulation generates executable queries in the native language (e.g., SQL, SPARQL, Cypher, or free-text) for each selected source, and 3) Cross-Source Evidence Selection consolidates the heterogeneous results into a unified evidence set.
Main Finding: OmniRetrieval consistently outperforms single-source baselines and a routing baseline on a comprehensive benchmark spanning 13 datasets and 309 distinct knowledge bases, demonstrating effective cross-paradigm retrieval.
Design Insight: A key design principle is deferring the final commitment on the correct answer to the evidence selection stage, which allows for broad exploration at the source selection step and enables recovery from initial mis-selections.
Scalability Advantage: Adding a new knowledge source requires only registration (providing its structural context $c_b$ ), without retraining a shared encoder or modifying the embedding space, making the framework scalable and extensible.

Introduction and Theoretical Foundation

Real-world information needs require accessing knowledge stored in various structural forms: unstructured text passages, relational tables, RDF knowledge graph triples, and labeled property graph paths. Existing retrieval systems are typically siloed, operating over a single source type using a fixed query language (e.g., document retrievers, text-to-SQL systems). This leaves the broader knowledge landscape fragmented.

A common unification attempt is to project all sources into a shared embedding space or linearized text format. However, this homogenization erases the structural affordances (schemas, joins, traversals, compositional operators) that give each source its expressive power and can introduce biases (e.g., embedding clusters by source type rather than semantic content).

OmniRetrieval adopts the opposite approach: keep each source on its own terms and build a unifying access layer above them. The core idea is to meet each source through its native query language, preserving its structural operators, while coordinating access from a single natural-language user interface.

The retrieval problem, then, is not merely to find relevant content within a source, but to navigate the structural heterogeneity that runs across sources.

Methodology

2.1. Problem Formulation

Let $q$ be a user question and $\mathcal{B} = \{ b_1, ..., b_N \}$ be a pool of independently maintained knowledge sources. Each source $b \in \mathcal{B}$ has:

A native query language (SQL, SPARQL, Cypher, or free-form text).
An execution engine $\text{Exec}(b, \hat{q})$ that accepts a native query $\hat{q}$ and returns results.
An exposed structural context $c_b$ (e.g., a relational schema, an ontology, a corpus descriptor) that an external caller can read to formulate a query.

The task is to provide a set of evidence $\mathcal{E}$ relevant to $q$ , drawn from one or more sources in $\mathcal{B}$ . The framework must operationalize:

Selection of a subset $\mathcal{S} \subseteq \mathcal{B}$ of sources to engage.
Formulation of an executable query $\hat{q}_b$ in the native language of each $b \in \mathcal{S}$ .
Consolidation of the executor outputs $\{ \text{Exec}(b, \hat{q}_b) \}_{b \in \mathcal{S}}$ into a single evidence set $\mathcal{E}$ .

2.2. Source Selection

Given the large, open-ended pool $\mathcal{B}$ with heterogeneous structural contexts ${c_b}$ , a simple embedding-and-ranking approach is restrictive. Instead, OmniRetrieval uses a long-context LLM to read the full catalog of source descriptors jointly with the question $q$ and identify relevant sources.

\mathcal{S} = \text{LLM}_{\text{select}}(q, \{c_b\}_{b \in \mathcal{B}}; k) \subseteq \mathcal{B}

The LLM returns a ranked subset $\mathcal{S}$ of at most $k$ sources. This allows accommodating queries that require multiple sources or have ambiguous target sources, deferring the final decision to the evidence selection stage.

2.3. Query Formulation

For each selected source $b \in \mathcal{S}$ , the framework generates an executable native query $\hat{q}_b$ , conditioned on its structural context $c_b$ .

\hat{q}_b = \text{Generate}_b(q, c_b) \quad \text{for each} \quad b \in \mathcal{S}

The paper instantiates $\text{Generate}_b$ as $\text{LLM}(\mathcal{T}_b(q, c_b))$ , where $\text{LLM}$ is a single shared LLM and $\mathcal{T}_b$ is a per-source prompt template that incorporates $q$ , $c_b$ , and instructions for the specific query language. For unstructured corpora, $q$ itself serves as the retriever query.

2.4. Cross-Source Evidence Selection

The executor outputs $\{ \text{Exec}(b, \hat{q}_b) \}_{b \in \mathcal{S}}$ are heterogeneous in form (rows, triples, paths, passages) and size. This step selects the subset relevant to $q$ .

\mathcal{E} = \text{Select}(q, \{ \text{Exec}(b, \hat{q}_b) \}_{b \in \mathcal{S}})

The paper instantiates $\text{Select}$ as $\text{LLM}(\mathcal{T}_{\text{sel}}(\cdot))$ , where $\mathcal{T}_{\text{sel}}$ is a prompt template that verbalizes each executor output in its native form and asks the model to identify outputs relevant to $q$ .

Empirical Validation / Results

3.1. Datasets and Knowledge Bases

Evaluation was conducted on a benchmark compiled from 13 datasets, spanning 309 distinct knowledge bases across four backend types:

Document Search: 7 datasets from BEIR (NFCorpus, SciFact, FiQA, MS MARCO, FEVER, Natural Questions, HotpotQA).
Relational Databases: Spider (206 databases) and BIRD (80 databases).
RDF Knowledge Graphs: SimpleQuestions, QALD-10, LC-QuAD 2.0 (all using Wikidata).
Labeled Property Graphs: Text2Cypher (15 graphs from Neo4j).

3.2. Methods Compared

Single-Backend Baselines: Four methods pinned to one paradigm (Document Search, Text-to-SQL, Text-to-SPARQL, Text-to-Cypher).
KB Routing: Routes to a single knowledge base per query.
OmniRetrieval: The proposed framework (engages multiple candidates, formulates native queries, consolidates results).
Oracle: Upper bound with perfect source selection (uses gold knowledge base).
Unified-Representation: Methods that collapse sources into a shared representation (Oguz et al., 2022; Ma et al., 2022; Baek et al., 2023), evaluated under a feasibility-constrained setup.

3.3. Evaluation Metrics

Three metrics, macro-averaged across the four paradigms:

Source Selection Accuracy: Measures correct backend and knowledge base selection.
Retrieval Accuracy: NDCG@10 for document search; Execution Match for SQL, SPARQL, Cypher.
LLM-as-a-Judge: Uses GPT-5.4-mini to credit predictions semantically equivalent to gold or faithful against alternative knowledge bases.

4. Experimental Results and Analyses

Table 1: Main results (macro-averaged across paradigms)

Method	GPT-5.4	Gemini-3.1 (Pro)	Sonnet-4.6	Qwen-3.5 (27B)	Gemma-4 (31B)	Average
Source Selection Accuracy
Document Search	21.49	22.61	21.90	21.49	21.42	21.78
Text-to-SQL	14.75	16.71	16.00	12.62	13.58	14.73
Text-to-SPARQL	24.92	25.00	24.81	24.94	24.56	24.84
Text-to-Cypher	19.92	20.42	20.83	20.42	19.08	20.13
KB Routing	64.88	68.21	60.40	54.83	59.93	61.65
OmniRetrieval (Ours)	68.58	73.30	66.47	57.81	62.40	65.71
Oracle	100.00	100.00	100.00	100.00	100.00	100.00
Retrieval Accuracy
Document Search	13.42	14.94	13.69	13.23	13.16	13.69
Text-to-SQL	13.51	16.75	15.46	12.78	13.92	14.48
Text-to-SPARQL	18.26	20.71	15.60	16.65	17.93	17.83
Text-to-Cypher	18.06	17.17	18.68	18.58	17.14	17.93
KB Routing	42.07	46.83	38.59	34.28	38.13	39.98
OmniRetrieval (Ours)	46.62	52.69	43.07	38.34	40.97	44.34
Oracle	62.47	65.56	60.27	60.11	60.85	61.85
LLM-as-a-Judge
Document Search	39.93	40.92	40.82	36.31	39.47	39.49
Text-to-SQL	25.61	25.86	29.63	22.00	25.16	25.65
Text-to-SPARQL	28.89	30.29	31.06	23.99	25.70	27.99
Text-to-Cypher	33.09	24.36	34.85	24.42	25.19	28.38
KB Routing	60.26	63.67	61.93	50.37	53.71	57.99
OmniRetrieval (Ours)	69.72	71.10	68.62	60.83	59.13	65.88
Oracle	75.20	76.50	76.29	71.93	72.81	74.55

Main Results: OmniRetrieval consistently outperforms all baselines across five backbones (GPT-5.4, Gemini-3.1, Sonnet-4.6, Qwen-3.5, Gemma-4). Single-backend baselines perform poorly as most queries lie outside their paradigm. KB Routing improves but suffers from upfront commitment. The gap to Oracle narrows from selection to judge metrics (34.27 → 17.51 → 8.67 points), indicating evidence selection can recover answers from alternative sources.

Analysis on Source Candidate Size ( $k$ ):

OmniRetrieval's performance improves monotonically with $k$ (number of candidate sources), but an Oracle (Evidence Selection) variant improves faster, widening the gap as $k$ grows.
The multi-candidate 1-of- $k$ accuracy of the evidence selector drops from 67.5% at $k=3$ to 62.8% at $k=10$ (Figure 3), pointing to evidence selection as a more impactful lever than simply increasing $k$ .

Analysis on Backbone Scale:

Using Qwen-3.5 from 2B to 27B, OmniRetrieval (Top 3) takes a clear lead over OmniRetrieval (Top 1) at larger scales (Figure 4).
This is linked to candidate diversity: at 2B, source selection collapses to a single paradigm; beyond 4B, it produces meaningfully different candidates across paradigms and sources (Figure 5).
The largest gap to the Oracle (Gold Source) ceiling is on Source Selection, underscoring its importance.

Analysis on Cross-Source Evidence Selection:

The gold source is included in the candidate list at a high rate for every backbone (solid segments in Figure 6).
Once included, the evidence-selection step picks it at a rate far above the random baseline (Table 2), with improvements ( $\Delta$ ) ranging from +26.60 to +34.51 percentage points.

Table 2: Evidence-selection accuracy on multi-candidate questions containing the gold

Backbone	Acc. (%)	Rand. (%)	$\Delta$ (pp)
GPT-5.4	72.81	38.31	+34.51
Gemini-3.1	75.29	43.99	+31.30
Sonnet-4.6	70.44	41.47	+28.98
Qwen-3.5	67.91	35.55	+32.36
Gemma-4	74.33	47.73	+26.60

Analysis on Cross-Paradigm Coverage:

Document Search has the widest cross-paradigm coverage (off-diagonal mean of 28.2%), largely driven by SPARQL questions where Wikipedia-derived corpora overlap with Wikidata content (Figure 7).

Analysis on Native vs Unified Retrieval:

Under a constrained setup (subsampled to make unified representation feasible), unified methods surpass single-backend baselines but stay far below KB Routing and OmniRetrieval (Table 3).
This points to a fundamental limit: atomic-unit retrieval in a shared space cannot capture the structural composition (joins, traversals) that native queries express.

Table 3: Results for unified-representation methods (constrained setup on GPT-5.4)

Method	Source Sel.	Retrieval	Judge
Document Search	21.49	13.42	39.93
Unified Representation †	31.00	23.00	45.00
KB Routing	64.88	42.07	60.26
OmniRetrieval (Ours)	68.58	46.62	69.72

Theoretical and Practical Implications

Preserving Structural Affordances: OmniRetrieval demonstrates that effective unified retrieval does not require homogenization. By engaging sources through their native query languages, it preserves the structural operators (joins, traversals) that give each source its expressive power, a key advantage over shared-representation methods.
Scalable and Extensible Architecture: The framework's design allows new knowledge sources to be added via simple registration (providing $c_b$ ), without retraining a shared encoder. This makes it practical for real-world deployments with evolving knowledge pools.
Deferred Commitment Strategy: The analysis shows that broad exploration at source selection combined with accurate final commitment at evidence selection is a effective strategy for handling ambiguity and recovering from initial errors.
Towards a General-Purpose Interface: OmniRetrieval positions itself as a step toward a universal layer that exposes a single natural-language interface to users while preserving the value of each heterogeneous source.

Conclusion

OmniRetrieval presents a framework for unified retrieval across heterogeneous knowledge sources by engaging each source through its native query language and consolidating results. Evaluations on a large, diverse benchmark show it outperforms relevant baselines. The key design principle of deferring commitment to the evidence selection stage enables graceful scaling. This work moves towards a general-purpose retrieval layer that preserves structural distinctions while providing unified access.

Limitations & Future Work: The current instantiation's cross-source evidence selection could be strengthened via supervised fine-tuning or reinforcement learning. Using a single shared LLM across all stages offers opportunities for operator-specific specialization.

Ethical Considerations: Outputs depend on the connected knowledge bases and the LLMs' internalized knowledge, so standard safeguards and filtering are recommended to mitigate risks from private, harmful, or biased content in these resources.