OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources

Summary (Overview)

  • Key Contribution: OmniRetrieval is a novel framework that provides unified retrieval across structurally diverse knowledge sources (unstructured text, relational databases, RDF knowledge graphs, and labeled property graphs) by engaging each source through its native query language, rather than collapsing them into a shared representation.
  • Core Methodology: The framework operates in three stages: 1) Source Selection identifies relevant knowledge bases from a large pool, 2) Query Formulation generates executable queries in the native language (e.g., SQL, SPARQL, Cypher, or free-text) for each selected source, and 3) Cross-Source Evidence Selection consolidates the heterogeneous results into a unified evidence set.
  • Main Finding: OmniRetrieval consistently outperforms single-source baselines and a routing baseline on a comprehensive benchmark spanning 13 datasets and 309 distinct knowledge bases, demonstrating effective cross-paradigm retrieval.
  • Design Insight: A key design principle is deferring the final commitment on the correct answer to the evidence selection stage, which allows for broad exploration at the source selection step and enables recovery from initial mis-selections.
  • Scalability Advantage: Adding a new knowledge source requires only registration (providing its structural context cbc_b), without retraining a shared encoder or modifying the embedding space, making the framework scalable and extensible.

Introduction and Theoretical Foundation

Real-world information needs require accessing knowledge stored in various structural forms: unstructured text passages, relational tables, RDF knowledge graph triples, and labeled property graph paths. Existing retrieval systems are typically siloed, operating over a single source type using a fixed query language (e.g., document retrievers, text-to-SQL systems). This leaves the broader knowledge landscape fragmented.

A common unification attempt is to project all sources into a shared embedding space or linearized text format. However, this homogenization erases the structural affordances (schemas, joins, traversals, compositional operators) that give each source its expressive power and can introduce biases (e.g., embedding clusters by source type rather than semantic content).

OmniRetrieval adopts the opposite approach: keep each source on its own terms and build a unifying access layer above them. The core idea is to meet each source through its native query language, preserving its structural operators, while coordinating access from a single natural-language user interface.

The retrieval problem, then, is not merely to find relevant content within a source, but to navigate the structural heterogeneity that runs across sources.

Methodology

2.1. Problem Formulation

Let qq be a user question and B={b1,...,bN}\mathcal{B} = \{ b_1, ..., b_N \} be a pool of independently maintained knowledge sources. Each source bBb \in \mathcal{B} has:

  • A native query language (SQL, SPARQL, Cypher, or free-form text).
  • An execution engine Exec(b,q^)\text{Exec}(b, \hat{q}) that accepts a native query q^\hat{q} and returns results.
  • An exposed structural context cbc_b (e.g., a relational schema, an ontology, a corpus descriptor) that an external caller can read to formulate a query.

The task is to provide a set of evidence E\mathcal{E} relevant to qq, drawn from one or more sources in B\mathcal{B}. The framework must operationalize:

  1. Selection of a subset SB\mathcal{S} \subseteq \mathcal{B} of sources to engage.
  2. Formulation of an executable query q^b\hat{q}_b in the native language of each bSb \in \mathcal{S}.
  3. Consolidation of the executor outputs {Exec(b,q^b)}bS\{ \text{Exec}(b, \hat{q}_b) \}_{b \in \mathcal{S}} into a single evidence set E\mathcal{E}.

2.2. Source Selection

Given the large, open-ended pool B\mathcal{B} with heterogeneous structural contexts cb{c_b}, a simple embedding-and-ranking approach is restrictive. Instead, OmniRetrieval uses a long-context LLM to read the full catalog of source descriptors jointly with the question qq and identify relevant sources.

S=LLMselect(q,{cb}bB;k)B\mathcal{S} = \text{LLM}_{\text{select}}(q, \{c_b\}_{b \in \mathcal{B}}; k) \subseteq \mathcal{B}

The LLM returns a ranked subset S\mathcal{S} of at most kk sources. This allows accommodating queries that require multiple sources or have ambiguous target sources, deferring the final decision to the evidence selection stage.

2.3. Query Formulation

For each selected source bSb \in \mathcal{S}, the framework generates an executable native query q^b\hat{q}_b, conditioned on its structural context cbc_b.

q^b=Generateb(q,cb)for eachbS\hat{q}_b = \text{Generate}_b(q, c_b) \quad \text{for each} \quad b \in \mathcal{S}

The paper instantiates Generateb\text{Generate}_b as LLM(Tb(q,cb))\text{LLM}(\mathcal{T}_b(q, c_b)), where LLM\text{LLM} is a single shared LLM and Tb\mathcal{T}_b is a per-source prompt template that incorporates qq, cbc_b, and instructions for the specific query language. For unstructured corpora, qq itself serves as the retriever query.

2.4. Cross-Source Evidence Selection

The executor outputs {Exec(b,q^b)}bS\{ \text{Exec}(b, \hat{q}_b) \}_{b \in \mathcal{S}} are heterogeneous in form (rows, triples, paths, passages) and size. This step selects the subset relevant to qq.

E=Select(q,{Exec(b,q^b)}bS)\mathcal{E} = \text{Select}(q, \{ \text{Exec}(b, \hat{q}_b) \}_{b \in \mathcal{S}})

The paper instantiates Select\text{Select} as LLM(Tsel())\text{LLM}(\mathcal{T}_{\text{sel}}(\cdot)), where Tsel\mathcal{T}_{\text{sel}} is a prompt template that verbalizes each executor output in its native form and asks the model to identify outputs relevant to qq.

Empirical Validation / Results

3.1. Datasets and Knowledge Bases

Evaluation was conducted on a benchmark compiled from 13 datasets, spanning 309 distinct knowledge bases across four backend types:

  • Document Search: 7 datasets from BEIR (NFCorpus, SciFact, FiQA, MS MARCO, FEVER, Natural Questions, HotpotQA).
  • Relational Databases: Spider (206 databases) and BIRD (80 databases).
  • RDF Knowledge Graphs: SimpleQuestions, QALD-10, LC-QuAD 2.0 (all using Wikidata).
  • Labeled Property Graphs: Text2Cypher (15 graphs from Neo4j).

3.2. Methods Compared

  • Single-Backend Baselines: Four methods pinned to one paradigm (Document Search, Text-to-SQL, Text-to-SPARQL, Text-to-Cypher).
  • KB Routing: Routes to a single knowledge base per query.
  • OmniRetrieval: The proposed framework (engages multiple candidates, formulates native queries, consolidates results).
  • Oracle: Upper bound with perfect source selection (uses gold knowledge base).
  • Unified-Representation: Methods that collapse sources into a shared representation (Oguz et al., 2022; Ma et al., 2022; Baek et al., 2023), evaluated under a feasibility-constrained setup.

3.3. Evaluation Metrics

Three metrics, macro-averaged across the four paradigms:

  • Source Selection Accuracy: Measures correct backend and knowledge base selection.
  • Retrieval Accuracy: NDCG@10 for document search; Execution Match for SQL, SPARQL, Cypher.
  • LLM-as-a-Judge: Uses GPT-5.4-mini to credit predictions semantically equivalent to gold or faithful against alternative knowledge bases.

4. Experimental Results and Analyses

Table 1: Main results (macro-averaged across paradigms)

MethodGPT-5.4Gemini-3.1 (Pro)Sonnet-4.6Qwen-3.5 (27B)Gemma-4 (31B)Average
Source Selection Accuracy
Document Search21.4922.6121.9021.4921.4221.78
Text-to-SQL14.7516.7116.0012.6213.5814.73
Text-to-SPARQL24.9225.0024.8124.9424.5624.84
Text-to-Cypher19.9220.4220.8320.4219.0820.13
KB Routing64.8868.2160.4054.8359.9361.65
OmniRetrieval (Ours)68.5873.3066.4757.8162.4065.71
Oracle100.00100.00100.00100.00100.00100.00
Retrieval Accuracy
Document Search13.4214.9413.6913.2313.1613.69
Text-to-SQL13.5116.7515.4612.7813.9214.48
Text-to-SPARQL18.2620.7115.6016.6517.9317.83
Text-to-Cypher18.0617.1718.6818.5817.1417.93
KB Routing42.0746.8338.5934.2838.1339.98
OmniRetrieval (Ours)46.6252.6943.0738.3440.9744.34
Oracle62.4765.5660.2760.1160.8561.85
LLM-as-a-Judge
Document Search39.9340.9240.8236.3139.4739.49
Text-to-SQL25.6125.8629.6322.0025.1625.65
Text-to-SPARQL28.8930.2931.0623.9925.7027.99
Text-to-Cypher33.0924.3634.8524.4225.1928.38
KB Routing60.2663.6761.9350.3753.7157.99
OmniRetrieval (Ours)69.7271.1068.6260.8359.1365.88
Oracle75.2076.5076.2971.9372.8174.55
  • Main Results: OmniRetrieval consistently outperforms all baselines across five backbones (GPT-5.4, Gemini-3.1, Sonnet-4.6, Qwen-3.5, Gemma-4). Single-backend baselines perform poorly as most queries lie outside their paradigm. KB Routing improves but suffers from upfront commitment. The gap to Oracle narrows from selection to judge metrics (34.27 → 17.51 → 8.67 points), indicating evidence selection can recover answers from alternative sources.

Analysis on Source Candidate Size (kk):

  • OmniRetrieval's performance improves monotonically with kk (number of candidate sources), but an Oracle (Evidence Selection) variant improves faster, widening the gap as kk grows.
  • The multi-candidate 1-of-kk accuracy of the evidence selector drops from 67.5% at k=3k=3 to 62.8% at k=10k=10 (Figure 3), pointing to evidence selection as a more impactful lever than simply increasing kk.

Analysis on Backbone Scale:

  • Using Qwen-3.5 from 2B to 27B, OmniRetrieval (Top 3) takes a clear lead over OmniRetrieval (Top 1) at larger scales (Figure 4).
  • This is linked to candidate diversity: at 2B, source selection collapses to a single paradigm; beyond 4B, it produces meaningfully different candidates across paradigms and sources (Figure 5).
  • The largest gap to the Oracle (Gold Source) ceiling is on Source Selection, underscoring its importance.

Analysis on Cross-Source Evidence Selection:

  • The gold source is included in the candidate list at a high rate for every backbone (solid segments in Figure 6).
  • Once included, the evidence-selection step picks it at a rate far above the random baseline (Table 2), with improvements (Δ\Delta) ranging from +26.60 to +34.51 percentage points.

Table 2: Evidence-selection accuracy on multi-candidate questions containing the gold

BackboneAcc. (%)Rand. (%)Δ\Delta (pp)
GPT-5.472.8138.31+34.51
Gemini-3.175.2943.99+31.30
Sonnet-4.670.4441.47+28.98
Qwen-3.567.9135.55+32.36
Gemma-474.3347.73+26.60

Analysis on Cross-Paradigm Coverage:

  • Document Search has the widest cross-paradigm coverage (off-diagonal mean of 28.2%), largely driven by SPARQL questions where Wikipedia-derived corpora overlap with Wikidata content (Figure 7).

Analysis on Native vs Unified Retrieval:

  • Under a constrained setup (subsampled to make unified representation feasible), unified methods surpass single-backend baselines but stay far below KB Routing and OmniRetrieval (Table 3).
  • This points to a fundamental limit: atomic-unit retrieval in a shared space cannot capture the structural composition (joins, traversals) that native queries express.

Table 3: Results for unified-representation methods (constrained setup on GPT-5.4)

MethodSource Sel.RetrievalJudge
Document Search21.4913.4239.93
Unified Representation †31.0023.0045.00
KB Routing64.8842.0760.26
OmniRetrieval (Ours)68.5846.6269.72

Theoretical and Practical Implications

  • Preserving Structural Affordances: OmniRetrieval demonstrates that effective unified retrieval does not require homogenization. By engaging sources through their native query languages, it preserves the structural operators (joins, traversals) that give each source its expressive power, a key advantage over shared-representation methods.
  • Scalable and Extensible Architecture: The framework's design allows new knowledge sources to be added via simple registration (providing cbc_b), without retraining a shared encoder. This makes it practical for real-world deployments with evolving knowledge pools.
  • Deferred Commitment Strategy: The analysis shows that broad exploration at source selection combined with accurate final commitment at evidence selection is a effective strategy for handling ambiguity and recovering from initial errors.
  • Towards a General-Purpose Interface: OmniRetrieval positions itself as a step toward a universal layer that exposes a single natural-language interface to users while preserving the value of each heterogeneous source.

Conclusion

OmniRetrieval presents a framework for unified retrieval across heterogeneous knowledge sources by engaging each source through its native query language and consolidating results. Evaluations on a large, diverse benchmark show it outperforms relevant baselines. The key design principle of deferring commitment to the evidence selection stage enables graceful scaling. This work moves towards a general-purpose retrieval layer that preserves structural distinctions while providing unified access.

Limitations & Future Work: The current instantiation's cross-source evidence selection could be strengthened via supervised fine-tuning or reinforcement learning. Using a single shared LLM across all stages offers opportunities for operator-specific specialization.

Ethical Considerations: Outputs depend on the connected knowledge bases and the LLMs' internalized knowledge, so standard safeguards and filtering are recommended to mitigate risks from private, harmful, or biased content in these resources.