SciAtlas: A Large-Scale Knowledge Graph for Automated Scientific Research - Summary

Summary (Overview)

Large-Scale, Multi-Disciplinary KG: Introduces SciAtlas, a massive knowledge graph containing over 43 million papers from 26 disciplines, with 157 million entities and 3 billion triplets. It organizes fragmented academic resources into a structured, panoramic scientific evolution network.
Neuro-Symbolic Retrieval Algorithm: Develops a novel retrieval algorithm featuring tri-path collaborative recall (keyword, semantic, title matching) and graph reranking via Random Walk with Restart (RWR). This enables a transition from simple semantic matching to deep topological reasoning.
Topological Cognitive Substrate: Provides a structured schema with 9 entity types and 12 relation types, organizing knowledge at semantic, conceptual, directional, and social levels. This dismantles disciplinary barriers and offers AI agents a global perspective for scientific discovery.
Reduced Reasoning Cost & Hallucination Mitigation: Serves as a deterministic "cognitive map," enabling deep association discovery and complex applications (literature review, idea evaluation, trend prediction) within ~2 minutes, significantly reducing the computational cost and logical hallucination risks associated with iterative LLM-based deep research frameworks.
Open Resource: The interfaces for KG retrieval and downstream tasks are released in the project's GitHub repository.

Introduction and Theoretical Foundation

The exponential growth of global academic output has created an "information explosion," where knowledge is fragmented and unstructured, hindering deep interdisciplinary integration. Current retrieval tools rely on superficial keyword or semantic matching, lacking topological reasoning capabilities. While agentic frameworks attempt deeper search, they are prone to hallucinations and high inference costs due to the absence of a structured cognitive anchor.

SciAtlas is proposed to bridge this gap. It is founded on the principle that a Knowledge Graph (KG) is an indispensable organizational form for scientific discovery. Although LLMs excel at semantic understanding, they are deficient in capturing the logical relationships between knowledge entities, which is paramount for scientific research that transcends mere semantic association. SciAtlas aims to provide a topological cognitive substrate—a deterministic, structured network that weaves fragmented knowledge into a self-explanatory panorama of scientific evolution, furnishing AI agents with a global perspective.

Methodology

1. SciAtlas Construction

The primary data source is OpenAlex, an open-source library of scholarly resources. The construction pipeline involves:

Data Restructuring & Filtering: Extracting entities, normalizing names, deduplicating, and filtering for high-quality English papers.
Keyword Extraction: Using a lightweight LLM (Qwen2-30B-A3B-Instruct) to extract 3-8 core, reusable keywords from paper abstracts, avoiding paper-specific terminology. An importance score is assigned to each keyword.
- Good Keywords: protein structure prediction, idea evaluation, wireless communication
- Bad Keywords: hierarchical dual-path adaptive learning framework, multi-stage cross-modal feature fusion architecture
Semantic Embedding: Embedding paper titles, abstracts, and keyword texts using the bge-large-en-v1.5 model. These vectors are stored as node attributes.
Schema & Deployment: The graph is built with a sophisticated schema (Fig. 2) and deployed using Neo4j.

SciAtlas Schema:

Entities (9 types): Paper, Author, Institution, Keyword, Topic, Field, Subfield, Domain, Source.
Relations (12 types): CITES, RELATED_TO, AUTHORED, COAUTHOR, HAS_KEYWORD, COOCCUR, HAS_TOPIC, AFFILIATED_WITH, PUBLISHED_IN, DOMAIN_OF, FIELD_OF, SUBFIELD_OF.

Statistics: Table 1: Statistics of SciAtlas

Entity (Total: 157M)	Num	Entity (Total: 157M)	Num	Relation (Total: 3B)	Num
Paper	43.30M	Author	109.70M	(Paper, CITES, Paper)	213.88M
Keyword	3.76M	Institution	0.12M	(Paper, HAS_KEYWORD, Keyword)	101.38M
Topic	4.52K	Source	0.28M	(Author, AFFILIATED_WITH, Instit)	195.94M
Field	26	Domain	4	(Author, AUTHORED, Paper)	149.00M
Subfield	252			(Author, COAUTHOR, Author)	2.06B
				(Keyword, COOCCUR, Keyword)	60.37M
				(Paper, RELATED_TO, Paper)	68.38M
				(Paper, PUBLISH_IN, Source)	40.90M

2. Neuro-Symbolic Retrieval Algorithm

Given a query $q$ , the algorithm proceeds in stages:

A. Node Matching (Tri-Path Recall)

Keyword Matching: An LLM extracts keywords $K = \{(k_i, s_i^{llm})\}_{i=1}^m$ from $q$ . For each keyword, it performs:
- Exact Match: score_exact(k_i, g) = s_i^{llm}
- Vector Match: score_vec(k_i, g) = s_i^{llm} · sim(k_i, g) (threshold $\theta_{kw}=0.7$ ) The final keyword node weight is:
$w_g^{kw} = \max_i \left\{ \mathbb{1}[k_i = g] \cdot s_i^{llm},\; \mathbb{1}[\text{sim}(k_i, g) \ge \theta_{kw}] \cdot s_i^{llm} \text{sim}(k_i, g) \right\}$
The seed set is $K_{seed} = \{(g, w_g^{kw})\}$ .
Semantic Matching: The query $q$ is embedded to $\mathbf{e}_q$ . Top-60 papers are retrieved based on title and abstract embeddings separately, then reranked using bge-reranker-large. The combined score for a paper $p$ is:
$s_p^{emb} = \frac{0.4 \cdot s_p^{title} + 0.6 \cdot s_p^{abs}}{0.4 \cdot \mathbb{1}[\exists s_p^{title}] + 0.6 \cdot \mathbb{1}[\exists s_p^{abs}]}$
The candidate set is $P_{emb} = \{(p, s_p^{emb})\}$ .
Title Matching: For queries containing titles, GROBID extracts titles and an LLM assigns confidence scores $c_j$ . Fuzzy matching similarity is computed:
$m(t_j, p) = 0.65 \cdot \text{seq}(t_j, p) + 0.35 \cdot \mathrm{token\_overlap}(t_j, p)$
Papers with similarity $\ge \theta_{title}=0.88$ are kept. The score is $s_{j,p}^{title} = c_j \cdot m(t_j, p)$ , and the final set is $P_{title} = \{(p, s_p^{title})\}$ .
Node Merging: The semantic and title paper sets are merged into $P_{seed}$ . Their scores are normalized (MinMaxNorm) and combined with a title bonus $b_p^{pre}$ :
$s_p^{pre} = \lambda_{emb} \tilde{s}_p^{emb} + \lambda_{title} \tilde{s}_p^{title} + b_p^{pre}, \quad b_p^{pre} = \begin{cases} 0.35, & \text{exact title hit} \\ 0.10, & \text{fuzzy title hit} \\ 0, & \text{otherwise} \end{cases}$
Defaults: $\lambda_{emb}=0.3$ , $\lambda_{title}=0.8$ .

B. Weight Setting & Graph Propagation Seed nodes $S = P_{seed} \cup K_{seed}$ are assigned initial weights. For a seed paper $p$ , its weight incorporates its importance based on citations:

\text{imp}(p) = \min \left\{ 1, \frac{\log(1 + c_p)}{\log(1 + \max(1, C))} \right\}

w_p^{seed} = s_p^{pre} \cdot (1 + \gamma \cdot \text{imp}(p)) \quad (\gamma=0.5)

The initial distribution $\mathbf{s}$ over all nodes is defined, where $s_v = w_v^{seed}/Z$ for $v \in S$ , and 0 otherwise.

C. Random Walk with Restart (RWR) A 2-hop subgraph is expanded from the seeds. Edge weights $\omega(u,v)$ are defined per type (see Table 2). The transition probability from node $u$ to neighbor $v$ is:

P(v|u) = \frac{\omega(u, v)}{\sum_{x \in N(u)} \omega(u, x)}

The RWR iteration updates node scores $\mathbf{r}^{(t)}$ :

r_v^{(t+1)} = \alpha s_v + (1-\alpha) \sum_u r_u^{(t)} P(v|u)

with restart probability $\alpha$ . Iteration stops when $||\mathbf{r}^{(t+1)} - \mathbf{r}^{(t)}||_1 < \varepsilon=10^{-6}$ or at $T_{max}=50$ .

D. Final Ranking The final graph score for a paper is $s_p^{graph} = r_p$ . A comprehensive final score is computed:

s_p^{final} = \min \left\{ 1,\; \lambda_{pre} \tilde{s}_p^{pre} + \lambda_{graph} \tilde{s}_p^{graph} g_p + \lambda_{imp} \text{imp}_{final}(p) \right\}

where $g_p = \max(0.25, \tilde{s}_p^{pre})$ is a graph support factor. Default weights: $\lambda_{pre}=0.35$ , $\lambda_{graph}=0.45$ , $\lambda_{imp}=0.20$ . The top-20 papers are returned.

Empirical Validation / Results

The paper presents qualitative running examples and a detailed description of the system's capabilities rather than quantitative benchmarks. Key demonstrated results include:

Scale & Coverage: SciAtlas successfully integrates over 43M papers across 26 core disciplines (see Fig. 1), with Medicine (18.56%), Social Sciences (10.70%), and Engineering (9.43%) being the largest. This demonstrates its capacity as a large-scale, multi-disciplinary resource.
Retrieval Efficiency: The entire neuro-symbolic retrieval process is reported to be completed within 2 minutes, a significant speedup compared to iterative LLM-based deep research frameworks.
Application Outputs: The downstream applications (see below) generate coherent, structured outputs, such as:
- Idea Grounding: Successfully matching a target idea's claim about LLM-as-a-Judge limitations with an evidence paragraph from a related paper, identifying both similar and different points.
- Idea Generation: Producing novel, interdisciplinary ideas like "Federated and Privacy-Preserving Knowledge Editing" by combining concepts from distinct domains.
- Trend Prediction: Generating a staged summary of the development of "biologically plausible learning in spiking neural networks" from 2006-2025, along with future directions.
- Researcher Profile: Creating a comprehensive profile summarizing a researcher's trajectory across Knowledge-Enhanced LLMs, Agentic AI Systems, and Model Analysis & Control.

Theoretical and Practical Implications

Theoretical Implications:

KG as Cognitive Map: Argues for the indispensable role of structured knowledge graphs in scientific discovery, as they provide the logical relationship topology that LLMs inherently lack.
Neuro-Symbolic Fusion: Demonstrates a practical framework for combining neural (semantic embeddings, LLMs) and symbolic (graph traversal, deterministic relations) AI, enabling deep reasoning over massive-scale knowledge.

Practical Implications:

Empowering Automated Scientific Research: SciAtlas serves as a foundational infrastructure to accelerate the full loop of AI-driven science, from literature review to peer review, by providing reliable, structured external knowledge.
Reducing Agentic Costs: Offers a deterministic retrieval alternative that drastically cuts the computational cost and latency of LLM-based deep research agents while mitigating hallucinations.
Facilitating Interdisciplinary Research: By dismantling disciplinary barriers in its topological organization, it aids both human researchers and AI agents in gaining a global perspective and fostering interdisciplinary innovation.
Open Resource for Community: The release of the KG and interfaces promotes further research and development in automated scientific discovery and knowledge-grounded AI.

Downstream Applications of SciAtlas

The paper proposes and demonstrates several key applications powered by SciAtlas and its retrieval algorithm:

Literature Review: Customizable retrieval for generating review reports. Hyperparameters can be tuned to emphasize venue prestige, author authority, or institutional reputation.
Idea Grounding and Evaluation: Retrieving relevant papers and paragraphs to ground a new idea, identifying prior similar work, supporting evidence, and true innovations. This provides a basis for evaluating novelty, feasibility, and soundness.
Idea Generation: Using a research direction or paper as an anchor to retrieve diverse papers, then synthesizing concepts to generate novel, interdisciplinary ideas. Search constraints can be relaxed to enhance exploratory retrieval.
Research Trend Predicting: Retrieving influential papers in a field, sorting them chronologically, and using an LLM to summarize the developmental trajectory and predict future directions. Paper citation importance is emphasized.
Related Author Retrieval: Retrieving the most relevant authors for a given research direction, with ranking influenced by author citation counts and authorship order.
Researcher Background Review: Directly matching an author node, collecting their publications, and using an LLM to summarize their academic trajectory and research themes.

Conclusion

SciAtlas is introduced as a large-scale, multi-disciplinary, heterogeneous academic knowledge graph designed as a panoramic scientific evolution network. By integrating massive amounts of structured data, it provides a topological cognitive substrate that breaks down disciplinary silos. The accompanying neuro-symbolic retrieval algorithm enables efficient, deep topological reasoning, moving beyond surface-level semantic matching. The demonstrated applications show that SciAtlas can effectively serve as a "cognitive map" to empower the entire loop of automated scientific research, significantly reducing reasoning costs and hallucination risks compared to purely LLM-based approaches.

Future Work includes: developing CLI tools and agentic skills for easier integration; integrating more knowledge forms (theorems, datasets, code); creating benchmarks for quantitative evaluation; and systematizing real-time KG update mechanisms.