Lingshu-Cell: A Generative Cellular World Model for Transcriptome Modeling toward Virtual Cells

Summary (Overview)

Core Innovation: Lingshu-Cell is a Masked Discrete Diffusion Model (MDDM) designed as a generative "cellular world model" that learns the distribution of single-cell transcriptomic states and supports conditional simulation under perturbation.
Key Advantages: It operates directly in a discrete token space, aligning with the sparse, non-sequential nature of scRNA-seq data, and models expression across ~18,000 genes without prior gene selection (e.g., filtering by high variability).
Unconditional Generation: The model accurately reproduces transcriptomic distributions, marker-gene expression patterns, and cell-subtype proportions across diverse human tissues and multiple species (mouse, rhesus macaque, zebrafish, fly).
Conditional Perturbation Prediction: By embedding cell type/donor identity with perturbation context, Lingshu-Cell predicts whole-transcriptome responses. It achieves leading performance on the Virtual Cell Challenge H1 genetic perturbation benchmark and accurately predicts cytokine-induced responses in human PBMCs.
Unified Framework: It establishes a single architecture capable of both high-fidelity cell state generation and perturbation response prediction, moving toward a practical virtual cell model for in silico experimentation.

Introduction and Theoretical Foundation

Modeling cellular states and predicting their responses to perturbations are central challenges in computational biology and the development of virtual cells. While large-scale scRNA-seq datasets have enabled comprehensive characterization, most analyses remain descriptive. The overarching goal is to develop a cellular world model—analogous to AI world models—that learns the compact representation of the transcriptomic state distribution and supports conditional simulation, moving biology beyond static cataloging.

Existing foundation models (e.g., scGPT, Geneformer) are optimized for static representation learning, not generative simulation. Other generative approaches (e.g., scDiffusion, scVI) are limited by continuous data assumptions that misalign with the sparse, discrete nature of scRNA-seq counts. Perturbation-focused methods (e.g., STATE, CellFlow) learn direct mappings but do not model the underlying state distribution.

Lingshu-Cell addresses these limitations by introducing a Masked Discrete Diffusion Model (MDDM). This design is inherently compatible with the orderless structure of gene expression data, avoiding the arbitrary generation order of autoregressive models and the distributional mismatch of continuous denoising diffusion models.

Methodology

4.1 Preliminaries: Masked Discrete Diffusion Models (MDDMs)

Let $x_0 = [x^1_0, x^2_0, ..., x^L_0]$ be a fully observed discrete sequence of length $L$ , where each token $x^i_0$ belongs to a predefined vocabulary $\mathcal{V}$ .

Forward Process: Tokens are independently masked over a continuous time variable $t \in [0,1]$ . The transition probability is: $q_{t|0}(x^i_t | x^i_0) = \begin{cases} 1 - t, & x^i_t = x^i_0 \\ t, & x^i_t = \mathcal{M} \end{cases}$ where $\mathcal{M}$ is a special mask token.
Reverse Process & Training: A parametric mask predictor neural network $p_\theta(\cdot|x_t)$ predicts original tokens for masked positions. The model is optimized using a cross-entropy loss computed exclusively on masked tokens: $\mathcal{L}(\theta) \triangleq -\mathbb{E}_{t,x_0,x_t}\left[ \frac{1}{t} \sum_{i=1}^{L} \mathbb{I}[x^i_t = \mathcal{M}] \log p_\theta(x^i_0 | x_t) \right]$

4.2 Representing Single-Cell Data as Discrete Sequences

A cell's UMI count vector $x^{(i)} \in \mathbb{Z}_{\ge0}^G$ is quantized into a finite set of expression levels via a function $q: \mathbb{Z}_{\ge0} \to \{0,1,...,B-1\} \cup \{\text{OVF}\}$ to handle the broad dynamic range. The quantization preserves approximately the first two significant digits: For $x \ge 100$ :

k(x) = \lfloor \log_{10}(\max(x,1)) \rfloor

r(x) = \left\lfloor \frac{x - 10^{k(x)}}{\Delta(x)} \right\rfloor \quad \text{with} \quad \Delta(x) = 10^{k(x)-1}

The final discrete sequence for cell $i$ is:

z^{(i)} = \left( q(x^{(i)}_1), q(x^{(i)}_2), ..., q(x^{(i)}_G) \right)

4.3 Embedding-Space Sequence Compression

To handle long gene sequences (~18,080 genes), a compression module reduces the internal sequence length processed by the Transformer.

A random permutation $\pi$ is applied to the token embedding sequence $E \in \mathbb{R}^{G \times D}$ .
The reordered sequence is partitioned into groups of size $S$ , and each block is projected to a single $D$ -dimensional vector via a shared linear map $W_{\text{down}} \in \mathbb{R}^{D \times (SD)}$ : $H_i = W_{\text{down}} \text{vec}\left(E_\pi^{(i)}\right), \quad i=1,...,G_c$ where $G_c = \lceil G/S \rceil$ .
After Transformer processing, a decompression module ( $W_{\text{up}}$ ) maps the compressed representations back to the original gene-level resolution.

4.4 Conditional Generation

For perturbation prediction, condition $c$ (e.g., cell line + target gene) is encoded as discrete tokens prepended to the expression sequence. These condition tokens are exempt from masking.

Training: Both perturbed and control cells are included. Control cells are assigned a biologically neutral control label $c_{nt}$ .
Sampling with Classifier-Free Guidance (CFG): To strengthen perturbation-specific generation, guided logits $\tilde{a}_\theta$ are computed: $\tilde{a}_\theta(v | x_t, c) = a_\theta(v | x_t, c_{nt}) + (w+1)\left( a_\theta(v | x_t, c) - a_\theta(v | x_t, c_{nt}) \right)$ where $w \ge 0$ is the guidance scale.

4.5 Inference-Time Biological Prior Injection

To provide directional signals, a perturbation-specific prior gene set $\mathcal{G}_\downarrow$ is constructed from external cell line data (genes identified as downregulated). During sampling initialization, positions corresponding to $g \in \mathcal{G}_\downarrow$ are assigned a low expression value $\mu=1$ and kept fixed, while other positions are generated by the model.

Model Architecture & Training Details

Architecture: Bidirectional Transformer backbone with 13 blocks, embedding dimension $D=640$ , 10 attention heads, SwiGLU FFN, and Rotary Position Embeddings (RoPE).
Training: AdamW optimizer, cosine annealing LR schedule, Exponential Moving Average (EMA) on weights. Trained with distributed data parallel on NVIDIA A800 GPUs.

Empirical Validation / Results

2.2 Unconditional Generation Across Species and Tissues

Lingshu-Cell was trained on the PBS control subset of the PARSE 10M PBMC dataset (629,701 cells).

Qualitative Results: Generated cells faithfully recapitulated marker-gene expression patterns and cell-type proportions for major PBMC lineages (T, NK, B cells, monocytes, DCs). This held true at both standard (10,000 cells) and large (200,000 cells) generation scales.
Quantitative Benchmark (PBMCs): Lingshu-Cell was benchmarked against scDiffusion and scVI using five metrics.

Table: Unconditional Generation Performance on PARSE-PBMC Dataset

Model	Pearson (↑)	Spearman (↑)	MMD (↓)	iLISI (↑)	1-WD (↓)
Lingshu-Cell	1.0000	0.9095	0.0088	0.9990	0.0064
scDiffusion	0.9900	0.8966	0.0178	0.9993	0.1594
scVI	0.9950	0.7429	0.0343	0.9975	0.0102

Lingshu-Cell achieved the best performance across all metrics, indicating the most faithful modeling.

Generalization: The model consistently produced high-quality samples across eight human tissues from the CZ CELLxGENE database and four non-human species (mouse ovary, rhesus macaque lung, zebrafish embryo, fly brain).

Table 1: Unconditional Generation Performance Across Human Tissues and Non-Human Species

Tissue	Pearson (↑)	Spearman (↑)	MMD (↓)	iLISI (↑)	1-WD (↓)
Human - Neocortex	0.9995	0.9991	0.0128	0.9053	0.0105
Human - Heart	0.9992	0.9987	0.0196	0.8972	0.0096
Human - Lung	0.9967	0.9970	0.0314	0.8906	0.0159
Human - Colon	0.9966	0.9960	0.0376	0.8815	0.0152
Mouse - Ovary	0.9996	0.9989	0.0116	0.9011	0.0077
Rhesus macaque - Lung	0.9985	0.9970	0.0218	0.8926	0.0149
Zebrafish - Embryo	0.9983	0.9974	0.0143	0.9035	0.0089
Fly - Brain	0.9984	0.9929	0.0163	0.8876	0.0107

2.3 Genetic Perturbation Prediction (Virtual Cell Challenge H1)

Lingshu-Cell was evaluated on the VCC H1 genetic perturbation benchmark. Ablation studies confirmed the contribution of three key strategies:

Classifier-Free Guidance (CFG): Improved fidelity, with $w=2$ giving best performance.
Sequence Compression: A patch size of 32 outperformed smaller sizes.
Biological Prior Injection: Improved perturbation direction similarity and correlation.

Table 2: Genetic Perturbation Prediction on the VCC Leaderboard (Top Teams)

Team	Avg Rank ↓	DES ↑	PDS ↑	MAE ↓	Sp. #DEG ↑	Sp. LFC ↑	AUPRC ↑	Pearson-∆ ↑
Lingshu-Cell	8.7	0.216	0.748	0.052	0.394	0.331	0.272	0.306
cleopatra	9.1	0.228	0.747	0.086	0.473	0.396	0.266	0.203
xBio	10.7	0.305	0.811	0.770	0.564	0.087	0.252	0.217

Lingshu-Cell achieved the best average rank and ranked first in MAE and Pearson-∆ correlation.

2.4 Cytokine Perturbation Prediction in PBMCs

Evaluated on the PARSE 10M PBMC cytokine perturbation dataset (12 donors, 90 conditions). For evaluation, 4 donors were held out, with 70% of their cytokine conditions (63 of 90) used as the test set.

Lingshu-Cell achieved the highest average score across all evaluated methods (PerturbMean, STATE, scGPT, scVI), ranking first in PDS and Pearson-∆ correlation, and also achieving the highest Spearman #DEG correlation.

Theoretical and Practical Implications

Paradigm Shift: Lingshu-Cell shifts single-cell foundation models from static representation learning to generative simulation, establishing a computational foundation for a cellular world model.
Alignment with Data Physics: The Masked Discrete Diffusion paradigm naturally aligns with the permutation invariance and zero-inflated sparsity intrinsic to transcriptomic data, avoiding biases and mismatches of previous approaches.
Unified Modeling: A single architecture successfully handles both unconditional generation of heterogeneous cell states and conditional prediction of responses to diverse perturbation modalities (genetic, cytokine).
Applications: Enables in silico experimentation for dissecting disease mechanisms, screening therapeutics, and mapping developmental trajectories. It serves as a powerful tool for probabilistic hypothesis generation.
Limitations & Future Directions:
- Current evaluations rely on population-level metrics and cannot fully assess single-cell biological plausibility.
- High-fidelity generation does not imply biological causality; predictions require wet-lab validation.
- The model currently operates only on transcriptomic data. A complete virtual cell would integrate multi-omic modalities (epigenomic, proteomic, spatial).
- Future directions include extending to complex interventions (drugs, combinatorial), modeling temporal dynamics, and moving toward closed-loop experimentation where model predictions guide targeted perturbations.

Conclusion

Lingshu-Cell demonstrates that Masked Discrete Diffusion Models can serve as a unified generative framework for single-cell transcriptomics. By directly modeling transcriptome-wide expression without prior gene selection, it achieves high-fidelity cell generation across diverse tissues and species and leading performance in predicting cellular responses to both genetic and cytokine perturbations. This work establishes MDDM as a promising paradigm for modeling cellular behavior and marks a critical step toward interactive virtual cells, paving the way for a new paradigm in biological discovery and perturbation screening.