AI Virtual Cell Data Generation: What Data Do These Models Actually Need?

27.03.2026

8’

Every AI model is only as good as the data behind it. Large language models need text. Image generators need training photos. And AI virtual cell (AIVC) models? They need something far more complex — multi-scale, multimodal biological data that captures how cells actually behave.

In 2025, Nature named foundation models for biology (including AIVCs) one of the seven technologies to watch. That same year, a landmark Cell review by Bunne et al. laid out the priorities for building virtual cells with AI, and a Cell Research editorial by Qian et al. framed the challenge around three data pillars. The consensus is clear: without the right data, AIVC remains a concept rather than a tool.

If you’re wondering what “the right data” actually looks like, this post breaks it down — from the data types that matter, to where public datasets fall short, to how single cell multi-omics data generation fills the gap.

This is the second post in our AIVC series. For an introduction to what AI virtual cell models are and why they matter, see our first post.

What Data Do AI Virtual Cells Need?

An AIVC aims to simulate cell behavior across conditions, tissues, and species. That’s a tall order. The data powering it need to span multiple biological scales and capture both structure and dynamics.

Molecular scale. At the foundation, AIVCs need data on individual molecules — DNA, RNA, proteins, metabolites. These are typically represented as sequences or atomic structures, generated through high-throughput sequencing and structural biology approaches.

Cellular scale. Cells are more than bags of molecules. Data at this level come from single cell RNA sequencing (scRNA-seq), scATAC-seq, and proteomics. Imaging technologies like confocal fluorescence microscopy, live-cell imaging, cryo-electron microscopy, and super-resolution microscopy add spatial information at subcellular resolution. Mass spectrometry and proximity-based labeling methods reveal protein-protein interactions and signaling network dynamics.

Multicellular scale. Cells don’t exist in isolation. Understanding the tissue microenvironment and intercellular interactions requires in situ spatial data — obtained through techniques like H&E staining, immunohistochemistry, spatial transcriptomics, and spatial multi-omics on 2D sections and 3D tissues.

Temporal and perturbation data. Static snapshots are not enough. AIVCs need data that capture dynamic processes — aging, development, cancer progression — as well as responses to genetic, chemical, and physical perturbations. Artificial perturbations are particularly valuable because they generate the diverse cell states a model needs to learn from.

Prior knowledge. Published literature, curated expression databases, and multi-scale imaging atlases encode the fundamental biological mechanisms that give models context. Qian et al. call this the first of their three data pillars: an extensive knowledge base that provides the starting point for AIVC construction, even though it alone isn’t sufficient.

On top of all this, the data need to reflect biological diversity across species and domains, with enough depth to separate true biological signals from technical noise.

Why Public Data Alone Won’t Get Us There

There’s no shortage of publicly available single cell data. Resources like the Human Cell Atlas and CELLxGENE have made millions of cell profiles accessible. So why not just train on what’s already out there?

The short answer: predictive power requires more than volume.

A recent position paper by Dibaeinia et al. (2026), “Virtual Cells Need Context, Not Just Scale,” makes this case directly. The authors show that within a given biological context, simple baseline models perform on par with sophisticated deep learning architectures — and that current AIVCs fail to consistently generalize across different contexts. Their argument: the bottleneck isn’t model size or expressivity; it’s the lack of diverse biological contexts in the training data.

This aligns with what we see in the field more broadly. Most public datasets come from cell lines or well-studied healthy tissues. They’re valuable for pre-training foundation models and learning broad biological patterns. But when you need a model that can predict drug responses in a patient’s tumor, or simulate how a rare immune cell behaves under stress, cell line data have real limitations. They lack the biological context of real clinical samples — the tissue architecture, the patient-specific heterogeneity, the messy reality of disease.

This is where targeted data generation becomes essential. Building AIVCs that are clinically relevant requires real patient samples, carefully annotated clinical metadata, and multi-omic measurements from complex tissues. Generating these data at scale is hard, but it’s the difference between a model that recognizes patterns in idealized conditions and one that makes predictions in the real world.

The Core Data Types Driving AIVC Development

Across the AIVC models published in the past two years, a few single cell data types consistently show up as essential inputs.

Single cell and single nucleus RNA sequencing (scRNA-seq / snRNA-seq). This remains the backbone. Transcriptomic profiles at single cell resolution provide the most widely available high-dimensional data for training and fine-tuning AIVC models.

Perturbation RNA-seq data. Observational data show what exists. Perturbation data show what happens when you intervene — gene knockouts, drug treatments, CRISPR screens. These causal data are critical for training models that predict, not just describe.

Chromatin accessibility (scATAC-seq). Epigenomic data capture regulatory potential and help models understand gene regulation beyond expression alone.

Multi-omic and spatial layers. Proteomics, spatial transcriptomics, and subcellular imaging add the dimensions needed for a truly multi-scale model.

Expanding the Data Landscape with Specialized Assays

The standard scRNA-seq and scATAC-seq profiles are necessary but not sufficient. Some of the most interesting questions — and the biggest gaps in current AIVC models — require specialized data layers.

Immune profiling. Immune responses are central to disease biology and tissue homeostasis. Single cell immune profiling captures T cell receptor (TCR) and B cell receptor (BCR) repertoire diversity, clonal expansion patterns, and antigen-receptor specificity — data that connect transcriptomic states to functional immune responses.

Targeted sequencing. Integrating information on viral infection, gene mutations, and fusion genes into single cell transcriptomic data. Single cell CRISPR screening solutions also provide essential perturbation data for model training.

Cell surface glycosylation. Simultaneously detecting the transcriptome and cell surface glycosylation at single cell resolution — a data layer most AIVC models haven’t seen yet.

RNA dynamics. Capturing dynamic changes in transcriptome at a single cell level, providing temporal resolution that static snapshots miss.

Full-length RNA sequencing. Capturing full-length transcript information to support splice site analysis, offering a more comprehensive view of splicing dynamics compared to standard 3′ or 5′ methods.

FFPE samples. Applying single nucleus sequencing to formalin-fixed paraffin-embedded samples using unbiased random priming for total transcriptome coverage. This unlocks archival clinical material — a massive, largely untapped resource for AIVC training data.

Cross-species data. Model organisms offer a rich repertoire of biology. They may not fully represent human biology, but they allow systematic perturbations that are impossible in human subjects — a valuable complement for AIVC training.

Scaling Up: Making Data Generation Practical

Generating enough diverse data for AIVC training is expensive. Per-sample costs add up fast, especially when you need large cohorts that capture biological variability.

Multiplexing helps. By labeling each sample with a unique barcode, it enables multi-sample and cross-species studies run in parallel, multiplexing reduces per-sample cost while increasing the biological diversity captured in a single experiment.

This kind of practical efficiency matters. Building the data infrastructure for AI virtual cells isn’t just a scientific problem — it’s a logistics and cost problem, and solutions that make data generation scalable are part of the answer.

From Data Integration to AI-Driven Discovery

The boundaries of single cell technology are being redrawn by AIVC. From targeted sequencing to spatial multi-omics, the field has accumulated more than just “more data.” It has built a multi-dimensional data foundation that is closer to representing true cell states than anything that came before.

When large-scale single cell data meet generative AI, the research paradigm has the potential to shift from observation to prediction. In the next chapter of single cell research, high-quality data integration isn’t just the endpoint of an experimental workflow — it becomes the starting point for AI to understand living systems.

We’re committed to supporting this transition with reliable, diverse single cell and spatial multi-omics solutions that contribute to the data infrastructure AIVC development demands. If you’re thinking through how to design data generation for an AIVC project, our scientific team can walk through the trade-offs with you.

References

Bunne, C. et al. (2024). How to build the virtual cell with artificial intelligence: Priorities and opportunities. Cell, 187(25), 7045–7063. https://doi.org/10.1016/j.cell.2024.11.015

Callaway, E. (2025). Can AI build a virtual cell? Scientists race to model life’s smallest unit. Nature, 643(8070), 13–14. https://doi.org/10.1038/d41586-025-02011-0

Dibaeinia, P. et al. (2026). Virtual Cells Need Context, Not Just Scale. bioRxiv. https://doi.org/10.64898/2026.02.04.703804

Qian, L., Dong, Z. & Guo, T. (2025). Grow AI virtual cells: three data pillars and closed-loop learning. Cell Research, 35, 319–321. https://doi.org/10.1038/s41422-025-01101-y

Liked what you read?

Liked what you read?

Subscribe to our newsletter to receive the latest single cell updates in your inbox!

A post by Yingting Wang

Yingting earned her PhD from the National University of Singapore, specializing in cell biology and tissue engineering. She has eight years of laboratory and commercial experience in single cell multi-omics, including roles in R&D, sales, technical support, and scientific communication.

Check out our latest blog posts

Learn more

26.03.15

AI Virtual Cell Data Generation: What Data Do These Models Actually Need?

Liked what you read?

Liked what you read?

Check out our latest blog posts

AI Virtual Cell Model (AIVC)

Tissue Preservation is the Unsung Hero of a Successful Single Cell Experiment

Full Transcriptome Insight from Single Cell Sequencing