Ready for your all-in-one single cell sequencing solution?
AIVC Clinical Data
AIVC Clinical Data

Building AI Virtual Cell Models for Drug Discovery: A Case for Clinical Data

You have the best architecture, top-notch computational team, and a signed pharma deal that will pay for any solid candidates that your AI Virtual Cell (AIVC) model generates. Now you need the data. Data from the right samples.

The power of AIVC models is undeniable. It is opening up possibilities on the scale and precision of drug discovery. This blog, however, is about a more specific question that is costly to get wrong: the data strategy that determines whether your model learns something real about the biology you are trying to simulate.

This is the third post in our AIVC series. Part 1 introduced the AIVC concept. Part 2 covered the data types that drive model development. This post focuses on what the current data landscape is missing in AI-driven drug discovery — and what a well-designed data strategy requires.

Cell Line Data do not Equal Clinical Response

Oncology drug discovery relied on cancer cell lines. Enormous datasets. Promising results. Many drugs have been made or validated using these efficient, standardized models.

However, drug discovery is getting harder. It is taking more and more resources to bring a single candidate to market. The development cost per drug has passed $500 million, with a large proportion attributable to cost of failures (Sertkaya et al. 2024). Lack of clinical efficiency is the primary reason for clinical failure: although candidates have been vigorously tested in in vitro and in animal models, the discrepancy between models and human biology makes clinical translation challenging (Sun et al. 2022).

Takamatsu et al. (2024) compared the largest cell line databases with clinical tumor data, and found that the response to platinum agents or PARP inhibitors do not match the clinical results. Biomarkers (HRD, or Homologous Recombination Deficiency status) that predict drug sensitivity in patients were not associated with sensitivities in the cell lines. This discrepancy broadens when cell lines from diverse backgrounds are used.

The differences can be attributed to multiple factors. Cell line models do not fully capture the complexity of the tumor and its microenvironment. The drug treatment is done at much shorter time frame with drastically different delivery and metabolism mechanisms. The cell culture methods introduce artificial signals that are not present in the body.

These issues are fundamental to biology, and will not magically disappear with larger datasets or with the use of AI models. Cell lines lack the heterogeneity and microenvironment signals that govern how real tumors behave. Models trained on that biology learned a version of cancer — not the disease itself. When the drug entered a patient, the biology it encountered bore only partial resemblance to what the model had been trained on.

The field eventually corrected course: patient-derived xenografts, organoids, and ultimately primary tissue became the standard for ground truth.

AIVC does not change this underlying problem. It increases the sophistication of the analysis — not the quality of the data. An AIVC trained primarily on cell line perturbation data or standard healthy-donor atlases will learn cell line biology and healthy-donor biology. That is useful for some purposes, such as generating the most plausible hypothesis. However, it is not sufficient for predicting how a patient with a specific indication will respond to treatment. The model has not seen the biology it is being asked to simulate.

A 2026 preprint (Dibaeinia et al.) analysed a 22-million-cell T cell immunology dataset and found that a state-of-the-art model recovered only approximately 9% of true differentially expressed genes. The remaining 91% of biological signal was missed. The authors frame this as a causal transportability problem: more data from the same distribution does not teach cross-context generalisation. A model can achieve high aggregate similarity scores while being biologically incorrect about what actually changes.

What the current data landscape is missing

The majority of AIVC models currently in development are built on cell line data, healthy tissue atlases, or a combination of both. These are the data that exist at scale. They are not the data that matter most for clinical prediction.

Disease-specific clinical samples teach the model what no public dataset can: how cells behave under pathological conditions, how patient-to-patient variation manifests, and how the disease microenvironment shapes cellular state. Disease stage, mutation load, immune infiltration, epigenetic changes, and tissue architecture drive the variation that determines drug response. These dimensions require samples from real patients, in disease, across diverse clinical presentations.

For anyone designing a data strategy: a well-designed disease-specific dataset at moderate scale will outperform a much larger generic dataset for indication-specific prediction. Context relevance matters more than sheer volume.

Standard single cell sequencing captures one layer — disease biology involves more

Standard single cell RNA sequencing (scRNA-seq) captures polyadenylated messenger RNA (mRNA) — representing approximately 1–2% of the human genome. That is one layer of cellular information. Additional layers of data exist:

Chromatin accessibility

Gene expression is regulated by which genomic regions are physically accessible for transcription. Single cell ATAC-seq reveals the epigenetic regulatory state of individual cells: which enhancers are active, which transcription factors are engaged, and which gene are primed to respond to a signal. In primary human cancers, single cell chromatin accessibility data has identified malignant regulatory pathways invisible to transcriptomics alone, with direct relevance to drug resistance mechanisms (Sundaram et al., 2024).

Immune repertoire

The clonal diversity and activation state of a patient’s T and B cell populations are central to immuno-oncology and cell therapy. Single cell immune profiling provides the receptor sequences and clonal architecture that transcriptomics alone cannot deliver.

Non-coding RNAs

Standard poly-A capture systematically misses non-coding RNAs. Yet protein-coding mRNAs represent only approximately 1–2% of the human genome. Non-coding RNAs — lncRNAs, miRNAs, and others — play central roles in regulating gene expression. More than 19,000 lncRNAs are annotated in the human genome (Mattick et al., 2023). Whole transcriptome sequencing using random priming captures both mRNA and non-coding RNA, adding the regulatory layer to single cell data.

Somatic mutations

Clonal evolution, mutational heterogeneity within a tumor, and the genotype-to-phenotype relationship at single cell resolution are not captured by standard RNA-seq. Targeted mutation detection or simultaneous DNA-RNA profiling links genotype to gene expression — directly relevant to resistance mechanisms, subclonal dynamics, and precision oncology.

Emerging layers worth watching

Two additional modalities are worth flagging. RNA dynamics — captured by metabolic labeling — adds a temporal dimension to the standard transcriptomic snapshot, distinguishing newly synthesized from pre-existing RNA within the same cell. Cell surface glycosylation is the other: glycans regulate immune checkpoint engagement and cell adhesion, and single cell glycosylation and RNA co-profiling is now technically achievable — opening a layer of biology that standard transcriptomics cannot access.

These additional layers of data are crucial to cell biology, but are nearly absent from public datasets at meaningful scale.

Three things a data strategy needs to get right

For each organization, there is a balance between efficiency, cost, and data quality — the right choices depend on the indication and the R&D goals.

Public data has a defined ceiling.  Large public datasets are a legitimate starting point. They provide broad representation of cell types, tissues, and basic gene regulatory networks. However, public data often does not generalize to specific disease states or cover sufficient patient heterogeneity for your indication.   

Clinical data is where differentiation happens.  Disease-specific samples from real patients teach the model what standard atlases cannot. This data cannot be downloaded. It requires sample processing and library preparation infrastructure capable of handling primary clinical material at the throughput and consistency required for cohort-level training sets.

Multi-omic dimensions should follow the mechanism.  Not every AIVC model needs every modality. The question to ask for each additional data type: is this layer directly implicated in the mechanism of action or the known failure modes of my indication? Where the answer is yes, the absence of that layer will be a blind spot in what the model can learn.

What to look for in a data generation partner

For teams with strong computational capacity but limited wet lab infrastructure, the practical question is what capability you actually need from a partner who can support sample processing at the speed and quality that your model demands.

A less well known fact: The bottleneck in building a disease-specific dataset happens steps before sequencing: tissue handling, single cell isolation from primary clinical material, and library preparation that is robust across imperfect, real-life tissues. Tumor biopsies, FFPE archival samples, and rare tissue types present challenges that standard protocols were not designed for.

A few indicators of relevant capability:

  • Clinical-grade processing standards.  Most single cell work is done at research grade. If your dataset will ultimately inform a clinical program, the quality standards matter from the start. Singleron’s Matrix NEO™ is the world’s first automated single cell processing system to receive medical device clearance, a shift from research-grade to clinical-grade single cell infrastructure.
  • Experience with difficult primary tissue.  The sample types that matter most for AIVC training — small volume biopsies, disease-stage-matched tissue, archival FFPE samples — are the ones most likely to fail with standard protocols. Breadth of validated sample types, demonstrated across thousands of projects, is a more meaningful indicator than instrument specs.
  • Multi-omic capability.  The ability to include additional layers of information to your data, such as chromatin accessibility, immune repertoire, mutation profiling, and RNA dynamics or glycosylation.
  • Demonstrated capability in supporting AIVC-relevant programs.  Singleron is a featured partner in the CERTAINTY consortium — an EU-funded project building a virtual twin for multiple myeloma — contributing single cell data generation to one of the first clinical implementations of the virtual cell concept.

If you are designing a data strategy for an AIVC program and want to think through which sample types and assay combinations are appropriate for your indication, our scientific team is available to discuss the specifics.

The model learns what the data teaches it

Cell line model systems are good science, but they provide an incomplete picture. AIVC models that train primarily on public atlases and cell line perturbation data are running the same experiment with a more sophisticated and expensive tool.

AIVC models will benefit from diverse data that represent both disease biology and patient heterogeneity: Disease-specific clinical samples as the core, multi-omic dimensions where the mechanism demands it, and public data used as a foundation rather than a substitute.

Investing in scale will bring the throughput, but investing in understanding the messy biology will ultimately provide the edge.

References

Bunne, C., et al. (2024). How to build the virtual cell with artificial intelligence: Priorities and opportunities. Cell, 187(25), 7045–7063.

Dibaeinia, P., et al. (2026). Virtual cells need context, not just scale. bioRxiv. https://doi.org/10.1101/2026.02.04.703804

Mak, I. W. Y., et al. (2014). Lost in translation: Animal models and clinical trials in cancer treatment. American Journal of Translational Research, 6(2), 114–118. https://pmc.ncbi.nlm.nih.gov/articles/PMC3731677/

Mattick, J. S., et al. (2023). Long non-coding RNAs: Definitions, functions, challenges and recommendations. Nature Reviews Molecular Cell Biology, 24, 430–447. https://doi.org/10.1038/s41580-022-00566-8

Mourragui, S., et al. (2021). Predicting patient response with models trained on cell lines and PDX data. Proceedings of the National Academy of Sciences, 118(49), Article e2106682118. https://doi.org/10.1073/pnas.2106682118

Sertkaya, A., Beleche, T., Jessup, A., & Sommers, B. D. (2024). Costs of drug development and research and development intensity in the US, 2000–2018. JAMA Network Open, 7(6), Article e2415445. https://doi.org/10.1001/jamanetworkopen.2024.15445

Sun, D., Gao, W., Hu, H., & Zhou, S. (2022). Why 90% of clinical drug development fails and how to improve it? Acta Pharmaceutica Sinica B, 12(7), 3049–3062. https://doi.org/10.1016/j.apsb.2022.02.002

Sundaram, L., et al. (2024). Single-cell chromatin accessibility reveals malignant regulatory programs in primary human cancers. Science, 385(6713), Article eadk9217. https://doi.org/10.1126/science.adk9217

Takamatsu, S., Murakami, K., & Matsumura, N. (2024). Homologous recombination deficiency unrelated to platinum and PARP inhibitor response in cell line libraries. Scientific Data, 11, Article 171. https://doi.org/10.1038/s41597-024-03018-4