Single-Cell Transcriptomics Databases: A Practical Guide

08.12.2025

15’

Introduction

Cells are the building blocks of life. They form tissues and organs and work together with their diverse phenotypes to ensure health and well-being. Before sequencing, scientists studied cellular phenotypes with fluorescence microscopy, flow cytometry, and immunohistochemistry. But today, single-cell RNA sequencing (scRNA-seq) enables the comprehensive profiling of a single cell’s transcriptional response. Since the first scRNA-seq study¹, many protocols to characterize gene expression in a single cell are now possible. These efforts produced thousands of single-cell transcriptomics studies, with data for millions of cells across the human populace.

With such mountains of data comes the growing need to store and access it in a centralized location. Such a place would foster collaborations between research groups and advance the dissemination of important insights at single-cell level.

Online repositories offer such a platform to do so. They can store data from scRNA-seq studies for cells across disease states, cell types, and physiological ranges. Researchers can then access these databases to build computational models for investigating biological processes in the body². With databases, you can also visualize your datasets in relation to other datasets to identify novel cell clusters and genetic signatures of health and disease (Figure 1). But with so many databases available, you may not know which databases to use for your research. In this blog, we will cover the components of a scRNA-seq database, the most common databases available, and how you should decide which databases to use for your research.

Figure 1: A cell cluster produced with Uniform Manifold Approximation and Projection (UMAP) cell clustering algorithm. Each point represents one of 42,628 cells from a mouse embryo, analyzed using two SCOPE-chips. The cells are grouped into 26 types, including rare cell types.

What are the components of a scRNA-seq database?

Any robust scRNA-seq database must integrate several pieces of data to produce a cell atlas. These atlases are typically produced in a workflow, such as Singleron’s CeleSCOPE and CeleLens platforms. As is the case with standard workflows, Singleron’s CeleSCOPE processes raw sequencing data into a gene expression matrix, and the CeleLens provides an intuitive user interface that identifies differentially expressed genes, enriched pathways, and advances analyses that enable deeper cellular characterization with cell atlases.

When the cell atlas is complete, strong scRNA-seq databases will describe each cell’s taxonomy, histology, physiological and homeostatic processes, disease impacts, and molecular mechanisms³. This information is gleaned from the raw and processed scRNA-seq data uploaded to the scRNA-seq databases. They include raw sequence data, gene expression matrices, metadata, and cluster files. You can observe each file type when you run our test data, which you can obtain here.

Raw sequence data

After a sequencing run, each sample produces one or two FASTQ files, depending on whether the experiment uses single-end or paired-end sequencing. However, scRNA-seq protocols are generally sequenced with paired-end sequencing. Most single-cell transcriptomes sequenced on an Illumina platform use paired-end reads. Most single-cell bioinformatics workflows, including the CELESCOPE, can process paired-end single-cell transcriptomes.

Every FASTQ file contains the nucleotide sequence of every sequenced transcript, called a sequencing read. In single-cell sequencing, each read also contains two sets of oligonucleotide sequences derived from library preparation: a cell barcode that distinguishes individual cells and a unique molecular identifier (UMI), which distinguishes individual transcripts within the cells.

Metadata

For a bioinformatics workflow to contextualize the data, metadata must accompany the raw sequence data. Without metadata, fellow scientists will be unable to contextualize the raw and processed data with respect to the research questions the study aims to solve. In a scRNA-seq study, two types of metadata must be supplied:

Sample-level: Sample-level metadata describes the samples from which cells originate. It includes donor characteristics, sample identifiers, and protocol parameters describing how cells were isolated and their genetic material sequenced

Cell-level: Cell-level metadata is more granular than sample-level metadata. It comprises annotations describing cell types, the samples from which they were obtained, the algorithms used to cluster cell types, and other biological conditions.

Gene expression matrix

Gene counts are obtained after the raw sequencing data is processed. These counts are reported in a tabular format to form a gene expression matrix, typically after normalization. Each row is labelled as a gene linked to a specific UMI. Each column would then correspond to a single cell from where sequences were obtained. Lastly, the values on each component of the matrix would represent the number of UMIs mapped to that gene.

Cluster files

From the gene expression matrix file, researchers can use clustering algorithms to identify cells with similar transcriptional profiles. These methods can be classified into four different categories: k-means clustering, hierarchical clustering, community-detection-based clustering, and density-based clustering⁴. Each algorithm produces cluster files that can then be uploaded for visualization within certain databases.

Common scRNA-seq databases

As more life sciences laboratories, companies, and organizations integrate scRNA-seq workflows, the wealth of data generated can change how we understand cellular behavior in health and disease. Existing bioinformatics platforms like the CeleSCOPE® can process data from online databases to analyze previously unexplored transcriptional mechanisms underlying health and disease. These online databases organize data from diverse cell types, hosts, and diseases. Such efforts facilitate collaborations, aid comparisons, and evaluate data generated between research groups.

The databases that house single-cell transcriptomics data can be classified into three main categories based on the types of cells they profile.

All-purpose scRNA-seq databases

Many scRNA-seq databases provide data from invertebrate and vertebrate models. The following databases are among the largest scRNA-seq databases online for this reason:

GEO/SRA

The NIH maintains two public repositories for RNA-seq and scRNA-seq data, known as the Gene Expression Omnibus (GEO) and the Sequence Read Archive (SRA). While only raw data is stored on SRA, GEO requires both raw and processed data to be submitted for approval. SRA and GEO data are also tightly linked to their associated manuscripts to make reproducibility tests easier to perform. Altogether, GEO and SRA form the largest repositories for accessing raw scRNA-seq data.

Single-Cell Expression Atlas (SCEA)

SCEA comprises data from scRNA-seq and single-nucleus RNA-seq (snRNA-seq) experiments⁵. Compiled by EMBL, SCEA takes raw data from GEO and SRA processes them with standardized analysis pipelines. As a result, each experiment is visualized using t-distributed stochastic neighbor embedding (t-SNE) plots. These plots can cluster cells by their transcriptional profiles⁶ and identify marker genes that most delineate the generated clusters. SCEA also lists each sample’s metadata and the experimental methods that produced the results. The database also recently included externally analyzed data, a small set of high-impact human datasets such as the Human Lung Cell Atlas for comparison.

Single-Cell Portal (SCP)

Developed by the Broad Institute, the SCP contains data from 884 studies and 62.5 million single cells and represents one of the largest dedicated scRNA-seq databases available online, covering 18 species. Using the processed data, SCP generates expression-based plots that visualize cells clustered by high or low expression of specific genes. The database also allows researchers to annotate their own cell populations by gene expression, publish them for viewing, and share them for evaluation and refinement.

SynEcoSys

Singleron’s all-purpose database comprises over 46 million cells from 731 datasets⁷. These cells are obtained from animal models, control tissues, and cells obtained from patients with diverse clinical indications and under various treatment regimens. Moreover, SynEcoSys integrates data from samples collected from multiple clinical studies into core datasets that define patient phenotypes for various tissues. For researchers, this minimizes the risk of batch efforts when comparing datasets. SynEcoSys also includes an embedded online data visualization tool that contextualizes cell behavior within living systems such as the human body.

Human and mouse-specific scRNA-seq databases

Cells from mouse models help researchers evaluate the similarities and differences between mouse and human physiology. The conclusions drawn from this data can inform researchers on experimental design to guide therapeutic development, study healthy human physiology, and investigate disease pathophysiology. The following databases comprise data exclusively from human and mouse cells.

Chan-Zuckerberg CELLxGENE Discover

The rise of scRNA-seq has not gone unnoticed within the tech industry. The Chan-Zuckerberg Initiative developed the CELLxGENE Discover database for single-cell researchers to download and explore⁷. The database comprises 123.8 million cells from 1892 datasets and 1040 cell types.

Human Cell Atlas (HCA)

HCA aims to define the molecular profiles of every human cell type and connect them with cell-level metadata³. The database currently comprises 63.2 million cells from over 10000 donors and 515 projects by 986 labs. Data from cells across the human body can be collected while ensuring equity and involving local partners across all continents ⁹. While only some of the body parts have atlases prepared, the database is actively under development, so researchers can expect to see more atlases emerge soon.

The Tabula Muris Senis

The Tabula Muris Senis and Tabula Sapiens are first-draft mouse and human cell atlases, both funded by the Chan-Zuckerberg Initiative. The former comprises more than 100,000 cells from 20 organs and tissues in the mouse model¹⁰. The latter comprises nearly 500,000 cells from 400 cell types across 24 tissues and organs across the human body¹¹. Since their publication, the number of cells in these databases has doubled, highlighting the regular updating of these databases for downstream research.

Disease-specific scRNA-seq databases

Many scRNA-seq databases contain cells from healthy human donors and animal models. Other databases are built to store single-cell transcriptomics data for human diseases. These databases aim to help elucidate the transcriptional processes that occur among cells in disease relative to healthy controls.

CancerSEA

CancerSEA is a single-cell RNA-seq database that comprises a cell atlas for 14 cancer-related functional states and 41900 cancerous single cells¹². These states reflect the various biological processes that cancerous cells participate in, from stemness to quiescence. Generating the database began with uploading scRNA-seq data from SRA and GEO. Then, the transcriptomes of cancer cells were collected before using gene-set variation analysis (GSVA) to cluster cells by the sets of genes enriched for each cancer cell state¹³. End users can then search for genes of interest, search functional states across all cancer cell types, and observe the functional activity profiles of every cancer cell in the database.

The Alzheimer’s Cell Atlas (TACA)

TACA is a single-cell multi-omics (including transcriptomics) database that compares single cells from all major brain regions and cell types from healthy controls and Alzheimer’s disease patients¹⁴. The database contains over 1.1 million cells and nuclei from 26 datasets and offers several analytical features. Its cell viewer visualizes cells clustered by cell type, gene expression, and sample metadata. The database also contains a means to explore differentially expressed genes and profile cell-cell and protein-protein interaction networks.

CardioAtlas

CardioAtlas is a single-cell transcriptomics database that compiles data from more than 3 million cells from 63 human and mouse data sets¹⁵. It comprises cardiovascular cells from healthy and diseased patients and models. Researchers can view five different modules to analyze cell clusters and assess gene expression across various cell types.

How do you decide where to upload your data and which datasets to use for your study?

With so many databases available, parsing through them can be a time-consuming procedure without proper direction. To determine which databases you should use for your workflow, consider answering the following questions first:

Do you want to work with preprocessed data or raw data?

Most databases host raw and processed data for downstream analyses. Each have their place in the study of single-cell transcriptomes. Raw data is most useful when you’re evaluating your data processing pipelines. You can also use raw to reanalyze an existing scRNA-seq dataset and compare multiple datasets.

On the other hand, preprocessed data is most useful when you’re validating downstream statistical methods after quality control. For instance, count normalization is a key component of comparing single-cell transcriptomics to minimize technical biases. It scales cell-specific measures to a common range, assuming most genes are not differentially expressed¹⁶.

Preprocessed data can become benchmarks to evaluate your feature selection protocols. Feature selection is the process of determining which genes are most relevant for differentiating between cells of interest. The choice of feature selection method affects how cells are clustered ¹⁷.

Are you focusing on a specific host or disease?

Many scRNA-seq databases contain cells from studies of the human body at the single-cell level. Among the studies focused on human physiology, a subset of them focuses on single-cell variation in human disease. If you’re focusing on the tumor microenvironment, consider using a human-focused database or CancerSEA, which focuses on cancer. Other databases for other disease indications are also under development, such as one for genetic diseases developed by EMBL scientists.

What analytical tools are available within the database?

Different databases offer various analytical tools that can help you visualize and compare your data with existing datasets. The Chan-Zuckerberg CELLxGENE Discover database lets users filter cells by ethnicity, publication, sex, or tissue type, and view gene expression data as easy-to-read dot plots. Scientists can benefit from using tools that centralize data analysis. Singleron’s pipeline flows seamlessly from raw sequencing data to a complete cell atlas in three steps: sequence data processing with CeleSCOPE®, gene expression analyses and data visualization with CeleLens™ Cloud, and a big database, reference atlas for comparison, with SynEcoSys®.

Conclusion

The burst of interest in scRNA-seq has prompted several research groups to establish online databases for storing and processing transcriptomes from individual cells. These databases include cells from across the animal kingdom or focus on specific cell types and diseases. For scRNA-seq databases to facilitate biomedical research, scientists must reflect on several questions that will guide the selection of databases that best answer their research questions.

Lastly, Singleron’s CeleSCOPE platform can process any paired-end FASTQ file regardless of what databases you download or which datasets you produce. With our pipeline, you would only need to input your paired-end read files to produce a BAM file and gene expression matrix files. You can then use these files to conduct downstream analyses, such as cell clustering and networking analyses.

Contact us to learn more about our array of bioinformatics products and how they can streamline your single-cell transcriptomics research.

References

Tang F, Barbacioru C, Wang Y, et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nat Methods. 2009;6:377–382
Gondal MN, Shah SUR, Chinnaiyan AM, Cieslik M. Overview of single-cell transcriptomics databases: use cases and limitations. ArXiv. 2024;arXiv:2404.10545v1
Regev A, Teichmann SA, Lander ES, et al. The Human Cell Atlas. eLife. 2017;6:e27041
Zhang S, Li X, Lin J, et al. Single-cell RNA-seq data clustering for cell-type identification. RNA. 2023;29:517–530
Moreno P, Fexova S, George N, et al. Expression Atlas update: gene and protein expression in multiple species. Nucleic Acids Res. 2021;50:D129–D140
Kobak D, Berens P. Using t-SNE for single-cell transcriptomics. Nat Commun. 2019;10:5416
Zhang Y, Li B, Duan J, et al. SynEcoSys: large-scale single-cell omics data platform. bioRxiv. 2023;2023.02.14.528566
CZI Cell Science Program, Abdulla S, Aevermann B, et al. CZ CELLxGENE Discover: single-cell data platform. Nucleic Acids Res. 2025;53:D886–D900
Amit I, Ardlie K, Arzuaga F, et al. Human Cell Atlas commitment to humanity. Nat Commun. 2024;15:10019
Schaum N, Karkanias J, Neff NF, et al. Single-cell transcriptomics of 20 mouse organs: Tabula Muris. Nature. 2018;562:367–372
The Tabula Sapiens Consortium. Tabula Sapiens: multi-organ human single-cell atlas. Science. 2022;376:eabl4896
Yuan H, Yan M, Zhang G, et al. CancerSEA: cancer single-cell state atlas. Nucleic Acids Res. 2019;47:D900–D908
Hänzelmann S, Castelo R, Guinney J. GSVA: gene set variation analysis. BMC Bioinformatics. 2013;14:7
Y Z, J X, Y H, et al. Alzheimer’s Cell Atlas (TACA): single-cell molecular map for therapeutics. Alzheimer’s & Dementia (NY). 2022;8:1
Jiang T, Jin X, Gao Y, et al. CardioAtlas: single-cell transcriptome in cardiovascular tissues. Biomark Res. 2024;12:149
Vallejos CA, Risso D, Scialdone A, et al. Normalizing scRNA-seq data: challenges and opportunities. Nat Methods. 2017;14:565–571
Zappia L, Richter S, Ramírez-Suástegui C, et al. Feature selection affects scRNA-seq integration and querying. Nat Methods. 2025;22:834–844

Liked what you read?

Liked what you read?

Subscribe to our newsletter to receive the latest single cell updates in your inbox!

A post by Salih Yilmaz

Check out our latest blog posts

Learn more

23.11.22

Decoding the Biological Meaning of Your Data: The Power of Accurate Automated Cell Type Annotation

Automated single cell RNA sequencing annotation streamlines analysis, saving time and improving reproducibility. Exploreautomated ‘annotation approaches and key considerations in our latest blog

23.08.01

Peering into Tomorrow: The Predictive Power of Machine Learning in Single Cell Analysis

Single cell analysis technologies are one of the most revolutionary advancements in recent years. However, volume and complexity of the generated data pose a significant challenge. This is where machine learning, deep learning and artificial intelligence have emergedpowerful tools

23.05.22

Standard differential gene expression analysis. What are we missing?

Single cell differential gene expression (DGE) analysis seeks to classify two (or more) gene distributions as different, where our distributions are gene expression counts from distinct populations of cells. However, we can ask the question: are all differences between distributions equivalent