Ready for your all-in-one single cell sequencing solution?

Single-Cell Transcriptomics Databases: A Practical Guide

Introduction 

Cells are the building blocks of life. They form tissues and organs and work together with their diverse phenotypes to ensure health and well-being. Before sequencing, scientists studied cellular phenotypes with fluorescence microscopy, flow cytometry, and immunohistochemistry. But today, single-cell RNA sequencing (scRNA-seq) enables the comprehensive profiling of a single cell’s transcriptional response. Since the first scRNA-seq study1, many protocols to characterize gene expression in a single cell are now possible. These efforts produced thousands of single-cell transcriptomics studies, with data for millions of cells across the human populace. 

With such mountains of data comes the growing need to store and access it in a centralized location. Such a place would foster collaborations between research groups and advance the dissemination of important insights at single-cell level.  

Online repositories offer such a platform to do so. They can store data from scRNA-seq studies for cells across disease states, cell types, and physiological ranges. Researchers can then access these databases to build computational models for investigating biological processes in the body2. With databases, you can also visualize your datasets in relation to other datasets to identify novel cell clusters and genetic signatures of health and disease (Figure 1). But with so many databases available, you may not know which databases to use for your research. In this blog, we will cover the components of a scRNA-seq database, the most common databases available, and how you should decide which databases to use for your research.

Figure 1: A cell cluster produced with Uniform Manifold Approximation and Projection (UMAP) cell clustering algorithm. Each point represents one of 42,628 cells from the mouse embryo that were analyzed on two SCOPE-chip. The cells are grouped into 26 distinct types, including rare cell types.

What are the components of a scRNA-seq database? 

Any robust scRNA-seq database must integrate several pieces of data to produce a cell atlas. These atlases are typically produced in a workflow, such as Singleron’s CeleSCOPE and CeleLens platforms. As is the case with standard workflows, Singleron’s CeleSCOPE processes raw sequencing data into a gene expression matrix, and the CeleLens provides an intuitive user interface that identifies differentially expressed genes, enriched pathways, and advances analyses that enable deeper cellular characterization with cell atlases. 

When the cell atlas is complete, strong scRNA-seq databases will describe each cell’s taxonomy, histology, physiological and homeostatic processes, disease impacts, and molecular mechanisms3. This information is gleaned from the raw and processed scRNA-seq data uploaded to the scRNA-seq databases. They include raw sequence data, gene expression matrices, metadata, and cluster files. You can observe each file type when you run our test data, which you can obtain here

Raw sequence data 

After a sequencing run is completed, each sample in the run will yield one or two FASTQ files, depending on whether the experiment is single-end or paired-end. However, all scRNA-seq protocols are sequenced with paired-end sequencing. Pair-ended reads are most common among single-cell transcriptomes sequenced with an Illumina sequencer. Most single-cell bioinformatics workflows, including the CELESCOPE, can process paired-end single-cell transcriptomes. 

Every FASTQ file contains the nucleotide sequence of every sequenced transcript, called a sequencing read. In single-cell sequencing, each read also contains two sets of oligonucleotide sequences derived from library preparation: a cell barcode that distinguishes individual cells and a unique molecular identifier (UMI), which distinguishes individual transcripts within the cells.    

Metadata 

For a bioinformatics workflow to contextualize the data, metadata must accompany the raw sequence data. Without metadata, fellow scientists will be unable to contextualize the raw and processed data with respect to the research questions the study aims to solve. In a scRNA-seq study, two types of metadata must be supplied: 

  • Sample-level: Sample-level metadata includes information about the samples from which the cells are taken. This information can include characteristics about the donor, sample identifiers to distinguish samples from each other, and protocol parameters detailing how the cells were isolated and their genetic material sequenced.  
  • Cell-level: Cell-level metadata is more granular than sample-level metadata. It comprises annotations describing cell types, the samples from which they were obtained, the algorithms used to cluster cell types, and other biological conditions. 

Gene expression matrix 

Gene counts are obtained after the raw sequencing data is processed. These counts are reported in a tabular format to form a gene expression matrix, typically after normalization. Each row is labelled as a gene linked to a specific UMI. Each column would then correspond to a single cell from where sequences were obtained. Lastly, the values on each component of the matrix would represent the number of UMIs mapped to that gene.   

Cluster files 

From the gene expression matrix file, researchers can use clustering algorithms to identify cells with similar transcriptional profiles. These methods can be classified into four different categories: k-means clustering, hierarchical clustering, community-detection-based clustering, and density-based clustering4. Each algorithm produces cluster files that can then be uploaded for visualization within certain databases. 

Common scRNA-seq databases 

As more life sciences laboratories, companies, and organizations integrate scRNA-seq workflows, the wealth of data generated can change how we understand cellular behavior in health and disease. Existing bioinformatics platforms like the CeleSCOPE® can process data from online databases to analyze previously unexplored transcriptional mechanisms underlying health and disease. These online databases organize data from diverse cell types, hosts, and diseases. Such efforts facilitate collaborations, aid comparisons, and evaluate data generated between research groups. 

The databases that house single-cell transcriptomics data can be classified into three main categories based on the types of cells they profile.  

All-purpose scRNA-seq databases 

Many scRNA-seq databases provide data from invertebrate and vertebrate models. The following databases are among the largest scRNA-seq databases online for this reason: 

GEO/SRA 

The NIH maintains two public repositories for RNA-seq and scRNA-seq data, known as the Gene Expression Omnibus (GEO) and the Sequence Read Archive (SRA). While only raw data is stored on SRA, GEO requires both raw and processed data to be submitted for approval. SRA and GEO data are also tightly linked to their associated manuscripts to make reproducibility tests easier to perform. Altogether, GEO and SRA form the largest repositories for accessing raw scRNA-seq data.    

Single-Cell Expression Atlas (SCEA) 

SCEA comprises data from scRNA-seq and single-nucleus RNA-seq (snRNA-seq) experiments5. Compiled by EMBL, SCEA takes raw data from GEO and SRA processes them with standardized analysis pipelines. As a result, each experiment’s results are portrayed as t-distributed stochastic neighbor embedding (t-SNE) plots. These plots can cluster cells by their transcriptional profiles6 and identify marker genes that most delineate the generated clusters. SCEA also lists each sample’s metadata and the experimental methods that produced the results. The database also recently included externally analyzed data, a small set of high-impact human datasets such as the Human Lung Cell Atlas for comparison.  

Single-Cell Portal (SCP) 

The SCP was developed by the Broad Institute and comprises data from 884 studies and 62.5 million single cells. SCP is one of the largest scRNA-seq dedicated databases available online, comprising data from 18 unique species. From the processed data, SCP generated expression-based plots where cells clustered by high or low expression of specific genes can be visualized. The database also allows researchers to annotate their own cell populations by gene expression, publish them for viewing, and share them for evaluation and refinement. 

SynEcoSys 

Singleron’s all-purpose database comprises over 46 million cells from 731 datasets7. These cells are obtained from animal models, control tissues, and cells obtained from patients with diverse clinical indications and under various treatment regimens. Moreover, SynEcoSys integrates data from samples collected from multiple clinical studies into core datasets that define patient phenotypes for various tissues. For researchers, this minimizes the risk of batch efforts when comparing datasets. SynEcoSys also includes an embedded online data visualization tool that contextualizes cell behavior within living systems such as the human body.  

Human and mouse-specific scRNA-seq databases 

Cells from mouse models help researchers evaluate the similarities and differences between mouse and human physiology. The conclusions drawn from this data can inform researchers on experimental design to guide therapeutic development, study healthy human physiology, and investigate disease pathophysiology. The following databases comprise data exclusively from human and mouse cells. 

Chan-Zuckerberg CELLxGENE Discover 

The rise of scRNA-seq has not gone unnoticed within the tech industry. The Chan-Zuckerberg Initiative developed the CELLxGENE Discover database for single-cell researchers to download and explore7. The database comprises 123.8 million cells from 1892 datasets and 1040 cell types. 

Human Cell Atlas (HCA) 

HCA aims to define the molecular profiles of every human cell type and connect them with cell-level metadata3. The database currently comprises 63.2 million cells from over 10000 donors and 515 projects by 986 labs. Data from all cells of the body can be obtained, with a deep focus on ensuring equity for all humanity, with deep involvement by local partners in all continents9. While only some of the body parts have atlases prepared, the database is actively under development, so researchers can expect to see more atlases emerge soon. 

The Tabula Muris Senis 

The Tabula Muris Senis and Tabula Sapiens are first-draft mouse and human cell atlases, both funded by the Chan-Zuckerberg Initiative. The former comprises more than 100,000 cells from 20 organs and tissues in the mouse model10. The latter comprises nearly 500,000 cells from 400 cell types across 24 tissues and organs across the human body11. Since their publication, the number of cells in these databases has doubled, highlighting the regular updating of these databases for downstream research. 

Disease-specific scRNA-seq databases 

Many scRNA-seq databases contain cells from healthy human donors and animal models. Other databases are built to store single-cell transcriptomics data for human diseases. These databases aim to help elucidate the transcriptional processes that occur among cells in disease relative to healthy controls.  

CancerSEA 

CancerSEA is a single-cell RNA-seq database that comprises a cell atlas for 14 cancer-related functional states and 41900 cancerous single cells12. These states reflect the various biological processes that cancerous cells participate in, from stemness to quiescence. Generating the database began with uploading scRNA-seq data from SRA and GEO. Then, the transcriptomes of cancer cells were collected before using gene-set variation analysis (GSVA) to cluster cells by the sets of genes enriched for each cancer cell state13. End users can then search for genes of interest, search functional states across all cancer cell types, and observe the functional activity profiles of every cancer cell in the database.  

The Alzheimer’s Cell Atlas (TACA) 

TACA is a single-cell multi-omics (including transcriptomics) database that compares single cells from all major brain regions and cell types from healthy controls and Alzheimer’s disease patients14. The database contains over 1.1 million cells and nuclei from 26 datasets and offers several analytical features. Its cell viewer visualizes cells clustered by cell type, gene expression, and sample metadata. The database also contains a means to explore differentially expressed genes and profile cell-cell and protein-protein interaction networks.   

CardioAtlas 

CardioAtlas is a single-cell transcriptomics database that compiles data from more than 3 million cells from 63 human and mouse data sets15. It comprises cardiovascular cells from healthy and diseased patients and models. Researchers can view five different modules to analyze cell clusters and assess gene expression across various cell types. 

How do you decide where to upload your data and which datasets to use for your study? 

With so many databases available, parsing through them can be a time-consuming procedure without proper direction. To determine which databases you should use for your workflow, consider answering the following questions first:  

Do you want to work with preprocessed data or raw data? 

Most databases host raw and processed data for downstream analyses. Each have their place in the study of single-cell transcriptomes. Raw data is most useful when you’re evaluating your data processing pipelines. You can also use raw to reanalyze an existing scRNA-seq dataset and compare multiple datasets. 

On the other hand, preprocessed data is most useful when you’re validating downstream statistical methods after quality control. For instance, count normalization is a key component of comparing single-cell transcriptomics to minimize technical biases. It brings cell-specific measures into common, comparable scales under the assumption that most genes are not differentially expressed16.   

Preprocessed data can also become benchmarks to evaluate your feature selection protocols. Feature selection is the process of determining which genes are most relevant for differentiating between cells of interest. Which feature selection method is chosen will affect how cells are clustered17

Are you focusing on a specific host or disease? 

Many scRNA-seq databases are also enriched with cells from studies of the human body at the single-cell level. Among the studies focused on human physiology, a subset of them focuses on single-cell variation in human disease. If you’re focusing on the tumor microenvironment, consider using a human-focused database or CancerSEA, which focuses on cancer. Other databases for other disease indications are also under development, such as one for genetic diseases developed by EMBL scientists. 

What analytical tools are available within the database? 

Different databases offer various analytical tools that can help you visualize and compare your data with existing datasets. For instance, the Chan-Zuckerberg CELLxGENE Discover database can filter cells by self-reported ethnicity, publication source, sex, and tissue type. The database can also depict gene expression data as dot plots to easily visualize published data within its interface. Scientists may also benefit from using tools that centralize data analysis. Singleron’s pipeline flows seamlessly from raw sequencing data to a complete cell atlas in three steps: sequence data processing with CeleSCOPE®, gene expression analyses and data visualization with CeleLens Cloud, and a big database, reference atlas for comparison, with SynEcoSys®

Conclusion 

The burst of interest in scRNA-seq has prompted several research groups to establish online databases for storing and processing transcriptomes from individual cells. These databases can comprise cells from across the animal kingdom or be specialized towards certain cell types and diseases. For scRNA-seq databases to facilitate biomedical research, scientists must reflect on several questions that will guide the selection of databases that best answer their research questions. 

Nonetheless, regardless of what databases you download or which datasets you produce, Singleron’s CeleSCOPE platform can process any paired-end FASTQ file. With our pipeline, you would only need to input your paired-end read files to produce a BAM file and gene expression matrix files. You can then use these files to conduct downstream analyses, such as cell clustering and networking analyses.  

Contact us to learn more about our array of bioinformatics products and how they can streamline your single-cell transcriptomics research.   

References 

1. Tang F, Barbacioru C, Wang Y, et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nat Methods. 2009;6(5):377-382. doi:10.1038/nmeth.1315 

2. Gondal MN, Shah SUR, Chinnaiyan AM, Cieslik M. A Systematic Overview of Single-Cell Transcriptomics Databases, their Use cases, and Limitations. ArXiv. Published online April 15, 2024:arXiv:2404.10545v1. 

3. Regev A, Teichmann SA, Lander ES, et al. The Human Cell Atlas. Gingeras TR, ed. eLife. 2017;6:e27041. doi:10.7554/eLife.27041 

4. Zhang S, Li X, Lin J, Lin Q, Wong KC. Review of single-cell RNA-seq data clustering for cell-type identification and characterization. RNA. 2023;29(5):517-530. doi:10.1261/rna.078965.121 

5. Moreno P, Fexova S, George N, et al. Expression Atlas update: gene and protein expression in multiple species. Nucleic Acids Res. 2021;50(D1):D129-D140. doi:10.1093/nar/gkab1030 

6. Kobak D, Berens P. The art of using t-SNE for single-cell transcriptomics. Nat Commun. 2019;10(1):5416. doi:10.1038/s41467-019-13056-x 

7. Zhang Y, Li B, Duan J, et al. SynEcoSys: a multifunctional platform of large-scale single-cell omics data analysis. bioRxiv. Preprint posted online February 15, 2023:2023.02.14.528566. doi:10.1101/2023.02.14.528566 

8. CZI Cell Science Program, Abdulla S, Aevermann B, et al. CZ CELLxGENE Discover: a single-cell data platform for scalable exploration, analysis and modeling of aggregated data. Nucleic Acids Res. 2025;53(D1):D886-D900. doi:10.1093/nar/gkae1142 

9. Amit I, Ardlie K, Arzuaga F, et al. The commitment of the human cell atlas to humanity. Nat Commun. 2024;15:10019. doi:10.1038/s41467-024-54306-x 

10. Schaum N, Karkanias J, Neff NF, et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature. 2018;562(7727):367-372. doi:10.1038/s41586-018-0590-4 

11. The Tabula Sapiens Consortium. The Tabula Sapiens: A multiple-organ, single-cell transcriptomic atlas of humans. Science. 2022;376(6594):eabl4896. doi:10.1126/science.abl4896 

12. Yuan H, Yan M, Zhang G, et al. CancerSEA: a cancer single-cell state atlas. Nucleic Acids Res. 2019;47(D1):D900-D908. doi:10.1093/nar/gky939 

13. Hänzelmann S, Castelo R, Guinney J. GSVA: gene set variation analysis for microarray and RNA-Seq data. BMC Bioinformatics. 2013;14(1):7. doi:10.1186/1471-2105-14-7 

14. Y Z, J X, Y H, et al. The Alzheimer’s Cell Atlas (TACA): A single-cell molecular map for translational therapeutics accelerator in Alzheimer’s disease. Alzheimer’s & dementia (New York, N Y). 2022;8(1). doi:10.1002/trc2.12350 

15. Jiang T, Jin X, Gao Y, et al. CardioAtlas: deciphering the single-cell transcriptome landscape in cardiovascular tissues and diseases. Biomark Res. 2024;12:149. doi:10.1186/s40364-024-00696-5 

16. Vallejos CA, Risso D, Scialdone A, Dudoit S, Marioni JC. Normalizing single-cell RNA sequencing data: Challenges and opportunities. Nat Methods. 2017;14(6):565-571. doi:10.1038/nmeth.4292 

17. Zappia L, Richter S, Ramírez-Suástegui C, et al. Feature selection methods affect the performance of scRNA-seq data integration and querying. Nat Methods. 2025;22(4):834-844. doi:10.1038/s41592-025-02624-3