Ready for your all-in-one single cell sequencing solution?

ChatGPT and Single Cell Data Science

ChatGPT and Single Cell Data Science

Since its first launch in November 2022, ChatGPT (Generative Pre-analysis Transformer; developed by OpenAI) has gained huge exposure in the tech sphere around the globe. It has gone beyond 100 million registered users, of which 25 million are daily active visitors. Among these people are students in the university, white collars in the office and researchers in the lab, etc. Why ChatGPT becomes so popular in such a short period of time? What is the technique behind this phenomenal success? Would ChatGPT be used for single cell data science as well? Let’s get a bit closer and demystify it.


What is ChatGPT?

The current version of ChatGPT is based on GPT 3.5, which stands for Generative Pre-analysis Transformer. It is a branch of natural language processing (NLP) using deep learning methods. In short, GPT starts generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task (Radford & Narasimhan, 2018). It is a semi-supervised approach with a combination of two-stage training procedures, namely unsupervised pre-training and supervised fine-tuning. This significantly boosted the performance of the model (Transformer) on natural language understanding.


Machine learning

The past decade has witnessed the explosion of interest in machine learning (ML). In principle, machine learning is using algorithms to extract information from raw data (the training dataset) and represent it in some types of models. This model will be applied to infer things about other data (the testing dataset) that have not yet been modeled (Jones, 2014). ML can help humankind recognize the patterns within the data and make predictions.

Though originated from computer science, ML has gained more and more applications in life science and medical research. A good example is the biomarker discovery in human diseases, where ML was applied to find the diseases markers and predict the outcome (Mayr et al., 2021).


Single cell data science

Single cell RNA sequencing was first introduced on a dozen of cells in the year 2009 (Tang et al., 2009) and became commercialized in 2015 (Business News, GenomeWeb.com). Since then, the single cell techniques have been rapidly developing. Currently, there are multiple tools available, spanning from transcriptomics (temporal and spatial), genomics, epigenomics to proteomics.

As the cost of Next-Generation Sequencing (NGS) and single cell sequencing drops, we see a bloom of “omics” data over the years (Kuhn Cuellar et al., 2022). This brings new challenges for the bioinformatics. In contrast to conventional bioinformatics analysis for “2D” data, such as sequence alignment, single cell omics data are usually in high dimensions (Lahnemann et al., 2020). Therefore, multi-omics data require a new approach for analysis.

Deep learning (DL), as part of a broader family of machine learning methods, has great potential in single cell data analysis by using sophisticated architectures of artificial neural networks. Single- cell data often have a limited number of labels and annotations, which could result in model overfitting and poor performance (Ma & Xu, 2022). This resembles the situation in NLP, where most text data are unlabeled. The good news is that in many cases semi-supervised learning (combining a small amount of labelled data with a large amount of unlabelled data) and self-supervised learning (representation of unlabelled data by fine-tuning) can gain insight without requiring the extra labels. Therefore, it can be foreseen that DL, the core technique behind ChatGPT, will play a crucial role in single cell data science in the years ahead.

References

Jones, N. (2014). Computer science: The learning machines. Nature, 505(7482), 146-148. https://doi.org/10.1038/505146a

Kuhn Cuellar, L., Friedrich, A., Gabernet, G., de la Garza, L., Fillinger, S., Seyboldt, A., Koch, T., Zur Oven-Krockhaus, S., Wanke, F., Richter, S., Thaiss, W. M., Horger, M., Malek, N., Harter, K., Bitzer, M., & Nahnsen, S. (2022). A data management infrastructure for the integration of imaging and omics data in life sciences. BMC Bioinformatics, 23(1), 61. https://doi.org/10.1186/s12859-022-04584-3

Lahnemann, D., Koster, J., Szczurek, E., McCarthy, D. J., Hicks, S. C., Robinson, M. D., Vallejos, C. A., Campbell, K. R., Beerenwinkel, N., Mahfouz, A., Pinello, L., Skums, P., Stamatakis, A., Attolini, C. S., Aparicio, S., Baaijens, J., Balvert, M., Barbanson, B., Cappuccio, A., . . . Schonhuth, A. (2020). Eleven grand challenges in single-cell data science. Genome Biol, 21(1), 31. https://doi.org/10.1186/s13059-020-1926-6

Ma, Q., & Xu, D. (2022). Deep learning shapes single-cell data analysis. Nat Rev Mol Cell Biol, 23(5), 303-304. https://doi.org/10.1038/s41580-022-00466-x
Mayr, C. H., Simon, L. M., Leuschner, G., Ansari, M., Schniering, J., Geyer, P. E., Angelidis, I., Strunz, M., Singh, P., Kneidinger, N., Reichenberger, F., Silbernagel, E., Bohm, S., Adler, H., Lindner, M., Maurer, B., Hilgendorff, A., Prasse, A., Behr, J., . . . Schiller, H. B. (2021). Integrative analysis of cell state changes in lung fibrosis with peripheral protein biomarkers. EMBO Mol Med, 13(4), e12871. https://doi.org/10.15252/emmm.202012871

Radford, A., & Narasimhan, K. (2018). Improving Language Understanding by Generative Pre-Training.

Tang, F., Barbacioru, C., Wang, Y., Nordman, E., Lee, C., Xu, N., Wang, X., Bodeau, J., Tuch, B. B., Siddiqui, A., Lao, K., & Surani, M. A. (2009). mRNA-Seq whole-transcriptome analysis of a single cell. Nat Methods, 6(5), 377-382. https://doi.org/10.1038/nmeth.1315