Ready for your all-in-one single cell sequencing solution?

Standard differential gene expression analysis. What are we missing?

Standard differential gene expression analysis. What are we missing?

One of the principal challenges in gene expression analysis is differentiating between expression patterns in two or more conditions. Gene expression analysis typically proceeds from count matrix normalization to testing for differences in gene expression. In single cell analysis, we have access to additional dimensions (cells grouped together by type, state, etc.) giving us additional grouping types that allow for a more refined and precise analysis. However, with respect to the primary question of ‘differential’ expression, we can ask: are all differences in expression generic, or the same? What level of resolution is a ‘difference’ in gene expression, and what precisely do we mean by ‘difference’? We might first think that mean/variance is perhaps a good way to distinguish distributions, after all these are the parameters of the normal distribution, however there are some issues as evidenced by the following graphic:

We can see here that the same mean and variance can be shared by distributions with very different shapes.

The distribution on the right is a mixture of gaussians, whereas the one on the left is a single gaussian. The distribution on the right is an important example of what we might see in single cell analysis: the populations of perhaps different cell types, states, or cells of different condition grouped together, a mixture distribution. The examples above are normal distributions, which we do not use to model single cell data, however you can do this same type of demonstration with other families of distributions. If we do not want to make any type of distributional assumption at all, we might then look towards nonparametric tests. The Wilcoxon Rank Sum test is one such nonparametric test often used in single cell analysis.

A popular package for single cell analysis is Seurat [1], where the default method provided for differential gene expression analysis is the Wilcoxon Rank Sum Test [2].

We can think of the Wilcoxon rank sum test as more or less a median test: a test for whether the median of two distributions is the same (in other words where the bulk of the population is located and whether or not that differs between two distributions). But there are problems with this approach, as evidenced by the following two plots:

The plots show hypothetical single cell transcript counts for a gene of interest. We see that the plot on the right has 4 times the counts for 3, corresponding to cells that have 3 transcripts of a gene, vs. the plot on the left where far fewer cells have expressed 3 transcripts. The problem is that from the perspective of the Wilcoxon Rand Sum test, these two distributions do not *significantly* differ, as the p-value is below the typical .05 significance level.

The structure of single cell data is the culprit here, and this should caution us as to what we may be missing in a typical analysis: the large number of zero counts comprising the bulk of the distribution can mask significant differences. In distributions where zero counts dominate, as is typical in single cell data, we can easily miss significant departures between distributions that are not located near the bulk, which in many cases is zero (the median in both distributions here is 0).

To remedy the situation, we can look in detail at the shape of the distributions.

Shape, and particularly tail shape can be significant in single cell populations. This might correspond to a small subset of cells that is responsible for a disease process. When we only look at where the bulk of a population is located, we are more or less looking at the first moment of the population. Let’s look at the first 5 moments:

We can see that there is a lot more to a distribution than just the 1st moment (i.e. mean/median). So how do we capture this additional information? There are various standard ways to calculate the higher moments, however we advocate a particular approach using L-moments [3]. L-moments are linear combinations of order statistics, with certain properties that make them a good fit for characterizing distributions [4]. Libraries for calculating these are available in both python and R [5, 6]. In the setting of patient or disease status classification, we can use the calculated L-moments as features, allowing us to pick up details in distributions for improved classification and biological understanding.

The expanded feature space allows us to pick up finer detail in the differences between distributions for patients in different disease classes, improving our predictive capabilities while simultaneous highlighting relevant biology (i.e. when we utilize SHAP values for explanations of predictions).

At Singleron, we are single cell experts focused on obtaining the highest quality single cell data and getting the most information out of it. Get in touch if you think your research would benefit from working with our team’s experience (info@singleronbio.com).

References

[1] Hao Y, Hao S, Andersen-Nissen E, III WMM, Zheng S, Butler A, Lee MJ, Wilk AJ, Darby C, Zagar M, Hoffman P, Stoeckius M, Papalexi E, Mimitou EP, Jain J, Srivastava A, Stuart T, Fleming LB, Yeung B, Rogers AJ, McElrath JM, Blish CA, Gottardo R, Smibert P, Satija R, “Integrated analysis of multimodal single-cell data.” Cell (2021)

[2]Frank Wilcoxon, “Individual comparisons by ranking methods”, Biometrics Bulletin. 1 (6): 80–83 (1945)

[3] J.R.M. Hosking, “L-Moments: Analysis and Estimation of Distributions Using Linear Combinations of Order Statistics”, Journal of the Royal Statistical Society, Series B, 52 1:105-124 (1990)

[4] J.R.M. Hosking, “On the characterization of distributions by their moments”, Journal of Statistical Planning and Inference 136(1)193-198 (2006)

[5] lmoments3, https://pypi.org/project/lmoments3/

[6] L moments: L-Moments and Quantile Mixtures, https://cran.r-project.org/web/packages/Lmoments/index.html