Skip to main content
Log in

Distortion-free PCA on sample space for highly variable gene detection from single-cell RNA-seq data

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Single-cell RNA-seq (scRNA-seq) allows the analysis of gene expression in each cell, which enables the detection of highly variable genes (HVG) that contribute to cell-to-cell variation within a homogeneous cell population. HVG detection is necessary for clustering analysis to improve the clustering result. scRNA-seq includes some genes that are expressed with a certain probability in all cells which make the cells indistinguishable. These genes are referred to as background noise. To remove the background noise and select the informative genes for clustering analysis, in this paper, we propose an effective HVG detection method based on principal component analysis (PCA). The proposed method utilizes PCA to evaluate the genes (features) on the sample space. The distortion-free principal components are selected to calculate the distance from the origin to gene as the weight of each gene. The genes that have the greatest distances to the origin are selected for clustering analysis. Experimental results on both synthetic and gene expression datasets show that the proposed method not only removes the background noise to select the informative genes for clustering analysis, but also outperforms the existing HVG detection methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Oshlack A, Robinson M D, Young M D. From RNA-seq reads to differential expression results. Genome Biology, 2010, 11(12): 220

    Article  Google Scholar 

  2. Ye X, Zhang W, Sakurai T. Adaptive unsupervised feature learning for gene signature identification in non-small-cell lung cancer. IEEE Access, 2020, 8: 154354–154362

    Article  Google Scholar 

  3. Ozsolak F, Milos P M. RNA sequencing: advances, challenges and opportunities. Nature Reviews Genetics, 2011, 12(2): 87–98

    Article  Google Scholar 

  4. Wagner A, Regev A, Yosef N. Revealing the vectors of cellular identity with single-cell genomics. Nature Biotechnology, 2016, 34(11): 1145–1160

    Article  Google Scholar 

  5. Kiselev V Y, Andrews T S, Hemberg M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nature Reviews Genetics, 2019, 20(5): 273–282

    Article  Google Scholar 

  6. Ye X, Zhang W, Futamura Y, Sakurai T. Detecting interactive gene groups for single-cell RNA-Seq data based on co-expression network analysis and subgraph learning. Cells, 2020, 9(9): 1938

    Article  Google Scholar 

  7. Ye X, Sakurai T. Robust similarity measure for spectral clustering based on shared neighbors. ETRI Journal, 2016, 38(3): 540–550

    Google Scholar 

  8. Emmert-Streib F, Dehmer M, Haibe-Kains B. Gene regulatory networks and their applications: understanding biological and medical problems in terms of networks. Frontiers in Cell and Developmental Biology, 2014, 2: 38

    Article  Google Scholar 

  9. Thompson D, Regev A, Roy S. Comparative analysis of gene regulatory networks: from network reconstruction to evolution. Annual Review of Cell and Developmental Biology, 2015, 31: 399–428

    Article  Google Scholar 

  10. Ye X, Sakurai T. Spectral clustering with adaptive similarity measure in Kernel space. Intelligent Data Analysis, 2018, 22(4): 751–765

    Article  Google Scholar 

  11. Yip S H, Sham P C, Wang J. Evaluation of tools for highly variable gene discovery from single-cell RNA-seq data. Briefings in Bioinformatics, 2019, 20(4): 1583–1589

    Article  Google Scholar 

  12. Finak G, McDavid A, Yajima M, Deng J, Gersuk V, Shalek A K, Slichter C K, Miller H W, McElrath M J, Prlic M, Linsley P S, Gottardo R. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biology, 2015, 16: 278

    Article  Google Scholar 

  13. Yip S H, Wang P, Kocher J P A, Sham P C, Wang J. Linnorm: improved statistical analysis for single cell RNA-seq expression data. Nucleic Acids Research, 2017, 45(22): e179

    Article  Google Scholar 

  14. Law C W, Chen Y, Shi W, Smyth G K. Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology, 2014, 15(2): R29

    Article  Google Scholar 

  15. Vallejos C A, Marioni J C, Richardson S. BASiCS: bayesian analysis of single-cell sequencing data. PLoS Computational Biology, 2015, 11(6): e1004333

    Article  Google Scholar 

  16. Lun A T L, Bach K, Marioni J C. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biology, 2016, 17: 75

    Article  Google Scholar 

  17. Lun A T L, McCarthy D J, Marioni J C. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Research, 2016, 5: 2122

    Google Scholar 

  18. Brennecke P, Anders S, Kim J K, Kolodziejczyk A A, Zhang X W, Proserpio V, Baying B, Benes V, Teichmann S A, Marioni J C, Heisler M G. Accounting for technical noise in single-cell RNA-seq experiments. Nature Methods, 2013, 10(11): 1093–1095

    Article  Google Scholar 

  19. Chen H I H, Jin Y, Huang Y, Chen Y. Detection of high variability in gene expression from single-cell RNA-seq profiling. BMC Genomics, 2016, 17(S7): 508

    Article  Google Scholar 

  20. Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Machine Learning, 2002, 46(1–3): 389–422

    Article  Google Scholar 

  21. Díaz-Uriarte R, de Andrés S A. Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 2006, 7: 3

    Article  Google Scholar 

  22. Satija R, Farrell J A, Gennert D, Schier A F, Regev A. Spatial reconstruction of single-cell gene expression data. Nature Biotechnology, 2015, 33(5): 495–502

    Article  Google Scholar 

  23. Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, MauckIII W M, Hao Y, Stoeckius M, Smibert P, Satija R. Comprehensive integration of single-cell data. Cell, 2019, 177(7): 1888–1902.e21

    Article  Google Scholar 

  24. Mayer C, Hafemeister C, Bandler R C, Machold R, Brito R B, Jaglin X, Allaway K, Butler A, Fishell G, Satija R. Developmental diversification of cortical inhibitory interneurons. Nature, 2018, 555(7697): 457–462

    Article  Google Scholar 

  25. Hotelling H. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 1993, 24(6): 417–441

    Article  Google Scholar 

  26. Jolliffe I T. Principal Component Analysis. Springer, 1986

  27. Pearson K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. In: Kotz S, Johnson N L, eds. Breakthroughs in Statistics. New York: Springer, 1992

    Google Scholar 

  28. Heckert N A, Filliben J J. NIST/SEMATECH e-Handbook of statistical methods; Chapter 1: Exploratory Data Analysis. 2003

  29. Gierahn T M, WadsworthII M H, Hughes T K, Bryson B D, Butler A, Satija R, Fortune S, Love J C, Shalek A K. Seq-Well: portable, low-cost RNA sequencing of single cells at high throughput. Nature Methods, 2017, 14(4): 395–398

    Article  Google Scholar 

  30. Liu A H, Nowakowski T J, Pollen A A, Lui J H, Horlbeck M A, Attenello F J, He D, Weissman J S, Kriegstein A R, Diaz A A, Lim D A. Single-cell analysis of long non-coding RNAs in the developing human neocortex. Genome Biology, 2016, 17: 67

    Article  Google Scholar 

  31. Pollen A A, Nowakowski T J, Shuga J, Wang X, Leyrat A A, et al. Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nature Biotechnology, 2014, 32(10): 1053–1058

    Article  Google Scholar 

  32. Hafemeister C, Satija R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biology, 2019, 20(1): 296

    Article  Google Scholar 

  33. Rand W M. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 1971, 66(336): 846–850

    Article  Google Scholar 

  34. McInnes L, Healy J, Saul N, Großberger L. UMAP: uniform manifold approximation and projection. The Journal of Open Source Software, 2018, 3(29): 861

    Article  Google Scholar 

Download references

Acknowledgements

This study was supported in part by the New Energy and Industrial Technology Development Organization (AJD30064) and JST COI-NEXT.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Xiucai Ye or Tetsuya Sakurai.

Additional information

Momo Matsuda is currently pursuing her PhD degree. She is now at the Department of Computer Science, University of Tsukuba, Japan. Her current research interests include dimensionality reduction, machine learning and clustering.

Yasunori Futamura received his PhD degree in Engineering from University of Tsukuba, Japan in 2014. From April 2014 to Septemper 2014, he was a postdoctoral fellow at Faculty of Engineering, Information and Systems, University of Tsukuba. Since October 2014, he has been an assistant professor at Faculty of Engineering, Information and Systems, University of Tsukuba, Japan. He is a member of the Japan Society for Industrial and Applied Mathematics, Information Processing Society of Japan, and Society for Industrial and Applied Mathematics.

Xiucai Ye received the PhD degree in computer science from the University of Tsukuba, Japan in 2014. She is currently an assistant professor with the Department of Computer Science, and Center for Artificial Intelligence Research (C-AIR), University of Tsukuba, Japan. Her current research interests include feature selection, clustering, bioinformatics, machine learning and its application fields.

Tetsuya Sakurai is a professor of the Department of Computer Science, and the director of Center for Artificial Intelligence Research (C-AIR) at the University of Tsukuba, Japan. He is also a visiting professor at the Open University of Japan, and a visiting researcher of Advanced Institute of Computational Science at RIKEN. He received a PhD in computer engineering from Nagoya University, Japan in 1992. His research interests include high performance algorithms for large-scale simulations, data and image analysis, and deep neural network computations. He is a member of the Japan Society for Industrial and Applied Mathematics (JSIAM), the Mathematical Society of Japan (MSJ), Information Processing Society of Japan (IPSJ), Society for Industrial and Applied Mathematics (SIAM).

Electronic Supplementary Material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Matsuda, M., Futamura, Y., Ye, X. et al. Distortion-free PCA on sample space for highly variable gene detection from single-cell RNA-seq data. Front. Comput. Sci. 17, 171310 (2023). https://doi.org/10.1007/s11704-022-1172-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-022-1172-z

Keywords

Navigation