Skip to main content
Log in

Randomized algorithms for distributed computation of principal component analysis and singular value decomposition

  • Published:
Advances in Computational Mathematics Aims and scope Submit manuscript

Abstract

Randomized algorithms provide solutions to two ubiquitous problems: (1) the distributed calculation of a principal component analysis or singular value decomposition of a highly rectangular matrix, and (2) the distributed calculation of a low-rank approximation (in the form of a singular value decomposition) to an arbitrary matrix. Carefully honed algorithms yield results that are uniformly superior to those of the stock, deterministic implementations in Spark (the popular platform for distributed computation); in particular, whereas the stock software will without warning return left singular vectors that are far from numerically orthonormal, a significantly burnished randomized implementation generates left singular vectors that are numerically orthonormal to nearly the machine precision.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Ailon, N., Rauhut, H.: Fast and RIP-optimal transforms. Discrete Comput. Geom. 52(4), 780–798 (2014)

    Article  MathSciNet  Google Scholar 

  2. Ballard, G., Demmel, J., Dumitriu, I.: Minimizing communication for eigenproblems and the singular value decomposition. Tech. Rep. UCB/EECS-2011-14, Dept. EECS, UC Berkeley (2011)

  3. Benson, A.R., Gleich, D.F., Demmel, J.: Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures. In: Proc. IEEE Internat. Conf. Big Data, pp. 264–272. IEEE (2013)

  4. Demmel, J., Dumitriu, I., Holtz, O.: Fast linear algebra is stable. Numer. Math. 108(1), 59–91 (2007)

    Article  MathSciNet  Google Scholar 

  5. Demmel, J., Grigori, L., Gu, M., Xiang, H.: Communication-avoiding rank-revealing QR factorization with column pivoting. SIAM J. Matrix. Anal. Appl. 36(1), 55–89 (2015)

    Article  MathSciNet  Google Scholar 

  6. Demmel, J., Grigori, L., Hoemmen, M., Langou, J.: Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. 34(1), 206–239 (2012)

    Article  MathSciNet  Google Scholar 

  7. Durstenfeld, R.: Algorithm 235: random permutation. Commun. ACM 7(7), 420 (1964)

    Article  Google Scholar 

  8. Fukaya, T., Nakatsukasa, Y., Yanagisawa, Y., Yamamoto, Y.: CholeskyQR2: a simple and communication-avoiding algorithm for computing a tall-skinny QR factorization on a large-scale parallel system. In: Proc. 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, pp. 31–38. IEEE (2014)

  9. Gittens, A., Devarakonda, A., Racah, E., Ringenburg, M., Gerhardt, L., Kottalam, J., Liu, J., Maschhoff, K., Canon, S., Chhugani, J., Sharma, P., Yang, J., Demmel, J., Harrell, J., Krishnamurthy, V., Mahoney, M.W.: Prabhat: Matrix factorization at scale: a comparison of scientific data analytics in Spark and C+MPI using three case studies. In: Proc. 2016 IEEE International Conference on Big Data, pp. 204–213. IEEE (2016)

  10. Golub, G., Van Loan, C.: Matrix Computations, 4th edn. Johns Hopkins University Press, Baltimore (2012)

    Google Scholar 

  11. Golub, G.H., Mahoney, M.W., Drineas, P., Lim, L.: Bridging the gap between numerical linear algebra, theoretical computer science, and data applications. SIAM News 39(8), 1–3 (2006)

    Google Scholar 

  12. Halko, N., Martinsson, P.G., Tropp, J.: Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53(2), 217–288 (2011)

    Article  MathSciNet  Google Scholar 

  13. Jolliffe, I.T.: Principal component analysis, 2nd edn. Springer Series in Statistics. Springer-Verlag, New York (2002)

    Google Scholar 

  14. Lehoucq, R., Sorensen, D., Yang, C.: ARPACK user’s guide: Solution of large-scale eigenvalue problems with implicitly restarted arnoldi methods. SIAM, Philadelphia, PA (1998)

  15. Li, H., Linderman, G., Szlam, A., Stanton, K., Kluger, Y., Tygert, M.: Algorithm 971: an implementation of a randomized algorithm for principal component analysis. ACM Trans. Math. Soft. 43(3), 28:1–28:14 (2016)

    MathSciNet  MATH  Google Scholar 

  16. Linden, A., Krensky, P., Hare, J., Idoine, C.J., Sicular, S., Vashisth, S.: Magic quadrant for data science platforms. Tech. Rep. G00301536, Gartner (2017)

  17. Shabat, G., Shmueli, Y., Aizenbud, Y., Averbuch, A.: Randomized LU decomposition. Appl. Comput. Harmon. Anal. To appear (2016)

  18. Stathopoulos, A., Wu, K.: A block orthogonalization procedure with constant synchronization requirements. SIAM J. Sci. Comput. 23(6), 2165–2182 (2002)

    Article  MathSciNet  Google Scholar 

  19. Yamazaki, I., Tomov, S., Dongarra, J.: Mixed-precision Cholesky QR factorization and its case studies on multicore CPU with multiple GPUs. SIAM J. Sci. Comput. 37(3), C307—C330 (2015)

    Article  MathSciNet  Google Scholar 

  20. Yamazaki, I., Tomov, S., Dongarra, J.: Stability and performance of various singular value QR implementations on multicore CPU with a GPU. ACM Trans. Math. Soft. 43(2), 10:1–10:18 (2016)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

We would like to thank the anonymous editor and referees for shaping the presentation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mark Tygert.

Additional information

Communicated by: Gunnar J Martinsson

Y. Kluger and H. Li were supported in part by United States National Institutes of Health grant 1R01HG008383-01A1. Y. Kluger is with the Program in Applied Mathematics, the Program in Biological and Biomedical Sciences, the Cancer Center, the Center for Medical Informatics, and the Department of Pathology in the School of Medicine at Yale University.

Appendices

Appendix A: Restricting to ten times fewer executors

Tables 11121314151617 and 18 display results analogous to those in Tables 3–5, 6–8, 9 and 10, but with the number of executors, spark.dynamicAllocation.maxExecutors, set to 18 (rather than 180). The results are broadly comparable to those presented earlier. This indicates how the timings scale with the number of machines. Of course, other processing in Spark (not necessarily related to principal component analysis or singular value decomposition) can benefit from having the data stored over more executors, and moving data around the cluster can dominate the overall timings in real-world usage (see also Remark 2 in the introduction of the present paper).

Table 11 m = 1,000,000; n = 2,000; restricted to ten times fewer executors
Table 12 m = 100,000; n = 2,000; restricted to ten times fewer executors
Table 13 m = 10,000; n = 2,000; restricted to ten times fewer executors
Table 14 m = 1,000,000; n = 2,000; l = 20; i = 2; restricted to ten times fewer executors
Table 15 m = 100,000; n = 2,000; l = 20; i = 2; restricted to ten times fewer executors
Table 16 m = 10,000; n = 2,000; l = 20; i = 2; restricted to ten times fewer executors
Table 17 Timings for l = 10; i = 2; restricted to ten times fewer executors
Table 18 Errors for l = 10; i = 2; restricted to ten times fewer executors

Appendix B: Another example with ten times fewer executors

Similar to Appendix A, the present appendix presents Tables 19202122232425 and 26, reporting results analogous to those in Tables 3–5, 8–9, 9 and 10, with the same setting as in Appendix A of the number of executors, spark.dynamicAllocation.maxExecutors, being 18 (rather than 180). The present appendix follows an anonymous reviewer’s suggestion, using for the diagonal entries of Σ in (2) singular values Σj, j from a fractal “Devil’s staircase” with many repeated singular values of varying multiplicities; Fig. 1 plots the singular values for Tables 1921. Specifically, the singular values arise from the following Scala code:

(0 until k).toArray.map(j =>

Integer.parseInt(Integer.toOctalString(

Math.round(j ⋆ Math.pow(8, 6).toFloat / k)

).replaceAll("[1-7]", "1"), 2)

/ Math.pow(2, 6) / (1 - Math.pow(2, -6))

).sorted.reverse

Table 19 m = 1,000,000; n = 2,000; restricted to ten times fewer executors; Appendix B defines the singular values of the matrix being processed
Table 20 m = 100,000; n = 2,000; restricted to ten times fewer executors; Appendix B defines the singular values of the matrix being processed
Table 21 m = 10,000; n = 2,000; restricted to ten times fewer executors; Appendix B defines the singular values of the matrix being processed
Table 22 m = 1,000,000; n = 2,000; l = 20; i = 2; restricted to ten times fewer executors; Appendix B defines the singular values of the matrix being processed
Table 23 m = 100,000; n = 2,000; l = 20; i = 2; restricted to ten times fewer executors; Appendix B defines the singular values of the matrix being processed
Table 24 m = 10,000; n = 2,000; l = 20; i = 2; restricted to ten times fewer executors; Appendix B defines the singular values of the matrix being processed
Table 25 Timings for l = 10; i = 2; restricted to ten times fewer executors; Appendix B defines the singular values of the matrix being processed
Table 26 Errors for l = 10; i = 2; restricted to ten times fewer executors; Appendix B defines the singular values of the matrix being processed
Fig. 1
figure 1

Singular values Σ1,1, Σ2,2, …, Σ2000,2000, which are the diagonal entries of Σ in (2), when k = n = 2,000, for Tables 1921 in Appendix B

Here, k = n for Tables 1921 and k = l for Tables 2226. Thus, the singular values arise from replacing the octal digits 1–7 with the binary digit 1 (keeping the octal digit 0 as the binary digit 0) for rounded representations of the real numbers between 0 and 1, then rescaling so that the final singular values range from 0 to 1, inclusive.

Again, the results are broadly comparable to those presented earlier; in some cases some of the algorithms attain better accuracy on the examples of the present appendix, but otherwise the numbers in the tables are similar.

Appendix C: Timings for generating the test matrices

For comparative purposes, Tables 2728 and 29 list the times required to generate (2) with (3) or (5) using the settings in Table 2.

Table 27 Timings for generating (2) with (3)
Table 28 Timings for generating (2) with (5) and l = 20
Table 29 Timings for generating (2) with (5) and l = 10

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, H., Kluger, Y. & Tygert, M. Randomized algorithms for distributed computation of principal component analysis and singular value decomposition. Adv Comput Math 44, 1651–1672 (2018). https://doi.org/10.1007/s10444-018-9600-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10444-018-9600-1

Keywords

Mathematics Subject Classification (2010)

Navigation