Abstract
Randomized algorithms provide solutions to two ubiquitous problems: (1) the distributed calculation of a principal component analysis or singular value decomposition of a highly rectangular matrix, and (2) the distributed calculation of a low-rank approximation (in the form of a singular value decomposition) to an arbitrary matrix. Carefully honed algorithms yield results that are uniformly superior to those of the stock, deterministic implementations in Spark (the popular platform for distributed computation); in particular, whereas the stock software will without warning return left singular vectors that are far from numerically orthonormal, a significantly burnished randomized implementation generates left singular vectors that are numerically orthonormal to nearly the machine precision.
Similar content being viewed by others
References
Ailon, N., Rauhut, H.: Fast and RIP-optimal transforms. Discrete Comput. Geom. 52(4), 780–798 (2014)
Ballard, G., Demmel, J., Dumitriu, I.: Minimizing communication for eigenproblems and the singular value decomposition. Tech. Rep. UCB/EECS-2011-14, Dept. EECS, UC Berkeley (2011)
Benson, A.R., Gleich, D.F., Demmel, J.: Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures. In: Proc. IEEE Internat. Conf. Big Data, pp. 264–272. IEEE (2013)
Demmel, J., Dumitriu, I., Holtz, O.: Fast linear algebra is stable. Numer. Math. 108(1), 59–91 (2007)
Demmel, J., Grigori, L., Gu, M., Xiang, H.: Communication-avoiding rank-revealing QR factorization with column pivoting. SIAM J. Matrix. Anal. Appl. 36(1), 55–89 (2015)
Demmel, J., Grigori, L., Hoemmen, M., Langou, J.: Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. 34(1), 206–239 (2012)
Durstenfeld, R.: Algorithm 235: random permutation. Commun. ACM 7(7), 420 (1964)
Fukaya, T., Nakatsukasa, Y., Yanagisawa, Y., Yamamoto, Y.: CholeskyQR2: a simple and communication-avoiding algorithm for computing a tall-skinny QR factorization on a large-scale parallel system. In: Proc. 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, pp. 31–38. IEEE (2014)
Gittens, A., Devarakonda, A., Racah, E., Ringenburg, M., Gerhardt, L., Kottalam, J., Liu, J., Maschhoff, K., Canon, S., Chhugani, J., Sharma, P., Yang, J., Demmel, J., Harrell, J., Krishnamurthy, V., Mahoney, M.W.: Prabhat: Matrix factorization at scale: a comparison of scientific data analytics in Spark and C+MPI using three case studies. In: Proc. 2016 IEEE International Conference on Big Data, pp. 204–213. IEEE (2016)
Golub, G., Van Loan, C.: Matrix Computations, 4th edn. Johns Hopkins University Press, Baltimore (2012)
Golub, G.H., Mahoney, M.W., Drineas, P., Lim, L.: Bridging the gap between numerical linear algebra, theoretical computer science, and data applications. SIAM News 39(8), 1–3 (2006)
Halko, N., Martinsson, P.G., Tropp, J.: Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53(2), 217–288 (2011)
Jolliffe, I.T.: Principal component analysis, 2nd edn. Springer Series in Statistics. Springer-Verlag, New York (2002)
Lehoucq, R., Sorensen, D., Yang, C.: ARPACK user’s guide: Solution of large-scale eigenvalue problems with implicitly restarted arnoldi methods. SIAM, Philadelphia, PA (1998)
Li, H., Linderman, G., Szlam, A., Stanton, K., Kluger, Y., Tygert, M.: Algorithm 971: an implementation of a randomized algorithm for principal component analysis. ACM Trans. Math. Soft. 43(3), 28:1–28:14 (2016)
Linden, A., Krensky, P., Hare, J., Idoine, C.J., Sicular, S., Vashisth, S.: Magic quadrant for data science platforms. Tech. Rep. G00301536, Gartner (2017)
Shabat, G., Shmueli, Y., Aizenbud, Y., Averbuch, A.: Randomized LU decomposition. Appl. Comput. Harmon. Anal. To appear (2016)
Stathopoulos, A., Wu, K.: A block orthogonalization procedure with constant synchronization requirements. SIAM J. Sci. Comput. 23(6), 2165–2182 (2002)
Yamazaki, I., Tomov, S., Dongarra, J.: Mixed-precision Cholesky QR factorization and its case studies on multicore CPU with multiple GPUs. SIAM J. Sci. Comput. 37(3), C307—C330 (2015)
Yamazaki, I., Tomov, S., Dongarra, J.: Stability and performance of various singular value QR implementations on multicore CPU with a GPU. ACM Trans. Math. Soft. 43(2), 10:1–10:18 (2016)
Acknowledgements
We would like to thank the anonymous editor and referees for shaping the presentation.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Gunnar J Martinsson
Y. Kluger and H. Li were supported in part by United States National Institutes of Health grant 1R01HG008383-01A1. Y. Kluger is with the Program in Applied Mathematics, the Program in Biological and Biomedical Sciences, the Cancer Center, the Center for Medical Informatics, and the Department of Pathology in the School of Medicine at Yale University.
Appendices
Appendix A: Restricting to ten times fewer executors
Tables 11, 12, 13, 14, 15, 16, 17 and 18 display results analogous to those in Tables 3–5, 6–8, 9 and 10, but with the number of executors, spark.dynamicAllocation.maxExecutors, set to 18 (rather than 180). The results are broadly comparable to those presented earlier. This indicates how the timings scale with the number of machines. Of course, other processing in Spark (not necessarily related to principal component analysis or singular value decomposition) can benefit from having the data stored over more executors, and moving data around the cluster can dominate the overall timings in real-world usage (see also Remark 2 in the introduction of the present paper).
Appendix B: Another example with ten times fewer executors
Similar to Appendix A, the present appendix presents Tables 19, 20, 21, 22, 23, 24, 25 and 26, reporting results analogous to those in Tables 3–5, 8–9, 9 and 10, with the same setting as in Appendix A of the number of executors, spark.dynamicAllocation.maxExecutors, being 18 (rather than 180). The present appendix follows an anonymous reviewer’s suggestion, using for the diagonal entries of Σ in (2) singular values Σj, j from a fractal “Devil’s staircase” with many repeated singular values of varying multiplicities; Fig. 1 plots the singular values for Tables 19–21. Specifically, the singular values arise from the following Scala code:
(0 until k).toArray.map(j =>
Integer.parseInt(Integer.toOctalString(
Math.round(j ⋆ Math.pow(8, 6).toFloat / k)
).replaceAll("[1-7]", "1"), 2)
/ Math.pow(2, 6) / (1 - Math.pow(2, -6))
).sorted.reverse
Here, k = n for Tables 19–21 and k = l for Tables 22–26. Thus, the singular values arise from replacing the octal digits 1–7 with the binary digit 1 (keeping the octal digit 0 as the binary digit 0) for rounded representations of the real numbers between 0 and 1, then rescaling so that the final singular values range from 0 to 1, inclusive.
Again, the results are broadly comparable to those presented earlier; in some cases some of the algorithms attain better accuracy on the examples of the present appendix, but otherwise the numbers in the tables are similar.
Appendix C: Timings for generating the test matrices
For comparative purposes, Tables 27, 28 and 29 list the times required to generate (2) with (3) or (5) using the settings in Table 2.
Rights and permissions
About this article
Cite this article
Li, H., Kluger, Y. & Tygert, M. Randomized algorithms for distributed computation of principal component analysis and singular value decomposition. Adv Comput Math 44, 1651–1672 (2018). https://doi.org/10.1007/s10444-018-9600-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10444-018-9600-1