The Design and Implementation of TIDeFlow: A Dataflow-Inspired Execution Model for Parallel Loops and Task Pipelining

Orozco, Daniel; Garcia, Elkin; Pavel, Robert; Arteaga, Jaime; Gao, Guang

doi:10.1007/s10766-015-0373-6

The Design and Implementation of TIDeFlow: A Dataflow-Inspired Execution Model for Parallel Loops and Task Pipelining

Published: 21 July 2015

Volume 44, pages 278–307, (2016)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Daniel Orozco¹,
Elkin Garcia¹,
Robert Pavel¹,
Jaime Arteaga¹ &
…
Guang Gao¹

307 Accesses
4 Citations
Explore all metrics

Abstract

This paper provides an extended description of the design and implementation of the Time Iterated Dependency Flow (TIDeFlow) execution model. TIDeFlow is a dataflow-inspired model that simplifies the scheduling of shared resources on many-core processors. To accomplish this, programs are specified as directed graphs and the dataflow model is extended through the introduction of intrinsic constructs for parallel loops and the arbitrary pipelining of operations. The main contributions of this paper are: (1) a formal description of the TIDeFlow execution model and its programming model, (2) a description of the TIDeFlow implementation and its strengths over previous execution models, such as the ability to natively express parallel loops and task pipelining, (3) an analysis of experimental results showing the advantages of TIDeFlow with respect to expressing parallel programs on many-core architectures and (4) a presentation of the implementation of a low overhead runtime system for TIDeFlow.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient High-Level Programming in Plain Java

Article 05 December 2022

Shared Memory Parallelism in Modern C++ and HPX

Article 20 April 2024

Trends in Processor Architecture

References

Agrawal, G., Saltz, J.: Interprocedural data flow based optimizations for compilation of irregular problems. In: Huang, C.H., Sadayappan, P., Banerjee, U., Gelernter, D., Nicolau, A., Padua, D. (eds.) Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science, vol. 1033, pp. 465–479. Springer, Berlin (1996). doi:10.1007/BFb0014218
Chapter Google Scholar
Arvind, Culler, D.E.: Dataflow Architectures, pp. 225–253. Annual Reviews Inc., Palo Alto. http://portal.acm.org/citation.cfm?id=17814.17824 (1986)
Blumofe, R., Leiserson, C.: Scheduling multithreaded computations by work stealing. In: Foundations of Computer Science, 1994 Proceedings, 35th Annual Symposium on, pp. 356 –368 (1994). doi:10.1109/SFCS.1994.365680
Butenhof, D.: Programming with POSIX Threads. Addison-Wesley Professional, Boston (1997)
Google Scholar
Chapman, B., Jost, G., van der Pas, R.: Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation). MIT Press, Cambridge (2007)
Google Scholar
del Cuvillo, J., Zhu, W., Gao, G.: Landing openmp on cyclops-64: an efficient mapping of openmp to a many-core system-on-a-chip. In: CF ’06: Proceedings of the 3rd Conference on Computing Frontiers, ACM, New York, NY, USA, pp. 41–50, (2006a). doi:10.1145/1128022.1128030
del Cuvillo, J., Zhu, W., Hu, Z., Gao, G.R.: Toward a software infrastructure for the cyclops-64 cellular architecture. In: High-Performance Computing in an Advanced Collaborative Environment, p. 9 (2006b). doi:10.1109/HPCS.2006.48
Del Cuvillo, J., Zhu, W., Hu, Z., Gao, G.: Tiny threads: a thread virtual machine for the cyclops64 cellular architecture. In: Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International, IEEE, p. 8 (2005)
Dennis, J.B.: First version of a data flow procedure language. In: Programming Symposium, Proceedings Colloque sur la Programmation. Springer, London, pp. 362–376 (1974). http://portal.acm.org/citation.cfm?id=647323.721501
Duran, A., Ayguad, E., Badia, R.M., Labarta, J., Martinell, L., Martorell, X., Planas, J.: Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel Process. Lett. 21(02), pp. 173–193 (2011). doi:10.1142/S0129626411000151
Ebcioglu, K., Saraswat, V., Sarkar, V.: X10: programming for hierarchical parallelism and non-uniform data access. In: Proceedings of the International Workshop on Language Runtimes, OOPSLA (2004)
Ellson, J., Gansner, E., Koutsofios, L., North, S., Woodhull, G.: Graphviz—open source graph drawing tools. In: Mutzel, P., Jünger, M., Leipert, S.(eds.) Graph Drawing. Lecture Notes in Computer Science, vol. 2265, pp. 483–484. Springer, Berlin Heidelberg (2002). doi:10.1007/3-540-45848-4_57
Gao, G.R.: A pipelined code mapping scheme for static data flow computers. PhD thesis, Massachusetts Institute of Technology. http://hdl.handle.net/1721.1/37165 (1986)
Garcia, E., Orozco, D., Khan, R., Venetis, I., Livingston, K., Gao, G.: A dynamic schema to increase performance in many-core architectures through percolation operations. In: Proceedings of the 2013 IEEE International Conference on High Performance Computing (HiPC 2013), Bangalore. IEEE Computer Society (2013)
Garcia, E., Venetis, I.E., Khan, R., Gao, G.: Optimized dense matrix multiplication on a many-core architecture. In: Proceedings of the Sixteenth International Conference on Parallel Computing (Euro-Par 2010), Part II, Springer, Ischia, Italy, Lecture Notes in Computer Science, vol. 6272, pp. 316–327 (2010b)
Garcia, E., Venetis, I.E., Khan, R., Gao, G.R.: Optimized dense matrix multiplication on a many-core architecture. In: Euro-Par 2010-Parallel Processing, pp. 316–327 (2010c)
Garcia, E., Orozco, D., Khan, R., Venetis, I.E., Livingston, K., Gao, G.R.: Dynamic percolation: a case of study on the shortcomings of traditional optimization in many-core architectures. In: ACM International Conference on Computing Frontiers 2012 (CF’12) (2012a)
Garcia, E., Orozco, D., Pavel, R., Gao, G.: A discussion in favor of dynamic scheduling for regular applications in many-core architectures. In: Parallel and Distributed Processing Symposium Workshops and PhD Forum (IPDPSW), 2012 IEEE 26th International, IEEE, pp. 1591–1600 (2012b)
Garcia, E., Orozco, D., Pavel, R., Gao, G.R.: A discussion in favor of Dynamic Scheduling for regular applications in Many-core Architectures. In: Proceedings of 2012 Workshop on Multithreaded Architectures and Applications (MTAAP 2012); 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2012), pp. 1591–1600. ACM, Shanghai (2012)
Gautier, T., Besseron, X., Pigeon, L.: Kaapi: A thread scheduling runtime system for data flow computations on cluster of multi-processors. In: Proceedings of the 2007 International Workshop on Parallel Symbolic Computation, PASCO ’07, pp. 15–23. ACM, New York, NY, USA (2007)
Gropp, W., Lusk, E., Thakur, R.: Using MPI-2: Advanced Features of the Message-Passing Interface. MIT Press, Cambridge (1999)
Google Scholar
Irigoin, F., Triolet, R.: Supernode partitioning. In: Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of Programming Languages, pp. 319–329. ACM (1988)
Murata, T.: Petri nets: properties, analysis and applications. Proc. IEEE 77(4), 541–580 (1989). doi:10.1109/5.24143
Article Google Scholar
Najjar, W.A., Lee, E.A., Gao, G.R.: Advances in the dataflow computational model. Parallel Comput. 25, 1907–1929 (1999)
Article Google Scholar
Nemawarkar, S., Gao, G.: Measurement and modeling of earth-manna multithreaded architecture. In: Modeling, Analysis, and Simulation of Computer and Telecommunication Systems. MASCOTS ’96, Proceedings of the Fourth International Workshop on, pp. 109–114 (1996). doi:10.1109/MASCOT.1996.501002
Gulati, K., Khatri, S.P.: GPU architecture and the CUDA programming model. In: Hardware acceleration of EDA algorithms, pp. 23–30. Springer US (2010). doi:10.1007/978-1-4419-0944-2_3
Orozco, D.: Tideflow: a parallel execution model for high performance computing programs. In: 2011 International Conference on Parallel Architectures and Compilation Techniques, p. 211 (2011)
Orozco, D., Gao, G.: Mapping the FDTD application to many-core chip architectures. In: Parallel Processing. ICPP ’09. International Conference on, pp. 309–316 (2009)
Orozco, D., Xue, L., Bolat, M., Li, X., Gao, G.R.: Experience of optimizing FFT on intel architectures. In: Parallel and Distributed Processing Symposium. IPDPS 2007. IEEE International, IEEE, pp. 1–8 (2007)
Orozco, D., Garcia, E., Gao, G.: Locality optimization of stencil applications using data dependency graphs. In: Proceedings of the 23rd International Conference on Languages and Compilers for Parallel Computing, LCPC’10, pp. 77–91. Springer, Berlin (2011a)
Orozco, D., Garcia, E., Khan, R., Livingston, K., Gao, G.: High throughput queue algorithms. Tech. rep., CAPSL Technical Memo 103 (2011b)
Orozco, D., Garcia, E., Pavel, R., Khan, R., Gao, G.R.: Polytasks: a compressed task representation for hpc runtimes. In: Proceedings of the 24th International Conference on Languages and Compilers for Parallel Computing, LCPC 11 (2011c)
Orozco, D., Garcia, E., Pavel, R., Khan, R., Gao, G.R.: Polytasks: a compressed task representation for hpc runtimes. CAPSL Technical Memo 105 (2011d)
Orozco, D., Garcia, E., Khan, R., Livingston, K., Gao, G.R.: Toward high-throughput algorithms on many-core architectures. ACM Trans. Archit. Code Optim. 8(4), 49 (2012)
Article Google Scholar
Sarkar, V., Hennessy, J.: Partitioning parallel programs for macro-dataflow. In: Proceedings of the 1986 ACM Conference on LISP and Functional Programming, LFP ’86, pp. 202–211. ACM, New York, NY, USA (1986). doi:10.1145/319838.319863
Stone, J.E., Gohara, D., Shi, G.: Opencl: a parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12(3), 66 (2010)
Article Google Scholar
Theobald, K.: Earth: an efficient architecture for running threads. PhD thesis, University of Delaware (1999)
Yan, Y., Chatterjee, S., Orozco, D., Garcia, E., Budimlic, Z., Shirako, J., Pavel, R., Sarkar, V., Gao, G.: Hardware and software tradeoffs for task synchronization on manycore architectures. In: Proceedings of the Seventeenth International Conference on Parallel Computing (Euro-Par 2011), Bordeaux, France, Lecture Notes in Computer Science (2011)
Zuckerman, S., Suetterlein, J., Knauerhase, R.,Gao, G.: Using a codelet program execution model for exascale machines: position paper. In: Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era, pp. 64–69. ACM (2011)

Download references

Acknowledgments

This research was made possible by the generous support of the NSF through Grants CCF-0833122, CCF-0925863, CCF-0937907, CNS-0720531, and OCI-0904534.

Author information

Authors and Affiliations

University of Delaware, Newark, DE, USA
Daniel Orozco, Elkin Garcia, Robert Pavel, Jaime Arteaga & Guang Gao

Authors

Daniel Orozco
View author publications
You can also search for this author in PubMed Google Scholar
Elkin Garcia
View author publications
You can also search for this author in PubMed Google Scholar
Robert Pavel
View author publications
You can also search for this author in PubMed Google Scholar
Jaime Arteaga
View author publications
You can also search for this author in PubMed Google Scholar
Guang Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel Orozco.

Additional information

This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Orozco, D., Garcia, E., Pavel, R. et al. The Design and Implementation of TIDeFlow: A Dataflow-Inspired Execution Model for Parallel Loops and Task Pipelining. Int J Parallel Prog 44, 278–307 (2016). https://doi.org/10.1007/s10766-015-0373-6

Download citation

Received: 31 January 2013
Accepted: 08 July 2015
Published: 21 July 2015
Issue Date: April 2016
DOI: https://doi.org/10.1007/s10766-015-0373-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Design and Implementation of TIDeFlow: A Dataflow-Inspired Execution Model for Parallel Loops and Task Pipelining

Abstract

Access this article

Similar content being viewed by others

Efficient High-Level Programming in Plain Java

Shared Memory Parallelism in Modern C++ and HPX

Trends in Processor Architecture

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The Design and Implementation of TIDeFlow: A Dataflow-Inspired Execution Model for Parallel Loops and Task Pipelining

Abstract

Access this article

Similar content being viewed by others

Efficient High-Level Programming in Plain Java

Shared Memory Parallelism in Modern C++ and HPX

Trends in Processor Architecture

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation