Abstract
Seismic processing applications are used to identify geological structures where reservoirs of oil and gas may be found. With oil companies seeking better precision over larger geographical regions, these applications require larger clusters to keep execution times reasonable. The combination of longer run times and clusters with greater numbers of components increases the probability of faults during the execution. To address this issue, this paper describes an application-level fault tolerance mechanism that considers node crashes and communication link failures. For this industrial application, experiments show that continued execution with the remaining resources is both feasible and efficient.
Supported by PRONEX E-26/110.552/2010, CNPq, FAPERJ, PETROBRAS.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Batchu, R., Dandass, Y., Skjellum, A., Beddhu, M.: MPI/FT: a model-based approach to low-overhead fault tolerant message-passing middleware. Cluster Computing 7(4), 303–315 (2004)
Bouteiller, A., Herault, T., Krawezik, G., Lemarinier, P., Cappello, F.: MPICH-V project: A multiprotocol automatic fault-tolerant MPI. International Journal of High Performance Computing Applications 20(3), 319–333 (2006)
Chandra, T.D., Toueg, S.: Unreliable failure detectors for reliable distributed systems. J. ACM 43(2), 225–267 (1996)
Deutsch, P., et al.: Zlib compressed data format specification version 3.3 (1996)
Duarte, A., Rexachs, D.I., Luque, E.: An intelligent management of fault tolerance in cluster using RADICMPI. In: Mohr, B., Träff, J.L., Worringen, J., Dongarra, J. (eds.) PVM/MPI 2006. LNCS, vol. 4192, pp. 150–157. Springer, Heidelberg (2006)
Fagg, G.E., Dongarra, J.: FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World. In: Dongarra, J., Kacsuk, P., Podhorszki, N. (eds.) PVM/MPI 2000. LNCS, vol. 1908, pp. 346–353. Springer, Heidelberg (2000)
Hoefler, T., Mehlan, T., Mietke, F., Rehm, W.: A Survey of Barrier Algorithms for Coarse Grained Supercomputers. Chemnitzer Informatik Berichte 04(03) (2004)
Intel: Intel MPI Library Reference Manual (2011), http://software.intel.com/en-us/articles/intel-mpi-library-documentation
Larrea, M., Arévalo, S., Fernández, A.: Efficient algorithms to implement unreliable failure detectors in partially synchronous systems. Dist. Comp., 847–847 (1999)
Louca, S., Neophytou, N., Lachanas, A., Evripidou, P.: MPI-FT: Portable Fault Tolerance Scheme for MPI. Parallel Processing Letters 10(4), 371–382 (2000)
Ortigosa, F., Araya-Polo, M., Rubio, F., Hanzich, M., Cruz, R., Cela, J.: Evaluation of 3d RTM on HPC platforms. SEG Expanded Abstracts 27(1), 2879–2883 (2008)
da Silva, J.A., Rebello, V.E.F.: Low Cost Self-healing in MPI Applications. In: Cappello, F., Herault, T., Dongarra, J. (eds.) PVM/MPI 2007. LNCS, vol. 4757, pp. 144–152. Springer, Heidelberg (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gonçalves, A., Bersot, M., Bulcão, A., Boeres, C., Drummond, L., Rebello, V. (2011). Fault Tolerance in an Industrial Seismic Processing Application for Multicore Clusters. In: Cotronis, Y., Danalis, A., Nikolopoulos, D.S., Dongarra, J. (eds) Recent Advances in the Message Passing Interface. EuroMPI 2011. Lecture Notes in Computer Science, vol 6960. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24449-0_30
Download citation
DOI: https://doi.org/10.1007/978-3-642-24449-0_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24448-3
Online ISBN: 978-3-642-24449-0
eBook Packages: Computer ScienceComputer Science (R0)