Abstract
As the infrastructure of high-performance computing, the data center network plays an important role. As network failures occur frequently, data center networks demand highly performed, robust, and energy-efficient failure recovery mechanisms. Despite process, the existing work still has a huge scope to improve to satisfy these requirements. The backup-based failure recovery schemes reserve backup paths in advance, which results in a large energy consumption under normal network conditions. In order to solve the energy consumption problem, the existing adaptive failure recovery schemes are proposed to calculate the rerouting path of the traffic on the failed link, which reduces the energy consumption. However, most adaptive fault recovery solutions apply multi-path routing to calculate the re-routing path. As multi-path routing cannot detect the congestion status of the path under the asymmetric topology caused by link failures, the network is congested, which ends up in less robustness of the network. In view of this, we design and evaluate AFRM, a novel adaptive failure recovery mechanism that overcomes these challenges. AFRM uses asymmetrical routing to calculate the re-routing path by being congestion-aware and is more robust to topological asymmetries compared with existing schemes. The asymmetrical routing dynamically schedules flows to the path with the least marginal cost, which makes AFRM much more energy-efficient. Additionally, AFRM achieves fast link failure detection based on hash storage and flow table matching. Evaluations show that AFRM can do the trade-off between failure recovery time and energy consumption, reduce flow completion time, and increase network throughput compared with existing schemes.
Similar content being viewed by others
References
Bhandarkar SM, Arabnia HR (1995) The REFINE multiprocessor: theoretical properties and algorithms. Parallel Comput 21(11):1783–1806
Arabnia HR, Smith JW (1993) A reconfigurable interconnection network for imaging operations and its implementation using a multi-stage switching box. In: Proceedings of the 7th annual international high performance computing conference. The 1993 high performance computing: new horizons supercomputing symposium, Calgary, Alberta, Canada, June, pp 349–357
Arif Wani M, Arabnia HR (2003) Parallel edge-region-based segmentation algorithm targeted at reconfigurable multi-ring network. J Supercomput 25(1):43–63
Wei W, Gu H, Wang K (2019) Improving cloud-based IoT services through virtual network embedding in elastic optical inter-DC networks. IEEE Internet Things J 6(1):986–996
Yu X, Xu H, Gu H (2018) THOR: a scalable hybrid switching architecture for data center. IEEE Trans Commun 66(10):4653–4665
Hou W, Ning Z, Guo L (2017) Novel framework of risk-aware virtual network embedding in optical data center networks. IEEE Syst J 12(3):2473–2482
Raeisi B, Giorgetti A (2016) Software-based fast failure recovery in load balanced SDN-based datacenter networks. In: Proceedings of the 6th International Conference on Information Communication and Management, pp 95–99
Heng J (2016) The loss of a data center failure is calculated in seconds [EB/OL]. http://news.idcquan.com/news/96583.shtml
Gill P, Jain N, Nagappan N (2011) Understanding network failures in data centers: measurement, analysis, and implications. ACM SIGCOMM Comput Commun 41(4):350–361
Jain S, Kumar A, Mandal S (2013) B4: Experience with a globally-deployed software defined wan. ACM Spec Interest Group Data Commun 43(4):3–14
Wu X, Turner D, Chen C (2012) NetPilot: automating datacenter network failure mitigation. ACM Spec Interest Group Data Commun 42(4):419–430
Hong C, Kandula S, Mahajan R (2013) Achieving high utilization with software-driven WAN. ACM Spec Interest Group Data Commun 43(4):15–26
Fonseca P, Mota E (2017) A survey on fault management in software-defined networks. IEEE Commun Surv Tutor 19(4):2284–2321
Chen J, Xu F (2015) When software defined networks meet fault tolerance: a survey. In: International Conference on Algorithms and Architectures for Parallel Processing, pp 351–368
Zhang S, Wang Y (2016) Backup-resource based failure recovery approach in SDN data plane. In: Network Operations and Management Symposium, pp 1–6
Capone A, Cascone C (2015) Detour planning for fast and reliable failure recovery in SDN with OpenState. In: Design of Reliable Communication Networks, pp 25–32
Caria M, Jukan A (2016) Link capacity planning for fault tolerant operation in hybrid SDN/OSPF networks. In: The Global Communication Conference, pp 1–6
Costa L, Buticchi G (2017) A fault-tolerant series-resonant DC–DC converter. IEEE Trans Power Electron 32(2):900–905
Sgambelluri A, Giorgetti A (2013) OpenFlow-based segment protection in Ethernet networks. J Opt Commun Netw 5(9):1066–1075
Kanagavelu R, Zhu Y (2018) A pro-active and adaptive mechanism for fast failure recovery in SDN data centers. In: Future of Information and Communication Conference, pp 239–257
Raeisi B, Giorgetti A (2016) Software-based fast failure recovery in load balanced SDN-based datacenter networks. In: Information Communication and Management, pp 95–99
Ghannami A, Shao C (2016) Efficient fast recovery mechanism in software-defined networks: multipath routing approach. In: International Conference for Internet Technology and Secured Transactions, pp 432–435
Astaneh S, Heydari S (2016) Optimization of SDN flow operations in multi-failure restoration scenarios. IEEE Trans Netw Serv Manag 13(3):421–432
Chen G, Lu Y (2018) FUSO: fast multi-path loss recovery for data center networks. IEEE Trans Netw 26(3):1376–1389
Borokhovich M, Schiff L (2014) Provable data plane connectivity with local fast failover: Introducing openflow graph algorithms. In: Proceedings of the Third Workshop on Hot Topics in Software Defined Networking, pp 121–126
Kuzniar M, Peresini P (2013) Automatic failure recovery for software-defined networks. In: Proceedings of the second ACM SIGCOMM workshop, pp 159–160
Reitblatt M, Canini M (2013) Fattire: declarative fault tolerance for software-defined networks. In: Proceedings of the Second ACM SIGCOMM Workshop, pp 109–114
Chen J, Ling J (2016) Failure recovery using vlan-tag in SDN: high speed with low memory requirement. In: Performance Computing and Communications Conference, pp 1–9
Kempf J, Bellagamba E, Kern A (2012) Scalable fault management for OpenFlow. In: IEEE International Conference on Communications, pp 6606–6610
Pen Y, Gong X, Guo L (2016) A survivability routing mechanism in SDN enabled wireless mesh networks: design and evaluation. China Commun 13(7):32–38
Tripathi R, Vignesh S, Tamarapalli V (2017) Cost efficient design of fault tolerant geo-distributed data centers. IEEE Trans Netw Serv Manag 14(2):289–301
Caria M, Jukan A (2016) Link capacity planning for fault tolerant operation in hybrid SDN/OSPF networks. In: IEEE Global Communications Conference, pp 1–6
Chen G, Zhao Y, Xu H (2017) Rapid failure recovery for routing in production data center networks. IEEE Trans Netw 25(4):1940–1953
Oh B, Lee J (2016) Feedback-based path failure detection and buffer blocking protection for MPTCP. IEEE Trans Netw 24(6):3450–3461
Rottenstrich O, Kanizo Y, Kaplan H (2018) Accurate traffic splitting on SDN switches. IEEE J Sel Areas Commun 36(10):2190–2201
Kim W, Hong J, Suh Y (2018) T-DCORAL: a threshold-based dynamic controller resource allocation for elastic control plane in software defined data center networks. IEEE Commun Lett 23:198–201
Shafiee M, Ghaderi J (2017) A simple congestion-aware algorithm for load balancing in datacenter networks. IEEE ACM Trans Netw 25(6):3670–3682
Alizadeh M, Greenberg A, Maltz D (2011) Data center TCP (DCTCP). ACM SIGCOMM Comput Commun 41(4):63–74
Lu X, Liu J, Zhao H (2019) Collaborative target tracking of IoT heterogeneous nodes. Measurement 147:106872
Sun X, Wang S, Xia Y (2020) Predictive-trend-aware composition of web services with time-varying quality-of-service. IEEE Access 8:1910–1921
Acknowledgements
This work was supported in part by the National Key R&D Program of China under Grant 2018YFE0202800, the National Natural Science Foundation of China under Grant 61634004 and 61934002, the Natural Science Foundation of Shaanxi Province for Distinguished Young Scholars under Grant No. 2020JC-26, the Fundamental Research Funds for the Central Universities under Grant No. JB190105, the Open Project Program of the State Key Laboratory of Mathematical Engineering and Advanced Computing under Grant No. 2019A01, and the China Postdoctoral Science Foundation No. 2018M633465.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Liu, Y., Gu, H., Wang, K. et al. An adaptive failure recovery mechanism based on asymmetric routing for data center networks. J Supercomput 77, 2103–2123 (2021). https://doi.org/10.1007/s11227-020-03337-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-020-03337-4