Automated Analysis of Distributed Tracing: Challenges and Research Directions

Bento, Andre; Correia, Jaime; Filipe, Ricardo; Araujo, Filipe; Cardoso, Jorge

doi:10.1007/s10723-021-09551-5

Automated Analysis of Distributed Tracing: Challenges and Research Directions

Published: 25 February 2021

Volume 19, article number 9, (2021)
Cite this article

Journal of Grid Computing Aims and scope Submit manuscript

Andre Bento ORCID: orcid.org/0000-0002-5388-0342¹,
Jaime Correia¹,
Ricardo Filipe¹,
Filipe Araujo¹ &
…
Jorge Cardoso^1,2

1206 Accesses
22 Citations
Explore all metrics

Abstract

Microservice-based architectures are gaining popularity for their benefits in software development. Distributed tracing can be used to help operators maintain observability in this highly distributed context, and find problems such as latency, and analyse their context and root cause. However, exploring and working with distributed tracing data is sometimes difficult due to its complexity and application specificity, volume of information and lack of tools. The most common and general tools available for this kind of data, focus on trace-level human-readable data visualisation. Unfortunately, these tools do not provide good ways to abstract, navigate, filter and analyse tracing data. Additionally, they do not automate or aid with trace analysis, relying on administrators to do it themselves. In this paper we propose using tracing data to extract service metrics, dependency graphs and work-flows with the objective of detecting anomalous services and operation patterns. We implemented and published open source prototype tools to process tracing data, conforming to the OpenTracing standard, and developed anomaly detection methods. We validated our tools and methods against real data provided by a major cloud provider. Results show that there is an underused wealth of actionable information that can be extracted from both metric and morphological aspects derived from tracing. In particular, our tools were able to detect anomalous behaviour and situate it both in terms of involved services, work-flows and time-frame. Furthermore, we identified some limitations of the OpenTracing format—as well as the industry accepted tracing abstractions—, and provide suggestions to test trace quality and enhance the standard.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enjoy your observability: an industrial survey of microservice tracing and analysis

Article 30 November 2021

A Modular Approach to Calculate Service-Based Maintainability Metrics from Runtime Data of Microservices

On Observability and Monitoring of Distributed Systems – An Industry Interview Study

Data Availability

The data that support the findings of this study are available from Huawei Research, but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Huawei Research. All our source code was made available in GitHub.^{Footnote 1}

Notes

OpenTracing Processor (OTP), https://github.com/andrepbento/OpenTracingProcessor

References

The OpenTracing Specification repository. https://github.com/opentracing/specification. Retrieved on Nov, 2018
Aguilera, M.K., Mogul, J.C., Wiener, J.L., Reynolds, P., Muthitacharoen, A.: Performance debugging for distributed systems of black boxes. ACM SIGOPS Operating Systems Review 37(5), 74 (2003). https://doi.org/10.1145/1165389.945454
Article Google Scholar
Apache Software Foundation: Zipkin. http://zipkin.io (2016). Retrieved on Feb, 2019
Ates, E., Sturmann, L., Toslali, M., Krieger, O., Megginson, R., Coskun, A.K., Sambasivan, R.R.: An automated, cross-layer instrumentation framework for diagnosing performance problems in distributed applications. In: Proceedings of the ACM Symposium on Cloud Computing - SoCC ’19, pp 165–170. ACM Press, New York (2019). https://doi.org/10.1145/3357223.3362704
Cinque, M., Della Corte, R., Pecchia, A.: Microservices monitoring with event logs and black box execution tracing. IEEE Trans. Serv. Comput., 1–1. https://doi.org/10.1109/TSC.2019.2940009 (2019)
Cloud Native Computing Foundation: OpenTelemetry: Effective observability requires high-quality telemetry. https://opentelemetry.io (2019). Retrieved on July, 2019
Cotroneo, D., De Simone, L., Liguori, P., Natella, R., Bidokhti, N.: Enhancing failure propagation analysis in cloud computing systems. In: 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE), pp 139–150. IEEE (2019). https://doi.org/10.1109/ISSRE.2019.00023
Cournapeau, D.: Scikit-learn - Machine learning in Python. https://github.com/scikit-learn/scikit-learn. Retrieved on Feb, 2019 (2007)
Dragoni, N., Giallorenzo, S., Lafuente, A.L., Mazzara, M., Montesi, F., Mustafin, R., Safina, L.: Microservices: yesterday, today, and tomorrow. In: Present and Ulterior Software Engineering, pp 195–216 (2017). https://doi.org/10.1007/978-3-319-67425-4_12
Erlingsson, Ú., Peinado, M., Peter, S., Erlingsson, U., Peinado, M., Peter, S., Budiu, M.: Fay. Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles - SOSP ’11 13(4), 311–326 (2011). https://doi.org/10.1145/2043556.2043585
Article Google Scholar
Ewaschuk, R., Beyer, B.: Site Reliability engineering: How Google Runs Production Systems, chap. Monitoring Distributed Systems, pp. 55–66. O’Reilly Media Inc. (2016)
Fonseca, R., Porter, G., Katz, R.H., Shenker, S., Stoica, I.: X-trace: a pervasive network tracing framework. In: Proceedings of the 4th USENIX Conference on Networked Systems Design & Implementation (NSDI’07), April, p. 20. USENIX Association. https://doi.org/10.5555/1973430.1973450 (2007)
Fowler, M., Lewis, J.: Microservices, a definition of this architectural term. https://martinfowler.com/articles/microservices.html. Retrieved on Sep, 2018 (2014)
Francesco, P.D., Malavolta, I., Lago, P.: Research on architecting microservices: trends, focus, and potential for industrial adoption. In: 2017 IEEE International Conference on Software Architecture (ICSA), pp 21–30. IEEE (2017). https://doi.org/10.1109/ICSA.2017.24
Google LLC: OpenCensus. https://opencensus.io (2016). Retrieved on July, 2019
Grafana Labs: Grafana - The tool for beautiful metric dashboards. https://github.com/grafana/grafana (2015). Retrieved on Feb, 2019
Herbst, N.R., Kounev, S., Reussner, R.: Elasticity in cloud computing: what it is, and what it is not. Presented as part of the 10th International Conference on Autonomic Computing, 23–27 (2013)
Jacob, S.: The Rise of AIOps: How Data, Machine Learning, and AI Will Transform Performance Monitoring. Retrieved on Mar, 2019 (2019). https://www.appdynamics.com/blog/aiops/aiops-platforms-transform-performance-monitoring
Janapati, S.P.R.: Distributed Logging Architecture for Microservices. Retrieved on Feb, 2019 (2017). https://dzone.com/articles/distributed-logging-architecture-for-microservices
Jonas Bonér Dave Farley, R.K., Thompson, M.: The Reactive Manifesto. https://www.reactivemanifesto.org. Retrieved on Jun, 2019 (2014)
Kaldor, J., Mace, J., Bejda, M., Gao, E., Kuropatwa, W., O’Neill, J., Ong, K.W., Schaller, B., Shan, P., Viscomi, B., Venkataraman, V., Veeraraghavan, K., Song, Y.J.: Canopy: an end-to-end performance tracing and analysis system. In: SOSP 2017 - Proceedings of the 26th ACM Symposium on Operating Systems Principles, pp 34–50. ACM Press, New York (2017). https://doi.org/10.1145/3132747.3132749
Kohyarnejadfard, I., Shakeri, M., Aloise, D.: System Performance Anomaly Detection Using Tracing Data Analysis. In: ACM International Conference Proceeding Series, vol. Part F1482, pp 169–173. ACM Press, New York (2019). https://doi.org/10.1145/3323933.3324085
Lamport, L.: Time, clocks, and the ordering of events in a distributed system. Communications of the ACM 21 (7), 558–565 (1978). https://doi.org/10.1145/359545.359563. http://amturing.acm.org/p558-lamport.pdf, http://portal.acm.org/citation.cfm?doid=359545.359563
Article Google Scholar
Laprie, J.C.: From dependability to resilience. In: 38th IEEE/IFIP Int. Conf. on Dependable Systems and Networks, pp G8–G9 (2008)
Las-Casas, P., Papakerashvili, G., Anand, V., Mace, J.: Sifter: scalable sampling for distributed traces, without feature engineering. In: Proceedings of the ACM Symposium on Cloud Computing - SoCC ’19, pp 312–324. ACM Press, New York (2019). https://doi.org/10.1145/3357223.3362736
Lerner, A.: AIOps Platforms. https://blogs.gartner.com/andrew-lerner/2017/08/09/aiops-platforms. Retrieved on Jun, 2019 (2017)
Levin, A., Garion, S., Kolodner, E.K., Lorenz, D.H., Barabash, K., Kugler, M., McShane, N.: AIOps for a cloud object storage service. In: 2019 IEEE International Congress on Big Data (Bigdatacongress), pp 165–169. IEEE (2019). https://doi.org/10.1109/BigDataCongress.2019.00036
Li, H., Oh, J., Oh, H., Lee, H.: Automated source code instrumentation for verifying potential vulnerabilities. IFIP Advances in Information and Communication Technology 471, 211–226 (2016). https://doi.org/10.1007/978-3-319-33630-5_15
Article Google Scholar
Li, S.: Time Series of Price Anomaly Detection. https://towardsdatascience.com/time-series-of-price-anomaly-detection-13586cd5ff46. Retrieved on Jan, 2019 (2019)
Nedelkoski, S., Cardoso, J., Kao, O.: Anomaly detection from system tracing data using multimodal deep learning. In: 2019 IEEE 12th International Conference on Cloud Computing (CLOUD), vol. 2019-July, pp. 179–186. IEEE. https://doi.org/10.1109/CLOUD.2019.00038 (2019)
NetworkX developers: NetworkX. https://networkx.github.io (2014). Retrieved on Nov, 2018
New Relic, Inc.: Newrelic – deliver more perfect software. https://newrelic.com (2008). Retrieved on Jan, 2021
OpenTracing Specification Council: The OpenTracing Data Model Specification. https://opentracing.io/specification (2019). Retrieved on Feb, 2019
OpenTracing Specification Council: The OpenTracing Semantic Conventions. https://github.com/opentracing/specification/blob/master/semantic_conventions.md (2019). Retrieved on Feb, 2019
OpenTracing Specification Council: The OpenTracing Semantic Specification. https://github.com/opentracing/specification/blob/master/specification.md (2019). Retrieved on Feb, 2019
Oracle: Java Stream API. https://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html (2017). Retrieved on Feb, 2019
Pina, F., Correia, J., Filipe, R., Araujo, F., Cardoso, J.: Nonintrusive monitoring of microservice-based systems. In: 2018 IEEE 17th International Symposium on Network Computing and Applications (NCA), pp 1–8. IEEE (2018)
Project Jupyter: Jupyter Notebooks. https://jupyter.org (2015). Retrieved on Nov, 2018
Richardson, C.: Microservices Definition. https://microservices.io. Retrieved on Sep, 2018 (2019)
Sambasivan, R.R., Fonseca, R., Shafer, I., Ganger, G.R.: So, you want to trace your distributed system? Key design insights from years of practical experience. Tech. rep., Technical Report CMU-PDL-14 (2014)
Sambasivan, R.R., Shafer, I., Mace, J., Sigelman, B.H., Fonseca, R., Ganger, G.R.: Principled workflow-centric tracing of distributed systems. In: Proceedings of the Seventh ACM Symposium on Cloud Computing - SoCC ’16, pp 401–414. ACM Press, New York (2016). https://doi.org/10.1145/2987550.2987568
Sigelman, B.H., André, L., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S., Shanbhag, C.: Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Tech. rep., Google LLC (2010)
StumbleUpon, Inc: OpenTSDB. https://github.com/OpenTSDB/opentsdb (2010). Retrieved on Feb, 2019
Uber Technologies: Jaeger. https://www.jaegertracing.io (2017). Retrieved on Jun, 2019
Wes McKinney: Pandas - Flexible and powerfull time-series data analysis. https://github.com/pandas-dev/pandas (2008). Retrieved on Nov, 2018
Zhou, X., Peng, X., Xie, T., Sun, J., Ji, C., Liu, D., Xiang, Q., He, C.: Latent error prediction and fault localization for microservice applications by learning from system trace logs. In: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering - ESEC/FSE 2019, pp 683–694. ACM Press, New York (2019). https://doi.org/10.1145/3338906.3338961

Download references

Acknowledgements

This work was produced with the support of INCD funded by FCT and FEDER, under the project 01/SAICT/2016 n 022153 and was partially carried out under the project P2020 - 31/SI/2017: AESOP — Autonomic Service Operation, supported by Portugal 2020 and UE-FEDER. This work was also partially supported by national funds through the FCT - Foundation for Science and Technology, I.P., within the scope of the project CISUC - UID/CEC/00326/2020 and by the European Social Fund, through the Regional Operational Program Centro 2020.

Author information

Authors and Affiliations

CISUC, Department of Informatics Engineering, University of Coimbra, Coimbra, Portugal
Andre Bento, Jaime Correia, Ricardo Filipe, Filipe Araujo & Jorge Cardoso
Huawei Munich Research Center, Munich, Germany
Jorge Cardoso

Authors

Andre Bento
View author publications
You can also search for this author in PubMed Google Scholar
Jaime Correia
View author publications
You can also search for this author in PubMed Google Scholar
Ricardo Filipe
View author publications
You can also search for this author in PubMed Google Scholar
Filipe Araujo
View author publications
You can also search for this author in PubMed Google Scholar
Jorge Cardoso
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andre Bento.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bento, A., Correia, J., Filipe, R. et al. Automated Analysis of Distributed Tracing: Challenges and Research Directions. J Grid Computing 19, 9 (2021). https://doi.org/10.1007/s10723-021-09551-5

Download citation

Received: 10 July 2020
Accepted: 06 February 2021
Published: 25 February 2021
DOI: https://doi.org/10.1007/s10723-021-09551-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automated Analysis of Distributed Tracing: Challenges and Research Directions

Abstract

Access this article

Similar content being viewed by others

Enjoy your observability: an industrial survey of microservice tracing and analysis

A Modular Approach to Calculate Service-Based Maintainability Metrics from Runtime Data of Microservices

On Observability and Monitoring of Distributed Systems – An Industry Interview Study

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Automated Analysis of Distributed Tracing: Challenges and Research Directions

Abstract

Access this article

Similar content being viewed by others

Enjoy your observability: an industrial survey of microservice tracing and analysis

A Modular Approach to Calculate Service-Based Maintainability Metrics from Runtime Data of Microservices

On Observability and Monitoring of Distributed Systems – An Industry Interview Study

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation