Skip to main content
Log in

Automated Analysis of Distributed Tracing: Challenges and Research Directions

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

Microservice-based architectures are gaining popularity for their benefits in software development. Distributed tracing can be used to help operators maintain observability in this highly distributed context, and find problems such as latency, and analyse their context and root cause. However, exploring and working with distributed tracing data is sometimes difficult due to its complexity and application specificity, volume of information and lack of tools. The most common and general tools available for this kind of data, focus on trace-level human-readable data visualisation. Unfortunately, these tools do not provide good ways to abstract, navigate, filter and analyse tracing data. Additionally, they do not automate or aid with trace analysis, relying on administrators to do it themselves. In this paper we propose using tracing data to extract service metrics, dependency graphs and work-flows with the objective of detecting anomalous services and operation patterns. We implemented and published open source prototype tools to process tracing data, conforming to the OpenTracing standard, and developed anomaly detection methods. We validated our tools and methods against real data provided by a major cloud provider. Results show that there is an underused wealth of actionable information that can be extracted from both metric and morphological aspects derived from tracing. In particular, our tools were able to detect anomalous behaviour and situate it both in terms of involved services, work-flows and time-frame. Furthermore, we identified some limitations of the OpenTracing format—as well as the industry accepted tracing abstractions—, and provide suggestions to test trace quality and enhance the standard.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Data Availability

The data that support the findings of this study are available from Huawei Research, but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Huawei Research. All our source code was made available in GitHub.Footnote 1

Notes

  1. OpenTracing Processor (OTP), https://github.com/andrepbento/OpenTracingProcessor

References

  1. The OpenTracing Specification repository. https://github.com/opentracing/specification. Retrieved on Nov, 2018

  2. Aguilera, M.K., Mogul, J.C., Wiener, J.L., Reynolds, P., Muthitacharoen, A.: Performance debugging for distributed systems of black boxes. ACM SIGOPS Operating Systems Review 37(5), 74 (2003). https://doi.org/10.1145/1165389.945454

    Article  Google Scholar 

  3. Apache Software Foundation: Zipkin. http://zipkin.io (2016). Retrieved on Feb, 2019

  4. Ates, E., Sturmann, L., Toslali, M., Krieger, O., Megginson, R., Coskun, A.K., Sambasivan, R.R.: An automated, cross-layer instrumentation framework for diagnosing performance problems in distributed applications. In: Proceedings of the ACM Symposium on Cloud Computing - SoCC ’19, pp 165–170. ACM Press, New York (2019). https://doi.org/10.1145/3357223.3362704

  5. Cinque, M., Della Corte, R., Pecchia, A.: Microservices monitoring with event logs and black box execution tracing. IEEE Trans. Serv. Comput., 1–1. https://doi.org/10.1109/TSC.2019.2940009 (2019)

  6. Cloud Native Computing Foundation: OpenTelemetry: Effective observability requires high-quality telemetry. https://opentelemetry.io (2019). Retrieved on July, 2019

  7. Cotroneo, D., De Simone, L., Liguori, P., Natella, R., Bidokhti, N.: Enhancing failure propagation analysis in cloud computing systems. In: 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE), pp 139–150. IEEE (2019). https://doi.org/10.1109/ISSRE.2019.00023

  8. Cournapeau, D.: Scikit-learn - Machine learning in Python. https://github.com/scikit-learn/scikit-learn. Retrieved on Feb, 2019 (2007)

  9. Dragoni, N., Giallorenzo, S., Lafuente, A.L., Mazzara, M., Montesi, F., Mustafin, R., Safina, L.: Microservices: yesterday, today, and tomorrow. In: Present and Ulterior Software Engineering, pp 195–216 (2017). https://doi.org/10.1007/978-3-319-67425-4_12

  10. Erlingsson, Ú., Peinado, M., Peter, S., Erlingsson, U., Peinado, M., Peter, S., Budiu, M.: Fay. Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles - SOSP ’11 13(4), 311–326 (2011). https://doi.org/10.1145/2043556.2043585

    Article  Google Scholar 

  11. Ewaschuk, R., Beyer, B.: Site Reliability engineering: How Google Runs Production Systems, chap. Monitoring Distributed Systems, pp. 55–66. O’Reilly Media Inc. (2016)

  12. Fonseca, R., Porter, G., Katz, R.H., Shenker, S., Stoica, I.: X-trace: a pervasive network tracing framework. In: Proceedings of the 4th USENIX Conference on Networked Systems Design & Implementation (NSDI’07), April, p. 20. USENIX Association. https://doi.org/10.5555/1973430.1973450 (2007)

  13. Fowler, M., Lewis, J.: Microservices, a definition of this architectural term. https://martinfowler.com/articles/microservices.html. Retrieved on Sep, 2018 (2014)

  14. Francesco, P.D., Malavolta, I., Lago, P.: Research on architecting microservices: trends, focus, and potential for industrial adoption. In: 2017 IEEE International Conference on Software Architecture (ICSA), pp 21–30. IEEE (2017). https://doi.org/10.1109/ICSA.2017.24

  15. Google LLC: OpenCensus. https://opencensus.io (2016). Retrieved on July, 2019

  16. Grafana Labs: Grafana - The tool for beautiful metric dashboards. https://github.com/grafana/grafana (2015). Retrieved on Feb, 2019

  17. Herbst, N.R., Kounev, S., Reussner, R.: Elasticity in cloud computing: what it is, and what it is not. Presented as part of the 10th International Conference on Autonomic Computing, 23–27 (2013)

  18. Jacob, S.: The Rise of AIOps: How Data, Machine Learning, and AI Will Transform Performance Monitoring. Retrieved on Mar, 2019 (2019). https://www.appdynamics.com/blog/aiops/aiops-platforms-transform-performance-monitoring

  19. Janapati, S.P.R.: Distributed Logging Architecture for Microservices. Retrieved on Feb, 2019 (2017). https://dzone.com/articles/distributed-logging-architecture-for-microservices

  20. Jonas Bonér Dave Farley, R.K., Thompson, M.: The Reactive Manifesto. https://www.reactivemanifesto.org. Retrieved on Jun, 2019 (2014)

  21. Kaldor, J., Mace, J., Bejda, M., Gao, E., Kuropatwa, W., O’Neill, J., Ong, K.W., Schaller, B., Shan, P., Viscomi, B., Venkataraman, V., Veeraraghavan, K., Song, Y.J.: Canopy: an end-to-end performance tracing and analysis system. In: SOSP 2017 - Proceedings of the 26th ACM Symposium on Operating Systems Principles, pp 34–50. ACM Press, New York (2017). https://doi.org/10.1145/3132747.3132749

  22. Kohyarnejadfard, I., Shakeri, M., Aloise, D.: System Performance Anomaly Detection Using Tracing Data Analysis. In: ACM International Conference Proceeding Series, vol. Part F1482, pp 169–173. ACM Press, New York (2019). https://doi.org/10.1145/3323933.3324085

  23. Lamport, L.: Time, clocks, and the ordering of events in a distributed system. Communications of the ACM 21 (7), 558–565 (1978). https://doi.org/10.1145/359545.359563. http://amturing.acm.org/p558-lamport.pdf, http://portal.acm.org/citation.cfm?doid=359545.359563

    Article  Google Scholar 

  24. Laprie, J.C.: From dependability to resilience. In: 38th IEEE/IFIP Int. Conf. on Dependable Systems and Networks, pp G8–G9 (2008)

  25. Las-Casas, P., Papakerashvili, G., Anand, V., Mace, J.: Sifter: scalable sampling for distributed traces, without feature engineering. In: Proceedings of the ACM Symposium on Cloud Computing - SoCC ’19, pp 312–324. ACM Press, New York (2019). https://doi.org/10.1145/3357223.3362736

  26. Lerner, A.: AIOps Platforms. https://blogs.gartner.com/andrew-lerner/2017/08/09/aiops-platforms. Retrieved on Jun, 2019 (2017)

  27. Levin, A., Garion, S., Kolodner, E.K., Lorenz, D.H., Barabash, K., Kugler, M., McShane, N.: AIOps for a cloud object storage service. In: 2019 IEEE International Congress on Big Data (Bigdatacongress), pp 165–169. IEEE (2019). https://doi.org/10.1109/BigDataCongress.2019.00036

  28. Li, H., Oh, J., Oh, H., Lee, H.: Automated source code instrumentation for verifying potential vulnerabilities. IFIP Advances in Information and Communication Technology 471, 211–226 (2016). https://doi.org/10.1007/978-3-319-33630-5_15

    Article  Google Scholar 

  29. Li, S.: Time Series of Price Anomaly Detection. https://towardsdatascience.com/time-series-of-price-anomaly-detection-13586cd5ff46. Retrieved on Jan, 2019 (2019)

  30. Nedelkoski, S., Cardoso, J., Kao, O.: Anomaly detection from system tracing data using multimodal deep learning. In: 2019 IEEE 12th International Conference on Cloud Computing (CLOUD), vol. 2019-July, pp. 179–186. IEEE. https://doi.org/10.1109/CLOUD.2019.00038 (2019)

  31. NetworkX developers: NetworkX. https://networkx.github.io (2014). Retrieved on Nov, 2018

  32. New Relic, Inc.: Newrelic – deliver more perfect software. https://newrelic.com (2008). Retrieved on Jan, 2021

  33. OpenTracing Specification Council: The OpenTracing Data Model Specification. https://opentracing.io/specification (2019). Retrieved on Feb, 2019

  34. OpenTracing Specification Council: The OpenTracing Semantic Conventions. https://github.com/opentracing/specification/blob/master/semantic_conventions.md (2019). Retrieved on Feb, 2019

  35. OpenTracing Specification Council: The OpenTracing Semantic Specification. https://github.com/opentracing/specification/blob/master/specification.md (2019). Retrieved on Feb, 2019

  36. Oracle: Java Stream API. https://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html (2017). Retrieved on Feb, 2019

  37. Pina, F., Correia, J., Filipe, R., Araujo, F., Cardoso, J.: Nonintrusive monitoring of microservice-based systems. In: 2018 IEEE 17th International Symposium on Network Computing and Applications (NCA), pp 1–8. IEEE (2018)

  38. Project Jupyter: Jupyter Notebooks. https://jupyter.org (2015). Retrieved on Nov, 2018

  39. Richardson, C.: Microservices Definition. https://microservices.io. Retrieved on Sep, 2018 (2019)

  40. Sambasivan, R.R., Fonseca, R., Shafer, I., Ganger, G.R.: So, you want to trace your distributed system? Key design insights from years of practical experience. Tech. rep., Technical Report CMU-PDL-14 (2014)

  41. Sambasivan, R.R., Shafer, I., Mace, J., Sigelman, B.H., Fonseca, R., Ganger, G.R.: Principled workflow-centric tracing of distributed systems. In: Proceedings of the Seventh ACM Symposium on Cloud Computing - SoCC ’16, pp 401–414. ACM Press, New York (2016). https://doi.org/10.1145/2987550.2987568

  42. Sigelman, B.H., André, L., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S., Shanbhag, C.: Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Tech. rep., Google LLC (2010)

  43. StumbleUpon, Inc: OpenTSDB. https://github.com/OpenTSDB/opentsdb (2010). Retrieved on Feb, 2019

  44. Uber Technologies: Jaeger. https://www.jaegertracing.io (2017). Retrieved on Jun, 2019

  45. Wes McKinney: Pandas - Flexible and powerfull time-series data analysis. https://github.com/pandas-dev/pandas (2008). Retrieved on Nov, 2018

  46. Zhou, X., Peng, X., Xie, T., Sun, J., Ji, C., Liu, D., Xiang, Q., He, C.: Latent error prediction and fault localization for microservice applications by learning from system trace logs. In: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering - ESEC/FSE 2019, pp 683–694. ACM Press, New York (2019). https://doi.org/10.1145/3338906.3338961

Download references

Acknowledgements

This work was produced with the support of INCD funded by FCT and FEDER, under the project 01/SAICT/2016 n 022153 and was partially carried out under the project P2020 - 31/SI/2017: AESOP — Autonomic Service Operation, supported by Portugal 2020 and UE-FEDER. This work was also partially supported by national funds through the FCT - Foundation for Science and Technology, I.P., within the scope of the project CISUC - UID/CEC/00326/2020 and by the European Social Fund, through the Regional Operational Program Centro 2020.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andre Bento.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bento, A., Correia, J., Filipe, R. et al. Automated Analysis of Distributed Tracing: Challenges and Research Directions. J Grid Computing 19, 9 (2021). https://doi.org/10.1007/s10723-021-09551-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10723-021-09551-5

Keywords

Navigation