Skip to main content

CoPTA: Contiguous Pattern Speculating TLB Architecture

  • Conference paper
  • First Online:
Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS 2020)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12471))

Included in the following conference series:

  • 1188 Accesses

Abstract

With the growing size of real-world datasets running on CPUs, address translation has become a significant performance bottleneck. To translate virtual addresses into physical addresses, modern operating systems perform several levels of page table walks (PTWs) in memory. Translation look-aside buffers (TLBs) are used as caches to keep recently used translation information. However, as datasets increase in size, both the TLB miss rate and the overhead of PTWs worsen, causing severe performance bottlenecks. Using a diverse set of workloads, we show the PTW overhead consumes an average of 20% application execution time.

In this paper, we propose CoPTA, a technique to speculate the memory address translation upon a TLB miss to hide the PTW latency. Specifically, we show that the operating system has a tendency to map contiguous virtual memory pages to contiguous physical pages. Using a real machine, we show that the Linux kernel can automatically defragment physical memory and create larger chunks for contiguous mapping, particularly when transparent huge page support is enabled. Based on this observation, we devise a speculation mechanism that finds nearby entries present in the TLB upon a miss and predicts the address translation of the missed address assuming contiguous address allocation. This allows CoPTAto speculatively execute instructions without waiting for the PTW to complete. We run the PTW in parallel, compare the speculated and the translated physical addresses, and flush the pipeline upon a wrong speculation with similar techniques used for handling branch mispredictions.

We comprehensively evaluate our proposal using benchmarks from three suites: SPEC CPU 2006 for server-grade applications, GraphBIG for graph applications, and the NAS benchmark suite for scientific applications. Using a trace-based simulation, we show an average address prediction accuracy of 82% across these workloads resulting in a 16% performance improvement.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Dynamic instrumentation tool platform. https://dynamorio.org/

  2. Linux kernel documentation. https://www.kernel.org/doc/

  3. Advanced micro devices. AMD x86–64 architecture programmer’s manual (2002)

    Google Scholar 

  4. Bailey, D.H., et al.: The NAS parallel benchmarks summary and preliminary results. In: Supercomputing 1991: Proceedings of the 1991 ACM/IEEE Conference on Supercomputing, pp. 158–165. IEEE (1991)

    Google Scholar 

  5. Barr, T.W., Cox, A.L., Rixner, S.: Translation caching: skip, don’t walk (the page table). In: ACM SIGARCH Computer Architecture News, vol. 38, pp. 48–59. ACM (2010)

    Google Scholar 

  6. Barr, T.W., Cox, A.L., Rixner, S.: SpecTLB: a mechanism for speculative address translation. In: ACM SIGARCH Computer Architecture News, vol. 39, pp. 307–318. ACM (2011)

    Google Scholar 

  7. Basu, A., Gandhi, J., Chang, J., Hill, M.D., Swift, M.M.: Efficient virtual memory for big memory servers. In: Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA 2013, pp. 237–248. ACM, New York (2013). https://doi.org/10.1145/2485922.2485943. http://doi.acm.org/10.1145/2485922.2485943

  8. Bhattacharjee, A., Lustig, D., Martonosi, M.: Shared last-level TLBS for chip multiprocessors. In: 2011 IEEE 17th International Symposium on High Performance Computer Architecture, pp. 62–63, February 2011. https://doi.org/10.1109/HPCA.2011.5749717

  9. Bhattacharjee, A., Lustig, D.: Architectural and Operating System Support for Virtual Memory. Synthesis Lectures on Computer Architecture 12(5), pp. 1–175 (2017)

    Google Scholar 

  10. Bhattacharjee, A., Martonosi, M.: Characterizing the TLB behavior of emerging parallel workloads on chip multiprocessors. In: 2009 18th International Conference on Parallel Architectures and Compilation Techniques, pp. 29–40. IEEE (2009)

    Google Scholar 

  11. Bhattacharjee, A., Martonosi, M.: Inter-core cooperative TLB for chip multiprocessors. SIGARCH Comput. Archit. News 38(1), 359–370 (2010). https://doi.org/10.1145/1735970.1736060

  12. Binkert, N., et al.: The gem5 simulator. SIGARCH Comput. Archit. News 39(2), 1–7 (2011). https://doi.org/10.1145/2024716.2024718. http://doi.acm.org/10.1145/2024716.2024718

  13. Blanas, S., Li, Y., Patel, J.M.: Design and evaluation of main memory hash join algorithms for multi-core CPUs. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, pp. 37–48 (2011)

    Google Scholar 

  14. Bruening, D.L.: Efficient, transparent, and comprehensive runtime code manipulation. Ph.D. thesis, Cambridge, MA, USA (2004). aAI0807735

    Google Scholar 

  15. Chen, J.B., Borg, A., Jouppi, N.P.: A simulation based study of TLB performance. In: Proceedings of the 19th Annual International Symposium on Computer Architecture, ISCA 1992, pp. 114–123. Association for Computing Machinery, New York (1992). https://doi.org/10.1145/139669.139708

  16. Cox, G., Bhattacharjee, A.: Efficient address translation for architectures with multiple page sizes. ACM SIGOPS Oper. Syst. Rev. 51(2), 435–448 (2017)

    Article  Google Scholar 

  17. Gandhi, J., Basu, A., Hill, M.D., Swift, M.M.: Efficient memory virtualization: reducing dimensionality of nested page walks. In: 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 178–189, December 2014. https://doi.org/10.1109/MICRO.2014.37

  18. Henning, J.L.: SPEC CPU2006 benchmark descriptions. ACM SIGARCH Comput. Archit. News 34(4), 1–17 (2006)

    Article  Google Scholar 

  19. Kocher, P., et al.: Spectre attacks: exploiting speculative execution. In: 2019 IEEE Symposium on Security and Privacy (SP), pp. 1–19 (2018)

    Google Scholar 

  20. Leskovec, J., Krevl, A.: SNAP Datasets: Stanford large network dataset collection, June 2014. http://snap.stanford.edu/data

  21. Lipp, M., et al.: Meltdown: reading kernel memory from user space. In: 27th USENIX Security Symposium (USENIX Security 2018), pp. 973–990. USENIX Association, Baltimore, August 2018. https://www.usenix.org/conference/usenixsecurity18/presentation/lipp

  22. Luszczek, P.R., et al.: The HPC challenge (HPCC) benchmark suite. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, vol. 213, pp. 1188455–1188677. Citeseer (2006)

    Google Scholar 

  23. Margaritov, A., Ustiugov, D., Bugnion, E., Grot, B.: Prefetched address translation. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 1023–1036. ACM (2019)

    Google Scholar 

  24. McCurdy, C., Cox, A., Vetter, J.: Investigating the TLB behavior of high-end scientific applications on commodity microprocessors, pp. 95–104, May 2008. https://doi.org/10.1109/ISPASS.2008.4510742

  25. Mittal, S.: A survey of techniques for architecting TLBs. Concurr. Comput. Pract. Experience 29(10), e4061 (2017)

    Article  Google Scholar 

  26. Nai, L., Xia, Y., Tanase, I.G., Kim, H., Lin, C.: GraphBIG: understanding graph computing in the context of industrial solutions. In: SC 2015: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12, November 2015. https://doi.org/10.1145/2807591.2807626

  27. Navarro, J., Iyer, S., Druschel, P., Cox, A.: Practical, transparent operating system support for superpages. SIGOPS Oper. Syst. Rev. 36(SI), 89–104 (2003). https://doi.org/10.1145/844128.844138

  28. Park, C.H., Heo, T., Jeong, J., Huh, J.: Hybrid TLB coalescing: improving TLB translation coverage under diverse fragmented memory allocations. In: Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 444–456 (2017)

    Google Scholar 

  29. Pham, B., Vaidyanathan, V., Jaleel, A., Bhattacharjee, A.: CoLT: coalesced large-reach TLBs. In: Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 258–269. IEEE Computer Society (2012)

    Google Scholar 

  30. Pham, B., Veselỳ, J., Loh, G.H., Bhattacharjee, A.: Large pages and lightweight memory management in virtualized environments: can you have it both ways? In: Proceedings of the 48th International Symposium on Microarchitecture, pp. 1–12 (2015)

    Google Scholar 

  31. Ryoo, J.H., Gulur, N., Song, S., John, L.K.: Rethinking TLB designs in virtualized environments: a very large part-of-memory TLB. ACM SIGARCH Comput. Archit. News 45(2), 469–480 (2017)

    Article  Google Scholar 

  32. Saulsbury, A., Dahlgren, F., Stenström, P.: Recency-based TLB preloading. In: Proceedings of the 27th Annual International Symposium on Computer Architecture, pp. 117–127 (2000)

    Google Scholar 

  33. Talluri, M., Hill, M.D.: Surpassing the TLB performance of superpages with less operating system support. SIGOPS Oper. Syst. Rev. 28(5), 171–182 (1994). https://doi.org/10.1145/381792.195531

  34. Fang, Z., Zhang, L., Carter, J.B., Hsieh, W.C., McKee, S.A.: Reevaluating online superpage promotion with hardware support. In: Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture, pp. 63–72, January 2001. https://doi.org/10.1109/HPCA.2001.903252

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yichen Yang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yang, Y. et al. (2020). CoPTA: Contiguous Pattern Speculating TLB Architecture. In: Orailoglu, A., Jung, M., Reichenbach, M. (eds) Embedded Computer Systems: Architectures, Modeling, and Simulation. SAMOS 2020. Lecture Notes in Computer Science(), vol 12471. Springer, Cham. https://doi.org/10.1007/978-3-030-60939-9_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-60939-9_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-60938-2

  • Online ISBN: 978-3-030-60939-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics