CoPTA: Contiguous Pattern Speculating TLB Architecture

Yang, Yichen; Ye, Haojie; Chen, Yuhan; Liu, Xueyang; Talati, Nishil; He, Xin; Mudge, Trevor; Dreslinski, Ronald

doi:10.1007/978-3-030-60939-9_5

Yichen Yang¹¹,
Haojie Ye¹¹,
Yuhan Chen¹¹,
Xueyang Liu¹¹,
Nishil Talati¹¹,
Xin He¹¹,
Trevor Mudge¹¹ &
…
Ronald Dreslinski¹¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12471))

Included in the following conference series:

International Conference on Embedded Computer Systems

1188 Accesses

Abstract

With the growing size of real-world datasets running on CPUs, address translation has become a significant performance bottleneck. To translate virtual addresses into physical addresses, modern operating systems perform several levels of page table walks (PTWs) in memory. Translation look-aside buffers (TLBs) are used as caches to keep recently used translation information. However, as datasets increase in size, both the TLB miss rate and the overhead of PTWs worsen, causing severe performance bottlenecks. Using a diverse set of workloads, we show the PTW overhead consumes an average of 20% application execution time.

In this paper, we propose CoPTA, a technique to speculate the memory address translation upon a TLB miss to hide the PTW latency. Specifically, we show that the operating system has a tendency to map contiguous virtual memory pages to contiguous physical pages. Using a real machine, we show that the Linux kernel can automatically defragment physical memory and create larger chunks for contiguous mapping, particularly when transparent huge page support is enabled. Based on this observation, we devise a speculation mechanism that finds nearby entries present in the TLB upon a miss and predicts the address translation of the missed address assuming contiguous address allocation. This allows CoPTAto speculatively execute instructions without waiting for the PTW to complete. We run the PTW in parallel, compare the speculated and the translated physical addresses, and flush the pipeline upon a wrong speculation with similar techniques used for handling branch mispredictions.

We comprehensively evaluate our proposal using benchmarks from three suites: SPEC CPU 2006 for server-grade applications, GraphBIG for graph applications, and the NAS benchmark suite for scientific applications. Using a trace-based simulation, we show an average address prediction accuracy of 82% across these workloads resulting in a 16% performance improvement.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Dynamic instrumentation tool platform. https://dynamorio.org/
Linux kernel documentation. https://www.kernel.org/doc/
Advanced micro devices. AMD x86–64 architecture programmer’s manual (2002)
Google Scholar
Bailey, D.H., et al.: The NAS parallel benchmarks summary and preliminary results. In: Supercomputing 1991: Proceedings of the 1991 ACM/IEEE Conference on Supercomputing, pp. 158–165. IEEE (1991)
Google Scholar
Barr, T.W., Cox, A.L., Rixner, S.: Translation caching: skip, don’t walk (the page table). In: ACM SIGARCH Computer Architecture News, vol. 38, pp. 48–59. ACM (2010)
Google Scholar
Barr, T.W., Cox, A.L., Rixner, S.: SpecTLB: a mechanism for speculative address translation. In: ACM SIGARCH Computer Architecture News, vol. 39, pp. 307–318. ACM (2011)
Google Scholar
Basu, A., Gandhi, J., Chang, J., Hill, M.D., Swift, M.M.: Efficient virtual memory for big memory servers. In: Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA 2013, pp. 237–248. ACM, New York (2013). https://doi.org/10.1145/2485922.2485943. http://doi.acm.org/10.1145/2485922.2485943
Bhattacharjee, A., Lustig, D., Martonosi, M.: Shared last-level TLBS for chip multiprocessors. In: 2011 IEEE 17th International Symposium on High Performance Computer Architecture, pp. 62–63, February 2011. https://doi.org/10.1109/HPCA.2011.5749717
Bhattacharjee, A., Lustig, D.: Architectural and Operating System Support for Virtual Memory. Synthesis Lectures on Computer Architecture 12(5), pp. 1–175 (2017)
Google Scholar
Bhattacharjee, A., Martonosi, M.: Characterizing the TLB behavior of emerging parallel workloads on chip multiprocessors. In: 2009 18th International Conference on Parallel Architectures and Compilation Techniques, pp. 29–40. IEEE (2009)
Google Scholar
Bhattacharjee, A., Martonosi, M.: Inter-core cooperative TLB for chip multiprocessors. SIGARCH Comput. Archit. News 38(1), 359–370 (2010). https://doi.org/10.1145/1735970.1736060
Binkert, N., et al.: The gem5 simulator. SIGARCH Comput. Archit. News 39(2), 1–7 (2011). https://doi.org/10.1145/2024716.2024718. http://doi.acm.org/10.1145/2024716.2024718
Blanas, S., Li, Y., Patel, J.M.: Design and evaluation of main memory hash join algorithms for multi-core CPUs. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, pp. 37–48 (2011)
Google Scholar
Bruening, D.L.: Efficient, transparent, and comprehensive runtime code manipulation. Ph.D. thesis, Cambridge, MA, USA (2004). aAI0807735
Google Scholar
Chen, J.B., Borg, A., Jouppi, N.P.: A simulation based study of TLB performance. In: Proceedings of the 19th Annual International Symposium on Computer Architecture, ISCA 1992, pp. 114–123. Association for Computing Machinery, New York (1992). https://doi.org/10.1145/139669.139708
Cox, G., Bhattacharjee, A.: Efficient address translation for architectures with multiple page sizes. ACM SIGOPS Oper. Syst. Rev. 51(2), 435–448 (2017)
Article Google Scholar
Gandhi, J., Basu, A., Hill, M.D., Swift, M.M.: Efficient memory virtualization: reducing dimensionality of nested page walks. In: 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 178–189, December 2014. https://doi.org/10.1109/MICRO.2014.37
Henning, J.L.: SPEC CPU2006 benchmark descriptions. ACM SIGARCH Comput. Archit. News 34(4), 1–17 (2006)
Article Google Scholar
Kocher, P., et al.: Spectre attacks: exploiting speculative execution. In: 2019 IEEE Symposium on Security and Privacy (SP), pp. 1–19 (2018)
Google Scholar
Leskovec, J., Krevl, A.: SNAP Datasets: Stanford large network dataset collection, June 2014. http://snap.stanford.edu/data
Lipp, M., et al.: Meltdown: reading kernel memory from user space. In: 27th USENIX Security Symposium (USENIX Security 2018), pp. 973–990. USENIX Association, Baltimore, August 2018. https://www.usenix.org/conference/usenixsecurity18/presentation/lipp
Luszczek, P.R., et al.: The HPC challenge (HPCC) benchmark suite. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, vol. 213, pp. 1188455–1188677. Citeseer (2006)
Google Scholar
Margaritov, A., Ustiugov, D., Bugnion, E., Grot, B.: Prefetched address translation. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 1023–1036. ACM (2019)
Google Scholar
McCurdy, C., Cox, A., Vetter, J.: Investigating the TLB behavior of high-end scientific applications on commodity microprocessors, pp. 95–104, May 2008. https://doi.org/10.1109/ISPASS.2008.4510742
Mittal, S.: A survey of techniques for architecting TLBs. Concurr. Comput. Pract. Experience 29(10), e4061 (2017)
Article Google Scholar
Nai, L., Xia, Y., Tanase, I.G., Kim, H., Lin, C.: GraphBIG: understanding graph computing in the context of industrial solutions. In: SC 2015: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12, November 2015. https://doi.org/10.1145/2807591.2807626
Navarro, J., Iyer, S., Druschel, P., Cox, A.: Practical, transparent operating system support for superpages. SIGOPS Oper. Syst. Rev. 36(SI), 89–104 (2003). https://doi.org/10.1145/844128.844138
Park, C.H., Heo, T., Jeong, J., Huh, J.: Hybrid TLB coalescing: improving TLB translation coverage under diverse fragmented memory allocations. In: Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 444–456 (2017)
Google Scholar
Pham, B., Vaidyanathan, V., Jaleel, A., Bhattacharjee, A.: CoLT: coalesced large-reach TLBs. In: Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 258–269. IEEE Computer Society (2012)
Google Scholar
Pham, B., Veselỳ, J., Loh, G.H., Bhattacharjee, A.: Large pages and lightweight memory management in virtualized environments: can you have it both ways? In: Proceedings of the 48th International Symposium on Microarchitecture, pp. 1–12 (2015)
Google Scholar
Ryoo, J.H., Gulur, N., Song, S., John, L.K.: Rethinking TLB designs in virtualized environments: a very large part-of-memory TLB. ACM SIGARCH Comput. Archit. News 45(2), 469–480 (2017)
Article Google Scholar
Saulsbury, A., Dahlgren, F., Stenström, P.: Recency-based TLB preloading. In: Proceedings of the 27th Annual International Symposium on Computer Architecture, pp. 117–127 (2000)
Google Scholar
Talluri, M., Hill, M.D.: Surpassing the TLB performance of superpages with less operating system support. SIGOPS Oper. Syst. Rev. 28(5), 171–182 (1994). https://doi.org/10.1145/381792.195531
Fang, Z., Zhang, L., Carter, J.B., Hsieh, W.C., McKee, S.A.: Reevaluating online superpage promotion with hardware support. In: Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture, pp. 63–72, January 2001. https://doi.org/10.1109/HPCA.2001.903252

Download references

Author information

Authors and Affiliations

University of Michigan, Ann Arbor, MI, 48109, USA
Yichen Yang, Haojie Ye, Yuhan Chen, Xueyang Liu, Nishil Talati, Xin He, Trevor Mudge & Ronald Dreslinski

Authors

Yichen Yang
View author publications
You can also search for this author in PubMed Google Scholar
Haojie Ye
View author publications
You can also search for this author in PubMed Google Scholar
Yuhan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xueyang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Nishil Talati
View author publications
You can also search for this author in PubMed Google Scholar
Xin He
View author publications
You can also search for this author in PubMed Google Scholar
Trevor Mudge
View author publications
You can also search for this author in PubMed Google Scholar
Ronald Dreslinski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yichen Yang .

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
Alex Orailoglu
Department of Electrical and Computer Engineering, Fraunhofer IESE, Kaiserslautern, Germany
Matthias Jung
Department of Computer Science, Friedrich-Alexander University, Erlangen, Germany
Marc Reichenbach

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, Y. et al. (2020). CoPTA: Contiguous Pattern Speculating TLB Architecture. In: Orailoglu, A., Jung, M., Reichenbach, M. (eds) Embedded Computer Systems: Architectures, Modeling, and Simulation. SAMOS 2020. Lecture Notes in Computer Science(), vol 12471. Springer, Cham. https://doi.org/10.1007/978-3-030-60939-9_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-60939-9_5
Published: 07 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60938-2
Online ISBN: 978-3-030-60939-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics