{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,17]],"date-time":"2025-09-17T14:51:20Z","timestamp":1758120680755,"version":"3.41.0"},"reference-count":78,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2020,12,30]],"date-time":"2020-12-30T00:00:00Z","timestamp":1609286400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2021,3,31]]},"abstract":"<jats:p>Stencil codes (a.k.a. nearest-neighbor computations) are widely used in image processing, machine learning, and scientific applications. Stencil codes incur nearest-neighbor data exchange because the value of each point in the structured grid is calculated as a function of its value and the values of a subset of its nearest-neighbor points. When running on Graphics Processing Unit (GPUs), stencil codes exhibit a high degree of data sharing between nearest-neighbor threads. Sharing is typically implemented through shared memories, shuffle instructions, and on-chip caches and often incurs performance overheads due to the redundancy in memory accesses.<\/jats:p>\n          <jats:p>In this article, we propose Neighbor Data (NeDa), a direct nearest-neighbor data sharing mechanism that uses two registers embedded in each streaming processor (SP) that can be accessed by nearest-neighbor SP cores. The registers are compiler-allocated and serve as a data exchange mechanism to eliminate nearest-neighbor shared accesses. NeDa is embedded carefully with local wires between SP cores so as to minimize the impact on density. We place and route NeDa in an open-source GPU and show a small area overhead of 1.3%. The cycle-accurate simulation indicates an average performance improvement of 21.8% and power reduction of up to 18.3% for stencil codes in General-Purpose Graphics Processing Unit (GPGPU) standard benchmark suites. We show that NeDa\u2019s performance is within 13.2% of an ideal GPU with no overhead for nearest-neighbor data exchange.<\/jats:p>","DOI":"10.1145\/3429981","type":"journal-article","created":{"date-parts":[[2020,12,30]],"date-time":"2020-12-30T12:30:51Z","timestamp":1609331451000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["Efficient Nearest-Neighbor Data Sharing in GPUs"],"prefix":"10.1145","volume":"18","author":[{"given":"Negin","family":"Nematollahi","sequence":"first","affiliation":[{"name":"Department of Computer Engineering, Sharif University of Technology, Iran"}]},{"given":"Mohammad","family":"Sadrosadati","sequence":"additional","affiliation":[{"name":"School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Iran"}]},{"given":"Hajar","family":"Falahati","sequence":"additional","affiliation":[{"name":"School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Iran"}]},{"given":"Marzieh","family":"Barkhordar","sequence":"additional","affiliation":[{"name":"Department of Computer Engineering, Sharif University of Technology, Iran"}]},{"given":"Mario Paulo","family":"Drumond","sequence":"additional","affiliation":[{"name":"EPFL University, Switzerland"}]},{"given":"Hamid","family":"Sarbazi-Azad","sequence":"additional","affiliation":[{"name":"Department of Computer Engineering, Sharif University of Technology, Iran and School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Iran"}]},{"given":"Babak","family":"Falsafi","sequence":"additional","affiliation":[{"name":"EPFL University, Switzerland"}]}],"member":"320","published-online":{"date-parts":[[2020,12,30]]},"reference":[{"doi-asserted-by":"publisher","key":"e_1_2_1_1_1","DOI":"10.1145\/2934583.2934606"},{"volume-title":"Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201909)","author":"Bakhoda Ali","unstructured":"Ali Bakhoda , George L. Yuan , Wilson W. L. Fung , Henry Wong , and Tor M. Aamodt . 2009. Analyzing CUDA workloads using a detailed GPU simulator . In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201909) . IEEE, 163--174. Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201909). IEEE, 163--174.","key":"e_1_2_1_2_1"},{"key":"e_1_2_1_3_1","article-title":"Enabling GPGPU low-level hardware explorations with MIAOW: An open-source RTL implementation of a GPGPU","volume":"12","author":"Balasubramanian Raghuraman","year":"2015","unstructured":"Raghuraman Balasubramanian , Vinay Gangadhar , Ziliang Guo , Chen-Han Ho , Cherin Joseph , Jaikrishnan Menon , Mario Paulo Drumond , Robin Paul , Sharath Prasad , Pradip Valathol , et\u00a0al. 2015 . Enabling GPGPU low-level hardware explorations with MIAOW: An open-source RTL implementation of a GPGPU . ACM Transactions on Architecture and Code Optimization (TACO) 12 , 2 Article 21 (2015). Raghuraman Balasubramanian, Vinay Gangadhar, Ziliang Guo, Chen-Han Ho, Cherin Joseph, Jaikrishnan Menon, Mario Paulo Drumond, Robin Paul, Sharath Prasad, Pradip Valathol, et\u00a0al. 2015. Enabling GPGPU low-level hardware explorations with MIAOW: An open-source RTL implementation of a GPGPU. ACM Transactions on Architecture and Code Optimization (TACO) 12, 2 Article 21 (2015).","journal-title":"ACM Transactions on Architecture and Code Optimization (TACO)"},{"key":"e_1_2_1_4_1","volume-title":"Chung","author":"Bao Siqi","year":"2018","unstructured":"Siqi Bao and Albert C. S . Chung . 2018 . Multi-scale structured CNN with label consistency for brain MR image segmentation. Computer Methods in Biomechanics and Biomedical Engineering: Imaging 8 Visualization 6, 1 (2018), 113--117. Siqi Bao and Albert C. S. Chung. 2018. Multi-scale structured CNN with label consistency for brain MR image segmentation. Computer Methods in Biomechanics and Biomedical Engineering: Imaging 8 Visualization 6, 1 (2018), 113--117."},{"doi-asserted-by":"publisher","key":"e_1_2_1_5_1","DOI":"10.1109\/HPCA.2015.7056017"},{"doi-asserted-by":"publisher","key":"e_1_2_1_6_1","DOI":"10.1109\/IISWC.2009.5306797"},{"doi-asserted-by":"publisher","key":"e_1_2_1_7_1","DOI":"10.1109\/LCA.2017.2693371"},{"doi-asserted-by":"publisher","key":"e_1_2_1_8_1","DOI":"10.1145\/2159430.2159443"},{"volume-title":"Proceedings of the 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916)","author":"Clemons Jason","unstructured":"Jason Clemons , Chih-Chi Cheng , Iuri Frosio , Daniel Johnson , and Stephen W. Keckler . 2016. A patch memory system for image processing and computer vision . In Proceedings of the 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916) . IEEE, 1--13. Jason Clemons, Chih-Chi Cheng, Iuri Frosio, Daniel Johnson, and Stephen W. Keckler. 2016. A patch memory system for image processing and computer vision. In Proceedings of the 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916). IEEE, 1--13.","key":"e_1_2_1_9_1"},{"volume-title":"Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units. ACM, 63--74","author":"Danalis Anthony","unstructured":"Anthony Danalis , Gabriel Marin , Collin McCurdy , Jeremy S. Meredith , Philip C. Roth , Kyle Spafford , Vinod Tipparaju , and Jeffrey S. Vetter . 2010. The scalable heterogeneous computing (SHOC) benchmark suite . In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units. ACM, 63--74 . Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. 2010. The scalable heterogeneous computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units. ACM, 63--74.","key":"e_1_2_1_10_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_11_1","DOI":"10.1109\/ICCD.2018.00080"},{"doi-asserted-by":"publisher","key":"e_1_2_1_12_1","DOI":"10.1145\/3001589"},{"doi-asserted-by":"publisher","key":"e_1_2_1_13_1","DOI":"10.1049\/iet-ipr.2017.1117"},{"doi-asserted-by":"publisher","key":"e_1_2_1_14_1","DOI":"10.1109\/SiPS.2016.57"},{"doi-asserted-by":"publisher","key":"e_1_2_1_15_1","DOI":"10.1145\/2000064.2000093"},{"doi-asserted-by":"publisher","key":"e_1_2_1_16_1","DOI":"10.1145\/3072959.3073592"},{"volume-title":"Edge Detection Methods Based on Generalized Type-2 Fuzzy Logic","author":"Gonzalez Claudia I.","unstructured":"Claudia I. Gonzalez , Patricia Melin , Juan R. Castro , and Oscar Castillo . 2017. Edge detection methods and filters used on digital image processing . In Edge Detection Methods Based on Generalized Type-2 Fuzzy Logic . Springer , 11--16. Claudia I. Gonzalez, Patricia Melin, Juan R. Castro, and Oscar Castillo. 2017. Edge detection methods and filters used on digital image processing. In Edge Detection Methods Based on Generalized Type-2 Fuzzy Logic. Springer, 11--16.","key":"e_1_2_1_17_1"},{"key":"e_1_2_1_18_1","first-page":"811","article-title":"Texture state cache","volume":"9","author":"Goodman Benjiman L.","year":"2017","unstructured":"Benjiman L. Goodman , Adam T. Moerschell , and James S. Blomgren . 2017 . Texture state cache . US Patent 9 , 811 ,875. Benjiman L. Goodman, Adam T. Moerschell, and James S. Blomgren. 2017. Texture state cache. US Patent 9,811,875.","journal-title":"US Patent"},{"volume-title":"Auto-tuning a high-level language targeted to GPU codes. In 2012 Innovative Parallel Computing (InPar)","author":"Grauer-Gray Scott","unstructured":"Scott Grauer-Gray , Lifan Xu , Robert Searles , Sudhee Ayalasomayajula , and John Cavazos . 2012. Auto-tuning a high-level language targeted to GPU codes. In 2012 Innovative Parallel Computing (InPar) . IEEE , 1--10. Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In 2012 Innovative Parallel Computing (InPar). IEEE, 1--10.","key":"e_1_2_1_19_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_20_1","DOI":"10.1109\/PACT.2019.00028"},{"key":"e_1_2_1_21_1","volume-title":"Scarpazza","author":"Jia Zhe","year":"2018","unstructured":"Zhe Jia , Marco Maggioni , Benjamin Staiger , and Daniele P . Scarpazza . 2018 . Dissecting the NVIDIA volta GPU architecture via microbenchmarking. arXiv preprint arXiv:1804.06826. Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. 2018. Dissecting the NVIDIA volta GPU architecture via microbenchmarking. arXiv preprint arXiv:1804.06826."},{"doi-asserted-by":"publisher","key":"e_1_2_1_22_1","DOI":"10.1109\/IPDPS.2010.5470421"},{"doi-asserted-by":"publisher","key":"e_1_2_1_23_1","DOI":"10.1145\/2980983.2908117"},{"volume-title":"Proceedings of the 47th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201914)","author":"Kayiran Onur","unstructured":"Onur Kayiran , Nachiappan Chidambaram Nachiappan , Adwait Jog , Rachata Ausavarungnirun , Mahmut T. Kandemir , Gabriel H. Loh , Onur Mutlu , and Chita R. Das . 2014. Managing GPU concurrency in heterogeneous architectures . In Proceedings of the 47th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201914) . IEEE, 114--126. Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, and Chita R. Das. 2014. Managing GPU concurrency in heterogeneous architectures. In Proceedings of the 47th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201914). IEEE, 114--126.","key":"e_1_2_1_24_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_25_1","DOI":"10.1109\/ISCA.2018.00073"},{"doi-asserted-by":"publisher","key":"e_1_2_1_26_1","DOI":"10.2200\/S00451ED1V01Y201209CAC020"},{"doi-asserted-by":"publisher","key":"e_1_2_1_27_1","DOI":"10.1109\/ISCA.2008.25"},{"doi-asserted-by":"publisher","key":"e_1_2_1_28_1","DOI":"10.1145\/3123939.3123974"},{"doi-asserted-by":"publisher","key":"e_1_2_1_29_1","DOI":"10.1186\/s12859-016-1434-6"},{"doi-asserted-by":"publisher","key":"e_1_2_1_30_1","DOI":"10.1109\/IPDPS.2018.00024"},{"doi-asserted-by":"publisher","key":"e_1_2_1_31_1","DOI":"10.1145\/3140659.3080239"},{"doi-asserted-by":"publisher","key":"e_1_2_1_32_1","DOI":"10.1145\/2458523.2458538"},{"doi-asserted-by":"publisher","key":"e_1_2_1_33_1","DOI":"10.1145\/2872887.2750418"},{"doi-asserted-by":"publisher","key":"e_1_2_1_34_1","DOI":"10.1145\/2508148.2485964"},{"doi-asserted-by":"publisher","key":"e_1_2_1_35_1","DOI":"10.1145\/3093315.3037709"},{"doi-asserted-by":"publisher","key":"e_1_2_1_36_1","DOI":"10.1145\/3123939.3123941"},{"key":"e_1_2_1_37_1","volume-title":"Proceedings of the 1st International Workshop on High-Performance Stencil Computations","author":"Maruyama Naoya","year":"2014","unstructured":"Naoya Maruyama and Takayuki Aoki . 2014 . Optimizing stencil computations for NVIDIA Kepler GPUs . In Proceedings of the 1st International Workshop on High-Performance Stencil Computations , Vienna. 89--95. Naoya Maruyama and Takayuki Aoki. 2014. Optimizing stencil computations for NVIDIA Kepler GPUs. In Proceedings of the 1st International Workshop on High-Performance Stencil Computations, Vienna. 89--95."},{"doi-asserted-by":"publisher","key":"e_1_2_1_38_1","DOI":"10.1109\/TPDS.2016.2549523"},{"unstructured":"Andreas Meister and Gunter Saake. 2016. Challenges for a GPU-accelerated dynamic programming approach for join-order optimization. In GvD. 86--81.  Andreas Meister and Gunter Saake. 2016. Challenges for a GPU-accelerated dynamic programming approach for join-order optimization. In GvD. 86--81.","key":"e_1_2_1_39_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_40_1","DOI":"10.1109\/HPCSim.2014.6903695"},{"doi-asserted-by":"publisher","key":"e_1_2_1_41_1","DOI":"10.1504\/IJHPCN.2019.097046"},{"volume-title":"Proceedings of the 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 164--165","author":"Mirhosseini Amirhossein","unstructured":"Amirhossein Mirhosseini , Mohammad Sadrosadati , Behnaz Soltani , Hamid Sarbazi-Azad , and Thomas F. Wenisch . 2017. POSTER: Elastic reconfiguration for heterogeneous NoCs with BiNoCHS . In Proceedings of the 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 164--165 . Amirhossein Mirhosseini, Mohammad Sadrosadati, Behnaz Soltani, Hamid Sarbazi-Azad, and Thomas F. Wenisch. 2017. POSTER: Elastic reconfiguration for heterogeneous NoCs with BiNoCHS. In Proceedings of the 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 164--165.","key":"e_1_2_1_42_1"},{"volume-title":"Proceedings of the 44th Annual IEEE\/ACM International Symposium on Microarchitecture. 308--317","author":"Narasiman Veynu","unstructured":"Veynu Narasiman , Michael Shebanow , Chang Joo Lee , Rustam Miftakhutdinov , Onur Mutlu , and Yale N. Patt . 2011. Improving GPU performance via large warps and two-level warp scheduling . In Proceedings of the 44th Annual IEEE\/ACM International Symposium on Microarchitecture. 308--317 . Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE\/ACM International Symposium on Microarchitecture. 308--317.","key":"e_1_2_1_43_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_44_1","DOI":"10.1109\/LCA.2018.2873679"},{"unstructured":"NVIDIA. 2018. CUDA SDK code samples. Retrieved from https:\/\/docs.nvidia.com\/cuda\/archive\/9.1\/pdf\/CUDA_Samples.pdf.  NVIDIA. 2018. CUDA SDK code samples. Retrieved from https:\/\/docs.nvidia.com\/cuda\/archive\/9.1\/pdf\/CUDA_Samples.pdf.","key":"e_1_2_1_45_1"},{"unstructured":"NVIDIA. 2018. Cuda9.0 prgramming guide. Retrieved from https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html.  NVIDIA. 2018. Cuda9.0 prgramming guide. Retrieved from https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html.","key":"e_1_2_1_46_1"},{"unstructured":"NVIDIA. 2018. GeForce GTX 980 Whitepaper\u2014NVIDIA File Downloads. Retrieved from https:\/\/international.download.nvidia.com\/geforce-com\/international\/pdfs\/GeForce_GTX_980_Whitepaper_FINAL.PDF.  NVIDIA. 2018. GeForce GTX 980 Whitepaper\u2014NVIDIA File Downloads. Retrieved from https:\/\/international.download.nvidia.com\/geforce-com\/international\/pdfs\/GeForce_GTX_980_Whitepaper_FINAL.PDF.","key":"e_1_2_1_47_1"},{"unstructured":"NVIDIA. 2018. Profiler user guide. Retrieved from http:\/\/docs.nvidia.com\/cuda\/profiler-users-guide\/index.html.  NVIDIA. 2018. Profiler user guide. Retrieved from http:\/\/docs.nvidia.com\/cuda\/profiler-users-guide\/index.html.","key":"e_1_2_1_48_1"},{"key":"e_1_2_1_49_1","first-page":"595","article-title":"Image processing method for detail enhancement and noise reduction","volume":"9","author":"Olsson Stefan","year":"2017","unstructured":"Stefan Olsson . 2017 . Image processing method for detail enhancement and noise reduction . US Patent 9 , 595 ,087. Stefan Olsson. 2017. Image processing method for detail enhancement and noise reduction. US Patent 9,595,087.","journal-title":"US Patent"},{"key":"e_1_2_1_50_1","volume-title":"Proceedings of the 17th Workshop em Desempenho de Sistemas Computacionais e de Comunica\u00e7\u00e3o (WPerformance","volume":"17","author":"Pavan Pablo Jos\u00e9","year":"2018","unstructured":"Pablo Jos\u00e9 Pavan , Matheus da Silva Serpa , V\u00edctor Mart\u00ednez , Edson Luiz Padoin , Jairo Panetta , and Philippe O. A. Navaux . 2018. Strategies to improve the performance and energy efficiency of stencil computations for NVIDIA GPUs . In Proceedings of the 17th Workshop em Desempenho de Sistemas Computacionais e de Comunica\u00e7\u00e3o (WPerformance 2018 ), Vol. 17 . SBC. Pablo Jos\u00e9 Pavan, Matheus da Silva Serpa, V\u00edctor Mart\u00ednez, Edson Luiz Padoin, Jairo Panetta, and Philippe O. A. Navaux. 2018. Strategies to improve the performance and energy efficiency of stencil computations for NVIDIA GPUs. In Proceedings of the 17th Workshop em Desempenho de Sistemas Computacionais e de Comunica\u00e7\u00e3o (WPerformance 2018), Vol. 17. SBC."},{"doi-asserted-by":"publisher","key":"e_1_2_1_51_1","DOI":"10.1109\/DCC.2017.56"},{"volume-title":"Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 641--652","author":"Rawat Prashant","unstructured":"Prashant Rawat , Miheer Vaidya , Aravind Sukumaran-Rajam , Atanas Rountev , Louis-No\u00ebl Pouchet , and P. Sadayappan . 2019. On optimizing complex stencils on GPUs . In Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 641--652 . Prashant Rawat, Miheer Vaidya, Aravind Sukumaran-Rajam, Atanas Rountev, Louis-No\u00ebl Pouchet, and P. Sadayappan. 2019. On optimizing complex stencils on GPUs. In Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 641--652.","key":"e_1_2_1_52_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_53_1","DOI":"10.1109\/JPROC.2018.2862896"},{"doi-asserted-by":"publisher","key":"e_1_2_1_54_1","DOI":"10.1016\/j.jocs.2015.04.023"},{"doi-asserted-by":"publisher","key":"e_1_2_1_55_1","DOI":"10.1145\/2872887.2750410"},{"doi-asserted-by":"publisher","key":"e_1_2_1_56_1","DOI":"10.1145\/3291606"},{"doi-asserted-by":"publisher","key":"e_1_2_1_57_1","DOI":"10.1145\/3173162.3173211"},{"doi-asserted-by":"publisher","key":"e_1_2_1_58_1","DOI":"10.23919\/DATE.2017.7926954"},{"doi-asserted-by":"publisher","key":"e_1_2_1_59_1","DOI":"10.1016\/j.sysarc.2012.10.004"},{"key":"e_1_2_1_60_1","volume-title":"Flexible router architecture for network-on-chip. Computers 8 Mathematics with Applications 64, 5","author":"Sayed Mostafa S.","year":"2012","unstructured":"Mostafa S. Sayed , Ahmed Shalaby , Mohamed El-Sayed , and Victor Goulart . 2012. Flexible router architecture for network-on-chip. Computers 8 Mathematics with Applications 64, 5 ( 2012 ), 1301--1310. Mostafa S. Sayed, Ahmed Shalaby, Mohamed El-Sayed, and Victor Goulart. 2012. Flexible router architecture for network-on-chip. Computers 8 Mathematics with Applications 64, 5 (2012), 1301--1310."},{"doi-asserted-by":"publisher","key":"e_1_2_1_61_1","DOI":"10.1109\/MICRO.2014.31"},{"key":"e_1_2_1_62_1","volume-title":"Geng Daniel Liu, and Wen-mei W. Hwu","author":"Stratton John A.","year":"2012","unstructured":"John A. Stratton , Christopher Rodrigues , I- Jui Sung , Nady Obeid , Li-Wen Chang , Nasser Anssari , Geng Daniel Liu, and Wen-mei W. Hwu . 2012 . Parboil : A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127 (2012). John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127 (2012)."},{"doi-asserted-by":"publisher","key":"e_1_2_1_63_1","DOI":"10.1109\/IPDPS.2017.106"},{"volume-title":"Proceedings of the International Conference on Supercomputing. ACM, 214--224","author":"Unat Didem","unstructured":"Didem Unat , Xing Cai , and Scott B. Baden . 2011. Mint: Realizing CUDA performance in 3D stencil methods with annotated C . In Proceedings of the International Conference on Supercomputing. ACM, 214--224 . Didem Unat, Xing Cai, and Scott B. Baden. 2011. Mint: Realizing CUDA performance in 3D stencil methods with annotated C. In Proceedings of the International Conference on Supercomputing. ACM, 214--224.","key":"e_1_2_1_64_1"},{"doi-asserted-by":"crossref","unstructured":"Nandita Vijaykumar Eiman Ebrahimi Kevin Hsieh Phillip B. Gibbons and Onur Mutlu. 2018. The locality descriptor: A holistic cross-layer abstraction to express data locality in GPUs. ISCA.  Nandita Vijaykumar Eiman Ebrahimi Kevin Hsieh Phillip B. Gibbons and Onur Mutlu. 2018. The locality descriptor: A holistic cross-layer abstraction to express data locality in GPUs. ISCA.","key":"e_1_2_1_65_1","DOI":"10.1109\/ISCA.2018.00074"},{"doi-asserted-by":"publisher","key":"e_1_2_1_66_1","DOI":"10.5555\/3195638.3195656"},{"doi-asserted-by":"publisher","key":"e_1_2_1_67_1","DOI":"10.1109\/TC.2017.2776908"},{"doi-asserted-by":"publisher","key":"e_1_2_1_68_1","DOI":"10.1109\/MICRO.2018.00013"},{"doi-asserted-by":"publisher","key":"e_1_2_1_69_1","DOI":"10.1109\/ACCESS.2019.2910824"},{"doi-asserted-by":"publisher","key":"e_1_2_1_70_1","DOI":"10.1145\/3287624.3287633"},{"doi-asserted-by":"publisher","key":"e_1_2_1_71_1","DOI":"10.1145\/3297858.3304055"},{"doi-asserted-by":"publisher","key":"e_1_2_1_72_1","DOI":"10.1145\/2967938.2967954"},{"doi-asserted-by":"publisher","key":"e_1_2_1_73_1","DOI":"10.1145\/2830772.2830813"},{"doi-asserted-by":"publisher","key":"e_1_2_1_74_1","DOI":"10.1109\/HPCA.2015.7056023"},{"doi-asserted-by":"publisher","key":"e_1_2_1_75_1","DOI":"10.1145\/1811100.1811104"},{"key":"e_1_2_1_76_1","volume-title":"IEEE High Performance Extreme Computing Conference, HPEC.","author":"Zhang Guangwei","year":"2016","unstructured":"Guangwei Zhang and Yinliang Zhao . 2016 . Modeling the performance of 2.5 D blocking of 3D stencil code on GPUs . In IEEE High Performance Extreme Computing Conference, HPEC. Guangwei Zhang and Yinliang Zhao. 2016. Modeling the performance of 2.5 D blocking of 3D stencil code on GPUs. In IEEE High Performance Extreme Computing Conference, HPEC."},{"doi-asserted-by":"publisher","key":"e_1_2_1_77_1","DOI":"10.1109\/P3HPC.2018.00009"},{"doi-asserted-by":"publisher","key":"e_1_2_1_78_1","DOI":"10.1145\/3174243.3174248"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3429981","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3429981","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T21:31:46Z","timestamp":1750195906000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3429981"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,12,30]]},"references-count":78,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2021,3,31]]}},"alternative-id":["10.1145\/3429981"],"URL":"https:\/\/doi.org\/10.1145\/3429981","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2020,12,30]]},"assertion":[{"value":"2020-03-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-10-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-12-30","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}