{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T19:02:24Z","timestamp":1774983744873,"version":"3.50.1"},"reference-count":60,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2025,2,10]],"date-time":"2025-02-10T00:00:00Z","timestamp":1739145600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2025,2,10]]},"abstract":"<jats:p>There is a growing interest in leveraging GPUs for tasks beyond ML, especially in database systems. Despite the existing extensive work on GPU-based database operators, several questions are still open. For instance, the performance of almost all operators suffers from random accesses, which can account for up to 75% of the runtime. In addition, the group-by operator which is widely used in combination with joins, has not been fully explored for GPU acceleration. Furthermore, existing work often uses limited and unrepresentative workloads for evaluation and does not explore the query optimization aspect, i.e., how to choose the most efficient implementation based on the workload. In this paper, we revisit the state-of-the-art GPU-based join and group-by implementations. We identify their inefficiencies and propose several optimizations. We introduce GFTR, a novel technique to reduce random accesses, leading to speedups of up to 2.3x. We further optimize existing hash-based and sort-based group-by implementations, achieving significant speedups (19.4x and 1.7x, respectively). We also present a new partition-based group-by algorithm ideal for high group cardinalities. We analyze the optimizations with cost models, allowing us to predict the speedup. Finally, we conduct a performance evaluation to analyze each implementation. We conclude by providing practical heuristics to guide query optimizers in selecting the most efficient implementation for a given workload.<\/jats:p>","DOI":"10.1145\/3709689","type":"journal-article","created":{"date-parts":[[2025,2,11]],"date-time":"2025-02-11T15:45:06Z","timestamp":1739288706000},"page":"1-27","source":"Crossref","is-referenced-by-count":7,"title":["Efficiently Processing Joins and Grouped Aggregations on GPUs"],"prefix":"10.1145","volume":"3","author":[{"ORCID":"https:\/\/orcid.org\/0009-0000-3390-0695","authenticated-orcid":false,"given":"Bowen","family":"Wu","sequence":"first","affiliation":[{"name":"Department of Computer Science, ETH Z\u00fcrich, Zurich, Switzerland"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-0409-8611","authenticated-orcid":false,"given":"Dimitrios","family":"Koutsoukos","sequence":"additional","affiliation":[{"name":"Department of Computer Science, ETH Z\u00fcrich, Zurich, Switzerland"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4396-6695","authenticated-orcid":false,"given":"Gustavo","family":"Alonso","sequence":"additional","affiliation":[{"name":"Department of Computer Science, ETH Z\u00fcrich, Zurich, Switzerland"}]}],"member":"320","published-online":{"date-parts":[[2025,2,11]]},"reference":[{"key":"e_1_2_2_1_1","unstructured":"Mart\u00edn Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen Craig Citro Greg S. Corrado Andy Davis Jeffrey Dean Matthieu Devin Sanjay Ghemawat Ian Goodfellow Andrew Harp Geoffrey Irving Michael Isard Yangqing Jia Rafal Jozefowicz Lukasz Kaiser Manjunath Kudlur Josh Levenberg Dandelion Man\u00e9 Rajat Monga Sherry Moore Derek Murray Chris Olah Mike Schuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar Paul Tucker Vincent Vanhoucke Vijay Vasudevan Fernanda Vi\u00e9gas Oriol Vinyals Pete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https:\/\/www.tensorflow.org\/ Software available from tensorflow.org."},{"key":"e_1_2_2_2_1","volume-title":"Onesweep: A Faster Least Significant Digit Radix Sort for GPUs. arxiv: 2206.01784 [cs.DC]","author":"Adinets Andy","year":"2022","unstructured":"Andy Adinets and Duane Merrill. 2022. Onesweep: A Faster Least Significant Digit Radix Sort for GPUs. arxiv: 2206.01784 [cs.DC]"},{"key":"e_1_2_2_3_1","unstructured":"AMD. 2023. AMD Instinct MI250X Accelerator. https:\/\/www.amd.com\/en\/products\/server-accelerators\/instinct-mi250x"},{"key":"e_1_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2013.6544839"},{"key":"e_1_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2014.2313874"},{"key":"e_1_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3452831"},{"key":"e_1_2_2_7_1","unstructured":"Sean Baxter. 2016. moderngpu 2.0. (2016). https:\/\/github.com\/moderngpu\/moderngpu\/wiki."},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3698812"},{"key":"e_1_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1007\/s13222-014-0164-z"},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.14778\/3632093.3632107"},{"key":"e_1_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2021.3061394"},{"key":"e_1_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.14778\/3303753.3303760"},{"key":"e_1_2_2_13_1","unstructured":"NVIDIA Corporation. 2023a. Cooperative primitives for CUDA C. https:\/\/github.com\/NVIDIA\/cub GitHub repository."},{"key":"e_1_2_2_14_1","unstructured":"NVIDIA Corporation. 2023b. NVIDIA H100 Tensor Core GPU Architecture. https:\/\/resources.nvidia.com\/en-us-tensor-core"},{"key":"e_1_2_2_15_1","volume-title":"Thrust: Parallel Algorithms Library. https:\/\/github.com\/NVIDIA\/thrust GitHub repository.","author":"NVIDIA Corporation","year":"2023","unstructured":"NVIDIA Corporation. 2023c. Thrust: Parallel Algorithms Library. https:\/\/github.com\/NVIDIA\/thrust GitHub repository."},{"key":"e_1_2_2_16_1","unstructured":"RAPIDS cuDF Developers. 2023. cuDF: GPU DataFrame Library. https:\/\/github.com\/rapidsai\/cudf GitHub repository."},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/TEST.2011.6139189"},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3361682"},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.14778\/3603581.3603590"},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3183734"},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/1132960.1132964"},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/2304576.2304621"},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/1362622.1362684"},{"key":"e_1_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/1620585.1620588"},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/1376616.1376670"},{"key":"e_1_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.14778\/3551793.3551833"},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.14778\/2536206.2536216"},{"key":"e_1_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3517869"},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/2236584.2236592"},{"key":"e_1_2_2_30_1","volume-title":"Lohman","author":"Karnagel Tomas","year":"2015","unstructured":"Tomas Karnagel, Ren\u00e9 M\u00fcller, and Guy M. Lohman. 2015. Optimizing GPU-accelerated Group-By and Aggregation. In ADMS@VLDB. https:\/\/api.semanticscholar.org\/CorpusID:5017248"},{"key":"e_1_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3662010.3663441"},{"key":"e_1_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-021-00708-y"},{"key":"e_1_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476311.3476378"},{"key":"e_1_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.14778\/3007328.3007331"},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389705"},{"key":"e_1_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3517911"},{"key":"e_1_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2002.1019210"},{"key":"e_1_2_2_38_1","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)","author":"Nakandala Supun","year":"2020","unstructured":"Supun Nakandala, Karla Saur, Gyeong-In Yu, Konstantinos Karanasos, Carlo Curino, Markus Weimer, and Matteo Interlandi. 2020. A Tensor Compiler for Unified Machine Learning Prediction Serving. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 899--917. https:\/\/www.usenix.org\/conference\/osdi20\/presentation\/nakandala"},{"key":"e_1_2_2_39_1","unstructured":"NVIDIA. Accessed: 2024-06--16. cuCollections. https:\/\/github.com\/NVIDIA\/cuCollections\/tree\/dev."},{"key":"e_1_2_2_40_1","volume-title":"PyTorch: An Imperative Style","author":"Paszke Adam","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch\u00e9 Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024--8035. http:\/\/papers.neurips.cc\/paper\/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf"},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10619-019-07280-z"},{"key":"e_1_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2915224"},{"key":"e_1_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3457254"},{"key":"e_1_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.14778\/3554821.3554829"},{"key":"e_1_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/3485126"},{"key":"e_1_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/3329785.3329922"},{"key":"e_1_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.14778\/3436905.3436927"},{"key":"e_1_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/3085504.3085521"},{"key":"e_1_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3380595"},{"key":"e_1_2_2_50_1","volume-title":"Proceedings of the 20th International Conference on Very Large Data Bases (VLDB 94)","author":"Shatdal Ambuj","unstructured":"Ambuj Shatdal, Chander Kant, and Jeffrey F. Naughton. 1994. Cache Conscious Algorithms for Relational Query Processing. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB 94). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 510--521."},{"key":"e_1_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2019.00068"},{"key":"e_1_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDEW58674.2023.00034"},{"key":"e_1_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/3430936"},{"key":"e_1_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/3588709"},{"key":"e_1_2_2_55_1","volume-title":"Boncz","author":"Tom\u00e9 Diego G.","year":"2018","unstructured":"Diego G. Tom\u00e9, Tim Gubner, Mark Raasveldt, Eyal Rozenberg, and Peter A. Boncz. 2018. Optimizing Group-By and Aggregation using GPU-CPU Co-Processing. In ADMS@VLDB. https:\/\/api.semanticscholar.org\/CorpusID:52895287"},{"key":"e_1_2_2_56_1","unstructured":"Peter Van Sandt and Zhe Jia. 2021. Dissecting the Ampere GPU Architecture through Microbenchmarking. online. https:\/\/www.nvidia.com\/en-us\/on-demand\/session\/gtcspring21-s33322\/"},{"key":"e_1_2_2_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2017.2677451"},{"key":"e_1_2_2_58_1","doi-asserted-by":"publisher","DOI":"10.14778\/3551793.3551809"},{"key":"e_1_2_2_59_1","doi-asserted-by":"publisher","DOI":"10.14778\/2536206.2536210"},{"key":"e_1_2_2_60_1","doi-asserted-by":"publisher","DOI":"10.1145\/3470496.3533044"}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3709689","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3709689","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T18:22:22Z","timestamp":1774981342000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3709689"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,10]]},"references-count":60,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,2,10]]}},"alternative-id":["10.1145\/3709689"],"URL":"https:\/\/doi.org\/10.1145\/3709689","relation":{},"ISSN":["2836-6573"],"issn-type":[{"value":"2836-6573","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,2,10]]}}}