{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,12]],"date-time":"2026-01-12T17:50:40Z","timestamp":1768240240111,"version":"3.49.0"},"reference-count":44,"publisher":"Association for Computing Machinery (ACM)","issue":"1","funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62125110 and U23B2052"],"award-info":[{"award-number":["62125110 and U23B2052"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Science and Technology Major Project of Tibetan Autonomous Region of China","award":["XZ202201ZD0006G04"],"award-info":[{"award-number":["XZ202201ZD0006G04"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2026,1,31]]},"abstract":"<jats:p>Recently, deep learning-based visual localization has gained significant attention and made remarkable advancements. Although previous visual localization methods have obtained promising performance on indoor or outdoor street scenes, there have been few attempts at visual localization on aerial scenes. In this article, a depth-aware aerial localization transformer (DALTR) is proposed to learn camera poses in real-world aerial scenes assisted by the depth map. To improve the ability of network to perceive on aerial scenes, a multi-level depth embedding transformer module is presented by adaptively incorporating depth information into multiple levels of transformer. In addition, to encourage the piece-wise smooth geometric characteristic of the scene coordinates, a depth-guided smoothness constraint is developed to provide additional supervision for scene coordinate regression. Extensive experimental results on aerial localization benchmark datasets demonstrate that the proposed DALTR achieves superior aerial localization performance.<\/jats:p>","DOI":"10.1145\/3773767","type":"journal-article","created":{"date-parts":[[2025,10,29]],"date-time":"2025-10-29T14:06:16Z","timestamp":1761746776000},"page":"1-16","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Depth-Aware Transformer for Aerial Localization"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-3171-7680","authenticated-orcid":false,"given":"Jianjun","family":"Lei","sequence":"first","affiliation":[{"name":"School of Electrical and Information Engineering, Tianjin University, Tianjin, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-6558-6379","authenticated-orcid":false,"given":"Duohui","family":"Tu","sequence":"additional","affiliation":[{"name":"School of Electrical and Information Engineering, Tianjin University, Tianjin, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6616-453X","authenticated-orcid":false,"given":"Bo","family":"Peng","sequence":"additional","affiliation":[{"name":"School of Electrical and Information Engineering, Tianjin University, Tianjin, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4081-2073","authenticated-orcid":false,"given":"Jie","family":"Zhu","sequence":"additional","affiliation":[{"name":"School of Electrical and Information Engineering, Tianjin University, Tianjin, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8772-2107","authenticated-orcid":false,"given":"Zhe","family":"Zhang","sequence":"additional","affiliation":[{"name":"School of Electrical and Information Engineering, Tianjin University, Tianjin, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8265-2760","authenticated-orcid":false,"given":"Chong","family":"Wu","sequence":"additional","affiliation":[{"name":"EFY Intelligent Control (Tianjin) Technology Company Ltd., Tianjin, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7542-296X","authenticated-orcid":false,"given":"Qingming","family":"Huang","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2026,1,12]]},"reference":[{"key":"e_1_3_1_2_2","first-page":"751","volume-title":"European Conference on Computer Vision (ECCV)","author":"Balntas Vassileios","year":"2018","unstructured":"Vassileios Balntas, Shuda Li, and Victor Prisacariu. 2018. RelocNet: Continuous metric learning relocalisation using neural nets. In European Conference on Computer Vision (ECCV), 751\u2013767."},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2021.3063297"},{"key":"e_1_3_1_4_2","first-page":"5044","volume-title":"Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Brachmann Eric","year":"2023","unstructured":"Eric Brachmann, Tommaso Cavallari, and Victor Adrian Prisacariu. 2023. Accelerated coordinate encoding: Learning to relocalize in minutes using RGB and poses. In Conference on Computer Vision and Pattern Recognition (CVPR), 5044\u20135053."},{"key":"e_1_3_1_5_2","first-page":"4654","volume-title":"Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Brachmann Eric","year":"2018","unstructured":"Eric Brachmann and Carsten Rother. 2018. Learning less is more\u20146D camera localization via 3D surface regression. In Conference on Computer Vision and Pattern Recognition (CVPR), 4654\u20134662."},{"issue":"9","key":"e_1_3_1_6_2","first-page":"5847","article-title":"Visual camera re-localization from RGB and RGB-D images using DSAC","volume":"44","author":"Brachmann Eric","year":"2021","unstructured":"Eric Brachmann and Carsten Rother. 2021. Visual camera re-localization from RGB and RGB-D images using DSAC. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 9 (2021), 5847\u20135865.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_1_7_2","first-page":"2616","volume-title":"Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Brahmbhatt Samarth","year":"2018","unstructured":"Samarth Brahmbhatt, Jinwei Gu, Kihwan Kim, James Hays, and Jan Kautz. 2018. Geometry-aware learning of maps for camera localization. In Conference on Computer Vision and Pattern Recognition (CVPR), 2616\u20132625."},{"key":"e_1_3_1_8_2","first-page":"3769","volume-title":"International Conference on Computer Vision Workshops (ICCVW)","author":"Cai Ming","year":"2019","unstructured":"Ming Cai, Huangying Zhan, Chamara Saroj Weerasekera, Kejie Li, and Ian Reid. 2019. Camera relocalization by exploiting multi-view constraints for scene coordinates regression. In International Conference on Computer Vision Workshops (ICCVW), 3769\u20133777."},{"key":"e_1_3_1_9_2","first-page":"3258","volume-title":"Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Chen Kefan","year":"2021","unstructured":"Kefan Chen, Noah Snavely, and Ameesh Makadia. 2021. Wide-baseline relative camera pose estimation with directional learning. In Conference on Computer Vision and Pattern Recognition (CVPR), 3258\u20133268."},{"key":"e_1_3_1_10_2","first-page":"1","article-title":"An oblique-robust absolute visual localization method for GPS-denied UAV with satellite imagery","volume":"62","author":"Chen Yuan","year":"2023","unstructured":"Yuan Chen and Jie Jiang. 2023. An oblique-robust absolute visual localization method for GPS-denied UAV with satellite imagery. IEEE Transactions on Geoscience and Remote Sensing 62 (2023), 1\u201313.","journal-title":"IEEE Transactions on Geoscience and Remote Sensing"},{"key":"e_1_3_1_11_2","first-page":"2871","volume-title":"International Conference on Computer Vision (ICCV)","author":"Ding Mingyu","year":"2019","unstructured":"Mingyu Ding, Zhe Wang, Jiankai Sun, Jianping Shi, and Ping Luo. 2019. CamNet: Coarse-to-fine retrieval for camera re-localization. In International Conference on Computer Vision (ICCV), 2871\u20132880."},{"key":"e_1_3_1_12_2","unstructured":"Hugo Germain Guillaume Bourmaud and Vincent Lepetit. 2020. S2DNet: Learning accurate correspondences for sparse-to-dense feature matching. arXiv:2004.01673. Retrieved from https:\/\/arxiv.org\/abs\/2004.01673"},{"issue":"3","key":"e_1_3_1_13_2","doi-asserted-by":"crossref","first-page":"5737","DOI":"10.1109\/LRA.2021.3082473","article-title":"Scene coordinate regression network with global context-guided spatial feature transformation for visual relocalization","volume":"6","author":"Guan Peiyu","year":"2021","unstructured":"Peiyu Guan, Zhiqiang Cao, Junzhi Yu, Chao Zhou, and Min Tan. 2021. Scene coordinate regression network with global context-guided spatial feature transformation for visual relocalization. IEEE Robotics and Automation Letters 6, 3 (2021), 5737\u20135744.","journal-title":"IEEE Robotics and Automation Letters"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2018.2806446"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01553"},{"key":"e_1_3_1_16_2","first-page":"2938","volume-title":"International Conference on Computer Vision (ICCV)","author":"Kendall Alex","year":"2015","unstructured":"Alex Kendall, Matthew Grimes, and Roberto Cipolla. 2015. PoseNet: A convolutional network for real-time 6-DoF camera relocalization. In International Conference on Computer Vision (ICCV), 2938\u20132946."},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2023.3287500"},{"key":"e_1_3_1_18_2","first-page":"229","volume-title":"European Conference on Computer Vision (ECCV)","author":"Li Xinyi","year":"2022","unstructured":"Xinyi Li and Haibin Ling. 2022. GTCaR: Graph transformer for camera re-localization. In European Conference on Computer Vision (ECCV), 229\u2013246."},{"key":"e_1_3_1_19_2","first-page":"11983","volume-title":"Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Li Xiaotian","year":"2020","unstructured":"Xiaotian Li, Shuzhe Wang, Yi Zhao, Jakob Verbeek, and Juho Kannala. 2020. Hierarchical scene coordinate classification and regression for visual localization. In Conference on Computer Vision and Pattern Recognition (CVPR), 11983\u201311992."},{"key":"e_1_3_1_20_2","first-page":"1043","volume-title":"Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Lim Hyon","year":"2012","unstructured":"Hyon Lim, Sudipta N. Sinha, Michael F. Cohen, and Matthew Uyttendaele. 2012. Real-time image-based 6-DoF localization in large-scale environments. In Conference on Computer Vision and Pattern Recognition (CVPR), 1043\u20131050."},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1145\/3587467"},{"issue":"1","key":"e_1_3_1_22_2","first-page":"1","article-title":"Robust and accurate mobile visual localization and its applications","volume":"9","author":"Liu Heng","year":"2013","unstructured":"Heng Liu, Tao Mei, Houqiang Li, Jiebo Luo, and Shipeng Li. 2013. Robust and accurate mobile visual localization and its applications. ACM Transactions on Multimedia Computing, Communications, and Applications 9, 1 (2013), 1\u201322.","journal-title":"ACM Transactions on Multimedia Computing, Communications, and Applications"},{"key":"e_1_3_1_23_2","unstructured":"Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv:1711.05101. Retrieved from https:\/\/arxiv.org\/abs\/1711.05101"},{"key":"e_1_3_1_24_2","first-page":"1525","volume-title":"International Conference on Intelligent Robots and Systems (IROS)","author":"Naseer Tayyab","year":"2017","unstructured":"Tayyab Naseer and Wolfram Burgard. 2017. Deep regression for monocular camera-based 6-DoF global localization in outdoor environments. In International Conference on Intelligent Robots and Systems (IROS), 1525\u20131530."},{"issue":"3","key":"e_1_3_1_25_2","doi-asserted-by":"crossref","first-page":"673","DOI":"10.1109\/JAS.2023.123660","article-title":"Depth-guided vision transformer with normalizing flows for monocular 3D object detection","volume":"11","author":"Pan Cong","year":"2024","unstructured":"Cong Pan, Junran Peng, and Zhaoxiang Zhang. 2024. Depth-guided vision transformer with normalizing flows for monocular 3D object detection. IEEE\/CAA Journal of Automatica Sinica 11, 3 (2024), 673\u2013689.","journal-title":"IEEE\/CAA Journal of Automatica Sinica"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/3663570"},{"key":"e_1_3_1_27_2","first-page":"2102","volume-title":"International Conference on Computer Vision (ICCV)","author":"Sattler Torsten","year":"2015","unstructured":"Torsten Sattler, Michal Havlena, Filip Radenovic, Konrad Schindler, and Marc Pollefeys. 2015. Hyperpoints and fine vocabularies for large-scale location recognition. In International Conference on Computer Vision (ICCV), 2102\u20132110."},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2016.2611662"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3677123"},{"key":"e_1_3_1_30_2","first-page":"532","volume-title":"Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Svarm Linus","year":"2014","unstructured":"Linus Svarm, Olof Enqvist, Magnus Oskarsson, and Fredrik Kahl. 2014. Accurate localization and pose estimation for large 3D models. In Conference on Computer Vision and Pattern Recognition (CVPR), 532\u2013539."},{"key":"e_1_3_1_31_2","first-page":"1831","volume-title":"Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Tang Shitao","year":"2021","unstructured":"Shitao Tang, Chengzhou Tang, Rui Huang, Siyu Zhu, and Ping Tan. 2021. Learning camera localization via dense scene matching. In Conference on Computer Vision and Pattern Recognition (CVPR), 1831\u20131841."},{"key":"e_1_3_1_32_2","first-page":"627","volume-title":"International Conference on Computer Vision (ICCV)","author":"Walch Florian","year":"2017","unstructured":"Florian Walch, Caner Hazirbas, Laura Leal-Taixe, Torsten Sattler, Sebastian Hilsenbeck, and Daniel Cremers. 2017. Image-based localization using LSTMs for structured feature correlation. In International Conference on Computer Vision (ICCV), 627\u2013637."},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i06.6608"},{"key":"e_1_3_1_34_2","first-page":"6209","volume-title":"AAAI Conference on Artificial Intelligence","volume":"37","author":"Wang Sijie","year":"2023","unstructured":"Sijie Wang, Qiyu Kang, Rui She, Wee Peng Tay, Andreas Hartmannsgruber, and Diego Navarro Navarro. 2023. RobustLoc: Robust camera pose regression in challenging driving environments. In AAAI Conference on Artificial Intelligence, Vol. 37, 6209\u20136216."},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-023-01982-9"},{"key":"e_1_3_1_36_2","first-page":"21666","volume-title":"Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Wang Yifan","year":"2024","unstructured":"Yifan Wang, Xingyi He, Sida Peng, Dongli Tan, and Xiaowei Zhou. 2024. Efficient LoFTR: Semi-dense local feature matching with sparse-like speed. In Conference on Computer Vision and Pattern Recognition (CVPR), 21666\u201321675."},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2022.3222904"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/3622788"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2023.3235495"},{"key":"e_1_3_1_40_2","first-page":"17358","volume-title":"Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Yan Qi","year":"2022","unstructured":"Qi Yan, Jianhao Zheng, Simon Reding, Shanci Li, and Iordan Doytchinov. 2022. CrossLoc: Scalable aerial localization assisted by multimodal synthetic data. In Conference on Computer Vision and Pattern Recognition (CVPR), 17358\u201317368."},{"key":"e_1_3_1_41_2","first-page":"9155","volume-title":"International Conference on Computer Vision (ICCV)","author":"Zhang Renrui","year":"2023","unstructured":"Renrui Zhang, Han Qiu, Tai Wang, Ziyu Guo, Ziteng Cui, Yu Qiao, Hongsheng Li, and Peng Gao. 2023. MonoDETR: Depth-guided transformer for monocular 3D object detection. In International Conference on Computer Vision (ICCV), 9155\u20139166."},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-020-01399-8"},{"key":"e_1_3_1_43_2","first-page":"6165","volume-title":"AAAI Conference on Artificial Intelligence","volume":"35","author":"Zhou Kaichen","year":"2021","unstructured":"Kaichen Zhou, Changhao Chen, Bing Wang, Muhamad Risqi U. Saputra, Niki Trigoni, and Andrew Markham. 2021. VMLoc: Variational fusion for learning-based multimodal camera localization. In AAAI Conference on Artificial Intelligence, Vol. 35, 6165\u20136173."},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1145\/3596445"},{"key":"e_1_3_1_45_2","first-page":"32","volume-title":"Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Zhuang Bingbing","year":"2021","unstructured":"Bingbing Zhuang and Manmohan Chandraker. 2021. Fusing the old with the new: Learning relative camera pose with geometry-guided uncertainty. In Conference on Computer Vision and Pattern Recognition (CVPR), 32\u201342."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3773767","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,12]],"date-time":"2026-01-12T14:30:12Z","timestamp":1768228212000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3773767"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,12]]},"references-count":44,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2026,1,31]]}},"alternative-id":["10.1145\/3773767"],"URL":"https:\/\/doi.org\/10.1145\/3773767","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,1,12]]},"assertion":[{"value":"2024-12-27","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-09-23","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-01-12","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}