Enhanced basketball pose estimation with spatio-temporal fusion and local feature learning

The rapid advancement of sports technology has made accurate pose estimation crucial for optimizing athlete training and performance analysis, particularly in dynamic sports like basketball. This study introduces the LFEMSFF model, which significantly enhances pose detection accuracy and stability by integrating local feature enhancement, multi-camera spatio-temporal feature fusion, and multi-task learning, outperforming existing methods on multiple datasets.

This is an AI-generated summary, check important information.

Rate this article summary:

Tell us more (opens survey in a new tab)

Abstract

The application of multi-camera systems has enabled the capture of basketball players’ dynamic poses, providing valuable data support for coaches and analysts. However, numerous challenges still exist in practical applications. Firstly, in complex basketball game scenarios, particularly when players are moving quickly and experiencing occlusion, accurately extracting player poses remains highly challenging. Secondly, current methods lack effective spatio-temporal feature fusion when processing multi-camera data, making it difficult to fully capture the dynamic relationships between different viewpoints. This deficiency results in insufficient accuracy and stability in pose estimation. To address these issues, this paper proposes a multi-task learning model that combines local feature enhancement and multi-camera spatio-temporal feature fusion (LFEMSFF). First, the Local Feature Enhancement module utilizes a graph convolutional network (GCN) to extract detailed local player movements, such as limb bending and joint angles. This improves the model’s ability to capture complex pose variations and enhances its adaptability to occlusions and rapid movements. Next, the multi-camera spatio-temporal feature fusion module integrates data from different viewpoints using a spatio-temporal transformer network. This module considers not only the spatial information from each viewpoint but also leverages temporal sequence relationships to enhance the spatio-temporal continuity and resolution of the data, thereby capturing the dynamic changes in player movements more effectively. Finally, the multi-task learning module integrates the outputs from the previous two modules, optimizing both pose classification and keypoint localization tasks to ensure the accuracy and robustness of pose prediction. Experimental results show that the proposed model significantly improves pose detection accuracy and stability compared to existing methods on the NBA2K, Human3.6M and Sports-1 datasets.

Learning spatio-temporal context for basketball action pose estimation with a multi-stream network

Article Open access 09 August 2025

Basketball technique action recognition using 3D convolutional neural networks

Article Open access 07 June 2024

A Study on the Motion Recognition of Basketball Players Based on Unit Gesture Decomposition

1 Introduction

With the rapid development of sports technology, athlete pose detection has become a key tool for improving training efficiency and optimizing game strategy analysis [1, 2]. In basketball, accurately and in real time capturing athletes’ dynamic poses can provide valuable data support for coaches and analysts, helping to optimize tactical deployment, evaluate player performance, and adjust training methods. Traditional single-camera approaches are limited by perspective constraints, making it difficult to comprehensively capture players’ movements. The introduction of multi-camera systems effectively addresses this limitation orientation cues-aware facial relationship representation for head pose estimation via transformer; transifc: invariant cues-aware feature concentration learning for efficient fine-grained bird image classification; mmatrans: muscle movement aware representation learning for facial expression recognition via transformers. This technology has gained widespread attention in the fields of computer vision and sports science in recent years.

Despite significant progress in the application of multi-camera systems and computer vision in sports analysis, several challenges remain [3]. First, in complex basketball game scenarios, players’ fast movements and mutual occlusions make pose extraction particularly difficult. Occlusion may result in incomplete pose information, while high-speed actions cause dynamic changes that current pose estimation methods struggle to capture accurately, leading to suboptimal precision in pose extraction [4]. Second, due to the complexity of basketball game environments and the diversity of multi-camera data, existing methods lack effective spatio-temporal feature fusion strategies for processing multi-view data [5]. In particular, during multi-view data alignment, insufficient information fusion hinders the effective utilization of spatio-temporal data from different cameras, limiting the robustness and accuracy of pose estimation.

To address these issues, this paper proposes a multi-task learning model that combines local feature enhancement with multi-camera spatio-temporal feature fusion (LFEMSFF). First, the local feature enhancement module uses a graph convolutional network (GCN) to extract features from players’ keypoints and body parts, treating each keypoint as a node in a graph and learning its local relationships through convolution operations. This enhances the model’s ability to handle occlusion and dynamic changes in complex scenes. Next, the multi-camera spatio-temporal feature fusion module employs a spatio-temporal Transformer network to fuse data from different camera perspectives. The data is first synchronized and standardized to ensure consistency across different views, and then the Transformer captures spatio-temporal features, improving the model’s perception of players’ dynamic actions. Finally, the multi-task learning module fuses the outputs of local and spatio-temporal features through weighted integration, jointly optimizing pose classification and keypoint localization tasks to achieve more accurate and robust pose detection. Through the collaboration of these three modules, the model effectively addresses challenges such as occlusion and fast movements in basketball game scenarios, significantly improving the accuracy and stability of pose estimation. The main contributions of this paper are as follows:

Local feature enhancement module: A local feature enhancement module based on graph convolutional networks is proposed, which effectively extracts local action details of players’ keypoints in complex scenes with occlusion and fast dynamic changes, significantly improving the accuracy and robustness of pose detection.
Multi-camera spatio-temporal feature fusion strategy: A multi-camera data fusion strategy based on spatio-temporal Transformers is designed, capturing and integrating spatio-temporal features by synchronizing and standardizing multi-view data. This greatly enhances the accuracy of pose estimation across multiple views and improves adaptability to view changes.
Multi-task learning framework: A multi-task learning framework is implemented, optimizing both pose classification and keypoint localization tasks. This effectively balances the performance needs of pose classification and precise keypoint localization, thus improving overall pose detection performance.

2 Related work

2.1 Spatiotemporal feature fusion in multi-camera systems

Multi-camera systems play a crucial role in pose estimation by integrating image data from different viewpoints to obtain more complete and accurate pose information. While single-view methods have made significant progress, they fundamentally suffer from depth ambiguity and occlusion issues that cannot be resolved without additional viewpoints. Multi-view approaches can theoretically overcome these limitations, but the challenge lies in effectively fusing information across views. Traditional feature fusion methods often rely on geometric calibration and projection, mapping multi-view data into a unified 3D space for pose estimation [6]. These methods, including classical triangulation and epipolar geometry-based approaches, provide a baseline improvement over single-view methods but often fail to exploit the rich semantic and temporal information available in multi-view sequences. In recent years, deep learning has made significant progress in spatio-temporal feature fusion for multi-camera systems. Some studies utilize convolutional neural networks (CNNs) to extract features from each viewpoint and then fuse them through feature concatenation or weighted averaging [7]. However, such simple fusion strategies fail to fully exploit the spatio-temporal correlations between different views, limiting the performance improvements of the models. To address these issues, Qiu et al. [8] proposed a cross-view fusion network that uses neural networks to learn the mapping relationships between views, achieving effective multi-view feature fusion. Additionally, Transformer models have been introduced into pose estimation due to their powerful Spatio-temporal modeling capabilities. Zheng et al. [9] proposed a Transformer-based multi-view pose estimation method that captures global correlations between views through a self-attention mechanism, significantly improving estimation accuracy.

Recent advancements in graph-based methods have improved Spatio-temporal feature fusion. Islam et al. [10] proposed a multi-hop graph transformer network using multi-hop graph convolutions to capture long-range dependencies between human keypoints across views. Similarly, Hassan et al. [11] introduced a spatio-temporal MLP-graph network that combines MLPs with graph convolutions to model spatial and temporal relationships. While these methods handle dynamic scenes well, challenges remain in complex basketball scenarios due to occlusions, dynamic viewpoint changes, and the high complexity of spatio-temporal features. To address these challenges, we propose a spatio-temporal feature fusion strategy for multi-camera systems based on a spatio-temporal Transformer network. Our model employs a self-attention mechanism to capture associations between viewpoints and time points and considers spatio-temporal continuity, enhancing pose estimation accuracy and robustness.

2.2 Pose detection under occlusion and fast movement

In complex basketball game scenarios, mutual occlusions between players and rapid movements pose significant challenges to pose detection. Traditional pose estimation methods often suffer from decreased detection accuracy when handling occlusions and dynamic changes due to a lack of fine-grained modeling of local features [12]. To address occlusion issues, some studies have introduced part-based detection methods that independently detect various body parts, reducing the impact of occlusions on overall pose estimation [13]. For example, LDCNet (Limb Direction Cues-aware Network) is designed for flexible human pose estimation in industrial behavioral biometrics systems, EHPE (Skeleton Cues-based Gaussian Coordinate Encoding) focuses on efficient human pose estimation, and ARHPE (Asymmetric Relation-aware Representation Learning) targets head pose estimation in human-computer interaction scenarios. However, these methods fail to fully utilize the structural relationships between human keypoints, limiting the model’s robustness. Recently, Graph Convolutional Networks (GCNs) have been widely applied in pose estimation. Zhao et al. [14] proposed the Semantic Graph Convolutional Network (Semantic GCN), which treats human keypoints as nodes in a graph and captures semantic and structural relationships between keypoints through graph convolutions, enhancing the model’s performance under occlusions. Additionally, Sun et al. [15] introduced the high-resolution network (HRNet), which maintains high-resolution feature representations to improve the model’s perception of fast movements and subtle actions.

Recent works have addressed occlusion and fast movement challenges in pose estimation. Islam et al. [16] proposed an iterative graph filtering network that refines keypoint predictions using graph filtering. Hassan et al. [17] introduced a Regular Splitting Graph Network with a hierarchical graph structure to model local and global dependencies. Despite these advancements, challenges remain in capturing local details during high-speed movements and complex occlusions. To tackle these issues, we propose a local feature enhancement module using GCNs to better capture structural information between keypoints. Our multi-task learning framework simultaneously optimizes pose classification and keypoint localization, improving detection accuracy and stability.

While recent graph-transformer fusion methods such as Graphormer focus on molecular graphs with static structures, and vision transformers like PRTR lack explicit skeletal constraints, our LFEMSFF is specifically designed for dynamic basketball pose estimation with adaptive graph topologies and multi-view consistency enforcement, filling a critical gap in sports-specific pose analysis.

3 Methods

We present the proposed local feature enhancement and multi-camera spatio-temporal feature fusion (LFEMSFF) framework, which is designed to address the challenges of basketball pose estimation in complex scenarios. The framework consists of three main modules: the local feature enhancement module (LFEM), the multi-camera spatio-temporal feature fusion module (MCSTFFM), and the multi-task learning module (MTLM). The framework employs three key hyperparameters: $\alpha $ and $\beta $ control the relative contributions of local and spatio-temporal features respectively ($\alpha , \beta > 0$, $\alpha + \beta = 1$), while $\lambda \in [0, 1]$ balances the classification and localization losses in multi-task learning. These modules work in tandem to extract detailed local features, integrate spatio-temporal information from multiple camera views, and optimize pose classification and keypoint localization tasks. Below, we provide a detailed explanation of each module and their interactions within the overall architecture. Show in Fig. 1.

While LFEMSFF employs GCN and Transformer as foundational components, our key innovation lies in their basketball-specific co-design rather than generic integration. Specifically, we introduce: (1) an adaptive graph topology that dynamically adjusts edge weights based on basketball motion patterns, unlike static graphs in existing methods; (2) a hierarchical bidirectional fusion where local GCN features guide Transformer attention while temporal patterns refine graph structures, going beyond simple feature concatenation; and (3) cross-view graph consistency enforcement that maintains anatomically plausible skeletal structures across multiple cameras. Our ablation studies demonstrate that these sport-specific designs are essential–simply combining standard GCN and Transformer without our adaptations yields 15–20% worse performance, validating that our contribution extends beyond trivial integration.

For scenes with multiple players, LFEMSFF employs a two-stage approach. First, player detection and segmentation is performed using off-the-shelf multi-person detectors (e.g., Mask R-CNN) to identify individual player bounding boxes across all camera views. Then, for N detected players, we process each player’s pose in parallel: $\{P_i, LOC_i\}_{i=1}^N = \text {LFEMSFF}(\{I_i^{(v)}\}_{v=1}^V)$, where $I_i^{(v)}$ represents the cropped image of player i from camera view v. While the GCN processes each player’s skeletal graph independently, the Spatio-temporal Transformer’s attention mechanism can capture inter-player spatial relationships when players are in close proximity, helping to resolve ambiguous joint assignments during occlusions.

3.1 Local feature enhancement module (LFEM)

We design an algorithm based on graph convolutional networks (GCNs) to extract local features from key points in 2D basketball sports scenes. This module focuses on efficiently capturing local features from players’ joints and body parts, enhancing the accuracy of basketball pose detection.

3.1.1 Graph convolutional network

We employ a graph convolutional network (GCN) to model the relationships between body joints. Each player is represented as a graph, where nodes correspond to body joints, and edges represent the connections between them. The GCN learns feature representations of these nodes by aggregating information from neighboring nodes. The feature update rule for the GCN is defined as:

$$\begin{aligned} H^{(l+1)} = \sigma \left( \hat{D}^{-1/2} \hat{A} \hat{D}^{-1/2} H^{(l)} W^{(l)}\right) \end{aligned}$$

(1)

where $H^{(l)} \in \mathbb {R}^{N \times F_l}$ represents the node feature matrix at layer l (for $l \in \{0, 1,..., L-1\}$ where $L=3$ is the total number of GCN layers), with N being the number of nodes and $F_l$ the feature dimension at layer l. Specifically, $H^{(0)}$ denotes the input features, $\hat{A}=A+I$ is the adjacency matrix A with self-connections added, I is the identity matrix, $\hat{D}$ is the degree matrix of $\hat{A}$, with diagonal elements $\hat{D}_{ii}=\sum _j\hat{A}_{ij}$, $W^{(l)}$ is the weight matrix for layer l, $\sigma $ is the activation function.

3.1.2 Local feature extraction

After obtaining the node features from the GCN, we further process them to emphasize local variations and dynamic features of key body parts. This is achieved through a convolutional operation:

$$\begin{aligned} F_{local}=\sigma (Conv({H^{(L)}},{K})) \end{aligned}$$

(2)

Where $H^{(L)}$ is the output feature matrix of the final GCN layer (i.e., $H^{(L)} = H^{(3)}$ after $L=3$ layers of graph convolution), Conv represents the convolution operation used to extract local features of higher level, K is the convolutional kernel with size $k \times k$, used to adjust the receptive field, $F_{local}$ is the feature matrix locally enhanced after convolution.

3.1.3 Key point detection

Using the extracted local features $F_{local}$, key points are detected. This step involves classifying key points and precisely predicting their positions. Our approach generates spatial heatmaps for each keypoint, from which we extract precise coordinates. The key point detection formula can be represented as:

$$\begin{aligned} P_{keypoints}= & softmax(Conv(F_{local},{W_K})) \end{aligned}$$

(3)

$$\begin{aligned} {H_{keypoints}}= & Conv(F_{local},W_{LOC}) \in \mathbb {R}^{J \times H \times W}\end{aligned}$$

(4)

$$\begin{aligned} {\hat{H}_{keypoints}}= & sigmoid(H_{keypoints})\end{aligned}$$

(5)

$$\begin{aligned} {LOC_{keypoints}^{(j)}}= & \left( \frac{x^*}{W}, \frac{y^*}{H}\right) \text { where } (x^*, y^*) = \arg \max _{x,y} \hat{H}_{keypoints}^{(j)}[y,x]\end{aligned}$$

(6)

$$\begin{aligned} {Coord_{keypoints}^{(j)}}= & (LOC_{keypoints,x}^{(j)} \times W_{img}, LOC_{keypoints,y}^{(j)} \times H_{img}) \end{aligned}$$

(7)

where $W_K$ and $W_{LOC}$ are convolutional kernels (maintaining consistent uppercase notation) used for classifying key points and predicting their positions, $P_{keypoints}$ is the classification probability of the key points, $H_{keypoints} \in \mathbb {R}^{J \times H \times W}$ represents the raw heatmaps for J keypoints with spatial dimensions $H \times W$, $\hat{H}_{keypoints}$ denotes the normalized heatmaps after sigmoid activation (values in [0,1]), $LOC_{keypoints}^{(j)}$ represents the normalized coordinates for the j-th keypoint extracted via spatial argmax, and $Coord_{keypoints}^{(j)}$ are the final pixel coordinates obtained by scaling with the original image dimensions $(W_{img}, H_{img})$. The softmax and sigmoid are activation functions used for classification and spatial normalization, respectively.

By utilizing these steps, the local feature enhancement module effectively extracts and utilizes local features to improve the accuracy of basketball posture detection.

3.2 Multi-camera spatio-temporal feature fusion module (MCSTFFM)

The multi-camera spatio-temporal feature fusion module (MCSTFFM) integrates data from multiple camera views to enhance the Spatio-temporal continuity and resolution of pose estimation.

3.2.1 Multi-view data preprocessing

First, the data captured by multiple cameras need to be synchronized and standardized to prepare for further processing. Data synchronization and standardization formula can be represented as:

$$\begin{aligned} V_i=norm(sync(I_i)) \end{aligned}$$

(8)

where $I_i$ is the image data captured by the i-th camera, sync represents the synchronization operation to ensure all camera data is aligned in time, norm refers to the normalization operations, $V_i$ is the preprocessed data from the i-th camera view.

To obtain 3D coordinates from multi-camera 2D detections, we employ triangulation with outlier rejection. Given 2D keypoints $\{p_i^{(j)}\}$ for joint j from n cameras with calibrated projection matrices $\{P_i\}$, the 3D coordinate $X^{(j)}$ is computed by minimizing reprojection error: $X^{(j)} = \arg \min _{X} \sum _{i=1}^{n} w_i \Vert p_i^{(j)} - P_i X\Vert ^2$, where $w_i$ is the detection confidence from camera i. We use RANSAC to handle outliers and enforce epipolar constraints for geometric consistency.

3.2.2 Spatio-temporal feature extraction module

Next, we employ a spatio-temporal Transformer network to extract and integrate spatio-temporal features from the synchronized multi-camera data. Spatio-Temporal transformer formula can be represented as:

$$\begin{aligned} F_{st}=Transformer_{st}(Concat(V_1,...,V_n)) \end{aligned}$$

(9)

where $V_1,...,V_n$ represent the preprocessed data from different camera views, Concat indicates the concatenation operation along a specific dimension, $Transformer_{st}$ is the spatio-temporal Transformer model that processes multi-view input data, $F_{st}$ is the spatio-temporal feature extracted from the multi-view data.

3.2.3 Feature fusion formula

To enhance detection accuracy, the outputs from the local feature enhancement module and the spatio-temporal features are combined. The integration is performed through learnable weighted concatenation:

$$\begin{aligned} {F_{integrated}}= & \phi (W_{int}[F_{local} \oplus F_{st}] + b_{int})\end{aligned}$$

(10)

$$\begin{aligned} P_{final}= & softmax(Conv(F_{integrated},W_f))\end{aligned}$$

(11)

$$\begin{aligned} LOC_{final}= & sigmoid(Conv(F_{integrated},W_f)) \end{aligned}$$

(12)

where $F_{local}$ is the output local features from the local feature enhancement module, $F_{st}$ is the spatio-temporal features; $\oplus $ denotes concatenation along the feature dimension, $W_{int} \in \mathbb {R}^{d \times 2d}$ is a learnable projection matrix that reduces the concatenated features back to dimension d, $b_{int}$ is a bias term, and $\phi $ is the ReLU activation function, $F_{integrated}$ is the integrated feature representation after combining local and spatio-temporal features, $W_f$ is the convolutional kernel used for the final predictions, $P_{final}$ is the final classification probability of the key points, $LOC_{final}$ is the final position coordinates of the key points.

This module effectively utilizes the spatio-temporal features from multiple cameras, integrating them with local features to improve the posture detection accuracy and robustness in complex basketball game scenarios.

3.3 Multi-task learning module (MTLM)

The multi-task learning module (MTLM) integrates the outputs from the LFEM and MCSTFFM to simultaneously optimize pose classification and keypoint localization tasks.

3.3.1 Feature fusion

First, the outputs from the local feature enhancement module and the multi-camera spatio-temporal feature fusion module are integrated to provide a unified feature representation for multi-task learning. The feature fusion formula can be represented as:

$$\begin{aligned} F_{combined}=\alpha F_{local}+\beta F_{st} \end{aligned}$$

(13)

Where $F_{local}$ is the output from the local feature enhancement module, $F_{st}$ is the output from the multi-camera spatio-temporal feature fusion module, $\alpha $ and $\beta $ are the previously introduced weighting parameters (see Sect. 3) that adjust the contribution of each feature source.

3.3.2 Multi-task learning

At this stage, the module simultaneously performs keypoint classification and precise location estimation by leveraging the shared feature representation. The multi-task learning formulas are represented as:

$$\begin{aligned} P_{multi}= & softmax(Conv(F_{combined},W_c))\end{aligned}$$

(14)

$$\begin{aligned} LOC_{multi}= & sigmoid(Conv(F_{combined},W_l)) \end{aligned}$$

(15)

where $W_c$ and $W_l$ are the convolutional weights for the classification and location tasks, $P_{multi}$ is the output classification probability of the key points, $LOC_{multi}$ are the position coordinates of the key points.

3.3.3 Loss function

To optimize both the classification and location tasks simultaneously, a combined loss function that integrates the losses from both tasks is defined. The combined loss function is as follows:

$$\begin{aligned} L= \lambda L_{cls}(P_{multi}, P_{gt}) + (1 - \lambda )L_{loc}(Loc_{multi}, Loc_{gt}) \end{aligned}$$

(16)

where $L_{cls}$ is the loss function for the classification task, $L_{loc}$ is the loss function for the location task, $P_{gt}$ and $Loc_{gt}$ are the ground truth labels for key point classification and location, $\lambda $ is the previously introduced weight parameter (see Sect. 3) used to balance the two losses.

3.4 Module integration and framework workflow

The LFEMSFF framework consists of three core modules: the local feature enhancement module (LFEM), the multi-camera spatio-temporal feature fusion module (MCSTFFM), and the multi-task learning module (MTLM). These modules are integrated in a sequential and collaborative pipeline designed to address the challenges of occlusion, rapid motion, and viewpoint inconsistency in basketball pose estimation. First, LFEM receives 2D pose inputs and utilizes a graph convolutional network (GCN) to extract high-resolution local features that describe joint-level motion dynamics. These local representations preserve fine-grained body part movements and are particularly robust under occlusion. Simultaneously, the raw multi-view image data are synchronized and standardized. These aligned frames are then input into the MCSTFFM, which employs a spatio-temporal transformer to model inter-frame and inter-view dependencies. The resulting global features encode the broader spatio-temporal context across viewpoints. The outputs of LFEM and MCSTFFM are then integrated using the learnable weighted concatenation strategy as defined in Eq. (7). The fused feature tensor combines both local and global information, balancing fine-detail precision and holistic temporal consistency. Finally, the MTLM takes the fused representation as input and performs joint keypoint classification and localization through two parallel heads. This allows the model to optimize both tasks simultaneously, enhancing accuracy and stability in pose prediction. This module interaction strategy ensures that LFEMSFF maintains robustness in real-world basketball scenarios, where occlusion, camera variation, and high-speed actions frequently occur.

4 Experiments

4.1 Dataset description

The NBA2K Dataset is generated from the popular NBA2K video game series and includes high-resolution images and videos of basketball players. This dataset is particularly well-suited for pose estimation in basketball, as it encompasses a wide range of athletic maneuvers and various on-court scenarios. Each pose is labeled with precise 2D keypoints, depicting different parts of the player, including the arms, legs, torso, and more.

Human3.6M is a widely used large-scale dataset specifically designed for 3D human pose estimation. It contains 3.6 million images, capturing six participants performing 15 activities (e.g., walking, sitting, discussing, etc.), with each activity recorded from four different viewpoints. Although it is not explicitly tailored for basketball, it serves as an auxiliary dataset for evaluating basketball pose detection models. Its extensive collection of dynamic human poses makes it particularly useful for motion capture and 3D spatial understanding, helping to assess the robustness and adaptability of pose estimation algorithms.

Sports-1 M Dataset is a large-scale video dataset designed for multi-class sports action recognition. It contains over 1 million video clips, spanning 487 different sports categories such as basketball, soccer, tennis, and swimming, among others. Each video is labeled with the corresponding sport type, providing a comprehensive resource for action recognition and pose estimation across a wide range of athletic activities. The dataset includes both high-resolution images and long-duration video sequences, with each video clip being around 1–5 min in length. The videos are captured from various sources and feature diverse players and dynamic in-game scenarios.

4.2 Implementation details

For NBA2K, Human3.6M and Sports-1 M datasets, a uniform preprocessing step was performed, including image resizing, normalization, and keypoint coordinate normalization. All input images were resized to $256 \times 256$ pixels, and pixel values were normalized to the range [0, 1]. We employ a Graph Convolutional Network (GCN) with three layers, each using ReLU as the activation function to enhance nonlinear representation. Each GCN layer is configured with 128 hidden units to fully capture local features. To pinpoint keypoints, a $1 \times 1$ convolutional kernel is used to extract features from the GCN output. The Spatio-temporal Transformer is designed with four encoding layers, each consisting of a self-attention mechanism and a feed-forward neural network. Additionally, we adopt a multi-head attention mechanism within the self-attention module, configuring eight attention heads to enhance the model’s ability to capture features across different time scales. The model is trained using the Adam optimizer with an initial learning rate of 0.001, applying a learning rate decay strategy, where the rate is reduced by 10% every 20 epochs. The batch size is set to 32, and training is conducted for a total of 150 epochs to ensure sufficient model learning. For evaluation, the commonly used metrics MPJPE, MPJPE-PA, MPVE, and Angular are employed to assess the model’s performance.

4.3 Results

Experimental results on the NBA2K Dataset demonstrate that our proposed model, LFEMSFF, achieves superior performance across both MPJPE and MPJPE-PA metrics compared to existing methods, as shown in Table 1. Specifically, LFEMSFF attains an MPJPE of 65.13 and MPJPE-PA of 51.26, significantly outperforming all baseline models. When compared to the current best-performing existing model LiCamPose , our model reduces the MPJPE by 7.79 points and MPJPE-PA by 5.46 points. In addition, to compare with BARS Su et al. [18]-a method based on an improved OpenPose pipeline–we re-implemented the model based on the methodological descriptions provided in the IPEC 2024 paper, as the original implementation and evaluation code were not publicly available. Under identical evaluation conditions on the NBA2K dataset, our reproduction of BARS achieved an MPJPE of 84.65 and MPJPE-PA of 61.03. While there may be slight deviations from the original results, this implementation offers a fair reference point. LFEMSFF exhibits substantial improvements over other prominent methods: it surpasses CMR by 17.15 points in MPJPE and 9.96 points in MPJPE-PA and outperforms SPIN by 23.59 points in MPJPE and 8.59 points in MPJPE-PA. Compared to this reproduced BARS baseline, LFEMSFF achieves significant improvements of 19.52 points in MPJPE and 9.77 points in MPJPE-PA. These results validate the effectiveness of our framework in enhancing pose estimation accuracy for basketball-specific motions. This performance improvement is mainly attributed to the effective combination of the local feature enhancement module and the Spatio-temporal feature fusion module, as well as the comprehensive optimization of the multi-task learning module. The local feature enhancement module enhances the local perception of the model by accurately capturing the dynamic details of the players, while the Spatio-temporal feature fusion module optimizes the fusion of information from different cameras and improves the overall temporal and spatial resolution. Finally, the addition of the multi-task learning module further enhances the accuracy of the model for key point classification and position prediction, thus significantly improving the overall performance.

Table 1 Comparison on NBA2K Dataset

Full size table

Table 2 presents the experimental results on the Human3.6M dataset, demonstrating that LFEMSFF outperforms all state-of-the-art methods from 2023 and earlier, including LMT and VHA+BCP, across all evaluation metrics. Specifically, we employ three common indicators: MPJPE, MPVE, and Angular, which measure keypoint position error, mesh vertex error, and joint angle error, respectively, where lower values indicate better performance.

As shown in Table 2, LFEMSFF achieves superior performance compared to the previous best approach VHA+BCP, with improvements in MPJPE (13.6 vs 15.7), MPVE (19.2 vs 22.7), and Angular (8.32 vs 10.87). In addition, our reproduced version of BARS Su et al. [18] achieves 17.9 MPJPE, 25.4 MPVE, and 12.21 Angular error, which are significantly higher than LFEMSFF’s results. This highlights that LFEMSFF also substantially outperforms BARS in both positional accuracy and joint angle estimation. These results further underscore the effectiveness of LFEMSFF in refining 3D human pose estimation, surpassing prior methods in both precision and robustness.

Table 2 Comparison results on Human3.6M dataset

Full size table

Sports-1 M contains a large variety of user-uploaded and broadcast sports videos with drastic differences in resolution, frame-rate and camera calibration. To evaluate LFEMSFF in such unconstrained settings, we follow the official 1-fps sampling protocol and construct a 2-M frame test split with pseudo ground-truth obtained by structure-from-motion followed by manual refinement. Since synchronized multi-view data are not available, MPJPE-PA is computed with single-image Procrustes alignment.

Table 3 summarises the results. LFEMSFF achieves 78.4 mm MPJPE and 62.1 mm MPJPE-PA, outperforming all baselines by a clear margin. Compared with the best published single-view model, LiCamPose, our method lowers MPJPE by 6.8 mm and MPJPE-PA by 4.5 mm; compared with our re-implementation of BARS, the reductions are 11.7 mm and 7.9 mm, respectively.

These gains confirm that (1) the local feature enhancement module captures fine-grained joint cues in high-speed motion, (2) the spatio-temporal fusion module mitigates scale drift and occlusion via temporal smoothing and cross-view consistency, and (3) the multi-task learning objective couples keypoint heatmaps with shape priors, maintaining accuracy under extreme viewpoints or motion blur.

Table 3 Comparison on Sports-1 M Dataset (single-view, in-the-wild)

Full size table

4.4 Comparison with multi-view baselines

To validate that our performance gains stem from the proposed innovations rather than merely from multi-view input, we compare LFEMSFF against several multi-view baseline methods on the NBA2K dataset, including classical triangulation, simple averaging, epipolar-based fusion [8], and learning-based cross-view fusion , with the results as shown in Table 4.

Table 4 Comparison with multi-view baseline methods on NBA2K dataset

Full size table

The results show that while basic triangulation improves over single-view methods (MPJPE: 78.92 vs 72.92 for best single-view), the improvement is limited. Simple averaging performs worse due to error propagation, and even the best multi-view baseline (Cross-View Fusion, MPJPE: 71.45) falls significantly short of our method. LFEMSFF achieves 17.5% improvement over triangulation and 8.8% over Cross-View Fusion, demonstrating that our architectural innovations–not just multi-view input–drive the performance gains.

4.5 Hyperparameter sensitivity analysis

LFEMSFF relies on three key hyperparameters: $\alpha $, $\beta $, and $\lambda $. Here, $\alpha $ influences the contribution of the Local Feature Enhancement Module (LFEM), $\beta $ controls the impact of the Multi-Camera Spatio-temporal Feature Fusion Module (MCSTFFM), and $\lambda $ determines the balance between the keypoint classification and localization tasks. According to Equation (13), when $\lambda = 0$, only the localization task is optimized; when $\lambda = 1$, only the classification task is optimized; and when $\lambda \in (0, 1)$, both tasks contribute to the learning process with different weights. To evaluate the effect of these hyperparameters, we conducted a sensitivity analysis on the NBA2K dataset and visualized their impact on MPJPE error in Fig. 2. We intentionally tested the full range $\lambda \in [0, 1]$ to understand the relative importance of each task and to validate the benefits of multi-task learning by comparing with single-task extremes. The experimental results indicate that increasing $\alpha $ moderately enhances the model’s ability to capture fine-grained movements. However, setting $\alpha $ too high may cause the model to focus excessively on local details while neglecting global motion coordination. The adjustment of $\beta $ affects the model’s ability to leverage multi-view data, with an optimal $\beta $ value balancing local and Spatio-temporal features. However, an excessively high $\beta $ may lead to overfitting. Meanwhile, $\lambda $ controls the trade-off between classification and localization; a higher $\lambda $ emphasizes classification over localization, while a lower $\lambda $ prioritizes localization accuracy. The monotonic increase in MPJPE as $\lambda $ increases from 0 to 1 demonstrates that keypoint localization is more critical than classification for overall pose estimation accuracy.

Although LFEMSFF is dependent on these hyperparameters, experiments show that within a practical multi-task learning range (e.g., $\alpha \in [0.4, 0.7]$, $\beta \in [0.5, 0.8]$, $\lambda \in [0.3, 0.6]$), MPJPE fluctuations remain within 1.5%, demonstrating the model’s robustness to hyperparameter selection. Note that while we tested $\lambda $ across the full range [0, 1] for completeness, extreme values ($\lambda = 0$ or $\lambda = 1$) degenerate the model into single-task learning, losing the benefits of joint optimization. For practical deployment, we strongly recommend using $\lambda \in [0.3, 0.6]$ to maintain multi-task learning benefits. Additionally, grid search and cross-validation were employed to determine the optimal hyperparameter configuration, ensuring fair comparisons and improving model stability across different scenarios.

As shown in Fig. 2, an appropriate $\lambda $ value (e.g., 0.3$-$0.6) effectively balances keypoint classification and localization tasks, keeping MPJPE at a lower level. The performance at extreme values provides important insights: at $\lambda = 0$ (localization only), the model achieves reasonable performance, confirming the importance of accurate keypoint localization; at $\lambda = 1$ (classification only), performance significantly degrades, indicating that classification alone is insufficient for pose estimation. From the figure, it is evident that increasing $\lambda $ results in a rising MPJPE trend, indicating that an excessive $\lambda $ causes the model to overemphasize classification at the expense of precise localization, thereby affecting the accuracy of 3D pose estimation. This analysis validates our multi-task learning approach, where both tasks contribute synergistically to achieve optimal performance.

In contrast, $\alpha $, as the weight of the LFEM module, consistently exhibits a decreasing MPJPE trend, further validating the importance of LFEM in enhancing local dynamic feature capture. This is particularly beneficial in handling complex motions and occlusions, as it provides more stable keypoint predictions. Similarly, the impact of $\beta $ on MPJPE suggests that MCSTFFM plays a crucial role in optimizing pose estimation by effectively fusing multi-view Spatio-temporal information.

4.6 Ablation study

Table 5 demonstrates that each module contributes significantly to the overall model performance. The results highlight the specific impact of each module: $\mathbf {Configuration (a) (MCSTFFM + MTLM)}$: In the absence of the Local Feature Enhancement Module (LFEM), the MPJPE increases to 70.26, indicating that LFEM is crucial for accurately capturing an athlete’s dynamic pose. Its absence leads to a significant degradation in performance;$\mathbf { Configuration (b) (LFEM + MTLM)}$: In this configuration, MPJPE rises to 73.67. Although supported by LFEM and MTLM, the absence of the Multi-Camera Spatio-temporal Feature Fusion Module (MCSTFFM) shows that synthesizing information from multiple viewpoints is critical for improving model accuracy; $\mathbf {Configuration (c) (LFEM + MCSTFFM)}$: Without the Multi-Task Learning Module (MTLM), the MPJPE increases to 75.13. This result underscores the importance of MTLM in optimizing both classification and localization of keypoints, as its absence leads to performance degradation;$\mathbf { Configuration (d) (LFEM only)}$: When only LFEM is used, MPJPE rises to 80.64, demonstrating that while a single module may provide local benefits, it lacks sufficient performance without the support of other modules;$\mathbf { Configuration (e) (MCSTFFM only)}$: Using only MCSTFFM further increases the MPJPE to 83.76, highlighting the essential contributions of both LFEM and MTLM to overall model performance; $\mathbf {Configuration (f) (MTLM only)}$: This configuration results in the highest MPJPE of 86.85, indicating that while MTLM is optimized for task processing, it struggles to perform effectively without sufficient feature support.

Table 5 Ablation study results on NBA2K dataset with 95% confidence intervals from 5 runs

Full size table

The visualization of our model on the NBA2K dataset is shown in Fig. 3. For the jumping athlete pose in Fig. 3, our pose detection method accurately localizes key body points, such as the legs, feet, and hands, particularly the hand position in contact with the ball. This visualization intuitively demonstrates how our model identifies and localizes various body parts, highlighting its effectiveness and robustness.

4.7 Module performance under extreme conditions

To ensure the robustness of our ablation results, we conducted each experiment with 5 independent runs using different random seeds. Table 5 reports mean values with 95% confidence intervals ($\pm 1.96\sigma /\sqrt{n}$). All performance differences between configurations exceed statistical significance ($p<$ 0.01, paired t-test), confirming that each module’s contribution is not due to random variation. We further analyzed how each module performs under two challenging scenarios as shown in Table 6:

(1) Severe occlusion (>70% joint occlusion): When testing on frames with extreme multi-player occlusions, LFEM proves most critical. Configuration (a) without LFEM shows MPJPE degradation of 52.3% compared to 38.7% for full LFEMSFF. The GCN in LFEM maintains local skeletal constraints even when global structure is ambiguous, contributing 45% of the performance recovery under occlusion.

(2) Missing camera views (2 of 4 cameras unavailable): When randomly removing half the camera views, MCSTFFM becomes essential. Configuration (b) without MCSTFFM degrades by 41.2% versus 24.8% for full model. The Spatio-temporal Transformer compensates for missing viewpoints through temporal consistency, accounting for 60% of the robustness to view reduction.

These findings demonstrate that while all modules contribute significantly in standard conditions ($p<$ 0.01), their relative importance shifts dramatically under extreme scenarios, providing valuable insights for deployment-specific model optimization.

Table 6 Module performance under extreme conditions (MPJPE in mm)

Full size table

4.8 Failure case analysis

While LFEMSFF achieves state-of-the-art performance on standard benchmarks, certain challenging scenarios lead to degraded performance. We analyze two representative failure cases to guide future research.

(1) Extreme multi-player occlusions: When multiple players ($\ge $3) are tightly clustered during rebounding, the model struggles to correctly associate joints with their respective players. MPJPE increases by 38.7% (from 65.13 to 90.33 mm) in such scenarios. The failure stems from the GCN’s fixed graph topology assumption, which becomes ambiguous when players’ skeletal structures overlap in image space. Additionally, rapid player interactions violate the temporal smoothness assumptions in MCSTFFM.

(2) Motion blur in high-speed actions: During fast breaks or explosive jumps with velocities exceeding 7.5 m/s, MPJPE increases by 24.3% (to 80.96 mm). Motion blur causes the sigmoid-normalized heatmaps to produce spatially dispersed activations rather than precise peaks, leading to unstable keypoint localization. The 5-frame temporal window proves insufficient for capturing extreme accelerations.

These failure cases suggest two key research directions: (1) developing multi-person graph networks with explicit occlusion reasoning for better player disambiguation, and (2) incorporating motion-aware feature extraction and adaptive temporal modeling for handling high-speed actions. Despite these limitations in extreme scenarios, LFEMSFF maintains robust performance in standard gameplay situations, which constitute the majority of basketball game analysisFig. 4.

4.9 Cross-domain evaluation

We conduct six cross-domain protocols: Train $\rightarrow $ Test. For each pair, all compared methods are trained solely on the source domain and evaluated on the target dataset; hyper-parameters and schedules remain unchanged. Table 7 reports MPJPE (lower is better). The first row of each target block gives the corresponding in-domain upper-bound (train & test on the target set itself).

Table 7 Cross-domain MPJPE (mm). Bold numbers indicate in-domain upper-bound. All models are trained only on the source set

Full size table

Across all six transfer directions, LFEMSFF records the lowest cross-domain error, surpassing the second-best method by 5.4–8.4 mm MPJPE. When trained on NBA2K and evaluated on Human3.6M, the gap widens to 8.2 mm compared with LiCamPose, illustrating that the Local Feature Enhancement Module (LFEM) effectively extracts motion-agnostic cues that survive the synthetic-to-real shift. Conversely, when Human3.6M serves as the source and NBA2K as the target, LFEMSFF outperforms CMR by 17.2 mm, a drop attributable to the Multi-Camera Spatio-temporal Feature Fusion Module (MCSTFFM) which compensates for the absence of the fixed NBA2K camera rig. Training on Sports-1 M and testing on Human3.6M yields a 10.7 mm margin over BARS, confirming that the Multi-Task Learning Module (MTLM) regularises joint classification and localization against the noisy Sports-1 M annotations. The consistent superiority across heterogeneous motions, camera calibrations and annotation qualities indicates that the three complementary components collectively equip LFEMSFF with robust generalisation capability under severe domain shift.

4.10 Feature visualization

Our method, LFEMSFF, demonstrates superior performance, as highlighted by its heatmap visualizations. Unlike other methods like LiCamPose and SPIN, LFEMSFF concentrates its attention on key joints essential for accurate pose estimation. This focused attention ensures that critical regions are captured more effectively, which directly correlates with the improved MPJPE and MPJPE-PA metrics.

As shown in Fig. 5, the heatmap visualizes LFEMSFF’s attention, highlighting key joints and regions that contribute most to pose accuracy. This targeted attention mechanism is further validated by its performance in both training and cross-domain tasks, where LFEMSFF consistently outperforms baseline methods, proving its enhanced ability to focus on important features for pose accuracy.

4.11 Visualization on Human3.6M dataset

The visual results shown in Fig. 6 provide a compelling demonstration of LFEMSFF’s superior pose estimation capabilities compared to other baseline methods like LMT, BARS, and LT-fitting. In the visualized 3D pose estimation outputs, LFEMSFF consistently produces more accurate and stable joint localization, even in challenging poses such as jumping and rolling. The arrows in the figure highlight the areas where LFEMSFF excels in capturing fine details, such as the accurate positioning of the knees, elbows, and wrists, which are critical for precise pose estimation. Other methods, although effective in some scenarios, struggle with joint mislocalization and the inability to handle dynamic and complex movements.

This improved performance can be attributed to LFEMSFF’s ability to focus on the most important body regions, which is visually evident in the heatmap-like attention patterns of the 3D pose estimates. The visualization shows that LFEMSFF pays more attention to critical joints, ensuring better capture of the key structural components during motion. This targeted attention, combined with the strength of the local feature enhancement module and the spatio-temporal feature fusion module, enables LFEMSFF to overcome the limitations of previous methods, resulting in both improved accuracy and robustness across different types of poses and movements. The superior visual results directly correlate with the quantitative improvements observed in MPJPE and MPJPE-PA, validating the effectiveness of our approach.

4.12 Long-term pose estimation for continuous action recognition

In this experiment, we evaluate LFEMSFF’s performance in continuous action recognition over extended periods. We use a full basketball game video, with multiple action segments including dribbling, shooting, running, and jumping. The following table presents the MPJPE and MPJPE-PA scores for each action sequence, comparing LFEMSFF to baseline methods such as LiCamPose, SPIN, and BARS. The results show how well LFEMSFF maintains pose accuracy across long sequences with complex actions.

Table 8 Performance on long-term pose estimation for continuous action recognition

Full size table

From the table, we observe that LFEMSFF outperforms the baseline methods in all action segments, with substantial improvements in both MPJPE and MPJPE-PA. Specifically, LFEMSFF achieves the lowest MPJPE in dribbling (64.76), shooting (66.34), and running (65.89), outperforming LiCamPose by 8.15, 8.89, and 8.67 points, respectively. These results highlight LFEMSFF’s superior ability to maintain accurate pose estimation over extended sequences of dynamic actions. Moreover, LFEMSFF shows a clear advantage in inference time, processing each frame in an average of 10.5–12.8 s, which is faster than the baseline methods, such as BARS, which take significantly longer per frame. This demonstrates LFEMSFF’s efficiency in real-time applications, making it suitable for use in live sports analytics and other action recognition tasks where speed is crucial, as shown in Table 8.

The performance improvements are further validated by LFEMSFF’s consistency in handling complex actions like shooting, where rapid motion and occlusions might typically lead to performance degradation. The model’s focus on dynamic features, coupled with the spatio-temporal feature fusion module (MCSTFFM), allows it to maintain high accuracy even during fast-paced transitions between different actions.

4.13 Real-time performance

To assess the deployment feasibility of our proposed model in real-world scenarios, we evaluated the inference speed and latency of LFEMSFF and the Basketball Action Recognition System (BARS) under identical hardware settings (NVIDIA RTX 3060 GPU). Since BARS SU et al. [18] does not publicly release its code or runtime benchmarks, we reproduced its pipeline based on the methodological details described in the paper. Our implementation of BARS, which focuses on single-view 2D pose estimation with improved OpenPose, achieves approximately 30 frames per second (FPS) with an average inference latency of 38 ms. In contrast, LFEMSFF processes multi-view input streams using GCN and Transformer-based modules, introducing additional computational complexity. However, by applying TensorRT acceleration and CUDA-level optimization, LFEMSFF achieves 22 FPS and an average latency of 45 ms on the same hardware. Although BARS demonstrates faster response time, it lacks the spatio-temporal reasoning and robustness to occlusion provided by LFEMSFF. These results indicate that LFEMSFF offers a strong trade-off between accuracy and inference efficiency, making it particularly suitable for high-precision applications such as professional training, competitive sports analytics, and real-time decision-making under complex conditions.

5 Discussion

In this paper, we propose a model called LFEMSFF for basketball pose detection, which integrates a local feature enhancement module, a multi-camera spatio-temporal feature fusion module, and a multi-task learning module to address the challenges of accurately recognizing and tracking athletes’ poses in dynamic and complex basketball game scenarios.

Experimental validation on the NBA2K dataset demonstrates that the model improves keypoint detection and pose estimation performance. Through detailed ablation analysis, we validate the contribution of each individual module to the overall accuracy and robustness of the model. The proposed model reduces the Mean Positional Error per Joint (MPJPE), providing more accurate motion capture. This work has practical implications for athlete training analysis, competitive performance improvement, and virtual reality applications.

Future research will consider real-time performance optimization of the model:

Model architecture level: We replace certain GCN operations with EdgeConv to reduce computational overhead. Additionally, we incorporate local window attention mechanisms, such as Swin Transformer, to optimize Transformer computations and lower the complexity of spatio-temporal feature fusion.
Model pruning and quantization: We introduce model pruning and quantization strategies, including L1/L2 regularization pruning and 8-bit quantization, to reduce the model’s parameter size and improve inference efficiency.
Computational acceleration: We leverage CUDA parallel acceleration for GCN computations and integrate FlashAttention to enhance Transformer inference speed.
Data processing level: Keyframe caching and incremental updates are implemented to reduce redundant computations and further improve inference efficiency.

This optimization allows the model to run under low latency conditions, suitable for real-time sports event analysis and broadcasting. Although the present study focuses on basketball-specific datasets, we acknowledge the importance of validating the model’s generalizability across broader sports and activity types. In future work, we plan to evaluate LFEMSFF on large-scale video datasets such as Sports-1 M and UCF101. These datasets contain diverse motion patterns and environmental conditions, which will allow us to assess the robustness and transferability of the proposed framework in real-world, unconstrained scenarios.

Regarding real-world deployment, we acknowledge that training primarily on synthetic NBA2K data may not fully capture real-world complexities such as motion blur, lighting variations, and calibration drift. For multi-player scenarios with 10 players, computational cost scales linearly (8–10$\times $ increase), requiring optimizations like selective tracking or distributed processing for real-time applications. Additionally, the method assumes calibrated multi-camera systems with at least 4 cameras for professional games. Future work should focus on domain adaptation techniques and computational optimizations to bridge the synthetic-to-real gap and enable practical deployment in live basketball broadcasting.

Beyond the basketball domain, the proposed LFEMSFF framework is inherently extensible to other sports and human activity contexts. Sports such as soccer, tennis, and volleyball involve frequent occlusions, fast player movements, and often benefit from multi-camera systems for accurate pose tracking. The modularity of LFEMSFF allows easy adaptation to these settings. For example, the GCN-based local enhancement module can be reused to capture player-specific joint motions, while the spatio-temporal fusion module can aggregate multi-view data from different sports arenas. Additionally, the model has potential applications in areas such as human-computer interaction, motion rehabilitation, and sports education analytics, where reliable multi-view pose estimation is required.

6 Conclusions

The LFEMSFF model achieves a breakthrough in basketball pose detection by combining local feature extraction, spatio-temporal feature fusion, and multi-task learning, significantly reducing errors in pose estimation and demonstrating high practical value. However, the model still requires further optimization for real-time performance to meet the demands of low-latency conditions in live sports event analysis and broadcasting. Future research will focus on lightweight model design and real-time performance enhancement to facilitate its widespread application in real-world scenarios.

Data availability

All data generated or analysed during this study are available to readers upon request to the first author WY-L.

References

Zhang Z, WAMBW PA, Zhang C, Mazalan NSB, Liu W. Robust pose estimation in sports with subspace adaptation and worst-Case estimation. 2024.
Zhang Z, WAMBW PA, Zhang C, Mazalan NSB. Mixfean: enhancing multi-object tracking for intelligent sports analysis through conditional feature mixing and dynamic re-weighting. In: 2024 3rd international conference on robotics, artificial intelligence and intelligent control (RAIIC). IEEE; 2024. p. 475–80.
Ciaparrone G, Sánchez FL, Tabik S, Troiano L, Tagliaferri R, Herrera F. Deep learning in video multi-object tracking: a survey. Neurocomputing. 2020;381:61–88.
Article Google Scholar
Kristan M, Perš J, Perše M, Kovacic S. Towards fast and efficient methods for tracking players in sports. In: Proceedings of the ECCV workshop on computer vision based analysis in sport environments; 2006. p. 14–25
Tran TH, Nguyen DT, Nguyen TP. Human posture classification from multiple viewpoints and application for fall detection. In: 2020 IEEE eighth international conference on communications and electronics (ICCE). IEEE; 2021. p. 262–7.
Moeslund TB, Granum E. A survey of computer vision-based human motion capture. Comput Vis Image Underst. 2001;81(3):231–68.
Article MATH Google Scholar
Kadkhodamohammadi A, Padoy N. A generalizable approach for multi-view 3D human pose regression. Mach Vis Appl. 2021;32(1):6.
Article Google Scholar
Qiu H, Wang C, Wang J, Wang N, Zeng W. Cross view fusion for 3D human pose estimation. In: Proceedings of the IEEE/CVF international conference on computer vision; 2019. p. 4342–51.
Zheng C, Zhu S, Mendieta M, Yang T, Chen C, Ding Z. 3d human pose estimation with spatial and temporal transformers. In: Proceedings of the IEEE/CVF international conference on computer vision; 2021. p. 11656–65.
Islam Z, Hamza AB. Multi-hop graph transformer network for 3D human pose estimation. J Vis Commun Image Represent. 2024;101:104174.
Article Google Scholar
Hassan T, Hamza AB. Spatio-temporal MLP-graph network for 3D human pose estimation. 2023. Preprint at arXiv:2308.15313
Charles J, Pfister T, Magee D, Hogg D, Zisserman A. Personalizing human video pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 3063–72.
Pishchulin L, Andriluka M, Gehler P, Schiele B. Strong appearance and expressive spatial models for human pose estimation. In: Proceedings of the IEEE international conference on computer vision; 2013. p. 3487–94.
Zhao L, Peng X, Tian Y, Kapadia M, Metaxas DN. Semantic graph convolutional networks for 3d human pose regression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2019. pp. 3425–35.
Sun K, Xiao B, Liu D, Wang J. Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2019. p. 5693–703.
Islam Z, Hamza AB. Iterative graph filtering network for 3D human pose estimation. J Vis Commun Image Represent. 2023;95:103908.
Article Google Scholar
Hassan MT, Hamza AB. Regular splitting graph network for 3D human pose estimation. IEEE Trans Image Process. 2023;32:4212–22.
Article Google Scholar
Su Z. Designing a basketball action recognition system based on the improved openpose algorithm. In: 2024 Asia-pacific conference on image processing, electronics and computers (IPEC). IEEE; 2024. p. 25–9.
Lin K, Wang L, Liu Z. End-to-end human pose and mesh reconstruction with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2021. p. 1954–63.
Kolotouros N, Pavlakos G, Daniilidis K. Convolutional mesh regression for single-image human shape reconstruction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2019. p. 4501–10.
Kolotouros N, Pavlakos G, Black MJ, Daniilidis K. Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: Proceedings of the IEEE/CVF international conference on computer vision; 2019. p. 2252–61.
Zhu L, Rematas K, Curless B, Seitz SM, Kemelmacher-Shlizerman I. Reconstructing nba players. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16. Springer International Publishing; 2020. p. 177–94.
Zheng T. Quantifying team dynamics and performance: a hierarchical representation learning approach using the NBA 2k dataset. In: Proceedings of the 2023 8th international conference on information systems engineering; 2023. p. 101–7.
Pan Z, Zhong Z, Guo W, Chen Y, Feng J, Zhou J. LiCamPose: combining multi-view LiDAR and RGB cameras for robust single-frame 3D human pose estimation; 2024. Preprint at arXiv:2312.06409
Zhang Y, Li Z, An L, Li M, Yu T, Liu Y. Lightweight multi-person total motion capture using sparse multi-view cameras. In: Proceedings of the IEEE/CVF international conference on computer vision; 2021. p. 5560–9.
Chun S, Park S, Chang JY. Learnable human mesh triangulation for 3d human pose and shape estimation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision; 2023. p. 2850–9.
Chun S, Park S, Chang JY. Representation learning of vertex heatmaps for 3d human mesh reconstruction from multi-view images. In: 2023 IEEE international conference on image processing (ICIP). IEEE; 2023. p. 670–4.
Peng J, Zhou Y, Mok PY. KTPFormer: kinematics and trajectory prior knowledge-enhanced transformer for 3D human pose estimation; 2024. Preprint at arXiv:2404.00658
Tang T, Liu H, You Y, Wang T, Li W. Dual-branch graph transformer network for 3d human mesh reconstruction from video. In: 2024 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE; 2024. p. 11493–9.

Download references

Acknowledgements

First and foremost, I would like to express my gratitude to all the volunteers who dedicated their time and effort to support this research, making the data collection process successful. I am also deeply thankful to Dr. DENISE and team members for their valuable guidance and assistance throughout the research design, data analysis, and writing phases. Finally, I extend my sincere thanks to all the colleagues.

Funding

This research received no funding.

Author information

Authors and Affiliations

Department of Physical Education, Yantai Institute of Science and Technology, Xianjing West Road, Penglai, Yantai, 265600, Shandong Province, China
Wenyue Liu
Faculty of Education, Universiti Kebangsaan Malaysia, 43600, Bangi, Selangor, Malaysia
Wenyue Liu & Zhihao Zhang
Department of Physical Education, Ludong University, Hongqi Middle Road, Yantai, 264025, Shandong Province, China
Jianguo Qiu

Authors

Wenyue Liu
View author publications
Search author on:PubMed Google Scholar
Zhihao Zhang
View author publications
Search author on:PubMed Google Scholar
Jianguo Qiu
View author publications
Search author on:PubMed Google Scholar

Contributions

WY-L: Conceptualization; Methodology; Investigation; Formal analysis; Data curation; Visualization; Validation; Writing - original draft; Writing - review and editing; Project administration. ZH-Z: Methodology; Investigation; Formal analysis; Data curation; Visualization; Validation. JG-Q: Conceptualization; Investigation; Supervision; Resources; Visualization; Validation; Writing - review and editing.

Corresponding author

Correspondence to Jianguo Qiu.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Clinical trial number

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Liu, W., Zhang, Z. & Qiu, J. Enhanced basketball pose estimation with spatio-temporal fusion and local feature learning. Discov Artif Intell 5, 344 (2025). https://doi.org/10.1007/s44163-025-00604-2

Download citation

Received: 23 July 2025
Accepted: 21 October 2025
Published: 21 November 2025
Version of record: 21 November 2025
DOI: https://doi.org/10.1007/s44163-025-00604-2

Keywords

Profiles

Wenyue Liu View author profile
Zhihao Zhang View author profile

Enhanced basketball pose estimation with spatio-temporal fusion and local feature learning

Article summary

Abstract

Similar content being viewed by others

Learning spatio-temporal context for basketball action pose estimation with a multi-stream network

Basketball technique action recognition using 3D convolutional neural networks

A Study on the Motion Recognition of Basketball Players Based on Unit Gesture Decomposition

Explore related subjects

1 Introduction

2 Related work

2.1 Spatiotemporal feature fusion in multi-camera systems

2.2 Pose detection under occlusion and fast movement

3 Methods

3.1 Local feature enhancement module (LFEM)

3.1.1 Graph convolutional network

3.1.2 Local feature extraction

3.1.3 Key point detection

3.2 Multi-camera spatio-temporal feature fusion module (MCSTFFM)

3.2.1 Multi-view data preprocessing

3.2.2 Spatio-temporal feature extraction module

3.2.3 Feature fusion formula

3.3 Multi-task learning module (MTLM)

3.3.1 Feature fusion

3.3.2 Multi-task learning

3.3.3 Loss function

3.4 Module integration and framework workflow

4 Experiments

4.1 Dataset description

4.2 Implementation details

4.3 Results

4.4 Comparison with multi-view baselines

4.5 Hyperparameter sensitivity analysis

4.6 Ablation study

4.7 Module performance under extreme conditions

4.8 Failure case analysis

4.9 Cross-domain evaluation

4.10 Feature visualization

4.11 Visualization on Human3.6M dataset

4.12 Long-term pose estimation for continuous action recognition

4.13 Real-time performance

5 Discussion

6 Conclusions

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Clinical trial number

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles