Enhancing Golf and Baseball Swing Markerless Motion Capture Using RTMPose and RTMDet: A Top-Down Approach – Swing Catalyst - Help Center

Enhancing Golf and Baseball Swing Markerless Motion Capture Using RTMPose and RTMDet: A Top-Down Approach

Abstract

This white paper documents the application of RTMPose and RTMDet for accurate and efficient pose estimation of golf and baseball swings. Leveraging state-of-the-art techniques optimized for real-time performance, these models enable detailed tracking of body movements during golf and baseball swings—a critical feature for improving performance in sport analytics. We highlight the advantages of a top-down approach, where an off-the-shelf RTMDet detector identifies the golfer and baseball player in each frame, and RTMPose estimates the positions of key body joints.

1. Introduction

Pose estimation has become pivotal in sports performance analysis, allowing for precise tracking of athletes' movements. In golf and baseball, capturing the biomechanical data of a player's swing provides valuable insights into swing dynamics, aiding professionals and amateurs alike in refining their techniques. Traditional 2D pose estimation methods often face latency and accuracy challenges, especially in real-time scenarios. This paper proposes a solution using RTMPose and RTMDet within the MMPose framework for detailed pose estimation during a golf and baseball swings.

2. Background

The complexity of golf and baseball swings requires precise measurement of body movements. Existing pose estimation methods may not provide the necessary accuracy for real-time performance. Advances in deep learning and computer vision have introduced models like RTMPose and RTMDet, which offer improved accuracy and efficiency.

3. Top-down approach with RTMdet and RTMpose

--insert figure--

3. RTMPose: A High-Performance Pose Estimation Model

RTMPose [1] is designed for high-performance, real-time pose estimation, optimized to run efficiently on limited hardware.

Key Features:

Model Architecture and Efficiency: RTMpose utilizes CSPNeXt as its backbone [1, 2], balancing speed and accuracy. CSPNeXt is optimized for dense prediction tasks like pose estimation and object detection, providing high resolution and precision while maintaining computational efficiency.
Keypoint Prediction: Employs a SimCC-based algorithm [1, 3], treating the horizontal and vertical positions of keypoints as separate classification tasks. This compact representation reduces computational load and suits deployment on various devices.

4. RTMDet: The Detection Backbone

RTMDet [4] acts as the detector preceding RTMPose in the top-down pipeline, identifying the golfer's or baseball player’s location within each frame.

Key Features:

Model Architecture and Efficiency: RTMDet utilize a modified version of CSPDarkNet [5] more trainable and precise than many of the YOLO models. The modified version leverages large-kernel depth-wise convolutions to balance complexity and speed and is efficient on both GPU and CPU. It's ideal for real-time applications like sports performance tracking.
Versatility: Handles various object detection tasks, including instance segmentation and rotated object detection. Ensures precise localization of the player, even in dynamic scenes.

5. Advantages of using RTMDet and RTMPose in Golf and Baseball Swing Analysis

5.1 Higher Accuracy in Uncrowded Scenes

In typical golf/baseball settings with few individuals in the frame, RTMDet isolates the golfer/baseball player, allowing RTMPose to process each detected person with high accuracy. This avoids the complexity of bottom-up methods that process all keypoints for all persons in frame, simultaneously. The top-down approach can also include post-processing algorithm of RTMdet identifying the correct person (i.e., golfer or baseball player) before performing pose estimation. In addition, RTMPose have been pre-trained on extended image material containing

5.2 Efficient Computation and Real-Time Performance

Using lightweight models, like RTMdet and RTMpose, maintains low latency, enabling real-time swing analysis on consumer-grade hardware. This is particularly useful for providing immediate live feedback during coaching or training sessions. The Swing Catalyst markerless motion capture system is one of few studio system that provide live motion capture feedback to golfers and baseball players.

5.3 Detailed Keypoint Analysis

The RTMPose detects a setup of 26 body keypoints [6] displayed in figure 1 below essential for analyzing golf and baseball swing kinematics. Hal is an extended setup that includes additional markers on the feet and head compared to the more standard Coco setup with 17 markers.

--Insert Figure--

6. Methodology for Golf and Baseball Swing Markerless Motion Capture

6.1 Detection Phase: RTMDet

Applied to video frames of a golfer or baseball player, RTMDet generates bounding boxes around the player, which are passed to RTMPose. This focuses the pose estimation on relevant image regions, reducing computational load.

--Insert Image--

6.2 Pose Estimation Phase: RTMPose

RTMPose estimates keypoint positions within the bounding box. Critical joints for golf and baseballs swing analysis include wrists, elbows, shoulders, hips, and knees. These keypoints assess body angles and positions during the swing's phases: backswing, downswing, and follow-through.

--Insert Image--

6.3 Performance Metrics

The general performance of RTMPose is measured using metrics like Average Precision (AP) on pose estimation benchmarks like MS COCO. Below is the performance of the best ranked models on commonly used Coco benchmark. On the MS COCO val dataset, RTMPose-X is the best performing model able to provide real-time feedback and achieves up to 75.8% AP with frame rates exceeding ?? FPS on consumer-grade GPUs, making it suitable for high-speed sports analysis.

Rank	Model	Resolution	Size/params (Mill)	AP	Real time inference
1	Sapiens-2B	1024x768	2000	82.2	No
2	Sapiens-1B	1024x768	1000	82.1	No
3	Sapiens-0.6B	1024x768	600	81.2	No
4	Sapiens-0.3B	1024x768	300	79.6	No
5	VitPose-H	256x192	632	79.4	No
6	RTMPose-X	384x288	49	78.8	Yes
7	VitPose-L	256x192	307	78.6	No
8	RTMPose-L	384x288	28	78.3	Yes
9	HRFormer	256x192	43	77.2	No
10	HRNet-UDP	384x288	64	77.2	Yes
11	VitPose-B	256x192	86	77.0	Yes
12	RTMPose-L	256x198	28	76.7	Yes
13	RTMPose-M	384x288	14	76.6	Yes
14	HRNet	384x288	64	76.3	Yes
15	VitPose-S	256x192	43	75.8	Yes
16	RTMPose-M	256x192	14	74.9	Yes
17	SimpleBaseline	256x192	60	73.5	Yes
18	FastPose	256x192	79	73.3	Yes

7. Application in Golf Swing Analysis

By applying the RTMPose-X and RTMDet-M framework:

Track Joint Movements Frame-by-Frame: Provides comprehensive data for analyzing each phase of the swing.
Provide Real-Time Feedback: Enables immediate insights into swing posture and form during training sessions.
Compare with Ideal Mechanics: Allows comparison against ideal swing kinematics to identify areas for improvement.

8. Conclusion

The integration of RTMPose-X and RTMDet-M offers a powerful solution for real-time golf swing analysis. With high precision, low latency, and compatibility across various hardware platforms, this top-down approach delivers detailed insights into swing mechanics. It has significant potential to aid both amateur and professional golfers in enhancing their performance.

9. Future Work

Future developments could involve:

Integrating Machine Learning Algorithms: To provide predictive analytics and suggest adjustments for improving swing efficiency.
Expanding to Multi-Person Scenarios: Enhancing applicability in team sports or group training environments.
Developing a User-Friendly Interface: Creating applications or tools that make this technology accessible to coaches and athletes without technical expertise.

Appendix

Detailed Methodology: Top-Down Approach for Golf Swing Pose Estimation Using RTMPose-X and RTMDet-M

Overview

The methodology described here outlines the detailed steps involved in a top-down approach for real-time pose estimation of a golf and baseball swing, leveraging the strengths of RTMPose for keypoint localization and RTMDet for object detection. The process is divided into several stages: detection, keypoint localization, and post-processing, each contributing to the precise and efficient estimation of body joints in a golf swing for biomechanical analysis.

--Insert figure--

1. Detection Phase: Real-Time Localization with RTMDet-M

The first stage of the top-down approach involves detecting the golfer within each frame of the video. In sports scenarios, particularly golf, the scene usually consists of a single player, simplifying the detection task compared to crowd scenes.

1.1 Model Architecture

RTMDet-M is employed as the object detector in the pipeline. It uses a convolutional neural network (CNN) backbone, specifically the CSPNeXt backbone, designed to optimize real-time object detection performance while maintaining a balance between speed and accuracy. Key aspects of the architecture include:

Large-kernel depth-wise convolutions: These are utilized in the backbone and neck layers, increasing the receptive field while maintaining low computational cost.
Feature pyramid network (FPN): A multi-scale feature extraction technique that allows the detection of objects at various scales, ensuring that the golfer can be detected regardless of their distance from the camera.

1.2 Dynamic Label Assignment

RTMDet-M leverages a dynamic label assignment strategy that improves detection accuracy by assigning soft labels to objects based on a combination of classification and localization loss. The label assignment is governed by the SimOTA algorithm, which dynamically selects positive samples based on their likelihood of matching the ground truth object. This method ensures robust detection in varying lighting and environmental conditions often encountered in outdoor golf scenes.

1.3 Bounding Box Prediction

The detector outputs bounding boxes that enclose the golfer in each frame. These bounding boxes provide spatial constraints within which the pose estimation model will operate, reducing the computational load on the subsequent pose estimation phase by focusing only on relevant areas of the frame. In this context, RTMDet-M generates bounding boxes in real-time at over 300 FPS on high-performance hardware, ensuring that it can keep up with the fast dynamics of a golf swing.

1.4 Person Non-Maximum Suppression (NMS)

In multi-person settings (although rare in golf swing analysis), RTMDet-M incorporates a pose Non-Maximum Suppression (NMS) algorithm that eliminates redundant keypoint detections, ensuring that only the most confident detections are retained for everyone. This is crucial in instances where overlapping bounding boxes might be detected in crowded scenes or video sequences.

1.5 Training Data set and Performance

The RTMDet-M is trained on a binary classification task on the person instances in the Object356 dataset.

2. Pose Estimation Phase: RTMPose-X Keypoint Localization

Once the bounding box for the golfer has been established, the next phase involves estimating the precise location of key body joints within this region. RTMPose-X, a high-performance pose estimation model, is utilized for this purpose.

2.1 SimCC-Based Keypoint Localization

RTMPose-X employs the SimCC (Simple Coordinate Classification) algorithm, which treats keypoint localization as a classification problem. In contrast to traditional heatmap-based methods, SimCC divides the x and y coordinates of each keypoint into bins and classifies the exact bin where each keypoint lies. This approach significantly reduces computational complexity and improves inference speed while maintaining high accuracy for human pose estimation tasks.

2.2 CSPNeXt Backbone

Similar to RTMDet-M, RTMPose-X also uses the CSPNeXt backbone, which is tailored for dense prediction tasks such as pose estimation. The CSPNeXt backbone is advantageous in this scenario for the following reasons:

Lightweight architecture: The model’s architecture is designed to minimize the number of parameters while maximizing the throughput, making it ideal for real-time applications.
Efficient feature extraction: CSPNeXt’s feature extraction layers are optimized to process high-resolution images, which is crucial for detecting small details in fast-moving body parts during a golf swing, such as wrists, elbows, and knees.

2.3 Keypoint Representation

RTMPose-X outputs keypoint locations for all relevant body parts, including:

Upper body joints: shoulders, elbows, wrists, and neck
Lower body joints: hips, knees, and ankles
Additional joints: head, spine, and other key points relevant for swing analysis

The resolution of 384x288 for the input images ensures that even subtle movements in the joints can be captured accurately, while also maintaining the system’s ability to run in real time.

2.4 RTMPose preprocessing: Unbiased Data Processing (UDP)

Before the cropped image is entered in the RTMpose model, an Unbiased Data Processing (UDP) step is performed. UDP addresses critical biases in data processing of RTMpose during training and testing, specifically in coordinate system and keypoint format transformations. In conventional human pose estimation pipelines, standard operations such as flipping and resizing often misalign outputs, especially due to pixel-based transformations, which lead to precision loss and non-alignment of flipped images. UDP corrects these by establishing an unbiased coordinate system transformation, preserving semantic alignment across different coordinate spaces during essential operations (cropping, resizing, rotating, flipping). UDP also introduces unbiased keypoint format transformation by encoding keypoints into heatmaps without introducing positional bias, further refined through a Gaussian distribution-aware decoding process. This data processing approach systematically improves model performance, as shown in extensive tests on COCO and CrowdPose datasets, where it achieved enhanced accuracy and reduced inference latency across top-down and bottom-up models [Ref].

3. Post-Processing and Pose Refinement

Once the keypoints are predicted, several post-processing steps are applied to refine the pose estimation and ensure stability across frames.

3.1 Pose Smoothing

Golf swings involve rapid motion, which can introduce noise or fluctuations in the estimated keypoint positions across frames. To mitigate this, a One-Euro Filter is applied to smooth the keypoint trajectories over time, ensuring that small, non-physical fluctuations in the keypoint predictions are eliminated. The One-Euro Filter operates by adjusting the filter’s bandwidth dynamically based on the speed of the motion, which is ideal for scenarios like golf swings, where the motion varies significantly in speed across different phases (backswing, downswing, and follow-through).

3.2 Frame Skip Mechanism

For further optimization, a frame skip mechanism is implemented, where detection is performed on keyframes only, and pose estimation is interpolated for intermediate frames. This drastically reduces computational load without sacrificing accuracy in scenarios with limited motion between frames, such as slow-motion analysis of a golf swing.

4. Temporal Tracking and Sequence Consistency

Given that golf swings are inherently sequential, maintaining temporal consistency in pose estimation is vital. RTMPose-X addresses this through temporal tracking techniques, which ensure that the keypoint predictions are consistent across consecutive frames. This involves tracking keypoint positions over time and ensuring that their trajectories follow realistic motion patterns based on biomechanical constraints.

4.1 Keypoint Velocity and Acceleration Analysis

In addition to tracking keypoint positions, RTMPose-X also estimates the velocity and acceleration of each keypoint. This information is critical for analyzing the dynamics of a golf swing, providing insight into key performance metrics such as:

Swing speed: Calculated based on wrist velocity during the downswing.
Hip rotation: Analyzed through the rotational velocity of the hip joints.
Club path and head speed: Inferred indirectly from wrist and elbow trajectories.

These metrics can be compared against professional benchmarks to offer feedback on a player’s swing mechanics.

5. Inference and Real-Time Performance

The entire top-down pipeline is optimized for real-time performance, allowing for pose estimation at over 90 FPS on modern GPUs. The use of highly efficient model architectures (CSPNeXt) and fast inference techniques (SimCC) ensures that the system can handle high frame rate video input, making it suitable for real-time feedback during training sessions.

6. Evaluation and Validation

The RTMPose-X and RTMDet-M models are evaluated on standard datasets such as COCO and MPII, showing strong performance with an average precision (AP) of 75.8% on the COCO dataset for body keypoints. These results are validated against ground-truth annotations in golf swing datasets, ensuring the robustness of the model in capturing dynamic sports movements.

6.1 Performance Metrics

Mean Squared Error (MSE): Used to quantify the accuracy of keypoint predictions against ground truth annotations.
Average Precision (AP): Evaluates the overall performance of the pose estimation model.
Frame Processing Time: Benchmarked to ensure the system meets real-time requirements (<10 ms per frame).

7. Conclusion

The top-down approach using RTMPose-X and RTMDet-M provides an efficient and accurate method for real-time pose estimation in sports analytics, specifically for golf swing analysis. With robust keypoint detection, temporal tracking, and real-time inference, this methodology offers detailed biomechanical insights into golf swing dynamics, aiding in performance improvement and injury prevention.

References

[1] RTMpose https://arxiv.org/pdf/2303.07399

[2] CSPNeXt https://www.sciencedirect.com/science/article/pii/S0952197624000447

[3] SIMCC https://arxiv.org/abs/2107.03332

[4] RTMdet https://arxiv.org/pdf/2212.07784

[5] CSPDarkNet

[6] Halpe26

[] AI challenge dataset:

[] MS Coco dataset:

[7] Crowdpose dataset: https://arxiv.org/pdf/1812.00324

[] MPII dataset:

[] sub-JHMBD dataset: