AAAI-2024 三维人体姿态(3D Human Pose)相关论文5篇_deep semantic graph transformer for multi-view 3d -CSDN博客

本文链接：https://blog.csdn.net/weixin_42155685/article/details/142895980

AAAI-2024 三维人体姿态(3D Human Pose)相关论文5篇

Deep Semantic Graph Transformer for Multi-View 3D Human Pose Estimation

文章解读: http://www.studyai.com/xueshu/paper/detail/240d7ca15d
文章链接: (10.1609/aaai.v38i7.28549)

摘要

Most Graph Convolutional Networks based 3D human pose estimation (HPE) methods were involved in single-view 3D HPE and utilized certain spatial graphs, existing key problems such as depth ambiguity, insufficient feature representation, or limited receptive fields.
To address these issues, we propose a multi-view 3D HPE framework based on deep semantic graph transformer, which adaptively learns and fuses multi-view significant semantic features of human nodes to improve 3D HPE performance.
First, we propose a deep semantic graph transformer encoder to enrich spatial feature information.
It deeply mines the position, spatial structure, and skeletal edge knowledge of joints and dynamically learns their correlations.
Then, we build a progressive multi-view spatial-temporal feature fusion framework to mitigate joint depth uncertainty.
To enhance the pose spatial representation, deep spatial semantic feature are interacted and fused across different viewpoints during monocular feature extraction.
Furthermore, long-time relevant temporal dependencies are modeled and spatial-temporal information from all viewpoints is fused to intermediately supervise the depth.
Extensive experiments on three 3D HPE benchmarks show that our method achieves state-of-the-art results.
It can effectively enhance pose features, mitigate depth ambiguity in single-view 3D HPE, and improve 3D HPE performance without providing camera parameters.
Codes and models are available at https://github.com/z0911k/SGraFormer…

Disentangled Diffusion-Based 3D Human Pose Estimation with Hierarchical Spatial and Temporal Denoiser

文章解读: http://www.studyai.com/xueshu/paper/detail/2b5b97f02e
文章链接: (10.1609/aaai.v38i2.27847)

摘要

Recently, diffusion-based methods for monocular 3D human pose estimation have achieved state-of-the-art (SOTA) performance by directly regressing the 3D joint coordinates from the 2D pose sequence.
Although some methods decompose the task into bone length and bone direction prediction based on the human anatomical skeleton to explicitly incorporate more human body prior constraints, the performance of these methods is significantly lower than that of the SOTA diffusion-based methods.
This can be attributed to the tree structure of the human skeleton.
Direct application of the disentangled method could amplify the accumulation of hierarchical errors, propagating through each hierarchy.
Meanwhile, the hierarchical information has not been fully explored by the previous methods.
To address these problems, a Disentangled Diffusion-based 3D human Pose Estimation method with Hierarchical Spatial and Temporal Denoiser is proposed, termed DDHPose.
In our approach: (1) We disentangle the 3d pose and diffuse the bone length and bone direction during the forward process of the diffusion model to effectively model the human pose prior.
A disentanglement loss is proposed to supervise diffusion model learning.
(2) For the reverse process, we propose Hierarchical Spatial and Temporal Denoiser (HSTDenoiser) to improve the hierarchical modelling of each joint.
Our HSTDenoiser comprises two components: the Hierarchical-Related Spatial Transformer (HRST) and the Hierarchical-Related Temporal Transformer (HRTT).
HRST exploits joint spatial information and the influence of the parent joint on each joint for spatial modeling, while HRTT utilizes information from both the joint and its hierarchical adjacent joints to explore the hierarchical temporal correlations among joints.
Extensive experiments on the Human3.6M and MPI-INF-3DHP datasets show that our method outperforms the SOTA disentangled-based, non-disentangled based, and probabilistic approaches by 10.0%, 2.0%, and 1.3%, respectively…

PoseGen: Learning to Generate 3D Human Pose Dataset with NeRF

文章解读: http://www.studyai.com/xueshu/paper/detail/d5dc0b5ba5
文章链接: (10.1609/aaai.v38i3.27960)

摘要

This paper proposes an end-to-end framework for generating 3D human pose datasets using Neural Radiance Fields (NeRF).
Public datasets generally have limited diversity in terms of human poses and camera viewpoints, largely due to the resource-intensive nature of collecting 3D human pose data.
As a result, pose estimators trained on public datasets significantly underperform when applied to unseen out-of-distribution samples.
Previous works proposed augmenting public datasets by generating 2D-3D pose pairs or rendering a large amount of random data.
Such approaches either overlook image rendering or result in suboptimal datasets for pre-trained models.
Here we propose PoseGen, which learns to generate a dataset (human 3D poses and images) with a feedback loss from a given pre-trained pose estimator.
In contrast to prior art, our generated data is optimized to improve the robustness of the pre-trained model.
The objective of PoseGen is to learn a distribution of data that maximizes the prediction error of a given pre-trained model.
As the learned data distribution contains OOD samples of the pre-trained model, sampling data from such a distribution for further fine-tuning a pre-trained model improves the generalizability of the model.
This is the first work that proposes NeRFs for 3D human data generation.
NeRFs are data-driven and do not require 3D scans of humans.
Therefore, using NeRF for data generation is a new direction for convenient user-specific data generation.
Our extensive experiments show that the proposed PoseGen improves two baseline models (SPIN and HybrIK) on four datasets with an average 6% relative improvement…

Neighborhood-Enhanced 3D Human Pose Estimation with Monocular LiDAR in Long-Range Outdoor Scenes

文章解读: http://www.studyai.com/xueshu/paper/detail/ecce3a2ec8
文章链接: (10.1609/aaai.v38i7.28545)

摘要

3D human pose estimation (3HPE) in large-scale outdoor scenes using commercial LiDAR has attracted significant attention due to its potential for real-life applications.
However, existing LiDAR-based methods for 3HPE primarily rely on recovering 3D human poses from individual point clouds, and the coherence cues present in the neighborhood are not sufficiently harnessed.
In this work, we explore spatial and contexture coherence cues contained in the neighborhood that lead to great performance improvements in 3HPE.
Specifically, firstly, we deeply investigate the 3D neighbor in the background (3BN) which serves as a spatial coherence cue for inferring reliable motion since it provides physical laws to limit motion targets.
Secondly, we introduce a novel 3D scanning neighbor (3SN) generated during the data collection and 3SN implies structural edge coherence cues.
We use 3SN to overcome the degradation of performance and data quality caused by the sparsity-varying properties of LiDAR point clouds.
In order to effectively model the complementation between these distinct cues and build consistent temporal relationships across human motions, we propose a new transformer-based module called the CoherenceFuse module.
Extensive experiments were conducted on publicly available datasets, namely LidarHuman26M, CIMI4D, SLOPER4D and Waymo Open Dataset v2.0, showcase the superiority and effectiveness of our proposed method.
In particular, when compared with LidarCap on the LidarHuman26M dataset, our method demonstrates a reduction of 7.08mm in the average MPJPE metric, along with a decrease of 16.55mm in the MPJPE metric for distances exceeding 25 meters.
The code and models are available at https://github.com/jingyi-zhang/Neighborhood-enhanced-LidarCap…

Lifting by Image – Leveraging Image Cues for Accurate 3D Human Pose Estimation

文章解读: http://www.studyai.com/xueshu/paper/detail/fa613bef71
文章链接: (10.1609/aaai.v38i7.28596)

摘要

The “lifting from 2D pose” method has been the dominant approach to 3D Human Pose Estimation (3DHPE) due to the powerful visual analysis ability of 2D pose estimators.
Widely known, there exists a depth ambiguity problem when estimating solely from 2D pose, where one 2D pose can be mapped to multiple 3D poses.
Intuitively, the rich semantic and texture information in images can contribute to a more accurate “lifting” procedure.
Yet, existing research encounters two primary challenges.
Firstly, the distribution of image data in 3D motion capture datasets is too narrow because of the laboratorial environment, which leads to poor generalization ability of methods trained with image information.
Secondly, effective strategies for leveraging image information are lacking.
In this paper, we give new insight into the cause of poor generalization problems and the effectiveness of image features.
Based on that, we propose an advanced framework.
Specifically, the framework consists of two stages.
First, we enable the keypoints to query and select the beneficial features from all image patches.
To reduce the keypoints attention to inconsequential background features, we design a novel Pose-guided Transformer Layer, which adaptively limits the updates to unimportant image patches.
Then, through a designed Adaptive Feature Selection Module, we prune less significant image patches from the feature map.
In the second stage, we allow the keypoints to further emphasize the retained critical image features.
This progressive learning approach prevents further training on insignificant image features.
Experimental results show that our model achieves state-of-the-art performance on both the Human3.6M dataset and the MPI-INF-3DHP dataset…