AITEPose: Learning an End-to-End Monocular 3D Human Pose Estimator via Auxiliary-Information-Driven Training Enhancement

Bowei Xie; Geyuan Liu; Fang Deng; Maobin Lu

doi:10.1109/TCSVT.2025.3570967

AITEPose: Learning an End-to-End Monocular 3D Human Pose Estimator via Auxiliary-Information-Driven Training Enhancement

Bowei Xie, Geyuan Liu, Fang Deng, Maobin Lu^*

^*此作品的通讯作者

自动化学院

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

3D human pose estimation (3DHPE) from a single monocular RGB image is fundamental in many image-related fields, such as virtual reality, motion analysis, and human-computer interaction. To improve estimation accuracy, existing works typically integrate complex networks or divide monocular 3DHPE into multiple stages. However, complicating the estimation process to improve the estimation accuracy sacrifices the estimation speed and limits its application. To alleviate this, we propose AITEPose, an end-to-end model, which achieves higher monocular 3DHPE accuracy with a simpler model structure. Specifically, inspired by online knowledge distillation, we design an Auxiliary-Information-Driven Training Enhancement (AITE) framework. In the AITE framework, during training, an adjustment network is introduced between the prediction network and the loss function to incorporate auxiliary information and enhance the training process. Notably, the adjustment network is constructed by developing a novel cascaded Disturbance-Correction Module (DCM). It adjusts the poses to get more accurate results based on ground-truth bone lengths. Both AITE and DCM are employed only during training, thereby improving training outcomes without complicating the inference process. The AITEPose model achieves state-of-the-art performance for single-frame monocular 3DHPE on the most comprehensive dataset Human3.6M. To further validate the effectiveness of AITE and DCM, we design a monocular 2DHPE model, AITEPose2D, and conduct extensive ablation experiments on the COCO2017 dataset, demonstrating the robustness and generalizability of our proposed AITEPose.

源语言	英语
期刊	IEEE Transactions on Circuits and Systems for Video Technology
DOI	http://doi.org/10.1109/TCSVT.2025.3570967
出版状态	已接受/待刊 - 2025

访问文件

10.1109/TCSVT.2025.3570967

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{b2f0416b9a4a457084a67b6eb5dd8111,

title = "AITEPose: Learning an End-to-End Monocular 3D Human Pose Estimator via Auxiliary-Information-Driven Training Enhancement",

abstract = "3D human pose estimation (3DHPE) from a single monocular RGB image is fundamental in many image-related fields, such as virtual reality, motion analysis, and human-computer interaction. To improve estimation accuracy, existing works typically integrate complex networks or divide monocular 3DHPE into multiple stages. However, complicating the estimation process to improve the estimation accuracy sacrifices the estimation speed and limits its application. To alleviate this, we propose AITEPose, an end-to-end model, which achieves higher monocular 3DHPE accuracy with a simpler model structure. Specifically, inspired by online knowledge distillation, we design an Auxiliary-Information-Driven Training Enhancement (AITE) framework. In the AITE framework, during training, an adjustment network is introduced between the prediction network and the loss function to incorporate auxiliary information and enhance the training process. Notably, the adjustment network is constructed by developing a novel cascaded Disturbance-Correction Module (DCM). It adjusts the poses to get more accurate results based on ground-truth bone lengths. Both AITE and DCM are employed only during training, thereby improving training outcomes without complicating the inference process. The AITEPose model achieves state-of-the-art performance for single-frame monocular 3DHPE on the most comprehensive dataset Human3.6M. To further validate the effectiveness of AITE and DCM, we design a monocular 2DHPE model, AITEPose2D, and conduct extensive ablation experiments on the COCO2017 dataset, demonstrating the robustness and generalizability of our proposed AITEPose.",

keywords = "monocular 3D human pose estimation, training enhancement",

author = "Bowei Xie and Geyuan Liu and Fang Deng and Maobin Lu",

note = "Publisher Copyright: {\textcopyright} 1991-2012 IEEE.",

year = "2025",

doi = "10.1109/TCSVT.2025.3570967",

language = "English",

journal = "IEEE Transactions on Circuits and Systems for Video Technology",

issn = "1051-8215",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - AITEPose

T2 - Learning an End-to-End Monocular 3D Human Pose Estimator via Auxiliary-Information-Driven Training Enhancement

AU - Xie, Bowei

AU - Liu, Geyuan

AU - Deng, Fang

AU - Lu, Maobin

PY - 2025

Y1 - 2025

N2 - 3D human pose estimation (3DHPE) from a single monocular RGB image is fundamental in many image-related fields, such as virtual reality, motion analysis, and human-computer interaction. To improve estimation accuracy, existing works typically integrate complex networks or divide monocular 3DHPE into multiple stages. However, complicating the estimation process to improve the estimation accuracy sacrifices the estimation speed and limits its application. To alleviate this, we propose AITEPose, an end-to-end model, which achieves higher monocular 3DHPE accuracy with a simpler model structure. Specifically, inspired by online knowledge distillation, we design an Auxiliary-Information-Driven Training Enhancement (AITE) framework. In the AITE framework, during training, an adjustment network is introduced between the prediction network and the loss function to incorporate auxiliary information and enhance the training process. Notably, the adjustment network is constructed by developing a novel cascaded Disturbance-Correction Module (DCM). It adjusts the poses to get more accurate results based on ground-truth bone lengths. Both AITE and DCM are employed only during training, thereby improving training outcomes without complicating the inference process. The AITEPose model achieves state-of-the-art performance for single-frame monocular 3DHPE on the most comprehensive dataset Human3.6M. To further validate the effectiveness of AITE and DCM, we design a monocular 2DHPE model, AITEPose2D, and conduct extensive ablation experiments on the COCO2017 dataset, demonstrating the robustness and generalizability of our proposed AITEPose.

AB - 3D human pose estimation (3DHPE) from a single monocular RGB image is fundamental in many image-related fields, such as virtual reality, motion analysis, and human-computer interaction. To improve estimation accuracy, existing works typically integrate complex networks or divide monocular 3DHPE into multiple stages. However, complicating the estimation process to improve the estimation accuracy sacrifices the estimation speed and limits its application. To alleviate this, we propose AITEPose, an end-to-end model, which achieves higher monocular 3DHPE accuracy with a simpler model structure. Specifically, inspired by online knowledge distillation, we design an Auxiliary-Information-Driven Training Enhancement (AITE) framework. In the AITE framework, during training, an adjustment network is introduced between the prediction network and the loss function to incorporate auxiliary information and enhance the training process. Notably, the adjustment network is constructed by developing a novel cascaded Disturbance-Correction Module (DCM). It adjusts the poses to get more accurate results based on ground-truth bone lengths. Both AITE and DCM are employed only during training, thereby improving training outcomes without complicating the inference process. The AITEPose model achieves state-of-the-art performance for single-frame monocular 3DHPE on the most comprehensive dataset Human3.6M. To further validate the effectiveness of AITE and DCM, we design a monocular 2DHPE model, AITEPose2D, and conduct extensive ablation experiments on the COCO2017 dataset, demonstrating the robustness and generalizability of our proposed AITEPose.

KW - monocular 3D human pose estimation

KW - training enhancement

UR - http://www.scopus.com/pages/publications/105005540534

U2 - 10.1109/TCSVT.2025.3570967

DO - 10.1109/TCSVT.2025.3570967

M3 - Article

AN - SCOPUS:105005540534

SN - 1051-8215

JO - IEEE Transactions on Circuits and Systems for Video Technology

JF - IEEE Transactions on Circuits and Systems for Video Technology

ER -

AITEPose: Learning an End-to-End Monocular 3D Human Pose Estimator via Auxiliary-Information-Driven Training Enhancement

摘要

访问文件

其它文件与链接

指纹

引用此