Introduction

We present HAP, introducing human structure priors - human parts - into MIM pre-training, to yield substantial benefits across a range of human-centric perception tasks. HAP simply uses a plain ViT as the encoder yet establishes new state-of-the-art performance on 11 human-centric benchmarks, and on-par result on one dataset. For example, HAP achieves 78.1% mAP on MSMT17 for person re-identification, 86.54% mA on PA-100K for pedestrian attribute recognition, 78.2% AP on MS COCO for 2D pose estimation, and 56.0 PA-MPJPE on 3DPW for 3D pose and shape estimation.

Framework

Our HAP is model-agnostic that can be integrated into kinds of MIM methods.

Results

HAP simply uses a plain ViT as the encoder yet establishes new state-of-the-art performance on 11 human-centric benchmarks, and on-par result on one dataset.

Code

[HAP Github]

HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception

Junkun Yuan^1,2, Xinyu Zhang^2#, Hao Zhou², Jian Wang², Zhongwei Qiu³, Zhiyin Shao⁴, Shaofeng Zhang⁵, Sifan Long⁶, Kun Kuang^1#

Kun Yao², Junyu Han², Errui Ding², Lanfen Lin¹, Fei Wu¹ and Jingdong Wang^2#

¹Zhejiang University ²Baidu VIS ³University of Science and Technology Beijing ⁴South China University of Technology ⁵Shanghai Jiao Tong University ⁶Jilin University

^*Equal contribution ^#Corresponding author

Introduction

Framework

Results

Code