TY - JOUR
T1 - A Low-Resolution Video Action Recognition Approach Based on Multi-Scale Reconstruction and Multi-Modal Fusion
AU - Zheng, Hui
AU - Zhao, Yesheng
AU - Zhang, Bo
AU - Shang, Guoqiang
AU - Al-Shamri, Mohammad Yahya H.
AU - Aldossary, Haya
N1 - Publisher Copyright:
© 1975-2011 IEEE.
PY - 2025
Y1 - 2025
N2 - The challenge of low-resolution video action recognition task lies in recovering and extracting feature representations that can effectively capture action characteristics with limited semantic information. In this paper, we propose an approach to address this challenge, which primarily comprises a multi-scale reconstruction module and a multi-modal fusion module. In multi-scale reconstruction module, we introduce a frequency-adaptive reconstruction model to reconstruct lost information from multiple scales. For crucial high-frequency sub-band images, we propose a wavelet-based super-resolution generative adversarial network to recover detailed information. In multi-modal fusion module, we propose a two-stream Transformer-based network to mine spatial-temporal joint feature representations from the reconstructed video. Additionally, we utilize another Transformer model to fuse features from different modalities, capturing both consistent and complementary representations. Finally, the fused features are fed into a classifier for recognition. Experimental results show that our proposed model outperforms other models for low-quality action recognition on HMDB51 (16×12 58.70%, 14×14 62.25%, 80×60 68.94%), UCF101 (14×14 76.74%, 28×28 84.15 %, 80×60 92.78%), and Tiny-VIRAT (35.63%) datasets.
AB - The challenge of low-resolution video action recognition task lies in recovering and extracting feature representations that can effectively capture action characteristics with limited semantic information. In this paper, we propose an approach to address this challenge, which primarily comprises a multi-scale reconstruction module and a multi-modal fusion module. In multi-scale reconstruction module, we introduce a frequency-adaptive reconstruction model to reconstruct lost information from multiple scales. For crucial high-frequency sub-band images, we propose a wavelet-based super-resolution generative adversarial network to recover detailed information. In multi-modal fusion module, we propose a two-stream Transformer-based network to mine spatial-temporal joint feature representations from the reconstructed video. Additionally, we utilize another Transformer model to fuse features from different modalities, capturing both consistent and complementary representations. Finally, the fused features are fed into a classifier for recognition. Experimental results show that our proposed model outperforms other models for low-quality action recognition on HMDB51 (16×12 58.70%, 14×14 62.25%, 80×60 68.94%), UCF101 (14×14 76.74%, 28×28 84.15 %, 80×60 92.78%), and Tiny-VIRAT (35.63%) datasets.
KW - Action recognition
KW - low-resolution
KW - multi-modal
KW - multi-scale
UR - https://www.scopus.com/pages/publications/85213284584
U2 - 10.1109/TCE.2024.3521512
DO - 10.1109/TCE.2024.3521512
M3 - Article
AN - SCOPUS:85213284584
SN - 0098-3063
VL - 71
SP - 970
EP - 983
JO - IEEE Transactions on Consumer Electronics
JF - IEEE Transactions on Consumer Electronics
IS - 1
ER -