A Low-Resolution Video Action Recognition Approach Based on Multi-Scale Reconstruction and Multi-Modal Fusion

Hui Zheng; Yesheng Zhao; Bo Zhang; Guoqiang Shang; Mohammad Yahya H. Al-Shamri; Haya Aldossary

doi:10.1109/TCE.2024.3521512

A Low-Resolution Video Action Recognition Approach Based on Multi-Scale Reconstruction and Multi-Modal Fusion

Hui Zheng
, Yesheng Zhao
, Bo Zhang
, Guoqiang Shang^*
, Mohammad Yahya H. Al-Shamri
, Haya Aldossary

^*Corresponding author for this work

Computer Science Department

Anhui University
Chinese Academy of Sciences
Yunnan University
CAS - Aerospace Information Research Institute
Space Engineering University
King Khalid University

Research output: Contribution to journal › Article › peer-review

3 Scopus citations

Abstract

The challenge of low-resolution video action recognition task lies in recovering and extracting feature representations that can effectively capture action characteristics with limited semantic information. In this paper, we propose an approach to address this challenge, which primarily comprises a multi-scale reconstruction module and a multi-modal fusion module. In multi-scale reconstruction module, we introduce a frequency-adaptive reconstruction model to reconstruct lost information from multiple scales. For crucial high-frequency sub-band images, we propose a wavelet-based super-resolution generative adversarial network to recover detailed information. In multi-modal fusion module, we propose a two-stream Transformer-based network to mine spatial-temporal joint feature representations from the reconstructed video. Additionally, we utilize another Transformer model to fuse features from different modalities, capturing both consistent and complementary representations. Finally, the fused features are fed into a classifier for recognition. Experimental results show that our proposed model outperforms other models for low-quality action recognition on HMDB51 (16×12 58.70%, 14×14 62.25%, 80×60 68.94%), UCF101 (14×14 76.74%, 28×28 84.15 %, 80×60 92.78%), and Tiny-VIRAT (35.63%) datasets.

Original language	English
Pages (from-to)	970-983
Number of pages	14
Journal	IEEE Transactions on Consumer Electronics
Volume	71
Issue number	1
DOIs	https://doi.org/10.1109/TCE.2024.3521512
State	Published - 2025

Keywords

Action recognition
low-resolution
multi-modal
multi-scale

Access to Document

10.1109/TCE.2024.3521512

Cite this

@article{b018efc961e64c21953c7fb16194acbc,

title = "A Low-Resolution Video Action Recognition Approach Based on Multi-Scale Reconstruction and Multi-Modal Fusion",

abstract = "The challenge of low-resolution video action recognition task lies in recovering and extracting feature representations that can effectively capture action characteristics with limited semantic information. In this paper, we propose an approach to address this challenge, which primarily comprises a multi-scale reconstruction module and a multi-modal fusion module. In multi-scale reconstruction module, we introduce a frequency-adaptive reconstruction model to reconstruct lost information from multiple scales. For crucial high-frequency sub-band images, we propose a wavelet-based super-resolution generative adversarial network to recover detailed information. In multi-modal fusion module, we propose a two-stream Transformer-based network to mine spatial-temporal joint feature representations from the reconstructed video. Additionally, we utilize another Transformer model to fuse features from different modalities, capturing both consistent and complementary representations. Finally, the fused features are fed into a classifier for recognition. Experimental results show that our proposed model outperforms other models for low-quality action recognition on HMDB51 (16×12 58.70\%, 14×14 62.25\%, 80×60 68.94\%), UCF101 (14×14 76.74\%, 28×28 84.15 \%, 80×60 92.78\%), and Tiny-VIRAT (35.63\%) datasets.",

keywords = "Action recognition, low-resolution, multi-modal, multi-scale",

author = "Hui Zheng and Yesheng Zhao and Bo Zhang and Guoqiang Shang and Al-Shamri, \{Mohammad Yahya H.\} and Haya Aldossary",

note = "Publisher Copyright: {\textcopyright} 1975-2011 IEEE.",

year = "2025",

doi = "10.1109/TCE.2024.3521512",

language = "English",

volume = "71",

pages = "970--983",

journal = "IEEE Transactions on Consumer Electronics",

issn = "0098-3063",

number = "1",

}

TY - JOUR

T1 - A Low-Resolution Video Action Recognition Approach Based on Multi-Scale Reconstruction and Multi-Modal Fusion

AU - Zheng, Hui

AU - Zhao, Yesheng

AU - Zhang, Bo

AU - Shang, Guoqiang

AU - Al-Shamri, Mohammad Yahya H.

AU - Aldossary, Haya

PY - 2025

Y1 - 2025

N2 - The challenge of low-resolution video action recognition task lies in recovering and extracting feature representations that can effectively capture action characteristics with limited semantic information. In this paper, we propose an approach to address this challenge, which primarily comprises a multi-scale reconstruction module and a multi-modal fusion module. In multi-scale reconstruction module, we introduce a frequency-adaptive reconstruction model to reconstruct lost information from multiple scales. For crucial high-frequency sub-band images, we propose a wavelet-based super-resolution generative adversarial network to recover detailed information. In multi-modal fusion module, we propose a two-stream Transformer-based network to mine spatial-temporal joint feature representations from the reconstructed video. Additionally, we utilize another Transformer model to fuse features from different modalities, capturing both consistent and complementary representations. Finally, the fused features are fed into a classifier for recognition. Experimental results show that our proposed model outperforms other models for low-quality action recognition on HMDB51 (16×12 58.70%, 14×14 62.25%, 80×60 68.94%), UCF101 (14×14 76.74%, 28×28 84.15 %, 80×60 92.78%), and Tiny-VIRAT (35.63%) datasets.

AB - The challenge of low-resolution video action recognition task lies in recovering and extracting feature representations that can effectively capture action characteristics with limited semantic information. In this paper, we propose an approach to address this challenge, which primarily comprises a multi-scale reconstruction module and a multi-modal fusion module. In multi-scale reconstruction module, we introduce a frequency-adaptive reconstruction model to reconstruct lost information from multiple scales. For crucial high-frequency sub-band images, we propose a wavelet-based super-resolution generative adversarial network to recover detailed information. In multi-modal fusion module, we propose a two-stream Transformer-based network to mine spatial-temporal joint feature representations from the reconstructed video. Additionally, we utilize another Transformer model to fuse features from different modalities, capturing both consistent and complementary representations. Finally, the fused features are fed into a classifier for recognition. Experimental results show that our proposed model outperforms other models for low-quality action recognition on HMDB51 (16×12 58.70%, 14×14 62.25%, 80×60 68.94%), UCF101 (14×14 76.74%, 28×28 84.15 %, 80×60 92.78%), and Tiny-VIRAT (35.63%) datasets.

KW - Action recognition

KW - low-resolution

KW - multi-modal

KW - multi-scale

UR - https://www.scopus.com/pages/publications/85213284584

U2 - 10.1109/TCE.2024.3521512

DO - 10.1109/TCE.2024.3521512

M3 - Article

AN - SCOPUS:85213284584

SN - 0098-3063

VL - 71

SP - 970

EP - 983

JO - IEEE Transactions on Consumer Electronics

JF - IEEE Transactions on Consumer Electronics

IS - 1

ER -

A Low-Resolution Video Action Recognition Approach Based on Multi-Scale Reconstruction and Multi-Modal Fusion

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this