A Low-Resolution Video Action Recognition Approach Based on Multi-Scale Reconstruction and Multi-Modal Fusion

  • Hui Zheng
  • , Yesheng Zhao
  • , Bo Zhang
  • , Guoqiang Shang*
  • , Mohammad Yahya H. Al-Shamri
  • , Haya Aldossary
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

The challenge of low-resolution video action recognition task lies in recovering and extracting feature representations that can effectively capture action characteristics with limited semantic information. In this paper, we propose an approach to address this challenge, which primarily comprises a multi-scale reconstruction module and a multi-modal fusion module. In multi-scale reconstruction module, we introduce a frequency-adaptive reconstruction model to reconstruct lost information from multiple scales. For crucial high-frequency sub-band images, we propose a wavelet-based super-resolution generative adversarial network to recover detailed information. In multi-modal fusion module, we propose a two-stream Transformer-based network to mine spatial-temporal joint feature representations from the reconstructed video. Additionally, we utilize another Transformer model to fuse features from different modalities, capturing both consistent and complementary representations. Finally, the fused features are fed into a classifier for recognition. Experimental results show that our proposed model outperforms other models for low-quality action recognition on HMDB51 (16×12 58.70%, 14×14 62.25%, 80×60 68.94%), UCF101 (14×14 76.74%, 28×28 84.15 %, 80×60 92.78%), and Tiny-VIRAT (35.63%) datasets.

Original languageEnglish
Pages (from-to)970-983
Number of pages14
JournalIEEE Transactions on Consumer Electronics
Volume71
Issue number1
DOIs
StatePublished - 2025

Keywords

  • Action recognition
  • low-resolution
  • multi-modal
  • multi-scale

Fingerprint

Dive into the research topics of 'A Low-Resolution Video Action Recognition Approach Based on Multi-Scale Reconstruction and Multi-Modal Fusion'. Together they form a unique fingerprint.

Cite this