Abstract
The challenge of low-resolution video action recognition task lies in recovering and extracting feature representations that can effectively capture action characteristics with limited semantic information. In this paper, we propose an approach to address this challenge, which primarily comprises a multi-scale reconstruction module and a multi-modal fusion module. In multi-scale reconstruction module, we introduce a frequency-adaptive reconstruction model to reconstruct lost information from multiple scales. For crucial high-frequency sub-band images, we propose a wavelet-based super-resolution generative adversarial network to recover detailed information. In multi-modal fusion module, we propose a two-stream Transformer-based network to mine spatial-temporal joint feature representations from the reconstructed video. Additionally, we utilize another Transformer model to fuse features from different modalities, capturing both consistent and complementary representations. Finally, the fused features are fed into a classifier for recognition. Experimental results show that our proposed model outperforms other models for low-quality action recognition on HMDB51 (16×12 58.70%, 14×14 62.25%, 80×60 68.94%), UCF101 (14×14 76.74%, 28×28 84.15 %, 80×60 92.78%), and Tiny-VIRAT (35.63%) datasets.
| Original language | English |
|---|---|
| Pages (from-to) | 970-983 |
| Number of pages | 14 |
| Journal | IEEE Transactions on Consumer Electronics |
| Volume | 71 |
| Issue number | 1 |
| DOIs | |
| State | Published - 2025 |
Keywords
- Action recognition
- low-resolution
- multi-modal
- multi-scale
Fingerprint
Dive into the research topics of 'A Low-Resolution Video Action Recognition Approach Based on Multi-Scale Reconstruction and Multi-Modal Fusion'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver