MDA-MAA: A Collaborative Augmentation Approach for Generalizing Cross-Domain Retrieval

Scritto il 02/02/2026

da Ming Jin

IEEE Trans Image Process. 2026 Feb 2;PP. doi: 10.1109/TIP.2026.3658223. Online ahead of print.

ABSTRACT

In video-text cross-domain retrieval tasks, the generalization ability of the retrieval models is key to improving their performance and is crucial for enhancing their practical applicability. However, existing retrieval models exhibit significant deficiencies in cross-domain generalization. On one hand, models tend to overfit specific training domain data, resulting in poor cross-domain matching and significantly reduced retrieval accuracy when dealing with data from different, new, or mixed domains. On the other hand, although data augmentation is a vital strategy for enhancing model generalization, most existing methods focus on unimodal augmentation and fail to fully exploit the multimodal correlations between video and text. As a result, the augmented data lack semantic diversity, which further limits the model's ability to understand and perform in complex cross-domain scenarios. To address these challenges, this paper proposes an innovative collaborative augmentation approach named MDA-MAA, which includes two core modules: the Masked Attention Augmentation (MAA) module and the Multimodal Diffusion Augmentation (MDA) module. The MAA module applies masking to the original video frame features and uses an attention mechanism to predict the masked features, effectively reducing overfitting to training data and enhancing model generalization. The MDA module generates subtitles from video frames and uses the LLaMA model to infer comprehensive video captions. These captions, combined with the original video frames, are integrated into a diffusion model for joint learning, ultimately generating semantically enriched augmented video frames. This process leverages the multimodal relationship between video and text to increase the diversity of the training data distribution. Experimental results demonstrate that this collaborative augmentation method significantly improves the performance of video-text cross-domain retrieval models, validating its effectiveness in enhancing model generalization.

PMID:41628037 | DOI:10.1109/TIP.2026.3658223