Multimodal sentiment analysis: hybrid classification model with image and text feature descriptors

Scritto il 19/03/2026

da P Vasanthi

Sci Rep. 2026 Mar 18. doi: 10.1038/s41598-026-42912-2. Online ahead of print.

ABSTRACT

Understanding human emotions across multiple modalities such as text and images, is increasingly important for applications including content personalization, social media analysis, and Human-Computer Interaction (HCI). Conventional sentiment analysis methods often rely on a single modality, overlooking complementary information from other sources. This paper proposes a novel multimodal sentiment analysis framework that integrates text and image data. Text preprocessing includes tokenization, stopword removal, and stemming, while image preprocessing employs object detection. From the preprocessed text, N-grams, emojis, and Normalized Dispersion Coefficient (NDC)-based Term frequency-inverse document frequency (TF-IDF) features are extracted. Then, the improved multitexon and Shape Local Binary Texture (SLBT) features are derived from the preprocessed images. A hybrid sentiment analysis model is introduced, combining an optimized Deep Maxout and a Modified Sigmoid (MS)-based Bidirectional Gated Recurrent Unit (MS-Bi-GRU) models. Here, the extracted text features are subjected to an optimized Deep Maxout model; on the other hand, the MS-based Bi-GRU model trains the image features. Specifically, the transfer learning strategy is employed in the MS-based Bi-GRU model which leveraging the knowledge of a pre-trained model to efficiently train the feature set to improve efficiency. The Deep Maxout weights are optimized using a novel Innovative Beluga Whale Optimization Algorithm (IBwOA). Based on the outcomes from both models, the final results will be determined. Experimental results demonstrate that the proposed approach outperforms conventional models across multiple performance metrics.

PMID:41851277 | DOI:10.1038/s41598-026-42912-2