Memes, which are user-generated content in the form of images and text, have become a powerful medium for shaping public discourse. Given their increasing influence, detecting persuasive techniques embedded within these multimodal forms of communication is crucial for identifying propaganda and combating online disinformation. Persuasion techniques in memes often combine rhetorical elements from both text and image, creating unique challenges for computational models.
This thesis seeks to determine the impact of multimodal integration on the detection of persuasion techniques in memes and to evaluate how well multimodal models perform compared to single-modality models in this classification task. To achieve this, we developed and fine-tuned several models for text-based and multimodal persuasion detection using both pre-trained language models (BERT, XLM-RoBERTa, mBERT) and image-based models (CLIP, ResNET, VisualBERT).A key contribution of this work is the implementation of paraphrase-based data augmentation, which helped address class imbalance and improved the performance of text-only models. For multimodal approaches, we explored both early fusion and cross-modal alignment strategies.
Surprisingly, cross-modal alignment underperformed, likely due to challenges in aligning abstract textual and visual cues. In contrast, the early fusion approach of combining text and image embeddings showed the highest performance, significantly outperforming text-only and image-only models.We also conducted zero-shot experiments with GPT-4 to benchmark its effectiveness in multimodal persuasion detection. Although GPT-4 demonstrated potential in zero-shot settings, the fine-tuned models still outperformed it, particularly when leveraging multimodal integration.This research advances the understanding of multimodal learning for detecting persuasion techniques, with broader implications for disinformation detection in online content.