Drawing from the Conceptual Metaphor Theory and the Structure-Mapping Theory, this paper introduces two exploratory works in the field of metaphorical and visual reasoning using vision models and multimodal large language models. (i) The Multimodal Chain-of-Thought Prompting for Metaphor Generation task aimed to generate metaphorical linguistic expressions from non-metaphorical images by using the multimodal LLaVA 1.5 model and the two-step approach of multimodal chain-of-thought prompting. The results showed the model's ability to generate metaphorical expressions, as 92% of them were classified as metaphors by human evaluators. Additionally, the evaluation revealed interesting patterns in terms of metaphoricity, familiarity and appeal scores across the generated metaphors. (ii) The Metaphorical Visual Analogy (MeVA) task consisted in solving visual analogies of the kind "source_domain : target_domain :: source_element : ?" by choosing the correct target element among three difficult distractors, varying in semantic domains and roles. The results showed that all six models and humans performed higher than chance level, with only GPT-4o and ConvNeXt achieving higher than humans. Moreover, the error analysis showed that, in solving the analogies, the most frequent error was the selection of distractor 1. These works showed encouraging results for future research in the field of metaphorical and visual reasoning, contributing to the broader question of whether AI models serve as empirical tests of existing cognitive theories.

Understanding is Seeing: Metaphorical and Visual Reasoning in Multimodal Large Language Models

Sofia Lugli;Carlo Strapparava
2025-01-01

Abstract

Drawing from the Conceptual Metaphor Theory and the Structure-Mapping Theory, this paper introduces two exploratory works in the field of metaphorical and visual reasoning using vision models and multimodal large language models. (i) The Multimodal Chain-of-Thought Prompting for Metaphor Generation task aimed to generate metaphorical linguistic expressions from non-metaphorical images by using the multimodal LLaVA 1.5 model and the two-step approach of multimodal chain-of-thought prompting. The results showed the model's ability to generate metaphorical expressions, as 92% of them were classified as metaphors by human evaluators. Additionally, the evaluation revealed interesting patterns in terms of metaphoricity, familiarity and appeal scores across the generated metaphors. (ii) The Metaphorical Visual Analogy (MeVA) task consisted in solving visual analogies of the kind "source_domain : target_domain :: source_element : ?" by choosing the correct target element among three difficult distractors, varying in semantic domains and roles. The results showed that all six models and humans performed higher than chance level, with only GPT-4o and ConvNeXt achieving higher than humans. Moreover, the error analysis showed that, in solving the analogies, the most frequent error was the selection of distractor 1. These works showed encouraging results for future research in the field of metaphorical and visual reasoning, contributing to the broader question of whether AI models serve as empirical tests of existing cognitive theories.
File in questo prodotto:
File Dimensione Formato  
eScholarship UC item 1zd9598p.pdf

solo utenti autorizzati

Licenza: Creative commons
Dimensione 3.35 MB
Formato Adobe PDF
3.35 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/366507
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact