IRIS Institutional Research Information System

Drawing from the Conceptual Metaphor Theory and the Structure-Mapping Theory, this paper introduces two exploratory works in the field of metaphorical and visual reasoning using vision models and multimodal large language models. (i) The Multimodal Chain-of-Thought Prompting for Metaphor Generation task aimed to generate metaphorical linguistic expressions from non-metaphorical images by using the multimodal LLaVA 1.5 model and the two-step approach of multimodal chain-of-thought prompting. The results showed the model's ability to generate metaphorical expressions, as 92% of them were classified as metaphors by human evaluators. Additionally, the evaluation revealed interesting patterns in terms of metaphoricity, familiarity and appeal scores across the generated metaphors. (ii) The Metaphorical Visual Analogy (MeVA) task consisted in solving visual analogies of the kind "source_domain : target_domain :: source_element : ?" by choosing the correct target element among three difficult distractors, varying in semantic domains and roles. The results showed that all six models and humans performed higher than chance level, with only GPT-4o and ConvNeXt achieving higher than humans. Moreover, the error analysis showed that, in solving the analogies, the most frequent error was the selection of distractor 1. These works showed encouraging results for future research in the field of metaphorical and visual reasoning, contributing to the broader question of whether AI models serve as empirical tests of existing cognitive theories.

Understanding is Seeing: Metaphorical and Visual Reasoning in Multimodal Large Language Models

Sofia Lugli;Carlo Strapparava

2025-01-01

Abstract

Drawing from the Conceptual Metaphor Theory and the Structure-Mapping Theory, this paper introduces two exploratory works in the field of metaphorical and visual reasoning using vision models and multimodal large language models. (i) The Multimodal Chain-of-Thought Prompting for Metaphor Generation task aimed to generate metaphorical linguistic expressions from non-metaphorical images by using the multimodal LLaVA 1.5 model and the two-step approach of multimodal chain-of-thought prompting. The results showed the model's ability to generate metaphorical expressions, as 92% of them were classified as metaphors by human evaluators. Additionally, the evaluation revealed interesting patterns in terms of metaphoricity, familiarity and appeal scores across the generated metaphors. (ii) The Metaphorical Visual Analogy (MeVA) task consisted in solving visual analogies of the kind "source_domain : target_domain :: source_element : ?" by choosing the correct target element among three difficult distractors, varying in semantic domains and roles. The results showed that all six models and humans performed higher than chance level, with only GPT-4o and ConvNeXt achieving higher than humans. Moreover, the error analysis showed that, in solving the analogies, the most frequent error was the selection of distractor 1. These works showed encouraging results for future research in the field of metaphorical and visual reasoning, contributing to the broader question of whether AI models serve as empirical tests of existing cognitive theories.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2025

Appare nelle tipologie:

4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
eScholarship UC item 1zd9598p.pdf solo utenti autorizzati Licenza: Creative commons Dimensione 3.35 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	3.35 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/366507

Citazioni

ND

social impact