IRIS Institutional Research Information System

LLMs face significant challenges in systematic generalization, particularly when dealing with reasoning tasks requiring compositional rules and handling out-of-distribution examples. To address these challenges, we introduce a few-shot repair methodology aimed at improving the generalization capabilities of general purpose LLMs. Our approach employs an iterative example selection strategy, which incrementally constructs a tailored set of few-shot examples optimized to enhance model's performance on a given task. As a proof of concept, we apply this methodology to the resolution of algebraic expressions involving non-standard simplification rules, according to which the priority of addition and multiplication is changed. We construct synthetic datasets with diverse levels of difficulty to evaluate the performance of LLMs in simplifying non-standard mathematical expressions, designed to test compositional reasoning. We evaluate multiple prompting strategies, namely zero-shot, few-shot, and Chain-of-Thought prompts. Our findings indicate that LLMs exhibit limited proficiency in these mathematical tasks. We further demonstrate that LLMs reasoning benefits from our iterative shot selection prompting strategy integrated with explicit reasoning instructions. Interestingly, our experiments reveal that some LLMs achieve better generalization performances when prompted with simpler few-shot examples rather than complex ones following the test data distribution. This counterintuitive finding suggests that the model may benefit more from clear, easily interpretable patterns that can then be abstracted and applied to more complex, out-of-distribution tasks, compared to being provided with few-shot complex examples following the data distribution. Our results confirm the effectiveness and broad applicability of our methodology for systematically improving LLM performance in abstract reasoning tasks with in- and out-of-distribution examples.

Iterative In-Context Learning to Enhance LLMs Abstract Reasoning: The Case-Study of Algebraic Tasks

Stefano Fioravanti;Matteo zavatteri;Roberto Confalonieri;Kamyar Zeinalipour;Paolo Frazzetto;Alessandro Sperduti;Nicolò Navarin

2026-01-01

Abstract

LLMs face significant challenges in systematic generalization, particularly when dealing with reasoning tasks requiring compositional rules and handling out-of-distribution examples. To address these challenges, we introduce a few-shot repair methodology aimed at improving the generalization capabilities of general purpose LLMs. Our approach employs an iterative example selection strategy, which incrementally constructs a tailored set of few-shot examples optimized to enhance model's performance on a given task. As a proof of concept, we apply this methodology to the resolution of algebraic expressions involving non-standard simplification rules, according to which the priority of addition and multiplication is changed. We construct synthetic datasets with diverse levels of difficulty to evaluate the performance of LLMs in simplifying non-standard mathematical expressions, designed to test compositional reasoning. We evaluate multiple prompting strategies, namely zero-shot, few-shot, and Chain-of-Thought prompts. Our findings indicate that LLMs exhibit limited proficiency in these mathematical tasks. We further demonstrate that LLMs reasoning benefits from our iterative shot selection prompting strategy integrated with explicit reasoning instructions. Interestingly, our experiments reveal that some LLMs achieve better generalization performances when prompted with simpler few-shot examples rather than complex ones following the test data distribution. This counterintuitive finding suggests that the model may benefit more from clear, easily interpretable patterns that can then be abstracted and applied to more complex, out-of-distribution tasks, compared to being provided with few-shot complex examples following the data distribution. Our results confirm the effectiveness and broad applicability of our methodology for systematically improving LLM performance in abstract reasoning tasks with in- and out-of-distribution examples.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2026

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/370148

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

social impact