IRIS Institutional Research Information System

In various domains requiring general knowledge and agent reasoning, traditional reinforcement learning (RL) algorithms often start from scratch, lacking prior knowledge of the environment. This approach can lead to significant inefficiencies as agents sometimes undergo extensive exploration before optimizing their actions. Conversely, in this paper we assume that recent Vision Language Models (VLMs), integrating both visual and textual information, possess inherent knowledge and basic reasoning capabilities, offering potential solutions to the sample inefficiency problem in RL. The paper explores the integration of VLMs into RL by employing a robust VLMmodel, Idefics-9B, as a policy updated via Proximal Policy Optimization (PPO). Experimental results on simulated environments demonstrate that utilizing VLMs in RL significantly accelerates PPO convergence and improves rewards compared to traditional solutions. Additionally, we propose a streamlined modification to the model architecture for memory efficiency and lighter training, and we release a number of upgraded environments featuring both visual observations and textual descriptions, which, we hope, will facilitate research in VLM and RL applications. Code is available at: https://github.com/giobin/VlmPolicyEsann24

Vision Language Models as Policy Learners in Reinforcement Learning Environments

Bonetta, Giovanni;Zago, Davide;Cancelliere, Rossella;Polato, Mirko;Magnini, Bernardo

2024-01-01

Abstract

In various domains requiring general knowledge and agent reasoning, traditional reinforcement learning (RL) algorithms often start from scratch, lacking prior knowledge of the environment. This approach can lead to significant inefficiencies as agents sometimes undergo extensive exploration before optimizing their actions. Conversely, in this paper we assume that recent Vision Language Models (VLMs), integrating both visual and textual information, possess inherent knowledge and basic reasoning capabilities, offering potential solutions to the sample inefficiency problem in RL. The paper explores the integration of VLMs into RL by employing a robust VLMmodel, Idefics-9B, as a policy updated via Proximal Policy Optimization (PPO). Experimental results on simulated environments demonstrate that utilizing VLMs in RL significantly accelerates PPO convergence and improves rewards compared to traditional solutions. Additionally, we propose a streamlined modification to the model architecture for memory efficiency and lighter training, and we release a number of upgraded environments featuring both visual observations and textual descriptions, which, we hope, will facilitate research in VLM and RL applications. Code is available at: https://github.com/giobin/VlmPolicyEsann24

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2024

Appare nelle tipologie:

4.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/357507

Citazioni

ND

social impact