In various domains requiring general knowledge and agent reasoning, traditional reinforcement learning (RL) algorithms often start from scratch, lacking prior knowledge of the environment. This approach can lead to significant inefficiencies as agents sometimes undergo extensive exploration before optimizing their actions. Conversely, in this paper we assume that recent Vision Language Models (VLMs), integrating both visual and textual information, possess inherent knowledge and basic reasoning capabilities, offering potential solutions to the sample inefficiency problem in RL. The paper explores the integration of VLMs into RL by employing a robust VLMmodel, Idefics-9B, as a policy updated via Proximal Policy Optimization (PPO). Experimental results on simulated environments demonstrate that utilizing VLMs in RL significantly accelerates PPO convergence and improves rewards compared to traditional solutions. Additionally, we propose a streamlined modification to the model architecture for memory efficiency and lighter training, and we release a number of upgraded environments featuring both visual observations and textual descriptions, which, we hope, will facilitate research in VLM and RL applications. Code is available at: https://github.com/giobin/VlmPolicyEsann24
Vision Language Models as Policy Learners in Reinforcement Learning Environments
Bonetta, Giovanni;Magnini, Bernardo
2024-01-01
Abstract
In various domains requiring general knowledge and agent reasoning, traditional reinforcement learning (RL) algorithms often start from scratch, lacking prior knowledge of the environment. This approach can lead to significant inefficiencies as agents sometimes undergo extensive exploration before optimizing their actions. Conversely, in this paper we assume that recent Vision Language Models (VLMs), integrating both visual and textual information, possess inherent knowledge and basic reasoning capabilities, offering potential solutions to the sample inefficiency problem in RL. The paper explores the integration of VLMs into RL by employing a robust VLMmodel, Idefics-9B, as a policy updated via Proximal Policy Optimization (PPO). Experimental results on simulated environments demonstrate that utilizing VLMs in RL significantly accelerates PPO convergence and improves rewards compared to traditional solutions. Additionally, we propose a streamlined modification to the model architecture for memory efficiency and lighter training, and we release a number of upgraded environments featuring both visual observations and textual descriptions, which, we hope, will facilitate research in VLM and RL applications. Code is available at: https://github.com/giobin/VlmPolicyEsann24I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.