Can human visual attention help agents perform visual control tasks?
Training robots to perform complex control tasks from high-dimensional pixel input using reinforcement learning (RL) is sample-inefficient, because image observations are comprised primarily of task-irrelevant information. By contrast, humans are able to visually attend to task-relevant objects and areas. Based on this insight, we introduce Visual Saliency-Guided Reinforcement Learning (ViSaRL).
Using ViSaRL to learn visual representations significantly improves the success rate, sample efficiency, and generalization of an RL agent on diverse tasks including DeepMind Control benchmark, robot manipulation in simulation and on a real robot. We present approaches for incorporating saliency into both CNN and Transformer-based encoders. We show that visual representations learned using ViSaRL are robust to various sources of visual perturbations including perceptual noise and scene variations.
The key idea of ViSaRL is to train a visual encoder using both RGB and saliency inputs and an RL policy that operates over lower dimensional image representations. By using a multimodal autoencoder trained using a self-supervised objective, our learned representations attend to the most salient parts of an image for downstream task learning making them robust to visual distractors. To circumvent the expensive process of manually annotating saliency maps, we train a state-of-the-art saliency predictor using only a few human-annotated examples to pseudo-label RGB observations with saliency.
We show quantitative results of our approach with two different encoder backbones, CNN and Transformer, across multiple simulated environments including the Meta-World manipulation and DMC benchmarks.
Insights:
Zero-shot evaluation using a pretrained PiCANet and even state-of-the-art Vision Language Models fail to correctly identify the task-relevant regions of the image necessitating the use of human-annotation to guide the saliency training. Thus, we train a new saliency model for each task. We apply random vertical and horizontal flip for data augmentation to prevent overfitting on our small dataset. We measure the quality of the saliency model on two evaluation metrics following prior work: F-measure score (Fζ) and Mean Absolute Error (MAE). On our test dataset, we obtain an Fζ score of 0.78 ± 0.02 and MAE of 0.004 ± 0.003 averaged across all tasks, consistent with results reported in prior work.
ViSaRL scales to real-robot tasks and is robust to distractor objects. Even on the easier Pick Apple task, using saliency augmented representations, RGB+Saliency, improves the success rate over RGB. On tasks with distractor objects and longer horizon tasks such as Put Apple in Bowl, ViSaRL representations nearly doubles the success rate.
The downstream policy is trained using standard imitation learning. The state representation is the visual embedding and the robot proprioceptive information (e.g., joint positions). This yields a 271-dimensional state representation. The policy is an LSTM with 256-dimensional hidden states. The final hidden state of the LSTM is processed by a 2-layer MLP to predict a continuous action. The action space, A ∈ R7, consists of Δ(x, y, z, φ, θ, ψ), and a continuous scalar value for gripper speed.
@article{liang2024visarl,
author = {Liang, Anthony and Thomason, Jesse and B{\i}y{\i}k, Erdem},
title = {ViSaRL: Visual Reinforcement Learning Guided by Human Saliency},
journal = {IROS},
year = {2024},
}