ViSaRL: Visual Reinforcement Learning Guided by Human Saliency

Abstract

Training robots to perform complex control tasks from high-dimensional pixel input using reinforcement learning (RL) is sample-inefficient, because image observations are comprised primarily of task-irrelevant information. By contrast, humans are able to visually attend to task-relevant objects and areas. Based on this insight, we introduce Visual Saliency-Guided Reinforcement Learning (ViSaRL).

Using ViSaRL to learn visual representations significantly improves the success rate, sample efficiency, and generalization of an RL agent on diverse tasks including DeepMind Control benchmark, robot manipulation in simulation and on a real robot. We present approaches for incorporating saliency into both CNN and Transformer-based encoders. We show that visual representations learned using ViSaRL are robust to various sources of visual perturbations including perceptual noise and scene variations.

ViSaRL

The key idea of ViSaRL is to train a visual encoder using both RGB and saliency inputs and an RL policy that operates over lower dimensional image representations. By using a multimodal autoencoder trained using a self-supervised objective, our learned representations attend to the most salient parts of an image for downstream task learning making them robust to visual distractors. To circumvent the expensive process of manually annotating saliency maps, we train a state-of-the-art saliency predictor using only a few human-annotated examples to pseudo-label RGB observations with saliency.

Simulation Experiments

We show quantitative results of our approach with two different encoder backbones, CNN and Transformer, across multiple simulated environments including the Meta-World manipulation and DMC benchmarks.

Learning curves for four robot manipulation tasks in Meta-World evaluated by task success rate. (Top) CNN encoder methods. (Bottom) Transformer encoder methods.

Insights:

Saliency input improves downstream task success rates.
A saliency channel achieves the best task success rate for CNN encoder.
Training encoder with saliency improves RGB-only success rates at inference time.
Using saliency in both pretraining and inference yields the best performance.

Masked Reconstruction with MultiMAE

MultiMAE predictions for different random masks. We visualize the masked predictions for RGB observation from each of the four tasks. For each input image, we randomly sample three different masks from a uniform distribution between RGB and saliency. Even when there are a few unmasked patches from one modality, the reconstructions are still very accurate due to cross-modal interaction.

Masked input images and RGB / saliency reconstruction by MultiMAE.

Quality of Saliency Model

Zero-shot evaluation using a pretrained PiCANet and even state-of-the-art Vision Language Models fail to correctly identify the task-relevant regions of the image necessitating the use of human-annotation to guide the saliency training. Thus, we train a new saliency model for each task. We apply random vertical and horizontal flip for data augmentation to prevent overfitting on our small dataset. We measure the quality of the saliency model on two evaluation metrics following prior work: F-measure score (F_ζ) and Mean Absolute Error (MAE). On our test dataset, we obtain an F_ζ score of 0.78 ± 0.02 and MAE of 0.004 ± 0.003 averaged across all tasks, consistent with results reported in prior work.

Saliency map predictions for held out test images in real world setup.

Real World Experiments

Evaluation Tasks. Four Meta-World (top) simulation tasks and four real-robot tabletop manipulation tasks (bottom).

Real World Results

ViSaRL scales to real-robot tasks and is robust to distractor objects. Even on the easier Pick Apple task, using saliency augmented representations, RGB+Saliency, improves the success rate over RGB. On tasks with distractor objects and longer horizon tasks such as Put Apple in Bowl, ViSaRL representations nearly doubles the success rate.

Task success rates on 10 evaluation rollouts.

Robot Policy Architecture

The downstream policy is trained using standard imitation learning. The state representation is the visual embedding and the robot proprioceptive information (e.g., joint positions). This yields a 271-dimensional state representation. The policy is an LSTM with 256-dimensional hidden states. The final hidden state of the LSTM is processed by a 2-layer MLP to predict a continuous action. The action space, A ∈ R⁷, consists of Δ(x, y, z, φ, θ, ψ), and a continuous scalar value for gripper speed.

Real World Rollouts (Left: RGB-only, Right: RGB + Saliency)

Pick Up Apple

Pick Up Red Block with Distractor Objects

Put Bread on Plate

Put Apple in Bowl with Distractor Objects

BibTeX

@article{liang2024visarl,
  author    = {Liang, Anthony and Thomason, Jesse and B{\i}y{\i}k, Erdem},
  title     = {ViSaRL: Visual Reinforcement Learning Guided by Human Saliency},
  journal   = {IROS},
  year      = {2024},
}