ViSaRL: Visual Reinforcement Learning Guided by Human Saliency

University of Southern California

Can human visual attention help agents perform visual control tasks?

Abstract

Training robots to perform complex control tasks from high-dimensional pixel input using reinforcement learning (RL) is sample-inefficient, because image observations are comprised primarily of task-irrelevant information. By contrast, humans are able to visually attend to task-relevant objects and areas. Based on this insight, we introduce Visual Saliency-Guided Reinforcement Learning (ViSaRL).

Using ViSaRL to learn visual representations significantly improves the success rate, sample efficiency, and generalization of an RL agent on diverse tasks including DeepMind Control benchmark, robot manipulation in simulation and on a real robot. We present approaches for incorporating saliency into both CNN and Transformer-based encoders. We show that visual representations learned using ViSaRL are robust to various sources of visual perturbations including perceptual noise and scene variations.

teaser
ViSaRL trains a saliency prediction model from a few human-annotated saliency maps. This model is used to augment an offline image dataset with saliency. A visual encoder is pretrained with the dataset and used during downstream policy learning to generate latent representations of the agent’s observations.

ViSaRL

The key idea of ViSaRL is to train a visual encoder using both RGB and saliency inputs and an RL policy that operates over lower dimensional image representations. By using a multimodal autoencoder trained using a self-supervised objective, our learned representations attend to the most salient parts of an image for downstream task learning making them robust to visual distractors. To circumvent the expensive process of manually annotating saliency maps, we train a state-of-the-art saliency predictor using only a few human-annotated examples to pseudo-label RGB observations with saliency.

mmae
MultiMAE employs a self-supervised objective in which masked patches for both input modalities are reconstructed given only the visible patches. The pretrained model is frozen and used for extracting representations during policy learning.

Simulation Experiments

We show quantitative results of our approach with two different encoder backbones, CNN and Transformer, across multiple simulated environments including the Meta-World manipulation and DMC benchmarks.

mmae

Learning curves for four robot manipulation tasks in Meta-World evaluated by task success rate. (Top) CNN encoder methods. (Bottom) Transformer encoder methods.

Insights:

  1. Saliency input improves downstream task success rates.
  2. A saliency channel achieves the best task success rate for CNN encoder.
  3. Training encoder with saliency improves RGB-only success rates at inference time.
  4. Using saliency in both pretraining and inference yields the best performance.

Masked Reconstruction with MultiMAE

MultiMAE predictions for different random masks. We visualize the masked predictions for RGB observation from each of the four tasks. For each input image, we randomly sample three different masks from a uniform distribution between RGB and saliency. Even when there are a few unmasked patches from one modality, the reconstructions are still very accurate due to cross-modal interaction.
masked_inputs Masked input images and RGB / saliency reconstruction by MultiMAE.

Quality of Saliency Model

Zero-shot evaluation using a pretrained PiCANet and even state-of-the-art Vision Language Models fail to correctly identify the task-relevant regions of the image necessitating the use of human-annotation to guide the saliency training. Thus, we train a new saliency model for each task. We apply random vertical and horizontal flip for data augmentation to prevent overfitting on our small dataset. We measure the quality of the saliency model on two evaluation metrics following prior work: F-measure score (Fζ) and Mean Absolute Error (MAE). On our test dataset, we obtain an Fζ score of 0.78 ± 0.02 and MAE of 0.004 ± 0.003 averaged across all tasks, consistent with results reported in prior work.

masked_inputs Saliency map predictions for held out test images in real world setup.

Real World Experiments

real_world_envs

Evaluation Tasks. Four Meta-World (top) simulation tasks and four real-robot tabletop manipulation tasks (bottom).

Real World Results

ViSaRL scales to real-robot tasks and is robust to distractor objects. Even on the easier Pick Apple task, using saliency augmented representations, RGB+Saliency, improves the success rate over RGB. On tasks with distractor objects and longer horizon tasks such as Put Apple in Bowl, ViSaRL representations nearly doubles the success rate.

real_world_results Task success rates on 10 evaluation rollouts.

Robot Policy Architecture

The downstream policy is trained using standard imitation learning. The state representation is the visual embedding and the robot proprioceptive information (e.g., joint positions). This yields a 271-dimensional state representation. The policy is an LSTM with 256-dimensional hidden states. The final hidden state of the LSTM is processed by a 2-layer MLP to predict a continuous action. The action space, A ∈ R7, consists of Δ(x, y, z, φ, θ, ψ), and a continuous scalar value for gripper speed.

robot_policy

Real World Rollouts (Left: RGB-only, Right: RGB + Saliency)

Pick Up Apple

pick up apple fail pick up apple success

Pick Up Red Block with Distractor Objects

pick up red block fail pick up red block success

Put Bread on Plate

put bread on plate fail put bread on plate success

Put Apple in Bowl with Distractor Objects

put apple in bowl fail put apple in bowl success

BibTeX

@article{liang2024visarl,
  author    = {Liang, Anthony and Thomason, Jesse and B{\i}y{\i}k, Erdem},
  title     = {ViSaRL: Visual Reinforcement Learning Guided by Human Saliency},
  journal   = {IROS},
  year      = {2024},
}