VISTA.. _advanced_usage-guided_policy_learning:
Guided Policy Learning¶
Here we demonstrate how to leverage the power of local synthesis around a passive dataset using VISTA, which leads to a learning framework called guided policy learning (GPL). In contrast to imitation learning (IL), GPL actively samples sensor data that is around but different from the original dataset and couples it with control commands (labels in supervised imitation learning) that aims to correct the agent to the nominal human trajectory in the original passive dataset. Overall, GPL can be seen as a data augmentation version of IL that tries to improve robustness of the model within some deviation of demonstration (the passive dataset).
Similar to IL, we first initialize VISTA simulator and a sampler. There are two major differences
during reset. First, we always initialize the ego agent some distance away from the human trajectory
(demonstration) to actively create scenarios to be corrected from. This is specified by initial_dynamics_fn
.
Second, we need a controller that provides ground truth control commands to correct toward the
demonstration. The controller is allowed to have access to privileged information (e.g., lane
boundaries, ado cars’ poses, etc) as it is only used to provide guidance for the policy learning
during training time.
self._world = vista.World(self.trace_paths, self.trace_config)
self._agent = self._world.spawn_agent(self.car_config)
self._camera = self._agent.spawn_camera(self.camera_config)
self._world.reset({self._agent.id: self.initial_dynamics_fn})
self._sampler = RejectionSampler()
self._privileged_controller = get_controller(self.privileged_control_config)
Next, we implement a data generator that produces a training dataset “around” the original passive dataset used for imitation learning. This encourages the policy to correct itself toward the demonstration (human trajectories).
# Data generator from simulation
self._snippet_i = 0
while True:
# reset simulator
if self._agent.done or self._snippet_i >= self.snippet_size:
self._world.reset({self._agent.id: self.initial_dynamics_fn})
self._snippet_i = 0
# privileged control
curvature, speed = self._privileged_controller(self._agent)
# step simulator
sensor_name = self._camera.name
img = self._agent.observations[sensor_name] # associate action t with observation t-1
action = np.array([curvature, speed])
self._agent.step_dynamics(action)
val = curvature
sampling_prob = self._sampler.get_sampling_probability(val)
if self._rng.uniform(0., 1.) > sampling_prob: # reject
self._snippet_i += 1
continue
self._sampler.add_to_history(val)
# preprocess and produce data-label pairs
img = transform_rgb(img, self._camera, self.train)
label = np.array([curvature]).astype(np.float32)
self._snippet_i += 1
yield {'camera': img, 'target': label}
As shown above, the only difference from imitation learning is to run a privileged controller that
produces the correct control commands for states deviated from the human trajectories. Thus, we get
data of the agent initially deviating away from the demonstration but gradually converging to the human
trajectories. At test time with closed-loop control settings, such a training scheme allows the policy
to correct itself from drifting away due to compounding error. For more details, please check
examples/advanced_usage/gpl_rgb_dataset.py
.