VISTA.. _advanced_usage-guided_policy_learning:

Guided Policy Learning

Here we demonstrate how to leverage the power of local synthesis around a passive dataset using VISTA, which leads to a learning framework called guided policy learning (GPL). In contrast to imitation learning (IL), GPL actively samples sensor data that is around but different from the original dataset and couples it with control commands (labels in supervised imitation learning) that aims to correct the agent to the nominal human trajectory in the original passive dataset. Overall, GPL can be seen as a data augmentation version of IL that tries to improve robustness of the model within some deviation of demonstration (the passive dataset).

Similar to IL, we first initialize VISTA simulator and a sampler. There are two major differences during reset. First, we always initialize the ego agent some distance away from the human trajectory (demonstration) to actively create scenarios to be corrected from. This is specified by initial_dynamics_fn. Second, we need a controller that provides ground truth control commands to correct toward the demonstration. The controller is allowed to have access to privileged information (e.g., lane boundaries, ado cars’ poses, etc) as it is only used to provide guidance for the policy learning during training time.

self._world = vista.World(self.trace_paths, self.trace_config)
self._agent = self._world.spawn_agent(self.car_config)
self._camera = self._agent.spawn_camera(self.camera_config)
self._world.reset({self._agent.id: self.initial_dynamics_fn})
self._sampler = RejectionSampler()

self._privileged_controller = get_controller(self.privileged_control_config)

Next, we implement a data generator that produces a training dataset “around” the original passive dataset used for imitation learning. This encourages the policy to correct itself toward the demonstration (human trajectories).

# Data generator from simulation
self._snippet_i = 0
while True:
    # reset simulator
    if self._agent.done or self._snippet_i >= self.snippet_size:
        self._world.reset({self._agent.id: self.initial_dynamics_fn})
        self._snippet_i = 0

    # privileged control
    curvature, speed = self._privileged_controller(self._agent)

    # step simulator
    sensor_name = self._camera.name
    img = self._agent.observations[sensor_name] # associate action t with observation t-1
    action = np.array([curvature, speed])
    self._agent.step_dynamics(action)

    val = curvature
    sampling_prob = self._sampler.get_sampling_probability(val)
    if self._rng.uniform(0., 1.) > sampling_prob: # reject
        self._snippet_i += 1
        continue
    self._sampler.add_to_history(val)

    # preprocess and produce data-label pairs
    img = transform_rgb(img, self._camera, self.train)
    label = np.array([curvature]).astype(np.float32)

    self._snippet_i += 1

    yield {'camera': img, 'target': label}

As shown above, the only difference from imitation learning is to run a privileged controller that produces the correct control commands for states deviated from the human trajectories. Thus, we get data of the agent initially deviating away from the demonstration but gradually converging to the human trajectories. At test time with closed-loop control settings, such a training scheme allows the policy to correct itself from drifting away due to compounding error. For more details, please check examples/advanced_usage/gpl_rgb_dataset.py.