VISTA.. _advanced_usage-guided_policy_learning:

Guided Policy Learning
======================

Here we demonstrate how to leverage the power of local synthesis around a passive dataset using
VISTA, which leads to a learning framework called guided policy learning (GPL). In contrast to
imitation learning (IL), GPL actively samples sensor data that is around but different from the original
dataset and couples it with control commands (labels in supervised imitation learning) that aims to
correct the agent to the nominal human trajectory in the original passive dataset. Overall, GPL can
be seen as a data augmentation version of IL that tries to improve robustness of the model within
some deviation of demonstration (the passive dataset).

Similar to IL, we first initialize VISTA simulator and a sampler. There are two major differences
during reset. First, we always initialize the ego agent some distance away from the human trajectory
(demonstration) to actively create scenarios to be corrected from. This is specified by ``initial_dynamics_fn``.
Second, we need a controller that provides ground truth control commands to correct toward the
demonstration. The controller is allowed to have access to privileged information (e.g., lane
boundaries, ado cars' poses, etc) as it is only used to provide guidance for the policy learning
during training time. ::

    self._world = vista.World(self.trace_paths, self.trace_config)
    self._agent = self._world.spawn_agent(self.car_config)
    self._camera = self._agent.spawn_camera(self.camera_config)
    self._world.reset({self._agent.id: self.initial_dynamics_fn})
    self._sampler = RejectionSampler()

    self._privileged_controller = get_controller(self.privileged_control_config)

Next, we implement a data generator that produces a training dataset "around" the original passive
dataset used for imitation learning. This encourages the policy to correct itself toward the
demonstration (human trajectories). ::

    # Data generator from simulation
    self._snippet_i = 0
    while True:
        # reset simulator
        if self._agent.done or self._snippet_i >= self.snippet_size:
            self._world.reset({self._agent.id: self.initial_dynamics_fn})
            self._snippet_i = 0

        # privileged control
        curvature, speed = self._privileged_controller(self._agent)

        # step simulator
        sensor_name = self._camera.name
        img = self._agent.observations[sensor_name] # associate action t with observation t-1
        action = np.array([curvature, speed])
        self._agent.step_dynamics(action)

        val = curvature
        sampling_prob = self._sampler.get_sampling_probability(val)
        if self._rng.uniform(0., 1.) > sampling_prob: # reject
            self._snippet_i += 1
            continue
        self._sampler.add_to_history(val)

        # preprocess and produce data-label pairs
        img = transform_rgb(img, self._camera, self.train)
        label = np.array([curvature]).astype(np.float32)

        self._snippet_i += 1

        yield {'camera': img, 'target': label}

As shown above, the only difference from imitation learning is to run a privileged controller that
produces the correct control commands for states deviated from the human trajectories. Thus, we get
data of the agent initially deviating away from the demonstration but gradually converging to the human
trajectories. At test time with closed-loop control settings, such a training scheme allows the policy
to correct itself from drifting away due to compounding error. For more details, please check
``examples/advanced_usage/gpl_rgb_dataset.py``.