VISTA.. _advanced_usage-reinforcement_learning:

Reinforcement Learning

Based on what we discussed in the previous section for guided policy learning where we actively generate data with off-human-trajectory initialization and a privileged controller for correction, we can consider the further extreme of exploiting the “activeness” of applying VISTA on a passive dataset. This leads to reinforcement learning (RL), where an agent is allowed to interact with the environment, collect training data itself, and learn to achieve good task performance. Here we show how to use VISTA to define a RL environment for learning a policy.

Implementation-wise, there are four major components to be defined, namely the observations, environment dynamics, reward function, and terminal condition. Observations are simply sensor data and environment dynamics is how the ego car moves after applying some control commands, which is already embedded in the vehicle state update (and collision if multi-agent scenario is considered). Thus, we need to further define the reward function and terminal conditions. For example, if we consider lane following,

def step(self, action, dt=1 / 30.):
    # Step agent and get observation
    agent = self._world.agents[0]
    action = np.array([action[agent.id][0], agent.human_speed])
    agent.step_dynamics(action, dt=dt)
    agent.step_sensors()
    observations = agent.observations

    # Define terminal condition
    done, info_from_terminal_condition = self.config['terminal_condition'](
        self, agent.id)

    # Define reward
    reward, _ = self.config['reward_fn'](self, agent.id,
                                         **info_from_terminal_condition)

    # Get info
    # ...

    # Pack output
    observations, reward, done, info = map(
        self._append_agent_id, [observations, reward, done, info])

    return observations, reward, done, info

, where self.config['terminal_condition'] can be defined as when the car is off the lane (too far away from the lane center) or the car heading deviates too much from the road curvature. Note that apart from being the terminal condition for the lane following task, the above-mentioned two constraints should be satisfied since VISTA only allows for high-fidelity synthesis locally around the original passive dataset.

def default_terminal_condition(task, agent_id, **kwargs):
    agent = [_a for _a in task.world.agents if _a.id == agent_id][0]

    def _check_out_of_lane():
        road_half_width = agent.trace.road_width / 2.
        return np.abs(agent.relative_state.x) > road_half_width

    def _check_exceed_max_rot():
        maximal_rotation = np.pi / 10.
        return np.abs(agent.relative_state.theta) > maximal_rotation

    out_of_lane = _check_out_of_lane()
    exceed_max_rot = _check_exceed_max_rot()
    done = out_of_lane or exceed_max_rot or agent.done
    other_info = {
        'done': done,
        'out_of_lane': out_of_lane,
        'exceed_max_rot': exceed_max_rot,
    }

    return done, other_info

We can define a very simple reward function that encourages survival (not going off the lane or exceeding some rotation with respect to the road curvature) by simply checking whether the current step is terminated.

def default_reward_fn(task, agent_id, **kwargs):
    """ An example definition of reward function. """
    reward = -1 if kwargs['done'] else 0  # simply encourage survival

    return reward, {}

Please check lane_following.py for more details. The implementation roughly follows OpenAI Gym interface with reset and step functions. However, there are still other attributes or functions to be implemented like action_space, observation_space, render, etc, which may require objects from gym.