End-to-end Vision-aware Vehicle Decision Making via Imitation Learning

Autonomous vehicles that understand road agents: Detection, tracking, and behavior prediction

Problem Statement

Take raw sensor data (camera, LiDAR, HD map) from the ego vehicle over the last 10 frames, predict driver behavior n frames into the future. Driver behavior at each future frame is one of:

  1. Left turn

  2. Right turn

  3. Left lane change

  4. Right lane change

  5. Straight

A sample sequence and its output labels with n=5 is given below:

Real world driving data collected by Argo AI’s self-driving test vehicles in Miami and Pittsburgh.

The data covers different seasons, weather conditions, and times of day to provide a broad range of real-world driving scenarios.

Sensor data:

  • RGB video frames (1920 x 1200 x 3) at 30 Hz

  • LiDAR point cloud at 10 Hz

  • High Definition Map (HD map) with drivable area and lane polygons

  • Vehicle position and pose information from GPS-based and sensor-based localization

Ground Truth Label Generation & Data Pre-processing

Ground Truth

High quality ground truth label is an essential but often overlooked factor in learning-based problems. To generate ground truth action labels, I have tried the following two methods:

  • A heuristic algorithm to generate labels using GPS rotation vector and position vector

  • Label every frame manually

Labels generated by the heuristic algorithm are very noisy. I decided to manually label every frame.


  • Resize RGB video frames

  • LiDAR point cloud coordinate transform

  • Generate LiDAR Bird's Eye View (BEV)

  • Remove ground LiDAR points

  • Parse HD map as a three-channel image, following Uber's approach

  • Align map orientation with ego vehicle heading direction at each time step


Baseline Model: FaF

Adapt The "late fusion" version of Uber’s Fast and Furios paper for our problem setup.

Our Model: Fusion Seq2seq

A Seq2seq model with attention that takes the raw sensor data from last 10 time steps and predicts the vehicle action labels for the next n frames (n=1, 5, 10, 20, 30). At each time step, RGB video frame, LiDAR BEV image, and HD map are concatenated along the channel dimension.

Model Variant: 3-branch Seq2seq

The 3-branch variant of of Fusion Seq2seq. Use one CNN branch for each type of raw sensor data.

Evaluation and Analysis

FaF, Input 10, Predict 10, Stride 1

Fusion Seq2seq, Input 10, Predict 10, Stride 1

3-branch Seq2seq, Input 10, Predict 10, Stride 1