End-to-end Vision-aware Vehicle Decision Making via Imitation Learning

Autonomous vehicles that understand road agents: Detection, tracking, and behavior prediction

Problem Statement

Take raw sensor data (camera, LiDAR, HD map) from the ego vehicle over the last 10 frames, predict driver behavior n frames into the future. Driver behavior at each future frame is one of: 

A sample sequence and its output labels with n=5 is given below:

Real world driving data collected by Argo AI’s self-driving test vehicles in Miami and Pittsburgh. 

The data covers different seasons, weather conditions, and times of day to provide a broad range of real-world driving scenarios.

Sensor data:

Ground Truth Label Generation  & Data Pre-processing 

Ground Truth

High quality ground truth label is an essential but often overlooked factor in learning-based problems. To generate ground truth action labels, I have tried the following two methods:

Labels generated by the heuristic algorithm are very noisy. I decided to manually label every frame.



Baseline Model: FaF

Adapt The "late fusion" version of Uber’s Fast and Furios paper for our problem setup. 

Our Model: Fusion Seq2seq

A Seq2seq model with attention that takes the raw sensor data from last 10 time steps and predicts the vehicle action labels for the next n frames (n=1, 5, 10, 20, 30). At each time step, RGB video frame, LiDAR BEV image, and HD map are concatenated along the channel dimension.

Model Variant: 3-branch Seq2seq

The 3-branch variant of of Fusion Seq2seq. Use one CNN branch for each type of raw sensor data.

Evaluation and Analysis

FaF, Input 10, Predict 10, Stride 1

Fusion Seq2seq, Input 10, Predict 10, Stride 1

3-branch Seq2seq, Input 10, Predict 10, Stride 1