End-to-end Vision-aware Vehicle Decision Making via Imitation Learning
Take raw sensor data (camera, LiDAR, HD map) from the ego vehicle over the last 10 frames, predict driver behavior n frames into the future. Driver behavior at each future frame is one of:
- Left turn
- Right turn
- Left lane change
- Right lane change
A sample sequence and its output labels with n=5 is given below:
Real world driving data collected by Argo AI’s self-driving test vehicles in Miami and Pittsburgh.
The data covers different seasons, weather conditions, and times of day to provide a broad range of real-world driving scenarios.
- RGB video frames (1920 x 1200 x 3) at 30 Hz
- LiDAR point cloud at 10 Hz
- High Definition Map (HD map) with drivable area and lane polygons
- Vehicle position and pose information from GPS-based and sensor-based localization
Ground Truth Label Generation & Data Pre-processing
High quality ground truth label is an essential but often overlooked factor in learning-based problems. To generate ground truth action labels, I have tried the following two methods:
- A heuristic algorithm to generate labels using GPS rotation vector and position vector
- Label every frame manually
Labels generated by the heuristic algorithm are very noisy. I decided to manually label every frame.
- Resize RGB video frames
- LiDAR point cloud coordinate transform
- Generate LiDAR Bird's Eye View (BEV)
- Remove ground LiDAR points
- Parse HD map as a three-channel image, following Uber's approach
- Align map orientation with ego vehicle heading direction at each time step
Baseline Model: FaF
Adapt The "late fusion" version of Uber’s Fast and Furios paper for our problem setup.
Our Model: Fusion Seq2seq
A Seq2seq model with attention that takes the raw sensor data from last 10 time steps and predicts the vehicle action labels for the next n frames (n=1, 5, 10, 20, 30). At each time step, RGB video frame, LiDAR BEV image, and HD map are concatenated along the channel dimension.
Model Variant: 3-branch Seq2seq
The 3-branch variant of of Fusion Seq2seq. Use one CNN branch for each type of raw sensor data.