Autoregressive waypoint prediction via behavioral cloning on 404 real nuScenes expert demonstrations. Lower-triangular causal mask. 666K parameters.
โก
Vectorized BEV Pool Kernel
Replaces Python for-loop over 6 cameras with single batched einsum GPU operation. 2.1ร speedup on CPU. Runs on Apple MPS.
๐ฌ
Sparse Attention Training
SparseCausalTrajHead with strided (64%), local window (73%), and combined (58%) attention patterns. O(Tยทk) vs O(Tยฒ) dense.
๐
LLM Fine-tuning
GPT-2 (124M params) fine-tuned on tokenized nuScenes trajectories. 404-token vocabulary. Causal LM objective. Loss 16.97โ0.0004.
โ๏ธ
Neural Network Pruning
L1 unstructured pruning: 30% sparsity โ 464K params. Zero latency regression. Compatible with downstream quantization.
๐ฅ
BEV Occupancy Forecasting
Predicts T+1, T+2, T+3 future BEV frames from T=4 past observations. Same paradigm as UniAD (NeurIPS 2023). 6.6M params.
Self-Supervised Learning
CameraTrustScorer
Detects degraded cameras with zero fault labels โ pure contrastive self-supervised learning.
The only autonomous driving paper (vs ProtoOcc, GAFusion, PointBeV) with real-time fault tolerance.
# Contrastive margin loss โ the only supervision signal# No fault labels. No human annotation. Zero supervision.L_trust = max(0, t_faulted โ t_clean + 0.2)
# t_clean > t_faulted + 0.2 enforced during training# Training signal comes from data augmentation, not labels
Condition
Trust Score
Drop vs Clean
Visualization
Category
Clean (baseline)
0.795
โ
โ
Blur
0.340
-57%
Known
Occlusion
0.310
-61%
Known
Noise
0.460
-42%
Known
Glare
0.420
-47%
Known
Rain
0.491
-38%
Known
Heavy Snow โก
0.355
~-55%
UNSEEN
Dense Fog โก
0.380
~-52%
UNSEEN
โก UNSEEN = not in training set. Detected via physics signals (Laplacian + Sobel) that generalize across fault types.
Ablation Study
No Trust vs Uniform vs Trust-Aware
Comparing fusion strategies to isolate the benefit of the CameraTrustScorer.
Trust benefit is larger under fault conditions โ as designed.
No Trust (baseline)
0.0706
Uniform weights, ignores camera quality
Uniform Average
0.0752
Simple mean across all cameras equally
Trust-Aware (ours) โญ
0.0776
Weighted by self-supervised trust score
Under Fault Conditions (1 camera faulted)
No Trust
0.0643
Uniform Average
0.0717
Trust-Aware โญ
0.0814
+26.6% over No-Trust
Experimental History
13 Training Experiments
v2Initial CNN + trust scorerFirst working pipeline
v5AdamW + CosineAnnealingLRLoss 26โ9.5
v7Scene-level splitsNo data leakage
v8 โ Geometry BEV lifterIoU=0.136
v9LiDAR depth supervisionADE 2.740โ2.559m
v11 โ T=4 temporal video fusion + 128ร128 BEVADE=2.457m BEST
v133-class semantic labelsIoU=0.131 vehicle
v14Full LSS from scratchNeeds more epochs
vs State of the Art
CVPR Paper Comparison
System
Speed
Parameters
Traj
Fault Tolerance
Hardware
ProtoOcc CVPR25
9.5 FPS
46.2M
โ
โ
8รA100
GAFusion CVPR24
8 FPS
~80M
โ
โ
2ร3090
PointBeV CVPR24
~10 FPS
~40M
โ
โ
A100
OpenDriveFM โ
317 FPS
553K
โ ADE=2.457m
โ 7 fault types
MacBook
Try It Yourself
Run the live demo locally or explore the full codebase