The KITTI Vision Benchmark Suite

Method

Voxel-Pixel Fusion Network [VPFNet]
TBD

Submitted on 20 Jun. 2021 13:40 by
CHIA-HUNG WANG (National Taiwan University)

Running time:		0.2 s
Environment:		1 core @ 2.5 Ghz (C/C++)

Method Description:

Disclaimer:

We strongly recommend using the abbreviation "VPFNet" instead of "VoPiFNet" when referring to our work in citations of our journal publication for consistency. This abbreviation was initially coined by us and has been consistently adopted by other researchers.

Parameters:

Disclaimer:

Another method, whose results were published on 2021-05-14, was initially named DRF (confirmed on 2021-06-20 when we published our results), then renamed to VPF (confirmed on 2021-07-15), and subsequently to VPFNet (confirmed on 2021-07-17).

The corresponding journal paper, titled "VPFNet: Improving 3D Object Detection with Virtual Point-based LiDAR and Stereo Data Fusion," authored by Hanqi Zhu, Jiajun Deng, Yu Zhang, Jianmin Ji, Qiuyu Mao, Houqiang Li, and Yanyong Zhang from the University of Science and Technology of China (USTC), was published in IEEE Transactions on Multimedia in 2022.

We are unsure of their intentions for renaming to become the same as ours. Furthermore, we have been requested to rename our method name "VPFNet" to "VoPiFNet" in our journal paper and cite their paper since theirs was published earlier. Please take care to cite the correct paper in your work in the future.

Latex Bibtex:

@misc{wang2021vpfnet,
title={VPFNet: Voxel-Pixel Fusion Network
for Multi-class 3D Object Detection},
author={Chia-Hung Wang and Hsueh-Wei Chen
and Li-Chen Fu},
year={2021},
eprint={2111.00966},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@ARTICLE{10521558,
author={Wang, Chia-Hung and Chen, Hsueh-Wei and Chen, Yi and Hsiao, Pei-Yung and Fu, Li-Chen},
journal={IEEE Transactions on Intelligent Transportation Systems},
title={VoPiFNet: Voxel-Pixel Fusion Network for Multi-Class 3D Object Detection},
year={2024},
volume={},
number={},
pages={1-11},
keywords={Three-dimensional displays;Feature extraction;Laser radar;Cameras;Detectors;Object detection;Point cloud compression;Multi-modal;multi-class;cross-modal attention;3D object detection;deep learning},
doi={10.1109/TITS.2024.3392783}}

Detailed Results

Object detection and orientation estimation results. Results for object detection are given in terms of average precision (AP) and results for joint object detection and orientation estimation are provided in terms of average orientation similarity (AOS).

Benchmark	Easy	Moderate	Hard
Car (Detection)	96.06 %	95.17 %	92.66 %
Car (Orientation)	96.03 %	95.01 %	92.41 %
Car (3D Detection)	88.51 %	80.97 %	76.74 %
Car (Bird's Eye View)	93.94 %	90.52 %	86.25 %
Pedestrian (Detection)	75.03 %	65.68 %	61.95 %
Pedestrian (Orientation)	67.96 %	58.63 %	54.77 %
Pedestrian (3D Detection)	54.65 %	48.36 %	44.98 %
Pedestrian (Bird's Eye View)	60.07 %	52.41 %	50.28 %
Cyclist (Detection)	82.60 %	74.52 %	66.04 %
Cyclist (Orientation)	82.08 %	73.62 %	65.27 %
Cyclist (3D Detection)	77.64 %	64.10 %	58.00 %
Cyclist (Bird's Eye View)	80.83 %	67.66 %	61.36 %