Semanic Scene Understanding

2D Semantic Segmentation


Our evaluation table ranks all methods according to the confidence weighted mean intersection-over-union (mIoU). The weighted IoU of one class can be defined as \(\text{IoU} = \frac{\sum_{i\in{\{\text{TP}\}}}c_{i}}{\sum_{i\in{\{\text{TP, FP, FN}\}}}c_{i}}\) where \(\{\text{TP}\}\) and \(\{\text{TP, FP, FN}\}\) are the set of image pixels in the intersection and the union of the class label, respectively. \(c_i \in [0, 1]\) denotes the confidence value at pixel \(i\). In constrast to standard evaluation where \(c_i=1\) for all pixels, we adopt confidence weighted evaluation metrics leveraging the uncertainty to take into account the ambiguity in our automatically generated annotations.

Method Setting Code mIoU Class mIoU Category Runtime Environment
1 PSPNet code 64.92 82.17 0.2 s 1 core @ 2.5 Ghz (C/C++)
H. Zhao, J. Shi, X. Qi, X. Wang and J. Jia: Pyramid Scene Parsing Network. CVPR 2017.
2 FCN 54.00 77.64 0.2 s 1 core @ 2.5 Ghz (C/C++)
J. Long, E. Shelhamer and T. Darrell: Fully Convolutional Networks for Semantic Segmentation. CVPR 2015.
Table as LaTeX | Only published Methods


2D Instance Segmentation


Our evaluation table ranks all methods according to the Average Precision (AP) over 10 IoU thresholds, ranging from 0.5 to 0.95 with a step size of 0.05. The IoU is weighted by the confidence as \(\text{IoU} = \frac{\sum_{i\in{\{\text{TP}\}}}c_{i}}{\sum_{i\in{\{\text{TP, FP, FN}\}}}c_{i}}\) where \(\{\text{TP}\}\) and \(\{\text{TP, FP, FN}\}\) are the set of image pixels in the intersection and the union of one instance, respectively. \(c_i \in [0, 1]\) denotes the confidence value at pixel \(i\). In constrast to standard evaluation where \(c_i=1\) for all pixels, we adopt confidence weighted evaluation metrics leveraging the uncertainty to take into account the ambiguity in our automatically generated annotations.

Method Setting Code AP AP 50 Runtime Environment
1 Mask R-CNN (Res.101) code 20.92 40.10 0.02 s 1 core @ 2.5 Ghz (C/C++)
K. He, G. Gkioxari, P. Doll\''ar and R. Girshick: Mask R-CNN. PAMI 2020.
2 Mask R-CNN (Res. 50) code 19.51 36.25 0.02 s 1 core @ 2.5 Ghz (C/C++)
K. He, G. Gkioxari, P. Doll\\\'ar and R. Girshick: Mask R-CNN. PAMI 2020.
Table as LaTeX | Only published Methods


3D Semantic Segmentation


Our evaluation table ranks all methods according to the confidence weighted mean intersection-over-union (mIoU). The weighted IoU of one class can be defined as \(\text{IoU} = \frac{\sum_{i\in{\{\text{TP}\}}}c_{i}}{\sum_{i\in{\{\text{TP, FP, FN}\}}}c_{i}}\) where \(\{\text{TP}\}\) and \(\{\text{TP, FP, FN}\}\) are the set of image pixels in the intersection and the union of the class label, respectively. \(c_i \in [0, 1]\) denotes the confidence value at pixel \(i\). In constrast to standard evaluation where \(c_i=1\) for all pixels, we adopt confidence weighted evaluation metrics leveraging the uncertainty to take into account the ambiguity in our automatically generated annotations.

Method Setting Code mIoU Class mIoU Category Runtime Environment
1 DeepViewAggregation code 58.25 73.66 - NVIDIA V100
D. Robert, B. Vallet and L. Landrieu: Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic Segmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022.
2 MinkowskiNet code 53.92 74.08 - NVIDIA V100
C. Choy, J. Gwak and S. Savarese: 4d spatio-temporal convnets: Minkowski convolutional neural networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019.
D. Robert, B. Vallet and L. Landrieu: Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic Segmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022.
3 PointNet++ code 35.66 58.28 NVIDIA V100
C. Qi, L. Yi, H. Su and L. Guibas: PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. NeurIPS 2017.
4 PointNet code 13.07 30.42 NVIDIA V100
C. Qi, H. Su, K. Mo and L. Guibas: Pointnet: Deep learning on point sets for 3d classification and segmentation. CVPR 2017.
Table as LaTeX | Only published Methods


3D Instance Segmentation


Our evaluation table ranks all methods according to the Average Precision (AP) over 10 IoU thresholds, ranging from 0.5 to 0.95 with a step size of 0.05. The IoU is weighted by the confidence as \(\text{IoU} = \frac{\sum_{i\in{\{\text{TP}\}}}c_{i}}{\sum_{i\in{\{\text{TP, FP, FN}\}}}c_{i}}\) where \(\{\text{TP}\}\) and \(\{\text{TP, FP, FN}\}\) are the set of image pixels in the intersection and the union of one instance, respectively. \(c_i \in [0, 1]\) denotes the confidence value at pixel \(i\). In constrast to standard evaluation where \(c_i=1\) for all pixels, we adopt confidence weighted evaluation metrics leveraging the uncertainty to take into account the ambiguity in our automatically generated annotations.

Method Setting Code AP AP 50 Runtime Environment
1 PointGroup code 34.76 53.61 NVIDIA V100
L. Jiang, H. Zhao, S. Shi, S. Liu, C. Fu and J. Jia: PointGroup: Dual-Set Point Grouping for 3D Instance Segmentation. CVPR 2020.
2 PointNet++ w. clustering code 23.37 38.53 NVIDIA V100
C. Qi, L. Yi, H. Su and L. Guibas: PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. NeurIPS 2017.
Table as LaTeX | Only published Methods


3D Bounding Box Detection


We evaluate all methods using mean Average Precision (AP) calculated at a threshold of 0.25 and 0.5, respectively. Our evaluation table ranks all methods according to the AP evaluated at the IoU threshold of 0.5.

Method Setting Code AP 25 AP 50 Runtime Environment
1 PBEV+SeaBird code 37.12 4.64 0.15 s NVIDIA A100
A. Kumar, Y. Guo, X. Huang, L. Ren and X. Liu: SeaBird: Segmentation in Bird's View with Dice Loss Improves Monocular 3D Detection of Large Objects. CVPR 2024.
2 BoxNet code 23.59 4.08 NVIDIA V100
C. Qi, O. Litany, K. He and L. Guibas: Deep Hough Voting for 3D Object Detection in Point Clouds. ICCV 2019.
3 VoteNet code 30.61 3.40 NVIDIA V100
C. Qi, O. Litany, K. He and L. Guibas: Deep Hough Voting for 3D Object Detection in Point Clouds. ICCV 2019.
4 I2M+SeaBird code 35.04 3.14 0.02 s NVIDIA A100
A. Kumar, Y. Guo, X. Huang, L. Ren and X. Liu: SeaBird: Segmentation in Bird's View with Dice Loss Improves Monocular 3D Detection of Large Objects. CVPR 2024.
5 MonoDTR code 39.76 3.02 0.04 s NVIDIA A6000
K. Huang, T. Wu, H. Su and W. Hsu: MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer. CVPR 2022.
6 DEVIANT code 26.96 0.88 0.04 s NVIDIA A100
A. Kumar, G. Brazil, E. Corona, A. Parchami and X. Liu: DEVIANT: Depth Equivariant Network for Monocular 3D Object Detection. ECCV 2022.
7 GUP Net code 27.25 0.87 0.02 s NVIDIA A100
Y. Lu, X. Ma, L. Yang, T. Zhang, Y. Liu, Q. Chu, J. Yan and W. Ouyang: Geometry Uncertainty Projection Network for Monocular 3D Object Detection. ICCV 2021.
8 MonoDLE code 28.99 0.85 0.04 s NVIDIA A100
X. Ma, Y. Zhang, D. Xu, D. Zhou, S. Yi, H. Li and W. Ouyang: Delving into Localization Errors for Monocular 3D Object Detection. CVPR 2021.
9 Cube R-CNN code 15.57 0.80 0.04 s NVIDIA A100
G. Brazil, A. Kumar, J. Straub, N. Ravi, J. Johnson and G. Gkioxari: Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild. CVPR 2023.
10 MonoDETR code 27.13 0.79 0.4 s 1 core @ 2.5 Ghz (C/C++)
R. Zhang, H. Qiu, T. Wang, X. Xu, Z. Guo, Y. Qiao, P. Gao and H. Li: MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection. ICCV 2023.
11 GrooMeD-NMS code 16.12 0.17 0.12 s 1 core @ 2.5 Ghz (Python)
A. Kumar, G. Brazil and X. Liu: GrooMeD-NMS: Grouped Mathematically Differentiable NMS for Monocular 3D Object Detection. CVPR 2021.
Table as LaTeX | Only published Methods


Semantic Scene Completion


We evaluate geometric completion and semantic estimation and rank the methods according to the confidence weighted mean intersection-over-union (mIoU). Geometric completion is evaluated via completeness and accuracy at a threshold of 20cm. Completeness is calculated as the fraction of ground truth points of which the distances to their closest reconstructed points are below the threshold. Accuracy instead measures the percentage of reconstructed points that are within a distance threshold to the ground truth points. As our ground truth reconstruction may not be complete, we prevent punishing reconstructed points by dividing the space into observed and unobserved regions, which are determined by the unobserved volume from a 3D occupancy map obtained using OctoMap. We further measure the F1 score as the harmonic mean of the completeness and the accuracy.

Method Setting Code Accuracy Completeness F1 mIoU Class Runtime Environment
1 EncDec 41.36 41.23 41.29 9.07 NVIDIA V100
Y. Liao, J. Xie and A. Geiger: KITTI-360: A Novel Dataset and Benchmarks for Urban Scene Understanding in 2D and 3D. ARXIV 2021.
2 SimultaneousSampling 80.37 29.49 43.15 3.88 3 h NVIDIA V100
3 Raw Input 98.24 19.07 32.35 0.00 NVIDIA V100
Y. Liao, J. Xie and A. Geiger: KITTI-360: A Novel Dataset and Benchmarks for Urban Scene Understanding in 2D and 3D. ARXIV 2021.
Table as LaTeX | Only published Methods





eXTReMe Tracker