The KITTI Vision Benchmark Suite

Method

Point Virtual Transformer V2 [PointVit V2]

Submitted on 8 Aug. 2025 19:59 by
Veerain Sood (Texas A & M)

Running time:		.006 s
Environment:		1 core @ 2.5 Ghz (Python + C/C++)

Method Description:

This is a single stage transformer architecture
operating directly on voxelized LiDAR points.
PointViT V2 introduces depth based virtual points
with Depth Maps from BP-Net,
integrates self attention with local depthwise
convolutions, and employs a multiscale voxel
feature encoder.

Parameters:

Num_Heads = 4

Latex Bibtex:

@misc{sood2026pointvirtualtransformer,
title={Point Virtual Transformer},
author={Veerain Sood and Bnalin and Gaurav
Pandey},
year={2026},
eprint={2602.06406},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.06406},
}

Detailed Results

Object detection and orientation estimation results. Results for object detection are given in terms of average precision (AP) and results for joint object detection and orientation estimation are provided in terms of average orientation similarity (AOS).

Benchmark	Easy	Moderate	Hard
Car (Detection)	97.04 %	96.56 %	88.97 %
Car (Orientation)	97.04 %	96.50 %	88.88 %
Car (3D Detection)	89.81 %	80.54 %	74.96 %
Car (Bird's Eye View)	93.59 %	89.67 %	82.12 %

This table as LaTeX

2D object detection results.
This figure as: png eps txt gnuplot

Orientation estimation results.
This figure as: png eps txt gnuplot

3D object detection results.
This figure as: png eps txt gnuplot

Bird's eye view results.
This figure as: png eps txt gnuplot

The KITTI Vision Benchmark Suite

A project of Karlsruhe Institute of Technologyand Toyota Technological Institute at Chicago

Method

Detailed Results

A project of Karlsruhe Institute of Technology
and Toyota Technological Institute at Chicago