The KITTI Vision Benchmark Suite

Method

Language Guidance to Achieve Multi-Modal 3D Multi-Object Tracking [LG3MOT]

Submitted on 24 Feb. 2025 07:33 by
Youlin LIU ()

Running time:		0.26 s
Environment:		GPU @ 2.5 Ghz (Python)

Method Description:

The proposed Language-Guided Multi-Modal 3D Multi-
Object Tracking (LG-MM3DMOT) framework integrates
Vision-Language Models (VLMs) to enhance tracking by
aligning image regions with textual concepts using
RegionCLIP. It introduces the Target Semantic
Matching Module (TSM) to filter noisy regions, the
3D Feature EMA Module for temporal feature fusion,
and the Gaussian Confidence Fusion Module for
weighted trajectory confidence computation.
Additionally, the Early Drop Strategy leverages
semantic information to efficiently manage
trajectories by terminating mismatched ones early.
These components collectively improve tracking
accuracy and robustness in complex scenarios.

Parameters:

confidence_thresh=0.1
confidence_his_max=16
max_age=48

Latex Bibtex:

Detailed Results

From all 29 test sequences, our benchmark computes the commonly used tracking metrics CLEARMOT, MT/PT/ML, identity switches, and fragmentations [1,2]. The tables below show all of these metrics.

Benchmark	MOTA	MOTP	MODA	MODP
CAR	86.17 %	86.25 %	86.26 %	88.93 %

Benchmark	recall	precision	F1	TP	FP	FN	FAR	#objects	#trajectories
CAR	94.70 %	93.16 %	93.92 %	36511	2682	2042	24.11 %	46763	1236

Benchmark	MT	PT	ML	IDS	FRAG
CAR	82.31 %	15.23 %	2.46 %	33	374

This table as LaTeX

[1] K. Bernardin, R. Stiefelhagen: Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics. JIVP 2008.
[2] Y. Li, C. Huang, R. Nevatia: Learning to associate: HybridBoosted multi-target tracker for crowded scene. CVPR 2009.

The KITTI Vision Benchmark Suite

A project of Karlsruhe Institute of Technologyand Toyota Technological Institute at Chicago

Method

Detailed Results

A project of Karlsruhe Institute of Technology
and Toyota Technological Institute at Chicago