Method

Language Guidance to Achieve Multi-Modal 3D Multi-Object Tracking [LG3MOT]


Submitted on 24 Feb. 2025 07:33 by
longhui hu (USTB)

Running time:0.26 s
Environment:GPU @ 2.5 Ghz (Python)

Method Description:
The proposed Language-Guided Multi-Modal 3D Multi-
Object Tracking (LG-MM3DMOT) framework integrates
Vision-Language Models (VLMs) to enhance tracking by
aligning image regions with textual concepts using
RegionCLIP. It introduces the Target Semantic
Matching Module (TSM) to filter noisy regions, the
3D Feature EMA Module for temporal feature fusion,
and the Gaussian Confidence Fusion Module for
weighted trajectory confidence computation.
Additionally, the Early Drop Strategy leverages
semantic information to efficiently manage
trajectories by terminating mismatched ones early.
These components collectively improve tracking
accuracy and robustness in complex scenarios.
Parameters:
confidence_thresh=0.1
confidence_his_max=16
max_age=48
Latex Bibtex:

Detailed Results

From all 29 test sequences, our benchmark computes the commonly used tracking metrics CLEARMOT, MT/PT/ML, identity switches, and fragmentations [1,2]. The tables below show all of these metrics.


Benchmark MOTA MOTP MODA MODP
CAR 86.17 % 86.25 % 86.26 % 88.93 %

Benchmark recall precision F1 TP FP FN FAR #objects #trajectories
CAR 94.70 % 93.16 % 93.92 % 36511 2682 2042 24.11 % 46763 1236

Benchmark MT PT ML IDS FRAG
CAR 82.31 % 15.23 % 2.46 % 33 374

This table as LaTeX


[1] K. Bernardin, R. Stiefelhagen: Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics. JIVP 2008.
[2] Y. Li, C. Huang, R. Nevatia: Learning to associate: HybridBoosted multi-target tracker for crowded scene. CVPR 2009.


eXTReMe Tracker