From Alignment to Reason: Multi-Agent Debate for Tactical Badminton Video Retrieval

National Yang Ming Chiao Tung University

Query: Instances of winning a point by executing a precise, low cross-court net shot that forces the opponent into a stretched position, leading to a technical error where the shuttle fails to clear the net.

Our Multi-Agent Dialectic Reasoning (MADR) framework pioneers a "Generate-then-Retrieve" paradigm. By translating raw game logs into explicit causal narratives, MADR enables highly accurate and human-interpretable video retrieval, empowering coaches and analysts to easily search for complex tactical behaviors.

Abstract

Retrieving complex, tactical moments in specialized domains like badminton is a significant challenge that current Vision-Language Models (VLMs) are not equipped to handle. Trained on massive, generic datasets, VLMs learn pixel-to-text alignments but lack the domain-specific expertise to understand tactical intent or causality. We propose a novel framework that bypasses this VLM alignment, instead leveraging the rich domain knowledge of Large Language Models (LLMs) to interpret structured event data. Our method first employs domain-specific computer vision tools to decompose videos into structured, textual game logs. We demonstrate that these game logs are a remarkably potent asset for retrieval, as semantic search on these logs alone dramatically outperforms state-of-the-art VLM-based systems. To capture the causal reasoning that raw logs lack, we introduce a rigorous, multi-agent dialectic reasoning process where agents collaboratively debate the log, draft and revise a narrative, and verify its grounding to the original game log events. This ”Generate-then-Retrieve” framework provides a step-function improvement in retrieval accuracy. Our system is not only more accurate but also fully interpretable, providing human-readable, grounded explanations for every result.

Methodology Framework

We propose a Generate-then-Retrieve framework for badminton rally videos, which enables users to search for videos using high-level tactical queries. In the first stage, game logs, such as shuttle trajectories and player positions, are extracted from the videos. A tactical debate team, consisting of multiple agents, analyzes player intentions and tactics based on these logs to transform the videos into tactical narratives. In the second stage, rally videos are retrieved by comparing user queries with the generated narratives.

Methodology Framework Overview

Retrieval Results

Category 1: Factual Queries

Queries focused on identifying specific rally attributes, simple actions, or distinct events.

Specific Event

Query: Identify the rallies where the point is won with a drop shot at the end.

Rally Attribute

Query: Identify the rallies where both players only hit drive shots, with no drop shots or smashes.


Category 2: Relational Queries

Queries demanding the understanding of temporal sequences or causal relations between events.

Temporal Sequence

Query: Identify the rallies where an attack is set up through net shots and eventually ends with a smash winning the point.

Causal Relation

Query: Find rallies where a lob from the front court pushes the opponent deep into the backcourt, followed by an aggressive smash resulting in an out-of-bounds error.


Category 3: Strategic Reasoning Queries

Complex queries requiring the deduction of abstract intents or deep tactical performance analysis.

Performance Analysis

Query: Instances of losing a rally due to a shallow defensive clear from the rear court that fails to disrupt the opponent's attacking rhythm, allowing them to execute a decisive wrist smash.

Abstract Intent

Query: Identify rallies where the player defends first and then turns to offense to win the point.