Retrieving complex, tactical moments in specialized domains like badminton is a significant challenge that current Vision-Language Models (VLMs) are not equipped to handle. Trained on massive, generic datasets, VLMs learn pixel-to-text alignments but lack the domain-specific expertise to understand tactical intent or causality. We propose a novel framework that bypasses this VLM alignment, instead leveraging the rich domain knowledge of Large Language Models (LLMs) to interpret structured event data. Our method first employs domain-specific computer vision tools to decompose videos into structured, textual game logs. We demonstrate that these game logs are a remarkably potent asset for retrieval, as semantic search on these logs alone dramatically outperforms state-of-the-art VLM-based systems. To capture the causal reasoning that raw logs lack, we introduce a rigorous, multi-agent dialectic reasoning process where agents collaboratively debate the log, draft and revise a narrative, and verify its grounding to the original game log events. This ”Generate-then-Retrieve” framework provides a step-function improvement in retrieval accuracy. Our system is not only more accurate but also fully interpretable, providing human-readable, grounded explanations for every result.
We propose a Generate-then-Retrieve framework for badminton rally videos, which enables users to search for videos using high-level tactical queries. In the first stage, game logs, such as shuttle trajectories and player positions, are extracted from the videos. A tactical debate team, consisting of multiple agents, analyzes player intentions and tactics based on these logs to transform the videos into tactical narratives. In the second stage, rally videos are retrieved by comparing user queries with the generated narratives.
Queries focused on identifying specific rally attributes, simple actions, or distinct events.
Query: Identify the rallies where the point is won with a drop shot at the end.
Query: Identify the rallies where both players only hit drive shots, with no drop shots or smashes.
Queries demanding the understanding of temporal sequences or causal relations between events.
Query: Identify the rallies where an attack is set up through net shots and eventually ends with a smash winning the point.
Query: Find rallies where a lob from the front court pushes the opponent deep into the backcourt, followed by an aggressive smash resulting in an out-of-bounds error.
Complex queries requiring the deduction of abstract intents or deep tactical performance analysis.
Query: Instances of losing a rally due to a shallow defensive clear from the rear court that fails to disrupt the opponent's attacking rhythm, allowing them to execute a decisive wrist smash.
Query: Identify rallies where the player defends first and then turns to offense to win the point.