Role-Aware Virtual Agents for Navigational Interaction guided by a Multimodal Large Language Model

Abstract

We present a role-aware virtual agent navigational interaction that generates consistent, role-aligned movement behaviors.

Our approach leverages Multimodal Large Language Models (MLLMs) to interpret multimodal inputs including scene information, user state, and high-level language role instruction, producing discrete navigation decisions and stylized planning path. Our approach enables virtual agents to behave consistently with narrative roles and respond to dynamic actions, such as playing a hide-and-seek taking into account the agent's role and the user's possible intention.

Our approach demonstrates how MLLMs can go beyond language-based interaction to support embodied, spatial, and role-aware agent behaviors in immersive environments such as augmented reality.

Contributions

We introduce a holistic multimodal approach that integrates learning-based temporal decision filtering, MLLM-driven scene and role-aware goal reasoning, and path planning guided by an optimization for intent-driven path generation
We enable virtual agents to navigate with role-conditioned goals and paths taking into account the scene semantics and user interaction dynamics via an MLLM..
We demonstrate consistent role-conditioned behaviors across hide-and-seek, mixed-role shepherding, and ablation scenarios.

Role-Conditioned Navigation Results

Playful

The agent keeps the interaction fun and discoverable, encouraging the user to find it without frustration.

Hostile

The agent prioritizes stealth, distance, and cover to avoid being detected by the user.

Supportive

The agent guides the user toward the goal using safe and understandable navigation behavior.

Our virtual agent generates different navigation behaviors depending on the assigned role, even within the same environment.

Shepherding Examples with Mixed Roles

Active hazard

The dog tracks the wolf, blocks likely flanking routes, and pushes the threat away from the herd while keeping the sheep grouped.

Passive hazard

The dog guides the herd around environmental hazards, rescues isolated sheep, and recenters the flock to restore cohesion.

Mixed hazard

The dog handles both environmental hazards and predator pressure at once, balancing defense, rerouting, and herd cohesion in a single behavior.

The MLLM-controlled dog dynamically balanced sheep support, hazard mitigation, and predator blocking in response to changing environmental conditions

Ablation Example

Visual prompt / Role-Aware Navigation

Without visual inputs, the agent shows limited strategic navigation with the hostile role.
With visual inputs, it makes more strategic, role-consistent decisions, fleeing to farther and more appropriate destinations.
Under the neutral condition (no contextual instruction), the agent shows repetitive behaviors, repeatedly selecting similar destinations and lingering near the start.

Our Approach

Our framework combines decision filtering with role-aware agent behavior generation.

Overview

System Pipeline

User state, agent state, and scene observations are filtered into multimodal prompts that drive role-aware planning and path generation.

MLLMs

Multimodal Context

Each simulation step captures user motion, scene state, and agent context for MLLM-based decision making.

MLLMs

Visual Input

The MLLM receives RGB and segmentation views from top-down, agent, and user perspectives.

Architecture

Decision Filter

The model combines visual features from a pre-trained ResNet, sequential inputs processed by LSTMs, and directional embeddings to ”Yes” or ”No”.

Terms

Role Vocabulary

Behavioral terms define how role descriptions shape the agent's navigation strategy.

BibTeX


        @article{kim2026roleagent,
        title={Role-Aware Virtual Agents for Navigational Interaction guided by a Multimodal Large Language Model},
        author={Kim, Minyoung, Li, Changyang, Nguyen, Cuong and Yu, Lap-Fai},
        journal = {ACM Transactions on Graphics (TOG)},
        volume = {45},
        number = {4},
        articleno = {126},
        year = {2026}
      }