DAG: Unleash the Potential of Diffusion Model for Open-Vocabulary 3D Affordance Grounding

1Shanghai Artificial Intelligence Laboratory
2ShanghaiTech University
3Zhejiang University
4Huazhong University of Science and Technology

*Indicates Equal Contribution

+Indicates Corresponding Author

HOI Image to 3D Affordance Grounding

Directional Weight Score

Text-to-Image Diffusion model can understand how people interact with objects. It has an awareness of affordance and can generate reasonable Human-Object Interaction (HOI) images (Left). Motivated by this, we would like to find a way to transfer this rich affordance knowledge into 3D affordance grounding (Right).

Abstract

3D object affordance grounding aims to predict the touchable regions on a 3d object, which is crucial for human-object interaction (HOI), embodied perception, and robot learning. Recent advances tackle this problem via learning through demonstration images. However, these methods fail to capture the general affordance knowledge within the image, leading to poor generalization. To address this issue, we propose to use text-to-image diffusion models to extract the general affordance knowledge due to we find such models can generate semantically valid HOI images, which demonstrate that their internal representation space is highly correlated with real-world affordance concepts. Specifically, we introduce the DAG: a diffusion-based 3d affordance grounding framework, which leverages the frozen internal representations of the text-to-image diffusion model and unlocks affordance knowledge within the diffusion model to perform 3D affordance grounding. We further introduce an affordance block and a multi-source affordance decoder to endow 3D dense affordance prediction. Extensive experimental evaluations show that our model excels over well-established methods and exhibits open-world generalization.

2D image attention visualization

Directional Weight Score

DAG Method

Directional Weight Score

Our framework comprises two sequential components. First, a 3D multimodal large language model (3D MLLM) ingests high-level instructions and object point clouds to generate sequential affordance maps and decompose the high-level task into a sequence of sub-tasks. Second, the diffusion model takes the affordance map and the decomposed task sequence as conditions to synthesize realistic hand-object interaction sequences.

Affordance Block

Directional Weight Score

Given visual embeddings and text tokens, affordance block fusion them and then utilizes average pool operation to obtain the affordance embeddings.

Multi Source Affordance Decoder

Directional Weight Score

Specifically, we utilize proposed fusion bolcks to fusion the CLS token, point embeddings and affordance embeddings, and the fusion features are fed into a mlp layer to obtain the affordance the mask.

Upsampled Method

Directional Weight Score

We show the details of the Upsample method.Specifically, we use FPS and feature propagation to get the hierarchical point cloud features and upsample the sparse features into dense features

Affordance Visualization

Directional Weight Score

Directional Weight Score

Directional Weight Score

DAG achieves more accurate results in both seen and unseen settings.