Diffusion Models are Open-World Affordance Learners: Leveraging Generative Priors for 3D Affordance Learning

1The Hong Kong University of Science and Technology (GZ)
2ShanghaiTech University
3Zhejiang University
4Huazhong University of Science and Technology
5Nanjing University
6East China Normal University
7Nanjing University of Science&Technology

*Indicates Equal Contribution

Indicates Corresponding Author

ECCV 2026

HOI Image to 3D Affordance Grounding

Directional Weight Score

Text-to-Image Diffusion model can understand how people interact with objects. It has an awareness of affordance and can generate reasonable Human-Object Interaction (HOI) images (Left). Motivated by this, we would like to find a way to transfer this rich affordance knowledge into 3D affordance grounding (Right).

Abstract

3D affordance grounding aims to understand how diverse objects can be manipulated, making it a cornerstone of embodied interaction. However, prior works struggle to generalize to out-of-distribution, open-world scenarios, leaving a critical gap between limited dataset performance and real-world application needs. Inspired by the saying: ``What I can not create, I do not understand, we find generative models can generate semantically valid HOI images, which indicates inherent encoding of affordance concepts. Building on this insight, we propose DAG, the first innovative diffusion-based 3D affordance grounding framework that extracts general affordance knowledge from text-to-image diffusion models for 3D affordance prediction. Specifically, we extract the affordance priors from a diffusion model to encode HOI priors, and design an affordance block with a multi-source affordance decoder for dense 3D affordance prediction. Extensive experiments show that DAG consistently outperforms state-of-the-art methods and exhibits strong open-world generalization, even in the challenging one-shot setting.

2D image attention visualization

Directional Weight Score

We conduct some toy experiments here. As shown in a), compared with CLIP and DINO, the diffusion model can understand the affordance concept of "hold" better and locate the accurate affordance region. b) shows that the inherent affordance knowledge in the diffusion model can reveal the affordance region on both the ego and exo image. c) When the interaction area indicated by the instruction we give is incorrect, the diffusion model can still generate correct and reasonable interaction images, which indicates that it truly understands ``how to interact'' and ``where to interact'' represented by affordance internally, rather than simply mapping data distribution.

DAG Method

Directional Weight Score

Specifically, DAG employs a frozen diffusion U-Net to extract rich affordance knowledge from HOI images. An affordance semantics encoder and an implicit projector are adopted to map the extracted affordance knowledge into the U-Net structure. We then use an aggregation network to construct a unified affordance bank. Afterward, Affordance Blocks fuse the text embeddings with the affordance bank, which are further fed into a multi-modal affordance decoder. Finally, the decoder interacts with 3D point cloud embeddings to generate dense 3D affordance masks.

Upsampled Method

Directional Weight Score

We show the details of the Upsample method.Specifically, we use FPS and feature propagation to get the hierarchical point cloud features and upsample the sparse features into dense features

Affordance Visualization

Directional Weight Score

Directional Weight Score

Directional Weight Score

DAG achieves more accurate results in both seen and unseen settings.