Can visual similarity alone guide robot object retrieval?
Visual retrieval works for text QA but fails for embodied agents—the most visually similar object may be unreachable or locked. Should retrieval systems for robots rank by what the agent can physically execute instead?
Standard multimodal retrieval ranks candidates by visual or semantic similarity to a query — useful for question answering but disastrous for embodied agents, because the most visually similar object may be unreachable, immovable, or behind a closed door. AffordanceRAG adds an affordance reranking step on top of visual retrieval: it builds an affordance-aware memory from images of the explored environment, retrieves objects and locations by visual and regional features, and then reranks them by whether the robot can physically execute an action on them.
The conceptual move is treating affordance — what the agent can do with the object — as a first-class retrieval signal rather than a downstream filter. This matters because the failure modes of visually-similar-but-unactionable retrieval are not easily corrected at action time: by the time the planner discovers the cabinet is locked or the cup is too high, the system has already committed to a plan around it. Reranking by affordance during retrieval prunes these dead ends before they become plans.
More broadly the work argues that RAG for embodied agents needs a different similarity function from RAG for text. The grounding criterion is not "this passage answers the question" but "this object permits the action." Carrying that distinction into retrieval architecture rather than treating it as a post-hoc check is what makes zero-shot mobile manipulation tractable without task-specific training. The general pattern of replacing similarity-based ranking with task-aware ranking also surfaces in Can rationale-driven selection beat similarity re-ranking for evidence? (where rationale replaces semantic similarity) and in Can reasoning stay grounded without external feedback loops? (where action feedback corrects model-internal associations).
Source: 12 types of RAG
Related concepts in this collection
-
Can rationale-driven selection beat similarity re-ranking for evidence?
Can LLMs generate search guidance that outperforms traditional similarity-based evidence ranking? This matters because current re-ranking lacks interpretability and fails against adversarial attacks.
extends: same architectural move of replacing similarity scoring with task-grounded scoring (rationale for QA, affordance for embodied action); both keep retrieval but install a different ranking criterion
-
Can reasoning stay grounded without external feedback loops?
Explores whether language models can maintain accurate reasoning through their own internal chains of thought, or whether they need real-world feedback to avoid hallucination and error propagation.
extends: both use real-world executability rather than model-internal representations to constrain output; AffordanceRAG does this at retrieval time, ReAct does it at reasoning time
-
Do embedding dimensions fundamentally limit retrievable document combinations?
Can single-vector embeddings represent any top-k document subset a user might need? Research using communication complexity theory suggests there are hard geometric limits independent of training data or model architecture.
supports: motivates why visual-similarity retrieval alone fails in embodied settings — embedding similarity cannot encode action constraints
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
affordance-aware retrieval reranks robot perception by physical executability — visual similarity alone retrieves objects the robot cannot actually act on