Knowledge Retrieval and RAG LLM Reasoning and Architecture

Can visual similarity alone guide robot object retrieval?

Visual retrieval works for text QA but fails for embodied agents—the most visually similar object may be unreachable or locked. Should retrieval systems for robots rank by what the agent can physically execute instead?

Note · 2026-05-03 · sourced from 12 types of RAG

Standard multimodal retrieval ranks candidates by visual or semantic similarity to a query — useful for question answering but disastrous for embodied agents, because the most visually similar object may be unreachable, immovable, or behind a closed door. AffordanceRAG adds an affordance reranking step on top of visual retrieval: it builds an affordance-aware memory from images of the explored environment, retrieves objects and locations by visual and regional features, and then reranks them by whether the robot can physically execute an action on them.

The conceptual move is treating affordance — what the agent can do with the object — as a first-class retrieval signal rather than a downstream filter. This matters because the failure modes of visually-similar-but-unactionable retrieval are not easily corrected at action time: by the time the planner discovers the cabinet is locked or the cup is too high, the system has already committed to a plan around it. Reranking by affordance during retrieval prunes these dead ends before they become plans.

More broadly the work argues that RAG for embodied agents needs a different similarity function from RAG for text. The grounding criterion is not "this passage answers the question" but "this object permits the action." Carrying that distinction into retrieval architecture rather than treating it as a post-hoc check is what makes zero-shot mobile manipulation tractable without task-specific training. The general pattern of replacing similarity-based ranking with task-aware ranking also surfaces in Can rationale-driven selection beat similarity re-ranking for evidence? (where rationale replaces semantic similarity) and in Can reasoning stay grounded without external feedback loops? (where action feedback corrects model-internal associations).


Source: 12 types of RAG

Related concepts in this collection

Concept map
15 direct connections · 143 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

affordance-aware retrieval reranks robot perception by physical executability — visual similarity alone retrieves objects the robot cannot actually act on