Why does visual similarity retrieval fail for embodied agents?

This explores why ranking objects by how visually alike they are breaks down when a robot has to actually pick something up and act — not just recognize it.

This explores why ranking objects by how visually alike they are breaks down when a robot has to actually pick something up and act — not just recognize it. The short version from the corpus: visual similarity answers "what looks like the query?" when an embodied agent needs an answer to "what can I physically do something with right now?" Those are different questions, and embedding-based retrieval only knows how to answer the first.

The sharpest diagnosis comes from work showing that vector embeddings measure semantic *association*, not task relevance Do vector embeddings actually measure task relevance?. Embeddings encode co-occurrence and resemblance, so a mug and a photo of a mug, or a full cup and an empty one, land close together — even though only one supports the action the agent intends. This isn't a robotics quirk; it's the same structural failure that haunts retrieval generally, where systems break on semantic-task mismatch rather than on tuning details Where do retrieval systems fail and why?. The embedding is doing exactly what it was built to do; it just wasn't built to know about executability.

For embodied agents the fix is to re-rank by physics, not appearance. AffordanceRAG keeps visual retrieval as a first pass but reorders candidates by affordance scores — can the robot actually grasp, reach, or manipulate this object given its current state? — so plans don't collapse at execution time Can visual similarity alone guide robot object retrieval?. The architectural move is the interesting part: similarity becomes a recall stage, and a task-grounded signal becomes the ranking stage. That mirrors a broader pattern in the corpus where routing or restructuring retrieval to fit the task beats uniform similarity search Can routing queries to task-matched structures improve RAG reasoning?.

There's a second, quieter failure mode worth knowing: raw visual embeddings are a thin description of the world. Work on zero-shot recognition found that describing an image in natural language first, then retrieving against a text index, bridges the visual-reference gap better than direct embedding similarity Can describing images in text improve zero-shot recognition?. The lesson generalizes — pixels-to-vector loses the relational and functional facts (what's on top of what, what's reachable, what's occupied) that an embodied plan depends on, and a richer intermediate representation recovers them.

So the deeper takeaway isn't "visual similarity is bad" — it's that for an agent that acts, retrieval has to be grounded in the consequences of action. The thing that looks most like your query is frequently the thing you cannot do anything with. Once you see that, the whole "retrieve then verify against reality" loop — affordance reranking, reflective failure memory Can agents learn from failure without updating their weights? — reads as one idea: similarity proposes, the world disposes.

Sources 6 notes

Do vector embeddings actually measure task relevance?

Embeddings encode co-occurrence patterns, making semantically close but role-distinct concepts highly similar. This works in simple demos but fails in production where underspecified queries have many wrong-but-associated candidates.

Where do retrieval systems fail and why?

RAG systems fail at three structural levels: adaptive triggering (fixed intervals waste context), semantic-task mismatch (embeddings measure association, not relevance), and mathematical limits (embedding dimension constrains representable document sets). These require fundamentally different retrieval approaches, not tuning.

AffordanceRAG reranks visually retrieved objects by affordance scores, ensuring the robot selects only physically executable actions. This architectural shift from similarity to task-grounded ranking prevents plans that fail at execution time.

Can routing queries to task-matched structures improve RAG reasoning?

StructRAG demonstrates that selecting knowledge structure type based on query demands—via DPO-trained router choosing among tables, graphs, algorithms, catalogues, and chunks—improves knowledge-intensive reasoning over standard retrieval. The approach grounds this in cognitive load and cognitive fit theory from cognitive science.

Can describing images in text improve zero-shot recognition?

SignRAG demonstrates that describing an unknown image via vision-language model, then retrieving known designs from a text-indexed database, eliminates the need for recognition model training. Natural-language description bridges the visual-reference gap better than direct embedding similarity.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Why does visual similarity retrieval fail for embodied agents?

Sources 6 notes

Next inquiring lines