INQUIRING LINE

Can models retrieve the right tool without relying on vector similarity?

This explores whether an LLM can pick the correct tool to call without leaning on embedding/vector-similarity matching — and what the corpus offers as alternatives.


This explores whether models can select the right tool without leaning on vector similarity — and the corpus has a surprisingly rich set of escape routes from embedding matching. The starting problem is sharp: vector embeddings measure *semantic association*, not *task relevance* Do vector embeddings actually measure task relevance?. They encode co-occurrence, so concepts that are semantically close but play completely different roles look nearly identical. That's fine in a demo and quietly broken in production, where an underspecified query has many wrong-but-associated candidates the embedding happily ranks high.

The most direct answer is to flip who's in charge. Instead of a retriever passively matching a query to tool descriptions, let the model itself emit structured tool requests and refine them across turns Can models decide better than retrievers which tools to use?. This sidesteps the colloquial-to-formal vocabulary gap that sinks single-round semantic matching — the model reasons its way to the requirement rather than hoping its phrasing lands near the right embedding. A related move replaces similarity ranking with *reasoning about relevance*: generating rationales for why a piece of evidence matters beats similarity re-ranking by a third while using half as many chunks Can rationale-driven selection beat similarity re-ranking for evidence?. The lesson generalizes from evidence selection to tool selection — "why is this relevant" is a different and better question than "what looks similar."

There are also structural alternatives to similarity search entirely. When the relationships between things are what matters, deterministic graph traversal beats probabilistic vector lookup — you query the structure with something like Cypher instead of nearest-neighbor guessing When do graph databases outperform vector embeddings for retrieval?. And the model's own uncertainty turns out to be a better signal than external retrieval heuristics for *whether* to reach for a tool at all: calibrated token-probability uncertainty beats multi-call adaptive retrieval at a fraction of the compute Can simple uncertainty estimates beat complex adaptive retrieval?. The model's self-knowledge is doing the routing.

What you didn't ask but might want to know: similarity itself isn't the villain — *learned* similarity can be. A properly tuned dot product beats an MLP trained to imitate one, because the dot product carries a structural inductive bias the MLP has to discover from scratch Why does dot product beat MLP-based similarity in practice?. So the real fault line isn't "vector vs. not-vector," it's whether your matching mechanism encodes the right notion of relevance. Two more threads round this out: small models can be trained to call functions reliably through preference pairs that teach them what a *wrong* call looks like, not just what's plausible Can small models match large models on function calling?; and models can learn to operate over an inventory they never directly retrieve over, picking the right action through closed-loop feedback rather than lookup Can LLMs recommend products without ever seeing the catalog?. Across all of these, the through-line is the same: relevance is a reasoning and feedback problem, and vector similarity is only one — often the weakest — way to approximate it.


Sources 8 notes

Do vector embeddings actually measure task relevance?

Embeddings encode co-occurrence patterns, making semantically close but role-distinct concepts highly similar. This works in simple demos but fails in production where underspecified queries have many wrong-but-associated candidates.

Can models decide better than retrievers which tools to use?

MCP-Zero shows that letting models emit structured tool requests iteratively across conversations outperforms single-round semantic matching. The model can refine requirements progressively across domains as reasoning unfolds, bypassing colloquial-to-formal vocabulary mismatch.

Can rationale-driven selection beat similarity re-ranking for evidence?

METEORA uses LLM-generated rationales with flagging instructions to select evidence, achieving 33% better accuracy with 50% fewer chunks than similarity re-ranking across legal, financial, and academic domains. The method also improves adversarial robustness substantially.

When do graph databases outperform vector embeddings for retrieval?

Graph-oriented databases solve vector similarity's failure on aggregate queries by replacing probabilistic similarity search with deterministic graph traversal via Cypher. The tradeoff: higher construction cost but precision and completeness for enterprise use cases where query patterns are relational.

Can simple uncertainty estimates beat complex adaptive retrieval?

Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.

Why does dot product beat MLP-based similarity in practice?

Rendle et al. show properly-tuned dot products substantially beat MLP-based similarity despite MLP universality. Learning a dot product with an MLP requires large models and datasets; dot products also enable efficient retrieval at production scale through MIPS algorithms.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Can LLMs recommend products without ever seeing the catalog?

Rec-R1 experiments show that LLMs trained via RL with recommender metrics as rewards can generate effective product search queries without catalog access. The model learns query refinement indirectly through system feedback, paralleling how humans search without knowing platform inventory.

Next inquiring lines