INQUIRING LINE

How do sharded HNSW indices preserve capability distinctions at scale?

This explores how vector-based indexing (like HNSW) keeps distinct agent capabilities separable when you're matching across many heterogeneous agents — though the corpus addresses the capability-vector and routing machinery more than 'sharding' per se.


This reads as a question about scale-out capability discovery: when you have hundreds of agents or models, how does a vector index keep their distinct competencies from blurring into one another? The corpus has a direct anchor here. The idea of embedding *versioned capability vectors* into an HNSW index treats 'what can this agent do' as a first-class, searchable object — and crucially couples that semantic match with policy and budget constraints, so discovery scales sub-linearly even as the population of agents gets more heterogeneous Can semantic capability vectors replace manual agent routing?. The honest caveat: the corpus discusses HNSW capability indexing, not literal sharding of the index. So treat the sharding framing as the deployment wrapper around a deeper question the collection answers well — how you represent capability so distinctions survive matching at scale.

The representation choice is where 'preserving distinctions' is actually won or lost. A single benchmark score collapses an agent into one number, and the corpus argues that's systematically misleading: capability decomposes into at least five separable axes — task success, privacy compliance, long-horizon retention, mode-shift behavior, and ecosystem readiness — and models that top one axis often rank low on another Does a single benchmark score actually predict agent readiness?. A scalar index would smear those agents together; a *vector* index is precisely what keeps the privacy-strong-but-task-weak agent distinguishable from the inverse. The geometry is the point.

Lateral move: this is the same insight that makes routing beat scaling. Avengers-Pro routes each query to the best model per semantic cluster, outperforming a frontier model by 7% or matching it at 27% lower cost — and ten small models with routing once surpassed GPT-4.1 Can routing beat building one better model?. Routing only works if the embedding space preserves which model is good at what; collapse the distinctions and you're back to picking one generalist. Capability indexing and cluster routing are two faces of the same bet: selection is a stronger lever than scaling, *provided* your index doesn't flatten the very differences you're selecting on.

There's a quiet warning under all of this. Identical performance metrics can mask fundamentally different internal representations — models can be perfectly accurate yet internally 'fractured,' fragile to perturbation and distribution shift in ways standard metrics never reveal Can models be smart without organized internal structure?. So a capability vector built from benchmark outputs may preserve *measured* distinctions while hiding the ones that break in deployment. Preserving capability distinctions at scale isn't just an indexing problem — it's a question of whether your vectors encode the distinctions that actually matter, or only the ones that are easy to measure.


Sources 4 notes

Can semantic capability vectors replace manual agent routing?

Versioned Capability Vectors embedded in HNSW indices couple semantic matching with policy and budget constraints, making capability discovery a first-class operation that scales sub-linearly as agent heterogeneity increases.

Does a single benchmark score actually predict agent readiness?

Capability decomposes into task success, privacy compliance, long-horizon retention, mode-shift behavior, and ecosystem readiness. Models ranked highest on one axis often rank lower on others, making single-score evaluations systematically misleading for real deployment.

Can routing beat building one better model?

Avengers-Pro achieves 7% higher accuracy than GPT-5-medium by routing queries to optimal models per semantic cluster, or matches its performance at 27% lower cost. Ten 7B models with routing previously surpassed GPT-4.1 and 4.5, suggesting selection is a stronger lever than scaling.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Next inquiring lines