Does scaling data automatically produce compositional reasoning or just better feature encoding?
This explores whether throwing more data at a model genuinely produces the ability to combine learned pieces into novel reasoning (composition), or whether it just sharpens the model's encoding of features it has already seen — and the corpus is sharply split on this.
This explores whether scaling data genuinely produces compositional reasoning or just better feature encoding — and the corpus stages a real argument about it, with the answer hinging on what you count as 'composition.' The optimistic camp says scale alone is enough: standard MLPs reach compositional generalization with no special architecture, *provided* the training distribution covers enough combinations of the underlying task modules Can neural networks learn compositional skills without symbolic mechanisms?. Notably, that same work uses linear decodability of the building blocks from hidden activations as its success signal — which is exactly the catch. The skeptical camp argues that linear decodability is precisely what masks the absence of real composition: a model can carry every linearly-decodable feature a task needs while its internal organization is fractured and brittle, invisible to standard metrics until perturbation or distribution shift breaks it Can models be smart without organized internal structure?. So the two notes that look like they agree (decodable features = good) actually disagree about whether decodable features *mean* anything.
Sources 7 notes
Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.
Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.