Can explicit stack tracking improve how transformers learn recursive syntax?
Can adding an explicit stack tape to transformers help them track recursive structure more efficiently? This matters because standard transformers struggle with long-tail recursive patterns despite their size and data.
Recursion is fundamental to human language and thought — composing complex objects from simpler constituents. It is also fundamental to mathematical reasoning, programming, and goal-directed planning. Standard self-attention has no explicit mechanism to track recursive state; it relies on hidden representations to implicitly but imperfectly encode stack information. This imperfect encoding limits syntactic generalization, especially for long-tail recursive structures.
Pushdown Layers address this directly: a stack tape tracks the estimated depth of every token in an incremental parse of the observed prefix. The transformer autoregressively updates this stack tape as it predicts new tokens, then uses the depth information to softly modulate attention — for instance, learning to "skip" over closed constituents (completed sub-phrases that are no longer active in the parse).
Results: 3-5x more sample-efficient syntactic generalization while maintaining similar perplexities. The improvement is not marginal — it represents a qualitative change in the model's ability to handle recursive structure. The layers are a drop-in replacement for standard self-attention, requiring no changes to the overall architecture.
The connection to Why do neural networks fail at compositional generalization? is direct: the binding problem identifies three sub-problems (segregation, representation, composition), and Pushdown Layers specifically address composition by providing an explicit mechanism for tracking constituent structure. Standard transformers attempt to solve this implicitly and fail on the long tail.
The relationship to Can neural networks learn compositional skills without symbolic mechanisms? is nuanced. That finding holds for broad compositional patterns, but Pushdown Layers demonstrate that for recursive structures specifically, explicit mechanisms dramatically improve sample efficiency. Scale can brute-force some recursive patterns, but a lightweight architectural inductive bias does it orders of magnitude more efficiently.
This also connects to the latent reasoning theme: just as Can models reason without generating visible thinking tokens? adds iterative depth for reasoning, Pushdown Layers add structural depth for language. Both augment the transformer with mechanisms it lacks — recurrence for reasoning, recursion for language.
Source: Cognitive Models Latent
Related concepts in this collection
-
Why do neural networks fail at compositional generalization?
Exploring whether the binding problem from neuroscience explains neural networks' inability to systematically generalize. The binding problem has three aspects—segregation, representation, and composition—each creating distinct failure modes in how networks handle structured information.
Pushdown Layers address the composition sub-problem directly
-
Can neural networks learn compositional skills without symbolic mechanisms?
Do neural networks need explicit symbolic architecture to compose learned concepts, or can scaling alone enable compositional generalization? This asks whether compositionality is an architectural feature or an emergent property of scale.
qualified: recursive structure specifically benefits from explicit mechanisms despite general compositional generalization emerging from scale
-
How do language models encode syntactic relations geometrically?
Do LLM embeddings use distance alone or also direction to represent syntax? Understanding whether neural networks can spontaneously develop symbolic-compatible geometric structures.
complementary: polar coordinates show syntax IS encoded; Pushdown Layers show it can be encoded more efficiently with explicit structure
-
Can models reason without generating visible thinking tokens?
Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.
parallel: adding missing mechanisms to transformers (recurrence for reasoning, recursion for syntax)
-
What formal languages actually help transformers learn natural language?
Not all formal languages are equally useful for pre-pretraining. This explores which formal languages transfer well to natural language and why—combining structural requirements with what transformers can actually learn.
Pushdown Layers change the architectural learnability boundary: the two-constraint model (Chomsky hierarchy × circuit complexity) applies to standard transformers, but explicit stack tape extends transformer computational limits — potentially expanding the set of formal languages that produce positive transfer
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
pushdown layers with explicit stack tape achieve 3-5x more sample-efficient syntactic generalization by providing recursive state tracking absent in standard transformers