Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers

Paper · arXiv 2412.12276 · Published December 16, 2024

Humans distill complex experiences into fundamental abstractions that enable rapid learning and adaptation. Similarly, autoregressive transformers exhibit adaptive learning through in-context learning (ICL), which begs the question of how. In this paper, we propose concept encoding-decoding mechanism to explain ICL by studying how transformers form and use internal abstractions in their representations. On synthetic ICL tasks, we analyze the training dynamics of a small transformer and report the coupled emergence of concept encoding and decoding. As the model learns to encode different latent concepts (e.g., “Finding the first noun in a sentence.”) into distinct, separable representations, it concureently builds conditional decoding algorithms and improve its ICL performance. We validate the existence of this mechanism across pretrained models of varying scales (Gemma-2 2B/9B/27B, Llama-3.1 8B/70B). Further, through mechanistic interventions and controlled finetuning, we demonstrate that the quality of concept encoding is causally related and predictive of ICL performance.

We propose the concept encoding-decoding mechanism to explain why and how task vectors emerge in pretrained LLMs. We demonstrate that transformers concurrently learn to map latent concepts into separable representations and develop context-specific decoding algorithms. We validate the generality of this finding across model families and scales, and show that the quality of concept encoding-decoding can predict ICL task performance.

these task-specific vectors can trigger the desired ICL behavior in many cases, with the effectiveness varying across tasks. Although an impactful first observation, there still remains unanswered questions of why these task vectors exist in the first place and why the effectiveness varies by task. This necessitates a deeper mechanistic understanding of the internal abstraction behavior of LLMs, which could explain the findings of task-specific vectors and the workings of ICL.

In our work, we propose the concept encoding-decoding mechanism as the origin of transformers’ internal abstraction behavior. To study the emergence of abstractions during pretraining, we train a small transformer on a mixture of sparse linear regression tasks. We find that concept encoding emerges as the model learns to map different latent concepts into distinct, separable representation spaces. This geometric structuring of the representation space is coupled with the development of concept-specific ICL algorithms – namely, concept decoding. Through causal analysis, we demonstrate that the model associates different algorithms to different learned concepts in the representation space and that ICL happens through the two-step process.