LLM Architecture

Topic · 65 papers

Related topics:

A Survey on Large Language Models with some Insights on their Capabilities and Limitations
The rapid advancement of artificial intelligence, particularly with the development of Large Language Models (LLMs) built on the transformer architecture, has redefined the capabilities of natural lan…
Agent Learning via Early Experience
A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data wi…
Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research
Agentic Reasoning, a framework1 that enhances large language model (LLM) reasoning by integrating external tool-using agents. Unlike conventional LLM-based reasoning approaches, which rely solely on i…
All AI Models are Wrong, but Some are Optimal
Abstract—AI models that predict the future behavior of a system (a.k.a. predictive AI models) are central to intelligent decision-making. However, decision-making using predictive AI models often resu…
Are Emergent Abilities in Large Language Models just In-Context Learning?
We present a novel theory that explains emergent abilities, taking into account their potential confounding factors, and rigorously substantiate this theory through over 1000 experiments. Our findings…
Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics
Do large language models (LLMs) solve reasoning tasks by learning robust generalizable algorithms, or do they memorize training data? To investigate this question, we use arithmetic reasoning as a rep…
Artifacts as Memory Beyond the Agent Boundary
The situated view of cognition holds that intelligent behavior depends not only on internal memory, but on an agent’s active use of environmental resources. Here, we begin formalizing this intuition w…
Ask, and it shall be given: Turing completeness of prompting
In this work, we show that prompting is in fact Turing-complete: there exists a finite-size Transformer such that for any computable function, there exists a corresponding prompt following which the T…
Ask-AC: An Initiative Advisor-in-the-Loop Actor-Critic Framework
Despite the promising results achieved, state-of-the art interactive reinforcement learning schemes rely on passively receiving supervision signals from advisor experts, in the form of either continuo…
Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases
Pretraining language models on formal language can improve their acquisition of natural language. Which features of the formal language impart an inductive bias that leads to effective transfer? Drawi…
Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning
To break the context limits of large language models (LLMs) that bottleneck reasoning accuracy and efficiency, we propose the Thread Inference Model (TIM1), a family of LLMs trained for recursive and …
Beyond neural scaling laws: beating power law scaling via data pruning
Widely observed neural scaling laws, in which error falls off as a power of the training set size, model size, or both, have driven substantial performance improvements in deep learning. However, thes…
Can Language Models Serve as Text-Based World Simulators?
Broadly speaking, there are two ways to leverage LLMs in the context of world modeling and simulation. The first is neurosymbolic: a number of efforts use language models to generate code in a symboli…
Compositional Reasoning with Transformers, RNNs, and Chain of Thought
Large language models [Touvron et al., 2023, Anil et al., 2023, Achiam et al., 2023] are increasingly used to perform logical reasoning and other problems that require algorithmic thinking. To underst…
Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data
Description automatically generated](file:////Users/adrianchan/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip_image012.png) One way to address safety risks from large la…
DataComp-LM: In search of the next generation of training sets for language models
A key challenge in this emerging research area is a lack of controlled comparisons. While the aforementioned proposals generally use the same evaluation datasets, researchers often compare models that…
DeepNet: Scaling Transformers to 1,000 Layers
In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, we introduce a new normalization function (DEEPNORM) to modify the residual connection i…
Discovering Latent Concepts Learned in BERT
A large number of studies that analyze deep neural network models and their ability to encode various linguistic and non-linguistic concepts provide an interpretation of the inner mechanics of these m…
Eliciting Reasoning in Language Models with Cognitive Tools
The recent advent of reasoning models like OpenAI’s o1 was met with excited speculation by the AI community about the mechanisms underlying these capabilities in closed models, followed by a rush of r…
Foundations of Large Language Models
The main part of BERT models is a multi-layer Transformer network. A Transformer layer consists of a self-attention sub-layer and an FFN sub-layer. Both of them follow the post-norm architecture: outp…
From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR
Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). Unlike traditional RL approaches, RLV…
Has the Creativity of Large-Language Models peaked? —an analysis of inter- and intra-LLM variability —
from creative writing and survey responses to research idea generation (Doshi and Hauser, 2024; Anderson et al., 2024; Moon et al., 2024). For instance, stories written with ChatGPT assistance were mo…
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
Large Language Models (LLMs) have demonstrated the ability to tackle increasingly complex tasks through advanced reasoning, long-form content generation, and tool use. Solving these tasks often involv…
Holy Grail 2.0: From Natural Language to Constraint Models
Twenty-seven years ago, E. Freuder highlighted that "Constraint programming represents one of the closest approaches computer science has yet made to the Holy Grail of programming: the user states the…
Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning
Task-oriented dialog systems help a user to accomplish some goal using natural language, such as making a restaurant reservation, getting technical support, or placing a phonecall. Historically, these…
LIMI: Less is More for Agency
We define “Agency” as the emergent capacity of AI systems to function as autonomous agents—actively discovering problems, formulating hypotheses, and executing solutions through self-directed engageme…
LLM Post-Training: A Deep Dive into Reasoning Large Language Models
Some policy gradient approaches are explained below: Policy Gradient (REINFORCE). The REINFORCE algorithm [114, 115] is a method used to improve decision-making by adjusting the model’s strategy (poli…
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
Large decoder-only language models (LLMs) are the state-of-the-art models on most of today’s NLP tasks and benchmarks. Yet, the community is only slowly adopting these models for text embedding tasks,…
Language Modeling by Language Models
Can we leverage LLMs to model the process of discovering novel language model (LM) architectures? Inspired by real research, we propose a multi-agent LLM approach that simulates the conventional stage…
Language Modeling is Compression
Information theory and machine learning are inextricably linked and have even been referred to as “two sides of the same coin” (MacKay, 2003). One particularly elegant connection is the essential equi…
Language Models are Pragmatic Speakers
“We propose a generalization of the previous methods called bounded pragmatic speakers with a dual model of thought. A dual model of thought comprises of a slow-thinking system for deep reasoning and …
Large Concept Models: Language Modeling in a Sentence Representation Space
This is in sharp contrast to humans who operate at multiple levels of abstraction, well beyond single words, to analyze information and to generate creative content. In this paper, we present an attem…
Latent Collaboration in Multi-Agent Systems
Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediatio…
Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models
Structured generation, the process of producing content in standardized formats like JSON and XML, is widely utilized in real-world applications to extract key output information from large language m…
Leveraging Approximate Symbolic Models for Reinforcement Learning via Skill Diversity
Creating reinforcement learning (RL) agents that are capable of accepting and leveraging task specific knowledge from humans has been long identified as a possible strategy for developing scalable app…
Looking beyond the next token
The structure of causal language model training assumes that each token can be accurately predicted from the previous context. This contrasts with humans’ natural writing and reasoning process, where …
MasRouter: Learning to Route LLMs for Multi-Agent Systems
Multi-agent systems (MAS) powered by Large Language Models (LLMs) have been demonstrated to push the boundaries of LLM capabilities, yet they often incur significant costs and face challenges in dynam…
MatFormer: Nested Transformer for Elastic Inference
Foundation models are applied in a broad spectrum of settings with different inference constraints, from massive multi-accelerator clusters to resource-constrained standalone mobile devices. However, …
Multi-Token Attention
Soft attention is a critical mechanism powering LLMs to locate relevant parts within a given context. However, individual attention weights are determined by the similarity of only a single query and …
Nested Learning: The Illusion of Deep Learning Architectures
Over the last decades, developing more powerful neural architectures and simultaneously designing optimization algorithms to effectively train them have been the core of research efforts to enhance th…
Neural Assistant: Joint Action Prediction, Response Generation, and Latent Knowledge Reasoning
Task-oriented dialog presents a difficult challenge encompassing multiple problems including multi-turn language understanding and generation, knowledge retrieval and reasoning, and action prediction.…
Neuro-Symbolic AI in 2024: A Systematic Review
2.1. Taxonomy of Neuro-Symbolic AI We identified five foundational research areas advancing the state of the art in Neuro-Symbolic AI. This taxonomy was synthesized from a review of six survey papers…
Not All Parameters Are Created Equal: Smart Isolation Boosts Fine-Tuning Performance
Supervised fine-tuning (SFT) is a pivotal approach to adapting large language models (LLMs) for downstream tasks; however, performance often suffers from the “seesaw phenomenon”, where indiscriminate …
On the Binding Problem in Artificial Neural Networks
In this work, we argue that this underlying cause is the binding problem: The inability of existing neural networks to dynamically and flexibly bind information that is distributed throughout the netw…
On the Theoretical Limitations of Embedding-Based Retrieval
Vector embeddings have been tasked with an ever-increasing set of retrieval tasks over the years, with a nascent rise in using them for reasoning, instruction-following, coding, andmore. These new ben…
Probing Structured Semantics Understanding and Generation of Language Models via Question Answering
As John McCarthy (McCarthy, 1990, 1959) points out, in order to a better understanding of natural language, it is necessary for an intelligence system to understand the “deep structure” (Chomsky, 2011…
Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words
we uncover systematic ways in which word similarities estimated by cosine over BERT embeddings are understated and trace this effect to training data frequency. We find that relative to human judgemen…
Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
Our findings indicate that the approach to reasoning the models use is unlike retrieval, and more like a generalisable strategy that synthesises procedural knowledge from documents doing a similar for…
QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration
The deployment of mixture-of-experts (MoE) large language models (LLMs) presents significant challenges due to their high memory demands. These challenges become even more pronounced in multi-tenant e…
RL + Transformer = A General-Purpose Problem Solver
What if artificial intelligence could not only solve problems for which it was trained but also learn to teach itself to solve new problems (i.e., metalearn)? In this study, we demonstrate that a pre-…
Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination
To obtain trustworthy evaluation signals, we introduce a generator that creates fully synthetic arithmetic problems of arbitrary length and difficulty, yielding clean datasets we call RandomCalculatio…
SParC: Cross-Domain Semantic Parsing in Context
The most prominent context-dependent text-to-SQL benchmark is ATIS1, which is set in the flight-booking domain and contains only one database (Hemphill et al., 1990; Dahl et al., 1994). In a real-worl…
Scaling Laws for Neural Language Models
We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, wit…
Self-Improving Transformers Overcome Easy-to-Hard and Length Generalization Challenges
Large language models often struggle with length generalization and solving complex problem instances beyond their training distribution. We present a self-improvement approach where models iterativel…
Self-reinforcing cascades: A spreading model for beliefs or products of varying intensity or quality
Models of how things spread often assume that transmission mechanisms are fixed over time. However, social contagions–the spread of ideas, beliefs, innovations–can lose or gain in momentum as they spr…
Sleep-time Compute: Beyond Inference Scaling at Test-time
Scaling test-time compute has emerged as a key ingredient for enabling large language models (LLMs) to solve difficult problems, but comes with high latency and inference cost. We introduce sleep-time…
System 1 vs. System 2 Thinking
It is widely accepted that the human mind is specialized for specific domains. But is there a domain-general mind? Cognitive psychologists concur; however, evolutionary psychologists find this notion …
Textgrad: Automatic “Differentiation” via Text
To optimize the new generation of AI systems, we introduce TEXTGRAD, automatic differentiation via text. Here we use differentiation and gradients as a metaphor for textual feedback from LLMs. In this…
The Unreasonable Ineffectiveness of the Deeper Layers
We empirically study a simple layer-pruning strategy for popular families of open weight pretrained LLMs, finding minimal degradation of performance on different question-answering benchmarks until af…
The Vanishing Gradient Problem for Stiff Neural Differential Equations
Neural differential equations have become a transformative tool in machine learning and scientific computing, enabling data-driven modeling of complex, time-dependent phenomena in fields ranging from …
Thinking Augmented Pre-training
This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for …
Titans: Learning to Memorize at Test Time
Over more than a decade there has been an extensive research effort of how effectively utilize recurrent models and attentions. While recurrent models aim to compress the data into a fixed-size memory…
Training language models to follow instructions with human feedback
Making language models bigger does not inherently make them better at following a user’s intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpf…
Unifying Large Language Models and Knowledge Graphs: A Roadmap
“Large language models (LLMs), such as ChatGPT and GPT4, are making new waves in the field of natural language processing and artificial intelligence, due to their emergent ability and generalizabilit…
What are the Goals of Distributional Semantics?
As Harnad (1990) discusses, if the meanings of words are defined only in terms of other words, these definitions are circular. One goal for a semantic model is to capture how language relates to the w…