LLM Architecture
Related topics:
- A Survey on Large Language Models with some Insights on their Capabilities and LimitationsThe rapid advancement of artificial intelligence, particularly with the development of Large Language Models (LLMs) built on the transformer architecture, has redefined the capabilities of natural lan…
- Agent Learning via Early ExperienceA long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data wi…
- Agentic Reasoning: Reasoning LLMs with Tools for the Deep ResearchAgentic Reasoning, a framework1 that enhances large language model (LLM) reasoning by integrating external tool-using agents. Unlike conventional LLM-based reasoning approaches, which rely solely on i…
- All AI Models are Wrong, but Some are OptimalAbstract—AI models that predict the future behavior of a system (a.k.a. predictive AI models) are central to intelligent decision-making. However, decision-making using predictive AI models often resu…
- Are Emergent Abilities in Large Language Models just In-Context Learning?We present a novel theory that explains emergent abilities, taking into account their potential confounding factors, and rigorously substantiate this theory through over 1000 experiments. Our findings…
- Arithmetic Without Algorithms: Language Models Solve Math With a Bag of HeuristicsDo large language models (LLMs) solve reasoning tasks by learning robust generalizable algorithms, or do they memorize training data? To investigate this question, we use arithmetic reasoning as a rep…
- Artifacts as Memory Beyond the Agent BoundaryThe situated view of cognition holds that intelligent behavior depends not only on internal memory, but on an agent’s active use of environmental resources. Here, we begin formalizing this intuition w…
- Ask, and it shall be given: Turing completeness of promptingIn this work, we show that prompting is in fact Turing-complete: there exists a finite-size Transformer such that for any computable function, there exists a corresponding prompt following which the T…
- Ask-AC: An Initiative Advisor-in-the-Loop Actor-Critic FrameworkDespite the promising results achieved, state-of-the art interactive reinforcement learning schemes rely on passively receiving supervision signals from advisor experts, in the form of either continuo…
- Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic BiasesPretraining language models on formal language can improve their acquisition of natural language. Which features of the formal language impart an inductive bias that leads to effective transfer? Drawi…
- Beyond Context Limits: Subconscious Threads for Long-Horizon ReasoningTo break the context limits of large language models (LLMs) that bottleneck reasoning accuracy and efficiency, we propose the Thread Inference Model (TIM1), a family of LLMs trained for recursive and …
- Beyond neural scaling laws: beating power law scaling via data pruningWidely observed neural scaling laws, in which error falls off as a power of the training set size, model size, or both, have driven substantial performance improvements in deep learning. However, thes…
- Can Language Models Serve as Text-Based World Simulators?Broadly speaking, there are two ways to leverage LLMs in the context of world modeling and simulation. The first is neurosymbolic: a number of efforts use language models to generate code in a symboli…
- Compositional Reasoning with Transformers, RNNs, and Chain of ThoughtLarge language models [Touvron et al., 2023, Anil et al., 2023, Achiam et al., 2023] are increasingly used to perform logical reasoning and other problems that require algorithmic thinking. To underst…
- Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training DataDescription automatically generated](file:////Users/adrianchan/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip_image012.png) One way to address safety risks from large la…
- DataComp-LM: In search of the next generation of training sets for language modelsA key challenge in this emerging research area is a lack of controlled comparisons. While the aforementioned proposals generally use the same evaluation datasets, researchers often compare models that…
- DeepNet: Scaling Transformers to 1,000 LayersIn this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, we introduce a new normalization function (DEEPNORM) to modify the residual connection i…
- Discovering Latent Concepts Learned in BERTA large number of studies that analyze deep neural network models and their ability to encode various linguistic and non-linguistic concepts provide an interpretation of the inner mechanics of these m…
- Eliciting Reasoning in Language Models with Cognitive ToolsThe recent advent of reasoning models like OpenAI’s o1 was met with excited speculation by the AI community about the mechanisms underlying these capabilities in closed models, followed by a rush of r…
- Foundations of Large Language ModelsThe main part of BERT models is a multi-layer Transformer network. A Transformer layer consists of a self-attention sub-layer and an FFN sub-layer. Both of them follow the post-norm architecture: outp…
- From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVRReinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). Unlike traditional RL approaches, RLV…
- Has the Creativity of Large-Language Models peaked? —an analysis of inter- and intra-LLM variability —from creative writing and survey responses to research idea generation (Doshi and Hauser, 2024; Anderson et al., 2024; Moon et al., 2024). For instance, stories written with ChatGPT assistance were mo…
- Hogwild! Inference: Parallel LLM Generation via Concurrent AttentionLarge Language Models (LLMs) have demonstrated the ability to tackle increasingly complex tasks through advanced reasoning, long-form content generation, and tool use. Solving these tasks often involv…
- Holy Grail 2.0: From Natural Language to Constraint ModelsTwenty-seven years ago, E. Freuder highlighted that "Constraint programming represents one of the closest approaches computer science has yet made to the Holy Grail of programming: the user states the…
- Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learningTask-oriented dialog systems help a user to accomplish some goal using natural language, such as making a restaurant reservation, getting technical support, or placing a phonecall. Historically, these…
- LIMI: Less is More for AgencyWe define “Agency” as the emergent capacity of AI systems to function as autonomous agents—actively discovering problems, formulating hypotheses, and executing solutions through self-directed engageme…
- LLM Post-Training: A Deep Dive into Reasoning Large Language ModelsSome policy gradient approaches are explained below: Policy Gradient (REINFORCE). The REINFORCE algorithm [114, 115] is a method used to improve decision-making by adjusting the model’s strategy (poli…
- LLM2Vec: Large Language Models Are Secretly Powerful Text EncodersLarge decoder-only language models (LLMs) are the state-of-the-art models on most of today’s NLP tasks and benchmarks. Yet, the community is only slowly adopting these models for text embedding tasks,…
- Language Modeling by Language ModelsCan we leverage LLMs to model the process of discovering novel language model (LM) architectures? Inspired by real research, we propose a multi-agent LLM approach that simulates the conventional stage…
- Language Modeling is CompressionInformation theory and machine learning are inextricably linked and have even been referred to as “two sides of the same coin” (MacKay, 2003). One particularly elegant connection is the essential equi…
- Language Models are Pragmatic Speakers“We propose a generalization of the previous methods called bounded pragmatic speakers with a dual model of thought. A dual model of thought comprises of a slow-thinking system for deep reasoning and …
- Large Concept Models: Language Modeling in a Sentence Representation SpaceThis is in sharp contrast to humans who operate at multiple levels of abstraction, well beyond single words, to analyze information and to generate creative content. In this paper, we present an attem…
- Latent Collaboration in Multi-Agent SystemsMulti-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediatio…
- Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language ModelsStructured generation, the process of producing content in standardized formats like JSON and XML, is widely utilized in real-world applications to extract key output information from large language m…
- Leveraging Approximate Symbolic Models for Reinforcement Learning via Skill DiversityCreating reinforcement learning (RL) agents that are capable of accepting and leveraging task specific knowledge from humans has been long identified as a possible strategy for developing scalable app…
- Looking beyond the next tokenThe structure of causal language model training assumes that each token can be accurately predicted from the previous context. This contrasts with humans’ natural writing and reasoning process, where …
- MasRouter: Learning to Route LLMs for Multi-Agent SystemsMulti-agent systems (MAS) powered by Large Language Models (LLMs) have been demonstrated to push the boundaries of LLM capabilities, yet they often incur significant costs and face challenges in dynam…
- MatFormer: Nested Transformer for Elastic InferenceFoundation models are applied in a broad spectrum of settings with different inference constraints, from massive multi-accelerator clusters to resource-constrained standalone mobile devices. However, …
- Multi-Token AttentionSoft attention is a critical mechanism powering LLMs to locate relevant parts within a given context. However, individual attention weights are determined by the similarity of only a single query and …
- Nested Learning: The Illusion of Deep Learning ArchitecturesOver the last decades, developing more powerful neural architectures and simultaneously designing optimization algorithms to effectively train them have been the core of research efforts to enhance th…
- Neural Assistant: Joint Action Prediction, Response Generation, and Latent Knowledge ReasoningTask-oriented dialog presents a difficult challenge encompassing multiple problems including multi-turn language understanding and generation, knowledge retrieval and reasoning, and action prediction.…
- Neuro-Symbolic AI in 2024: A Systematic Review2.1. Taxonomy of Neuro-Symbolic AI We identified five foundational research areas advancing the state of the art in Neuro-Symbolic AI. This taxonomy was synthesized from a review of six survey papers…
- Not All Parameters Are Created Equal: Smart Isolation Boosts Fine-Tuning PerformanceSupervised fine-tuning (SFT) is a pivotal approach to adapting large language models (LLMs) for downstream tasks; however, performance often suffers from the “seesaw phenomenon”, where indiscriminate …
- On the Binding Problem in Artificial Neural NetworksIn this work, we argue that this underlying cause is the binding problem: The inability of existing neural networks to dynamically and flexibly bind information that is distributed throughout the netw…
- On the Theoretical Limitations of Embedding-Based RetrievalVector embeddings have been tasked with an ever-increasing set of retrieval tasks over the years, with a nascent rise in using them for reasoning, instruction-following, coding, andmore. These new ben…
- Probing Structured Semantics Understanding and Generation of Language Models via Question AnsweringAs John McCarthy (McCarthy, 1990, 1959) points out, in order to a better understanding of natural language, it is necessary for an intelligence system to understand the “deep structure” (Chomsky, 2011…
- Problems with Cosine as a Measure of Embedding Similarity for High Frequency Wordswe uncover systematic ways in which word similarities estimated by cosine over BERT embeddings are understated and trace this effect to training data frequency. We find that relative to human judgemen…
- Procedural Knowledge in Pretraining Drives Reasoning in Large Language ModelsOur findings indicate that the approach to reasoning the models use is unlike retrieval, and more like a generalisable strategy that synthesises procedural knowledge from documents doing a similar for…
- QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime ReconfigurationThe deployment of mixture-of-experts (MoE) large language models (LLMs) presents significant challenges due to their high memory demands. These challenges become even more pronounced in multi-tenant e…
- RL + Transformer = A General-Purpose Problem SolverWhat if artificial intelligence could not only solve problems for which it was trained but also learn to teach itself to solve new problems (i.e., metalearn)? In this study, we demonstrate that a pre-…
- Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data ContaminationTo obtain trustworthy evaluation signals, we introduce a generator that creates fully synthetic arithmetic problems of arbitrary length and difficulty, yielding clean datasets we call RandomCalculatio…
- SParC: Cross-Domain Semantic Parsing in ContextThe most prominent context-dependent text-to-SQL benchmark is ATIS1, which is set in the flight-booking domain and contains only one database (Hemphill et al., 1990; Dahl et al., 1994). In a real-worl…
- Scaling Laws for Neural Language ModelsWe study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, wit…
- Self-Improving Transformers Overcome Easy-to-Hard and Length Generalization ChallengesLarge language models often struggle with length generalization and solving complex problem instances beyond their training distribution. We present a self-improvement approach where models iterativel…
- Self-reinforcing cascades: A spreading model for beliefs or products of varying intensity or qualityModels of how things spread often assume that transmission mechanisms are fixed over time. However, social contagions–the spread of ideas, beliefs, innovations–can lose or gain in momentum as they spr…
- Sleep-time Compute: Beyond Inference Scaling at Test-timeScaling test-time compute has emerged as a key ingredient for enabling large language models (LLMs) to solve difficult problems, but comes with high latency and inference cost. We introduce sleep-time…
- System 1 vs. System 2 ThinkingIt is widely accepted that the human mind is specialized for specific domains. But is there a domain-general mind? Cognitive psychologists concur; however, evolutionary psychologists find this notion …
- Textgrad: Automatic “Differentiation” via TextTo optimize the new generation of AI systems, we introduce TEXTGRAD, automatic differentiation via text. Here we use differentiation and gradients as a metaphor for textual feedback from LLMs. In this…
- The Unreasonable Ineffectiveness of the Deeper LayersWe empirically study a simple layer-pruning strategy for popular families of open weight pretrained LLMs, finding minimal degradation of performance on different question-answering benchmarks until af…
- The Vanishing Gradient Problem for Stiff Neural Differential EquationsNeural differential equations have become a transformative tool in machine learning and scientific computing, enabling data-driven modeling of complex, time-dependent phenomena in fields ranging from …
- Thinking Augmented Pre-trainingThis paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for …
- Titans: Learning to Memorize at Test TimeOver more than a decade there has been an extensive research effort of how effectively utilize recurrent models and attentions. While recurrent models aim to compress the data into a fixed-size memory…
- Training language models to follow instructions with human feedbackMaking language models bigger does not inherently make them better at following a user’s intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpf…
- Unifying Large Language Models and Knowledge Graphs: A Roadmap“Large language models (LLMs), such as ChatGPT and GPT4, are making new waves in the field of natural language processing and artificial intelligence, due to their emergent ability and generalizabilit…
- What are the Goals of Distributional Semantics?As Harnad (1990) discusses, if the meanings of words are defined only in terms of other words, these definitions are circular. One goal for a semantic model is to capture how language relates to the w…