Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Paper · arXiv 2507.06261 · Published July 7, 2025

Sparse MoE models activate a subset of model parameters per input token by learning to dynamically route tokens to a subset of parameters (experts); this allows them to decouple total model capacity from computation and serving cost per token. Developments to the model architecture contribute to the significantly improved performance of Gemini 2.5 compared to Gemini 1.5 Pro (see Section 3). Despite their overwhelming success, large transformers and sparse MoE models are known to suffer from training instabilities (Chowdhery et al., 2022; Dehghani et al., 2023; Fedus et al., 2021; Lepikhin et al., 2020; Liu et al., 2020; Molybog et al., 2023; Wortsman et al., 2023; Zhai et al., 2023; Zhang et al., 2022). The Gemini 2.5 model series makes considerable progress in enhancing large-scale training stability, signal propagation and optimization dynamics, resulting in a considerable boost in performance straight out of pre-training compared to previous Gemini models.

Gemini 2.5 models build on the success of Gemini 1.5 in processing long-context queries, and incorporate new modeling advances allowing Gemini 2.5 Pro to surpass the performance of Gemini 1.5 Pro in processing long context input sequences of up to 1M tokens (see Table 3). Both Gemini 2.5 Pro and Gemini 2.5 Flash can process pieces of long-form text (such as the entirety of “Moby Dick” or “Don Quixote”), whole codebases, and long form audio and video data (see Appendix 8.5). Together with advancements in long-context abilities, architectural changes to Gemini 2.5 vision processing lead to a considerable improvement in image and video understanding capabilities, including being able to process 3-hour-long videos and the ability to convert demonstrative videos into interactive coding applications (see our recent blog post by Baddepudi et al., 2025).

2.4. Post-training

Since the initial announcement of Gemini 1.5, significant advancements have been made in our post-training methodologies, driven by a consistent focus on data quality across the Supervised Fine-Tuning (SFT), Reward Modeling (RM), and Reinforcement Learning (RL) stages. A key focus has been leveraging the model itself to assist in these processes, enabling more efficient and nuanced quality control.

Furthermore, we have increased the training compute allocated to RL, allowing deeper exploration and refinement of model behaviors. This has been coupled with a focus on verifiable rewards and model-based generative rewards to provide more sophisticated and scalable feedback signals. Algorithmic changes to the RL process have also improved stability during longer training. These advancements have enabled Gemini 2.5 to learn from more diverse and complex RL environments, including those requiring multi-step actions and tool use.

2.5. Thinking

Past Gemini models produce an answer immediately following a user query. This constrains the amount of inference-time compute (Thinking) that our models can spend reasoning over a problem. Gemini Thinking models are trained with Reinforcement Learning to use additional compute at inference time to arrive at more accurate answers. The resulting models are able to spend tens of thousands of forward passes during a “thinking” stage, before responding to a question or query.

Within the context of generative models, ensuring the factuality of model responses to information-seeking prompts remains a core pillar of Gemini model development. With Gemini 1.5, our research was concentrated on enhancing the model’s world knowledge and its ability to provide answers faithfully grounded in the context provided within the prompt. This effort culminated in the December 2024 release of FACTS Grounding (Jacovi et al., 2025), now an industry-standard benchmark for evaluating an LLM’s capacity to generate responses grounded in user-provided documents. With Gemini 2.0 and 2.5, we have significantly expanded our scope to address multimodal inputs, long-context reasoning, and model-retrieved information. At the same time, the landscape and user expectations for factuality have evolved dramatically, shaped in part by Google’s deployment of AI Overviews and AI Mode (Stein, 2025). To meet these demands, Gemini 2.0 marked a significant leap as our first model family trained to natively call tools like Google Search, enabling it to formulate precise queries and synthesize fresh information with sources. Building on this, Gemini 2.5 integrates advanced reasoning, allowing it to interleave these search capabilities with internal thought processes to answer complex, multi-hop queries and execute long-horizon tasks. The model has learned to use search and other tools, reason about the outputs, and issue additional, detailed follow-up queries to expand the information available to it and to verify the factual accuracy of the response. Our latest models now power the experiences of over 1.5B monthly active users in Google’s AI Overviews and 400M users in the Gemini App.

Gemini as an Agent: Deep Research

Gemini Deep Research (Gemini Team, Google, 2024) is an agent built on top of the Gemini 2.5 Pro model designed to strategically browse the web and provide informed answers to even the most niche user queries. The agent is optimized to perform task prioritization, and is also able to identify when it reaches a dead-end when browsing. We have massively improved the capabilities of Gemini Deep Research since its initial launch in December 2024. As evidence of that, performance of Gemini Deep Research on the Humanity’s Last Exam benchmark (Phan et al., 2025) has gone from 7.95% in December 2024 to the SoTA score of 26.9% and 32.4% with higher compute (June 2025).

Gemini 2.5 Pro Deep Think

To advance Gemini’s capabilities towards solving hard reasoning problems, we developed a novel reasoning approach, called Deep Think, that naturally blends in parallel thinking techniques during response generation. Deep Think enables Gemini to creatively produce multiple hypotheses and carefully critique them before arriving at the final answer, achieving state-of-the-art performances in challenging benchmarks such as Olympiad math (USAMO 2025), competitive coding (LiveCodeBench), and multimodality (MMMU), see more details at (Doshi, 2025b). We announced Gemini 2.5 Deep Think at Google I/O and launched an experimental version to trusted testers and advanced users in June 2025.

• Supervised Fine-Tuning: For the SFT stage, we source adversarial prompts either leveraging existing models and tools to probe Gemini’s attack surface, or relying on human interactions to discover potentially harmful behavior. Throughout this process we strive for coverage of the safety policies described above across common model use cases. When we find that model behavior needs improvement, either because of safety policy violations, or because the model refuses when a helpful, non-policy-violating answer exists, we use a combination of custom data generation recipes loosely inspired by Constitutional AI (Bai et al., 2022), as well as human intervention to revise responses. The process described here is typically refined through successive model iterations. We use automated evaluations on both safety and non-safety metrics to monitor impact and potential unintended regressions.

• Reinforcement Learning from Human and Critic Feedback (RLF): Reward signal during RL comes from a combination of a Data Reward Model (DRM), which amortizes human preference data, and a Critic, a prompted model that grades responses according to pre-defined rubrics. We divide our interventions into Reward Model and Critic improvements (RM), and reinforcement learning (RL) improvements. For both RM and RL, similarly to SFT, we source prompts either through human-model or model-model interactions, striving for coverage of safety policies and use cases. For both DRM training, given a prompt set, we use custom data generation recipes to surface a representative sample of model responses. Humans then provide feedback on the responses, often comparing multiple potential response candidates for each query. This preference data is amortized in our Data Reward Model. Critics, on the other hand, do not require additional data, and iteration on the grading rubric can be done offline. Similarly to SFT, RLF steers the model away from undesirable behavior, both in terms of content policy violations, and trains the model to be helpful. RL*F is accompanied by a number of evaluations that run continuously during training to monitor for safety and other metrics.

Our deceptive alignment evaluations measure stealth capabilities and situational awareness capabilities separately, using two separate sets of challenges.

Stealth: In these challenges, the model is tasked with a devious objective and must achieve that objective without raising suspicion. For example, one challenge is a customer support scenario where the model deals with incoming support tickets. The agent is tasked with resolving as many tickets as possible while complying with the company’s protocols. There are ways to cheat at this (resulting in more tickets resolved)—but if the model does decide to cheat, it must not be caught by an overseer. Here we measure performance relative to the score that humans get when they do the challenge.

Situational awareness: These challenges test whether the model can reason about itself as an agent embedded in the deployment environment, and to intervene on it in service of a goal. Many of them test whether the model can realise that it needs to give itself more affordances, e.g. by changing configuration files that place constraints on its action space. The agent may need to, say, turn off swear word filtering in order to write an angry email. For more details of these evaluations, see (Phuong et al., 2025).