Tina: Tiny Reasoning Models via LoRA
How cost-effectively can strong reasoning abilities be achieved in language models? Driven by this fundamental question, we present Tina, a family of tiny reasoning models achieved with high cost-efficiency. Notably, Tina demonstrates that substantial reasoning performance can be developed using only minimal resources, by applying parameter-efficient updates during reinforcement learning (RL), using low-rank adaptation (LoRA), to an already tiny 1.5B parameter base model. This minimalist approach produces models that achieve reasoning performance which is competitive with, and sometimes surpasses, SOTA RL reasoning models built upon the same base model. Crucially, this is achieved at a tiny fraction of the computational post-training cost employed by existing SOTA models.
Language models (LMs) demonstrate increasing proficiency across a variety of tasks, but achieving robust, multi-step reasoning remains a frontier challenge (Wang and Neiswanger, 2025, Xu et al., 2025). Notably, such reasoning abilities are crucial for applications demanding complex problem-solving, from scientific discovery to intricate planning. Enhancing complex reasoning via supervised fine-tuning (SFT) is a well-adopted technique, often utilizing a distillation process (Min et al., 2024, Huang et al., 2024) by which the model learns to mimic reasoning traces (e.g., step-by-step thinking) generated by more advanced models such as o1 (OpenAI, 2024). This approach, while effective, relies upon the quality and availability of such expert demonstrations, which can be costly to obtain. Furthermore, it can run the risk of instilling a shallow form of imitation in the learning model, rather than fostering dynamic exploration of reasoning paths. In contrast, reinforcement learning (RL) enables models to learn directly and flexibly from verifiable reward signals derived from curated data (DeepSeek-AI, 2025, Lambert et al., 2025). In doing so, RL can lead the model to explore a greater variety of logical paths and possibly discover more robust solutions. However, RL pipelines are often complex and notoriously resource-intensive, typically involving substantial compute. This raises a fundamental question anchoring our research:
How cost-effectively can one perform RL to efficiently instill reasoning abilities in LMs?
Rapid Reasoning Format Adaptation Hypothesis. Based on our observations in post-training Tina, we hypothesize that LoRA’s effectiveness and efficiency stem from rapidly adapting the reasoning format under RL while preserving base model knowledge—a likely more compute-efficient process than the deep knowledge integration of full-parameter training. Partial support comes from studies showing tiny LMs can reason effectively (Hugging Face, 2025, DeepSeek-AI, 2025), while large LMs can store broader world knowledge (Allen-Zhu and Li, 2025). This distinction suggests reasoning capabilities can be significantly enhanced by focusing on adapting the output format itself, consistent with our hypothesis about LoRA. To test this, we exclusively train LoRA parameters in RL settings, focusing on leveraging this format adaptation mechanism.