Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models
Structured generation, the process of producing content in standardized formats like JSON and XML, is widely utilized in real-world applications to extract key output information from large language models (LLMs). This study investigates whether such constraints on generation space impact LLMs’ abilities, including reasoning and domain knowledge comprehension.
Surprisingly, we observe a significant decline in LLMs’ reasoning abilities under format restrictions. Furthermore, we find that stricter format constraints generally lead to greater performance degradation in reasoning tasks.
One common approach to overcoming this obstacle is structured generation, which involves providing output in standardized formats like JSON or XML through format restrictions. These restrictions can be implemented in various ways, such as instructing LLMs to adhere to specified formats with format-restricting instructions, or using industrial solutions like JSON mode (OpenAI, 2024; Gemini, 2024), Instructor (Liu, 2024), or Guardrails (PrefectHQ, 2024).
To the best of our knowledge, this is the first systematic investigation into the relationship between format-restricting instructions and the quality of generated content. Our contributions are twofold:
• We observe declines in LLMs’ reasoning abilities under format restrictions, with stricter constraints generally leading to greater performance degradation in reasoning tasks.
• We offer insights into why performance degrades due to format constraints and propose simple approaches to mitigate these issues, thereby achieving both consistent formats and optimal performance.
Instead of providing a specific schema (e.g., "Reply your answer in JSON format with the following schema: { "reason": ..., "answer": ... }"), we simply instruct the LLM to output in the target format language (e.g., "Reply your answer in JSON format."). Table 1 illustrates the effects of removing the schema restriction on the GSM8K dataset. We observe significant improvements in average scores and lower standard deviations across different prompt perturbations for Claude 3 Haiku, GPT-3.5 Turbo, and LLaMA 3 8B Instruct.