Does verbose chain-of-thought actually help multimodal perception tasks?
Extending RLHF to MLLMs through longer rationales follows the successful reasoning playbook, but may backfire on perception tasks. This explores when and why the standard CoT-and-RL recipe fails.
The default playbook for improving LLM reasoning under RL is well-known: longer chains of thought, more intermediate tokens, more verbose rationales. This helps on math, code, and multi-hop reasoning. Reinforced Attention Learning documents a domain where it actively hurts: multimodal perception tasks.
The failure mode is structural. Perception tasks — fine-grained visual question-answering, grounding, attribute identification — depend on the model precisely attending to the right region of the visual input. The bottleneck is not what to say; it is what to look at. Verbose rationales pile text tokens on top of the visual attention task, and the optimization signal flows to those text tokens rather than to the underlying attention. The model becomes more elaborate in its descriptions of what it sees without becoming more accurate about what it sees.
This contradicts the assumption that the CoT-and-RL recipe is universally beneficial. The recipe works when the bottleneck is reasoning steps that can be made externally visible — math derivations, logical chains, step-by-step planning. It does not work when the bottleneck is the model's internal information-allocation decisions, which are not visible in the token stream and which RL on tokens cannot directly reach.
The diagnostic generalizes. Before applying the verbose-CoT playbook to a new domain, ask: is the bottleneck on this task something the model can verbalize, or is it something happening inside the attention pattern? If verbalizable, verbose CoT and outcome RL help. If internal, they may add noise without addressing the actual problem — and in some cases, as in MLLM perception, may degrade performance by reinforcing the wrong policy object.
This connects to the broader pattern of CoT limitations. CoT is constrained imitation of reasoning form; it does not access mechanisms not encoded in token sequences. For tasks whose mechanism lives in attention, in latent state trajectories, or in cross-modal alignment, training the verbalization layer is training the wrong thing.
Related concepts in this collection
-
Can optimizing attention patterns improve multimodal RL better than optimizing tokens?
Standard RL training optimizes token outputs in multimodal models, but the real bottleneck may be where the model attends to visual information. Does steering attention directly outperform indirect optimization through final outputs?
same paper, the alternative to verbose CoT
-
Does chain-of-thought reasoning reveal genuine inference or pattern matching?
Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
adjacent: structural limit of CoT that applies broadly
-
Does chain of thought reasoning actually explain model decisions?
When language models show their reasoning steps in agentic pipelines, does the quality of those steps predict or explain the quality of final outputs? This matters for trusting and debugging AI systems.
adjacent: CoT degradation in another domain (agentic pipelines)
-
Do large language models actually perform iterative optimization?
Explores whether LLMs execute genuine numerical procedures like Newton-Raphson or instead pattern-match to memorized solution templates when solving constrained optimization problems.
adjacent: another case where verbose CoT does not address the actual bottleneck
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
verbose chain-of-thought degrades MLLM perception tasks — text-token RL is the wrong policy objective when the bottleneck is visual grounding