Reasoning and Learning Architectures Reasoning and Knowledge

Does verbose chain-of-thought actually help multimodal perception tasks?

Extending RLHF to MLLMs through longer rationales follows the successful reasoning playbook, but may backfire on perception tasks. This explores when and why the standard CoT-and-RL recipe fails.

Note · 2026-05-18 · sourced from LLM Architecture

The default playbook for improving LLM reasoning under RL is well-known: longer chains of thought, more intermediate tokens, more verbose rationales. This helps on math, code, and multi-hop reasoning. Reinforced Attention Learning documents a domain where it actively hurts: multimodal perception tasks.

The failure mode is structural. Perception tasks — fine-grained visual question-answering, grounding, attribute identification — depend on the model precisely attending to the right region of the visual input. The bottleneck is not what to say; it is what to look at. Verbose rationales pile text tokens on top of the visual attention task, and the optimization signal flows to those text tokens rather than to the underlying attention. The model becomes more elaborate in its descriptions of what it sees without becoming more accurate about what it sees.

This contradicts the assumption that the CoT-and-RL recipe is universally beneficial. The recipe works when the bottleneck is reasoning steps that can be made externally visible — math derivations, logical chains, step-by-step planning. It does not work when the bottleneck is the model's internal information-allocation decisions, which are not visible in the token stream and which RL on tokens cannot directly reach.

The diagnostic generalizes. Before applying the verbose-CoT playbook to a new domain, ask: is the bottleneck on this task something the model can verbalize, or is it something happening inside the attention pattern? If verbalizable, verbose CoT and outcome RL help. If internal, they may add noise without addressing the actual problem — and in some cases, as in MLLM perception, may degrade performance by reinforcing the wrong policy object.

This connects to the broader pattern of CoT limitations. CoT is constrained imitation of reasoning form; it does not access mechanisms not encoded in token sequences. For tasks whose mechanism lives in attention, in latent state trajectories, or in cross-modal alignment, training the verbalization layer is training the wrong thing.

Related concepts in this collection

Concept map
14 direct connections · 127 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

verbose chain-of-thought degrades MLLM perception tasks — text-token RL is the wrong policy objective when the bottleneck is visual grounding