Can multimodal telemetry operationalize the attentional component of discourse?
This explores whether the behavioral signals machines can capture during interaction (gaze, hesitation, typing speed) can actually stand in for the 'attentional' layer that discourse theory says comprehension depends on — the moment-to-moment tracking of what's currently salient.
This explores whether multimodal telemetry — gaze, hesitation, interaction speed — can serve as a measurable proxy for the attentional layer of discourse, rather than just the architecture's notion of 'attention.' The first thing the corpus does is split the question's two halves apart. On the discourse side, comprehension isn't one process: it requires tracking three irreducible layers at once — the linguistic segments, the speaker's intentional structure, and the *attentional salience* of what's in focus right now How do readers track segments, purposes, and salience together?. The attentional component isn't a vague 'paying attention' — it's the shifting register of which entities and goals are currently live. The question is whether external behavioral signals can instrument that register.
The optimistic answer comes from work showing AI systems can read cognitive state from interaction patterns alone — treating gaze, pauses, and speed as continuous signals of where a user's mind is, without disruptive explicit probes Can AI systems read cognitive state from interaction patterns alone?. That's exactly the kind of substrate the attentional layer would need: something that updates turn-by-turn and tracks salience without asking. And there's precedent for treating salience as a trackable temporal stream rather than a static property — Conversational DNA encodes relevance and topic coherence as parallel time-series, surfacing patterns flat statistical analysis misses Can tracking dialogue dimensions simultaneously reveal hidden conversation patterns?.
But the corpus plants two warnings that turn this from a yes into a 'careful.' First, beware conflating the model's *attention mechanism* with discourse attention. Transformer soft attention is structurally biased toward repeated and context-prominent tokens regardless of relevance — it over-weights what's loud, not what's salient to the human Does transformer attention architecture inherently favor repeated content?. So if 'operationalize' quietly means 'let the architecture's attention do the work,' you get a systematic mismatch with the reader's actual focus. Second, and sharper: behavioral signals can detect the *shape* of a phenomenon without the phenomenon itself. Chalmers' behavioral test passes any system producing contextually appropriate output, but the thing it claims to measure requires conditions the behavior alone can't certify Does behavioral speech output prove communicative subjecthood?. Telemetry that correlates with attention is not the same as telemetry that *is* attention — the gap between signal and substance is where the operationalization can quietly fail.
The genuinely uncomfortable finding sits underneath all this: the same multimodal substrate that enables helpful, well-timed responses also enables manipulative profiling Can AI systems read cognitive state from interaction patterns alone?. And there's reason to think coordinated behavioral signals carry more than cognitive load — linguistic style matching, for instance, measurably *increases* during deception, meaning interaction telemetry encodes relational and strategic state, not just attention Do liars and listeners coordinate their language during deception?. So a system instrumenting your attention is, by construction, instrumenting more than your attention.
The synthesis: yes, multimodal telemetry can plausibly operationalize the attentional layer — it's the one discourse component with a natural behavioral footprint, where the other two (segments, intentions) are more internal. But 'can' hides three traps the corpus names precisely: don't substitute the transformer's attention for the human's, don't mistake correlation-with-attention for attention, and don't forget that the signal which reads focus also reads everything else.
Sources 6 notes
Discourse processing demands parallel recognition of linguistic segments, intentional structure, and attentional salience—not sequential processing. These three layers constrain each other during comprehension, and failures in any single layer disrupt overall understanding.
Research shows AI systems can instrument multimodal behavioral signals (gaze, hesitation, speed) to read cognitive state during interaction, preserving flow by avoiding disruptive explicit probes. However, the same substrate enables both helpful timing and manipulative profiling.
Conversational DNA encodes four simultaneous dimensions—linguistic complexity, emotional trajectories, topic coherence, and conversational relevance—as temporal streams. The reverse Turing test finding showed expert assessments of AI diverged sharply, suggesting conversational structure shapes interpretation as much as content.
Transformer soft attention systematically over-weights repeated and context-prominent tokens regardless of relevance, creating a positive feedback loop that amplifies opinions and framing before RLHF acts. System 2 Attention—regenerating context to remove irrelevant material—can interrupt this mechanism.
Chalmers' test passes any system producing contextually appropriate text, but communicative subjecthood requires relational-normative conditions like accountability and evaluative stance. The test is calibrated to the wrong phenomenon, creating false positives like puppets that walk-shaped without walking.
Research shows interlocutors' linguistic styles correlate more during false communication than truthful communication, especially when the speaker is motivated to deceive. This coordination serves as a detectable deception signal through the listener's adaptive behavior, not just the liar's language.