The Method of Critical AI Studies, A Propaedeutic
We outline some common methodological issues in the field of critical AI studies, including a tendency to overestimate the explanatory power of individual samples (the benchmark casuistry), a dependency on theoretical frameworks derived from earlier conceptualizations of computation (the black box casuistry), and a preoccupation with a cause-and-effect model of algorithmic harm (the stack casuistry). In the face of these issues, we call for, and point towards, a future set of methodologies that might take into account existing strengths in the humanistic close analysis of cultural objects.
How else, they might say, would one conceive of a benchmark of a system of unknown capabilities if not by means of probabilistic sampling under fixed external conditions? And indeed, actual AI benchmarks do exactly that, including the thematically (to us) appropriate BIG Bench (Beyond the Imitation Game) framework which provides over 200 tasks that all are measured by running a task several times and noting the likelihood of outcomes (Ghazal et al., 2013). And yet, humanities critiques are dominated by n = 1 experiments, where a single input-output example serves to ‘discredit’ the entire probabilistic system. We argue that this is not exclusively the result of a lack of technical skills or resources, but due to a misaligned application of roughly hermeneutically-inspired methods to probabilistic systems. In other words, humanities scholars may ironically be finding themselves red-handed at the sight of a crime they often accuse the AI of committing: misrecognizing everything as a nail by self-identifying as a hammer.
Where does the humanities’ fixation on the black box come from, then? As so often, it comes from a development that Vinsell (2021) has called ‘criti-hype’: the counterintuitive alignment of hype and critique, where critics require worrisome developments to take place to critique, while corporations utilize them to present their products as more powerful than they actually are.
If AI systems are black boxes, then probing their inputs and outputs must be enough to arrive at a coherent model of their functioning. The result are folk theories (Kempton, 1986) of artificial intelligence: theories which seem to work in practice (i.e., that seem to give evidence to a critical perspective) but are entirely disconnected from the technical reality of the objects they purport to describe.
One conceptual shift here, at the very least as a heuristic, might go via contending with ’pipelines’ instead of stacks. A pipeline is many things, but referencing the fluid pipelines that carry water, oil, and gas, a pipeline in computation connotes the fluidity, flexibility, infrastructural transitions, and material extraction (Dhaliwal, 2025). We suggest contending with pipelining not only because it is a conceptual framework better suited to the material, technical conditions of artificial intelligence operations today—data pipelines feed into, and are mixed during the execution cycles of, instruction pipelines, which may themselves be triggered by model pipelining where several training datasets (and the generated weights through training) are brought to bear, along with several vector databases, towards a multi-step or multi-agentic AI operation—but also because it would more closely match, while still retaining enough critical capacity, a paradigmatic material-semiotic condition deployed by computer scientists and AI experts themselves
Like in the benchmark casuistry described above, the commonly-used chat- or prompt-based interface contributes to the black-box casuistry. The linguistic interaction that forms the lynchpin of these interfaces not only helps sustain some of the aforementioned narrativization in humanities critiques but also encourages this very blackboxing; the simple and largely empty chat window demonstrates no real depth, instead safely ensconcing the technical setup behind a plain void. Perhaps a cue can be taken here from the field of Human-Computer Interaction (HCI). As the name indicates, HCI has always been in the business of understanding humans as a part of the computational assemblage; it can be best understood, in this homological verve, as the traditionally “humanities” part of computer science. And when HCI as a field, in its internal critiques (Kapania et al., 2024; Ma et al., 2024; Suh et al., 2024), is clear about the significant limitations of natural language use in AI interfaces, it should, at the very least, give humanistic criticisms of chat-based interactions some pause. Interfaces, as Alexander Galloway reminds us, are not ‘windows’ into content, but effects produced by technology (Galloway, 2012). The chat-box seems to have effected, then, a perplexing condition of interpretive illusion for our disciplinary configuration.
First is an increased openness towards the technical disciplines, taking inspiration, perhaps, from earlier calls for a better integration of philosophical critique and technical development, such as in the work of Phil Agre during the previous artificial intelligence hype cycle (Agre, 1997). Technical benchmarks have progressed beyond the Turing test, as we have outlined above, and so should the methodology of critical AI studies. Second, a ‘humanities’ way of benchmarking would actually need to return to the roots of humanities critique in the close reading of complex systems with multiple dependencies within and outside of the system (nothing else are literary texts, after all).