Can Language Models Represent the Past without Anachronism?

Paper · arXiv 2505.00030 · Published April 28, 2025

We find that prompting a contemporary model with examples of period prose does not produce output consistent with period style. Fine-tuning produces results that are stylistically convincing enough to fool an automated judge, but human evaluators can still distinguish fine-tuned model outputs from authentic historical text. We tentatively conclude that pretraining on period prose may be required in order to reliably simulate historical perspectives for social research.

the risk that models trained in the present— however carefully prompted or tuned—will import contemporary knowledge, assumptions, styles, or norms into historical contexts where they ought to be absent.

If historical researchers have to pretrain a cutting-edge model, from scratch, every time they investigate a new social context, LLM-assisted research could demand an impractically large investment of computation and human labor. “Machine forgetting” is not likely to be easier to implement—especially not in a context like this where the content being erased is not a single fact but a whole social context [17–24].

2.1 GPT-1914: A pretrained historical language model We also generated continuations using GPT-1914, a small (774M) prototype of a GPT-2 model trained only on books published between 1880 and 1914, provided by HathiTrust Research Center. The model was trained on 26.5B tokens using 8 A100 GPUs for 90 hours at an imputed cost of $1,440, using code by Andrej Karpathy [25]. The set of books used to generate prompts was held out from the training corpus, to avoid any risk of memorization. Because this model is quite small, its continuations have less intellectual coherence than those generated by GPT-4o (see example in figure 2). However, it proved to be much better than a contemporary language model at reproducing period style.