A ripple in time: a discontinuity in American history
Abstract—In this technical note we suggest a novel approach to discover temporal (related and unrelated to language dilation) and personality (authorship attribution) aspects in historical datasets. We exemplify our approach on the State of the Union addresses given by the past 42 US presidents: this dataset is known for its relatively small amount of data, and high variability of the size and style of texts. Nevertheless, we manage to achieve about 95% accuracy on the authorship attribution task, and pin down the date of writing to a single presidential term.
While it is widely believed that BERT (and its variations) is most suitable for NLP classification tasks, we find out that GPT–2 in conjunction with nonlinear dimension reduction methods, such as UMAP, provides stronger clustering. This makes GPT–2 + UMAP an interesting alternative. In our case, no model fine–tuning is required, and the pre–trained out–of– the–box GPT–2 model is enough.
We have also experimented with determining which President wrote which State of the Union address. The simplest solution was the one based on pre–trained embedding (just as was done before), and this had an accuracy of around 80%, which, while not terrible, was not very exciting. So, as the next step, we fine–tuned a HuggingFace DistilBERT model for the classification task.
Since the DistilBERT model can only deal with 512 tokens (shorter than any of our addresses), we chunked each texts with a sliding window of 512 token having an overlap of 128 tokens, and treated each chunk as a training sample. For the testing phase, we did the same, but aggregated the scores for the chunks to get a prediction for the document.
It should be noted that the per–chunk prediction rate was not great – something south of 60%. Not too surprisingly, the wrong answers of the model seem to be for temporally adjacent authors (Washington and Adams, or Carter and Ford), so presumably the Zeitgeist and language matter more than the actual politics.
However, instead of considering individual chunks of text, a better way to capture its context and vernacular is to aggregate the logits output by the model over all chunks produced from that text. After that, the argmax is applied. In this case the accuracy raises spectacularly up to ≈ 93% – 95%, depending on the number of fine–tuning epochs.