SMILE: Evaluation and Domain Adaptation for Social Media Language Understanding

Paper · arXiv 2307.00135 · Published June 30, 2023

“Social media (SM) plays an increasingly important role in our lives. As of 2021, seven out of ten US adults use at least one social media platform like Facebook, Twitter, Instagram, or Pinterest [3]. That proportion likely underestimates SM use if we broaden the definition of SM to include user-generated content like restaurant reviews, news article comments, and forum discussions. The ever-growing trove of text produced by social media users is both a challenge and an opportunity for natural language processing (NLP). NLP models with a strong grasp of social media language could perform a variety of socially, economically, and politically important tasks. They could, for example, tackle automatic content moderation to improve the quality of online discourse, summarize restaurant reviews to simplify the decision-making process of hungry customers, and detect and stunt disinformation campaigns aimed at sowing societal instability.

However, modeling social media language is inherently challenging. Due to its informal, noisy, and fast-evolving nature, social media language on platforms such as Twitter [14] is different from the language found in books, news publications, and Wikipedia. Additionally, the tasks that organically arise from the social media domain (trend detection, emoji prediction, cyberbullying detection, online marketing, etc.) are qualitatively different from the tasks natural to the domain of standard written language (translation, entailment, grammar checking, etc.). Although the recipe of pretraining language models on massive conventional corpora has been successful in pushing the state-of-the-art of general language understanding [8, 10, 12, 16, 35, 41], it is unclear if this recipe’s success will transfer to the social media domain. The reason for this lack of clarity is that general language understanding benchmarks [7, 33, 42, 44, 45, 50, 51] include neither SM data nor tasks and therefore do not measure social media language understanding.

Towards a more comprehensive understanding of social media language across multiple platforms and multiple application scenarios, we propose a new benchmark and a recipe for training language models that accounts for the divergence between social media language and conventional language. Specifically, we consider social media language understanding in English and make the following contributions:

We conduct a time-aligned comparison between the vocabulary (token) distribution of (1) posts from Twitter and Reddit and (2) that of mC4 [53], a conventional text corpus used to pretrain language models in many existing works [47, 52, 53]. We observe a substantial difference between the two distributions and also find that social media language changes twice as fast as conventional language (Section 3).

We compile a Social MedIa Language Evaluation (SMILE,) benchmark that includes social media language data from four platforms (Twitter, Reddit, Yelp, and Civil Comments) across both classification and generation tasks organically arising from the social media domain. This newly compiled benchmark coupled with an evaluation protocol is a well-rounded toolkit for the evaluation of an LM’s social media language understanding (Section 4).

We provide an effective recipe for training LMs for social media language understanding backed by a large-scale empirical study conducted using the SMILE, benchmark and training regimen for T5-based architectures [41]. Our study suggests that by training a custom tokenizer and pretraining the model from scratch using a corpus of both social media and conventional language, we can improve performance by 4.2 points compared to a similarly-sized baseline model (Section 5). We carry out additional ablation studies in Section 6.”