A Little Human Data Goes A Long Way

Paper · arXiv 2410.13098 · Published October 17, 2024

Faced with an expensive human annotation process, creators of NLP systems increasingly turn to synthetic data generation. While this method shows promise, the extent to which synthetic data can replace human annotation is poorly understood. We investigate the use of synthetic data in Fact Verification (FV) and Question Answering (QA) by studying the effects of incrementally replacing human generated data with synthetic points on eight diverse datasets. Strikingly, replacing up to 90% of the training data only marginally decreases performance, but replacing the final 10% leads to severe declines. We find that models trained on purely synthetic data can be reliably improved by including as few as 125 human generated data points. We show that matching the performance gain of just a little additional human data requires an order of magnitude more synthetic data, and we then estimate price ratios at which human annotation would be a more cost-effective solution. Our results suggest that even when human annotation at scale is infeasible, there is great value to having a small proportion of the dataset being human generated.

we can compare the utility of synthetic data to the original human generated data points. Across multiple fine-tuning models, prompt models, and prompting strategies, we find that while increasing the proportion of synthetic data typically causes only minor degradations in model performance, a significant decline occurs at the extremes; i.e., when the percentage of synthetic data exceeds 90%. Focusing on the extremes, we show that purely synthetically trained FV and QA systems can be meaningfully improved by including as few as 125 human generated datapoints.

Specifically, we use Few-Shot In-Context Learning (Brown, 2020) to generate synthetic (claim, label) pairs from an input evidence text. The prompt model is given examples of (evidence text, claim, label) from the real training data, and is then queried with the evidence text we seek to generate data for (QA synthetic data is generated analogously

Together, the datasets span a variety of domains (science, news, social media, reasoning, conversation, fiction).

despite advances in synthetic generation, human annotation yields more useful data

significant difference between the performance of models on 97.5% and 100% synthetic data. These trends hold robustly over choice of fine-tuning model (Mistral7B), prompt model (GPT4), prompting strategy (Chain-of-Thought) and data scale (Appendix A).

On WANLI (Figure 3), more than 17000 additional synthetic points are needed to achieve the performance gains of 200 human points. If the price of a synthetic point for WANLI exceeds 73 times the price of a human generated point, then an incremental amount of human annotation would be the more cost-effective solution.

Rather than interpret these numbers literally, we take them to suggest that human data could have unique value in some settings, enabling performance levels that are impossible with purely synthetic datasets.

This suggests that synthetic data generation produces data points that are more directly taken from the evidence texts, while humans are more likely to employ rephrasing or different vocabulary than the evidence texts. Surprisingly, we find that synthetic data generation chooses more diverse sources for the question and answer content, with human annotation overwhelmingly more likely to create questions whose answers lie in the start of the evidence texts.