Perturbation CheckLists for Evaluating NLG Evaluation Metrics
We see that, across tasks, for most pairs of criteria, the correlation is moderate (between 0.3 and 0.5) to low (< 0:3). The highest correlation of 0.76 is observed between interestingness and enjoyability for dialogue generation. However other criteria such as avoiding repetition, inquisitiveness, and making sense have low correlations with most of the other criteria. We make similar observations for the correlations between the criteria for other tasks. Even for IC the correlation between the 2 criteria of thoroughness and correctness is 0.41. For MT, the commonly used criteria of fluency and adequacy were found to be highly correlated with Pearson correlation co-efficient of 0.69 (Banchs et al., 2015). This justifies why WMT evaluations now ask humans to give to only a single score indicating overall quality. However, given the low to moderate correlations between criteria for other tasks,