X-EVAL: Multi-aspect Text Evaluation
X-EVAL: Multi-aspect Text Evaluation
Abstract
Natural Language Generation (NLG) typically
involves evaluating the generated text in vari-
arXiv:2311.08788v2 [[Link]] 13 Apr 2024
Figure 2: Illustration of our X-E VAL framework. The left section depicts our two-stage training approach: vanilla
instruction tuning on diverse tasks and subsequent training on instruction tasks enriched with auxiliary aspects. The
right section illustrates the inference pipeline with auxiliary aspects.
that there is no overlap among the datasets used the aspect naturalness usually shows a notable
for training and inference. We consider two as- correlation with engagingness. When a dialogue
pects that have identical aspect names but are in response is not natural, it is very likely that hu-
different NLG tasks as distinct aspects. We in- man considers the response to be not engaging.
clude more details about the source datasets, con- While these two aspects are not interchangeable
structed instruction-tuning tasks, and the number given their different definitions, the evaluation of
of instances of each task in Appendix A.2. one aspect can offer useful clues for the evaluation
of another potentially related aspect. Motivated by
4 X-E VAL this, we enrich our training regimen with an addi-
4.1 Two-Stage Instruction Tuning tional instruction tuning stage to leverage potential
connections to the target evaluation aspect.
Figure 2 presents an overview of X-E VAL, which
consists of two stages of instruction tuning: More precisely, for each instruction-tuning task
detailed in Section 3.2, we augment it based on
Vanilla Instruction Tuning The first training
the ground truth evaluation results of a predefined
stage aims to equip the model with the ability to
set of auxiliary aspects which are all other aspects
follow instructions to perform diverse evaluation
collected in the source dataset. To convert the eval-
tasks. We adopt Flan-T5 (Chung et al., 2022), an
uation results of auxiliary aspects into natural lan-
open-source language model as the base model for
guage that can be fed into the input, we employ a
our evaluator. Based on Flan-T5, we further per-
template-based verbalizer, denoted as v(·), which
form standard instruction tuning on the mixture of
takes in an aspect a and its evaluation score s for
four types of tasks: scoring, comparison, ranking,
an instance, mapping it into a verbalized evalua-
and Boolean QA, as elaborated in Section 3.2.
tion h = v(s, a). For example, with the aspect
Instruction Tuning with Auxiliary Aspects Consistency on Data2Text and the evaluation
Through our study, we discern that certain eval- score 0.9 out of 1.0, the verbalized result is phrased
uation aspects could be interrelated. As evidence, as "This sentence is consistent with the source."
in dialogue evaluation (Gopalakrishnan et al., 2019) (see more details in Appendix B). We construct the
Algorithm 1: Inference Pipeline evaluation score c for the target aspect:
Input: Set of evaluation aspects A, Target aspect at ,
NLG system’s output x, Additional set of P (“Y es”|x, S, a)
texts S, Scoring function f (·), Evaluation
c=
P (“Y es”|x, S, a) + P (“N o”|x, S, a)
verbalizer v(·), Similarity measure sim(·),
Sentence encoder E
Output: Target score ct
where P (·) denotes the probability of the model
// Determine top-k auxiliary aspects generating a specific word. The pseudo-code of our
1 L ← {(sim(E(a), E(at )), a) | a ∈ A \ {at }} inference pipeline is in Algorithm 1.
2 Sort L in descending order based on similarity
3 AR ← first k aspects from sorted L
// Generate verbalized evaluation
5 Experiment Setup
results for auxiliary aspects
4 Initialize an empty auxiliary evaluation set H Meta Evaluation We meta-evaluate our X-E VAL
5 for ar ∈ AR do on the test split of A SPECT I NSTRUCT, where the
// Score for auxiliary aspect details of the test set are introduced as follows. For
6 cr ← f (x, Sr , ar )
// Add verbalized evaluation to the
text summarization, we adopt SummEval (Fabbri
auxiliary evaluation set et al., 2021) and QAGS (Wang et al., 2020b).
7 H ← [H; v(cr , ar )] For dialogue generation, we employ Topical-
8 St ← [St ; H] Chat (Gopalakrishnan et al., 2019) and FED (Mehri
// Evaluate the target aspect
9 ct ← f (x, St , at )
and Eskenazi, 2020a). For data-to-text generation,
10 return ct we utilize SFHOT & SFRES (Wen et al., 2015).
A SPECT I NSTRUCT contains the following unseen
aspects: topic depth (DEP), likeability
(LIK), understandability (UND),
set of verbalized results H with the verbalizer for flexibility (FLE), informativeness (INF),
each auxiliary aspect (except for the target aspect). inquisitiveness (INQ), interestingness
This set H is then concatenated into the additional (INT), specificity (SPE), correctness
set of texts in the evaluator’s input. The model then (COR), and semantic appropriateness
undergoes the second training stage on the instruc- (SEM). More detailed descriptions of the test splits,
tion tasks enriched with these evaluation results. as well as seen and unseen evaluation aspects, are
be found in Appendix A.4.
4.2 Inference with Auxiliary Aspects Implementation Details We adopt Flan-T5-
large (with ~780M parameters) as our base lan-
At the inference stage, we perform the following
guage model for subsequent finetuning. Without
steps to evaluate the text on the target aspect: First,
specification, we pick the top-1 aspect during infer-
we select a set of auxiliary aspects for the target
ence, i.e., k = 1. More implementation details can
aspect. Based on the definitions of the target as-
be found in Appendix C.
pect and a pool of candidate aspects, we employ
Sentence-T5 (Ni et al., 2022) to encode the def- Baselines We compare our X-E VAL with the fol-
initions and measure the similarity between the lowing state-of-the-art NLG evaluation metrics: (1)
sentence embeddings of target aspect definition UniEval (Zhong et al., 2022) is a unified multi-
and each candidate aspect definition. We select the aspect evaluator that re-frames the evaluation pro-
aspects with top-k similarity scores as the auxil- cess as a Boolean QA task; (2) GPTScore (Fu
iary aspects to limit inference cost, where k is a et al., 2023) is a multi-faceted and training-free
hyperparameter. Second, we run an inference pro- evaluation framework that utilizes the output prob-
cess using the Boolean QA task format, where the abilities from LLMs to score generated texts; (3)
model predicts either "Yes" or "No", as outlined in G-Eval (Liu et al., 2023) proposes to leverage large
Section 3.2, on each auxiliary aspect. We convert language models such as GPT-3.5 or GPT-4 to as-
the prediction into natural language results with the sess the text quality with form-filling paradigm
verbalizer. These verbalized results, denoted as H, in a training-free manner; (4) ROUGE-L (Lin,
are subsequently integrated into the additional set 2004); (5) DynaEval (Zhang et al., 2021); (6)
of texts S for evaluating the target aspect. Finally, BERTScore (Zhang* et al., 2020); (7) Mover-
given the input enhanced by auxiliary aspects, we Score (Zhao et al., 2019); (8) USR (Mehri and
adopt the same Boolean QA format to compute the Eskenazi, 2020b); (9) BARTScore (Yuan et al.,
Dialogue-level Turn-level
Metrics DEP LIK UND FLE INF INQ AVG INT SPE COR SEM UND AVG
BARTScore (Yuan et al., 2021) 0.082 0.099 -0.115 0.093 0.092 0.062 0.052 0.159 0.083 0.076 0.100 0.120 0.128
DynaEval (Zhang et al., 2021) 0.498 0.416 0.365 0.383 0.426 0.410 0.416 0.327 0.346 0.242 0.202 0.200 0.263
UniEval (Zhong et al., 2022) 0.046 0.009 -0.024 -0.003 -0.070 0.085 0.030 0.435 0.381 0.125 0.051 0.082 0.215
GPTScore (GPT-3-d01) (Fu et al., 2023) 0.669 0.634 0.524 0.515 0.602 0.503 0.574 0.501 0.214 0.434 0.444 0.365 0.392
GPTScore (GPT-3-d03) (Fu et al., 2023) 0.341 0.184 0.196 0.072 0.317 -0.101 0.168 0.224 0.151 0.428 0.405 0.311 0.304
G-Eval (GPT-3.5)† (Liu et al., 2023) 0.339 0.392 0.123 0.344 0.232 0.101 0.259 0.30 0.280 0.430 0.390 0.274 0.335
G-Eval (GPT-4)† (Liu et al., 2023) 0.583 0.614 0.602 0.587 0.510 0.551 0.573 0.506 0.368 0.522 0.443 0.438 0.455
X-E VAL (Ours) 0.583 0.436 0.588 0.324 0.480 0.497 0.485 0.421 0.370 0.492 0.376 0.332 0.398
- w/o Training 0.377 0.387 0.394 0.424 0.370 0.417 0.395 0.250 0.175 0.296 0.289 0.225 0.247
- w/o Instructions 0.350 0.333 0.495 0.355 0.425 0.435 0.399 0.477 0.353 0.203 0.255 0.211 0.300
- w/o Stage-Two Tuning 0.388 0.324 0.555 0.384 0.582 0.437 0.445 0.372 0.282 0.418 0.329 0.311 0.342
Table 1: Meta-evaluation on dialogue based on unseen aspects in terms of dialogue-level and turn-level Spearman
(ρ) correlations on FED. The best overall results are highlighted in bold. We also highlight the best results excluding
GPT-based metrics with underline. †: our re-implementation, where we adopt our annotated instructions and aspect
definitions as inputs to OpenAI’s API to obtain the performance of G-Eval on FED.
Table 2: Turn-level Pearson (r) and Spearman (ρ) correlations on seen aspects on Topical-Chat. The best overall
results are highlighted in bold. We also highlight the best results excluding GPT-based metrics with underline.
2021). We include more details of baselines (4)-(9) delineates the performance of traditional metrics
in Appendix C due to space limit. and evaluators based on lightweight language mod-
els. The middle section shows the performance of
Variants of X-E VAL We design several variants the evaluators based on GPTs (Brown et al., 2020;
of X-E VAL for ablation studies: (1) X-E VAL w/o OpenAI, 2023) that are proprietary and much larger
Training denotes the original Flan-T5 (without than our approach. The bottom section shows the
any further finetuning on our proposed A SPECT I N - performance of X-E VAL and its variants.
STRUCT); (2) X-E VAL w/o Instructions: based on
Flan-T5, we only conduct prompt-based multi-task
training and inference in the same way as (Zhong
Results of Dialogue Evaluation on FED To as-
et al., 2022) where we ask the model to answer
sess X-E VAL’s ability to generalize to unseen as-
Boolean questions without using aspect definitions;
pects, we present the Spearman correlation on FED
(3) X-E VAL w/o Stage-Two Tuning: for this vari-
in Table 1. X-E VAL surpasses the baselines in
ant, we only conduct vanilla instruction tuning in
the top section. Also, X-E VAL matches the perfor-
Stage 1 based on Flan-T5. During inference, we di-
mance of GPT-based baselines with much fewer pa-
rectly perform evaluation based on the instructions
rameters. The bottom section of the table highlights
without using auxiliary aspects.
the improvement achieved by two-stage tuning, in-
6 Main Results corporating instructions, and integrating auxiliary
aspects. It is worth noting that UniEval achieves
We report the main results of dialogue evaluation notably poor performance on dialogue-level eval-
in Table 1 and Table 2, summarization in Table 3 uation on FED, which is probably due to UniEval
and Table 9 , and data-to-text in Table 4. Each being overfitted to turn-level evaluation and failing
table is divided into three sections: the top section to generalize to dialogue-level evaluation.
Coherence Consistency Fluency Relevance AVG
Metrics ρ τ ρ τ ρ τ ρ τ ρ τ
ROUGE-L (Lin, 2004) 0.128 0.099 0.115 0.092 0.105 0.084 0.311 0.237 0.165 0.128
MOVERSscore (Zhao et al., 2019) 0.159 0.118 0.157 0.127 0.129 0.105 0.318 0.244 0.191 0.148
BERTScore (Zhang* et al., 2020) 0.284 0.211 0.110 0.090 0.193 0.158 0.312 0.243 0.225 0.175
BARTScore (Yuan et al., 2021) 0.448 0.342 0.382 0.315 0.356 0.292 0.356 0.273 0.385 0.305
UniEval (Zhong et al., 2022) 0.495 0.374 0.435 0.365 0.419 0.346 0.424 0.327 0.443 0.353
GPTScore (Fu et al., 2023) 0.434 – 0.449 – 0.403 – 0.381 – 0.417 –
G-Eval (GPT-3.5) (Liu et al., 2023) 0.440 0.335 0.386 0.318 0.424 0.347 0.385 0.293 0.401 0.320
G-Eval (GPT-4) (Liu et al., 2023) 0.582 0.457 0.507 0.425 0.455 0.378 0.547 0.433 0.514 0.418
X-E VAL (Ours) 0.530 0.382 0.428 0.340 0.461 0.365 0.500 0.361 0.480 0.362
- w/o Training 0.187 0.131 0.193 0.152 0.135 0.104 0.444 0.325 0.240 0.178
- w/o Instructions 0.458 0.333 0.414 0.328 0.395 0.309 0.496 0.359 0.441 0.333
- w/o Stage-Two Tuning 0.536 0.385 0.413 0.326 0.455 0.360 0.503 0.363 0.476 0.359
Table 3: Summary-level Spearman (ρ) and Kendall-Tau (τ ) correlations of different metrics on SummEval. All
aspects are seen aspects. The best overall results are highlighted in bold. We also highlight the best results excluding
GPT-based metrics with underline.
Metrics NAT COH ENG GRO AVG predict inaccurate evaluations for auxiliary aspects.
X-E VAL 0.478 0.622 0.593 0.728 0.605 To investigate their impact, we tailor several base-
- Inference w/o Auxiliary Aspects 0.462 0.641 0.577 0.723 0.600
- w/ GT RAA (Upperbound) 0.552 0.651 0.703 0.751 0.664 lines: (1) directly applying the model after two-
- w/ Random RAA (Lowerbound) 0.468 0.601 0.561 0.628 0.564
stage tuning to evaluate without auxiliary aspects;
Table 6: Analysis of error propagation in auxiliary as-
(2) using the ground truth (“GT”) evaluation results
pects on Topical-Chat in terms of Spearman correlation. instead of predicted results for auxiliary aspects
We highlight the best results in bold and the best results (upperbound), and; (3) using random evaluation
without using ground truths with underline. “RAA” de- results for auxiliary aspects (lowerbound). From
notes the evaluation Results on Auxiliary Aspects. Table 6, removing auxiliary aspects makes the over-
all performance drop. The variant with GT results
gains improvement in all aspects, which indicates
evaluation dataset. In general, X-E VAL trained on
the error in the evaluation of auxiliary aspects does
the combination of all forms of evaluation tasks,
impact the performance of target aspects, but not to
including scoring, comparison, ranking, achieves
a large degree. Using random results, on the other
the highest averaged correlation for nearly all tasks.
hand, deteriorates the performance significantly.
Effect of the Scale of Language Model Back-
bones We adopt the same training and inference Effect of Hyperparameter k We examine the
pipelines for the backbones with different scales to choice of k in selecting top-k auxiliary aspects dur-
show the effect of the models’ size and justify the ing inference. Table 7 shows that inference with
use of Flan-T5-large. Specifically, we additionally the top-1 auxiliary aspect generally achieves better
experiment with Flan-T5-small (80M), Flan-T5- correlation. We speculate that this may stem from
base (250M), and Flan-T5-xl (3B) as the backbone the error propagation during inference on auxiliary
models, and term our X-E VAL respectively. The aspects, where using more auxiliary aspects poten-
results are shown in Figure 3. From Figure 3, the tially introduces more inaccuracies, offsetting their
evaluators’ performance consistently increases as potential performance benefits.
the model size increases in general. However, when
we upgrade the backbone model from Flan-T5- Qualitative Correlation Analysis on Instruction
large to Flan-T5-xl, the performance improvement Tuning To further investigate the effect of instruc-
becomes less significant. Given the trade-off be- tion tuning, in Figure 4, we visualize the correla-
tween efficiency and performance, we select Flan- tion of our X-E VAL and Flan-T5 (i.e., “X-E VAL
T5-large as the default backbone model of X-E VAL w/o Training”) based on naturalness on Topical-
in our experiments. We include a more detailed per- Chat and consistency on SummEval. The red
formance analysis of the effect of language model lines are linear regression fits to show how well
backbones in Appendix C. the predicted scores correlate to human judgments
linearly. Before instruction tuning, the predicted
Error Propagation from Auxiliary Aspects dur- scores are more uniformly distributed regardless
ing Inference During inference, X-E VAL may of ground truth scores, which results in poor cor-
Dial-Naturalness (X-Eval) Dial-Naturalness (Flan-T5) Similarity of Aspect Definition 1.00
1 0.83 0.81 0.78 0.85 0.77 0.78 0.86 0.85
Lishan Huang, Zheng Ye, Jinghui Qin, Liang Lin, and Shikib Mehri and Maxine Eskenazi. 2020b. USR: An
Xiaodan Liang. 2020. GRADE: automatic graph- unsupervised and reference free evaluation metric for
enhanced coherence metric for evaluating open- dialog generation. pages 681–707.
domain dialogue systems. In Proceedings of the
2020 Conference on Empirical Methods in Natural Shuhaib Mehri and Vered Shwartz. 2023. Automatic
Language Processing, EMNLP 2020, Online, Novem- evaluation of generative models with instruction tun-
ber 16-20, 2020, pages 9230–9240. Association for ing. CoRR, abs/2310.20072.
Computational Linguistics.
Jianmo Ni, Gustavo Hernandez Abrego, Noah Constant,
Dongfu Jiang, Yishan Li, Ge Zhang, Wenhao Huang, Ji Ma, Keith Hall, Daniel Cer, and Yinfei Yang. 2022.
Bill Yuchen Lin, and Wenhu Chen. 2023. Tigerscore: Sentence-t5: Scalable sentence encoders from pre-
Towards building explainable metric for all text gen- trained text-to-text models. In Findings of the As-
eration tasks. CoRR, abs/2310.00752. sociation for Computational Linguistics: ACL 2022,
pages 1864–1874.
Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang,
OpenAI. 2023. Gpt-4 technical report. ArXiv,
Shayne Longpre, Hwaran Lee, Sangdoo Yun,
abs/2303.08774.
Seongjin Shin, Sungdong Kim, James Thorne, and
Minjoon Seo. 2024. Prometheus: Inducing evalua- Bo Pang, Erik Nijkamp, Wenjuan Han, Linqi Zhou, Yix-
tion capability in language models. In The Twelfth ian Liu, and Kewei Tu. 2020. Towards holistic and
International Conference on Learning Representa- automatic evaluation of open-domain dialogue gener-
tions. ation. In Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics, pages
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan 3619–3629, Online. Association for Computational
Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Linguistics.
Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: De-
noising sequence-to-sequence pre-training for natural Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
language generation, translation, and comprehension. Jing Zhu. 2002. Bleu: a method for automatic evalu-
arXiv preprint arXiv:1910.13461. ation of machine translation. In Proceedings of the
40th annual meeting of the Association for Computa-
Chin-Yew Lin. 2004. Rouge: A package for automatic tional Linguistics, pages 311–318.
evaluation of summaries. In Text summarization
branches out, pages 74–81. Jingyuan Qi, Zhiyang Xu, Ying Shen, Minqian Liu,
Di Jin, Qifan Wang, and Lifu Huang. 2023. The
Minqian Liu, Shiyu Chang, and Lifu Huang. 2022. In- art of SOCRATIC QUESTIONING: zero-shot mul-
cremental prompting: Episodic memory prompt for timodal reasoning with recursive thinking and self-
lifelong event detection. In Proceedings of the 29th questioning. CoRR, abs/2305.14999.
International Conference on Computational Linguis-
tics, pages 2157–2165, Gyeongju, Republic of Korea. Ananya B Sai, Akash Kumar Mohankumar, Siddhartha
International Committee on Computational Linguis- Arora, and Mitesh M Khapra. 2020. Improving di-
tics. alog evaluation with a multi-reference adversarial
dataset and large scale pretraining. Transactions of
Minqian Liu and Lifu Huang. 2023. Teamwork is not the Association for Computational Linguistics, 8:810–
always good: An empirical study of classifier drift 827.
in class-incremental information extraction. In Find-
ings of the Association for Computational Linguis- Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
tics: ACL 2023, pages 2241–2257, Toronto, Canada. Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Association for Computational Linguistics. Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Azhar, Aurélien Rodriguez, Armand Joulin, Edouard
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Grave, and Guillaume Lample. 2023a. Llama: Open
Ruochen Xu, and Chenguang Zhu. 2023. G-eval: and efficient foundation language models. CoRR,
NLG evaluation using GPT-4 with better human abs/2302.13971.
alignment. CoRR, abs/2303.16634.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
Andrea Madotto, Zhaojiang Lin, Zhenpeng Zhou, Se- bert, Amjad Almahairi, Yasmine Babaei, Nikolay
ungwhan Moon, Paul Crook, Bing Liu, Zhou Yu, Eu- Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
njoon Cho, Pascale Fung, and Zhiguang Wang. 2021. Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-
Continual learning in task-oriented dialogue systems. Ferrer, Moya Chen, Guillem Cucurull, David Esiobu,
In Proceedings of the 2021 Conference on Empiri- Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,
cal Methods in Natural Language Processing, pages Cynthia Gao, Vedanuj Goswami, Naman Goyal, An-
7452–7467, Online and Punta Cana, Dominican Re- thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan
public. Association for Computational Linguistics. Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,
Isabel Kloumann, Artem Korenev, Punit Singh Koura, Zhiyang Xu, Ying Shen, and Lifu Huang. 2023b. Multi-
Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di- Instruct: Improving multi-modal zero-shot learning
ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar- via instruction tuning. In Proceedings of the 61st An-
tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly- nual Meeting of the Association for Computational
bog, Yixin Nie, Andrew Poulton, Jeremy Reizen- Linguistics (Volume 1: Long Papers), pages 11445–
stein, Rashi Rungta, Kalyan Saladi, Alan Schelten, 11465, Toronto, Canada. Association for Computa-
Ruan Silva, Eric Michael Smith, Ranjan Subrama- tional Linguistics.
nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay-
lor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran,
Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Thomas L. Griffiths, Yuan Cao, and Karthik
Melanie Kambadur, Sharan Narang, Aurélien Ro- Narasimhan. 2023. Tree of thoughts: Deliberate
driguez, Robert Stojnic, Sergey Edunov, and Thomas problem solving with large language models. CoRR,
Scialom. 2023b. Llama 2: Open foundation and abs/2305.10601.
fine-tuned chat models. CoRR, abs/2307.09288. Weizhe Yuan, Graham Neubig, and Pengfei Liu. 2021.
Michael Völske, Martin Potthast, Shahbaz Syed, and Bartscore: Evaluating generated text as text gener-
Benno Stein. 2017. TL;DR: Mining Reddit to learn ation. Advances in Neural Information Processing
automatic summarization. In Proceedings of the Systems, 34:27263–27277.
Workshop on New Frontiers in Summarization, pages Chen Zhang, Yiming Chen, Luis Fernando D’Haro,
59–63, Copenhagen, Denmark. Association for Com- Yan Zhang, Thomas Friedrichs, Grandee Lee, and
putational Linguistics. Haizhou Li. 2021. DynaEval: Unifying turn and di-
Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020a. alogue level evaluation. In Proceedings of the 59th
Asking and answering questions to evaluate the fac- Annual Meeting of the Association for Computational
tual consistency of summaries. In Proceedings of Linguistics and the 11th International Joint Confer-
the 58th Annual Meeting of the Association for Com- ence on Natural Language Processing (Volume 1:
putational Linguistics, ACL 2020, Online, July 5-10, Long Papers), pages 5676–5689, Online. Association
2020, pages 5008–5020. Association for Computa- for Computational Linguistics.
tional Linguistics. Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q.
Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020b. Weinberger, and Yoav Artzi. 2020. Bertscore: Eval-
Asking and answering questions to evaluate the uating text generation with bert. In International
factual consistency of summaries. arXiv preprint Conference on Learning Representations.
arXiv:2004.04228. Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Chris-
tian M Meyer, and Steffen Eger. 2019. Moverscore:
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V.
Text generation evaluating with contextualized em-
Le, Ed H. Chi, Sharan Narang, Aakanksha Chowd-
beddings and earth mover distance. In Proceedings
hery, and Denny Zhou. 2023. Self-consistency
of the 2019 Conference on Empirical Methods in Nat-
improves chain of thought reasoning in language
ural Language Processing and the 9th International
models. In The Eleventh International Conference
Joint Conference on Natural Language Processing
on Learning Representations, ICLR 2023, Kigali,
(EMNLP-IJCNLP), pages 563–578.
Rwanda, May 1-5, 2023. [Link].
Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, and
Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Jiawei Han. 2022. Towards a unified multi-
Dai, and Quoc V Le. 2022a. Finetuned language dimensional evaluator for text generation. In Pro-
models are zero-shot learners. In International Con- ceedings of the 2022 Conference on Empirical Meth-
ference on Learning Representations. ods in Natural Language Processing, pages 2023–
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten 2038, Abu Dhabi, United Arab Emirates. Association
Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, for Computational Linguistics.
and Denny Zhou. 2022b. Chain-of-thought prompt-
ing elicits reasoning in large language models. In A More Details on A SPECT I NSTRUCT
NeurIPS.
Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-
A.1 Annotation Protocol of Instructions
Hao Su, David Vandyke, and Steve Young. 2015. We depict the annotation process for the instruc-
Semantically conditioned lstm-based natural lan- tions in A SPECT I NSTRUCT as follows. To curate
guage generation for spoken dialogue systems. arXiv
preprint arXiv:1508.01745. the definition for each aspect, we first refer to the
definition of the aspect in the original annotation
Wenda Xu, Danqing Wang, Liangming Pan, Zhenqiao guideline. When a definition is absent from the
Song, Markus Freitag, William Yang Wang, and Lei
Li. 2023a. INSTRUCTSCORE: towards explainable
guideline, three human annotators (graduate stu-
text generation evaluation with automatic feedback. dents studying in computational linguistics or natu-
CoRR, abs/2305.14282. ral language processing areas) construct and revise
Original Annotation Scoring Boolean QA
Input: Context: “Are you looking for an Input: Dialogue relevance is to determine Input: Dialogue relevance is to determine
apartment?”, “Yes, I’m whether the response is relevant to the whether the response is relevant to the
interested in a one-bedroom context. In this task, you are given a context. In this task, you need to
apartment.” dialogue context and a response. Your evaluate the quality of the dialogue
Response: “I think I have just a task is to determine whether the response based on relevance by
right one for you.” response is relevant to the context. A answering 'Yes' or 'No' to the
score of 0 indicates the response is not following question. Question: Is this
Human Rating: 0.67 (out of 1)
relevant. A score of 1 indicates the response relevant to the given dialogue
response is relevant. Predict 0 or 1 as history?
Comparison your choice. Context: “Are you looking for an
Input: Dialogue relevance is to Context: “Are you looking for an apartment?”, “Yes, I’m interested in a
determine whether the response is apartment?”, “Yes, I’m interested in a one-bedroom apartment.”
relevant to the context. In this one-bedroom apartment.” Response: “I think I have just a right
task, you are given a dialogue Response: “I think I have just a right one for you.”
context and two candidate one for you.”
Output: Yes
responses. Your task is to Output: 1
determine which response is more
relevant to the context. Ranking
Context: “Are you looking for an Input: Dialogue relevance is to determine whether the response is relevant to the context. In
apartment?”, “Yes, I’m interested this task, you are given a dialogue context and three candidate responses. Your task is to
in a one-bedroom apartment.” give a ranking for the three responses from the best quality to the worst.
Response 1: “I think I have just a Context: “Are you looking for an apartment?”, “Yes, I’m interested in a one-bedroom
right one for you.” apartment.”
Response 2: “Because my friend Response 1: “I think I have just a right apartment for you.”, Response 2: “Because my
wants to meet you”. friend wants to meet you”, Response 3: “There is good news.”
Output: Response 1 Output: Response 1 > Response 3 > Response 2
Figure 6: An illustrative example of augmented instruction-tuning tasks from the original annotation. The definition
of the aspect is highlighted in purple. The annotated task instructions and the constructed output labels are
highlighted in the corresponding colors for each task.
the definition until they reach an agreement. The Metrics CNN XSUM AVG
task descriptions and evaluation protocols are also ROUGE-L (Lin, 2004) 0.324 -0.011 0.156
written by three human annotators in similar anno- BERTScore (Zhang* et al., 2020) 0.505 0.008 0.256
tation protocols. MOVERScore (Zhao et al., 2019) 0.347 0.044 0.195
BARTScore (Yuan et al., 2021) 0.680 0.159 0.420
UniEval (Zhong et al., 2022) 0.662 0.488 0.575
GPTScore (Fu et al., 2023) 0.649 0.238 0.443
A.2 Augmenting Instruction-tuning Tasks
G-Eval (GPT-3.5) (Liu et al., 2023) 0.516 0.406 0.461
G-Eval (GPT-4) (Liu et al., 2023) 0.685 0.537 0.611
We show the seen aspects, their corresponding X-E VAL (Ours) 0.656 0.500 0.578
source datasets where we collect the training data,
constructed tasks, and the number of training in- Table 9: Spearman correlation on the summarization
stances for each task in Table 11 and Table 12. task based on the consistency aspect on QAGS. The
For the way we count the number of aspects, we best results are highlighted in bold. We also highlight
treat the aspects with the same name but in dif- the best results among lightweight (with <7B parame-
ferent NLG tasks as different aspects. For ex- ters) and open-source metrics with underline.
ample, the naturalness aspect in dialogue evalu-
ation and data2text evaluation are considered dif- A.3 Aspect Definition
ferent under these two settings, although they have
the same aspect name. More specifically, in our We present the annotated definitions in A SPECT I N -
A SPECT I NSTRUCT dataset, understandability is STRUCT in the following. We show the definitions
counted twice for dialogue-level and turn-level di- of seen aspects on dialogue evaluation on Table 13,
alogue evaluation; naturalness is counted twice unseen aspects on dialogue evaluation on Table 14,
for turn-level dialogue evaluation and data-to-text and the aspects on summarization on Table 15.
evaluation; informativeness is counted twice for
A.4 Source Datasets for Meta Evaluation
dialogue-level dialogue evaluation and data-to-text
evaluation. We also include an example of how we SummEval (Fabbri et al., 2021) is an evalu-
augment instruction-tuning tasks from the original ation benchmark for summarization which con-
annotation in Figure 6. tains human ratings of 100 summaries along
four evaluation dimensions: fluency, coherence, Table 16 for dialogue evaluation and Table 17 for
consistency, and relevance. summarization evaluation.
QAGS (Wang et al., 2020b) is a benchmark for C More Details on Experiments
identifying and evaluating hallucinations in the
summarization task. It aims to measure the fac- More Implementation Details We use
tual inconsistencies of generated summaries. the checkpoint released on HuggingFace for
Flan-T5-large2 . In the first training stage, we
Topical-Chat (Gopalakrishnan et al., 2019) is a set the number of epochs to 2, the learning rate to
knowledge-grounded human-human conversation 5e-05, and the maximum source length to 1024.
dataset. Following (Zhong et al., 2022), we uti- The second training stage shares the same setup
lize human ratings collected by (Mehri and Eske- except the number of epochs set to 1. We set
nazi, 2020b) for Topical-Chat as the benchmark the maximum source length during inference to
for evaluating dialog response generation. The 2048 and pick the top-1 aspect during inference,
assessment consider five aspects: naturalness, i.e., k = 1. We use sentence-T5-large3 to
coherence, engagingness, groundedness, and compute the embeddings for aspect definition for
understandability. auxiliary aspect selection. All the experiments are
FED (Mehri and Eskenazi, 2020a) is an evalua- conducted on NVIDIA A40 GPUs including both
tion benchmark for fine-grained dialog evaluation. training and inference.
It comprises human annotations evaluated across More Details on Baselines We include more de-
eighteen dialog aspects at both the turn-level and tails for the following baselines that are omitted
the dialog-level. in the main paper due to page limit: (4) ROUGE-
SFHOT & SFRES (Wen et al., 2015) are eval- L (Lin, 2004) counts the overlap (i.e., longest com-
uation benchmarks for data-to-text task. They mon subsequence) between the text to be evalu-
provide information about restaurants and hotels ated and reference to indicate text quality; (5) Dy-
in San Francisco. The generated text is evalu- naEval (Zhang et al., 2021) adopts a graph con-
ated based on two aspects: informativeness and volutional network to model dialogue’s structure
naturalness. to facilitate evaluation; (6) BERTScore (Zhang*
et al., 2020) is a similarity-based evaluator. It uses
B More Details on X-E VAL the contextualized representation from BERT (De-
vlin et al., 2019) to compute the similarity be-
Pseudo-code of Inference Pipeline We provide tween the generated text and reference; (7) Mover-
the pseudo-code of our proposed inference pipeline Score (Zhao et al., 2019) goes beyond BERTScore
for X-E VAL in Algorithm 1. by utilizing soft alignments and new aggrega-
More Details on Verbalizer v and its Templates tion methods on the layer-wise information; (8)
We design a template-based verbalizer to convert USR (Mehri and Eskenazi, 2020b) is an unsu-
the evaluation results of auxiliary aspects into natu- pervised and reference-free evaluation metric to
ral language evaluation that can be integrated into measure multiple desirable qualities of dialog; (9)
the instructions. More formally, the inputs of the BARTScore (Yuan et al., 2021) is a unified eval-
verbalizer v contain aspect a and evaluation score uator based on BART (Lewis et al., 2019), which
s (in the range of 0-1). We first adopt a threshold uses the average likelihood of the model output
δ (we set δ = 0.5 throughout all experiments) to as the metric. Note that for all single-aspect met-
get a binary label that indicates the quality is "pos- rics, we compute the correlation between the single
itive" (if s > δ) or "negative" (if s ≤ δ). Given predicted evaluation and the human rating of each
this label and the aspect a, we map the results into fine-grained aspect, respectively.
a template in natural language accordingly. The More Results on the Effect of the Scale of Lan-
verbalized results will then be integrated into the guage Model Backbones We further conducted
instructions. We construct the templates for each an experiment on using another language model
aspect by deriving from aspect definition. We apply
2
the annotation protocol that three human annota- [Link]
flan-t5-large
tors revise the templates together until they reach 3
[Link]
a consensus. We show the verbalized templates in sentence-transformers/sentence-t5-large
Model # Parameters TopicalChat SummEval FED-Dialog FED-Turn Data2Text AVG
X-E VAL-large (Default Ver.) 780M 0.605 0.480 0.485 0.398 0.303 0.454
X-E VAL-LLaMA-LoRA 7B 0.519 0.448 0.427 0.351 0.337 0.416
Table 10: Effect of the scale of language model backbones. For each meta-evaluation benchmark, we report the
average Spearman correlation on all the aspects.
Table 11: The full list of apects, the corresponding datasets and tasks on summarization evaluation collected in the
training split of A SPECT I NSTRUCT.
Table 12: The full list of apects, the corresponding datasets and tasks on dialogue evaluation collected in the training
split of A SPECT I NSTRUCT. “NOTA” indicates the comparison task consists of the case of “None Of The Above”,
where the quality of two candidates is tied.
backbone. Specifically, we adopt LLaMA-7B- ing. We report the performance in Table 10.
chat (Touvron et al., 2023a) as the backbone model
and adopt LoRA parameter-efficient tuning (Hu
et al., 2022) during the two-stage instruction tun-
Aspect Definition
Naturalness in dialogue evaluation refers to the degree to which a response in a conversa-
Naturalness tional context mirrors the characteristics, language use, and structure typical of a human
conversational partner.
Coherence refers to the logical and consistent interconnection of utterances and exchanges
throughout a conversation. It represents the extent to which a dialogue system maintains
Coherence relevance, consistency, and meaningful progression within the discourse, ensuring that the
flow and structure of the conversation align with expected conversational norms and the
ongoing context.
Engagingness in the context of dialogue evaluation refers to the degree to which a response
Engagingness fosters continued interaction, maintains or elevates interest, and stimulates a compelling
exchange of ideas, emotions, or information between participants.
Dialogue groundedness measures how well does the response use the given fact. A response
Groundedness with weak groundedness means the response does not mention or refer to the fact at all. A
response with good groundedness means the response uses the fact well.
Relevance in dialogue evaluation refers to the measure of applicability, pertinence, or
Relevance connection of a given response to the preceding conversational context and/or the explicitly
posed question or statement.
Fluency in dialogue evaluation refers to the degree of fluidity, coherence, and linguistic
correctness in a generated response. It encompasses not only the grammatical and syntactic
Fluency
accuracy but also the seamless flow of ideas, the smooth transition between topics, and the
naturalness of the language used, echoing human-like conversation patterns.
Table 13: The full list and definitions of seen aspects on dialogue evaluation collected in A SPECT I NSTRUCT.
Aspect Definition
Topic depth refers to the ability of a dialogue system to engage in extensive, detailed, and
Topic Depth
multi-turn discussions on a particular subject.
Likeability refers to the degree to which an interactive system presents a pleasant, engaging,
Likeability
and affable conversational style that resonates positively with the user.
Understandability reflects the ability of a conversational system to correctly parse and
Understandability interpret user inputs, reflect an appropriate comprehension of the context, and generate
contextually relevant responses.
Flexibility measures the system’s capacity to understand and react appropriately to a wide
range of conversational scenarios, and not merely those for which it was explicitly pro-
Flexibility grammed or trained. It implies the capacity to engage in a diverse array of topics, offer
meaningful responses in unexpected situations, and adjust conversational strategies based on
the evolving context or user input.
Informativeness refers to the quality and relevance of the information that a dialogue system
Informativeness provides in response to user inputs. It captures the system’s ability to offer novel, detailed,
accurate, and appropriate information that aligns with the user’s requests or needs.
Inquisitiveness pertains to the consistent exhibition of the capacity to ask meaningful,
contextually appropriate, and well-timed questions within a conversation by a dialogue
Inquisitiveness system. This behavior is exhibited in the pursuit of greater comprehension, clarifying
ambiguities, furthering the dialogue, or driving deeper engagement with the conversation
partner.
Interestingness refers to the degree to which a response stimulates engagement, thought,
or emotional reaction in the average user, fostering a desire to continue the conversation
Interestingness
or explore the topic further. It is a measure of the response’s capacity to capture the user’s
attention and maintain their engagement over time.
Specificity measures to what degree the response is unique, personalized, or pertinent to the
Specificity specific details of the preceding user inputs or dialogue context, as opposed to being generic,
universally applicable, or independent of the conversational specifics.
Correctness in dialogue evaluation measures to the extent to which a generated response cor-
Correctness rectly reflects, comprehends, and addresses the salient elements, inferences, and implications
in the preceding conversation context.
Semantic appropriateness is the measure of the extent to which a response in a dialogue
Semantic Appropriateness maintains logical, meaningful, and contextually fitting alignment with the preceding discourse
elements, while adhering to the rules and principles of the language used in the conversation.
Table 14: The full list and definitions of unseen aspects on dialogue evaluation collected in A SPECT I NSTRUCT.
Aspect Definition
The accuracy aspect measures how the factual information in the summary accurately
matches the post. A summary is accurate if it doesn’t say things that aren’t in the article, it
Accuracy doesn’t mix up people, and generally is not misleading. If the summary says anything at all
that is not mentioned in the post or contradicts something in the post, it should be considered
as an inaccurate summary.
The coherence aspect measures how coherent is the summary on its own. A summary is
coherent if, when read by itself, it’s easy to understand and free of English errors. A summary
Coherence
is not coherent if it’s difficult to understand what the summary is trying to say. Generally, it’s
more important that the summary is understandable than it being free of grammar errors.
The coverage aspect measures how well does the summary cover the important information
in the post?” A summary has good coverage if it mentions the main information from the
post that’s important to understand the situation described in the post. A summary has poor
Coverage
coverage if someone reading only the summary would be missing several important pieces
of information about the situation in the post. A summary with good coverage should also
match the purpose of the original post (e.g. to ask for advice).
The consistency aspect measures the factual alignment between the summary and the sum-
Consistency marized source. A factually consistent summary contains only statements that are entailed by
the source document. You also need to penalize summaries that contained hallucinated facts.
Fluency measures the quality of individual sentences. A fluent summary should have
Fluency no formatting problems, capitalization errors or obviously ungrammatical sentences (e.g.,
fragments, missing components) that make the text difficult to read.
Relevance measures the selection of important content from the source. The summary
Relevance should include only important information from the source document. You should penalize
summaries which contain redundancies and excess information.
Table 15: The full list and definitions of aspects of summarization evaluation collected in A SPECT I NSTRUCT.
Aspect Verbalizer Template
NEG: The response is unnatural.
Naturalness
POS: The response is natural.
NEG: The response drastically changes topic or ignores the conversation history.
Coherence
POS: The response is on topic and strongly acknowledges the conversation
history.
NEG: The response is generic and dull.
Engagingnes
POS: The response is interesting or presents an interesting fact.
NEG: Given the interesting fact that the response is conditioned on, the response
Groundedness does not mention or refer to the fact at all.
POS: Given the interesting fact that the response is conditioned on, the response
uses the fact well.
NEG: The response is not relevant to the conversation.
Relevance
POS: The response is relevant to the conversation.
NEG: The response is not fluently written.
Fluency
POS: The response is fluently written.
NEG: The system cannot discuss topics in depth.
Topic Depth
POS: The system is able to discuss topics in depth.
NEG: The system cannot display a likeable personality.
Likeability
POS: The system is able to display a likeable personality.
NEG: The response is difficult to understand. You do not know what the person
Understandability is trying to say.
POS: The response is understandable. You know what the person is trying to
say.
NEG: The system is not flexible and adaptable to the user and their interests.
Flexibility
POS: The system is flexible and adaptable to the user and their interests.
NEG: The system is not informative throughout the conversation.
Informativeness
POS: The system is informative throughout the conversation.
NEG: The system is not inquisitive throughout the conversation.
Inquisitiveness
POS: The system is inquisitive throughout the conversation.
NEG: To the average person, the response is not interesting.
Interestingness
POS: To the average person, the response is interesting.
NEG: The response is too generic and not specific to the conversation.
Specificity
POS: The response is specific to the conversation.
NEG: There was a misunderstanding of the conversation.
Correctness
POS: The response is correct in the context of the conversation.
NEG: The response is not semantically appropriate.
Semantic Appropriateness
POS: The response is semantically appropriate.
Table 16: The full list of verbalizer templates that are used to convert the evaluation results of auxiliary aspects
for dialogue evaluation collected in A SPECT I NSTRUCT. "POS" and "NEG" indicate "positive" and "negative",
respectively.
Aspect Verbalizer Template
NEG: The factual information in the summary cannot accurately match the
post. It says things that aren’t in the article, it mixes up people, or generally is
Accuracy
misleading.
POS: The factual information in the summary accurately match the post. It
doesn’t say things that aren’t in the article, it doesn’t mix up people, and
generally is not misleading.
NEG: The summary is not coherent as it lacks a logical flow and has disjointed
Coherence information, making it difficult to understand the main topic or argument.
POS: The summary is well-structured and well-organized and it is built from
sentence to sentence to a coherent body of information about a topic.
NEG: The summary has poor coverage on the important information in the post,
e.g., someone reading only the summary would be missing several important
Coverage
pieces of information about the situation in the post.
POS: The summary has good coverage since it mentions the main information
from the post that’s important to understand the situation described in the post
and also match the purpose of the original post.
NEG: The summary is not factually consistent with the original post as it
introduces factual inaccuracies or hallucinated facts that are not present in or
Consistency
supported by the original source document.
POS: The summary has good factual alignment between the summary and the
summarized source. It contains only statements that are entailed by the source
document.
NEG: The summary is not fluent as it contains formatting problems, capital-
ization errors or obviously ungrammatical sentences (e.g., fragments, missing
Fluency
components) that make the text difficult to read.
POS: This is a fluent summary as it generally does not have formatting problems,
capitalization errors or obviously ungrammatical sentences (e.g., fragments,
missing components) that make the text difficult to read.
NEG: This summary is not relevant to the source document as it contains
Relevance redundancies or excess information.
POS: The summary generally includes relevant content, capturing some key
points from the source.
Table 17: The full list of verbalizer templates that are used to convert the evaluation results of auxiliary aspects for
summarization evaluation collected in A SPECT I NSTRUCT. "POS" and "NEG" indicate "positive" and "negative",
respectively.