0% found this document useful (0 votes)
113 views20 pages

X-EVAL: Multi-aspect Text Evaluation

The document presents X-E VAL, a two-stage instruction tuning framework designed for evaluating natural language generation (NLG) outputs across multiple aspects, including both seen and unseen criteria. It introduces A SPECT I NSTRUCT, the first dataset tailored for multi-aspect NLG evaluation, and demonstrates that X-E VAL can achieve high correlation with human judgments using a lightweight model. The approach emphasizes generalization, efficiency, and the ability to perform evaluations without requiring reference texts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views20 pages

X-EVAL: Multi-aspect Text Evaluation

The document presents X-E VAL, a two-stage instruction tuning framework designed for evaluating natural language generation (NLG) outputs across multiple aspects, including both seen and unseen criteria. It introduces A SPECT I NSTRUCT, the first dataset tailored for multi-aspect NLG evaluation, and demonstrates that X-E VAL can achieve high correlation with human judgments using a lightweight model. The approach emphasizes generalization, efficiency, and the ability to perform evaluations without requiring reference texts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

X-E VAL: Generalizable Multi-aspect Text Evaluation via Augmented

Instruction Tuning with Auxiliary Evaluation Aspects

Minqian Liu♠ Ying Shen♠ Zhiyang Xu♠ Yixin Cao♣


Eunah Cho♡ Vaibhav Kumar♡ Reza Ghanadan♡ Lifu Huang♠
♠ ♡ ♣
Virginia Tech Amazon Inc. Fudan University
{minqianliu, yings, zhiyangx, lifuh}@[Link]
{eunahch, kvabh, ghanadan}@[Link] caoyixin2011@[Link]

Abstract
Natural Language Generation (NLG) typically
involves evaluating the generated text in vari-
arXiv:2311.08788v2 [[Link]] 13 Apr 2024

ous aspects (e.g., consistency and naturalness)


to obtain a comprehensive assessment. How-
ever, multi-aspect evaluation remains challeng-
ing as it may require the evaluator to generalize
to any given evaluation aspect even if it’s ab-
sent during training. In this paper, we introduce
X-E VAL, a two-stage instruction tuning frame-
work to evaluate text in both seen and unseen
aspects customized by end users. X-E VAL con-
sists of two learning stages: the vanilla instruc-
tion tuning stage that improves the model’s abil-
ity to follow evaluation instructions, and an en-
hanced instruction tuning stage that exploits the
connections between fine-grained evaluation as- Figure 1: Illustration of X-E VAL for multiple seen
pects to better assess text quality. To support and unseen fine-grained evaluation aspects across
the training of X-E VAL, we collect A SPECT I N - various NLG tasks. The unseen aspect (i.e.,
STRUCT, the first instruction tuning dataset tai- Interestingness) is highlighted in italics. The text
lored for multi-aspect NLG evaluation span- to be evaluated is highlighted with underline. In this ex-
ning 27 diverse evaluation aspects with 65 tasks. ample, each evaluation score is from 0 to 1. The higher
To enhance task diversity, we devise an augmen- score indicates better quality.
tation strategy that converts human rating anno-
tations into diverse forms of NLG evaluation
tasks, including scoring, comparison, ranking, tuning (Wei et al., 2022a) have improved the quality
and Boolean question answering. Extensive of machine generated texts by a significant degree.
experiments across three essential categories of Nevertheless, the evaluation of various Natural Lan-
NLG tasks: dialogue generation, summariza-
guage Generation (NLG) tasks still lags far behind
tion, and data-to-text coupled with 21 aspects
in meta-evaluation, demonstrate that X-E VAL compared with the rapid progress of large language
enables even a lightweight language model to models (LLMs). Previous similarity-based met-
achieve a comparable if not higher correlation rics such as ROUGE (Lin, 2004), BLUE (Papineni
with human judgments compared to the state- et al., 2002), METEOR (Banerjee and Lavie, 2005),
of-the-art NLG evaluators like GPT-4. 1 and BERTScore (Zhang* et al., 2020) predomi-
nantly measures the similarity between the gen-
1 Introduction erated and reference text, failing to accurately re-
Recent advancements of pre-training (Chung et al., flect the quality of generated text (Gehrmann et al.,
2022; Touvron et al., 2023a,b), prompting (Brown 2023), especially for open-ended generation tasks.
et al., 2020; Wei et al., 2022b; Wang et al., 2023; To obtain a more comprehensive assessment of
Yao et al., 2023; Qi et al., 2023), and instruction text quality, multi-aspect evaluation (Fabbri et al.,
1
2021) has been proposed to evaluate the generated
The source code, model checkpoints, and datasets are
publicly available at [Link] text from multiple fine-grained evaluation aspects,
for research purposes. such as fluency and consistency. While most
existing studies (Mehri and Eskenazi, 2020b; Yuan task diversity in enhancing zero-short generaliza-
et al., 2021; Zhong et al., 2022) consider a closed tion, we further augment the dataset by converting
set of aspects, in many realistic scenarios, the users the original human rating data into diverse forms of
may need to evaluate the text with their customized NLG evaluation tasks, including scoring, compari-
aspects and specifications, calling for building an son, ranking and Boolean question answering. In
evaluator that can be flexibly extended to any un- addition, to incorporate auxiliary aspects, we man-
seen aspects without the need of training data. Re- ually create templates that convert the numerical
cent studies (Fu et al., 2023; Liu et al., 2023) pro- evaluation scores of each aspect into descriptions
pose to leverage large language models (LLMs) in natural language.
such as GPT-4 (OpenAI, 2023) as NLG evalua- The main advantages of our approach are high-
tors, yielding promising zero-shot performance on lighted as follows: (1) Generalization ability: we
unseen aspects. However, such evaluations, espe- introduce X-E VAL that can be flexibly general-
cially with proprietary LLMs, are cost-intensive, ized to evaluate the unseen NLG tasks or the as-
time-consuming, and pose concerns about data pri- pects customized by user instructions in a zero-shot
vacy and reproducibility. manner with a single model; (2) Strong perfor-
In this work, we propose X-E VAL, an automatic mance with high efficiency: with significantly
evaluation framework that can conduct fine-grained less amount of model parameters (780M), X-E VAL
evaluation on both seen and unseen aspects across achieves strong performance compared to the state-
various NLG tasks with a single model, as illus- of-the-art LLM-based evaluators (including GPT-4)
trated in Figure 1. X-E VAL follows a two-stage demonstrated through comprehensive experiments;
training paradigm: we first instruction-finetune an (3) Reference-free and open-source: our eval-
open-source language model to equip it with the uator does not require gold reference to perform
capability of following human-written instructions evaluation and it is more reliable and transparent
for evaluation. Then, motivated by the observa- thanks to its open-source nature.
tion that evaluation aspects usually exhibit inter-
connections (Fu et al., 2023) and thus their eval- 2 Related Work
uations can benefit each other, we introduce an
Similarity-based Metrics The previously dom-
additional training stage to finetune the model on
inant text evaluation paradigm is to predict a
the instruction-tuning tasks enriched with the eval-
one evaluation score, where most of them are
uations of a set of auxiliary aspects, which are
similarity-based metrics, including metrics that
expected to provide clues for evaluating the target
measure the surface overlap between the gener-
aspect and encourage consistent evaluations across
ated and reference text, such as ROUGE (Lin,
multiple aspects. During training, for each target
2004), BLUE (Papineni et al., 2002), and ME-
aspect, we take all the remaining aspects defined in
TEOR (Banerjee and Lavie, 2005), as well as
the corresponding dataset as auxiliary aspects and
metrics measuring the distance between the con-
incorporate their gold evaluations into the instruc-
textualized embeddings of the generated text and
tions for the second-stage tuning. During inference,
the reference as the similarity score, such as
given the target aspect, we first select a set of aux-
BERTScore (Zhang* et al., 2020) and Mover-
iliary aspects based on the similarity of the aspect
Score (Zhao et al., 2019). Although these met-
definitions and predict the evaluation result for each
rics are widely adopted, they often overlook fine-
auxiliary aspect using the trained model. We then
grained aspects and later study (Gehrmann et al.,
re-perform the evaluation for each target aspect by
2023) has proven that they fail to truly capture the
incorporating the results of auxiliary aspects.
quality of text with the coarse-grained score.
To support our proposed two-stage training of
X-E VAL, we construct A SPECT I NSTRUCT, the first Multi-Aspect Metrics To conduct a more holis-
multi-aspect evaluation instruction tuning dataset tic evaluation, recent studies (Wang et al., 2020a;
spanning 27 diverse evaluation aspects over 65 Huang et al., 2020) propose to evaluate the
tasks. This dataset is anchored around three core NLG systems via multiple fine-grained aspects.
categories of NLG tasks: dialogue, summarization, UniEval (Zhong et al., 2022) proposes to re-frame
and data-to-text. In light of insights from previous NLG evaluation into a QA format and perform
studies in instruction tuning (Wei et al., 2022a; Xu multi-aspect evaluation with a single model via
et al., 2023b), which emphasize the advantage of continual learning (Madotto et al., 2021; Liu et al.,
2022; Liu and Huang, 2023). However, UniEval including dialogue generation (Sai et al., 2020; Gu-
cannot maintain robust performance when general- nasekara et al., 2020; Pang et al., 2020; Gopalakr-
izing to novel aspects. To obtain an evaluator that ishnan et al., 2019; Mehri and Eskenazi, 2020a),
can be generalized to customized aspects, some text summarization (Völske et al., 2017; Fabbri
recent studies (Fu et al., 2023; Liu et al., 2023) et al., 2021; Wang et al., 2020b; Zhong et al., 2022),
harness proprietary LLMs to perform fine-grained and data-to-text (Wen et al., 2015).
evaluation in a zero-shot manner. However, due
to the closed-source nature, these evaluation met- Task Augmentation The original datasets we
rics suffer from issues of reproducibility and are collect only contain numerical scores annotated
prohibitively expensive. More recently, some con- by humans, which severely limits the diversity of
current studies (Xu et al., 2023a; Jiang et al., 2023; instruction-tuning tasks. Thus, we further derive
Mehri and Shwartz, 2023; Kim et al., 2024) pro- diverse forms of evaluation tasks from the original
pose to extract instruction-following data from pro- annotations to enhance task diversity. Denote the
prietary LLMs for finetuning a more lightweight ground truth score for text xi as yi . We derive four
model as the evaluator. Nevertheless, they still re- types of tasks based on this annotation: (1) Scor-
quire high costs to call the APIs to obtain a large ing: we ask the model to directly predict a discrete
amount of training data and it is non-trivial to en- score (e.g., in the Likert scale) where we map the
sure the data are of high quality. In addition, to the continuous ground truth yi into a discrete scale; (2)
best of our knowledge, we are the first to metic- Comparison: we sample two texts xi and xj for an
ulously curate the instruction-tuning dataset and identical context, e.g., two versions of summaries
train an instruction-based evaluator for dialogue for the same source document, and ask the model
evaluation. to select the text with the higher evaluation score;
(3) Ranking: we further extend the comparison
3 A SPECT I NSTRUCT task into ranking by sampling three candidates un-
der the same context and ask the model to predict
3.1 Problem Definition the correct ranking of the candidates based on the
Multi-aspect automatic text evaluation aims to eval- text quality; (4) Boolean Question Answering:
uate the quality of NLG system’s output x given we also formulate evaluation as a Boolean QA task
a set of evaluation aspects A (e.g., coherence, following (Zhong et al., 2022) by asking the model
naturalness and so on), and optionally an addi- a question such as "Is this response fluent?" and let
tional set of texts S (e.g., the source documents for the model predict "Yes" or "No".
text summarization, or context for dialogue evalua- Instruction Creation Finally, we define a uni-
tion). The evaluation task can be formulated as: fied instructions format for tasks included in A S -
PECT I NSTRUCT. Each instruction consists of three
c = f (x, S, a) parts: (1) task description that briefly introduces
the evaluation task, (2) aspect definition, and (3)
where a ∈ A is the fine-grained aspect to be evalu- evaluation protocol that details what the model
ated, and f (·) is the scoring function that provides should output to perform the evaluation. We
an assessment c w.r.t. the aspect a. present the detailed procedure for instruction anno-
tation in Appendix A.1. We provide an example
3.2 Data Collection
of the original annotation, and the derived evalu-
We aim to build a unified automatic evaluation ation tasks along with the curated instructions in
framework that can assess the text quality for both Figure 6 in Appendix A.2. The full list of evalua-
seen and unseen evaluation aspects across vari- tion aspects and the collected instructions can be
ous NLG tasks via instruction tuning. To this found in Appendix A.3.
end, we build an instruction-tuning dataset tailored
for multi-aspect evaluation, namely A SPECT I N - Statistics In total, we construct 65 tasks in A S -
STRUCT , with the following steps: PECT I NSTRUCT , where we split 32 tasks and 14
seen aspects for instruction tuning and 33 tasks and
Existing Dataset Collection We first collect 10 13 unseen aspects for meta-evaluation. We collect
existing evaluation datasets with human annota- 72,637 instances in total with 55,602 instances for
tions for 3 representative categories of NLG tasks, training and 17,035 instances for inference. Note
Inference on Unseen Aspects
Training on Seen Aspects (e.g, Dialogue-Engagingness)
(e.g., Data2Text-Informativeness)
Stage 1: Vanilla Instruction Tuning Stage 2: Instruction Tuning w/ Auxiliary Aspects
Auxiliary Inference Auxiliary Inference
Boolean QA Boolean QA w/ Auxiliary Aspects
Input: {Task Input: {Task
Input: {Task Des.}{Aspect Def.} Input: {Task Des.}{Aspect Def.}
Des.}{Aspect Def.} Des.}{Aspect Def.}
Is this an engaging response? Is this an engaging response?
Is the following Is the following
{Response} {Response}
sentence sentence
Evaluation of Auxiliary Aspects:
Output: Yes understandable consistent with
The response is human-like and natural.
according to the the source?
The response contains interesting content.
Scoring reference? {source}

{source} {sentence}
Input: {Task Des.}{Aspect Def.} Output: Yes
{sentence} {reference}
Assign an engagingness score to the Scoring w/ Auxiliary Aspects {reference}
following response on a scale of 1 to 5 …
{Response} Input: {Task Des.}{Aspect Def.} Output: Yes Output: No
Assign an engagingness score to the following
Output: 2 response on a scale of 1 to 5 …
{Response} This sentence is This sentence is not
Ranking Evaluation of Auxiliary Aspects: understandable. consistent with the source.
The response is somewhat human-like and
Input: {Task Des.}{Aspect Def.}
natural.
Provide a ranking among the following Target Inference

three responses …
Output: 2 Input: {Task Des.}{Aspect Def.}
{Response 1}{Response 2} {Response 3}
Is this sentence informative according to the
Output: 2 > 3 > 1 Comparison w/ Auxiliary Aspects reference?
Input: {Task Des.}{Aspect Def.} {source}
Comparison Among the following two responses which one {sentence}
is more engaging? {reference}
Input: {Task Des.}{Aspect Def.} {Response 1}{Response 2} Evaluation of Auxiliary Aspects:
Among the following two responses Evaluation of Auxiliary Aspects: This sentence is understandable.
which one is more engaging? Response 2 is more human-like and natural. This sentence is not consistent with the source.
{Response 1}{Response 2} …
Output: Yes
Output: Response 2 Output: Response 2

Figure 2: Illustration of our X-E VAL framework. The left section depicts our two-stage training approach: vanilla
instruction tuning on diverse tasks and subsequent training on instruction tasks enriched with auxiliary aspects. The
right section illustrates the inference pipeline with auxiliary aspects.

that there is no overlap among the datasets used the aspect naturalness usually shows a notable
for training and inference. We consider two as- correlation with engagingness. When a dialogue
pects that have identical aspect names but are in response is not natural, it is very likely that hu-
different NLG tasks as distinct aspects. We in- man considers the response to be not engaging.
clude more details about the source datasets, con- While these two aspects are not interchangeable
structed instruction-tuning tasks, and the number given their different definitions, the evaluation of
of instances of each task in Appendix A.2. one aspect can offer useful clues for the evaluation
of another potentially related aspect. Motivated by
4 X-E VAL this, we enrich our training regimen with an addi-
4.1 Two-Stage Instruction Tuning tional instruction tuning stage to leverage potential
connections to the target evaluation aspect.
Figure 2 presents an overview of X-E VAL, which
consists of two stages of instruction tuning: More precisely, for each instruction-tuning task
detailed in Section 3.2, we augment it based on
Vanilla Instruction Tuning The first training
the ground truth evaluation results of a predefined
stage aims to equip the model with the ability to
set of auxiliary aspects which are all other aspects
follow instructions to perform diverse evaluation
collected in the source dataset. To convert the eval-
tasks. We adopt Flan-T5 (Chung et al., 2022), an
uation results of auxiliary aspects into natural lan-
open-source language model as the base model for
guage that can be fed into the input, we employ a
our evaluator. Based on Flan-T5, we further per-
template-based verbalizer, denoted as v(·), which
form standard instruction tuning on the mixture of
takes in an aspect a and its evaluation score s for
four types of tasks: scoring, comparison, ranking,
an instance, mapping it into a verbalized evalua-
and Boolean QA, as elaborated in Section 3.2.
tion h = v(s, a). For example, with the aspect
Instruction Tuning with Auxiliary Aspects Consistency on Data2Text and the evaluation
Through our study, we discern that certain eval- score 0.9 out of 1.0, the verbalized result is phrased
uation aspects could be interrelated. As evidence, as "This sentence is consistent with the source."
in dialogue evaluation (Gopalakrishnan et al., 2019) (see more details in Appendix B). We construct the
Algorithm 1: Inference Pipeline evaluation score c for the target aspect:
Input: Set of evaluation aspects A, Target aspect at ,
NLG system’s output x, Additional set of P (“Y es”|x, S, a)
texts S, Scoring function f (·), Evaluation
c=
P (“Y es”|x, S, a) + P (“N o”|x, S, a)
verbalizer v(·), Similarity measure sim(·),
Sentence encoder E
Output: Target score ct
where P (·) denotes the probability of the model
// Determine top-k auxiliary aspects generating a specific word. The pseudo-code of our
1 L ← {(sim(E(a), E(at )), a) | a ∈ A \ {at }} inference pipeline is in Algorithm 1.
2 Sort L in descending order based on similarity
3 AR ← first k aspects from sorted L
// Generate verbalized evaluation
5 Experiment Setup
results for auxiliary aspects
4 Initialize an empty auxiliary evaluation set H Meta Evaluation We meta-evaluate our X-E VAL
5 for ar ∈ AR do on the test split of A SPECT I NSTRUCT, where the
// Score for auxiliary aspect details of the test set are introduced as follows. For
6 cr ← f (x, Sr , ar )
// Add verbalized evaluation to the
text summarization, we adopt SummEval (Fabbri
auxiliary evaluation set et al., 2021) and QAGS (Wang et al., 2020b).
7 H ← [H; v(cr , ar )] For dialogue generation, we employ Topical-
8 St ← [St ; H] Chat (Gopalakrishnan et al., 2019) and FED (Mehri
// Evaluate the target aspect
9 ct ← f (x, St , at )
and Eskenazi, 2020a). For data-to-text generation,
10 return ct we utilize SFHOT & SFRES (Wen et al., 2015).
A SPECT I NSTRUCT contains the following unseen
aspects: topic depth (DEP), likeability
(LIK), understandability (UND),
set of verbalized results H with the verbalizer for flexibility (FLE), informativeness (INF),
each auxiliary aspect (except for the target aspect). inquisitiveness (INQ), interestingness
This set H is then concatenated into the additional (INT), specificity (SPE), correctness
set of texts in the evaluator’s input. The model then (COR), and semantic appropriateness
undergoes the second training stage on the instruc- (SEM). More detailed descriptions of the test splits,
tion tasks enriched with these evaluation results. as well as seen and unseen evaluation aspects, are
be found in Appendix A.4.
4.2 Inference with Auxiliary Aspects Implementation Details We adopt Flan-T5-
large (with ~780M parameters) as our base lan-
At the inference stage, we perform the following
guage model for subsequent finetuning. Without
steps to evaluate the text on the target aspect: First,
specification, we pick the top-1 aspect during infer-
we select a set of auxiliary aspects for the target
ence, i.e., k = 1. More implementation details can
aspect. Based on the definitions of the target as-
be found in Appendix C.
pect and a pool of candidate aspects, we employ
Sentence-T5 (Ni et al., 2022) to encode the def- Baselines We compare our X-E VAL with the fol-
initions and measure the similarity between the lowing state-of-the-art NLG evaluation metrics: (1)
sentence embeddings of target aspect definition UniEval (Zhong et al., 2022) is a unified multi-
and each candidate aspect definition. We select the aspect evaluator that re-frames the evaluation pro-
aspects with top-k similarity scores as the auxil- cess as a Boolean QA task; (2) GPTScore (Fu
iary aspects to limit inference cost, where k is a et al., 2023) is a multi-faceted and training-free
hyperparameter. Second, we run an inference pro- evaluation framework that utilizes the output prob-
cess using the Boolean QA task format, where the abilities from LLMs to score generated texts; (3)
model predicts either "Yes" or "No", as outlined in G-Eval (Liu et al., 2023) proposes to leverage large
Section 3.2, on each auxiliary aspect. We convert language models such as GPT-3.5 or GPT-4 to as-
the prediction into natural language results with the sess the text quality with form-filling paradigm
verbalizer. These verbalized results, denoted as H, in a training-free manner; (4) ROUGE-L (Lin,
are subsequently integrated into the additional set 2004); (5) DynaEval (Zhang et al., 2021); (6)
of texts S for evaluating the target aspect. Finally, BERTScore (Zhang* et al., 2020); (7) Mover-
given the input enhanced by auxiliary aspects, we Score (Zhao et al., 2019); (8) USR (Mehri and
adopt the same Boolean QA format to compute the Eskenazi, 2020b); (9) BARTScore (Yuan et al.,
Dialogue-level Turn-level
Metrics DEP LIK UND FLE INF INQ AVG INT SPE COR SEM UND AVG
BARTScore (Yuan et al., 2021) 0.082 0.099 -0.115 0.093 0.092 0.062 0.052 0.159 0.083 0.076 0.100 0.120 0.128
DynaEval (Zhang et al., 2021) 0.498 0.416 0.365 0.383 0.426 0.410 0.416 0.327 0.346 0.242 0.202 0.200 0.263
UniEval (Zhong et al., 2022) 0.046 0.009 -0.024 -0.003 -0.070 0.085 0.030 0.435 0.381 0.125 0.051 0.082 0.215
GPTScore (GPT-3-d01) (Fu et al., 2023) 0.669 0.634 0.524 0.515 0.602 0.503 0.574 0.501 0.214 0.434 0.444 0.365 0.392
GPTScore (GPT-3-d03) (Fu et al., 2023) 0.341 0.184 0.196 0.072 0.317 -0.101 0.168 0.224 0.151 0.428 0.405 0.311 0.304
G-Eval (GPT-3.5)† (Liu et al., 2023) 0.339 0.392 0.123 0.344 0.232 0.101 0.259 0.30 0.280 0.430 0.390 0.274 0.335
G-Eval (GPT-4)† (Liu et al., 2023) 0.583 0.614 0.602 0.587 0.510 0.551 0.573 0.506 0.368 0.522 0.443 0.438 0.455
X-E VAL (Ours) 0.583 0.436 0.588 0.324 0.480 0.497 0.485 0.421 0.370 0.492 0.376 0.332 0.398
- w/o Training 0.377 0.387 0.394 0.424 0.370 0.417 0.395 0.250 0.175 0.296 0.289 0.225 0.247
- w/o Instructions 0.350 0.333 0.495 0.355 0.425 0.435 0.399 0.477 0.353 0.203 0.255 0.211 0.300
- w/o Stage-Two Tuning 0.388 0.324 0.555 0.384 0.582 0.437 0.445 0.372 0.282 0.418 0.329 0.311 0.342

Table 1: Meta-evaluation on dialogue based on unseen aspects in terms of dialogue-level and turn-level Spearman
(ρ) correlations on FED. The best overall results are highlighted in bold. We also highlight the best results excluding
GPT-based metrics with underline. †: our re-implementation, where we adopt our annotated instructions and aspect
definitions as inputs to OpenAI’s API to obtain the performance of G-Eval on FED.

Naturalness Coherence Engagingness Groundedness AVG


Metrics r ρ r ρ r ρ r ρ r ρ
ROUGE-L (Lin, 2004) 0.176 0.146 0.193 0.203 0.295 0.300 0.310 0.327 0.243 0.244
BERTScore (Zhang* et al., 2020) 0.226 0.209 0.214 0.233 0.317 0.335 0.291 0.317 0.262 0.273
USR (Mehri and Eskenazi, 2020b) 0.337 0.325 0.416 0.377 0.456 0.465 0.222 0.447 0.358 0.403
UniEval (Zhong et al., 2022) 0.480 0.512 0.518 0.609 0.544 0.563 0.462 0.456 0.501 0.535
G-Eval (GPT-3.5) (Liu et al., 2023) 0.532 0.539 0.519 0.544 0.660 0.691 0.586 0.567 0.574 0.585
G-Eval (GPT-4) (Liu et al., 2023) 0.549 0.565 0.594 0.605 0.627 0.631 0.531 0.551 0.575 0.588
X-E VAL (Ours) 0.417 0.478 0.558 0.622 0.449 0.593 0.734 0.728 0.540 0.605
- w/o Training 0.054 0.051 0.063 0.073 0.258 0.298 0.427 0.436 0.200 0.214
- w/o Instructions 0.415 0.452 0.560 0.574 0.397 0.532 0.690 0.701 0.515 0.565
- w/o Stage-Two Tuning 0.396 0.446 0.581 0.642 0.408 0.569 0.725 0.706 0.528 0.592

Table 2: Turn-level Pearson (r) and Spearman (ρ) correlations on seen aspects on Topical-Chat. The best overall
results are highlighted in bold. We also highlight the best results excluding GPT-based metrics with underline.

2021). We include more details of baselines (4)-(9) delineates the performance of traditional metrics
in Appendix C due to space limit. and evaluators based on lightweight language mod-
els. The middle section shows the performance of
Variants of X-E VAL We design several variants the evaluators based on GPTs (Brown et al., 2020;
of X-E VAL for ablation studies: (1) X-E VAL w/o OpenAI, 2023) that are proprietary and much larger
Training denotes the original Flan-T5 (without than our approach. The bottom section shows the
any further finetuning on our proposed A SPECT I N - performance of X-E VAL and its variants.
STRUCT); (2) X-E VAL w/o Instructions: based on
Flan-T5, we only conduct prompt-based multi-task
training and inference in the same way as (Zhong
Results of Dialogue Evaluation on FED To as-
et al., 2022) where we ask the model to answer
sess X-E VAL’s ability to generalize to unseen as-
Boolean questions without using aspect definitions;
pects, we present the Spearman correlation on FED
(3) X-E VAL w/o Stage-Two Tuning: for this vari-
in Table 1. X-E VAL surpasses the baselines in
ant, we only conduct vanilla instruction tuning in
the top section. Also, X-E VAL matches the perfor-
Stage 1 based on Flan-T5. During inference, we di-
mance of GPT-based baselines with much fewer pa-
rectly perform evaluation based on the instructions
rameters. The bottom section of the table highlights
without using auxiliary aspects.
the improvement achieved by two-stage tuning, in-
6 Main Results corporating instructions, and integrating auxiliary
aspects. It is worth noting that UniEval achieves
We report the main results of dialogue evaluation notably poor performance on dialogue-level eval-
in Table 1 and Table 2, summarization in Table 3 uation on FED, which is probably due to UniEval
and Table 9 , and data-to-text in Table 4. Each being overfitted to turn-level evaluation and failing
table is divided into three sections: the top section to generalize to dialogue-level evaluation.
Coherence Consistency Fluency Relevance AVG
Metrics ρ τ ρ τ ρ τ ρ τ ρ τ
ROUGE-L (Lin, 2004) 0.128 0.099 0.115 0.092 0.105 0.084 0.311 0.237 0.165 0.128
MOVERSscore (Zhao et al., 2019) 0.159 0.118 0.157 0.127 0.129 0.105 0.318 0.244 0.191 0.148
BERTScore (Zhang* et al., 2020) 0.284 0.211 0.110 0.090 0.193 0.158 0.312 0.243 0.225 0.175
BARTScore (Yuan et al., 2021) 0.448 0.342 0.382 0.315 0.356 0.292 0.356 0.273 0.385 0.305
UniEval (Zhong et al., 2022) 0.495 0.374 0.435 0.365 0.419 0.346 0.424 0.327 0.443 0.353
GPTScore (Fu et al., 2023) 0.434 – 0.449 – 0.403 – 0.381 – 0.417 –
G-Eval (GPT-3.5) (Liu et al., 2023) 0.440 0.335 0.386 0.318 0.424 0.347 0.385 0.293 0.401 0.320
G-Eval (GPT-4) (Liu et al., 2023) 0.582 0.457 0.507 0.425 0.455 0.378 0.547 0.433 0.514 0.418
X-E VAL (Ours) 0.530 0.382 0.428 0.340 0.461 0.365 0.500 0.361 0.480 0.362
- w/o Training 0.187 0.131 0.193 0.152 0.135 0.104 0.444 0.325 0.240 0.178
- w/o Instructions 0.458 0.333 0.414 0.328 0.395 0.309 0.496 0.359 0.441 0.333
- w/o Stage-Two Tuning 0.536 0.385 0.413 0.326 0.455 0.360 0.503 0.363 0.476 0.359

Table 3: Summary-level Spearman (ρ) and Kendall-Tau (τ ) correlations of different metrics on SummEval. All
aspects are seen aspects. The best overall results are highlighted in bold. We also highlight the best results excluding
GPT-based metrics with underline.

SFRES SFHOT Metrics Topic. FED Summ. D2T AVG


Metrics NAT INFO NAT INFO AVG
X-E VAL (w/o STT) 0.592 0.375 0.480 0.295 0.436
ROUGE-L 0.169 0.103 0.186 0.110 0.142 - w/o Scoring 0.547 0.281 0.438 0.300 0.392
BERTScore 0.219 0.156 0.178 0.135 0.172 - w/o Comparison 0.554 0.347 0.448 0.293 0.411
MOVERScore 0.190 0.153 0.242 0.172 0.189 - w/o Ranking 0.591 0.354 0.433 0.252 0.408
BARTScore 0.289 0.238 0.288 0.235 0.263 - w/o QA 0.579 0.357 0.418 0.284 0.410
UniEval (Summ) 0.333 0.225 0.320 0.249 0.282
GPTScore 0.190 0.232 0.036 0.184 0.161 Table 5: Ablation study on stage one instruction tuning
G-Eval (GPT-3.5)† 0.144 0.118 0.072 0.102 0.109
task type (Spearman correlation). "w/o STT" denotes
G-Eval (GPT-4)† 0.351 0.189 0.338 0.198 0.269
the model does not use Stage-Two Tuning. The best
X-E VAL (Ours) 0.316 0.265 0.322 0.310 0.303
results are highlighted in bold.
- w/o Training 0.240 0.192 0.207 0.262 0.225
- w/o Instructions 0.303 0.255 0.297 0.277 0.283
- w/o Stage-Two Tuning 0.322 0.257 0.311 0.292 0.295
are seen aspects. From Table 3, X-E VAL surpasses
Table 4: Spearman correlation on the data-to-text lightweight evaluators in averaged Spearman corre-
NLG task. NAT and INFO indicate Naturalness and lation and outperforms both GPTScore and G-Eval
Informativeness, respectively. The best results are (GPT-3.5). G-Eval (GPT-4) consistently excels
highlighted in bold. †: our re-implementation.
across all aspects. We speculate this may stem from
GPT-4’s strong ability to handle long input con-
Results of Dialogue Evaluation on Topical-Chat texts. In addition, we report the results on QAGS
We also evaluate the performance for the seen as- in Table 9 in Appendix due to the space limit.
pects on Topical-Chat and report the results in Ta- Results of Unseen NLG Task Evaluation In this
ble 2. Notably, in addition to the superior perfor- experiment, we evaluate X-E VAL on the unseen
mance over lightweight baselines, X-E VAL also data-to-text generation task. Table 4 shows that
surpasses all GPT-based metrics in averaged Spear- while X-E VAL experiences a slight performance
man correlation. We notice that the correlation loss in naturalness compared to G-Eval (GPT-4),
of X-E VAL on groundedness is notably higher it consistently excels over all other baselines across
than other baselines. One plausible reason is that all aspects. This underscores the generalization
Flan-T5 has been finetuned on related tasks such as capability of X-E VAL on unseen NLG tasks.
natural language inference (Chung et al., 2022), as
X-E VAL w/o Training has achieved decent perfor- 7 Discussions
mance without finetuning on A SPECT I NSTRUCT.
Ablation Study of Instruction Tuning Tasks
Results of Summarization Evaluation We use We conduct ablation studies to investigate the con-
summary-level Spearman and Kendall-Tau cor- tribution of incorporating diverse forms of evalua-
relation to assess various evaluators on Sum- tion tasks during instruction tuning. Table 5 shows
mEval. Note that all the aspects in SummEval the averaged Spearman correlation on each meta-
Figure 3: Effect of the scale of language model backbones. For each meta-evaluation benchmark, we report the
average Spearman correlation on all the aspects. X-E VAL-large (780M) is the default backbone language model
throughout all the experiments if there is no specification.

Metrics NAT COH ENG GRO AVG predict inaccurate evaluations for auxiliary aspects.
X-E VAL 0.478 0.622 0.593 0.728 0.605 To investigate their impact, we tailor several base-
- Inference w/o Auxiliary Aspects 0.462 0.641 0.577 0.723 0.600
- w/ GT RAA (Upperbound) 0.552 0.651 0.703 0.751 0.664 lines: (1) directly applying the model after two-
- w/ Random RAA (Lowerbound) 0.468 0.601 0.561 0.628 0.564
stage tuning to evaluate without auxiliary aspects;
Table 6: Analysis of error propagation in auxiliary as-
(2) using the ground truth (“GT”) evaluation results
pects on Topical-Chat in terms of Spearman correlation. instead of predicted results for auxiliary aspects
We highlight the best results in bold and the best results (upperbound), and; (3) using random evaluation
without using ground truths with underline. “RAA” de- results for auxiliary aspects (lowerbound). From
notes the evaluation Results on Auxiliary Aspects. Table 6, removing auxiliary aspects makes the over-
all performance drop. The variant with GT results
gains improvement in all aspects, which indicates
evaluation dataset. In general, X-E VAL trained on
the error in the evaluation of auxiliary aspects does
the combination of all forms of evaluation tasks,
impact the performance of target aspects, but not to
including scoring, comparison, ranking, achieves
a large degree. Using random results, on the other
the highest averaged correlation for nearly all tasks.
hand, deteriorates the performance significantly.
Effect of the Scale of Language Model Back-
bones We adopt the same training and inference Effect of Hyperparameter k We examine the
pipelines for the backbones with different scales to choice of k in selecting top-k auxiliary aspects dur-
show the effect of the models’ size and justify the ing inference. Table 7 shows that inference with
use of Flan-T5-large. Specifically, we additionally the top-1 auxiliary aspect generally achieves better
experiment with Flan-T5-small (80M), Flan-T5- correlation. We speculate that this may stem from
base (250M), and Flan-T5-xl (3B) as the backbone the error propagation during inference on auxiliary
models, and term our X-E VAL respectively. The aspects, where using more auxiliary aspects poten-
results are shown in Figure 3. From Figure 3, the tially introduces more inaccuracies, offsetting their
evaluators’ performance consistently increases as potential performance benefits.
the model size increases in general. However, when
we upgrade the backbone model from Flan-T5- Qualitative Correlation Analysis on Instruction
large to Flan-T5-xl, the performance improvement Tuning To further investigate the effect of instruc-
becomes less significant. Given the trade-off be- tion tuning, in Figure 4, we visualize the correla-
tween efficiency and performance, we select Flan- tion of our X-E VAL and Flan-T5 (i.e., “X-E VAL
T5-large as the default backbone model of X-E VAL w/o Training”) based on naturalness on Topical-
in our experiments. We include a more detailed per- Chat and consistency on SummEval. The red
formance analysis of the effect of language model lines are linear regression fits to show how well
backbones in Appendix C. the predicted scores correlate to human judgments
linearly. Before instruction tuning, the predicted
Error Propagation from Auxiliary Aspects dur- scores are more uniformly distributed regardless
ing Inference During inference, X-E VAL may of ground truth scores, which results in poor cor-
Dial-Naturalness (X-Eval) Dial-Naturalness (Flan-T5) Similarity of Aspect Definition 1.00
1 0.83 0.81 0.78 0.85 0.77 0.78 0.86 0.85

INT UND GRO ENG COH NAT


0.83 1 0.81 0.75 0.87 0.74 0.75 0.85 0.88
0.95
Human Score

0.81 0.81 1 0.75 0.8 0.91 0.77 0.83 0.78

0.78 0.75 0.75 1 0.77 0.74 0.76 0.8 0.78 0.90

0.85 0.87 0.8 0.77 1 0.78 0.82 0.9 0.87


0.85
0.77 0.74 0.91 0.74 0.78 1 0.8 0.76 0.75
Predicted Score Predicted Score
0.78 0.75 0.77 0.76 0.82 0.8 1 0.82 0.79

SEM COR SPE


Summ-Consistency (X-Eval) Summ-Consistency (Flan-T5)
0.80
0.86 0.85 0.83 0.8 0.9 0.76 0.82 1 0.88
Human Score

0.85 0.88 0.78 0.78 0.87 0.75 0.79 0.88 1 0.75


NAT COH ENG GRO UND INT SPE COR SEM

Figure 5: Cosine similarity scores of the sentence em-


beddings of aspect definition in turn-level dialogue
evaluation. Naturalness (NAT), coherence (COH),
Predicted Score Predicted Score engagingness (ENG), and groundedness (GRO) are
seen aspects, while the rest are unseen aspects.
Figure 4: The scatter plots of correlation between hu-
man scores and predicted scores of X-E VAL and Flan-
Selection Topic-Chat FED-Turn AVG
T5, respectively.
All 0.605 0.398 0.502
Seen 0.602 0.399 0.489
Selection Topic. FED Summ. D2T AVG
Unseen 0.608 0.379 0.481
Top-1 0.605 0.434 0.480 0.303 0.456 Random 0.592 0.381 0.475
Top-3 0.602 0.414 0.466 0.278 0.440
Top-5 0.598 0.435 0.463 0.275 0.443 Table 8: Comparison of different pools of candidate aux-
iliary aspects in terms of averaged Spearman correlation
Table 7: Effect of different k in selecting auxiliary as- for turn-level dialogue evaluation. The best results are
pect in terms of averaged Spearman correlation. The highlighted in bold.
best results are highlighted in bold.

mance. Also, we observe a substantial performance


relation. On the contrary, our X-E VAL can predict degradation when the auxiliary aspect is randomly
scores that not only achieve better correlation but selected, which shows the effectiveness of our as-
also are more distinctive (either close to 1 or 0), pect selection strategy.
showing the effectiveness of our instruction tuning.
8 Conclusion
Visualization of Auxiliary Aspect Selection In
Figure 5, we also report the cosine similarity be- In this work, we present X-E VAL, a novel two-
tween the sentence embeddings of the aspect def- stage instruction-tuning framework for text evalu-
initions used in turn-level dialogue evaluation as ation across both seen and unseen aspects. To fa-
the qualitative analysis of our aspect selection strat- cilitate training, we collect A SPECT I NSTRUCT, the
egy. In general, our strategy can select semantically first instruction-tuning dataset for multi-aspect eval-
related aspects for target-aspect evaluation. uation. Extensive experiments on meta-evaluation
benchmarks demonstrate that with significantly
Analysis of Auxiliary Aspect Selection Strategy
fewer parameters, X-E VAL achieves a compara-
We also experimented to compare the performance
ble if not higher correlation with human judgments
of selecting auxiliary aspects based on seen, un-
compared to the state-of-the-art NLG evaluators.
seen, or all aspects, as well as randomly selecting
aspects regardless of the definitions. We set the 9 Limitations
number of auxiliary aspects to 1 in this experiment.
From Table 8, selecting the auxiliary aspect based Limitation of Data Collection In this work, we
on all the aspects achieves the best overall perfor- mainly target evaluation tasks in English. Future
work can explore evaluation tasks in a more di- Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
verse language setting and augment our A SPECT I N - Jack Clark, Christopher Berner, Sam McCandlish,
Alec Radford, Ilya Sutskever, and Dario Amodei.
STRUCT dataset. In addition, our dataset focuses on
2020. Language models are few-shot learners. CoRR,
a limited subset of NLG tasks including dialogue, abs/2005.14165.
summarization, and data2text. More NLG tasks
can be considered in the future. Hyung Won Chung, Le Hou, Shayne Longpre, Barret
Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang,
Inference Efficiency Our algorithm may require Mostafa Dehghani, Siddhartha Brahma, Albert Web-
son, Shixiang Shane Gu, Zhuyun Dai, Mirac Suz-
multiple rounds of predictions to generate evalua- gun, Xinyun Chen, Aakanksha Chowdhery, Sharan
tion results from auxiliary aspects in the inference Narang, Gaurav Mishra, Adams Yu, Vincent Y. Zhao,
time. While this process imposes additional com- Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav
putational costs, given that the backbone we used is Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam
Roberts, Denny Zhou, Quoc V. Le, and Jason Wei.
lightweight (with 780M parameters) and efficient, 2022. Scaling instruction-finetuned language models.
our approach is still significantly more efficient CoRR, abs/2210.11416.
than the evaluators that are much larger, e.g., GPT-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
4. We leave exploring more efficient inference Kristina Toutanova. 2019. Bert: Pre-training of deep
strategies for future work. bidirectional transformers for language understand-
ing. In Proceedings of the 2019 Conference of the
Error Propagation During inference, the evalu- North American Chapter of the Association for Com-
ation results of auxiliary aspects may contain some putational Linguistics: Human Language Technolo-
errors. The errors may affect the final evaluation gies, Volume 1 (Long and Short Papers), pages 4171–
of the target aspect. We leave developing more 4186.
robust inference algorithms to address the error Alexander R Fabbri, Wojciech Kryściński, Bryan Mc-
propagation problem for future works. Cann, Caiming Xiong, Richard Socher, and Dragomir
Radev. 2021. Summeval: Re-evaluating summariza-
Acknowledgments tion evaluation. Transactions of the Association for
Computational Linguistics, 9:391–409.
This research is partially supported by a research Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei
award from the Amazon-Virginia Tech Initiative Liu. 2023. Gptscore: Evaluate as you desire. CoRR,
and award No. 2330940 from the Secure and Trust- abs/2302.04166.
worthy Cyberspace program of the National Sci-
Sebastian Gehrmann, Elizabeth Clark, and Thibault Sel-
ence Foundation (NSF). The views and conclusions lam. 2023. Repairing the cracked foundation: A sur-
contained herein are those of the authors and should vey of obstacles in evaluation practices for generated
not be interpreted as necessarily representing the text. J. Artif. Intell. Res., 77:103–166.
official policies, either expressed or implied, of Karthik Gopalakrishnan, Behnam Hedayatnia, Qinlang
the U.S. Government. The U.S. Government is Chen, Anna Gottardi, Sanjeev Kwatra, Anushree
authorized to reproduce and distribute reprints for Venkatesh, Raefer Gabriel, and Dilek Hakkani-Tür.
governmental purposes notwithstanding any copy- 2019. Topical-chat: Towards knowledge-grounded
open-domain conversations.
right annotation therein.
R. Chulaka Gunasekara, Seokhwan Kim, Luis Fer-
nando D’Haro, Abhinav Rastogi, Yun-Nung Chen,
References Mihail Eric, Behnam Hedayatnia, Karthik Gopalakr-
ishnan, Yang Liu, Chao-Wei Huang, Dilek Hakkani-
Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An Tür, Jinchao Li, Qi Zhu, Lingxiao Luo, Lars Li-
automatic metric for mt evaluation with improved cor- den, Kaili Huang, Shahin Shayandeh, Runze Liang,
relation with human judgments. In Proceedings of Baolin Peng, Zheng Zhang, Swadheen Shukla, Min-
the acl workshop on intrinsic and extrinsic evaluation lie Huang, Jianfeng Gao, Shikib Mehri, Yulan Feng,
measures for machine translation and/or summariza- Carla Gordon, Seyed Hossein Alavi, David R. Traum,
tion, pages 65–72. Maxine Eskénazi, Ahmad Beirami, Eunjoon Cho,
Paul A. Crook, Ankita De, Alborz Geramifard,
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Satwik Kottur, Seungwhan Moon, Shivani Poddar,
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind and Rajen Subba. 2020. Overview of the ninth di-
Neelakantan, Pranav Shyam, Girish Sastry, Amanda alog system technology challenge: DSTC9. CoRR,
Askell, Sandhini Agarwal, Ariel Herbert-Voss, abs/2011.06486.
Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-
Clemens Winter, Christopher Hesse, Mark Chen, Eric Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu
Chen. 2022. LoRA: Low-rank adaptation of large Shikib Mehri and Maxine Eskenazi. 2020a. Unsuper-
language models. In International Conference on vised evaluation of interactive dialog with dialogpt.
Learning Representations. arXiv preprint arXiv:2006.12719.

Lishan Huang, Zheng Ye, Jinghui Qin, Liang Lin, and Shikib Mehri and Maxine Eskenazi. 2020b. USR: An
Xiaodan Liang. 2020. GRADE: automatic graph- unsupervised and reference free evaluation metric for
enhanced coherence metric for evaluating open- dialog generation. pages 681–707.
domain dialogue systems. In Proceedings of the
2020 Conference on Empirical Methods in Natural Shuhaib Mehri and Vered Shwartz. 2023. Automatic
Language Processing, EMNLP 2020, Online, Novem- evaluation of generative models with instruction tun-
ber 16-20, 2020, pages 9230–9240. Association for ing. CoRR, abs/2310.20072.
Computational Linguistics.
Jianmo Ni, Gustavo Hernandez Abrego, Noah Constant,
Dongfu Jiang, Yishan Li, Ge Zhang, Wenhao Huang, Ji Ma, Keith Hall, Daniel Cer, and Yinfei Yang. 2022.
Bill Yuchen Lin, and Wenhu Chen. 2023. Tigerscore: Sentence-t5: Scalable sentence encoders from pre-
Towards building explainable metric for all text gen- trained text-to-text models. In Findings of the As-
eration tasks. CoRR, abs/2310.00752. sociation for Computational Linguistics: ACL 2022,
pages 1864–1874.
Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang,
OpenAI. 2023. Gpt-4 technical report. ArXiv,
Shayne Longpre, Hwaran Lee, Sangdoo Yun,
abs/2303.08774.
Seongjin Shin, Sungdong Kim, James Thorne, and
Minjoon Seo. 2024. Prometheus: Inducing evalua- Bo Pang, Erik Nijkamp, Wenjuan Han, Linqi Zhou, Yix-
tion capability in language models. In The Twelfth ian Liu, and Kewei Tu. 2020. Towards holistic and
International Conference on Learning Representa- automatic evaluation of open-domain dialogue gener-
tions. ation. In Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics, pages
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan 3619–3629, Online. Association for Computational
Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Linguistics.
Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: De-
noising sequence-to-sequence pre-training for natural Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
language generation, translation, and comprehension. Jing Zhu. 2002. Bleu: a method for automatic evalu-
arXiv preprint arXiv:1910.13461. ation of machine translation. In Proceedings of the
40th annual meeting of the Association for Computa-
Chin-Yew Lin. 2004. Rouge: A package for automatic tional Linguistics, pages 311–318.
evaluation of summaries. In Text summarization
branches out, pages 74–81. Jingyuan Qi, Zhiyang Xu, Ying Shen, Minqian Liu,
Di Jin, Qifan Wang, and Lifu Huang. 2023. The
Minqian Liu, Shiyu Chang, and Lifu Huang. 2022. In- art of SOCRATIC QUESTIONING: zero-shot mul-
cremental prompting: Episodic memory prompt for timodal reasoning with recursive thinking and self-
lifelong event detection. In Proceedings of the 29th questioning. CoRR, abs/2305.14999.
International Conference on Computational Linguis-
tics, pages 2157–2165, Gyeongju, Republic of Korea. Ananya B Sai, Akash Kumar Mohankumar, Siddhartha
International Committee on Computational Linguis- Arora, and Mitesh M Khapra. 2020. Improving di-
tics. alog evaluation with a multi-reference adversarial
dataset and large scale pretraining. Transactions of
Minqian Liu and Lifu Huang. 2023. Teamwork is not the Association for Computational Linguistics, 8:810–
always good: An empirical study of classifier drift 827.
in class-incremental information extraction. In Find-
ings of the Association for Computational Linguis- Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
tics: ACL 2023, pages 2241–2257, Toronto, Canada. Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Association for Computational Linguistics. Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Azhar, Aurélien Rodriguez, Armand Joulin, Edouard
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Grave, and Guillaume Lample. 2023a. Llama: Open
Ruochen Xu, and Chenguang Zhu. 2023. G-eval: and efficient foundation language models. CoRR,
NLG evaluation using GPT-4 with better human abs/2302.13971.
alignment. CoRR, abs/2303.16634.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
Andrea Madotto, Zhaojiang Lin, Zhenpeng Zhou, Se- bert, Amjad Almahairi, Yasmine Babaei, Nikolay
ungwhan Moon, Paul Crook, Bing Liu, Zhou Yu, Eu- Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
njoon Cho, Pascale Fung, and Zhiguang Wang. 2021. Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-
Continual learning in task-oriented dialogue systems. Ferrer, Moya Chen, Guillem Cucurull, David Esiobu,
In Proceedings of the 2021 Conference on Empiri- Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,
cal Methods in Natural Language Processing, pages Cynthia Gao, Vedanuj Goswami, Naman Goyal, An-
7452–7467, Online and Punta Cana, Dominican Re- thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan
public. Association for Computational Linguistics. Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,
Isabel Kloumann, Artem Korenev, Punit Singh Koura, Zhiyang Xu, Ying Shen, and Lifu Huang. 2023b. Multi-
Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di- Instruct: Improving multi-modal zero-shot learning
ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar- via instruction tuning. In Proceedings of the 61st An-
tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly- nual Meeting of the Association for Computational
bog, Yixin Nie, Andrew Poulton, Jeremy Reizen- Linguistics (Volume 1: Long Papers), pages 11445–
stein, Rashi Rungta, Kalyan Saladi, Alan Schelten, 11465, Toronto, Canada. Association for Computa-
Ruan Silva, Eric Michael Smith, Ranjan Subrama- tional Linguistics.
nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay-
lor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran,
Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Thomas L. Griffiths, Yuan Cao, and Karthik
Melanie Kambadur, Sharan Narang, Aurélien Ro- Narasimhan. 2023. Tree of thoughts: Deliberate
driguez, Robert Stojnic, Sergey Edunov, and Thomas problem solving with large language models. CoRR,
Scialom. 2023b. Llama 2: Open foundation and abs/2305.10601.
fine-tuned chat models. CoRR, abs/2307.09288. Weizhe Yuan, Graham Neubig, and Pengfei Liu. 2021.
Michael Völske, Martin Potthast, Shahbaz Syed, and Bartscore: Evaluating generated text as text gener-
Benno Stein. 2017. TL;DR: Mining Reddit to learn ation. Advances in Neural Information Processing
automatic summarization. In Proceedings of the Systems, 34:27263–27277.
Workshop on New Frontiers in Summarization, pages Chen Zhang, Yiming Chen, Luis Fernando D’Haro,
59–63, Copenhagen, Denmark. Association for Com- Yan Zhang, Thomas Friedrichs, Grandee Lee, and
putational Linguistics. Haizhou Li. 2021. DynaEval: Unifying turn and di-
Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020a. alogue level evaluation. In Proceedings of the 59th
Asking and answering questions to evaluate the fac- Annual Meeting of the Association for Computational
tual consistency of summaries. In Proceedings of Linguistics and the 11th International Joint Confer-
the 58th Annual Meeting of the Association for Com- ence on Natural Language Processing (Volume 1:
putational Linguistics, ACL 2020, Online, July 5-10, Long Papers), pages 5676–5689, Online. Association
2020, pages 5008–5020. Association for Computa- for Computational Linguistics.
tional Linguistics. Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q.
Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020b. Weinberger, and Yoav Artzi. 2020. Bertscore: Eval-
Asking and answering questions to evaluate the uating text generation with bert. In International
factual consistency of summaries. arXiv preprint Conference on Learning Representations.
arXiv:2004.04228. Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Chris-
tian M Meyer, and Steffen Eger. 2019. Moverscore:
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V.
Text generation evaluating with contextualized em-
Le, Ed H. Chi, Sharan Narang, Aakanksha Chowd-
beddings and earth mover distance. In Proceedings
hery, and Denny Zhou. 2023. Self-consistency
of the 2019 Conference on Empirical Methods in Nat-
improves chain of thought reasoning in language
ural Language Processing and the 9th International
models. In The Eleventh International Conference
Joint Conference on Natural Language Processing
on Learning Representations, ICLR 2023, Kigali,
(EMNLP-IJCNLP), pages 563–578.
Rwanda, May 1-5, 2023. [Link].
Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, and
Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Jiawei Han. 2022. Towards a unified multi-
Dai, and Quoc V Le. 2022a. Finetuned language dimensional evaluator for text generation. In Pro-
models are zero-shot learners. In International Con- ceedings of the 2022 Conference on Empirical Meth-
ference on Learning Representations. ods in Natural Language Processing, pages 2023–
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten 2038, Abu Dhabi, United Arab Emirates. Association
Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, for Computational Linguistics.
and Denny Zhou. 2022b. Chain-of-thought prompt-
ing elicits reasoning in large language models. In A More Details on A SPECT I NSTRUCT
NeurIPS.
Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-
A.1 Annotation Protocol of Instructions
Hao Su, David Vandyke, and Steve Young. 2015. We depict the annotation process for the instruc-
Semantically conditioned lstm-based natural lan- tions in A SPECT I NSTRUCT as follows. To curate
guage generation for spoken dialogue systems. arXiv
preprint arXiv:1508.01745. the definition for each aspect, we first refer to the
definition of the aspect in the original annotation
Wenda Xu, Danqing Wang, Liangming Pan, Zhenqiao guideline. When a definition is absent from the
Song, Markus Freitag, William Yang Wang, and Lei
Li. 2023a. INSTRUCTSCORE: towards explainable
guideline, three human annotators (graduate stu-
text generation evaluation with automatic feedback. dents studying in computational linguistics or natu-
CoRR, abs/2305.14282. ral language processing areas) construct and revise
Original Annotation Scoring Boolean QA
Input: Context: “Are you looking for an Input: Dialogue relevance is to determine Input: Dialogue relevance is to determine
apartment?”, “Yes, I’m whether the response is relevant to the whether the response is relevant to the
interested in a one-bedroom context. In this task, you are given a context. In this task, you need to
apartment.” dialogue context and a response. Your evaluate the quality of the dialogue
Response: “I think I have just a task is to determine whether the response based on relevance by
right one for you.” response is relevant to the context. A answering 'Yes' or 'No' to the
score of 0 indicates the response is not following question. Question: Is this
Human Rating: 0.67 (out of 1)
relevant. A score of 1 indicates the response relevant to the given dialogue
response is relevant. Predict 0 or 1 as history?
Comparison your choice. Context: “Are you looking for an
Input: Dialogue relevance is to Context: “Are you looking for an apartment?”, “Yes, I’m interested in a
determine whether the response is apartment?”, “Yes, I’m interested in a one-bedroom apartment.”
relevant to the context. In this one-bedroom apartment.” Response: “I think I have just a right
task, you are given a dialogue Response: “I think I have just a right one for you.”
context and two candidate one for you.”
Output: Yes
responses. Your task is to Output: 1
determine which response is more
relevant to the context. Ranking
Context: “Are you looking for an Input: Dialogue relevance is to determine whether the response is relevant to the context. In
apartment?”, “Yes, I’m interested this task, you are given a dialogue context and three candidate responses. Your task is to
in a one-bedroom apartment.” give a ranking for the three responses from the best quality to the worst.
Response 1: “I think I have just a Context: “Are you looking for an apartment?”, “Yes, I’m interested in a one-bedroom
right one for you.” apartment.”
Response 2: “Because my friend Response 1: “I think I have just a right apartment for you.”, Response 2: “Because my
wants to meet you”. friend wants to meet you”, Response 3: “There is good news.”
Output: Response 1 Output: Response 1 > Response 3 > Response 2

Figure 6: An illustrative example of augmented instruction-tuning tasks from the original annotation. The definition
of the aspect is highlighted in purple. The annotated task instructions and the constructed output labels are
highlighted in the corresponding colors for each task.

the definition until they reach an agreement. The Metrics CNN XSUM AVG
task descriptions and evaluation protocols are also ROUGE-L (Lin, 2004) 0.324 -0.011 0.156
written by three human annotators in similar anno- BERTScore (Zhang* et al., 2020) 0.505 0.008 0.256
tation protocols. MOVERScore (Zhao et al., 2019) 0.347 0.044 0.195
BARTScore (Yuan et al., 2021) 0.680 0.159 0.420
UniEval (Zhong et al., 2022) 0.662 0.488 0.575
GPTScore (Fu et al., 2023) 0.649 0.238 0.443
A.2 Augmenting Instruction-tuning Tasks
G-Eval (GPT-3.5) (Liu et al., 2023) 0.516 0.406 0.461
G-Eval (GPT-4) (Liu et al., 2023) 0.685 0.537 0.611
We show the seen aspects, their corresponding X-E VAL (Ours) 0.656 0.500 0.578
source datasets where we collect the training data,
constructed tasks, and the number of training in- Table 9: Spearman correlation on the summarization
stances for each task in Table 11 and Table 12. task based on the consistency aspect on QAGS. The
For the way we count the number of aspects, we best results are highlighted in bold. We also highlight
treat the aspects with the same name but in dif- the best results among lightweight (with <7B parame-
ferent NLG tasks as different aspects. For ex- ters) and open-source metrics with underline.
ample, the naturalness aspect in dialogue evalu-
ation and data2text evaluation are considered dif- A.3 Aspect Definition
ferent under these two settings, although they have
the same aspect name. More specifically, in our We present the annotated definitions in A SPECT I N -
A SPECT I NSTRUCT dataset, understandability is STRUCT in the following. We show the definitions
counted twice for dialogue-level and turn-level di- of seen aspects on dialogue evaluation on Table 13,
alogue evaluation; naturalness is counted twice unseen aspects on dialogue evaluation on Table 14,
for turn-level dialogue evaluation and data-to-text and the aspects on summarization on Table 15.
evaluation; informativeness is counted twice for
A.4 Source Datasets for Meta Evaluation
dialogue-level dialogue evaluation and data-to-text
evaluation. We also include an example of how we SummEval (Fabbri et al., 2021) is an evalu-
augment instruction-tuning tasks from the original ation benchmark for summarization which con-
annotation in Figure 6. tains human ratings of 100 summaries along
four evaluation dimensions: fluency, coherence, Table 16 for dialogue evaluation and Table 17 for
consistency, and relevance. summarization evaluation.
QAGS (Wang et al., 2020b) is a benchmark for C More Details on Experiments
identifying and evaluating hallucinations in the
summarization task. It aims to measure the fac- More Implementation Details We use
tual inconsistencies of generated summaries. the checkpoint released on HuggingFace for
Flan-T5-large2 . In the first training stage, we
Topical-Chat (Gopalakrishnan et al., 2019) is a set the number of epochs to 2, the learning rate to
knowledge-grounded human-human conversation 5e-05, and the maximum source length to 1024.
dataset. Following (Zhong et al., 2022), we uti- The second training stage shares the same setup
lize human ratings collected by (Mehri and Eske- except the number of epochs set to 1. We set
nazi, 2020b) for Topical-Chat as the benchmark the maximum source length during inference to
for evaluating dialog response generation. The 2048 and pick the top-1 aspect during inference,
assessment consider five aspects: naturalness, i.e., k = 1. We use sentence-T5-large3 to
coherence, engagingness, groundedness, and compute the embeddings for aspect definition for
understandability. auxiliary aspect selection. All the experiments are
FED (Mehri and Eskenazi, 2020a) is an evalua- conducted on NVIDIA A40 GPUs including both
tion benchmark for fine-grained dialog evaluation. training and inference.
It comprises human annotations evaluated across More Details on Baselines We include more de-
eighteen dialog aspects at both the turn-level and tails for the following baselines that are omitted
the dialog-level. in the main paper due to page limit: (4) ROUGE-
SFHOT & SFRES (Wen et al., 2015) are eval- L (Lin, 2004) counts the overlap (i.e., longest com-
uation benchmarks for data-to-text task. They mon subsequence) between the text to be evalu-
provide information about restaurants and hotels ated and reference to indicate text quality; (5) Dy-
in San Francisco. The generated text is evalu- naEval (Zhang et al., 2021) adopts a graph con-
ated based on two aspects: informativeness and volutional network to model dialogue’s structure
naturalness. to facilitate evaluation; (6) BERTScore (Zhang*
et al., 2020) is a similarity-based evaluator. It uses
B More Details on X-E VAL the contextualized representation from BERT (De-
vlin et al., 2019) to compute the similarity be-
Pseudo-code of Inference Pipeline We provide tween the generated text and reference; (7) Mover-
the pseudo-code of our proposed inference pipeline Score (Zhao et al., 2019) goes beyond BERTScore
for X-E VAL in Algorithm 1. by utilizing soft alignments and new aggrega-
More Details on Verbalizer v and its Templates tion methods on the layer-wise information; (8)
We design a template-based verbalizer to convert USR (Mehri and Eskenazi, 2020b) is an unsu-
the evaluation results of auxiliary aspects into natu- pervised and reference-free evaluation metric to
ral language evaluation that can be integrated into measure multiple desirable qualities of dialog; (9)
the instructions. More formally, the inputs of the BARTScore (Yuan et al., 2021) is a unified eval-
verbalizer v contain aspect a and evaluation score uator based on BART (Lewis et al., 2019), which
s (in the range of 0-1). We first adopt a threshold uses the average likelihood of the model output
δ (we set δ = 0.5 throughout all experiments) to as the metric. Note that for all single-aspect met-
get a binary label that indicates the quality is "pos- rics, we compute the correlation between the single
itive" (if s > δ) or "negative" (if s ≤ δ). Given predicted evaluation and the human rating of each
this label and the aspect a, we map the results into fine-grained aspect, respectively.
a template in natural language accordingly. The More Results on the Effect of the Scale of Lan-
verbalized results will then be integrated into the guage Model Backbones We further conducted
instructions. We construct the templates for each an experiment on using another language model
aspect by deriving from aspect definition. We apply
2
the annotation protocol that three human annota- [Link]
flan-t5-large
tors revise the templates together until they reach 3
[Link]
a consensus. We show the verbalized templates in sentence-transformers/sentence-t5-large
Model # Parameters TopicalChat SummEval FED-Dialog FED-Turn Data2Text AVG
X-E VAL-large (Default Ver.) 780M 0.605 0.480 0.485 0.398 0.303 0.454
X-E VAL-LLaMA-LoRA 7B 0.519 0.448 0.427 0.351 0.337 0.416

Table 10: Effect of the scale of language model backbones. For each meta-evaluation benchmark, we report the
average Spearman correlation on all the aspects.

Aspect Datasets Task # Instances


Scoring 5,000
Boolean QA 5,000
Accuracy TL;DR (Völske et al., 2017)
Comparison 898
Ranking 599
Scoring 5,000
Boolean QA 5,000
Coherence TL;DR (Völske et al., 2017), UniEval (Zhong et al., 2022)
Comparison 734
Ranking 425
Scoring 5,000
Boolean QA 4,354
Coverage TL;DR (Völske et al., 2017)
Comparison 1,028
Ranking 964
Consistency UniEval (Zhong et al., 2022) Boolean QA 15,000
Fluency UniEval (Zhong et al., 2022) Boolean QA 15,000
Relevance UniEval (Zhong et al., 2022) Boolean QA 15,000

Table 11: The full list of apects, the corresponding datasets and tasks on summarization evaluation collected in the
training split of A SPECT I NSTRUCT.

Aspect Datasets Task # Instances


Scoring 2,000
Boolean QA 2,000
Relevance DailyDialog++ (Sai et al., 2020)
Comparison 2,000
Comparison (w/ NOTA) 2,000
HolisticDial (Pang et al., 2020); DSTC9 (Gunasekara et al., Scoring 2,400
Coherence
2020); UniEval (Zhong et al., 2022) Boolean QA 17,200
Scoring 2,200
Consistency DSTC9 (Gunasekara et al., 2020)
Boolean QA 2,200
Scoring 2,200
Diversity DSTC9 (Gunasekara et al., 2020)
Boolean QA 2,200
Engagingness UniEval (Zhong et al., 2022) Boolean QA 15,000
Groundedness UniEval (Zhong et al., 2022) Boolean QA 15,000
Naturalness UniEval (Zhong et al., 2022) Boolean QA 15,000
Fluency HolisticDial (Pang et al., 2020) Scoring 200

Table 12: The full list of apects, the corresponding datasets and tasks on dialogue evaluation collected in the training
split of A SPECT I NSTRUCT. “NOTA” indicates the comparison task consists of the case of “None Of The Above”,
where the quality of two candidates is tied.

backbone. Specifically, we adopt LLaMA-7B- ing. We report the performance in Table 10.
chat (Touvron et al., 2023a) as the backbone model
and adopt LoRA parameter-efficient tuning (Hu
et al., 2022) during the two-stage instruction tun-
Aspect Definition
Naturalness in dialogue evaluation refers to the degree to which a response in a conversa-
Naturalness tional context mirrors the characteristics, language use, and structure typical of a human
conversational partner.
Coherence refers to the logical and consistent interconnection of utterances and exchanges
throughout a conversation. It represents the extent to which a dialogue system maintains
Coherence relevance, consistency, and meaningful progression within the discourse, ensuring that the
flow and structure of the conversation align with expected conversational norms and the
ongoing context.
Engagingness in the context of dialogue evaluation refers to the degree to which a response
Engagingness fosters continued interaction, maintains or elevates interest, and stimulates a compelling
exchange of ideas, emotions, or information between participants.
Dialogue groundedness measures how well does the response use the given fact. A response
Groundedness with weak groundedness means the response does not mention or refer to the fact at all. A
response with good groundedness means the response uses the fact well.
Relevance in dialogue evaluation refers to the measure of applicability, pertinence, or
Relevance connection of a given response to the preceding conversational context and/or the explicitly
posed question or statement.
Fluency in dialogue evaluation refers to the degree of fluidity, coherence, and linguistic
correctness in a generated response. It encompasses not only the grammatical and syntactic
Fluency
accuracy but also the seamless flow of ideas, the smooth transition between topics, and the
naturalness of the language used, echoing human-like conversation patterns.

Table 13: The full list and definitions of seen aspects on dialogue evaluation collected in A SPECT I NSTRUCT.
Aspect Definition
Topic depth refers to the ability of a dialogue system to engage in extensive, detailed, and
Topic Depth
multi-turn discussions on a particular subject.
Likeability refers to the degree to which an interactive system presents a pleasant, engaging,
Likeability
and affable conversational style that resonates positively with the user.
Understandability reflects the ability of a conversational system to correctly parse and
Understandability interpret user inputs, reflect an appropriate comprehension of the context, and generate
contextually relevant responses.
Flexibility measures the system’s capacity to understand and react appropriately to a wide
range of conversational scenarios, and not merely those for which it was explicitly pro-
Flexibility grammed or trained. It implies the capacity to engage in a diverse array of topics, offer
meaningful responses in unexpected situations, and adjust conversational strategies based on
the evolving context or user input.
Informativeness refers to the quality and relevance of the information that a dialogue system
Informativeness provides in response to user inputs. It captures the system’s ability to offer novel, detailed,
accurate, and appropriate information that aligns with the user’s requests or needs.
Inquisitiveness pertains to the consistent exhibition of the capacity to ask meaningful,
contextually appropriate, and well-timed questions within a conversation by a dialogue
Inquisitiveness system. This behavior is exhibited in the pursuit of greater comprehension, clarifying
ambiguities, furthering the dialogue, or driving deeper engagement with the conversation
partner.
Interestingness refers to the degree to which a response stimulates engagement, thought,
or emotional reaction in the average user, fostering a desire to continue the conversation
Interestingness
or explore the topic further. It is a measure of the response’s capacity to capture the user’s
attention and maintain their engagement over time.
Specificity measures to what degree the response is unique, personalized, or pertinent to the
Specificity specific details of the preceding user inputs or dialogue context, as opposed to being generic,
universally applicable, or independent of the conversational specifics.
Correctness in dialogue evaluation measures to the extent to which a generated response cor-
Correctness rectly reflects, comprehends, and addresses the salient elements, inferences, and implications
in the preceding conversation context.
Semantic appropriateness is the measure of the extent to which a response in a dialogue
Semantic Appropriateness maintains logical, meaningful, and contextually fitting alignment with the preceding discourse
elements, while adhering to the rules and principles of the language used in the conversation.

Table 14: The full list and definitions of unseen aspects on dialogue evaluation collected in A SPECT I NSTRUCT.
Aspect Definition
The accuracy aspect measures how the factual information in the summary accurately
matches the post. A summary is accurate if it doesn’t say things that aren’t in the article, it
Accuracy doesn’t mix up people, and generally is not misleading. If the summary says anything at all
that is not mentioned in the post or contradicts something in the post, it should be considered
as an inaccurate summary.
The coherence aspect measures how coherent is the summary on its own. A summary is
coherent if, when read by itself, it’s easy to understand and free of English errors. A summary
Coherence
is not coherent if it’s difficult to understand what the summary is trying to say. Generally, it’s
more important that the summary is understandable than it being free of grammar errors.
The coverage aspect measures how well does the summary cover the important information
in the post?” A summary has good coverage if it mentions the main information from the
post that’s important to understand the situation described in the post. A summary has poor
Coverage
coverage if someone reading only the summary would be missing several important pieces
of information about the situation in the post. A summary with good coverage should also
match the purpose of the original post (e.g. to ask for advice).
The consistency aspect measures the factual alignment between the summary and the sum-
Consistency marized source. A factually consistent summary contains only statements that are entailed by
the source document. You also need to penalize summaries that contained hallucinated facts.
Fluency measures the quality of individual sentences. A fluent summary should have
Fluency no formatting problems, capitalization errors or obviously ungrammatical sentences (e.g.,
fragments, missing components) that make the text difficult to read.
Relevance measures the selection of important content from the source. The summary
Relevance should include only important information from the source document. You should penalize
summaries which contain redundancies and excess information.

Table 15: The full list and definitions of aspects of summarization evaluation collected in A SPECT I NSTRUCT.
Aspect Verbalizer Template
NEG: The response is unnatural.
Naturalness
POS: The response is natural.
NEG: The response drastically changes topic or ignores the conversation history.
Coherence
POS: The response is on topic and strongly acknowledges the conversation
history.
NEG: The response is generic and dull.
Engagingnes
POS: The response is interesting or presents an interesting fact.
NEG: Given the interesting fact that the response is conditioned on, the response
Groundedness does not mention or refer to the fact at all.
POS: Given the interesting fact that the response is conditioned on, the response
uses the fact well.
NEG: The response is not relevant to the conversation.
Relevance
POS: The response is relevant to the conversation.
NEG: The response is not fluently written.
Fluency
POS: The response is fluently written.
NEG: The system cannot discuss topics in depth.
Topic Depth
POS: The system is able to discuss topics in depth.
NEG: The system cannot display a likeable personality.
Likeability
POS: The system is able to display a likeable personality.
NEG: The response is difficult to understand. You do not know what the person
Understandability is trying to say.
POS: The response is understandable. You know what the person is trying to
say.
NEG: The system is not flexible and adaptable to the user and their interests.
Flexibility
POS: The system is flexible and adaptable to the user and their interests.
NEG: The system is not informative throughout the conversation.
Informativeness
POS: The system is informative throughout the conversation.
NEG: The system is not inquisitive throughout the conversation.
Inquisitiveness
POS: The system is inquisitive throughout the conversation.
NEG: To the average person, the response is not interesting.
Interestingness
POS: To the average person, the response is interesting.
NEG: The response is too generic and not specific to the conversation.
Specificity
POS: The response is specific to the conversation.
NEG: There was a misunderstanding of the conversation.
Correctness
POS: The response is correct in the context of the conversation.
NEG: The response is not semantically appropriate.
Semantic Appropriateness
POS: The response is semantically appropriate.

Table 16: The full list of verbalizer templates that are used to convert the evaluation results of auxiliary aspects
for dialogue evaluation collected in A SPECT I NSTRUCT. "POS" and "NEG" indicate "positive" and "negative",
respectively.
Aspect Verbalizer Template
NEG: The factual information in the summary cannot accurately match the
post. It says things that aren’t in the article, it mixes up people, or generally is
Accuracy
misleading.
POS: The factual information in the summary accurately match the post. It
doesn’t say things that aren’t in the article, it doesn’t mix up people, and
generally is not misleading.
NEG: The summary is not coherent as it lacks a logical flow and has disjointed
Coherence information, making it difficult to understand the main topic or argument.
POS: The summary is well-structured and well-organized and it is built from
sentence to sentence to a coherent body of information about a topic.
NEG: The summary has poor coverage on the important information in the post,
e.g., someone reading only the summary would be missing several important
Coverage
pieces of information about the situation in the post.
POS: The summary has good coverage since it mentions the main information
from the post that’s important to understand the situation described in the post
and also match the purpose of the original post.
NEG: The summary is not factually consistent with the original post as it
introduces factual inaccuracies or hallucinated facts that are not present in or
Consistency
supported by the original source document.
POS: The summary has good factual alignment between the summary and the
summarized source. It contains only statements that are entailed by the source
document.
NEG: The summary is not fluent as it contains formatting problems, capital-
ization errors or obviously ungrammatical sentences (e.g., fragments, missing
Fluency
components) that make the text difficult to read.
POS: This is a fluent summary as it generally does not have formatting problems,
capitalization errors or obviously ungrammatical sentences (e.g., fragments,
missing components) that make the text difficult to read.
NEG: This summary is not relevant to the source document as it contains
Relevance redundancies or excess information.
POS: The summary generally includes relevant content, capturing some key
points from the source.

Table 17: The full list of verbalizer templates that are used to convert the evaluation results of auxiliary aspects for
summarization evaluation collected in A SPECT I NSTRUCT. "POS" and "NEG" indicate "positive" and "negative",
respectively.

You might also like