Matics Software Suite Overview
Matics Software Suite Overview
Olivier Galibert, Guillaume Bernard, Agnes Delaborde, Sabrina Lecadre, Juliette Kahn
Laboratoire national de métrologie et d’essais
1 rue Gaston Boissier - 75015 Paris, France
[Link]@[Link]
Abstract
Matics is a free and open-source software suite for exploring annotated data and evaluation results. It proposes a dataframe data model
allowing the intuitive exploration of data characteristics and evaluation results and provides support for graphing the values and running
appropriate statistical tests. The tools already run on several Natural Language Processing tasks and standard annotation formats, and
are under on-going development.
2027
• DATOMATIC: It is designed for the importation and 2.4. Supported External Formats
database indexation of corpora and files. The data can Matics supports several standard structured formats, like
be made up of reference data (e. g. labeled by an ex- XML (e. g. Transcriber) or the Tab Delimited Format of
pert) and hypothesis data (output of an NLP system, XTrans. It also supports annotation formats such as:
automatically labeled). Source data (i. e. unlabeled
and/or unstructured) can also be included, such as • The stm and ctm file formats (in the sclite() program
plain text or audio. The data can be browsed through developed by the NIST for the evaluation of speech
via search features, and visualized according to their recognizers);
types (text, video, audio and the related annotations).
The software offers several descriptive statistics (sig- • The CoNLL-X format (Tjong Kim Sang and De Meul-
nal duration, number of words/speakers/entities, file or der, 2003);
language distribution...). Multi-criteria sub-selections
on the corpora can be performed. The resulting cor- • MUC-7 (Chinchor and Robinson, 1997);
pora can be locally exported to be processed in Evalo-
matic. • QUAERO (Galibert et al., 2011).
• E VALOMATIC: Evalomatic works exclusively on As of now, unsupported formats need to be externally trans-
Datomatic formatted databases. Evalomatic allows formed into an supported format so as to be loadable in
running evaluations, for example comparisons be- Datomatic, but supporting new formats for an already han-
tween reference and hypothesis data for speech tran- dled task requires a reasonable amount of effort.
scription tasks. The reference and hypothesis data
(as well as the evaluation results) are structured as 2.5. Implemented Metrics
dataframes, which allows performing several manip- • ASR: WER (Word Error Rate); CER (Character Error
ulations on the data for an evaluation at different lev- Rate; NCE (Normalised Cross-Entropy)
els of granularity. The software offers several standard
comparison metrics (e. g. F-measure, Slot Error Rate • NER: SER (Slot Error Rate); ETER (Entity Tree Error
SER), some of which specifically designed for NLP Rate)
(e. g. Word Error Rate WER). Statistical functions are
provided (e. g. t-tests or Anova). Data and results can • Speaker verification: EER (Equal Error Rate); Cdet
be plotted on graphs (e. g. DET plot, bar chart). (Cost of DETection); Cllr (Cost Log-Likelihood Ratio)
Matics is an on-going work, initially developed to address • General metrics: F-measure; Recall; Precision
our team’s evaluation needs. The decision of publicly re-
leasing it is motivated by our wish to contribute to a thriv- These metrics cover the evaluation of NLP applications de-
ing development of NLP technologies, and artificially intel- scribed hereinbefore. New metrics will be added along with
ligent systems on the whole. In its earlier stages, the soft- the expansion of the NLP tasks list.
ware suite presents some limitations: we do not guarantee
it is fully bug-free, many features are left to add, and as of 2.6. Statistical Functions
now the interface only offers French. Evaluation being our A toolbox of several standard statistical functions is avail-
core activity, the development of Matics is one of our main able. The result of these functions can be used as new
priorities, and there are, and will be, constant updates. columns in the dataframe, meaning that they can be used
as a test statistic in the evaluation.
2.2. Availability
The Matics suite is free and open-source. It can be down- • Descriptive statistics:
loaded at: [Link]
– Gaussian statistics: mean, standard deviation,
2.3. Supported NLP Tasks skewness, kurtosis
As of now, Matics allows performing evaluations on NLP – Distributional statistics: min, max, median, first
systems for these tasks: and last quartile, first and last decile, mode
2028
3. Matics Capabilities and Concepts operation. In that configuration the durations are computed
3.1. Data Management at the turn level and summed together, giving the total turn
time. The spans of the values in the start and end time
3.1.1. Dataframe columns are what sets the duration computation granular-
The main underlying concept used in the Evalomatic inter- ity.
face is the Dataframe. It is a table roughly equivalent to a
single SQL table or a R data frame. Each column has a two- 3.2. Evaluation capabilities
part name: a group name and a column name. For instance
As detailed in Sections 2.3. and 2.5., Matics can deal with
the “speaker” group may have the columns “[Link]”,
the evaluation of several NLP tasks and implements the cor-
“[Link]” and “[Link]”. Each column has a
responding metrics. All the input formats are converted to
type that is built from four traits:
reference or hypothesis dataframes which are then used to
1. The column may contain labels or values: labels are build an evaluation dataframe with a complete alignment of
names into categories (file name, speaker name, gen- the texts.
der, turn id...) while values are actual values (time, The ASR evaluation subsystem, for example, is able to
score, word, text segment...). work on the word or character level, and take the case into
account when requested to. It uses unicode for multilingual
2. The datatype of the column content can be string or support.
numeric (integer for labels, floating-point for values). The final evaluation dataframe contains the full alignment
and the error counts per type, with computed columns
3. The column can store the initial values/labels, or val- added to provide WER/CER (Crossover Error Rate) and
ues computed from other columns through expres- NCE (Normalized Cross Entropy) at any chosen granular-
sions. ity.
4. (optional) The column can have a sub-type that tells 3.3. Statistical analysis capabilities
the interface how to show or interpret the values. Cur-
rently defined sub-types are name, time, p-value (for One aim of the interface is to give a fast access to statisti-
statistical tests) and correlation (for correlation tests). cal testing capabilities. The list of the currently available
functions has been presented in Section 2.6.. A uniform,
Non-expression columns actually store data. A stored value drag’n’drop based interface is proposed to select the data
often span multiple lines. For instance, a speech transcrip- columns the testing applies to.
tion evaluation dataframe has one line per aligned word. In the case of the standard descriptive statistics on a value,
In that dataframe, the turn start and end times span all the for example, the user selects a value column to compute
words of that turn. That spanned information is explic- the statistic on (for instance WER) and a label column for
itly stored in the dataframe. In addition, some cells can the granularity (for instance speech turn). They can also
be empty, which is a different status from zero or an empty optionally select a factor column as a factor (for instance
string. “System” – the NLP system of which we evaluate the out-
put results) to compute a series of statistics instead of a
3.1.2. Granularity and Foldable Categories global one. The computation of these statistics allows the
A key capability of that dataframe structure is a variable user to summarize the distribution of the values and get an
granularity. Lines can be folded together, and columns op- idea of how gaussian and symmetrical they are.
tionally have a folding method, called reduce operation, The second available analysis is a very common one: sig-
which defines how the value for the folded lines is com- nificance of a difference for paired values. The user selects
puted. A number of reduce operations are already avail- a value (e. g. WER), a pairing/granularity (speech turn) and
able: min, max, mean, median, sum for numeric values, the factor to analyze (system) and the interface computes,
concatenation for string values. Expression columns either for each system pair, the p-value, e. g. the probability that
include a reduce operation, and then compute their value at the WERs are in practice identical and the differences only
the lowest possible granularity then apply the operation, or randomness. It can use either a Student paired t-test if the
do not include a reduce operation and compute the value user considers the values gaussian (which is rare), or a less
from the reduced values of the other columns. powerful but more robust Wilcoxon paired-difference test
To illustrate that capability, two examples can be given. otherwise.
Computing the WER in speech transcription is done by di- The third analysis is a correlation test between two value
viding the count of errors by the number of words in the ref- columns to, for instance, check whether the WER is corre-
erence. The WER is then an expression column without re- lated with the turn duration. The user selects the values to
duce operation which divides the value in the error column compare and the ganularity. The interface then computes
by the value in the reference words count column. These three standard correlation values: Pearson’s r (linear corre-
two source columns on the other hand have a “sum” reduce lation), Kendall’s τ and Spearman’s ρ (rank correlations).
operation to accumulate the count of errors and words at Finally a fourth analysis method is implemented: the
the required granularity. Anova. It is used to measure the importance of different
In contrast, computing the total speech time is done from factors on a result, and measures how much of the variance
the speech duration column which is an expression defined can be explained by each factor. It should be available by
as turn end time minus turn start time, with a sum reduce the time the final papers are due.
2029
4. User Interface a dataframe with an appropriate alignment to the original
This section presents some views of the user interface of- signal.
fered by Matics. The audio display and listening is currently available. The
interface allows listening to the signal at different levels:
4.1. Dataframe views the whole signal, per speaker, per sentences, or per words.
The segmentation follows the timestamps defined in the
corresponding annotation file. The Figure 3 shows a screen
capture of the interface.
4.4. Graphing
Figure 1: Evaluation dataframe with every line folded but
the system name. The WER is updated and represents the The other main capability of the interface is graphing data,
global score for all the files of the system. to ease the visualization of data and results.
2030
Figure 3: Audio signal and the associated transcription. A click on each token (one per rectangle in the center area) plays
the corresponding audio segment.
Figure 4: Statistical paired difference configuration interface. In blue (e. g. comments, lang, system): labels of the columns;
in red (e. g. start time, end time): values of the columns. Labels in English have been added on the figure to translate the
French items.
6. Bibliographical References
Figure 5: Scatterplot of WER vs. speech duration in a file
Bernard, G., Galibert, O., Rémi, R., Demeyer, S.,
and Kahn, J. (2016). LNE-Visu : une plateforme
d’exploration et de visualisation de données d’évaluation
also concerned with a localization feature, to broaden out (LNE-Visu: a platform for the exploration and display
the system to the non-French speaking community. Al- of evaluation data). In Proceedings of the JEP-TALN-
though the interface vocabulary may be quite transparent to Recital joint conference, 07.
computer scientists and statisticians, that would be a strong Brunessaux, S., Giroux, P., Grilhères, B., Manta, M.,
requirement in terms of ergonomics. The localization pro- Bodin, M., Choukri, K., Galibert, O., and Kahn, J.
2031
Figure 6: DET curves for two simulated systems (named h1b and h1) with EER and Cdet decision thresholds.
2032
speech 2017 computational paralinguistics challenge:
Addressee, cold & snoring. In Computational Paralin-
guistics Challenge (ComParE), Interspeech 2017, pages
3442–3446.
Tjong Kim Sang, E. F. and De Meulder, F. (2003). In-
troduction to the CoNLL-2003 Shared Task: Language-
Independent Named Entity Recognition. In Walter
Daelemans et al., editors, Proceedings of CoNLL-2003,
pages 142–147. Edmonton, Canada.
2033