0% found this document useful (0 votes)
29 views7 pages

Matics Software Suite Overview

Uploaded by

Leymi Cairo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views7 pages

Matics Software Suite Overview

Uploaded by

Leymi Cairo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Matics Software Suite: New Tools for Evaluation and Data Exploration

Olivier Galibert, Guillaume Bernard, Agnes Delaborde, Sabrina Lecadre, Juliette Kahn
Laboratoire national de métrologie et d’essais
1 rue Gaston Boissier - 75015 Paris, France
[Link]@[Link]

Abstract
Matics is a free and open-source software suite for exploring annotated data and evaluation results. It proposes a dataframe data model
allowing the intuitive exploration of data characteristics and evaluation results and provides support for graphing the values and running
appropriate statistical tests. The tools already run on several Natural Language Processing tasks and standard annotation formats, and
are under on-going development.

1. Introduction conducted many evaluations of data-processing systems


in projects such as Quaero (Galibert et al., 2011),
The evaluation of data processing systems is a cornerstone
ETAPE (Gravier et al., 2012), MAURDOR (Brunessaux
for developers, researchers and users. The evaluation al-
et al., 2014), PEA-TRAD or REPERE (Giraudel et al.,
lows positioning a technology with regard to the compe-
2012; Galibert and Kahn, 2013). These evaluations con-
tition, but also allows assessing the performance of the
cerned various NLP tasks and systems (speech recognition,
system in different contexts. Through quantified scores,
speaker diarization, speaker identification, named entities
it orientates the development or guides the user towards
recognition, optical character recognition, etc.), which im-
the most suitable product. As evidenced by the popular-
plied dealing with different system output formats, annota-
ity of evaluation competitions such as the VarDial eval-
tion guides, and comparison metrics. A number of com-
uation campaigns (Malmasi et al., 2016), the Interspeech
monalities appeared through time in the process of such
challenges (Schuller et al., 2017), the CoNLL Shared
evaluations, in the pre-processing and exploration of the
Tasks (Oepen et al., 2017) or the Evalita evaluation cam-
data and the computation and viewing of statistical scores,
paign (Pierpaolo et al., 2017), there is a constant need for
hence the need for a reusable and general framework to
the rapidly-evolving Natural Language Processing (NLP)
carry out the evaluations.
technologies to position themselves.
One aspect we are especially interested in is to be able to
The evaluation of text, speech or multimedia data pro- assess the representativity of the different sub-corpora cre-
cessing systems relies on large amounts of usually anno- ated (train, development, test) and to identify factors of in-
tated data. Numerous works propose interfaces or frame- fluence on the performance of the system. Such an analysis
works to build, explore and visualize corpora of annotated is usually done through a mix of independent evaluation
data. ANVIL (Kipp, 2010) for instance proposes a well- tools, ad-hoc data extraction scripts and generic analysis
conceived database-oriented annotation tool where the user engines (such as R), or dedicated to a specific NLP task
can add temporally or spatially grounded elements. The an- (such as the NIST Scoring Toolkit SCTK (NIST, 2015) for
notated data can be exported to perform statistical analyses speech recognition). This works perfectly fine for evalua-
in several external systems. UAM (O’Donnell, 2008) em- tions on specific applications, or on databases of average
phasizes the project management aspects in a multi-layer size; this becomes somewhat burdensome when perform-
text annotation task and offers dedicated statistical analysis ing large scale evaluations on a great panel of application
tools. Headtalk/Handtalk (Knight et al., 2009) explores the types.
annotation and visualization of multimodal corpora, for the We thus decided to build a new tool to provide a uni-
purpose of building datasets suitable for statistic analysis. fied response to our evaluation needs by first testing some
Some propose a framework able to explore data in a spe- data handling and UI prototype in a pre-project called
cific context. (Schmitt et al., 2010) for instance is dialogue- LNE-Visu, presented in a demonstration at the French
oriented, and presents a multi-level interface including di- JEP-TAL-Recital joint conference in 2016 (Bernard et al.,
alog selection from a database, display of the selected dia- 2016).
log, and application and evaluation of integrated prediction Then, taking the results into account, we started an inter-
models for various characteristics (task completion, anger nal project to build the Matics software suite, to implement
level, age and gender predictions). the vision we have of such an exploration interface. It inte-
All these systems are mainly dedicated to annotation tasks grates evaluation, exploration at varying granularity, graph-
and/or to specific NLP applications. They offer data ex- ical representations and statistical testing. All these aspects
ploration features, and can either produce data formatted to are presented in this paper.
perform an evaluation in an external system, or offer statis-
tical analysis specialized for testing coherency or measur- 2. Matics at a Glance
ing advance on a specific NLP task.
The LNE (laboratoire national de métrologie et d’essais - 2.1. General Description
French national metrology and testing laboratory) has Matics comprises two interconnected softwares:

2027
• DATOMATIC: It is designed for the importation and 2.4. Supported External Formats
database indexation of corpora and files. The data can Matics supports several standard structured formats, like
be made up of reference data (e. g. labeled by an ex- XML (e. g. Transcriber) or the Tab Delimited Format of
pert) and hypothesis data (output of an NLP system, XTrans. It also supports annotation formats such as:
automatically labeled). Source data (i. e. unlabeled
and/or unstructured) can also be included, such as • The stm and ctm file formats (in the sclite() program
plain text or audio. The data can be browsed through developed by the NIST for the evaluation of speech
via search features, and visualized according to their recognizers);
types (text, video, audio and the related annotations).
The software offers several descriptive statistics (sig- • The CoNLL-X format (Tjong Kim Sang and De Meul-
nal duration, number of words/speakers/entities, file or der, 2003);
language distribution...). Multi-criteria sub-selections
on the corpora can be performed. The resulting cor- • MUC-7 (Chinchor and Robinson, 1997);
pora can be locally exported to be processed in Evalo-
matic. • QUAERO (Galibert et al., 2011).

• E VALOMATIC: Evalomatic works exclusively on As of now, unsupported formats need to be externally trans-
Datomatic formatted databases. Evalomatic allows formed into an supported format so as to be loadable in
running evaluations, for example comparisons be- Datomatic, but supporting new formats for an already han-
tween reference and hypothesis data for speech tran- dled task requires a reasonable amount of effort.
scription tasks. The reference and hypothesis data
(as well as the evaluation results) are structured as 2.5. Implemented Metrics
dataframes, which allows performing several manip- • ASR: WER (Word Error Rate); CER (Character Error
ulations on the data for an evaluation at different lev- Rate; NCE (Normalised Cross-Entropy)
els of granularity. The software offers several standard
comparison metrics (e. g. F-measure, Slot Error Rate • NER: SER (Slot Error Rate); ETER (Entity Tree Error
SER), some of which specifically designed for NLP Rate)
(e. g. Word Error Rate WER). Statistical functions are
provided (e. g. t-tests or Anova). Data and results can • Speaker verification: EER (Equal Error Rate); Cdet
be plotted on graphs (e. g. DET plot, bar chart). (Cost of DETection); Cllr (Cost Log-Likelihood Ratio)

Matics is an on-going work, initially developed to address • General metrics: F-measure; Recall; Precision
our team’s evaluation needs. The decision of publicly re-
leasing it is motivated by our wish to contribute to a thriv- These metrics cover the evaluation of NLP applications de-
ing development of NLP technologies, and artificially intel- scribed hereinbefore. New metrics will be added along with
ligent systems on the whole. In its earlier stages, the soft- the expansion of the NLP tasks list.
ware suite presents some limitations: we do not guarantee
it is fully bug-free, many features are left to add, and as of 2.6. Statistical Functions
now the interface only offers French. Evaluation being our A toolbox of several standard statistical functions is avail-
core activity, the development of Matics is one of our main able. The result of these functions can be used as new
priorities, and there are, and will be, constant updates. columns in the dataframe, meaning that they can be used
as a test statistic in the evaluation.
2.2. Availability
The Matics suite is free and open-source. It can be down- • Descriptive statistics:
loaded at: [Link]
– Gaussian statistics: mean, standard deviation,
2.3. Supported NLP Tasks skewness, kurtosis

As of now, Matics allows performing evaluations on NLP – Distributional statistics: min, max, median, first
systems for these tasks: and last quartile, first and last decile, mode

• Automatic Speech Recognition (ASR) • Significance tests on paired experiments:

• Named Entity Recognition (NER) – Gaussian : paired t-test


– Non parametric: Wilcoxon
• Tokenization
• Correlation tests:
• Lemmatization
– Pearson linear correlation
• Speaker verification
– Rank correlation (Kendall, Spearman)
Note that Matics supports Latin and non Latin alphabet lan-
guages (Chinese, Arabic, Russian...). • Anova

2028
3. Matics Capabilities and Concepts operation. In that configuration the durations are computed
3.1. Data Management at the turn level and summed together, giving the total turn
time. The spans of the values in the start and end time
3.1.1. Dataframe columns are what sets the duration computation granular-
The main underlying concept used in the Evalomatic inter- ity.
face is the Dataframe. It is a table roughly equivalent to a
single SQL table or a R data frame. Each column has a two- 3.2. Evaluation capabilities
part name: a group name and a column name. For instance
As detailed in Sections 2.3. and 2.5., Matics can deal with
the “speaker” group may have the columns “[Link]”,
the evaluation of several NLP tasks and implements the cor-
“[Link]” and “[Link]”. Each column has a
responding metrics. All the input formats are converted to
type that is built from four traits:
reference or hypothesis dataframes which are then used to
1. The column may contain labels or values: labels are build an evaluation dataframe with a complete alignment of
names into categories (file name, speaker name, gen- the texts.
der, turn id...) while values are actual values (time, The ASR evaluation subsystem, for example, is able to
score, word, text segment...). work on the word or character level, and take the case into
account when requested to. It uses unicode for multilingual
2. The datatype of the column content can be string or support.
numeric (integer for labels, floating-point for values). The final evaluation dataframe contains the full alignment
and the error counts per type, with computed columns
3. The column can store the initial values/labels, or val- added to provide WER/CER (Crossover Error Rate) and
ues computed from other columns through expres- NCE (Normalized Cross Entropy) at any chosen granular-
sions. ity.
4. (optional) The column can have a sub-type that tells 3.3. Statistical analysis capabilities
the interface how to show or interpret the values. Cur-
rently defined sub-types are name, time, p-value (for One aim of the interface is to give a fast access to statisti-
statistical tests) and correlation (for correlation tests). cal testing capabilities. The list of the currently available
functions has been presented in Section 2.6.. A uniform,
Non-expression columns actually store data. A stored value drag’n’drop based interface is proposed to select the data
often span multiple lines. For instance, a speech transcrip- columns the testing applies to.
tion evaluation dataframe has one line per aligned word. In the case of the standard descriptive statistics on a value,
In that dataframe, the turn start and end times span all the for example, the user selects a value column to compute
words of that turn. That spanned information is explic- the statistic on (for instance WER) and a label column for
itly stored in the dataframe. In addition, some cells can the granularity (for instance speech turn). They can also
be empty, which is a different status from zero or an empty optionally select a factor column as a factor (for instance
string. “System” – the NLP system of which we evaluate the out-
put results) to compute a series of statistics instead of a
3.1.2. Granularity and Foldable Categories global one. The computation of these statistics allows the
A key capability of that dataframe structure is a variable user to summarize the distribution of the values and get an
granularity. Lines can be folded together, and columns op- idea of how gaussian and symmetrical they are.
tionally have a folding method, called reduce operation, The second available analysis is a very common one: sig-
which defines how the value for the folded lines is com- nificance of a difference for paired values. The user selects
puted. A number of reduce operations are already avail- a value (e. g. WER), a pairing/granularity (speech turn) and
able: min, max, mean, median, sum for numeric values, the factor to analyze (system) and the interface computes,
concatenation for string values. Expression columns either for each system pair, the p-value, e. g. the probability that
include a reduce operation, and then compute their value at the WERs are in practice identical and the differences only
the lowest possible granularity then apply the operation, or randomness. It can use either a Student paired t-test if the
do not include a reduce operation and compute the value user considers the values gaussian (which is rare), or a less
from the reduced values of the other columns. powerful but more robust Wilcoxon paired-difference test
To illustrate that capability, two examples can be given. otherwise.
Computing the WER in speech transcription is done by di- The third analysis is a correlation test between two value
viding the count of errors by the number of words in the ref- columns to, for instance, check whether the WER is corre-
erence. The WER is then an expression column without re- lated with the turn duration. The user selects the values to
duce operation which divides the value in the error column compare and the ganularity. The interface then computes
by the value in the reference words count column. These three standard correlation values: Pearson’s r (linear corre-
two source columns on the other hand have a “sum” reduce lation), Kendall’s τ and Spearman’s ρ (rank correlations).
operation to accumulate the count of errors and words at Finally a fourth analysis method is implemented: the
the required granularity. Anova. It is used to measure the importance of different
In contrast, computing the total speech time is done from factors on a result, and measures how much of the variance
the speech duration column which is an expression defined can be explained by each factor. It should be available by
as turn end time minus turn start time, with a sum reduce the time the final papers are due.

2029
4. User Interface a dataframe with an appropriate alignment to the original
This section presents some views of the user interface of- signal.
fered by Matics. The audio display and listening is currently available. The
interface allows listening to the signal at different levels:
4.1. Dataframe views the whole signal, per speaker, per sentences, or per words.
The segmentation follows the timestamps defined in the
corresponding annotation file. The Figure 3 shows a screen
capture of the interface.

4.3. Statistical Functions Selection


The selection form can be seen in Figure 4. The user can
drag and drop between the column list at the bottom and
the configurable fields in the middle.

4.4. Graphing
Figure 1: Evaluation dataframe with every line folded but
the system name. The WER is updated and represents the The other main capability of the interface is graphing data,
global score for all the files of the system. to ease the visualization of data and results.

• Bar charts — The histogram graphic category can plot


any value. The basic histogram allows graphing of one
or more value columns with one or more label columns
on the x axis. This allows counting the number of dif-
ferent labels in one column, using another for the x
axis (for instance counting the speakers in each show)
with optionally a third used to color subparts of the
histogram (gender for instance). An optional gaussian
curve can be overlaid.
• Scatterplots — The scatterplots can be created from
two value columns with color and shape controlled
from label columns. An example of scatterplot show-
ing the lack of correlation between file speech duration
in a file and the WER can be seen in Figure 5.
• Boxplots — Visualisation of the distribution of the
Figure 2: Evaluation dataframe for the comparison of two
data, through quartile and decile. A same graph can
systems (names have been blurred out). Every line folded
show the boxes for different factors (file, system...).
but the system and file name. wer ci: WER case indepen-
dent. • Detection Error Tradeoff (DET) curves — For binary
classification. The DET curves for several systems can
The main interaction is done with the dataframe. A be presented at once for visual comparison, with a vi-
dataframe has views on it, where each view has its own sualization of EER and Cdet decision thresholds. See
state. The user has control over which columns are visi- an example Figure 6.
ble and in which order they appear. The display granularity
is implicitely controlled by the visible columns: consecu-
tive lines with identical labels in all the label columns are 5. Conclusion
collapsed. If every column is hidden except for the sys- The Matics software suite offers a unified tool for the eval-
tem name and the WER, then the per-system WER is vis- uation of NLP systems, through two independent tools:
ible, as can be seen in Figure 1. When the file name col- Datomatic and Evalomatic. Datomatic allows the manip-
umn is then shown, the per-file score then becomes visible, ulation, visualisation and sub-selection of hypothesis and
as in Figure 2. The dataframe can also be sorted on the reference corpora; evaluations can be conducted in Evalo-
columns, giving the possibility to get a per-speaker score in matic, with metrics implemented for a range of NLP appli-
a dataframe originally generated with lines in time order. cations.
Filtering is also possible, to view a subset of the lines. The Developed by the LNE, specialized in the evaluation of
active granularity and filtering is taken into account when NLP systems, Matics is free and open-source. While still
doing a graph, while only the filtering is taken into ac- in the development stage, the tool aims at providing a con-
count for statistical tests and the granularity is requested crete and fully reusable solution for data exploration and
explicitely. evaluation. New features are expected to be implemented,
and regular updates of the system will be offered according
4.2. Data visualization to the evolution of our evaluation activities.
The interface gives the capability to link to source data (au- For example, an expected upcoming feature is video syn-
dio, video, etc.) and visualize the annotations present in chronization with the annotation (for Datomatic). We are

2030
Figure 3: Audio signal and the associated transcription. A click on each token (one per rectangle in the center area) plays
the corresponding audio segment.

Figure 4: Statistical paired difference configuration interface. In blue (e. g. comments, lang, system): labels of the columns;
in red (e. g. start time, end time): values of the columns. Labels in English have been added on the figure to translate the
French items.

cess requires some modification at the core of the system


that will be addressed soon.
A longer term perspective is to give the interface the ca-
pability to rewrite the different supported formats, and use
that capability combined with statistical analysis possibili-
ties to select representative subsets of data for train, devel-
opment and test. This aspect, while quite out of the scope
of evaluation, is also part of our mission of accompanying
technology developers.

6. Bibliographical References
Figure 5: Scatterplot of WER vs. speech duration in a file
Bernard, G., Galibert, O., Rémi, R., Demeyer, S.,
and Kahn, J. (2016). LNE-Visu : une plateforme
d’exploration et de visualisation de données d’évaluation
also concerned with a localization feature, to broaden out (LNE-Visu: a platform for the exploration and display
the system to the non-French speaking community. Al- of evaluation data). In Proceedings of the JEP-TALN-
though the interface vocabulary may be quite transparent to Recital joint conference, 07.
computer scientists and statisticians, that would be a strong Brunessaux, S., Giroux, P., Grilhères, B., Manta, M.,
requirement in terms of ergonomics. The localization pro- Bodin, M., Choukri, K., Galibert, O., and Kahn, J.

2031
Figure 6: DET curves for two simulated systems (named h1b and h1) with EER and Cdet decision thresholds.

(2014). The Maurdor Project: Improving Automatic Corpora, 4(1):1–32.


Processing of Digital Documents. In 2014 11th IAPR Malmasi, S., Zampieri, M., Ljubešić, N., Nakov, P., Ali,
International Workshop on Document Analysis Systems, A., and Tiedemann, J. (2016). Discriminating between
pages 349–354, April. Similar Languages and Arabic Dialect Identification: A
Chinchor, N. and Robinson, P. (1997). MUC-7 named en- Report on the Third DSL Shared Task. In Proceedings
tity task definition. In Proceedings of the 7th Conference of the Third Workshop on NLP for Similar Languages,
on Message Understanding, volume 29. Varieties and Dialects (VarDial3), pages 1–14, Osaka,
Galibert, O. and Kahn, J. (2013). The First Official Japan, December.
REPERE Evaluation. In First Workshop on Speech, Lan- NIST. (2015). NIST Multimodal Information Group
guage and Audio in Multimedia (SLAM’13). - Tools. [Link]
Galibert, O., Rosset, S., Grouin, C., Zweigenbaum, P., and mig/tools. Accessed: 2018-02-21.
Quintard, L. (2011). Structured and Extended Named O’Donnell, M. (2008). The UAM CorpusTool: Software
Entity Evaluation in Automatic Speech Transcriptions. for corpus annotation and exploration. In Proceedings of
In Proc of IJCNLP, Chiang Mai, Thailand, 9-11 novem- the XXVI Congreso de AESLA, pages 1433–1447, 01.
bre. Oepen, S., Øvrelid, L., Björne, J., Johansson, R., Lapponi,
Giraudel, A., Carré, M., Mapelli, V., Kahn, J., Galibert, O., E., Ginter, F., and Velldal, E. (2017). The 2017 Shared
and Quintard, L. (2012). The REPERE Corpus : a multi- Task on Extrinsic Parser Evaluation Towards a Reusable
modal corpus for person recognition. In Nicoletta Calzo- Community Infrastructure. EPE 2017, page 1.
lari (Conference Chair), et al., editors, Proceedings of the Pierpaolo, B., Malvina, N., Patti, V., Rachele, S., and
Eight International Conference on Language Resources Francesco, C. (2017). EVALITA Goes Social: Tasks,
and Evaluation (LREC’12), Istanbul, Turkey, may. Euro- Data, and Community at the 2016 Edition. Italian Jour-
pean Language Resources Association (ELRA). nal of Computational Linguistics, 3(1):93–127.
Gravier, G., Adda, G., Paulson, N., Carré, M., Giraudel, Schmitt, A., Bertrand, G., Heinroth, T., Minker, W., and
A., and Galibert, O. (2012). The ETAPE corpus for the Liscombe, J. (2010). WITcHCRafT: A Workbench
evaluation of speech-based TV content processing in the for Intelligent exploraTion of Human ComputeR con-
French language. In LREC - Eighth international con- versaTions. In Nicoletta Calzolari (Conference Chair),
ference on Language Resources and Evaluation, Turkey. et al., editors, Proceedings of the Seventh Interna-
Kipp, M. (2010). Multimedia annotation, querying and tional Conference on Language Resources and Evalu-
analysis in ANVIL. Multimedia information extraction, ation (LREC’10), Valletta, Malta, may. European Lan-
19. guage Resources Association (ELRA).
Knight, D., Evans, D., Carter, R., and Adolphs, S. (2009). Schuller, B., Steidl, S., Batliner, A., Bergelson, E.,
HeadTalk, HandTalk and the corpus: Towards a frame- Krajewski, J., Janott, C., Amatuni, A., Casillas, M.,
work for multi-modal, multi-media corpus development. Seidl, A., Soderstrom, M., et al. (2017). The inter-

2032
speech 2017 computational paralinguistics challenge:
Addressee, cold & snoring. In Computational Paralin-
guistics Challenge (ComParE), Interspeech 2017, pages
3442–3446.
Tjong Kim Sang, E. F. and De Meulder, F. (2003). In-
troduction to the CoNLL-2003 Shared Task: Language-
Independent Named Entity Recognition. In Walter
Daelemans et al., editors, Proceedings of CoNLL-2003,
pages 142–147. Edmonton, Canada.

2033

You might also like