0% found this document useful (0 votes)
37 views10 pages

Phonetic Normalization for Scoring Misspellings

Uploaded by

Infinity Dazee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views10 pages

Phonetic Normalization for Scoring Misspellings

Uploaded by

Infinity Dazee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Misspellings in Responses to Listening Comprehension Questions:

Prospects for Scoring based on Phonetic Normalization

Heike da Silva Cardoso† and Magdalena Wolska∗


† Department of Linguistics ∗ LEAD Graduate School

Eberhard Karls Universität Tübingen, Tübingen, Germany


{hcardoso,[Link]}@[Link]

Abstract errors, through spelling and word-formation er-


rors (which in many cases cannot be clearly dif-
Automated scoring systems which evalu-
ferentiated), up to sentence structure, syntactic,
ate content require robust ways of dealing
errors. In this paper we address one class of
with form errors. The work presented in
form errors, non-word misspellings, in the context
this paper is set in the context of scoring
of a semantics-oriented task: assessment of con-
learners’ responses to listening compre-
structed responses to German as a Foreign Lan-
hension items included in a placement test
guage listening comprehension questions.
of German as a foreign language. Based
In the task of content scoring, misspellings
on a corpus of over 3000 responses to 17
introduce obvious noise. A recently proposed
questions, by test takers of different lan-
method of addressing the spelling problem in
guage proficiencies, we perform a quan-
automated scoring involves phonetic normaliza-
titative analysis of the diversity in mis-
tion based on Soundex, a coarse-granularity
spellings. We evaluate the performance of
sound-based coding. Shedeed (2011) used
an off-the-shelf open source spell-checker
Soundex in a system for scoring short answers
on our data showing that around 45% of
in Arabic. Hahn et al. (2013) used an analogous
the reported non-word errors are not cor-
method for German and showed that a bag-of-
rectly accounted for, that is, they are either
Soundex model outperforms other models on un-
falsely identified as misspelt or the spell-
seen data at the accuracy over 85%.
checker is unable to identify the intended
The work presented here has been motivated by
word.
a different approach to content scoring: computer-
We propose to address misspellings in assisted scoring. In the context of a real-world
computer-based scoring of constructed re- task, instead of automatically assigning scores we
sponse items by means of phonetic nor- group responses that are likely to be graded with
malization. Learner responses transcribed the same scores with the goal of streamlining man-
into Soundex codes and into two encod- ual scoring (see (Wolska et al., 2014)). Identifying
ings borrowed from historical linguistics responses that are similar at the appropriate level
(ASJP and Dolgopolsky’s sound classes) of abstraction is thus crucial here. In the study pre-
are compared to transcribed reference an- sented in this paper, we evaluate the prospects for
swers using string distance measures. We using phonetic string encodings based on sound
show that reliable correlation with teach- classes derived in historical linguistics as a pre-
ers’ scores can be obtained, however, sim- processing step for this task.
ilarity thresholds are item-specific. In historical and comparative linguistics sound
1 Introduction classes are used, among others to detect cognates,
identify relatedness among languages, or detect or
Form errors are the type of noise in linguistic explain changes in sound patterns. Phonetic en-
data that can interfere with computational lan- coding in this case is a normalization step which
guage analysis already at the preprocessing stage. serves to make languages comparable. In our case,
Form errors in writing range from basic mechan- phonetic normalization of type-written responses
ics errors, such as capitalization or punctuation to listening comprehension items is motivated by
∗ Corresponding author the fact that students, especially those of lower
Heike Da Silva Cardoso and Magdalena Wolska 2015. Misspellings in responses to listening comprehension
questions: Prospects for scoring based on phonetic normalization. Proceedings of the 4th workshop on NLP
for Computer Assisted Language Learning at NODALIDA 2015. NEALT Proceedings Series 26 / Linköping
Electronic Conference Proceedings 114: 1–10.
1
proficiency, tend to misspell words to some ex- a word’s position in the list of suggestions when
tent in systematic ways, for instance, related to they select an alternative spelling. Clearly, with
the properties of their mother-tongue (orthogra- incorrect top-level suggestions, only more errors
phy rules or phonological differences between the are introduced.
mother-tongue and the target language). Corpus-based studies into low-level form errors
Based on a corpus of learner responses to listen- in German learner writing are sparse. Boyd (2010)
ing comprehension items, in this paper we answer created a corpus of online workbook exercises and
the following questions: essays submitted of by American students learn-
ing German and built a subcorpus of around 1200
• What is the extent of the misspellings prob- non-word spelling errors found in this data. The
lem in learner responses to German listening most prominent error annotated German learner
comprehension questions? corpus is Falko (Reznicek et al., 2013) and it also
• How diverse are misspellings, that is, to what includes annotations of target hypotheses for mis-
extent they diverge from target hypotheses? spellings. Juozulynas (2013) analyzed around 350
German essays written by American college stu-
• To what extent an off-the-shelf spell- dents and found that around 15% of the identified
checking tool can “solve” the problem? errors were spelling errors. Analysis of accuracy
of robust automated correction was not performed
• Does grouping responses based on phonetic in these studies.
normalization account for teacher’s response To our knowledge, the only prior work in which
scores? explicit phonetic normalization is employed in
content scoring is the previously mentioned work
In the context of the last question, we test
by Shedeed (2011) and the subsequent study by
two linguistically-motivated phonetic encodings
Hahn et al. (2013). In both cases Soundex coding
of different granularity: ASJPcode (Wichmann
is used.
et al., 2013) and Dolgopolsky’s classes (Dolgo-
polsky, 1986). These are compared to Soundex
3 Listening Comprehension Corpus
encoding (Russell, 1918 1922), a practically-
motivated indexing method, which, as mentioned Data collection In this study we used responses
earlier, had been previously proposed as a pre- to listening comprehension (LC) items collected
processing step in content scoring. We hypothe- during placement tests for language courses (four
size that normalization based on the linguistically- cohorts of students) administered by the Saarland
motivated systems should yield response groups University’s International Office centre for Teach-
that better reflect the assigned scores than group- ing German as a Foreign Language. The tests con-
ing based on Soundex encoding. sisted of three parts: grammar, C-Test, and listen-
ing comprehension. The listening part consisted
2 Related Work
of three audio stimuli of increasing difficulty in
Research into misspellings in learner language has terms of linguistic properties and speech tempo.
been predominantly addressing English as the tar- The stimuli were accompanied with up to 11 con-
get (see, for instance, (Flor and Futagi, 2012) for a structed response questions each. For each ques-
recent overview). Analogous lines of work based tion the teachers provided one or more correct ref-
on digital corpora has been emerging for Ger- erence answers.
man as a Foreign Language. Rimrott and Heift The tests were developed by an experienced
(2008) analyzed the performance of MS Word teacher of the language centre and conducted us-
spell-checker on learner German and found that ing a web-based platform. Students’ responses,
around 20% of misspellings were undetected. For preprocessed as outlined below, were scored man-
single-error words, in over 40% of the cases the ually – for the most part one teacher, head of the
correct word was not in the suggestion list whereas centre – also using a web-based platform. Re-
for multiple-error words in about 80% of the cases sponses were graded on a [0,1] or [0,2] scale;
the spell-checker failed to provide a correction. In half points were used for partial credits. Approx-
a further study, Heift and Rimrott (2008) found imately 600 students of various proficiencies and
that in CALL activities students are influenced by mother tongues participated in the tests.

Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015

2
500 Count 50 ●

Unique responses
400 40
Unique tokens

300 30



● ●

200 20 ● ● ●
● ●
● ● ● ● ● ● ● ●

● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●

100 10 ● ● ●
● ●


● ●
● ●


● ● ● ● ● ● ● ● ● ● ●

0 0
.1 .2 .4 .5 .6 .7 .8 .1 .2 .3 .5 .6 .1 .2 .4 .5 1 0 .1 .2 .4 .5 .6 .7 .8 .1 .2 .3 .5 .6 .1 .2 .4 .5 1 0
_1 _1 _1 _1 _1 _1 _1 _2 _2 _2 _2 _2 _3 _3 _3 _3 _3. _1 _1 _1 _1 _1 _1 _1 _2 _2 _2 _2 _2 _3 _3 _3 _3 _3.
LC LC LC LC LC LC LC LC LC LC LC LC LC LC LC LC LC LC LC LC LC LC LC LC LC LC LC LC LC LC LC LC LC LC

[Link] [Link]

Figure 1: Number of unique responses and unique Figure 2: Response lengths, in tokens, per ques-
tokens per question tion

Variable N
of responding to questions to the more difficult
Verbatim responses 7208
audio prompts) and for some questions it is low
Verbatim unique 3794
(only 29 responses to one of the questions after
Preprocessed unique 3146
preprocessing) for the analyses presented in this
Tokens 16298
paper, we selected only those questions to which
Token types 2429
we have at least 100 unique preprocessed re-
Table 1: Descriptive corpus information sponses. We moreover excluded questions which
elicited unordered multi-part responses, that is,
questions of the type “Name 3 . . . ” or “What
Preprocessing Certain minor form errors, such
are . . . ? (2 items)”. Our complete data set con-
as wrong capitalization or irregular punctuation,
sists of responses to 17 questions which elicited
are irrelevant while assessing comprehension. We
single-part responses and each response has been
exploit this in a scoring platform to reduce the
scored at 0, 0.5, or 1 points.
set of responses to score by normalizing spuri-
Table 1 shows basic descriptive information
ous writing mechanics differences which are not
about the corpus. The number of verbatim
considered score-affecting in assessing compre-
responses is the total number of responses to
hension. This includes lower-casing and remov-
the 17 questions before preprocessing. “Ver-
ing clause- and sentence-final punctuation. In or-
batim unique” is the number of token-identical
der to avoid differences in edit distance due to di-
verbatim responses collapsed to one observation.
acritics use, we also transcribe umlaut characters,
“Preprocessed unique” is the number of token-
using the standard convention, with their underly-
identical (unique) responses after preprocessing as
ing vowel followed by ‘e’ (‘ö’ as ‘oe’, ‘ü’ as ‘ue’,
described in the previous paragraph. “Tokens” and
etc.). Preprocessing reduces the set of responses
“Token types” are, respectively, the number of all
which teachers need to score by more than 50%
tokens and unique tokens (types) in the prepro-
for some items. For this study we use responses
cessed responses.
scored in the preprocessed form. For the analy-
In the remainder of this paper, we refer to the set
sis presented in this paper we use a subset of the
of preprocessed unique responses. Figure 1 shows
scored preprocessed responses selected as summa-
the distribution of responses and unique tokens per
rized below.
question for the three items (LC 1, LC 2, LC 3).
The corpus Since the number of responses dif- Figure 2 shows the distribution of response lengths
fers from question to question (at least partially per question. There are more unique responses to
due to different language proficiencies of the test- the more difficult items, LC 2 and LC 3, and the
takers; low-proficiency test-takers are not capable responses to those items are longer and more di-

Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015

3
LC 1.1 LC 1.6 LC 3.1
frankreisch austereich giespallampe
frankrich austerreich energiespaerlaempe
frankriech oestereicht energysparen
frankrreisch oeustreich energiesparenlampen
frankrreit ostreich energiesparlampel
franzoezisch oesterreisch energiesparer
franzuezisch oesttereich energiespannlampe Figure 4: Corpus processing
freinkreich oeustreich energisparelampen
frienkriesch oeschterich sparrlampen
frienricht oessterrisch energiespaerlaempe lenging. This may be partially due to the fact that
“Energiesparlampe” is a compound noun.
Figure 3: Examples of misspelled responses Even this small sample illustrates the large va-
riety of spelling errors, the high complexity of the
verse (the number of unique tokens larger than the spell-checking task, and the high demands on au-
number of unique responses, that is, fewer recur- tomated processing. Some misspellings, such as
ring words than in the easiest item, LC 1). The lampel for “lampe” or “lampen”, are probably ty-
average response length was 5 tokens. pos, while others are likely to have a phonolog-
ical source, like frankreisch or oesterreisch, and
Examples In order to illustrate the spelling er- among those some might be explained by inter-
rors problem, in Figure 3 we show examples of ference of another foreign or the native language
misspellings in responses to three questions which of the student, for instance “au” in austereich or
elicited simple one-word key concepts. We will “y” in energysparen. Some errors might be inter-
use responses to these questions in one of the preted as wrong morphological forms rather than
analyses (RAs below are reference answers pro- misspellings, e.g. energisparelampen. In many
vided by the teachers; vertical bar separates alter- cases multiple errors are combined.
natives):
LC 1.1 Wo wohnt Alexandra? 4 Spell-checking and Normalizations
‘Where does Alexandra live?’ As shown in Figure 4, data for analysis was pre-
RA: frankreich pared as follows: We created a spelling gold-
LC 1.6 Woher kommt Elisabeth? standard semi-automatically by spell-checking
preprocessed responses using an off-the-shelf
‘Where does Elisabeth come from?’
spell-checker (described in more details in Sec-
RA: oesterreich|wien|wien oesterreich tion 4.1) and then manually annotating (verify-
LC 3.1 Wie beleuchtet die Bundeskanzlerin An- ing and correcting) the checker’s outputs (Sec-
gela Merkel ihre Wohnung? tion 4.2). Each learner response and reference an-
‘How does Chancellor Angela Merkel swer was automatically transcribed into three dif-
light her apartment’ ferent phonetically-based encodings which, in the
context of the automated scoring task, we treat as
RA: energiesparlampen
spelling normalizations (Section 4.3). In the anal-
‘energy saving lamps’ ysis section we compare the spell-checked and the
Two of the questions (LC 1.1 and LC 1.6) ap- phonetically transcribed responses with, respec-
peared with the first, easiest, listening prompt. tively, the strings or the transcriptions of target
Even though identifying the answers within the hypotheses and reference answers. The methods
audio prompts was easy for most test-takers, also and tools used for annotation and normalization
low-proficiency, spelling the answers correctly are outlined below.
turned out to be challenging, even though the
elicited key concepts denote two well known Eu- 4.1 Spell-checking
ropean countries. The third question (LC 3.1) ap- For automated spell-checking and spelling cor-
peared with the last, most difficult, audio prompt rection we use Aspell (Atkinson, 2006), an open
and was answered by medium- to high-proficiency source spell-checker provided by GNU. Aspell
learners. Likewise here spelling the word is chal- supports multiple languages and is frequently

Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015

4
used as a reference system in research on spell- good-faith interpretation was impossible or bor-
checking and writing normalization. Crucially to derline possible, we marked those words as unin-
this work, a large dictionary for the German lan- terpretable (for instance, frankaise, freikeit, franch
guage compatible with Aspell is freely available, in response to LC 1.1 and oestech, busterish,
as are implementations of the system itself. As- uscraisch, or susthei in response to LC 1.6). We
pell is thus a good candidate for integration into a also marked foreign words explicitly (france, fran-
scoring system, and so a well-motivated choice for cais, austria) as some students answered in En-
an evaluation. glish or in their native language.
Aspell performs checking and suggests correc- The annotation was carried out by the authors
tions based on a combination of orthographic and of this paper. The corpus was divided into parts
phonetic coding, fast dictionary lookup, and an and single annotation was performed for each mis-
edit distance calculation. Alternative spellings are spelled word by one author. The manually cor-
identified by an algorithm which represents words rected spell-checker outputs are used as a spelling
by their orthographic forms and their “soundslike” gold standard. The spell-checked, annotated cor-
equivalents, that is, approximate pronunciations pus contains 2945 responses, 15260 tokens (2898
constructed based on phonetic information. Sug- unique responses, 2173 unique tokens).
gestions are ordered by a weighted average of the
edit distances between the candidate and the mis- 4.3 String Normalizations
spelled word and between the “soundslike” encod-
ings of the two words. Aspell language versions For this study we used three phonetically-based
differ in their dictionaries and phonetic data, but encodings: ASJP and Dolgopolsky’s systems, and
the underlying edit distance algorithm is the same. Soundex as baseline.
Note that Aspell performs context-insensitive
spell-checking, that is, individual words are pro- ASJPcode Automated Similarity Judgment Pro-
cessed in isolation. Thus, only non-word errors gram (ASJP) is a procedure originating from com-
are detected, while real-word errors are not. In this parative and historical linguistics developed with
study we do not address real-word errors, however, the view to comparing world languages by lexi-
we are planning to annotate the complete data set cal similarity (Wichmann et al., 2013). Compar-
manually in the future. isons are based on word lists encoded in standard-
ized orthography (ASJPcode), a simplified version
of the International Phonetic Alphabet (Interna-
4.2 Annotation
tional Phonetic Association, 1999). ASJP encod-
We annotated the learner responses with target ing consists of 41 symbols, 7 vowels and 34 con-
hypotheses (hypothesized intended forms) semi- sonants, which represent the commonly occurring
automatically using the Aspell checker. For each sounds of the world’s languages (for details, see
non-word Aspell searches its dictionary and pro- Appendix C of (Brown et al., 2008)). The tran-
vides a list of suggested replacements. To obtain scription employed in this study was specifically
a spell-checked corpus we processed our data set designed to capture the sound representations of
with Aspell and for each word which Aspell re- German.
ported as misspelled, we stored Aspell’s first sug-
gestion. Then, we manually checked the first sug- Dolgopolsky’s sound classes The sound class
gestions and corrected them were necessary. coding system of Dolgopolsky (1986) was devel-
As Figure 3 illustrates, the range of spelling oped in the context of research analogous to the
variants includes cases of questionable inter- ASJP project, that of identifying related language
pretation and acceptability; consider, for in- families. Dolgopolsky’s system groups similar
stance, frienricht or giespallampe as misspellings consonants into 10 “sound classes” in such way
of “frankreich” and “energiesparlampe”, respec- that phonetic regularities within a class are more
tively. When building the spelling gold standard systematic than between classes. Each class is rep-
we did not use the teachers’ scores as guides, resented with a single character. Vowels are sim-
but rather attempted to accept generously those ply marked as such (V). The transcription used in
words which could be in good faith interpreted to this study was also designed to capture the sound
be misspellings of the expected concepts. Where system of German.

Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015

5
String ASJP Dolgopolsky Soundex Valid Misspelled Row
frankeriech fGaNkeGiS PRVNKVRVS F652
frankfurt fGaNkfuGt PRVNKPVRT F652
words words totals
fraenkerisch fGaENkeGiS PRVVNKVRVS F652 Reported 42 1040 1082
fracraich fGakGaiS PRVKRVVS F626 Suggestions found 21 904 925
oestarreich 7oEstaGaiS HVVSTVRVVS O236
First Correct - 583 583
oestereisch 7oEsteGaiS HVVSTVRVVS O236
austerreich 7austEGaiS HVVSTVRVVS A236 First Wrong 21 321 342
austerreicht 7austEGaiSt HVVSTVRVVST A236 No Suggestions 21 136 157
Figure 5: Examples of normalizations Table 2: Performance of the Aspell spell-checker

Both ASJP and Dolgopolsky’s transcriptions


were done based on sound classes for German as 5 Results
is done in the LingPy package (List and Moran, The following analyses are performed: We start by
2013; List et al., 2013). summarizing the performance of the spell-checker
at the word-level. Next, we look at the extent of
Soundex Soundex, originally patented by Rus-
divergence of the misspelled words from the anno-
sell (1918 1922), also uses sound classes to rep-
tated target hypotheses by quantifying divergence
resent similar sounding words with the same en-
in terms of string distances. Then, we relate mis-
coding, however, it was designed with a practical
spellings and normalizations to scores: For two
goal of indexing family names for the census. A
questions eliciting single key concept responses,
Soundex code represents a token with a character
we show how distance to the key concepts affects
followed by three digits. The character denotes
response scores. Finally, we focus on complete re-
the first letter of the word and the digits denote the
sponses and look at relations between scores and
sound classes of the three following consonants.
distances between normalized learner responses
There are six such sound classes. Vowels, unless
and reference responses.
word-initial, are ignored, as are the letters H and
Two standard string distance measures are used
W. If the word is longer than the four symbol se-
throughout this section: Damerau-Levenshtein
quence, the remaining letters are ignored. If it is
distance (nDL), a variant of Levenshtein edit dis-
shorter, zeros are added. Soundex is thus a more
tance which accounts for transposition of adjacent
general approach than the other two and most
characters (Damerau, 1964; Levenshtein, 1966),
lossy (to a greater degree abstracts away from the
and string vector cosine based on n-grams. A
original string), but as it is one of the most fre-
length correction on the edit distance is performed
quently employed phonetic encodings and there-
in a standard way by dividing the distance by
fore a good baseline for comparison. Soundex has
the length of the longer string. Cosine similar-
been also used in previous work on short answer
ity is computed for unigrams, bigrams and tri-
scoring as a way of addressing misspellings (Hahn
grams. Because the data is not normally dis-
et al., 2013).
tributed and for some items the number of ob-
To illustrate the selected phonetic normaliza- servations is low, instead of performing statistical
tions, examples of encoding are shown in Fig- analysis, we present boxplots to show general ten-
ure 5. As can be clearly seen, the effect of the dencies in an informative way.
normalizations is markedly different and reflects
5.1 Automated Spell-checking
the more linguistically-informed basis of the ASJP
and Dolgopolsky’s codes: In the set of responses The performance of the Aspell spell-checker
to LC 1.1, frankeriech, fraenkerisch, and frankfurt against the gold-standard is summarized in Ta-
are grouped into one sound equivalence class by ble 2. “Valid words” refers to correctly spelled
Soundex – an undesired result – but not by any words and “Misspelled words” to non-words. The
of the other encodings. In the set of responses to numbers refer to unique tokens.
LC 1.6, oestarreich, oestereisch and austerreich, Out of the 2173 unique tokens, Aspell reported
austerreicht form two clusters in Soundex encod- around 50% (1082) as misspelled. Since there
ing, whereas ASJP and Dolgopolsky’s codes yield were 1818 occurrences of misspellings overall,
more intuitive groupings; ASJP being more fine- it is clear that a lot of the same misspellings
grained than Dolgopolsky. recur. Out of the 1082 reported misspellings

Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015

6
Cosine−1−gram Cosine−2−gram Cosine−3−gram nDL Most of the remaining errors are due to con-
1.00 ● ●
text insensitivity; for instance, to “What did


Karl Marx do in Cologne?” (RA: “Leitung
der Neuen Rheinischen Zeitung” (‘Led the “New
● ●

● ● ● ●

0.75 ●



Rhinish Newspaper”’) a student wrote: radikal de-
mochratisch behzatung (‘radical democratic UN -


● ●

● ● ● ●
INTERPRETABLE ’) for which Aspell suggested
0.50 radikal demokratische Beratung (radical demo-

● ● ● ● ● ● ●


cratic counseling) which considering pure edit dis-



● ● ●

tance obviously makes sense, otherwise not.


0.25 ●

● ● ●



5.2 Diversity of Misspellings

Figure 6 shows the distribution of cosine and nor-


0.00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

.1 .1 .1 .1 .1 .1 .1 .1 .1 .1 .1 .1
malized Damerau-Levenshtein distances (nDL) to
_1 _2 _3 _1 _2 _3 _1 _2 _3 _1 _2 _3
LC LC LC LC LC LC LC LC LC LC LC LC target hypotheses with linear trend lines. On the
[Link]
x-axis, items within distance measure groups are
ordered as in Figures 1 and 2. As can be seen in
Figure 6: Per item distribution of distances be- the plots, the range of unigram cosine values is
tween misspelled words and target hypotheses large for some items. Thus a lot of misspellings
involve more than just letter transpositions. The
Aspell reported 21 (4%) correctly spelled words as large ranges in bigram cosines and many values
misspelled and suggested a correction (false pos- at 0 for trigrams show that misspellings tend to di-
itives). Overall Aspell’s precision in identifying verge from the target hypotheses to a large extent.
misspellings in our data is thus at 96%.1 For the easier questions (left end of the x sub-
Now, as far as automated correction is con- axes) the ranges of unigram cosine and Leven-
cerned, suggestions were found for not even 60% shtein distance tend to be smaller, while bigram
of the tokens. Out of the 925 tokens for which sug- and trigram cosines are larger and they are also
gestions were found, 321 first suggestions were closer to the low-end of the scale. This means that
wrong, yielding a false negative rate of 64%. With in the easy questions, misspellings tend to contain
321 wrong suggestions and 136 cases for which the right letters, but the letters are misplaced. The
suggestions were not available, about 45% of the same can be seen for the difficult questions (except
non-word misspellings are not accounted for cor- for the last one). The intermediate difficulty items
rectly by Aspell. These results are similar to those tend to have the least letter overlap and many tri-
reported by Rimrott and Heift (2008). gram similarities at the low end of the scale. These
A major issue for Aspell, and, as can be are likely to be most difficult to correct automati-
expected, for any off-the-shelf German spell- cally, but possibly easier to identify as qualifying
checker, are compound nouns. Two of the lis- to be scored at 0.
tening prompts contained compounds as key con-
cepts: “Marxhaus” in the answer to Where are Pe- 5.3 Relation to Scores: Misspelled Key
ter and Birgit? (RA: ‘In front of Marx’ birth place Concepts
in Trier’) and “Energiesparlampen” in the answer As mentioned in Section 3, we used re-
to the previously mentioned LC 3.1. “Marxhaus” sponses to two questions which elicited one
is not in Aspell’s dictionary; the closest sugges- key concept, LC 1.1 and LC 1.6, to inves-
tions it finds as replacements include Matthäus tigate the relation between misspellings and
(Matthew; as in Matthew the Apostle), Parkhaus scores. From the LC 1.1-LC 1.6 corpus sub-
(carpark) or even Hausbar (house bar). Com- set, we extracted responses which contained to-
pounds account for all the 21 valid words which kens with gold standard annotation correspond-
Aspell identified as misspellings. ing to the expected concept: “frankreich” for
1 We
LC 1.1 and “oesterreich” for LC 1.6. There
cannot provide recall results at this point since our
gold standard includes only non-words identified by Aspell. were 236 and 260 such responses, respectively.
We are planning to annotate real-word errors in the future.

Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015

7
ASJP Dolgopolsky Soundex String
1.00 ● ●

● ● ●


● ● ● ● ● ● ● ● ● ● ●

● ●



● ●

● ●

● ●

0.75

● ● ●


● ●


● ● ● ● ● ●
● ●



● ● ●


● ● ● ● ● ●
● ●
● ● ● ●
● ● ●

● ●
● ● ● ●

● ● ● ●
● ● ● ●
● ● ●

0.50
● ● ● ● ● ● ●

Cosine
● ●


● ● ●

● ● ●
● ● ● ●


● ●
● ●

● ● ● ● ● ●
● ●
● ●
● ●
● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ●● ● ● ●
● ●
● ●
● ● ● ●
● ● ●
● ●
● ● ● ● ●

0.25 ● ● ●

● ● ●

● ● ● ● ●
● ● ● ● ●
● ● ●
● ●
● ● ● ●
● ●
● ● ●
● ●

● ● ● ●
● ● ● ●
● ●
● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●

● ● ● ●

● ●
● ● ● ●
● ●

● ● ● ● ●
● ● ●
● ● ● ●
● ● ● ● ●

● ● ●

● ●

0.00

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

1.00 ● ●
● ● ● ● ●

● ●



● ●

● ●
● ● ● ●

● ● ● ● ● ●

● ● ●

● ● ● ●
● ●

0.75
● ●

● ● ●

● ●
● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ●


● ● ●

● ● ●

● ● ● ● ● ● ● ●● ● ● ● ●
● ● ●

● ●
● ● ●
● ● ● ●

● ● ●

0.50 nLD
● ●


● ● ● ● ●● ● ● ● ●● ● ● ●
● ● ●
● ● ●

● ●

● ●
● ● ●
● ●
● ● ● ●

● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ●
● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ● ●● ● ● ● ● ● ● ●
● ●
● ● ● ●
● ● ●

● ● ● ● ● ●
● ● ●

0.25
● ● ● ● ● ● ● ●
● ● ●

● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ●
● ●
● ●

● ● ● ● ● ● ● ● ●
● ●
● ●
● ● ● ● ● ●
● ● ●

● ● ● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ● ●● ● ● ●
● ●
● ● ● ● ●
● ●

0.00
● ● ●

●● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ●●● ●● ● ●● ● ● ● ● ● ● ● ● ● ●

0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1


Score

Figure 8: Per score distribution of distances between normalized responses and reference responses

Cosine 1−gram Cosine 2−gram Cosine 3−gram nDL lower similarity to the target concept tokens are
1.00 accepted with partial and full scores in LC 1.1.
Also a larger range of similarity accounts for par-


● ●
● ● ●

● ● ●


tial and full points in LC 1.1. This suggests that
what counts as acceptable in terms of misspellings

0.75 ●

could be item-specific and different thresholds



would have to be used for different items.
0.50 ●

5.4 Relation to Scores: Normalizations


Finally, we investigate the relation between

0.25 sound class-based response normalizations and the


Item
LC_1.1

scores assigned by teachers. Complete prepro-


LC_1.6 cessed learner and reference responses have been
0.00 transcribed into the three encodings described in
0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 Section 4.3. Based on Figure 7 the 3-gram cosine
Score
distance yields a pattern that best distinguishes be-
tween the three score points. Therefore, only 3-
Figure 7: Per score distribution of distances be- gram cosine distances are reported for the normal-
tween misspelled key concepts and target hypothe- ized responses. We seek to find out which normal-
ses for two items ization yields the most consistent patterns in terms
of the expected relation to the teachers’ scores.
For these responses, in Figure 7 we show the dis- The distributions of distances between normal-
tribution of the distances to the target hypotheses ized learner and reference responses for all the
between score points. items are shown in Figure 8. Items clustered by
Most of the expected general tendencies can be score-point are ordered as in Figures 1 and 2. Dis-
found in the data: cosine distances for all n-grams tribution of string distances is shown for compari-
increase with the scores as expected. Levenshtein son. Linear trends are overlayed.
distance decreases as expected for LC 1.1, but Two immediate observations can be made of
the pattern for LC 1.6 is not clear. Moreover, the results. First, the score-based grouping is not
and more interestingly, the acceptability thresh- clear-cut and the distance ranges overlap across
olds for the two questions appear to be differ- score levels. Second, the expected pattern of co-
ent. Responses with misspelled key tokens of sine distance (linearly) increasing and normalized

Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015

8
Levenshtein distance (linearly) decreasing can be vergence from target forms appears to be item-
seen in the distribution of ASJP and Dolgopol- specific. Finally, we proposed sound class-based
sky normalizations, but less so in the distribu- normalizations as a method of grouping noisy re-
tion of Soundex distances across items. Soundex sponses in terms of their pronunciation similar-
transcriptions do not distinguish well between the ity as well as related distances between normal-
scores based on Levenshtein distance and only ized responses and reference answers to response
somewhat better based on cosine; for most items scores. This served to evaluate prospects for a
there is little difference between mean distances normalization-based approach to response clus-
for scores 0.5 and 1 on the nLD measure and tering. Soundex, the most frequently employed
between mean scores 0.5 and 1. ASJP and normalization, does not distinguish between re-
Dolgopolsky normalizations are more stable in sponses at different score-points, so it can be con-
terms of variance, with ASJP, moreover, display- sidered the worst choice for a normalization-based
ing fewer outliers. This confirms our hypothe- approach. Both of the more elaborate phonetic
sis that the more linguistically-informed encoding transcriptions, based on ASJP’s and Dolgopol-
yields clusters which better correspond to the as- sky’s codes, perform better than Soundex and are
signed scores. It also suggests that these encod- promising directions to pursue. We will exper-
ings might result in better performance on the au- iment with including distances to reference an-
tomated scoring task. We are planning to investi- swers based on both representations as features for
gate this in the course of further work. The ASJP (semi-)automated scoring.
and Dolgopolsky distributions moreover better re-
flect the pattern of string-based distances than the Acknowledgments
Soundex distributions. Finally, ASJP and Dolgo- We thank Dr. Kristin Stezano Cotelo from the
polsky normalizations appear more stable across Saarland University International Office for col-
items on both distance measures and the shape of laboration on placement testing thanks to which
the distributions is similar. It is possibly a combi- this research is possible. We would like to thank
nation of both that would work best as features for Johannes Dellert for letting us use his code for
scoring. sound class-based normalizations. We also thank
the three anonymous reviewers for their helpful
6 Conclusions and Further Work comments.
We presented a study on misspellings in a corpus This work was funded by the Ministry of
of constructed responses to listening comprehen- Science, Research and the Arts of Baden-
sion items used for placement testing for German. Württemberg within the FRESCO project. Mag-
Not surprisingly, our data contains a large num- dalena Wolska is supported by the Institutional
ber of misspellings (around 50% of the unique Strategy of the University of Tübingen (Deutsche
words that learners used). The first-ranked sugges- Forschungsgemeinschaft, ZUK 63).
tions of an off-the-shelf spell-checker were cor-
rect in not even 60% of the cases. This is likely References
to be partially due to the fact that the range of
Kevin Atkinson. 2006. Gnu Aspell 0.60.7. http://
divergence from target forms is substantial. It [Link].
also varies between questions. The majority of
false positives were due to compounds specific Adriane Boyd. 2010. EAGLE: an Error-Annotated Corpus
of Beginning Learner German. In Proceedings of the 7th
to the listening prompts. An obvious solution LREC. [Link]
we are pursuing to improve precision and reduce lrec2010/summaries/812.
false negative suggestion rate is constructing two
Cecil H Brown, Eric W Holman, Søren Wichmann, and
dictionaries: one prompt-specific and the other Viveka Velupillai. 2008. Automated classification of
learner-language specific; the purpose of the lat- the world’s languages: a description of the method
ter is to provide prompt-specific frequent invalid and preliminary results. STUF-Language Typology and
Universals Sprachtypologie und Universalienforschung,
forms produced by the learners. 61(4):285–308. doi:10.1524/stuf.2008.0026.
We have also shown that while in general the
Fred J. Damerau. 1964. A technique for computer detection
expected trend in scoring misspelled responses and correction of spelling errors. Communications of the
can be observed, however, acceptability of di- ACM. doi:10.1145/363958.363994.

Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015

9
Aharon B. Dolgopolsky. 1986. A probabilistic hypothesis Søren Wichmann, André Mller, Annkathrin Wett, Viveka
concerning the oldest relationships among the language Velupillai, Julia Bischoffberger, Cecil H. Brown, Eric W.
families of northern Eurasia. In Typology, Relationship Holman, Sebastian Sauppe, Zarina Molochieva, Pamela
and Time: A Collection of Papers on Language Change Brown, Harald Hammarström, Oleg Belyaev, Johann-
and Relationship by Soviet Linguists, pages 27–50. (Orig- Mattis List, Dik Bakker, Dmitry Egorov, Matthias Ur-
inal: 1964 In: Voprosy Jazykoznanija 2). ban, Robert Mailhammer, Agustina Carrizo, Matthew S.
Dryer, Evgenia Korovina, David Beck, Helen Geyer, Pat-
Michael Flor and Yoko Futagi. 2012. On using context for tie Epps, Anthony Grant, and Pilar Valenzuela. 2013. The
automatic correction of non-word misspellings in student ASJP-Database (version 16). [Link]
essays. In Proceedings of the 7th Workshop on Building (Retrieved 04/15).
Educational Applications Using NLP. [Link]
org/[Link]?id=2390397. Magdalena Wolska, Andrea Horbach, and Alexis Palmer.
2014. Computer-assisted scoring of short responses: the
efficiency of a clustering-based approach in a real-life
Michael Hahn, Niels Ott, Ramon Ziai, and Detmar Meur-
task. In Advances in Natural Language Processing (Pro-
ers. 2013. CoMeT: Integrating different levels of linguis-
ceedings of the 9th International Conference on Natural
tic modeling for meaning assessment. [Link]
Language Processing (PolTAL-14)). doi:10.1007/978-3-
org/anthology/S/S13/[Link].
319-10888-9 31.
Trude Heift and Anne Rimrott. 2008. Learner responses to
corrective feedback for spelling errors in CALL. System.
doi:10.1016/[Link].2007.09.007.

International Phonetic Association, editor. 1999. Handbook


of the International Phonetic Association: A guide to the
use of the International Phonetic Alphabet. Cambridge
University Press.

Vilius Juozulynas. 2013. Errors in the compositions of


second-year german students: An empirical study for
parser-based icali. CALICO Journal. [Link]
[Link]/html/article_578.pdf.

V.I. Levenshtein. 1966. Binary codes capable of correcting


deletions, insertions and reversals. Soviet physics doklady,
10(8):707–710.

Johann-Mattis List and Steven Moran. 2013. An open


source toolkit for quantitative historical linguistics. In
Proceedings of the ACL Conference (System Demon-
strations). [Link]
P13/[Link].

Johann-Mattis List, Steven Moran, Peter Bouda, and Jo-


hannes Dellert. 2013. LingPy. Python Library for Au-
tomatic Tasks in Historical Linguistics. [Link]
[Link].

Marc Reznicek, Anke Lüdeling, and Hagen Hirschmann.


2013. Competing target hypotheses in the Falko corpus.
In Automatic Treatment and Analysis of Learner Corpus
Data, volume 59 of Studies in Corpus Linguistics, pages
101–123.

Anne Rimrott and Trude Heift. 2008. Evaluating automatic


detection of misspellings in German. Language Learn-
ing & Technology, 12(3):73–92. [Link]
vol12num3/[Link].

Robert C. Russell. 1918, 1922. US Patents No.: 1261167


and 1435663. (Retrieved 04/15 via [Link]
[Link]/netahtml/PTO/[Link]).

Howida A. Shedeed. 2011. A new intelligent methodol-


ogy for computer based assessment of short answer ques-
tion based on a new enhanced soundex phonetic algo-
rithm for arabic language. International Journal of Com-
puter Applications. [Link]
org/volume34/number10/[Link].

Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015

10

You might also like