Diagnosing Dysarthria With Long Short-Term Memory Networks: September 15-19, 2019, Graz, Austria
Diagnosing Dysarthria With Long Short-Term Memory Networks: September 15-19, 2019, Graz, Austria
Abstract
This paper proposes the use of Recurrent Neural Networks
(RNNs) with Long Short-Term Memory (LSTM) units for de-
termining whether Mandarin-speaking individuals are afflicted
with a form of Dysarthria based on samples of syllable pro-
nunciations. Several LSTM network architectures are evalu-
ated on this binary classification task, using accuracy and Re-
ceiver Operating Characteristic (ROC) curves as metrics. The
LSTM models are shown to significantly improve upon a base-
line fully connected network, reaching over 90% area under the
ROC curve on the task of classifying new speakers, when a suf-
ficient number of cepstrum coefficients are used. The results
show that the LSTM’s ability to leverage temporal information
within its input makes for an effective step in the pursuit of ac-
Figure 1: From a waveform example X to its classification Y
cessible Dysarthria diagnoses.
in the proposed model architecture.
Index Terms: Dysarthria, RNN, LSTM, speech processing
X 0 and the last output hT is provided to the logistic regression models, as follows:
layer. 1. Baseline: A fully connected feedforward neural network
We experimented with several variants of the LSTM model, with one hidden layer.
including adding layers and using a bidirectional LSTMs. For
the models with one layer, L2 regularization was used. The two- 2. LSTM-1: Single layer, unidirectional LSTM.
layer model employed dropout [4, 5] between the LSTM layers, 3. LSTM-2: Double layer, unidirectional LSTM.
as well as between the last LSTM layer and logistic regression.
4. BiLSTM-1: Single layer, bidirectional LSTM.
Bidirectional LSTM networks perform two concurrent passes
on the data, left to right and right to left. The output vectors All models use a hidden layer size of 200. They are
produced by the two passes are concatenated and fed to the lo- trained using Adam [9] for 40 epochs on mini-batches of size
gistic regression layer. 64. The training objective is formulated as the syllable-level
cross-entropy loss between the predictions and the ground truth
3. Evaluation Methodology provided by the medical practitioners who collected the data.
While only the individuals were manually labeled as having
The evaluation data consists of samples of syllables recorded Dysarthria (positive) or not (negative), the label for each indi-
from 69 Mandarin speaking adults, 38 male and 31 female. vidual was also assigned to all the syllables coming from that
The number of individuals in each class is presented in Table individual, and the cross-entropy loss was formulated using syl-
1, together with the corresponding number of recorded sylla- lables as examples.
bles. The participants were from Jinan University School of To prevent overfitting, a form of early stopping was em-
Medicine, who included 31 native Mandarin-speaking patients ployed, where training is stopped when the ratio of the current
(19 males and 12 females) with post-stroke Dysarthria. The age validation error curr to the lowest error seen thus far min ex-
of the dysarthric speakers ranged from 25 to 83 years old [mean ceeds a threshold 1 + α, i.e. curr /min > 1 + α, where
± SD: 56.74 ± 16.40 years]. All participants went through α = 0.075. A grace period is used, such that training is only
physical examination, Frenchay Dysarthria Assessment, and stopped if the threshold is met for 5 epochs in a row.
other auxiliary examinations (such as brain CT, MRI). Before
the stroke occurred, all patients had no speech-related impair-
ments and were able to communicate fluently in Mandarin.
4. Experiment I: Known Speakers
They had no alexia, visual, or severe auditory comprehension We first compared the LSTM models against the baseline fully
impairments, and had pure-tone thresholds at 500, 1000, and connected network on the relatively easier task of non-novel
2000 Hz of ≤ 25 dB HL in at least one ear. The control group speakers. Here, the entire set of syllables from the dataset is
included 38 healthy adults (HA) (19 males and 19 females) in shuffled and partitioned into the training, testing, and valida-
a similar age range (21 to 76 years old; mean ± SD: 45.89 ± tion sets using a number of syllables ratio of [Link], respectively.
13.02 years). Some of the family members of the Dysarthria Since the dataset is partitioned at syllable-level, it is possible
groups were recruited into the HA group. They all had pure- for a patient to have their syllables partitioned among the train-
tone thresholds at 500, 1000, and 2000 Hz of ≤ 25 dB HL in at ing, validation, and test sets. Thus, syllables that appear at test
least one ear with no reported hearing or speech disorders. More time may come from patients that have been observed at train-
details about the demographical information of the participants ing time.
and the acoustic properties of the speech samples can be found We employed the syllable-level accuracy, precision, and re-
in [6]. Informed consent was obtained from all participants. All call as metrics to judge the performance of each model [10].
research was performed in accordance with relevant guidelines Accuracy is the percentage of correct syllable labels predicted
and regulations. by the system. Precision and recall were also considered due to
Note that although the positive or negative labels are orig- the medical nature of our experiments. That is, most people do
inally assigned only to speakers, the models are trained and not suffer from Dysarthria, but it is the instances in which one
tested using syllables as input. To assign labels to syllables dur- does that are important to classify correctly. Precision is the per-
ing training, we propagate the speaker label to all the syllables centage of correct positive predictions (true positives) out of all
recorded from that speaker. This is bound to introduce label the positive predictions (true positives + false positives). Recall
noise in the positive labeled syllables, as not all syllables from is the percentage of correct positive predictions out of all the
a speaker afflicted with Dysarthria exhibit abnormal speech pro- positive examples (true positives + false negatives). Because
duction. If a speaker is seen to correspond to a bag of syllables, individuals who receive a negative prediction (i.e., who do not
the problem corresponds to a multiple instance learning (MIL) suffer from Dysarthria) are less likely to seek a second opinion,
setting [7] [8]. In this paper, we use the simple MIL approach of we are especially interested in a higher recall.
projecting bag labels to all syllable instances in the bag, leaving The baseline fully connected model obtains 79.0% accu-
the use of more sophisticated MIL methods for future work. racy, which is a significant improvement over the 53.4% of
The dataset was used for the training and evaluation of four the majority classifier. Table 2 also shows the results for each
4515
Figure 2: Receiver-operating characteristic (ROC) and Figure 3: Receiver-operating characteristic (ROC) and
precision-recall (PR) curves for syllable-level classification. precision-recall (PR) curves for speaker-level evaluation.
LSTM model. LSTM-1 and LSTM-2 achieve similar perfor- logistic regression output for the k-th syllable of that speaker.
mance, outperforming the baseline and making a marginal im- Then the two aggregation methods compute the speaker-level
provement upon the Bi-LSTM-1 model. probability as follows:
m
5. Experiment II: Novel Speakers 1 X
soft-majority = σk (1)
m
k=1
While LSTM models were shown to outperform the baseline on
m
the task of classifying syllables from known speakers, we are 1 X
also interested in their performace on novel speakers. In the ex- noisy-OR = 1 − log (1 − σk ) (2)
m
k=1
periment from Section 4, the training set and the test set may
contain syllables from the same speaker. To more accurately Because the traditional noisy-OR calculation would be affected
match the application to novel speakers, in the experiment from by the number of syllables each speaker has, we computed it in
this section the training and test sets were created by partition- log-space and normalized the probability of the negative class
ing the set of speakers. As such, an individual’s syllables appear by the number of that speaker’s syllables. This allowed us to
either in the test or in the training set, but not in both. directly compare the noisy-OR scores between speakers with a
Because the number of speakers in the dataset is relatively varying number of examples.
small, we opted to evaluate the models using 10-fold cross val- Figure 3 presents the receiver-operating characteristic
idation. We randomly sampled 9 of the 69 speakers to use as (ROC) and precision-recall (PR) curves for the soft-majority
a validation set. The remaining 60 were partitioned into 10 method of inference. The noisy-OR method produced identi-
groups of 6. In each of the 10 evaluation rounds, the models cal results, therefore its curves are not shown. The LSTM-1
were trained on 54 speakers and tested on a different group of model again obtains the best results. The first line in Table 3
6 novel speakers. The trained models are then evaluated in two shows the performance of the three LSTM models in terms of
scenarios: syllable-level and speaker-level classification. the area under the ROC curve (AUC). The speaker-level per-
formance is higher than at syllable level, likely because not all
5.1. Syllable-Level Evaluation syllables from speakers afflicted with Dysarthria exhibit abnor-
In the syllable-level classification, the trained models are eval- mal speech production. Because an accurate diagnosis cannot
uated by how good they are at classifying syllables from the be expected to result from a single syllable, the speaker-level
speakers in the test set. This is similar to the experiment from method is therefore more appropriate for practical purposes.
Section 4, except that now the test syllables are now coming
from novel speakers. Figure 2 shows the receiver-operating Table 3: AUC scores: speaker-level vs. syllable-level.
characteristic (ROC) and precision-recall (PR) curves. To gain
perspective on the ROC behavior, we consider a model which LSTM-1 LSTM-2 Bi-LSTM-1
produces a positive classification with probability p. The major- Speaker-level 85.4 78.5 84.7
ity classifier can be seen as the extreme case, where p = 1. As p Syllable-level 75.4 65.9 71.4
increases from 0 to 1, the red ROC line in Figure 2 is produced,
with an area under the curve (AUC) of 0.5. The three LSTM
models clearly improve upon this baseline for all three methods 5.3. Effects of Syllables Types on Classification Accuracy
of inference, with LSTM-1 edging out the other two models. In
terms of AUC, the 3 systems obtain the following scores: 75.4 There are three types of syllables in the dataset: (1) syllables
for LSTM-1, 69.5 for LSTM-2, and 71.4 for Bi-LSTM-1. with monophthongs, (2) syllables with compound vowels, and
(3) syllables with consonant-/a/. In order to evaluate the classifi-
cation accuracy based on various types of syllables, we created
5.2. Speaker-Level Evaluation
three combinations of the syllable dataset. The first combina-
In the speaker-level classification, we take the logistic regres- tion (no prefix) consists of all three types of syllables. The sec-
sion outputs for all syllables belonging to a test speaker and ond combination (prefixed with ’cv-’) does not contain syllables
aggregate them into a single probability score that can be used with compound vowels. The third combination (prefixed with
to classify the speaker. To achieve this, we investigated two ag- ’c/a/-’) does not contain syllables with consonant-/a/. In this ex-
gregation methods: soft-majority and a normalized version of periment, we test how well the models perform when they are
noisy-OR. Given a speaker with m syllables, let σk be the the trained on the three different combinations of syllable types.
4516
havior in the sense that their maximum performance is achieved
for 19 coefficients. When all 25 coefficients are used, their per-
formance decreases, which could be due to a lack of capacity.
6. Related Work
Carmichael et al. [11] employed multilayer perceptrons and
decision trees to classify the different forms of Dysarthria, us-
ing as input a computerised Frenchay Dysarthria Assessement
(CFDA) profile, essentially a vector of articulatory dysfunction
Figure 4: Receiver-operating characteristic (ROC) and values measured using acoustic signal processing techniques.
precision-recall (PR) curves using all syllables vs. without syl- Unlike our work, however, the system is trained and tested on a
lables with compound vowels (prefixed with ’cv-’) vs. without distribution of English-speaking people already known to have
syllables with consonant-/a/ (prefixed with ’c/a/-’). some form of Dysarthria. Prior to this, an effort was made to
classify speakers into one of the categories of Dysarthria using
a manual Frenchay Dysarthria Assessement of each patient as
input [3][12]. The more advanced topic of recognizing speech
Figure 4 shows the receiver-operating characteristic (ROC) and
produced from someone with Dysarthria using RNN networks
precision-recall (PR) curves for the three LSTM models and the
has also been investigated recently for English speaking indi-
three different combinations of syllable types. The correspond-
viduals, using Elman recurrent neural networks in [13] and a
ing AUC scores are shown in Table 4.
hybrid deep neural network – hidden Markov model (DNN-
HMM) architecture in [14]. Wu et al. [15] presented a personal-
Table 4: Speaker-level AUC scores over all syllables (All) vs.
ized model adaptation for automatic speech recognition (ASR)
without syllables with compound vowels (No ’cv’) vs. without
targeted at Mandarin-speaking individuals afflicted with articu-
syllables with consonant-/a/ (No ’c/a/’).
lation disorders due to mild-to-moderate hearing impairment.
All No ’cv’ No ’c/a/’
LSTM-1 85.4 92.3 62.1 7. Conclusion and Future Work
LSTM-2 78.5 90.2 84.5
Bi-LSTM-1 84.7 92.0 85.9 This paper investigated the effectiveness of three LSTM net-
works, two uni-directional and one bi-directional, for the task
of Dysarthria diagnosis based on recordings of syllables from
The models performed significantly better according to both afflicted and healthy Mandarin speakers. In the first exper-
their AUC scores when syllables with compound vowels were iment, all LSTM architectures outperformed a fully connected
removed. All three models scored above 90% when trained baseline when evaluated using syllable-level accuracy, with the
without this type of syllables. Therefore, it may be a reasonable bi-directional variant slightly trailing the uni-directional vari-
heuristic to not include syllables with compound vowels when ants. The second experiment assumes the test syllables come
diagnosing a Dysarthria patient. This intuitively follows from from novel speakers, and evaluates the three LSTM models at
the observation that, even for healthy speakers, these syllables both syllable-level and speaker-level. When the syllables with
are more difficult to produce and variability of their acoustic compound vowels are removed from the dataset, all models ob-
properties is greater than syllables with monophthongs. tain over 90% AUC. Furthermore, we found that the LSTM
models’ performance could be improved by increasing the num-
5.4. Varying the Number of Cepstrum Coefficients ber of cepstrum coefficients. While these methods may not be
While including N = 13 cepstrum coefficients in each feature yet practical as a stand-alone medical test, they do suggest that
has produced promising results, there may still be room for im- LSTM networks may provide a fruitful avenue for the realiza-
provement by adding more coefficients. To this end, three 10- tion of autonomous Dysarthria diagnosis.
fold cross-validation evaluations were conducted in the same ZCA whitening is employed as a pre-processing step in
manner as before, with N = 13, 19, and 25 coefficients used as many audio classification tasks [16][17][18], as such it is a com-
input, respectively. When less than 25 cepstrum coefficients are pelling next step in an effort to improve performance. CNNs can
used, they are taken starting from the cepstrum with the lowest often performed competitively on sequence processing tasks
quefrency. Table 5 shows the speaker-level AUC scores for the [19], therefore we plan to comparatively evaluate CNNs and
three LSTM models using the soft-majority inference method. long-term recurrent CNNs [20], as well as dilated RNNs [21].
The model presented in [22] takes features much closer to the
Table 5: Speaker-level AUC scores for different numbers of cep- raw waveform when compared to MFCCs. Applying this ap-
strum coefficients. proach to Dysarthria classification may also prove to be effec-
tive. The type of training data available for speaker classifi-
N = 13 N = 19 N = 25 cation falls under the multiple instance learning (MIL) setting
LSTM-1 85.4 90.1 81.7 [7, 23]. Correspondingly, we plan to use LSTMs with models
LSTM-2 78.5 88.2 87.1 that are specifically designed for the MIL setting.
Bi-LSTM-1 84.7 88.4 90.4
8. Acknowledgements
Adding more cepstrum coefficients leads to substantial im-
provements in the performance of BiLSTM-1, matching LSTM- This study was supported in part by the NIH NIDCD Grant No.
1’s best performance. LSTM-1 and LSTM-2 have a similar be- R15-DC014587.
4517
9. References [20] J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan,
S. Guadarrama, K. Saenko, and T. Darrell, “Long-term recurrent
[1] S. Li, Speech Therapy. People’s Health Press, 2013. convolutional networks for visual recognition and description,”
[2] P. Enderby and P. Davies, “Communication disorders: Planning IEEE Transactions on Pattern Analysis and Machine Intelligence,
a service to meet the needs,” International Journal of Language vol. 39, no. 4, pp. 677–691, Apr. 2017. [Online]. Available:
and Communication Disorders, vol. 4, no. 3, pp. 301–331, 1989. [Link]
[3] P. Enderby, “Frenchay Dysarthria assessment,” British Journal of [21] S. Chang, Y. Zhang, W. Han, M. Yu, X. Guo, W. Tan,
Disorders of Communication, vol. 15, no. 3, pp. 165–173, 1980. X. Cui, M. Witbrock, M. A. Hasegawa-Johnson, and T. S.
Huang, “Dilated recurrent neural networks,” in Advances in
[4] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and Neural Information Processing Systems 30, I. Guyon, U. V.
R. Salakhutdinov, “Dropout: A simple way to prevent neural Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,
networks from overfitting,” Journal of Machine Learning and R. Garnett, Eds. Curran Associates, Inc., 2017, pp.
Research, vol. 15, no. 1, pp. 1929–1958, Jan. 2014. [Online]. 77–87. [Online]. Available: [Link]
Available: [Link] [Link]
[5] Y. Gal and Z. Ghahramani, “A theoretically grounded application [22] J. Lee, J. Park, K. L. Kim, and J. Nam, “Sample-level deep convo-
of dropout in recurrent neural networks,” in Proceedings of the lutional neural networks for music auto-tagging using raw wave-
30th International Conference on Neural Information Processing forms,” in Proceedings of the 14th Sound and Music Computing
Systems, ser. NIPS’16, 2016, pp. 1027–1035. [Online]. Available: Conference (SMC), Espoo, Finland, 2017. [Online]. Available:
[Link] [Link]/media/materials/proceedings/SMC17 [Link]
[6] Z. Mou, Z. Chen, J. Yang, and L. Xu, “Acoustic properties of [23] S. Ray, S. Scott, and H. Blockeel, “Multiple-instance learning,”
vowel production in Mandarin-speaking patients with post-stroke Encyclopedia of Machine Learning and Data Mining, pp. 1–13,
Dysarthria,” Scientific Reports, vol. 8, no. 14188, 2018. 2014.
[7] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez, “Solving the
multiple instance problem with axis-parallel rectangles,” Artificial
Intelligence, vol. 89, no. 1-2, pp. 31–71, Jan. 1997.
[8] S. Ray, S. Scott, and H. Blockeel, “Multiple-instance learning,” in
Encyclopedia of Machine Learning and Data Mining, 2017, pp.
882–892.
[9] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
mization,” in Proceedings of the 3rd International Conference on
Learning Representations (ICLR).
[10] L. Torgo and R. Ribeiro, “Precision and recall for regression,” in
International Conference on Discovery Science. Springer, 2009,
pp. 332–346.
[11] J. Carmichael, V. Wan, and P. D. Green, “Combining neural net-
work and rule-based systems for Dysarthria diagnosis,” in Inter-
speech, 2008, pp. 2226–2229.
[12] J. N. Carmichael, Introducing objective acoustic metrics for the
Frenchay Dysarthria Assessment procedure. University of
Sheffield, 2007.
[13] S. S. Nidhyananthan, R. S. S. Kumari, and V. Shenbagalakshmi,
“Assessment of Dysarthric speech using Elman back propaga-
tion network (recurrent network) for speech recognition,” Interna-
tional Journal of Speech Technology, vol. 19, no. 3, pp. 577–583,
2016.
[14] C. España-Bonet and J. A. Fonollosa, “Automatic speech recogni-
tion with deep neural networks for impaired speech,” in Advances
in Speech and Language Technologies for Iberian Languages:
Third International Conference, IberSPEECH 2016, Lisbon, Por-
tugal, November 23-25, 2016, Proceedings 3. Springer, 2016,
pp. 97–107.
[15] C.-H. Wu, H.-Y. Su, and H.-P. Shen, “Articulation-disordered
speech recognition using speaker-adaptive acoustic models and
personalized articulation patterns,” ACM Transactions on Asian
Language Information Processing, vol. 10, no. 2, pp. 7:1–7:19,
Jun. 2011.
[16] Y. L. Gwon, W. M. Campbell, D. Sturim, and H. Kung, “Language
recognition via sparse coding,” in Interspeech, 2016.
[17] C. Chen, R. Bunescu, L. Xu, and C. Liu, “Tone classification in
Mandarin Chinese using convolutional neural networks,” Inter-
speech 2016, pp. 2150–2154, 2016.
[18] O. Vinyals and L. Deng, “Are sparse representations rich enough
for acoustic modeling?” in Interspeech, 2012, pp. 2570–2573.
[19] W. Yin, K. Kann, M. Yu, and H. Schütze, “Comparative study
of cnn and rnn for natural language processing,” CoRR, vol.
abs/1702.01923, 2017.
4518