International Journal of Recent Technology and Engineering (IJRTE)
ISSN: 2277-3878, Volume-7 Issue-4S, November 2018
Speech Emotion Recognition using Deep Learning
Nithya Roopa S., Prabhakaran M, Betty.P
Abstract: Emotion recognition is the part of speech
recognition which is gaining more popularity and need for it II. BACKGROUND INFORMATION
increases enormously. Although there are methods to recognize
emotion using machine learning techniques, this project attempts
A. SPEECH RECOGNITION
to use deep learning and image classification method to
recognize emotion and classify the emotion according to the Speech Recognition is the technology that deals with
speech signals. Various datasets are investigated and explored techniques and methodologies to recognize the speech from
for training emotion recognition model are explained in this the speech signals. Various technological advancements in
paper. Some of the issues on database, existing methodologies the field of the artificial intelligence and signal processing
are addressed in the paper. Inception Net is used for emotion
recognition with the paper. Inception Net is used for emotion techniques, recognition of emotion made easier and possible.
recognition with IEMOCAP datasets. Final accuracy of this It is also known as „Automatic Speech Recognition‟. It is
emotion recognition model using Inception Net v3 Model is found that voice can be next medium for communicating
35%(~). with machines especially when computer-based systems. A
Index Terms: speech recognition; emotion recognition; Need for inferring emotion from spoken utterances increases
automatic speech recognition; deep learning; image recognition; exponentially. Since there is an enormous development in
speechtechnology; signal processing; image classification the field of Voice Recognition. There are many voice
products has been developed like Amazon Alex, Google
I. INTRODUCTION1 Home, Apple HomePod which functions mainly on voice-
IN today‟s digital era, speech signals become a mode of based commands. It is evident that Voice will be the better
communication between humans and machines which is medium for communicating to the machines.
possible by various technological advancements. Speech
recognition techniques with methodologies signal processing B. EMOTION RECOGNITION
techniques made leads to Speech-to-Text (STT) The Basically, Emotion Recognition deals with the study
technology[1] which is used mobile phones as a mode of of inferring emotions, methods used for inferring. Emotion
communication. Speech Recognition is the fastest growing can be recognized from facial expressions, speech signals.
research topic in which attempts to recognize speech signals. Various techniques have been developed to find the
This leads to Speech Emotion Recognition(SER) growing emotions such as signal processing, machine learning, neural
research topic in which lots of advancements can lead to networks, computer vision. Emotion analysis, Emotion
advancements in various field like automatic translation Recognition are being studied and developed all over the
systems, machine to human interaction, used in synthesizing world. Emotion Recognition is gaining its popularity in
speech from text so on. In contrast the paper focus to survey research which is the key to solve many problems also
and review various speech extraction features, emotional makes life easier. The main need of Emotion Recognition
speech databases, classifier algorithms and so on. Problems from Speech is challenging tasks in Artificial Intelligence
present in various topics were addressed. This paper is where speech signals is alone an input for the computer
organized as follows. Section 2 describes background
systems. Speech Emotion Recognition (SER) is also used in
information about speech recognition, emotion recognition
various fields like BPO Centre and Call Centre to detect the
system, applications of emotion recognition. Section 3
emotion useful for identifying the happiness of the customer
explains the methods of feature extraction and optimization
from speech signals. Section 4 compares various speech about the product, IVR Systems to enhance the speech
emotional databases prepared for research. Section 5contains interaction, to solve various language ambiguities and
various classifier algorithms for classifying speech signals adaption of computer systems according to the mood and
according to the emotion inferred. Finally, a conclusion is emotion of an individual.
given in section 6. C. SPEECH EMOTION RECOGNITION
Speech Emotion Recognition is research area problem
which tries to infer the emotion from the speech signals.
Various survey state that advancement in emotion detection
will make lot of systems easier and hence making a world
better place to live. SER has its own application which is
explained later. Emotion Recognition is the challenging
problem in ways such as emotion may differ based on the
environment, culture, individual face reaction leads to
1
Revised Version Manuscript Received on 25 November, 2018. ambiguous findings; speech corpus is not enough to
Nithya Roopa S, Assistant Professor, Kumaraguru College of accurately infer the emotion; lack of speech database in
Technology, Coimbatore, Tamilnadu, India. many languages.
Prabhakaran M, PG Scholar, Kumaraguru College of Technology,
Coimbatore. Tamilnadu, India.
Betty. P, Assistant Professor, Kumaraguru College of Technology,
Coimbatore, Tamilnadu, India.
Published By:
Blue Eyes Intelligence Engineering &
Retrieval Number: E1917017519 247 Sciences Publication
Speech Emotion Recognition using Deep Learning
Survey on speech emotion recognition [2] which helps a information about their facial expression and hand
lot in exploring speech emotion recognition movements during scripted and spontaneous spoken
communication scenarios. The actors performed selected
D. DEEP LEARNING
emotional scripts and also improvised hypothetical scenarios
Deep Learning [3] is machine learning techniques which designed to elicit specific types of emotions (happiness,
data models are designed bound to a specific task. Deep anger, sadness, frustration and neutral state). Database
learning in neural networks is used for various tasks such consist of twelve hours of audio-visual data. I‟ve choose
image recognition, classification tasks, decision making, audio clips of various sessions. Based on certain annotators,
pattern recognition etc. [4] Various other Deep Learning these audio clips of 10 seconds(approx.) are classified into
techniques such as multimodal deep learning used for feature one of the emotions classes. All the audio-visual data is
extraction, image recognition made at ease[5]. divided into five sessions audio data in .wav format and
video data in .mp4 format. During the sessions of capturing
E. APPLICATIONS OF EMOTION RECOGNITION the data, actor‟s emotions is evaluated by various annotators
Emotion Recognition is used in call center for into seven range of emotions. All the data are given along
classifying calls according to emotions[6]. Emotion with the database.
Recognition serves as the performance parameter for
B. TRANSFER LEARNING
conversational analysis[7] thus identifying the unsatisfied
customer, customer satisfaction so on. SER is used in-car Transfer learning is one of the machine learning models
board system based on information of the mental state of the which uses the knowledge gained from solving one problem
driver can be provided to the system to initiate his/her safety is incorporated to solve another problem. It is evident that
preventing accidents to happen[8]. Transfer learning solves many problems within short interval
of time. Transfer Learning is incorporated whenever there is
III. LITERATURE SURVEY any need to reduce computation cost, achieve accuracy with
less training
Complete review on the speech emotion recognition is
explained in [9] which reviews properties of dataset, speech
emotion recognition study classifier choice. Various acoustic
features of speech are investigated and some of the classifier
methods are analyzed in [10] which is helpful in the further
investigation of modern methods of emotion recognition.
This paper [11] investigated the prediction of the next
reactions from emotional vocal signals based on the
recognition of emotions, using different categories of
classifiers. Some of the classification algorithms like K-NN,
Random Forest are used in [11] to classify emotion
accordingly. Recurrent Neural network arises enormously
which tries to solve many problems in the filed of data
science. Deep RNN like LSTM, Bi-directional LSTM
trained for acoustic features are used in [12]. Various range
of CNN are being implemented and trained for speech
emotion recognition are evaluated in [13]. Emotion is
inferred from speech signals using filter banks and Deep
CNN[14] which shows high accuracy rate which gives an
inference that deep learning can also be used for emotion
detection. Speech emotion recognition can be also Fig 1. Inception Net Architecture
performed using image spectrograms with deep
convolutional networks which is implemented in [15]. C. INCEPTION NET V3 MODEL
Inception Net v3 [17] Model is used to build an emotion
IV. PROPOSED METHODOLOGY recognition model. Inception is evolved from GoogLeNet
Architecture with some enhancements. Inception model is
This section explains the proposed methodology, emotion used for automatic image classification and image labelling
database used for research, Inception model. according to the image. Inception-v3 is used for image
A. EMOTION DATABASE classification in Google Image Search.
IEMOCAP corpus Database [16] is prepared by the Inception-v3 achieved top 5.6% error rate in ILSVRC
Speech Analysis and Interpretation Laboratory (SAIL), at 2012 classification challenge validation.
the University of Southern California (USC) a new corpus Figure 1 which illustrates the complete architecture of
named “interactive emotional dyadic motion capture Inception Net v3 Model. Inception Net model which consist
database” (IEMOCAP) is used in this paper. Since this data of Inception module which concatenates all the output of
is rarely used, so this project explores more on this dataset. 1x1,3x3,5x5 filters.
Corpus Data consists of ten actors in dyadic sessions with
markers on the face, head, and hands, which provide detailed
Published By:
Blue Eyes Intelligence Engineering &
Retrieval Number: E1917017519 248 Sciences Publication
International Journal of Recent Technology and Engineering (IJRTE)
ISSN: 2277-3878, Volume-7 Issue-4S, November 2018
Inception net consist of network in a network in a
network which consist of three inception modules that are
embed inside the inception architecture which helps in
reduction of numerical array.
D. PREPARATION OF TRAINING DATASET
All the audio clips from the IEMOCAP databases are
pulled out from various sessions. Using the emotion
evaluation report which is given along with the database,
various wav files are labeled and categorized into seven
range of emotions as mentioned earlier. Speech signals in
.wav format are converted into spectrogram images within
the emotion class.
Fig.2 Accuracy rate of the Data Model
V. EXPERIEMENTAL SETUP
This section which contains explanations about
experimental setup, libraries used for Deep learning which
helps in emotion recognition.
A. SYSTEM SETUP
For performing the experiment I‟ve used system setup
consist of Core i7 6th Generation 3.7 GHz Processor,
Samsung SSB of 512 GB memory space, NVIDIA GeForce
GT 730 2GB GPU Card with Ubuntu 16.04 installed. For
deep learning I‟ve used Tensor Flow 1.5 for implementing
the Inception net model and Tensor Board for visualizing the
learning, graphs, histograms and so on.
B. TRAINING METHOD
All images labeled with respective emotions are Fig.3 Cross Entropy
prepared for training the model. The proposed CNN model Some of the reason for less accuracy rate are, Transfer
was implemented using TensorFlow. The spectrogram Learning is used to train the model, there could‟ve been less
images were generated from the IEMOCAP are resized to spectrograms used for training, which leads to the less
500 x 300. More than 400 spectrograms were generated accuracy. There are also less data set used for the training
from all the audio files in the dataset. For each emotion, process which also leads to the case.
Image range of about 500 for each class of emotion is
collected from the corpus database. The training process was VII. CONCLUSION
run for 20 epochs with a batch size set to 100. Initial Various investigations and surveys about Emotion
learning rate was set to 0.01 with a decay of 0.1 after each Recognition, Deep learning techniques used for recognizing
10 epochs. Training data model was performed on a single the emotions are performed. It is necessary in future to have
NVidia GeForce GT 730 with 2 GB onboard memory. The a system like this with much more reliable, which has
training took around 35 minutes and the best accuracy was endless possibilities in all fields. This project attempted to
achieved after 28 epochs. On the training set, a loss of 0.71 use inception net for solving emotion recognition problem,
was achieved, whereas 0.95 loss was recorded on the test set. various databases have been explored, IEMOCAP database
An accuracy of 35.95 % was achieved per spectrogram. It is is used as dataset for carrying out my experiment. Trained
important to notice here that the overall accuracy is very my model using TensorFlow. Accuracy rate of about 38% is
low. These may be due to transfer learning used and less achieved. In future, real time emotion recognition can be
dataset for each class of emotion. developed using the same architecture.
VI. RESULT & ANALYSIS REFERENCES
An accuracy rate of about 35.6% is achieved from the data 1. S. Furui, T. Kikuchi, Y. Shinnaka, and C. Hori, “Speech-to-Text and
model for predicting the emotions. It is evident from the Speech-to-Speech Summarization,” vol. 12, no. 4, pp. 401–408,
below figure that 0.8 is the highest accuracy rate achieved 2004.
2. M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion
during validation of data. recognition: Features, classification schemes, and databases,” Pattern
Recognit., vol. 44, no. 3, pp. 572–587, 2011.
3. Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol.
521, no. 7553, pp. 436–444, 2015.
4. J. Schmidhuber, “Deep Learning in neural networks: An overview,”
Neural Networks, vol. 61, pp. 85–117, 2015.
Published By:
Blue Eyes Intelligence Engineering &
Retrieval Number: E1917017519 249 Sciences Publication
Speech Emotion Recognition using Deep Learning
5. J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng,
“Multimodal Deep Learning,” Proc. 28th Int. Conf. Mach. Learn.,
pp. 689–696, 2011.
6. F. Dipl and T. Vogt, “Real-time automatic emotion recognition from
speech,” 2010.
7. S. Lugovic, I. Dunder, and M. Horvat, “Techniques and applications
of emotion recognition in speech,” 2016 39th Int. Conv. Inf.
Commun. Technol. Electron. Microelectron. MIPRO 2016 - Proc.,
no. November 2017, pp. 1278–1283, 2016.
8. B. Schuller, G. Rigoll, and M. Lang, “Speech emotion recognition
combining acoustic features and linguistic information in a hybrid
support vector machine - belief network architecture,” Acoust.
Speech, Signal Process., vol. 1, pp. 577–580, 2004.
9. S. G. Koolagudi and S. R. Krothapalli, “Emotion recognition from
speech using sub-syllabic and pitch synchronous spectral features,”
Int. J. Speech Technol., vol. 15, no. 4, pp. 495–511, 2012.
10. J. Rong, G. Li, and Y. P. P. Chen, “Acoustic feature selection for
automatic emotion recognition from speech,” Inf. Process. Manag.,
vol. 45, no. 3, pp. 315–328, 2009.
11. F. Noroozi, N. Akrami, and G. Anbarjafari, “Speech-based emotion
recognition and next reaction prediction,” 2017 25th Signal Process.
Commun. Appl. Conf. SIU 2017, no. 1, 2017.
12. A. Graves, A. Mohamed, and G. Hinton, “Speech Recognition with
Deep Recurrent Neural Networks,” in 2013 IEEE International
Conference on Acoustics, Speech and Signal Processing, 2013, pp.
6645–6649.
13. C.-W. Huang and S. S. Narayanan, “Characterizing Types of
Convolution in Deep Convolutional Recurrent Neural Networks for
Robust Speech Emotion Recognition,” pp. 1–19, 2017.
14. H. M. Fayek, M. Lech, and L. Cavedon, “Evaluating deep learning
architectures for Speech Emotion Recognition,” Neural Networks,
vol. 92, pp. 60–68, 2017.
15. A. M. Badshah, J. Ahmad, N. Rahim, and S. W. Baik, “Speech
Emotion Recognition from Spectrograms with Deep Convolutional
Neural Network,” 2017 Int. Conf. Platf. Technol. Serv., pp. 1–5,
2017.
16. C. Busso et al., “IEMOCAP: Interactive emotional dyadic motion
capture database,” Lang. Resour. Eval., vol. 42, no. 4, pp. 335–359,
2008.
17. C. Szegedy, V. Vanhoucke, J. Shlens, and Z. Wojna, “Rethinking the
Inception Architecture for Computer Vision,” 2014.
Published By:
Blue Eyes Intelligence Engineering &
Retrieval Number: E1917017519 250 Sciences Publication