Environmental Sound Classification using
Transformers
Yash Vardhan Singh Tanay Dilip Patel Aditya Raj Anurag Debnath
RA2311003011811 RA2311003011812 RA2311003011812 RA2311003011844
SRM Institute of Science & SRM Institute of Science &
SRM Institute of Science & SRM Institute of Science &
Technology, Kattankulathur Technology, Kattankulathur
Technology, Kattankulathur Technology, Kattankulathur line
Abstract—Environmental Sound Classification (ESC) involves II. METHODOLOGY
recognizing sounds from real-world environments such as sirens,
barking dogs, drilling, or footsteps. Traditional methods using
handcrafted features and classical classifiers have struggled in The proposed model combines multiple feature representations of
noisy and overlapping sound conditions. Deep learning, environmental sounds, including Mel-spectrograms, MFCCs, and
Chromagrams. These features are fused into a unified representation
especially CNNs and Transformers, has achieved significant
before being passed to the Transformer encoder. The model’s
performance improvements in this field. workflow includes audio preprocessing, feature extraction, data
augmentation, model training, and evaluation.
This project focuses on reproducing and improving the Transformer-
based ESC model proposed by Jahangir et al. (2023) in “Deep The Transformer encoder captures long-term temporal dependencies
Learning-based Environmental Sound Classification Using Feature using multi-head self-attention mechanisms. Augmentation techniques
Fusion and Data Enhancement”. We employed advanced data such as SpecAugment, MixUp, and Between-Class Learning enhance
augmentation, feature fusion, and model optimization techniques to the model’s generalization capabilities. Hyperparameter tuning,
enhance accuracy across benchmark datasets (ESC-10, ESC-50, dropout regularization, and learning rate scheduling are applied for
UrbanSound8K). optimization.
Our model achieved a marginal improvement of approximately
0.01% over the reported state-of-the-art accuracy (up to 98.01% on
III. RESULT AND DISCUSSION
UrbanSound8K). The outcomes demonstrate that augmentation and
optimized Transformer architectures can further enhance ESC
systems, making them suitable for IoT, autonomous vehicles, and The model was trained on the UrbanSound8K and ESC-50
smart city applications. datasets. The experiments demonstrated strong classification
performance with clear improvements over traditional CNN-
Keywords - Environmental Sound Classification, Transformer, based approaches. The fused features enabled better sound
Deep Learning, Feature Fusion, Data Augmentation. discrimination, while augmentations improved robustness
under noisy conditions. The proposed model achieved over
94% accuracy on ESC-50 and approximately 98% on
UrbanSound8K. The confusion matrix and accuracy curves
I. INTRODUCTION indicate consistent learning behavior with minimal overfitting.
grammar.
Environmental sounds form a crucial part of our daily
surroundings. Identifying such sounds has applications in smart cities,
IoT, surveillance, and autonomous systems. However, environmental A. Abbreviations and Acronyms
sound data is often noisy, overlapping, and highly variable, making
classification a challenging task. The project focuses on supervised This paper uses several abbreviations. ESC stands for
classification of environmental sounds using deep learning. The model Environmental Sound Classification. CNN represents Convolutional
is trained on public datasets and evaluated for performance Neural Network, while MFCC denotes Mel-Frequency Cepstral
improvement. Future scope includes deployment on real-time IoT Coefficients. STFT means Short-Time Fourier Transform, and GPU
devices. refers to Graphics Processing Unit used for model training. MLP
stands for Multi-Layer Perceptron, and SNR means Signal-to-Noise
This project aims to design a robust and scalable Transformer-based Ratio. MSE indicates Mean Squared Error, used for loss calculation,
ESC model that generalizes well to real-world audio conditions, and F1-score combines precision and recall to measure classification
paving the way for practical deployment in smart IoT devices and performance.
urban sound analysis systems.
B. Units • Be aware of the different meanings of the homophones
“affect” and “effect”, “complement” and “compliment”,
• The primary unit explicitly used for performance evaluation is “discreet” and “discrete”, “principal” and “principle”.
percentage (%), which is applied to the key classification
metric of accuracy (e.g., achieving over 94% accuracy on • Do not confuse “imply” and “infer”.
ESC-50 and 98% on UrbanSound8K). This project utilizes
audio features, other standard units are implicitly used in the
• The prefix “non” is not a word; it should be joined to the word
signal processing domain, including the Mel scale for it modifies, usually without a hyphen.
representing perceived pitch differences in Mel-spectrograms • There is no period after the “et” in the Latin abbreviation “et
and MFCCs, and decibels (dB) for quantifying the magnitude al.”.
or power of the audio signal. Additionally, the underlying raw
audio data is typically measured in seconds (s)for duration. • The abbreviation “i.e.” means “that is”, and the abbreviation
“e.g.” means “for example”.
C. Equations
IV. CONCLUSION AND FUTURE WORK
Mathematical modeling provides a formal representation of the
learning process used by the proposed Transformer-based sound This study presents an enhanced Transformer-based framework
classification system. It defines the relationships between the input for Environmental Sound Classification. By employing feature fusion
features (audio signals), the model parameters, and the output and advanced data augmentation, the proposed model outperforms
predictions. This section describes the theoretical foundations, model existing methods in terms of accuracy and robustness. Future work
equations, and optimization strategy used to train the system may include real-time ESC deployment on embedded systems and
effectively.
exploring lightweight Transformer variants for edge devices.
Transformer-Based Feature Encoding:
A. Figures and Tables
Each time-frequency patch s_i from the spectrogram is linearly a) The charts show a deep learning model's performance over
projected into a fixed-dimensional embedding vector: z_i = W_E * epochs. The Accuracy plot indicates the model is learning well,
s_i + b_E, where W_E is the learnable embedding matrix and b_E is with both training and validation accuracy increasing and staying
the bias term. To preserve the sequential structure, positional close. The Loss plot shows a good fit, as both training and
encodings are added: h_i = z_i + p_i, where p_i represents the validation loss decrease, with no severe signs of overfitting or
positional encoding vector. underfitting.
Adam optimizer: θ_(t+1) = θ_t - η * (m̂ _t / (√v̂ _t + ε))
Softmax layer: ŷ = Softmax(W_c * F(h) + b_c).
The multi-head version computes multiple parallel attention heads:
MultiHead(Q, K, V) = [head1; head2; ...; headh] * W_O, where each
head learns different temporal-frequency dependencies.
The model is trained to minimize the categorical cross-entropy loss
between predicted probabilities and true labels: L = -1/N Σ Σ y_ic
log(ŷ_ic), where N is the number of samples, y_ic is the true label,
and ŷ_ic is the predicted probability.
The final output of the model is the predicted class label: y* =
argmax_c(ŷ_c), representing the environmental sound category most b) The confusion matrix evaluates a Transformer model's
likely present. classification of 10 sound types. The diagonal shows the number
of correct predictions (e.g., 196 air conditioner sounds). Off-
diagonal numbers represent misclassifications (e.g., 20 'children
playing' sounds were wrongly predicted as 'street music'). The
D. Some Common Mistakes model performs well overall.
• Incorrect Input Prep: Directly adding positional encoding
to the raw spectrogram without an initial linear projection,
which can disrupt the model's learning process.
• In Alignment Issues (for sequential tasks): Using traditional
methods like Dynamic Time Warping (DTW) instead of
relying on the Transformer's self-attention mechanism to
implicitly learn better sequence alignment.
• Under-Sizing: Using an insufficient number of layers or
hidden units for the Transformer architecture, which prevents
the model from capturing the complex, long-range
dependencies in audio data.
c) Comparison with Existing Models
Model Architecture Accuracy (%)
CNN (Piczak, 2D CNN 64.5
2015)
CNN + Deep CNN 79
Augmentation
(2017)
ResNet Transfer CNN (ResNet) 92
(2021)
Transformer Transformer + 98
(2023) Feature Fusion
ACKNOWLEDGMENT
We sincerely thank SRM Institute of Science and Technology
leadership, including Vice-Chancellor Dr. C. Muthamizhchelvan and
Dean-CET Dr. Leenus Jesu Martin M, for their vital support and
facilities. We are grateful to the School of Computing Chairperson,
Dr. Revathi Venkataraman, Associate Chairpersons, and especially
Dr. G. Niranjana, Head of Department, for her guidance. Our deepest
gratitude goes to our Faculty Advisors, Dr. Bakkialakshmi V S and
Dr. Aswathy K. Cherian, whose mentorship and support were
invaluable. Finally, we thank all staff, students, and our families for
their continuous help and encouragement.
REFERENCES
1. Piczak, K. J. (2015). Environmental sound classi ication with
convolutional neural networks. IEEE MLSP, 1–6.
2. Salamon, J., & Bello, J. P. (2017). Deep convolutional neural
networks and data augmentation for environmental sound
classi ication. IEEE SPL, 24(3), 279–283.
3. Tokozume, Y., Ushiku, Y., & Harada, T. (2017). Learning from
between-class examples for deep sound recognition.
arXiv:1711.10282.
4. Guzhov, A., Raue, F., Hees, J., & Dengel, A. (2021). ESResNet:
Environmental sound classi ication based on visual domain
models. ICPR 2021, 4933–4940.
5. Jahangir, R., Nauman, M. A., Alroobaea, R., Almotiri, J., Malik,
M. M., & Alzahrani, S. M. (2023). Deep learning-based
environmental sound classi ication using feature fusion and
data enhancement. Computers, Materials & Continua, 74(1),
1070–1091.
f
f
f
f