0% found this document useful (0 votes)

19 views3 pages

Transformer-Based Environmental Sound Classification

This document presents a study on Environmental Sound Classification (ESC) using a Transformer-based model that improves upon existing methods by employing feature fusion and advanced data augmentation techniques. The model achieved over 94% accuracy on the ESC-50 dataset and approximately 98% on UrbanSound8K, demonstrating enhanced performance compared to traditional CNN approaches. Future work includes real-time deployment on IoT devices and exploring lightweight Transformer variants for edge applications.

Uploaded by

yvsg16

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views3 pages

Transformer-Based Environmental Sound Classification

Uploaded by

yvsg16

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Environmental Sound Classification using

Transformers

Yash Vardhan Singh Tanay Dilip Patel Aditya Raj Anurag Debnath
RA2311003011811 RA2311003011812 RA2311003011812 RA2311003011844
SRM Institute of Science & SRM Institute of Science &
SRM Institute of Science & SRM Institute of Science &
Technology, Kattankulathur Technology, Kattankulathur
Technology, Kattankulathur Technology, Kattankulathur line

Abstract—Environmental Sound Classification (ESC) involves II. METHODOLOGY

recognizing sounds from real-world environments such as sirens,
barking dogs, drilling, or footsteps. Traditional methods using
handcrafted features and classical classifiers have struggled in The proposed model combines multiple feature representations of
noisy and overlapping sound conditions. Deep learning, environmental sounds, including Mel-spectrograms, MFCCs, and
Chromagrams. These features are fused into a unified representation
especially CNNs and Transformers, has achieved significant
before being passed to the Transformer encoder. The model’s
performance improvements in this field. workflow includes audio preprocessing, feature extraction, data
augmentation, model training, and evaluation.
This project focuses on reproducing and improving the Transformer-
based ESC model proposed by Jahangir et al. (2023) in “Deep The Transformer encoder captures long-term temporal dependencies
Learning-based Environmental Sound Classification Using Feature using multi-head self-attention mechanisms. Augmentation techniques
Fusion and Data Enhancement”. We employed advanced data such as SpecAugment, MixUp, and Between-Class Learning enhance
augmentation, feature fusion, and model optimization techniques to the model’s generalization capabilities. Hyperparameter tuning,
enhance accuracy across benchmark datasets (ESC-10, ESC-50, dropout regularization, and learning rate scheduling are applied for
UrbanSound8K). optimization.

Our model achieved a marginal improvement of approximately

0.01% over the reported state-of-the-art accuracy (up to 98.01% on
III. RESULT AND DISCUSSION
UrbanSound8K). The outcomes demonstrate that augmentation and
optimized Transformer architectures can further enhance ESC
systems, making them suitable for IoT, autonomous vehicles, and The model was trained on the UrbanSound8K and ESC-50
smart city applications. datasets. The experiments demonstrated strong classification
performance with clear improvements over traditional CNN-
Keywords - Environmental Sound Classification, Transformer, based approaches. The fused features enabled better sound
Deep Learning, Feature Fusion, Data Augmentation. discrimination, while augmentations improved robustness
under noisy conditions. The proposed model achieved over
94% accuracy on ESC-50 and approximately 98% on
UrbanSound8K. The confusion matrix and accuracy curves
I. INTRODUCTION indicate consistent learning behavior with minimal overfitting.
grammar.
Environmental sounds form a crucial part of our daily
surroundings. Identifying such sounds has applications in smart cities,
IoT, surveillance, and autonomous systems. However, environmental A. Abbreviations and Acronyms
sound data is often noisy, overlapping, and highly variable, making
classification a challenging task. The project focuses on supervised This paper uses several abbreviations. ESC stands for
classification of environmental sounds using deep learning. The model Environmental Sound Classification. CNN represents Convolutional
is trained on public datasets and evaluated for performance Neural Network, while MFCC denotes Mel-Frequency Cepstral
improvement. Future scope includes deployment on real-time IoT Coefficients. STFT means Short-Time Fourier Transform, and GPU
devices. refers to Graphics Processing Unit used for model training. MLP
stands for Multi-Layer Perceptron, and SNR means Signal-to-Noise
This project aims to design a robust and scalable Transformer-based Ratio. MSE indicates Mean Squared Error, used for loss calculation,
ESC model that generalizes well to real-world audio conditions, and F1-score combines precision and recall to measure classification
paving the way for practical deployment in smart IoT devices and performance.
urban sound analysis systems.
B. Units • Be aware of the different meanings of the homophones
“affect” and “effect”, “complement” and “compliment”,
• The primary unit explicitly used for performance evaluation is “discreet” and “discrete”, “principal” and “principle”.
percentage (%), which is applied to the key classification
metric of accuracy (e.g., achieving over 94% accuracy on • Do not confuse “imply” and “infer”.
ESC-50 and 98% on UrbanSound8K). This project utilizes
audio features, other standard units are implicitly used in the
• The prefix “non” is not a word; it should be joined to the word
signal processing domain, including the Mel scale for it modifies, usually without a hyphen.
representing perceived pitch differences in Mel-spectrograms • There is no period after the “et” in the Latin abbreviation “et
and MFCCs, and decibels (dB) for quantifying the magnitude al.”.
or power of the audio signal. Additionally, the underlying raw
audio data is typically measured in seconds (s)for duration. • The abbreviation “i.e.” means “that is”, and the abbreviation
“e.g.” means “for example”.

C. Equations
IV. CONCLUSION AND FUTURE WORK
Mathematical modeling provides a formal representation of the
learning process used by the proposed Transformer-based sound This study presents an enhanced Transformer-based framework
classification system. It defines the relationships between the input for Environmental Sound Classification. By employing feature fusion
features (audio signals), the model parameters, and the output and advanced data augmentation, the proposed model outperforms
predictions. This section describes the theoretical foundations, model existing methods in terms of accuracy and robustness. Future work
equations, and optimization strategy used to train the system may include real-time ESC deployment on embedded systems and
effectively.
exploring lightweight Transformer variants for edge devices.
Transformer-Based Feature Encoding:
A. Figures and Tables
Each time-frequency patch s_i from the spectrogram is linearly a) The charts show a deep learning model's performance over
projected into a fixed-dimensional embedding vector: z_i = W_E * epochs. The Accuracy plot indicates the model is learning well,
s_i + b_E, where W_E is the learnable embedding matrix and b_E is with both training and validation accuracy increasing and staying
the bias term. To preserve the sequential structure, positional close. The Loss plot shows a good fit, as both training and
encodings are added: h_i = z_i + p_i, where p_i represents the validation loss decrease, with no severe signs of overfitting or
positional encoding vector. underfitting.
Adam optimizer: θ_(t+1) = θ_t - η * (m̂ _t / (√v̂ _t + ε))

Softmax layer: ŷ = Softmax(W_c * F(h) + b_c).

The multi-head version computes multiple parallel attention heads:

MultiHead(Q, K, V) = [head1; head2; ...; headh] * W_O, where each
head learns different temporal-frequency dependencies.

The model is trained to minimize the categorical cross-entropy loss

between predicted probabilities and true labels: L = -1/N Σ Σ y_ic
log(ŷ_ic), where N is the number of samples, y_ic is the true label,
and ŷ_ic is the predicted probability.

The final output of the model is the predicted class label: y* =

argmax_c(ŷ_c), representing the environmental sound category most b) The confusion matrix evaluates a Transformer model's
likely present. classification of 10 sound types. The diagonal shows the number
of correct predictions (e.g., 196 air conditioner sounds). Off-
diagonal numbers represent misclassifications (e.g., 20 'children
playing' sounds were wrongly predicted as 'street music'). The
D. Some Common Mistakes model performs well overall.

• Incorrect Input Prep: Directly adding positional encoding

to the raw spectrogram without an initial linear projection,
which can disrupt the model's learning process.
• In Alignment Issues (for sequential tasks): Using traditional
methods like Dynamic Time Warping (DTW) instead of
relying on the Transformer's self-attention mechanism to
implicitly learn better sequence alignment.
• Under-Sizing: Using an insufficient number of layers or
hidden units for the Transformer architecture, which prevents
the model from capturing the complex, long-range
dependencies in audio data.
c) Comparison with Existing Models

Model Architecture Accuracy (%)

CNN (Piczak, 2D CNN 64.5

2015)

CNN + Deep CNN 79

Augmentation
(2017)

ResNet Transfer CNN (ResNet) 92

(2021)

Transformer Transformer + 98
(2023) Feature Fusion

ACKNOWLEDGMENT
We sincerely thank SRM Institute of Science and Technology
leadership, including Vice-Chancellor Dr. C. Muthamizhchelvan and
Dean-CET Dr. Leenus Jesu Martin M, for their vital support and
facilities. We are grateful to the School of Computing Chairperson,
Dr. Revathi Venkataraman, Associate Chairpersons, and especially
Dr. G. Niranjana, Head of Department, for her guidance. Our deepest
gratitude goes to our Faculty Advisors, Dr. Bakkialakshmi V S and
Dr. Aswathy K. Cherian, whose mentorship and support were
invaluable. Finally, we thank all staff, students, and our families for
their continuous help and encouragement.

REFERENCES
1. Piczak, K. J. (2015). Environmental sound classi ication with
convolutional neural networks. IEEE MLSP, 1–6.

2. Salamon, J., & Bello, J. P. (2017). Deep convolutional neural

networks and data augmentation for environmental sound
classi ication. IEEE SPL, 24(3), 279–283.

3. Tokozume, Y., Ushiku, Y., & Harada, T. (2017). Learning from

between-class examples for deep sound recognition.
arXiv:1711.10282.

4. Guzhov, A., Raue, F., Hees, J., & Dengel, A. (2021). ESResNet:
Environmental sound classi ication based on visual domain
models. ICPR 2021, 4933–4940.

5. Jahangir, R., Nauman, M. A., Alroobaea, R., Almotiri, J., Malik,

M. M., & Alzahrani, S. M. (2023). Deep learning-based
environmental sound classi ication using feature fusion and
data enhancement. Computers, Materials & Continua, 74(1),
1070–1091.
f
f
f
f

CNN Optimization for Sound Classification
No ratings yet
CNN Optimization for Sound Classification
25 pages
Deep CNN with Mixup for Sound Classification
No ratings yet
Deep CNN with Mixup for Sound Classification
12 pages
Survey on Deep Learning for Sound Classification
No ratings yet
Survey on Deep Learning for Sound Classification
6 pages
Deep Learning for Urban Sound Classification
No ratings yet
Deep Learning for Urban Sound Classification
7 pages
Environmental Sound Classification on Microcontrollers
No ratings yet
Environmental Sound Classification on Microcontrollers
70 pages
Deep Learning for Sound Classification
No ratings yet
Deep Learning for Sound Classification
4 pages
CNN-RNN for Environmental Sound Classification
No ratings yet
CNN-RNN for Environmental Sound Classification
5 pages
Synthetic Speech Detection Analysis
No ratings yet
Synthetic Speech Detection Analysis
58 pages
Deep CNN for Environmental Sound Classification
No ratings yet
Deep CNN for Environmental Sound Classification
9 pages
Residual Self-Attention for Sound Classification
No ratings yet
Residual Self-Attention for Sound Classification
19 pages
Deep Learning for Audio Classification
No ratings yet
Deep Learning for Audio Classification
20 pages
CNNs for Environmental Sound Classification
No ratings yet
CNNs for Environmental Sound Classification
5 pages
Lightweight 1D CNN for Sound Classification
No ratings yet
Lightweight 1D CNN for Sound Classification
10 pages
Environmental Sound Classification Using CNNs
No ratings yet
Environmental Sound Classification Using CNNs
6 pages
Spectrogram Transformers For Audio Classification
No ratings yet
Spectrogram Transformers For Audio Classification
7 pages
Synthetic Speech Detection Analysis
No ratings yet
Synthetic Speech Detection Analysis
37 pages
PHD Thesis Sound Event Detection With Weakly Labelled Data - v2.0
No ratings yet
PHD Thesis Sound Event Detection With Weakly Labelled Data - v2.0
102 pages
CNN and Data Augmentation for Sound Classification
No ratings yet
CNN and Data Augmentation for Sound Classification
1 page
Samsung PRISM: Sound Event Localization
No ratings yet
Samsung PRISM: Sound Event Localization
11 pages
UAV Detection with Acoustic Nodes
No ratings yet
UAV Detection with Acoustic Nodes
86 pages
Speech Emotion Recognition Framework
No ratings yet
Speech Emotion Recognition Framework
23 pages
2015 Piczak ESC Sound Datasets
No ratings yet
2015 Piczak ESC Sound Datasets
4 pages
Deep Learning for Audio Classification
No ratings yet
Deep Learning for Audio Classification
25 pages
Environmental Sound Classification Report
No ratings yet
Environmental Sound Classification Report
44 pages
Sound Classification with Neural Networks
No ratings yet
Sound Classification with Neural Networks
16 pages
Optimized YAMNet for Emergency Sound Detection
No ratings yet
Optimized YAMNet for Emergency Sound Detection
13 pages
Deep Learning for Sound Classification
No ratings yet
Deep Learning for Sound Classification
5 pages
Multi-Resolution Audio Classification Techniques
No ratings yet
Multi-Resolution Audio Classification Techniques
11 pages
Deep Convolutional Neural Networks For Environmental Sound Classification
No ratings yet
Deep Convolutional Neural Networks For Environmental Sound Classification
7 pages
Few-Shot Incremental Audio Classification
No ratings yet
Few-Shot Incremental Audio Classification
15 pages
AI Sound Recognition with CNNs
No ratings yet
AI Sound Recognition with CNNs
24 pages
Advanced Engineering Informatics: Sharnil Pandya, Hemant Ghayvat
No ratings yet
Advanced Engineering Informatics: Sharnil Pandya, Hemant Ghayvat
21 pages
Real-Time Noise Classifier Using MFCC
No ratings yet
Real-Time Noise Classifier Using MFCC
66 pages
Device-Robust Acoustic Scene Classification
No ratings yet
Device-Robust Acoustic Scene Classification
5 pages
Sampling Rates in Sound Classification
No ratings yet
Sampling Rates in Sound Classification
9 pages
Braided CNN for Audio Classification
No ratings yet
Braided CNN for Audio Classification
7 pages
Urban Sound Classification with FPGAs
No ratings yet
Urban Sound Classification with FPGAs
11 pages
Urban Sound Classification with LSTM
No ratings yet
Urban Sound Classification with LSTM
11 pages
VoxCPM: Tokenizer-Free TTS Model
No ratings yet
VoxCPM: Tokenizer-Free TTS Model
18 pages
Classifying Environmental Sounds Using Image Recognition Networks Classifying Environmental Sounds Using Image Recognition Networks
No ratings yet
Classifying Environmental Sounds Using Image Recognition Networks Classifying Environmental Sounds Using Image Recognition Networks
9 pages
UrbanSound8K Dataset Analysis
No ratings yet
UrbanSound8K Dataset Analysis
5 pages
Implicit Neural Representations in Audio
No ratings yet
Implicit Neural Representations in Audio
75 pages
Prototypical Networks for ASC Adaptation
No ratings yet
Prototypical Networks for ASC Adaptation
5 pages
Audio Classification with Librosa & PyTorch
No ratings yet
Audio Classification with Librosa & PyTorch
12 pages
Multi-View Audio Deepfake Detection System
No ratings yet
Multi-View Audio Deepfake Detection System
5 pages
Audio Deep Learning: Sound Classification Guide
No ratings yet
Audio Deep Learning: Sound Classification Guide
31 pages
Vision Transformer for Audio Classification
No ratings yet
Vision Transformer for Audio Classification
5 pages
Randomly Weighted CNNs for Audio Classification
No ratings yet
Randomly Weighted CNNs for Audio Classification
5 pages
Impact of Sampling Rate on ANC DNNs
No ratings yet
Impact of Sampling Rate on ANC DNNs
16 pages
Audio Classification Using Neural Networks
No ratings yet
Audio Classification Using Neural Networks
19 pages
Urban Noise Classification with ML Models
No ratings yet
Urban Noise Classification with ML Models
17 pages
DCASE 2022 Sound Event Detection System
No ratings yet
DCASE 2022 Sound Event Detection System
5 pages
MERT: Acoustic Music Understanding Model With Large-Scale Self-Supervised Training
No ratings yet
MERT: Acoustic Music Understanding Model With Large-Scale Self-Supervised Training
23 pages
Acoustic Scene Classification Techniques
No ratings yet
Acoustic Scene Classification Techniques
35 pages
Cat and Dog Sound Classification Using CNN
No ratings yet
Cat and Dog Sound Classification Using CNN
6 pages
International Data Encryption Algorithm
No ratings yet
International Data Encryption Algorithm
5 pages
Summer Homework for Class 10 Students
No ratings yet
Summer Homework for Class 10 Students
4 pages
Lagrangians and Risk Aversion in Economics
No ratings yet
Lagrangians and Risk Aversion in Economics
13 pages
Market Anomalies in Indian Stocks
No ratings yet
Market Anomalies in Indian Stocks
19 pages
Predicting Solar Cycle 25 Amplitude
No ratings yet
Predicting Solar Cycle 25 Amplitude
16 pages
Geoelectric Surveys in Mendip Hills
No ratings yet
Geoelectric Surveys in Mendip Hills
11 pages
Comparing Fractions in Grade 5
No ratings yet
Comparing Fractions in Grade 5
8 pages
FEM Simulation of Ultrasonic Waves in Solids
No ratings yet
FEM Simulation of Ultrasonic Waves in Solids
12 pages
Midterm Exam Prep: Limits & Derivatives
No ratings yet
Midterm Exam Prep: Limits & Derivatives
4 pages
Class X Mathematics Pre-Final Exam
No ratings yet
Class X Mathematics Pre-Final Exam
2 pages
Ichimoku Kinko Hyo Visualization Tools
No ratings yet
Ichimoku Kinko Hyo Visualization Tools
21 pages
Common Percent and Fraction Table
No ratings yet
Common Percent and Fraction Table
2 pages
Related Rates Problem Solutions
No ratings yet
Related Rates Problem Solutions
2 pages
Measurement Concepts at ESA Itanagar
No ratings yet
Measurement Concepts at ESA Itanagar
1 page
Year 4 Reasoning Test Set 2
100% (1)
Year 4 Reasoning Test Set 2
11 pages
IIT Guwahati MA102 Mid Exam 2022
No ratings yet
IIT Guwahati MA102 Mid Exam 2022
6 pages
Osmotic Dehydration of Onion Slices
No ratings yet
Osmotic Dehydration of Onion Slices
8 pages
Proofs and Induction in Mathematics
No ratings yet
Proofs and Induction in Mathematics
29 pages
(Ebook PDF) Business Statistics: A First Course 8th Edition Full Access
100% (1)
(Ebook PDF) Business Statistics: A First Course 8th Edition Full Access
107 pages
2024 Mathematics Marking Key Paper I
No ratings yet
2024 Mathematics Marking Key Paper I
8 pages
2D Frame Matrix Analysis Guide
No ratings yet
2D Frame Matrix Analysis Guide
4 pages
HRM Trainees Course Structure Overview
No ratings yet
HRM Trainees Course Structure Overview
83 pages
MRI Fundamentals and Applications
No ratings yet
MRI Fundamentals and Applications
59 pages
Fabian Peng Karrholm PH D2008
No ratings yet
Fabian Peng Karrholm PH D2008
110 pages
Rise and Fall Method in Civil Engineering
No ratings yet
Rise and Fall Method in Civil Engineering
6 pages
Real Analysis: Axioms and Theorems
No ratings yet
Real Analysis: Axioms and Theorems
6 pages
Second Bounded Cohomology of Hyperbolic Groups
No ratings yet
Second Bounded Cohomology of Hyperbolic Groups
15 pages
Korean Phonology: A Diachronic Study
No ratings yet
Korean Phonology: A Diachronic Study
212 pages
Image Enhancement Techniques Overview
No ratings yet
Image Enhancement Techniques Overview
46 pages
Calculating Lure Buoyancy with Lead
100% (1)
Calculating Lure Buoyancy with Lead
2 pages

Transformer-Based Environmental Sound Classification

Uploaded by

Transformer-Based Environmental Sound Classification

Uploaded by

Environmental Sound Classification using

Abstract—Environmental Sound Classification (ESC) involves II. METHODOLOGY

Our model achieved a marginal improvement of approximately

Softmax layer: ŷ = Softmax(W_c * F(h) + b_c).

The multi-head version computes multiple parallel attention heads:

The model is trained to minimize the categorical cross-entropy loss

The final output of the model is the predicted class label: y* =

• Incorrect Input Prep: Directly adding positional encoding

Model Architecture Accuracy (%)

CNN (Piczak, 2D CNN 64.5

CNN + Deep CNN 79

ResNet Transfer CNN (ResNet) 92

2. Salamon, J., & Bello, J. P. (2017). Deep convolutional neural

3. Tokozume, Y., Ushiku, Y., & Harada, T. (2017). Learning from

5. Jahangir, R., Nauman, M. A., Alroobaea, R., Almotiri, J., Malik,

You might also like