0% found this document useful (0 votes)
695 views94 pages

Drowsiness Detection System Development

This document is an end-of-studies project report submitted by AFILAL Taha to the jury at Abdelmalek Essaâdi University, Faculty of sciences and technologies in Tangier, Morocco. The project involved developing a drowsiness detection system at Capgemini Engineering. The report acknowledges the contributions of academic and professional supervisors. It then provides an abstract that overviews the development of a state-of-the-art drowsiness detection system focused on behavioral metrics using techniques like deep learning, computer vision, and facial landmarks. The table of contents previews that the report will cover the project scoping, methods and techniques used including machine learning and tools.

Uploaded by

twiha000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
695 views94 pages

Drowsiness Detection System Development

This document is an end-of-studies project report submitted by AFILAL Taha to the jury at Abdelmalek Essaâdi University, Faculty of sciences and technologies in Tangier, Morocco. The project involved developing a drowsiness detection system at Capgemini Engineering. The report acknowledges the contributions of academic and professional supervisors. It then provides an abstract that overviews the development of a state-of-the-art drowsiness detection system focused on behavioral metrics using techniques like deep learning, computer vision, and facial landmarks. The table of contents previews that the report will cover the project scoping, methods and techniques used including machine learning and tools.

Uploaded by

twiha000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
  • General Introduction
  • Chapter 1: Project Scoping
  • Chapter 2: Methods and Techniques Used in the Project
  • Chapter 3: Project Implementation
  • General Conclusion

Kingdom of Morocco

Abdelmalek Essaâdi University


Faculty of sciences and technologies – Tangier
Department of Electrical Engineering

End-Of-Studies Project
For The attainment of
Engineering Degree
Field of study: Electrical Engineering and Industrial
Management

Subject:

Development of a Drowsiness Detection


System
Realized in:

Capgemini Engineering
By:
AFILAL Taha
Defended on: 21/06/2023 before the jury:
- Pr. EL MOKHTARI Karim (President) -
- Pr. BSISS Mohammed (Supervisor)
- Pr. BENLAMLIH Mohammed (Examiner)
- Pr. EL MRABET Mhamed (Examiner)
- Mr. ASLANE Amine (Supervisor)
Academic Year: 2022-2023
Acknowledgements

Upon submitting this final study report, we feel not only necessary but also compelled by
respect and gratitude to express our sincere thanks to all those who have contributed directly or
indirectly to the completion of this work.

We extend our profound and sincere thanks to our academic supervisor, Mr. BSISS Mohammed,
for his substantial academic guidance, keen interest in monitoring the progress of our work,
invaluable guidance, availability, kindness, and relevant remarks that have been highly valuable to
us.

It gives us great pleasure to express our gratitude to our professional mentors, Mr. ASLANE
Amine, System Engineer at Capgemini, and Ms. JADIANI Chadia Team manager at Capgemini,
for their availability, sound advice, constructive remarks, and support.

We would also like to extend our sincere appreciation to the members of the jury:

Mr. BSISS Mohammed, Mr. BENLAMLIH Mohammed, Mr. EL MOKHTARI Karim, and
Mr. EL MRABET Mhamed, for their interest in our project, accepting to evaluate our work, and
enriching it with their suggestions.

We would be remiss not to express our gratitude to the entire team at FST Tangier, especially the
department of Electrical Engineering.

May this valuable work be an expression of our utmost esteem and warmest thanks to all the
individuals mentioned above.
Abstract:

In recent years, the alarming surge in the prevalence of driver drowsiness has emerged as a
paramount catalyst behind the escalating number of perilous road accidents, resulting in
catastrophic consequences such as grievous injuries, fatalities, and substantial economic
ramifications. Consequently, the pressing imperative for a dependable and foolproof driver
drowsiness detection system has reached unprecedented heights, aiming to preemptively notify and
alert drivers of potential hazards before calamitous accidents occur. Meticulous and comprehensive
research endeavors have been undertaken to explore diverse measures and dimensions in the quest
for effective driver drowsiness determination, encompassing multifaceted domains including
vehicle-based analyses, behavioral evaluations, and physiological assessments. Within the purview
of this scholarly article, an all-encompassing and exhaustive guide elucidates the intricate
developments of a state-of-art Drowsiness Detection System, with an exclusive focus on behavioral
metrics. This seminal work scrutinizes; in meticulous detail, the cutting-edge techniques,
methodologies, and strategies employed in this intricate construction and implementation of this
groundbreaking system.

Keywords: Deep Learning, Computer vision, Advanced Driver Assistance System.


Table of contents

Chapter 1 Project Scoping:.................................................................................................................4

1.1 Company Presentation:...............................................................................................................4

1.1.1 Introduction:.........................................................................................................................4

1.1.2 Capgemini Engineering’s general presentation:..................................................................4

1.1.3 Location:..............................................................................................................................5

1.1.4 Capgemini Engineering Morocco:.......................................................................................6

1.2 Problem Definition:....................................................................................................................7

1.2.1 Project Context:....................................................................................................................7

1.2.2 Problem definition:..............................................................................................................7

1.2.3 Needs analysis:.....................................................................................................................8

1.2.4 Functional analysis:..............................................................................................................9

1.2.5 Project Chart:.....................................................................................................................10

1.2.6 Gantt Diagram:...................................................................................................................11

1.3 Conclusion:...............................................................................................................................11

Chapter 2 Methods and techniques used in the project:...................................................................13

2.1 Introduction:.............................................................................................................................13

2.2 Drowsiness detection approaches:............................................................................................13

2.2.1 Facial expressions-based measures:...................................................................................13

2.2.2 Physiological measures:.....................................................................................................14

2.2.3 Vehicle based measures:....................................................................................................15

2.3 Approach selection:..................................................................................................................16

2.4 Utilized techniques:..................................................................................................................17

2.4.1 Machine Learning/ Deep Learning:...................................................................................17

[Link] Introduction:................................................................................................................17
[Link] Machine learning types:..............................................................................................17

a) Supervised Machine learning:.........................................................................................17

b) Unsupervised Machine learning:.....................................................................................18

[Link] Convolutional neural network « CNN »:....................................................................19

[Link] Transfer Learning and learning from scratch:.............................................................26

a) Learning from scratch:.....................................................................................................26

b) Transfer Learning:...........................................................................................................28

c) Methods for transfer learning:.........................................................................................30

[Link] Machine Learning process:.........................................................................................31

[Link] K-Fold cross validation:..............................................................................................32

[Link] Hyperparameters tuning:.............................................................................................33

a) Hyperparameters and parameters:...................................................................................34

b) Grid search:.....................................................................................................................34

c) Random search:...............................................................................................................35

2.4.2 Facial Landmarks:..............................................................................................................35

2.4.3 Tools:.................................................................................................................................36

[Link] Programming language:..............................................................................................36

[Link] Libraries:.....................................................................................................................37

a) Opencv:............................................................................................................................37

b) Dlib:.................................................................................................................................37

c) Keras:...............................................................................................................................37

Chapter 3 Project Implementation:...................................................................................................39

3.1 Introduction:.............................................................................................................................39

3.2 System requirements and analysis:...........................................................................................39

3.2.1 Functional requirements:....................................................................................................39

[Link] introduction:................................................................................................................39

[Link] Identification of facial expressions to be monitored:..................................................39


a) Eyes expressions:.............................................................................................................39

b) Mouth expressions:..........................................................................................................40

[Link] Accuracy level:............................................................................................................41

[Link] Response time requirements:.......................................................................................41

a) Sensor response time (Camera):......................................................................................41

b) Code running time requirements:....................................................................................42

c) CNN model response time requirements:........................................................................42

3.2.2 Environmental factors:.......................................................................................................43

3.3 Components design:..................................................................................................................43

3.3.1 Introduction:.......................................................................................................................43

3.3.2 Eyes State subsystem:........................................................................................................44

[Link] Introduction:................................................................................................................44

[Link] Convolutional Neural Network model for eyes classification:...................................44

a) Data:.................................................................................................................................44

(i) Source:..........................................................................................................................44

(ii) Data cleaning:...........................................................................................................45

b) Data distribution:.............................................................................................................46

c) CNN structure for eyes classification model:..................................................................47

(i) Features-extractor Layers:............................................................................................47

(ii) Classification Layers:...............................................................................................49

(iii) Conclusion:...............................................................................................................52

d) Training phase:................................................................................................................53

(i) EfficientNet-b3 based Eyes Classifier curves:.............................................................54

(ii) DenseNet-121 based Eyes Classifier curves:...........................................................55

(iii) MobileNet-1 based Eyes Classifier curves:..............................................................56

e) Evaluation phase:.............................................................................................................57

(i) k-Cross Validation Curves analysis:............................................................................57


(ii) Test dataset Evaluation:............................................................................................58

[Link] Eyes Localization and Classification:.........................................................................64

[Link] Eyes closure ratio (blinking ratio):..............................................................................65

[Link] Eyes closure time:.......................................................................................................66

3.3.3 Eyes closure percentage subsystem:..................................................................................67

3.3.4 Mouth closure percentage subsystem:...............................................................................68

[Link] Yawn expression detection:.........................................................................................69

I. Data:..............................................................................................................................................ii

II. Data distribution:.........................................................................................................................iii

III. Model Training:.........................................................................................................................v


Figures List:

Figure 1: Capgemini’s Locations.........................................................................................................5


Figure 2: Functional Needs..................................................................................................................9
Figure 3: Supervised Machine Learning............................................................................................18
Figure 4: Unsupervised Machine Learning........................................................................................18
Figure 5: Image Channels...................................................................................................................19
Figure 6: Convolutional process.........................................................................................................20
Figure 7: Average Pooling................................................................................................................21
Figure 8: Max Pooling........................................................................................................................21
Figure 9: Classification Layers...........................................................................................................22
Figure 10: Sigmoid Activation function.............................................................................................23
Figure 11: ReLu activation function..................................................................................................23
Figure 12: Neuron Scalar value..........................................................................................................24
Figure 13: Froward and back propagations........................................................................................25
Figure 14: Convolutional Neural Network Structure.........................................................................26
Figure 15: Training from scratch process...........................................................................................26
Figure 16: Training CNN from scratch requirements........................................................................27
Figure 17: transfer learning process...................................................................................................28
Figure 18: requirement for Transfer learning.....................................................................................28
Figure 19: deep learning basic network.............................................................................................30
Figure 20: Features Transfer..............................................................................................................31
Figure 21: Machine learning process.................................................................................................32
Figure 22: k-fold Cross Validation.....................................................................................................33
Figure 23: Grid search........................................................................................................................34
Figure 24: random search...................................................................................................................35
Figure 25: Facial landmarks...............................................................................................................36
Figure 26: Drowsiness detection system Diagram.............................................................................44
Figure 27: MRL dataset......................................................................................................................45
Figure 28: Dimensions converting algorithm.....................................................................................46
Figure 29: 4-fold cross validation......................................................................................................47
Figure 30: CNN Models benchmark..................................................................................................48
Figure 31: Close and Open Eyes characteristics................................................................................50
Figure 32: Random Layer construction..............................................................................................51
Figure 33: Eyes State Classifier Structures........................................................................................52
Figure 34: EfficientNet-b3 iteration 1................................................................................................54
Figure 35: EfficientNet-b3 iteration 2................................................................................................54
Figure 36: EfficientNet-b3 iteration 3................................................................................................54
Figure 37: EfficientNet-b3 iteration 4................................................................................................54
Figure 38: DenseNet-121 iteration 1..................................................................................................55
Figure 39: DenseNet-121 iteration 2..................................................................................................55
Figure 40: DenseNet-121 iteration 3..................................................................................................55
Figure 41: DenseNet-121 iteration 4..................................................................................................55
Figure 42: MobileNet-1 iteration 1....................................................................................................56
Figure 43: MobileNet-1 iteration 2....................................................................................................56
Figure 44: MobileNet-1 iteration 3....................................................................................................56
Figure 45: MobileNet-1 iteration 4....................................................................................................56
Figure 46: EfficientNet-b3 Graph......................................................................................................57
Figure 47: DenseNet-121 Graph........................................................................................................57
Figure 48: MobileNet-1 Graph...........................................................................................................58
Figure 49: No-Eyes-Glasses dataset...................................................................................................58
Figure 50: Eyes-Glasses dataset.........................................................................................................59
Figure 51: EfficientNet prediction on No-Eyes-Glasses dataset........................................................59
Figure 52: EfficientNet-b3 on Eyes-Glasses dataset..........................................................................60
Figure 53: DenseNet-121 on No-Eyes-Glasses dataset......................................................................61
Figure 54: DenseNet-121 on Eyes-Glasses dataset............................................................................61
Figure 55: MobileNet-1 on No-Eyes-Glasses dataset........................................................................62
Figure 56: MobileNet-1 on Eyes-Glasses dataset..............................................................................63
Figure 57: Eyes landmarks.................................................................................................................64
Figure 58: Eyes localization...............................................................................................................65
Figure 59: Eye Classification.............................................................................................................65
Figure 60: Eyes Closure ratio function...............................................................................................66
Figure 61: Eyes closure time function................................................................................................67
Figure 62: Eyes landmarks.................................................................................................................68
Figure 63: Eyes Closure Percentage function....................................................................................68
Figure 64: Mouth landmarks..............................................................................................................69
Figure 65: Person Laughing (Smiling)...............................................................................................69
Figure 66: Person yawning.................................................................................................................70
Figure 67: Mouth closure percentage function...................................................................................71
Tables List:

Table 1: Functions description.............................................................................................................9


Table 2: Project’s Chart......................................................................................................................10
Table 3: Behavioral Measures............................................................................................................14
Table 4: Physiological Measures........................................................................................................15
Table 5: Vehicle-based measures.......................................................................................................16
Table 6: Sensor required Frames per second......................................................................................42
Table 7: Models comparison..............................................................................................................48
Table 8: Classification block Layers..................................................................................................51
Table 9: K-fold Cross Validation Analysis........................................................................................57
Table 10: Models Confusion Matrix results.......................................................................................64
Table 11: Yawning and Laughing Mouth Aspect Ratio.....................................................................71
Acronyms List:

ADAS: Advanced Driver Assistance System.


CNN: Convolutional Neural Network.
FPS: Frames Per Rate.
General Introduction:

Drowsiness and inattentiveness while driving or operating heavy machinery pose serious safety
hazards and risks. According to studies conducted by the National Highway Traffic Safety
Administration, drowsy driving is responsible for over 6.000 accidents in the United State each
year, resulting in thousands of injuries and deaths. Traditional methods for alertness monitoring rely
on subjective self-reporting by individuals, which can be unreliable. An automated system for
detecting driver alertness in real-time is therefore highly desirable to help reduce drowsy driving
incidents.

In this graduation project, we propose a computer vision-based drowsiness detection system which
analyzes facial expressions and cues to identify signs of drowsiness and impaired alertness. Our
system tracks:

- Eyes blinking ratio.


- Eyes closure time.
- Eyes closure percentage.
- Yawning frequency.
- Head orientation changes.

Which are indicators empirically linked to levels of drowsiness. By monitoring multiple facial
metrics, we aim to robustly detect drowsiness and issue an alert if the driver shows prolonged or
repeated drowsy behavior? The system uses machine learning algorithms trained on large datasets
of images with annotated facial landmark points and drowsiness levels.

Drowsiness detection is an active area of research with life-saving applications in transportation and
safety. An automated system for monitoring driver alertness in real-time can help prevent tragic
drowsy driving accidents and save lives. The system proposed in this project aims to leverage state-
of-the-art machine learning and computer vision techniques to provide a robust solution for
detecting signs of drowsiness and impaired awareness to improve traffic safety. Our Drowsiness
Detection System can serve as an important component of Advanced Driver Assistance Systems
(ADAS) to improve traffic safety.

1
ADAS is passive vehicle systems which use camera and image processing algorithms to monitor the
driver and issue warning and alerts. The drowsiness detection module presented in this project can
enhance ADAS by providing real-time assessment of driver alertness and issuing warnings when
signs of drowsiness are detected to prevent potential accidents.

2
Chapter 1:
Project Scoping

3
Chapter 1 Project Scoping:
1.1 Company Presentation:
1.1.1 Introduction:

This works was carried out as part of a final year internship at Capgemini Engineering Morocco.
This introductory chapter aims to highlights the general framework and environments of the project.
The first section is dedicated to presenting the host organization, emphasizing its areas of activity
and internal structure. This contextualizes the project withing the specific framework of Capgemini
Engineering Morocco and provides an understanding of the resources and expertise available within
the organization.

The second section of this chapter focuses on the project’s context. It highlights the challenges and
objectives of the project, providing an overview of the expected outcomes. This helps grasp the
significance of the work being undertaken and defines the expected deliverables. Furthermore, the
project specifications and requirements are presented in the form of a project brief. This document
serves as an essential guide for project completion, outlining the constraints and expected outcomes.

In essence, this introductory chapter situates the work carried out within the context of the final year
internship at Capgemini Engineering Morocco. It sheds light on the host organization, outlines the
project’s challenges and objectives, and presents the project brief that guides its execution. These
elements provide a solid foundation for understanding and advancing the project in the subsequent
chapters.

1.1.2 Capgemini Engineering’s general presentation:

Capgemini Engineering is a global information technology service company that offers engineering
and technology consulting solutions. Established in 1967, it is part of the Capgemini group, one of
the world’s leading providers of digital services and digital transformation.

In the automotive domain, Capgemini Engineering provides a comprehensive range of engineering


and consulting services. This encompasses embedded software development, electronic systems
design, integration of connectivity and infotainment solutions, as well as validation and verification
of automotive systems.

The company is also involved in the development of technologies related to intelligent mobility,
such as autonomous driving. Capgemini Engineering engages in research and development projects
aimed at enhancing safety, efficiency, and driving experience through advanced solutions in

4
sensors, artificial intelligence, and data processing. By leveraging these cutting-edge technologies,
Capgemini Engineering strives to shape the future of the automotive industry and enable innovative
solutions for the evolving needs of customers and society at large.

1.1.3 Location:

Capgemini is a multinational group that has a strong presence in over 50 countries worldwide. It
maintains offices and service delivery centers in several key regions, including Europe, North
America, South America, Asia-Pacific, the Middle East, and Africa.

Capgemini’s global footprint enables it to serve international clients and leverage extensive
expertise and geographical reach. The strategic placement of Capgemini’s offices and service
delivery centers allows for meeting the needs of both local and international clients across various
sectors and industries. By being present in diverse locations, Capgemini can effectively provide
tailored solutions and support to clients, regardless of their geographical location or industry
specialization. This global presence strengthens Capgemini’s ability to deliver high-quality services
and foster long-term partnerships with clients worldwide.

Figure 1: Capgemini’s Locations

5
1.1.4 Capgemini Engineering Morocco:

Through its presence in Morocco, Capgemini aimed to establish a nearshore platform to support the
group’s international development in the automotive, aerospace, and the transportation sectors. This
platform serves a means to assist Capgemini’s clients in their innovation strategies, cost
optimization, and international expansion efforts.

The Moroccan entity also strives to be a local partner for major client accounts of Capgemini
operating within the national territory. As part of the Moroccan government’s “Emergence”
strategy, numerous foreign companies with strong growth potential have established themselves in
the country. Capgemini Morocco specifically focuses on those operating in the automotive,
aerospace, and renewable energy sectors.

Furthermore, Capgemini Morocco leverages the offshoring strategy implemented by the Moroccan
government, which offers significant advantages in terms of optimizing the competence-to-cost
ratio. These advantages include specialized training programs for offshoring professions, attractive
salary packages, favorable taxation policies, and more.

By capitalizing on these factors, Capgemini Morocco aims to provide high-quality services, cost
efficiency, and expertise to both international clients and local companies operating in strategic
sectors. The establishment of Capgemini’s presence in Morocco aligns with the country’s economic
development goals and offers a mutually beneficial partnership for growth and success.

6
1.2 Problem Definition:
1.2.1 Project Context:

Road Safety and Advanced Driver Assistance Systems (ADAS) share a strong connection as ADAS
technologies are designed to enhance road safety and mitigate risks on the highways. ADAS
comprises a wide range of intelligent features and functionalities integrated into vehicles to assist
drivers, monitor the environments, and provide timely warnings or interventions. These systems
leverage various sensors, cameras, and algorithms to detect potential hazards, monitor driver
behavior, and optimize vehicle control. By analyzing data in real-time and offering proactive
assistance, ADAS technologies aim to prevent accidents reduce human errors, and promote safe
driving practices.

In this context, Capgemini Engineering is dedicated to developing a cutting-edge Driver Assistance


system with the primary objective of accident prevention.

1.2.2 Problem definition:

Based on available statistical data, road accidents continue to have devastating consequences
worldwide. Each year, more than 1.3 million lives are tragically lost, and millions of people suffer
non-fatal injuries. Among the contributing factors, driver drowsiness has emerged as a significant
concern. In the United States alone, conservative estimates by the National Highway Traffic Safety
Administration suggest that approximately 100,000 crashes annually can be directly attributed to
drowsy driving, resulting in around 1.550 fatalities, 71.000 injuries, and significant economic
losses. Disturbingly, a report by the US National Sleep Foundation revealed that 54% of adult
drivers admitted to driving while feeling drowsy, with 28% of them having fallen asleep behind the
wheel. Recognizing the critical role that driver fatigue plays in road safety, we are committed to
leveraging advanced technologies and innovative approaches to detect signs of drowsiness in real-
time. By combining sophisticated sensors, machine learning algorithms, and extensive research, we
aim to create a robust and reliable system that can proactively alert drivers when fatigue levels
become concerning.

7
1.2.3 Needs analysis:

The purpose of a needs analysis is to ensure a clear understanding of what needs to be achieved and
to define the scope, features and functionalities of a system or solution. It helps establish a
foundation for effective planning, design, and development by providing a comprehensive
understanding of user needs, organizational goals, and technical constraints. To identify the
appropriate needs a set of questions can be used to gather relevant information and insights. Some
of the main questions to answer are:

Who are the stakeholders involved?


What is the main objective or goal of the project?
What does the project targets?
Answering these questions allows us to gain clarity regarding the problem at hand, define its
objectives, and effectively establish the launch of the project.

8
1.2.4 Functional analysis:

In this phase we are going to identify all the possible functions, these functions could be obstacles
as they could be goals of the project.

Figure 2: Functional Needs

Table 1: Functions description.

Functions Description
FP1 The system should be able to perform real-time monitoring.
FP2 The system should be able to predict the drowsiness state of the driver and
alert him.
FC1 The system should be safe in terms of cybersecurity and personal data usage.
FC2 The system should be intuitive to use and not intrusive.
FC3 The system should be easy to repair.
FC4 The system should be open to additional improvements

9
1.2.5 Project Chart:

by using a project chart, stakeholders can quickly grasp the project’s overall organization, identify
points of contact, and understand the flow of authority and responsibility. This helps to streamline
communication, improve coordination, and facilitate efficient decision-making throughout the
project lifecycle.

Table 2: Project’s Chart.

Project’s Title Drowsiness Detection System


Client Capgemini Engineering
Department Model-Based Systems Engineering (MBSE)
Project Duration 01/02/2023 to 15/06/2023
Technical Supervisors M. ASLANE Amine: System Engineer
Mme. JADIANI Chadia: Team Manager
Academical Supervisors M. BSISS Mohammed: Professor and Researcher at faculty of
sciences and technologies of Tangier
Project Developer M. AFILAL Taha: Internship Student in last year of Electrical
Engineering and Industrial Management Engineering Cycle at
faculty of sciences and technologies of Tangier.
Project’s Objectives 1-Detection of Driver’s Drowsiness.
2-Real-time Monitoring capabilities.
3- Preventing vehicles accidents by alerting the driver of the
vehicle.

10
1.2.6 Gantt Diagram:

Figure 3: Gantt Diagram

1.3 Conclusion:

This chapter has been devoted firstly to the presentation of the organization in which the end-of-
studies internship took place. Then, it describes the project aimed at designing a prototype of an
ADAS system incorporating multiple functionalities. Towards the end, a work schedule is presented
using the Gantt chart to properly schedule and plan the progress towards the desired objectives.

11
Chapter 2:
Methods and
techniques used in
the project

12
Chapter 2 Methods and techniques used in the project:
2.1 Introduction:

In this chapter we will mainly discuss and explain all the necessary tools and techniques that are
going to helps us in developing all the functionalities that we are aiming to create. This will include
machine learning tools, methods of how to monitor drowsiness of a driver…

2.2 Drowsiness detection approaches:

To measure drowsiness levels of a driver, a set of approaches were studied and developed. In this
section we will site the three main measures that we can through them conclude whether the driver
is drowsy or not.

2.2.1 Facial expressions-based measures:

a person who’s drowsy displays several different facial movements such as a rapid blinking,
nodding, or swinging their head and frequent yawning. A wide range of studies have focused on the
rapid blinking as well as the eyes closure percentage to determine if a person is drowsy or not.

Some other scientists were more interested in the frequent yawning and the head orientation (head
swinging and nodding) as the main indicator of a person drowsiness.

Overall, this measurement which is the Facial expressions-based one has been found to be a reliable
measure to predict drowsiness and has been used in commercial products such as “Seeing
Machines”.

In this table we can find some of the facial expressions than can reflect drowsiness and the means to
capture the drowsiness information from them:

13
Table 3: Behavioral Measures.

Measures Sensor used Drowsiness Measure Utilized


techniques
PERCLOS (eyes closure Camera, Infra-red Modification of the Facial
percentage) Camera algebraic distance landmarks,
between the upper HOG and CNN
eyelashes and the lower
eyelashes
Eyes blinking Ratio Camera, Infra-red Time between CNN
Camera consecutive blinks
Head nodding and Camera, Infra-red Head orientation angle CNN
swinging Camera
Yawning ratio Camera, Infra-red Time between CNN
Camera consecutive yawning

Although facial expressions-based drowsiness detection gives us a direct indication to the driver
drowsiness, but this approach can have some obstacles that can affect it effectiveness such us
driving at night time since these measurement are based on the images that we get from the sensor
which is in this case the Camera and the quality of the images depends directly on the lighting
quality so in this case driving at night may cause some major issues if the hardware (Camera) and
the software are not designed to work at night time as well as the day light.

2.2.2 Physiological measures:

As a person in a transition between the awakening state and the drowsiness state, it starts emitting
specific physiological signals indicating that the driver is transitioning into the sleep phase which
can be used to detect or measure the drowsiness level of the driver.

Many studies were conducted in the scientific field generally and the medical field specifically to
find out the link between the physiological signal and the fatigue state of the person and how to
measure and interpret these signals.

Some of the researchers have found that the heart rate (HR) varies significantly between the
different stages of drowsiness, such as alertness and fatigue. Therefore, heart rate, which can be
determined by the ECG signal (Electrocardiogram signal) can be used to detect the drowsiness of a
person. Others have mentioned that the Heart Rate Variability (HRV) can be more reliant to detect
the drowsiness of a person, HRV is a measure of the beat-to-beat changes in the heart rate. The ratio

14
of Low frequency to high frequency in the ECG decreases progressively as the driver progresses
from an awake to a drowsy state.

Another intermediate to measure drowsiness is done throughout the Electroencephalogram (EEG).


The EEG signal has various frequency bands, including the delta band (0.5-4 Hz) which
corresponds to sleep activity, the theta band (4-8 Hz) which is related to drowsiness, the alpha band
(8-13 Hz), which represents relaxation and creativity, and the beta band (13-25 Hz) which
corresponds to alertness. A decrease in the power changes in the alpha frequency band and an
increase in the theta frequency band indicates drowsiness.

Table 4: Physiological Measures.

Measure Sensor used Drowsiness measure Utilized techniques


Heart rate (HR) ECG Heart rate Neural Network, Fast
Fourier Transform
Heart Rate Variability ECG Heartbeat Frequency Neural Network, Fast
(HRV) changes Fourier Transform
Electrical activity of EEG Brain signal waves Neural Network, Fast
the brain length Fourier Transform

The reliability and the accuracy of driver drowsiness detection by using physiological signals is
very high compared to other methods. However, the intrusive nature of measuring physiological
signals remains an absolute issue to be addressed in the context of automotive and Advanced Driver
Assistance System.

2.2.3 Vehicle based measures:

Another way to find out if a driver is drowsy or not involves vehicle-based measurements. While
some researchers were interested in studying the human body and its behaviors to detect
drowsiness, others focused on the vehicle behaviors and the link that can be established between
vehicle behaviors and the drowsiness of the driver.

The measurements related to vehicle behaviors are determined by placing sensors on various
vehicle components, the signals sent by the sensors are then analyzed to determine the level of
drowsiness [1].

The most used vehicle-based measures are the steering wheel movement, standard deviation of lane
position and the acceleration pedal.

15
steering wheel movement (SWM) is measured using steering wheel angle sensor, this sensor is
mounted on the steering column, the driver’s steering behavior is measured. When the driver is
drowsy the number of micro-corrections on the steering wheel reduces compared to normal driving
[2].

To eliminate the effect of lane changes, the researchers considered only small steering wheel
movements which is between 0.5° and 5°, which are needed to adjust the lateral position within the
lane [3]. Hence, based on the small steering wheel movements it is possible to determine the
drowsiness state of the driver.

Another measurement can be done in the context of drowsiness detection based on vehicle-based
measurements is the standard deviation of lane position [4]. Researchers conducted an experiment
to derive numerical statistics based on standard deviation of lane position and found that when the
Karolinska sleepiness scale (KSS) which is a verbal description rating scale that has 9 levels that
varies from Extremely alert till Very sleepy, increases, the standard deviation of the lane position
also increased. For example, KSS ratings of 1, 5, 8 and 9 corresponded to standard deviation of lane
position measurements of 0.19, 0.26, 0.36 and 0.47 respectively. The standard deviation of lane
position measurements can be limited by the road nature and marking and conditions. In summary
many studies have determined that vehicle-based measures are a poor predictor of drowsiness state
of the driver.

This table summarizes the vehicle-based measurements and the way to implement them:

Table 5: Vehicle-based measures.

Measure Used sensor Drowsiness Measure Utilized techniques


Steering wheel Torsion sensor Steering wheel angles Neural network, signal
movements processing
Standard deviation of Camera The variation of the CNN
lane position distance between the
lanes marking and the
car

2.3 Approach selection:

From the previous section we had a quick tour and explanation of the main measurements that we
can implement to detect the drowsiness state of a person. To have a good drowsiness detection
system all the sited measurements in the previous section should be integrated in the system to have

16
a robust drowsiness detector. Due to the project context and permissible duration of the project we
would not be able to integrate all the approaches in the system. Hence, we had to choose the most
reliant, efficient, and non-intrusive approach.

As we saw in the previous section the approach of the vehicle-based measurements is not
specifically made to be a main and trusted indicator of drowsiness which makes us abandon this
approach in favor of the remaining ones.

Also, the intrusive nature of physiological based measurements approach will make it
implementation in the context of automotive really challenging and can overwhelm the driver and in
consequences the driver will not use such a system, this makes the physiological approach not
suited for this project.

Which will leave us with the final approach that we can implement in this project which is the
facial expression-based measurement approach, were it efficiency is high, and its intuitive non-
intrusive nature makes it implementation affordable.

2.4 Utilized techniques:


2.4.1 Machine Learning/ Deep Learning:
[Link] Introduction:

Machine learning is a branch of artificial intelligence (AI) that focuses on the development of
algorithms and statistical models to enable computer systems to learn from data and make
predictions or decisions without being explicitly programmed. It involves the study and
construction of systems that can automatically learn and improve from experience, without the for
explicit instructions. On the other hand, Deep learning is a subfield of machine learning that is
based on artificial neural networks, particularly deep neural networks. It involves the development
of algorithms that are inspired by the structure and function of the human brain, extracting
meaningful patterns and features from raw input, and making highly accurate predictions or
classifications.

[Link] Machine learning types:


a) Supervised Machine learning:

Supervised machine learning is a type of machine learning where the algorithm is trained on labeled
data.

17
Labeled data refers to input data where the desired output or target variable is known, the algorithm
learns from these labeled examples to make predictions or classify new, unseen data accurately. In
supervised learning, there is a clear relationship between input variables and the corresponding
output, and the algorithm aims to generalize this relationship to make predictions on new data.

Figure 4: Supervised Machine Learning.

There are two types of supervised machine learning:

 Regression: used if there is a relationship between the input variable and the output value. It
is used for the prediction of continuous variables, such as Weather forecasting, Market
Trends, etc.
 Classification: Classification algorithms are used when the output variable is categorical,
which means there are two classes such as Yes-No, Male-Female, True-false, etc.

b) Unsupervised Machine learning:

Unsupervised machine learning involves training algorithms on unlabeled data.

In unsupervised machine learning, there are no predefined target variables or labels provided. The
algorithm learns to identify patterns, structures, or relationships within the data on its own. It aims
to uncover hidden patterns, clusters, or associations in the data without any explicit guidance.
Unsupervised learning can be used for tasks such as clustering, dimensionality reduction, or
anomaly detection.

Figure 5: Unsupervised Machine Learning.

18
There are two types of unsupervised machine learning:

 Clustering: is a method of grouping the objects into clusters such that objects with most
similarities remains into a group and has less or no similarities with the objects of another
group. Cluster analysis finds the commonalities between the data objects and categorizes
them as per the presence and absence of those commonalities.
 Association: An association rule is an unsupervised learning method which is used for
finding the relationships between variables in the large database. It determines the set of
items that occurs together in the dataset. Association rule makes marketing strategy more
effective.

[Link] Convolutional neural network « CNN »:

A Convolutional Neural Network (CNN) is a deep learning algorithm that can take in an input
image, assign importance (learnable weights and biases) to various aspects/objects in the image, and
be able to differentiate one from the other. The pre-processing required in a CNN is much lower as
compared to other classification algorithms.

The architecture of a CNN is analogous to that of the connectivity pattern of Neurons in the human
brain and was inspired by the organization of the visual cortex.

A CNN can successfully capture the spatial and temporal dependencies in an image through the
application of relevant filters. The architecture performs a better fitting to the image dataset due to
the reduction in the number of parameters involved and the reusability of weights. In other words,
the network can be trained to understand the sophistication of the image better.

An image in general has 3 dimensions which are the length, width, and the color channels which are
RED, Green, and Blue.

Figure 6: Image Channels

19
Trying to compute this image in a traditional way can be computationally heavy when the length
and width of the image get bigger.

The role of a CNN is to reduce the images into a form that is easier to process, without losing
features that are critical for getting a good prediction.

This can be made throughout two main functions (layers):

Convolutional Layers (functions): these layers are designed to extract features and capture spatial
relationships within input data, particularly in the context of image and video processing. A
convolutional layer consists of a set of learnable filters, also known as kernels or feature detectors.
These filters are small matrices that are convolved with the input data, preforming element-wise
multiplication, and summation operations. This operation is repeated for each position in the image,
resulting in a two-dimensional activation map often called a feature map or a convolutional feature
map.

The size of the feature map depends on several factors, including the size of the input image, the
size of the filters, and the stride (the amount by which the filter moves between each convolutional
operation). The depth of the feature map corresponds to the number of filters in the convolutional
layer.

These feature maps capture different local patterns or high-level representations of the input image.

Figure 7: Convolutional process.

In summary a convolutional layer is a layer which has a set of filters, these filters are convolved
with the input image and as a result it outputs a set of feature maps, each feature map is the
representation of a pattern or a feature in the input image. The number of the feature maps is the
same as the filters number.

20
Pooling layers: like the convolutional layer, the pooling layer is responsible for reducing the spatial
size of the convolved feature. This is to decrease the computational power required to process the
data through dimensionality reduction. Furthermore, it is useful for extracting dominant features,
thus maintaining the process of effectively training the model.

There are different types of pooling layers, such as max pooling and average pooling.

Average pooling divides the input feature map into rectangular regions (often referred to as pooling
regions or pooling windows) and outputs the average value of each region.

Max pooling is the most common type. It divides the input feature map into rectangular regions
(often referred to as pooling regions or pooling windows) and outputs the maximum value within
each region.

The pooling regions are defined by the size of the pooling window, which determines the spatial
extent of the pooling operation. The pooling windows does not have any trainable parameters like
the filters in a convolutional layer. Instead, it simply defines the size of the region over which the
pooling operation is applied.

During the pooling operation, each pooling window moves across the feature map with a fixed
stride, like convolution, however, unlike convolution, there is no element-wise multiplication or
summation involved in pooling. Instead, pooling layers only select the maximum or average value
within each pooling region, discarding the other values.

Figure 8: Average Pooling.

Figure 9: Max Pooling.

Max Pooling is primarily used to reduce the spatial dimensions of feature map by selecting the
maximum value within each pooling region. Also it is particularly effective at capturing the most
salient or dominant features within each pooling region by selecting the maximum value. In

21
addition Max Pooling helps preserving the edges and boundries of features within the feature map
since the maximum value is selected, it is more likely to be located on an edge boundary, providing
a sharper representation of those boundaries.

Average Pooling also helps in downsampling the feature map while preserving the general spatial
information. Another advantage is that it can help reduce the impact of noise or outlies in the
feature map by averaging the values, it can smooth out variations and provide a more representative
value for the region.

NB: Due to the Max Pooling advantages it is the mostly used in the CNN.

Classification Layers/Dense Layers: fullly connected layer in a convolutional neural network


(CNN), also known as a dense layer or a fully connected neural network layer, is a type of layer that
connects every neuron from the previous layer to every neuron in the current layer.

Figure 10: Classification Layers.

In CNNs, the fully connected layers are typically located towards the end of the network, following
the convolutional and pooling layers. These layers serve the purpose of learning high-level
representations and making predictions based on the extracted features. Each neuron in the fully
connected layer receives input from every neuron in the previous layer. The outputs of these
neurons are then passed through an activation function, which introduces non-linearity to the
network.

There are two main types of activation functions:

Sigmoid activation function: the sigmoid activation function is a mathematical function that maps
input values to a range between 0 and 1, producing a smooth S-shaped curve. It is commonly used

22
to introduce non-linearity in neural networks and is particularly useful in binary classification
problems where the output represents probabilities or decision boundaries.

Figure 11: Sigmoid Activation function.

ReLu (Rectified Linear Unit) activation function: ReLu is a non-linear function that sets all
negative input values to zero and leaves positive values unchanged. It is widely used in neural
networks as it introduces sparsity and allows for faster training convergence compared to other
activation functions, such as sigmoid.

Figure 12: ReLu activation function.

The purpose of the fully connected layers is to capture complex relationships and dependencies
among the features extracted by the earlier layers. By connecting all neurons, they can model
interaction and combinations of features across the entire input.

When a fully connected layer receives input from the previous layer, the input values are multiplied
by their corresponding weights. These weighted inputs are then summed up, typically with an
additional bias term, to produce a single scalar value for each neuron in the current layer.

23
Figure 13: Neuron Scalar value.

We can represent this mathematically by the following formula:

z=w1 x 1 +w 2 x 2 +…+ wk x k + b

Where:

W1…k: the weights of the neurons.

X1…k: the neurons values.

b: the bias.

The weights associated with the connections between neurons determine the strength or importance
of each connection. During the training process, these weights are adjusted through a process called
backpropagation, where the network learns to optimize its performance by updating the weights
based on the Error degree in its prediction.

The CNN learns in two phases or steps the first one is what so called forward propagation, and the
second one is what so called back propagation.

Froward propagation: forward propagation means that the input data is fed in the forward
direction through the network. Overall, this process covers the data flow from the initial input
passing by every layer in the network and finally the computation of the prediction error.

Back propagation: after forward propagation comes back propagation and it’s certainly the
essential part of the training. In brief it’s a process of fine-tuning the weights of a network based on
the errors or loss obtained in the previous epoch (iteration). Proper weight tuning ensures lower
error rates while increasing the model’s reliability by enhancing its generalization.

The essential part in a CNN learning curve is getting the error rate so we can optimize the weights
of the CNN model, to get the error of the CNN model we can calculate what is called loss function.

24
A loss function in machine learning is a mathematical function that quantifies the discrepancy
between predicted outputs and the true or expected outputs. It provides a measure of how well the
model is performing. And the goal is to minimize this value during training. The general formula
for a loss function depends on the specific task and can vary. However, a common example is the
Binary cross entropy loss function, which measures the dissimilarity between the predicted value
and the true label:

L=−(Y∗log ( P ) + (1−Y )∗log ⁡(1−P))

Where:

L: Binary cross entropy loss.

Y: True binary label (0 or 1)

P: predicted probability of the positive class

there are many different algorithms to optimize CNN model’s weights, one of the most used
algorithms is what so called Stochastic Gradient Descent (SGD). In brief, gradient descent is an
optimization algorithm that we use to minimize loss function in the neural network by iteratively
moving in the direction of the steepest descent of the function. Therefore, to find the direction of the
steepest descent, we need to calculate gradients of the loss function with respect to weights and
bias. After that, we’ll be able to update weights and bias using negative gradients.

Figure 14: Froward and back propagations.

CNN Architecture: structuring and layering all the three mentioned layers in above is what forms a
Convolutional Neural Network.

Other than the fully connected layers are located at the end of the CNN structure and preceded by
the convolutional and pooling layers, there’s no specific rule of how the order of the layers must be

25
or how these layers must be put together, finding the most performing structure is the crucial
objective that we are looking for.

So, a final look of how CNN looks like is presented by this figure:

Figure 15: Convolutional Neural Network Structure.

[Link] Transfer Learning and learning from scratch:


a) Learning from scratch:

Training a model from scratch refers to the process of building and training a machine learning or
deep learning model from the ground up, starting with randomly initialized weights. This
approach involves feeding the model with a labeled dataset, iterating through multiple epochs, and
adjusting the model’s parameters to minimize the difference between the predicted outputs and the
ground truth labels.

Figure 16: Training from scratch process.

26
To train an efficient CNN model from scratch it requires large clusters of compute servers, large
amounts of training data, and a large amount of time to train the deep neural network.

Figure 17: Training CNN from scratch requirements.

Pros of Training a model from scratch:

 Flexibility: Training a model from scratch allows complete control over the architecture,
hyperparameters and training process.
 Domain-specific representation: by training a model from scratch, it can learn
representations and features directly from the provided dataset, capturing domain-specific
patterns and nuances that may not be present in pre-trained models.
 Independence from external data: when training from scratch, there is no dependency on
pre-existing large-scale datasets or pre-trained models. This can be advantageous when
working with niche or specialized domains.

Cons of training a model from scratch:

 Resource-intensive: Training a model from scratch can be computationally demanding and


time-consuming, especially for complex architectures and large datasets.
 Large data requirements: Training from scratch often necessitates a substantial amount of
labeled training data. In situations where labeled data is scarce or expensive to obtain it may
be challenging to train a model from scratch effectively.
 Initialization sensitivity: Starting with random weights can lead to varying outcomes and
performance, as the model’s initial state can influence the optimization process.

27
b) Transfer Learning:

The basic idea of transfer learning is to start with a deep learning network that is pre-trained (pre-
initialized) from training of a similar problem.

Figure 18: transfer learning process.

Using transfer learning requires, low amount of time, low amount of training data, and less
computing power compared to learning from scratch.

Figure 19: requirement for Transfer learning.

28
Pros of transfer learning:

 Reduces training time and resource requirements: Transfer learning significantly reduces
the time and resources needed to train a model. Instead of starting from random
initialization, the model begins with pre-trained weights, which already capture generic
features and patterns. This can lead to faster convergence and reduces computational costs.
 Improved performance with limited data: Transfer learning enables effective utilization
of pre-existing large-scale datasets, which often contain more diverse and varied data than
what may be available for the specific target task.
 Generalization to new tasks: pre-trained models are trained on diverse datasets and have
learned general features that are transferable across tasks. By starting with these learned
representations, transfer learning allows the model to generalize well to new tasks, even in
domains where labeled data is scarce or unavailable.
 Robustness and regularization: pre-trained models have undergone extensive training on
large datasets, which often helps in regularizing the model and making it more robust to
noisy or incomplete data.

Cons of transfer learning:

 Limited domain specificity: while pre-trained models provide a good starting point, they
may not capture task-specific nuances or domain-specific features. If the target task
significantly differs from the pre-training data, the transfer learning approach my not be as
effective.
 Need for compatible pre-trained models: the availability of suitable pre-trained models
that align with the target task or domain can be a limiting factor. In some cases, finding pre-
trained models that closely match the desired task may requires extensive search or
adaptation techniques.
 Fine -tuning challenges: Fine-tuning a pre-trained model requires careful consideration of
hyperparameters, learning rates and regularization techniques. It can be challenging to strike
a balance between preserving the pre-trained knowledge and adapting the model to the new
task.

29
c) Methods for transfer learning:

One of the most straightforward methods of transfer learning is called feature transfer. In deep
learning, the network is made up of many layers. These layers are important because deep learning
is a layered architecture that learns different features at different layers. [deep learning basic
network] illustrates a sample deep learning network made up of many layers, but three distinct
categories of layers. In this example, the network accepts a 3-D image (width, height, and depth for
the color). This constitutes the input layer, mapping the input to the subsequent layer. Next is the
feature-extraction layer, which can have many internal layers consisting of convolutions and
pooling layers. The output of the feature extraction layer are “features” that can represent features
from the image and can then use hierarchically to translate to higher-level features. The final
classification layer pulls together the features found within the feature-extraction layer and provides
a classification.

Figure 20: deep learning basic network.

Note that the classification layer is the responsible layer for determining the object from the image
as a function of the detected features. Now the idea behind feature transfer is then to use the input
and feature-extraction layers that have been already trained with a given data set, we freeze their
weights (the weights values do not change through the hole training process), and add in top of
these layers (input layer and features-extraction layers) a new classification layer for the related
problem domain as illustrated in the [Features Transfer] and then re-train the model on the dataset
of the related problem.

30
Figure 21: Features Transfer.

[Link] Machine Learning process:

the process of training a deep learning model is constituted of five steps:

Import and clean the data: importing and cleaning the data is one of the most crucial steps in the
process of training a deep learning model, the data must be suitable for the task that we want to
achieve using the deep learning model. Also, data inspection is necessary to avoid problems such as
missing or incomplete values, or data is not transformed or not preprocessed to match the input of
the model.

Splitting data into training data, validation data and test data: the step that follows importing and
cleaning data is splitting the data, this step is a common practice in machine learning, where the
dataset is divided into three subsets. The training set is used to train the model, the validation set is
used to tune hyperparameters and evaluate model performance during training, and the test set is
reserved for a final unbiased evaluation of the model’s performance after training. In general, the
amount of data in each set is defined as follows:

31
Training Data = 80% dataset

Validation data = 15% dataset

Test data = 5% dataset

Create a model: this is the third step in machine learning where we define a specific structure of the
CNN model in other words, we define how much layers the model will contain, the order of
layering convolutional layers and pooling layers…

Model training: in this step, we tune the hyperparameters of the training session such as the number
of epochs, the learning rate of the model during the training session, selection of the optimizer…

Evaluate the model: this is the final step of machine learning process which is the evaluation phase
where we test the model on the test data and see the efficiency of the model and if the model is
capable of generalizing over new unseen data.

The [figure] summarize the machine learning process.

Figure 22: Machine learning process.

[Link] K-Fold cross validation:

Model efficiency evaluation is the determinant phase to judge whether the model is ready to be
implemented in the application or not. To evaluate a model there’s a set of methods that we can use,
one of them as I mentioned earlier in Machine Learning process, is dividing the dataset into three
subsets, Train dataset, validation dataset and test dataset this method is called “Hold Out method”,
this evaluation method is not as efficient as it sounds since the data split happens randomly and we
can’t be sure which data ends up in the train subset or test subset during the split. This can lead to
extremely high variance in the error rate, and every time the split changes, the accuracy will also
change. This will lead to high variance in test error rates, and it totally depends on which
observations end up in the training set and test set. Also, only a part of the data is used to train the
model which will lead to a high bias which is not a very good idea when data is not huge which will

32
end up by an overestimation of test error. The only advantage of the Hold Out method is it’s
computationally inexpensive. This leads us to look for another method to evaluate our model with.

“K-fold cross validation” method is one of the most reliable and efficient methods to judge the
efficiency of the trained model. In this resampling technique, the whole data is divided into K sets
of equal sizes. The first set is selected as the test set and the model is trained on the remaining K-1
sets. The test error rate is then calculated after fitting the model to the test data.

In the second iteration, the 2 nd set is selected as a test set and the remaining K-1 sets are used to
train the data and the error is calculated. This process continues for all the K sets as it displays the
[k-fold Cross validation].

Figure 23: k-fold Cross Validation.

The mean of errors from all the iterations is calculated as the CV (Cross validation) test error
estimate.
K
1
CV ( k ) = ∑ MSE I
K i=1

The advantages of this method are that each data point gets to be in the test set exactly once and
gets to be part of the training set K-1 times, on top of that, as the number of folds K increases, the
variance decreases (low variance).

33
[Link] Hyperparameters tuning:

Hyperparameters tuning is one of the most important parts of a machine learning pipeline. A wrong
choice of the hyperparameter’s values may lead to wrong results and a model with poor
performance.

a) Hyperparameters and parameters:

Both parameters and hyperparameter are part of the machine learning model, although they serve
different purposes. Let’s look at the differences between them in the context of machine learning.

Parameters are the variables that machine learning algorithm uses to forecast results based on
historical data input. The machine learning method itself uses an optimization approach to estimate
these. As a result, neither the user nor the expert can set or hard code these variables. These
variables are used in the model training process.

Hyperparameters are variables that the user specifies throughout the machine learning model
construction process. As a result, hyperparameters are provided before parameters, or we may say
that hyperparameters are utilized to assess the model’s ideal parameters. The thing about
hyperparameters is that the user who is developing the model decides on their values.

b) Grid search:

There are several ways to perform hyperparameter tuning. One of them is grid search.

Grid search is the simplest algorithm for hyperparameter tuning. Basically, we divide the domain of
the hyperparameter into a discrete grid [Grid search]. Then, we try every combination of values of
this grid, calculating some performance metrics using cross-validation. The point of the grid that
maximizes the average value in cross-validation is the optimal combination of values for the
hyperparameters.

Figure 24: Grid search.

34
Grid search is an exhaustive algorithm that spans all the combinations, so it can find the best point
in the domain. The great drawback is that it’s very slow.

c) Random search:

Random search is like grid search, but instead of using all the points in the grid, it tests only a
randomly selected subset of these points [random search]. The smaller this subset the faster but less
accurate the optimization.

Figure 25: random search.

2.4.2 Facial Landmarks:

Facial landmarks refer to the specific points or features on a person’s face that are used to identify
and locate different regions of the face. These landmarks are key anatomical positions that can be
consistently identified across different faces. By identifying and tracking these landmarks, we can
analyze facial expressions, perform facial recognition, apply filters or effects to specific facial
regions and much more.

Facial landmarks typically include point such as:

Eyes: Landmarks for the eyes include positions like the inner and outer corners of the eyes, the
center of the pupil, and the eyebrow arch points.

Eyebrows: These landmarks mark the starting and ending points of each eyebrow and may include
points along the eyebrow shape.

Nose: landmarks on the nose include the tip of the nose, the base of the nose, and points on the
bridge of the nose.

Mouth: these landmarks mark the corners of the mouth, the center of the lips, and sometimes points
on the upper and lower lip contours.

Jawline and chin: landmarks along the jawline and chin help define the shape and contour of the
face.

35
Facial landmarks can be represented as coordinate points (x, y) on an image, indicating the precise
location of each landmark. These points provide valuable information about the structure and
geometry of the face allowing for various facial analysis tasks and applications [Facial landmarks].

Figure 26: Facial landmarks.

One of the most efficient ways to detect the facial landmarks is by using machine learning to detect
the face features than locate the facial landmarks.

2.4.3 Tools:
[Link] Programming language:

Since this project is mainly focused on deep learning and its application to detect drowsiness of the
driver, I chose to utilize python as the primary programming language for several compelling
reasons. Python has gained significant popularity in the field of deep learning due to its extensive
library support and rich ecosystem. The availability of powerful libraries such as Tensorflow, keras,
and PyTorch, all of which have python interfaces, greatly simplifies the development and
implementation of deep learning models. Python’s readability and simplicity make it easier to
understand and maintain complex code structures, which is crucial when working with intricate
deep learning architectures. Moreover, python’s flexibility and compatibility with other languages
facilitate seamless integration with existing software and frameworks. The active and vibrant
Python community further enhances the overall development process by providing a wealth of
resources, documentations, and support. By choosing Python as the language for this project, I was
able to leverage its robust deep learning libraries, intuitive syntax, and collaborative ecosystem to
effectively explore and implement cutting-edge techniques in the field of deep learning.

36
[Link] Libraries:
a) Opencv:

Opencv is an open-source computer vision library that provides a comprehensive set of tools and
algorithms for image and video processing, allowing for tasks needed in this project such as image
manipulation. Its extensive support for multiple platforms and programming languages makes it a
go-to choose for developing computer vision applications.

b) Dlib:

Dlib stands out for its capabilities in face detection, facial recognition, and facial landmarks
detection which will be the key-point in this project for localizing the different parts of the face.
This library offers algorithms and models for robust face analysis, making it a popular choice for
tasks like emotion detections, and facial attribute analysis.

c) Keras:

Keras is a high-level neural network library that simplifies the process of building and training deep
learning models. It provides a use-friendly API and abstracts away the complexities of lower-level
frameworks like TensorFlow, making it accessible to anyone who wants to implement deep learning
in their projects. With keras, developers can effortlessly experiment with different architectures and
optimize their models for a wide range of applications.

37
Chapter 3:
Project Implementation

38
Chapter 3 Project Implementation:
3.1 Introduction:

In the context of our “Driver Drowsiness Detection System” project, we have carefully considered
various measures and determined that facial expressions-based measures offer the most suitable
approach for effectively identifying drowsiness state.

In this chapter, we will provide a comprehensive guide on how we implemented the drowsiness
detection system based on facial expressions. We will explore the underlying technology,
methodologies, and techniques utilized to achieve accurate and real-time detection. Our approach
combines computer vision algorithms, machine learning models, and image processing techniques
to analyze facial expressions and identify signs of drowsiness.

3.2 System requirements and analysis:


3.2.1 Functional requirements:
[Link] introduction:

In the following sections, we will dive into the specific functional requirements for our drowsiness
detection system, addressing the identification of facial expressions, the desired accuracy level, the
response time requirements, and the necessary integrations. By capturing these requirements, we
aim to develop a robust and reliable system that contributes to enhancing road safety and preventing
drowsiness-related accidents.

[Link] Identification of facial expressions to be monitored:

Human facial expressions can give an explicit conclusion about its drowsiness state, such as eyes
closure, yawning, … in this project we’ve aimed to monitor the main face parts that can indicate
drowsiness state, which are:

a) Eyes expressions:

eyes are the most reliable indicators of drowsiness state of a human, from eyes expressions we can
extract different types of information that could be monitored. In this project three eyes’
expressions have been monitored:

 eyes blinking ratio: eyes blinking ratio, refers to the frequency and duration of eye blinks,
can serve as an important indicator of driver drowsiness. Research has shown that drowsy
individuals tend to exhibit changes in their blinking patterns, characterized by an increase in

39
blink rate. This phenomenon occurs due to the increased effort to stay awake and alert. By
analyzing the blinking ratio in real-time, our drowsiness detection system can detect
deviations from the baseline blink rate and identify patterns indicative of drowsiness. A
higher blinking ratio, accompanied by more frequent and rapid eye blinks, can provide
valuable insights into the driver’s level of alertness, and indicate the need for intervention or
alerting mechanisms. By monitoring and analyzing the blinking ratio, our system aims to
contribute to early drowsiness detection and enhance road safety by proactively alerting
drivers and encouraging appropriate rest or intervention measures.
 fully closed eyes time: the duration of fully closed eyes, often referred to as “fully closed
eyes time”, Is another significant parameter that can indicate driver drowsiness. When a
driver become drowsy, there is a higher likelihood of experiencing longer periods of
complete eye closure. This extended duration of fully closed eyes can be attributed to lapses
in attention and microsleep episode, where the driver momentarily loses consciousness. By
monitoring and analyzing the fully closed eyes time, our drowsiness detection system can
identify instances where the driver’s eyes remain shut for abnormally extended periods. This
information serves as a crucial indicator of drowsiness levels, prompting the system to issue
alerts or intervention measures to mitigate the risks associated with drowsy driving. By
effectively capturing and analyzing the fully closed eyes time, our system aims to enhance
driver safety and prevent accidents caused by drowsiness-related impairment.
 eyes closure percentage: an increased eyes closure percentage signifies a higher likelihood
of drowsiness. The rise in the percentage can be attributed to prolonged eye closure, such as
when the driver’s eyes remain shut or partially closed for a substantial duration. Monitoring
the eyes closure percentage in real-time allows our system to detect deviations from the
baseline and promptly identify instances of elevated drowsiness levels. By analyzing this
metric, the system can issue timely alerts or interventions measures to mitigate the risk
associated with drowsy driving.

b) Mouth expressions:

Mouth expression, specifically the occurrence of yawning, is a notable indicator of driver


drowsiness. Yawning is a reflex action that often signifies a need for increased oxygen intake and is
commonly associated with fatigue and drowsiness. In the context of our drowsiness detection
system, analyzing mouth expressions, particularly the frequency and intensity of yawning, provides
valuable insights into the driver’s level of alertness.

40
As drowsiness sets in, individuals are more prone to experiencing frequent and pronounced
yawning episodes. By monitoring and analyzing the occurrence of yawning in real-time, our
drowsiness detection system can identify patterns indicative of drowsiness. An increased frequency
and intensity of yawning indicate a higher likelihood of drowsiness, as it reflects the body’s attempt
to stay awake and combat fatigue.

[Link] Accuracy level:

Ensuring a high accuracy level in the drowsiness detection system is of utmost importance of
effectively address the risks associated with drowsy driving. The accuracy level refers to the
system’s ability to correctly identify drowsiness in drivers with minimal false positive (system
incorrectly identifies a driver as drowsy) and false negative (system fails to detect drowsiness). For
critical applications such as drowsiness detection in vehicles, it is important to aim for high
accuracy to minimize false negative and false positive. Typically, higher accuracy is better for
safety. However, 100% accuracy may be unrealistic and expensive to achieve. For this matter in this
project, we will set the minimum accuracy at 90%. This accuracy will ensure low risks of having
false negative or false positive.

[Link] Response time requirements:

In this project we will monitor five different indicators, all those indicators will be directly linked to
time, and since the project we are aiming to create will be a real-time monitoring system, the factor
of time will be crucial for us. In this context, the requirements of the image preprocessing and
treatment will be limited and decided by the external elements that we are monitoring.

a) Sensor response time (Camera):

Five elements will be monitored by the system we aim to create. Each of these elements needs to be
captured by the system the moments they occur. This leads us to configure the frames rate which
the system will be run with to successfully capture all the facial expressions we are monitoring.

In the case of head orientation, human head tends to stand still in a specific orientation for long
durations of time, in general it takes several minutes for a human to change his head orientation,
especially if he’s in the drowsiness state or if he’s sleeping. This will be in our favor since this does
not dictate the system to have a high frames rate.

In the case of Mouth expressions (yawning), a normal person takes approximately 6 to 8 seconds to
complete a yawn, this dictate that our monitoring system should be capable of capturing one frame

41
Each 6 seconds, which corresponds to 1 FPS (Frame per second) as a limit, if the system can’t
respect this time interval it may fail to detect the yawning event.

In the case of eyes blinking rate, a normal person takes approximately 150 to 200 milliseconds to
blink, in this case the monitoring system should be capable of capturing frames at a high frame rate
in this case the limit will be 7 FPS (frame per seconds) otherwise the system may fail to detect
the event of eyes blinking.

Table 6: Sensor required Frames per second.

Head Orientation Yawning Blinking


Frames per seconds indifferent 1 7
minimum limit
From the [Sensor required Frames per second] if we considered the minimum time of all monitored
elements is the required response time of the system sensor, we can say that the minimum response
time of our sensor should not exceed 150 milliseconds which corresponds to 7 FPS.

b) Code running time requirements:

As mentioned in the section “a”, the Camera should be capable of taking a minimum of 7 frames
per second so we can capture any event whether it’s a head orientation, yawning, or eyes blinking.
This means that we need an algorithm (programming Code) that can process 7 frames per second.
This leads us to conclude that the algorithm and the code that we will develop must process one
single frame each 20 milliseconds.

c) CNN model response time requirements:

When developing a real-time application such as our project, the number of model parameters is a
crucial factor. Larger models with more parameters tend to be computationally expensive and can
slow down inference times. To achieve real-time performance, it is important to choose or design
models with fewer parameters. Models with parameter count below 20 million are considered
lightweight and can offer fast inference times, making them ideal for real-time applications with
strict latency requirements. These models strike a balance between computational efficiency and
accuracy making them suitable for deployment on devices with limited processing capabilities. For
this reason, in our project the created models should not have parameters number that exceed 20
million parameters.

42
3.2.2 Environmental factors:

The project’s objective is to detect drowsiness based on the driver’s facial expressions. To achieve
this, the system will capture images of the driver’s face.

Two significant factors need to be considered in this context: lighting conditions and camera angles.

 Lighting conditions: lighting conditions play a crucial role in the accuracy of the system’s
drowsiness detection. The quality of the captured frames from the camera will be directly
influenced by the lighting conditions. Pictures taken in daylight will generally have higher
quality compared to those taken at night. This difference in picture quality can introduce
bias to the system’s accuracy. The system may perform exceptionally well in daylight but
exhibit lower performance in low-light situations during nighttime.
 Camera angles: Camera angles also need to be considered. The positioning and angles at
which the camera captures the deriver’s face can impact the system’s ability to accurately
detect drowsiness. Optimal camera placement and appropriate angles are essential to capture
facial expressions effectively and ensure reliable drowsiness detection.

3.3 Components design:


3.3.1 Introduction:

As mentioned in the functional requirements section, the system will monitor five indicators:

 Eyes blinking ratio


 Fully closed eyes time
 Eyes closure percentage
 Yawning
 Head orientation

To effectively monitor these indicators, the drowsiness detection system will consist of five
principal functions or subsystems. Each function will be responsible for monitoring one of the five
indicators as shown in the [Drowsiness detection system Diagram].

43
Figure 27: Drowsiness detection system Diagram.

In this section there will be a full development guide of each function (subsystem).

3.3.2 Eyes State subsystem:


[Link] Introduction:

To develop two subsystems of the drowsiness detection system which are “Eyes closure ratio” and
“Eyes closure time”, we need to develop an algorithm which is capable of classifying if the Eyes are
closed or open. In this project we will achieve that using the “Convolutional Neural Network”
algorithm which is a machine learning algorithm. In this section we will go through all the phases to
train a CNN model to be capable of classifying if the Eyes are closed or open.

[Link] Convolutional Neural Network model for eyes classification:


a) Data:

The first step to train a model is to prepare the right dataset for the problem. In our case we need the
proper dataset to classify successfully if the Eyes are closed or open.

(i) Source:

The dataset used to train the model with, is the MRL (Multimedia Research Lab) dataset. This
dataset is a comprehensive and valuable resource for advancing computer vision research and
applications related to eyes detection and recognition. Consisting of an extensive collection of
images and videos, the dataset encompasses a diverse range of eye-related scenarios, including
different lighting conditions and orientations. The MRL dataset provides two groups of images,
Open Eyes images and Closed Eyes images. This dataset is an open-source dataset that you can
find in [MRL dataset] and is the dataset that we will train the “Eyes Classifier” model for this

44
project. This dataset contains 24 000 images and consists of 12 000 closed Eyes images and 12 000
open Eyes images.

Figure 28: MRL dataset.

(ii) Data cleaning:

The data cleaning process is a necessary process that the dataset must pass through. Cleaning the
data involves:

Data Labeling: To train the model there will be two folders, one folder is named “Closed Eyes
Folder” and the other folder will be named “Open Eyes Folder”. Hence, one of the folders will
contain only open eyes images and the other one will contain only closed eyes images, this is
necessary because these two folders will be used to label the dataset, Since the image found in the
“Closed Eyes images” folder will be labeled as Closed Eyes image and the image found in the
“Open Eyes images” folder will be labeled as Open Eyes image.

identifying and rectifying inconsistencies and errors: a prime example of this in the context of MRL
Eyes dataset is that we found some of the open Eyes images in the folder of “Closed Eyes images”
and vice-versa, this will lead to a catastrophic error where we will confuse the learning curve of the
model because we are training the model on a dataset full of errors which will impact completely
model’s performance.

Data dimensions set-up: in general, the dataset collected could have different dimensions, meaning
that we could find different height, width, or channels. This leads us to unify the dataset dimensions
and the reason for that is when defining the CNN model structure, the input layer will accept only
images that have one certain type of dimensions that we as the developers define. Hence, if there’s
images with different types of dimensions, they will not be compatible with the input layer which

45
will lead to an Error call while training the model which will stops the model training. In our case,
we defined the input layer accepts an image with the following dimensions:

Height=224 pixels

Width=224 pixels

Channels=3

Those dimensions are the standard dimensions for images used to train a CNN model.

In our case we have as mentioned above a dataset of 24 000 images and converting the dimensions
of all those images manually will be time consuming. For this matter we have created this algorithm
which will first verify if the image has the right dimensions, if not the image will be converted to
the right dimensions as shown in [Dimensions converting algorithm]

Figure 29: Dimensions converting algorithm.

b) Data distribution:

The data distribution will be according to K-fold cross validation technique where we will split the
data into K folders, and we will than train our model on the dataset of those folders, As mentioned
in the chapter two, specifically in the “k-cross validation” section.

The parameter K represents the number of folders. There’s no specific rule that defines how to find
the number K, in this project we have chosen the value 4 for the parameter K, since it’s the standard
value for K-cross validation.

As shown in [4-fold cross validation] the model will be trained on each iteration, and from each
iteration we will have an accuracy specific to the dataset of that iteration. Finally, the average
accuracy of the model will be calculated using the formula mentioned in “k-cross validation”
section in chapter two. This will ensure a real estimation of the model’s performance.

46
Figure 30: 4-fold cross validation.

c) CNN structure for eyes classification model:

The CNN structure of a model is the key for having a performant model. Finding the best CNN
structure for the problem that we want to solve using CNN algorithm is not an easy task and could
take a long time. For this matter we want to perform techniques which can find the best structure
and accelerate the learning process of the model. In this project we have used two techniques,
Transfer Learning and Random Search.

To implement transfer learning, it is either find a pre-trained model which is capable of classifying
Eyes state (Closed Eyes or Open Eyes) if it does not exist, we need to find suitable pre-trained
models trained for different purpose and use them as features-extractors and build using them a new
model which is capable of classifying Eyes State (Closed Eyes or Open Eyes). For the first option
which is finding an already existent model which is capable of classifying Eyes State is not
available since we could not find a model which is capable of classifying Eyes state, this will lead
us to build our own model using the second approach which is find pre-trained models and use them
to build a new CNN model capable of classifying Eyes state.

(i) Features-extractor Layers:

A benchmark study made by Google [Google research paper], where they compared the accuracy of
the top CNN models that could be found in the market trained on a general dataset called ImageNet
and got the results shown in the [Models benchmark].

47
Figure 31: CNN Models benchmark.

According to the [project’s requirements], for real-time purposes, the CNN model parameters
number shall not exceed 20 million parameters. This left us only with the following models that
could be used as features-extractors which are EfficientNet, MobileNet-1, Inception-v2,
DenseNet-121 and Xception.

These models got the following characteristics:

Table 7: Models comparison.

Model Parameters Number Accuracy Comment


EfficientNet-b3 11 million 81% High accuracy, High
parameters number
MobileNet-1 4 million 74% Medium accuracy,
Low parameters
number
Inception-v2 11 million 74% Medium accuracy,
High parameters
number
DenseNet-121 12 million 77,8% Medium accuracy,
High parameters
number
Xception 20 million 79% Medium accuracy,
High parameters
number

48
As mentioned previously these models are trained on a dataset called “ImageNet” this dataset
consists of millions of labeled images across a wide range of categories. the dataset originally
introduced in 2009 for the ImageNet Large Scale Visual Recognition Challenge contained around
1.2 million labeled images across 1000 different classes (different objects such as humans, cars,
animals …).

In deep learning, it’s indeed common to explore and compare various model structures to select the
most suitable one. But since trying all the given models will be computationally expensive and time
consuming, we have chosen three models off the models sited in [Models comparison] which are:

 EfficientNet-b3: this model stands out with the highest accuracy among the compared
models. However, it is worth noting to mention that it has a high number of parameters.
Despite its complexity, its superior performance makes it a strong contender.

 MobileNet-1: compared to the other models, MobileNet-1 has the lowest number of
parameters, which implies reduced computational requirements. While it may have slightly
lower accuracy compared to EfficientNet-b3, its real-time monitoring efficiency and
moderate accuracy make it an appealing choice, especially when resource constraints are a
concern like in our case.
 DenseNet-121: this model falls between EfficientNet-b3 and MobileNet-1 in terms of
accuracy and parameter count. With a higher number of parameters than MobileNet-1,
DenseNet-121 offers a trade-off between model complexity and accuracy.

(ii) Classification Layers:

After the features extraction process, the classification layers come into plays. These layers are
typically fully connected layers that takes the extracted features from the previous layers (in our
case it going to be either Dense Net, EfficientNet or MobileNet) as input and learn to classify them
into different categories or classes. They use activation functions and weights matrices to map the
extracted features to the appropriate class probabilities.

The design of the classification block is influenced by factors such as complexity of the problem,
the amount of available data, and the nature of the dataset. In the context of the model that we want
to design which will be a model for a binary classification (Close Eyes and Open Eyes), the
problem can be considered as a relatively less complex problem compared to many other
classification tasks. There are several reasons for this:

49
 Data distribution: in most cases, the distribution of open eyes and closed eyes is relatively
distinct, and there is a clear separation between the two classes. Open eyes and closed eyes
tend to exhibit different visual characteristics, such as the presence or absence of eyelids
shown in [Close and open Eyes characteristics], which can make the classes easier to
distinguish.

Figure 32: Close and Open Eyes characteristics.

 Feature space: The distinguishing features for this problem are typically readily available
and relatively straightforward. Features such as the presence or absence of eyelids, eye
shape, or patterns related to open or closed eyelashes can be used to differentiate between
open and closed Eyes.

For these reasons the number of Dense layers in the classification block could vary from 1 layer to
10 layers.

In our case, we have chosen to include in the classification Block only 4 Dense Layers. Keep in
mind that there’s no specific rule or formula that determines the exact number of Dense layers in
the classification block, the only way to judge this choice is by evaluating the performance of the
model in the evaluation phase.

To prevent overfitting, we have added in the classification block 2 Dropout layers, and we also
included in all the Dense layers a kernel Regularizer value.

As we know all these layers in the classification block have what we call hyperparameters. These
hyperparameters are going to be specified randomly using the technique of Random Search.

To perform Random Search, we will define what we call “Numbers Pool”, Number Pool is a group
of numbers that potentially could be used as the value for the hyperparameter. In the example of
Dense layer, this layer has a parameter called Neurons number, this neurons number is an integer, in
this case we can define a Numbers Pool of [200, 400], then we pick randomly a number off this
Numbers Pool. Some other layers such as Dropout Layer has a parameter called “Dropout Rate”
this parameter is a floating number, and this parameter accept a value varying between 0 and 1, in

50
this case we can define a Numbers Pool of [0, 1], then we pick randomly a floating number off this
Numbers Pool. The same logic is going to be implemented on all the layers present in classification
block. The Classification block structure will be as the following:

Table 8: Classification block Layers.

Layer Type Numbers Pool


Dense [400, 500]
Dropout [0.2, 0.5]
Dense [200, 300]
Dropout [0.1, 0.2]
Dense [50, 100]
Dense 1
NB: The Order of the Layers in the classification Block is the same in [Classification block layers].

For classification layers, we need an algorithm that will build the layers and affect to them the
hyperparameters values. The algorithm will take as an input the Layer, then it will verify the layer’s
type because each layer has its own hyperparameters type (int, float …), after that we will randomly
pick a value from the Numbers Pool defined for the Layer. see [Random Layer construction].

Figure 33: Random Layer construction.

(iii) Conclusion:

Combining the classification block with the feature-extractors models which are EfficientNet-b3,
MobileNet-1, and DenseNet-121 we get three different structures as shown in the [Eyes State
Classifier Structures].

51
Figure 34: Eyes State Classifier Structures.

52
d) Training phase:

After preparing the dataset and the CNN structures, in this phase we will start the training of the
Eyes state classifier. From the previous section we defined 3 structures that we will train, structure
based on EfficientNet, structure based on MobileNet, and another structure based on DenseNet.

To train the CNN model, we are going to use keras of Tensorflow framework developed by google.
In this phase we need to set up the training hyperparameters, which are:

 Epochs: the CNN model that we are training is based on transfer learning, and we know that
transfer learning is not time consuming, for this reason we are going to set a maximum
epoch value at 100 and use EarlyStop callback to stop the training when the model training
metrics plateau.
 Optimizer: a study [5] suggests that Adam optimizer can be a good choice for image
classification problems due to its robust performance, adaptability to different tasks, and
combination of features from other optimizers. Adam optimizer combines the advantages of
AdaGrad optimizer and RMSProp optimizer, it also showed fast convergence and good
generalization ability across a wide range of tasks and specifically image classification
tasks.
 Learning rate: keeping the learning rate at its default value (which is 0.001) for the first
training session of a CNN model is often considered a good starting point for several
reasons, such as generalization because the default learning rate is typically chosen to
provide a reasonable balance between convergence speed and generalization ability, and
also exploration because during the initial training session, the model is in a relatively
unexplored parameter space and by using the default learning rate, we allow the
optimization algorithm to explore and search for optimal regions of the parameter space.

in this section we will analyze the learning capabilities of the three CNN structures we defined
previously and see if the CNN models based on these structures are learning and generalizing well
to unseen data or they are overfitting the dataset.

53
(i) EfficientNet-b3 based Eyes Classifier curves:

Figure 35: EfficientNet-b3 iteration 1.

Figure 36: EfficientNet-b3 iteration 2.

Figure 37: EfficientNet-b3 iteration 3.

Figure 38: EfficientNet-b3 iteration 4.

54
(ii) DenseNet-121 based Eyes Classifier curves:

Figure 39: DenseNet-121 iteration 1.

Figure 40: DenseNet-121 iteration 2.

Figure 41: DenseNet-121 iteration 3.

Figure 42: DenseNet-121 iteration 4.

55
(iii) MobileNet-1 based Eyes Classifier curves:

Figure 43: MobileNet-1 iteration 1.

Figure 44: MobileNet-1 iteration 2.

Figure 45: MobileNet-1 iteration 3.

Figure 46: MobileNet-1 iteration 4.

56
e) Evaluation phase:

The evaluation phase is the determinant phase where we will evaluate all the trained models to
choose the most performant CNN model. In this phase we will evaluate the model based on
Validation Curves of the training session, and a final evaluation on the Test dataset.

(i) k-Cross Validation Curves analysis:


Table 9: K-fold Cross Validation Analysis.

EfficientNet-b3 DenseNet-121 MobileNet-1


Cross
Validation Accuracy Loss Accuracy Loss Accuracy Loss
Iterations
Iteration 1 0,958333 0,181669 0,998149 0,020619 0,966145 0,166959
Iteration 2 0,999399 0,004268 0,997121 0,018682 0,999198 0,004487
Iteration 3 0,999398 0,004267 0,998972 0,007433 0,999198 0,004487
Iteration 4 1 0,001921 0,999177 0,007433 0,999599 0,003745
Min 0,958333 0,001921 0,997121 0,007433 0,966145 0,003745
Max 1 0,181669 0,999177 0,020619 0,999599 0,166959
Mean 0,989282 0,048031 0,998355 0,01354 0,991035 0,04492
Standard 0,0771619 0,01787 0,006147 0,000809 0,070459 0,014371
Deviation

Figure 47: EfficientNet-b3 Graph.

Figure 48: DenseNet-121 Graph.

57
Figure 49: MobileNet-1 Graph.

Based on the k-Cross Validation curves [EfficientNet-b3 Graph], [DenseNet-121 Graph], and
[MobileNet-1 Graph], DenseNet-121 is the most suitable structure for Eyes Classification (Eyes
State) which have the highest Accuracy and the lowest Loss in all different iterations, also
DenseNet-121 has the lowest Standard Deviation which indicates the predictions stability of the
model.

(ii) Test dataset Evaluation:

To affirm the conclusion that we have made through the K-Cross Validation Curves analysis, one
final evaluation must be done which is the evaluation of the model using the Test Dataset, the
evaluation tool used in this section will be the Confusion Matrix.

We have prepared two types of Eyes dataset. The first Test dataset called No-Eyes-Glasses dataset
[No-Glasses dataset] contains images without glasses and the second one is a Test dataset that
contains images with glasses called Eyes-Glasses dataset [Glasses dataset]. The reason of having
two types of datasets especially dataset that contains Eyes Glasses is to see if the model can still
predict the Eyes if the driver is wearing Eyes Glasses.

Figure 50: No-Eyes-Glasses dataset.

58
Figure 51: Eyes-Glasses dataset.

The confusion matrix evaluation of all the three CNN models is:

 EfficientNet-b3:

Figure 52: EfficientNet prediction on No-Eyes-Glasses dataset.

In [EfficientNet prediction on No-Eyes-Glasses dataset] let consider the following notations:

True Negative = 482, True Positive = 368, False Negative = 135 and, False Positive = 5

The calculated parameters (Accuracy and Misclassification Rate) are:

True Negative+True Positive


Accuracy=
True Negative+True Positive+ False Negative+ False Positive

368+ 482
Accuracy= =0,85=85 %
368+ 482+135 +5

False Negative+ False Psotive


Misclassification Rate=
True Negative+True Positive+ False Negative+ False Positive

59
135+5
Misclassification Rate= =0,14=14 %
368+482+135+5

Figure 53: EfficientNet-b3 on Eyes-Glasses dataset.

In [EfficientNet-b3 on Eyes-Glasses dataset] let consider the following notations:

True Negative = 392, True Positive = 179, False Negative = 222 and, False Positive = 8

The calculated parameters (Accuracy and Misclassification Rate) are:

True Negative+True Positive


Accuracy=
True Negative+True Positive+ False Negative+ False Positive

179+392
Accuracy= =0,71=71 %
179+392+222+8

False Negative+ False Psotive


Misclassification Rate=
True Negative+True Positive+ False Negative+ False Positive

222+8
Misclassification Rate= =0,28=28 %
179+392+222+8

60
 DenseNet-121:

Figure 54: DenseNet-121 on No-Eyes-Glasses dataset.

In [DenseNet-121 on No-Eyes-Glasses dataset] let consider the following notations:

True Negative =480, True Positive = 388, False Negative = 115 and, False Positive = 7

The calculated parameters (Accuracy and Misclassification Rate) are:

True Negative+True Positive


Accuracy=
True Negative+True Positive+ False Negative+ False Positive

388+ 480
Accuracy= =0,88=88 %
388+ 480+115+7

False Negative+ False Psotive


Misclassification Rate=
True Negative+True Positive+ False Negative+ False Positive

115+ 7
Misclassification Rate= =0,12=12 %
388+480+ 115+7

61
Figure 55: DenseNet-121 on Eyes-Glasses dataset.

In [DenseNet-121 on Eyes-Glasses dataset] let consider the following notations:

True Negative =342, True Positive =383, False Negative = 18 and, False Positive = 58

The calculated parameters (Accuracy and Misclassification Rate) are:

True Negative+True Positive


Accuracy=
True Negative+True Positive+ False Negative+ False Positive

383+342
Accuracy= =0,91=91 %
383+342+18+58

False Negative+ False Psotive


Misclassification Rate=
True Negative+True Positive+ False Negative+ False Positive

18+58
Misclassification Rate= =0,09=9 %
383+342+18+58

62
 MobileNet-1:

Figure 56: MobileNet-1 on No-Eyes-Glasses dataset.

In [MobileNet-1 on No-Eyes-Glasses dataset] let consider the following notations:

True Negative =487, True Positive = 373, False Negative = 130 and, False Positive = 0

The calculated parameters (Accuracy and Misclassification Rate) are:

True Negative+True Positive


Accuracy=
True Negative+True Positive+ False Negative+ False Positive

373+ 487
Accuracy= =0,87=87 %
373+ 487+130+0

False Negative+ False Psotive


Misclassification Rate=
True Negative+True Positive+ False Negative+ False Positive

130+ 0
Misclassification Rate= =0,13=13 %
373+487+ 130+0

63
Figure 57: MobileNet-1 on Eyes-Glasses dataset.

In [MobileNet-1 on Eyes-Glasses dataset] let consider the following notations:

True Negative =398, True Positive =210, False Negative = 191 and, False Positive = 2

The calculated parameters (Accuracy and Misclassification Rate) are:

True Negative+True Positive


Accuracy=
True Negative+True Positive+ False Negative+ False Positive

210+398
Accuracy= =0,76=76 %
210+398+191+2

False Negative+ False Psotive


Misclassification Rate=
True Negative+True Positive+ False Negative+ False Positive

191+2
Misclassification Rate= =0,24=24 %
210+398+191+2

 Conclusion:

From [Models Confusion Matrix results], we can conclude that the most performant CNN model
out of the three models is DenseNet-121 based model with an Accuracy of 89% and
Misclassification of 10,7%.

64
Table 10: Models Confusion Matrix results.

No-Eyes-Glasses dataset Eyes-Glasses dataset


Accuracy Misclassification Accuracy Misclassification
EfficientNet-b3 86 % 14 % 72 % 28 %
DenseNet-121 88 % 12 % 91 % 9,00 %
MobileNet-1 87 % 13 % 76 % 24 %

[Link] Eyes Localization and Classification:

To get the Eyes Closure ratio, we got to first localize the Eyes in the image and after that we are
going to use the Eyes Classifier that we have developed to detect if the Eyes did blink or not. To
localize the Eyes in an image we are going to use face landmarks algorithm. In this project we’ve
used the face landmarks provided by the library Dlib. in Dlib’s face landmarks the Eyes landmarks
are the following:

Left Eye: the landmarks defining the left Eye region in Dlib are 6 points which are the landmarks
[43, 44, 45, 46, 47, 48] as shown in [Eyes landmarks].

Right Eye: the landmarks defining the right Eye region in Dlib are 6 points which are the
landmarks [37, 38, 39, 40, 41, 42] as shown in [Eyes landmarks].

Figure 58: Eyes landmarks.

Using Dlib’s functionalities we can get the coordinates of each point of the Eyes landmarks, using
these coordinates we can define coordinates of rectangular region around the Eyes. Using this
rectangle region, we can crop the Eyes from the image as shown in [Eyes localization].

65
Figure 59: Eyes localization.

After localizing where are the Eyes and crop them, we can then pass them to the classifier to predict
the state of the Eyes (Closed Eyes or Open Eyes).

Figure 60: Eye Classification.

[Link] Eyes closure ratio (blinking ratio):

Since now we have developed a tool capable of detecting if the Eyes are closed or not, we can
easily calculate the Eyes closure ratio of the driver. A study [6] was made to find out the
relationship between blinking ratio and fatigue (drowsy) state of an individual, the study showed
that the average blinking ratio of an alert individual is 24 blinks/minute, also when an individual
starts getting drowsy the Eyes blinking ratio starts increasing from 24 blinks/minute to more than 32
blinks/minute. Based on this study we are going to make an algorithm which will be capable of
calculating the Eyes blinking ratio and monitor the Eyes of the driver so when the Blinking ratio of
the driver start increasing and surpasses 32 blinks/minute we can alert the driver.

To ensure this function we are going to use the algorithm shown in [Eyes Closure ratio function].

66
Figure 61: Eyes Closure ratio function.

[Link] Eyes closure time:

An Eyes closure time monitoring function will have the role of preventing the driver from sleeping
while driving the vehicle. The function would predict the Eyes State of the driver, if the driver kept
closing his Eyes continuously during a certain amount of time, then this function will alert the

67
deriver. To ensure this function we are going to use the algorithm shown in [Eyes closure time
function].

Figure 62: Eyes closure time function.

3.3.3 Eyes closure percentage subsystem:

This subsystem or function is going to monitor the Eyes closure of the driver through out what is
called Eye Aspect Ratio. Eye Aspect Ratio is a powerful metric that can be utilized to detect driver
drowsiness and prevent potential accidents on the road. Eye Aspect Ratio is a measure of the eye’s
openness and is calculated based on the ratio of the vertical and horizontal eye landmarks obtained
from facial recognition system that we have built. By monitoring changes in Eye Aspect Ratio over
time, it is possible to determine the level of drowsiness exhibited by a driver. typically, when a
person is alert and awake, their eyes remain relatively open, resulting in a lower Eye Aspect Ratio
value. However, as drowsiness sets in the eyes tend to close gradually, leading to a decrease in Eye
Aspect Ratio. By setting a threshold for Eye Aspect Ratio an intelligent system can issue warning or

68
trigger interventions when Eye Aspect Ratio surpass the specified level, alerting the driver to their
drowsy state. A study [7] has concluded that when the Eye Aspect Ratio surpass 70%, the person
tends to be drowsy. This leads us to use 70% as a threshold for our monitoring system. The formula
to calculate Eye Aspect Ratio is:

Figure 63: Eyes landmarks.

(‖P 2−P 6‖+‖P 3−P 5‖)


Eye Aspect Ratio=
2∗‖P 1−P 4‖

To ensure this function we are going to use the algorithm shown in [Eyes Closure Percentage
function].

Figure 64: Eyes Closure Percentage function.

69
3.3.4 Mouth closure percentage subsystem:

Monitoring the Mouth expressions is going to be through what so called Mouth Aspect Ratio.
Mouth Aspect Ratio can be effectively utilized to detect yawning, a common physiological response
associated with fatigue and drowsiness. Yawning is characterized by a distinctive mouth movement,
wherein the mouth opens wide, often accompanied by a deep inhalation. By monitoring changes in
Mouth Aspect Ratio, it is possible to detect yawning occurrences in real-time. During a yawn, the
mouth opens significantly wider than during normal speech or other facial expressions, resulting in
a noticeable increase in the Mouth Aspect Ratio value.

[Link] Yawn expression detection:

Mouth expressions are numerous such as smiling, laughing, and yawning. Each of these expressions
have a Mouth Ratio Aspect Region. To detect the expression of yawning we need to find the Mouth
Aspect Ratio region for it. The formula to calculate Mouth Aspect Ratio is:

Figure 65: Mouth landmarks.

‖P 2−P 8‖+‖P 3−P 7‖+‖P 4−P 6‖


Mouth Aspect Ratio=
2∗‖P 1−P 5‖

This formula makes a comparison between the height of Mouth Opening which is represented by
‖P 2−P 8‖+‖P 3−P 7‖+‖P 4−P 6‖ and the width of the Mouth represented by 2∗‖P 1−P 5‖.

From this formula there’s a possibility to theoretically conclude the regions of each expression.
Generally, when someone is laughing or smiling the width of the mouth become significantly
increase while the height of the Mouth opening stays relatively the same as shown in the [Person
Laughing (Smiling)].

70
Figure 66: Person Laughing (Smiling).

In [Person Laughing (Smiling)] the Mouth width is almost equal to all the three distances of the
Mouth height.

And in the Formula of Mouth Ratio Aspect, we divide the sum of all the three height distances of
Mouth on the width of Mouth multiplied by 2. This means that the Mouth Aspect ratio of a person
who’s smiling must have a region between 0 and 0,7.

In contradiction, when someone yawns, the height of their mouth opening increases more
noticeably compared to the relatively smaller increase in mouth width as shown in [Person
yawning].

Figure 67: Person yawning.

In [Person yawning] the Mouth Height distances are almost three times big than the Mouth’s width.

And in the formula of Mouth Ratio Aspect, we divide the sum of all the three height distances of
Mouth on the width of the Mouth multiplied by 2. This means that the Mouth Aspect Ratio of a
person who’s yawning must have a region above 1.

To ensure this function we are going to use the algorithm shown in [Mouth closure percentage
function].

71
Figure 68: Mouth closure percentage function.

to support the theory that we have made about laughing and yawning regions we will test the
function that calculate the Mouth Aspect Ratio that we have made on 10 images, 5 images contain a
person laughing and the other 5 images contains a person who’s yawning.

Table 11: Yawning and Laughing Mouth Aspect Ratio.

Yawning

Mouth Aspect
1,37 0,98 0,92 1,29 1
Ratio

Laughing

Mouth Aspect
0,58 0,3 0,38 0,67 0,31
Ratio

72
The theorical supposition that we have (someone who’s yawning has a Mouth Aspect Ratio superior
to 1, and someone who’s laughing has a Mouth Aspect Ratio inferior to 0,7) are proven right
practically by testing our function on different persons laughing and yawning [Yawning and
Laughing Mouth Aspect Ratio]. In this function we are interested in Yawning expression, thus we
will set a threshold of 0,9 (we lowered the threshold from 1 to 0,9 just to make sure all the yawning
expressions are detected), when the Mouth Aspect Ratio surpasses this threshold then our function
can consider that the driver is yawning.

73
General Conclusion:

In conclusion, the Drowsiness Detection project based on facial expressions, specifically focusing
on indicators like Eyes blinking, Eyes closure time, Eyes closure percentage, head orientation, and
yawning, has proven to be a promising approach for addressing the critical issue of driver
drowsiness. Through the implementation of various algorithms and techniques, we successfully
developed a system that can effectively detect and alert drivers when signs of drowsiness are
detected.

The project’s findings highlight the importance of utilizing facial expressions as valuable indicators
of driver drowsiness. By leveraging computer vision and machine learning techniques, we were
able to accurately analyze and interpret these facial features in real-time, providing a proactive
warning system to drivers.

The effectiveness of the system was evaluated through extensive testing and validation, considering
different scenarios and subjects. The results demonstrated a high level of accuracy in detecting
drowsiness episodes, providing reliable alerts to drivers and potentially preventing accidents caused
by drowsiness-related impairment.

While the project achieved significant progress in drowsiness detection based on facial expressions,
there are areas for further improvement. Future enhancements could explore robust techniques to
handle these challenges and improve the system’s reliability across diverse environments.

Moreover, expanding the dataset used for training the model and considering a broader range of
facial expressions associated with drowsiness could enhance the system’s accuracy and
generalizability. Additionally, integrating other physiological measures or data sources, such as
heart rate or steering behavior, could further strengthen the system’s capability to detect drowsiness.

In conclusion, this project provides a significant contribution towards mitigating the risks associated
with driver drowsiness. The developed drowsiness detection system based on facial expressions
offers a practical solution that can be integrated into existing vehicle safety systems, potentially
saving lives, and preventing accidents caused by drowsy driving.

74
Bibliography:

[1] C. C. Liu, S. G. Hosking, et M. G. Lenné, « Predicting driver drowsiness using vehicle measures:
Recent insights and future challenges », Journal of Safety Research, vol. 40, no 4, p. 239-245,
août 2009, doi: 10.1016/[Link].2009.04.005.
[2] R. Feng, G. Zhang, et B. Cheng, « An on-board system for detecting driver drowsiness based on
multi-sensor data fusion using Dempster-Shafer theory », in 2009 International Conference on
Networking, Sensing and Control, Okayama, Japan: IEEE, mars 2009, p. 897-902. doi:
10.1109/ICNSC.2009.4919399.
[3] S. Otmani, T. Pebayle, J. Roge, et A. Muzet, « Effect of driving duration and partial sleep
deprivation on subsequent alertness and performance of car drivers », Physiology & Behavior,
vol. 84, no 5, p. 715-724, avr. 2005, doi: 10.1016/[Link].2005.02.021.
[4] M. Ingre, T. Akerstedt, B. Peters, A. Anund, et G. Kecklund, « Subjective sleepiness, simulated
driving performance and blink duration: examining individual differences », J Sleep Res, vol. 15,
no 1, p. 47-53, mars 2006, doi: 10.1111/j.1365-2869.2006.00504.x.
[5] D. Choi, C. J. Shallue, Z. Nado, J. Lee, C. J. Maddison, et G. E. Dahl, « On Empirical Comparisons
of Optimizers for Deep Learning ». arXiv, 15 juin 2020. Consulté le: 22 juin 2023. [En ligne].
Disponible sur: [Link]
[6] Z. A. Haq et Z. Hasan, « Eye-blink rate detection for fatigue determination », in 2016 1st India
International Conference on Information Processing (IICIP), Delhi, India: IEEE, août 2016, p. 1-5.
doi: 10.1109/IICIP.2016.7975348.
[7] « PERCLOS: A Valid Psychophysiological Measure of Alertness As Assessed by Psychomotor
Vigilance ».

75
Annexes:

i
Annex A: Head orientation Classifier:
Convolutional Neural Network model for eyes classification:

I. Data:

- Source: The dataset used to train the model with, is the FEI (Face) dataset. The FEI face
database is a Brazilian face database that contains a set of face images taken between June
2005 and March 2006 at the Artificial Intelligence Laboratory of FEI in São Bernardo do
Campo, São Paulo, Brazil. There are 14 images for each of 200 individuals, a total of 2800
images. All images are colorful and taken against a white homogenous background in an
upright frontal position with profile rotation of up to about 180 degrees. Scale might vary
about 10% and the original size of each image is 640x480 pixels. All faces are mainly
represented by students and staff at FEI, between 19 and 40 years old with distinct
appearance, hairstyle, and adorns. The number of male and female subjects are the same
and equal to 100. Figure 1 shows some examples of image variations from the FEI face
database.
- Data Cleaning: The data cleaning process is a necessary process that the dataset must pass
through. Cleaning the data involves what so called Data Labeling: To train the model there
will be two folders, one folder is named “Faced Head Orientation” and the other folder will
be named “Other Head Orientation”. Hence, one of the folders will contain only faces head
orientation images and the other one will contain only other head orientation images, this is
necessary because these two folders will be used to label the dataset, Since the image found
in the “Closed Eyes images” folder will be labeled as Closed Eyes image and the image
found in the “Open Eyes images” folder will be labeled as Open Eyes image.

Figure: Head orientations

ii
II. Data distribution:
The data distribution will be according to K-fold cross validation technique where we will split the
data into K folders, and we will than train our model on the dataset of those folders, As mentioned
in the chapter two, specifically in the “k-cross validation” section.

The parameter K represents the number of folders. There’s no specific rule that defines how to find
the number K, in this project we have chosen the value 4 for the parameter K, since it’s the standard
value for K-cross validation.

CNN model structure:

For the same reasons mentioned in chapter 3 in case of eyes state classifier, we are going to use the
three different features extractors which are MobileNet-1, EfficinetNet-b3 and DenseNet-121.

In head orientation case, the problem is more complicated than Eyes State classifier because the
features difference between faced head orientation and other head orientation is not significant as
shown in [Head Orientations]. As we can see in the figure, head orientation is really complicated to
be classify it since faced Head orientation and the other Head Orientations shares a lot of the same
Features.

Figure: Head Orientations

For this reason, we have added more Dense layers to classification block so that during the training
the model can learn more how to predict head orientations.

iii
Figure: CNN model structures for head orientation

iv
III. Model Training:
- EfficientNet-b3 based Eyes Classifier curves:

Figure: Head Orientation EfficientNet model Training.

v
- DenseNet-121 based Eyes Classifier curves:

Figure: Head Orientation DenseNet model Training.

vi
- MobileNet-1 based Eyes Classifier curves:

Figure: Head Orientation MobileNet model training.

vii

Common questions

Powered by AI

Feature transfer allows the reuse of lower layers from a pre-trained model—those designed for feature extraction—within a new problem domain. The weights of these layers are frozen, and only the new classifying layers are trained on the new data. This approach benefits adaptation by reducing the computational cost and time required to train the model from scratch, leveraging pre-learned features from large datasets .

Dlib provides robust capabilities for face detection and facial landmark analysis, essential for accurately localizing facial parts. Keras simplifies the construction and training of deep learning models with its high-level abstraction of complex TensorFlow processes. Together, these tools enable efficient implementation of facial expression recognition needed for drowsiness detection .

The 'Hold Out' method randomly splits data into distinct training and test sets, leading to high variance because the split can differ significantly with each iteration. 'K-Fold' cross-validation, however, divides data into K subsets and averages the model evaluation over K iterations, providing a more stable estimation by ensuring each data point is used for both training and testing. 'K-Fold' is preferred because it reduces bias and variance, offering a more comprehensive validation process .

Hyperparameter tuning is crucial for optimizing the machine learning model's performance within this project. By adjusting parameters such as learning rate and the number of epochs, the project can fine-tune model accuracy and robustness, ensuring the drowsiness detection system performs reliably across various scenarios .

The CNN model structure in this project accommodates varying image dimensions by standardizing all input images to specific dimensions: 224x224 pixels with 3 channels. This standardization is crucial to ensure compatibility with the input layer, facilitating uniform processing and preventing errors during the model training phase .

The mouth aspect ratio (MAR) differentiates yawning from laughing by calculating the ratio of mouth height to its width. Yawning typically results in an MAR above 1, while laughing results in one below 0.7. Setting the threshold at 0.9 ensures comprehensive detection of yawning, thus efficiently identifying drowsiness-triggered yawning events while minimizing false positives .

The project chart facilitates stakeholders' understanding of the project's overall organization, identifying contacts, and clarifying authority lines. This enhances communication, coordination, and ensures efficient decision-making throughout the lifecycle of the project .

The project ensures alignment with functional requirements regarding cybersecurity and personal data usage by implementing systems that are safe in terms of both cybersecurity and personal data handling. These requirements are crucial to protect user information and maintain system integrity, as indicated by the project's initial specifications .

Integrating additional physiological measures like heart rate can enhance the detection of drowsiness by providing a more holistic view of driver fatigue. This can increase system accuracy through multimodal analysis, potentially anticipating drowsiness before facial expressions change. However, it also introduces complexity in data handling and analysis, and may require additional sensors, raising privacy and cost concerns .

The system monitors eyes blinking ratio, yawning, and head orientation as primary facial expressions to detect drowsiness. These indicators are effective because they reflect physiological responses associated with fatigue, such as increased blinking rate and yawning due to efforts to maintain alertness, and head movements that signal reduced control and focus .

You might also like